<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Matplotlib - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Matplotlib - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 24 May 2026 16:30:48 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/matplotlib/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Get Started with Matplotlib – With Code Examples and Visualizations ]]>
                </title>
                <description>
                    <![CDATA[ One of the key steps in data analysis is data visualization, as it helps you notice certain features, tendencies, and relevant patterns that may not be obvious in raw data. Matplotlib is one of the most effective libraries for Python, and it allows t... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/getting-started-with-matplotlib/</link>
                <guid isPermaLink="false">67046b93639bbc713991b8a3</guid>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oyedele Tioluwani ]]>
                </dc:creator>
                <pubDate>Mon, 07 Oct 2024 23:15:31 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1727947002230/9ab7fb41-65fe-4bf5-bb59-8f514a3e9396.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>One of the key steps in data analysis is data visualization, as it helps you notice certain features, tendencies, and relevant patterns that may not be obvious in raw data. Matplotlib is one of the most effective libraries for Python, and it allows the plotting of static, animated, and interactive graphics.</p>
<p>This guide explores Matplotlib's capabilities, focusing on solving specific data visualization problems and offering practical examples to apply to your projects.</p>
<p>Here’s what we are going to cover in this article:</p>
<ul>
<li><p><a class="post-section-overview" href="#heading-importance-of-data-visualization-in-data-analysis">Importance of Data Visualization in Data Analysis</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-brief-overview-of-matplotlib">Brief Overview of Matplotlib</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-getting-started-with-matplotlib">Getting Started with Matplotlib</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-installation-and-setup">Installation and Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-your-first-plot">How to Create Your First Plot</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-exploring-different-types-of-plots">Exploring Different Types of Plots</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-advanced-plot-customizations">Advanced Plot Customizations</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-how-to-work-with-multiple-plots">How to Work with Multiple Plots</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-enhance-plot-aesthetics">How to Enhance Plot Aesthetics</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-save-and-export-plots">How to Save and Export Plots</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-interactive-plotting-and-animation">Interactive Plotting and Animation</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-interactive-features-in-matplotlib">Interactive Features in Matplotlib</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-animations">How to Create Animations</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-optimize-plots-for-large-datasets">How to Optimize Plots for Large Datasets</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-efficient-plotting-techniques-for-large-datasets">Efficient Plotting Techniques for Large Datasets</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-statistical-data-visualization">Statistical Data Visualization</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-common-visualization-pitfalls-and-how-to-avoid-them">Common Visualization Pitfalls and How to Avoid Them</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-overplotting">Overplotting</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-misleading-scales-and-axes">Misleading Scales and Axes</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-color-misuse">Color Misuse</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-misleading-use-of-3d-plots">Misleading Use of 3D Plots</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-misleading-use-of-area-charts">Misleading Use of Area Charts</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-importance-of-data-visualization-in-data-analysis">Importance of Data Visualization in Data Analysis</h2>
<p>Assuming that you are dealing with the sales data of a big chain of stores. Raw data may contain hundreds or thousands of rows, with possible columns such as product categories, sales regions, and monthly revenues. These useful concepts and raw data analytical approaches present the data in a very complex manner which can be estranged for anyone to undertake.</p>
<p>However, by visualizing the data, you can have a broad view of what is likely to be occurring, such as, which product category is succeeding, or which region is lagging.</p>
<p>Data visualization is a process of getting data into more easily comprehensible and analyzable forms for decision-making. Matplotlib is particularly effective at addressing these challenges for data scientists and analysts, due to the vast number of plot types and possible alterations that are available.</p>
<h2 id="heading-brief-overview-of-matplotlib">Brief Overview of Matplotlib</h2>
<p>Matplotlib, which is now one of the most popular plotting software currently running in the Python environment, was started by John Hunter in the year 2003. With it, one can obtain various forms of static, dynamic, and even animated plots, making it an indispensable tool for any scientist, engineer, or data analyst.</p>
<p>Some common problems that Matplotlib can help solve include:</p>
<ul>
<li><p>Visualize large datasets to identify patterns and outliers.</p>
</li>
<li><p>Design exemplary complex graphics for the publication of academic articles.</p>
</li>
<li><p>Combining data gathered from different sources into interactive and informative illustrations.</p>
</li>
<li><p>Adapting trends in plots to make clear the information that is being portrayed.</p>
</li>
</ul>
<h2 id="heading-getting-started-with-matplotlib">Getting Started with Matplotlib</h2>
<h3 id="heading-installation-and-setup">Installation and Setup</h3>
<p>Before we dive into creating plots, let's get Matplotlib installed and set up. You can install Matplotlib using <code>pip</code> or <code>conda</code>:</p>
<pre><code class="lang-python">pip install matplotlib
</code></pre>
<p>Alternatively, if you're using Anaconda:</p>
<pre><code class="lang-python">conda install matplotlib
</code></pre>
<p>To verify the installation:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib
print(matplotlib.__version__)
</code></pre>
<h3 id="heading-how-to-create-your-first-plot">How to Create Your First Plot</h3>
<p>Let’s start by solving a common problem: let’s assume that you have a set of data that records daily temperature for a given month, and you want to study the variation of temperature.</p>
<p>Here’s how you can create a simple line plot to visualize this trend:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Simulating daily temperature data</span>
days = np.arange(<span class="hljs-number">1</span>,<span class="hljs-number">20</span>)
temperature = np.random.normal(loc=<span class="hljs-number">25</span>, scale=<span class="hljs-number">5</span>, size=len(days))

plt.plot(days, temperature, marker=<span class="hljs-string">'o'</span>)
plt.title(<span class="hljs-string">'Daily Temperatures in August'</span>)
plt.xlabel(<span class="hljs-string">'Day'</span>)
plt.ylabel(<span class="hljs-string">'Temperature (°C)'</span>)
plt.grid(<span class="hljs-literal">True</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727733970801/479efd1e-0324-4c93-b12e-50942b78f183.png" alt="A simple  plot" width="576" height="465" loading="lazy"></p>
<ul>
<li><p>We used <code>np.arange</code> to construct a series of days.</p>
</li>
<li><p><code>np.random.normal</code> models temperature data with a mean (<code>loc</code>) equaling 20 degrees Celsius and a standard deviation (<code>scale</code>) equal to 5 degrees Celsius.</p>
</li>
<li><p><code>plt.plot</code> creates a line plot with markers for each day.</p>
</li>
<li><p>Titles and labels were added to make the plot informative.</p>
</li>
</ul>
<h3 id="heading-exploring-different-types-of-plots">Exploring Different Types of Plots</h3>
<p>Matplotlib supports various plot types, each suited to specific data visualization problems.</p>
<h4 id="heading-line-plots">Line Plots</h4>
<p>Line plots are ideal for visualizing trends over time or continuous data. For example, tracking the monthly sales of a product:</p>
<pre><code class="lang-python">months = np.arange(<span class="hljs-number">1</span>,<span class="hljs-number">13</span>)
sales = np.random.randint(<span class="hljs-number">2000</span>, <span class="hljs-number">4000</span>, size=len(months))
plt.plot(months, sales, color=<span class="hljs-string">'red'</span>, linestyle=<span class="hljs-string">'--'</span>, marker=<span class="hljs-string">'o'</span>)
plt.title(<span class="hljs-string">"Monthly Sales of Product "</span>)
plt.xlabel(<span class="hljs-string">"Month"</span>)
plt.ylabel(<span class="hljs-string">"Sales (Units)"</span>)
plt.grid(<span class="hljs-literal">True</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734299673/80917af9-81c1-4adc-aeef-63aac02d6b66.png" alt="Using line plots to track monthly sales" width="576" height="448" loading="lazy"></p>
<h4 id="heading-scatter-plots">Scatter Plots</h4>
<p>They are used for the construction of simple relations between two variables of data where the appearance of the points are compared. For instance, visualizing the relationship between advertisement spending and sales:</p>
<pre><code class="lang-python">ad_spend = np.random.randint(<span class="hljs-number">50</span>, <span class="hljs-number">1000</span>, size=<span class="hljs-number">50</span>)
sales = ad_spend * np.random.uniform(<span class="hljs-number">0.8</span>, <span class="hljs-number">1.2</span>, size=<span class="hljs-number">50</span>)

plt.scatter(ad_spend, sales, color=<span class="hljs-string">'blue'</span>)
plt.title(<span class="hljs-string">"Advertisement Spending vs. Sales"</span>)
plt.xlabel(<span class="hljs-string">"Ad Spend (USD)"</span>)
plt.ylabel(<span class="hljs-string">"Sales (Units)"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734341461/0ebd072d-3bd8-498c-9e6f-3917497ba5a9.png" alt="A scatter plot representation" width="576" height="450" loading="lazy"></p>
<h4 id="heading-bar-charts">Bar Charts</h4>
<p>Bar charts are effective for comparing categorical data. For example, visualizing the total revenue generated by several product groupings:</p>
<pre><code class="lang-python">groupings = [<span class="hljs-string">'Musical Instruments'</span>, <span class="hljs-string">'Furniture'</span>, <span class="hljs-string">'Clothing'</span>, <span class="hljs-string">'Food'</span>]
revenue = [<span class="hljs-number">50000</span>, <span class="hljs-number">30000</span>, <span class="hljs-number">20000</span>, <span class="hljs-number">40000</span>]

plt.bar(groupings, revenue, color=<span class="hljs-string">'green'</span>)
plt.title(<span class="hljs-string">"Revenue by Product Grouping"</span>)
plt.xlabel(<span class="hljs-string">"Group"</span>)
plt.ylabel(<span class="hljs-string">"Revenue (EURO)"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734374042/a81a751d-2b5f-4c8a-98e9-3170756c440e.png" alt="A bar chart visualization" width="576" height="446" loading="lazy"></p>
<h4 id="heading-histograms">Histograms</h4>
<p>They are used to view the distribution of numerical data based on frequency. For example, visualizing the distribution of customer ages in a survey:</p>
<pre><code class="lang-python">ages = np.random.randint(<span class="hljs-number">18</span>, <span class="hljs-number">65</span>, size=<span class="hljs-number">2000</span>)

plt.hist(ages, bins=<span class="hljs-number">10</span>, color=<span class="hljs-string">'purple'</span>, edgecolor=<span class="hljs-string">'black'</span>)
plt.title(<span class="hljs-string">"Age Distribution of Survey Participants"</span>)
plt.xlabel(<span class="hljs-string">"Age"</span>)
plt.ylabel(<span class="hljs-string">"Number of Participants"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734397041/19eeae57-97b9-4a29-9773-393627cb0d1c.png" alt="Histogram showing the distribution of customer ages" width="576" height="462" loading="lazy"></p>
<h4 id="heading-pie-charts">Pie Charts</h4>
<p>Pie charts are used to display the percentages of data in graphical format. For example, visualizing the market share of different companies:</p>
<pre><code class="lang-python">companies = [<span class="hljs-string">'Company W'</span>, <span class="hljs-string">'Company X'</span>, <span class="hljs-string">'Company Y'</span>, <span class="hljs-string">'Company Z'</span>]
market_share = [<span class="hljs-number">40</span>, <span class="hljs-number">30</span>, <span class="hljs-number">20</span>, <span class="hljs-number">10</span>]

plt.pie(market_share, labels=companies, autopct=<span class="hljs-string">'%1.1f%%'</span>, colors=[<span class="hljs-string">'blue'</span>, <span class="hljs-string">'orange'</span>, <span class="hljs-string">'green'</span>, <span class="hljs-string">'red'</span>])
plt.title(<span class="hljs-string">"Market Share by Company"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734413363/a626ebb2-f3bc-4fb9-98e1-57d2dfdead8e.png" alt="A pie chart representation" width="576" height="484" loading="lazy"></p>
<h2 id="heading-advanced-plot-customizations">Advanced Plot Customizations</h2>
<h3 id="heading-how-to-work-with-multiple-plots">How to Work with Multiple Plots</h3>
<p>In some situations, you’ll be required to compare multiple datasets in a single figure. For example, comparing sales trends across different regions. This can be achieved using subplots:</p>
<pre><code class="lang-python">regions = [<span class="hljs-string">'North'</span>, <span class="hljs-string">'South'</span>, <span class="hljs-string">'East'</span>, <span class="hljs-string">'West'</span>]
sales_data = np.random.randint(<span class="hljs-number">500</span>, <span class="hljs-number">5000</span>, size=(<span class="hljs-number">4</span>, <span class="hljs-number">12</span>))

fig, axs = plt.subplots(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>, figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">8</span>))
fig.suptitle(<span class="hljs-string">'Monthly Sales by Region'</span>)

<span class="hljs-keyword">for</span> i, region <span class="hljs-keyword">in</span> enumerate(regions):
    ax = axs[i // <span class="hljs-number">2</span>, i % <span class="hljs-number">2</span>]
    ax.plot(months, sales_data[i], marker=<span class="hljs-string">'o'</span>)
    ax.set_title(region)
    ax.set_xlabel(<span class="hljs-string">"Month"</span>)
    ax.set_ylabel(<span class="hljs-string">"Sales (Units)"</span>)

plt.tight_layout()
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734447574/336f9425-183a-4035-8f14-6462d0e1c358.png" alt="multiple plot diagrams comparing sales trend" width="576" height="456" loading="lazy"></p>
<h3 id="heading-how-to-enhance-plot-aesthetics">How to Enhance Plot Aesthetics</h3>
<p>Among the typical options for common plotting is the possibility to control the appearance of a plot to make it informative and aesthetically pleasing.</p>
<p>Here’s an example:</p>
<pre><code class="lang-python">plt.plot(days, temperature, color=<span class="hljs-string">'orange'</span>, marker=<span class="hljs-string">'x'</span>, linestyle=<span class="hljs-string">'-'</span>)
plt.title(<span class="hljs-string">"Daily Temperatures in August"</span>, fontsize=<span class="hljs-number">16</span>)
plt.xlabel(<span class="hljs-string">"Day"</span>, fontsize=<span class="hljs-number">12</span>)
plt.ylabel(<span class="hljs-string">"Temperature (°C)"</span>, fontsize=<span class="hljs-number">12</span>)
plt.grid(<span class="hljs-literal">True</span>)
plt.legend([<span class="hljs-string">'Temperature'</span>], loc=<span class="hljs-string">'upper right'</span>)
plt.annotate(<span class="hljs-string">'Coldest Day'</span>, xy=(<span class="hljs-number">5</span>, <span class="hljs-number">10</span>), xytext=(<span class="hljs-number">7</span>, <span class="hljs-number">5</span>),
             arrowprops=dict(facecolor=<span class="hljs-string">'black'</span>, arrowstyle=<span class="hljs-string">'-&gt;'</span>))
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734492330/12638dc3-dd99-427f-ba1d-2ec59cadd03a.png" alt="Image showing an aesthetically pleasing plot" width="576" height="457" loading="lazy"></p>
<p>The code changes colors and markers, line styles, titles, and axis labels of the desired font size, grid on, adds legend and annotates the coldest day by an arrow. These improvements make the plot informative and neat and as a result, a professional and clear message would be delivered.</p>
<h3 id="heading-how-to-save-and-export-plots">How to Save and Export Plots</h3>
<p>Once you've created a plot, you might need to save it in a specific format for a report or presentation. Below is an example on how to save plots efficiently:</p>
<pre><code class="lang-python">plt.plot(days, temperature)
plt.title(<span class="hljs-string">"Daily Temperatures in August"</span>)
plt.xlabel(<span class="hljs-string">"Day"</span>)
plt.ylabel(<span class="hljs-string">"Temperature (°C)"</span>)

<span class="hljs-comment"># Saving the plot</span>
plt.savefig(<span class="hljs-string">"daily_temperatures_august.png"</span>, dpi=<span class="hljs-number">300</span>, bbox_inches=<span class="hljs-string">'tight'</span>)
plt.savefig(<span class="hljs-string">"daily_temperatures_august.pdf"</span>, format=<span class="hljs-string">'pdf'</span>, bbox_inches=<span class="hljs-string">'tight'</span>)
</code></pre>
<p>The <code>dpi</code> parameter controls the resolution of the saved plot, and <code>bbox_inches='tight'</code> ensure that the plot is saved without extra whitespace.</p>
<h2 id="heading-interactive-plotting-and-animation">Interactive Plotting and Animation</h2>
<h3 id="heading-interactive-features-in-matplotlib">Interactive Features in Matplotlib</h3>
<p>You can also make your plots interactive. For example, rather than viewing an entire plot, one might move closer to a region of interest, or when the plot has to be changed in some way because of the user input.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

x = np.linspace(<span class="hljs-number">0</span>, <span class="hljs-number">10</span>, <span class="hljs-number">100</span>)
y = np.cos(x)

fig, ax = plt.subplots()
ax.plot(x, y)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_click</span>(<span class="hljs-params">event</span>):</span>
    <span class="hljs-comment"># This function is called when the plot is clicked</span>
    print(<span class="hljs-string">f"The Coordinates were clicked at: (<span class="hljs-subst">{event.xdata}</span>, <span class="hljs-subst">{event.ydata}</span>)"</span>)

fig.canvas.mpl_connect(<span class="hljs-string">'button_press_event'</span>, on_click)
plt.show()
</code></pre>
<p>The code generates a cosine wave plot and sets a click event handler on it with the <code>on_click</code> name. Once you click anywhere on the plot, the handler prints the coordinates of the click on the Python console.</p>
<h3 id="heading-how-to-create-animations">How to Create Animations</h3>
<p>Animations can be handy in showing how things evolve. For instance, the increase of a stock price or the incubation period of a disease:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.animation <span class="hljs-keyword">as</span> animation

fig, ax = plt.subplots()
line, = ax.plot(x, y)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">update</span>(<span class="hljs-params">frame</span>):</span>
    line.set_ydata(np.cos(x + frame / <span class="hljs-number">10</span>))
    <span class="hljs-keyword">return</span> line,

ani = animation.FuncAnimation(fig, update, frames=range(<span class="hljs-number">100</span>), blit=<span class="hljs-literal">True</span>)
plt.show()
</code></pre>
<p>The code forms an animated cosine wave, which over time seems to move horizontally and creates an impression of a wave moving from left or right. Such animations can also be useful if the data should be represented in terms of change with time.</p>
<h2 id="heading-how-to-optimize-plots-for-large-datasets">How to Optimize Plots for Large Datasets</h2>
<p>The size of the dataset being considered when dealing with big data is characterized by the amount of data, thus, the importance of performance needs to be expressed. It is often too slow and takes much memory to plot large quantities of data. Here are some tips you need to employ to make the most of your plots.</p>
<h3 id="heading-efficient-plotting-techniques-for-large-datasets">Efficient Plotting Techniques for Large Datasets</h3>
<h4 id="heading-downsampling">Downsampling</h4>
<p>In this process, you sample fewer points than what the original plot has.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Generate large dataset</span>
x_huge = np.linspace(<span class="hljs-number">0</span>, <span class="hljs-number">100</span>, <span class="hljs-number">10000</span>)
y_huge = np.sin(x_huge) + np.random.normal(<span class="hljs-number">0</span>, <span class="hljs-number">0.1</span>, size=x_huge.shape)

<span class="hljs-comment"># Downsample the data</span>
x_downsampled = x_huge[::<span class="hljs-number">10</span>]
y_downsampled = y_huge[::<span class="hljs-number">10</span>]

plt.plot(x_downsampled, y_downsampled)
plt.title(<span class="hljs-string">"Downsampled Plot"</span>)
plt.xlabel(<span class="hljs-string">"X"</span>)
plt.ylabel(<span class="hljs-string">"Y"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734654600/041f79ce-0519-4d6c-830f-099b0be2ac4f.png" alt="A downsampled plot image" width="576" height="456" loading="lazy"></p>
<p>With this, we reduce the number of points to plot the graph on and plot a point after an interval of 10 points. It reduces the load to be rendered but does so without distorting the general structure of the data.</p>
<h4 id="heading-data-aggregation">Data Aggregation</h4>
<p>Data Aggregation is a process where data gathered in numerical form is grouped into classes to arrive at tabulations of the observations under a given class.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Generate large dataset</span>
x_huge = np.linspace(<span class="hljs-number">0</span>, <span class="hljs-number">100</span>, <span class="hljs-number">10000</span>)
y_huge = np.sin(x_huge) + np.random.normal(<span class="hljs-number">0</span>, <span class="hljs-number">0.1</span>, size=x_huge.shape)

<span class="hljs-comment"># Aggregate the data into bins</span>
bins = np.linspace(<span class="hljs-number">0</span>, <span class="hljs-number">100</span>, <span class="hljs-number">100</span>)
y_aggregated = [np.mean(y_huge[(x_huge &gt;= bins[i]) &amp; (x_huge &lt; bins[i+<span class="hljs-number">1</span>])]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(bins)<span class="hljs-number">-1</span>)]

plt.plot(bins[:<span class="hljs-number">-1</span>], y_aggregated)
plt.title(<span class="hljs-string">"Aggregated Plot"</span>)
plt.xlabel(<span class="hljs-string">"X"</span>)
plt.ylabel(<span class="hljs-string">"Average Y"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734688145/696c2a72-64d8-4dd1-97ad-217979f91707.png" alt="An aggregated plot image" width="576" height="446" loading="lazy"></p>
<p>This process reduces the number of data points needed to represent the data distribution, making the plot easier to read and interpret while still capturing the overall trend of the original data.</p>
<h3 id="heading-statistical-data-visualization">Statistical Data Visualization</h3>
<p>Statistical plots are useful for summarizing and understanding large datasets, some of which include the following:</p>
<h4 id="heading-box-plots">Box Plots</h4>
<p>It displays the data distribution based on a five-number summary: minimum, first quartile, median, third quartile, and maximum.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Generate random data</span>
data = np.random.randn(<span class="hljs-number">1000</span>)
plt.boxplot(data)
plt.title(<span class="hljs-string">"Box Plot"</span>)
plt.ylabel(<span class="hljs-string">"Values"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734716151/ddfa3387-8aa7-47ba-af4b-31faef782c32.png" alt="A box plot representation" width="576" height="443" loading="lazy"></p>
<p>They are especially used in positional outlier detection and the comparison of the dispersion and symmetry of two variables.</p>
<h4 id="heading-violin-plot">Violin Plot</h4>
<p>It employs a box plot as well as a density plot to present more specific information regarding the value distribution of the given variables.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Generate random data</span>
data = np.random.randn(<span class="hljs-number">1000</span>)
plt.violinplot(data)
plt.title(<span class="hljs-string">"Violin Plot"</span>)
plt.ylabel(<span class="hljs-string">"Values"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734737336/4b2e08a7-68ab-4060-ab6b-c925eb4e38e4.png" alt="A violin plot representation" width="576" height="445" loading="lazy"></p>
<p>Violin plots can be used when there is a need to represent full distributions.</p>
<h2 id="heading-common-visualization-pitfalls-and-how-to-avoid-them">Common Visualization Pitfalls and How to Avoid Them</h2>
<h3 id="heading-overplotting">Overplotting</h3>
<p>A value is rendered over-plotted when many observations are superimposed in the same foreground, which makes the figures messy, and the points or patterns become obscure. This is particularly common in scatter plots or line plots with large datasets.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Generate large dataset</span>
x = np.random.rand(<span class="hljs-number">10000</span>)
y = np.random.rand(<span class="hljs-number">10000</span>)

<span class="hljs-comment"># Plot without transparency (over-plotting)</span>
plt.scatter(x, y)
plt.title(<span class="hljs-string">"Scatter Plot with Over-plotting"</span>)
plt.xlabel(<span class="hljs-string">"X"</span>)
plt.ylabel(<span class="hljs-string">"Y"</span>)
plt.show()

<span class="hljs-comment"># Plot with transparency to reduce over-plotting</span>
plt.scatter(x, y, alpha=<span class="hljs-number">0.1</span>)  <span class="hljs-comment"># Set alpha for transparency</span>
plt.title(<span class="hljs-string">"Scatter Plot with Reduced Over-plotting"</span>)
plt.xlabel(<span class="hljs-string">"X"</span>)
plt.ylabel(<span class="hljs-string">"Y"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734768136/4de79ed6-6f57-45d3-909b-e2b547a26232.png" alt="An image of over-plotting and reduced over-plotting" width="363" height="576" loading="lazy"></p>
<p>In the first plot, without transparency, the data points overlap significantly, making it hard to identify any patterns or density areas. In the second plot, transparency (<code>alpha=0.1</code>) is applied to the data points, allowing denser regions to become more apparent while reducing clutter. This technique makes it easier to interpret the plot's data distribution.</p>
<h3 id="heading-misleading-scales-and-axes">Misleading Scales and Axes</h3>
<p>It is possible to choose the scales and axes in such a way that it changes the overall perception of the plot. Misleading scales mess up the actual picture an analyst gets about the data and leads to making improper conclusions.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Generate data</span>
x = np.arange(<span class="hljs-number">10</span>)
y1 = np.random.randint(<span class="hljs-number">50</span>, <span class="hljs-number">100</span>, size=<span class="hljs-number">10</span>)
y2 = y1 + np.random.randint(<span class="hljs-number">-5</span>, <span class="hljs-number">5</span>, size=<span class="hljs-number">10</span>)

<span class="hljs-comment"># Plot with truncated y-axis</span>
plt.plot(x, y1, label=<span class="hljs-string">'Data 1'</span>)
plt.plot(x, y2, label=<span class="hljs-string">'Data 2'</span>)
plt.ylim(<span class="hljs-number">90</span>, <span class="hljs-number">100</span>)  <span class="hljs-comment"># Truncated y-axis</span>
plt.title(<span class="hljs-string">"Plot with Truncated Y-Axis"</span>)
plt.xlabel(<span class="hljs-string">"X"</span>)
plt.ylabel(<span class="hljs-string">"Y"</span>)
plt.legend()
plt.show()

<span class="hljs-comment"># Plot with full y-axis</span>
plt.plot(x, y1, label=<span class="hljs-string">'Data 1'</span>)
plt.plot(x, y2, label=<span class="hljs-string">'Data 2'</span>)
plt.title(<span class="hljs-string">"Plot with Full Y-Axis"</span>)
plt.xlabel(<span class="hljs-string">"X"</span>)
plt.ylabel(<span class="hljs-string">"Y"</span>)
plt.legend()
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727734992341/e257fe37-1b47-4ed4-82fd-5c2e9a458202.png" alt="Truncated Y-axis vs Full Y-axis" width="364" height="576" loading="lazy"></p>
<p>What can be gathered from the first plot is that the range of the y-axis is fixed. This brings out a graph that is quite misleading. The second plot uses the full y-axis, providing a more accurate representation of the data.</p>
<h3 id="heading-color-misuse">Color Misuse</h3>
<p>The somewhat weak link in data visualization is the way colors are chosen and, more often than not, used improperly. Issues are low contrasts, picking colors that a color-blind person cannot differentiate, and creating color importance where there is none.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Generate data</span>
x = np.linspace(<span class="hljs-number">0</span>, <span class="hljs-number">10</span>, <span class="hljs-number">100</span>)
y1 = np.sin(x)
y2 = np.cos(x)

<span class="hljs-comment"># Plot with non-colorblind-friendly palette</span>
plt.plot(x, y1, color=<span class="hljs-string">'red'</span>, label=<span class="hljs-string">'sin(x)'</span>)
plt.plot(x, y2, color=<span class="hljs-string">'green'</span>, label=<span class="hljs-string">'cos(x)'</span>)
plt.title(<span class="hljs-string">"Plot with Non-Colorblind-Friendly Colors"</span>)
plt.xlabel(<span class="hljs-string">"X"</span>)
plt.ylabel(<span class="hljs-string">"Y"</span>)
plt.legend()
plt.show()

<span class="hljs-comment"># Plot with colorblind-friendly palette</span>
plt.plot(x, y1, color=<span class="hljs-string">'#0072B2'</span>, label=<span class="hljs-string">'sin(x)'</span>)  <span class="hljs-comment"># Blue</span>
plt.plot(x, y2, color=<span class="hljs-string">'#D55E00'</span>, label=<span class="hljs-string">'cos(x)'</span>)  <span class="hljs-comment"># Orange</span>
plt.title(<span class="hljs-string">"Plot with Colorblind-Friendly Colors"</span>)
plt.xlabel(<span class="hljs-string">"X"</span>)
plt.ylabel(<span class="hljs-string">"Y"</span>)
plt.legend()
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727735130128/9dc47d8a-9d2a-4982-9185-29e76014f9c5.png" alt="image highlighting color importance" width="376" height="576" loading="lazy"></p>
<p>The first plot employs red and green which are notoriously difficult for users with red-green color blindness. The second plot uses a colorblind web-friendly palette to ensure that everyone can understand the plot without being confused by the colors.</p>
<h3 id="heading-misleading-use-of-3d-plots">Misleading Use of 3D Plots</h3>
<p>3D plots can be visually appealing but often add unnecessary complexities and can be misleading if not used appropriately. They are most effective when the third dimension genuinely adds value to the visualization, such as when displaying multivariate data. However, 3D plots make it a bit difficult to have a comparison of the values in the plots.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">from</span> mpl_toolkits.mplot3d <span class="hljs-keyword">import</span> Axes3D
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Generate data</span>
x = np.linspace(<span class="hljs-number">-5</span>, <span class="hljs-number">5</span>, <span class="hljs-number">100</span>)
y = np.linspace(<span class="hljs-number">-5</span>, <span class="hljs-number">5</span>, <span class="hljs-number">100</span>)
X, Y = np.meshgrid(x, y)
Z = np.sin(np.sqrt(X**<span class="hljs-number">2</span> + Y**<span class="hljs-number">2</span>))

<span class="hljs-comment"># 3D plot</span>
fig = plt.figure()
ax = fig.add_subplot(<span class="hljs-number">111</span>, projection=<span class="hljs-string">'3d'</span>)
ax.plot_surface(X, Y, Z, cmap=<span class="hljs-string">'viridis'</span>)
plt.title(<span class="hljs-string">"3D Plot"</span>)
plt.show()

<span class="hljs-comment"># 2D contour plot</span>
plt.contourf(X, Y, Z, cmap=<span class="hljs-string">'viridis'</span>)
plt.colorbar(label=<span class="hljs-string">'Z value'</span>)
plt.title(<span class="hljs-string">"2D Contour Plot"</span>)
plt.xlabel(<span class="hljs-string">"X"</span>)
plt.ylabel(<span class="hljs-string">"Y"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727735152366/965da133-cfb7-4db4-bbbc-c98d7271bd82.png" alt="3D plot vs 2D contour plot" width="388" height="576" loading="lazy"></p>
<p>The 3D plot helps to plot the data in three dimensions, but it is not easy to understand the exact height difference of the regions because of the perspective. The 2D contour plot, however, uses varying colors to reflect the dimension data (Z values), making it easier and more accurate to compare areas in the graph. More often than not, the 2D plots used are better representations and easier to understand compared to the 3D ones.</p>
<h3 id="heading-misleading-use-of-area-charts">Misleading Use of Area Charts</h3>
<p>Area charts can effectively show trends over time or the distribution of a whole into parts. However, they may be confusing if some of the areas intersect or if the accumulation scheme of the chart is not clear.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Generate data</span>
x = np.arange(<span class="hljs-number">0</span>, <span class="hljs-number">10</span>, <span class="hljs-number">1</span>)
y1 = np.array([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">9</span>, <span class="hljs-number">10</span>])
y2 = np.array([<span class="hljs-number">1</span>, <span class="hljs-number">3</span>, <span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">5</span>, <span class="hljs-number">7</span>, <span class="hljs-number">6</span>, <span class="hljs-number">8</span>])

<span class="hljs-comment"># Stacked area chart (potentially misleading)</span>
plt.fill_between(x, y1, color=<span class="hljs-string">'skyblue'</span>, alpha=<span class="hljs-number">0.5</span>)
plt.fill_between(x, y2, color=<span class="hljs-string">'orange'</span>, alpha=<span class="hljs-number">0.5</span>)
plt.title(<span class="hljs-string">"Misleading Stacked Area Chart"</span>)
plt.xlabel(<span class="hljs-string">"X"</span>)
plt.ylabel(<span class="hljs-string">"Y"</span>)
plt.show()

<span class="hljs-comment"># Improved area chart with non-overlapping areas</span>
plt.fill_between(x, y1, color=<span class="hljs-string">'skyblue'</span>, alpha=<span class="hljs-number">0.5</span>)
plt.fill_between(x, y1 + y2, y1, color=<span class="hljs-string">'orange'</span>, alpha=<span class="hljs-number">0.5</span>)
plt.title(<span class="hljs-string">"Improved Stacked Area Chart"</span>)
plt.xlabel(<span class="hljs-string">"X"</span>)
plt.ylabel(<span class="hljs-string">"Y"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727735170489/d5e4bea1-713f-4e75-9cd7-b8e3a15f6331.png" alt="A representation of use of area charts" width="372" height="576" loading="lazy"></p>
<p>In the first area chart, the areas overlap, which can create confusion about the contribution of each category to the whole. The second plot improves clarity by stacking the areas on top of each other without overlap, clearly showing the cumulative nature of the data.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>With Matplotlib, one has many features to solve particular visualization problems in the data analysis field. You can use it for line plots, complex data handling, large data processing, creating animated plots, and so on.</p>
<p>In this guide, we have explored the important aspects of Matplotlib and tried to bring them closer to solving real problems that you may face in your day-to-day programming work.</p>
<p>We also included detailed examples to support these applications. In whatever capacity you engage with the data, whether as a data scientist, engineer, or analyst, Matplotlib enables you to tell your data’s narrative in the best way possible.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Data Visualization with Matplotlib – a Step by Step Guide ]]>
                </title>
                <description>
                    <![CDATA[ SEE is a beautiful Apple TV series that depicts a dystopia where humans have lost their sight. Hundreds of years later, it was considered a myth that people could ever see.  Jason Momoa is one of the leads and plays Baba Voss, an elite warrior. Jason... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/introduction-to-data-vizualization-using-matplotlib/</link>
                <guid isPermaLink="false">66c71f7455df43f1418b5b21</guid>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mene-Ejegi Ogbemi ]]>
                </dc:creator>
                <pubDate>Mon, 24 Apr 2023 18:32:32 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/04/data-visualization-1.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p><a target="_blank" href="https://tv.apple.com/us/show/see/umc.cmc.3s4mgg2y7h95fks9gnc4pw13m">SEE</a> is a beautiful Apple TV series that depicts a dystopia where humans have lost their sight. Hundreds of years later, it was considered a myth that people could ever see. </p>
<p>Jason Momoa is one of the leads and plays Baba Voss, an elite warrior. Jason's wife gives birth to sighted twins, and years after, during battle, Baba Voss sometimes needs the aid of the sighted children. They helped him understand the terrain better, even with his battlefield mastery. We could say his children helped him visualize things.</p>
<p>In ancient times, before digital devices, data visualization was also a myth. Earlier humans understood the need for visualization, so they had resources like maps, hieroglyphs, rock art, and so on. Eyewitnesses typically draw their paths and other relevant information on stones, wood, or scrolls. </p>
<p>Like Baba Voss's kids, these resources make it easier for humans to have a visual perspective on things or environments. </p>
<p>So what does visualization actually mean in this context? We can define visualization as "any technique for creating images, diagrams, or animations to communicate a message." (<a target="_blank" href="https://en.wikipedia.org/wiki/Visualization_(graphics)">source</a>)</p>
<p>In this article, we'll explore what data visualization is and how you can use the data visualization tool Matplotlib to explore and analyze data. You'll learn how to use it to create charts that help business owners and stakeholders get more insight about data and make informed decisions.</p>
<h2 id="heading-what-is-data-visualization">What is Data Visualization?</h2>
<p>Data visualization refers to the integration of data and visual elements like images, charts, diagrams, and so on to communicate messages to different stakeholders. </p>
<p>These stakeholders can be users, team members, managers, or top executive members of an organization. </p>
<p>Data in this context refers to different input gathered from the organization database or gotten from external sources, like public databases or private organizations, that have given access through their APIs.</p>
<p>We'll work with an employee layoff dataset which contains details of employees that have been laid off in different industries from 2020 to 2022. The columns in the dataset include the names of companies, locations, industries, total laid off, percentage laid off, date, countries, and other relevant columns. </p>
<p>Below is a snapshot of the data frame:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/08/layoff-table.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-what-is-matplotlib">What is Matplotlib?</h2>
<p>Matplotlib is a popular Python library for displaying data and creating static, animated, and interactive plots. This program lets you draw appealing and informative graphics like line plots, scatter plots, histograms, and bar charts. </p>
<p>Matplotlib is highly customizable and flexible, which makes it a preferred choice for data analysts and scientists working in fields such as finance, science, engineering, and social sciences. </p>
<p>In this article, I'll show you how to create a bar chart, a pie chart, and a line plot to explain how you can do data visualization using Matplotlib.</p>
<p>The first thing you need is to import the Matplotlib and other relevant libraries like Pandas, Numpy and their sub modules.</p>
<pre><code class="lang-python"><span class="hljs-comment">#Imports packages</span>
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> matplotlib.dates <span class="hljs-keyword">as</span> mdates
<span class="hljs-keyword">from</span> matplotlib.ticker <span class="hljs-keyword">import</span> MaxNLocator
</code></pre>
<p>In the code above, we import the Pandas package, which analyzes and manipulates our data. We imported Matplotlib and we'll use the Pyplot module for data visualization. </p>
<p>We'll use the Numpy package imported in the third line for numerical computations. We'll also work with the date module for date manipulations when plotting our chart. The last module is the ticker module, which sets ticks on plot axes. With these modules, you can analyze, manipulate, compute, and visualize your data.</p>
<h2 id="heading-how-to-create-a-bar-chart">How to Create a Bar Chart</h2>
<p>Bar charts help you with categorical values. That is, if you want to compare different entities on quantity, a bar chart is an excellent way to visualize it. In the layoff dataset, we'll compare different companies that laid off employees according to the number of staff laid off.</p>
<pre><code class="lang-python">plt.figure(figsize= (<span class="hljs-number">8</span>, <span class="hljs-number">6</span>))
industry_val = df_layoffs.groupby(<span class="hljs-string">'company'</span>)[<span class="hljs-string">'total_laid_off'</span>].sum().sort_values(ascending = <span class="hljs-literal">False</span>).head(<span class="hljs-number">10</span>)
industry_val.plot(label=<span class="hljs-string">""</span>, kind=<span class="hljs-string">'bar'</span>)
plt.show()
</code></pre>
<p>The code above is one way to create a bar chart. It shows the top 10 companies with the highest number of layoffs. </p>
<p>We first set the size of the graph to 8 inches by 6 inches. Then, we group our data in the dataframe by the sum total of employees laid off by each company. We then sort in descending order and select the top 10 with the highest layoffs. Finally, we create our bar chart using the selected data. The last line (<code>plt.show()</code>) displays the graph which is shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/04/layoff-bar-chart.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>From the chart above, you will notice that Meta and Amazon had the highest number of laid off staff while Twitter had the fewest layoffs.</p>
<h3 id="heading-how-to-create-a-pie-chart">How to Create a Pie Chart</h3>
<p>A pie chart represents a whole sector, with each portion allocated according to its size to a sub-sector. The industry column will be a perfect fit for using pie chart. We'll see which industry had most and fewest layoffs.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Group the data by industry and sum the total laid off employees</span>
industry_val = df_layoffs.groupby(<span class="hljs-string">'industry'</span>)[<span class="hljs-string">'total_laid_off'</span>].sum().sort_values(ascending=<span class="hljs-literal">False</span>).head()

<span class="hljs-comment"># create the pie chart and display the labels and values inside the pie</span>
plt.figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>))
plt.pie(industry_val, labels=industry_val.index, autopct=<span class="hljs-string">'%1.1f%%'</span>)
plt.title(<span class="hljs-string">'Laid Off Employees by Industry'</span>)
plt.show()
</code></pre>
<p>First, the code groups the data by industry and sums up the total number of laid-off employees for each industry. It then sorts the industries in descending order based on the total number of laid-off employees and selects the top values using the <code>head()</code> function.</p>
<p>Next, we create a pie chart to visualize the data. The size of each slice in the pie represents the proportion of laid-off employees in that industry. The pie chart labels show the names of the industries. The percentage values inside the slices show the proportion of laid-off employees in that industry. The chart is titled "Laid Off Employees by Industry."</p>
<p>Finally, the pie chart is displayed using the <code>plt.show()</code> function. Like we did in the bar chart, the <code>plt.figure(figsize=(8, 6))</code> function sets the chart size to be 8 inches wide and 6 inches tall.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/04/layoff-industry.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The chart above shows the proportion of layoffs across different industries. The transportation sector and consumer sector are the industry mostly affected followed by retail, finance and food industry.</p>
<h3 id="heading-how-to-create-a-line-chart">How to Create a Line chart</h3>
<p>Line charts show changes over time for an entity. With our dataset, a line chart could be used to show the trend of layoffs over the past year or two.  This depends on what you are trying to communicate, but we'll work with a one year analysis.</p>
<pre><code class="lang-python"><span class="hljs-comment"># convert date column to datetime object</span>
df_layoffs[<span class="hljs-string">'date'</span>] = pd.to_datetime(df_layoffs[<span class="hljs-string">'date'</span>])

<span class="hljs-comment"># select data for one-year duration starting from January 1st, 2022</span>
start_date = pd.Timestamp(<span class="hljs-string">'2022-01-01'</span>)
end_date = start_date + pd.DateOffset(years=<span class="hljs-number">1</span>)
df_one_year = df_layoffs.loc[(df_layoffs[<span class="hljs-string">'date'</span>] &gt;= start_date) &amp; (df_layoffs[<span class="hljs-string">'date'</span>] &lt; end_date)]

<span class="hljs-comment"># plot the selected data</span>
df_date = df_one_year.groupby(<span class="hljs-string">'date'</span>)[<span class="hljs-string">'total_laid_off'</span>].sum()
plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">4</span>))
plt.plot(df_date.index, df_date.values)
plt.xlabel(<span class="hljs-string">'Date'</span>)
plt.ylabel(<span class="hljs-string">'Total Laid Off'</span>)
plt.title(<span class="hljs-string">'Laid Off Trend for 2022'</span>)
plt.xticks(rotation=<span class="hljs-number">45</span>)
<span class="hljs-comment"># set the format of the x-axis labels to show Month-Year</span>
date_fmt = mdates.DateFormatter(<span class="hljs-string">'%b-%Y'</span>)
plt.gca().xaxis.set_major_formatter(date_fmt)

<span class="hljs-comment"># Use MaxNLocator to reduce the number of xticks</span>
locator = MaxNLocator(nbins=<span class="hljs-number">10</span>)
plt.gca().xaxis.set_major_locator(locator)

plt.show()
</code></pre>
<p>In comparison to the bar charts and pie charts, this code is much more challenging. But here is an explanation:</p>
<p>The first line of the code converts the 'date' column of the DataFrame (df_layoffs) into a DateTime object so that the dates can be handled easily.</p>
<pre><code class="lang-python"><span class="hljs-comment"># convert date column to datetime object</span>
df_layoffs[<span class="hljs-string">'date'</span>] = pd.to_datetime(df_layoffs[<span class="hljs-string">'date'</span>])
</code></pre>
<p>Next, we select the data for a one-year duration starting on January 1st, 2022. The start date is defined as a Timestamp object, and the end date is set as one year from the start date using the pd.DateOffset function. The loc function is then used to filter the DataFrame rows, selecting only those that fall within this one-year duration. Remember we are working with a year's data.</p>
<pre><code class="lang-python"><span class="hljs-comment"># select data for one-year duration starting from January 1st, 2022</span>
start_date = pd.Timestamp(<span class="hljs-string">'2022-01-01'</span>)
end_date = start_date + pd.DateOffset(years=<span class="hljs-number">1</span>)
df_one_year = df_layoffs.loc[(df_layoffs[<span class="hljs-string">'date'</span>] &gt;= start_date) &amp; (df_layoffs[<span class="hljs-string">'date'</span>] &lt; end_date)]
</code></pre>
<p>After that, we group the selected data by date and calculate the total number of layoffs on each date using the groupby and sum functions. This is stored in a new DataFrame called <code>df_date</code>.</p>
<pre><code class="lang-python"><span class="hljs-comment"># plot the selected data</span>
df_date = df_one_year.groupby(<span class="hljs-string">'date'</span>)[<span class="hljs-string">'total_laid_off'</span>].sum()
</code></pre>
<p>Then, we create a plot of the laid off trend for 2022 using the matplotlib library. The plot size is set to (10, 4) using the figure function.</p>
<pre><code class="lang-python">plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">4</span>))
</code></pre>
<p>The x-axis represents the date, and the y-axis represents the total number of layoffs. The xlabel function labels the x-axis as 'Date,' and the ylabel function labels the y-axis as 'Total Laid Off.'</p>
<pre><code class="lang-python">plt.plot(df_date.index, df_date.values)
plt.xlabel(<span class="hljs-string">'Date'</span>)
plt.ylabel(<span class="hljs-string">'Total Laid Off'</span>)
</code></pre>
<p>The plot title is set to 'Laid Off Trend for 2022' using the title function.</p>
<pre><code class="lang-python">plt.title(<span class="hljs-string">'Laid Off Trend for 2022'</span>)
</code></pre>
<p>The x-axis labels are rotated by 45 degrees using the xticks function to avoid overcrowding.</p>
<pre><code class="lang-python">plt.xticks(rotation=<span class="hljs-number">45</span>)
</code></pre>
<p>The format of the x-axis labels is set to show the Month-Year format using the DateFormatter function.</p>
<pre><code class="lang-python"><span class="hljs-comment"># set the format of the x-axis labels to show Month-Year</span>
date_fmt = mdates.DateFormatter(<span class="hljs-string">'%b-%Y'</span>)
plt.gca().xaxis.set_major_formatter(date_fmt)
</code></pre>
<p>Finally, the number of xticks on the plot is reduced using the MaxNLocator function, which reduces the number of xticks to 10.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Use MaxNLocator to reduce the number of xticks</span>
locator = MaxNLocator(nbins=<span class="hljs-number">10</span>)
plt.gca().xaxis.set_major_locator(locator)
</code></pre>
<p>The plot is then displayed using the show function.</p>
<pre><code class="lang-python">plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/04/layoff-line-chart.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The chart above shows layoff trends and patterns for 2022.</p>
<p>You can also analyze how well an entity performed over different periods of time. The second chart shows an analysis of employee layoffs in 2020 versus 2022.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> matplotlib.dates <span class="hljs-keyword">as</span> mdates
<span class="hljs-keyword">from</span> matplotlib.ticker <span class="hljs-keyword">import</span> MaxNLocator

<span class="hljs-comment"># convert date column to datetime object</span>
df_layoffs[<span class="hljs-string">'date'</span>] = pd.to_datetime(df_layoffs[<span class="hljs-string">'date'</span>])

<span class="hljs-comment"># filter data to only include 2020 and 2022</span>
df_filtered = df_layoffs[(df_layoffs[<span class="hljs-string">'date'</span>].dt.year == <span class="hljs-number">2020</span>) | (df_layoffs[<span class="hljs-string">'date'</span>].dt.year == <span class="hljs-number">2022</span>)]

<span class="hljs-comment"># group data by year and calculate total layoffs</span>
df_filtered[<span class="hljs-string">'year'</span>] = df_filtered[<span class="hljs-string">'date'</span>].dt.year
df_yearly = df_filtered.groupby([<span class="hljs-string">'year'</span>, <span class="hljs-string">'date'</span>])[<span class="hljs-string">'total_laid_off'</span>].sum().reset_index()

<span class="hljs-comment"># create subplots and plot the data for each year in separate charts</span>
fig, axs = plt.subplots(ncols=<span class="hljs-number">2</span>, figsize=(<span class="hljs-number">14</span>, <span class="hljs-number">8</span>))
<span class="hljs-keyword">for</span> i, year <span class="hljs-keyword">in</span> enumerate(df_yearly[<span class="hljs-string">'year'</span>].unique()):
    df_year = df_yearly.loc[df_yearly[<span class="hljs-string">'year'</span>] == year]
    axs[i].plot(df_year[<span class="hljs-string">'date'</span>], df_year[<span class="hljs-string">'total_laid_off'</span>])
    axs[i].set_xlabel(<span class="hljs-string">'Date'</span>)
    axs[i].set_ylabel(<span class="hljs-string">'Total Laid Off'</span>)
    axs[i].set_title(<span class="hljs-string">f'Laid Off Trend for <span class="hljs-subst">{year}</span>'</span>)
    axs[i].xaxis.set_major_formatter(mdates.DateFormatter(<span class="hljs-string">'%b-%Y'</span>))
    axs[i].tick_params(axis=<span class="hljs-string">'x'</span>, rotation=<span class="hljs-number">45</span>)
    locator = MaxNLocator(nbins=<span class="hljs-number">10</span>)
    axs[i].xaxis.set_major_locator(locator)

<span class="hljs-comment"># set y-axis limit to 0-14000 for each subplot</span>
<span class="hljs-keyword">for</span> ax <span class="hljs-keyword">in</span> axs:
    ax.set_ylim([<span class="hljs-number">0</span>, <span class="hljs-number">14000</span>])

plt.show()
</code></pre>
<p>Let's review the different components of the code above.</p>
<p>The 'date' column in the DataFrame is converted to a datetime object.</p>
<pre><code class="lang-python"><span class="hljs-comment"># convert date column to datetime object</span>
df_layoffs[<span class="hljs-string">'date'</span>] = pd.to_datetime(df_layoffs[<span class="hljs-string">'date'</span>])
</code></pre>
<p>Next, the code filters the data to only include layoffs from the years 2020 and 2022. It then groups the filtered data by year and date and calculates the total number of layoffs for each date.</p>
<pre><code class="lang-python"><span class="hljs-comment"># filter data to only include 2020 and 2022</span>
df_filtered = df_layoffs[(df_layoffs[<span class="hljs-string">'date'</span>].dt.year == <span class="hljs-number">2020</span>) | (df_layoffs[<span class="hljs-string">'date'</span>].dt.year == <span class="hljs-number">2022</span>)]

<span class="hljs-comment"># group data by year and calculate total layoffs</span>
df_filtered[<span class="hljs-string">'year'</span>] = df_filtered[<span class="hljs-string">'date'</span>].dt.year
df_yearly = df_filtered.groupby([<span class="hljs-string">'year'</span>, <span class="hljs-string">'date'</span>])[<span class="hljs-string">'total_laid_off'</span>].sum().reset_index()
</code></pre>
<p>We then create two subplots and plot the total number of layoffs for each year in separate charts. We set the x-axis labels to the date format of 'MMM-YYYY' (for example, Jan-2022) and rotate them by 45 degrees. We also set the y-axis label to 'Total Laid Off' and the chart title to 'Laid Off Trend for {year}' (for example, Laid Off Trend for 2020). Finally, we show the charts using the <code>plt.show()</code> command.</p>
<pre><code class="lang-python"><span class="hljs-comment"># create subplots and plot the data for each year in separate charts</span>
fig, axs = plt.subplots(ncols=<span class="hljs-number">2</span>, figsize=(<span class="hljs-number">14</span>, <span class="hljs-number">8</span>))
<span class="hljs-keyword">for</span> i, year <span class="hljs-keyword">in</span> enumerate(df_yearly[<span class="hljs-string">'year'</span>].unique()):
    df_year = df_yearly.loc[df_yearly[<span class="hljs-string">'year'</span>] == year]
    axs[i].plot(df_year[<span class="hljs-string">'date'</span>], df_year[<span class="hljs-string">'total_laid_off'</span>])
    axs[i].set_xlabel(<span class="hljs-string">'Date'</span>)
    axs[i].set_ylabel(<span class="hljs-string">'Total Laid Off'</span>)
    axs[i].set_title(<span class="hljs-string">f'Laid Off Trend for <span class="hljs-subst">{year}</span>'</span>)
    axs[i].xaxis.set_major_formatter(mdates.DateFormatter(<span class="hljs-string">'%b-%Y'</span>))
    axs[i].tick_params(axis=<span class="hljs-string">'x'</span>, rotation=<span class="hljs-number">45</span>)
    locator = MaxNLocator(nbins=<span class="hljs-number">10</span>)
    axs[i].xaxis.set_major_locator(locator)

plt.show()
</code></pre>
<p>Overall, the code is used to filter, group, and visualize data related to company layoffs specifically focusing on trends for 2020 and 2022. You can see the result in the chart below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/04/line-chart-comparison-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We started by discussing what visualization is and how data visualization is significant in transforming raw numbers into insight and business sense. </p>
<p>Then we used the popular Python library Matplotlib, which is a tool for data visualization, to create bar charts, pie charts, and line charts. There are also other use cases not covered in this article, like histograms, scatter plots, box plots, and so on. </p>
<p>By using these visualizations, we can make sense of our data and take actions that wouldn't be possible by looking at raw numbers. Data visualization can help us achieve better outcomes in other areas such as finance, science, engineering, etc. For further study, you can check the official matplotlib documentation <a target="_blank" href="https://matplotlib.org/stable/index.html">here</a>.</p>
<p>Thank you for reading! Please follow me on <a target="_blank" href="https://www.linkedin.com/in/ogbemi-ejegi/">LinkedIn</a> where I also post more data related content.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Conda Remove Package - How To Remove Matplotlib in Anaconda ]]>
                </title>
                <description>
                    <![CDATA[ You can use Conda to create and manage different environments and their packages. It is mostly used for data science and machine learning projects.  In this article, you'll learn how to remove an environment's package using in built Conda commands.  ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-remove-a-package-in-anaconda/</link>
                <guid isPermaLink="false">66b0a2e06428eb897141f87c</guid>
                
                    <category>
                        <![CDATA[ anaconda ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ihechikara Abba ]]>
                </dc:creator>
                <pubDate>Wed, 12 Apr 2023 12:21:34 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/04/joshua-woroniecki-lzh3hPtJz9c-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You can use Conda to create and manage different environments and their packages. It is mostly used for data science and machine learning projects. </p>
<p>In this article, you'll learn how to remove an environment's package using in built Conda commands. </p>
<p>You'll learn the following: </p>
<ul>
<li>How to create an environment. </li>
<li>How to install packages in an environment. </li>
<li>How to remove/delete an environment's package. </li>
</ul>
<p>Let's get started!</p>
<h2 id="heading-how-to-create-an-environment-in-conda">How To Create an Environment in Conda</h2>
<p>You can use the <code>conda create package-name</code> to create a new environment in Conda. </p>
<p>Here's an example:</p>
<pre><code class="lang-bash">conda create -n package-tutorial
</code></pre>
<p>The command above creates an environment called <code>package-tutorial</code>.</p>
<p>You can activate or switch to the <code>package-tutorial</code> environment using the <code>conda activate environment-name</code> command. That is:</p>
<pre><code class="lang-bash">conda activate package-tutorial
</code></pre>
<h2 id="heading-how-to-install-packages-in-a-conda-environment">How To Install Packages in a Conda Environment</h2>
<p>In the last section, we created and activated an environment called <code>package-tutorial</code>. </p>
<p>In this section, you'll see how to install a package in that environment. Let's install Matplotlib. </p>
<p>You can install a package using the <code>conda install package-name</code> command.</p>
<p>Here's one of the command for installing Matplotlib in Conda: </p>
<pre><code class="lang-bash">conda install -c conda-forge matplotlib
</code></pre>
<p>The installation might take a while to download and extract the package. You can check the packages that exist in your environment using <code>conda list</code> command. </p>
<p>Once the installation is complete, use the <code>conda list</code> command to verify that the package has been installed in your environment. </p>
<h2 id="heading-how-to-remove-a-package-in-conda">How To Remove a Package in Conda</h2>
<p>You can remove a package in the current environment by running the <code>conda remove package-name</code> command. </p>
<p>In our case, we want to remove Matplotlib from the current environment (<code>package-tutorial</code> environment):</p>
<pre><code class="lang-bash">conda remove matplotlib
</code></pre>
<p>The command above removes Matplotlib from the current environment. When you run the <code>conda list</code> command, Matplotlib will no longer be listed as a package.</p>
<h2 id="heading-summary">Summary</h2>
<p>In this article, we talked about packages in Conda. They can be installed in Conda environments. </p>
<p>We saw how to create and activate a Conda environment . We also saw how to install and remove packages in Conda. </p>
<p>Happy coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Matplotlib Marker - How To Create a Marker in Matplotlib ]]>
                </title>
                <description>
                    <![CDATA[ In this article, you'll learn how to use markers in Matplotlib to indicate specific points in a plot.  The marker parameter can be used to create "markers" in a plot. You can specify the shape of the marker by passing a value to the parameter.  Here'... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-a-marker-in-matplotlib/</link>
                <guid isPermaLink="false">66b0a2d178fa885fe5a09ccb</guid>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ihechikara Abba ]]>
                </dc:creator>
                <pubDate>Tue, 14 Mar 2023 15:05:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/03/isaac-smith-6EnTPvPPL6I-unsplash--2--1.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this article, you'll learn how to use markers in Matplotlib to indicate specific points in a plot. </p>
<p>The <code>marker</code> parameter can be used to create "markers" in a plot. You can specify the shape of the marker by passing a value to the parameter. </p>
<p>Here's what a normal Matplotlib plot looks like:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

x = [<span class="hljs-number">2</span>,<span class="hljs-number">4</span>,<span class="hljs-number">6</span>,<span class="hljs-number">8</span>]
y = [<span class="hljs-number">1</span>,<span class="hljs-number">3</span>,<span class="hljs-number">9</span>,<span class="hljs-number">7</span>]

plt.plot(x,y)
plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/plot-without-marker.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>a matplotlib plot without a marker</em></p>
<p>Here's a plot with a marker:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

x = [<span class="hljs-number">2</span>,<span class="hljs-number">4</span>,<span class="hljs-number">6</span>,<span class="hljs-number">8</span>]
y = [<span class="hljs-number">1</span>,<span class="hljs-number">3</span>,<span class="hljs-number">9</span>,<span class="hljs-number">7</span>]

plt.plot(x,y, marker = <span class="hljs-string">'o'</span>)
plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/plot-with-markers.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>a matplotlib plot with an "o" marker</em></p>
<p>As can be seen in the image above, every meeting point for both axis in the plot is denoted by a marker that looks like an circle.</p>
<p>We're able to do that by setting the value of the <code>marker</code> parameter to "0": <code>plt.plot(x,y, marker = 'o')</code>.</p>
<h2 id="heading-list-of-matplotlib-markers">List of Matplotlib Markers</h2>
<p>Here is a list (from the <a target="_blank" href="https://matplotlib.org/stable/api/markers_api.html">Matplotlib documentation</a>) of marker values that can be assigned to the <code>marker</code> parameter:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Marker</td><td>Description</td></tr>
</thead>
<tbody>
<tr>
<td>"."</td><td>point</td></tr>
<tr>
<td>","</td><td>pixel</td></tr>
<tr>
<td>"o"</td><td>circle</td></tr>
<tr>
<td>"v"</td><td>triangle_down</td></tr>
<tr>
<td>"^"</td><td>triangle_up</td></tr>
<tr>
<td>"&lt;"</td><td>triangle_left</td></tr>
<tr>
<td>"&gt;"</td><td>triangle_right</td></tr>
<tr>
<td>"1"</td><td>tri_down</td></tr>
<tr>
<td>"2"</td><td>tri_up</td></tr>
<tr>
<td>"3"</td><td>tri_left</td></tr>
<tr>
<td>"4"</td><td>tri_right</td></tr>
<tr>
<td>"8"</td><td>octagon</td></tr>
<tr>
<td>"s"</td><td>square</td></tr>
<tr>
<td>"p"</td><td>pentagon</td></tr>
<tr>
<td>"P"</td><td>plus (filled)</td></tr>
<tr>
<td>"h"</td><td>hexagon1</td></tr>
<tr>
<td>"H"</td><td>hexagon2</td></tr>
<tr>
<td>"+"</td><td>plus</td></tr>
<tr>
<td>"*"</td><td>star</td></tr>
<tr>
<td>"x"</td><td>x</td></tr>
<tr>
<td>"X"</td><td>x (filled)</td></tr>
<tr>
<td>"D"</td><td>diamond</td></tr>
<tr>
<td>"d"</td><td>thin_diamond</td></tr>
<tr>
<td>"_"</td><td>hline</td></tr>
<tr>
<td>"s"</td><td>square</td></tr>
<tr>
<td>0</td><td>tickleft</td></tr>
<tr>
<td>1</td><td>tickright</td></tr>
<tr>
<td>2</td><td>tickup</td></tr>
<tr>
<td>3</td><td>tickdown</td></tr>
<tr>
<td>4</td><td>caretleft</td></tr>
<tr>
<td>5</td><td>caretright</td></tr>
<tr>
<td>6</td><td>caretup</td></tr>
<tr>
<td>7</td><td>caretdown</td></tr>
<tr>
<td>8</td><td>caretleft (centered at base)</td></tr>
<tr>
<td>9</td><td>caretright (centered at base)</td></tr>
<tr>
<td>10</td><td>caretup (centered at base)</td></tr>
<tr>
<td>11</td><td>caretdown (centered at base)</td></tr>
</tbody>
</table>
</div><p>This list above shows the different values you can use to change the style of a marker in a plot. </p>
<h2 id="heading-summary">Summary</h2>
<p>In this article, we talked about markers in Matplotlib. They can be used to mark/indicate specific points in a plot. </p>
<p>We saw some code examples showing the application of the <code>marker</code> parameter. </p>
<p>Lastly, we saw a list of <code>marker</code> values that can be used to change the style of a marker. </p>
<p>Happy coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How To Change Legend Font Size in Matplotlib ]]>
                </title>
                <description>
                    <![CDATA[ You can modify different properties of a plot — color, size, label, title and so on — when working with Matplotlib.  In this article, you'll learn what a legend is in Matplotlib, and how to use some of its parameters to make your plots more relatable... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-change-legend-fontsize-in-matplotlib/</link>
                <guid isPermaLink="false">66b0a2c65e73cf343a5cc00e</guid>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ihechikara Abba ]]>
                </dc:creator>
                <pubDate>Tue, 14 Mar 2023 15:04:13 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/03/isaac-smith-6EnTPvPPL6I-unsplash--2-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You can modify different properties of a plot — color, size, label, title and so on — when working with Matplotlib. </p>
<p>In this article, you'll learn what a legend is in Matplotlib, and how to use some of its parameters to make your plots more relatable. </p>
<p>You'll then learn how to change the font size of a Matplotlib legend using:</p>
<ul>
<li>The <code>fontsize</code> parameter. </li>
<li>The <code>prop</code> parameter.</li>
</ul>
<h2 id="heading-what-is-a-legend-in-matplotlib">What Is a Legend in Matplotlib?</h2>
<p>A legend is a Matplotlib function used to describe elements that make up a graph. </p>
<p>Consider the graph below:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-comment"># create a plot</span>
x = [<span class="hljs-number">1</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">8</span>]
y = [<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">2</span>]

plt.plot(x, y)

plt.legend([<span class="hljs-string">"Data"</span>], loc=<span class="hljs-string">"upper right"</span>)

plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/matplotlib-legend.png" alt="Image" width="600" height="400" loading="lazy">
<em>matplotlib graph with a legend</em></p>
<p>In the graph above, we described the plot using a <code>legend</code>. A description of "Data" was assigned to the legend, and was placed in the upper right corner of the graph using the <code>upper right</code> value of the <code>loc</code> parameter. </p>
<p>With the <code>legend</code> function, you can assign different descriptions to each line of a graph. </p>
<p>Here's an example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

age = [<span class="hljs-number">1</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">8</span>]
number = [<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">2</span>, <span class="hljs-number">1</span>]

plt.plot(age)
plt.plot(number)

plt.legend([<span class="hljs-string">"age"</span>, <span class="hljs-string">"number"</span>], loc =<span class="hljs-string">"upper right"</span>)

plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/matplotlib-legend.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>two line graph with different legend descriptions</em></p>
<p>In the graph above, we've used the <code>legend</code> function to describe each line in the plot. </p>
<p>This makes it easier for anyone viewing the graph to know that the blue line denotes <code>age</code> while the orange line denotes <code>number</code> in the graph. </p>
<p>You can change the position of the legend using the following values of the <code>loc</code> parameter: </p>
<ul>
<li><code>best</code></li>
<li><code>upper right</code></li>
<li><code>upper left</code></li>
<li><code>lower left</code></li>
<li><code>lower right</code></li>
<li><code>right</code></li>
<li><code>center left</code></li>
<li><code>center right</code></li>
<li><code>lower center</code></li>
<li><code>upper center</code></li>
<li><code>center</code></li>
</ul>
<h2 id="heading-how-to-change-legend-font-size-in-matplotlib-using-the-fontsize-parameter">How To Change Legend Font Size in Matplotlib Using the <code>fontsize</code> Parameter</h2>
<p>You can change the font size of a Matplotlib legend by specifying a font size value for the <code>fontsize</code> parameter. </p>
<p>Here's what the default legend font size looks like:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

age = [<span class="hljs-number">1</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">8</span>]
number = [<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">2</span>, <span class="hljs-number">1</span>]

plt.plot(age)
plt.plot(number)

plt.legend([<span class="hljs-string">"age"</span>, <span class="hljs-string">"number"</span>], loc =<span class="hljs-string">"upper right"</span>)

plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/matplotlib-legend-1.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>matplotlib graph with default legend font size</em></p>
<p>Here's another code example with the <code>fontsize</code> parameter included:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

age = [<span class="hljs-number">1</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">8</span>]
number = [<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">2</span>, <span class="hljs-number">1</span>]

plt.plot(age)
plt.plot(number)

plt.legend([<span class="hljs-string">"age"</span>, <span class="hljs-string">"number"</span>], fontsize=<span class="hljs-string">"20"</span>, loc =<span class="hljs-string">"upper left"</span>)

plt.show()
</code></pre>
<p>Here's what the legend would look like:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/matplotlib-legend-fontsize-parameter-1.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>matplotlib legend size using fontsize parameter</em></p>
<p>We assigned a font size of 20 to the <code>fontsize</code> parameter to get the legend size in the image above: <code>fontsize="20"</code>. </p>
<p>You'd also notice the legend was placed at the upper left corner of the graph using the <code>loc</code> parameter.</p>
<h2 id="heading-how-to-change-legend-font-size-in-matplotlib-using-the-prop-parameter">How To Change Legend Font Size in Matplotlib Using the <code>prop</code> Parameter</h2>
<p>Another way of changing the font size of a legend is by using the <code>legend</code> function's <code>prop</code> parameter. </p>
<p>Here's how to use it:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

age = [<span class="hljs-number">1</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">8</span>]
number = [<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">2</span>, <span class="hljs-number">1</span>]

plt.plot(age)
plt.plot(number)

plt.legend([<span class="hljs-string">"age"</span>, <span class="hljs-string">"number"</span>], prop = { <span class="hljs-string">"size"</span>: <span class="hljs-number">20</span> }, loc =<span class="hljs-string">"upper left"</span>)

plt.show()
</code></pre>
<p>Using the <code>prop</code> parameter, we specified a font size of 20: <code>prop = { "size": 20 }</code>. </p>
<p>Here's the output:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/matplotlib-legend-fontsize-parameter-2.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>matplotlib legend size using prop parameter</em></p>
<h2 id="heading-summary">Summary</h2>
<p>In this article, we talked about the <code>legend</code> function in Matplotlib. It can be used to describe the elements that maker up a graph. </p>
<p>We first saw what a legend is in Matplotlib, and some examples to show its basic usage and parameters. </p>
<p>We then saw how to use the <code>fontsize</code> and <code>prop</code> parameters to change the font size of a Matplotlib legend. </p>
<p>Happy coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Matplotlib Add Color – How To Change Line Color in Matplotlib ]]>
                </title>
                <description>
                    <![CDATA[ Matplotlib is a Python library used for data visualization, and creating interactive plots and graphs.  In this article, you'll learn how to add colors to your Matplotlib plots using parameter values provided by the Matplotlib plot() function. You'll... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-change-color-in-matplotlib/</link>
                <guid isPermaLink="false">66b0a2c43ac4671a1e5802f0</guid>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ihechikara Abba ]]>
                </dc:creator>
                <pubDate>Mon, 13 Mar 2023 21:55:25 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/03/isaac-smith-6EnTPvPPL6I-unsplash--1-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Matplotlib is a Python library used for <a target="_blank" href="https://www.freecodecamp.org/news/data-visualization-tools-guide">data visualization</a>, and creating interactive plots and graphs. </p>
<p>In this article, you'll learn how to add colors to your Matplotlib plots using parameter values provided by the Matplotlib <code>plot()</code> function.</p>
<p>You'll learn how to change the color of a plot using:</p>
<ul>
<li>Color names. </li>
<li>Color abbreviations.</li>
<li>RGB/RGBA values. </li>
<li>Hex values.</li>
</ul>
<p>Let's get started!</p>
<h2 id="heading-how-to-change-line-color-in-matplotlib">How To Change Line Color in Matplotlib</h2>
<p>By default, the color of plots in Matplotlib is blue. That is:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

x = [<span class="hljs-number">5</span>,<span class="hljs-number">10</span>,<span class="hljs-number">15</span>,<span class="hljs-number">20</span>]
y = [<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,<span class="hljs-number">30</span>,<span class="hljs-number">40</span>]

plt.plot(x,y)
plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/matplotlib-default-line-color.PNG" alt="Image" width="600" height="400" loading="lazy"></p>
<p>To change the color of a plot, simply add a <code>color</code> parameter to the <code>plot</code> function and specify the value of the color. </p>
<p>Here are some examples:</p>
<h3 id="heading-how-to-change-line-color-in-matplotlib-example-1">How To Change Line Color in Matplotlib Example #1</h3>
<p>In this example, we'll change the color of the plot using a color name. </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

x = [<span class="hljs-number">5</span>,<span class="hljs-number">10</span>,<span class="hljs-number">15</span>,<span class="hljs-number">20</span>]
y = [<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,<span class="hljs-number">30</span>,<span class="hljs-number">40</span>]

plt.plot(x,y, color=<span class="hljs-string">'red'</span>)
plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/matplotlib-red-line-color.PNG" alt="Image" width="600" height="400" loading="lazy"></p>
<p>In the example above, we assigned a value of 'red' to the <code>color</code> parameter: <code>color='red'</code>.</p>
<h3 id="heading-how-to-change-line-color-in-matplotlib-example-2">How To Change Line Color in Matplotlib Example #2</h3>
<p>You can make use of abbreviations when specifying the color to be used for the plot. That is: </p>
<ul>
<li><code>'b'</code> = blue</li>
<li><code>'g'</code> = green</li>
<li><code>'r'</code> = red</li>
<li><code>'c'</code> =  cyan</li>
<li><code>'m'</code> = magenta</li>
<li><code>'y'</code> = yellow</li>
<li><code>'k'</code> = black</li>
<li><code>'w'</code> = white</li>
</ul>
<p>Here's a code example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

x = [<span class="hljs-number">5</span>,<span class="hljs-number">10</span>,<span class="hljs-number">15</span>,<span class="hljs-number">20</span>]
y = [<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,<span class="hljs-number">30</span>,<span class="hljs-number">40</span>]

plt.plot(x,y, color=<span class="hljs-string">'m'</span>)
plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/matplotlib-magenta-line-color.PNG" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-how-to-change-line-color-in-matplotlib-example-3">How To Change Line Color in Matplotlib Example #3</h3>
<p>You can also make use of RGB and RGBA (red, green, blue, alpha), and hex values. </p>
<p>Here's an example that creates a plot with a yellow color using RGB:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

x = [<span class="hljs-number">5</span>,<span class="hljs-number">10</span>,<span class="hljs-number">15</span>,<span class="hljs-number">20</span>]
y = [<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,<span class="hljs-number">30</span>,<span class="hljs-number">40</span>]

plt.plot(x,y, color=(<span class="hljs-number">1.0</span>, <span class="hljs-number">0.92</span>, <span class="hljs-number">0.23</span>))
plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/matplotlib-yellow-line-color.PNG" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Here's another example that uses a hex value to create a green plot:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

x = [<span class="hljs-number">5</span>,<span class="hljs-number">10</span>,<span class="hljs-number">15</span>,<span class="hljs-number">20</span>]
y = [<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,<span class="hljs-number">30</span>,<span class="hljs-number">40</span>]

plt.plot(x,y, color=<span class="hljs-string">'#00FF00'</span>)
plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/matplotlib-green-line-color.PNG" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-summary">Summary</h2>
<p>In this article, we talked about how to change the color of plots in Matplotlip. </p>
<p>We saw examples that showed how to use color name, abbreviations, RGB/RGBA values, and hex values to change the color of a plot in Matplotlib. </p>
<p>Happy coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Matplotlib Figure Size – How to Change Plot Size in Python with plt.figsize() ]]>
                </title>
                <description>
                    <![CDATA[ When creating plots using Matplotlib, you get a default figure size of 6.4 for the width and 4.8 for the height (in inches). In this article, you'll learn how to change the plot size using the following:  The figsize() attribute.  The set_figwidth()... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/matplotlib-figure-size-change-plot-size-in-python/</link>
                <guid isPermaLink="false">66b0a322b30dd4d00547bba5</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ihechikara Abba ]]>
                </dc:creator>
                <pubDate>Thu, 12 Jan 2023 15:29:17 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/01/isaac-smith-6EnTPvPPL6I-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When creating plots using Matplotlib, you get a default figure size of 6.4 for the width and 4.8 for the height (in inches).</p>
<p>In this article, you'll learn how to change the plot size using the following: </p>
<ul>
<li>The <code>figsize()</code> attribute. </li>
<li>The <code>set_figwidth()</code> method.</li>
<li>The <code>set_figheight()</code> method.</li>
<li>The <code>rcParams</code> parameter.</li>
</ul>
<p>Let's get started!</p>
<h2 id="heading-how-to-change-plot-size-in-matplotlib-with-pltfigsize">How to Change Plot Size in Matplotlib with <code>plt.figsize()</code></h2>
<p>As stated in the previous section, the default parameters (in inches) for Matplotlib plots are 6.4 wide and 4.8 high. Here's a code example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

x = [<span class="hljs-number">2</span>,<span class="hljs-number">4</span>,<span class="hljs-number">6</span>,<span class="hljs-number">8</span>]
y = [<span class="hljs-number">10</span>,<span class="hljs-number">3</span>,<span class="hljs-number">20</span>,<span class="hljs-number">4</span>]

plt.plot(x,y)

plt.show()
</code></pre>
<p>In the code above, we first imported <code>matplotlib</code>. We then created two lists — <code>x</code> and <code>y</code> — with values to be plotted. </p>
<p>Using <code>plt.plot()</code>, we plotted list <code>x</code> on the x-axis and list <code>y</code> on the y-axis: <code>plt.plot(x,y)</code>. </p>
<p>Lastly, the <code>plt.show()</code> displays the plot. Here's what the plot would look like with the default figure size parameters: </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/matplotlib.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>matplotlib plot with default figure size parameters</em></p>
<p>We can change the size of the plot above using the <code>figsize()</code> attribute of the <code>figure()</code> function. </p>
<p>The <code>figsize()</code> attribute takes in two parameters — one for the width and the other for the height. </p>
<h3 id="heading-heres-what-the-syntax-looks-like">Here's what the syntax looks like:</h3>
<pre><code class="lang-txt">figure(figsize=(WIDTH_SIZE,HEIGHT_SIZE))
</code></pre>
<p>Here's a code example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

x = [<span class="hljs-number">2</span>,<span class="hljs-number">4</span>,<span class="hljs-number">6</span>,<span class="hljs-number">8</span>]
y = [<span class="hljs-number">10</span>,<span class="hljs-number">3</span>,<span class="hljs-number">20</span>,<span class="hljs-number">4</span>]

plt.figure(figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">6</span>))
plt.plot(x,y)

plt.show()
</code></pre>
<p>We've added one new line of code: <code>plt.figure(figsize=(10,6))</code>. This will modify/change the width and height of the plot. </p>
<p>Here's what the plot would look like:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/matplotlib1.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>matplotlib plot with modified figure size</em></p>
<h2 id="heading-how-to-change-plot-width-in-matplotlib-with-setfigwidth">How to Change Plot Width in Matplotlib with <code>set_figwidth()</code></h2>
<p>You can use the <code>set_figwidth()</code> method to change the width of a plot. </p>
<p>We'll pass in the value the width should be changed to as a parameter to the method. </p>
<p>This method will not change the default or preset value of the plot's height.</p>
<p>Here's a code example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

x = [<span class="hljs-number">2</span>,<span class="hljs-number">4</span>,<span class="hljs-number">6</span>,<span class="hljs-number">8</span>]
y = [<span class="hljs-number">10</span>,<span class="hljs-number">3</span>,<span class="hljs-number">20</span>,<span class="hljs-number">4</span>]

plt.figure().set_figwidth(<span class="hljs-number">15</span>)
plt.plot(x,y)

plt.show()
</code></pre>
<p>Using the <code>set_figwidth()</code> method, we set the width of the plot to 10. Here's what the plot would look like:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/matplotlib2.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>matplotlib plot with modified width</em></p>
<h2 id="heading-how-to-change-plot-height-in-matplotlib-with-setfigheight">How to Change Plot Height in Matplotlib with <code>set_figheight()</code></h2>
<p>You can use the <code>set_figheight()</code> method to change the height of a plot. </p>
<p>This method will not change the default or preset value of the plot's width. </p>
<p>Here's a code example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

x = [<span class="hljs-number">2</span>,<span class="hljs-number">4</span>,<span class="hljs-number">6</span>,<span class="hljs-number">8</span>]
y = [<span class="hljs-number">10</span>,<span class="hljs-number">3</span>,<span class="hljs-number">20</span>,<span class="hljs-number">4</span>]

plt.figure().set_figheight(<span class="hljs-number">2</span>)
plt.plot(x,y)

plt.show()
</code></pre>
<p>Using the <code>set_figheight()</code> in the example above, we set the plot's height to 2. Here's what the plot would look like:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/matplotlib3.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>matplotlib plot with modified height</em></p>
<h2 id="heading-how-to-change-default-plot-size-in-matplotlib-with-rcparams">How to Change Default Plot Size in Matplotlib with <code>rcParams</code></h2>
<p>You can override the default plot size in Matplotlib using the <code>rcParams</code> parameter. </p>
<p>This is useful when you want all your plots to follow a particular size. This means you don't have to change the size of every plot you create. </p>
<p>Here's an example with two plots:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

x = [<span class="hljs-number">2</span>,<span class="hljs-number">4</span>,<span class="hljs-number">6</span>,<span class="hljs-number">8</span>]
y = [<span class="hljs-number">10</span>,<span class="hljs-number">3</span>,<span class="hljs-number">20</span>,<span class="hljs-number">4</span>]

plt.rcParams[<span class="hljs-string">'figure.figsize'</span>] = [<span class="hljs-number">4</span>, <span class="hljs-number">4</span>]
plt.plot(x,y)

plt.show()
</code></pre>
<pre><code class="lang-python">a = [<span class="hljs-number">5</span>,<span class="hljs-number">10</span>,<span class="hljs-number">15</span>,<span class="hljs-number">20</span>]
b = [<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,<span class="hljs-number">30</span>,<span class="hljs-number">40</span>]

plt.plot(a,b)
</code></pre>
<p>Using the <code>figure.figsize</code> parameter, we set the default width and height to 4: <code>plt.rcParams['figure.figsize'] = [4, 4]</code>. These parameters will change the default width and height of the two plots. </p>
<p>Here are the plots:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/matplotlib4.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>matplotlib plot with modified default size</em></p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/matplotlib5-1.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>matplotlib plot with modified default size</em></p>
<h2 id="heading-summary">Summary</h2>
<p>In this article, we talked about the different ways you can change the size of a plot in Matplotlib. </p>
<p>We saw code examples and visual representation of the plots. This helped us understand how each method can be used to change the size of a plot. </p>
<p>We discussed the following methods used in changing the plot size in Matplotlib:</p>
<ul>
<li>The <code>figsize()</code> attribute can be used when you want to change the default size of a specific plot. </li>
<li>The <code>set_figwidth()</code> method can be used to change only the width of a plot.</li>
<li>The <code>set_figheight()</code> method can be used to change only the height of a plot.</li>
<li>The <code>rcParams</code> parameter can be used when want to override the default plot size for all your plots. Unlike the the <code>figsize()</code> attribute that targets a specific plot, the <code>rcParams</code> parameter targets all the plots in a project.</li>
</ul>
<p>Happy coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ What is Data Analysis? How to Visualize Data with Python, Numpy, Pandas, Matplotlib & Seaborn Tutorial ]]>
                </title>
                <description>
                    <![CDATA[ By Aakash NS Data Analysis is the process of exploring, investigating, and gathering insights from data using statistical measures and visualizations.  The objective of data analysis is to develop an understanding of data by uncovering trends, relati... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/exploratory-data-analysis-with-numpy-pandas-matplotlib-seaborn/</link>
                <guid isPermaLink="false">66d45d5ab3016bf139028cff</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ numpy ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 24 Jun 2021 00:11:01 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/05/blog-cover-4.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Aakash NS</p>
<p>Data Analysis is the process of exploring, investigating, and gathering insights from data using statistical measures and visualizations. </p>
<p>The objective of data analysis is to develop an understanding of data by uncovering trends, relationships, and patterns.</p>
<p>Data analysis is both a science and an art. On the one hand it requires that you know statistics, visualization techniques, and data analysis tools like Numpy, Pandas, and Seaborn. </p>
<p>On the other hand, it requires that you ask interesting questions to guide the investigation, and then interpret the numbers and figures to generate useful insights.</p>
<p>This tutorial on data analysis covers the following topics:</p>
<ol>
<li><a class="post-section-overview" href="#heading-what-is-numerical-computation-python-and-numpy-for-beginners">What is Numerical Computation? Python and Numpy for Beginners</a></li>
<li><a class="post-section-overview" href="#heading-how-to-analyze-tabular-data-using-python-and-pandas">How to Analyze Tabular Data using Python and Pandas</a></li>
<li><a class="post-section-overview" href="#heading-data-visualization-using-python-matplotlib-and-seaborn">Data Visualization using Python, Matplotlib, and Seaborn</a></li>
</ol>
<h2 id="heading-what-is-numerical-computation-python-and-numpy-for-beginners">What is Numerical Computation? Python and Numpy for Beginners</h2>
<p><img src="https://i.imgur.com/mg8O3kd.png" alt="Image" width="1385" height="480" loading="lazy">
_Source: <a target="_blank" href="https://github.com/elegant-scipy/elegant-scipy/blob/master/figures/NumPy_ndarrays_v2.png">Elegant Scipy</a>_</p>
<p>You can follow along with the tutorial and run the code here: <a target="_blank" href="https://jovian.ai/aakashns/python-numerical-computing-with-numpy">https://jovian.ai/aakashns/python-numerical-computing-with-nump</a>y</p>
<p>This section covers the following topics:</p>
<ul>
<li>How to work with numerical data in Python</li>
<li>How to turn Python lists into Numpy arrays</li>
<li>Multi-dimensional Numpy arrays and their benefits</li>
<li>Array operations, broadcasting, indexing, and slicing</li>
<li>How to work with CSV data files using Numpy</li>
</ul>
<h3 id="heading-how-to-work-with-numerical-data-in-python">How to Work with Numerical Data in Python</h3>
<p>The "data" in <em>Data Analysis</em> typically refers to numerical data, like stock prices, sales figures, sensor measurements, sports scores, database tables, and so on. </p>
<p>The <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org">Numpy</a> library provides specialized data structures, functions, and other tools for numerical computing in Python. Let's work through an example to see why and how to use Numpy to work with numerical data.</p>
<p>Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. </p>
<p>A simple approach to do this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in millimeters), and average relative humidity (in percentage) as a linear equation.</p>
<p><code>yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity</code></p>
<p>We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity. </p>
<p>This equation is an approximation, since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.</p>
<p>Based on some statistical analysis of historical data, we might come up with reasonable values for the weights <code>w1</code>, <code>w2</code>, and <code>w3</code>. Here's an example set of values:</p>
<pre><code class="lang-py">w1, w2, w3 = <span class="hljs-number">0.3</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">0.5</span>
</code></pre>
<p>Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:</p>
<p><img src="https://i.imgur.com/TXPBiqv.png" alt="Image" width="846" height="330" loading="lazy"></p>
<p>To begin, we can define some variables to record climate data for a region.</p>
<pre><code class="lang-py">kanto_temp = <span class="hljs-number">73</span>
kanto_rainfall = <span class="hljs-number">67</span>
kanto_humidity = <span class="hljs-number">43</span>
</code></pre>
<p>We can now substitute these variables into the linear equation to predict the yield of apples.</p>
<pre><code class="lang-py">kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples
<span class="hljs-comment"># 56.8</span>

print(<span class="hljs-string">"The expected yield of apples in Kanto region is {} tons per hectare."</span>.format(kanto_yield_apples))
<span class="hljs-comment"># The expected yield of apples in Kanto region is 56.8 tons per hectare.</span>
</code></pre>
<p>To make it slightly easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector, that is a list of numbers.</p>
<pre><code class="lang-py">kanto = [<span class="hljs-number">73</span>, <span class="hljs-number">67</span>, <span class="hljs-number">43</span>]
johto = [<span class="hljs-number">91</span>, <span class="hljs-number">88</span>, <span class="hljs-number">64</span>]
hoenn = [<span class="hljs-number">87</span>, <span class="hljs-number">134</span>, <span class="hljs-number">58</span>]
sinnoh = [<span class="hljs-number">102</span>, <span class="hljs-number">43</span>, <span class="hljs-number">37</span>]
unova = [<span class="hljs-number">69</span>, <span class="hljs-number">96</span>, <span class="hljs-number">70</span>]
</code></pre>
<p>The three numbers in each vector represent the temperature, rainfall, and humidity data, respectively.</p>
<p>We can also represent the set of weights used in the formula as a vector.</p>
<pre><code class="lang-py">weights = [w1, w2, w3]
</code></pre>
<p>We can now write a function <code>crop_yield</code> to calculate the yield of apples (or any other crop) given the climate data and the respective weights.</p>
<pre><code class="lang-py"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">crop_yield</span>(<span class="hljs-params">region, weights</span>):</span>
    result = <span class="hljs-number">0</span>
    <span class="hljs-keyword">for</span> x, w <span class="hljs-keyword">in</span> zip(region, weights):
        result += x * w
    <span class="hljs-keyword">return</span> result

crop_yield(kanto, weights)
<span class="hljs-comment"># 56.8</span>

crop_yield(johto, weights)
<span class="hljs-comment"># 76.9</span>

crop_yield(unova, weights)
<span class="hljs-comment"># 74.9</span>
</code></pre>
<h3 id="heading-how-to-turn-python-lists-into-numpy-arrays">How to Turn Python Lists into Numpy Arrays</h3>
<p>The calculation performed by the <code>crop_yield</code> (element-wise multiplication of two vectors and taking a sum of the results) is also called the <em>dot product</em>. Learn more about dot products <a target="_blank" href="https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length">here</a>.</p>
<p>The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.</p>
<p>Let's install the Numpy library using the <code>pip</code> package manager.</p>
<pre><code class="lang-py">!pip install numpy --upgrade --quiet
</code></pre>
<p>Next, let's import the <code>numpy</code> module. It's common practice to import numpy with the alias <code>np</code>.</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
</code></pre>
<p>We can now use the <code>np.array</code> function to create Numpy arrays.</p>
<pre><code class="lang-py">kanto = np.array([<span class="hljs-number">73</span>, <span class="hljs-number">67</span>, <span class="hljs-number">43</span>])

kanto
<span class="hljs-comment"># array([73, 67, 43])</span>

weights = np.array([w1, w2, w3])

weights
<span class="hljs-comment"># array([0.3, 0.2, 0.5])</span>
</code></pre>
<p>Numpy arrays have the type <code>ndarray</code>.</p>
<pre><code class="lang-py">type(kanto)
<span class="hljs-comment"># numpy.ndarray</span>

type(weights)
<span class="hljs-comment"># numpy.ndarray</span>
</code></pre>
<p>Just like lists, Numpy arrays support the indexing notation <code>[]</code>.</p>
<pre><code class="lang-py">weights[<span class="hljs-number">0</span>]
<span class="hljs-comment"># 0.3</span>

kanto[<span class="hljs-number">2</span>]
<span class="hljs-comment">#43</span>
</code></pre>
<h3 id="heading-how-to-operate-on-numpy-arrays">How to Operate on Numpy arrays</h3>
<p>We can now compute the dot product of the two vectors using the <code>np.dot</code> function.</p>
<pre><code class="lang-py">np.dot(kanto, weights)
<span class="hljs-comment"># 56.8</span>
</code></pre>
<p>We can achieve the same result with low-level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the resulting numbers' sum.</p>
<pre><code class="lang-py">(kanto * weights).sum()
<span class="hljs-comment"># 56.8</span>
</code></pre>
<p>The <code>*</code> operator performs an element-wise multiplication of two arrays if they have the same size. The <code>sum</code> method calculates the sum of numbers in an array.</p>
<pre><code class="lang-py">arr1 = np.array([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])
arr2 = np.array([<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>])

arr1 * arr2
<span class="hljs-comment"># array([ 4, 10, 18])</span>

arr2.sum()
<span class="hljs-comment"># 15</span>
</code></pre>
<h3 id="heading-what-are-the-benefits-of-using-numpy-arrays">What are the Benefits of Using Numpy Arrays?</h3>
<p>Numpy arrays offer the following benefits over Python lists for operating on numerical data:</p>
<ul>
<li><strong>They're easy</strong> to <strong>use</strong>: You can write small, concise, and intuitive mathematical expressions like <code>(kanto * weights).sum()</code> rather than using loops and custom functions like <code>crop_yield</code>.</li>
<li><strong>Performance</strong>: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements and loops that are interpreted at runtime</li>
</ul>
<p>Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Python lists</span>
arr1 = list(range(<span class="hljs-number">1000000</span>))
arr2 = list(range(<span class="hljs-number">1000000</span>, <span class="hljs-number">2000000</span>))

<span class="hljs-comment"># Numpy arrays</span>
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

%%time
result = <span class="hljs-number">0</span>
<span class="hljs-keyword">for</span> x1, x2 <span class="hljs-keyword">in</span> zip(arr1, arr2):
    result += x1*x2
result

<span class="hljs-comment"># CPU times: user 300 ms, sys: 3.26 ms, total: 303 ms</span>
<span class="hljs-comment"># Wall time: 302 ms</span>
<span class="hljs-comment"># 833332333333500000</span>

%%time
np.dot(arr1_np, arr2_np)

<span class="hljs-comment"># CPU times: user 2.11 ms, sys: 951 µs, total: 3.07 ms</span>
<span class="hljs-comment"># Wall time: 1.58 ms</span>
<span class="hljs-comment"># 833332333333500000</span>
</code></pre>
<p>As you can see, using <code>np.dot</code> is 100 times faster than using a <code>for</code> loop. This makes Numpy especially useful while working with really large datasets with tens of thousands or millions of data points.</p>
<h3 id="heading-multi-dimensional-numpy-arrays">Multi-Dimensional Numpy Arrays</h3>
<p>We can now go one step further and represent the climate data for all the regions using a single 2-dimensional Numpy array.</p>
<pre><code class="lang-py">climate_data = np.array([[<span class="hljs-number">73</span>, <span class="hljs-number">67</span>, <span class="hljs-number">43</span>],
                         [<span class="hljs-number">91</span>, <span class="hljs-number">88</span>, <span class="hljs-number">64</span>],
                         [<span class="hljs-number">87</span>, <span class="hljs-number">134</span>, <span class="hljs-number">58</span>],
                         [<span class="hljs-number">102</span>, <span class="hljs-number">43</span>, <span class="hljs-number">37</span>],
                         [<span class="hljs-number">69</span>, <span class="hljs-number">96</span>, <span class="hljs-number">70</span>]])

climate_data
<span class="hljs-comment"># array([[ 73,  67,  43],</span>
<span class="hljs-comment">#        [ 91,  88,  64],</span>
<span class="hljs-comment">#        [ 87, 134,  58],</span>
<span class="hljs-comment">#        [102,  43,  37],</span>
<span class="hljs-comment">#        [ 69,  96,  70]])</span>
</code></pre>
<p>If you've taken a linear algebra class in high school, you may recognize the above 2-d array as a matrix with five rows and three columns. Each row represents one region, and the columns represent temperature, rainfall, and humidity, respectively.</p>
<p>Numpy arrays can have any number of dimensions and different lengths along each dimension. We can inspect the length along each dimension using the <code>.shape</code> property of an array.</p>
<p><img src="https://fgnt.github.io/python_crashkurs_doc/_images/numpy_array_t.png" alt="Image" width="1440" height="805" loading="lazy">
_Source: <a target="_blank" href="https://github.com/elegant-scipy/elegant-scipy/blob/master/figures/NumPy_ndarrays_v2.png">Elegant Scipy</a>_</p>
<pre><code class="lang-py"><span class="hljs-comment"># 2D array (matrix)</span>
climate_data.shape
<span class="hljs-comment"># (5, 3)</span>

weights
<span class="hljs-comment"># array([0.3, 0.2, 0.5])</span>

<span class="hljs-comment"># 1D array (vector)</span>
weights.shape
<span class="hljs-comment"># (3,)</span>

<span class="hljs-comment"># 3D array </span>
arr3 = np.array([
    [[<span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>], 
     [<span class="hljs-number">13</span>, <span class="hljs-number">14</span>, <span class="hljs-number">15</span>]], 
    [[<span class="hljs-number">15</span>, <span class="hljs-number">16</span>, <span class="hljs-number">17</span>], 
     [<span class="hljs-number">17</span>, <span class="hljs-number">18</span>, <span class="hljs-number">19.5</span>]]])

arr3.shape
<span class="hljs-comment"># (2, 2, 3)</span>
</code></pre>
<p>All the elements in a numpy array have the same data type. You can check the data type of an array using the <code>.dtype</code> property.</p>
<pre><code class="lang-py">weights.dtype
<span class="hljs-comment"># dtype('float64')</span>

climate_data.dtype
<span class="hljs-comment"># dtype('int64')</span>
</code></pre>
<p>If an array contains even a single floating point number, all the other elements are also converted to floats.</p>
<pre><code class="lang-py">arr3.dtype
<span class="hljs-comment"># dtype('float64')</span>
</code></pre>
<p>We can now compute the predicted yields of apples in all the regions, using a single matrix multiplication between <code>climate_data</code> (a 5x3 matrix) and <code>weights</code> (a vector of length 3). Here's what it looks like visually:</p>
<p><img src="https://i.imgur.com/LJ2WKSI.png" alt="Image" width="578" height="334" loading="lazy"></p>
<p>You can learn about matrices and matrix multiplication by watching the first 3-4 videos of <a target="_blank" href="https://www.youtube.com/watch?v=xyAuNHPsq-g&amp;list=PLFD0EB975BA0CC1E0&amp;index=1">this YouTube playlist</a>.</p>
<p>We can use the <code>np.matmul</code> function or the <code>@</code> operator to perform matrix multiplication.</p>
<pre><code class="lang-py">np.matmul(climate_data, weights)
<span class="hljs-comment"># array([56.8, 76.9, 81.9, 57.7, 74.9])</span>

climate_data @ weights
<span class="hljs-comment"># array([56.8, 76.9, 81.9, 57.7, 74.9])</span>
</code></pre>
<h3 id="heading-how-to-work-with-csv-data-files">How to Work with CSV Data Files</h3>
<p>Numpy also provides helper functions reading from and writing to files. Let's download a file <code>climate.txt</code>, which contains 10,000 climate measurements (temperature, rainfall, and humidity) in the following format:</p>
<pre><code>temperature,rainfall,humidity
<span class="hljs-number">25.00</span>,<span class="hljs-number">76.00</span>,<span class="hljs-number">99.00</span>
<span class="hljs-number">39.00</span>,<span class="hljs-number">65.00</span>,<span class="hljs-number">70.00</span>
<span class="hljs-number">59.00</span>,<span class="hljs-number">45.00</span>,<span class="hljs-number">77.00</span>
<span class="hljs-number">84.00</span>,<span class="hljs-number">63.00</span>,<span class="hljs-number">38.00</span>
<span class="hljs-number">66.00</span>,<span class="hljs-number">50.00</span>,<span class="hljs-number">52.00</span>
<span class="hljs-number">41.00</span>,<span class="hljs-number">94.00</span>,<span class="hljs-number">77.00</span>
<span class="hljs-number">91.00</span>,<span class="hljs-number">57.00</span>,<span class="hljs-number">96.00</span>
<span class="hljs-number">49.00</span>,<span class="hljs-number">96.00</span>,<span class="hljs-number">99.00</span>
<span class="hljs-number">67.00</span>,<span class="hljs-number">20.00</span>,<span class="hljs-number">28.00</span>
...
</code></pre><p>This format of storing data is known as <em>comma-separated values</em> or CSV.</p>
<blockquote>
<p><strong>CSVs</strong>: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)</p>
</blockquote>
<p>To read this file into a numpy array, we can use the <code>genfromtxt</code> function.</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> urllib.request

urllib.request.urlretrieve(
    <span class="hljs-string">'https://hub.jovian.ml/wp-content/uploads/2020/08/climate.csv'</span>, 
    <span class="hljs-string">'climate.txt'</span>)

climate_data = np.genfromtxt(<span class="hljs-string">'climate.txt'</span>, delimiter=<span class="hljs-string">','</span>, skip_header=<span class="hljs-number">1</span>)

climate_data
<span class="hljs-comment"># array([[25., 76., 99.],</span>
<span class="hljs-comment">#        [39., 65., 70.],</span>
<span class="hljs-comment">#        [59., 45., 77.],</span>
<span class="hljs-comment">#        ...,</span>
<span class="hljs-comment">#        [99., 62., 58.],</span>
<span class="hljs-comment">#        [70., 71., 91.],</span>
<span class="hljs-comment">#        [92., 39., 76.]])</span>

climate_data.shape
<span class="hljs-comment"># (10000, 3)</span>
</code></pre>
<p>We can now perform a matrix multiplication using the <code>@</code> operator to predict the yield of apples for the entire dataset using a given set of weights.</p>
<pre><code class="lang-py">weights = np.array([<span class="hljs-number">0.3</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">0.5</span>])

yields = climate_data @ weights
yields
<span class="hljs-comment"># array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])</span>

yields.shape
<span class="hljs-comment"># (10000,)</span>
</code></pre>
<p>Let's add the <code>yields</code> to <code>climate_data</code> as a fourth column using the <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org%2Fdoc%2Fstable%2Freference%2Fgenerated%2Fnumpy.concatenate.html"><code>np.concatenate</code></a> function.</p>
<pre><code class="lang-py">climate_results = np.concatenate((climate_data, yields.reshape(<span class="hljs-number">10000</span>, <span class="hljs-number">1</span>)), axis=<span class="hljs-number">1</span>)

climate_results
<span class="hljs-comment"># array([[25. , 76. , 99. , 72.2],</span>
<span class="hljs-comment">#        [39. , 65. , 70. , 59.7],</span>
<span class="hljs-comment">#        [59. , 45. , 77. , 65.2],</span>
<span class="hljs-comment">#        ...,</span>
<span class="hljs-comment">#        [99. , 62. , 58. , 71.1],</span>
<span class="hljs-comment">#        [70. , 71. , 91. , 80.7],</span>
<span class="hljs-comment">#        [92. , 39. , 76. , 73.4]])</span>
</code></pre>
<p>There are a couple of subtleties here:</p>
<ul>
<li>Since we wish to add new columns, we pass the argument <code>axis=1</code> to <code>np.concatenate</code>. The <code>axis</code> argument specifies the dimension for concatenation.</li>
<li>The arrays should have the same number of dimensions, and the same length along each except the dimension used for concatenation. We use the <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org%2Fdoc%2Fstable%2Freference%2Fgenerated%2Fnumpy.reshape.html"><code>np.reshape</code></a> function to change the shape of <code>yields</code> from <code>(10000,)</code> to <code>(10000,1)</code>.</li>
</ul>
<p>Here's a visual explanation of <code>np.concatenate</code> along <code>axis=1</code> (can you guess what <code>axis=0</code> results in?):</p>
<p><img src="https://www.w3resource.com/w3r_images/python-numpy-image-exercise-58.png" alt="Image" width="576" height="536" loading="lazy">
<em>Source: <a target="_blank" href="w3resource.com">w3resource.com</a></em></p>
<p>The best way to understand what a Numpy function does is to experiment with it and read the documentation to learn about its arguments and return values. Use the cells below to experiment with <code>np.concatenate</code> and <code>np.reshape</code>.</p>
<p>Let's write the final results from our computation above back to a file using the <code>np.savetxt</code> function.</p>
<pre><code class="lang-py">np.savetxt(<span class="hljs-string">'climate_results.txt'</span>, 
           climate_results, 
           fmt=<span class="hljs-string">'%.2f'</span>, 
           delimiter=<span class="hljs-string">','</span>,
           header=<span class="hljs-string">'temperature,rainfall,humidity,yeild_apples'</span>, 
           comments=<span class="hljs-string">''</span>)
</code></pre>
<p>The results are written back in the CSV format to the file <code>climate_results.txt</code>.</p>
<pre><code>temperature,rainfall,humidity,yeild_apples
<span class="hljs-number">25.00</span>,<span class="hljs-number">76.00</span>,<span class="hljs-number">99.00</span>,<span class="hljs-number">72.20</span>
<span class="hljs-number">39.00</span>,<span class="hljs-number">65.00</span>,<span class="hljs-number">70.00</span>,<span class="hljs-number">59.70</span>
<span class="hljs-number">59.00</span>,<span class="hljs-number">45.00</span>,<span class="hljs-number">77.00</span>,<span class="hljs-number">65.20</span>
<span class="hljs-number">84.00</span>,<span class="hljs-number">63.00</span>,<span class="hljs-number">38.00</span>,<span class="hljs-number">56.80</span>
...
</code></pre><p>Numpy provides hundreds of functions for performing operations on arrays. Here are some commonly used functions:</p>
<ul>
<li>Mathematics: <code>np.sum</code>, <code>np.exp</code>, <code>np.round</code>, arithmetic operators</li>
<li>Array manipulation: <code>np.reshape</code>, <code>np.stack</code>, <code>np.concatenate</code>, <code>np.split</code></li>
<li>Linear Algebra: <code>np.matmul</code>, <code>np.dot</code>, <code>np.transpose</code>, <code>np.eigvals</code></li>
<li>Statistics: <code>np.mean</code>, <code>np.median</code>, <code>np.std</code>, <code>np.max</code></li>
</ul>
<p><strong>So how do you </strong>find the function you need?<em>**</em> The easiest way to find the right function for a specific operation or use-case is to do a web search. For instance, searching for "How to join numpy arrays" leads to <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fcmdlinetips.com%2F2018%2F04%2Fhow-to-concatenate-arrays-in-numpy%2F">this tutorial on array concatenation</a>.</p>
<p>You can find a <a target="_blank" href="https://numpy.org/doc/stable/reference/routines.html">full list of array functions here</a>.</p>
<h3 id="heading-numpy-arithmetic-operations-broadcasting-and-comparison">Numpy Arithmetic Operations, Broadcasting, and Comparison</h3>
<p>Numpy arrays support arithmetic operators like <code>+</code>, <code>-</code>, <code>*</code>, etc. You can perform an arithmetic operation with a single number (also called a scalar) or with another array of the same shape. </p>
<p>Operators make it easy to write mathematical expressions with multi-dimensional arrays.</p>
<pre><code class="lang-py">arr2 = np.array([[<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>], 
                 [<span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>], 
                 [<span class="hljs-number">9</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>]])

arr3 = np.array([[<span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>, <span class="hljs-number">14</span>], 
                 [<span class="hljs-number">15</span>, <span class="hljs-number">16</span>, <span class="hljs-number">17</span>, <span class="hljs-number">18</span>], 
                 [<span class="hljs-number">19</span>, <span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>]])

<span class="hljs-comment"># Adding a scalar</span>
arr2 + <span class="hljs-number">3</span>

<span class="hljs-comment"># array([[ 4,  5,  6,  7],</span>
<span class="hljs-comment">#        [ 8,  9, 10, 11],</span>
<span class="hljs-comment">#        [12,  4,  5,  6]])</span>

<span class="hljs-comment"># Element-wise subtraction</span>
arr3 - arr2

<span class="hljs-comment"># array([[10, 10, 10, 10],</span>
<span class="hljs-comment">#        [10, 10, 10, 10],</span>
<span class="hljs-comment">#        [10, 10, 10, 10]])</span>

<span class="hljs-comment"># Division by scalar</span>
arr2 / <span class="hljs-number">2</span>

<span class="hljs-comment"># array([[0.5, 1. , 1.5, 2. ],</span>
<span class="hljs-comment">#        [2.5, 3. , 3.5, 4. ],</span>
<span class="hljs-comment">#        [4.5, 0.5, 1. , 1.5]])</span>

<span class="hljs-comment"># Element-wise multiplication</span>
arr2 * arr3

<span class="hljs-comment"># array([[ 11,  24,  39,  56],</span>
<span class="hljs-comment">#        [ 75,  96, 119, 144],</span>
<span class="hljs-comment">#        [171,  11,  24,  39]])</span>

<span class="hljs-comment"># Modulus with scalar</span>
arr2 % <span class="hljs-number">4</span>

<span class="hljs-comment"># array([[1, 2, 3, 0],</span>
<span class="hljs-comment">#        [1, 2, 3, 0],</span>
<span class="hljs-comment">#        [1, 1, 2, 3]])</span>
</code></pre>
<h4 id="heading-numpy-array-broadcasting"><strong>Numpy Array Broadcasting</strong></h4>
<p>Numpy arrays also support <em>broadcasting</em>, allowing arithmetic operations between two arrays with different numbers of dimensions but compatible shapes. Let's look at an example to see how it works.</p>
<pre><code class="lang-py">arr2 = np.array([[<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>], 
                 [<span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>], 
                 [<span class="hljs-number">9</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>]])               
arr2.shape
<span class="hljs-comment"># (3, 4)</span>

arr4 = np.array([<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>])
arr4.shape
<span class="hljs-comment"># (4,)</span>

arr2 + arr4
<span class="hljs-comment"># array([[ 5,  7,  9, 11],</span>
<span class="hljs-comment">#        [ 9, 11, 13, 15],</span>
<span class="hljs-comment">#        [13,  6,  8, 10]])</span>
</code></pre>
<p>When the expression <code>arr2 + arr4</code> is evaluated, <code>arr4</code> (which has the shape <code>(4,)</code>) is replicated three times to match the shape <code>(3, 4)</code> of <code>arr2</code>. Numpy performs the replication without actually creating three copies of the smaller dimension array, thus improving performance and using lower memory.</p>
<p><img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/02.05-broadcasting.png" alt="Image" width="432" height="324" loading="lazy">
<em>Source: <a target="_blank" href="https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html">Python Data Science Handbook</a></em></p>
<p>Broadcasting only works if one of the arrays can be replicated to match the other array's shape.</p>
<pre><code class="lang-py">arr5 = np.array([<span class="hljs-number">7</span>, <span class="hljs-number">8</span>])
arr5.shape
<span class="hljs-comment"># (2,)</span>

arr2 + arr5
<span class="hljs-comment"># ValueError: operands could not be broadcast together with shapes (3,4) (2,)</span>
</code></pre>
<p>In the above example, even if <code>arr5</code> is replicated three times, it will not match the shape of <code>arr2</code>. So <code>arr2 + arr5</code> cannot be evaluated successfully. <a target="_blank" href="https://numpy.org/doc/stable/user/basics.broadcasting.html">Learn more about broadcasting here</a>.</p>
<h4 id="heading-numpy-array-comparison"><strong>Numpy Array Comparison</strong></h4>
<p>Numpy arrays also support comparison operations like <code>==</code>, <code>!=</code>, <code>&gt;</code> and so on. The result is an array of booleans.</p>
<pre><code class="lang-py">arr1 = np.array([[<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>], [<span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>]])
arr2 = np.array([[<span class="hljs-number">2</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>], [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">5</span>]])

arr1 == arr2
<span class="hljs-comment"># array([[False,  True,  True],</span>
<span class="hljs-comment">#        [False, False,  True]])</span>

arr1 != arr2
<span class="hljs-comment"># array([[ True, False, False],</span>
<span class="hljs-comment">#        [ True,  True, False]])</span>

arr1 &gt;= arr2
<span class="hljs-comment"># array([[False,  True,  True],</span>
<span class="hljs-comment">#        [ True,  True,  True]])</span>

arr1 &lt; arr2
<span class="hljs-comment"># array([[ True, False, False],</span>
<span class="hljs-comment">#        [False, False, False]])</span>
</code></pre>
<p>Array comparison is frequently used to count the number of equal elements in two arrays using the <code>sum</code> method. Remember that <code>True</code> evaluates to <code>1</code> and <code>False</code> evaluates to <code>0</code> when you use booleans in arithmetic operations.</p>
<pre><code class="lang-py">(arr1 == arr2).sum()
<span class="hljs-comment"># 3</span>
</code></pre>
<h3 id="heading-numpy-array-indexing-and-slicing">Numpy Array Indexing and Slicing</h3>
<p>Numpy extends Python's list indexing notation using <code>[]</code> to multiple dimensions in an intuitive fashion. You can provide a comma-separated list of indices or ranges to select a specific element or a subarray (also called a slice) from a Numpy array.</p>
<pre><code class="lang-py">arr3 = np.array([
    [[<span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>, <span class="hljs-number">14</span>], 
     [<span class="hljs-number">13</span>, <span class="hljs-number">14</span>, <span class="hljs-number">15</span>, <span class="hljs-number">19</span>]], 

    [[<span class="hljs-number">15</span>, <span class="hljs-number">16</span>, <span class="hljs-number">17</span>, <span class="hljs-number">21</span>], 
     [<span class="hljs-number">63</span>, <span class="hljs-number">92</span>, <span class="hljs-number">36</span>, <span class="hljs-number">18</span>]], 

    [[<span class="hljs-number">98</span>, <span class="hljs-number">32</span>, <span class="hljs-number">81</span>, <span class="hljs-number">23</span>],      
     [<span class="hljs-number">17</span>, <span class="hljs-number">18</span>, <span class="hljs-number">19.5</span>, <span class="hljs-number">43</span>]]])

arr3.shape
<span class="hljs-comment"># (3, 2, 4)</span>

<span class="hljs-comment"># Single element</span>
arr3[<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>]

<span class="hljs-comment"># 36.0</span>

<span class="hljs-comment"># Subarray using ranges</span>
arr3[<span class="hljs-number">1</span>:, <span class="hljs-number">0</span>:<span class="hljs-number">1</span>, :<span class="hljs-number">2</span>]

<span class="hljs-comment"># array([[[15., 16.]],</span>
<span class="hljs-comment"># </span>
<span class="hljs-comment">#        [[98., 32.]]])</span>

<span class="hljs-comment"># Mixing indices and ranges</span>
arr3[<span class="hljs-number">1</span>:, <span class="hljs-number">1</span>, <span class="hljs-number">3</span>]

<span class="hljs-comment"># array([18., 43.])</span>

arr3[<span class="hljs-number">1</span>:, <span class="hljs-number">1</span>, :<span class="hljs-number">3</span>]
<span class="hljs-comment"># array([[63. , 92. , 36. ],</span>
<span class="hljs-comment">#        [17. , 18. , 19.5]])</span>

<span class="hljs-comment"># Using fewer indices</span>
arr3[<span class="hljs-number">1</span>]

<span class="hljs-comment"># array([[15., 16., 17., 21.],</span>
<span class="hljs-comment">#        [63., 92., 36., 18.]])</span>

arr3[:<span class="hljs-number">2</span>, <span class="hljs-number">1</span>]
<span class="hljs-comment"># array([[13., 14., 15., 19.],</span>
<span class="hljs-comment">#        [63., 92., 36., 18.]])</span>

<span class="hljs-comment"># Using too many indices</span>
arr3[<span class="hljs-number">1</span>,<span class="hljs-number">3</span>,<span class="hljs-number">2</span>,<span class="hljs-number">1</span>]

<span class="hljs-comment"># IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed</span>
</code></pre>
<p>The notation and its results can seem confusing at first, so take your time to experiment and become comfortable with it. </p>
<p>Use the cells below to try out some examples of array indexing and slicing, with different combinations of indices and ranges. Here are some more examples demonstrated visually:</p>
<p><img src="https://scipy-lectures.org/_images/numpy_indexing.png" alt="Image" width="772" height="383" loading="lazy">
_Source: <a target="_blank" href="https://scipy-lectures.org/intro/numpy/array_object.html">Scipy Lectures</a>_</p>
<h3 id="heading-how-to-create-numpy-arrays-other-methods">How to Create Numpy Arrays – Other Methods</h3>
<p>Numpy also provides some handy functions to create arrays of desired shapes with fixed or random values. Check out the <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org%2Fdoc%2Fstable%2Freference%2Froutines.array-creation.html">official documentation</a> or use the <code>help</code> function to learn more.</p>
<pre><code># All zeros
np.zeros((<span class="hljs-number">3</span>, <span class="hljs-number">2</span>))

# array([[<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>]])

# All ones
np.ones([<span class="hljs-number">2</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])

# array([[[<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>],
#         [<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>]],
#
#        [[<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>],
#         [<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>]]])

# Identity matrix
np.eye(<span class="hljs-number">3</span>)

# array([[<span class="hljs-number">1.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">1.</span>]])

# Random vector
np.random.rand(<span class="hljs-number">5</span>)

# array([<span class="hljs-number">0.92929562</span>, <span class="hljs-number">0.11301864</span>, <span class="hljs-number">0.64213555</span>, <span class="hljs-number">0.8600434</span> , <span class="hljs-number">0.53738656</span>])

# Random matrix
np.random.randn(<span class="hljs-number">2</span>, <span class="hljs-number">3</span>) # rand vs. randn - what<span class="hljs-string">'s the difference?

# array([[ 0.09906435, -1.64668094,  0.08073528],
#        [ 0.1437016 ,  0.80715712,  1.27285476]])

# Fixed value
np.full([2, 3], 42)

# array([[42, 42, 42],
#        [42, 42, 42]])

# Range with start, end and step
np.arange(10, 90, 3)

# array([10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58,
#        61, 64, 67, 70, 73, 76, 79, 82, 85, 88])

# Equally spaced numbers in a range
np.linspace(3, 27, 9)

# array([ 3.,  6.,  9., 12., 15., 18., 21., 24., 27.])</span>
</code></pre><h3 id="heading-exercises">Exercises</h3>
<p>Try the following exercises to become familiar with Numpy arrays and practice your skills:</p>
<ul>
<li>Assignment on Numpy array functions: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fjovian.ml%2Faakashns%2Fnumpy-array-operations">https://jovian.ml/aakashns/numpy-array-operations</a></li>
<li>(Optional) 100 numpy exercises: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fjovian.ml%2Faakashns%2F100-numpy-exercises">https://jovian.ml/aakashns/100-numpy-exercises</a></li>
</ul>
<h3 id="heading-summary-and-further-reading">Summary and Further Reading</h3>
<p>With this, we complete our discussion of numerical computing with Numpy. We've covered the following topics in this part of the tutorial:</p>
<ul>
<li>How to go from Python lists to Numpy arrays</li>
<li>How to operate on Numpy arrays</li>
<li>The benefits of using Numpy arrays over lists</li>
<li>Multi-dimensional Numpy arrays</li>
<li>How to work with CSV data files</li>
<li>Arithmetic operations and broadcasting</li>
<li>Array indexing and slicing</li>
<li>Other ways of creating Numpy arrays</li>
</ul>
<p>Check out the following resources for learning more about Numpy:</p>
<ul>
<li><a target="_blank" href="https://numpy.org/devdocs/user/quickstart.html">Official tutorial</a></li>
<li><a target="_blank" href="https://www.freecodecamp.org/news/the-ultimate-guide-to-the-numpy-scientific-computing-library-for-python/">Numpy course on freeCodeCamp</a></li>
<li><a target="_blank" href="http://scipy-lectures.org/advanced/advanced_numpy/index.html">Advanced Numpy (exploring the internals)</a></li>
</ul>
<h3 id="heading-review-questions-to-check-your-comprehension">Review Questions to Check Your Comprehension</h3>
<p>Try answering the following questions to test your understanding of the topics covered in this notebook:</p>
<ol>
<li>What is a vector?</li>
<li>How do you represent vectors using a Python list? Give an example.</li>
<li>What is a dot product of two vectors?</li>
<li>Write a function to compute the dot product of two vectors.</li>
<li>What is Numpy?</li>
<li>How do you install Numpy?</li>
<li>How do you import the <code>numpy</code> module?</li>
<li>What does it mean to import a module with an alias? Give an example.</li>
<li>What is the commonly used alias for <code>numpy</code>?</li>
<li>What is a Numpy array?</li>
<li>How do you create a Numpy array? Give an example.</li>
<li>What is the type of Numpy arrays?</li>
<li>How do you access the elements of a Numpy array?</li>
<li>How do you compute the dot product of two vectors using Numpy?</li>
<li>What happens if you try to compute the dot product of two vectors which have different sizes?</li>
<li>How do you compute the element-wise product of two Numpy arrays?</li>
<li>How do you compute the sum of all the elements in a Numpy array?</li>
<li>What are the benefits of using Numpy arrays over Python lists for operating on numerical data?</li>
<li>Why do Numpy array operations have better performance compared to Python functions and loops?</li>
<li>Illustrate the performance difference between Numpy array operations and Python loops using an example.</li>
<li>What are multi-dimensional Numpy arrays?</li>
<li>Illustrate how you'd create Numpy arrays with 2, 3, and 4 dimensions.</li>
<li>How do you inspect the number of dimensions and the length along each dimension in a Numpy array?</li>
<li>Can the elements of a Numpy array have different data types?</li>
<li>How do you check the data types of the elements of a Numpy array?</li>
<li>What is the data type of a Numpy array?</li>
<li>What is the difference between a matrix and a 2D Numpy array?</li>
<li>How do you perform matrix multiplication using Numpy?</li>
<li>What is the <code>@</code> operator used for in Numpy?</li>
<li>What is the CSV file format?</li>
<li>How do you read data from a CSV file using Numpy?</li>
<li>How do you concatenate two Numpy arrays?</li>
<li>What is the purpose of the <code>axis</code> argument of <code>np.concatenate</code>?</li>
<li>When are two Numpy arrays compatible for concatenation?</li>
<li>Give an example of two Numpy arrays that can be concatenated.</li>
<li>Give an example of two Numpy arrays that cannot be concatenated.</li>
<li>What is the purpose of the <code>np.reshape</code> function?</li>
<li>What does it mean to “reshape” a Numpy array?</li>
<li>How do you write a numpy array into a CSV file?</li>
<li>Give some examples of Numpy functions for performing mathematical operations.</li>
<li>Give some examples of Numpy functions for performing array manipulation.</li>
<li>Give some examples of Numpy functions for performing linear algebra.</li>
<li>Give some examples of Numpy functions for performing statistical operations.</li>
<li>How do you find the right Numpy function for a specific operation or use case?</li>
<li>Where can you see a list of all the Numpy array functions and operations?</li>
<li>What are the arithmetic operators supported by Numpy arrays? Illustrate with examples.</li>
<li>What is array broadcasting? How is it useful? Illustrate with an example.</li>
<li>Give some examples of arrays that are compatible for broadcasting.</li>
<li>Give some examples of arrays that are not compatible for broadcasting.</li>
<li>What are the comparison operators supported by Numpy arrays? Illustrate with examples.</li>
<li>How do you access a specific subarray or slice from a Numpy array?</li>
<li>Illustrate array indexing and slicing in multi-dimensional Numpy arrays with some examples.</li>
<li>How do you create a Numpy array with a given shape containing all zeros?</li>
<li>How do you create a Numpy array with a given shape containing all ones?</li>
<li>How do you create an identity matrix of a given shape?</li>
<li>How do you create a random vector of a given length?</li>
<li>How do you create a Numpy array with a given shape with a fixed value for each element?</li>
<li>How do you create a Numpy array with a given shape containing randomly initialized elements?</li>
<li>What is the difference between <code>np.random.rand</code> and <code>np.random.randn</code>? Illustrate with examples.</li>
<li>What is the difference between <code>np.arange</code> and <code>np.linspace</code>? Illustrate with examples.</li>
</ol>
<p>You are ready to move on to the next section of this tutorial.</p>
<h2 id="heading-how-to-analyze-tabular-data-using-python-and-pandas">How to Analyze Tabular Data using Python and Pandas</h2>
<p><img src="https://i.imgur.com/zfxLzEv.png" alt="Image" width="3175" height="1414" loading="lazy"></p>
<p>Follow along and run the code here: <a target="_blank" href="https://jovian.ai/aakashns/python-pandas-data-analysis">https://jovian.ai/aakashns/python-pandas-data-analysis</a>.</p>
<p>This section covers the following topics:</p>
<ul>
<li>How to read a CSV file into a Pandas data frame</li>
<li>How to retrieve data from Pandas data frames</li>
<li>How to query, sort, and analyze data</li>
<li>How to merge, group, and aggregate data</li>
<li>How to extract useful information from dates</li>
<li>Basic plotting using line and bar charts</li>
<li>How to write data frames to CSV files</li>
</ul>
<h3 id="heading-how-to-read-a-csv-file-using-pandas">How to Read a CSV File Using Pandas</h3>
<p><a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fpandas.pydata.org%2F">Pandas</a> is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet). It provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, JSON, SQL, and more. </p>
<p>Let's download a file <code>italy-covid-daywise.txt</code> which contains day-wise Covid-19 data for Italy in the following format:</p>
<pre><code>date,new_cases,new_deaths,new_tests
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-21</span>,<span class="hljs-number">2256.0</span>,<span class="hljs-number">454.0</span>,<span class="hljs-number">28095.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-22</span>,<span class="hljs-number">2729.0</span>,<span class="hljs-number">534.0</span>,<span class="hljs-number">44248.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-23</span>,<span class="hljs-number">3370.0</span>,<span class="hljs-number">437.0</span>,<span class="hljs-number">37083.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-24</span>,<span class="hljs-number">2646.0</span>,<span class="hljs-number">464.0</span>,<span class="hljs-number">95273.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-25</span>,<span class="hljs-number">3021.0</span>,<span class="hljs-number">420.0</span>,<span class="hljs-number">38676.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-26</span>,<span class="hljs-number">2357.0</span>,<span class="hljs-number">415.0</span>,<span class="hljs-number">24113.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-27</span>,<span class="hljs-number">2324.0</span>,<span class="hljs-number">260.0</span>,<span class="hljs-number">26678.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-28</span>,<span class="hljs-number">1739.0</span>,<span class="hljs-number">333.0</span>,<span class="hljs-number">37554.0</span>
...
</code></pre><p>This format of storing data is known as <em>comma-separated values</em> or CSV. Here's a reminder in case you need a definition of what the CSV format is:</p>
<blockquote>
<p><strong>CSVs</strong>: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)</p>
</blockquote>
<p>We'll download this file using the <code>urlretrieve</code> function from the <code>urllib.request</code> module.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> urllib.request <span class="hljs-keyword">import</span> urlretrieve

urlretrieve(<span class="hljs-string">'https://hub.jovian.ml/wp-content/uploads/2020/09/italy-covid-daywise.csv'</span>, <span class="hljs-string">'italy-covid-daywise.csv'</span>)
</code></pre>
<p>To read the file, we can use the <code>read_csv</code> method from Pandas. First, let's install the Pandas library.</p>
<pre><code class="lang-py">!pip install pandas --upgrade --quiet
</code></pre>
<p>We can now import the <code>pandas</code> module. As a convention, it is imported with the alias <code>pd</code>.</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

covid_df = pd.read_csv(<span class="hljs-string">'italy-covid-daywise.csv'</span>)
</code></pre>
<p>Data from the file is read and stored in a <code>DataFrame</code> object – one of the core data structures in Pandas for storing and working with tabular data. We typically use the <code>_df</code> suffix in the variable names for dataframes.</p>
<pre><code class="lang-py">type(covid_df)
<span class="hljs-comment"># pandas.core.frame.DataFrame</span>

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-108.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Here's what we can tell by looking at the dataframe:</p>
<ul>
<li>The file provides four day-wise counts for COVID-19 in Italy</li>
<li>The metrics reported are new cases, deaths, and tests</li>
<li>Data is provided for 248 days: from Dec 12, 2019, to Sep 3, 2020</li>
</ul>
<p>Keep in mind that these are officially reported numbers. The actual number of cases and deaths may be higher, as not all cases are diagnosed.</p>
<p>We can view some basic information about the data frame using the <code>.info</code> method.</p>
<pre><code class="lang-py">covid_df.info()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-109.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It appears that each column contains values of a specific data type. You can view statistical information for numerical columns (mean, standard deviation, minimum/maximum values, and the number of non-empty values) using the <code>.describe</code> method.</p>
<pre><code class="lang-py">covid_df.describe()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-110.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The <code>columns</code> property contains the list of columns within the data frame.</p>
<pre><code class="lang-py">covid_df.columns
<span class="hljs-comment"># Index(['date', 'new_cases', 'new_deaths', 'new_tests'], dtype='object')</span>
</code></pre>
<p>You can also retrieve the number of rows and columns in the data frame using the <code>.shape</code> method.</p>
<pre><code class="lang-py">covid_df.shape
<span class="hljs-comment"># (248, 4)</span>
</code></pre>
<p>Here's a summary of the functions and methods we've looked at so far:</p>
<ul>
<li><code>pd.read_csv</code> – Read data from a CSV file into a Pandas <code>DataFrame</code> object</li>
<li><code>.info()</code> – View basic information about rows, columns, and data types</li>
<li><code>.describe()</code> – View statistical information about numeric columns</li>
<li><code>.columns</code> – Get the list of column names</li>
<li><code>.shape</code> – Get the number of rows and columns as a tuple</li>
</ul>
<h3 id="heading-how-to-retrieve-data-from-a-data-frame-in-pandas">How to Retrieve Data from a Data Frame in Pandas</h3>
<p>The first thing you might want to do is retrieve data from this data frame, like the counts of a specific day or the list of values in a particular column. </p>
<p>To do this, you should understand the internal representation of data in a data frame. Conceptually, you can think of a dataframe as a dictionary of lists: keys are column names, and values are lists/arrays containing data for the respective columns.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Pandas format is simliar to this</span>
covid_data_dict = {
    <span class="hljs-string">'date'</span>:       [<span class="hljs-string">'2020-08-30'</span>, <span class="hljs-string">'2020-08-31'</span>, <span class="hljs-string">'2020-09-01'</span>, <span class="hljs-string">'2020-09-02'</span>, <span class="hljs-string">'2020-09-03'</span>],
    <span class="hljs-string">'new_cases'</span>:  [<span class="hljs-number">1444</span>, <span class="hljs-number">1365</span>, <span class="hljs-number">996</span>, <span class="hljs-number">975</span>, <span class="hljs-number">1326</span>],
    <span class="hljs-string">'new_deaths'</span>: [<span class="hljs-number">1</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">8</span>, <span class="hljs-number">6</span>],
    <span class="hljs-string">'new_tests'</span>: [<span class="hljs-number">53541</span>, <span class="hljs-number">42583</span>, <span class="hljs-number">54395</span>, <span class="hljs-literal">None</span>, <span class="hljs-literal">None</span>]
}
</code></pre>
<p>Representing data in the above format has a few benefits:</p>
<ul>
<li>All values in a column typically have the same type of value, so it's more efficient to store them in a single array.</li>
<li>Retrieving the values for a particular row simply requires extracting the elements at a given index from each column array.</li>
<li>The representation is more compact (column names are recorded only once) compared to other formats that use a dictionary for each row of data (see the example below).</li>
</ul>
<pre><code class="lang-py"><span class="hljs-comment"># Pandas format is not similar to this</span>
covid_data_list = [
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-08-30'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">1444</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'new_tests'</span>: <span class="hljs-number">53541</span>},
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-08-31'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">1365</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">4</span>, <span class="hljs-string">'new_tests'</span>: <span class="hljs-number">42583</span>},
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-09-01'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">996</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">6</span>, <span class="hljs-string">'new_tests'</span>: <span class="hljs-number">54395</span>},
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-09-02'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">975</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">8</span> },
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-09-03'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">1326</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">6</span>},
]
</code></pre>
<p>With the dictionary of lists analogy in mind, you can now guess how to retrieve data from a data frame. For example, we can get a list of values from a specific column using the <code>[]</code> indexing notation.</p>
<pre><code class="lang-py">covid_data_dict[<span class="hljs-string">'new_cases'</span>]
<span class="hljs-comment"># [1444, 1365, 996, 975, 1326]</span>

covid_df[<span class="hljs-string">'new_cases'</span>]
<span class="hljs-comment"># 0         0.0</span>
<span class="hljs-comment"># 1         0.0</span>
<span class="hljs-comment"># 2         0.0</span>
<span class="hljs-comment"># 3         0.0</span>
<span class="hljs-comment"># 4         0.0</span>
<span class="hljs-comment">#         ...  </span>
<span class="hljs-comment"># 243    1444.0</span>
<span class="hljs-comment"># 244    1365.0</span>
<span class="hljs-comment"># 245     996.0</span>
<span class="hljs-comment"># 246     975.0</span>
<span class="hljs-comment"># 247    1326.0</span>
<span class="hljs-comment"># Name: new_cases, Length: 248, dtype: float64</span>
</code></pre>
<p>Each column is represented using a data structure called <code>Series</code>, which is essentially a numpy array with some extra methods and properties.</p>
<pre><code class="lang-py">type(covid_df[<span class="hljs-string">'new_cases'</span>])
<span class="hljs-comment"># pandas.core.series.Series</span>
</code></pre>
<p>Like arrays, you can retrieve a specific value with a series using the indexing notation <code>[]</code>.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'new_cases'</span>][<span class="hljs-number">246</span>]
<span class="hljs-comment"># 975.0</span>

covid_df[<span class="hljs-string">'new_tests'</span>][<span class="hljs-number">240</span>]
<span class="hljs-number">57640.0</span>
</code></pre>
<p>Pandas also provides the <code>.at</code> method to retrieve the element at a specific row &amp; column directly.</p>
<pre><code class="lang-py">covid_df.at[<span class="hljs-number">246</span>, <span class="hljs-string">'new_cases'</span>]
<span class="hljs-comment"># 975.0</span>

covid_df.at[<span class="hljs-number">240</span>, <span class="hljs-string">'new_tests'</span>]
<span class="hljs-comment"># 57640.0</span>
</code></pre>
<p>Instead of using the indexing notation <code>[]</code>, Pandas also allows accessing columns as properties of the dataframe using the <code>.</code> notation. However, this method only works for columns whose names do not contain spaces or special characters.</p>
<pre><code class="lang-py">covid_df.new_cases
<span class="hljs-comment"># 0         0.0</span>
<span class="hljs-comment"># 1         0.0</span>
<span class="hljs-comment"># 2         0.0</span>
<span class="hljs-comment"># 3         0.0</span>
<span class="hljs-comment"># 4         0.0</span>
<span class="hljs-comment">#         ...  </span>
<span class="hljs-comment"># 243    1444.0</span>
<span class="hljs-comment"># 244    1365.0</span>
<span class="hljs-comment"># 245     996.0</span>
<span class="hljs-comment"># 246     975.0</span>
<span class="hljs-comment"># 247    1326.0</span>
<span class="hljs-comment"># Name: new_cases, Length: 248, dtype: float64</span>
</code></pre>
<p>Further, you can also pass a list of columns within the indexing notation <code>[]</code> to access a subset of the data frame with just the given columns.</p>
<pre><code class="lang-py">cases_df = covid_df[[<span class="hljs-string">'date'</span>, <span class="hljs-string">'new_cases'</span>]]
cases_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-111.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The new data frame <code>cases_df</code> is simply a "view" of the original data frame <code>covid_df</code>. Both point to the same data in the computer's memory. Changing any values inside one of them will also change the respective values in the other. </p>
<p>Sharing data between data frames makes data manipulation in Pandas blazing fast. You needn't worry about the overhead of copying thousands or millions of rows every time you want to create a new data frame by operating on an existing one.</p>
<p>Sometimes you might need a full copy of the data frame, in which case you can use the <code>copy</code> method.</p>
<pre><code class="lang-py">covid_df_copy = covid_df.copy()
</code></pre>
<p>The data within <code>covid_df_copy</code> is completely separate from <code>covid_df</code>, and changing values inside one of them will not affect the other.</p>
<p>To access a specific row of data, Pandas provides the <code>.loc</code> method.</p>
<pre><code class="lang-py">covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-112.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">covid_df.loc[<span class="hljs-number">243</span>]
<span class="hljs-comment"># date          2020-08-30</span>
<span class="hljs-comment"># new_cases         1444.0</span>
<span class="hljs-comment"># new_deaths           1.0</span>
<span class="hljs-comment"># new_tests        53541.0</span>
<span class="hljs-comment"># Name: 243, dtype: object</span>
</code></pre>
<p>Each retrieved row is also a <code>Series</code> object.</p>
<pre><code class="lang-py">type(covid_df.loc[<span class="hljs-number">243</span>])
<span class="hljs-comment"># pandas.core.series.Series</span>
</code></pre>
<p>We can use the <code>.head</code> and <code>.tail</code> methods to view the first or last few rows of data.</p>
<pre><code class="lang-py">covid_df.head(<span class="hljs-number">5</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-113.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">covid_df.tail(<span class="hljs-number">4</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-114.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice above that while the first few values in the <code>new_cases</code> and <code>new_deaths</code> columns are <code>0</code>, the corresponding values within the <code>new_tests</code> column are <code>NaN</code>. That is because the CSV file does not contain any data for the <code>new_tests</code> column for specific dates (you can verify this by looking into the file). These values may be missing or unknown.</p>
<pre><code class="lang-py">covid_df.at[<span class="hljs-number">0</span>, <span class="hljs-string">'new_tests'</span>]
<span class="hljs-comment"># nan</span>

type(covid_df.at[<span class="hljs-number">0</span>, <span class="hljs-string">'new_tests'</span>])
<span class="hljs-comment"># numpy.float64</span>
</code></pre>
<p>The distinction between <code>0</code> and <code>NaN</code> is subtle but important. In this dataset, it represents that daily test numbers were not reported on specific dates. Italy started reporting daily tests on Apr 19, 2020. They'd already conducted 935,310 tests before Apr 19.</p>
<p>We can find the first index that doesn't contain a <code>NaN</code> value using a column's <code>first_valid_index</code> method.</p>
<pre><code class="lang-py">covid_df.new_tests.first_valid_index()
<span class="hljs-comment"># 111</span>
</code></pre>
<p>Let's look at a few rows before and after this index to verify that the values change from <code>NaN</code> to actual numbers. We can do this by passing a range to <code>loc</code>.</p>
<pre><code class="lang-py">covid_df.loc[<span class="hljs-number">108</span>:<span class="hljs-number">113</span>]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-115.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can use the <code>.sample</code> method to retrieve a random sample of rows from the data frame.</p>
<pre><code class="lang-py">covid_df.sample(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-116.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice that even though we have taken a random sample, each row's original index is preserved. This is a useful property of data frames.</p>
<p>Here's a summary of the functions and methods we looked at in this section:</p>
<ul>
<li><code>covid_df['new_cases']</code> – Retrieving columns as a <code>Series</code> using the column name</li>
<li><code>new_cases[243]</code> – Retrieving values from a <code>Series</code> using an index</li>
<li><code>covid_df.at[243, 'new_cases']</code> – Retrieving a single value from a data frame</li>
<li><code>covid_df.copy()</code> – Creating a deep copy of a data frame</li>
<li><code>covid_df.loc[243]</code> - Retrieving a row or range of rows of data from the data frame</li>
<li><code>head</code>, <code>tail</code>, and <code>sample</code> – Retrieving multiple rows of data from the data frame</li>
<li><code>covid_df.new_tests.first_valid_index</code> – Finding the first non-empty index in a series</li>
</ul>
<h3 id="heading-how-to-analyze-data-from-data-frames-in-pandas">How to Analyze Data from Data Frames in Pandas</h3>
<p>Let's try to answer some questions about our data.</p>
<p><strong>Q: What are the total number of reported cases and deaths related to Covid-19 in Italy?</strong></p>
<p>Similar to Numpy arrays, a Pandas series supports the <code>sum</code> method to answer these questions.</p>
<pre><code class="lang-py">total_cases = covid_df.new_cases.sum()
total_deaths = covid_df.new_deaths.sum()

print(<span class="hljs-string">'The number of reported cases is {} and the number of reported deaths is {}.'</span>.format(int(total_cases), int(total_deaths)))
<span class="hljs-comment"># The number of reported cases is 271515 and the number of reported deaths is 35497.</span>
</code></pre>
<p><strong>Q: What is the overall death rate (ratio of reported deaths to reported cases)?</strong></p>
<pre><code class="lang-py">death_rate = covid_df.new_deaths.sum() / covid_df.new_cases.sum()

print(<span class="hljs-string">"The overall reported death rate in Italy is {:.2f} %."</span>.format(death_rate*<span class="hljs-number">100</span>))
<span class="hljs-comment"># The overall reported death rate in Italy is 13.07 %.</span>
</code></pre>
<p><strong>Q: What is the overall number of tests conducted? A total of 935</strong>,<strong>310 tests were conducted before daily test numbers were reported.</strong></p>
<pre><code class="lang-py">initial_tests = <span class="hljs-number">935310</span>
total_tests = initial_tests + covid_df.new_tests.sum()

total_tests
<span class="hljs-comment"># 5214766.0</span>
</code></pre>
<p><strong>Q: What fraction of tests returned a positive result?</strong></p>
<pre><code class="lang-py">positive_rate = total_cases / total_tests

print(<span class="hljs-string">'{:.2f}% of tests in Italy led to a positive diagnosis.'</span>.format(positive_rate*<span class="hljs-number">100</span>))
<span class="hljs-comment"># 5.21% of tests in Italy led to a positive diagnosis.</span>
</code></pre>
<p>Try asking and answering some more questions about the data.</p>
<h3 id="heading-how-to-query-and-sort-rows-in-pandas">How to Query and Sort Rows in Pandas</h3>
<p>Let's say we only want to look at the days which had more than 1,000 reported cases. We can use a boolean expression to check which rows satisfy this criterion.</p>
<pre><code class="lang-py">high_new_cases = covid_df.new_cases &gt; <span class="hljs-number">1000</span>

high_new_cases
<span class="hljs-comment"># 0      False</span>
<span class="hljs-comment"># 1      False</span>
<span class="hljs-comment"># 2      False</span>
<span class="hljs-comment"># 3      False</span>
<span class="hljs-comment"># 4      False</span>
<span class="hljs-comment">#        ...  </span>
<span class="hljs-comment"># 243     True</span>
<span class="hljs-comment"># 244     True</span>
<span class="hljs-comment"># 245    False</span>
<span class="hljs-comment"># 246    False</span>
<span class="hljs-comment"># 247     True</span>
<span class="hljs-comment"># Name: new_cases, Length: 248, dtype: bool</span>
</code></pre>
<p>The boolean expression returns a series containing <code>True</code> and <code>False</code> boolean values. You can use this series to select a subset of rows from the original dataframe, corresponding to the <code>True</code> values in the series.</p>
<pre><code class="lang-py">covid_df[high_new_cases]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-117.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The data frame contains 72 rows, but only the first and last five rows are displayed by default with Jupyter for brevity. We can change some display options to view all the rows.</p>
<pre><code class="lang-py">high_cases_df = covid_df[covid_df.new_cases &gt; <span class="hljs-number">1000</span>]

high_cases_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-118.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The data frame contains 72 rows, but only the first &amp; last five rows are displayed by default with Jupyter for brevity. We can change some display options to view all the rows.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> IPython.display <span class="hljs-keyword">import</span> display
<span class="hljs-keyword">with</span> pd.option_context(<span class="hljs-string">'display.max_rows'</span>, <span class="hljs-number">100</span>):
    display(covid_df[covid_df.new_cases &gt; <span class="hljs-number">1000</span>])
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-119.png" alt="Image" width="600" height="400" loading="lazy">
<em>This is just part of the data frame. Check out the rest <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-pandas-data-analysis">here</a>.</em></p>
<p>We can also formulate more complex queries that involve multiple columns. As an example, let's try to determine the days when the ratio of cases reported to tests conducted is higher than the overall <code>positive_rate</code>.</p>
<pre><code class="lang-py">positive_rate
<span class="hljs-comment"># 0.05206657403227681</span>

high_ratio_df = covid_df[covid_df.new_cases / covid_df.new_tests &gt; positive_rate]

high_ratio_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-120.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The result of performing an operation on two columns is a new series.</p>
<pre><code class="lang-py">covid_df.new_cases / covid_df.new_tests
<span class="hljs-comment"># 0           NaN</span>
<span class="hljs-comment"># 1           NaN</span>
<span class="hljs-comment"># 2           NaN</span>
<span class="hljs-comment"># 3           NaN</span>
<span class="hljs-comment"># 4           NaN</span>
<span class="hljs-comment">#          ...   </span>
<span class="hljs-comment"># 243    0.026970</span>
<span class="hljs-comment"># 244    0.032055</span>
<span class="hljs-comment"># 245    0.018311</span>
<span class="hljs-comment"># 246         NaN</span>
<span class="hljs-comment"># 247         NaN</span>
<span class="hljs-comment"># Length: 248, dtype: float64</span>
</code></pre>
<p>We can use this series to add a new column to the data frame.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'positive_rate'</span>] = covid_df.new_cases / covid_df.new_tests

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-121.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>However, keep in mind that sometimes it takes a few days to get the results for a test, so we can't compare the number of new cases with the number of tests conducted on the same day. Any inference based on this <code>positive_rate</code> column is likely to be incorrect. </p>
<p>It's essential to watch out for such subtle relationships that are often not conveyed within the CSV file and require some external context. It's always a good idea to read through the documentation provided with the dataset or ask for more information.</p>
<p>For now, let's remove the <code>positive_rate</code> column using the <code>drop</code> method.</p>
<pre><code class="lang-py">covid_df.drop(columns=[<span class="hljs-string">'positive_rate'</span>], inplace=<span class="hljs-literal">True</span>)
</code></pre>
<p>Can you figure the purpose of the <code>inplace</code> argument?</p>
<h4 id="heading-how-to-sort-rows-using-column-values-in-pandas"><strong>How to Sort Rows Using Column Values in Pandas</strong></h4>
<p>You can also sort the rows by a specific column using <code>.sort_values</code>. Let's sort to identify the days with the highest number of cases, then chain it with the <code>head</code> method to list just the first ten results.</p>
<pre><code class="lang-py">covid_df.sort_values(<span class="hljs-string">'new_cases'</span>, ascending=<span class="hljs-literal">False</span>).head(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-122.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It looks like the last two weeks of March had the highest number of daily cases. Let's compare this to the days where the highest number of deaths were recorded.</p>
<pre><code class="lang-py">covid_df.sort_values(<span class="hljs-string">'new_deaths'</span>, ascending=<span class="hljs-literal">False</span>).head(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-123.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It appears that daily deaths hit a peak just about a week after the peak in daily new cases.</p>
<p>Let's also look at the days with the smallest number of cases. We might expect to see the first few days of the year on this list.</p>
<pre><code class="lang-py">covid_df.sort_values(<span class="hljs-string">'new_cases'</span>).head(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-124.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It seems like the count of new cases on Jun 20, 2020, was <code>-148</code>, a negative number! Not something we might have expected, but that's the nature of real-world data. It could be a data entry error, or the government may have issued a correction to account for miscounting in the past. </p>
<p>Can you dig through news articles online and figure out why the number was negative?</p>
<p>Let's look at some days before and after Jun 20, 2020.</p>
<pre><code class="lang-py">covid_df.loc[<span class="hljs-number">169</span>:<span class="hljs-number">175</span>]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-125.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>For now, let's assume this was indeed a data entry error. We can use one of the following approaches for dealing with the missing or faulty value:</p>
<ol>
<li>Replace it with <code>0</code>.</li>
<li>Replace it with the average of the entire column</li>
<li>Replace it with the average of the values on the previous and next date</li>
<li>Discard the row entirely</li>
</ol>
<p>Which approach you pick requires some context about the data and the problem. In this case, since we are dealing with data ordered by date, we can go ahead with the third approach.</p>
<p>You can use the <code>.at</code> method to modify a specific value within the dataframe.</p>
<pre><code class="lang-py">covid_df.at[<span class="hljs-number">172</span>, <span class="hljs-string">'new_cases'</span>] = (covid_df.at[<span class="hljs-number">171</span>, <span class="hljs-string">'new_cases'</span>] + covid_df.at[<span class="hljs-number">173</span>, <span class="hljs-string">'new_cases'</span>])/<span class="hljs-number">2</span>
</code></pre>
<p>Here's a summary of the functions and methods we looked at in this section:</p>
<ul>
<li><code>covid_df.new_cases.sum()</code> – Computing the sum of values in a column or series</li>
<li><code>covid_df[covid_df.new_cases &gt; 1000]</code> – Querying a subset of rows satisfying the chosen criteria using boolean expressions</li>
<li><code>df['pos_rate'] = df.new_cases/df.new_tests</code> – Adding new columns by combining data from existing columns</li>
<li><code>covid_df.drop('positive_rate')</code> – Removing one or more columns from the data frame</li>
<li><code>sort_values</code> – Sorting the rows of a data frame using column values</li>
<li><code>covid_df.at[172, 'new_cases'] = ...</code> – Replacing a value within the data frame</li>
</ul>
<h3 id="heading-how-to-work-with-dates-in-pandas">How to Work with Dates in Pandas</h3>
<p>While we've looked at overall numbers for the cases, tests, positive rate, and more, it would also be useful to study these numbers on a month-by-month basis. </p>
<p>The <code>date</code> column might come in handy here, as Pandas provides many utilities for working with dates.</p>
<pre><code class="lang-py">covid_df.date
<span class="hljs-comment"># 0      2019-12-31</span>
<span class="hljs-comment"># 1      2020-01-01</span>
<span class="hljs-comment"># 2      2020-01-02</span>
<span class="hljs-comment"># 3      2020-01-03</span>
<span class="hljs-comment"># 4      2020-01-04</span>
<span class="hljs-comment">#           ...    </span>
<span class="hljs-comment"># 243    2020-08-30</span>
<span class="hljs-comment"># 244    2020-08-31</span>
<span class="hljs-comment"># 245    2020-09-01</span>
<span class="hljs-comment"># 246    2020-09-02</span>
<span class="hljs-comment"># 247    2020-09-03</span>
<span class="hljs-comment"># Name: date, Length: 248, dtype: object</span>
</code></pre>
<p>The data type of date is currently <code>object</code>, so Pandas does not know that this column is a date. We can convert it into a <code>datetime</code> column using the <code>pd.to_datetime</code> method.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'date'</span>] = pd.to_datetime(covid_df.date)

covid_df[<span class="hljs-string">'date'</span>]
<span class="hljs-comment"># 0     2019-12-31</span>
<span class="hljs-comment"># 1     2020-01-01</span>
<span class="hljs-comment"># 2     2020-01-02</span>
<span class="hljs-comment"># 3     2020-01-03</span>
<span class="hljs-comment"># 4     2020-01-04</span>
<span class="hljs-comment">#          ...    </span>
<span class="hljs-comment"># 243   2020-08-30</span>
<span class="hljs-comment"># 244   2020-08-31</span>
<span class="hljs-comment"># 245   2020-09-01</span>
<span class="hljs-comment"># 246   2020-09-02</span>
<span class="hljs-comment"># 247   2020-09-03</span>
<span class="hljs-comment"># Name: date, Length: 248, dtype: datetime64[ns]</span>
</code></pre>
<p>You can see that it now has the datatype <code>datetime64</code>. We can now extract different parts of the data into separate columns, using the <code>DatetimeIndex</code> class (<a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fversion%2F0.23.4%2Fgenerated%2Fpandas.DatetimeIndex.html">view docs</a>).</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'year'</span>] = pd.DatetimeIndex(covid_df.date).year
covid_df[<span class="hljs-string">'month'</span>] = pd.DatetimeIndex(covid_df.date).month
covid_df[<span class="hljs-string">'day'</span>] = pd.DatetimeIndex(covid_df.date).day
covid_df[<span class="hljs-string">'weekday'</span>] = pd.DatetimeIndex(covid_df.date).weekday

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-126.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let's check the overall metrics for May. We can query the rows for May, choose a subset of columns, and use the <code>sum</code> method to aggregate each selected column's values.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Query the rows for May</span>
covid_df_may = covid_df[covid_df.month == <span class="hljs-number">5</span>]

<span class="hljs-comment"># Extract the subset of columns to be aggregated</span>
covid_df_may_metrics = covid_df_may[[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]]

<span class="hljs-comment"># Get the column-wise sum</span>
covid_may_totals = covid_df_may_metrics.sum()

covid_may_totals
<span class="hljs-comment"># new_cases       29073.0</span>
<span class="hljs-comment"># new_deaths       5658.0</span>
<span class="hljs-comment"># new_tests     1078720.0</span>
<span class="hljs-comment"># dtype: float64</span>

type(covid_may_totals)
<span class="hljs-comment"># pandas.core.series.Series</span>
</code></pre>
<p>We can also combine the above operations into a single statement.</p>
<pre><code class="lang-py">covid_df[covid_df.month == <span class="hljs-number">5</span>][[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]].sum()
<span class="hljs-comment"># new_cases       29073.0</span>
<span class="hljs-comment"># new_deaths       5658.0</span>
<span class="hljs-comment"># new_tests     1078720.0</span>
<span class="hljs-comment"># dtype: float64</span>
</code></pre>
<p>As another example, let's check if the number of cases reported on Sundays is higher than the average number of cases reported every day. This time, we might want to aggregate columns using the <code>.mean</code> method.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Overall average</span>
covid_df.new_cases.mean()

<span class="hljs-comment"># 1096.6149193548388</span>

<span class="hljs-comment"># Average for Sundays</span>
covid_df[covid_df.weekday == <span class="hljs-number">6</span>].new_cases.mean()

<span class="hljs-comment"># 1247.2571428571428</span>
</code></pre>
<p>It seems like more cases were reported on Sundays compared to other days.</p>
<p>Try asking and answering some more date-related questions about the data.</p>
<h3 id="heading-how-to-group-and-aggregate-data-in-pandas">How to Group and Aggregate Data in Pandas</h3>
<p>As a next step, we might want to summarize the day-wise data and create a new dataframe with month-wise data. We can use the <code>groupby</code> function to create a group for each month, select the columns we wish to aggregate, and aggregate them using the <code>sum</code> method.</p>
<pre><code class="lang-py">covid_month_df = covid_df.groupby(<span class="hljs-string">'month'</span>)[[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]].sum()

covid_month_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-127.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The result is a new data frame that uses unique values from the column passed to <code>groupby</code> as the index. Grouping and aggregation is a powerful method for progressively summarizing data into smaller data frames.</p>
<p>Instead of aggregating by sum, you can also aggregate by other measures like mean. Let's compute the average number of daily new cases, deaths, and tests for each month.</p>
<pre><code class="lang-py">covid_month_mean_df = covid_df.groupby(<span class="hljs-string">'month'</span>)[[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]].mean()

covid_month_mean_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-128.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Apart from grouping, another form of aggregation is the running or cumulative sum of cases, tests, or deaths up to each row's date. We can use the <code>cumsum</code> method to compute the cumulative sum of a column as a new series. </p>
<p>Let's add three new columns: <code>total_cases</code>, <code>total_deaths</code>, and <code>total_tests</code>.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'total_cases'</span>] = covid_df.new_cases.cumsum()
covid_df[<span class="hljs-string">'total_deaths'</span>] = covid_df.new_deaths.cumsum()
covid_df[<span class="hljs-string">'total_tests'</span>] = covid_df.new_tests.cumsum() + initial_tests
</code></pre>
<p>We've also included the initial test count in <code>total_test</code> to account for tests conducted before daily reporting was started.</p>
<pre><code class="lang-py">covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-129.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice how the <code>NaN</code> values in the <code>total_tests</code> column remain unaffected.</p>
<h3 id="heading-how-to-merge-data-from-multiple-sources-in-pandas">How to Merge Data from Multiple Sources in Pandas</h3>
<p>To determine other metrics like test per million, cases per million, and so on, we require some more information about the country, namely its population. </p>
<p>Let's download another file <code>locations.csv</code> that contains health-related information for many countries, including Italy.</p>
<pre><code class="lang-py">urlretrieve(<span class="hljs-string">'https://gist.githubusercontent.com/aakashns/8684589ef4f266116cdce023377fc9c8/raw/99ce3826b2a9d1e6d0bde7e9e559fc8b6e9ac88b/locations.csv'</span>, <span class="hljs-string">'locations.csv'</span>)

locations_df = pd.read_csv(<span class="hljs-string">'locations.csv'</span>)
locations_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-130.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">locations_df[locations_df.location == <span class="hljs-string">"Italy"</span>]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-131.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can merge this data into our existing data frame by adding more columns. However, to merge two data frames, we need at least one common column. Let's insert a <code>location</code> column in the <code>covid_df</code> dataframe with all values set to <code>"Italy"</code>.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'location'</span>] = <span class="hljs-string">"Italy"</span>

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-132.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can now add the columns from <code>locations_df</code> into <code>covid_df</code> using the <code>.merge</code> method.</p>
<pre><code class="lang-py">merged_df = covid_df.merge(locations_df, on=<span class="hljs-string">"location"</span>)

merged_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-133.png" alt="Image" width="600" height="400" loading="lazy">
<em>Check out the full data frame <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-pandas-data-analysis">here</a>.</em></p>
<p>The location data for Italy is appended to each row within <code>covid_df</code>. If the <code>covid_df</code> data frame contained data for multiple locations, then the respective country's location data would be appended for each row.</p>
<p>We can now calculate metrics like cases per million, deaths per million, and tests per million.</p>
<pre><code class="lang-py">merged_df[<span class="hljs-string">'cases_per_million'</span>] = merged_df.total_cases * <span class="hljs-number">1e6</span> / merged_df.population
merged_df[<span class="hljs-string">'deaths_per_million'</span>] = merged_df.total_deaths * <span class="hljs-number">1e6</span> / merged_df.population
merged_df[<span class="hljs-string">'tests_per_million'</span>] = merged_df.total_tests * <span class="hljs-number">1e6</span> / merged_df.population

merged_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-134.png" alt="Image" width="600" height="400" loading="lazy">
<em>Check out the full data frame <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-pandas-data-analysis">here</a>.</em></p>
<h3 id="heading-how-to-write-data-back-to-files-in-pandas">How to Write Data Back to Files in Pandas</h3>
<p>After completing your analysis and adding new columns, you should write the results back to a file. Otherwise, the data will be lost when the Jupyter notebook shuts down. </p>
<p>Before writing to file, let's first create a data frame containing just the columns we wish to record.</p>
<pre><code class="lang-py">result_df = merged_df[[<span class="hljs-string">'date'</span>,
                       <span class="hljs-string">'new_cases'</span>, 
                       <span class="hljs-string">'total_cases'</span>, 
                       <span class="hljs-string">'new_deaths'</span>, 
                       <span class="hljs-string">'total_deaths'</span>, 
                       <span class="hljs-string">'new_tests'</span>, 
                       <span class="hljs-string">'total_tests'</span>, 
                       <span class="hljs-string">'cases_per_million'</span>, 
                       <span class="hljs-string">'deaths_per_million'</span>, 
                       <span class="hljs-string">'tests_per_million'</span>]]

result_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-135.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>To write the data from the data frame into a file, we can use the <code>to_csv</code> function.</p>
<pre><code class="lang-py">result_df.to_csv(<span class="hljs-string">'results.csv'</span>, index=<span class="hljs-literal">None</span>)
</code></pre>
<p>The <code>to_csv</code> function also includes an additional column for storing the index of the dataframe by default. We pass <code>index=None</code> to turn off this behavior. You can now verify that the <code>results.csv</code> is created and contains data from the data frame in CSV format:</p>
<pre><code class="lang-py">date,new_cases,total_cases,new_deaths,total_deaths,new_tests,total_tests,cases_per_million,deaths_per_million,tests_per_million
<span class="hljs-number">2020</span><span class="hljs-number">-02</span><span class="hljs-number">-27</span>,<span class="hljs-number">78.0</span>,<span class="hljs-number">400.0</span>,<span class="hljs-number">1.0</span>,<span class="hljs-number">12.0</span>,,,<span class="hljs-number">6.61574439992122</span>,<span class="hljs-number">0.1984723319976366</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-02</span><span class="hljs-number">-28</span>,<span class="hljs-number">250.0</span>,<span class="hljs-number">650.0</span>,<span class="hljs-number">5.0</span>,<span class="hljs-number">17.0</span>,,,<span class="hljs-number">10.750584649871982</span>,<span class="hljs-number">0.28116913699665186</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-02</span><span class="hljs-number">-29</span>,<span class="hljs-number">238.0</span>,<span class="hljs-number">888.0</span>,<span class="hljs-number">4.0</span>,<span class="hljs-number">21.0</span>,,,<span class="hljs-number">14.686952567825108</span>,<span class="hljs-number">0.34732658099586405</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-03</span><span class="hljs-number">-01</span>,<span class="hljs-number">240.0</span>,<span class="hljs-number">1128.0</span>,<span class="hljs-number">8.0</span>,<span class="hljs-number">29.0</span>,,,<span class="hljs-number">18.656399207777838</span>,<span class="hljs-number">0.47964146899428844</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-03</span><span class="hljs-number">-02</span>,<span class="hljs-number">561.0</span>,<span class="hljs-number">1689.0</span>,<span class="hljs-number">6.0</span>,<span class="hljs-number">35.0</span>,,,<span class="hljs-number">27.93498072866735</span>,<span class="hljs-number">0.5788776349931067</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-03</span><span class="hljs-number">-03</span>,<span class="hljs-number">347.0</span>,<span class="hljs-number">2036.0</span>,<span class="hljs-number">17.0</span>,<span class="hljs-number">52.0</span>,,,<span class="hljs-number">33.67413899559901</span>,<span class="hljs-number">0.8600467719897585</span>,
...
</code></pre>
<h3 id="heading-bonus-basic-plotting-with-pandas">Bonus: Basic Plotting with Pandas</h3>
<p>We generally use a library like <code>matplotlib</code> or <code>seaborn</code> to plot graphs within a Jupyter notebook. However, Pandas dataframes and series provide a handy <code>.plot</code> method for quick and easy plotting.</p>
<p>Let's plot a line graph showing how the number of daily cases varies over time.</p>
<pre><code class="lang-py">result_df.new_cases.plot();
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-137.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>While this plot shows the overall trend, it's hard to tell where the peak occurred, as there are no dates on the X-axis. We can use the <code>date</code> column as the index for the data frame to address this issue.</p>
<pre><code class="lang-py">result_df.set_index(<span class="hljs-string">'date'</span>, inplace=<span class="hljs-literal">True</span>)

result_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-138.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice that the index of a data frame doesn't have to be numeric. Using the date as the index also allows us to get the data for a specific data using <code>.loc</code>.</p>
<pre><code class="lang-py">result_df.loc[<span class="hljs-string">'2020-09-01'</span>]
<span class="hljs-comment"># new_cases             9.960000e+02</span>
<span class="hljs-comment"># total_cases           2.696595e+05</span>
<span class="hljs-comment"># new_deaths            6.000000e+00</span>
<span class="hljs-comment"># total_deaths          3.548300e+04</span>
<span class="hljs-comment"># new_tests             5.439500e+04</span>
<span class="hljs-comment"># total_tests           5.214766e+06</span>
<span class="hljs-comment"># cases_per_million     4.459996e+03</span>
<span class="hljs-comment"># deaths_per_million    5.868661e+02</span>
<span class="hljs-comment"># tests_per_million     8.624890e+04</span>
<span class="hljs-comment"># Name: 2020-09-01 00:00:00, dtype: float64</span>
</code></pre>
<p>Let's plot the new cases and new deaths per day as line graphs.</p>
<pre><code class="lang-py">result_df.new_cases.plot()
result_df.new_deaths.plot();
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-139.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can also compare the total cases vs. total deaths.</p>
<pre><code class="lang-py">result_df.total_cases.plot()
result_df.total_deaths.plot();
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-140.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let's see how the death rate and positive testing rates vary over time.</p>
<pre><code class="lang-py">death_rate = result_df.total_deaths / result_df.total_cases

death_rate.plot(title=<span class="hljs-string">'Death Rate'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-141.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">positive_rates = result_df.total_cases / result_df.total_tests

positive_rates.plot(title=<span class="hljs-string">'Positive Rate'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-142.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Finally, let's plot some month-wise data using a bar chart to visualize the trend at a higher level.</p>
<pre><code class="lang-py">covid_month_df.new_cases.plot(kind=<span class="hljs-string">'bar'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-143.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">covid_month_df.new_tests.plot(kind=<span class="hljs-string">'bar'</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-144.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-pandas-exercises">Pandas Exercises</h3>
<p>Try the following exercises to become familiar with Pandas dataframes and practice your skills:</p>
<ul>
<li><a target="_blank" href="https://jovian.ml/aakashns/pandas-practice-assignment">Assignment on Pandas dataframes</a></li>
<li><a target="_blank" href="https://github.com/guipsamora/pandas_exercises">Additional exercises on Pandas</a></li>
<li><a target="_blank" href="https://www.kaggle.com/datasets">Try downloading and analyzing some data from Kaggle</a></li>
</ul>
<h3 id="heading-summary-and-further-reading-1">Summary and Further Reading</h3>
<p>We've covered the following topics in this tutorial:</p>
<ul>
<li>How to read a CSV file into a Pandas data frame</li>
<li>How to retrieve data from Pandas data frames</li>
<li>How to query, sort, and analyze data</li>
<li>How to merge, group, and aggregate data</li>
<li>How to extract useful information from dates</li>
<li>Basic plotting using line and bar charts</li>
<li>How to write data frames to CSV files</li>
</ul>
<p>Check out the following resources to learn more about Pandas:</p>
<ul>
<li><a target="_blank" href="https://pandas.pydata.org/docs/user_guide/index.html">User guide for Pandas</a></li>
<li><a target="_blank" href="https://www.oreilly.com/library/view/python-for-data/9781491957653/">Python for Data Analysis (book by Wes McKinney - creator of Pandas)</a></li>
</ul>
<h3 id="heading-review-questions-to-check-your-comprehension-1">Review Questions to Check Your Comprehension</h3>
<p>Try answering the following questions to test your understanding of the topics covered in this notebook:</p>
<ol>
<li>What is Pandas? What makes it useful?</li>
<li>How do you install the Pandas library?</li>
<li>How do you import the <code>pandas</code> module?</li>
<li>What is the common alias used while importing the <code>pandas</code> module?</li>
<li>How do you read a CSV file using Pandas? Give an example.</li>
<li>What are some other file formats you can read using Pandas? Illustrate with examples.</li>
<li>What are Pandas dataframes?</li>
<li>How are Pandas dataframes different from Numpy arrays?</li>
<li>How do you find the number of rows and columns in a dataframe?</li>
<li>How do you get the list of columns in a dataframe?</li>
<li>What is the purpose of the <code>describe</code> method of a dataframe?</li>
<li>How are the <code>info</code> and <code>describe</code> dataframe methods different?</li>
<li>Is a Pandas dataframe conceptually similar to a list of dictionaries or a dictionary of lists? Explain with an example.</li>
<li>What is a Pandas <code>Series</code>? How is it different from a Numpy array?</li>
<li>How do you access a column from a dataframe?</li>
<li>How do you access a row from a dataframe?</li>
<li>How do you access an element at a specific row and column of a dataframe?</li>
<li>How do you create a subset of a dataframe with a specific set of columns?</li>
<li>How do you create a subset of a dataframe with a specific range of rows?</li>
<li>Does changing a value within a dataframe affect other dataframes created using a subset of the rows or columns? Why is it so?</li>
<li>How do you create a copy of a dataframe?</li>
<li>Why should you avoid creating too many copies of a dataframe?</li>
<li>How do you view the first few rows of a dataframe?</li>
<li>How do you view the last few rows of a dataframe?</li>
<li>How do you view a random selection of rows of a dataframe?</li>
<li>What is the "index" in a dataframe? How is it useful?</li>
<li>What does a <code>NaN</code> value in a Pandas dataframe represent?</li>
<li>How is <code>Nan</code> different from <code>0</code>?</li>
<li>How do you identify the first non-empty row in a Pandas series or column?</li>
<li>What is the difference between <code>df.loc</code> and <code>df.at</code>?</li>
<li>Where can you find a full list of methods supported by Pandas <code>DataFrame</code> and <code>Series</code> objects?</li>
<li>How do you find the sum of numbers in a column of a dataframe?</li>
<li>How do you find the mean of numbers in a column of a dataframe?</li>
<li>How do you find the number of non-empty numbers in a column of a dataframe?</li>
<li>What is the result obtained by using a Pandas column in a boolean expression? Illustrate with an example.</li>
<li>How do you select a subset of rows where a specific column's value meets a given condition? Illustrate with an example.</li>
<li>What is the result of the expression <code>df[df.new_cases &gt; 100]</code> ?</li>
<li>How do you display all the rows of a pandas dataframe in a Jupyter cell output?</li>
<li>What is the result obtained when you perform an arithmetic operation between two columns of a dataframe? Illustrate with an example.</li>
<li>How do you add a new column to a dataframe by combining values from two existing columns? Illustrate with an example.</li>
<li>How do you remove a column from a dataframe? Illustrate with an example.</li>
<li>What is the purpose of the <code>inplace</code> argument in dataframe methods?</li>
<li>How do you sort the rows of a dataframe based on the values in a particular column?</li>
<li>How do you sort a pandas dataframe using values from multiple columns?</li>
<li>How do you specify whether to sort by ascending or descending order while sorting a Pandas dataframe?</li>
<li>How do you change a specific value within a dataframe?</li>
<li>How do you convert a dataframe column to the <code>datetime</code> data type?</li>
<li>What are the benefits of using the <code>datetime</code> data type instead of <code>object</code>?</li>
<li>How do you extract different parts of a date column like the month, year, month, weekday, and so on into separate columns? Illustrate with an example.</li>
<li>How do you aggregate multiple columns of a dataframe together?</li>
<li>What is the purpose of the <code>groupby</code> method of a dataframe? Illustrate with an example.</li>
<li>What are the different ways in which you can aggregate the groups created by <code>groupby</code>?</li>
<li>What do you mean by a running or cumulative sum?</li>
<li>How do you create a new column containing the running or cumulative sum of another column?</li>
<li>What are other cumulative measures supported by Pandas dataframes?</li>
<li>What does it mean to merge two dataframes? Give an example.</li>
<li>How do you specify the columns that should be used for merging two dataframes?</li>
<li>How do you write data from a Pandas dataframe into a CSV file? Give an example.</li>
<li>What are some other file formats you can write to from a Pandas dataframe? Illustrate with examples.</li>
<li>How do you create a line plot showing the values within a column of a dataframe?</li>
<li>How do you convert a column of a dataframe into its index?</li>
<li>Can the index of a dataframe be non-numeric?</li>
<li>What are the benefits of using a non-numeric dataframe? Illustrate with an example.</li>
<li>How you create a bar plot showing the values within a column of a dataframe?</li>
<li>What are some other types of plots supported by Pandas dataframes and series?</li>
</ol>
<p>You are ready to move on to the next section of the tutorial.</p>
<h2 id="heading-data-visualization-using-python-matplotlib-and-seaborn">Data Visualization using Python, Matplotlib, and Seaborn</h2>
<p><img src="https://i.imgur.com/9i806Rh.png" alt="Image" width="2314" height="1092" loading="lazy"></p>
<p>Notebook link: <a target="_blank" href="https://jovian.ai/aakashns/python-matplotlib-data-visualization">https://jovian.ai/aakashns/python-matplotlib-data-visualization</a></p>
<p>Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers. </p>
<p>Visualizing data is an essential part of data analysis and machine learning. We'll use Python libraries <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org">Matplotlib</a> and <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fseaborn.pydata.org">Seaborn</a> to learn and apply some popular data visualization techniques. We'll use the words <em>chart</em>, <em>plot</em>, and <em>graph</em> interchangeably in this tutorial.</p>
<p>To begin, let's install and import the libraries. We'll use the <code>matplotlib.pyplot</code> module for basic plots like line and bar charts. It is often imported with the alias <code>plt</code>. We'll use the <code>seaborn</code> module for more advanced plots. It is commonly imported with the alias <code>sns</code>.</p>
<pre><code class="lang-py">!pip install matplotlib seaborn --upgrade --quiet

<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
%matplotlib inline
</code></pre>
<p>Notice this we also include the special command <code>%matplotlib inline</code> to ensure that our plots are shown and embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up windows.</p>
<h3 id="heading-how-to-create-a-line-chart-in-python">How to Create a Line Chart in Python</h3>
<p>The line chart is one of the simplest and most widely used data visualization techniques. A line chart displays information as a series of data points or markers connected by straight lines. </p>
<p>You can customize the shape, size, color, and other aesthetic elements of the lines and markers for better visual clarity.</p>
<p>Here's a Python list showing the yield of apples (tons per hectare) over six years in an imaginary country called Kanto.</p>
<pre><code class="lang-py">yield_apples = [<span class="hljs-number">0.895</span>, <span class="hljs-number">0.91</span>, <span class="hljs-number">0.919</span>, <span class="hljs-number">0.926</span>, <span class="hljs-number">0.929</span>, <span class="hljs-number">0.931</span>]
</code></pre>
<p>We can visualize how the yield of apples changes over time using a line chart. To draw a line chart, we can use the <code>plt.plot</code> function.</p>
<pre><code class="lang-py">plt.plot(yield_apples)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-145.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Calling the <code>plt.plot</code> function draws the line chart as expected. It also returns a list of plots drawn <code>[&lt;matplotlib.lines.Line2D at 0x7ff70aa20760&gt;]</code>, shown within the output. We can include a semicolon (<code>;</code>) at the end of the last statement in the cell to avoiding showing the output and display just the graph.</p>
<pre><code class="lang-py">plt.plot(yield_apples);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-146.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let's enhance this plot step-by-step to make it more informative and beautiful.</p>
<h4 id="heading-how-to-customize-the-x-axis-in-matplotlib"><strong>How to Customize the X-axis in MatPlotLib</strong></h4>
<p>The X-axis of the plot currently shows list element indices 0 to 5. The plot would be more informative if we could display the year for which we're plotting the data. We can do this by two arguments <code>plt.plot</code>.</p>
<pre><code class="lang-py">years = [<span class="hljs-number">2010</span>, <span class="hljs-number">2011</span>, <span class="hljs-number">2012</span>, <span class="hljs-number">2013</span>, <span class="hljs-number">2014</span>, <span class="hljs-number">2015</span>]
yield_apples = [<span class="hljs-number">0.895</span>, <span class="hljs-number">0.91</span>, <span class="hljs-number">0.919</span>, <span class="hljs-number">0.926</span>, <span class="hljs-number">0.929</span>, <span class="hljs-number">0.931</span>]

plt.plot(years, yield_apples)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-147.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-axis-labels-in-matplotlib"><strong>Axis Labels in MatPlotLib</strong></h4>
<p>We can add labels to the axes to show what each axis represents using the <code>plt.xlabel</code> and <code>plt.ylabel</code> methods.</p>
<pre><code class="lang-py">plt.plot(years, yield_apples)
plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-148.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-plot-multiple-lines-in-matplotlib"><strong>How to Plot Multiple Lines in MatPlotLib</strong></h4>
<p>You can invoke the <code>plt.plot</code> function once for each line to plot multiple lines in the same graph. Let's compare the yields of apples vs. oranges in Kanto.</p>
<pre><code class="lang-py">years = range(<span class="hljs-number">2000</span>, <span class="hljs-number">2012</span>)
apples = [<span class="hljs-number">0.895</span>, <span class="hljs-number">0.91</span>, <span class="hljs-number">0.919</span>, <span class="hljs-number">0.926</span>, <span class="hljs-number">0.929</span>, <span class="hljs-number">0.931</span>, <span class="hljs-number">0.934</span>, <span class="hljs-number">0.936</span>, <span class="hljs-number">0.937</span>, <span class="hljs-number">0.9375</span>, <span class="hljs-number">0.9372</span>, <span class="hljs-number">0.939</span>]
oranges = [<span class="hljs-number">0.962</span>, <span class="hljs-number">0.941</span>, <span class="hljs-number">0.930</span>, <span class="hljs-number">0.923</span>, <span class="hljs-number">0.918</span>, <span class="hljs-number">0.908</span>, <span class="hljs-number">0.907</span>, <span class="hljs-number">0.904</span>, <span class="hljs-number">0.901</span>, <span class="hljs-number">0.898</span>, <span class="hljs-number">0.9</span>, <span class="hljs-number">0.896</span>, ]

plt.plot(years, apples)
plt.plot(years, oranges)
plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-149.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-chart-title-and-legend-in-matplotlib"><strong>Chart Title and Legend in MatPlotLib</strong></h4>
<p>To differentiate between multiple lines, we can include a legend within the graph using the <code>plt.legend</code> function. We can also set a title for the chart using the <code>plt.title</code> function.</p>
<pre><code class="lang-py">plt.plot(years, apples)
plt.plot(years, oranges)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-150.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-use-line-markers-in-matplotlib"><strong>How to Use Line Markers in MatPlotLib</strong></h4>
<p>We can also show markers for the data points on each line using the <code>marker</code> argument of <code>plt.plot</code>. </p>
<p>Matplotlib provides many different markers like a circle, cross, square, diamond, and more. You can find the full list of marker types here: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.1.1%2Fapi%2Fmarkers_api.html">https://matplotlib.org/3.1.1/api/markers_api.html</a> .</p>
<pre><code class="lang-py">plt.plot(years, apples, marker=<span class="hljs-string">'o'</span>)
plt.plot(years, oranges, marker=<span class="hljs-string">'x'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-151.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-style-lines-and-markers-in-matplotlib"><strong>How to Style Lines and Markers in MatPlotLib</strong></h4>
<p>The <code>plt.plot</code> function supports many arguments for styling lines and markers:</p>
<ul>
<li><code>color</code> or <code>c</code> – Set the color of the line (<a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.1.0%2Fgallery%2Fcolor%2Fnamed_colors.html">supported colors</a>)</li>
<li><code>linestyle</code> or <code>ls</code> – Choose between a solid or dashed line</li>
<li><code>linewidth</code> or <code>lw</code> – Set the width of a line</li>
<li><code>markersize</code> or <code>ms</code> – Set the size of markers</li>
<li><code>markeredgecolor</code> or <code>mec</code> – Set the edge color for markers</li>
<li><code>markeredgewidth</code> or <code>mew</code> – Set the edge width for markers</li>
<li><code>markerfacecolor</code> or <code>mfc</code> – Set the fill color for markers</li>
<li><code>alpha</code> – Opacity of the plot</li>
</ul>
<p>Check out the documentation for <code>plt.plot</code> to learn more: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2Fapi%2F_as_gen%2Fmatplotlib.pyplot.plot.html%23matplotlib.pyplot.plot">https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot</a> .</p>
<pre><code class="lang-py">plt.plot(years, apples, marker=<span class="hljs-string">'s'</span>, c=<span class="hljs-string">'b'</span>, ls=<span class="hljs-string">'-'</span>, lw=<span class="hljs-number">2</span>, ms=<span class="hljs-number">8</span>, mew=<span class="hljs-number">2</span>, mec=<span class="hljs-string">'navy'</span>)
plt.plot(years, oranges, marker=<span class="hljs-string">'o'</span>, c=<span class="hljs-string">'r'</span>, ls=<span class="hljs-string">'--'</span>, lw=<span class="hljs-number">3</span>, ms=<span class="hljs-number">10</span>, alpha=<span class="hljs-number">.5</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-152.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The <code>fmt</code> argument provides a shorthand for specifying the marker shape, line style, and line color. You can provide it as the third argument to <code>plt.plot</code>.</p>
<pre><code class="lang-py">fmt = <span class="hljs-string">'[marker][line][color]'</span>

plt.plot(years, apples, <span class="hljs-string">'s-b'</span>)
plt.plot(years, oranges, <span class="hljs-string">'o--r'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-153.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can use the <code>plt.figure</code> function to change the size of the figure.</p>
<pre><code class="lang-py">plt.plot(years, oranges, <span class="hljs-string">'or'</span>)
plt.title(<span class="hljs-string">"Yield of Oranges (tons per hectare)"</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-154.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-change-the-figure-size-in-matplotlib"><strong>How to Change the Figure Size in MatPlotLib</strong></h4>
<p>You can use the <code>plt.figure</code> function to change the size of the figure.</p>
<pre><code class="lang-py">plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">6</span>))

plt.plot(years, oranges, <span class="hljs-string">'or'</span>)
plt.title(<span class="hljs-string">"Yield of Oranges (tons per hectare)"</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-155.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-improve-default-styles-using-seaborn"><strong>How to Improve Default Styles using Seaborn</strong></h4>
<p>An easy way to make your charts look beautiful is to use some default styles from the Seaborn library. You can apply them globally using the <code>sns.set_style</code> function. You can see a full list of predefined styles here: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fseaborn.pydata.org%2Fgenerated%2Fseaborn.set_style.html">https://seaborn.pydata.org/generated/seaborn.set_style.html</a> .</p>
<pre><code class="lang-py">sns.set_style(<span class="hljs-string">"whitegrid"</span>)
plt.plot(years, apples, <span class="hljs-string">'s-b'</span>)
plt.plot(years, oranges, <span class="hljs-string">'o--r'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-156.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code>sns.set_style(<span class="hljs-string">"darkgrid"</span>)

plt.plot(years, apples, <span class="hljs-string">'s-b'</span>)
plt.plot(years, oranges, <span class="hljs-string">'o--r'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-157.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">plt.plot(years, oranges, <span class="hljs-string">'or'</span>)
plt.title(<span class="hljs-string">"Yield of Oranges (tons per hectare)"</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-158.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can also edit default styles directly by modifying the <code>matplotlib.rcParams</code> dictionary. Learn more: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.2.1%2Ftutorials%2Fintroductory%2Fcustomizing.html%23matplotlib-rcparams">https://matplotlib.org/3.2.1/tutorials/introductory/customizing.html#matplotlib-rcparams</a> .</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> matplotlib

matplotlib.rcParams[<span class="hljs-string">'font.size'</span>] = <span class="hljs-number">14</span>
matplotlib.rcParams[<span class="hljs-string">'figure.figsize'</span>] = (<span class="hljs-number">9</span>, <span class="hljs-number">5</span>)
matplotlib.rcParams[<span class="hljs-string">'figure.facecolor'</span>] = <span class="hljs-string">'#00000000'</span>
</code></pre>
<h3 id="heading-scatter-plots-in-matplotlib">Scatter Plots <strong>in MatPlotLib</strong></h3>
<p>In a scatter plot, the values of 2 variables are plotted as points on a 2-dimensional grid. Additionally, you can also use a third variable to determine the size or color of the points. Let's try out an example.</p>
<p>The <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FIris_flower_data_set">Iris flower dataset</a> provides sample measurements of sepals and petals for three species of flowers. The Iris dataset is included with the Seaborn library and you can load it as a Pandas data frame.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Load data into a Pandas dataframe</span>
flowers_df = sns.load_dataset(<span class="hljs-string">"iris"</span>)

flowers_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-159.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">flowers_df.species.unique()
<span class="hljs-comment"># array(['setosa', 'versicolor', 'virginica'], dtype=object)</span>
</code></pre>
<p>Let's try to visualize the relationship between sepal length and sepal width. Our first instinct might be to create a line chart using <code>plt.plot</code>.</p>
<pre><code class="lang-py">plt.plot(flowers_df.sepal_length, flowers_df.sepal_width);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-160.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The output is not very informative as there are too many combinations of the two properties within the dataset. There doesn't seem to be simple relationship between them.</p>
<p>We can use a scatter plot to visualize how sepal length and sepal width vary using the <code>scatterplot</code> function from the <code>seaborn</code> module (imported as <code>sns</code>).</p>
<pre><code class="lang-py">sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-161.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-add-hues-in-matplotlib"><strong>How to Add Hues in MatPlotLib</strong></h4>
<p>Notice how the points in the above plot seem to form distinct clusters with some outliers. We can color the dots using the flower species as a <code>hue</code>. We can also make the points larger using the <code>s</code> argument.</p>
<pre><code class="lang-py">sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width, hue=flowers_df.species, s=<span class="hljs-number">100</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-162.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Adding hues makes the plot more informative. We can immediately tell that Setosa irises have a smaller sepal length but higher sepal widths. In contrast, the opposite is true for Virginica irises.</p>
<h4 id="heading-how-to-customize-seaborn-figures"><strong>How to </strong>Customiz<strong>e </strong>Seaborn Figures<em>**</em></h4>
<p>Since Seaborn uses Matplotlib's plotting functions internally, we can use functions like <code>plt.figure</code> and <code>plt.title</code> to modify the figure.</p>
<pre><code class="lang-py">plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">6</span>))
plt.title(<span class="hljs-string">'Sepal Dimensions'</span>)

sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species,
                s=<span class="hljs-number">100</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-163.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-plot-data-using-pandas-data-frames-with-seaborn"><strong>How to Plot Data using Pandas Data Frames with Seaborn</strong></h4>
<p>Seaborn has built-in support for Pandas data frames. Instead of passing each column as a series, you can provide column names and use the <code>data</code> argument to specify a data frame.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">'Sepal Dimensions'</span>)
sns.scatterplot(x=<span class="hljs-string">'sepal_length'</span>, 
                y=<span class="hljs-string">'sepal_width'</span>, 
                hue=<span class="hljs-string">'species'</span>,
                s=<span class="hljs-number">100</span>,
                data=flowers_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-164.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-histograms-in-matplotlib">Histograms <strong>in MatPlotLib</strong></h3>
<p>A histogram represents the distribution of a variable by creating bins (intervals) along the range of values and showing vertical bars to indicate the number of observations in each bin.</p>
<p>For example, let's visualize the distribution of values of sepal width in the Iris dataset. We can use the <code>plt.hist</code> function to create a histogram.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Load data into a Pandas dataframe</span>
flowers_df = sns.load_dataset(<span class="hljs-string">"iris"</span>)

flowers_df.sepal_width
<span class="hljs-comment"># 0      3.5</span>
<span class="hljs-comment"># 1      3.0</span>
<span class="hljs-comment"># 2      3.2</span>
<span class="hljs-comment"># 3      3.1</span>
<span class="hljs-comment"># 4      3.6</span>
<span class="hljs-comment">#       ... </span>
<span class="hljs-comment"># 145    3.0</span>
<span class="hljs-comment"># 146    2.5</span>
<span class="hljs-comment"># 147    3.0</span>
<span class="hljs-comment"># 148    3.4</span>
<span class="hljs-comment"># 149    3.0</span>
<span class="hljs-comment"># Name: sepal_width, Length: 150, dtype: float64</span>
</code></pre>
<pre><code class="lang-py">plt.title(<span class="hljs-string">"Distribution of Sepal Width"</span>)
plt.hist(flowers_df.sepal_width);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-165.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can immediately see that the sepal widths lie in the range 2.0 - 4.5, and around 35 values are in the range 2.9 - 3.1, which seems to be the most populous bin.</p>
<h4 id="heading-how-to-control-the-size-and-number-of-bins"><strong>How to C</strong>ontrol the<strong> S</strong>ize and<strong> N</strong>umber of<strong> B</strong>ins<em>**</em></h4>
<p>We can control the number of bins or the size of each one using the bins argument.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Specifying the number of bins</span>
plt.hist(flowers_df.sepal_width, bins=<span class="hljs-number">5</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-166.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Specifying the boundaries of each bin</span>
plt.hist(flowers_df.sepal_width, bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>));
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-167.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py"><span class="hljs-comment"># Bins of unequal sizes</span>
plt.hist(flowers_df.sepal_width, bins=[<span class="hljs-number">1</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">4.5</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-168.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-manage-multiple-histograms-in-matplotlib"><strong>How to Manage Multiple Histograms in MatPlotLib</strong></h4>
<p>Similar to line charts, we can draw multiple histograms in a single chart. We can reduce each histogram's opacity so that one histogram's bars don't hide the others'.</p>
<p>Let's draw separate histograms for each species of flowers.</p>
<pre><code class="lang-py">setosa_df = flowers_df[flowers_df.species == <span class="hljs-string">'setosa'</span>]
versicolor_df = flowers_df[flowers_df.species == <span class="hljs-string">'versicolor'</span>]
virginica_df = flowers_df[flowers_df.species == <span class="hljs-string">'virginica'</span>]

plt.hist(setosa_df.sepal_width, alpha=<span class="hljs-number">0.4</span>, bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>));
plt.hist(versicolor_df.sepal_width, alpha=<span class="hljs-number">0.4</span>, bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>));
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-169.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can also stack multiple histograms on top of one another.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">'Distribution of Sepal Width'</span>)

plt.hist([setosa_df.sepal_width, versicolor_df.sepal_width, virginica_df.sepal_width], 
         bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>), 
         stacked=<span class="hljs-literal">True</span>);

plt.legend([<span class="hljs-string">'Setosa'</span>, <span class="hljs-string">'Versicolor'</span>, <span class="hljs-string">'Virginica'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-170.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-bar-charts-in-matplotlib">Bar Charts <strong>in MatPlotLib</strong></h3>
<p>Bar charts are quite similar to line charts, that is they show a sequence of values. However, a bar is shown for each value, rather than points connected by lines. We can use the <code>plt.bar</code> function to draw a bar chart.</p>
<pre><code class="lang-py">years = range(<span class="hljs-number">2000</span>, <span class="hljs-number">2006</span>)
apples = [<span class="hljs-number">0.35</span>, <span class="hljs-number">0.6</span>, <span class="hljs-number">0.9</span>, <span class="hljs-number">0.8</span>, <span class="hljs-number">0.65</span>, <span class="hljs-number">0.8</span>]
oranges = [<span class="hljs-number">0.4</span>, <span class="hljs-number">0.8</span>, <span class="hljs-number">0.9</span>, <span class="hljs-number">0.7</span>, <span class="hljs-number">0.6</span>, <span class="hljs-number">0.8</span>]

plt.bar(years, oranges);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-171.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Like histograms, we can stack bars on top of one another. We use the <code>bottom</code> argument of <code>plt.bar</code> to achieve this.</p>
<pre><code class="lang-py">plt.bar(years, apples)
plt.bar(years, oranges, bottom=apples);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-172.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-bar-plots-with-averages-in-seaborn"><strong>Bar Plots with Averages in Seaborn</strong></h4>
<p>Let's look at another sample dataset included with Seaborn called <code>tips</code>. The dataset contains information about the sex, time of day, total bill, and tip amount for customers visiting a restaurant over a week.</p>
<pre><code class="lang-py">tips_df = sns.load_dataset(<span class="hljs-string">"tips"</span>);

tips_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-173.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We might want to draw a bar chart to visualize how the average bill amount varies across different days of the week. One way to do this would be to compute the day-wise averages and then use <code>plt.bar</code> (try it as an exercise).</p>
<p>However, since this is a very common use case, the Seaborn library provides a <code>barplot</code> function which can automatically compute averages.</p>
<pre><code class="lang-py">sns.barplot(x=<span class="hljs-string">'day'</span>, y=<span class="hljs-string">'total_bill'</span>, data=tips_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-174.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The lines cutting each bar represent the amount of variation in the values. For instance, it seems like the variation in the total bill is relatively high on Fridays and low on Saturdays.</p>
<p>We can also specify a <code>hue</code> argument to compare bar plots side-by-side based on a third feature, for example sex.</p>
<pre><code class="lang-py">sns.barplot(x=<span class="hljs-string">'day'</span>, y=<span class="hljs-string">'total_bill'</span>, hue=<span class="hljs-string">'sex'</span>, data=tips_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-175.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can make the bars horizontal simply by switching the axes.</p>
<pre><code class="lang-py">sns.barplot(x=<span class="hljs-string">'total_bill'</span>, y=<span class="hljs-string">'day'</span>, hue=<span class="hljs-string">'sex'</span>, data=tips_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-176.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-heatmaps-in-seaborn">Heatmaps in Seaborn</h3>
<p>A heatmap is used to visualize 2-dimensional data like a matrix or a table using colors. The best way to understand it is by looking at an example. </p>
<p>We'll use another sample dataset from Seaborn, called <code>flights</code>, to visualize monthly passenger footfall at an airport over 12 years.</p>
<pre><code class="lang-py">flights_df = sns.load_dataset(<span class="hljs-string">"flights"</span>).pivot(<span class="hljs-string">"month"</span>, <span class="hljs-string">"year"</span>, <span class="hljs-string">"passengers"</span>)

flights_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-177.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><code>flights_df</code> is a matrix with one row for each month and one column for each year. The values show the number of passengers (in thousands) that visited the airport in a specific month of a year. We can use the <code>sns.heatmap</code> function to visualize the footfall at the airport.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">"No. of Passengers (1000s)"</span>)
sns.heatmap(flights_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-178.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The brighter colors indicate a higher footfall at the airport. By looking at the graph, we can infer two things:</p>
<ul>
<li>The footfall at the airport in any given year tends to be the highest around July and August.</li>
<li>The footfall at the airport in any given month tends to grow year by year.</li>
</ul>
<p>We can also display the actual values in each block by specifying <code>annot=True</code> and using the <code>cmap</code> argument to change the color palette.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">"No. of Passengers (1000s)"</span>)
sns.heatmap(flights_df, fmt=<span class="hljs-string">"d"</span>, annot=<span class="hljs-literal">True</span>, cmap=<span class="hljs-string">'Blues'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-179.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-images-in-matplotlib">Images <strong>in MatPlotLib</strong></h3>
<p>We can also use Matplotlib to display images. Let's download an image from the internet.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> urllib.request <span class="hljs-keyword">import</span> urlretrieve

urlretrieve(<span class="hljs-string">'https://i.imgur.com/SkPbq.jpg'</span>, <span class="hljs-string">'chart.jpg'</span>);
</code></pre>
<p>Before displaying an image, it has to be read into memory using the <code>PIL</code> module.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image

img = Image.open(<span class="hljs-string">'chart.jpg'</span>)
</code></pre>
<p>An image loaded using PIL is simply a 3-dimensional numpy array containing pixel intensities for the red, green &amp; blue (RGB) channels of the image. We can convert the image into an array using <code>np.array</code>.</p>
<pre><code>img_array = np.array(img)

img_array.shape
# (<span class="hljs-number">481</span>, <span class="hljs-number">640</span>, <span class="hljs-number">3</span>)
</code></pre><p>We can display the PIL image using <code>plt.imshow</code>.</p>
<pre><code class="lang-py">plt.imshow(img);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-180.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can turn off the axes &amp; grid lines and show a title using the relevant functions.</p>
<pre><code class="lang-py">plt.grid(<span class="hljs-literal">False</span>)
plt.title(<span class="hljs-string">'A data science meme'</span>)
plt.axis(<span class="hljs-string">'off'</span>)
plt.imshow(img);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-181.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>To display a part of the image, we can simply select a slice from the numpy array.</p>
<pre><code class="lang-py">plt.grid(<span class="hljs-literal">False</span>)
plt.axis(<span class="hljs-string">'off'</span>)
plt.imshow(img_array[<span class="hljs-number">125</span>:<span class="hljs-number">325</span>,<span class="hljs-number">105</span>:<span class="hljs-number">305</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-182.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-how-to-plot-multiple-charts-in-a-grid-in-matplotlib-and-seaborn">How to Plot Multiple Charts in a Grid <strong>in MatPlotLib and Seaborn</strong></h3>
<p>Matplotlib and Seaborn also support plotting multiple charts in a grid, using <code>plt.subplots</code>, which returns a set of axes for plotting.</p>
<p>Here's a single grid showing the different types of charts we've covered in this tutorial.</p>
<pre><code class="lang-py">fig, axes = plt.subplots(<span class="hljs-number">2</span>, <span class="hljs-number">3</span>, figsize=(<span class="hljs-number">16</span>, <span class="hljs-number">8</span>))

<span class="hljs-comment"># Use the axes for plotting</span>
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].plot(years, apples, <span class="hljs-string">'s-b'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].plot(years, oranges, <span class="hljs-string">'o--r'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_xlabel(<span class="hljs-string">'Year'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">'Crop Yields in Kanto'</span>)


<span class="hljs-comment"># Pass the axes into seaborn</span>
axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>].set_title(<span class="hljs-string">'Sepal Length vs. Sepal Width'</span>)
sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species, 
                s=<span class="hljs-number">100</span>, 
                ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>]);

<span class="hljs-comment"># Use the axes for plotting</span>
axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>].set_title(<span class="hljs-string">'Distribution of Sepal Width'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>].hist([setosa_df.sepal_width, versicolor_df.sepal_width, virginica_df.sepal_width], 
         bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>), 
         stacked=<span class="hljs-literal">True</span>);

axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>].legend([<span class="hljs-string">'Setosa'</span>, <span class="hljs-string">'Versicolor'</span>, <span class="hljs-string">'Virginica'</span>]);

<span class="hljs-comment"># Pass the axes into seaborn</span>
axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">'Restaurant bills'</span>)
sns.barplot(x=<span class="hljs-string">'day'</span>, y=<span class="hljs-string">'total_bill'</span>, hue=<span class="hljs-string">'sex'</span>, data=tips_df, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>]);

<span class="hljs-comment"># Pass the axes into seaborn</span>
axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>].set_title(<span class="hljs-string">'Flight traffic'</span>)
sns.heatmap(flights_df, cmap=<span class="hljs-string">'Blues'</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>]);

<span class="hljs-comment"># Plot an image using the axes</span>
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].set_title(<span class="hljs-string">'Data Science Meme'</span>)
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].imshow(img)
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].grid(<span class="hljs-literal">False</span>)
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].set_xticks([])
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].set_yticks([])

plt.tight_layout(pad=<span class="hljs-number">2</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-183.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>See this page for a full list of supported functions: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.3.1%2Fapi%2Faxes_api.html%23the-axes-class">https://matplotlib.org/3.3.1/api/axes_api.html#the-axes-class</a> .</p>
<h4 id="heading-pair-plots-with-seaborn"><strong>Pair</strong> P<strong>lots with Seaborn</strong></h4>
<p>Seaborn also provides a helper function <code>sns.pairplot</code> to automatically plot several different charts for pairs of features within a dataframe.</p>
<pre><code class="lang-py">sns.pairplot(flowers_df, hue=<span class="hljs-string">'species'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-184.png" alt="Image" width="600" height="400" loading="lazy">
<em>See the full output <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-matplotlib-data-visualization/">here</a>.</em></p>
<pre><code class="lang-py">sns.pairplot(tips_df, hue=<span class="hljs-string">'sex'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-185.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-summary-and-further-reading-2">Summary and Further Reading</h3>
<p>We have covered the following topics in this tutorial:</p>
<ul>
<li>How to create and customize line charts using Matplotlib</li>
<li>How to visualize relationships between two or more variables using scatter plots</li>
<li>How to study distributions of variables using histograms and bar charts</li>
<li>How to visualize two-dimensional data using heatmaps</li>
<li>How to display images using Matplotlib's <code>plt.imshow</code></li>
<li>How to plot multiple Matplotlib and Seaborn charts in a grid</li>
</ul>
<p>In this tutorial we've covered some of the fundamental concepts and popular techniques for data visualization using Matplotlib and Seaborn. Data visualization is a vast field and we've barely scratched the surface here. Check out these references to learn and discover more:</p>
<ul>
<li>Data Visualization cheat sheet: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fjovian.ml%2Faakashns%2Fdataviz-cheatsheet">https://jovian.ml/aakashns/dataviz-cheatsheet</a></li>
<li>Seaborn gallery: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fseaborn.pydata.org%2Fexamples%2Findex.html">https://seaborn.pydata.org/examples/index.html</a></li>
<li>Matplotlib gallery: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.1.1%2Fgallery%2Findex.html">https://matplotlib.org/3.1.1/gallery/index.html</a></li>
<li>Matplotlib tutorial: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fgithub.com%2Frougier%2Fmatplotlib-tutorial">https://github.com/rougier/matplotlib-tutorial</a></li>
</ul>
<h3 id="heading-review-questions-to-check-your-comprehension-2">Review Questions to Check Your Comprehension</h3>
<p>Try answering the following questions to test your understanding of the topics covered in this notebook:</p>
<ol>
<li>What is data visualization?</li>
<li>What is Matplotlib?</li>
<li>What is Seaborn?</li>
<li>How do you install Matplotlib and Seaborn?</li>
<li>How you import Matplotlib and Seaborn? What are the common aliases used while importing these modules?</li>
<li>What is the purpose of the magic command <code>%matplotlib inline</code>?</li>
<li>What is a line chart?</li>
<li>How do you plot a line chart in Python? Illustrate with an example.</li>
<li>How do you specify values for the X-axis of a line chart?</li>
<li>How do you specify labels for the axes of a chart?</li>
<li>How do you plot multiple line charts on the same axes?</li>
<li>How do you show a legend for a line chart with multiple lines?</li>
<li>How you set a title for a chart?</li>
<li>How do you show markers on a line chart?</li>
<li>What are the different options for styling lines and markers in line charts? Illustrate with examples.</li>
<li>What is the purpose of the <code>fmt</code> argument to <code>plt.plot</code>?</li>
<li>Where can you see a list of all the arguments accepted by <code>plt.plot</code>?</li>
<li>How do you change the size of the figure using Matplotlib?</li>
<li>How do you apply the default styles from Seaborn globally for all charts?</li>
<li>What are the predefined styles available in Seaborn? Illustrate with examples.</li>
<li>What is a scatter plot?</li>
<li>How is a scatter plot different from a line chart?</li>
<li>How do you draw a scatter plot using Seaborn? Illustrate with an example.</li>
<li>How do you decide when to use a scatter plot vs a line chart?</li>
<li>How do you specify the colors for dots on a scatter plot using a categorical variable?</li>
<li>How do you customize the title, figure size, legend, and son on for Seaborn plots?</li>
<li>How do you use a Pandas dataframe with <code>sns.scatterplot</code>?</li>
<li>What is a histogram?</li>
<li>When should you use a histogram vs a line chart?</li>
<li>How do you draw a histogram using Matplotlib? Illustrate with an example.</li>
<li>What are "bins" in a histogram?</li>
<li>How do you change the sizes of bins in a histogram?</li>
<li>How do you change the number of bins in a histogram?</li>
<li>How do you show multiple histograms on the same axes?</li>
<li>How do you stack multiple histograms on top of one another?</li>
<li>What is a bar chart?</li>
<li>How do you draw a bar chart using Matplotlib? Illustrate with an example.</li>
<li>What is the difference between a bar chart and a histogram?</li>
<li>What is the difference between a bar chart and a line chart?</li>
<li>How do you stack bars on top of one another?</li>
<li>What is the difference between <code>plt.bar</code> and <code>sns.barplot</code>?</li>
<li>What do the lines cutting the bars in a Seaborn bar plot represent?</li>
<li>How do you show bar plots side-by-side?</li>
<li>How do you draw a horizontal bar plot?</li>
<li>What is a heat map?</li>
<li>What type of data is best visualized with a heat map?</li>
<li>What does the <code>pivot</code> method of a Pandas dataframe do?</li>
<li>How do you draw a heat map using Seaborn? Illustrate with an example.</li>
<li>How do you change the color scheme of a heat map?</li>
<li>How do you show the original values from the dataset on a heat map?</li>
<li>How do you download images from a URL in Python?</li>
<li>How do you open an image for processing in Python?</li>
<li>What is the purpose of the <code>PIL</code> module in Python?</li>
<li>How do you convert an image loaded using PIL into a Numpy array?</li>
<li>How many dimensions does a Numpy array for an image have? What does each dimension represent?</li>
<li>What are "color channels" in an image?</li>
<li>What is RGB?</li>
<li>How do you display an image using Matplotlib?</li>
<li>How do you turn off the axes and gridlines in a chart?</li>
<li>How do you display a portion of an image using Matplotlib?</li>
<li>How do you plot multiple charts in a grid using Matplotlib and Seaborn? Illustrate with examples.</li>
<li>What is the purpose of the <code>plt.subplots</code> function?</li>
<li>What are pair plots in Seaborn? Illustrate with an example.</li>
<li>How do you export a plot into a PNG image file using Matplotlib?</li>
<li>Where can you learn about the different types of charts you can create using Matplotlib and Seaborn?</li>
</ol>
<p>Congratulations on making it to the end of this tutorial! You can now apply these skills to analyze real world datasets from sources like <a target="_blank" href="https://kaggle.com/datasets">Kaggle</a>. </p>
<p>If you're pursuing a career in data science and machine learning, consider joining the <a target="_blank" href="https://zerotodatascience.com">Zero to Data Science Bootcamp by Jovian</a>. It's a 20-week part-time program where you'll complete 7 courses, 12 coding assignments and 4-real world projects. You will also receive 6 months of career support to help you find your first data science job.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.jovian.ai/zero-to-data-science-bootcamp">https://www.jovian.ai/zero-to-data-science-bootcamp</a></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Python Data Analysis: How to Visualize a Kaggle Dataset with Pandas, Matplotlib, and Seaborn ]]>
                </title>
                <description>
                    <![CDATA[ By Srijan The Indian Premier League or IPL is a T20 cricket tournament organized annually by the Board of Control for Cricket In India (BCCI). Eight city-based franchises compete with each other over 6 weeks to find the winner. In this article, I'm g... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/kaggle-dataset-analysis-with-pandas-matplotlib-seaborn/</link>
                <guid isPermaLink="false">66d4614a73634435aafcefdc</guid>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analytics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ kaggle ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 22 Oct 2020 17:49:27 +0000</pubDate>
                <media:content url="https://cdn-media-2.freecodecamp.org/w1280/5f9c9822740569d1a4ca1855.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Srijan</p>
<p>The <strong>Indian Premier League</strong> or IPL is a T20 cricket tournament organized annually by the Board of Control for Cricket In India (BCCI). Eight city-based franchises compete with each other over 6 weeks to find the winner.</p>
<p>In this article, I'm going to analyze data from the IPL's past seasons to see which teams have won the most games, how teams behave when winning a toss, who has the greatest legacy, and so on. </p>
<p>I have done this analysis from a historical point of view, giving an overview of what has happened in the IPL over the years. I have used tools such as <em>Pandas</em>, <em>Matplotlib</em> and <em>Seaborn</em> along with _Pytho_n to give a visual as well as numeric representation of the data in front of us.</p>
<p><strong>Pandas</strong> stands for <em>Python Data Analysis</em> library. It is typically used for working with tabular data (similar to the data stored in a spreadsheet). Pandas provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, JSON, SQL and perform operations on them.</p>
<p><strong>Matplotlib</strong> and <strong>Seaborn</strong> are two Python libraries that are used to produce plots. Matplotlib is generally used for plotting lines, pie charts, and bar graphs. </p>
<p>Seaborn provides some more advanced visualization features with less syntax and more customizations. I switch back-and-forth between them during the analysis.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><a class="post-section-overview" href="#heading-1-getting-the-dataset">Getting the Dataset</a></li>
<li><a class="post-section-overview" href="#heading-2-data-preparation-and-cleaning">Data Preparation and Cleaning</a></li>
<li><a class="post-section-overview" href="#heading-3-exploratory-analysis-and-visualization">Exploratory Analysis and Visualization</a></li>
<li><a class="post-section-overview" href="#asking-and-answering-questions">Asking and Answering Questions</a></li>
<li><a class="post-section-overview" href="#heading-5-inferences-from-the-analysis">Inferences From the Analysis</a></li>
<li><a class="post-section-overview" href="#heading-6-conclusion">Conclusion</a></li>
</ol>
<h2 id="heading-1-getting-the-dataset">1. Getting the Dataset</h2>
<p>I downloaded the dataset from <a target="_blank" href="https://www.kaggle.com/nowke9/ipldata">Kaggle</a>. You will see there are two CSV (Comma Separated Value) files, matches.csv and deliveries.csv. I chose to do my analysis on matches.csv.</p>
<p>To find more interesting datasets, you can look at <a target="_blank" href="https://jovian.ml/forum/t/recommended-datasets-for-course-project/11711">this</a> page.</p>
<h2 id="heading-2-data-preparation-and-cleaning">2. Data Preparation and Cleaning</h2>
<p>A dataset contains many columns and rows. It is always possible that certain rows have missing values or <code>NaN</code> for one or more columns. </p>
<p>It is also possible that there might be certain columns or rows that you want to discard from your analysis. You can also combine two or more datasets for an in-depth analysis.</p>
<p>Cleaning the data involves making corrections to that data, leaving out unnecessary columns or rows, merging datasets, and so on.</p>
<p>Before taking these steps, I needed to install and import the tools (<em>libraries</em>) to be used during the analysis. I imported the libraries with different aliases such as <code>pd</code>, <code>plt</code> and <code>sns</code>.  I then set some basic styles for the plots.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=5" height="308" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Notice the special command <code>%matplotlib inline</code>. It makes sure that plots are shown and embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up windows.</p>
<p>Using the <code>read_csv()</code> method from the <em>Pandas</em> library, I loaded the <em>matches.csv</em> file<em>.</em> </p>
<p>Data from the file is read and stored in a <code>DataFrame</code> object - one of the core data structures in Pandas for storing and working with tabular data. I used the <code>_df</code> suffix in the variable names for data frames.</p>
<p>I used the name <code>matches_raw_df</code> for the data frame. This indicates that this is unprocessed data that I will clean, filter, and modify to prepare a data frame that's ready for analysis.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=9" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=10" height="308" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Using the <code>shape</code> property of a <code>Dataframe</code> object, I found that the dataset contains 756 rows and 18 columns. To find the names of those columns I used the <code>columns</code> property. It returned a list of the columns in a data frame.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=11" height="138" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=13" height="222" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>To get a summary of what the data frame contains, I used <code>info()</code>. This gives information about columns, number of non-null values in each column, their data type, and memory usage.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=15" height="717" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Almost all columns except <code>umpire3</code> have no or very few null values. The presence of null values could result from a lack of information or an incorrect data entry. </p>
<p>An interesting thing to observe is that, although there are no null values for the <code>result</code> column, there are some for <code>winner</code> and <code>player_of_match</code> columns. Let's find out why.</p>
<p>I first accessed the <code>result</code> column using <em>dot notation</em> (<code>matches_raw_df.result</code>). Then I used <code>vaule_counts()</code> method on the <code>result</code> column.</p>
<p><code>value_counts()</code> returns a <em>series</em> which contains counts of unique values. Here, it tells us about the different values present in <code>result</code> and the total number for each of them.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=18" height="218" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>So, out of 756 matches (rows), 4 matches ended as <em>no result</em>. </p>
<p>Cricket is an outdoor sport and unlike, say, football, play isn't possible when it's raining. It is very common to have matches abandoned due to incessant raining. Therefore, we have no winners or player of the match for these 4 matches.</p>
<p>For this analysis, the <code>umpire3</code> column isn't needed. So I removed the column using the <code>drop()</code> method by passing the column name and axis value. If you want to remove multiple columns, the column names are to be given in a list.</p>
<p>I assigned this <strong>cleaned</strong> data frame to <code>matches_df</code>. I used this data frame for further analysis.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=22" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=23" height="308" width="800" title="Embedded content" loading="lazy"></iframe></div>

<h2 id="heading-3-exploratory-analysis-and-visualization">3. Exploratory Analysis and Visualization</h2>
<p>Exploratory analysis involves performing operations on the dataset to understand the data and find patterns. It helps us make sense of the data we have. </p>
<p>Visualization is the graphic representation of data. It involves producing charts that communicate those patterns among the represented data to viewers.</p>
<p>Now, let's take a look at the data I analyzed and what I learned in the process.</p>
<h3 id="heading-number-of-matches-and-teams">Number of matches and teams</h3>
<p>I tried to find the number of matches played in each season in the IPL from its inception to 2019.</p>
<p>Since I needed matches played each season, it made sense to group our data according to different seasons. Pandas has a <code>groupby()</code> method to achieve this, wherein I passed <code>season</code> as an argument.</p>
<p>Since an <code>id</code> is unique for each match (row), counting the number of ids for each season leads to what we want. I used the <code>count()</code> method on the <code>id</code> column to find the number of matches held each season. This series is assigned to the variable <code>matches_per_season</code>.</p>
<p>I then used the <code>barplot()</code> method from the Seaborn library to plot the series. The index of the series, that is the seasons, were given as the x-value while the values of those indices were given as y-values.</p>
<p>I used various <code>matpllotlib.pyplot</code> methods such as <code>figure()</code>, <code>xticks()</code> and <code>title()</code> to set the size of the plot, title of the plot, and so on. </p>
<p><code>figure</code> takes a parameter, <code>figsize</code>, which I set to <code>(12,6)</code>. Notice that the size was given as a tuple. To <code>xticks()</code>, I gave the <code>rotation</code> parameter a value of <code>75</code> to make it easier to read. </p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=30" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=31" height="565" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Each season, almost 60 matches were played. However, we see a spike in the number of matches from 2011 to 2013. This is because two new franchises, the <strong>Pune Warrior</strong>s and <strong>Kochi Tuskers Kerala</strong>, were introduced, increasing the number of teams to 10.</p>
<p>However, Kochi was removed in the very next season, while the Pune Warriors were removed in 2013, bringing the number down to 8 from 2014 onwards.</p>
<p>Before the start of the 2016 season, two teams, the <strong>Chennai Super Kings</strong> and <strong>Rajasthan Royals</strong> were banned for two seasons. To make up for their absence, two new teams (the <strong>Rising Pune Supergiants</strong> and <strong>Gujarat Lions</strong>) entered the competition.</p>
<p>When the Chennai Super Kings and Rajasthan Royals returned, these two teams were removed from the competition.</p>
<h3 id="heading-analyzing-the-toss-results">Analyzing the Toss results</h3>
<p>One of the most significant events in any cricket match is the toss, which happens at the very start of a match. The toss winner can choose whether they want to bat first or second (fielding first). </p>
<p>Let's see what the trend has been amongst the teams across different seasons.</p>
<p>Again I grouped the rows by season and then counted the different values of the <code>toss_decision</code> column by using <code>value_counts()</code>. </p>
<p>Since a percentage gives a clearer picture, I divided the above result with <code>matches_per_season</code> and multiplied it by 100. This series was assigned to <code>toss_decision_percentage</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=35" height="105" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=36" height="643" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Here, <code>toss_decision_percentage</code> is a series with <em>multi-index</em>. If we print the index of the series using the <code>index</code> property, we see it is of the form <code>(2008, 'bat'), (2008, 'field')</code> and so on. </p>
<p>The series used both <code>season</code> and <code>toss_decision</code> as an index. But I only wanted the seasons to be an index. I used <code>unstack()</code> to achieve this. </p>
<p>By using the <code>unstack()</code> method on the series, it converted the values of <code>toss_decision</code> (that is, <code>bat</code> and <code>field</code>) into separate columns.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/85&amp;cellId=38" height="490" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Next I used the <code>plot()</code> method from Matplotlib to represent these values as bar charts. <code>plot()</code> has a parameter <code>kind</code> which decides what type of plot to draw. The value was set to <code>bar</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/85&amp;cellId=39" height="484" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>For 2008-2013, teams seemed to favour both batting first and second. For this period, teams chose to bat first more in 2009, 2010 and 2013. On the other hand, they chose fielding first more in 2008 and 2011. Things were even-steven in 2012.</p>
<p>This could be because IPL and T20 cricket in general was in its budding stages. So, teams were probably learning and trying to figure out which option would be more beneficial.</p>
<p>However, since 2014, teams have overwhelmingly chosen to bat second. Especially since 2016, teams have chosen to field first <strong>more than 80%</strong> of the time.</p>
<p>Batting first requires that the team gauge the conditions and the pitch and then set a target accordingly. Chasing is less complicated, as there is a fixed target to achieve. </p>
<p>Conditions have also become more batsman-friendly and the skills of the batsmen have increased tremendously (<em>read more</em> <a target="_blank" href="https://www.espncricinfo.com/story/_/id/18568387/tim-wigmore-how-batting-second-become-more-fruitful-more-popular"><em>here</em></a>).</p>
<h3 id="heading-number-of-wins">Number of Wins</h3>
<p>We saw how teams in the recent past have chosen to bat second more than 4 out of 5 times. Did this decision transform the results? Let's see.</p>
<p>For <code>wins_batting_first</code>, the values of <code>win_by_wickets</code> has to be 0. Also, the <code>result</code> column should have a value of <code>normal</code> since tied matches also have win margins as 0. This condition was stored as <code>filter1</code>.</p>
<p>Similarly, for <code>wins_fielding_first</code>, the the value of <code>win_by_runs</code> has to be 0 and the <code>result</code> column should have a value of <code>normal</code>. This condition was stored as <code>filter1</code>.</p>
<p>In both the series, I used <code>count()</code> method on <code>winner</code> column to find the won matches in the filtered conditions. I divided the results with <code>matches_per_season</code> calculated earlier to give a better understanding.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/88&amp;cellId=43" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/88&amp;cellId=44" height="105" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/88&amp;cellId=45" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/89&amp;cellId=46" height="105" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>To plot these two series together, I combined them using Pandas' <code>concat()</code> method. I passed the two series names as a list and set the value of <code>axis</code> as <code>1</code>. This gives us a new data frame which was stored as <code>combined_wins_df</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/89&amp;cellId=47" height="547" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Next I plotted <code>combined_wins_df</code> as a bar chart using <code>plot()</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=44" height="484" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>We saw earlier that for 2008-2013, teams faced a conundrum whether to bat first or field first. This is partially visible in the results as well. </p>
<p>The wins from batting first are very close to that from fielding first. However, there is just one season where teams batting first won more, with things being equal in 2013.</p>
<p>Again, since 2014, things have been in favour of teams chasing except 2015. Leaving out 2015, things have been overwhelmingly in favour of teams fielding first.</p>
<p>So, teams choosing to field more have been justified in their decisions.</p>
<h3 id="heading-teams-with-history">Teams with "History"</h3>
<p>In leagues across different sports, there is always talk about teams with "history" – teams that have played the most in the league and continue to do so. Let's find those teams in the IPL.</p>
<p>Now, between two teams A and B, it can be "A vs B" or "B vs A", depending on how the data entry has been done. So I decided to count the total number of different values for both the <code>team1</code> and <code>team2</code> columns using <code>value_counts()</code>. Then I added them together.</p>
<p>I sorted the results in descending order using the <code>sort_values()</code> method from Pandas. The <code>ascending</code> parameter was set to <code>False</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=48" height="470" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Here, I used <code>sns.barplot()</code> to plot the graph.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=49" height="451" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>The <strong>Mumbai Indians</strong> have played the most matches. They are followed by the Royal Challengers Bangalore, Kolkata Knight Riders, Kings XI Punjab and Chennai Super Kings.</p>
<p>The Chennai Super Kings and Rajasthan Royals could have been higher had they not been banned.</p>
<p>You will see there are two teams from Delhi, the <strong>Delhi Daredevils</strong> and <strong>Delhi Capitals</strong>. This resulted from a change in ownership and then team name in 2018.</p>
<p>It's a similar story for the <strong>Deccan Chargers</strong> and <strong>Sunrisers Hyderabad</strong>, as the Deccan Chargers were removed from the IPL in 2013 and the Sunrisers came in their place.</p>
<p>Also, there are two teams with almost same name: the <strong>Rising Pune Supergiants</strong> and <strong>Rising Pune Supergiant</strong>. They are same team, and there was no change in ownership – it has more to do with superstitions.</p>
<p>In the 2016 season, the Rising Pune Supergiants finished 7th. The owners changed the captain for 2017 and also <strong>dropped the 's'</strong> from Supergiants. Well, it paid off as they finished as runner-up that season!</p>
<h3 id="heading-teams-with-legacy">Teams with "Legacy"</h3>
<p>Now, teams may have a lot of history but it's their "legacy" – how often they win – that makes them popular and attracts new and neutral fans.</p>
<p>To find such teams, I simply used <code>value_counts()</code> on the <code>winner</code> column. This gives us the number of matches that each team has won.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=53" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=54" height="433" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>So Mumbai has the most wins. But a better metric to judge would be the win percentage. To find the win percentage, I divided <code>most_wins</code> by <code>total_matches_played</code> to find the <code>win_percentage</code> for each team.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=57" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=58" height="444" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>The Rising Pune Supergiant and Delhi Capitals have the highest win percentage. This is largely because they have played fewer matches compared to most teams. Especially Rising Pune Supergiant, which technically became a new team after dropping the 's'.</p>
<p>The Chennai Super Kings, despite playing two fewer seasons than the Mumbai Indians, had only 9 fewer victories. They, along with the Mumbai Indians, are the only two teams in the top 5 that were also part of the IPL in 2008.</p>
<p><strong>Chennai</strong> and <strong>Mumbai</strong> are the teams with the most legacy.</p>
<h2 id="heading-4-asking-and-answering-questions-from-the-data">4. Asking and Answering Questions from the Data</h2>
<p>We've already gained some insights about the IPL by exploring various columns of our dataset. </p>
<p>Let's ask some specific questions, and try to answer them using data frame operations and interesting visualizations.</p>
<h3 id="heading-q-who-has-won-the-ipl-tournament">Q. Who has won the IPL tournament?</h3>
<ul>
<li>Group the rows according to seasons using <code>groupby()</code>.</li>
<li>Find the last match of each season, that is, the final using <code>tail()</code>. It returns the last n rows from a Dataframe object or series based on position.</li>
<li>Sort the values per season using <code>sort_values()</code>.</li>
<li>Count the different winners and the times they won using <code>value_counts()</code> on <code>winner</code>.</li>
</ul>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=65" height="134" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=66" height="264" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Then I plotted the series <code>ipl_winners</code> using <code>sns.barplot()</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=67" height="353" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Mumbai and Chennai, our <em>legacy</em> teams, have won the IPL at least 3 times. The Sunrisers Hyderabad are the only team that joined the league later and won the trophy.</p>
<h3 id="heading-q-which-are-the-most-and-least-consistent-teams-across-all-seasons">Q. Which are the most and least consistent teams across all seasons?</h3>
<ul>
<li>Created a data frame between different values of <code>winner</code> and <code>season</code> using <code>pd.crosstab()</code>.</li>
<li>Plotted the data frame as a heatmap.</li>
</ul>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=71" height="105" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=72" height="208" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p><code>pd.crosstab()</code> gives a simple cross-tabulation of the <code>winner</code> and <code>season</code> columns. For each different value of <code>winner</code>, <code>pd.crosstab()</code> finds its frequency for each different value in <code>season</code>. </p>
<p>Then I plotted  <code>matches_won_each_season</code> using <code>sns.heatmap()</code>. I passed the data frame <code>matches_won_each_season</code>, with <code>annot</code> as <code>True</code> to have the values shown as well. Here, the darker color indicates more matches won.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=73" height="496" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>The <strong>Chennai Super Kings</strong> have been the most consistent team, winning at least 8 matches in each of the seasons they have played. This is backed up by the fact that they are the <strong>only</strong> team to reach the playoffs stage every season.</p>
<p>At the other end of the spectrum are 3 teams, the <strong>Delhi Daredevils</strong>, <strong>Kings XI Punjab</strong> and <strong>Rajasthan Royals</strong>. All three of them have had two seasons where they performed really well. However, they have been pretty average during the other seasons.</p>
<h3 id="heading-q-what-has-been-the-biggest-margin-of-victory-in-terms-of-runs-in-the-ipl">Q. What has been the biggest margin of victory in terms of runs in the IPL?</h3>
<ul>
<li>Filter the data frame using the required condition.</li>
<li>Sort the values in descending order using <code>sort_values()</code>.</li>
<li>Find the biggest 10 victories in the list using the <code>head()</code> method. It works opposite to <code>tail()</code>, returning the first n rows.</li>
</ul>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=81" height="134" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>I plotted the filtered data frame <code>highest_wins_by_runs_df</code> using <code>sns.scatterplot()</code>. For the <code>x</code> parameter I used <code>season</code>, and I used <code>win_by_runs</code> as the <code>y</code> parameter. I made the size of the points bigger for the top 10 victories using the <code>s</code> parameter.</p>
<p>To put emphasis on the top 10 victories, I used a different color as well as annotated those data points using <code>plt.annotate()</code>. The first parameter is the text of the annotation. The position of the point to be annotated is given as a tuple.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=82" height="501" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>The biggest margin of victory by runs is <strong>146 runs</strong>. In 2017, the Mumbai Indians defeated the Delhi Daredevils by this margin. The Royal Challengers Bangalore have 3 victories amongst the top 5.</p>
<h3 id="heading-q-mumbai-and-chennai-are-the-two-most-successful-teams-so-far-which-team-leads-in-the-head-to-head-record">Q. Mumbai and Chennai are the two most successful teams so far. Which team leads in the head-to-head record?</h3>
<ul>
<li>Filter the data frame using the required condition to find the matches played between the two teams.</li>
<li>Use the <code>value_counts()</code> on the <code>winner</code> column to find how many times each of the teams have won.</li>
</ul>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=105" height="105" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=108" height="180" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>I plotted the series <code>mivcsk</code> as a bar chart for a better visualization.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=109" height="507" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>MI have dominated CSK and are leading the head-to-head record 17-11. We can see their dominance especially in the 2019 season, where the MI defeated the CSK 4 out of 4 times they met, including the playoff and the final.</p>
<h2 id="heading-5-inferences-from-the-analysis">5. Inferences from the Analysis</h2>
<p>We have drawn some interesting inferences and now know more about the IPL than when we started. Here's a summary of what we learned through our analysis:</p>
<ul>
<li>Almost 60 matches are played in every IPL season amongst 8 teams.</li>
<li>There has been an attempt to expand the IPL to 10 teams but the 8 teams idea was brought back and has been continued since.</li>
<li>For the first six seasons (2008-2013), teams were figuring out whether batting first or chasing would be better after winning the toss. This could be down to the fact that the IPL and T20 cricket were both in their early stages so teams were trying different strategies.</li>
<li>But, since 2014, teams have preferred chasing, especially in the past 4 seasons (2016-2019) where teams have chosen to field more than 4 times out of 5. This is likely because having a set total to chase makes things simpler. This could also result from teams preferring to chase in ODIs as well.</li>
<li>Though teams have overwhelmingly chosen to field first, the win percentage after choosing to bat or field is not that one-sided. However, their difference is on the rise.</li>
<li>Mumbai Indians have played the most matches in the IPL. Due to the brief expansion, change of owners, and removal and banning of teams, there have been 15 teams who have played in the IPL.</li>
<li>Chennai and Mumbai are the two teams with the highest win percentage. The fact that they are the only two teams that were part of the first season as well, in the top 5, shows their dominance.</li>
<li>Mumbai Indians have the won the IPL 4 times, the most. They are followed by Chennai at 3 and Kolkata Knight Riders at 2. Sunrisers Hyderabad, Deccan Chargers and Rajasthan Royals complete the IPL Champions list, all winning once each.</li>
<li>146 runs is the largest margin of victory by runs. Mumbai Indians defeated Delhi Daredevils by this margin in 2017. The largest margin for victory by wickets is 10, which has been achieved many times.</li>
<li>The two heavyweights, Mumbai and Chennai, have a head-to-head record in favour of Mumbai at 17-11. Mumbai have had the upper hand in the 2019 season every time they met, including the final.</li>
</ul>
<h2 id="heading-6-conclusion">6. Conclusion</h2>
<p>In this article, we did a bunch of analysis and saw some interesting visualizations. However, this was just scratching the surface.</p>
<p>You can perform more interesting analysis on <em>matches.csv</em> as a standalone data set. But combining <em>deliveries.csv</em> with this dataset could lead to more in-depth analysis.</p>
<p>I did this data analysis and visualization as a project for the 6-week course <a target="_blank" href="https://www.freecodecamp.org/news/kaggle-dataset-analysis-with-pandas-matplotlib-seaborn/zerotopandas.com">Data Analysis with Python: Zero to Pandas</a>. This course was conducted by <a target="_blank" href="https://jovian.ml">Jovian.ml</a> in partnership with <a target="_blank" href="https://www.freecodecamp.org/news/kaggle-dataset-analysis-with-pandas-matplotlib-seaborn/www.freecodecamp.org">freeCodeCamp.org</a>. Check out the project <a target="_blank" href="https://jovian.ml/srijansrj5901/ipl-data-analysis">here</a>.</p>
<p>Also, the IPL is on right now. Go watch it and enjoy!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Matplotlib Course – Learn Python Data Visualization ]]>
                </title>
                <description>
                    <![CDATA[ Learn the basics of Matplotlib in this crash course tutorial. Matplotlib is an amazing data visualization library for Python. You will also learn how to apply Matplotlib to real-world problems. You can watch the full course here (90 minute watch). Co... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/matplotlib-course-learn-python-data-visualization/</link>
                <guid isPermaLink="false">66c35b2dc5e11f7a9c40686c</guid>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 20 May 2020 21:10:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2020/08/maxresdefault--5-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Learn the basics of Matplotlib in this crash course tutorial. Matplotlib is an amazing data visualization library for Python. You will also learn how to apply Matplotlib to real-world problems.</p>
<p>You can <a target="_blank" href="https://www.youtube.com/watch?v=3Xc3CA655Y4">watch the full course here</a> (90 minute watch).</p>
<h2 id="heading-course-notes">Course Notes</h2>
<p>? <a target="_blank" href="https://github.com/KeithGalli/matplotlib_tutorial/">Source Code</a></p>
<p>? <a target="_blank" href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html">Matplotlib Pyplot Documentation</a></p>
<p>? <a target="_blank" href="http://jonathansoma.com/lede/data-studio/matplotlib/list-all-fonts-available-in-matplotlib-plus-samples/">Font List</a></p>
<p>? <a target="_blank" href="https://matplotlib.org/3.1.0/gallery/style_sheets/style_sheets_reference.html">Matplotlib Style Options</a></p>
<p>? <a target="_blank" href="https://www.kaggle.com/karangadiya/fifa19">Kaggle Data Link</a></p>
<h2 id="heading-how-to-install-libraries-needed-for-matplotlib">How to Install libraries Needed for Matplotlib</h2>
<h3 id="heading-option-1-how-to-install-matplotlib-directly-using-pip-install">Option 1: How to Install Matplotlib directly using pip install</h3>
<ol>
<li>Open up a terminal window and type</li>
<li>pip install matplotlib</li>
<li>pip install numpy</li>
<li>pip install pandas</li>
</ol>
<h3 id="heading-option-2-how-to-install-anaconda">Option 2: How to install Anaconda</h3>
<p>Download Anaconda, which will contain all the packages we need. Here's <a target="_blank" href="https://youtu.be/YJC6ldI3hWk">a video tutorial that walks you through how to do this</a>.</p>
<p>Again, you can <a target="_blank" href="https://www.youtube.com/watch?v=3Xc3CA655Y4">watch the full course here</a> (90 minute watch).</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
