<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ numpy - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ numpy - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Tue, 09 Jun 2026 10:26:13 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/numpy/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ What is Data Analysis? How to Visualize Data with Python, Numpy, Pandas, Matplotlib & Seaborn Tutorial ]]>
                </title>
                <description>
                    <![CDATA[ By Aakash NS Data Analysis is the process of exploring, investigating, and gathering insights from data using statistical measures and visualizations.  The objective of data analysis is to develop an understanding of data by uncovering trends, relati... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/exploratory-data-analysis-with-numpy-pandas-matplotlib-seaborn/</link>
                <guid isPermaLink="false">66d45d5ab3016bf139028cff</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ numpy ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 24 Jun 2021 00:11:01 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/05/blog-cover-4.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Aakash NS</p>
<p>Data Analysis is the process of exploring, investigating, and gathering insights from data using statistical measures and visualizations. </p>
<p>The objective of data analysis is to develop an understanding of data by uncovering trends, relationships, and patterns.</p>
<p>Data analysis is both a science and an art. On the one hand it requires that you know statistics, visualization techniques, and data analysis tools like Numpy, Pandas, and Seaborn. </p>
<p>On the other hand, it requires that you ask interesting questions to guide the investigation, and then interpret the numbers and figures to generate useful insights.</p>
<p>This tutorial on data analysis covers the following topics:</p>
<ol>
<li><a class="post-section-overview" href="#heading-what-is-numerical-computation-python-and-numpy-for-beginners">What is Numerical Computation? Python and Numpy for Beginners</a></li>
<li><a class="post-section-overview" href="#heading-how-to-analyze-tabular-data-using-python-and-pandas">How to Analyze Tabular Data using Python and Pandas</a></li>
<li><a class="post-section-overview" href="#heading-data-visualization-using-python-matplotlib-and-seaborn">Data Visualization using Python, Matplotlib, and Seaborn</a></li>
</ol>
<h2 id="heading-what-is-numerical-computation-python-and-numpy-for-beginners">What is Numerical Computation? Python and Numpy for Beginners</h2>
<p><img src="https://i.imgur.com/mg8O3kd.png" alt="Image" width="1385" height="480" loading="lazy">
_Source: <a target="_blank" href="https://github.com/elegant-scipy/elegant-scipy/blob/master/figures/NumPy_ndarrays_v2.png">Elegant Scipy</a>_</p>
<p>You can follow along with the tutorial and run the code here: <a target="_blank" href="https://jovian.ai/aakashns/python-numerical-computing-with-numpy">https://jovian.ai/aakashns/python-numerical-computing-with-nump</a>y</p>
<p>This section covers the following topics:</p>
<ul>
<li>How to work with numerical data in Python</li>
<li>How to turn Python lists into Numpy arrays</li>
<li>Multi-dimensional Numpy arrays and their benefits</li>
<li>Array operations, broadcasting, indexing, and slicing</li>
<li>How to work with CSV data files using Numpy</li>
</ul>
<h3 id="heading-how-to-work-with-numerical-data-in-python">How to Work with Numerical Data in Python</h3>
<p>The "data" in <em>Data Analysis</em> typically refers to numerical data, like stock prices, sales figures, sensor measurements, sports scores, database tables, and so on. </p>
<p>The <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org">Numpy</a> library provides specialized data structures, functions, and other tools for numerical computing in Python. Let's work through an example to see why and how to use Numpy to work with numerical data.</p>
<p>Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. </p>
<p>A simple approach to do this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in millimeters), and average relative humidity (in percentage) as a linear equation.</p>
<p><code>yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity</code></p>
<p>We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity. </p>
<p>This equation is an approximation, since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.</p>
<p>Based on some statistical analysis of historical data, we might come up with reasonable values for the weights <code>w1</code>, <code>w2</code>, and <code>w3</code>. Here's an example set of values:</p>
<pre><code class="lang-py">w1, w2, w3 = <span class="hljs-number">0.3</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">0.5</span>
</code></pre>
<p>Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:</p>
<p><img src="https://i.imgur.com/TXPBiqv.png" alt="Image" width="846" height="330" loading="lazy"></p>
<p>To begin, we can define some variables to record climate data for a region.</p>
<pre><code class="lang-py">kanto_temp = <span class="hljs-number">73</span>
kanto_rainfall = <span class="hljs-number">67</span>
kanto_humidity = <span class="hljs-number">43</span>
</code></pre>
<p>We can now substitute these variables into the linear equation to predict the yield of apples.</p>
<pre><code class="lang-py">kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples
<span class="hljs-comment"># 56.8</span>

print(<span class="hljs-string">"The expected yield of apples in Kanto region is {} tons per hectare."</span>.format(kanto_yield_apples))
<span class="hljs-comment"># The expected yield of apples in Kanto region is 56.8 tons per hectare.</span>
</code></pre>
<p>To make it slightly easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector, that is a list of numbers.</p>
<pre><code class="lang-py">kanto = [<span class="hljs-number">73</span>, <span class="hljs-number">67</span>, <span class="hljs-number">43</span>]
johto = [<span class="hljs-number">91</span>, <span class="hljs-number">88</span>, <span class="hljs-number">64</span>]
hoenn = [<span class="hljs-number">87</span>, <span class="hljs-number">134</span>, <span class="hljs-number">58</span>]
sinnoh = [<span class="hljs-number">102</span>, <span class="hljs-number">43</span>, <span class="hljs-number">37</span>]
unova = [<span class="hljs-number">69</span>, <span class="hljs-number">96</span>, <span class="hljs-number">70</span>]
</code></pre>
<p>The three numbers in each vector represent the temperature, rainfall, and humidity data, respectively.</p>
<p>We can also represent the set of weights used in the formula as a vector.</p>
<pre><code class="lang-py">weights = [w1, w2, w3]
</code></pre>
<p>We can now write a function <code>crop_yield</code> to calculate the yield of apples (or any other crop) given the climate data and the respective weights.</p>
<pre><code class="lang-py"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">crop_yield</span>(<span class="hljs-params">region, weights</span>):</span>
    result = <span class="hljs-number">0</span>
    <span class="hljs-keyword">for</span> x, w <span class="hljs-keyword">in</span> zip(region, weights):
        result += x * w
    <span class="hljs-keyword">return</span> result

crop_yield(kanto, weights)
<span class="hljs-comment"># 56.8</span>

crop_yield(johto, weights)
<span class="hljs-comment"># 76.9</span>

crop_yield(unova, weights)
<span class="hljs-comment"># 74.9</span>
</code></pre>
<h3 id="heading-how-to-turn-python-lists-into-numpy-arrays">How to Turn Python Lists into Numpy Arrays</h3>
<p>The calculation performed by the <code>crop_yield</code> (element-wise multiplication of two vectors and taking a sum of the results) is also called the <em>dot product</em>. Learn more about dot products <a target="_blank" href="https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length">here</a>.</p>
<p>The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.</p>
<p>Let's install the Numpy library using the <code>pip</code> package manager.</p>
<pre><code class="lang-py">!pip install numpy --upgrade --quiet
</code></pre>
<p>Next, let's import the <code>numpy</code> module. It's common practice to import numpy with the alias <code>np</code>.</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
</code></pre>
<p>We can now use the <code>np.array</code> function to create Numpy arrays.</p>
<pre><code class="lang-py">kanto = np.array([<span class="hljs-number">73</span>, <span class="hljs-number">67</span>, <span class="hljs-number">43</span>])

kanto
<span class="hljs-comment"># array([73, 67, 43])</span>

weights = np.array([w1, w2, w3])

weights
<span class="hljs-comment"># array([0.3, 0.2, 0.5])</span>
</code></pre>
<p>Numpy arrays have the type <code>ndarray</code>.</p>
<pre><code class="lang-py">type(kanto)
<span class="hljs-comment"># numpy.ndarray</span>

type(weights)
<span class="hljs-comment"># numpy.ndarray</span>
</code></pre>
<p>Just like lists, Numpy arrays support the indexing notation <code>[]</code>.</p>
<pre><code class="lang-py">weights[<span class="hljs-number">0</span>]
<span class="hljs-comment"># 0.3</span>

kanto[<span class="hljs-number">2</span>]
<span class="hljs-comment">#43</span>
</code></pre>
<h3 id="heading-how-to-operate-on-numpy-arrays">How to Operate on Numpy arrays</h3>
<p>We can now compute the dot product of the two vectors using the <code>np.dot</code> function.</p>
<pre><code class="lang-py">np.dot(kanto, weights)
<span class="hljs-comment"># 56.8</span>
</code></pre>
<p>We can achieve the same result with low-level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the resulting numbers' sum.</p>
<pre><code class="lang-py">(kanto * weights).sum()
<span class="hljs-comment"># 56.8</span>
</code></pre>
<p>The <code>*</code> operator performs an element-wise multiplication of two arrays if they have the same size. The <code>sum</code> method calculates the sum of numbers in an array.</p>
<pre><code class="lang-py">arr1 = np.array([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])
arr2 = np.array([<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>])

arr1 * arr2
<span class="hljs-comment"># array([ 4, 10, 18])</span>

arr2.sum()
<span class="hljs-comment"># 15</span>
</code></pre>
<h3 id="heading-what-are-the-benefits-of-using-numpy-arrays">What are the Benefits of Using Numpy Arrays?</h3>
<p>Numpy arrays offer the following benefits over Python lists for operating on numerical data:</p>
<ul>
<li><strong>They're easy</strong> to <strong>use</strong>: You can write small, concise, and intuitive mathematical expressions like <code>(kanto * weights).sum()</code> rather than using loops and custom functions like <code>crop_yield</code>.</li>
<li><strong>Performance</strong>: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements and loops that are interpreted at runtime</li>
</ul>
<p>Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Python lists</span>
arr1 = list(range(<span class="hljs-number">1000000</span>))
arr2 = list(range(<span class="hljs-number">1000000</span>, <span class="hljs-number">2000000</span>))

<span class="hljs-comment"># Numpy arrays</span>
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

%%time
result = <span class="hljs-number">0</span>
<span class="hljs-keyword">for</span> x1, x2 <span class="hljs-keyword">in</span> zip(arr1, arr2):
    result += x1*x2
result

<span class="hljs-comment"># CPU times: user 300 ms, sys: 3.26 ms, total: 303 ms</span>
<span class="hljs-comment"># Wall time: 302 ms</span>
<span class="hljs-comment"># 833332333333500000</span>

%%time
np.dot(arr1_np, arr2_np)

<span class="hljs-comment"># CPU times: user 2.11 ms, sys: 951 µs, total: 3.07 ms</span>
<span class="hljs-comment"># Wall time: 1.58 ms</span>
<span class="hljs-comment"># 833332333333500000</span>
</code></pre>
<p>As you can see, using <code>np.dot</code> is 100 times faster than using a <code>for</code> loop. This makes Numpy especially useful while working with really large datasets with tens of thousands or millions of data points.</p>
<h3 id="heading-multi-dimensional-numpy-arrays">Multi-Dimensional Numpy Arrays</h3>
<p>We can now go one step further and represent the climate data for all the regions using a single 2-dimensional Numpy array.</p>
<pre><code class="lang-py">climate_data = np.array([[<span class="hljs-number">73</span>, <span class="hljs-number">67</span>, <span class="hljs-number">43</span>],
                         [<span class="hljs-number">91</span>, <span class="hljs-number">88</span>, <span class="hljs-number">64</span>],
                         [<span class="hljs-number">87</span>, <span class="hljs-number">134</span>, <span class="hljs-number">58</span>],
                         [<span class="hljs-number">102</span>, <span class="hljs-number">43</span>, <span class="hljs-number">37</span>],
                         [<span class="hljs-number">69</span>, <span class="hljs-number">96</span>, <span class="hljs-number">70</span>]])

climate_data
<span class="hljs-comment"># array([[ 73,  67,  43],</span>
<span class="hljs-comment">#        [ 91,  88,  64],</span>
<span class="hljs-comment">#        [ 87, 134,  58],</span>
<span class="hljs-comment">#        [102,  43,  37],</span>
<span class="hljs-comment">#        [ 69,  96,  70]])</span>
</code></pre>
<p>If you've taken a linear algebra class in high school, you may recognize the above 2-d array as a matrix with five rows and three columns. Each row represents one region, and the columns represent temperature, rainfall, and humidity, respectively.</p>
<p>Numpy arrays can have any number of dimensions and different lengths along each dimension. We can inspect the length along each dimension using the <code>.shape</code> property of an array.</p>
<p><img src="https://fgnt.github.io/python_crashkurs_doc/_images/numpy_array_t.png" alt="Image" width="1440" height="805" loading="lazy">
_Source: <a target="_blank" href="https://github.com/elegant-scipy/elegant-scipy/blob/master/figures/NumPy_ndarrays_v2.png">Elegant Scipy</a>_</p>
<pre><code class="lang-py"><span class="hljs-comment"># 2D array (matrix)</span>
climate_data.shape
<span class="hljs-comment"># (5, 3)</span>

weights
<span class="hljs-comment"># array([0.3, 0.2, 0.5])</span>

<span class="hljs-comment"># 1D array (vector)</span>
weights.shape
<span class="hljs-comment"># (3,)</span>

<span class="hljs-comment"># 3D array </span>
arr3 = np.array([
    [[<span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>], 
     [<span class="hljs-number">13</span>, <span class="hljs-number">14</span>, <span class="hljs-number">15</span>]], 
    [[<span class="hljs-number">15</span>, <span class="hljs-number">16</span>, <span class="hljs-number">17</span>], 
     [<span class="hljs-number">17</span>, <span class="hljs-number">18</span>, <span class="hljs-number">19.5</span>]]])

arr3.shape
<span class="hljs-comment"># (2, 2, 3)</span>
</code></pre>
<p>All the elements in a numpy array have the same data type. You can check the data type of an array using the <code>.dtype</code> property.</p>
<pre><code class="lang-py">weights.dtype
<span class="hljs-comment"># dtype('float64')</span>

climate_data.dtype
<span class="hljs-comment"># dtype('int64')</span>
</code></pre>
<p>If an array contains even a single floating point number, all the other elements are also converted to floats.</p>
<pre><code class="lang-py">arr3.dtype
<span class="hljs-comment"># dtype('float64')</span>
</code></pre>
<p>We can now compute the predicted yields of apples in all the regions, using a single matrix multiplication between <code>climate_data</code> (a 5x3 matrix) and <code>weights</code> (a vector of length 3). Here's what it looks like visually:</p>
<p><img src="https://i.imgur.com/LJ2WKSI.png" alt="Image" width="578" height="334" loading="lazy"></p>
<p>You can learn about matrices and matrix multiplication by watching the first 3-4 videos of <a target="_blank" href="https://www.youtube.com/watch?v=xyAuNHPsq-g&amp;list=PLFD0EB975BA0CC1E0&amp;index=1">this YouTube playlist</a>.</p>
<p>We can use the <code>np.matmul</code> function or the <code>@</code> operator to perform matrix multiplication.</p>
<pre><code class="lang-py">np.matmul(climate_data, weights)
<span class="hljs-comment"># array([56.8, 76.9, 81.9, 57.7, 74.9])</span>

climate_data @ weights
<span class="hljs-comment"># array([56.8, 76.9, 81.9, 57.7, 74.9])</span>
</code></pre>
<h3 id="heading-how-to-work-with-csv-data-files">How to Work with CSV Data Files</h3>
<p>Numpy also provides helper functions reading from and writing to files. Let's download a file <code>climate.txt</code>, which contains 10,000 climate measurements (temperature, rainfall, and humidity) in the following format:</p>
<pre><code>temperature,rainfall,humidity
<span class="hljs-number">25.00</span>,<span class="hljs-number">76.00</span>,<span class="hljs-number">99.00</span>
<span class="hljs-number">39.00</span>,<span class="hljs-number">65.00</span>,<span class="hljs-number">70.00</span>
<span class="hljs-number">59.00</span>,<span class="hljs-number">45.00</span>,<span class="hljs-number">77.00</span>
<span class="hljs-number">84.00</span>,<span class="hljs-number">63.00</span>,<span class="hljs-number">38.00</span>
<span class="hljs-number">66.00</span>,<span class="hljs-number">50.00</span>,<span class="hljs-number">52.00</span>
<span class="hljs-number">41.00</span>,<span class="hljs-number">94.00</span>,<span class="hljs-number">77.00</span>
<span class="hljs-number">91.00</span>,<span class="hljs-number">57.00</span>,<span class="hljs-number">96.00</span>
<span class="hljs-number">49.00</span>,<span class="hljs-number">96.00</span>,<span class="hljs-number">99.00</span>
<span class="hljs-number">67.00</span>,<span class="hljs-number">20.00</span>,<span class="hljs-number">28.00</span>
...
</code></pre><p>This format of storing data is known as <em>comma-separated values</em> or CSV.</p>
<blockquote>
<p><strong>CSVs</strong>: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)</p>
</blockquote>
<p>To read this file into a numpy array, we can use the <code>genfromtxt</code> function.</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> urllib.request

urllib.request.urlretrieve(
    <span class="hljs-string">'https://hub.jovian.ml/wp-content/uploads/2020/08/climate.csv'</span>, 
    <span class="hljs-string">'climate.txt'</span>)

climate_data = np.genfromtxt(<span class="hljs-string">'climate.txt'</span>, delimiter=<span class="hljs-string">','</span>, skip_header=<span class="hljs-number">1</span>)

climate_data
<span class="hljs-comment"># array([[25., 76., 99.],</span>
<span class="hljs-comment">#        [39., 65., 70.],</span>
<span class="hljs-comment">#        [59., 45., 77.],</span>
<span class="hljs-comment">#        ...,</span>
<span class="hljs-comment">#        [99., 62., 58.],</span>
<span class="hljs-comment">#        [70., 71., 91.],</span>
<span class="hljs-comment">#        [92., 39., 76.]])</span>

climate_data.shape
<span class="hljs-comment"># (10000, 3)</span>
</code></pre>
<p>We can now perform a matrix multiplication using the <code>@</code> operator to predict the yield of apples for the entire dataset using a given set of weights.</p>
<pre><code class="lang-py">weights = np.array([<span class="hljs-number">0.3</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">0.5</span>])

yields = climate_data @ weights
yields
<span class="hljs-comment"># array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])</span>

yields.shape
<span class="hljs-comment"># (10000,)</span>
</code></pre>
<p>Let's add the <code>yields</code> to <code>climate_data</code> as a fourth column using the <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org%2Fdoc%2Fstable%2Freference%2Fgenerated%2Fnumpy.concatenate.html"><code>np.concatenate</code></a> function.</p>
<pre><code class="lang-py">climate_results = np.concatenate((climate_data, yields.reshape(<span class="hljs-number">10000</span>, <span class="hljs-number">1</span>)), axis=<span class="hljs-number">1</span>)

climate_results
<span class="hljs-comment"># array([[25. , 76. , 99. , 72.2],</span>
<span class="hljs-comment">#        [39. , 65. , 70. , 59.7],</span>
<span class="hljs-comment">#        [59. , 45. , 77. , 65.2],</span>
<span class="hljs-comment">#        ...,</span>
<span class="hljs-comment">#        [99. , 62. , 58. , 71.1],</span>
<span class="hljs-comment">#        [70. , 71. , 91. , 80.7],</span>
<span class="hljs-comment">#        [92. , 39. , 76. , 73.4]])</span>
</code></pre>
<p>There are a couple of subtleties here:</p>
<ul>
<li>Since we wish to add new columns, we pass the argument <code>axis=1</code> to <code>np.concatenate</code>. The <code>axis</code> argument specifies the dimension for concatenation.</li>
<li>The arrays should have the same number of dimensions, and the same length along each except the dimension used for concatenation. We use the <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org%2Fdoc%2Fstable%2Freference%2Fgenerated%2Fnumpy.reshape.html"><code>np.reshape</code></a> function to change the shape of <code>yields</code> from <code>(10000,)</code> to <code>(10000,1)</code>.</li>
</ul>
<p>Here's a visual explanation of <code>np.concatenate</code> along <code>axis=1</code> (can you guess what <code>axis=0</code> results in?):</p>
<p><img src="https://www.w3resource.com/w3r_images/python-numpy-image-exercise-58.png" alt="Image" width="576" height="536" loading="lazy">
<em>Source: <a target="_blank" href="w3resource.com">w3resource.com</a></em></p>
<p>The best way to understand what a Numpy function does is to experiment with it and read the documentation to learn about its arguments and return values. Use the cells below to experiment with <code>np.concatenate</code> and <code>np.reshape</code>.</p>
<p>Let's write the final results from our computation above back to a file using the <code>np.savetxt</code> function.</p>
<pre><code class="lang-py">np.savetxt(<span class="hljs-string">'climate_results.txt'</span>, 
           climate_results, 
           fmt=<span class="hljs-string">'%.2f'</span>, 
           delimiter=<span class="hljs-string">','</span>,
           header=<span class="hljs-string">'temperature,rainfall,humidity,yeild_apples'</span>, 
           comments=<span class="hljs-string">''</span>)
</code></pre>
<p>The results are written back in the CSV format to the file <code>climate_results.txt</code>.</p>
<pre><code>temperature,rainfall,humidity,yeild_apples
<span class="hljs-number">25.00</span>,<span class="hljs-number">76.00</span>,<span class="hljs-number">99.00</span>,<span class="hljs-number">72.20</span>
<span class="hljs-number">39.00</span>,<span class="hljs-number">65.00</span>,<span class="hljs-number">70.00</span>,<span class="hljs-number">59.70</span>
<span class="hljs-number">59.00</span>,<span class="hljs-number">45.00</span>,<span class="hljs-number">77.00</span>,<span class="hljs-number">65.20</span>
<span class="hljs-number">84.00</span>,<span class="hljs-number">63.00</span>,<span class="hljs-number">38.00</span>,<span class="hljs-number">56.80</span>
...
</code></pre><p>Numpy provides hundreds of functions for performing operations on arrays. Here are some commonly used functions:</p>
<ul>
<li>Mathematics: <code>np.sum</code>, <code>np.exp</code>, <code>np.round</code>, arithmetic operators</li>
<li>Array manipulation: <code>np.reshape</code>, <code>np.stack</code>, <code>np.concatenate</code>, <code>np.split</code></li>
<li>Linear Algebra: <code>np.matmul</code>, <code>np.dot</code>, <code>np.transpose</code>, <code>np.eigvals</code></li>
<li>Statistics: <code>np.mean</code>, <code>np.median</code>, <code>np.std</code>, <code>np.max</code></li>
</ul>
<p><strong>So how do you </strong>find the function you need?<em>**</em> The easiest way to find the right function for a specific operation or use-case is to do a web search. For instance, searching for "How to join numpy arrays" leads to <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fcmdlinetips.com%2F2018%2F04%2Fhow-to-concatenate-arrays-in-numpy%2F">this tutorial on array concatenation</a>.</p>
<p>You can find a <a target="_blank" href="https://numpy.org/doc/stable/reference/routines.html">full list of array functions here</a>.</p>
<h3 id="heading-numpy-arithmetic-operations-broadcasting-and-comparison">Numpy Arithmetic Operations, Broadcasting, and Comparison</h3>
<p>Numpy arrays support arithmetic operators like <code>+</code>, <code>-</code>, <code>*</code>, etc. You can perform an arithmetic operation with a single number (also called a scalar) or with another array of the same shape. </p>
<p>Operators make it easy to write mathematical expressions with multi-dimensional arrays.</p>
<pre><code class="lang-py">arr2 = np.array([[<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>], 
                 [<span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>], 
                 [<span class="hljs-number">9</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>]])

arr3 = np.array([[<span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>, <span class="hljs-number">14</span>], 
                 [<span class="hljs-number">15</span>, <span class="hljs-number">16</span>, <span class="hljs-number">17</span>, <span class="hljs-number">18</span>], 
                 [<span class="hljs-number">19</span>, <span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>]])

<span class="hljs-comment"># Adding a scalar</span>
arr2 + <span class="hljs-number">3</span>

<span class="hljs-comment"># array([[ 4,  5,  6,  7],</span>
<span class="hljs-comment">#        [ 8,  9, 10, 11],</span>
<span class="hljs-comment">#        [12,  4,  5,  6]])</span>

<span class="hljs-comment"># Element-wise subtraction</span>
arr3 - arr2

<span class="hljs-comment"># array([[10, 10, 10, 10],</span>
<span class="hljs-comment">#        [10, 10, 10, 10],</span>
<span class="hljs-comment">#        [10, 10, 10, 10]])</span>

<span class="hljs-comment"># Division by scalar</span>
arr2 / <span class="hljs-number">2</span>

<span class="hljs-comment"># array([[0.5, 1. , 1.5, 2. ],</span>
<span class="hljs-comment">#        [2.5, 3. , 3.5, 4. ],</span>
<span class="hljs-comment">#        [4.5, 0.5, 1. , 1.5]])</span>

<span class="hljs-comment"># Element-wise multiplication</span>
arr2 * arr3

<span class="hljs-comment"># array([[ 11,  24,  39,  56],</span>
<span class="hljs-comment">#        [ 75,  96, 119, 144],</span>
<span class="hljs-comment">#        [171,  11,  24,  39]])</span>

<span class="hljs-comment"># Modulus with scalar</span>
arr2 % <span class="hljs-number">4</span>

<span class="hljs-comment"># array([[1, 2, 3, 0],</span>
<span class="hljs-comment">#        [1, 2, 3, 0],</span>
<span class="hljs-comment">#        [1, 1, 2, 3]])</span>
</code></pre>
<h4 id="heading-numpy-array-broadcasting"><strong>Numpy Array Broadcasting</strong></h4>
<p>Numpy arrays also support <em>broadcasting</em>, allowing arithmetic operations between two arrays with different numbers of dimensions but compatible shapes. Let's look at an example to see how it works.</p>
<pre><code class="lang-py">arr2 = np.array([[<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>], 
                 [<span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>], 
                 [<span class="hljs-number">9</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>]])               
arr2.shape
<span class="hljs-comment"># (3, 4)</span>

arr4 = np.array([<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>])
arr4.shape
<span class="hljs-comment"># (4,)</span>

arr2 + arr4
<span class="hljs-comment"># array([[ 5,  7,  9, 11],</span>
<span class="hljs-comment">#        [ 9, 11, 13, 15],</span>
<span class="hljs-comment">#        [13,  6,  8, 10]])</span>
</code></pre>
<p>When the expression <code>arr2 + arr4</code> is evaluated, <code>arr4</code> (which has the shape <code>(4,)</code>) is replicated three times to match the shape <code>(3, 4)</code> of <code>arr2</code>. Numpy performs the replication without actually creating three copies of the smaller dimension array, thus improving performance and using lower memory.</p>
<p><img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/02.05-broadcasting.png" alt="Image" width="432" height="324" loading="lazy">
<em>Source: <a target="_blank" href="https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html">Python Data Science Handbook</a></em></p>
<p>Broadcasting only works if one of the arrays can be replicated to match the other array's shape.</p>
<pre><code class="lang-py">arr5 = np.array([<span class="hljs-number">7</span>, <span class="hljs-number">8</span>])
arr5.shape
<span class="hljs-comment"># (2,)</span>

arr2 + arr5
<span class="hljs-comment"># ValueError: operands could not be broadcast together with shapes (3,4) (2,)</span>
</code></pre>
<p>In the above example, even if <code>arr5</code> is replicated three times, it will not match the shape of <code>arr2</code>. So <code>arr2 + arr5</code> cannot be evaluated successfully. <a target="_blank" href="https://numpy.org/doc/stable/user/basics.broadcasting.html">Learn more about broadcasting here</a>.</p>
<h4 id="heading-numpy-array-comparison"><strong>Numpy Array Comparison</strong></h4>
<p>Numpy arrays also support comparison operations like <code>==</code>, <code>!=</code>, <code>&gt;</code> and so on. The result is an array of booleans.</p>
<pre><code class="lang-py">arr1 = np.array([[<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>], [<span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>]])
arr2 = np.array([[<span class="hljs-number">2</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>], [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">5</span>]])

arr1 == arr2
<span class="hljs-comment"># array([[False,  True,  True],</span>
<span class="hljs-comment">#        [False, False,  True]])</span>

arr1 != arr2
<span class="hljs-comment"># array([[ True, False, False],</span>
<span class="hljs-comment">#        [ True,  True, False]])</span>

arr1 &gt;= arr2
<span class="hljs-comment"># array([[False,  True,  True],</span>
<span class="hljs-comment">#        [ True,  True,  True]])</span>

arr1 &lt; arr2
<span class="hljs-comment"># array([[ True, False, False],</span>
<span class="hljs-comment">#        [False, False, False]])</span>
</code></pre>
<p>Array comparison is frequently used to count the number of equal elements in two arrays using the <code>sum</code> method. Remember that <code>True</code> evaluates to <code>1</code> and <code>False</code> evaluates to <code>0</code> when you use booleans in arithmetic operations.</p>
<pre><code class="lang-py">(arr1 == arr2).sum()
<span class="hljs-comment"># 3</span>
</code></pre>
<h3 id="heading-numpy-array-indexing-and-slicing">Numpy Array Indexing and Slicing</h3>
<p>Numpy extends Python's list indexing notation using <code>[]</code> to multiple dimensions in an intuitive fashion. You can provide a comma-separated list of indices or ranges to select a specific element or a subarray (also called a slice) from a Numpy array.</p>
<pre><code class="lang-py">arr3 = np.array([
    [[<span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>, <span class="hljs-number">14</span>], 
     [<span class="hljs-number">13</span>, <span class="hljs-number">14</span>, <span class="hljs-number">15</span>, <span class="hljs-number">19</span>]], 

    [[<span class="hljs-number">15</span>, <span class="hljs-number">16</span>, <span class="hljs-number">17</span>, <span class="hljs-number">21</span>], 
     [<span class="hljs-number">63</span>, <span class="hljs-number">92</span>, <span class="hljs-number">36</span>, <span class="hljs-number">18</span>]], 

    [[<span class="hljs-number">98</span>, <span class="hljs-number">32</span>, <span class="hljs-number">81</span>, <span class="hljs-number">23</span>],      
     [<span class="hljs-number">17</span>, <span class="hljs-number">18</span>, <span class="hljs-number">19.5</span>, <span class="hljs-number">43</span>]]])

arr3.shape
<span class="hljs-comment"># (3, 2, 4)</span>

<span class="hljs-comment"># Single element</span>
arr3[<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>]

<span class="hljs-comment"># 36.0</span>

<span class="hljs-comment"># Subarray using ranges</span>
arr3[<span class="hljs-number">1</span>:, <span class="hljs-number">0</span>:<span class="hljs-number">1</span>, :<span class="hljs-number">2</span>]

<span class="hljs-comment"># array([[[15., 16.]],</span>
<span class="hljs-comment"># </span>
<span class="hljs-comment">#        [[98., 32.]]])</span>

<span class="hljs-comment"># Mixing indices and ranges</span>
arr3[<span class="hljs-number">1</span>:, <span class="hljs-number">1</span>, <span class="hljs-number">3</span>]

<span class="hljs-comment"># array([18., 43.])</span>

arr3[<span class="hljs-number">1</span>:, <span class="hljs-number">1</span>, :<span class="hljs-number">3</span>]
<span class="hljs-comment"># array([[63. , 92. , 36. ],</span>
<span class="hljs-comment">#        [17. , 18. , 19.5]])</span>

<span class="hljs-comment"># Using fewer indices</span>
arr3[<span class="hljs-number">1</span>]

<span class="hljs-comment"># array([[15., 16., 17., 21.],</span>
<span class="hljs-comment">#        [63., 92., 36., 18.]])</span>

arr3[:<span class="hljs-number">2</span>, <span class="hljs-number">1</span>]
<span class="hljs-comment"># array([[13., 14., 15., 19.],</span>
<span class="hljs-comment">#        [63., 92., 36., 18.]])</span>

<span class="hljs-comment"># Using too many indices</span>
arr3[<span class="hljs-number">1</span>,<span class="hljs-number">3</span>,<span class="hljs-number">2</span>,<span class="hljs-number">1</span>]

<span class="hljs-comment"># IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed</span>
</code></pre>
<p>The notation and its results can seem confusing at first, so take your time to experiment and become comfortable with it. </p>
<p>Use the cells below to try out some examples of array indexing and slicing, with different combinations of indices and ranges. Here are some more examples demonstrated visually:</p>
<p><img src="https://scipy-lectures.org/_images/numpy_indexing.png" alt="Image" width="772" height="383" loading="lazy">
_Source: <a target="_blank" href="https://scipy-lectures.org/intro/numpy/array_object.html">Scipy Lectures</a>_</p>
<h3 id="heading-how-to-create-numpy-arrays-other-methods">How to Create Numpy Arrays – Other Methods</h3>
<p>Numpy also provides some handy functions to create arrays of desired shapes with fixed or random values. Check out the <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org%2Fdoc%2Fstable%2Freference%2Froutines.array-creation.html">official documentation</a> or use the <code>help</code> function to learn more.</p>
<pre><code># All zeros
np.zeros((<span class="hljs-number">3</span>, <span class="hljs-number">2</span>))

# array([[<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>]])

# All ones
np.ones([<span class="hljs-number">2</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])

# array([[[<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>],
#         [<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>]],
#
#        [[<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>],
#         [<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>]]])

# Identity matrix
np.eye(<span class="hljs-number">3</span>)

# array([[<span class="hljs-number">1.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">1.</span>]])

# Random vector
np.random.rand(<span class="hljs-number">5</span>)

# array([<span class="hljs-number">0.92929562</span>, <span class="hljs-number">0.11301864</span>, <span class="hljs-number">0.64213555</span>, <span class="hljs-number">0.8600434</span> , <span class="hljs-number">0.53738656</span>])

# Random matrix
np.random.randn(<span class="hljs-number">2</span>, <span class="hljs-number">3</span>) # rand vs. randn - what<span class="hljs-string">'s the difference?

# array([[ 0.09906435, -1.64668094,  0.08073528],
#        [ 0.1437016 ,  0.80715712,  1.27285476]])

# Fixed value
np.full([2, 3], 42)

# array([[42, 42, 42],
#        [42, 42, 42]])

# Range with start, end and step
np.arange(10, 90, 3)

# array([10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58,
#        61, 64, 67, 70, 73, 76, 79, 82, 85, 88])

# Equally spaced numbers in a range
np.linspace(3, 27, 9)

# array([ 3.,  6.,  9., 12., 15., 18., 21., 24., 27.])</span>
</code></pre><h3 id="heading-exercises">Exercises</h3>
<p>Try the following exercises to become familiar with Numpy arrays and practice your skills:</p>
<ul>
<li>Assignment on Numpy array functions: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fjovian.ml%2Faakashns%2Fnumpy-array-operations">https://jovian.ml/aakashns/numpy-array-operations</a></li>
<li>(Optional) 100 numpy exercises: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fjovian.ml%2Faakashns%2F100-numpy-exercises">https://jovian.ml/aakashns/100-numpy-exercises</a></li>
</ul>
<h3 id="heading-summary-and-further-reading">Summary and Further Reading</h3>
<p>With this, we complete our discussion of numerical computing with Numpy. We've covered the following topics in this part of the tutorial:</p>
<ul>
<li>How to go from Python lists to Numpy arrays</li>
<li>How to operate on Numpy arrays</li>
<li>The benefits of using Numpy arrays over lists</li>
<li>Multi-dimensional Numpy arrays</li>
<li>How to work with CSV data files</li>
<li>Arithmetic operations and broadcasting</li>
<li>Array indexing and slicing</li>
<li>Other ways of creating Numpy arrays</li>
</ul>
<p>Check out the following resources for learning more about Numpy:</p>
<ul>
<li><a target="_blank" href="https://numpy.org/devdocs/user/quickstart.html">Official tutorial</a></li>
<li><a target="_blank" href="https://www.freecodecamp.org/news/the-ultimate-guide-to-the-numpy-scientific-computing-library-for-python/">Numpy course on freeCodeCamp</a></li>
<li><a target="_blank" href="http://scipy-lectures.org/advanced/advanced_numpy/index.html">Advanced Numpy (exploring the internals)</a></li>
</ul>
<h3 id="heading-review-questions-to-check-your-comprehension">Review Questions to Check Your Comprehension</h3>
<p>Try answering the following questions to test your understanding of the topics covered in this notebook:</p>
<ol>
<li>What is a vector?</li>
<li>How do you represent vectors using a Python list? Give an example.</li>
<li>What is a dot product of two vectors?</li>
<li>Write a function to compute the dot product of two vectors.</li>
<li>What is Numpy?</li>
<li>How do you install Numpy?</li>
<li>How do you import the <code>numpy</code> module?</li>
<li>What does it mean to import a module with an alias? Give an example.</li>
<li>What is the commonly used alias for <code>numpy</code>?</li>
<li>What is a Numpy array?</li>
<li>How do you create a Numpy array? Give an example.</li>
<li>What is the type of Numpy arrays?</li>
<li>How do you access the elements of a Numpy array?</li>
<li>How do you compute the dot product of two vectors using Numpy?</li>
<li>What happens if you try to compute the dot product of two vectors which have different sizes?</li>
<li>How do you compute the element-wise product of two Numpy arrays?</li>
<li>How do you compute the sum of all the elements in a Numpy array?</li>
<li>What are the benefits of using Numpy arrays over Python lists for operating on numerical data?</li>
<li>Why do Numpy array operations have better performance compared to Python functions and loops?</li>
<li>Illustrate the performance difference between Numpy array operations and Python loops using an example.</li>
<li>What are multi-dimensional Numpy arrays?</li>
<li>Illustrate how you'd create Numpy arrays with 2, 3, and 4 dimensions.</li>
<li>How do you inspect the number of dimensions and the length along each dimension in a Numpy array?</li>
<li>Can the elements of a Numpy array have different data types?</li>
<li>How do you check the data types of the elements of a Numpy array?</li>
<li>What is the data type of a Numpy array?</li>
<li>What is the difference between a matrix and a 2D Numpy array?</li>
<li>How do you perform matrix multiplication using Numpy?</li>
<li>What is the <code>@</code> operator used for in Numpy?</li>
<li>What is the CSV file format?</li>
<li>How do you read data from a CSV file using Numpy?</li>
<li>How do you concatenate two Numpy arrays?</li>
<li>What is the purpose of the <code>axis</code> argument of <code>np.concatenate</code>?</li>
<li>When are two Numpy arrays compatible for concatenation?</li>
<li>Give an example of two Numpy arrays that can be concatenated.</li>
<li>Give an example of two Numpy arrays that cannot be concatenated.</li>
<li>What is the purpose of the <code>np.reshape</code> function?</li>
<li>What does it mean to “reshape” a Numpy array?</li>
<li>How do you write a numpy array into a CSV file?</li>
<li>Give some examples of Numpy functions for performing mathematical operations.</li>
<li>Give some examples of Numpy functions for performing array manipulation.</li>
<li>Give some examples of Numpy functions for performing linear algebra.</li>
<li>Give some examples of Numpy functions for performing statistical operations.</li>
<li>How do you find the right Numpy function for a specific operation or use case?</li>
<li>Where can you see a list of all the Numpy array functions and operations?</li>
<li>What are the arithmetic operators supported by Numpy arrays? Illustrate with examples.</li>
<li>What is array broadcasting? How is it useful? Illustrate with an example.</li>
<li>Give some examples of arrays that are compatible for broadcasting.</li>
<li>Give some examples of arrays that are not compatible for broadcasting.</li>
<li>What are the comparison operators supported by Numpy arrays? Illustrate with examples.</li>
<li>How do you access a specific subarray or slice from a Numpy array?</li>
<li>Illustrate array indexing and slicing in multi-dimensional Numpy arrays with some examples.</li>
<li>How do you create a Numpy array with a given shape containing all zeros?</li>
<li>How do you create a Numpy array with a given shape containing all ones?</li>
<li>How do you create an identity matrix of a given shape?</li>
<li>How do you create a random vector of a given length?</li>
<li>How do you create a Numpy array with a given shape with a fixed value for each element?</li>
<li>How do you create a Numpy array with a given shape containing randomly initialized elements?</li>
<li>What is the difference between <code>np.random.rand</code> and <code>np.random.randn</code>? Illustrate with examples.</li>
<li>What is the difference between <code>np.arange</code> and <code>np.linspace</code>? Illustrate with examples.</li>
</ol>
<p>You are ready to move on to the next section of this tutorial.</p>
<h2 id="heading-how-to-analyze-tabular-data-using-python-and-pandas">How to Analyze Tabular Data using Python and Pandas</h2>
<p><img src="https://i.imgur.com/zfxLzEv.png" alt="Image" width="3175" height="1414" loading="lazy"></p>
<p>Follow along and run the code here: <a target="_blank" href="https://jovian.ai/aakashns/python-pandas-data-analysis">https://jovian.ai/aakashns/python-pandas-data-analysis</a>.</p>
<p>This section covers the following topics:</p>
<ul>
<li>How to read a CSV file into a Pandas data frame</li>
<li>How to retrieve data from Pandas data frames</li>
<li>How to query, sort, and analyze data</li>
<li>How to merge, group, and aggregate data</li>
<li>How to extract useful information from dates</li>
<li>Basic plotting using line and bar charts</li>
<li>How to write data frames to CSV files</li>
</ul>
<h3 id="heading-how-to-read-a-csv-file-using-pandas">How to Read a CSV File Using Pandas</h3>
<p><a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fpandas.pydata.org%2F">Pandas</a> is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet). It provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, JSON, SQL, and more. </p>
<p>Let's download a file <code>italy-covid-daywise.txt</code> which contains day-wise Covid-19 data for Italy in the following format:</p>
<pre><code>date,new_cases,new_deaths,new_tests
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-21</span>,<span class="hljs-number">2256.0</span>,<span class="hljs-number">454.0</span>,<span class="hljs-number">28095.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-22</span>,<span class="hljs-number">2729.0</span>,<span class="hljs-number">534.0</span>,<span class="hljs-number">44248.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-23</span>,<span class="hljs-number">3370.0</span>,<span class="hljs-number">437.0</span>,<span class="hljs-number">37083.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-24</span>,<span class="hljs-number">2646.0</span>,<span class="hljs-number">464.0</span>,<span class="hljs-number">95273.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-25</span>,<span class="hljs-number">3021.0</span>,<span class="hljs-number">420.0</span>,<span class="hljs-number">38676.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-26</span>,<span class="hljs-number">2357.0</span>,<span class="hljs-number">415.0</span>,<span class="hljs-number">24113.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-27</span>,<span class="hljs-number">2324.0</span>,<span class="hljs-number">260.0</span>,<span class="hljs-number">26678.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-28</span>,<span class="hljs-number">1739.0</span>,<span class="hljs-number">333.0</span>,<span class="hljs-number">37554.0</span>
...
</code></pre><p>This format of storing data is known as <em>comma-separated values</em> or CSV. Here's a reminder in case you need a definition of what the CSV format is:</p>
<blockquote>
<p><strong>CSVs</strong>: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)</p>
</blockquote>
<p>We'll download this file using the <code>urlretrieve</code> function from the <code>urllib.request</code> module.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> urllib.request <span class="hljs-keyword">import</span> urlretrieve

urlretrieve(<span class="hljs-string">'https://hub.jovian.ml/wp-content/uploads/2020/09/italy-covid-daywise.csv'</span>, <span class="hljs-string">'italy-covid-daywise.csv'</span>)
</code></pre>
<p>To read the file, we can use the <code>read_csv</code> method from Pandas. First, let's install the Pandas library.</p>
<pre><code class="lang-py">!pip install pandas --upgrade --quiet
</code></pre>
<p>We can now import the <code>pandas</code> module. As a convention, it is imported with the alias <code>pd</code>.</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

covid_df = pd.read_csv(<span class="hljs-string">'italy-covid-daywise.csv'</span>)
</code></pre>
<p>Data from the file is read and stored in a <code>DataFrame</code> object – one of the core data structures in Pandas for storing and working with tabular data. We typically use the <code>_df</code> suffix in the variable names for dataframes.</p>
<pre><code class="lang-py">type(covid_df)
<span class="hljs-comment"># pandas.core.frame.DataFrame</span>

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-108.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Here's what we can tell by looking at the dataframe:</p>
<ul>
<li>The file provides four day-wise counts for COVID-19 in Italy</li>
<li>The metrics reported are new cases, deaths, and tests</li>
<li>Data is provided for 248 days: from Dec 12, 2019, to Sep 3, 2020</li>
</ul>
<p>Keep in mind that these are officially reported numbers. The actual number of cases and deaths may be higher, as not all cases are diagnosed.</p>
<p>We can view some basic information about the data frame using the <code>.info</code> method.</p>
<pre><code class="lang-py">covid_df.info()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-109.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It appears that each column contains values of a specific data type. You can view statistical information for numerical columns (mean, standard deviation, minimum/maximum values, and the number of non-empty values) using the <code>.describe</code> method.</p>
<pre><code class="lang-py">covid_df.describe()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-110.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The <code>columns</code> property contains the list of columns within the data frame.</p>
<pre><code class="lang-py">covid_df.columns
<span class="hljs-comment"># Index(['date', 'new_cases', 'new_deaths', 'new_tests'], dtype='object')</span>
</code></pre>
<p>You can also retrieve the number of rows and columns in the data frame using the <code>.shape</code> method.</p>
<pre><code class="lang-py">covid_df.shape
<span class="hljs-comment"># (248, 4)</span>
</code></pre>
<p>Here's a summary of the functions and methods we've looked at so far:</p>
<ul>
<li><code>pd.read_csv</code> – Read data from a CSV file into a Pandas <code>DataFrame</code> object</li>
<li><code>.info()</code> – View basic information about rows, columns, and data types</li>
<li><code>.describe()</code> – View statistical information about numeric columns</li>
<li><code>.columns</code> – Get the list of column names</li>
<li><code>.shape</code> – Get the number of rows and columns as a tuple</li>
</ul>
<h3 id="heading-how-to-retrieve-data-from-a-data-frame-in-pandas">How to Retrieve Data from a Data Frame in Pandas</h3>
<p>The first thing you might want to do is retrieve data from this data frame, like the counts of a specific day or the list of values in a particular column. </p>
<p>To do this, you should understand the internal representation of data in a data frame. Conceptually, you can think of a dataframe as a dictionary of lists: keys are column names, and values are lists/arrays containing data for the respective columns.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Pandas format is simliar to this</span>
covid_data_dict = {
    <span class="hljs-string">'date'</span>:       [<span class="hljs-string">'2020-08-30'</span>, <span class="hljs-string">'2020-08-31'</span>, <span class="hljs-string">'2020-09-01'</span>, <span class="hljs-string">'2020-09-02'</span>, <span class="hljs-string">'2020-09-03'</span>],
    <span class="hljs-string">'new_cases'</span>:  [<span class="hljs-number">1444</span>, <span class="hljs-number">1365</span>, <span class="hljs-number">996</span>, <span class="hljs-number">975</span>, <span class="hljs-number">1326</span>],
    <span class="hljs-string">'new_deaths'</span>: [<span class="hljs-number">1</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">8</span>, <span class="hljs-number">6</span>],
    <span class="hljs-string">'new_tests'</span>: [<span class="hljs-number">53541</span>, <span class="hljs-number">42583</span>, <span class="hljs-number">54395</span>, <span class="hljs-literal">None</span>, <span class="hljs-literal">None</span>]
}
</code></pre>
<p>Representing data in the above format has a few benefits:</p>
<ul>
<li>All values in a column typically have the same type of value, so it's more efficient to store them in a single array.</li>
<li>Retrieving the values for a particular row simply requires extracting the elements at a given index from each column array.</li>
<li>The representation is more compact (column names are recorded only once) compared to other formats that use a dictionary for each row of data (see the example below).</li>
</ul>
<pre><code class="lang-py"><span class="hljs-comment"># Pandas format is not similar to this</span>
covid_data_list = [
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-08-30'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">1444</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'new_tests'</span>: <span class="hljs-number">53541</span>},
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-08-31'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">1365</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">4</span>, <span class="hljs-string">'new_tests'</span>: <span class="hljs-number">42583</span>},
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-09-01'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">996</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">6</span>, <span class="hljs-string">'new_tests'</span>: <span class="hljs-number">54395</span>},
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-09-02'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">975</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">8</span> },
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-09-03'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">1326</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">6</span>},
]
</code></pre>
<p>With the dictionary of lists analogy in mind, you can now guess how to retrieve data from a data frame. For example, we can get a list of values from a specific column using the <code>[]</code> indexing notation.</p>
<pre><code class="lang-py">covid_data_dict[<span class="hljs-string">'new_cases'</span>]
<span class="hljs-comment"># [1444, 1365, 996, 975, 1326]</span>

covid_df[<span class="hljs-string">'new_cases'</span>]
<span class="hljs-comment"># 0         0.0</span>
<span class="hljs-comment"># 1         0.0</span>
<span class="hljs-comment"># 2         0.0</span>
<span class="hljs-comment"># 3         0.0</span>
<span class="hljs-comment"># 4         0.0</span>
<span class="hljs-comment">#         ...  </span>
<span class="hljs-comment"># 243    1444.0</span>
<span class="hljs-comment"># 244    1365.0</span>
<span class="hljs-comment"># 245     996.0</span>
<span class="hljs-comment"># 246     975.0</span>
<span class="hljs-comment"># 247    1326.0</span>
<span class="hljs-comment"># Name: new_cases, Length: 248, dtype: float64</span>
</code></pre>
<p>Each column is represented using a data structure called <code>Series</code>, which is essentially a numpy array with some extra methods and properties.</p>
<pre><code class="lang-py">type(covid_df[<span class="hljs-string">'new_cases'</span>])
<span class="hljs-comment"># pandas.core.series.Series</span>
</code></pre>
<p>Like arrays, you can retrieve a specific value with a series using the indexing notation <code>[]</code>.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'new_cases'</span>][<span class="hljs-number">246</span>]
<span class="hljs-comment"># 975.0</span>

covid_df[<span class="hljs-string">'new_tests'</span>][<span class="hljs-number">240</span>]
<span class="hljs-number">57640.0</span>
</code></pre>
<p>Pandas also provides the <code>.at</code> method to retrieve the element at a specific row &amp; column directly.</p>
<pre><code class="lang-py">covid_df.at[<span class="hljs-number">246</span>, <span class="hljs-string">'new_cases'</span>]
<span class="hljs-comment"># 975.0</span>

covid_df.at[<span class="hljs-number">240</span>, <span class="hljs-string">'new_tests'</span>]
<span class="hljs-comment"># 57640.0</span>
</code></pre>
<p>Instead of using the indexing notation <code>[]</code>, Pandas also allows accessing columns as properties of the dataframe using the <code>.</code> notation. However, this method only works for columns whose names do not contain spaces or special characters.</p>
<pre><code class="lang-py">covid_df.new_cases
<span class="hljs-comment"># 0         0.0</span>
<span class="hljs-comment"># 1         0.0</span>
<span class="hljs-comment"># 2         0.0</span>
<span class="hljs-comment"># 3         0.0</span>
<span class="hljs-comment"># 4         0.0</span>
<span class="hljs-comment">#         ...  </span>
<span class="hljs-comment"># 243    1444.0</span>
<span class="hljs-comment"># 244    1365.0</span>
<span class="hljs-comment"># 245     996.0</span>
<span class="hljs-comment"># 246     975.0</span>
<span class="hljs-comment"># 247    1326.0</span>
<span class="hljs-comment"># Name: new_cases, Length: 248, dtype: float64</span>
</code></pre>
<p>Further, you can also pass a list of columns within the indexing notation <code>[]</code> to access a subset of the data frame with just the given columns.</p>
<pre><code class="lang-py">cases_df = covid_df[[<span class="hljs-string">'date'</span>, <span class="hljs-string">'new_cases'</span>]]
cases_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-111.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The new data frame <code>cases_df</code> is simply a "view" of the original data frame <code>covid_df</code>. Both point to the same data in the computer's memory. Changing any values inside one of them will also change the respective values in the other. </p>
<p>Sharing data between data frames makes data manipulation in Pandas blazing fast. You needn't worry about the overhead of copying thousands or millions of rows every time you want to create a new data frame by operating on an existing one.</p>
<p>Sometimes you might need a full copy of the data frame, in which case you can use the <code>copy</code> method.</p>
<pre><code class="lang-py">covid_df_copy = covid_df.copy()
</code></pre>
<p>The data within <code>covid_df_copy</code> is completely separate from <code>covid_df</code>, and changing values inside one of them will not affect the other.</p>
<p>To access a specific row of data, Pandas provides the <code>.loc</code> method.</p>
<pre><code class="lang-py">covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-112.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">covid_df.loc[<span class="hljs-number">243</span>]
<span class="hljs-comment"># date          2020-08-30</span>
<span class="hljs-comment"># new_cases         1444.0</span>
<span class="hljs-comment"># new_deaths           1.0</span>
<span class="hljs-comment"># new_tests        53541.0</span>
<span class="hljs-comment"># Name: 243, dtype: object</span>
</code></pre>
<p>Each retrieved row is also a <code>Series</code> object.</p>
<pre><code class="lang-py">type(covid_df.loc[<span class="hljs-number">243</span>])
<span class="hljs-comment"># pandas.core.series.Series</span>
</code></pre>
<p>We can use the <code>.head</code> and <code>.tail</code> methods to view the first or last few rows of data.</p>
<pre><code class="lang-py">covid_df.head(<span class="hljs-number">5</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-113.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">covid_df.tail(<span class="hljs-number">4</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-114.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice above that while the first few values in the <code>new_cases</code> and <code>new_deaths</code> columns are <code>0</code>, the corresponding values within the <code>new_tests</code> column are <code>NaN</code>. That is because the CSV file does not contain any data for the <code>new_tests</code> column for specific dates (you can verify this by looking into the file). These values may be missing or unknown.</p>
<pre><code class="lang-py">covid_df.at[<span class="hljs-number">0</span>, <span class="hljs-string">'new_tests'</span>]
<span class="hljs-comment"># nan</span>

type(covid_df.at[<span class="hljs-number">0</span>, <span class="hljs-string">'new_tests'</span>])
<span class="hljs-comment"># numpy.float64</span>
</code></pre>
<p>The distinction between <code>0</code> and <code>NaN</code> is subtle but important. In this dataset, it represents that daily test numbers were not reported on specific dates. Italy started reporting daily tests on Apr 19, 2020. They'd already conducted 935,310 tests before Apr 19.</p>
<p>We can find the first index that doesn't contain a <code>NaN</code> value using a column's <code>first_valid_index</code> method.</p>
<pre><code class="lang-py">covid_df.new_tests.first_valid_index()
<span class="hljs-comment"># 111</span>
</code></pre>
<p>Let's look at a few rows before and after this index to verify that the values change from <code>NaN</code> to actual numbers. We can do this by passing a range to <code>loc</code>.</p>
<pre><code class="lang-py">covid_df.loc[<span class="hljs-number">108</span>:<span class="hljs-number">113</span>]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-115.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can use the <code>.sample</code> method to retrieve a random sample of rows from the data frame.</p>
<pre><code class="lang-py">covid_df.sample(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-116.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice that even though we have taken a random sample, each row's original index is preserved. This is a useful property of data frames.</p>
<p>Here's a summary of the functions and methods we looked at in this section:</p>
<ul>
<li><code>covid_df['new_cases']</code> – Retrieving columns as a <code>Series</code> using the column name</li>
<li><code>new_cases[243]</code> – Retrieving values from a <code>Series</code> using an index</li>
<li><code>covid_df.at[243, 'new_cases']</code> – Retrieving a single value from a data frame</li>
<li><code>covid_df.copy()</code> – Creating a deep copy of a data frame</li>
<li><code>covid_df.loc[243]</code> - Retrieving a row or range of rows of data from the data frame</li>
<li><code>head</code>, <code>tail</code>, and <code>sample</code> – Retrieving multiple rows of data from the data frame</li>
<li><code>covid_df.new_tests.first_valid_index</code> – Finding the first non-empty index in a series</li>
</ul>
<h3 id="heading-how-to-analyze-data-from-data-frames-in-pandas">How to Analyze Data from Data Frames in Pandas</h3>
<p>Let's try to answer some questions about our data.</p>
<p><strong>Q: What are the total number of reported cases and deaths related to Covid-19 in Italy?</strong></p>
<p>Similar to Numpy arrays, a Pandas series supports the <code>sum</code> method to answer these questions.</p>
<pre><code class="lang-py">total_cases = covid_df.new_cases.sum()
total_deaths = covid_df.new_deaths.sum()

print(<span class="hljs-string">'The number of reported cases is {} and the number of reported deaths is {}.'</span>.format(int(total_cases), int(total_deaths)))
<span class="hljs-comment"># The number of reported cases is 271515 and the number of reported deaths is 35497.</span>
</code></pre>
<p><strong>Q: What is the overall death rate (ratio of reported deaths to reported cases)?</strong></p>
<pre><code class="lang-py">death_rate = covid_df.new_deaths.sum() / covid_df.new_cases.sum()

print(<span class="hljs-string">"The overall reported death rate in Italy is {:.2f} %."</span>.format(death_rate*<span class="hljs-number">100</span>))
<span class="hljs-comment"># The overall reported death rate in Italy is 13.07 %.</span>
</code></pre>
<p><strong>Q: What is the overall number of tests conducted? A total of 935</strong>,<strong>310 tests were conducted before daily test numbers were reported.</strong></p>
<pre><code class="lang-py">initial_tests = <span class="hljs-number">935310</span>
total_tests = initial_tests + covid_df.new_tests.sum()

total_tests
<span class="hljs-comment"># 5214766.0</span>
</code></pre>
<p><strong>Q: What fraction of tests returned a positive result?</strong></p>
<pre><code class="lang-py">positive_rate = total_cases / total_tests

print(<span class="hljs-string">'{:.2f}% of tests in Italy led to a positive diagnosis.'</span>.format(positive_rate*<span class="hljs-number">100</span>))
<span class="hljs-comment"># 5.21% of tests in Italy led to a positive diagnosis.</span>
</code></pre>
<p>Try asking and answering some more questions about the data.</p>
<h3 id="heading-how-to-query-and-sort-rows-in-pandas">How to Query and Sort Rows in Pandas</h3>
<p>Let's say we only want to look at the days which had more than 1,000 reported cases. We can use a boolean expression to check which rows satisfy this criterion.</p>
<pre><code class="lang-py">high_new_cases = covid_df.new_cases &gt; <span class="hljs-number">1000</span>

high_new_cases
<span class="hljs-comment"># 0      False</span>
<span class="hljs-comment"># 1      False</span>
<span class="hljs-comment"># 2      False</span>
<span class="hljs-comment"># 3      False</span>
<span class="hljs-comment"># 4      False</span>
<span class="hljs-comment">#        ...  </span>
<span class="hljs-comment"># 243     True</span>
<span class="hljs-comment"># 244     True</span>
<span class="hljs-comment"># 245    False</span>
<span class="hljs-comment"># 246    False</span>
<span class="hljs-comment"># 247     True</span>
<span class="hljs-comment"># Name: new_cases, Length: 248, dtype: bool</span>
</code></pre>
<p>The boolean expression returns a series containing <code>True</code> and <code>False</code> boolean values. You can use this series to select a subset of rows from the original dataframe, corresponding to the <code>True</code> values in the series.</p>
<pre><code class="lang-py">covid_df[high_new_cases]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-117.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The data frame contains 72 rows, but only the first and last five rows are displayed by default with Jupyter for brevity. We can change some display options to view all the rows.</p>
<pre><code class="lang-py">high_cases_df = covid_df[covid_df.new_cases &gt; <span class="hljs-number">1000</span>]

high_cases_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-118.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The data frame contains 72 rows, but only the first &amp; last five rows are displayed by default with Jupyter for brevity. We can change some display options to view all the rows.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> IPython.display <span class="hljs-keyword">import</span> display
<span class="hljs-keyword">with</span> pd.option_context(<span class="hljs-string">'display.max_rows'</span>, <span class="hljs-number">100</span>):
    display(covid_df[covid_df.new_cases &gt; <span class="hljs-number">1000</span>])
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-119.png" alt="Image" width="600" height="400" loading="lazy">
<em>This is just part of the data frame. Check out the rest <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-pandas-data-analysis">here</a>.</em></p>
<p>We can also formulate more complex queries that involve multiple columns. As an example, let's try to determine the days when the ratio of cases reported to tests conducted is higher than the overall <code>positive_rate</code>.</p>
<pre><code class="lang-py">positive_rate
<span class="hljs-comment"># 0.05206657403227681</span>

high_ratio_df = covid_df[covid_df.new_cases / covid_df.new_tests &gt; positive_rate]

high_ratio_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-120.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The result of performing an operation on two columns is a new series.</p>
<pre><code class="lang-py">covid_df.new_cases / covid_df.new_tests
<span class="hljs-comment"># 0           NaN</span>
<span class="hljs-comment"># 1           NaN</span>
<span class="hljs-comment"># 2           NaN</span>
<span class="hljs-comment"># 3           NaN</span>
<span class="hljs-comment"># 4           NaN</span>
<span class="hljs-comment">#          ...   </span>
<span class="hljs-comment"># 243    0.026970</span>
<span class="hljs-comment"># 244    0.032055</span>
<span class="hljs-comment"># 245    0.018311</span>
<span class="hljs-comment"># 246         NaN</span>
<span class="hljs-comment"># 247         NaN</span>
<span class="hljs-comment"># Length: 248, dtype: float64</span>
</code></pre>
<p>We can use this series to add a new column to the data frame.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'positive_rate'</span>] = covid_df.new_cases / covid_df.new_tests

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-121.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>However, keep in mind that sometimes it takes a few days to get the results for a test, so we can't compare the number of new cases with the number of tests conducted on the same day. Any inference based on this <code>positive_rate</code> column is likely to be incorrect. </p>
<p>It's essential to watch out for such subtle relationships that are often not conveyed within the CSV file and require some external context. It's always a good idea to read through the documentation provided with the dataset or ask for more information.</p>
<p>For now, let's remove the <code>positive_rate</code> column using the <code>drop</code> method.</p>
<pre><code class="lang-py">covid_df.drop(columns=[<span class="hljs-string">'positive_rate'</span>], inplace=<span class="hljs-literal">True</span>)
</code></pre>
<p>Can you figure the purpose of the <code>inplace</code> argument?</p>
<h4 id="heading-how-to-sort-rows-using-column-values-in-pandas"><strong>How to Sort Rows Using Column Values in Pandas</strong></h4>
<p>You can also sort the rows by a specific column using <code>.sort_values</code>. Let's sort to identify the days with the highest number of cases, then chain it with the <code>head</code> method to list just the first ten results.</p>
<pre><code class="lang-py">covid_df.sort_values(<span class="hljs-string">'new_cases'</span>, ascending=<span class="hljs-literal">False</span>).head(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-122.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It looks like the last two weeks of March had the highest number of daily cases. Let's compare this to the days where the highest number of deaths were recorded.</p>
<pre><code class="lang-py">covid_df.sort_values(<span class="hljs-string">'new_deaths'</span>, ascending=<span class="hljs-literal">False</span>).head(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-123.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It appears that daily deaths hit a peak just about a week after the peak in daily new cases.</p>
<p>Let's also look at the days with the smallest number of cases. We might expect to see the first few days of the year on this list.</p>
<pre><code class="lang-py">covid_df.sort_values(<span class="hljs-string">'new_cases'</span>).head(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-124.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It seems like the count of new cases on Jun 20, 2020, was <code>-148</code>, a negative number! Not something we might have expected, but that's the nature of real-world data. It could be a data entry error, or the government may have issued a correction to account for miscounting in the past. </p>
<p>Can you dig through news articles online and figure out why the number was negative?</p>
<p>Let's look at some days before and after Jun 20, 2020.</p>
<pre><code class="lang-py">covid_df.loc[<span class="hljs-number">169</span>:<span class="hljs-number">175</span>]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-125.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>For now, let's assume this was indeed a data entry error. We can use one of the following approaches for dealing with the missing or faulty value:</p>
<ol>
<li>Replace it with <code>0</code>.</li>
<li>Replace it with the average of the entire column</li>
<li>Replace it with the average of the values on the previous and next date</li>
<li>Discard the row entirely</li>
</ol>
<p>Which approach you pick requires some context about the data and the problem. In this case, since we are dealing with data ordered by date, we can go ahead with the third approach.</p>
<p>You can use the <code>.at</code> method to modify a specific value within the dataframe.</p>
<pre><code class="lang-py">covid_df.at[<span class="hljs-number">172</span>, <span class="hljs-string">'new_cases'</span>] = (covid_df.at[<span class="hljs-number">171</span>, <span class="hljs-string">'new_cases'</span>] + covid_df.at[<span class="hljs-number">173</span>, <span class="hljs-string">'new_cases'</span>])/<span class="hljs-number">2</span>
</code></pre>
<p>Here's a summary of the functions and methods we looked at in this section:</p>
<ul>
<li><code>covid_df.new_cases.sum()</code> – Computing the sum of values in a column or series</li>
<li><code>covid_df[covid_df.new_cases &gt; 1000]</code> – Querying a subset of rows satisfying the chosen criteria using boolean expressions</li>
<li><code>df['pos_rate'] = df.new_cases/df.new_tests</code> – Adding new columns by combining data from existing columns</li>
<li><code>covid_df.drop('positive_rate')</code> – Removing one or more columns from the data frame</li>
<li><code>sort_values</code> – Sorting the rows of a data frame using column values</li>
<li><code>covid_df.at[172, 'new_cases'] = ...</code> – Replacing a value within the data frame</li>
</ul>
<h3 id="heading-how-to-work-with-dates-in-pandas">How to Work with Dates in Pandas</h3>
<p>While we've looked at overall numbers for the cases, tests, positive rate, and more, it would also be useful to study these numbers on a month-by-month basis. </p>
<p>The <code>date</code> column might come in handy here, as Pandas provides many utilities for working with dates.</p>
<pre><code class="lang-py">covid_df.date
<span class="hljs-comment"># 0      2019-12-31</span>
<span class="hljs-comment"># 1      2020-01-01</span>
<span class="hljs-comment"># 2      2020-01-02</span>
<span class="hljs-comment"># 3      2020-01-03</span>
<span class="hljs-comment"># 4      2020-01-04</span>
<span class="hljs-comment">#           ...    </span>
<span class="hljs-comment"># 243    2020-08-30</span>
<span class="hljs-comment"># 244    2020-08-31</span>
<span class="hljs-comment"># 245    2020-09-01</span>
<span class="hljs-comment"># 246    2020-09-02</span>
<span class="hljs-comment"># 247    2020-09-03</span>
<span class="hljs-comment"># Name: date, Length: 248, dtype: object</span>
</code></pre>
<p>The data type of date is currently <code>object</code>, so Pandas does not know that this column is a date. We can convert it into a <code>datetime</code> column using the <code>pd.to_datetime</code> method.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'date'</span>] = pd.to_datetime(covid_df.date)

covid_df[<span class="hljs-string">'date'</span>]
<span class="hljs-comment"># 0     2019-12-31</span>
<span class="hljs-comment"># 1     2020-01-01</span>
<span class="hljs-comment"># 2     2020-01-02</span>
<span class="hljs-comment"># 3     2020-01-03</span>
<span class="hljs-comment"># 4     2020-01-04</span>
<span class="hljs-comment">#          ...    </span>
<span class="hljs-comment"># 243   2020-08-30</span>
<span class="hljs-comment"># 244   2020-08-31</span>
<span class="hljs-comment"># 245   2020-09-01</span>
<span class="hljs-comment"># 246   2020-09-02</span>
<span class="hljs-comment"># 247   2020-09-03</span>
<span class="hljs-comment"># Name: date, Length: 248, dtype: datetime64[ns]</span>
</code></pre>
<p>You can see that it now has the datatype <code>datetime64</code>. We can now extract different parts of the data into separate columns, using the <code>DatetimeIndex</code> class (<a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fversion%2F0.23.4%2Fgenerated%2Fpandas.DatetimeIndex.html">view docs</a>).</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'year'</span>] = pd.DatetimeIndex(covid_df.date).year
covid_df[<span class="hljs-string">'month'</span>] = pd.DatetimeIndex(covid_df.date).month
covid_df[<span class="hljs-string">'day'</span>] = pd.DatetimeIndex(covid_df.date).day
covid_df[<span class="hljs-string">'weekday'</span>] = pd.DatetimeIndex(covid_df.date).weekday

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-126.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let's check the overall metrics for May. We can query the rows for May, choose a subset of columns, and use the <code>sum</code> method to aggregate each selected column's values.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Query the rows for May</span>
covid_df_may = covid_df[covid_df.month == <span class="hljs-number">5</span>]

<span class="hljs-comment"># Extract the subset of columns to be aggregated</span>
covid_df_may_metrics = covid_df_may[[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]]

<span class="hljs-comment"># Get the column-wise sum</span>
covid_may_totals = covid_df_may_metrics.sum()

covid_may_totals
<span class="hljs-comment"># new_cases       29073.0</span>
<span class="hljs-comment"># new_deaths       5658.0</span>
<span class="hljs-comment"># new_tests     1078720.0</span>
<span class="hljs-comment"># dtype: float64</span>

type(covid_may_totals)
<span class="hljs-comment"># pandas.core.series.Series</span>
</code></pre>
<p>We can also combine the above operations into a single statement.</p>
<pre><code class="lang-py">covid_df[covid_df.month == <span class="hljs-number">5</span>][[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]].sum()
<span class="hljs-comment"># new_cases       29073.0</span>
<span class="hljs-comment"># new_deaths       5658.0</span>
<span class="hljs-comment"># new_tests     1078720.0</span>
<span class="hljs-comment"># dtype: float64</span>
</code></pre>
<p>As another example, let's check if the number of cases reported on Sundays is higher than the average number of cases reported every day. This time, we might want to aggregate columns using the <code>.mean</code> method.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Overall average</span>
covid_df.new_cases.mean()

<span class="hljs-comment"># 1096.6149193548388</span>

<span class="hljs-comment"># Average for Sundays</span>
covid_df[covid_df.weekday == <span class="hljs-number">6</span>].new_cases.mean()

<span class="hljs-comment"># 1247.2571428571428</span>
</code></pre>
<p>It seems like more cases were reported on Sundays compared to other days.</p>
<p>Try asking and answering some more date-related questions about the data.</p>
<h3 id="heading-how-to-group-and-aggregate-data-in-pandas">How to Group and Aggregate Data in Pandas</h3>
<p>As a next step, we might want to summarize the day-wise data and create a new dataframe with month-wise data. We can use the <code>groupby</code> function to create a group for each month, select the columns we wish to aggregate, and aggregate them using the <code>sum</code> method.</p>
<pre><code class="lang-py">covid_month_df = covid_df.groupby(<span class="hljs-string">'month'</span>)[[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]].sum()

covid_month_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-127.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The result is a new data frame that uses unique values from the column passed to <code>groupby</code> as the index. Grouping and aggregation is a powerful method for progressively summarizing data into smaller data frames.</p>
<p>Instead of aggregating by sum, you can also aggregate by other measures like mean. Let's compute the average number of daily new cases, deaths, and tests for each month.</p>
<pre><code class="lang-py">covid_month_mean_df = covid_df.groupby(<span class="hljs-string">'month'</span>)[[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]].mean()

covid_month_mean_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-128.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Apart from grouping, another form of aggregation is the running or cumulative sum of cases, tests, or deaths up to each row's date. We can use the <code>cumsum</code> method to compute the cumulative sum of a column as a new series. </p>
<p>Let's add three new columns: <code>total_cases</code>, <code>total_deaths</code>, and <code>total_tests</code>.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'total_cases'</span>] = covid_df.new_cases.cumsum()
covid_df[<span class="hljs-string">'total_deaths'</span>] = covid_df.new_deaths.cumsum()
covid_df[<span class="hljs-string">'total_tests'</span>] = covid_df.new_tests.cumsum() + initial_tests
</code></pre>
<p>We've also included the initial test count in <code>total_test</code> to account for tests conducted before daily reporting was started.</p>
<pre><code class="lang-py">covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-129.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice how the <code>NaN</code> values in the <code>total_tests</code> column remain unaffected.</p>
<h3 id="heading-how-to-merge-data-from-multiple-sources-in-pandas">How to Merge Data from Multiple Sources in Pandas</h3>
<p>To determine other metrics like test per million, cases per million, and so on, we require some more information about the country, namely its population. </p>
<p>Let's download another file <code>locations.csv</code> that contains health-related information for many countries, including Italy.</p>
<pre><code class="lang-py">urlretrieve(<span class="hljs-string">'https://gist.githubusercontent.com/aakashns/8684589ef4f266116cdce023377fc9c8/raw/99ce3826b2a9d1e6d0bde7e9e559fc8b6e9ac88b/locations.csv'</span>, <span class="hljs-string">'locations.csv'</span>)

locations_df = pd.read_csv(<span class="hljs-string">'locations.csv'</span>)
locations_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-130.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">locations_df[locations_df.location == <span class="hljs-string">"Italy"</span>]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-131.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can merge this data into our existing data frame by adding more columns. However, to merge two data frames, we need at least one common column. Let's insert a <code>location</code> column in the <code>covid_df</code> dataframe with all values set to <code>"Italy"</code>.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'location'</span>] = <span class="hljs-string">"Italy"</span>

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-132.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can now add the columns from <code>locations_df</code> into <code>covid_df</code> using the <code>.merge</code> method.</p>
<pre><code class="lang-py">merged_df = covid_df.merge(locations_df, on=<span class="hljs-string">"location"</span>)

merged_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-133.png" alt="Image" width="600" height="400" loading="lazy">
<em>Check out the full data frame <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-pandas-data-analysis">here</a>.</em></p>
<p>The location data for Italy is appended to each row within <code>covid_df</code>. If the <code>covid_df</code> data frame contained data for multiple locations, then the respective country's location data would be appended for each row.</p>
<p>We can now calculate metrics like cases per million, deaths per million, and tests per million.</p>
<pre><code class="lang-py">merged_df[<span class="hljs-string">'cases_per_million'</span>] = merged_df.total_cases * <span class="hljs-number">1e6</span> / merged_df.population
merged_df[<span class="hljs-string">'deaths_per_million'</span>] = merged_df.total_deaths * <span class="hljs-number">1e6</span> / merged_df.population
merged_df[<span class="hljs-string">'tests_per_million'</span>] = merged_df.total_tests * <span class="hljs-number">1e6</span> / merged_df.population

merged_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-134.png" alt="Image" width="600" height="400" loading="lazy">
<em>Check out the full data frame <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-pandas-data-analysis">here</a>.</em></p>
<h3 id="heading-how-to-write-data-back-to-files-in-pandas">How to Write Data Back to Files in Pandas</h3>
<p>After completing your analysis and adding new columns, you should write the results back to a file. Otherwise, the data will be lost when the Jupyter notebook shuts down. </p>
<p>Before writing to file, let's first create a data frame containing just the columns we wish to record.</p>
<pre><code class="lang-py">result_df = merged_df[[<span class="hljs-string">'date'</span>,
                       <span class="hljs-string">'new_cases'</span>, 
                       <span class="hljs-string">'total_cases'</span>, 
                       <span class="hljs-string">'new_deaths'</span>, 
                       <span class="hljs-string">'total_deaths'</span>, 
                       <span class="hljs-string">'new_tests'</span>, 
                       <span class="hljs-string">'total_tests'</span>, 
                       <span class="hljs-string">'cases_per_million'</span>, 
                       <span class="hljs-string">'deaths_per_million'</span>, 
                       <span class="hljs-string">'tests_per_million'</span>]]

result_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-135.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>To write the data from the data frame into a file, we can use the <code>to_csv</code> function.</p>
<pre><code class="lang-py">result_df.to_csv(<span class="hljs-string">'results.csv'</span>, index=<span class="hljs-literal">None</span>)
</code></pre>
<p>The <code>to_csv</code> function also includes an additional column for storing the index of the dataframe by default. We pass <code>index=None</code> to turn off this behavior. You can now verify that the <code>results.csv</code> is created and contains data from the data frame in CSV format:</p>
<pre><code class="lang-py">date,new_cases,total_cases,new_deaths,total_deaths,new_tests,total_tests,cases_per_million,deaths_per_million,tests_per_million
<span class="hljs-number">2020</span><span class="hljs-number">-02</span><span class="hljs-number">-27</span>,<span class="hljs-number">78.0</span>,<span class="hljs-number">400.0</span>,<span class="hljs-number">1.0</span>,<span class="hljs-number">12.0</span>,,,<span class="hljs-number">6.61574439992122</span>,<span class="hljs-number">0.1984723319976366</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-02</span><span class="hljs-number">-28</span>,<span class="hljs-number">250.0</span>,<span class="hljs-number">650.0</span>,<span class="hljs-number">5.0</span>,<span class="hljs-number">17.0</span>,,,<span class="hljs-number">10.750584649871982</span>,<span class="hljs-number">0.28116913699665186</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-02</span><span class="hljs-number">-29</span>,<span class="hljs-number">238.0</span>,<span class="hljs-number">888.0</span>,<span class="hljs-number">4.0</span>,<span class="hljs-number">21.0</span>,,,<span class="hljs-number">14.686952567825108</span>,<span class="hljs-number">0.34732658099586405</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-03</span><span class="hljs-number">-01</span>,<span class="hljs-number">240.0</span>,<span class="hljs-number">1128.0</span>,<span class="hljs-number">8.0</span>,<span class="hljs-number">29.0</span>,,,<span class="hljs-number">18.656399207777838</span>,<span class="hljs-number">0.47964146899428844</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-03</span><span class="hljs-number">-02</span>,<span class="hljs-number">561.0</span>,<span class="hljs-number">1689.0</span>,<span class="hljs-number">6.0</span>,<span class="hljs-number">35.0</span>,,,<span class="hljs-number">27.93498072866735</span>,<span class="hljs-number">0.5788776349931067</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-03</span><span class="hljs-number">-03</span>,<span class="hljs-number">347.0</span>,<span class="hljs-number">2036.0</span>,<span class="hljs-number">17.0</span>,<span class="hljs-number">52.0</span>,,,<span class="hljs-number">33.67413899559901</span>,<span class="hljs-number">0.8600467719897585</span>,
...
</code></pre>
<h3 id="heading-bonus-basic-plotting-with-pandas">Bonus: Basic Plotting with Pandas</h3>
<p>We generally use a library like <code>matplotlib</code> or <code>seaborn</code> to plot graphs within a Jupyter notebook. However, Pandas dataframes and series provide a handy <code>.plot</code> method for quick and easy plotting.</p>
<p>Let's plot a line graph showing how the number of daily cases varies over time.</p>
<pre><code class="lang-py">result_df.new_cases.plot();
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-137.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>While this plot shows the overall trend, it's hard to tell where the peak occurred, as there are no dates on the X-axis. We can use the <code>date</code> column as the index for the data frame to address this issue.</p>
<pre><code class="lang-py">result_df.set_index(<span class="hljs-string">'date'</span>, inplace=<span class="hljs-literal">True</span>)

result_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-138.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice that the index of a data frame doesn't have to be numeric. Using the date as the index also allows us to get the data for a specific data using <code>.loc</code>.</p>
<pre><code class="lang-py">result_df.loc[<span class="hljs-string">'2020-09-01'</span>]
<span class="hljs-comment"># new_cases             9.960000e+02</span>
<span class="hljs-comment"># total_cases           2.696595e+05</span>
<span class="hljs-comment"># new_deaths            6.000000e+00</span>
<span class="hljs-comment"># total_deaths          3.548300e+04</span>
<span class="hljs-comment"># new_tests             5.439500e+04</span>
<span class="hljs-comment"># total_tests           5.214766e+06</span>
<span class="hljs-comment"># cases_per_million     4.459996e+03</span>
<span class="hljs-comment"># deaths_per_million    5.868661e+02</span>
<span class="hljs-comment"># tests_per_million     8.624890e+04</span>
<span class="hljs-comment"># Name: 2020-09-01 00:00:00, dtype: float64</span>
</code></pre>
<p>Let's plot the new cases and new deaths per day as line graphs.</p>
<pre><code class="lang-py">result_df.new_cases.plot()
result_df.new_deaths.plot();
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-139.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can also compare the total cases vs. total deaths.</p>
<pre><code class="lang-py">result_df.total_cases.plot()
result_df.total_deaths.plot();
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-140.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let's see how the death rate and positive testing rates vary over time.</p>
<pre><code class="lang-py">death_rate = result_df.total_deaths / result_df.total_cases

death_rate.plot(title=<span class="hljs-string">'Death Rate'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-141.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">positive_rates = result_df.total_cases / result_df.total_tests

positive_rates.plot(title=<span class="hljs-string">'Positive Rate'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-142.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Finally, let's plot some month-wise data using a bar chart to visualize the trend at a higher level.</p>
<pre><code class="lang-py">covid_month_df.new_cases.plot(kind=<span class="hljs-string">'bar'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-143.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">covid_month_df.new_tests.plot(kind=<span class="hljs-string">'bar'</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-144.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-pandas-exercises">Pandas Exercises</h3>
<p>Try the following exercises to become familiar with Pandas dataframes and practice your skills:</p>
<ul>
<li><a target="_blank" href="https://jovian.ml/aakashns/pandas-practice-assignment">Assignment on Pandas dataframes</a></li>
<li><a target="_blank" href="https://github.com/guipsamora/pandas_exercises">Additional exercises on Pandas</a></li>
<li><a target="_blank" href="https://www.kaggle.com/datasets">Try downloading and analyzing some data from Kaggle</a></li>
</ul>
<h3 id="heading-summary-and-further-reading-1">Summary and Further Reading</h3>
<p>We've covered the following topics in this tutorial:</p>
<ul>
<li>How to read a CSV file into a Pandas data frame</li>
<li>How to retrieve data from Pandas data frames</li>
<li>How to query, sort, and analyze data</li>
<li>How to merge, group, and aggregate data</li>
<li>How to extract useful information from dates</li>
<li>Basic plotting using line and bar charts</li>
<li>How to write data frames to CSV files</li>
</ul>
<p>Check out the following resources to learn more about Pandas:</p>
<ul>
<li><a target="_blank" href="https://pandas.pydata.org/docs/user_guide/index.html">User guide for Pandas</a></li>
<li><a target="_blank" href="https://www.oreilly.com/library/view/python-for-data/9781491957653/">Python for Data Analysis (book by Wes McKinney - creator of Pandas)</a></li>
</ul>
<h3 id="heading-review-questions-to-check-your-comprehension-1">Review Questions to Check Your Comprehension</h3>
<p>Try answering the following questions to test your understanding of the topics covered in this notebook:</p>
<ol>
<li>What is Pandas? What makes it useful?</li>
<li>How do you install the Pandas library?</li>
<li>How do you import the <code>pandas</code> module?</li>
<li>What is the common alias used while importing the <code>pandas</code> module?</li>
<li>How do you read a CSV file using Pandas? Give an example.</li>
<li>What are some other file formats you can read using Pandas? Illustrate with examples.</li>
<li>What are Pandas dataframes?</li>
<li>How are Pandas dataframes different from Numpy arrays?</li>
<li>How do you find the number of rows and columns in a dataframe?</li>
<li>How do you get the list of columns in a dataframe?</li>
<li>What is the purpose of the <code>describe</code> method of a dataframe?</li>
<li>How are the <code>info</code> and <code>describe</code> dataframe methods different?</li>
<li>Is a Pandas dataframe conceptually similar to a list of dictionaries or a dictionary of lists? Explain with an example.</li>
<li>What is a Pandas <code>Series</code>? How is it different from a Numpy array?</li>
<li>How do you access a column from a dataframe?</li>
<li>How do you access a row from a dataframe?</li>
<li>How do you access an element at a specific row and column of a dataframe?</li>
<li>How do you create a subset of a dataframe with a specific set of columns?</li>
<li>How do you create a subset of a dataframe with a specific range of rows?</li>
<li>Does changing a value within a dataframe affect other dataframes created using a subset of the rows or columns? Why is it so?</li>
<li>How do you create a copy of a dataframe?</li>
<li>Why should you avoid creating too many copies of a dataframe?</li>
<li>How do you view the first few rows of a dataframe?</li>
<li>How do you view the last few rows of a dataframe?</li>
<li>How do you view a random selection of rows of a dataframe?</li>
<li>What is the "index" in a dataframe? How is it useful?</li>
<li>What does a <code>NaN</code> value in a Pandas dataframe represent?</li>
<li>How is <code>Nan</code> different from <code>0</code>?</li>
<li>How do you identify the first non-empty row in a Pandas series or column?</li>
<li>What is the difference between <code>df.loc</code> and <code>df.at</code>?</li>
<li>Where can you find a full list of methods supported by Pandas <code>DataFrame</code> and <code>Series</code> objects?</li>
<li>How do you find the sum of numbers in a column of a dataframe?</li>
<li>How do you find the mean of numbers in a column of a dataframe?</li>
<li>How do you find the number of non-empty numbers in a column of a dataframe?</li>
<li>What is the result obtained by using a Pandas column in a boolean expression? Illustrate with an example.</li>
<li>How do you select a subset of rows where a specific column's value meets a given condition? Illustrate with an example.</li>
<li>What is the result of the expression <code>df[df.new_cases &gt; 100]</code> ?</li>
<li>How do you display all the rows of a pandas dataframe in a Jupyter cell output?</li>
<li>What is the result obtained when you perform an arithmetic operation between two columns of a dataframe? Illustrate with an example.</li>
<li>How do you add a new column to a dataframe by combining values from two existing columns? Illustrate with an example.</li>
<li>How do you remove a column from a dataframe? Illustrate with an example.</li>
<li>What is the purpose of the <code>inplace</code> argument in dataframe methods?</li>
<li>How do you sort the rows of a dataframe based on the values in a particular column?</li>
<li>How do you sort a pandas dataframe using values from multiple columns?</li>
<li>How do you specify whether to sort by ascending or descending order while sorting a Pandas dataframe?</li>
<li>How do you change a specific value within a dataframe?</li>
<li>How do you convert a dataframe column to the <code>datetime</code> data type?</li>
<li>What are the benefits of using the <code>datetime</code> data type instead of <code>object</code>?</li>
<li>How do you extract different parts of a date column like the month, year, month, weekday, and so on into separate columns? Illustrate with an example.</li>
<li>How do you aggregate multiple columns of a dataframe together?</li>
<li>What is the purpose of the <code>groupby</code> method of a dataframe? Illustrate with an example.</li>
<li>What are the different ways in which you can aggregate the groups created by <code>groupby</code>?</li>
<li>What do you mean by a running or cumulative sum?</li>
<li>How do you create a new column containing the running or cumulative sum of another column?</li>
<li>What are other cumulative measures supported by Pandas dataframes?</li>
<li>What does it mean to merge two dataframes? Give an example.</li>
<li>How do you specify the columns that should be used for merging two dataframes?</li>
<li>How do you write data from a Pandas dataframe into a CSV file? Give an example.</li>
<li>What are some other file formats you can write to from a Pandas dataframe? Illustrate with examples.</li>
<li>How do you create a line plot showing the values within a column of a dataframe?</li>
<li>How do you convert a column of a dataframe into its index?</li>
<li>Can the index of a dataframe be non-numeric?</li>
<li>What are the benefits of using a non-numeric dataframe? Illustrate with an example.</li>
<li>How you create a bar plot showing the values within a column of a dataframe?</li>
<li>What are some other types of plots supported by Pandas dataframes and series?</li>
</ol>
<p>You are ready to move on to the next section of the tutorial.</p>
<h2 id="heading-data-visualization-using-python-matplotlib-and-seaborn">Data Visualization using Python, Matplotlib, and Seaborn</h2>
<p><img src="https://i.imgur.com/9i806Rh.png" alt="Image" width="2314" height="1092" loading="lazy"></p>
<p>Notebook link: <a target="_blank" href="https://jovian.ai/aakashns/python-matplotlib-data-visualization">https://jovian.ai/aakashns/python-matplotlib-data-visualization</a></p>
<p>Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers. </p>
<p>Visualizing data is an essential part of data analysis and machine learning. We'll use Python libraries <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org">Matplotlib</a> and <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fseaborn.pydata.org">Seaborn</a> to learn and apply some popular data visualization techniques. We'll use the words <em>chart</em>, <em>plot</em>, and <em>graph</em> interchangeably in this tutorial.</p>
<p>To begin, let's install and import the libraries. We'll use the <code>matplotlib.pyplot</code> module for basic plots like line and bar charts. It is often imported with the alias <code>plt</code>. We'll use the <code>seaborn</code> module for more advanced plots. It is commonly imported with the alias <code>sns</code>.</p>
<pre><code class="lang-py">!pip install matplotlib seaborn --upgrade --quiet

<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
%matplotlib inline
</code></pre>
<p>Notice this we also include the special command <code>%matplotlib inline</code> to ensure that our plots are shown and embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up windows.</p>
<h3 id="heading-how-to-create-a-line-chart-in-python">How to Create a Line Chart in Python</h3>
<p>The line chart is one of the simplest and most widely used data visualization techniques. A line chart displays information as a series of data points or markers connected by straight lines. </p>
<p>You can customize the shape, size, color, and other aesthetic elements of the lines and markers for better visual clarity.</p>
<p>Here's a Python list showing the yield of apples (tons per hectare) over six years in an imaginary country called Kanto.</p>
<pre><code class="lang-py">yield_apples = [<span class="hljs-number">0.895</span>, <span class="hljs-number">0.91</span>, <span class="hljs-number">0.919</span>, <span class="hljs-number">0.926</span>, <span class="hljs-number">0.929</span>, <span class="hljs-number">0.931</span>]
</code></pre>
<p>We can visualize how the yield of apples changes over time using a line chart. To draw a line chart, we can use the <code>plt.plot</code> function.</p>
<pre><code class="lang-py">plt.plot(yield_apples)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-145.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Calling the <code>plt.plot</code> function draws the line chart as expected. It also returns a list of plots drawn <code>[&lt;matplotlib.lines.Line2D at 0x7ff70aa20760&gt;]</code>, shown within the output. We can include a semicolon (<code>;</code>) at the end of the last statement in the cell to avoiding showing the output and display just the graph.</p>
<pre><code class="lang-py">plt.plot(yield_apples);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-146.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let's enhance this plot step-by-step to make it more informative and beautiful.</p>
<h4 id="heading-how-to-customize-the-x-axis-in-matplotlib"><strong>How to Customize the X-axis in MatPlotLib</strong></h4>
<p>The X-axis of the plot currently shows list element indices 0 to 5. The plot would be more informative if we could display the year for which we're plotting the data. We can do this by two arguments <code>plt.plot</code>.</p>
<pre><code class="lang-py">years = [<span class="hljs-number">2010</span>, <span class="hljs-number">2011</span>, <span class="hljs-number">2012</span>, <span class="hljs-number">2013</span>, <span class="hljs-number">2014</span>, <span class="hljs-number">2015</span>]
yield_apples = [<span class="hljs-number">0.895</span>, <span class="hljs-number">0.91</span>, <span class="hljs-number">0.919</span>, <span class="hljs-number">0.926</span>, <span class="hljs-number">0.929</span>, <span class="hljs-number">0.931</span>]

plt.plot(years, yield_apples)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-147.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-axis-labels-in-matplotlib"><strong>Axis Labels in MatPlotLib</strong></h4>
<p>We can add labels to the axes to show what each axis represents using the <code>plt.xlabel</code> and <code>plt.ylabel</code> methods.</p>
<pre><code class="lang-py">plt.plot(years, yield_apples)
plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-148.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-plot-multiple-lines-in-matplotlib"><strong>How to Plot Multiple Lines in MatPlotLib</strong></h4>
<p>You can invoke the <code>plt.plot</code> function once for each line to plot multiple lines in the same graph. Let's compare the yields of apples vs. oranges in Kanto.</p>
<pre><code class="lang-py">years = range(<span class="hljs-number">2000</span>, <span class="hljs-number">2012</span>)
apples = [<span class="hljs-number">0.895</span>, <span class="hljs-number">0.91</span>, <span class="hljs-number">0.919</span>, <span class="hljs-number">0.926</span>, <span class="hljs-number">0.929</span>, <span class="hljs-number">0.931</span>, <span class="hljs-number">0.934</span>, <span class="hljs-number">0.936</span>, <span class="hljs-number">0.937</span>, <span class="hljs-number">0.9375</span>, <span class="hljs-number">0.9372</span>, <span class="hljs-number">0.939</span>]
oranges = [<span class="hljs-number">0.962</span>, <span class="hljs-number">0.941</span>, <span class="hljs-number">0.930</span>, <span class="hljs-number">0.923</span>, <span class="hljs-number">0.918</span>, <span class="hljs-number">0.908</span>, <span class="hljs-number">0.907</span>, <span class="hljs-number">0.904</span>, <span class="hljs-number">0.901</span>, <span class="hljs-number">0.898</span>, <span class="hljs-number">0.9</span>, <span class="hljs-number">0.896</span>, ]

plt.plot(years, apples)
plt.plot(years, oranges)
plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-149.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-chart-title-and-legend-in-matplotlib"><strong>Chart Title and Legend in MatPlotLib</strong></h4>
<p>To differentiate between multiple lines, we can include a legend within the graph using the <code>plt.legend</code> function. We can also set a title for the chart using the <code>plt.title</code> function.</p>
<pre><code class="lang-py">plt.plot(years, apples)
plt.plot(years, oranges)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-150.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-use-line-markers-in-matplotlib"><strong>How to Use Line Markers in MatPlotLib</strong></h4>
<p>We can also show markers for the data points on each line using the <code>marker</code> argument of <code>plt.plot</code>. </p>
<p>Matplotlib provides many different markers like a circle, cross, square, diamond, and more. You can find the full list of marker types here: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.1.1%2Fapi%2Fmarkers_api.html">https://matplotlib.org/3.1.1/api/markers_api.html</a> .</p>
<pre><code class="lang-py">plt.plot(years, apples, marker=<span class="hljs-string">'o'</span>)
plt.plot(years, oranges, marker=<span class="hljs-string">'x'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-151.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-style-lines-and-markers-in-matplotlib"><strong>How to Style Lines and Markers in MatPlotLib</strong></h4>
<p>The <code>plt.plot</code> function supports many arguments for styling lines and markers:</p>
<ul>
<li><code>color</code> or <code>c</code> – Set the color of the line (<a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.1.0%2Fgallery%2Fcolor%2Fnamed_colors.html">supported colors</a>)</li>
<li><code>linestyle</code> or <code>ls</code> – Choose between a solid or dashed line</li>
<li><code>linewidth</code> or <code>lw</code> – Set the width of a line</li>
<li><code>markersize</code> or <code>ms</code> – Set the size of markers</li>
<li><code>markeredgecolor</code> or <code>mec</code> – Set the edge color for markers</li>
<li><code>markeredgewidth</code> or <code>mew</code> – Set the edge width for markers</li>
<li><code>markerfacecolor</code> or <code>mfc</code> – Set the fill color for markers</li>
<li><code>alpha</code> – Opacity of the plot</li>
</ul>
<p>Check out the documentation for <code>plt.plot</code> to learn more: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2Fapi%2F_as_gen%2Fmatplotlib.pyplot.plot.html%23matplotlib.pyplot.plot">https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot</a> .</p>
<pre><code class="lang-py">plt.plot(years, apples, marker=<span class="hljs-string">'s'</span>, c=<span class="hljs-string">'b'</span>, ls=<span class="hljs-string">'-'</span>, lw=<span class="hljs-number">2</span>, ms=<span class="hljs-number">8</span>, mew=<span class="hljs-number">2</span>, mec=<span class="hljs-string">'navy'</span>)
plt.plot(years, oranges, marker=<span class="hljs-string">'o'</span>, c=<span class="hljs-string">'r'</span>, ls=<span class="hljs-string">'--'</span>, lw=<span class="hljs-number">3</span>, ms=<span class="hljs-number">10</span>, alpha=<span class="hljs-number">.5</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-152.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The <code>fmt</code> argument provides a shorthand for specifying the marker shape, line style, and line color. You can provide it as the third argument to <code>plt.plot</code>.</p>
<pre><code class="lang-py">fmt = <span class="hljs-string">'[marker][line][color]'</span>

plt.plot(years, apples, <span class="hljs-string">'s-b'</span>)
plt.plot(years, oranges, <span class="hljs-string">'o--r'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-153.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can use the <code>plt.figure</code> function to change the size of the figure.</p>
<pre><code class="lang-py">plt.plot(years, oranges, <span class="hljs-string">'or'</span>)
plt.title(<span class="hljs-string">"Yield of Oranges (tons per hectare)"</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-154.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-change-the-figure-size-in-matplotlib"><strong>How to Change the Figure Size in MatPlotLib</strong></h4>
<p>You can use the <code>plt.figure</code> function to change the size of the figure.</p>
<pre><code class="lang-py">plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">6</span>))

plt.plot(years, oranges, <span class="hljs-string">'or'</span>)
plt.title(<span class="hljs-string">"Yield of Oranges (tons per hectare)"</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-155.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-improve-default-styles-using-seaborn"><strong>How to Improve Default Styles using Seaborn</strong></h4>
<p>An easy way to make your charts look beautiful is to use some default styles from the Seaborn library. You can apply them globally using the <code>sns.set_style</code> function. You can see a full list of predefined styles here: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fseaborn.pydata.org%2Fgenerated%2Fseaborn.set_style.html">https://seaborn.pydata.org/generated/seaborn.set_style.html</a> .</p>
<pre><code class="lang-py">sns.set_style(<span class="hljs-string">"whitegrid"</span>)
plt.plot(years, apples, <span class="hljs-string">'s-b'</span>)
plt.plot(years, oranges, <span class="hljs-string">'o--r'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-156.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code>sns.set_style(<span class="hljs-string">"darkgrid"</span>)

plt.plot(years, apples, <span class="hljs-string">'s-b'</span>)
plt.plot(years, oranges, <span class="hljs-string">'o--r'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-157.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">plt.plot(years, oranges, <span class="hljs-string">'or'</span>)
plt.title(<span class="hljs-string">"Yield of Oranges (tons per hectare)"</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-158.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can also edit default styles directly by modifying the <code>matplotlib.rcParams</code> dictionary. Learn more: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.2.1%2Ftutorials%2Fintroductory%2Fcustomizing.html%23matplotlib-rcparams">https://matplotlib.org/3.2.1/tutorials/introductory/customizing.html#matplotlib-rcparams</a> .</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> matplotlib

matplotlib.rcParams[<span class="hljs-string">'font.size'</span>] = <span class="hljs-number">14</span>
matplotlib.rcParams[<span class="hljs-string">'figure.figsize'</span>] = (<span class="hljs-number">9</span>, <span class="hljs-number">5</span>)
matplotlib.rcParams[<span class="hljs-string">'figure.facecolor'</span>] = <span class="hljs-string">'#00000000'</span>
</code></pre>
<h3 id="heading-scatter-plots-in-matplotlib">Scatter Plots <strong>in MatPlotLib</strong></h3>
<p>In a scatter plot, the values of 2 variables are plotted as points on a 2-dimensional grid. Additionally, you can also use a third variable to determine the size or color of the points. Let's try out an example.</p>
<p>The <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FIris_flower_data_set">Iris flower dataset</a> provides sample measurements of sepals and petals for three species of flowers. The Iris dataset is included with the Seaborn library and you can load it as a Pandas data frame.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Load data into a Pandas dataframe</span>
flowers_df = sns.load_dataset(<span class="hljs-string">"iris"</span>)

flowers_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-159.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">flowers_df.species.unique()
<span class="hljs-comment"># array(['setosa', 'versicolor', 'virginica'], dtype=object)</span>
</code></pre>
<p>Let's try to visualize the relationship between sepal length and sepal width. Our first instinct might be to create a line chart using <code>plt.plot</code>.</p>
<pre><code class="lang-py">plt.plot(flowers_df.sepal_length, flowers_df.sepal_width);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-160.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The output is not very informative as there are too many combinations of the two properties within the dataset. There doesn't seem to be simple relationship between them.</p>
<p>We can use a scatter plot to visualize how sepal length and sepal width vary using the <code>scatterplot</code> function from the <code>seaborn</code> module (imported as <code>sns</code>).</p>
<pre><code class="lang-py">sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-161.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-add-hues-in-matplotlib"><strong>How to Add Hues in MatPlotLib</strong></h4>
<p>Notice how the points in the above plot seem to form distinct clusters with some outliers. We can color the dots using the flower species as a <code>hue</code>. We can also make the points larger using the <code>s</code> argument.</p>
<pre><code class="lang-py">sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width, hue=flowers_df.species, s=<span class="hljs-number">100</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-162.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Adding hues makes the plot more informative. We can immediately tell that Setosa irises have a smaller sepal length but higher sepal widths. In contrast, the opposite is true for Virginica irises.</p>
<h4 id="heading-how-to-customize-seaborn-figures"><strong>How to </strong>Customiz<strong>e </strong>Seaborn Figures<em>**</em></h4>
<p>Since Seaborn uses Matplotlib's plotting functions internally, we can use functions like <code>plt.figure</code> and <code>plt.title</code> to modify the figure.</p>
<pre><code class="lang-py">plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">6</span>))
plt.title(<span class="hljs-string">'Sepal Dimensions'</span>)

sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species,
                s=<span class="hljs-number">100</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-163.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-plot-data-using-pandas-data-frames-with-seaborn"><strong>How to Plot Data using Pandas Data Frames with Seaborn</strong></h4>
<p>Seaborn has built-in support for Pandas data frames. Instead of passing each column as a series, you can provide column names and use the <code>data</code> argument to specify a data frame.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">'Sepal Dimensions'</span>)
sns.scatterplot(x=<span class="hljs-string">'sepal_length'</span>, 
                y=<span class="hljs-string">'sepal_width'</span>, 
                hue=<span class="hljs-string">'species'</span>,
                s=<span class="hljs-number">100</span>,
                data=flowers_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-164.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-histograms-in-matplotlib">Histograms <strong>in MatPlotLib</strong></h3>
<p>A histogram represents the distribution of a variable by creating bins (intervals) along the range of values and showing vertical bars to indicate the number of observations in each bin.</p>
<p>For example, let's visualize the distribution of values of sepal width in the Iris dataset. We can use the <code>plt.hist</code> function to create a histogram.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Load data into a Pandas dataframe</span>
flowers_df = sns.load_dataset(<span class="hljs-string">"iris"</span>)

flowers_df.sepal_width
<span class="hljs-comment"># 0      3.5</span>
<span class="hljs-comment"># 1      3.0</span>
<span class="hljs-comment"># 2      3.2</span>
<span class="hljs-comment"># 3      3.1</span>
<span class="hljs-comment"># 4      3.6</span>
<span class="hljs-comment">#       ... </span>
<span class="hljs-comment"># 145    3.0</span>
<span class="hljs-comment"># 146    2.5</span>
<span class="hljs-comment"># 147    3.0</span>
<span class="hljs-comment"># 148    3.4</span>
<span class="hljs-comment"># 149    3.0</span>
<span class="hljs-comment"># Name: sepal_width, Length: 150, dtype: float64</span>
</code></pre>
<pre><code class="lang-py">plt.title(<span class="hljs-string">"Distribution of Sepal Width"</span>)
plt.hist(flowers_df.sepal_width);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-165.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can immediately see that the sepal widths lie in the range 2.0 - 4.5, and around 35 values are in the range 2.9 - 3.1, which seems to be the most populous bin.</p>
<h4 id="heading-how-to-control-the-size-and-number-of-bins"><strong>How to C</strong>ontrol the<strong> S</strong>ize and<strong> N</strong>umber of<strong> B</strong>ins<em>**</em></h4>
<p>We can control the number of bins or the size of each one using the bins argument.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Specifying the number of bins</span>
plt.hist(flowers_df.sepal_width, bins=<span class="hljs-number">5</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-166.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Specifying the boundaries of each bin</span>
plt.hist(flowers_df.sepal_width, bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>));
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-167.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py"><span class="hljs-comment"># Bins of unequal sizes</span>
plt.hist(flowers_df.sepal_width, bins=[<span class="hljs-number">1</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">4.5</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-168.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-manage-multiple-histograms-in-matplotlib"><strong>How to Manage Multiple Histograms in MatPlotLib</strong></h4>
<p>Similar to line charts, we can draw multiple histograms in a single chart. We can reduce each histogram's opacity so that one histogram's bars don't hide the others'.</p>
<p>Let's draw separate histograms for each species of flowers.</p>
<pre><code class="lang-py">setosa_df = flowers_df[flowers_df.species == <span class="hljs-string">'setosa'</span>]
versicolor_df = flowers_df[flowers_df.species == <span class="hljs-string">'versicolor'</span>]
virginica_df = flowers_df[flowers_df.species == <span class="hljs-string">'virginica'</span>]

plt.hist(setosa_df.sepal_width, alpha=<span class="hljs-number">0.4</span>, bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>));
plt.hist(versicolor_df.sepal_width, alpha=<span class="hljs-number">0.4</span>, bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>));
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-169.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can also stack multiple histograms on top of one another.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">'Distribution of Sepal Width'</span>)

plt.hist([setosa_df.sepal_width, versicolor_df.sepal_width, virginica_df.sepal_width], 
         bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>), 
         stacked=<span class="hljs-literal">True</span>);

plt.legend([<span class="hljs-string">'Setosa'</span>, <span class="hljs-string">'Versicolor'</span>, <span class="hljs-string">'Virginica'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-170.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-bar-charts-in-matplotlib">Bar Charts <strong>in MatPlotLib</strong></h3>
<p>Bar charts are quite similar to line charts, that is they show a sequence of values. However, a bar is shown for each value, rather than points connected by lines. We can use the <code>plt.bar</code> function to draw a bar chart.</p>
<pre><code class="lang-py">years = range(<span class="hljs-number">2000</span>, <span class="hljs-number">2006</span>)
apples = [<span class="hljs-number">0.35</span>, <span class="hljs-number">0.6</span>, <span class="hljs-number">0.9</span>, <span class="hljs-number">0.8</span>, <span class="hljs-number">0.65</span>, <span class="hljs-number">0.8</span>]
oranges = [<span class="hljs-number">0.4</span>, <span class="hljs-number">0.8</span>, <span class="hljs-number">0.9</span>, <span class="hljs-number">0.7</span>, <span class="hljs-number">0.6</span>, <span class="hljs-number">0.8</span>]

plt.bar(years, oranges);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-171.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Like histograms, we can stack bars on top of one another. We use the <code>bottom</code> argument of <code>plt.bar</code> to achieve this.</p>
<pre><code class="lang-py">plt.bar(years, apples)
plt.bar(years, oranges, bottom=apples);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-172.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-bar-plots-with-averages-in-seaborn"><strong>Bar Plots with Averages in Seaborn</strong></h4>
<p>Let's look at another sample dataset included with Seaborn called <code>tips</code>. The dataset contains information about the sex, time of day, total bill, and tip amount for customers visiting a restaurant over a week.</p>
<pre><code class="lang-py">tips_df = sns.load_dataset(<span class="hljs-string">"tips"</span>);

tips_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-173.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We might want to draw a bar chart to visualize how the average bill amount varies across different days of the week. One way to do this would be to compute the day-wise averages and then use <code>plt.bar</code> (try it as an exercise).</p>
<p>However, since this is a very common use case, the Seaborn library provides a <code>barplot</code> function which can automatically compute averages.</p>
<pre><code class="lang-py">sns.barplot(x=<span class="hljs-string">'day'</span>, y=<span class="hljs-string">'total_bill'</span>, data=tips_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-174.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The lines cutting each bar represent the amount of variation in the values. For instance, it seems like the variation in the total bill is relatively high on Fridays and low on Saturdays.</p>
<p>We can also specify a <code>hue</code> argument to compare bar plots side-by-side based on a third feature, for example sex.</p>
<pre><code class="lang-py">sns.barplot(x=<span class="hljs-string">'day'</span>, y=<span class="hljs-string">'total_bill'</span>, hue=<span class="hljs-string">'sex'</span>, data=tips_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-175.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can make the bars horizontal simply by switching the axes.</p>
<pre><code class="lang-py">sns.barplot(x=<span class="hljs-string">'total_bill'</span>, y=<span class="hljs-string">'day'</span>, hue=<span class="hljs-string">'sex'</span>, data=tips_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-176.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-heatmaps-in-seaborn">Heatmaps in Seaborn</h3>
<p>A heatmap is used to visualize 2-dimensional data like a matrix or a table using colors. The best way to understand it is by looking at an example. </p>
<p>We'll use another sample dataset from Seaborn, called <code>flights</code>, to visualize monthly passenger footfall at an airport over 12 years.</p>
<pre><code class="lang-py">flights_df = sns.load_dataset(<span class="hljs-string">"flights"</span>).pivot(<span class="hljs-string">"month"</span>, <span class="hljs-string">"year"</span>, <span class="hljs-string">"passengers"</span>)

flights_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-177.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><code>flights_df</code> is a matrix with one row for each month and one column for each year. The values show the number of passengers (in thousands) that visited the airport in a specific month of a year. We can use the <code>sns.heatmap</code> function to visualize the footfall at the airport.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">"No. of Passengers (1000s)"</span>)
sns.heatmap(flights_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-178.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The brighter colors indicate a higher footfall at the airport. By looking at the graph, we can infer two things:</p>
<ul>
<li>The footfall at the airport in any given year tends to be the highest around July and August.</li>
<li>The footfall at the airport in any given month tends to grow year by year.</li>
</ul>
<p>We can also display the actual values in each block by specifying <code>annot=True</code> and using the <code>cmap</code> argument to change the color palette.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">"No. of Passengers (1000s)"</span>)
sns.heatmap(flights_df, fmt=<span class="hljs-string">"d"</span>, annot=<span class="hljs-literal">True</span>, cmap=<span class="hljs-string">'Blues'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-179.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-images-in-matplotlib">Images <strong>in MatPlotLib</strong></h3>
<p>We can also use Matplotlib to display images. Let's download an image from the internet.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> urllib.request <span class="hljs-keyword">import</span> urlretrieve

urlretrieve(<span class="hljs-string">'https://i.imgur.com/SkPbq.jpg'</span>, <span class="hljs-string">'chart.jpg'</span>);
</code></pre>
<p>Before displaying an image, it has to be read into memory using the <code>PIL</code> module.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image

img = Image.open(<span class="hljs-string">'chart.jpg'</span>)
</code></pre>
<p>An image loaded using PIL is simply a 3-dimensional numpy array containing pixel intensities for the red, green &amp; blue (RGB) channels of the image. We can convert the image into an array using <code>np.array</code>.</p>
<pre><code>img_array = np.array(img)

img_array.shape
# (<span class="hljs-number">481</span>, <span class="hljs-number">640</span>, <span class="hljs-number">3</span>)
</code></pre><p>We can display the PIL image using <code>plt.imshow</code>.</p>
<pre><code class="lang-py">plt.imshow(img);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-180.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can turn off the axes &amp; grid lines and show a title using the relevant functions.</p>
<pre><code class="lang-py">plt.grid(<span class="hljs-literal">False</span>)
plt.title(<span class="hljs-string">'A data science meme'</span>)
plt.axis(<span class="hljs-string">'off'</span>)
plt.imshow(img);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-181.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>To display a part of the image, we can simply select a slice from the numpy array.</p>
<pre><code class="lang-py">plt.grid(<span class="hljs-literal">False</span>)
plt.axis(<span class="hljs-string">'off'</span>)
plt.imshow(img_array[<span class="hljs-number">125</span>:<span class="hljs-number">325</span>,<span class="hljs-number">105</span>:<span class="hljs-number">305</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-182.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-how-to-plot-multiple-charts-in-a-grid-in-matplotlib-and-seaborn">How to Plot Multiple Charts in a Grid <strong>in MatPlotLib and Seaborn</strong></h3>
<p>Matplotlib and Seaborn also support plotting multiple charts in a grid, using <code>plt.subplots</code>, which returns a set of axes for plotting.</p>
<p>Here's a single grid showing the different types of charts we've covered in this tutorial.</p>
<pre><code class="lang-py">fig, axes = plt.subplots(<span class="hljs-number">2</span>, <span class="hljs-number">3</span>, figsize=(<span class="hljs-number">16</span>, <span class="hljs-number">8</span>))

<span class="hljs-comment"># Use the axes for plotting</span>
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].plot(years, apples, <span class="hljs-string">'s-b'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].plot(years, oranges, <span class="hljs-string">'o--r'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_xlabel(<span class="hljs-string">'Year'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">'Crop Yields in Kanto'</span>)


<span class="hljs-comment"># Pass the axes into seaborn</span>
axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>].set_title(<span class="hljs-string">'Sepal Length vs. Sepal Width'</span>)
sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species, 
                s=<span class="hljs-number">100</span>, 
                ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>]);

<span class="hljs-comment"># Use the axes for plotting</span>
axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>].set_title(<span class="hljs-string">'Distribution of Sepal Width'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>].hist([setosa_df.sepal_width, versicolor_df.sepal_width, virginica_df.sepal_width], 
         bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>), 
         stacked=<span class="hljs-literal">True</span>);

axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>].legend([<span class="hljs-string">'Setosa'</span>, <span class="hljs-string">'Versicolor'</span>, <span class="hljs-string">'Virginica'</span>]);

<span class="hljs-comment"># Pass the axes into seaborn</span>
axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">'Restaurant bills'</span>)
sns.barplot(x=<span class="hljs-string">'day'</span>, y=<span class="hljs-string">'total_bill'</span>, hue=<span class="hljs-string">'sex'</span>, data=tips_df, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>]);

<span class="hljs-comment"># Pass the axes into seaborn</span>
axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>].set_title(<span class="hljs-string">'Flight traffic'</span>)
sns.heatmap(flights_df, cmap=<span class="hljs-string">'Blues'</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>]);

<span class="hljs-comment"># Plot an image using the axes</span>
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].set_title(<span class="hljs-string">'Data Science Meme'</span>)
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].imshow(img)
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].grid(<span class="hljs-literal">False</span>)
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].set_xticks([])
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].set_yticks([])

plt.tight_layout(pad=<span class="hljs-number">2</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-183.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>See this page for a full list of supported functions: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.3.1%2Fapi%2Faxes_api.html%23the-axes-class">https://matplotlib.org/3.3.1/api/axes_api.html#the-axes-class</a> .</p>
<h4 id="heading-pair-plots-with-seaborn"><strong>Pair</strong> P<strong>lots with Seaborn</strong></h4>
<p>Seaborn also provides a helper function <code>sns.pairplot</code> to automatically plot several different charts for pairs of features within a dataframe.</p>
<pre><code class="lang-py">sns.pairplot(flowers_df, hue=<span class="hljs-string">'species'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-184.png" alt="Image" width="600" height="400" loading="lazy">
<em>See the full output <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-matplotlib-data-visualization/">here</a>.</em></p>
<pre><code class="lang-py">sns.pairplot(tips_df, hue=<span class="hljs-string">'sex'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-185.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-summary-and-further-reading-2">Summary and Further Reading</h3>
<p>We have covered the following topics in this tutorial:</p>
<ul>
<li>How to create and customize line charts using Matplotlib</li>
<li>How to visualize relationships between two or more variables using scatter plots</li>
<li>How to study distributions of variables using histograms and bar charts</li>
<li>How to visualize two-dimensional data using heatmaps</li>
<li>How to display images using Matplotlib's <code>plt.imshow</code></li>
<li>How to plot multiple Matplotlib and Seaborn charts in a grid</li>
</ul>
<p>In this tutorial we've covered some of the fundamental concepts and popular techniques for data visualization using Matplotlib and Seaborn. Data visualization is a vast field and we've barely scratched the surface here. Check out these references to learn and discover more:</p>
<ul>
<li>Data Visualization cheat sheet: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fjovian.ml%2Faakashns%2Fdataviz-cheatsheet">https://jovian.ml/aakashns/dataviz-cheatsheet</a></li>
<li>Seaborn gallery: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fseaborn.pydata.org%2Fexamples%2Findex.html">https://seaborn.pydata.org/examples/index.html</a></li>
<li>Matplotlib gallery: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.1.1%2Fgallery%2Findex.html">https://matplotlib.org/3.1.1/gallery/index.html</a></li>
<li>Matplotlib tutorial: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fgithub.com%2Frougier%2Fmatplotlib-tutorial">https://github.com/rougier/matplotlib-tutorial</a></li>
</ul>
<h3 id="heading-review-questions-to-check-your-comprehension-2">Review Questions to Check Your Comprehension</h3>
<p>Try answering the following questions to test your understanding of the topics covered in this notebook:</p>
<ol>
<li>What is data visualization?</li>
<li>What is Matplotlib?</li>
<li>What is Seaborn?</li>
<li>How do you install Matplotlib and Seaborn?</li>
<li>How you import Matplotlib and Seaborn? What are the common aliases used while importing these modules?</li>
<li>What is the purpose of the magic command <code>%matplotlib inline</code>?</li>
<li>What is a line chart?</li>
<li>How do you plot a line chart in Python? Illustrate with an example.</li>
<li>How do you specify values for the X-axis of a line chart?</li>
<li>How do you specify labels for the axes of a chart?</li>
<li>How do you plot multiple line charts on the same axes?</li>
<li>How do you show a legend for a line chart with multiple lines?</li>
<li>How you set a title for a chart?</li>
<li>How do you show markers on a line chart?</li>
<li>What are the different options for styling lines and markers in line charts? Illustrate with examples.</li>
<li>What is the purpose of the <code>fmt</code> argument to <code>plt.plot</code>?</li>
<li>Where can you see a list of all the arguments accepted by <code>plt.plot</code>?</li>
<li>How do you change the size of the figure using Matplotlib?</li>
<li>How do you apply the default styles from Seaborn globally for all charts?</li>
<li>What are the predefined styles available in Seaborn? Illustrate with examples.</li>
<li>What is a scatter plot?</li>
<li>How is a scatter plot different from a line chart?</li>
<li>How do you draw a scatter plot using Seaborn? Illustrate with an example.</li>
<li>How do you decide when to use a scatter plot vs a line chart?</li>
<li>How do you specify the colors for dots on a scatter plot using a categorical variable?</li>
<li>How do you customize the title, figure size, legend, and son on for Seaborn plots?</li>
<li>How do you use a Pandas dataframe with <code>sns.scatterplot</code>?</li>
<li>What is a histogram?</li>
<li>When should you use a histogram vs a line chart?</li>
<li>How do you draw a histogram using Matplotlib? Illustrate with an example.</li>
<li>What are "bins" in a histogram?</li>
<li>How do you change the sizes of bins in a histogram?</li>
<li>How do you change the number of bins in a histogram?</li>
<li>How do you show multiple histograms on the same axes?</li>
<li>How do you stack multiple histograms on top of one another?</li>
<li>What is a bar chart?</li>
<li>How do you draw a bar chart using Matplotlib? Illustrate with an example.</li>
<li>What is the difference between a bar chart and a histogram?</li>
<li>What is the difference between a bar chart and a line chart?</li>
<li>How do you stack bars on top of one another?</li>
<li>What is the difference between <code>plt.bar</code> and <code>sns.barplot</code>?</li>
<li>What do the lines cutting the bars in a Seaborn bar plot represent?</li>
<li>How do you show bar plots side-by-side?</li>
<li>How do you draw a horizontal bar plot?</li>
<li>What is a heat map?</li>
<li>What type of data is best visualized with a heat map?</li>
<li>What does the <code>pivot</code> method of a Pandas dataframe do?</li>
<li>How do you draw a heat map using Seaborn? Illustrate with an example.</li>
<li>How do you change the color scheme of a heat map?</li>
<li>How do you show the original values from the dataset on a heat map?</li>
<li>How do you download images from a URL in Python?</li>
<li>How do you open an image for processing in Python?</li>
<li>What is the purpose of the <code>PIL</code> module in Python?</li>
<li>How do you convert an image loaded using PIL into a Numpy array?</li>
<li>How many dimensions does a Numpy array for an image have? What does each dimension represent?</li>
<li>What are "color channels" in an image?</li>
<li>What is RGB?</li>
<li>How do you display an image using Matplotlib?</li>
<li>How do you turn off the axes and gridlines in a chart?</li>
<li>How do you display a portion of an image using Matplotlib?</li>
<li>How do you plot multiple charts in a grid using Matplotlib and Seaborn? Illustrate with examples.</li>
<li>What is the purpose of the <code>plt.subplots</code> function?</li>
<li>What are pair plots in Seaborn? Illustrate with an example.</li>
<li>How do you export a plot into a PNG image file using Matplotlib?</li>
<li>Where can you learn about the different types of charts you can create using Matplotlib and Seaborn?</li>
</ol>
<p>Congratulations on making it to the end of this tutorial! You can now apply these skills to analyze real world datasets from sources like <a target="_blank" href="https://kaggle.com/datasets">Kaggle</a>. </p>
<p>If you're pursuing a career in data science and machine learning, consider joining the <a target="_blank" href="https://zerotodatascience.com">Zero to Data Science Bootcamp by Jovian</a>. It's a 20-week part-time program where you'll complete 7 courses, 12 coding assignments and 4-real world projects. You will also receive 6 months of career support to help you find your first data science job.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.jovian.ai/zero-to-data-science-bootcamp">https://www.jovian.ai/zero-to-data-science-bootcamp</a></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Python NumPy Crash Course – How to Build N-Dimensional Arrays for Machine Learning ]]>
                </title>
                <description>
                    <![CDATA[ NumPy is a Python library for performing large scale numerical computations. It is extremely useful, especially in machine learning. Let's look at what NumPy has to offer. Introduction to NymPy NumPy is a Python library used to perform numerical comp... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/numpy-crash-course-build-powerful-n-d-arrays-with-numpy/</link>
                <guid isPermaLink="false">66d0360912c679876b0602e7</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ numpy ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Tue, 22 Sep 2020 17:34:36 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2020/09/numpy-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>NumPy is a Python library for performing large scale numerical computations. It is extremely useful, especially in machine learning. Let's look at what NumPy has to offer.</p>
<h1 id="heading-introduction-to-nympy">Introduction to NymPy</h1>
<p>NumPy is a Python library used to perform numerical computations with large datasets. The name stands for Numerical Python and it is a popular library used by data scientists, especially for machine learning problems. </p>
<p>NumPy is useful while pre-processing the data before you train it using a machine learning algorithm.</p>
<p>Working with n-dimensional arrays is easier in NumPy compared to Python lists. NumPy arrays are also faster than Python lists since, unlike lists, NumPy arrays are stored at one continuous place in memory. This enables the processor to perform computations efficiently.</p>
<p>In this article, we will look at the basics of working with NumPy including array operations, matrix transformations, generating random values, and so on.</p>
<h1 id="heading-installation">Installation</h1>
<p>Clear installation instructions are provided on NumPy's official website, so I am not going to repeat them in this article. <a target="_blank" href="https://numpy.org/install/">Please find those instructions here</a>.</p>
<h1 id="heading-working-with-numpy">Working with NumPy</h1>
<h2 id="heading-importing-numpy">Importing NumPy</h2>
<p>To start using NumPy in your script, you have to import it.</p>
<pre><code><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
</code></pre><h2 id="heading-converting-arrays-to-numpy-arrays">Converting Arrays to NumPy Arrays</h2>
<p>You can convert your existing Python lists into NumPy arrays using the np.array() method, like this:</p>
<pre><code>arr = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]
np.array(arr)
</code></pre><p>This also applies to multi-dimensional arrays. NumPy will keep track of the shape (dimensions) of the array.</p>
<pre><code>nested_arr = [[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],[<span class="hljs-number">3</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>]]
np.array(nested_arr)
</code></pre><h2 id="heading-numpy-arrange-function">NumPy Arrange Function</h2>
<p>When working with data, you will often come across use cases where you need to generate data.</p>
<p>NumPy as an “arrange()” method with which you can generate a range of values between two numbers. The arrange function takes the start, end, and an optional distance parameter.</p>
<pre><code>print(np.arrange(<span class="hljs-number">0</span>,<span class="hljs-number">10</span>)) # without distance parameter
<span class="hljs-attr">OUTPUT</span>:[<span class="hljs-number">0</span> <span class="hljs-number">1</span> <span class="hljs-number">2</span> <span class="hljs-number">3</span> <span class="hljs-number">4</span> <span class="hljs-number">5</span> <span class="hljs-number">6</span> <span class="hljs-number">7</span> <span class="hljs-number">8</span> <span class="hljs-number">9</span>]

print(np.arrange(<span class="hljs-number">0</span>,<span class="hljs-number">10</span>,<span class="hljs-number">2</span>)) # <span class="hljs-keyword">with</span> distance parameter
<span class="hljs-attr">OUTPUT</span>: [<span class="hljs-number">0</span> <span class="hljs-number">2</span> <span class="hljs-number">4</span> <span class="hljs-number">6</span> <span class="hljs-number">8</span>]
</code></pre><h2 id="heading-zeroes-and-ones">Zeroes and Ones</h2>
<p>You can also generate an array or matrix of zeroes or ones using NumPy (trust me, you will need it). Here's how.</p>
<pre><code>print(np.zeros(<span class="hljs-number">3</span>))
<span class="hljs-attr">OUTPUT</span>: [<span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span>]

print(np.ones(<span class="hljs-number">3</span>))
<span class="hljs-attr">OUTPUT</span>: [<span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span>]
</code></pre><p>Both these functions support n-dimensional arrays as well. You can add the shape as a tuple with rows and columns.</p>
<pre><code>print(np.zeros((<span class="hljs-number">4</span>,<span class="hljs-number">5</span>)))
<span class="hljs-attr">OUTPUT</span>:
[
 [<span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span>]
 [<span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span>]
 [<span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span>]
 [<span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span>]
 [<span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span> <span class="hljs-number">0.</span>]
]

print(np.ones((<span class="hljs-number">4</span>,<span class="hljs-number">5</span>)))
<span class="hljs-attr">OUTPUT</span>:
[
 [<span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span>]
 [<span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span>]
 [<span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span>]
 [<span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span>]
 [<span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span>]
]
</code></pre><h2 id="heading-identity-matrix">Identity Matrix</h2>
<p>You can also generate an <a target="_blank" href="https://en.wikipedia.org/wiki/Identity_matrix">identity matrix</a> using a built-in NumPy function called “eye”.</p>
<pre><code>np.eye(<span class="hljs-number">5</span>)
<span class="hljs-attr">OUTPUT</span>:
[[<span class="hljs-number">1.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>]
[<span class="hljs-number">0.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>]
[<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>]
[<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">0.</span>]
[<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">1.</span>]]
</code></pre><h2 id="heading-numpy-linspace-function">NumPy Linspace Function</h2>
<p>NumPy has a linspace method that generates evenly spaced points between two numbers.</p>
<pre><code>print(np.linspace(<span class="hljs-number">0</span>,<span class="hljs-number">10</span>,<span class="hljs-number">3</span>))
<span class="hljs-attr">OUTPUT</span>:[ <span class="hljs-number">0.</span>  <span class="hljs-number">5.</span> <span class="hljs-number">10.</span>]
</code></pre><p>In the above example, the first and second params are the start and the end points, while the third param is the number of points you need between the start and the end.</p>
<p>Here is the same range with 20 points.</p>
<pre><code>print(np.linspace(<span class="hljs-number">0</span>,<span class="hljs-number">10</span>,<span class="hljs-number">20</span>))
<span class="hljs-attr">OUTPUT</span>:[ <span class="hljs-number">0.</span> <span class="hljs-number">0.52631579</span>  <span class="hljs-number">1.05263158</span>  <span class="hljs-number">1.57894737</span>  <span class="hljs-number">2.10526316</span>  <span class="hljs-number">2.63157895</span>   <span class="hljs-number">3.15789474</span>  <span class="hljs-number">3.68421053</span>  <span class="hljs-number">4.21052632</span>  <span class="hljs-number">4.73684211</span>  <span class="hljs-number">5.26315789</span>  <span class="hljs-number">5.78947368</span>   <span class="hljs-number">6.31578947</span>  <span class="hljs-number">6.84210526</span>  <span class="hljs-number">7.36842105</span>  <span class="hljs-number">7.89473684</span>  <span class="hljs-number">8.42105263</span>  <span class="hljs-number">8.94736842</span>   <span class="hljs-number">9.47368421</span> <span class="hljs-number">10.</span>]
</code></pre><h2 id="heading-random-number-generation">Random Number Generation</h2>
<p>When you are working on machine learning problems, you will often need to generate random numbers. NumPy has in-built functions for that as well.</p>
<p>But before we start generating random numbers, let's look at two major types of distributions.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/09/distro-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-normal-distribution">Normal Distribution</h3>
<p>In a <a target="_blank" href="https://www.mathsisfun.com/data/standard-normal-distribution.html">standard normal distribution</a>, the values peak in the middle. </p>
<p>The normal distribution is a very important concept in statistics since it seen in many natural phenomena. It is also called a “bell curve”.</p>
<h3 id="heading-uniform-distribution">Uniform Distribution</h3>
<p>If the values in the distribution have the probability as a constant, it is called a <a target="_blank" href="https://www.investopedia.com/terms/u/uniform-distribution.asp">uniform distribution</a>. </p>
<p>For example, a coin toss has a uniform distribution since the probability of getting either heads or tails in a coin toss is the same.</p>
<p>Now that you know how the two main distributions work, let's generate some random numbers.</p>
<ul>
<li>To generate random numbers in a uniform distribution, use the <strong>rand()</strong> function from <strong>np.random</strong>:</li>
</ul>
<pre><code>print(np.random.rand(<span class="hljs-number">10</span>)) # array
<span class="hljs-attr">OUTPUT</span>: [<span class="hljs-number">0.46015141</span> <span class="hljs-number">0.89326339</span> <span class="hljs-number">0.22589334</span> <span class="hljs-number">0.29874476</span> <span class="hljs-number">0.5664353</span>  <span class="hljs-number">0.39257603</span>  <span class="hljs-number">0.77672998</span> <span class="hljs-number">0.35768031</span> <span class="hljs-number">0.95087408</span> <span class="hljs-number">0.34418542</span>]

print(np.random.rand(<span class="hljs-number">3</span>,<span class="hljs-number">4</span>)) # <span class="hljs-number">3</span>x4 matrix
<span class="hljs-attr">OUTPUT</span>:[[<span class="hljs-number">0.63775985</span> <span class="hljs-number">0.91746663</span> <span class="hljs-number">0.41667645</span> <span class="hljs-number">0.28272243</span>]  [<span class="hljs-number">0.14919547</span> <span class="hljs-number">0.72895922</span> <span class="hljs-number">0.87147748</span> <span class="hljs-number">0.94037953</span>]  [<span class="hljs-number">0.5545835</span>  <span class="hljs-number">0.30870297</span> <span class="hljs-number">0.49341904</span> <span class="hljs-number">0.27852723</span>]]
</code></pre><ul>
<li>To generate random numbers in a normal distribution, use the <strong>randn()</strong> function from <strong>np.random</strong>:</li>
</ul>
<pre><code>print(np.random.randn(<span class="hljs-number">10</span>))
<span class="hljs-attr">OUTPUT</span>:[<span class="hljs-number">-1.02087155</span> <span class="hljs-number">-0.75207769</span> <span class="hljs-number">-0.22696798</span>  <span class="hljs-number">0.86739858</span>  <span class="hljs-number">0.07367362</span> <span class="hljs-number">-0.41932541</span>   <span class="hljs-number">0.86303979</span>  <span class="hljs-number">0.13739312</span>  <span class="hljs-number">0.13214285</span>  <span class="hljs-number">1.23089936</span>]

print(np.random.randn(<span class="hljs-number">3</span>,<span class="hljs-number">4</span>))
<span class="hljs-attr">OUTPUT</span>: [[ <span class="hljs-number">1.61013773</span>  <span class="hljs-number">1.37400445</span>  <span class="hljs-number">0.55494053</span>  <span class="hljs-number">0.23133522</span>]  [ <span class="hljs-number">0.31290971</span> <span class="hljs-number">-0.30866402</span>  <span class="hljs-number">0.33093618</span>  <span class="hljs-number">0.34868954</span>]  [<span class="hljs-number">-0.11659865</span> <span class="hljs-number">-1.22311073</span>  <span class="hljs-number">0.36676476</span>  <span class="hljs-number">0.40819545</span>]]
</code></pre><ul>
<li>To generate random integers between a low and high value, use the <strong>randint()</strong> function from <strong>np.random</strong>:</li>
</ul>
<pre><code>print(np.random.randint(<span class="hljs-number">1</span>,<span class="hljs-number">100</span>,<span class="hljs-number">10</span>))
<span class="hljs-attr">OUTPUT</span>:[<span class="hljs-number">64</span> <span class="hljs-number">37</span> <span class="hljs-number">62</span> <span class="hljs-number">27</span>  <span class="hljs-number">4</span> <span class="hljs-number">33</span> <span class="hljs-number">23</span> <span class="hljs-number">52</span> <span class="hljs-number">70</span>  <span class="hljs-number">7</span>]

print(np.random.randint(<span class="hljs-number">1</span>,<span class="hljs-number">100</span>,(<span class="hljs-number">2</span>,<span class="hljs-number">3</span>)))
<span class="hljs-attr">OUTPUT</span>:[[<span class="hljs-number">92</span> <span class="hljs-number">42</span> <span class="hljs-number">38</span>]  [<span class="hljs-number">87</span> <span class="hljs-number">69</span> <span class="hljs-number">38</span>]]
</code></pre><p>A <a target="_blank" href="https://en.wikipedia.org/wiki/Random_seed">seed value</a> is used if you want your random numbers to be the same during each computation. Here is how you set a seed value in NumPy.</p>
<ul>
<li>To set a seed value in NumPy, do the following:</li>
</ul>
<pre><code>np.random.seed(<span class="hljs-number">42</span>)
print(np.random.rand(<span class="hljs-number">4</span>))
<span class="hljs-attr">OUTPUT</span>:[<span class="hljs-number">0.37454012</span>, <span class="hljs-number">0.95071431</span>, <span class="hljs-number">0.73199394</span>, <span class="hljs-number">0.59865848</span>]
</code></pre><p>Whenever you use a seed number, you will always get the same array generated without any change.</p>
<h2 id="heading-reshaping-arrays">Reshaping Arrays</h2>
<p>As a data scientist, you will work with re-shaping the data sets for different types of computations. In this section, we will look at how to work with the shapes of arrays.</p>
<ul>
<li>To get the shape of an array, use the <strong>shape</strong> property.</li>
</ul>
<pre><code>arr = np.random.rand(<span class="hljs-number">2</span>,<span class="hljs-number">2</span>)
print(arr)
print(arr.shape)
<span class="hljs-attr">OUTPUT</span>:[
[<span class="hljs-number">0.19890857</span> <span class="hljs-number">0.00806693</span>]
[<span class="hljs-number">0.48199837</span> <span class="hljs-number">0.55373954</span>]
]
(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)
</code></pre><ul>
<li>To reshape an array, use the <strong>reshape()</strong> function.</li>
</ul>
<pre><code>print(arr.reshape(<span class="hljs-number">1</span>,<span class="hljs-number">4</span>))
<span class="hljs-attr">OUTPUT</span>: [[<span class="hljs-number">0.19890857</span> <span class="hljs-number">0.00806693</span> <span class="hljs-number">0.48199837</span> <span class="hljs-number">0.55373954</span>]]
print(arr.reshape(<span class="hljs-number">4</span>,<span class="hljs-number">1</span>))
<span class="hljs-attr">OUTPUT</span>:[
[<span class="hljs-number">0.19890857</span>]
[<span class="hljs-number">0.00806693</span>]
[<span class="hljs-number">0.48199837</span>]
[<span class="hljs-number">0.55373954</span>]
]
</code></pre><p>In order to permanently reshape an array, you have to assign the reshaped array to the ‘arr’ variable. </p>
<p>Also, reshape only works if the existing structure makes sense. You cannot reshape a 2x2 array into a 3x1 array.</p>
<h2 id="heading-slicing-data">Slicing Data</h2>
<p>Let's look at fetching data from NumPy arrays. NumPy arrays work similarly to Python lists during fetch operations.</p>
<ul>
<li>To slice an array, do this:</li>
</ul>
<pre><code>myarr = np.arange(<span class="hljs-number">0</span>,<span class="hljs-number">11</span>)
print(myarr)
<span class="hljs-attr">OUTPUT</span>:[ <span class="hljs-number">0</span>  <span class="hljs-number">1</span>  <span class="hljs-number">2</span>  <span class="hljs-number">3</span>  <span class="hljs-number">4</span>  <span class="hljs-number">5</span>  <span class="hljs-number">6</span>  <span class="hljs-number">7</span>  <span class="hljs-number">8</span>  <span class="hljs-number">9</span> <span class="hljs-number">10</span>]

sliced = myarr[<span class="hljs-number">0</span>:<span class="hljs-number">5</span>]
print(sliced)
<span class="hljs-attr">OUTPUT</span>: [<span class="hljs-number">0</span> <span class="hljs-number">1</span> <span class="hljs-number">2</span> <span class="hljs-number">3</span> <span class="hljs-number">4</span>]

sliced[:] = <span class="hljs-number">99</span>
print(sliced)
<span class="hljs-attr">OUTPUT</span>: [<span class="hljs-number">99</span> <span class="hljs-number">99</span> <span class="hljs-number">99</span> <span class="hljs-number">99</span> <span class="hljs-number">99</span>]

print(myarr)
<span class="hljs-attr">OUTPUT</span>:[<span class="hljs-number">99</span> <span class="hljs-number">99</span> <span class="hljs-number">99</span> <span class="hljs-number">99</span> <span class="hljs-number">99</span>  <span class="hljs-number">5</span>  <span class="hljs-number">6</span>  <span class="hljs-number">7</span>  <span class="hljs-number">8</span>  <span class="hljs-number">9</span> <span class="hljs-number">10</span>]
</code></pre><p>If you look at the above example, even though we assigned the slice of “myarr” to the variable “sliced”, changing the value of “sliced” affects the original array. This is because the “slice” was just pointing to the original array.</p>
<p>To make an independent section of an array, use the <strong>copy()</strong> function.</p>
<pre><code>sliced = myarr.copy()[<span class="hljs-number">0</span>:<span class="hljs-number">5</span>]
</code></pre><ul>
<li>Slicing multi-dimensional arrays work similarly to one-dimensional arrays.</li>
</ul>
<pre><code>my_matrix = np.random.randint(<span class="hljs-number">1</span>,<span class="hljs-number">30</span>,(<span class="hljs-number">3</span>,<span class="hljs-number">3</span>))
print(my_matrix)
<span class="hljs-attr">OUTPUT</span>: [
[<span class="hljs-number">21</span>  <span class="hljs-number">1</span> <span class="hljs-number">20</span>]
[<span class="hljs-number">22</span> <span class="hljs-number">16</span> <span class="hljs-number">27</span>]
[<span class="hljs-number">24</span> <span class="hljs-number">14</span> <span class="hljs-number">22</span>]
]

print(my_matrix[<span class="hljs-number">0</span>]) # print a single row
<span class="hljs-attr">OUTPUT</span>: [<span class="hljs-number">21</span>  <span class="hljs-number">1</span> <span class="hljs-number">20</span>]

print(my_matrix[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]) # print a single value or row <span class="hljs-number">0</span>, column <span class="hljs-number">0</span>
<span class="hljs-attr">OUTPUT</span>: <span class="hljs-number">21</span>

print(my_matrix[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>]) #alternate way to print value <span class="hljs-keyword">from</span> row0,col0
<span class="hljs-attr">OUTPUT</span>: <span class="hljs-number">21</span>
</code></pre><h2 id="heading-array-computations">Array Computations</h2>
<p>Now let's look at array computations. NumPy is known for its speed when performing complex computations on large multi-dimensional arrays.</p>
<p>Let’s try a few basic operations.</p>
<pre><code>new_arr = np.arange(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)
print(new_arr)
<span class="hljs-attr">OUTPUT</span>: [ <span class="hljs-number">1</span>  <span class="hljs-number">2</span>  <span class="hljs-number">3</span>  <span class="hljs-number">4</span>  <span class="hljs-number">5</span>  <span class="hljs-number">6</span>  <span class="hljs-number">7</span>  <span class="hljs-number">8</span>  <span class="hljs-number">9</span> <span class="hljs-number">10</span>]
</code></pre><ul>
<li>Addition</li>
</ul>
<pre><code>print(new_arr + <span class="hljs-number">5</span>)
<span class="hljs-attr">OUTPUT</span>: [ <span class="hljs-number">6</span>  <span class="hljs-number">7</span>  <span class="hljs-number">8</span>  <span class="hljs-number">9</span> <span class="hljs-number">10</span> <span class="hljs-number">11</span> <span class="hljs-number">12</span> <span class="hljs-number">13</span> <span class="hljs-number">14</span> <span class="hljs-number">15</span>]
</code></pre><ul>
<li>Subtraction</li>
</ul>
<pre><code>print(new_arr - <span class="hljs-number">5</span>)
<span class="hljs-attr">OUTPUT</span>: [<span class="hljs-number">-4</span> <span class="hljs-number">-3</span> <span class="hljs-number">-2</span> <span class="hljs-number">-1</span>  <span class="hljs-number">0</span>  <span class="hljs-number">1</span>  <span class="hljs-number">2</span>  <span class="hljs-number">3</span>  <span class="hljs-number">4</span>  <span class="hljs-number">5</span>]
</code></pre><ul>
<li>Array Addition</li>
</ul>
<pre><code>print(new_arr + new_arr)
<span class="hljs-attr">OUTPUT</span>: [ <span class="hljs-number">2</span>  <span class="hljs-number">4</span>  <span class="hljs-number">6</span>  <span class="hljs-number">8</span> <span class="hljs-number">10</span> <span class="hljs-number">12</span> <span class="hljs-number">14</span> <span class="hljs-number">16</span> <span class="hljs-number">18</span> <span class="hljs-number">20</span>]
</code></pre><ul>
<li>Array Division</li>
</ul>
<pre><code>print(new_arr / new_arr)
<span class="hljs-attr">OUTPUT</span>:[<span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span> <span class="hljs-number">1.</span>]
</code></pre><p>For <a target="_blank" href="https://airbrake.io/blog/python-exception-handling/zerodivisionerror-2">zero division errors</a>, Numpy will convert the value to NaN (not a number).</p>
<p>There are also a few in-built computation methods available in NumPy to calculate values like mean, standard deviation, variance, and others.</p>
<ul>
<li>Sum — np.sum()</li>
<li>Square Root — np.sqrt()</li>
<li>Mean — np.mean()</li>
<li>Variance — np.var()</li>
<li>Standard Deviation — np.std()</li>
</ul>
<p>While working with 2d arrays, you will often need to calculate row wise or column-wise sum, mean, variance, and so on. You can use the optional axis parameter to specify if you want to choose a row or a column.</p>
<pre><code>arr2d = np.arange(<span class="hljs-number">25</span>).reshape(<span class="hljs-number">5</span>,<span class="hljs-number">5</span>)
print(arr2d)
<span class="hljs-attr">OUTPUT</span>: [
[ <span class="hljs-number">0</span>  <span class="hljs-number">1</span>  <span class="hljs-number">2</span>  <span class="hljs-number">3</span>  <span class="hljs-number">4</span>]
[ <span class="hljs-number">5</span>  <span class="hljs-number">6</span>  <span class="hljs-number">7</span>  <span class="hljs-number">8</span>  <span class="hljs-number">9</span>]
[<span class="hljs-number">10</span> <span class="hljs-number">11</span> <span class="hljs-number">12</span> <span class="hljs-number">13</span> <span class="hljs-number">14</span>]
[<span class="hljs-number">15</span> <span class="hljs-number">16</span> <span class="hljs-number">17</span> <span class="hljs-number">18</span> <span class="hljs-number">19</span>]
[<span class="hljs-number">20</span> <span class="hljs-number">21</span> <span class="hljs-number">22</span> <span class="hljs-number">23</span> <span class="hljs-number">24</span>]
]

print(arr2d.sum())
<span class="hljs-attr">OUTPUT</span>: <span class="hljs-number">300</span>

print(arr2d.sum(axis=<span class="hljs-number">0</span>))  # sum <span class="hljs-keyword">of</span> columns
<span class="hljs-attr">OUTPUT</span>: [<span class="hljs-number">50</span> <span class="hljs-number">55</span> <span class="hljs-number">60</span> <span class="hljs-number">65</span> <span class="hljs-number">70</span>]

print(arr2d.sum(axis=<span class="hljs-number">1</span>)) #sum <span class="hljs-keyword">of</span> rows
<span class="hljs-attr">OUTPUT</span>: [ <span class="hljs-number">10</span>  <span class="hljs-number">35</span>  <span class="hljs-number">60</span>  <span class="hljs-number">85</span> <span class="hljs-number">110</span>]
</code></pre><h2 id="heading-conditional-operations">Conditional Operations</h2>
<p>You can also do conditional filtering with NumPy using the square bracket notation. Here is an example:</p>
<pre><code>arr = np.arange(<span class="hljs-number">0</span>,<span class="hljs-number">10</span>)
<span class="hljs-attr">OUTPUT</span>: [<span class="hljs-number">0</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">7</span>,<span class="hljs-number">8</span>,<span class="hljs-number">9</span>]

print(arr &gt; <span class="hljs-number">4</span>)
<span class="hljs-attr">OUTPUT</span>: [False False False False False  True  True  True  True  True]

print(arr[arr &gt; <span class="hljs-number">4</span>])
<span class="hljs-attr">OUTPUT</span>: [<span class="hljs-number">5</span> <span class="hljs-number">6</span> <span class="hljs-number">7</span> <span class="hljs-number">8</span> <span class="hljs-number">9</span>]
</code></pre><h1 id="heading-summary">Summary</h1>
<p>When it comes to working with large datasets, NumPy is a powerful tool to have in your toolkit. It is capable of handling advanced numeric computations and complex n-dimensional array operations.</p>
<p>I highly recommended that you learn NumPy if you plan to start a career in machine learning.</p>
<p><a target="_blank" href="https://colab.research.google.com/drive/1Oa8J_sZXACQJEiMqANIHkftMgUrqSpVt#scrollTo=ITrCTnT6RkWP">Here is a Google colab notebook if you want to try out these examples</a>.</p>
<p><a target="_blank" href="https://tinyletter.com/manishmshiva"><strong>Get a summary of my articles</strong></a> and videos sent to your email every Monday morning. You can also <a target="_blank" href="https://www.manishmshiva.com/"><strong>connect with me</strong></a> here.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The Ultimate Guide to the NumPy Package for Scientific Computing in Python ]]>
                </title>
                <description>
                    <![CDATA[ By Nick McCullum NumPy (pronounced "numb pie") is one of the most important packages to grasp when you’re starting to learn Python. The package is known for a very useful data structure called the NumPy array. NumPy also allows Python developers to q... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/the-ultimate-guide-to-the-numpy-scientific-computing-library-for-python/</link>
                <guid isPermaLink="false">66d46054d14641365a050929</guid>
                
                    <category>
                        <![CDATA[ numpy ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Mon, 06 Jul 2020 17:18:57 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2020/07/numpy.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Nick McCullum</p>
<p>NumPy (pronounced "numb pie") is one of the most important packages to grasp when you’re starting to <a target="_blank" href="https://courses.nickmccullum.com/courses/enroll/python-for-finance/">learn Python</a>.</p>
<p>The package is known for a very useful data structure called the NumPy array. NumPy also allows Python developers to quickly perform a wide variety of numerical computations.</p>
<p>This tutorial will teach you the fundamentals of NumPy that you can use to build numerical Python applications today.</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<p>You can skip to a specific section of this NumPy tutorial using the table of contents below:</p>
<ul>
<li><a class="post-section-overview" href="#heading-introduction-to-numpy">Introduction to NumPy</a></li>
<li><a class="post-section-overview" href="#heading-numpy-arrays">NumPy Arrays</a></li>
<li><a class="post-section-overview" href="#heading-numpy-methods-and-operations">NumPy Methods and Operations</a></li>
<li><a class="post-section-overview" href="#heading-numpy-indexing-and-assignment">NumPy Indexing and Assignment</a></li>
<li><a class="post-section-overview" href="#heading-final-thoughts-amp-special-offer">Final Thoughts &amp; Special Offer</a></li>
</ul>
<h2 id="heading-introduction-to-numpy"><strong>Introduction to NumPy</strong></h2>
<p>In this section, we will introduce the <a target="_blank" href="https://nickmccullum.com/advanced-python/numpy/">NumPy library</a> in Python.</p>
<h3 id="heading-what-is-numpy"><strong>What is NumPy?</strong></h3>
<p>NumPy is a Python library for scientific computing. NumPy stand for Numerical Python. Here is the official description of the library from <a target="_blank" href="https://numpy.org/">its website</a>:</p>
<p><em>“NumPy is the fundamental package for scientific computing with Python. It contains among other things:</em></p>
<ul>
<li><em>a powerful N-dimensional array object</em></li>
<li><em>sophisticated (broadcasting) functions</em></li>
<li><em>tools for integrating C/C++ and Fortran code</em></li>
<li><em>useful linear algebra, Fourier transform, and random number capabilities</em></li>
</ul>
<p><em>Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.</em></p>
<p><em>NumPy is licensed under the <a target="_blank" href="https://numpy.org/license.html#license">BSD license</a>, enabling reuse with few restrictions.”</em></p>
<p>NumPy is such an important Python library that there are other libraries (including pandas) that are built entirely on NumPy.</p>
<h3 id="heading-the-main-benefit-of-numpy"><strong>The Main Benefit of NumPy</strong></h3>
<p>The main benefit of NumPy is that it allows for extremely fast data generation and handling. NumPy has its own built-in data structure called an <code>array</code> which is similar to the normal Python <code>list</code>, but can store and operate on data much more efficiently.</p>
<h3 id="heading-what-we-will-learn-about-numpy"><strong>What We Will Learn About NumPy</strong></h3>
<p>Advanced Python practitioners will spend much more time working with pandas than they spend working with NumPy. Still, given that pandas is built on NumPy, it is important to understand the most important aspects of the NumPy library.</p>
<p>Over the next several sections, we will cover the following information about the NumPy library:</p>
<ul>
<li>NumPy Arrays</li>
<li>NumPy Indexing and Assignment</li>
<li>NumPy Methods and Operations</li>
</ul>
<h3 id="heading-moving-on"><strong>Moving On</strong></h3>
<p>Let’s move on to learning about NumPy arrays, the core data structure that every NumPy practitioner must be familiar with.</p>
<h2 id="heading-numpy-arrays"><strong>NumPy Arrays</strong></h2>
<p>In this section, we will be learning about <a target="_blank" href="https://nickmccullum.com/advanced-python/numpy-arrays/">NumPy arrays</a>.</p>
<h3 id="heading-what-are-numpy-arrays"><strong>What Are NumPy Arrays?</strong></h3>
<p>NumPy arrays are the main way to store data using the NumPy library. They are similar to normal lists in Python, but have the advantage of being faster and having more built-in methods.</p>
<p>NumPy arrays are created by calling the <code>array()</code> method from the NumPy library. Within the method, you should pass in a list.</p>
<p>An example of a basic NumPy array is shown below. Note that while I run the <code>import numpy as np</code> statement at the start of this code block, it will be excluded from the other code blocks in this section for brevity’s sake.</p>
<pre><code><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

sample_list = [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>]

np.array(sample_list)
</code></pre><p>The last line of that code block will result in an output that looks like this.</p>
<pre><code>array([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>])
</code></pre><p>The <code>array()</code> wrapper indicates that this is no longer a normal Python list. Instead, it is a NumPy array.</p>
<h3 id="heading-the-two-different-types-of-numpy-arrays"><strong>The Two Different Types of NumPy Arrays</strong></h3>
<p>There are two different types of NumPy arrays: vectors and matrices.</p>
<p>Vectors are one-dimensional NumPy arrays, and look like this:</p>
<pre><code>my_vector = np.array([<span class="hljs-string">'this'</span>, <span class="hljs-string">'is'</span>, <span class="hljs-string">'a'</span>, <span class="hljs-string">'vector'</span>])
</code></pre><p>Matrices are two-dimensional arrays and are created by passing a list of lists into the <code>np.array()</code> method. An example is below.</p>
<pre><code>my_matrix = [[<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>],[<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>],[<span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">9</span>]]

np.array(my_matrix)
</code></pre><p>You can also expand NumPy arrays to deal with three-, four-, five-, six- or higher-dimensional arrays, but they are rare and largely outside the scope of this course (after all, this is a course on Python programming, not linear algebra).</p>
<h3 id="heading-numpy-arrays-built-in-methods"><strong>NumPy Arrays: Built-In Methods</strong></h3>
<p>NumPy arrays come with a number of useful built-in methods. We will spend the rest of this section discussing these methods in detail.</p>
<h4 id="heading-how-to-get-a-range-of-numbers-in-python-using-numpy"><strong>How To Get A Range Of Numbers in Python Using NumPy</strong></h4>
<p>NumPy has a useful method called <code>arange</code> that takes in two numbers and gives you an array of integers that are greater than or equal to (<code>&gt;=</code>) the first number and less than (<code>&lt;</code>) the second number.</p>
<p>An example of the <code>arange</code> method is below.</p>
<pre><code>np.arange(<span class="hljs-number">0</span>,<span class="hljs-number">5</span>)

#Returns array([<span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>])
</code></pre><p>You can also include a third variable in the <code>arange</code> method that provides a step-size for the function to return. Passing in <code>2</code> as the third variable will return every 2nd number in the range, passing in <code>5</code> as the third variable will return every 5th number in the range, and so on.</p>
<p>An example of using the third variable in the <code>arange</code> method is below.</p>
<pre><code>np.arange(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>,<span class="hljs-number">2</span>)

#Returns array([<span class="hljs-number">1</span>, <span class="hljs-number">3</span>, <span class="hljs-number">5</span>, <span class="hljs-number">7</span>, <span class="hljs-number">9</span>])
</code></pre><h3 id="heading-how-to-generates-ones-and-zeros-in-python-using-numpy"><strong>How To Generates Ones and Zeros in Python Using NumPy</strong></h3>
<p>While programming, you will from time to time need to create arrays of ones or zeros. NumPy has built-in methods that allow you to do either of these.</p>
<p>We can create arrays of zeros using NumPy’s <code>zeros</code> method. You pass in the number of integers you’d like to create as the argument of the function. An example is below.</p>
<pre><code>np.zeros(<span class="hljs-number">4</span>)

#Returns array([<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>])
</code></pre><p>You can also do something similar using three-dimensional arrays. For example, <code>np.zeros(5, 5)</code> creates a 5x5 matrix that contains all zeros.</p>
<p>We can create arrays of ones using a similar method named <code>ones</code>. An example is below.</p>
<pre><code>np.ones(<span class="hljs-number">5</span>)

#Returns array([<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>])
</code></pre><h4 id="heading-how-to-evenly-divide-a-range-of-numbers-in-python-using-numpy"><strong>How To Evenly Divide A Range Of Numbers In Python Using NumPy</strong></h4>
<p>There are many situations in which you have a range of numbers and you would like to equally divide that range of numbers into intervals. NumPy’s <code>linspace</code> method is designed to solve this problem. <code>linspace</code> takes in three arguments:</p>
<ol>
<li>The start of the interval</li>
<li>The end of the interval</li>
<li>The number of subintervals that you’d like the interval to be divided into</li>
</ol>
<p>An example of the <code>linspace</code> method is below.</p>
<pre><code>np.linspace(<span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">10</span>)

#Returns array([<span class="hljs-number">0</span>, <span class="hljs-number">0.1</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">0.3</span>, <span class="hljs-number">0.4</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">0.6</span>, <span class="hljs-number">0.7</span>, <span class="hljs-number">0.8</span>, <span class="hljs-number">0.9</span>, <span class="hljs-number">1.0</span>])
</code></pre><h4 id="heading-how-to-create-an-identity-matrix-in-python-using-numpy"><strong>How To Create An Identity Matrix In Python Using NumPy</strong></h4>
<p>Anyone who has studied linear algebra will be familiar with the concept of an ‘identity matrix’, which is a square matrix whose diagonal values are all <code>1</code>. NumPy has a built-in function that takes in one argument for building identity matrices. The function is <code>eye</code>.</p>
<p>Examples are below:</p>
<pre><code>np.eye(<span class="hljs-number">1</span>)

#Returns a <span class="hljs-number">1</span>x1 identity matrix

np.eye(<span class="hljs-number">2</span>) 

#Returns a <span class="hljs-number">2</span>x2 identity matrix

np.eye(<span class="hljs-number">50</span>)

#Returns a <span class="hljs-number">50</span>x50 identity matrix
</code></pre><h4 id="heading-how-to-create-random-numbers-in-python-using-numpy"><strong>How To Create Random Numbers in Python Using NumPy</strong></h4>
<p>NumPy has a number of methods built-in that allow you to create arrays of random numbers. Each of these methods starts with <code>random</code>. A few examples are below:</p>
<pre><code>np.random.rand(sample_size)

#Returns a sample <span class="hljs-keyword">of</span> random numbers between <span class="hljs-number">0</span> and <span class="hljs-number">1.</span>

#Sample size can either be one integer (<span class="hljs-keyword">for</span> a one-dimensional array) or two integers separated by commas (<span class="hljs-keyword">for</span> a two-dimensional array).

np.random.randn(sample_size)

#Returns a sample <span class="hljs-keyword">of</span> random numbers between <span class="hljs-number">0</span> and <span class="hljs-number">1</span>, following the normal distribution.

#Sample size can either be one integer (<span class="hljs-keyword">for</span> a one-dimensional array) or two integers separated by commas (<span class="hljs-keyword">for</span> a two-dimensional array).

np.random.randint(low, high, sample_size)

#Returns a sample <span class="hljs-keyword">of</span> integers that are greater than or equal to <span class="hljs-string">'low'</span> and less than <span class="hljs-string">'high'</span>
</code></pre><h4 id="heading-how-to-reshape-numpy-arrays"><strong>How To Reshape NumPy Arrays</strong></h4>
<p>It is very common to take an array with certain dimensions and transform that array into a different shape. For example, you might have a one-dimensional array with 10 elements and want to switch it to a 2x5 two-dimensional array.</p>
<p>An example is below:</p>
<pre><code>arr = np.array([<span class="hljs-number">0</span>,<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>])

arr.reshape(<span class="hljs-number">2</span>,<span class="hljs-number">3</span>)
</code></pre><p>The output of this operation is:</p>
<pre><code>array([[<span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>],

       [<span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>]])
</code></pre><p>Note that in order to use the <code>reshape</code> method, the original array must have the same number of elements as the array that you’re trying to reshape it into.</p>
<p>If you’re curious about the current shape of a NumPy array, you can determine its shape using NumPy’s <code>shape</code> attribute. Using our previous <code>arr</code> variable structure, an example of how to call the <code>shape</code> attribute is below:</p>
<pre><code>arr = np.array([<span class="hljs-number">0</span>,<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>])

arr.shape

#Returns (<span class="hljs-number">6</span>,) - note that there is no second element since it is a one-dimensional array

arr = arr.reshape(<span class="hljs-number">2</span>,<span class="hljs-number">3</span>)

arr.shape

#Returns (<span class="hljs-number">2</span>,<span class="hljs-number">3</span>)
</code></pre><p>You can also combine the <code>reshape</code> method with the <code>shape</code> attribute on one line like this:</p>
<pre><code>arr.reshape(<span class="hljs-number">2</span>,<span class="hljs-number">3</span>).shape

#Returns (<span class="hljs-number">2</span>,<span class="hljs-number">3</span>)
</code></pre><h4 id="heading-how-to-find-the-maximum-and-minimum-value-of-a-numpy-array"><strong>How To Find The Maximum and Minimum Value Of A NumPy Array</strong></h4>
<p>To conclude this section, let’s learn about four useful methods for identifying the maximum and minimum values within a NumPy array. We’ll be working with this array:</p>
<pre><code>simple_array = [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>]
</code></pre><p>We can use the <code>max</code> method to find the maximum value of a NumPy array. An example is below.</p>
<pre><code>simple_array.max()

#Returns <span class="hljs-number">4</span>
</code></pre><p>We can also use the <code>argmax</code> method to find the index of the maximum value within a NumPy array. This is useful for when you want to find the location of the maximum value but you do not necessarily care what its value is.</p>
<p>An example is below.</p>
<pre><code>simple_array.argmax()

#Returns <span class="hljs-number">3</span>
</code></pre><p>Similarly, we can use the <code>min</code> and <code>argmin</code> methods to find the value and index of the minimum value within a NumPy array.</p>
<pre><code>simple_array.min()

#Returns <span class="hljs-number">1</span>

simple_array.argmin()

#Returns <span class="hljs-number">0</span>
</code></pre><h3 id="heading-moving-on-1"><strong>Moving On</strong></h3>
<p>In this section, we discussed various attributes and methods of NumPy arrays. We will follow up by working through some NumPy array practice problems in the next section.</p>
<h2 id="heading-numpy-methods-and-operations"><strong>NumPy Methods and Operations</strong></h2>
<p>In this section, we will be working through <a target="_blank" href="https://nickmccullum.com/advanced-python/numpy-methods-operations/">various operations included in the NumPy library.</a></p>
<p>Throughout this section, we will be assuming that the <code>import numpy as np</code> command has already been run.</p>
<h3 id="heading-the-array-used-in-this-section"><strong>The Array Used In This Section</strong></h3>
<p>For this section, I will be working with an array of length 4 created using <code>np.arange</code> in all of the examples.</p>
<p>If you’d like to compare my array with the outputs used in this section, here is how I created and printed the array:</p>
<pre><code>arr = np.arange(<span class="hljs-number">4</span>)

arr
</code></pre><p>The array values are below.</p>
<pre><code>array([<span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])
</code></pre><h3 id="heading-how-to-perform-arithmetic-in-python-using-number"><strong>How To Perform Arithmetic In Python Using Number</strong></h3>
<p>NumPy makes it very easy to perform arithmetic with arrays. You can either perform arithmetic using the array and a single number, or you can perform arithmetic between two NumPy arrays.</p>
<p>We explore each of the major mathematical operations below.</p>
<h4 id="heading-addition"><strong>Addition</strong></h4>
<p>When adding a single number to a NumPy array, that number is added to each element in the array. An example is below:</p>
<pre><code><span class="hljs-number">2</span> + arr

#Returns array([<span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>])
</code></pre><p>You can add two NumPy arrays using the <code>+</code> operator. The arrays are added on an element-by-element basis (meaning the first elements are added together, the second elements are added together, and so on).</p>
<p>An example is below.</p>
<pre><code>arr + arr

#Returns array([<span class="hljs-number">0</span>, <span class="hljs-number">2</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>])
</code></pre><h4 id="heading-subtraction"><strong>Subtraction</strong></h4>
<p>Like addition, subtraction is performed on an element-by-element basis for NumPy arrays. You can find example for both a single number and another NumPy array below.</p>
<pre><code>arr - <span class="hljs-number">10</span>

#Returns array([<span class="hljs-number">-10</span>,  <span class="hljs-number">-9</span>,  <span class="hljs-number">-8</span>,  <span class="hljs-number">-7</span>])

arr - arr

#Returns array([<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>])
</code></pre><h4 id="heading-multiplication"><strong>Multiplication</strong></h4>
<p>Multiplication is also performed on an element-by-element basis for both single numbers and NumPy arrays.</p>
<p>Two examples are below.</p>
<pre><code><span class="hljs-number">6</span> * arr

#Returns array([ <span class="hljs-number">0</span>,  <span class="hljs-number">6</span>, <span class="hljs-number">12</span>, <span class="hljs-number">18</span>])

arr * arr

#Returns array([<span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">4</span>, <span class="hljs-number">9</span>])
</code></pre><h4 id="heading-division"><strong>Division</strong></h4>
<p>By this point, you’re probably not surprised to learn that division performed on NumPy arrays is done on an element-by-element basis. An example of dividing <code>arr</code> by a single number is below:</p>
<pre><code>arr / <span class="hljs-number">2</span>

#Returns array([<span class="hljs-number">0.</span> , <span class="hljs-number">0.5</span>, <span class="hljs-number">1.</span> , <span class="hljs-number">1.5</span>])
</code></pre><p>Division does have one notable exception compared to the other mathematical operations we have seen in this section. Since we cannot divide by zero, doing so will cause the corresponding field to be populated by a <code>nan</code> value, which is Python shorthand for “Not A Number”. Jupyter Notebook will also print a warning that looks like this:</p>
<pre><code>RuntimeWarning: invalid value encountered <span class="hljs-keyword">in</span> true_divide
</code></pre><p>An example of dividing by zero is with a NumPy array is shown below.</p>
<pre><code>arr / arr

#Returns array([nan,  <span class="hljs-number">1.</span>,  <span class="hljs-number">1.</span>,  <span class="hljs-number">1.</span>])
</code></pre><p>We will learn how to deal with <code>nan</code> values in more detail later in this course.</p>
<h3 id="heading-complex-operations-in-numpy-arrays"><strong>Complex Operations in NumPy Arrays</strong></h3>
<p>Many operations cannot simply be performed by applying the normal syntax to a NumPy array. In this section, we will explore several mathematical operations that have built-in methods in the NumPy library.</p>
<h4 id="heading-how-to-calculate-square-roots-using-numpy"><strong>How To Calculate Square Roots Using NumPy</strong></h4>
<p>You can calculate the square root of every element in an array using the <code>np.sqrt</code> method:</p>
<pre><code>np.sqrt(arr)

#Returns array([<span class="hljs-number">0.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.41421356</span>, <span class="hljs-number">1.73205081</span>])
</code></pre><p>Many other examples are below (note that you will not be tested on these, but it is still useful to see the capabilities of NumPy):</p>
<pre><code>np.exp(arr)

#Returns e^element <span class="hljs-keyword">for</span> every element <span class="hljs-keyword">in</span> the array

np.sin(arr)

#Calculate the trigonometric sine <span class="hljs-keyword">of</span> every value <span class="hljs-keyword">in</span> the array

np.cos(arr)

#Calculate the trigonometric cosine <span class="hljs-keyword">of</span> every value <span class="hljs-keyword">in</span> the array

np.log(arr)

#Calculate the base-ten logarithm <span class="hljs-keyword">of</span> every value <span class="hljs-keyword">in</span> the array
</code></pre><h3 id="heading-moving-on-2"><strong>Moving On</strong></h3>
<p>In this section, we explored the various methods and operations available in the NumPy Python library. We will text your knowledge of these concepts in the practice problems presented next.</p>
<h2 id="heading-numpy-indexing-and-assignment"><strong>NumPy Indexing and Assignment</strong></h2>
<p>In this section, we will explore <a target="_blank" href="https://nickmccullum.com/advanced-python/numpy-indexing-assignment/">indexing and assignment in NumPy arrays.</a></p>
<h3 id="heading-the-array-ill-be-using-in-this-section"><strong>The Array I’ll Be Using In This Section</strong></h3>
<p>As before, I will be using a specific array through this section. This time it will be generated using the <code>np.random.rand</code> method. Here’s how I generated the array:</p>
<pre><code>arr = np.random.rand(<span class="hljs-number">5</span>)
</code></pre><p>Here is the actual array:</p>
<pre><code>array([<span class="hljs-number">0.69292946</span>, <span class="hljs-number">0.9365295</span> , <span class="hljs-number">0.65682359</span>, <span class="hljs-number">0.72770856</span>, <span class="hljs-number">0.83268616</span>])
</code></pre><p>To make this array easier to look at, I will round every element of the array to 2 decimal places using NumPy’s <code>round</code> method:</p>
<pre><code>arr = np.round(arr, <span class="hljs-number">2</span>)
</code></pre><p>Here’s the new array:</p>
<pre><code>array([<span class="hljs-number">0.69</span>, <span class="hljs-number">0.94</span>, <span class="hljs-number">0.66</span>, <span class="hljs-number">0.73</span>, <span class="hljs-number">0.83</span>])
</code></pre><h3 id="heading-how-to-return-a-specific-element-from-a-numpy-array"><strong>How To Return A Specific Element From A NumPy Array</strong></h3>
<p>We can select (and return) a specific element from a NumPy array in the same way that we could using a normal Python list: using square brackets.</p>
<p>An example is below:</p>
<pre><code>arr[<span class="hljs-number">0</span>]

#Returns <span class="hljs-number">0.69</span>
</code></pre><p>We can also reference multiple elements of a NumPy array using the colon operator. For example, the index <code>[2:]</code> selects every element from index 2 onwards. The index <code>[:3]</code> selects every element up to and excluding index 3. The index <code>[2:4]</code> returns every element from index 2 to index 4, excluding index 4. The higher endpoint is always excluded.</p>
<p>A few example of indexing using the colon operator are below.</p>
<pre><code>arr[:]

#Returns the entire array: array([<span class="hljs-number">0.69</span>, <span class="hljs-number">0.94</span>, <span class="hljs-number">0.66</span>, <span class="hljs-number">0.73</span>, <span class="hljs-number">0.83</span>])

arr[<span class="hljs-number">1</span>:]

#Returns array([<span class="hljs-number">0.94</span>, <span class="hljs-number">0.66</span>, <span class="hljs-number">0.73</span>, <span class="hljs-number">0.83</span>])

arr[<span class="hljs-number">1</span>:<span class="hljs-number">4</span>] 

#Returns array([<span class="hljs-number">0.94</span>, <span class="hljs-number">0.66</span>, <span class="hljs-number">0.73</span>])
</code></pre><h3 id="heading-element-assignment-in-numpy-arrays"><strong>Element Assignment in NumPy Arrays</strong></h3>
<p>We can assign new values to an element of a NumPy array using the <code>=</code> operator, just like regular python lists. A few examples are below (note that this is all one code block, which means that the element assignments are carried forward from step to step).</p>
<pre><code>array([<span class="hljs-number">0.12</span>, <span class="hljs-number">0.94</span>, <span class="hljs-number">0.66</span>, <span class="hljs-number">0.73</span>, <span class="hljs-number">0.83</span>])

arr

#Returns array([<span class="hljs-number">0.12</span>, <span class="hljs-number">0.94</span>, <span class="hljs-number">0.66</span>, <span class="hljs-number">0.73</span>, <span class="hljs-number">0.83</span>])

arr[:] = <span class="hljs-number">0</span>

arr

#Returns array([<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>])

arr[<span class="hljs-number">2</span>:<span class="hljs-number">5</span>] = <span class="hljs-number">0.5</span>

arr

#Returns array([<span class="hljs-number">0.</span> , <span class="hljs-number">0.</span> , <span class="hljs-number">0.5</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">0.5</span>])
</code></pre><h3 id="heading-array-referencing-in-numpy"><strong>Array Referencing in NumPy</strong></h3>
<p>NumPy makes use of a concept called ‘array referencing’ which is a very common source of confusion for people that are new to the library.</p>
<p>To understand array referencing, let’s first consider an example:</p>
<pre><code>
new_array = np.array([<span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">9</span>])

second_new_array = new_array[<span class="hljs-number">0</span>:<span class="hljs-number">2</span>]

second_new_array

#Returns array([<span class="hljs-number">6</span>, <span class="hljs-number">7</span>])

second_new_array[<span class="hljs-number">1</span>] = <span class="hljs-number">4</span>

second_new_array 

#Returns array([<span class="hljs-number">6</span>, <span class="hljs-number">4</span>]), <span class="hljs-keyword">as</span> expected

new_array 

#Returns array([<span class="hljs-number">6</span>, <span class="hljs-number">4</span>, <span class="hljs-number">8</span>, <span class="hljs-number">9</span>]) 

#which is DIFFERENT <span class="hljs-keyword">from</span> its original value <span class="hljs-keyword">of</span> array([<span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">9</span>])

#What the heck?
</code></pre><p>As you can see, modifying <code>second_new_array</code> also changed the value of <code>new_array</code>.</p>
<p>Why is this?</p>
<p>By default, NumPy does not create a copy of an array when you reference the original array variable using the <code>=</code> assignment operator. Instead, it simply points the new variable to the old variable, which allows the second variable to make modification to the original variable - even if this is not your intention.</p>
<p>This may seem bizarre, but it does have a logical explanation. The purpose of array referencing is to conserve computing power. When working with large data sets, you would quickly run out of RAM if you created a new array every time you wanted to work with a slice of the array.</p>
<p>Fortunately, there is a workaround to array referencing. You can use the <code>copy</code> method to explicitly copy a NumPy array.</p>
<p>An example of this is below.</p>
<pre><code>array_to_copy = np.array([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])

copied_array = array_to_copy.copy()

array_to_copy

#Returns array([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])

copied_array

#Returns array([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])
</code></pre><p>As you can see below, making modifications to the copied array does not alter the original.</p>
<pre><code>copied_array[<span class="hljs-number">0</span>] = <span class="hljs-number">9</span>

copied_array

#Returns array([<span class="hljs-number">9</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])

array_to_copy

#Returns array([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])
</code></pre><p>So far in the section, we have only explored how to reference one-dimensional NumPy arrays. We will now explore the indexing of two-dimensional arrays.</p>
<h3 id="heading-indexing-two-dimensional-numpy-arrays"><strong>Indexing Two-Dimensional NumPy Arrays</strong></h3>
<p>To start, let’s create a two-dimensional NumPy array named <code>mat</code>:</p>
<pre><code>mat = np.array([[<span class="hljs-number">5</span>, <span class="hljs-number">10</span>, <span class="hljs-number">15</span>],[<span class="hljs-number">20</span>, <span class="hljs-number">25</span>, <span class="hljs-number">30</span>],[<span class="hljs-number">35</span>, <span class="hljs-number">40</span>, <span class="hljs-number">45</span>]])

mat

<span class="hljs-string">""</span><span class="hljs-string">"

Returns:

array([[ 5, 10, 15],

       [20, 25, 30],

       [35, 40, 45]])

"</span><span class="hljs-string">""</span>
</code></pre><p>There are two ways to index a two-dimensional NumPy array:</p>
<ul>
<li><code>mat[row, col]</code></li>
<li><code>mat[row][col]</code></li>
</ul>
<p>I personally prefer to index using the <code>mat[row][col]</code> nomenclature because it is easier to visualize in a step-by-step fashion. For example:</p>
<pre><code>#First, <span class="hljs-keyword">let</span><span class="hljs-string">'s get the first row:

mat[0]

#Next, let'</span>s get the last element <span class="hljs-keyword">of</span> the first row:

mat[<span class="hljs-number">0</span>][<span class="hljs-number">-1</span>]
</code></pre><p>You can also generate sub-matrices from a two-dimensional NumPy array using this notation:</p>
<pre><code>mat[<span class="hljs-number">1</span>:][:<span class="hljs-number">2</span>]

<span class="hljs-string">""</span><span class="hljs-string">"

Returns:

array([[20, 25, 30],

       [35, 40, 45]])

"</span><span class="hljs-string">""</span>
</code></pre><p>Array referencing also applies to two-dimensional arrays in NumPy, so be sure to use the <code>copy</code> method if you want to avoid inadvertently modifying an original array after saving a slice of it into a new variable name.</p>
<h3 id="heading-conditional-selection-using-numpy-arrays"><strong>Conditional Selection Using NumPy Arrays</strong></h3>
<p>NumPy arrays support a feature called <code>conditional selection</code>, which allows you to generate a new array of boolean values that state whether each element within the array satisfies a particular <code>if</code> statement.</p>
<p>An example of this is below (I also re-created our original <code>arr</code> variable since its been awhile since we’ve seen it):</p>
<pre><code>arr = np.array([<span class="hljs-number">0.69</span>, <span class="hljs-number">0.94</span>, <span class="hljs-number">0.66</span>, <span class="hljs-number">0.73</span>, <span class="hljs-number">0.83</span>])

arr &gt; <span class="hljs-number">0.7</span>

#Returns array([False,  True, False,  True,  True])
</code></pre><p>You can also generate a new array of values that satisfy this condition by passing the condition into the square brackets (just like we do for indexing).</p>
<p>An example of this is below:</p>
<pre><code>arr[arr &gt; <span class="hljs-number">0.7</span>]

#Returns array([<span class="hljs-number">0.94</span>, <span class="hljs-number">0.73</span>, <span class="hljs-number">0.83</span>])
</code></pre><p>Conditional selection can become significantly more complex than this. We will explore more examples in this section’s associated practice problems.</p>
<h3 id="heading-moving-on-3"><strong>Moving On</strong></h3>
<p>In this section, we explored NumPy array indexing and assignment in thorough detail. We will solidify your knowledge of these concepts further by working through a batch of practice problems in the next section.</p>
<h2 id="heading-final-thoughts-amp-special-offer">Final Thoughts &amp; Special Offer</h2>
<p>Thanks for reading this article on NumPy, which is one of my favorite Python packages and a must-know library for every Python developer.</p>
<p><strong>This tutorial is an excerpt from my course</strong> <strong><a target="_blank" href="https://courses.nickmccullum.com/courses/enroll/python-for-finance/">Python For Finance and Data Science</a>. If you're interested in learning more core Python skills, the course is 50% off for the first 50 freeCodeCamp readers that sign up - <a target="_blank" href="https://courses.nickmccullum.com/courses/enroll/python-for-finance/">click here to get your discounted course now</a>!</strong></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn NumPy and start doing scientific computing in Python ]]>
                </title>
                <description>
                    <![CDATA[ Learn the basics of the NumPy library for Python in this tutorial from Keith Galli. The tutorial explains how NumPy works and how to write code with NumPy. You will learn about creating arrays, indexing, math, statistics, reshaping, and more. Here ar... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/numpy-python-tutorial/</link>
                <guid isPermaLink="false">66b205e7a5be9a107f434194</guid>
                
                    <category>
                        <![CDATA[ numpy ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Fri, 09 Aug 2019 17:51:50 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2020/09/numpy.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Learn the basics of the NumPy library for Python in this tutorial from Keith Galli. The tutorial explains how NumPy works and how to write code with NumPy. You will learn about creating arrays, indexing, math, statistics, reshaping, and more.</p>
<p>Here are the topics covered:</p>
<ul>
<li>What is NumPy</li>
<li>NumPy vs Lists (speed, functionality)</li>
<li>Applications of NumPy</li>
<li>The Basics (creating arrays, shape, size, data type)</li>
<li>Accessing/Changing Specific Elements, Rows, Columns, etc (slicing)</li>
<li>Initializing Different Arrays (1s, 0s, full, random, etc...)</li>
<li>Copying variables</li>
<li>Basic Mathematics (arithmetic, trigonometry, etc.)</li>
<li>Linear Algebra</li>
<li>Statistics</li>
<li>Reorganizing Arrays (reshape, vstack, hstack)</li>
<li>Load data in from a file</li>
<li>Advanced Indexing and Boolean Masking</li>
</ul>
<p>You can watch the full video course on the <a target="_blank" href="https://www.youtube.com/watch?v=QUT1VHiLmmI">freeCodeCamp.org YouTube channel</a> (1 hour watch).</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ If you have slow loops in Python, you can fix it…until you can’t ]]>
                </title>
                <description>
                    <![CDATA[ By Maxim Mamaev Let’s take a computational problem as an example, write some code, and see how we can improve the running time. Here we go. Setting the scene: the knapsack problem This is the computational problem we’ll use as the example: The knapsa... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/if-you-have-slow-loops-in-python-you-can-fix-it-until-you-cant-3a39e03b6f35/</link>
                <guid isPermaLink="false">66c3578fd372f14b49bdcb9f</guid>
                
                    <category>
                        <![CDATA[ numpy ]]>
                    </category>
                
                    <category>
                        <![CDATA[ optimization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 02 Aug 2018 18:12:10 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*t5vZrkc3PdQZ78RX7Jx8Lg.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Maxim Mamaev</p>
<p>Let’s take a computational problem as an example, write some code, and see how we can improve the running time. Here we go.</p>
<h3 id="heading-setting-the-scene-the-knapsack-problem">Setting the scene: the knapsack problem</h3>
<p>This is the computational problem we’ll use as the example:</p>
<p>The knapsack problem is a well-known problem in combinatorial optimization. In this section, we will review its most common flavor, the <strong>0–1 knapsack problem</strong>, and its solution by means of dynamic programming. If you are familiar with the subject, you can <a class="post-section-overview" href="#467d">skip this part</a>.</p>
<p>You are given a knapsack of capacity <strong>C</strong> and a collection of <strong>N</strong> items. Each item has weight <strong>w[i]</strong> and value <strong>v[i]</strong>. Your task is to pack the knapsack with the most valuable items. In other words, you are to maximize the total value of items that you put into the knapsack subject, with a constraint: the total weight of the taken items <strong>cannot</strong> exceed the capacity of the knapsack.</p>
<p>Once you’ve got a solution, the total weight of the items in the knapsack is called “solution weight,” and their total value is the “solution value”.</p>
<p>The problem has many practical applications. For example, you’ve decided to invest $1600 into the famed FAANG stock (the collective name for the shares of Facebook, Amazon, Apple, Netflix, and Google aka Alphabet). Each share has a current market price and the one-year price estimate. As of one day in 2018, they are as follows:</p>
<pre><code>========= ======= ======= =========Company   Ticker  Price   Estimate========= ======= ======= =========Alphabet  GOOG    <span class="hljs-number">1030</span>    <span class="hljs-number">1330</span>Amazon    AMZN    <span class="hljs-number">1573</span>    <span class="hljs-number">1675</span>Apple     AAPL    <span class="hljs-number">162</span>     <span class="hljs-number">193</span> Facebook  FB      <span class="hljs-number">174</span>     <span class="hljs-number">216</span> Netflix   NFLX    <span class="hljs-number">312</span>     <span class="hljs-number">327</span>========= ======= ======= =========
</code></pre><p>For the simplicity of the example, we’ll assume that you’d never put all your eggs in one basket. You are willing to buy <strong>no more</strong> than one share of each stock. What shares do you buy to maximize your profit?</p>
<p>This is a knapsack problem. Your budget ($1600) is the sack’s <strong>capacity (C)</strong>. The shares are the items to be packed. The current prices are the <strong>weights (w)</strong>. The price estimates are the <strong>values</strong>. The problem looks trivial. However, the solution is not evident at the first glance — whether you should buy one share of Amazon, or one share of Google plus one each of some combination of Apple, Facebook, or Netflix.</p>
<p>Of course, in this case, you may do quick calculations by hand and arrive at the solution: you should buy Google, Netflix, and Facebook. This way you spend $1516 and expect to gain $1873.</p>
<p>Now you believe that you’ve discovered a Klondike. You shatter your piggy bank and collect $10,000. Despite your excitement, you stay adamant with the rule “one stock — one buy”. Therefore, with that larger budget, you have to broaden your options. You decide to consider all stocks from the NASDAQ 100 list as candidates for buying.</p>
<p>The future has never been brighter, but suddenly you realize that, in order to identify your ideal investment portfolio, you will have to check around 2¹⁰⁰ combinations. Even if you are super optimistic about the imminence and the ubiquity of the digital economy, any economy requires — at the least — a universe where it runs. Unfortunately, in a few trillion years when your computation ends, our universe won’t probably exist.</p>
<h4 id="heading-dynamic-programming-algorithm">Dynamic programming algorithm</h4>
<p>We have to drop the brute force approach and program some clever solution. Small knapsack problems (and ours is a small one, believe it or not) are solved by dynamic programming. The basic idea is to start from a trivial problem whose solution we know and then add complexity step-by-step.</p>
<p>If you find the following explanations too abstract, here is an <a target="_blank" href="https://github.com/mmamaev/looping_python/blob/master/ks_dp_example.pdf">annotated illustration</a> of the solution to a very small knapsack problem. This will help you visualize what is happening.</p>
<p>Assume that, given the first <strong>i</strong> items of the collection, we know the solution values <strong>s(i, k)</strong> for all knapsack capacities <strong>k</strong> in the range from 0 to <strong>C</strong>.</p>
<p>In other words, we sewed <strong>C+1</strong> “auxiliary” knapsacks of all sizes from 0 to <strong>C</strong>. Then we sorted our collection, took the first <strong>i</strong> item and temporarily put aside all the rest. And now we assume that, by some magic, we know how to optimally pack each of the sacks from this working set of <strong>i</strong> items. The items that we pick from the working set may be different for different sacks, but at the moment we are not interested what items we take or skip. It is only the solution value <strong>s(i, k)</strong> that we record for each of our newly sewn sacks.</p>
<p>Now we fetch the next, <strong>(i+1)</strong>th, item from the collection and add it to the working set. Let’s find solution values for all auxiliary knapsacks with this new working set. In other words, we find <strong>s(i+1, k)</strong> for all <strong>k=0..C</strong> given <strong>s(i, k)</strong>.</p>
<p>If <strong>k</strong> is less than the weight of the new item <strong>w[i+1]</strong>, we cannot take this item. Indeed, even if we took <strong>only</strong> this item, it alone would not fit into the knapsack. Therefore, <strong>s(i+1, k) = s(i, k)</strong> for all <strong>k &lt; w[i</strong>+1].</p>
<p>For the values <strong>k &gt;= w[i</strong>+1] we have to make a choice: either we take the new item into the knapsack of capaci<strong>t</strong><em>y</em> k or we skip it. We need to evaluate these two options to determine which one gives us more value packed into the sack.</p>
<p>If we take the <strong>(i+1)</strong>th item, we acquire the value <strong>v[i+1]</strong> and consume the part of the knapsack’s capacity to accommodate the weight <strong>w[i+1]</strong>. That leaves us with the capacity <strong>k–w[i+1]</strong> which we have to optimally fill using (some of) the first <strong>i</strong> items. This optimal filling has the solution value <strong>s(i, k–w[i+1])</strong>. This number is already known to us because, by assumption, we know all solution values for the working set of <strong>i</strong> items. Hence, the candidate solution value for the knapsack <strong>k</strong> with the item <strong>i+1</strong> taken would be<br><strong>s(i+1, k | i+1 taken) = v[i+1] + s(i, k–w[i+1])</strong>.</p>
<p>The other option is to skip the item <strong>i+1</strong>. In this case, nothing changes in our knapsack, and the candidate solution value would be the same as <strong>s(i, k)</strong>.</p>
<p>To decide on the best choice we compare the two candidates for the solution values:<br><strong>s(i+1, k | i+1 taken) = v[i+1] + s(i, k–w[i+1])</strong><br><strong>s(i+1, k | i+1 skipped) = s(i, k)</strong></p>
<p>The maximum of these becomes the solution <strong>s(i+1, k)</strong>.</p>
<p>In summary:</p>
<pre><code><span class="hljs-keyword">if</span> k &lt; w[i+<span class="hljs-number">1</span>]:    s(i+<span class="hljs-number">1</span>, k) = s(i, k)<span class="hljs-keyword">else</span>:    s(i+<span class="hljs-number">1</span>, k) = max( v[i+<span class="hljs-number">1</span>] + s(i, k-w[i+<span class="hljs-number">1</span>]), s(i, k) )
</code></pre><p>Now we can solve the knapsack problem step-by-step. We start with the empty working set <strong>(<em>i=0</em>)</strong>. Obviously, <strong>s(0, k) = 0</strong> for any <strong>k</strong>. Then we take steps by adding items to the working set and finding solution values <strong>s(i, k)</strong> until we arrive at <strong>s(i+1=N, k=C)</strong> which is the solution value of the original problem.</p>
<p>Note that, by the way of doing this, we have built the grid of <strong><em>NxC</em></strong> solution values.</p>
<p>Yet, despite having learned the solution value, we do not know exactly what items have been taken into the knapsack. To find this out, we backtrack the grid. Starting from <strong>s(i=N, k=C)</strong>, we compare <strong>s(i, k) with s(i–1, k)</strong>.</p>
<p>If <strong>s(i, k) = s(i–1, k)</strong>, the <strong>i</strong>th item has not been taken. We reiterate with <strong>i=i–1</strong> keeping the value of <strong>k</strong> unchanged. Otherwise, the <strong>i</strong>th item has been taken and for the next examination step we shrink the knapsack by <strong>w[i]</strong> — we’ve set <strong>i=i–1, k=k–w[i]</strong>.</p>
<p>This way we examine all items from the <strong>N</strong>th to the first, and determine which of them have been put into the knapsack. This gives us the solution to the knapsack problem.</p>
<h3 id="heading-code-and-analysis">Code and analysis</h3>
<p>Now, as we have the algorithm, we will compare several implementations, starting from a straightforward one. The code is available on <a target="_blank" href="https://github.com/mmamaev/looping_python/blob/master/ks_dp_solvers.py">GitHub</a>.</p>
<p>The data is the <strong>Nasdaq 100</strong> list, containing current prices and price estimates for one hundred stock equities (as of one day in 2018). Our investment budget is $10,000.</p>
<p>Recall that share prices are not round dollar numbers, but come with cents. Therefore, to get the accurate solution, we have to count everything in cents — we definitely want to avoid float numbers. Hence the capacity of our knapsack is ($)10000 x 100 cents = ($)1000000, and the total size of our problem <strong>N x C</strong> = 1 000 000.</p>
<p>With an integer taking 4 bytes of memory, we expect that the algorithm will consume roughly 400 MB of RAM. So, the memory is not going to be a limitation. It is the execution time we should care about.</p>
<p>Of course, all our implementations will yield the same solution. For your reference, the investment (the solution weight) is 999930 ($9999.30) and the expected return (the solution value) is 1219475 ($12194.75). The list of stocks to buy is rather long (80 of 100 items). You can obtain it by running the code.</p>
<p>And, please, remember that <strong>this is a programming exercise, not investment advice</strong>. By the time you read this article, the prices and the estimates will have changed from what is used here as an example.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*SEihq9zvMPCd8qLZHaUHIg.jpeg" alt="Image" width="800" height="533" loading="lazy">
_Credit: [Martin von Rotz](https://www.snapwi.re/user/mavoro" rel="noopener" target="<em>blank" title=")</em></p>
<h4 id="heading-plain-old-for-loops">Plain old “for” loops</h4>
<p>The straightforward implementation of the algorithm is given below.</p>
<p>There are two parts.</p>
<p>In the first part (lines 3–7 above), two nested <code>for</code> loops are used to build the solution grid.</p>
<p>The outer loop adds items to the working set until we reach <strong>N</strong> (the value of <strong>N</strong> is passed in the parameter <code>items</code>). The row of solution values for each new working set is initialized with the values computed for the previous working set.</p>
<p>The inner loop for each working set iterates the values of <code>k</code> from the weight of the newly added <code>item</code> to <strong>C</strong> (the value of <strong>C</strong> is passed in the parameter <code>capacity</code>).</p>
<p>Note that we do not need to start the loop from <strong>k=0</strong><em>.</em> When <code>k</code> is less than the weight of <code>item</code>, the solution values are always the same as those computed for the previous working set, and these numbers have been already copied to the current row by initialisation.</p>
<p>When the loops are completed, we have the solution grid and the solution value.</p>
<p>The second part (lines 9–17) is a single <code>for</code> loop of <strong>N</strong> iterations. It backtracks the grid to find what items have been taken into the knapsack.</p>
<p>Further on, we will focus exclusively on the first part of the algorithm as it has <strong>O(N*C)</strong> time and space complexity. The backtracking part requires just <strong>O(N)</strong> time and does not spend any additional memory — its resource consumption is relatively negligible.</p>
<p>It takes <strong>180 seconds</strong> for the straightforward implementation to solve the <strong>Nasdaq 100</strong> knapsack problem on my computer.</p>
<p>How bad is it? On the one hand, with the speeds of the modern age, we are not used to spending three minutes waiting for a computer to do stuff. On the other hand, the size of the problem — a hundred million — looks indeed intimidating, so, maybe, three minutes are ok?</p>
<p>To obtain some benchmark, let’s program the same algorithm in another language. We need a statically-typed compiled language to ensure the speed of computation. No, not C. It is not fancy. We’ll stick to fashion and write in Go:</p>
<p>As you can see, the Go code is quite similar to that in Python. I even copy-pasted one line, the longest, as is.</p>
<p>What is the running time? <strong>400 milliseconds</strong>! In other words, Python came out 500 times slower than Go. The gap will probably be even bigger if we tried it in C. This is definitely a disaster for Python.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*lmi8rlKeei1hcMRkeShzIw.jpeg" alt="Image" width="800" height="533" loading="lazy">
_Quote from J. K. Rowling’s “Harry Potter and the Chamber of Secrets” [Source of original image here](https://pixabay.com/en/snail-rainy-day-spring-animal-slow-3385348/" rel="noopener" target="<em>blank" title=").</em></p>
<p>To find out what slows down the Python code, let’s run it with <a target="_blank" href="https://github.com/rkern/line_profiler">line profiler</a>. You can find profiler’s output for this and subsequent implementations of the algorithm at <a target="_blank" href="https://github.com/mmamaev/looping_python/blob/master/ks_dp_solvers_profiles.txt">GitHub</a>.</p>
<p>In the straightforward solver, 99.7% of the running time is spent in two lines. These two lines comprise the inner loop, that is executed 98 million times:</p>
<p>I apologize for the excessively long lines, but the line profiler cannot properly handle line breaks within the same statement.</p>
<p>I’ve heard that Python’s <code>for</code> operator is slow but, interestingly, the most time is spent not in the <code>for</code> line but in the loop’s body.</p>
<p>We can break down the loop’s body into individual operations to see if any particular operation is too slow:</p>
<p>It appears that no particular operation stands out. The running times of individual operations within the inner loop are pretty much the same as the running times of analogous operations elsewhere in the code.</p>
<p>Note how breaking the code down increased the total running time. The inner loop now takes 99.9% of the running time. The dumber your Python code, the slower it gets. Interesting, isn’t it?</p>
<h4 id="heading-built-in-map-function">Built-in map function</h4>
<p>Let’s make the code more optimised and replace the inner <code>for</code> loop with a built-in <code>map()</code> function:</p>
<p>The execution time of this code is <strong>102 seconds</strong>, being 78 seconds off the straightforward implementation’s score. Indeed, <code>map()</code> runs noticeably, but not overwhelmingly, faster.</p>
<h4 id="heading-list-comprehension">List comprehension</h4>
<p>You may have noticed that each run of the inner loop produces a list (which is added to the solution grid as a new row). The Pythonic way of creating lists is, of course, list comprehension. Let’s try it instead of <code>map()</code>.</p>
<p>This finished in <strong>81 seconds</strong>. We’ve achieved another improvement and cut the running time by half in comparison to the straightforward implementation (180 sec). Out of the context, this would be praised as significant progress. Alas, we are still light years away from our benchmark 0.4 sec.</p>
<h4 id="heading-numpy-arrays">NumPy arrays</h4>
<p>At last, we have exhausted built-in Python tools. Yes, I can hear the roar of the audience chanting “NumPy! NumPy!” But to appreciate NumPy’s efficiency, we should have put it into context by trying <code>for</code>, <code>map()</code> and list comprehension beforehand.</p>
<p>Ok, now it is NumPy time. So, we abandon lists and put our data into numpy arrays:</p>
<p>Suddenly, the result is discouraging. This code runs 1.5 times slower than the vanilla list comprehension solver (<strong>123 sec</strong> versus 81 sec). How can that be?</p>
<p>Let’s examine the line profiles for both solvers.</p>
<p>Initialization of <code>grid[0]</code> as a numpy array (line 274) is three times faster than when it is a Python list (line 245). Inside the outer loop, initialization of <code>grid[item+1]</code> is 4.5 times faster for a NumPy array (line 276) than for a list (line 248). So far, so good.</p>
<p>However, the execution of line 279 is 1.5 times slower than its numpy-less analog in line 252. The problem is that list comprehension creates a <strong>list</strong> of values, but we store these values in a <strong>NumPy array</strong> which is found on the left side of the expression. Hence, this line implicitly adds an overhead of converting a list into a NumPy array. With line 279 accounting for 99.9% of the running time, all the previously noted advantages of numpy become negligible.</p>
<p>But we still need a means to <strong>iterate</strong> through arrays in order to do the calculations. We have already learned that list comprehension is the fastest iteration tool. (By the way, if you try to build NumPy arrays within a plain old <code>for</code> loop avoiding list-to-NumPy-array conversion, you’ll get the whopping 295 sec running time.) So, are we stuck and is NumPy of no use? Of course, not.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*0yrZox6O3EEEKregnri0Vw.jpeg" alt="Image" width="800" height="533" loading="lazy">
_Credit: [Taras Makarenko](https://www.pexels.com/@taras-makarenko-188506" rel="noopener" target="<em>blank" title=")</em></p>
<h4 id="heading-proper-use-of-numpy">Proper use of NumPy</h4>
<p>Just storing data in NumPy arrays does not do the trick. The real power of NumPy comes with the functions that run calculations over NumPy arrays. They take arrays as parameters and return arrays as results.</p>
<p>For example, there is function <code>where()</code> which takes three arrays as parameters: <code>condition</code>, <code>x</code>, and <code>y</code>, and returns an array built by picking elements either from <code>x</code> or from <code>y</code>. The first parameter, <code>condition</code>, is an array of booleans. It tells where to pick from: if an element of <code>condition</code> is evaluated to <code>True</code>, the corresponding element of <code>x</code> is sent to the output, otherwise the element from <code>y</code> is taken.</p>
<p>Note that the NumPy function does all this in a single call. Looping through the arrays is put away under the hood.</p>
<p>This is how we use <code>where()</code> as a substitute of the internal <code>for</code> loop in the first solver or, respectively, the list comprehension of the latest:</p>
<p>There are three pieces of code that are interesting: line 8, line 9 and lines 10–13 as numbered above. Together, they substitute for the inner loop which would iterate through all possible sizes of knapsacks to find the solution values.</p>
<p>Until the knapsack’s capacity reaches the weight of the item newly added to the working set (<code>this_weight</code>), we have to ignore this item and set solution values to those of the previous working set. This is pretty straightforward (line 8):</p>
<pre><code>grid[item+<span class="hljs-number">1</span>, :this_weight] = grid[item, :this_weight]
</code></pre><p>Then we build an auxiliary array <code>temp</code> (line 9):</p>
<pre><code>temp = grid[item, :-this_weight] + this_value
</code></pre><p>This code is analogous to, but much faster than:</p>
<pre><code>[grid[item, k — this_weight] + this_value  <span class="hljs-keyword">for</span> k <span class="hljs-keyword">in</span> range(this_weight, capacity+<span class="hljs-number">1</span>)]
</code></pre><p>It calculates would-be solution values if the new item were taken into each of the knapsacks that can accommodate this item.</p>
<p>Note how the<code>temp</code> array is built by adding a <strong>scalar</strong> to an array. This is another powerful feature of NumPy called “broadcasting”. When NumPy sees operands with different dimensions, it tries to expand (that is, to “broadcast”) the low-dimensional operand to match the dimensions of the other. In our case, the scalar is expanded to an array of the same size as <code>grid[item, :-this_weight]</code> and these two arrays are added together. As a result, the value of <code>this_value</code> is added to each element of <code>grid[item, :-this_weight]</code>— no loop is needed.</p>
<p>In the next piece (lines 10–13) we use the function <code>where()</code> which does exactly what is required by the algorithm: it compares two would-be solution values for each size of knapsack and selects the one which is larger.</p>
<pre><code>grid[item + <span class="hljs-number">1</span>, <span class="hljs-attr">this_weight</span>:] =                 np.where(temp &gt; grid[item, <span class="hljs-attr">this_weight</span>:],             temp,             grid[item, <span class="hljs-attr">this_weight</span>:])
</code></pre><p>The comparison is done by the <code>condition</code> parameter, which is calculated as <code>temp &gt; grid[item, this_weigh</code>t:]. This is an element-wise operation that produces an array of boolean values, one for each size of an auxiliary knapsack. <code>A T</code>_r_ue value means that the corresponding item is to be packed into the knapsack. Therefore, the solution value taken from the array is the second argument of the functio<code>n, t</code>emp. Otherwise, the item is to be skipped, and the solution value is copied from the previous row of the grid — the third argument of t<code>he wher</code>e()function .</p>
<p>At last, the warp drive engaged! This solver executes in <strong>0.55 sec</strong>. This is 145 times faster than the list comprehension-based solver and 329 times faster than the code using the<code>for</code> loop. Although we did not outrun the solver written in Go (0.4 sec), we came quite close to it.</p>
<h4 id="heading-some-loops-are-to-stay">Some loops are to stay</h4>
<p>Wait, but what about the outer <code>for</code> loop?</p>
<p>In our example, the outer loop code, which is not part of the inner loop, is run only 100 times, so we can get away without tinkering with it. However, other times the outer loop can turn out to be as long as the inner.</p>
<p>Can we rewrite the outer loop using a NumPy function in a similar manner to what we did to the inner loop? The answer is no.</p>
<p>Despite both being <code>for</code> loops, the outer and inner loops are quite different in what they do.</p>
<p>The inner loop produces a 1D-array based on another 1D-array whose elements are <strong>all known</strong> when the loop starts. It is this prior availability of the input data that allowed us to substitute the inner loop with either <code>map()</code>, list comprehension, or a NumPy function.</p>
<p>The outer loop produces a 2D-array from 1D-arrays whose elements are <strong>not</strong> known when the loop starts. Moreover, these component arrays are computed by a recursive algorithm: we can find the elements of the <strong>(i+1)</strong>th array only after we have found the <strong>i</strong>th.</p>
<p>Suppose the outer loop could be presented as a function:<br><code>grid = g(row0, row1, … rowN)</code><br>All function parameters must be evaluated before the function is called, yet only <code>row0</code> is known beforehand. Since the computation of the <strong>(i+1)</strong>th row depends on the availability of the <strong>i</strong>th, we need a loop going from <code>1</code> to <code>N</code> to compute all the <code>row</code> parameters. Therefore, to substitute the outer loop with a function, we need another loop which evaluates the parameters of this function. This other loop is exactly the loop we are trying to replace.</p>
<p>The other way to avoid the outer <code>for</code> loop is to use the recursion. One can easily write the recursive function <code>calculate(i)</code> that produces the <strong>i</strong>th row of the grid. In order to do the job, the function needs to know the <strong>(i-1)</strong>th row, thus it calls itself as <code>calculate(i-1)</code> and then computes the <strong>i</strong>th row using the NumPy functions as we did before. The entire outer loop can then be replaced with <code>calculate(N)</code>. To make the picture complete, a recursive knapsack solver can be found in the source code accompanying this article on <a target="_blank" href="https://github.com/mmamaev/looping_python/blob/master/ks_dp_solvers.py">GitHub</a>.</p>
<p>However, the recursive approach is clearly not scalable. Python is not tail-optimized. The depth of the recursion stack is, by default, limited by the order of one thousand. This limit is surely conservative but, when we require a depth of millions, stack overflow is highly likely. Moreover, the experiment shows that recursion does not even provide a performance advantage over a NumPy-based solver with the outer <code>for</code> loop.</p>
<p>This is where we run out of the tools provided by Python and its libraries (to the best of my knowledge). If you absolutely need to speed up the loop that implements a recursive algorithm, you will have to resort to Cython, or to a JIT-compiled version of Python, or to another language.</p>
<h3 id="heading-takeaways">Takeaways</h3>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*YvnvnzC2wMwZ_EF3qFHBcg.png" alt="Image" width="800" height="325" loading="lazy">
<em>Running times of knapsack problem solvers</em></p>
<ul>
<li>Do numerical calculations with NumPy functions. They are two orders of magnitude faster than Python’s built-in tools.</li>
<li>Of Python’s built-in tools, list comprehension is faster than <code>map()</code> , which is significantly faster than <code>for</code>.</li>
<li>For deeply recursive algorithms, loops are more efficient than recursive function calls.</li>
<li>You cannot replace recursive loops with <code>map()</code>, list comprehension, or a NumPy function.</li>
<li>“Dumb” code (broken down into elementary operations) is the slowest. Use built-in functions and tools.</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
