<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ kaggle - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ kaggle - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Thu, 21 May 2026 10:21:43 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/kaggle/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Improve Your Data Science Skills by Solving Kaggle Challenges ]]>
                </title>
                <description>
                    <![CDATA[ Data science competitions can help you improve your data science skills. We just posted a course on the freeCodeCamp.org YouTube channel that is designed to help you understand and complete Kaggle competitions, from data exploration to model building... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/improve-your-data-science-skills-by-solving-kaggle-challenges/</link>
                <guid isPermaLink="false">66fb0551040c07e67e1261dd</guid>
                
                    <category>
                        <![CDATA[ kaggle ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Mon, 30 Sep 2024 20:08:49 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1727726892494/958dce2c-ac80-4bee-87a3-fb8fd5bd2aef.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Data science competitions can help you improve your data science skills.</p>
<p>We just posted a course on the <a target="_blank" href="http://freeCodeCamp.org">freeCodeCamp.org</a> YouTube channel that is designed to help you understand and complete Kaggle competitions, from data exploration to model building and leaderboard submissions. Rohan Kumar from S.M.D.S developed this course.</p>
<h3 id="heading-why-kaggle"><strong>Why Kaggle?</strong></h3>
<p>Kaggle is the premier platform for data science competitions, offering a unique opportunity to apply your skills to real-world problems. Whether you're a beginner eager to learn or an experienced data scientist looking to refine your techniques, Kaggle provides a dynamic environment to test and expand your capabilities.</p>
<h3 id="heading-course-overview"><strong>Course Overview</strong></h3>
<p>This teaches you how to complete Kaggle competitions, focusing on three specific challenges. The course covers every step of the process, ensuring you gain practical experience and insights along the way. Here's what you can expect:</p>
<ul>
<li><p><strong>Selecting the Right Competition:</strong> Learn how to choose competitions that match your skill level and interests, setting you up for success from the start.</p>
</li>
<li><p><strong>Data Exploration and Preprocessing:</strong> Discover techniques for understanding and preparing datasets, a crucial step in any data science project.</p>
</li>
<li><p><strong>Feature Engineering:</strong> Unlock the power of feature engineering to extract valuable insights and improve model performance.</p>
</li>
<li><p><strong>Model Selection and Evaluation:</strong> Explore popular machine learning algorithms and learn how to evaluate their effectiveness.</p>
</li>
<li><p><strong>Hyperparameter Tuning:</strong> Fine-tune your models to achieve optimal accuracy and performance.</p>
</li>
<li><p><strong>Submission Strategies:</strong> Gain insights into preparing and submitting predictions to the Kaggle leaderboard.</p>
</li>
</ul>
<p>This course provides a hands-on learning experience. By following along with the tutorial and working on competition projects, you'll develop a solid understanding of the entire data science workflow. You'll learn practical skills applicable to real-world projects, from data manipulation to model evaluation.</p>
<h3 id="heading-conclusion"><strong>Conclusion</strong></h3>
<p>Whether you're looking to enhance your competition skills or gain practical data science experience, our course offers the guidance and insights you need. Watch the full course on <a target="_blank" href="https://www.youtube.com/watch?v=BV03sQ0srcU">the freeCodeCamp.org YouTube channel</a> (2-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/BV03sQ0srcU" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Download a Kaggle Dataset Directly to a Google Colab Notebook ]]>
                </title>
                <description>
                    <![CDATA[ Kaggle is a popular data science-based competition platform that has a large online community of data scientists and machine learning engineers. The platform contains a ton of datasets and notebooks that you can use to learn and practice your data sc... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-download-kaggle-dataset-to-google-colab/</link>
                <guid isPermaLink="false">66b902cb941d2f900bad52a6</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Google Colab ]]>
                    </category>
                
                    <category>
                        <![CDATA[ kaggle ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Md. Fahim Bin Amin ]]>
                </dc:creator>
                <pubDate>Thu, 08 Feb 2024 19:39:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/02/Kaggle-to-Colab.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p><a target="_blank" href="https://www.kaggle.com/">Kaggle</a> is a popular data science-based competition platform that has a large online community of data scientists and machine learning engineers.</p>
<p>The platform contains a ton of datasets and notebooks that you can use to learn and practice your data science and machine learning skills. They even have competitions you can participate in.</p>
<p>Kaggle offers a 100% free platform for all users – but there are some restrictions depending on the resources you're using. </p>
<p>For example, you can use their CPU system for an unlimited amount of time. But there are strict limitations on GPU and TPU usage. You can use their GPU for 30 hours and TPU for 20 hours in a week. It gets resets each week, and then you get a fresh 30 hours GPU usage and 20 hours TPU usage at the start of the new week.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_14-21.png" alt="Image" width="600" height="400" loading="lazy">
<em>Kaggle Website</em></p>
<p>Alongside Kaggle, there are another popular platforms for machine learning engineers and data scientists – like <a target="_blank" href="https://colab.google/">Google Colaboratory</a>, or Google Colab for short.</p>
<p>In Google Colab, you can also use their CPU and GPU, but the free versions have more limitations than the free Kaggle account. In Google Colab, you can not get any GPU computational power until they allocate it from their free units. You don't know how many hours you can use, and you don't even know if you have any chance to get units over the next few days. </p>
<p>In order to get all the features, you need to subscribe to their pro plans which are quite expensive.</p>
<p>But sometimes you still may want to use Colab, in most cases for short tasks. In Colab, you can directly connect your Google Drive and use your datasets from there. You can also store your output from the notebook to Google Drive if you want.</p>
<p>When you're working on a project, though, sometimes you'll want to use datasets from Kaggle in Google Colab. So you'll need to download the dataset from Kaggle and upload that to Colab's temporary storage or your Google Drive. </p>
<p>You can probably guess that this is a very time-consuming process. </p>
<p>But there is a way that you can directly download a Kaggle dataset using an API call in the Google Colab's notebook! In this article, I am going to show you how you can do that.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<p>I've broken this tutorial down into separate parts for better understanding. You can get a clear overview of the entire article here:</p>
<ul>
<li><a class="post-section-overview" href="#heading-types-of-kaggle-datasets">Types of Kaggle datasets</a></li>
<li><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></li>
<li><a class="post-section-overview" href="#setup-google-colab-for-using-kaggle-api">Setup Google Colab for using Kaggle API</a></li>
<li><a class="post-section-overview" href="#install-kaggle-library">Install Kaggle library</a></li>
<li><a class="post-section-overview" href="#heading-mount-google-drive-to-colab">Mount Google Drive to Colab</a></li>
<li><a class="post-section-overview" href="#add-the-kaggle-api-token-to-colab-notebook">Add the Kaggle API Token to Colab Notebook</a></li>
<li><a class="post-section-overview" href="#download-kaggle-dataset">Download Kaggle dataset</a></li>
<li><a class="post-section-overview" href="#download-kaggle-competition-dataset">Download Kaggle Competition dataset</a></li>
<li><a target="_blank" href="https://www.freecodecamp.org/news/p/906afd5c-ae59-4f19-9fe3-662d110d63a7/download-specifc-file-from-kaggle-competition-dataset">Download Specifc file from Kaggle Competition dataset</a></li>
<li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></li>
</ul>
<h2 id="heading-video">Video</h2>
<p>If you would like to watch all of the steps from a video, you're in luck – I made this video just for you:</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/7Z0s-XDXR1E" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<h2 id="heading-types-of-kaggle-datasets">Types of Kaggle Datasets</h2>
<p>Normally Kaggle provides two types of datasets: typical datasets that anyone can upload, and competition datasets. In the competition datasets, the competition organizers typically add/upload the datasets. </p>
<p>Even though you can download a Kaggle dataset easily, you can't download a competition dataset if you don't participate in that competition. But some competitions remain open, and you can access their datasets via "Late Submission". So just make sure to check.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To go through this tutorial and get the most ouf of it, you'll need a Kaggle account, and that is completely free. Simply head over to the official website of <a target="_blank" href="https://www.kaggle.com/">Kaggle</a>, and create an account if you don't have one already.</p>
<p>You'll also need Kaggle's API. Head over to the <a target="_blank" href="https://www.kaggle.com/settings">settings</a> of your Kaggle account. Go to the API section, and click "Create New Token". Keep in mind that Kaggle does not allow you to keep multiple tokens. You can use only one active token for your Kaggle account.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_14-52.png" alt="Image" width="600" height="400" loading="lazy">
<em>Kaggle API Token</em></p>
<p>This will give you a <code>kaggle.json</code> file. Keep it safe, as you'll need to use it later.</p>
<p>You also need a Google account if you want to use Google Colab. You may already have one, but if you don't, go ahead and create a new account in Google.</p>
<p>Now, you can store your Kaggle JSON in your Google drive. I prefer to create a new folder and keep my JSON file there so that I can call that in Colab whenever I want.</p>
<h2 id="heading-how-to-setup-google-colab-to-use-the-kaggle-api">How to Setup Google Colab to Use the Kaggle API</h2>
<p>You can simply open any Colab notebook where you want to use the Kaggle API to download the dataset.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_15-45.png" alt="Image" width="600" height="400" loading="lazy">
<em>Google Colab</em></p>
<h3 id="heading-install-the-kaggle-library">Install the Kaggle library</h3>
<p>You need to install the Kaggle Python library before you start working with Kaggle. You can simply install it in the colab notebook using the command <code>! pip install kaggle</code>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_15-46.png" alt="Image" width="600" height="400" loading="lazy">
<em>Install Kaggle library in colab</em></p>
<h3 id="heading-mount-google-drive-to-colab">Mount Google Drive to Colab</h3>
<p>Now you need to mount your Google Drive to the Colab notebook, since you've uploaded your <code>kaggle.json</code> file inside your Google drive.</p>
<p>You can simply do that by using the two lines of code given below:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> google.colab <span class="hljs-keyword">import</span> drive
drive.mount(<span class="hljs-string">'/content/drive'</span>)
</code></pre>
<p>Make sure to give it permission to access your Google Drive:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_15-48.png" alt="Image" width="600" height="400" loading="lazy">
<em>Give access to Google Drive</em></p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_15-49.png" alt="Image" width="600" height="400" loading="lazy">
<em>Mount Google Drive</em></p>
<p>If you refresh the mounted folder icon, you will see your Google Drive and all of the content in the notebook.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_15-49_1.png" alt="Image" width="600" height="400" loading="lazy">
<em>Find MyDrive in Notebook</em></p>
<h3 id="heading-add-the-kaggle-api-token-to-the-colab-notebook">Add the Kaggle API Token to the Colab Notebook</h3>
<p>Now you need to add the Kaggle API token to the notebook. But before that, you can simply create a temporary directory for Kaggle at the temporary instance location on the Colab drive by using the command <code>! mkdir ~/.kaggle</code>.</p>
<p>Now you need to copy your uploaded JSON file to that temporary Kaggle directory. You need the URL where you uploaded your JSON file earlier. You can grab that link directly from the drive folder in the notebook.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/Screenshot-2024-02-08-155504.png" alt="Image" width="600" height="400" loading="lazy">
<em>Copy JSON file location</em></p>
<p>You can get the path directly like this. </p>
<p>Then you can use the copy command like below:</p>
<pre><code class="lang-bash">! cp kaggle_json_path ~/.kaggle/
</code></pre>
<p>For example, my JSON file is located at "/content/drive/MyDrive/Kaggle_API/kaggle.json", so my command would be:</p>
<pre><code class="lang-bash">! cp /content/drive/MyDrive/Kaggle_API/kaggle.json ~/.kaggle/
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_15-58_1.png" alt="Image" width="600" height="400" loading="lazy">
<em>Copy JSON file</em></p>
<p>Now you need to change the file permissions to read/write to the owner only for safety.</p>
<p>You can use the command below to achive that:</p>
<pre><code class="lang-bash">! chmod 600 ~/.kaggle/kaggle.json
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_15-59.png" alt="Image" width="600" height="400" loading="lazy">
<em>Change file permission of kaggle.json file</em></p>
<h2 id="heading-how-to-download-the-kaggle-dataset">How to Download the Kaggle Dataset</h2>
<p>For downloading a typical Kaggle dataset, you have to find the dataset on Kaggle first.</p>
<p>Let's say I want to download the following dataset from Kaggle:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_16-01.png" alt="Image" width="600" height="400" loading="lazy">
<em>Sample dataset</em></p>
<p>Check the complete URL of the dataset, which in this case is:</p>
<p><a target="_blank" href="https://www.kaggle.com/datasets/mdfahimbinamin/fastsurfer-processed-3d-brain-mri-from-adni">https://www.kaggle.com/datasets/mdfahimbinamin/fastsurfer-processed-3d-brain-mri-from-adni</a></p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.kaggle.com/datasets/mdfahimbinamin/fastsurfer-processed-3d-brain-mri-from-adni">https://www.kaggle.com/datasets/mdfahimbinamin/fastsurfer-processed-3d-brain-mri-from-adni</a></div>
<p>We need the "account_name_of_the_dataset_owner/dataset_path" string. From the URL, the account name of the dataset owner is mdfahimbinamin. The dataset path is fastsurfer-processed-3d-brain-mri-from-adni.</p>
<p>So to download this exact dataset from Kaggle to your Google colab, your command would be:</p>
<pre><code class="lang-bash">! kaggle datasets download mdfahimbinamin/fastsurfer-processed-3d-brain-mri-from-adni
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_16-06.png" alt="Image" width="600" height="400" loading="lazy">
<em>Downloading the Kaggle dataset to your Colab notebook</em></p>
<p>The entire process happens on Google's Cloud PC. So the downloading speed should be quite fast.</p>
<p>By default, the datasets come as <code>.zip</code> file. So if you need to unzip that, you can simply use the command below:</p>
<pre><code class="lang-bash">! unzip dataset-path.zip
</code></pre>
<p>For example, my dataset name/path was "fastsurfer-processed-3d-brain-mri-from-adni". So I will use the following command:</p>
<pre><code class="lang-bash">! unzip fastsurfer-processed-3d-brain-mri-from-adni.zip
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_16-09.png" alt="Image" width="600" height="400" loading="lazy">
<em>Unzip Kaggle Dataset</em></p>
<p>That's it! 😊</p>
<h2 id="heading-how-to-download-a-kaggle-competition-dataset">How to Download a Kaggle Competition Dataset</h2>
<p>Before downloading a Competition dataset, you need to make sure that either you have joined that competition or that you've selected "Late Submission" using the same Kaggle account that you're using for Kaggle API token.</p>
<p>Suppose I'm joining the ConnectX competition on Kaggle.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_16-15.png" alt="Image" width="600" height="400" loading="lazy">
<em>Connect X competition</em></p>
<p>I need to click "Join Competition" to get access to their dataset.</p>
<p>But if I want to download a dataset from a past competition, I need to join their "Late Submission" to gain their dataset.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_16-16.png" alt="Image" width="600" height="400" loading="lazy">
<em>Join a past competition</em></p>
<p>After clicking on "Late Submission", I need to grab the URL. This time, I'm using the Binary Classification with a Bank Churn Dataset. The complete URL is: <a target="_blank" href="https://www.kaggle.com/competitions/playground-series-s4e1/overview">https://www.kaggle.com/competitions/playground-series-s4e1/overview</a></p>
<p>From the URL, I can see that the dataset is located at "playground-series-s4e1". So I will use the following command to download the dataset to my Google Colab notebook:</p>
<pre><code class="lang-bash">! kaggle competitions download playground-series-s4e1
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_16-19.png" alt="Image" width="600" height="400" loading="lazy">
<em>Download dataset</em></p>
<p>That's it! 😊</p>
<h2 id="heading-how-to-download-a-specific-file-from-a-kaggle-competition-dataset">How to Download a Specific File from a Kaggle Competition Dataset</h2>
<p>Let's say, I want to download a specific file from a Kaggle competition dataset. I can also do that.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_16-21.png" alt="Image" width="600" height="400" loading="lazy">
<em>dataset</em></p>
<p>In the dataset used above, you can see that there are 3 files. Let's say I want to download the <code>test.csv</code> file only. </p>
<p>To do this, the command would be strucutred like this: <code>! kaggle competitions download dataset-path -f file_name_with_extension</code>.</p>
<p>So my command would be:</p>
<pre><code class="lang-bash">! kaggle competitions download playground-series-s4e1 -f test.csv
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/02/2024-02-08_16-23.png" alt="Image" width="600" height="400" loading="lazy">
<em>Download specific file</em></p>
<p>That's it! 😊</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>I hope you have gained some valuable insights from the article.</p>
<p>If you have enjoyed the procedures step-by-step, then don't forget to let me know on <a target="_blank" href="https://twitter.com/Fahim_FBA">Twitter/X</a> or <a target="_blank" href="https://www.linkedin.com/in/fahimfba/">LinkedIn</a>.</p>
<p>You can follow me on <a target="_blank" href="https://github.com/FahimFBA">GitHub</a> as well if you are interested in open source. Make sure to check <a target="_blank" href="https://fahimbinamin.com/">my website</a> (<a target="_blank" href="https://fahimbinamin.com/">https://fahimbinamin.com/</a>) as well!</p>
<p>If you like to watch programming and technology-related videos, then you can check my <a target="_blank" href="https://www.youtube.com/@FahimAmin?sub_confirmation=1">YouTube channel</a>, too. You can also check my other writings on <a target="_blank" href="https://dev.to/fahimfba">Dev.to</a>.</p>
<p>All the best for your programming and development journey. 😊</p>
<p>You can do it! Don't give up, never! ❤️</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Python Data Analysis: How to Visualize a Kaggle Dataset with Pandas, Matplotlib, and Seaborn ]]>
                </title>
                <description>
                    <![CDATA[ By Srijan The Indian Premier League or IPL is a T20 cricket tournament organized annually by the Board of Control for Cricket In India (BCCI). Eight city-based franchises compete with each other over 6 weeks to find the winner. In this article, I'm g... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/kaggle-dataset-analysis-with-pandas-matplotlib-seaborn/</link>
                <guid isPermaLink="false">66d4614a73634435aafcefdc</guid>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analytics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ kaggle ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 22 Oct 2020 17:49:27 +0000</pubDate>
                <media:content url="https://cdn-media-2.freecodecamp.org/w1280/5f9c9822740569d1a4ca1855.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Srijan</p>
<p>The <strong>Indian Premier League</strong> or IPL is a T20 cricket tournament organized annually by the Board of Control for Cricket In India (BCCI). Eight city-based franchises compete with each other over 6 weeks to find the winner.</p>
<p>In this article, I'm going to analyze data from the IPL's past seasons to see which teams have won the most games, how teams behave when winning a toss, who has the greatest legacy, and so on. </p>
<p>I have done this analysis from a historical point of view, giving an overview of what has happened in the IPL over the years. I have used tools such as <em>Pandas</em>, <em>Matplotlib</em> and <em>Seaborn</em> along with _Pytho_n to give a visual as well as numeric representation of the data in front of us.</p>
<p><strong>Pandas</strong> stands for <em>Python Data Analysis</em> library. It is typically used for working with tabular data (similar to the data stored in a spreadsheet). Pandas provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, JSON, SQL and perform operations on them.</p>
<p><strong>Matplotlib</strong> and <strong>Seaborn</strong> are two Python libraries that are used to produce plots. Matplotlib is generally used for plotting lines, pie charts, and bar graphs. </p>
<p>Seaborn provides some more advanced visualization features with less syntax and more customizations. I switch back-and-forth between them during the analysis.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><a class="post-section-overview" href="#heading-1-getting-the-dataset">Getting the Dataset</a></li>
<li><a class="post-section-overview" href="#heading-2-data-preparation-and-cleaning">Data Preparation and Cleaning</a></li>
<li><a class="post-section-overview" href="#heading-3-exploratory-analysis-and-visualization">Exploratory Analysis and Visualization</a></li>
<li><a class="post-section-overview" href="#asking-and-answering-questions">Asking and Answering Questions</a></li>
<li><a class="post-section-overview" href="#heading-5-inferences-from-the-analysis">Inferences From the Analysis</a></li>
<li><a class="post-section-overview" href="#heading-6-conclusion">Conclusion</a></li>
</ol>
<h2 id="heading-1-getting-the-dataset">1. Getting the Dataset</h2>
<p>I downloaded the dataset from <a target="_blank" href="https://www.kaggle.com/nowke9/ipldata">Kaggle</a>. You will see there are two CSV (Comma Separated Value) files, matches.csv and deliveries.csv. I chose to do my analysis on matches.csv.</p>
<p>To find more interesting datasets, you can look at <a target="_blank" href="https://jovian.ml/forum/t/recommended-datasets-for-course-project/11711">this</a> page.</p>
<h2 id="heading-2-data-preparation-and-cleaning">2. Data Preparation and Cleaning</h2>
<p>A dataset contains many columns and rows. It is always possible that certain rows have missing values or <code>NaN</code> for one or more columns. </p>
<p>It is also possible that there might be certain columns or rows that you want to discard from your analysis. You can also combine two or more datasets for an in-depth analysis.</p>
<p>Cleaning the data involves making corrections to that data, leaving out unnecessary columns or rows, merging datasets, and so on.</p>
<p>Before taking these steps, I needed to install and import the tools (<em>libraries</em>) to be used during the analysis. I imported the libraries with different aliases such as <code>pd</code>, <code>plt</code> and <code>sns</code>.  I then set some basic styles for the plots.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=5" height="308" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Notice the special command <code>%matplotlib inline</code>. It makes sure that plots are shown and embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up windows.</p>
<p>Using the <code>read_csv()</code> method from the <em>Pandas</em> library, I loaded the <em>matches.csv</em> file<em>.</em> </p>
<p>Data from the file is read and stored in a <code>DataFrame</code> object - one of the core data structures in Pandas for storing and working with tabular data. I used the <code>_df</code> suffix in the variable names for data frames.</p>
<p>I used the name <code>matches_raw_df</code> for the data frame. This indicates that this is unprocessed data that I will clean, filter, and modify to prepare a data frame that's ready for analysis.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=9" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=10" height="308" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Using the <code>shape</code> property of a <code>Dataframe</code> object, I found that the dataset contains 756 rows and 18 columns. To find the names of those columns I used the <code>columns</code> property. It returned a list of the columns in a data frame.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=11" height="138" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=13" height="222" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>To get a summary of what the data frame contains, I used <code>info()</code>. This gives information about columns, number of non-null values in each column, their data type, and memory usage.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=15" height="717" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Almost all columns except <code>umpire3</code> have no or very few null values. The presence of null values could result from a lack of information or an incorrect data entry. </p>
<p>An interesting thing to observe is that, although there are no null values for the <code>result</code> column, there are some for <code>winner</code> and <code>player_of_match</code> columns. Let's find out why.</p>
<p>I first accessed the <code>result</code> column using <em>dot notation</em> (<code>matches_raw_df.result</code>). Then I used <code>vaule_counts()</code> method on the <code>result</code> column.</p>
<p><code>value_counts()</code> returns a <em>series</em> which contains counts of unique values. Here, it tells us about the different values present in <code>result</code> and the total number for each of them.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=18" height="218" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>So, out of 756 matches (rows), 4 matches ended as <em>no result</em>. </p>
<p>Cricket is an outdoor sport and unlike, say, football, play isn't possible when it's raining. It is very common to have matches abandoned due to incessant raining. Therefore, we have no winners or player of the match for these 4 matches.</p>
<p>For this analysis, the <code>umpire3</code> column isn't needed. So I removed the column using the <code>drop()</code> method by passing the column name and axis value. If you want to remove multiple columns, the column names are to be given in a list.</p>
<p>I assigned this <strong>cleaned</strong> data frame to <code>matches_df</code>. I used this data frame for further analysis.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=22" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=23" height="308" width="800" title="Embedded content" loading="lazy"></iframe></div>

<h2 id="heading-3-exploratory-analysis-and-visualization">3. Exploratory Analysis and Visualization</h2>
<p>Exploratory analysis involves performing operations on the dataset to understand the data and find patterns. It helps us make sense of the data we have. </p>
<p>Visualization is the graphic representation of data. It involves producing charts that communicate those patterns among the represented data to viewers.</p>
<p>Now, let's take a look at the data I analyzed and what I learned in the process.</p>
<h3 id="heading-number-of-matches-and-teams">Number of matches and teams</h3>
<p>I tried to find the number of matches played in each season in the IPL from its inception to 2019.</p>
<p>Since I needed matches played each season, it made sense to group our data according to different seasons. Pandas has a <code>groupby()</code> method to achieve this, wherein I passed <code>season</code> as an argument.</p>
<p>Since an <code>id</code> is unique for each match (row), counting the number of ids for each season leads to what we want. I used the <code>count()</code> method on the <code>id</code> column to find the number of matches held each season. This series is assigned to the variable <code>matches_per_season</code>.</p>
<p>I then used the <code>barplot()</code> method from the Seaborn library to plot the series. The index of the series, that is the seasons, were given as the x-value while the values of those indices were given as y-values.</p>
<p>I used various <code>matpllotlib.pyplot</code> methods such as <code>figure()</code>, <code>xticks()</code> and <code>title()</code> to set the size of the plot, title of the plot, and so on. </p>
<p><code>figure</code> takes a parameter, <code>figsize</code>, which I set to <code>(12,6)</code>. Notice that the size was given as a tuple. To <code>xticks()</code>, I gave the <code>rotation</code> parameter a value of <code>75</code> to make it easier to read. </p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=30" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=31" height="565" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Each season, almost 60 matches were played. However, we see a spike in the number of matches from 2011 to 2013. This is because two new franchises, the <strong>Pune Warrior</strong>s and <strong>Kochi Tuskers Kerala</strong>, were introduced, increasing the number of teams to 10.</p>
<p>However, Kochi was removed in the very next season, while the Pune Warriors were removed in 2013, bringing the number down to 8 from 2014 onwards.</p>
<p>Before the start of the 2016 season, two teams, the <strong>Chennai Super Kings</strong> and <strong>Rajasthan Royals</strong> were banned for two seasons. To make up for their absence, two new teams (the <strong>Rising Pune Supergiants</strong> and <strong>Gujarat Lions</strong>) entered the competition.</p>
<p>When the Chennai Super Kings and Rajasthan Royals returned, these two teams were removed from the competition.</p>
<h3 id="heading-analyzing-the-toss-results">Analyzing the Toss results</h3>
<p>One of the most significant events in any cricket match is the toss, which happens at the very start of a match. The toss winner can choose whether they want to bat first or second (fielding first). </p>
<p>Let's see what the trend has been amongst the teams across different seasons.</p>
<p>Again I grouped the rows by season and then counted the different values of the <code>toss_decision</code> column by using <code>value_counts()</code>. </p>
<p>Since a percentage gives a clearer picture, I divided the above result with <code>matches_per_season</code> and multiplied it by 100. This series was assigned to <code>toss_decision_percentage</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=35" height="105" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=36" height="643" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Here, <code>toss_decision_percentage</code> is a series with <em>multi-index</em>. If we print the index of the series using the <code>index</code> property, we see it is of the form <code>(2008, 'bat'), (2008, 'field')</code> and so on. </p>
<p>The series used both <code>season</code> and <code>toss_decision</code> as an index. But I only wanted the seasons to be an index. I used <code>unstack()</code> to achieve this. </p>
<p>By using the <code>unstack()</code> method on the series, it converted the values of <code>toss_decision</code> (that is, <code>bat</code> and <code>field</code>) into separate columns.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/85&amp;cellId=38" height="490" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Next I used the <code>plot()</code> method from Matplotlib to represent these values as bar charts. <code>plot()</code> has a parameter <code>kind</code> which decides what type of plot to draw. The value was set to <code>bar</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/85&amp;cellId=39" height="484" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>For 2008-2013, teams seemed to favour both batting first and second. For this period, teams chose to bat first more in 2009, 2010 and 2013. On the other hand, they chose fielding first more in 2008 and 2011. Things were even-steven in 2012.</p>
<p>This could be because IPL and T20 cricket in general was in its budding stages. So, teams were probably learning and trying to figure out which option would be more beneficial.</p>
<p>However, since 2014, teams have overwhelmingly chosen to bat second. Especially since 2016, teams have chosen to field first <strong>more than 80%</strong> of the time.</p>
<p>Batting first requires that the team gauge the conditions and the pitch and then set a target accordingly. Chasing is less complicated, as there is a fixed target to achieve. </p>
<p>Conditions have also become more batsman-friendly and the skills of the batsmen have increased tremendously (<em>read more</em> <a target="_blank" href="https://www.espncricinfo.com/story/_/id/18568387/tim-wigmore-how-batting-second-become-more-fruitful-more-popular"><em>here</em></a>).</p>
<h3 id="heading-number-of-wins">Number of Wins</h3>
<p>We saw how teams in the recent past have chosen to bat second more than 4 out of 5 times. Did this decision transform the results? Let's see.</p>
<p>For <code>wins_batting_first</code>, the values of <code>win_by_wickets</code> has to be 0. Also, the <code>result</code> column should have a value of <code>normal</code> since tied matches also have win margins as 0. This condition was stored as <code>filter1</code>.</p>
<p>Similarly, for <code>wins_fielding_first</code>, the the value of <code>win_by_runs</code> has to be 0 and the <code>result</code> column should have a value of <code>normal</code>. This condition was stored as <code>filter1</code>.</p>
<p>In both the series, I used <code>count()</code> method on <code>winner</code> column to find the won matches in the filtered conditions. I divided the results with <code>matches_per_season</code> calculated earlier to give a better understanding.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/88&amp;cellId=43" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/88&amp;cellId=44" height="105" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/88&amp;cellId=45" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/89&amp;cellId=46" height="105" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>To plot these two series together, I combined them using Pandas' <code>concat()</code> method. I passed the two series names as a list and set the value of <code>axis</code> as <code>1</code>. This gives us a new data frame which was stored as <code>combined_wins_df</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/89&amp;cellId=47" height="547" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Next I plotted <code>combined_wins_df</code> as a bar chart using <code>plot()</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=44" height="484" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>We saw earlier that for 2008-2013, teams faced a conundrum whether to bat first or field first. This is partially visible in the results as well. </p>
<p>The wins from batting first are very close to that from fielding first. However, there is just one season where teams batting first won more, with things being equal in 2013.</p>
<p>Again, since 2014, things have been in favour of teams chasing except 2015. Leaving out 2015, things have been overwhelmingly in favour of teams fielding first.</p>
<p>So, teams choosing to field more have been justified in their decisions.</p>
<h3 id="heading-teams-with-history">Teams with "History"</h3>
<p>In leagues across different sports, there is always talk about teams with "history" – teams that have played the most in the league and continue to do so. Let's find those teams in the IPL.</p>
<p>Now, between two teams A and B, it can be "A vs B" or "B vs A", depending on how the data entry has been done. So I decided to count the total number of different values for both the <code>team1</code> and <code>team2</code> columns using <code>value_counts()</code>. Then I added them together.</p>
<p>I sorted the results in descending order using the <code>sort_values()</code> method from Pandas. The <code>ascending</code> parameter was set to <code>False</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=48" height="470" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Here, I used <code>sns.barplot()</code> to plot the graph.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=49" height="451" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>The <strong>Mumbai Indians</strong> have played the most matches. They are followed by the Royal Challengers Bangalore, Kolkata Knight Riders, Kings XI Punjab and Chennai Super Kings.</p>
<p>The Chennai Super Kings and Rajasthan Royals could have been higher had they not been banned.</p>
<p>You will see there are two teams from Delhi, the <strong>Delhi Daredevils</strong> and <strong>Delhi Capitals</strong>. This resulted from a change in ownership and then team name in 2018.</p>
<p>It's a similar story for the <strong>Deccan Chargers</strong> and <strong>Sunrisers Hyderabad</strong>, as the Deccan Chargers were removed from the IPL in 2013 and the Sunrisers came in their place.</p>
<p>Also, there are two teams with almost same name: the <strong>Rising Pune Supergiants</strong> and <strong>Rising Pune Supergiant</strong>. They are same team, and there was no change in ownership – it has more to do with superstitions.</p>
<p>In the 2016 season, the Rising Pune Supergiants finished 7th. The owners changed the captain for 2017 and also <strong>dropped the 's'</strong> from Supergiants. Well, it paid off as they finished as runner-up that season!</p>
<h3 id="heading-teams-with-legacy">Teams with "Legacy"</h3>
<p>Now, teams may have a lot of history but it's their "legacy" – how often they win – that makes them popular and attracts new and neutral fans.</p>
<p>To find such teams, I simply used <code>value_counts()</code> on the <code>winner</code> column. This gives us the number of matches that each team has won.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=53" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=54" height="433" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>So Mumbai has the most wins. But a better metric to judge would be the win percentage. To find the win percentage, I divided <code>most_wins</code> by <code>total_matches_played</code> to find the <code>win_percentage</code> for each team.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=57" height="88" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=58" height="444" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>The Rising Pune Supergiant and Delhi Capitals have the highest win percentage. This is largely because they have played fewer matches compared to most teams. Especially Rising Pune Supergiant, which technically became a new team after dropping the 's'.</p>
<p>The Chennai Super Kings, despite playing two fewer seasons than the Mumbai Indians, had only 9 fewer victories. They, along with the Mumbai Indians, are the only two teams in the top 5 that were also part of the IPL in 2008.</p>
<p><strong>Chennai</strong> and <strong>Mumbai</strong> are the teams with the most legacy.</p>
<h2 id="heading-4-asking-and-answering-questions-from-the-data">4. Asking and Answering Questions from the Data</h2>
<p>We've already gained some insights about the IPL by exploring various columns of our dataset. </p>
<p>Let's ask some specific questions, and try to answer them using data frame operations and interesting visualizations.</p>
<h3 id="heading-q-who-has-won-the-ipl-tournament">Q. Who has won the IPL tournament?</h3>
<ul>
<li>Group the rows according to seasons using <code>groupby()</code>.</li>
<li>Find the last match of each season, that is, the final using <code>tail()</code>. It returns the last n rows from a Dataframe object or series based on position.</li>
<li>Sort the values per season using <code>sort_values()</code>.</li>
<li>Count the different winners and the times they won using <code>value_counts()</code> on <code>winner</code>.</li>
</ul>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=65" height="134" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=66" height="264" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Then I plotted the series <code>ipl_winners</code> using <code>sns.barplot()</code>.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=67" height="353" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>Mumbai and Chennai, our <em>legacy</em> teams, have won the IPL at least 3 times. The Sunrisers Hyderabad are the only team that joined the league later and won the trophy.</p>
<h3 id="heading-q-which-are-the-most-and-least-consistent-teams-across-all-seasons">Q. Which are the most and least consistent teams across all seasons?</h3>
<ul>
<li>Created a data frame between different values of <code>winner</code> and <code>season</code> using <code>pd.crosstab()</code>.</li>
<li>Plotted the data frame as a heatmap.</li>
</ul>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=71" height="105" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=72" height="208" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p><code>pd.crosstab()</code> gives a simple cross-tabulation of the <code>winner</code> and <code>season</code> columns. For each different value of <code>winner</code>, <code>pd.crosstab()</code> finds its frequency for each different value in <code>season</code>. </p>
<p>Then I plotted  <code>matches_won_each_season</code> using <code>sns.heatmap()</code>. I passed the data frame <code>matches_won_each_season</code>, with <code>annot</code> as <code>True</code> to have the values shown as well. Here, the darker color indicates more matches won.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=73" height="496" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>The <strong>Chennai Super Kings</strong> have been the most consistent team, winning at least 8 matches in each of the seasons they have played. This is backed up by the fact that they are the <strong>only</strong> team to reach the playoffs stage every season.</p>
<p>At the other end of the spectrum are 3 teams, the <strong>Delhi Daredevils</strong>, <strong>Kings XI Punjab</strong> and <strong>Rajasthan Royals</strong>. All three of them have had two seasons where they performed really well. However, they have been pretty average during the other seasons.</p>
<h3 id="heading-q-what-has-been-the-biggest-margin-of-victory-in-terms-of-runs-in-the-ipl">Q. What has been the biggest margin of victory in terms of runs in the IPL?</h3>
<ul>
<li>Filter the data frame using the required condition.</li>
<li>Sort the values in descending order using <code>sort_values()</code>.</li>
<li>Find the biggest 10 victories in the list using the <code>head()</code> method. It works opposite to <code>tail()</code>, returning the first n rows.</li>
</ul>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=81" height="134" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>I plotted the filtered data frame <code>highest_wins_by_runs_df</code> using <code>sns.scatterplot()</code>. For the <code>x</code> parameter I used <code>season</code>, and I used <code>win_by_runs</code> as the <code>y</code> parameter. I made the size of the points bigger for the top 10 victories using the <code>s</code> parameter.</p>
<p>To put emphasis on the top 10 victories, I used a different color as well as annotated those data points using <code>plt.annotate()</code>. The first parameter is the text of the annotation. The position of the point to be annotated is given as a tuple.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=82" height="501" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>The biggest margin of victory by runs is <strong>146 runs</strong>. In 2017, the Mumbai Indians defeated the Delhi Daredevils by this margin. The Royal Challengers Bangalore have 3 victories amongst the top 5.</p>
<h3 id="heading-q-mumbai-and-chennai-are-the-two-most-successful-teams-so-far-which-team-leads-in-the-head-to-head-record">Q. Mumbai and Chennai are the two most successful teams so far. Which team leads in the head-to-head record?</h3>
<ul>
<li>Filter the data frame using the required condition to find the matches played between the two teams.</li>
<li>Use the <code>value_counts()</code> on the <code>winner</code> column to find how many times each of the teams have won.</li>
</ul>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=105" height="105" width="800" title="Embedded content" loading="lazy"></iframe></div>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=108" height="180" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>I plotted the series <code>mivcsk</code> as a bar chart for a better visualization.</p>
<div class="embed-wrapper"><iframe src="https://jovian.ai/embed?url=https://jovian.ai/srijansrj5901/ipl-data-analysis/v/83&amp;cellId=109" height="507" width="800" title="Embedded content" loading="lazy"></iframe></div>

<p>MI have dominated CSK and are leading the head-to-head record 17-11. We can see their dominance especially in the 2019 season, where the MI defeated the CSK 4 out of 4 times they met, including the playoff and the final.</p>
<h2 id="heading-5-inferences-from-the-analysis">5. Inferences from the Analysis</h2>
<p>We have drawn some interesting inferences and now know more about the IPL than when we started. Here's a summary of what we learned through our analysis:</p>
<ul>
<li>Almost 60 matches are played in every IPL season amongst 8 teams.</li>
<li>There has been an attempt to expand the IPL to 10 teams but the 8 teams idea was brought back and has been continued since.</li>
<li>For the first six seasons (2008-2013), teams were figuring out whether batting first or chasing would be better after winning the toss. This could be down to the fact that the IPL and T20 cricket were both in their early stages so teams were trying different strategies.</li>
<li>But, since 2014, teams have preferred chasing, especially in the past 4 seasons (2016-2019) where teams have chosen to field more than 4 times out of 5. This is likely because having a set total to chase makes things simpler. This could also result from teams preferring to chase in ODIs as well.</li>
<li>Though teams have overwhelmingly chosen to field first, the win percentage after choosing to bat or field is not that one-sided. However, their difference is on the rise.</li>
<li>Mumbai Indians have played the most matches in the IPL. Due to the brief expansion, change of owners, and removal and banning of teams, there have been 15 teams who have played in the IPL.</li>
<li>Chennai and Mumbai are the two teams with the highest win percentage. The fact that they are the only two teams that were part of the first season as well, in the top 5, shows their dominance.</li>
<li>Mumbai Indians have the won the IPL 4 times, the most. They are followed by Chennai at 3 and Kolkata Knight Riders at 2. Sunrisers Hyderabad, Deccan Chargers and Rajasthan Royals complete the IPL Champions list, all winning once each.</li>
<li>146 runs is the largest margin of victory by runs. Mumbai Indians defeated Delhi Daredevils by this margin in 2017. The largest margin for victory by wickets is 10, which has been achieved many times.</li>
<li>The two heavyweights, Mumbai and Chennai, have a head-to-head record in favour of Mumbai at 17-11. Mumbai have had the upper hand in the 2019 season every time they met, including the final.</li>
</ul>
<h2 id="heading-6-conclusion">6. Conclusion</h2>
<p>In this article, we did a bunch of analysis and saw some interesting visualizations. However, this was just scratching the surface.</p>
<p>You can perform more interesting analysis on <em>matches.csv</em> as a standalone data set. But combining <em>deliveries.csv</em> with this dataset could lead to more in-depth analysis.</p>
<p>I did this data analysis and visualization as a project for the 6-week course <a target="_blank" href="https://www.freecodecamp.org/news/kaggle-dataset-analysis-with-pandas-matplotlib-seaborn/zerotopandas.com">Data Analysis with Python: Zero to Pandas</a>. This course was conducted by <a target="_blank" href="https://jovian.ml">Jovian.ml</a> in partnership with <a target="_blank" href="https://www.freecodecamp.org/news/kaggle-dataset-analysis-with-pandas-matplotlib-seaborn/www.freecodecamp.org">freeCodeCamp.org</a>. Check out the project <a target="_blank" href="https://jovian.ml/srijansrj5901/ipl-data-analysis">here</a>.</p>
<p>Also, the IPL is on right now. Go watch it and enjoy!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ I did a Kaggle competition as a semester project at uni. Here’s what I learned. ]]>
                </title>
                <description>
                    <![CDATA[ By Ane Berasategi It was my first competition and my first semester. I didn’t know what I was doing. _Photo by [Unsplash](https://unsplash.com/@miguel_photo?utm_source=medium&utm_medium=referral" rel="noopener" target="_blank" title="">Miguel Henriq... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/i-did-a-kaggle-competition-as-a-semester-project-at-uni-heres-what-i-learned-afe36a99d309/</link>
                <guid isPermaLink="false">66c357447ef110ecbf367b04</guid>
                
                    <category>
                        <![CDATA[ kaggle ]]>
                    </category>
                
                    <category>
                        <![CDATA[ learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ university ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 24 Apr 2019 17:36:04 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/0*Q7IdllDE47WuWP-H" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Ane Berasategi</p>
<p>It was my first competition and my first semester. I didn’t know what I was doing.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/9pu2uThG1h1G6Uasfc6HKW9SEEOuTNbYdcNv" alt="Image" width="800" height="533" loading="lazy">
_Photo by [Unsplash](https://unsplash.com/@miguel_photo?utm_source=medium&amp;utm_medium=referral" rel="noopener" target="_blank" title=""&gt;Miguel Henriques on &lt;a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral" rel="noopener" target="<em>blank" title=")</em></p>
<p>This is the story of how I decided to be creative in a semester-long project, how my initial topic choice was crushed and how doing a Kaggle competition at the last minute saved my grade.</p>
<p>As my personal background, I studied Telecommunications Engineering, I have some experience with software development and machine learning, but I had no idea about NLP back then.</p>
<p>I started my Masters degree last semester and one of the first subjects I attended was called ‘Classification and clustering’ in NLP. The professor explained the basics of text processing during the first weeks and afterwards each student had to pick a classification or clustering problem in NLP, document the theory, implement a solution, and present it to the class. The implementation could be done at the end of the semester, not right away with the presentation.</p>
<blockquote>
<p>I was new at the uni so I decided to wait and observe what the other students did.</p>
</blockquote>
<h4 id="heading-choosing-the-topic">Choosing the topic</h4>
<p>Very quickly, the topics of decision trees, naïve Bayes, random forests, SVMs, logistic regression, etc were picked. I barely knew what they were so I was excited at the thought of my peers squeezing these topics into 30 min presentations, and I gave them all my attention and wrote all the notes I could.</p>
<p>Unfortunately, these first presentations were <strong>purely theoretical,</strong> since no one had had time to implement anything so early in the semester. I learnt later that the motivation behind presenting so early had been the urge to ‘pick an easy topic before someone else took it’ and ‘get the presentation over with’, postponing the implementation of the algorithm until later on in the semester.</p>
<p>As much as I tried, I didn’t understand much of the presentations. I need to visualize things, see the code, see examples. It’s not easy for me to follow a presentation full of mathematical notation and formulas.</p>
<blockquote>
<p>Weeks passed, the professor started to urge us to choose a topic and set a date for the presentation, and I had nothing. I waited some presentations more before starting to panic.</p>
</blockquote>
<p>The next presentations were a little more advanced: LDA, LSI, perceptrons, NNs, tensorflow, keras and word embeddings among others.</p>
<p>I was completely ignorant on some topics (LDA and LSI), but I did know some minimal ML. These presentations did include <strong>code</strong>, sometimes even too much. There was a lot of scrolling and very little time spent on analyzing the code, the focus was purely on the results. I learnt about the origins of tensorflow and keras, and I was left exhausted and confused at the end of each presentation. As much as I’d tried, I hadn’t learnt much.</p>
<blockquote>
<p>I was one of the last students left to choose a topic, and the professor was looking at me every time he mentioned the ‘friendly reminder’. I got the message.</p>
</blockquote>
<p>I tried to think rationally: there weren’t many obvious topics left, and I wanted a topic interesting for me and for the other students where I could put everything to use, not just a data structure or a ML model. The subject had 6 ECTS and I wanted to use the time to produce something I could be proud of.</p>
<p>I asked my friend to Google for classification problems in NLP, and after some searching I found out about <strong>sentiment analysis</strong>. It wrapped everything together beautifully, and I had my topic. I checked if someone had already picked it up, no one had, I told the professor, he said ‘Finally!’, and I started gathering my references. The wheel was in motion.</p>
<p>The following week, at another lecture, a guest lecturer gave a very interesting talk about his Master Thesis, on <strong>sentiment analysis</strong>. Of course. My fellow classmates and I spent 90 mins learning about it, the motivation of using it, the applications, the development, the code, the results, <em>everything</em>. It was a majestic Master Thesis and a very illustrative talk, and it ruined my presentation.</p>
<p>I could have still done my project on the same topic, but everyone had heard the experienced researcher on his thorough talk for 90 mins, there was no way I could’ve been able to do the equivalent of his Master Thesis in a couple of months, so I decided to keep looking for something unique, something I could present and people would say: “oooh”.</p>
<blockquote>
<p>At this point, panic mode was on.</p>
</blockquote>
<p>My presentation date was in 2 months, my awesome topic was no more, and I needed something, fast. I was scrolling through Twitter trying to ignore the pressure when I saw Kaggle announced their brand new <a target="_blank" href="https://www.kaggle.com/c/quora-insincere-questions-classification/"><strong>Quora insincere questions classification competition</strong></a><strong>,</strong> and I remember thinking:</p>
<ul>
<li>Quora? I like Quora</li>
<li>Insincere questions? Sounds like fun!</li>
<li>Classification? Could this be…</li>
</ul>
<p>I went to the webpage, and it was indeed a text classification problem. It was as if Kaggle had seen me drowning and lent me a helping hand. This competition could solve all my problems.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/u9Ukm6X1wwRQ6aLue1jH8XrByi1scWr8z8CO" alt="Image" width="800" height="169" loading="lazy"></p>
<ol>
<li>Had I ever done a Kaggle competition before? I have done some small projects on ML but never a competition.</li>
<li>Was the competition for beginners? No, it was hosted by Quora with real prizes, and professional people competing hard for it.</li>
<li>Did I have the slightest clue where to begin? No I did not.</li>
</ol>
<p>So I went for it. Doing this project would be doing something completely different from the rest of the class, and of course I was afraid. This is the mental dialogue with myself:</p>
<ul>
<li>What’s the worst that could happen?</li>
<li>Well, the professor might reject the topic.</li>
<li>Okay assuming it’s accepted, the worst thing?</li>
<li>Not finishing in time, not having something complete to present.</li>
<li>Fair point, what if I have something complete?</li>
<li>It could be terrible, worse than random classification.</li>
<li>That would indeed be bad.</li>
</ul>
<p>So I set my goal to have something finished and ideally with a decent result in two months.</p>
<p>I enthusiastically pitched the idea to the professor, he listened and nodded and said: “Sure, you can change the topic”. I also heard “if you can pull it off” but I’m somewhat sure that last part came from inside my mind and he didn’t actually say it.</p>
<p>I was going to do the documentation and implementation of my submission at the same time, so I set to work.</p>
<h4 id="heading-the-kaggle-competition">The Kaggle competition</h4>
<p>Since my ambitions were humble, I didn’t bother with the imposter syndrome. I made a list of the popular kernels in the website, went through them, understood them, combined them, tweaked them, and made my own.</p>
<h4 id="heading-1-eda">1. EDA</h4>
<p>The first thing to do was exploratory data analysis (EDA). In hindsight I spent way too much time exploring the questions, but in my defense, I didn’t know what I was doing, and some of the insincere questions were funny, I have to admit. I gathered all the questions Quora classifies as insincere and extracted some that I personally find funny. <a target="_blank" href="https://github.com/anebz/kaggle/tree/master/quora_insincere_questions">You can see them in my github</a>. And you can see my <a target="_blank" href="https://www.kaggle.com/anebzt/quora-eda">EDA in kaggle</a>.</p>
<h4 id="heading-2-preprocessing">2. Preprocessing</h4>
<p>Strategies were a bit different in the preprocessing, and it took some more time to understand what people were doing. I learnt how to use word embeddings, adjust the input text so that the text coverage is the maximum and the amount of unknown words is at the minimum. I was quite proud of how much I learnt about text processing in such a short time.</p>
<p>I used Glove as pretrained embeddings, the text coverage at the beginning:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/MdCtirsEuSPJSJEbIyBEYKFxtsFPp6l5GKOJ" alt="Image" width="388" height="90" loading="lazy"></p>
<p>From all the different words that were used, 31.5% are recognized by the embeddings, and from all the text used, 88%. There are more frequent words than others, such as ‘the’. ‘a’. etc. That 31.5% of the vocabulary makes up to 88% of the total text.</p>
<p>After lowering the text, expanding the contractions and removing special characters and punctuations, the coverage is as follows:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/ZhXNxFC2JkSAWduggHBqRU9vt5TpLEo4PQRq" alt="Image" width="378" height="93" loading="lazy"></p>
<p>Out of vocabulary words (those not recognized by the embeddings) include the following, along with their frequency:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/rG861sZlXpZO7nWxO5iT9Xq59PL9W1X7C-i3" alt="Image" width="236" height="253" loading="lazy"></p>
<p>You can see my <a target="_blank" href="https://www.kaggle.com/anebzt/quora-preprocessing-model">preprocessing kernel in kaggle</a>.</p>
<h4 id="heading-4-the-model">4. The model</h4>
<p>Here my limited knowledge on ML helped me move a bit faster, the only bottleneck was deciding which architecture to use. People were using models from RNNs to LSTMs to BERT even, adding KFold, cyclical learning rates, bidirectional models, what?</p>
<p>My stress level went up, the presentation date was in two weeks and I didn’t understand any of the architectures. I picked the simplest one that could give me a decent score, I started with a LSTM architecture.</p>
<p>I connected everything together, and I got a result. A terrible one, but a result nonetheless. My basic needs fulfilled, I started working on the presentation while I left model tuning as my procrastination activity. Eventually I added an Attention layer, and finally turned it into a bidirectional LSTM. The score was decent.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/RL7l3MyvHRdeMR3R40GoeCkDC1VVhr9GzwEQ" alt="Image" width="566" height="554" loading="lazy"></p>
<p>The final architecture I used, a BiLSTM with an Attention layer. It trained quite fast and gave a relatively good result. As before, you can see <a target="_blank" href="https://www.kaggle.com/anebzt/quora-preprocessing-model">the whole kernel in kaggle</a>.</p>
<h4 id="heading-5-the-preparation">5. The preparation</h4>
<p>For the first time in my life, I had too much material for my presentation. I had to cut enough to fit into 30 minutes, but no more lest I made my talk too general. I had to show code but not only code, since in my experience it’s difficult to focus on just code for half an hour.</p>
<p>I spent the last two weeks documenting my code, adding all the references I had used, just in case someone somewhere thought I had made that project by myself and had retrieved the information from my imagination.</p>
<blockquote>
<p>The openness in Kaggle and the availability of public and well-documented code is one of the greatest incentives of using Kaggle in my opinion.</p>
</blockquote>
<p>I polished my presentation and trained with classmates to see that I didn’t talk for over 30 mins. I did, and they gave me tips to reduce repetition in what I was saying, showing in the slides, and showing again in the code; I made much simpler slide-code transitions as a result.</p>
<h4 id="heading-6-the-presentation">6. The presentation</h4>
<p>For my presentation, I only used slides to explain the specifics of the competition: motivation, problem definitio, ninput data, metrics, etc.</p>
<p>For the EDA and preprocessing, I had a slide explaining what I would show in the code, later I switched to the code, and then came back to the slides and showed a recap of what I had just shown. At the end, I included all the advanced model architecture additions I hadn’t had time to consider.</p>
<p>The presentation went very well, I only spoke for 30 mins and there was a follow-up discussion of another 30 mins, where the whole class discussed different strategies to classify insincere questions. The professor praised my creativity and said he would consider changing the structure of the semester so that more students did their projects similar to mine.</p>
<p>I consider that a successful project!</p>
<h4 id="heading-7-conclusion">7. Conclusion</h4>
<p>Since I didn’t know what I was doing throughout the project, I had many doubts, it’s risky doing something completely opposite to the rest of the class, it can end very well or terribly.</p>
<p>I learnt that being creative can sometimes be rewarded, and that calculated risks are worth taking. In this case, I consulted with the professor before doing anything and he approved, so the risk was smaller.</p>
<p>I learnt a lot doing the Kaggle competition, I scored on the top 29% which is not so terrible! I’m quite proud of it, considering it was my first competition.</p>
<p>If there’s anything I can say as a takeaway, it’s this:</p>
<blockquote>
<p>If you’re at university or at a course/program, use the time to learn, experiment, and put yourself in situations where you could fail, but also succeed. My professional relationship with the professor got stronger because of my project.</p>
<p>If you can afford to do more than just completing the subject, consider going beyond what the professor says. Read the references, research online, propose topics. Who knows where your initiative could take you.</p>
<p>And lastly, <strong>you don’t have to do exactly what the other students do</strong>. Just because everyone follows a certain structure or submission format doesn’t mean it’s the correct one. Talk to the professor or teaching assistants, ask students who had the subject the previous year, and then decide consciously how you want to handle the subject.</p>
</blockquote>
<p>I hope you liked my story! If you want to hear more about it or contact me in any way, you can reach me <a target="_blank" href="https://twitter.com/aberasategi">on twitter.</a></p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
