kaggle - freeCodeCamp.org

Improve Your Data Science Skills by Solving Kaggle Challenges

Beau Carnes — Mon, 30 Sep 2024 20:08:49 +0000

Data science competitions can help you improve your data science skills.

We just posted a course on the freeCodeCamp.org YouTube channel that is designed to help you understand and complete Kaggle competitions, from data exploration to model building and leaderboard submissions. Rohan Kumar from S.M.D.S developed this course.

Why Kaggle?

Kaggle is the premier platform for data science competitions, offering a unique opportunity to apply your skills to real-world problems. Whether you're a beginner eager to learn or an experienced data scientist looking to refine your techniques, Kaggle provides a dynamic environment to test and expand your capabilities.

Course Overview

This teaches you how to complete Kaggle competitions, focusing on three specific challenges. The course covers every step of the process, ensuring you gain practical experience and insights along the way. Here's what you can expect:

Selecting the Right Competition: Learn how to choose competitions that match your skill level and interests, setting you up for success from the start.
Data Exploration and Preprocessing: Discover techniques for understanding and preparing datasets, a crucial step in any data science project.
Feature Engineering: Unlock the power of feature engineering to extract valuable insights and improve model performance.
Model Selection and Evaluation: Explore popular machine learning algorithms and learn how to evaluate their effectiveness.
Hyperparameter Tuning: Fine-tune your models to achieve optimal accuracy and performance.
Submission Strategies: Gain insights into preparing and submitting predictions to the Kaggle leaderboard.

This course provides a hands-on learning experience. By following along with the tutorial and working on competition projects, you'll develop a solid understanding of the entire data science workflow. You'll learn practical skills applicable to real-world projects, from data manipulation to model evaluation.

Conclusion

Whether you're looking to enhance your competition skills or gain practical data science experience, our course offers the guidance and insights you need. Watch the full course on the freeCodeCamp.org YouTube channel (2-hour watch).

How to Download a Kaggle Dataset Directly to a Google Colab Notebook

Md. Fahim Bin Amin — Thu, 08 Feb 2024 19:39:00 +0000

Kaggle is a popular data science-based competition platform that has a large online community of data scientists and machine learning engineers.

The platform contains a ton of datasets and notebooks that you can use to learn and practice your data science and machine learning skills. They even have competitions you can participate in.

Kaggle offers a 100% free platform for all users – but there are some restrictions depending on the resources you're using.

For example, you can use their CPU system for an unlimited amount of time. But there are strict limitations on GPU and TPU usage. You can use their GPU for 30 hours and TPU for 20 hours in a week. It gets resets each week, and then you get a fresh 30 hours GPU usage and 20 hours TPU usage at the start of the new week.

Kaggle Website

Alongside Kaggle, there are another popular platforms for machine learning engineers and data scientists – like Google Colaboratory, or Google Colab for short.

In Google Colab, you can also use their CPU and GPU, but the free versions have more limitations than the free Kaggle account. In Google Colab, you can not get any GPU computational power until they allocate it from their free units. You don't know how many hours you can use, and you don't even know if you have any chance to get units over the next few days.

In order to get all the features, you need to subscribe to their pro plans which are quite expensive.

But sometimes you still may want to use Colab, in most cases for short tasks. In Colab, you can directly connect your Google Drive and use your datasets from there. You can also store your output from the notebook to Google Drive if you want.

When you're working on a project, though, sometimes you'll want to use datasets from Kaggle in Google Colab. So you'll need to download the dataset from Kaggle and upload that to Colab's temporary storage or your Google Drive.

You can probably guess that this is a very time-consuming process.

But there is a way that you can directly download a Kaggle dataset using an API call in the Google Colab's notebook! In this article, I am going to show you how you can do that.

I've broken this tutorial down into separate parts for better understanding. You can get a clear overview of the entire article here:

Types of Kaggle datasets
Prerequisites
Setup Google Colab for using Kaggle API
Install Kaggle library
Mount Google Drive to Colab
Add the Kaggle API Token to Colab Notebook
Download Kaggle dataset
Download Kaggle Competition dataset
Download Specifc file from Kaggle Competition dataset
Conclusion

Video

If you would like to watch all of the steps from a video, you're in luck – I made this video just for you:

Types of Kaggle Datasets

Normally Kaggle provides two types of datasets: typical datasets that anyone can upload, and competition datasets. In the competition datasets, the competition organizers typically add/upload the datasets.

Even though you can download a Kaggle dataset easily, you can't download a competition dataset if you don't participate in that competition. But some competitions remain open, and you can access their datasets via "Late Submission". So just make sure to check.

Prerequisites

To go through this tutorial and get the most ouf of it, you'll need a Kaggle account, and that is completely free. Simply head over to the official website of Kaggle, and create an account if you don't have one already.

You'll also need Kaggle's API. Head over to the settings of your Kaggle account. Go to the API section, and click "Create New Token". Keep in mind that Kaggle does not allow you to keep multiple tokens. You can use only one active token for your Kaggle account.

Kaggle API Token

This will give you a kaggle.json file. Keep it safe, as you'll need to use it later.

You also need a Google account if you want to use Google Colab. You may already have one, but if you don't, go ahead and create a new account in Google.

Now, you can store your Kaggle JSON in your Google drive. I prefer to create a new folder and keep my JSON file there so that I can call that in Colab whenever I want.

How to Setup Google Colab to Use the Kaggle API

You can simply open any Colab notebook where you want to use the Kaggle API to download the dataset.

Google Colab

Install the Kaggle library

You need to install the Kaggle Python library before you start working with Kaggle. You can simply install it in the colab notebook using the command ! pip install kaggle.

Install Kaggle library in colab

Mount Google Drive to Colab

Now you need to mount your Google Drive to the Colab notebook, since you've uploaded your kaggle.json file inside your Google drive.

You can simply do that by using the two lines of code given below:

from google.colab import drive
drive.mount('/content/drive')

Make sure to give it permission to access your Google Drive:

Give access to Google Drive

Mount Google Drive

If you refresh the mounted folder icon, you will see your Google Drive and all of the content in the notebook.

Find MyDrive in Notebook

Add the Kaggle API Token to the Colab Notebook

Now you need to add the Kaggle API token to the notebook. But before that, you can simply create a temporary directory for Kaggle at the temporary instance location on the Colab drive by using the command ! mkdir ~/.kaggle.

Now you need to copy your uploaded JSON file to that temporary Kaggle directory. You need the URL where you uploaded your JSON file earlier. You can grab that link directly from the drive folder in the notebook.

Copy JSON file location

You can get the path directly like this.

Then you can use the copy command like below:

! cp kaggle_json_path ~/.kaggle/

For example, my JSON file is located at "/content/drive/MyDrive/Kaggle_API/kaggle.json", so my command would be:

! cp /content/drive/MyDrive/Kaggle_API/kaggle.json ~/.kaggle/

Copy JSON file

Now you need to change the file permissions to read/write to the owner only for safety.

You can use the command below to achive that:

! chmod 600 ~/.kaggle/kaggle.json

Change file permission of kaggle.json file

How to Download the Kaggle Dataset

For downloading a typical Kaggle dataset, you have to find the dataset on Kaggle first.

Let's say I want to download the following dataset from Kaggle:

Sample dataset

Check the complete URL of the dataset, which in this case is:

https://www.kaggle.com/datasets/mdfahimbinamin/fastsurfer-processed-3d-brain-mri-from-adni

We need the "account_name_of_the_dataset_owner/dataset_path" string. From the URL, the account name of the dataset owner is mdfahimbinamin. The dataset path is fastsurfer-processed-3d-brain-mri-from-adni.

So to download this exact dataset from Kaggle to your Google colab, your command would be:

! kaggle datasets download mdfahimbinamin/fastsurfer-processed-3d-brain-mri-from-adni

Downloading the Kaggle dataset to your Colab notebook

The entire process happens on Google's Cloud PC. So the downloading speed should be quite fast.

By default, the datasets come as .zip file. So if you need to unzip that, you can simply use the command below:

! unzip dataset-path.zip

For example, my dataset name/path was "fastsurfer-processed-3d-brain-mri-from-adni". So I will use the following command:

! unzip fastsurfer-processed-3d-brain-mri-from-adni.zip

Unzip Kaggle Dataset

That's it! 😊

How to Download a Kaggle Competition Dataset

Before downloading a Competition dataset, you need to make sure that either you have joined that competition or that you've selected "Late Submission" using the same Kaggle account that you're using for Kaggle API token.

Suppose I'm joining the ConnectX competition on Kaggle.

Connect X competition

I need to click "Join Competition" to get access to their dataset.

But if I want to download a dataset from a past competition, I need to join their "Late Submission" to gain their dataset.

Join a past competition

After clicking on "Late Submission", I need to grab the URL. This time, I'm using the Binary Classification with a Bank Churn Dataset. The complete URL is: https://www.kaggle.com/competitions/playground-series-s4e1/overview

From the URL, I can see that the dataset is located at "playground-series-s4e1". So I will use the following command to download the dataset to my Google Colab notebook:

! kaggle competitions download playground-series-s4e1

Download dataset

That's it! 😊

How to Download a Specific File from a Kaggle Competition Dataset

Let's say, I want to download a specific file from a Kaggle competition dataset. I can also do that.

dataset

In the dataset used above, you can see that there are 3 files. Let's say I want to download the test.csv file only.

To do this, the command would be strucutred like this: ! kaggle competitions download dataset-path -f file_name_with_extension.

So my command would be:

! kaggle competitions download playground-series-s4e1 -f test.csv

Download specific file

That's it! 😊

Conclusion

I hope you have gained some valuable insights from the article.

If you have enjoyed the procedures step-by-step, then don't forget to let me know on Twitter/X or LinkedIn.

You can follow me on GitHub as well if you are interested in open source. Make sure to check my website (https://fahimbinamin.com/) as well!

If you like to watch programming and technology-related videos, then you can check my YouTube channel, too. You can also check my other writings on Dev.to.

All the best for your programming and development journey. 😊

You can do it! Don't give up, never! ❤️

Python Data Analysis: How to Visualize a Kaggle Dataset with Pandas, Matplotlib, and Seaborn

freeCodeCamp — Thu, 22 Oct 2020 17:49:27 +0000

By Srijan

The Indian Premier League or IPL is a T20 cricket tournament organized annually by the Board of Control for Cricket In India (BCCI). Eight city-based franchises compete with each other over 6 weeks to find the winner.

In this article, I'm going to analyze data from the IPL's past seasons to see which teams have won the most games, how teams behave when winning a toss, who has the greatest legacy, and so on.

I have done this analysis from a historical point of view, giving an overview of what has happened in the IPL over the years. I have used tools such as Pandas, Matplotlib and Seaborn along with _Pytho_n to give a visual as well as numeric representation of the data in front of us.

Pandas stands for Python Data Analysis library. It is typically used for working with tabular data (similar to the data stored in a spreadsheet). Pandas provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, JSON, SQL and perform operations on them.

Matplotlib and Seaborn are two Python libraries that are used to produce plots. Matplotlib is generally used for plotting lines, pie charts, and bar graphs.

Seaborn provides some more advanced visualization features with less syntax and more customizations. I switch back-and-forth between them during the analysis.

Getting the Dataset
Data Preparation and Cleaning
Exploratory Analysis and Visualization
Asking and Answering Questions
Inferences From the Analysis
Conclusion

1. Getting the Dataset

I downloaded the dataset from Kaggle. You will see there are two CSV (Comma Separated Value) files, matches.csv and deliveries.csv. I chose to do my analysis on matches.csv.

To find more interesting datasets, you can look at this page.

2. Data Preparation and Cleaning

A dataset contains many columns and rows. It is always possible that certain rows have missing values or NaN for one or more columns.

It is also possible that there might be certain columns or rows that you want to discard from your analysis. You can also combine two or more datasets for an in-depth analysis.

Cleaning the data involves making corrections to that data, leaving out unnecessary columns or rows, merging datasets, and so on.

Before taking these steps, I needed to install and import the tools (libraries) to be used during the analysis. I imported the libraries with different aliases such as pd, plt and sns. I then set some basic styles for the plots.

Notice the special command %matplotlib inline. It makes sure that plots are shown and embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up windows.

Using the read_csv() method from the Pandas library, I loaded the matches.csv file.

Data from the file is read and stored in a DataFrame object - one of the core data structures in Pandas for storing and working with tabular data. I used the _df suffix in the variable names for data frames.

I used the name matches_raw_df for the data frame. This indicates that this is unprocessed data that I will clean, filter, and modify to prepare a data frame that's ready for analysis.

Using the shape property of a Dataframe object, I found that the dataset contains 756 rows and 18 columns. To find the names of those columns I used the columns property. It returned a list of the columns in a data frame.

To get a summary of what the data frame contains, I used info(). This gives information about columns, number of non-null values in each column, their data type, and memory usage.

Almost all columns except umpire3 have no or very few null values. The presence of null values could result from a lack of information or an incorrect data entry.

An interesting thing to observe is that, although there are no null values for the result column, there are some for winner and player_of_match columns. Let's find out why.

I first accessed the result column using dot notation (matches_raw_df.result). Then I used vaule_counts() method on the result column.

value_counts() returns a series which contains counts of unique values. Here, it tells us about the different values present in result and the total number for each of them.

So, out of 756 matches (rows), 4 matches ended as no result.

Cricket is an outdoor sport and unlike, say, football, play isn't possible when it's raining. It is very common to have matches abandoned due to incessant raining. Therefore, we have no winners or player of the match for these 4 matches.

For this analysis, the umpire3 column isn't needed. So I removed the column using the drop() method by passing the column name and axis value. If you want to remove multiple columns, the column names are to be given in a list.

I assigned this cleaned data frame to matches_df. I used this data frame for further analysis.

3. Exploratory Analysis and Visualization

Exploratory analysis involves performing operations on the dataset to understand the data and find patterns. It helps us make sense of the data we have.

Visualization is the graphic representation of data. It involves producing charts that communicate those patterns among the represented data to viewers.

Now, let's take a look at the data I analyzed and what I learned in the process.

Number of matches and teams

I tried to find the number of matches played in each season in the IPL from its inception to 2019.

Since I needed matches played each season, it made sense to group our data according to different seasons. Pandas has a groupby() method to achieve this, wherein I passed season as an argument.

Since an id is unique for each match (row), counting the number of ids for each season leads to what we want. I used the count() method on the id column to find the number of matches held each season. This series is assigned to the variable matches_per_season.

I then used the barplot() method from the Seaborn library to plot the series. The index of the series, that is the seasons, were given as the x-value while the values of those indices were given as y-values.

I used various matpllotlib.pyplot methods such as figure(), xticks() and title() to set the size of the plot, title of the plot, and so on.

figure takes a parameter, figsize, which I set to (12,6). Notice that the size was given as a tuple. To xticks(), I gave the rotation parameter a value of 75 to make it easier to read.

Each season, almost 60 matches were played. However, we see a spike in the number of matches from 2011 to 2013. This is because two new franchises, the Pune Warriors and Kochi Tuskers Kerala, were introduced, increasing the number of teams to 10.

However, Kochi was removed in the very next season, while the Pune Warriors were removed in 2013, bringing the number down to 8 from 2014 onwards.

Before the start of the 2016 season, two teams, the Chennai Super Kings and Rajasthan Royals were banned for two seasons. To make up for their absence, two new teams (the Rising Pune Supergiants and Gujarat Lions) entered the competition.

When the Chennai Super Kings and Rajasthan Royals returned, these two teams were removed from the competition.

Analyzing the Toss results

One of the most significant events in any cricket match is the toss, which happens at the very start of a match. The toss winner can choose whether they want to bat first or second (fielding first).

Let's see what the trend has been amongst the teams across different seasons.

Again I grouped the rows by season and then counted the different values of the toss_decision column by using value_counts().

Since a percentage gives a clearer picture, I divided the above result with matches_per_season and multiplied it by 100. This series was assigned to toss_decision_percentage.

Here, toss_decision_percentage is a series with multi-index. If we print the index of the series using the index property, we see it is of the form (2008, 'bat'), (2008, 'field') and so on.

The series used both season and toss_decision as an index. But I only wanted the seasons to be an index. I used unstack() to achieve this.

By using the unstack() method on the series, it converted the values of toss_decision (that is, bat and field) into separate columns.

Next I used the plot() method from Matplotlib to represent these values as bar charts. plot() has a parameter kind which decides what type of plot to draw. The value was set to bar.

For 2008-2013, teams seemed to favour both batting first and second. For this period, teams chose to bat first more in 2009, 2010 and 2013. On the other hand, they chose fielding first more in 2008 and 2011. Things were even-steven in 2012.

This could be because IPL and T20 cricket in general was in its budding stages. So, teams were probably learning and trying to figure out which option would be more beneficial.

However, since 2014, teams have overwhelmingly chosen to bat second. Especially since 2016, teams have chosen to field first more than 80% of the time.

Batting first requires that the team gauge the conditions and the pitch and then set a target accordingly. Chasing is less complicated, as there is a fixed target to achieve.

Conditions have also become more batsman-friendly and the skills of the batsmen have increased tremendously (read more here).

Number of Wins

We saw how teams in the recent past have chosen to bat second more than 4 out of 5 times. Did this decision transform the results? Let's see.

For wins_batting_first, the values of win_by_wickets has to be 0. Also, the result column should have a value of normal since tied matches also have win margins as 0. This condition was stored as filter1.

Similarly, for wins_fielding_first, the the value of win_by_runs has to be 0 and the result column should have a value of normal. This condition was stored as filter1.

In both the series, I used count() method on winner column to find the won matches in the filtered conditions. I divided the results with matches_per_season calculated earlier to give a better understanding.

To plot these two series together, I combined them using Pandas' concat() method. I passed the two series names as a list and set the value of axis as 1. This gives us a new data frame which was stored as combined_wins_df.

Next I plotted combined_wins_df as a bar chart using plot().

We saw earlier that for 2008-2013, teams faced a conundrum whether to bat first or field first. This is partially visible in the results as well.

The wins from batting first are very close to that from fielding first. However, there is just one season where teams batting first won more, with things being equal in 2013.

Again, since 2014, things have been in favour of teams chasing except 2015. Leaving out 2015, things have been overwhelmingly in favour of teams fielding first.

So, teams choosing to field more have been justified in their decisions.

Teams with "History"

In leagues across different sports, there is always talk about teams with "history" – teams that have played the most in the league and continue to do so. Let's find those teams in the IPL.

Now, between two teams A and B, it can be "A vs B" or "B vs A", depending on how the data entry has been done. So I decided to count the total number of different values for both the team1 and team2 columns using value_counts(). Then I added them together.

I sorted the results in descending order using the sort_values() method from Pandas. The ascending parameter was set to False.

Here, I used sns.barplot() to plot the graph.

The Mumbai Indians have played the most matches. They are followed by the Royal Challengers Bangalore, Kolkata Knight Riders, Kings XI Punjab and Chennai Super Kings.

The Chennai Super Kings and Rajasthan Royals could have been higher had they not been banned.

You will see there are two teams from Delhi, the Delhi Daredevils and Delhi Capitals. This resulted from a change in ownership and then team name in 2018.

It's a similar story for the Deccan Chargers and Sunrisers Hyderabad, as the Deccan Chargers were removed from the IPL in 2013 and the Sunrisers came in their place.

Also, there are two teams with almost same name: the Rising Pune Supergiants and Rising Pune Supergiant. They are same team, and there was no change in ownership – it has more to do with superstitions.

In the 2016 season, the Rising Pune Supergiants finished 7th. The owners changed the captain for 2017 and also dropped the 's' from Supergiants. Well, it paid off as they finished as runner-up that season!

Teams with "Legacy"

Now, teams may have a lot of history but it's their "legacy" – how often they win – that makes them popular and attracts new and neutral fans.

To find such teams, I simply used value_counts() on the winner column. This gives us the number of matches that each team has won.

So Mumbai has the most wins. But a better metric to judge would be the win percentage. To find the win percentage, I divided most_wins by total_matches_played to find the win_percentage for each team.

The Rising Pune Supergiant and Delhi Capitals have the highest win percentage. This is largely because they have played fewer matches compared to most teams. Especially Rising Pune Supergiant, which technically became a new team after dropping the 's'.

The Chennai Super Kings, despite playing two fewer seasons than the Mumbai Indians, had only 9 fewer victories. They, along with the Mumbai Indians, are the only two teams in the top 5 that were also part of the IPL in 2008.

Chennai and Mumbai are the teams with the most legacy.

4. Asking and Answering Questions from the Data

We've already gained some insights about the IPL by exploring various columns of our dataset.

Let's ask some specific questions, and try to answer them using data frame operations and interesting visualizations.

Q. Who has won the IPL tournament?

Group the rows according to seasons using groupby().
Find the last match of each season, that is, the final using tail(). It returns the last n rows from a Dataframe object or series based on position.
Sort the values per season using sort_values().
Count the different winners and the times they won using value_counts() on winner.

Then I plotted the series ipl_winners using sns.barplot().

Mumbai and Chennai, our legacy teams, have won the IPL at least 3 times. The Sunrisers Hyderabad are the only team that joined the league later and won the trophy.

Q. Which are the most and least consistent teams across all seasons?

Created a data frame between different values of winner and season using pd.crosstab().
Plotted the data frame as a heatmap.

pd.crosstab() gives a simple cross-tabulation of the winner and season columns. For each different value of winner, pd.crosstab() finds its frequency for each different value in season.

Then I plotted matches_won_each_season using sns.heatmap(). I passed the data frame matches_won_each_season, with annot as True to have the values shown as well. Here, the darker color indicates more matches won.

The Chennai Super Kings have been the most consistent team, winning at least 8 matches in each of the seasons they have played. This is backed up by the fact that they are the only team to reach the playoffs stage every season.

At the other end of the spectrum are 3 teams, the Delhi Daredevils, Kings XI Punjab and Rajasthan Royals. All three of them have had two seasons where they performed really well. However, they have been pretty average during the other seasons.

Q. What has been the biggest margin of victory in terms of runs in the IPL?

Filter the data frame using the required condition.
Sort the values in descending order using sort_values().
Find the biggest 10 victories in the list using the head() method. It works opposite to tail(), returning the first n rows.

I plotted the filtered data frame highest_wins_by_runs_df using sns.scatterplot(). For the x parameter I used season, and I used win_by_runs as the y parameter. I made the size of the points bigger for the top 10 victories using the s parameter.

To put emphasis on the top 10 victories, I used a different color as well as annotated those data points using plt.annotate(). The first parameter is the text of the annotation. The position of the point to be annotated is given as a tuple.

The biggest margin of victory by runs is 146 runs. In 2017, the Mumbai Indians defeated the Delhi Daredevils by this margin. The Royal Challengers Bangalore have 3 victories amongst the top 5.

Q. Mumbai and Chennai are the two most successful teams so far. Which team leads in the head-to-head record?

Filter the data frame using the required condition to find the matches played between the two teams.
Use the value_counts() on the winner column to find how many times each of the teams have won.

I plotted the series mivcsk as a bar chart for a better visualization.

MI have dominated CSK and are leading the head-to-head record 17-11. We can see their dominance especially in the 2019 season, where the MI defeated the CSK 4 out of 4 times they met, including the playoff and the final.

5. Inferences from the Analysis

We have drawn some interesting inferences and now know more about the IPL than when we started. Here's a summary of what we learned through our analysis:

Almost 60 matches are played in every IPL season amongst 8 teams.
There has been an attempt to expand the IPL to 10 teams but the 8 teams idea was brought back and has been continued since.
For the first six seasons (2008-2013), teams were figuring out whether batting first or chasing would be better after winning the toss. This could be down to the fact that the IPL and T20 cricket were both in their early stages so teams were trying different strategies.
But, since 2014, teams have preferred chasing, especially in the past 4 seasons (2016-2019) where teams have chosen to field more than 4 times out of 5. This is likely because having a set total to chase makes things simpler. This could also result from teams preferring to chase in ODIs as well.
Though teams have overwhelmingly chosen to field first, the win percentage after choosing to bat or field is not that one-sided. However, their difference is on the rise.
Mumbai Indians have played the most matches in the IPL. Due to the brief expansion, change of owners, and removal and banning of teams, there have been 15 teams who have played in the IPL.
Chennai and Mumbai are the two teams with the highest win percentage. The fact that they are the only two teams that were part of the first season as well, in the top 5, shows their dominance.
Mumbai Indians have the won the IPL 4 times, the most. They are followed by Chennai at 3 and Kolkata Knight Riders at 2. Sunrisers Hyderabad, Deccan Chargers and Rajasthan Royals complete the IPL Champions list, all winning once each.
146 runs is the largest margin of victory by runs. Mumbai Indians defeated Delhi Daredevils by this margin in 2017. The largest margin for victory by wickets is 10, which has been achieved many times.
The two heavyweights, Mumbai and Chennai, have a head-to-head record in favour of Mumbai at 17-11. Mumbai have had the upper hand in the 2019 season every time they met, including the final.

6. Conclusion

In this article, we did a bunch of analysis and saw some interesting visualizations. However, this was just scratching the surface.

You can perform more interesting analysis on matches.csv as a standalone data set. But combining deliveries.csv with this dataset could lead to more in-depth analysis.

I did this data analysis and visualization as a project for the 6-week course Data Analysis with Python: Zero to Pandas. This course was conducted by Jovian.ml in partnership with freeCodeCamp.org. Check out the project here.

Also, the IPL is on right now. Go watch it and enjoy!

I did a Kaggle competition as a semester project at uni. Here’s what I learned.

freeCodeCamp — Wed, 24 Apr 2019 17:36:04 +0000

By Ane Berasategi

It was my first competition and my first semester. I didn’t know what I was doing.

_Photo by [Unsplash](https://unsplash.com/@miguel_photo?utm_source=medium&utm_medium=referral" rel="noopener" target="_blank" title="">Miguel Henriques on Choosing the topic

Very quickly, the topics of decision trees, naïve Bayes, random forests, SVMs, logistic regression, etc were picked. I barely knew what they were so I was excited at the thought of my peers squeezing these topics into 30 min presentations, and I gave them all my attention and wrote all the notes I could.

Unfortunately, these first presentations were purely theoretical, since no one had had time to implement anything so early in the semester. I learnt later that the motivation behind presenting so early had been the urge to ‘pick an easy topic before someone else took it’ and ‘get the presentation over with’, postponing the implementation of the algorithm until later on in the semester.

As much as I tried, I didn’t understand much of the presentations. I need to visualize things, see the code, see examples. It’s not easy for me to follow a presentation full of mathematical notation and formulas.

Weeks passed, the professor started to urge us to choose a topic and set a date for the presentation, and I had nothing. I waited some presentations more before starting to panic.

The next presentations were a little more advanced: LDA, LSI, perceptrons, NNs, tensorflow, keras and word embeddings among others.

I was completely ignorant on some topics (LDA and LSI), but I did know some minimal ML. These presentations did include code, sometimes even too much. There was a lot of scrolling and very little time spent on analyzing the code, the focus was purely on the results. I learnt about the origins of tensorflow and keras, and I was left exhausted and confused at the end of each presentation. As much as I’d tried, I hadn’t learnt much.

I was one of the last students left to choose a topic, and the professor was looking at me every time he mentioned the ‘friendly reminder’. I got the message.

I tried to think rationally: there weren’t many obvious topics left, and I wanted a topic interesting for me and for the other students where I could put everything to use, not just a data structure or a ML model. The subject had 6 ECTS and I wanted to use the time to produce something I could be proud of.

I asked my friend to Google for classification problems in NLP, and after some searching I found out about sentiment analysis. It wrapped everything together beautifully, and I had my topic. I checked if someone had already picked it up, no one had, I told the professor, he said ‘Finally!’, and I started gathering my references. The wheel was in motion.

The following week, at another lecture, a guest lecturer gave a very interesting talk about his Master Thesis, on sentiment analysis. Of course. My fellow classmates and I spent 90 mins learning about it, the motivation of using it, the applications, the development, the code, the results, everything. It was a majestic Master Thesis and a very illustrative talk, and it ruined my presentation.

I could have still done my project on the same topic, but everyone had heard the experienced researcher on his thorough talk for 90 mins, there was no way I could’ve been able to do the equivalent of his Master Thesis in a couple of months, so I decided to keep looking for something unique, something I could present and people would say: “oooh”.

At this point, panic mode was on.

My presentation date was in 2 months, my awesome topic was no more, and I needed something, fast. I was scrolling through Twitter trying to ignore the pressure when I saw Kaggle announced their brand new Quora insincere questions classification competition, and I remember thinking:

Quora? I like Quora
Insincere questions? Sounds like fun!
Classification? Could this be…

I went to the webpage, and it was indeed a text classification problem. It was as if Kaggle had seen me drowning and lent me a helping hand. This competition could solve all my problems.

Had I ever done a Kaggle competition before? I have done some small projects on ML but never a competition.
Was the competition for beginners? No, it was hosted by Quora with real prizes, and professional people competing hard for it.
Did I have the slightest clue where to begin? No I did not.

So I went for it. Doing this project would be doing something completely different from the rest of the class, and of course I was afraid. This is the mental dialogue with myself:

What’s the worst that could happen?
Well, the professor might reject the topic.
Okay assuming it’s accepted, the worst thing?
Not finishing in time, not having something complete to present.
Fair point, what if I have something complete?
It could be terrible, worse than random classification.
That would indeed be bad.

So I set my goal to have something finished and ideally with a decent result in two months.

I enthusiastically pitched the idea to the professor, he listened and nodded and said: “Sure, you can change the topic”. I also heard “if you can pull it off” but I’m somewhat sure that last part came from inside my mind and he didn’t actually say it.

I was going to do the documentation and implementation of my submission at the same time, so I set to work.

The Kaggle competition

Since my ambitions were humble, I didn’t bother with the imposter syndrome. I made a list of the popular kernels in the website, went through them, understood them, combined them, tweaked them, and made my own.

1. EDA

The first thing to do was exploratory data analysis (EDA). In hindsight I spent way too much time exploring the questions, but in my defense, I didn’t know what I was doing, and some of the insincere questions were funny, I have to admit. I gathered all the questions Quora classifies as insincere and extracted some that I personally find funny. You can see them in my github. And you can see my EDA in kaggle.

2. Preprocessing

Strategies were a bit different in the preprocessing, and it took some more time to understand what people were doing. I learnt how to use word embeddings, adjust the input text so that the text coverage is the maximum and the amount of unknown words is at the minimum. I was quite proud of how much I learnt about text processing in such a short time.

I used Glove as pretrained embeddings, the text coverage at the beginning:

From all the different words that were used, 31.5% are recognized by the embeddings, and from all the text used, 88%. There are more frequent words than others, such as ‘the’. ‘a’. etc. That 31.5% of the vocabulary makes up to 88% of the total text.

After lowering the text, expanding the contractions and removing special characters and punctuations, the coverage is as follows:

Out of vocabulary words (those not recognized by the embeddings) include the following, along with their frequency:

You can see my preprocessing kernel in kaggle.

4. The model

Here my limited knowledge on ML helped me move a bit faster, the only bottleneck was deciding which architecture to use. People were using models from RNNs to LSTMs to BERT even, adding KFold, cyclical learning rates, bidirectional models, what?

My stress level went up, the presentation date was in two weeks and I didn’t understand any of the architectures. I picked the simplest one that could give me a decent score, I started with a LSTM architecture.

I connected everything together, and I got a result. A terrible one, but a result nonetheless. My basic needs fulfilled, I started working on the presentation while I left model tuning as my procrastination activity. Eventually I added an Attention layer, and finally turned it into a bidirectional LSTM. The score was decent.

The final architecture I used, a BiLSTM with an Attention layer. It trained quite fast and gave a relatively good result. As before, you can see the whole kernel in kaggle.

5. The preparation

For the first time in my life, I had too much material for my presentation. I had to cut enough to fit into 30 minutes, but no more lest I made my talk too general. I had to show code but not only code, since in my experience it’s difficult to focus on just code for half an hour.

I spent the last two weeks documenting my code, adding all the references I had used, just in case someone somewhere thought I had made that project by myself and had retrieved the information from my imagination.

The openness in Kaggle and the availability of public and well-documented code is one of the greatest incentives of using Kaggle in my opinion.

I polished my presentation and trained with classmates to see that I didn’t talk for over 30 mins. I did, and they gave me tips to reduce repetition in what I was saying, showing in the slides, and showing again in the code; I made much simpler slide-code transitions as a result.

6. The presentation

For my presentation, I only used slides to explain the specifics of the competition: motivation, problem definitio, ninput data, metrics, etc.

For the EDA and preprocessing, I had a slide explaining what I would show in the code, later I switched to the code, and then came back to the slides and showed a recap of what I had just shown. At the end, I included all the advanced model architecture additions I hadn’t had time to consider.

The presentation went very well, I only spoke for 30 mins and there was a follow-up discussion of another 30 mins, where the whole class discussed different strategies to classify insincere questions. The professor praised my creativity and said he would consider changing the structure of the semester so that more students did their projects similar to mine.

I consider that a successful project!

7. Conclusion

Since I didn’t know what I was doing throughout the project, I had many doubts, it’s risky doing something completely opposite to the rest of the class, it can end very well or terribly.

I learnt that being creative can sometimes be rewarded, and that calculated risks are worth taking. In this case, I consulted with the professor before doing anything and he approved, so the risk was smaller.

I learnt a lot doing the Kaggle competition, I scored on the top 29% which is not so terrible! I’m quite proud of it, considering it was my first competition.

If there’s anything I can say as a takeaway, it’s this:

If you’re at university or at a course/program, use the time to learn, experiment, and put yourself in situations where you could fail, but also succeed. My professional relationship with the professor got stronger because of my project.

If you can afford to do more than just completing the subject, consider going beyond what the professor says. Read the references, research online, propose topics. Who knows where your initiative could take you.

And lastly, you don’t have to do exactly what the other students do. Just because everyone follows a certain structure or submission format doesn’t mean it’s the correct one. Talk to the professor or teaching assistants, ask students who had the subject the previous year, and then decide consciously how you want to handle the subject.

I hope you liked my story! If you want to hear more about it or contact me in any way, you can reach me on twitter.

kaggle - freeCodeCamp.org

Improve Your Data Science Skills by Solving Kaggle Challenges

Why Kaggle?

Course Overview

Conclusion

How to Download a Kaggle Dataset Directly to a Google Colab Notebook

Table of Contents

Video

Types of Kaggle Datasets

Prerequisites

How to Setup Google Colab to Use the Kaggle API

Install the Kaggle library

Mount Google Drive to Colab

Add the Kaggle API Token to the Colab Notebook

How to Download the Kaggle Dataset

How to Download a Kaggle Competition Dataset

How to Download a Specific File from a Kaggle Competition Dataset

Conclusion

Python Data Analysis: How to Visualize a Kaggle Dataset with Pandas, Matplotlib, and Seaborn

Table of Contents

1. Getting the Dataset

2. Data Preparation and Cleaning

3. Exploratory Analysis and Visualization

Number of matches and teams

Analyzing the Toss results

Number of Wins

Teams with "History"

Teams with "Legacy"

4. Asking and Answering Questions from the Data

Q. Who has won the IPL tournament?

Q. Which are the most and least consistent teams across all seasons?

Q. What has been the biggest margin of victory in terms of runs in the IPL?

Q. Mumbai and Chennai are the two most successful teams so far. Which team leads in the head-to-head record?

5. Inferences from the Analysis

6. Conclusion

I did a Kaggle competition as a semester project at uni. Here’s what I learned.

The Kaggle competition

1. EDA

2. Preprocessing

4. The model

5. The preparation

6. The presentation

7. Conclusion