<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ scikit learn - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ scikit learn - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Wed, 13 May 2026 20:29:05 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/scikit-learn/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Machine Learning with Python and Scikit-Learn ]]>
                </title>
                <description>
                    <![CDATA[ Scikit-learn is an open-source machine learning library for Python, known for its simplicity, versatility, and accessibility. The library is well-documented and supported by a large community, making it a popular choice for both beginners and experie... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/machine-learning-with-python-and-scikit-learn/</link>
                <guid isPermaLink="false">66b2058f08bc664c3c097ef2</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ scikit learn ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Wed, 22 Nov 2023 16:14:14 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/11/machinelearning.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Scikit-learn is an open-source machine learning library for Python, known for its simplicity, versatility, and accessibility. The library is well-documented and supported by a large community, making it a popular choice for both beginners and experienced practitioners in the field of machine learning.</p>
<p>We just published an 18-hour course on the freeCodeCamp.org YouTube channel that is a practical and hands-on introduction to Machine Learning with Python and Scikit-Learn. It is directed at beginners with basic knowledge of Python and statistics.</p>
<p>The course is designed and taught by Aakash N S, CEO and co-founder of Jovian. Aakash has created many popular machine learning courses.  </p>
<p>The course starts with the basics of machine learning by exploring models like linear &amp; logistic regression and then moves on to tree-based models like decision trees, random forests, and gradient-boosting machines.</p>
<p>The course also discuss best practices for approaching and managing machine learning projects and demonstrates how to build a state-of-the-art machine learning model for a real-world dataset from scratch. Then the course looks at unsupervised learning &amp; recommendations briefly and walks through the process of deploying a machine-learning model to the cloud using the Flask web framework.</p>
<p>You will learn everything you need to know to start using Scikit-learn for machine learning. Scikit-learn offers a wide range of tools for various machine learning tasks, including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. Scikit-learn is built upon NumPy, SciPy, and Matplotlib, and its user-friendly interface allows for easy integration into Python applications.</p>
<p>By the end of this course, you'll be able to confidently build, train, and deploy machine learning models in the real world. To get the most out of this course, follow along &amp; type out all the code yourself, and apply the techniques covered here to other real-world datasets &amp; competitions that you can find on platforms like Kaggle.</p>
<p>Here are the lessons in this course:</p>
<ul>
<li>Lesson 1 - Linear Regression and Gradient Descent</li>
<li>Lesson 2 - Logistic Regression for Classification</li>
<li>Lesson 3 - Decision Trees and Random Forests</li>
<li>Lesson 4 - How to Approach Machine Learning Projects</li>
<li>Lesson 5 - Gradient Boosting Machines with XGBoost</li>
<li>Lesson 6 - Unsupervised Learning using Scikit-Learn</li>
<li>Lesson 7 - Machine Learning Project from Scratch</li>
<li>Lesson 8 - Deploying a Machine Learning Project with Flask</li>
</ul>
<p>You can watch the full course on <a target="_blank" href="https://www.youtube.com/watch?v=hDKCxebp88A">the freeCodeCamp.org YouTube channel</a> (18-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/hDKCxebp88A" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Improve Machine Learning Code Quality with Scikit-learn Pipeline and ColumnTransformer ]]>
                </title>
                <description>
                    <![CDATA[ By Yannawut Kimnaruk When you're working on a machine learning project, the most tedious steps are often data cleaning and preprocessing. Especially when you're working in a Jupyter Notebook, running code in many cells can be confusing. The Scikit-le... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/machine-learning-pipeline/</link>
                <guid isPermaLink="false">66d4617123b027d0ff16f2ce</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ scikit learn ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 08 Sep 2022 16:31:20 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/09/Python-Power-BI-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Yannawut Kimnaruk</p>
<p>When you're working on a machine learning project, the most tedious steps are often data cleaning and preprocessing. Especially when you're working in a Jupyter Notebook, running code in many cells can be confusing.</p>
<p>The Scikit-learn library has tools called Pipeline and ColumnTransformer that can really make your life easier. Instead of transforming the dataframe step by step, the pipeline combines all transformation steps. You can get the same result with less code. It's also easier to understand data workflows and modify them for other projects.</p>
<p>This article will show you step by step how to create the machine learning pipeline, starting with an easy one and working up to a more complicated one. </p>
<p>If you are familiar with the Scikit-learn pipeline and ColumnTransformer, you can jump directly to the part you want to learn more about.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><a class="post-section-overview" href="#heading-what-is-the-scikit-learn-pipeline">What is the Scikit-learn Pipeline?</a></li>
<li><a class="post-section-overview" href="#heading-what-is-the-scikit-learn-columntransformer">What is the Scikit-learn ColumnTransformer?</a></li>
<li><a class="post-section-overview" href="#heading-whats-the-difference-between-the-pipeline-and-columntransformer">What's the Difference between the Pipeline and ColumnTransformer?</a></li>
<li><a class="post-section-overview" href="#heading-how-to-create-a-pipeline">How to Create a Pipeline</a></li>
<li><a class="post-section-overview" href="#heading-how-to-find-the-best-hyperparameter-and-data-preparation-method">How to Find the Best Hyperparameter and Data Preparation Method</a></li>
<li><a class="post-section-overview" href="#heading-how-to-add-custom-transformations-and-find-the-best-machine-learning-model">How to Add Custom Transformations</a></li>
<li><a class="post-section-overview" href="#heading-how-to-add-custom-transformations-and-find-the-best-machine-learning-model">How to Choose the Best Machine Learning Model</a></li>
</ul>
<h2 id="heading-what-is-the-scikit-learn-pipeline">What is the Scikit-learn Pipeline?</h2>
<p>Before training a model, you should split your data into a training set and a test set. Each dataset will go through the data cleaning and preprocessing steps before you put it in a machine learning model. </p>
<p>It's not efficient to write repetitive code for the training set and the test set. This is when the scikit-learn pipeline comes into play.</p>
<p>Scikit-learn pipeline is an elegant way to create a machine learning model training workflow. It looks like this:</p>
<p><img src="https://miro.medium.com/max/1308/1*3cbyBR99wFWklU6Sy85NEA.png" alt="Image" width="600" height="400" loading="lazy">
<em>Pipeline illustration</em></p>
<p>First of all, imagine that you can create only one pipeline in which you can input any data. Those data will be transformed into an appropriate format before model training or prediction. </p>
<p>The Scikit-learn pipeline is a tool that links all steps of data manipulation together to create a pipeline. It will shorten your code and make it easier to read and adjust. (You can even visualize your pipeline to see the steps inside.) It's also easier to perform GridSearchCV without data leakage from the test set.</p>
<h2 id="heading-what-is-the-scikit-learn-columntransformer">What is the Scikit-learn ColumnTransformer?</h2>
<p>As stated on the scikit-learn website, this is the purpose of ColumnTransformer:</p>
<blockquote>
<p>"This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.   </p>
<p>This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer."</p>
</blockquote>
<p>In short, ColumnTransformer will transform each group of dataframe columns separately and combine them later. This is useful in the data preprocessing process.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/09/image-207.png" alt="Image" width="600" height="400" loading="lazy">
<em>ColumnTransformer Illustration</em></p>
<h2 id="heading-whats-the-difference-between-the-pipeline-and-columntransformer">What's the Difference between the Pipeline and ColumnTransformer?</h2>
<p>There is a big difference between Pipeline and ColumnTransformer that you should understand.</p>
<p><img src="https://miro.medium.com/max/1190/1*I0F-ALOL8J8f6V33CDKyrA.png" alt="Image" width="600" height="400" loading="lazy">
<em>Pipeline VS ColumnTransformer</em></p>
<p><strong>You use the pipeline</strong> for multiple transformations of the same columns.</p>
<p>On the other hand, <strong>you use the </strong>ColumnTransformer<em>**</em> to transform each column set separately before combining them later.</p>
<p>Alright, with that out of the way, let’s start coding!!</p>
<h2 id="heading-how-to-create-a-pipeline">How to Create a Pipeline</h2>
<h3 id="heading-get-the-dataset">Get the Dataset</h3>
<p>You can download the data I used in this article from this <a target="_blank" href="https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists?datasetId=1019790&amp;sortBy=voteCount&amp;select=aug_train.csv">kaggle dataset</a>. Here's a sample of the dataset:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/09/image-210.png" alt="Image" width="600" height="400" loading="lazy">
<em>Dataset sample</em></p>
<p>I wrote an article exploring the data from this dataset which you can find <a target="_blank" href="https://medium.com/mlearning-ai/data-analysis-job-change-of-data-scientist-685f3de0a983">here if you're interested.</a></p>
<p>In short, this dataset contains information about job candidates and their decision about whether they want to change jobs or not. The dataset has both numerical and categorical columns.</p>
<p>Our goal is to predict whether a candidate will change jobs based on their information. This is a classification task.</p>
<h2 id="heading-data-preprocessing-plan">Data Preprocessing Plan</h2>
<p><img src="https://miro.medium.com/max/1400/1*ZT7S2SuhMd4Zazb2lVWmcw.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Note that I skipped categorical feature encoding for the simplicity of this article.</p>
<h3 id="heading-here-are-the-steps-well-follow">Here are the steps we'll follow:</h3>
<ol>
<li>Import data and encoding</li>
<li>Define sets of columns to be transformed in different ways</li>
<li>Split data to train and test sets</li>
<li>Create pipelines for numerical and categorical features</li>
<li>Create ColumnTransformer to apply pipeline for each column set</li>
<li>Add a model to a final pipeline</li>
<li>Display the pipeline</li>
<li>Pass data through the pipeline</li>
<li>(Optional) Save the pipeline</li>
</ol>
<h3 id="heading-step-1-import-and-encode-the-data">Step 1: Import and Encode the Data</h3>
<p>After downloading the data, you can import it using Pandas like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

df = pd.read_csv(<span class="hljs-string">"aug_train.csv"</span>)
</code></pre>
<p>Then, encode the ordinal feature using mapping to transform categorical features into numerical features (since the model takes only numerical input).</p>
<pre><code class="lang-python"><span class="hljs-comment"># Making Dictionaries of ordinal features</span>

relevent_experience_map = {
    <span class="hljs-string">'Has relevent experience'</span>:  <span class="hljs-number">1</span>,
    <span class="hljs-string">'No relevent experience'</span>:    <span class="hljs-number">0</span>
}

experience_map = {
    <span class="hljs-string">'&lt;1'</span>      :    <span class="hljs-number">0</span>,
    <span class="hljs-string">'1'</span>       :    <span class="hljs-number">1</span>, 
    <span class="hljs-string">'2'</span>       :    <span class="hljs-number">2</span>, 
    <span class="hljs-string">'3'</span>       :    <span class="hljs-number">3</span>, 
    <span class="hljs-string">'4'</span>       :    <span class="hljs-number">4</span>, 
    <span class="hljs-string">'5'</span>       :    <span class="hljs-number">5</span>,
    <span class="hljs-string">'6'</span>       :    <span class="hljs-number">6</span>,
    <span class="hljs-string">'7'</span>       :    <span class="hljs-number">7</span>,
    <span class="hljs-string">'8'</span>       :    <span class="hljs-number">8</span>, 
    <span class="hljs-string">'9'</span>       :    <span class="hljs-number">9</span>, 
    <span class="hljs-string">'10'</span>      :    <span class="hljs-number">10</span>, 
    <span class="hljs-string">'11'</span>      :    <span class="hljs-number">11</span>,
    <span class="hljs-string">'12'</span>      :    <span class="hljs-number">12</span>,
    <span class="hljs-string">'13'</span>      :    <span class="hljs-number">13</span>, 
    <span class="hljs-string">'14'</span>      :    <span class="hljs-number">14</span>, 
    <span class="hljs-string">'15'</span>      :    <span class="hljs-number">15</span>, 
    <span class="hljs-string">'16'</span>      :    <span class="hljs-number">16</span>,
    <span class="hljs-string">'17'</span>      :    <span class="hljs-number">17</span>,
    <span class="hljs-string">'18'</span>      :    <span class="hljs-number">18</span>,
    <span class="hljs-string">'19'</span>      :    <span class="hljs-number">19</span>, 
    <span class="hljs-string">'20'</span>      :    <span class="hljs-number">20</span>, 
    <span class="hljs-string">'&gt;20'</span>     :    <span class="hljs-number">21</span>
} 

last_new_job_map = {
    <span class="hljs-string">'never'</span>        :    <span class="hljs-number">0</span>,
    <span class="hljs-string">'1'</span>            :    <span class="hljs-number">1</span>, 
    <span class="hljs-string">'2'</span>            :    <span class="hljs-number">2</span>, 
    <span class="hljs-string">'3'</span>            :    <span class="hljs-number">3</span>, 
    <span class="hljs-string">'4'</span>            :    <span class="hljs-number">4</span>, 
    <span class="hljs-string">'&gt;4'</span>           :    <span class="hljs-number">5</span>
}

<span class="hljs-comment"># Transform categorical features into numerical features</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">encode</span>(<span class="hljs-params">df_pre</span>):</span>
    df_pre.loc[:,<span class="hljs-string">'relevent_experience'</span>] = df_pre[<span class="hljs-string">'relevent_experience'</span>].map(relevent_experience_map)
    df_pre.loc[:,<span class="hljs-string">'last_new_job'</span>] = df_pre[<span class="hljs-string">'last_new_job'</span>].map(last_new_job_map)
    df_pre.loc[:,<span class="hljs-string">'experience'</span>] = df_pre[<span class="hljs-string">'experience'</span>].map(experience_map)

    <span class="hljs-keyword">return</span> df_pre

df = encode(df)
</code></pre>
<h3 id="heading-step-2-define-sets-of-columns-to-be-transformed-in-different-ways">Step 2: Define Sets of Columns to be Transformed in Different Ways</h3>
<p>Numerical and categorical data should be transformed in different ways. So I define <code>num_col</code> for numerical columns (numbers) and <code>cat_cols</code> for categorical columns.</p>
<pre><code class="lang-python">num_cols = [<span class="hljs-string">'city_development_index'</span>,<span class="hljs-string">'relevent_experience'</span>, <span class="hljs-string">'experience'</span>,<span class="hljs-string">'last_new_job'</span>, <span class="hljs-string">'training_hours'</span>]

cat_cols = [<span class="hljs-string">'gender'</span>, <span class="hljs-string">'enrolled_university'</span>, <span class="hljs-string">'education_level'</span>, <span class="hljs-string">'major_discipline'</span>, <span class="hljs-string">'company_size'</span>, <span class="hljs-string">'company_type'</span>]
</code></pre>
<h3 id="heading-step-3-create-pipelines-for-numerical-and-categorical-features">Step 3: Create Pipelines for Numerical and Categorical Features</h3>
<p>The syntax of the pipeline is:</p>
<pre><code class="lang-python">Pipeline(steps = [(‘step name’, transform function), …])
</code></pre>
<p>For <strong>numerical features</strong>, I perform the following actions:</p>
<ol>
<li>SimpleImputer to fill in the missing values with the mean of that column.</li>
<li>MinMaxScaler to scale the value to range from 0 to 1 (this will affect regression performance).</li>
</ol>
<p>For <strong>categorical features</strong>, I perform the following actions: </p>
<ol>
<li>SimpleImputer to fill in the missing values with the most frequency value of that column.</li>
<li>OneHotEncoder to split to many numerical columns for model training. (handle_unknown=’ignore’ is specified to prevent errors when it finds an unseen category in the test set)</li>
</ol>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> OneHotEncoder, MinMaxScaler
<span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline

num_pipeline = Pipeline(steps=[
    (<span class="hljs-string">'impute'</span>, SimpleImputer(strategy=<span class="hljs-string">'mean'</span>)),
    (<span class="hljs-string">'scale'</span>,MinMaxScaler())
])
cat_pipeline = Pipeline(steps=[
    (<span class="hljs-string">'impute'</span>, SimpleImputer(strategy=<span class="hljs-string">'most_frequent'</span>)),
    (<span class="hljs-string">'one-hot'</span>,OneHotEncoder(handle_unknown=<span class="hljs-string">'ignore'</span>, sparse=<span class="hljs-literal">False</span>))
])
</code></pre>
<h3 id="heading-step-4-create-columntransformer-to-apply-the-pipeline-for-each-column-set">Step 4: Create ColumnTransformer to Apply the Pipeline for Each Column Set</h3>
<p>The syntax of the ColumnTransformer is:</p>
<pre><code class="lang-python">ColumnTransformer(transformers=[(‘step name’, transform function,cols), …])
</code></pre>
<p>Pass numerical columns through the numerical pipeline and pass categorical columns through the categorical pipeline created in step 3.</p>
<p>remainder=’drop’ is specified to ignore other columns in a dataframe.</p>
<p>n_job = -1 means that we'll be using all processors to run in parallel.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer

col_trans = ColumnTransformer(transformers=[
    (<span class="hljs-string">'num_pipeline'</span>,num_pipeline,num_cols),
    (<span class="hljs-string">'cat_pipeline'</span>,cat_pipeline,cat_cols)
    ],
    remainder=<span class="hljs-string">'drop'</span>,
    n_jobs=<span class="hljs-number">-1</span>)
</code></pre>
<h3 id="heading-step-5-add-a-model-to-the-final-pipeline">Step 5: Add a Model to the Final Pipeline</h3>
<p>I'm using the logistic regression model in this example.</p>
<p>Create a new pipeline to commingle the ColumnTransformer in step 4 with the logistic regression model. I use a pipeline in this case because the entire dataframe must pass the ColumnTransformer step and modeling step, respectively.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> LogisticRegression

clf = LogisticRegression(random_state=<span class="hljs-number">0</span>)
clf_pipeline = Pipeline(steps=[
    (<span class="hljs-string">'col_trans'</span>, col_trans),
    (<span class="hljs-string">'model'</span>, clf)
])
</code></pre>
<h3 id="heading-step-6-display-the-pipeline">Step 6: Display the Pipeline</h3>
<p>The syntax for this is <code>display(pipeline name)</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn <span class="hljs-keyword">import</span> set_config

set_config(display=<span class="hljs-string">'diagram'</span>)
display(clf_pipeline)
</code></pre>
<p><img src="https://miro.medium.com/max/560/1*ZAQ6T65iADOmFx1eCJsjDQ.png" alt="Image" width="600" height="400" loading="lazy">
<em>Displayed pipeline</em></p>
<p>You can click on the displayed image to see the details of each step.<br>How convenient!</p>
<p><img src="https://miro.medium.com/max/1400/1*gahdAdZlFSICnQmiqbQYvg.png" alt="Image" width="600" height="400" loading="lazy">
<em>Expanded displayed pipeline</em></p>
<h3 id="heading-step-7-split-the-data-into-train-and-test-sets">Step 7: Split the Data into Train and Test Sets</h3>
<p>Split 20% of the data into a test set like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

X = df[num_cols+cat_cols]
y = df[<span class="hljs-string">'target'</span>]
<span class="hljs-comment"># train test split</span>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, stratify=y)
</code></pre>
<p>I will fit the pipeline for the train set and use that fitted pipeline for the test set to prevent data leakage from the test set to the model.</p>
<h3 id="heading-step-8-pass-data-through-the-pipeline">Step 8: Pass Data through the Pipeline</h3>
<p>Here's the syntax for this:</p>
<pre><code class="lang-python">pipeline_name.fit, pipeline_name.predict, pipeline_name.score
</code></pre>
<p><code>pipeline.fit</code> passes data through a pipeline. It also fits the model.</p>
<p><code>pipeline.predict</code> uses the model trained when <code>pipeline.fit</code>s to predict new data.</p>
<p><code>pipeline.score</code> gets a score of the model in the pipeline (accuracy of logistic regression in this example).</p>
<pre><code class="lang-python">clf_pipeline.fit(X_train, y_train)
<span class="hljs-comment"># preds = clf_pipeline.predict(X_test)</span>
score = clf_pipeline.score(X_test, y_test)
print(<span class="hljs-string">f"Model score: <span class="hljs-subst">{score}</span>"</span>) <span class="hljs-comment"># model accuracy</span>
</code></pre>
<p><img src="https://miro.medium.com/max/1400/1*Y5liijw_WH1kRMnO4S3ung.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-optional-step-9-save-the-pipeline">(Optional) Step 9: Save the Pipeline</h3>
<p>The syntax for this is <code>joblib.dumb</code>.</p>
<p>Use the joblib library to save the pipeline for later use, so you don’t need to create and fit the pipeline again. When you want to use a saved pipeline, just load the file using joblib.load like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> joblib

<span class="hljs-comment"># Save pipeline to file "pipe.joblib"</span>
joblib.dump(clf_pipeline,<span class="hljs-string">"pipe.joblib"</span>)

<span class="hljs-comment"># Load pipeline when you want to use</span>
same_pipe = joblib.load(<span class="hljs-string">"pipe.joblib"</span>)
</code></pre>
<h2 id="heading-how-to-find-the-best-hyperparameter-and-data-preparation-method">How to Find the Best Hyperparameter and Data Preparation Method</h2>
<p>A pipeline does not only make your code tidier, it can also help you optimize hyperparameters and data preparation methods.</p>
<h3 id="heading-heres-what-well-cover-in-this-section">Here's what we'll cover in this section:</h3>
<ul>
<li>How to find the changeable pipeline parameters</li>
<li>How to find the best hyperparameter sets: Add a pipeline to Grid Search</li>
<li>How to find the best data preparation method: Skip a step in a pipeline</li>
<li>How to Find the best hyperparameter sets and the best data preparation method</li>
</ul>
<h3 id="heading-how-to-find-the-changeable-pipeline-parameters">How to Find the Changeable Pipeline Parameters</h3>
<p>First, let’s see the list of parameters that can be adjusted.</p>
<pre><code class="lang-python">clf_pipeline.get_params()
</code></pre>
<p>The result can be very long. Take a deep breath and continue reading.</p>
<p>The first part is just about the steps of the pipeline.</p>
<p><img src="https://miro.medium.com/max/1400/1*JWw_1l68o9z_D9ptmvIIMA.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Below the first part you'll find what we are interested in: a list of parameters that we can adjust.</p>
<p><img src="https://miro.medium.com/max/926/1*NCkmLiyit676K3M-HfEbnw.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The format is <strong>step1<em>step2</em>…_parameter</strong>.</p>
<p>For example <strong>col_trans</strong>_<strong>cat_pipeline</strong><em><strong>one-hot</strong></em><strong>sparse</strong> means parameter sparse of the one-hot step.</p>
<p><img src="https://miro.medium.com/max/876/1*ZITc6M2sB8Qxzr5BCnBMHQ.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can change parameters directly using set_param.</p>
<pre><code class="lang-python">clf_pipeline.set_params(model_C = <span class="hljs-number">10</span>)
</code></pre>
<h3 id="heading-how-to-find-the-best-hyperparameter-sets-add-a-pipeline-to-grid-search">How to Find the Best Hyperparameter Sets: Add a Pipeline to Grid Search</h3>
<p>Grid Search is a method you can use to perform hyperparameter tuning. It helps you find the optimum parameter sets that yield the highest model accuracy.</p>
<h4 id="heading-set-the-tuning-parameters-and-their-range">Set the tuning parameters and their range.</h4>
<p>Create a dictionary of tuning parameters (hyperparameters)</p>
<pre><code class="lang-python">{ ‘tuning parameter’ : ‘possible value’, … }
</code></pre>
<p>In this example, I want to find the best penalty type and C of a logistic regression model.</p>
<pre><code class="lang-python">grid_params = {<span class="hljs-string">'model__penalty'</span> : [<span class="hljs-string">'none'</span>, <span class="hljs-string">'l2'</span>],
               <span class="hljs-string">'model__C'</span> : np.logspace(<span class="hljs-number">-4</span>, <span class="hljs-number">4</span>, <span class="hljs-number">20</span>)}
</code></pre>
<h4 id="heading-add-the-pipeline-to-grid-search">Add the pipeline to Grid Search</h4>
<pre><code class="lang-python">GridSearchCV(model, tuning parameter, …)
</code></pre>
<p>Our pipeline has a model step as the final step, so we can input the pipeline directly to the GridSearchCV function.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> GridSearchCV

gs = GridSearchCV(clf_pipeline, grid_params, cv=<span class="hljs-number">5</span>, scoring=<span class="hljs-string">'accuracy'</span>)
gs.fit(X_train, y_train)

print(<span class="hljs-string">"Best Score of train set: "</span>+str(gs.best_score_))
print(<span class="hljs-string">"Best parameter set: "</span>+str(gs.best_params_))
print(<span class="hljs-string">"Test Score: "</span>+str(gs.score(X_test,y_test)))
</code></pre>
<p><img src="https://miro.medium.com/max/1252/1*JP64DvryL62BV2Z8ctyVXw.png" alt="Image" width="600" height="400" loading="lazy">
<em>Result of Grid Search</em></p>
<p>After setting Grid Search, you can fit Grid Search with the data and see the results. Let's see what the code is doing:</p>
<ul>
<li><code>.fit</code>: fits the model and tries all sets of parameters in the tuning parameter dictionary</li>
<li><code>.best_score_</code>: the highest accuracy across all sets of parameters</li>
<li><code>.best_params_</code>: The set of parameters that yield the best score</li>
<li><code>.score(X_test,y_test)</code>: The score when trying the best model with the test set.</li>
</ul>
<p>You can read more about GridSearchCV in the documentation <a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">here</a>.</p>
<h3 id="heading-how-to-find-the-best-data-preparation-method-skip-a-step-in-a-pipeline">How to Find the Best Data Preparation Method: Skip a Step in a Pipeline</h3>
<p>Finding the best data preparation method can be difficult without a pipeline since you have to create so many variables for many data transformation cases.</p>
<p>With the pipeline, we can create data transformation steps in the pipeline and perform a grid search to find the best step. A grid search will select which step to skip and compare the result of each case.</p>
<h4 id="heading-how-to-adjust-the-current-pipeline-a-little">How to adjust the current pipeline a little</h4>
<p>I want to know which scaling method will work best for my data between MinMaxScaler and StandardScaler.</p>
<p>I add a step StandardScaler in the num_pipeline. The rest doesn't change.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler

num_pipeline2 = Pipeline(steps=[
    (<span class="hljs-string">'impute'</span>, SimpleImputer(strategy=<span class="hljs-string">'mean'</span>)),
    (<span class="hljs-string">'minmax_scale'</span>, MinMaxScaler()),
    (<span class="hljs-string">'std_scale'</span>, StandardScaler()),
])

col_trans2 = ColumnTransformer(transformers=[
    (<span class="hljs-string">'num_pipeline'</span>,num_pipeline2,num_cols),
    (<span class="hljs-string">'cat_pipeline'</span>,cat_pipeline,cat_cols)
    ],
    remainder=<span class="hljs-string">'drop'</span>,
    n_jobs=<span class="hljs-number">-1</span>)

clf_pipeline2 = Pipeline(steps=[
    (<span class="hljs-string">'col_trans'</span>, col_trans2),
    (<span class="hljs-string">'model'</span>, clf)
])
</code></pre>
<p><img src="https://miro.medium.com/max/526/1*K1pdg8EFtGLIhNSEUQ0DsA.png" alt="Image" width="600" height="400" loading="lazy">
<em>Adjusted pipeline</em></p>
<h3 id="heading-how-to-perform-grid-search">How to Perform Grid Search</h3>
<p>In grid search parameters, specify the steps you want to skip and set their value to <strong>passthrough</strong>.</p>
<p>Since MinMaxScaler and StandardScaler should not perform at the same time, I will use <strong>a list of dictionaries</strong> for the grid search parameters.</p>
<pre><code class="lang-python">[{case <span class="hljs-number">1</span>},{case <span class="hljs-number">2</span>}]
</code></pre>
<p>If using a list of dictionaries, grid search will perform a combination of every parameter in case 1 until complete. Then, it will perform a combination of every parameter in case 2. So there is no case where MinMaxScaler and StandardScaler are used together.</p>
<pre><code class="lang-python">grid_step_params = [{<span class="hljs-string">'col_trans__num_pipeline__minmax_scale'</span>: [<span class="hljs-string">'passthrough'</span>]},
                    {<span class="hljs-string">'col_trans__num_pipeline__std_scale'</span>: [<span class="hljs-string">'passthrough'</span>]}]
</code></pre>
<p>Perform Grid Search and print the results (like a normal grid search).</p>
<pre><code class="lang-python">gs2 = GridSearchCV(clf_pipeline2, grid_step_params, scoring=<span class="hljs-string">'accuracy'</span>)
gs2.fit(X_train, y_train)

print(<span class="hljs-string">"Best Score of train set: "</span>+str(gs2.best_score_))
print(<span class="hljs-string">"Best parameter set: "</span>+str(gs2.best_params_))
print(<span class="hljs-string">"Test Score: "</span>+str(gs2.score(X_test,y_test)))
</code></pre>
<p><img src="https://miro.medium.com/max/1354/1*u-TK9RhHn0eSIRbtEUdWsQ.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The best case is minmax_scale : ‘passthrough’, so StandardScaler is the best scaling method for this data.</p>
<h3 id="heading-how-to-find-the-best-hyperparameter-sets-and-the-best-data-preparation-method">How to Find the Best Hyperparameter Sets and the Best Data Preparation Method</h3>
<p>You can find the best hyperparameter sets and the best data preparation method by adding tuning parameters to the dictionary of each case of the data preparation method.</p>
<pre><code class="lang-python">grid_params = {<span class="hljs-string">'model__penalty'</span> : [<span class="hljs-string">'none'</span>, <span class="hljs-string">'l2'</span>],
               <span class="hljs-string">'model__C'</span> : np.logspace(<span class="hljs-number">-4</span>, <span class="hljs-number">4</span>, <span class="hljs-number">20</span>)}

grid_step_params = [{**{<span class="hljs-string">'col_trans__num_pipeline__minmax_scale'</span>: [<span class="hljs-string">'passthrough'</span>]}, **grid_params},
                    {**{<span class="hljs-string">'col_trans__num_pipeline__std_scale'</span>: [<span class="hljs-string">'passthrough'</span>]}, **grid_params}]
</code></pre>
<p>grid_params will be added to both case 1 (skip MinMaxScaler) and case 2 (skip StandardScalerand).</p>
<pre><code class="lang-python"><span class="hljs-comment"># You can merge dictionary using the syntax below.</span>

merge_dict = {**dict_1,**dict_2}
</code></pre>
<p>Perform Grid Search and print the results (like a normal grid search).</p>
<pre><code class="lang-python">gs3 = GridSearchCV(clf_pipeline2, grid_step_params2, scoring=<span class="hljs-string">'accuracy'</span>)
gs3.fit(X_train, y_train)

print(<span class="hljs-string">"Best Score of train set: "</span>+str(gs3.best_score_))
print(<span class="hljs-string">"Best parameter set: "</span>+str(gs3.best_params_))
print(<span class="hljs-string">"Test Score: "</span>+str(gs3.score(X_test,y_test)))
</code></pre>
<p><img src="https://miro.medium.com/max/1400/1*fLcVD6j9m2QcdkkYpoJOjA.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can find the best parameter set using .best<em>params</em>. As minmax_scale : ‘passthrough’, so StandardScaler is the best scaling method for this data.</p>
<p>You can show all grid search cases using .cv<em>results</em>:</p>
<pre><code class="lang-python">pd.DataFrame(gs3.cv_results_)
</code></pre>
<p><img src="https://miro.medium.com/max/1400/1*Ddwx3CZ1k3kfEXYG2pGkMw.png" alt="Image" width="600" height="400" loading="lazy">
<em>GridSearch result</em></p>
<p>There are 80 cases for this example. There's running time and accuracy of each case for you to consider, since sometimes we may select the fastest model with acceptable accuracy instead of the highest accuracy one.</p>
<h2 id="heading-how-to-add-custom-transformations-and-find-the-best-machine-learning-model">How to Add Custom Transformations and Find the Best Machine Learning Model</h2>
<p>Searching for the best machine learning model can be a time-consuming task. The pipeline can make this task much more convenient so that you can shorten the model training and evaluation loop.</p>
<h3 id="heading-heres-what-well-cover-in-this-part">Here's what we'll cover in this part:</h3>
<ul>
<li>Add a custom transformation</li>
<li>Find the best machine learning model</li>
</ul>
<h3 id="heading-how-to-add-a-custom-transformation">How to Add a Custom Transformation</h3>
<p>Apart from standard data transformation functions such as MinMaxScaler from sklearn, you can also create your own transformation for your data.</p>
<p>In this example, I will create a class method to encode ordinal features using mapping to transform categorical features into numerical ones. In simple words, we'll change data from text to numbers.</p>
<p>First we'll do the required data processing before regression model training.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.base <span class="hljs-keyword">import</span> TransformerMixin

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Encode</span>(<span class="hljs-params">TransformerMixin</span>):</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-comment"># Making Dictionaries of ordinal features</span>
        self.rel_exp_map = {
            <span class="hljs-string">'Has relevent experience'</span>: <span class="hljs-number">1</span>,
            <span class="hljs-string">'No relevent experience'</span>: <span class="hljs-number">0</span>}

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, df, y = None</span>):</span>
        <span class="hljs-keyword">return</span> self

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">transform</span>(<span class="hljs-params">self, df, y = None</span>):</span>
        df_pre = df.copy()
        df_pre.loc[:,<span class="hljs-string">'rel_exp'</span>] = df_pre[<span class="hljs-string">'rel_exp'</span>]\
                               .map(self.rel_exp_map)
        <span class="hljs-keyword">return</span> df_pre
</code></pre>
<p>Here's an explanation of what's going on in this code:</p>
<ul>
<li>Create a class named Encode which inherits the base class called TransformerMixin from sklearn.</li>
<li>Inside the class, there are 3 necessary methods: <code>__init__</code>, <code>fit</code>, and <code>transform</code></li>
<li><strong><code>__init__</code></strong> will be called when a pipeline is created. It is where we define variables inside the class. I created a variable ‘rel_exp_map’ which is a dictionary that maps categories to numbers.</li>
<li><strong><code>fit</code></strong> will be called when fitting the pipeline. I left it blank for this case.</li>
<li><strong><code>transform</code></strong> will be called when a pipeline transform is used. This method requires a dataframe (df) as an input while y is set to be None by default (It is forced to have y argument but I will not use it anyway).</li>
<li>In <strong>transform</strong>, the dataframe column ‘rel_exp’ will be mapped with the rel_exp_map.</li>
</ul>
<p>Note that the <code>\</code> is only to continue the code to a new line.</p>
<p>Next, add this Encode class as a pipeline step.</p>
<pre><code class="lang-python">pipeline = Pipeline(steps=[
    (<span class="hljs-string">'Encode'</span>, Encode()),
    (<span class="hljs-string">'col_trans'</span>, col_trans),
    (<span class="hljs-string">'model'</span>, LogisticRegression())
])
</code></pre>
<p>Then you can fit, transform, or grid search the pipeline like a normal pipeline.</p>
<h3 id="heading-how-to-find-the-best-machine-learning-model">How to Find the Best Machine Learning Model</h3>
<p>The first solution that came to my mind was adding many model steps in a pipeline and skipping a step by changing the step value to ‘passthrough’ in the grid search. This is like what we did when finding the best data preparation method.</p>
<pre><code class="lang-python">temp_pipeline = Pipeline(steps=[
    (<span class="hljs-string">'model1'</span>, LogisticRegression()),
    (<span class="hljs-string">'model2'</span>,SVC(gamma=<span class="hljs-string">'auto'</span>))
])
</code></pre>
<p>But I saw an error like this:</p>
<p><img src="https://miro.medium.com/max/700/1*2CGj8aBvcPbxDw_p9tpijg.png" alt="Image" width="600" height="400" loading="lazy">
<em>Error when there are 2 classifiers in 1 pipeline</em></p>
<p>Ah ha – you can’t have two classification models in a pipeline!</p>
<p>The solution to this problem is to create a custom transformation that receives a model as an input and performs grid search to find the best model.</p>
<h3 id="heading-here-are-the-steps-well-follow-1">Here are the steps we'll follow:</h3>
<ol>
<li>Create a class that receives a model as an input</li>
<li>Add the class in step 1 to a pipeline</li>
<li>Perform grid search</li>
<li>Print grid search results as a table</li>
</ol>
<h3 id="heading-step-1-create-a-class-that-receives-a-model-as-an-input">Step 1: Create a class that receives a model as an input</h3>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.base <span class="hljs-keyword">import</span> BaseEstimator
<span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> LogisticRegression
<span class="hljs-keyword">from</span> sklearn.svm <span class="hljs-keyword">import</span> SVC

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ClfSwitcher</span>(<span class="hljs-params">BaseEstimator</span>):</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, estimator = LogisticRegression(<span class="hljs-params"></span>)</span>):</span>
        self.estimator = estimator

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y=None, **kwargs</span>):</span>
        self.estimator.fit(X, y)
        <span class="hljs-keyword">return</span> self

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, y=None</span>):</span>
        <span class="hljs-keyword">return</span> self.estimator.predict(X)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
        <span class="hljs-keyword">return</span> self.estimator.predict_proba(X)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">score</span>(<span class="hljs-params">self, X, y</span>):</span>
        <span class="hljs-keyword">return</span> self.estimator.score(X, y)
</code></pre>
<p><strong>Code explanation:</strong></p>
<ul>
<li>Create a class named <code>ClfSwitcher</code> which inherits the base class called BaseEstimator from sklearn.</li>
<li>Inside the class, there are five necessary methods like <code>classification model: __init__</code>, <code>fit</code>, <code>predict</code>, <code>predict_proba</code> and <code>score</code></li>
<li><strong><code>__init__</code></strong> receives an estimator (model) as an input. I stated LogisticRegression() as a default model.</li>
<li><strong><code>fit</code></strong> is for model fitting. There's no return value.</li>
<li>The other methods are to simulate the model. It will return the result as if it's the model itself.</li>
</ul>
<h3 id="heading-step-2-add-the-class-in-step-1-to-a-pipeline">Step 2: Add the class in step 1 to a pipeline</h3>
<pre><code class="lang-python">clf_pipeline = Pipeline(steps=[
    (<span class="hljs-string">'Encode'</span>, Encode()),
    (<span class="hljs-string">'col_trans'</span>, col_trans),
    (<span class="hljs-string">'model'</span>, ClfSwitcher())
])
</code></pre>
<h3 id="heading-step-3-perform-grid-search">Step 3: Perform Grid search</h3>
<p>There are 2 cases using different classification models in grid search parameters, including logistic regression and support vector machine.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> GridSearchCV

grid_params = [
    {<span class="hljs-string">'model__estimator'</span>: [LogisticRegression()]},
    {<span class="hljs-string">'model__estimator'</span>: [SVC(gamma=<span class="hljs-string">'auto'</span>)]}
]

gs = GridSearchCV(clf_pipeline, grid_params, scoring=<span class="hljs-string">'accuracy'</span>)
gs.fit(X_train, y_train)

print(<span class="hljs-string">"Best Score of train set: "</span>+str(gs.best_score_))
print(<span class="hljs-string">"Best parameter set: "</span>+str(gs.best_params_))
print(<span class="hljs-string">"Test Score: "</span>+str(gs.score(X_test,y_test)))
</code></pre>
<p><img src="https://miro.medium.com/max/700/1*4rxzC3Wv0y9QOw0G4iHxog.png" alt="Image" width="600" height="400" loading="lazy">
<em>Grid Search Result</em></p>
<p>The result shows that logistic regression yields the best result.</p>
<h3 id="heading-step-4-print-grid-search-results-as-a-table">Step 4: Print grid search results as a table</h3>
<pre><code class="lang-python">pd.DataFrame(gs.cv_results_)
</code></pre>
<p><img src="https://miro.medium.com/max/700/1*bzCWW5AJ3Jb2c5fdIR78LA.png" alt="Image" width="600" height="400" loading="lazy">
<em>Grid Search Result Table</em></p>
<p>Logistic regression has a little higher accuracy than SVC but is much faster (less fit time).</p>
<p>Remember that you can apply different data preparation methods for each model as well.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You can implement the Scikit-learn pipeline and ColumnTransformer from the data cleaning to the data modeling steps to make your code neater. </p>
<p>You can also find the best hyperparameter, data preparation method, and machine learning model with grid search and the passthrough keyword.</p>
<p>You can find my code in this <a target="_blank" href="https://github.com/Yannawut/ML_Pipeline">GitHub</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a GUI Using Gradio for Machine Learning Models ]]>
                </title>
                <description>
                    <![CDATA[ By Edem Gold If you have ever built a Machine Learning model, you've probably thought "well this was cool, but how will other people be able to see how cool it is?"  Model deployment is a part of Machine Learning which isn't talked about as much as i... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-gui-using-gradio-for-machine-learning-models/</link>
                <guid isPermaLink="false">66d84fc563d2055c664a1a63</guid>
                
                    <category>
                        <![CDATA[ deployment ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ scikit learn ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 27 Jan 2022 21:00:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/01/gradio-image-2.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Edem Gold</p>
<p>If you have ever built a Machine Learning model, you've probably thought "well this was cool, but how will other people be able to see how cool it is?" </p>
<p>Model deployment is a part of Machine Learning which isn't talked about as much as it should be.</p>
<p>So in this article, I will introduce you to a new tool that will help you generate a web app for your Machine Learning model which you can then share with other devs so they can try it out.</p>
<p>I will be building a simple neural network model using scikit-learn and I'll create a GUI for the model using Gradio (this is the cool new tool I spoke about).</p>
<p>Let's get started.</p>
<blockquote>
<p>We cannot solve our problems with the same thinking we used to create them - Albert Einstein</p>
</blockquote>
<h1 id="heading-what-is-gradio">What is Gradio?</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1632054788128/NVI4Jgdrd.png?auto=compress,format&amp;format=webp" alt="gradio cover.png" width="600" height="400" loading="lazy">
<em><strong><strong>image credits: <a target="_blank" href="https://gradio.app/">gradio</a></strong></strong></em></p>
<p>According to the <a target="_blank" href="https://gradio.app/">Gradio website</a>, </p>
<blockquote>
<p>Gradio allows you to quickly create customizable UI components around your TensorFlow or PyTorch models or even arbitrary Python functions.</p>
</blockquote>
<p>Well, that's not terribly informative, is it? 😅.</p>
<p>If you have ever used a Python GUI library like Tkinter, then Gradio is like that.</p>
<p>Gradio is a GUI library that allows you to create customizable GUI components for your Machine Learning model.</p>
<p>Now that we understand what Gradio is, let's get into the project.</p>
<h2 id="heading-pre-requisite"><strong>Pre-requisite</strong></h2>
<p>For you to successfully work through this tutorial, you'll need to have Python installed.</p>
<h1 id="heading-lets-get-building">Let's Get Building</h1>
<p>You can check out the GitHub repo for the project <a target="_blank" href="https://github.com/EdemGold/gradio_project">here</a>. Now I'll take you through the project step by step.</p>
<h3 id="heading-install-the-required-packages">Install the required packages</h3>
<p>Let's install the required packages:</p>
<pre><code>pip install sklearn
</code></pre><pre><code>pip install pandas
</code></pre><pre><code>pip install numpy
</code></pre><pre><code>pip install gradio
</code></pre><h3 id="heading-get-our-data">Get our data</h3>
<p>Our data is going to be in the .CSV format. You can get the data by clicking <a target="_blank" href="https://raw.githubusercontent.com/EdemGold/gradio_project/main/diabetes.csv">here</a>.</p>
<h3 id="heading-import-the-packages">Import the Packages</h3>
<p>We are going to import the required packages like this:</p>
<pre><code><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-keyword">import</span> gradio <span class="hljs-keyword">as</span> gr
</code></pre><p>Next, we are going to filter the warnings so we don't see them.</p>
<pre><code><span class="hljs-keyword">import</span> warnings

warnings.filterwarnings(<span class="hljs-string">'ignore'</span>)
</code></pre><h3 id="heading-import-the-data">Import the data</h3>
<p>Next, we are going to import our data:</p>
<pre><code>data = pd.read_csv(<span class="hljs-string">'diabetes.csv'</span>)
</code></pre><p>Now let's see a little preview of our dataset with this command:</p>
<pre><code>data.head()
</code></pre><p>Let's see the feature columns in our dataset:</p>
<pre><code>print (data.columns)
</code></pre><h3 id="heading-get-our-variables">Get our Variables</h3>
<p>Next, we get our X and Y variables, so type in these commands:</p>
<pre><code>x = data.drop([<span class="hljs-string">'Outcome'</span>], axis=<span class="hljs-number">1</span>)

y = data[<span class="hljs-string">'Outcome'</span>]
</code></pre><h3 id="heading-split-the-data">Split the data</h3>
<p>Now we are going to split our data using scikit-learn's inbuilt _train_test<em>split</em> function.</p>
<pre><code><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y)
</code></pre><h3 id="heading-scale-our-data">Scale our data</h3>
<p>Next, we are going to scale our data using scikit-learn's inbuilt <em>StandardScaler</em> object.</p>
<pre><code><span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler

#instantiate StandardScaler object
scaler = StandardScaler()

#scale data
x_train_scaled = scaler.fit_transform(x_train)

x_test_scaled = scaler.fit_transform(x_test)
</code></pre><p>In the code above, we scaled our data using the StandardScaler object made available to us through scikit-learn. To learn more about Scaling and why we do it, click <a target="_blank" href="https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/">here</a>.</p>
<h3 id="heading-instantiate-and-train-the-model">Instantiate and train the model</h3>
<p>In this section, we are going to create and train our model. The model we are going to use will be a Multi-Layer Perceptron Classifier, a neural network built into scikit-learn.</p>
<pre><code>#<span class="hljs-keyword">import</span> model object
<span class="hljs-keyword">from</span> sklearn.neural_network <span class="hljs-keyword">import</span> MLPClassifier
model =  MLPClassifier(max_iter=<span class="hljs-number">1000</span>,  alpha=<span class="hljs-number">1</span>)

#train model on training data
model.fit(x_train_scaled, y_train)

#getting model performance on test data
print(<span class="hljs-string">"accuracy:"</span>, model.score(x_test_scaled, y_test))
</code></pre><h3 id="heading-create-the-function-for-gradio">Create the function for Gradio</h3>
<p>Now comes the fun part. Here we are going to create a function that will take in the features of the data set which our model was trained on and pass it as an array to our model to predict. Then we are going to build our Gradio web app based on that function.</p>
<p>To understand why we have to write a function, you must first understand that Gradio builds GUI components for our Machine Learning model based on the function. The function provides a way for Gradio to get input from users and pass it on to the ML model, which will then process it and then pass it back to Gradio which then passes the result out.</p>
<p>Let's write some code...</p>
<p>First, we will get the feature columns which we will then pass onto our function.</p>
<pre><code>#geting our columns

print(data.columns)
</code></pre><p>Now we will create our function like this:</p>
<pre><code>def diabetes(Pregnancies, Glucose, Blood_Pressure, SkinThickness, Insulin, BMI, Diabetes_Pedigree, Age):
#turning the <span class="hljs-built_in">arguments</span> into a numpy array  

 x = np.array([Pregnancies,Glucose,Blood_Pressure,SkinThickness,Insulin,BMI,Diabetes_Pedigree,Age])

  prediction = model.predict(x.reshape(<span class="hljs-number">1</span>, <span class="hljs-number">-1</span>))

  <span class="hljs-keyword">return</span> prediction
</code></pre><p>In the code above, we passed the feature columns from our data as arguments into a function which we named <em>diabetes</em>. Then we turned the arguments into a NumPy array which we then passed onto our model for prediction. Finally we returned the predicted result of our model.</p>
<h3 id="heading-create-our-gradio-interface">Create our Gradio Interface</h3>
<p>Now we are going to create our Web App interface using Gradio:</p>
<pre><code>outputs = gr.outputs.Textbox()

app = gr.Interface(fn=diabetes, inputs=[<span class="hljs-string">'number'</span>,<span class="hljs-string">'number'</span>,<span class="hljs-string">'number'</span>,<span class="hljs-string">'number'</span>,<span class="hljs-string">'number'</span>,<span class="hljs-string">'number'</span>,<span class="hljs-string">'number'</span>,<span class="hljs-string">'number'</span>], outputs=outputs,description=<span class="hljs-string">"This is a diabetes model"</span>)
</code></pre><p>The first thing we did above was to create a variable named outputs which holds the GUI component for our model result. The result of our model's prediction will be outputted in a text box.</p>
<p>Then we instantiated the Gradio interface object and passed in our earlier <em>diabetes</em> function. Then we generated our Inputs GUI component and told the radio to expect 8 inputs in the form of numbers.</p>
<p>The inputs represent the feature columns that are present in our dataset – the same 8 feature column names we passed into our <em>diabetes</em> function.</p>
<p>Then we passed our earlier output variable into the outputs parameter present in the object.</p>
<p>Finally, we passed in the description of our web app into the description parameter.</p>
<h3 id="heading-launch-the-gradio-web-app">Launch the Gradio Web App</h3>
<p>Now we're going to Launch our Gradio web app.</p>
<pre><code>app.launch()
</code></pre><p><strong>NOTE:</strong> If you are launching the Gradio app as a script from that e command line, you will be given a local host link which you will then copy and paste into your browser to see your web app.</p>
<p>If you are launching the app from a Jupyter notebook, you will see a live preview of the app as you run the cell (and you will also be provided with a link).</p>
<h3 id="heading-host-and-share-your-web-app">Host and Share your Web App</h3>
<p>If you want to share your web app, all you have to do is put in <code>share=True</code> as a parameter in your launch object.</p>
<pre><code>#To provide a shareable link
app.launch(share=True)
</code></pre><p>You'll then get a link with a .gradio extension. But this shareable link lasts for only 24 hours and will last if only your system is running. Because Gradio just hosts the web app on your system.</p>
<p>In simple words, for your link to work, your system has to be on. This is because Gradio uses your system to host the web app, so once your system is off the server connection is severed and you get a 500😅.</p>
<p>Luckily for us, Gradio also provides a way for you to permanently host your model. But the service is subscription-based, so you have to pay $7 monthly to access it. Permanent hosting is way out of the scope of this article (partly because the author is broke😅). But if you are interested in it, click <a target="_blank" href="https://www.gradio.app/introducing-hosted">here</a>.</p>
<h2 id="heading-important-resources"><strong>Important resources</strong></h2>
<ul>
<li><a target="_blank" href="https://gradio.app/">Gradio Website</a></li>
<li><a target="_blank" href="https://gradio.app/docs">Gradio Documentation</a></li>
<li><a target="_blank" href="https://github.com/gradio-app/gradio">Gradio on GitHub</a></li>
</ul>
<h2 id="heading-summary"><strong>Summary</strong></h2>
<p>The Gradio library is really cool and it helps solve a huge problem plaguing the Machine Learning community – model deployment.</p>
<p>90% of Machine Learning models built are not deployed, and Gradio is working to fix that.</p>
<p>It also serves as a way for beginners and experts to show off their models and also test the models in real life.</p>
<p>You can't go wrong with the Gradio Library. Give it a try.</p>
<p><a target="_blank" href="https://res.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_256,w_256,f_auto,q_auto:eco,dpr_1/tv8zrejyehjshagvxgt7">Cover image source</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Machine Learning in Python – The Top New Scikit-Learn 0.24 Features You Should Know ]]>
                </title>
                <description>
                    <![CDATA[ By Davis David Scikit-learn is one of the most popular open-source and free machine learning libraries for Python.  The scikit-learn library contains a lot of efficient tools for machine learning and statistical modeling including classification, reg... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/machine-learning-python-new-scikit-learn-features-you-should-know/</link>
                <guid isPermaLink="false">66d84ec54540581f645440e3</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ scikit learn ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 04 Jun 2021 20:58:07 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/06/1_osadNSUIUZkwDqBC-ozxtg.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Davis David</p>
<p>Scikit-learn is one of the most popular open-source and free machine learning libraries for Python. </p>
<p>The scikit-learn library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction.</p>
<p>Many data scientists, machine learning engineers, and researchers rely on this library for their <a target="_blank" href="https://hackernoon.com/machine-learning-as-a-service-mlaas-with-sklearn-and-algorithmia-7299fbaed584?ref=hackernoon.com">machine learning</a> projects. I personally love using the scikit-learn library because it offers a ton of flexibility and it’s easy to understand its documentation with a lot of examples.</p>
<p>In this article, I’m happy to share with you the five best new features in scikit-learn 0.24.</p>
<h3 id="heading-first-install-the-latest-version-of-the-scikit-learn-library">First, Install the Latest Version of the Scikit-Learn Library</h3>
<p>Firstly, make sure you install the latest version (with pip):</p>
<pre><code>pip install --upgrade scikit-learn
</code></pre><p>If you are using conda, use the following command:</p>
<pre><code>conda install -c conda-forge scikit-learn
</code></pre><p><strong>Note:</strong> This version supports Python versions <strong>3.6</strong> to <strong>3.9</strong>.</p>
<p>Now, let’s look at the new features!</p>
<h2 id="heading-mean-absolute-percentage-error-mape">Mean Absolute Percentage Error (MAPE)</h2>
<p>The new version of scikit-learn introduces a new evaluation metric for a regression problem called Mean Absolute Percentage Error(MAPE). Previously you could calculate MAPE by using a piece of code.</p>
<pre><code class="lang-python">np.mean(np.abs((y_test — preds)/y_test))
</code></pre>
<p>But now you can call a function called <strong>mean_absolute_percentage_error</strong> from the <strong>sklearn.metrics</strong> module to evaluate the performance of your regression model.</p>
<p><strong>Example:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_absolute_percentage_error
y_true = [<span class="hljs-number">3</span>, <span class="hljs-number">-0.5</span>, <span class="hljs-number">2</span>, <span class="hljs-number">7</span>]
y_pred = [<span class="hljs-number">2.5</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">2</span>, <span class="hljs-number">8</span>]

print(mean_absolute_percentage_error(y_true, y_pred))
</code></pre>
<p>0.3273809523809524</p>
<p><strong>Note:</strong> Keep in mind that the function does not represent the output as a percentage in the range [0, 100]. Instead, we represent it in the range [0, 1/eps]. The best value is <strong>0.0.</strong></p>
<h2 id="heading-onehotencoder-supports-missing-values">OneHotEncoder Supports Missing Values</h2>
<p><a target="_blank" href="https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f?ref=hackernoon.com">OneHotEncoder</a> can now handle missing values if presented in the dataset. It treats a missing value as a category. Let’s understand more about how it works in the following example.</p>
<p>First import the important packages:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd 
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> OneHotEncoder
</code></pre>
<p>Create a simple data-frame with a categorical feature that has missing values:</p>
<pre><code class="lang-python"><span class="hljs-comment"># intialise data of lists.</span>
data = {<span class="hljs-string">'education_level'</span>:[<span class="hljs-string">'primary'</span>, <span class="hljs-string">'secondary'</span>, <span class="hljs-string">'bachelor'</span>, np.nan,<span class="hljs-string">'masters'</span>,np.nan]}

<span class="hljs-comment"># Create DataFrame</span>
df = pd.DataFrame(data)

<span class="hljs-comment"># Print the output.</span>
print(df)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/06/zVaxL0LohRUpfDQhznRQ9z3y5tj1-1f9314q.jpeg" alt="Image" width="600" height="400" loading="lazy"></p>
<p>As you can see, we have two missing values in our <strong>education_level</strong> column.</p>
<p>Create the instance of OneHotEncoder:</p>
<pre><code class="lang-python">enc = OneHotEncoder()
</code></pre>
<p>Then fit and transform our data:</p>
<pre><code class="lang-python">enc.fit_transform(df).toarray()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/06/zVaxL0LohRUpfDQhznRQ9z3y5tj1-pn3531g0.jpeg" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Our education_level column has been transformed and all missing values treated as a new category (check the last column of the above array).</p>
<h2 id="heading-new-method-for-feature-selection">New Method for Feature Selection</h2>
<p><strong>SequentialFeatureSelector</strong> is a new method for feature selection in scikit-learn. It can be either forward selection or backward selection.</p>
<h3 id="heading-forward-selection">Forward Selection</h3>
<p>Forward Selection iteratively finds the best new feature and then adds it to the set of selected features. </p>
<p>This means we start with zero features and then find a feature that maximizes the cross-validation score of an estimator. The selected feature is added to the set and the procedure is repeated until we reach our desired number of selected features.</p>
<h3 id="heading-backward-selection">Backward Selection</h3>
<p>This second selection follows the same idea but in a different direction. Here we start with all features and then remove a feature from the set until we reach the desired number of selected features.</p>
<h4 id="heading-example">Example</h4>
<p>Import the important packages:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.feature_selection <span class="hljs-keyword">import</span> SequentialFeatureSelector
<span class="hljs-keyword">from</span> sklearn.neighbors <span class="hljs-keyword">import</span> KNeighborsClassifier
<span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> load_iris
</code></pre>
<p>Load the iris dataset and its feature names:</p>
<pre><code class="lang-python">X, y = load_iris(return_X_y=<span class="hljs-literal">True</span>, as_frame=<span class="hljs-literal">True</span>)
feature_names = X.columns
</code></pre>
<p>Create the instance of the estimator:</p>
<pre><code class="lang-python">knn = KNeighborsClassifier(n_neighbors=<span class="hljs-number">3</span>)
</code></pre>
<p>Create the instance of SequentialFeatureSelector, set the number of features to select to be <strong>2</strong>, and set the direction to be “<strong>backward</strong>”:</p>
<pre><code class="lang-python">sfs = SequentialFeatureSelector(knn, n_features_to_select=<span class="hljs-number">2</span>,direction=<span class="hljs-string">'backward'</span>)
</code></pre>
<p>Finally learn the features to select:</p>
<pre><code class="lang-python">sfs.fit(X,y)
</code></pre>
<p>Show selected features:</p>
<pre><code class="lang-python">print(<span class="hljs-string">"Features selected by backward sequential selection: "</span>f{feature_names[sfs.get_support()].tolist()}<span class="hljs-string">")</span>
</code></pre>
<p>Features selected by backward sequential selection: [‘petal length (cm)’, ‘petal width (cm)’].</p>
<p>The only downside of this new feature selection method is that it can be slower than other methods you already know (SelectFromModel &amp; RFE ) because it evaluates models with cross-validation.</p>
<h2 id="heading-new-methods-for-hyper-parameter-tuning">New Methods for Hyper-Parameter Tuning</h2>
<p>When it comes to hyper-parameter tuning, GridSearchCV and RandomizedSearchCv from Scikit-learn have been the first choice for many data scientists. </p>
<p>But in the new version, we have two new classes for hyper-parameter tuning called <strong>HalvingGridSearchCV</strong> and <strong>HalvingRandomSearchCV</strong>.</p>
<p>HalvingGridSearchCV and HalvingRandomSearchCV use a new approach called <strong>successive halving</strong> to find the best hyperparameters. Successive halving is like competition or tournament among all hyperparameter combinations.</p>
<h3 id="heading-how-does-successive-halving-work">How does successive halving work?</h3>
<p>In the first iteration, they train a combination of hyper-parameters on a subset of observations (training data). </p>
<p>Then in the next iteration, it selects only a combination of hyper-parameters that have good performance in the first iteration and they will be trained in a large number of observations to compete.</p>
<p>So it repeats this selection process in each iteration until it selects the best combination of hyperparameters in the final iteration.</p>
<p><strong>Note:</strong> These classes are still experimental:</p>
<h4 id="heading-example-1">Example:</h4>
<p>Import the important packages:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> make_classification
<span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> RandomForestClassifier
<span class="hljs-keyword">from</span> sklearn.experimental <span class="hljs-keyword">import</span> enable_halving_search_cv  
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> HalvingRandomSearchCV
<span class="hljs-keyword">from</span> scipy.stats <span class="hljs-keyword">import</span> randint
</code></pre>
<p>Since these new classes are still experimental, to use them, we explicitly import <strong>enable_halving_search_cv</strong>.</p>
<p>Create a classification dataset by using the make_classification method:</p>
<pre><code class="lang-python">X, y = make_classification(n_samples=<span class="hljs-number">1000</span>)
</code></pre>
<p>Create the instance of the estimator. Here we use a Random Forest Classifier:</p>
<pre><code class="lang-python">clf = RandomForestClassifier(n_estimators=<span class="hljs-number">20</span>)
</code></pre>
<p>Create parameter distribution for tuning:</p>
<pre><code class="lang-python">param_dist = {<span class="hljs-string">"max_depth"</span>: [<span class="hljs-number">3</span>, <span class="hljs-literal">None</span>],
              <span class="hljs-string">"max_features"</span>: randint(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>),
              <span class="hljs-string">"min_samples_split"</span>: randint(<span class="hljs-number">2</span>, <span class="hljs-number">11</span>),
              <span class="hljs-string">"bootstrap"</span>: [<span class="hljs-literal">True</span>, <span class="hljs-literal">False</span>],
              <span class="hljs-string">"criterion"</span>: [<span class="hljs-string">"gini"</span>, <span class="hljs-string">"entropy"</span>]}
</code></pre>
<p>Then we instantiate the HalvingGridSearchCV class with the RandomForestClassifier as an estimator and the list of parameter distributions:</p>
<pre><code class="lang-python">rsh = HalvingRandomSearchCV(
    estimator=clf,
    param_distributions=param_dist,
    cv = <span class="hljs-number">5</span>,
    factor=<span class="hljs-number">2</span>,
    min_resources = <span class="hljs-number">20</span>)
</code></pre>
<p>There are two important parameters in HalvingRandomSearchCV you need to know.</p>
<ol>
<li><strong>factor</strong> — This determines the proportion of the combination of hyper-parameters that are selected for each subsequent iteration. For example, <strong><em>factor=3</em></strong> means that only one-third of the candidates are selected for the next iteration.</li>
<li><strong>min_resources</strong> is the amount of resources (number of observations) allocated at the first iteration for each combination of hyper-parameters.</li>
</ol>
<p>Finally, we can fit the search object that we have created with our dataset.</p>
<pre><code class="lang-python">rsh.fit(X,y)
</code></pre>
<p>After training, we can see different output such as:</p>
<p>The number of iterations</p>
<pre><code class="lang-python">print(rsh.n_iterations_ )
</code></pre>
<p>which is 6.</p>
<p>Or the number of candidate parameters that were evaluated at each iteration</p>
<pre><code class="lang-python">print(rsh.n_candidates_ )
</code></pre>
<p>which is <strong>[50, 25, 13, 7, 4, 2]</strong>.</p>
<p>Or the number of resources used at each iteration:</p>
<pre><code class="lang-python">print(rsh.n_resources_)
</code></pre>
<p>which is <strong>[20, 40, 80, 160, 320, 640]</strong>.</p>
<p>Or parameter setting that gave the best results on the hold-out data:</p>
<pre><code class="lang-python">print(rsh.best_params_)
</code></pre>
<p><strong>{‘bootstrap’: False,</strong><br><strong>‘criterion’: ‘entropy’,</strong><br><strong>‘max_depth’: None,</strong><br><strong>‘max_features’: 5,</strong><br><strong>‘min_samples_split’: 2}</strong></p>
<h2 id="heading-new-self-training-meta-estimator-for-semi-supervised-learning">New self-training meta-estimator for semi-supervised learning</h2>
<p>Scikit-learn 0.24 has introduced a new self-training implementation for semi-supervised learning called <strong>SelfTrainingclassifier</strong>. You can use SelfTrainingClassifier with any supervised classifier that can return probability estimates for each class.</p>
<p>This means any supervised classifier can function as a semi-supervised classifier, allowing it to learn from unlabeled observations in the dataset.</p>
<p><strong>Note:</strong> The unlabeled values in the target column must have a value of -1.</p>
<p>Let’s understand more about how it works in the following example.</p>
<p>Import the important packages:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn <span class="hljs-keyword">import</span> datasets
<span class="hljs-keyword">from</span> sklearn.semi_supervised <span class="hljs-keyword">import</span> SelfTrainingClassifier
<span class="hljs-keyword">from</span> sklearn.svm <span class="hljs-keyword">import</span> SVC
</code></pre>
<p>In this example, we will use the iris dataset and the Super vector machine algorithm as a supervised classifier (it can implement <strong>fit</strong> and <strong>predict_proba</strong>).</p>
<p>Then we load the dataset and select randomly some of the observations to be unlabeled:</p>
<pre><code class="lang-python">rng = np.random.RandomState(<span class="hljs-number">42</span>)
iris = datasets.load_iris()
random_unlabeled_points = rng.rand(iris.target.shape[<span class="hljs-number">0</span>]) &lt; <span class="hljs-number">0.3</span>
iris.target[random_unlabeled_points] = <span class="hljs-number">-1</span>
</code></pre>
<p>As you can see, unlabeled values in the target column have a value of -1.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/06/zVaxL0LohRUpfDQhznRQ9z3y5tj1-jcah31ok.jpeg" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Create an instance of the supervised estimator:</p>
<pre><code class="lang-python">svc = SVC(probability=<span class="hljs-literal">True</span>, gamma=<span class="hljs-string">"auto"</span>)
</code></pre>
<p>Create an instance of the self-training meta estimator and add svc as a base_estimator:</p>
<pre><code class="lang-python">self_training_model = SelfTrainingClassifier(base_estimator=svc)
</code></pre>
<p>Finally, we can train self_traning_model on the iris dataset that has some unlabeled observations:</p>
<pre><code class="lang-python">self_training_model.fit(iris.data, iris.target)
</code></pre>
<p>SelfTrainingClassifier(base_estimator=SVC(gamma=’auto’, probability=True))</p>
<h2 id="heading-final-thoughts-on-scikit-learn-024">Final Thoughts on Scikit-Learn 0.24</h2>
<p>As I said, scikit-learn remains one the most popular open-source machine learning libraries. And it has all the <a target="_blank" href="https://towardsdatascience.com/14-lesser-known-impressive-features-of-scikit-learn-library-e7ea36f1149a?ref=hackernoon.com">features</a> you need to build an end-to-end machine learning project. </p>
<p>You can also implement the new impressive features presented in this article in your machine learning project.</p>
<p>You can find the highlights of other features released in scikit-learn 0.24 <a target="_blank" href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_0_24_0.html?ref=hackernoon.com">here</a>.</p>
<p>Congratulations 👏👏, you have made it to the end of this article! I hope you have learned something new that will help you on your next machine learning or data science project.</p>
<p>If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!</p>
<p>You can also find me on Twitter <a target="_blank" href="https://twitter.com/Davis_McDavid?ref=hackernoon.com">@Davis_McDavid.</a></p>
<p>You can read <a target="_blank" href="https://hackernoon.com/u/davisdavid">other articles</a> here<em>.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Python scikit-learn Tutorial – Machine Learning Crash Course ]]>
                </title>
                <description>
                    <![CDATA[ Scikit-learn is one of the most popular machine leaning libraries for Python. It provides many unsupervised and supervised learning algorithms that make machine leaning simpler. We just published a scikit-learn course on the freeCodeCamp.org YouTube ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/learn-scikit-learn/</link>
                <guid isPermaLink="false">66b2050620f547d355775792</guid>
                
                    <category>
                        <![CDATA[ scikit learn ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Wed, 07 Apr 2021 15:24:06 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/04/scikitlearn-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Scikit-learn is one of the most popular machine leaning libraries for Python. It provides many unsupervised and supervised learning algorithms that make machine leaning simpler.</p>
<p>We just published a scikit-learn course on the freeCodeCamp.org YouTube channel. This course will teach you the basics of scikit-learn so you can start using it in your own machine learning projects.</p>
<p>Vincent D. Warmerdam created this course. Vincent has taught many machine learning concepts on his <a target="_blank" href="https://calmcode.io/">website</a> and in his job as a research advocate. He has also created some useful open source libraries that work with scikit-learn. </p>
<p>Vincent has a knack for breaking down complex topics in a calm and simple manner.</p>
<p>First, you will get an overview of scikit-learn and learn about some high-level topics.</p>
<p>Next, you will learn about preprocessing tools. Preprocessing has a big impact on the performance of a model.</p>
<p>In the third section you will learn about metrics and how to create custom metrics to judge your machine learning models on.</p>
<p>Then, you will learn about meta estimators. These relate to post-processing your data.</p>
<p>Finally, you will learn about a machine learning library that integrates with scikit-learn and tries to make machine learning more human. </p>
<p>Watch the full course below or <a target="_blank" href="https://youtu.be/0B5eIE_1vpU">on the freeCodeCamp.org YouTube channel</a> (2-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/0B5eIE_1vpU" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Machine Learning with Scikit-Learn—Full Course ]]>
                </title>
                <description>
                    <![CDATA[ Scikit-learn is a free machine learning library for the Python programming language. We have released a full course on the freeCodeCamp.org YouTube channel that will teach you about machine learning using scikit-learn (also known as sklearn). First y... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/machine-learning-with-scikit-learn-full-course/</link>
                <guid isPermaLink="false">66b20591297cd6de0bd54682</guid>
                
                    <category>
                        <![CDATA[ scikit learn ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Wed, 24 Jun 2020 19:41:37 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2020/09/scikit-learn.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Scikit-learn is a free machine learning library for the Python programming language. We have released a full course on the <a target="_blank" href="https://youtu.be/pqNCD_5r0IU">freeCodeCamp.org YouTube channel</a> that will teach you about machine learning using scikit-learn (also known as sklearn).</p>
<p>First you will learn about the basics of machine learning and scikit-learn. Then you will learn about some common machine learning algorithms and how to implement them with scikit-learn. Finally, you will learn about artificial intelligence and the science behind it.</p>
<p>This course was created by DLAcademy. Throughout the course, machine learning cocepts will be taught through practical examples.</p>
<p>Here are the topics covered:</p>
<ul>
<li>Installing scikit-learn</li>
<li>Plotting a graph</li>
<li>Identifying features and labels</li>
<li>Saving and opening a model</li>
<li>Classification</li>
<li>Train / test split</li>
<li>What is KNN?</li>
<li>What is SVM?</li>
<li>Linear regression</li>
<li>Logistic vs linear regression</li>
<li>KMeans</li>
<li>Neural networks</li>
<li>Overfitting and underfitting</li>
<li>Backpropagation</li>
<li>Cost function and gradient descent</li>
<li>CNNs</li>
<li>Implementing a handwritten digits recognizer</li>
</ul>
<p>Watch the course on the <a target="_blank" href="https://youtu.be/pqNCD_5r0IU">freeCodeCamp.org YouTube channel</a> (3 hour watch).</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Two hours later and still running? How to keep your sklearn.fit under control. ]]>
                </title>
                <description>
                    <![CDATA[ By Nathan Toubiana Written by Gabriel Lerner and Nathan Toubiana All you wanted to do was test your code, yet two hours later your Scikit-learn fit shows no sign of ever finishing. Scitime is a package that predicts the runtime of machine learning al... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/two-hours-later-and-still-running-how-to-keep-your-sklearn-fit-under-control-cc603dc1283b/</link>
                <guid isPermaLink="false">66c363c1ef766eb77cd78805</guid>
                
                    <category>
                        <![CDATA[ Data Sc ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ scikit learn ]]>
                    </category>
                
                    <category>
                        <![CDATA[ timer ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 13 Mar 2019 15:36:10 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*aVzJTznRRfP1lM7AXe9yLw.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Nathan Toubiana</p>
<p><em>Written by <a target="_blank" href="https://medium.com/@gabi10004">Gabriel Lerner</a> and <a target="_blank" href="https://medium.com/@toubiana.nathan">Nathan Toubiana</a></em></p>
<p>All you wanted to do was test your code, yet two hours later your Scikit-learn fit shows no sign of ever finishing. Scitime is a package that predicts the runtime of machine learning algorithms so that you will not be caught off guard by an endless fit.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*aVzJTznRRfP1lM7AXe9yLw.jpeg" alt="Image" width="600" height="400" loading="lazy">
_Image by Kevin Ku on [unsplash.com](https://unsplash.com/photos/aiyBwbrWWlo" rel="noopener" target="<em>blank" title=")</em></p>
<p>Whether you are in the process of building a machine learning model or deploying your code to production, knowledge of how long your algorithm will take to fit is key to streamlining your workflow. With Scitime you will be able in a matter of seconds to estimate how long the fit should take for the most commonly used Scikit Learn algorithms.</p>
<p>There have been a couple of research articles (such as <a target="_blank" href="https://www.sciencedirect.com/science/article/pii/S0004370213001082">this one</a>) published on that subject. However, as far as we know, there’s no practical implementation of it. The goal here is not to predict the exact runtime of the algorithm but more to give a rough approximation.</p>
<h3 id="heading-what-is-scitime">What is Scitime?</h3>
<p>Scitime is a python package requiring at least python 3.6 with <a target="_blank" href="https://github.com/pandas-dev/pandas">pandas</a>, <a target="_blank" href="https://github.com/scikit-learn/scikit-learn">scikit-learn</a>, <a target="_blank" href="https://github.com/giampaolo/psutil">psutil</a> and <a target="_blank" href="https://github.com/joblib/joblib">joblib</a> dependencies. You will find the Scitime repo <a target="_blank" href="https://github.com/nathan-toubiana/scitime">here</a>.</p>
<p>The main function in this package is called “<em>time</em>”. Given a matrix vector X, the estimated vector Y along with the Scikit Learn model of your choice, <em>time</em> will output both the estimated time and its confidence interval. The package currently supports the following Scikit Learn algorithms with plans to add more in the near future:</p>
<ul>
<li><a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html">KMeans</a></li>
<li><a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html">RandomForestRegressor</a></li>
<li><a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html">SVC</a></li>
<li><a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">RandomForestClassifier</a></li>
</ul>
<h3 id="heading-quick-start">Quick Start</h3>
<p>Let’s install the package and run the basics.</p>
<p>First create a new virtualenv (this is optional, to avoid any version conflicts!)</p>
<pre><code>❱ virtualenv env❱ source env/bin/activate
</code></pre><p>and then run:</p>
<pre><code>❱ (env) pip install scitime
</code></pre><p>or with conda:</p>
<pre><code>❱ (env) conda install -c conda-forge scitime
</code></pre><p>Once the installation has succeeded, you are ready to estimate the time of your first algorithm.</p>
<p>Let’s say you wanted to train a kmeans clustering, for example. You would first need to import the scikit-learn package, set the kmeans parameters, and also choose the inputs (a.k.a <em>X)</em>, here generated <a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs">randomly</a> for simplicity.</p>
<p>Running this before doing the actual fit would give an approximation of the runtime:</p>
<p>As you can see, you can get this info only in one extra line of code! The inputs of the <em>time</em> function are exactly what’s needed to run the fit (that is the algo itself, and X), which makes it even easier to use.</p>
<p>Looking more closely at the last line of the above code, the first output (<em>estimation:</em> 15 seconds in this case) is the predicted runtime you’re looking for. Scitime will also output it with a confidence interval (_lower<em>bound</em> and _upper<em>bound:</em> 10 and 30 seconds in this case). You can always compare it to the actual training time by running:</p>
<p>In this case, on our local machine, the estimation is 15 seconds, whereas the actual training time is 20 seconds (but you might not get the same results, as we’ll explain later).</p>
<p><strong>As a quick usage guide:</strong></p>
<p>_Estimator(meta<em>algo, verbose, confidence) class:</em></p>
<ul>
<li><strong>meta_algo</strong>: The estimator used to predict the time, either ‘RF’ or ‘NN’ (see details in next paragraph) — defaults to‘RF’</li>
<li><strong>verbose</strong>: Control of the amount of log output (either 0, 1, 2 or 3) — defaults to 0</li>
<li><strong>confidence</strong>: Confidence for intervals — defaults to 95%</li>
</ul>
<p><em>estimator.time(algo, X, y) function:</em></p>
<ul>
<li><strong>algo</strong>: algo whose runtime the user wants to predict</li>
<li><strong>X</strong>: numpy array of inputs to be trained</li>
<li><strong>y</strong>: numpy array of outputs to be trained (set to <em>None</em> if the algo is unsupervised)</li>
</ul>
<p>Quick note: to avoid any confusion, it’s worth highlighting that <strong>algo</strong> and <strong>meta_algo</strong> are two different things here: <strong>algo</strong> is the algorithm whose runtime we want to estimate, <strong>meta_algo</strong> is the algorithm used by Scitime to predict the runtime.</p>
<h3 id="heading-how-scitime-works">How Scitime works</h3>
<p>We are able to predict the runtime to fit by using our own estimator, we call it meta algorithm (_meta<em>algo</em>), whose weights are stored in a dedicated pickle file in the package metadata. For each Scikit Learn model, you will find a corresponding meta algo pickle file in Scitime’s code base.</p>
<p>You might be thinking:</p>
<blockquote>
<p>Why not manually estimate the time complexity with big O notations?</p>
</blockquote>
<p>That’s a fair point. It’s a valid way of approaching the problem and something we thought about at the beginning of the project. One thing however is that we would need to formulate the complexity explicitly for each algo and set of parameters which is rather challenging in some cases, given the number of factors playing a role in the runtime. The meta_algo basically does all the work for you, and we’ll explain how.</p>
<p>Two types of meta algos have been trained to estimate the time to fit (both from Scikit Learn):</p>
<ul>
<li>The <strong>RF</strong> meta algo, a <a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html">RandomForestRegressor</a> estimator.</li>
<li>The <strong>NN</strong> meta algo, a basic <a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html">MLPRegressor</a> estimator.</li>
</ul>
<p>These meta algos estimate the time to fit using an array of ‘meta’ features. Here’s a summary of how we build these features:</p>
<p>Firstly, we fetch the shape of your input matrix X and output vector y. Second, the parameters you feed to the Scikit Learn model are taken into consideration as they will impact the training time as well. Lastly, your specific hardware, unique to your machine such as available memory and cpu counts are also considered.</p>
<p>As shown earlier, we also provide confidence intervals on the time prediction. The way these are computed depends on the meta algo chosen:</p>
<ul>
<li>For <strong>RF</strong>, since any random forest regressor is a combination of multiple trees (also called <em>estimators</em>), the confidence interval will be based on the distribution of the set of predictions computed by each estimator.</li>
<li>For <strong>NN</strong>, the process is a little less straightforward: we first compute a set of <a target="_blank" href="https://en.wikipedia.org/wiki/Mean_squared_error">MSE</a>s along with the number of observations on a test set, grouped by predicted duration bins (that is from 0 to 1 second, 1 to 5 seconds, and so on), and we then compute a <a target="_blank" href="https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html">t-stat</a> to get the lower and upper bounds of the estimation. As we don’t have a lot of data for very long models, the confidence interval for such data might get very broad.</li>
</ul>
<h3 id="heading-how-we-built-it">How we built it</h3>
<p>You might be thinking:</p>
<blockquote>
<p>How did you get enough data on the training time of all these sciki- learn fits over various parameters and hardware configurations?</p>
</blockquote>
<p>The (unglamorous) answer is we generated the data ourselves using a combination of computers and VM hardwares to simulate what the training time would be on the different systems. We then fitted our meta algos on these randomly generated data points to build an estimator meant to be reliable regardless of your system.</p>
<p>While the <a target="_blank" href="https://github.com/nathan-toubiana/scitime/blob/master/scitime/estimate.py">estimate.py</a> file handles the runtime prediction, the <a target="_blank" href="https://github.com/nathan-toubiana/scitime/blob/master/scitime/_model.py">_<em>model.py</em></a> file helped us generate data to train our meta algos, using our dedicated Model class. Here’s a corresponding code sample, for kmeans:</p>
<p>Note that you can also use the file <a target="_blank" href="https://github.com/nathan-toubiana/scitime/blob/master/_data.py">_<em>data.py</em></a> directly with the command line to generate data or train a new model. Related instructions can be found in the repo Readme file.</p>
<p>When generating data points, you can edit the parameters of the Scikit Learn models you want to train on. You can head to <a target="_blank" href="https://github.com/nathan-toubiana/scitime/blob/master/scitime/_config.json">_scitime/<em>config.json</em></a> and edit the parameters of the models as well as the number of rows and columns you would want to train with.</p>
<p>We use an <a target="_blank" href="https://docs.python.org/2/library/itertools.html#itertools.product">itertool</a> function to loop through every possible combination, along with a drop rate set between 0 and 1 to control how quickly the loop will jump through the different possible iterations.</p>
<h3 id="heading-how-accurate-is-scitime">How accurate is Scitime?</h3>
<p>Below, we highlight how our predictions perform for the specific case of kmeans. Our generated dataset contains ~100k data points, which we split into a train and test sets (75% — 25%).</p>
<p>We grouped training predicted times by different time buckets and computed the <a target="_blank" href="https://en.wikipedia.org/wiki/Mean_absolute_percentage_error">MAPE</a> and <a target="_blank" href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">RMSE</a> over each of those buckets for all our estimators using the RF meta-algo and the NN meta-algo.</p>
<p>Please note that these results were performed on a restricted data set, so they might be different on unexplored data points (such as other systems / extreme values of certain model parameters). For this specific training set, the <a target="_blank" href="https://en.wikipedia.org/wiki/Coefficient_of_determination">R-squared</a> is around 80% for NN and 90% for RF.</p>
<p>As we can see, not surprisingly, the accuracy is consistently higher on the train set than on the test, for both NN and RF. We also see that RF seems to perform way better than NN overall. The MAPE for RF is around 20% on the train set and 40% on the test set. The NN MAPE is surprisingly very high.</p>
<p>Let’s slice the MAPE (on test set) by the number of predicted seconds:</p>
<p>One important thing to keep in mind is that for some cases the time prediction is sensitive to the meta algo chosen (RF or NN). In our experience RF has performed very well within the data set input ranges, as shown above. However, for out of range points, NN might perform better, as suggested by the end of the above chart. This would explain why NN MAPE is quite high while the RMSE is decent: it performs poorly on small values.</p>
<p>As an example, if you try to predict the runtime of a kmeans with default parameters and with an input matrix of a few thousand lines, the RF meta algo will be precise because our training dataset contains similar data points. However, for predicting very specific parameters (for instance, a very high number of clusters), NN might perform better because it extrapolates from the training set, whereas RF doesn’t. NN performs worse on the above charts because these plots are only based on data close to the set of inputs of the training data.</p>
<p>However, as shown in this graph, the out of range values (thin lines) are extrapolated by the NN estimator, whereas the RF estimator predicts the output stepwise.</p>
<p>Now let’s look at the most important ‘meta’ features for the example of kmeans:</p>
<p>As we can see, only 6 features account for more than 80% of the model variance. Among them, the most important is a parameter of the scikit-learn kmeans class itself (number of clusters), but a lot of external factors have great influence on the runtime such as number of rows/columns and available memory.</p>
<h3 id="heading-limitations">Limitations</h3>
<p>As mentioned earlier, the first limitation is related to the confidence intervals: they may be very wide, especially for NN, and for heavy models (that would take at least an hour).</p>
<p>Additionally, the NN might perform poorly on small to medium predictions. Sometimes, for small durations, the NN might even predict a negative duration, in which case we automatically switch back to RF.</p>
<p>Another limitation of the estimator arise for when ‘special’ algo parameter values are used. For example, in a RandomForest scenario, when max_depth is set to <em>None</em>, the depth could take any value. This might result in a much longer time to fit which is more difficult for the meta algo to pick up, although we did our best to account for them.</p>
<p>When running <em>estimator.time(algo, X, y)</em> we do require the user to enter the actual X and y vector which seems unnecessary, as we could simply request the shape of the data to estimate the training time. The reason for this is that we actually try to fit the model before predicting the runtime, in order to raise any instant errors. We run <em>algo.fit(X, y)</em> in a subprocess for one second to check for any fit error up after which we move on to the prediction part. However, there are times where the algo (and / or the input matrix) are so big that running <em>algo.fit(X,y)</em> will throw a memory error eventually, which we can’t account for.</p>
<h3 id="heading-future-improvements">Future improvements</h3>
<p>The most effective and obvious way to improve the performance of our current predictions would be to generate more data points on different systems to better support a wide range of hardware/parameters.</p>
<p>We will be looking at adding more supported Scikit Learn algos in the near future. We could also implement other algos such as <a target="_blank" href="https://github.com/Microsoft/LightGBM">lightGBM</a> or <a target="_blank" href="https://github.com/dmlc/xgboost">xgboost</a>. Feel free to contact us if there’s an algorithm you would like us to implement in the next iterations of Scitime!</p>
<p>Other interesting avenues for improving the performance of the estimator would be to include more granular information about the input matrix such as variance, or correlation with output. We currently generate data completely randomly, for which the fit time might be higher than for real world datasets. So in some cases it might overestimate the training time.</p>
<p>In addition we could track finer hardware specific information such as frequency of the cpu, or current cpu usage.</p>
<p>Ideally, as the algorithm might change from a scikit-learn version to another, and thus have an impact on the runtime, we would also account for it, for example by using the version as a ‘meta’ feature.</p>
<p>As we acquire more data to fit our meta algos, we might think of using more complex meta algos, such as sophisticated neural networks (using regularization techniques like dropout or batch normalization). We could even consider using <a target="_blank" href="https://www.tensorflow.org">tensorflow</a> to fit the meta algo (and add it as optional): it would not only help us get a better accuracy, but also build more robust confidence intervals using <a target="_blank" href="https://towardsdatascience.com/uncertainty-estimation-for-neural-network-dropout-as-bayesian-approximation-7d30fc7bc1f2">dropout</a>.</p>
<h3 id="heading-contributing-to-scitime-and-sending-us-your-feedback">Contributing to Scitime and sending us your feedback</h3>
<p>First, any kind of feedback, especially on the performance of the predictions and on ideas to improve this process of generating data, is very much appreciated!</p>
<p>As discussed before, you can use our repo to generate your own data points in order to train your own meta algorithm. When doing so, you can help make Scitime better by sharing your data points found in the result csv (_~/scitime/scitime/[algo]<em>results.csv</em>) so that we can integrate it to our model.</p>
<p>To generate your own data you can run a command similar to this one (from the package repo source):</p>
<pre><code>❱ python _data.py --verbose <span class="hljs-number">3</span> --algo KMeans --drop_rate <span class="hljs-number">0.99</span>
</code></pre><p>Note: if run directly using the code source (with the <em>Model</em> class), do not forget to set _write<em>csv</em> to true, otherwise the generated data points will not be saved.</p>
<p><em>We use GitHub issues to track all bugs and feature requests. Feel free to open an issue if you have found a bug or wish to see a new feature implemented. More info can be found about how to contribute in the Scitime repo.</em></p>
<p><em>For issues with training time predictions, when submitting feedback, including the full dictionary of parameters you are fitting into your model might help, so that we can diagnose why the performance is subpar for your specific use case. To do so simply set the verbose parameter to 3 and copy paste the log of the parameter dic in the issue description.</em></p>
<p><em>Find the <a target="_blank" href="https://github.com/nathan-toubiana/scitime">code source</a></em></p>
<p><em>Find the <a target="_blank" href="https://scitime.readthedocs.io">documentation</a></em></p>
<h3 id="heading-credits">Credits</h3>
<ul>
<li><a target="_blank" href="https://github.com/gabrielRTR"><em>Gabriel Lerner</em></a> <em>&amp; <a target="_blank" href="https://github.com/nathan-toubiana">Nathan Toubiana</a> are the main contributors of this package and co-authors of this article</em></li>
<li><em>Special thanks to <a target="_blank" href="https://github.com/philippemizrahi">Philippe Mizrahi</a> for helping along the way</em></li>
<li><em>Thanks for all the help we got from early reviews / beta testing</em></li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ A beginner’s guide to training and deploying machine learning models using Python ]]>
                </title>
                <description>
                    <![CDATA[ By Ivan Yung When I was first introduced to machine learning, I had no idea what I was reading. All the articles I read consisted of weird jargon and crazy equations. How could I figure all this out? I opened a new tab in Chrome and looked for easier... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/a-beginners-guide-to-training-and-deploying-machine-learning-models-using-python-48a313502e5a/</link>
                <guid isPermaLink="false">66c341dfccd54aa295e92c5c</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ scikit learn ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 27 Jun 2018 16:33:23 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*-W-ioBNBUF5eSDYWc-ZHxQ.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Ivan Yung</p>
<p>When I was first introduced to machine learning, I had no idea what I was reading. All the articles I read consisted of weird jargon and crazy equations. How could I figure all this out?</p>
<p>I opened a new tab in Chrome and looked for easier solutions. I found APIs from Amazon, Microsoft, and Google that did all the machine learning for me. Each hackathon project I made would call their servers and WOW — it looked so smart! I was hooked.</p>
<p>But, after a year, I realized that I wasn’t learning anything. Everything I was doing was described by this Nedroid comic that I modified:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*1YwLOx3wkKoLjRUD-NoiZA.png" alt="Image" width="600" height="400" loading="lazy">
_[Original image source](https://nedroidcomics.tumblr.com/post/41879001445/the-internet" rel="noopener" target="<em>blank" title=").</em></p>
<p>Eventually, I sat down and learned how to use machine learning without megacorporations. And turns out, anyone can do it. The current libraries we have in Python are amazing. In this article, I will explain how I use these libraries to create a proper machine learning back end.</p>
<h3 id="heading-getting-a-dataset">Getting a dataset</h3>
<p>Machine learning projects are reliant on finding good datasets. If the dataset is bad, or too small, we cannot make accurate predictions. You can find some good datasets at <a target="_blank" href="http://kaggle.com">Kaggle</a> or the <a target="_blank" href="https://archive.ics.uci.edu/ml/index.php">UC Irvine Machine Learning Repository</a>.</p>
<p>In this article, I am using a <a target="_blank" href="https://archive.ics.uci.edu/ml/datasets/Wine+Quality">wine quality dataset</a> with many features and one label. <strong>Features</strong> are independent variables which affect the dependent variable called the <strong>label</strong>. In this case, we have one <strong>label</strong> column — wine quality — that is affected by all the other columns (features like pH, density, acidity, and so on).</p>
<p>In the following Python code, I use a library called <a target="_blank" href="https://pandas.pydata.org/">pandas</a> to control my dataset. pandas provides datasets with many functions to select and manipulate data.</p>
<p>First, I load the dataset to a panda and split it into the label and its features. I then grab the label column by its name (quality) and then drop the column to get all the features.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*Kybbe-8PK1jHttWyP0adow.png" alt="Image" width="600" height="400" loading="lazy">
<em>Scikits-learn, the library we will use for machine learning</em></p>
<h3 id="heading-training-a-model">Training a model</h3>
<p>Machine learning works by finding a relationship between a label and its features. We do this by showing an object (our model) a bunch of examples from our dataset. Each example helps define how each feature affects the label. We refer to this process as <strong>training our model</strong>.</p>
<p>I use the estimator object from the <a target="_blank" href="http://scikit-learn.org/stable/index.html">Scikit-learn</a> library for simple machine learning. <strong>Estimators</strong> are empty models that create relationships through a predefined algorithm.</p>
<p>For this wine dataset, I create a model from a linear regression estimator. (Linear regression attempts to draw a straight line of best fit through our dataset.) The model is able to get the regression data through the fit function. I can use the model by passing in a fake set of features through the predict function. The example below shows the features for one fake wine. The model will output an answer based on its training.</p>
<p>The code for this model, and fake wine, is below:</p>
<h3 id="heading-importing-and-exporting-our-python-model">Importing and exporting our Python model</h3>
<p>The <a target="_blank" href="https://docs.python.org/2/library/pickle.html">pickle</a> library makes it easy to serialize the models into files that I create. I am also able to load the model back into my code. This allows me to keep my model training code separated from the code that deploys my model.</p>
<p>I can import or export my Python model for use in other Python scripts with the code below:</p>
<h3 id="heading-creating-a-simple-web-server">Creating a simple web server</h3>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*wv3umUu_u8r7dgeXHX38uw.png" alt="Image" width="600" height="400" loading="lazy">
<em>Flask, the framework we will use to create a web server.</em></p>
<p>To deploy my model, I first have to create a server. Servers listen to web traffic, and run functions when they find a request addressed to them. The function that runs can depend on the request’s route and other data that it has. Afterwards, the server can send a message of confirmation back to the requester.</p>
<p>The <a target="_blank" href="http://flask.pocoo.org/">Flask</a> Python framework allows me to create web servers in record time.</p>
<p>In the code below, I use Flask to run a simple one-route web server. My one route listens to POST requests and sends a hello back. POST requests are a special type of request that carry data in a JSON object.</p>
<h3 id="heading-adding-the-model-to-my-server">Adding the model to my server</h3>
<p>With the pickle library, I am able to able to load our trained model into my web server.</p>
<p>Our server now loads the trained model during its initialization. I can access it by sending a post request to my “/echo” route. The route grabs an array of features from the request body and gives it to the model. The model’s prediction is then sent back to the requester.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>After reading this article, you should be able to create your own machine learning back end. For more detail, you can find a full example that I made at <a target="_blank" href="https://github.com/iYung/sklearn-flask-example">this</a> repository.</p>
<p>When you have time, I recommend taking a step back from coding and reading about machine learning. This article only teaches the bare necessities to create a model. There are topics like loss reduction and neural nets that you need to know.</p>
<p>Good luck and happy coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Text classification and prediction using the Bag Of Words approach ]]>
                </title>
                <description>
                    <![CDATA[ By gk_ There are a number of approaches to text classification. In other articles I’ve covered Multinomial Naive Bayes and Neural Networks. One of the simplest and most common approaches is called “Bag of Words.” It has been used by commercial analyt... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/text-classification-and-prediction-using-bag-of-words-8aeb1396cded/</link>
                <guid isPermaLink="false">66c3608139357f9446976639</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ scikit learn ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 23 Mar 2018 21:40:55 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*wdtdcVQQRzc7xPNZzyCsUg.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By gk_</p>
<p>There are a number of approaches to text classification. In other articles I’ve covered <a target="_blank" href="https://chatbotslife.com/text-classification-using-algorithms-e4d50dcba45">Multinomial Naive Bayes</a> and <a target="_blank" href="https://machinelearnings.co/text-classification-using-neural-networks-f5cd7b8765c6">Neural Networks</a>.</p>
<p>One of the simplest and most common approaches is called “Bag of Words.” It has been used by commercial analytics products including <a target="_blank" href="https://www.clarabridge.com/">Clarabridge</a>, <a target="_blank" href="https://www.webanalyticsworld.net/analytics-measurement-and-management-tools/radian-6-overview">Radian6</a>, and others.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*j3HUg18QwjDJTJwW9ja5-Q.png" alt="Image" width="512" height="190" loading="lazy">
_Image [source](https://machinelearnings.co/text-classification-using-neural-networks-f5cd7b8765c6" rel="noopener" target="<em>blank" title=").</em></p>
<p>The approach is relatively simple: given a set of topics and a set of terms associated with each topic, determine which topic(s) exist within a document (for example, a sentence).</p>
<p>While other, more exotic algorithms also organize words into “bags,” in this technique we don’t create a model or apply mathematics to the way in which this “bag” intersects with a classified document. A document’s classification will be polymorphic, as it can be associated with multiple topics.</p>
<p>Does this seem too simple to be useful? Try it before you jump to conclusions. In NLP, it is often the case that a simple approach can sometimes go a long way.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*aIUBmmPz2K44OdZnWCj4jw.png" alt="Image" width="150" height="225" loading="lazy">
_credit: Smitha Milli [https://twitter.com/smithamilli](https://twitter.com/smithamilli/status/837153616116985856" rel="noopener" target="<em>blank" title=")</em></p>
<p>We will need three things:</p>
<ul>
<li>A topics/words definition file</li>
<li>A classifier function</li>
<li>A notebook to test our classifier</li>
</ul>
<p>And then we will venture a bit further and build and test a predictive model using our classification data.</p>
<h4 id="heading-topics-and-words">Topics and Words</h4>
<p>Our definition file is in JSON format.We will use it to classify messages between patients and a nurse assigned to their care.</p>
<h4 id="heading-topicsjson">topics.json</h4>
<p>There are two items of note in this definition.</p>
<p>First, let’s look at some terms some terms. For example, “bruis” is a <strong>stem.</strong> It will cover supersets such as “bruise,” “bruising,” and so on. Second, terms containing <strong>*</strong> are actually <strong>patterns</strong>, for example <strong>*dpm</strong> is a pattern for a numeric <strong>d</strong>igit followed by “pm.”</p>
<p>To keep things simple, we are only handling numeric pattern matching, but this could be expanded to a broader scope.</p>
<p>This ability of finding patterns within a term is very useful to when classifying documents containing dates, times, monetary values, and so on.</p>
<p>Let’s try out some classification.</p>
<p>The classifier returns a JSON result set containing the sentence(s) associated with each topic found in the message. A message can contain multiple sentences, and a sentence can be associated with none, one, or multiple topics.</p>
<p>Let’s take a look at our classifier. The code is <a target="_blank" href="https://github.com/ugik/notebooks/blob/master/msgClassify.py">here</a>.</p>
<h4 id="heading-msgclassifypy">msgClassify.py</h4>
<p>The code is relatively straightforward, and includes a convenience function to split a document into sentences.</p>
<h4 id="heading-predictive-modeling">Predictive Modeling</h4>
<p>The aggregate classification for <strong>a set of documents associated with an outcome</strong> can be used to build a predictive model.</p>
<p>In this use-case, we wanted to see if we could predict hospitalizations based on the messages between patient and nurse prior to the incident. We compared messages for patients who did and did not incur hospitalizations.</p>
<p>You could use a similar technique for other types of messaging associated with some binary outcome.</p>
<p>This process takes a number of steps:</p>
<ul>
<li>A set of messages are classified and each topic receives a count for this set. The result is <strong>a fixed list of topics with a % allocation from the messages.</strong></li>
<li>The topic allocation is then <strong>assigned a binary value</strong>, in our case a 0 if there was no hospitalization and a 1 if there was a hospitalization</li>
<li>A <strong>logistic Regression</strong> algorithm is used to build a predictive model</li>
<li>The model is used to <strong>predict the outcome from new input</strong></li>
</ul>
<p>Let’s look at our input data. Your data should have a similar structure. We’re using a pandas <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html">DataFrame</a>.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*SRMLWhU-cEgK_ludaN9gMQ.png" alt="Image" width="800" height="147" loading="lazy"></p>
<p><strong>“incident”</strong> is the binary outcome, and it needs to be the first column in the input data.</p>
<p>Each subsequent column is a topic and the % of classification from the set of messages belonging to the patient.</p>
<p>In row 0, we see that roughly a quarter of the messages for this patient are about the <strong>thanks</strong> topic, and none are about <strong>medical terms</strong> or <strong>money</strong>. Thus each row is a binary outcome and a <strong>messaging classification profile</strong> across topics.</p>
<p>Your input data will have different topics, different column labels, and a different binary condition, but otherwise will be a similar structure.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*SE1UtYrUBvtca6qmwN3P2g.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let’s use <a target="_blank" href="http://scikit-learn.org/stable/">scikit-learn</a> to build a Logistic Regression and test our model.</p>
<p>Here’s our output:</p>
<pre><code>precision    recall  f1-score   support          <span class="hljs-number">0</span>       <span class="hljs-number">0.66</span>      <span class="hljs-number">0.69</span>      <span class="hljs-number">0.67</span>       <span class="hljs-number">191</span>          <span class="hljs-number">1</span>       <span class="hljs-number">0.69</span>      <span class="hljs-number">0.67</span>      <span class="hljs-number">0.68</span>       <span class="hljs-number">202</span>avg / total       <span class="hljs-number">0.68</span>      <span class="hljs-number">0.68</span>      <span class="hljs-number">0.68</span>       <span class="hljs-number">393</span>
</code></pre><p>The <a target="_blank" href="https://en.wikipedia.org/wiki/Precision_and_recall">precision and recall</a> of this model against the test data are in the high-60’s — <strong>slightly better than a guess</strong>, and not accurate enough to be of much value, unfortunately.</p>
<p>In this example, the amount of data was relatively small (a thousand patients, ~30 messages sampled per patient). Remember that only half of the data can be used for training, while the other half (after shuffling) is used to test.</p>
<p>By including structured data such as age, gender, condition, past incidents, and so on, we could strengthen our model and produce a stronger signal. Having more data would also be helpful as the number of training data columns is fairly large.</p>
<p>Try this with your structured/unstructured data and see if you can get a highly predictive model. You may not get the kind of precision that leads to automated actions, but a “risk” probability could be used as a filter or sorting function or as an early warning sign for human experts.</p>
<p>The “Bag of Words” approach is suitable to certain kinds of text classification work, particularly where the language is not nuanced.</p>
<p><strong>Enjoy.</strong></p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
