scikit learn - freeCodeCamp.org

Machine Learning with Python and Scikit-Learn

Beau Carnes — Wed, 22 Nov 2023 16:14:14 +0000

Scikit-learn is an open-source machine learning library for Python, known for its simplicity, versatility, and accessibility. The library is well-documented and supported by a large community, making it a popular choice for both beginners and experienced practitioners in the field of machine learning.

We just published an 18-hour course on the freeCodeCamp.org YouTube channel that is a practical and hands-on introduction to Machine Learning with Python and Scikit-Learn. It is directed at beginners with basic knowledge of Python and statistics.

The course is designed and taught by Aakash N S, CEO and co-founder of Jovian. Aakash has created many popular machine learning courses.

The course starts with the basics of machine learning by exploring models like linear & logistic regression and then moves on to tree-based models like decision trees, random forests, and gradient-boosting machines.

The course also discuss best practices for approaching and managing machine learning projects and demonstrates how to build a state-of-the-art machine learning model for a real-world dataset from scratch. Then the course looks at unsupervised learning & recommendations briefly and walks through the process of deploying a machine-learning model to the cloud using the Flask web framework.

You will learn everything you need to know to start using Scikit-learn for machine learning. Scikit-learn offers a wide range of tools for various machine learning tasks, including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. Scikit-learn is built upon NumPy, SciPy, and Matplotlib, and its user-friendly interface allows for easy integration into Python applications.

By the end of this course, you'll be able to confidently build, train, and deploy machine learning models in the real world. To get the most out of this course, follow along & type out all the code yourself, and apply the techniques covered here to other real-world datasets & competitions that you can find on platforms like Kaggle.

Here are the lessons in this course:

Lesson 1 - Linear Regression and Gradient Descent
Lesson 2 - Logistic Regression for Classification
Lesson 3 - Decision Trees and Random Forests
Lesson 4 - How to Approach Machine Learning Projects
Lesson 5 - Gradient Boosting Machines with XGBoost
Lesson 6 - Unsupervised Learning using Scikit-Learn
Lesson 7 - Machine Learning Project from Scratch
Lesson 8 - Deploying a Machine Learning Project with Flask

You can watch the full course on the freeCodeCamp.org YouTube channel (18-hour watch).

How to Improve Machine Learning Code Quality with Scikit-learn Pipeline and ColumnTransformer

freeCodeCamp — Thu, 08 Sep 2022 16:31:20 +0000

By Yannawut Kimnaruk

When you're working on a machine learning project, the most tedious steps are often data cleaning and preprocessing. Especially when you're working in a Jupyter Notebook, running code in many cells can be confusing.

The Scikit-learn library has tools called Pipeline and ColumnTransformer that can really make your life easier. Instead of transforming the dataframe step by step, the pipeline combines all transformation steps. You can get the same result with less code. It's also easier to understand data workflows and modify them for other projects.

This article will show you step by step how to create the machine learning pipeline, starting with an easy one and working up to a more complicated one.

If you are familiar with the Scikit-learn pipeline and ColumnTransformer, you can jump directly to the part you want to learn more about.

What is the Scikit-learn Pipeline?
What is the Scikit-learn ColumnTransformer?
What's the Difference between the Pipeline and ColumnTransformer?
How to Create a Pipeline
How to Find the Best Hyperparameter and Data Preparation Method
How to Add Custom Transformations
How to Choose the Best Machine Learning Model

What is the Scikit-learn Pipeline?

Before training a model, you should split your data into a training set and a test set. Each dataset will go through the data cleaning and preprocessing steps before you put it in a machine learning model.

It's not efficient to write repetitive code for the training set and the test set. This is when the scikit-learn pipeline comes into play.

Scikit-learn pipeline is an elegant way to create a machine learning model training workflow. It looks like this:

Pipeline illustration

First of all, imagine that you can create only one pipeline in which you can input any data. Those data will be transformed into an appropriate format before model training or prediction.

The Scikit-learn pipeline is a tool that links all steps of data manipulation together to create a pipeline. It will shorten your code and make it easier to read and adjust. (You can even visualize your pipeline to see the steps inside.) It's also easier to perform GridSearchCV without data leakage from the test set.

What is the Scikit-learn ColumnTransformer?

As stated on the scikit-learn website, this is the purpose of ColumnTransformer:

"This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.

This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer."

In short, ColumnTransformer will transform each group of dataframe columns separately and combine them later. This is useful in the data preprocessing process.

ColumnTransformer Illustration

What's the Difference between the Pipeline and ColumnTransformer?

There is a big difference between Pipeline and ColumnTransformer that you should understand.

Pipeline VS ColumnTransformer

You use the pipeline for multiple transformations of the same columns.

On the other hand, you use the ColumnTransformer** to transform each column set separately before combining them later.

Alright, with that out of the way, let’s start coding!!

How to Create a Pipeline

Get the Dataset

You can download the data I used in this article from this kaggle dataset. Here's a sample of the dataset:

Dataset sample

I wrote an article exploring the data from this dataset which you can find here if you're interested.

In short, this dataset contains information about job candidates and their decision about whether they want to change jobs or not. The dataset has both numerical and categorical columns.

Our goal is to predict whether a candidate will change jobs based on their information. This is a classification task.

Data Preprocessing Plan

Note that I skipped categorical feature encoding for the simplicity of this article.

Here are the steps we'll follow:

Import data and encoding
Define sets of columns to be transformed in different ways
Split data to train and test sets
Create pipelines for numerical and categorical features
Create ColumnTransformer to apply pipeline for each column set
Add a model to a final pipeline
Display the pipeline
Pass data through the pipeline
(Optional) Save the pipeline

Step 1: Import and Encode the Data

After downloading the data, you can import it using Pandas like this:

import pandas as pd

df = pd.read_csv("aug_train.csv")

Then, encode the ordinal feature using mapping to transform categorical features into numerical features (since the model takes only numerical input).

# Making Dictionaries of ordinal features

relevent_experience_map = {
    'Has relevent experience':  1,
    'No relevent experience':    0
}

experience_map = {
    '<1'      :    0,
    '1'       :    1, 
    '2'       :    2, 
    '3'       :    3, 
    '4'       :    4, 
    '5'       :    5,
    '6'       :    6,
    '7'       :    7,
    '8'       :    8, 
    '9'       :    9, 
    '10'      :    10, 
    '11'      :    11,
    '12'      :    12,
    '13'      :    13, 
    '14'      :    14, 
    '15'      :    15, 
    '16'      :    16,
    '17'      :    17,
    '18'      :    18,
    '19'      :    19, 
    '20'      :    20, 
    '>20'     :    21
} 

last_new_job_map = {
    'never'        :    0,
    '1'            :    1, 
    '2'            :    2, 
    '3'            :    3, 
    '4'            :    4, 
    '>4'           :    5
}

# Transform categorical features into numerical features

def encode(df_pre):
    df_pre.loc[:,'relevent_experience'] = df_pre['relevent_experience'].map(relevent_experience_map)
    df_pre.loc[:,'last_new_job'] = df_pre['last_new_job'].map(last_new_job_map)
    df_pre.loc[:,'experience'] = df_pre['experience'].map(experience_map)

    return df_pre

df = encode(df)

Step 2: Define Sets of Columns to be Transformed in Different Ways

Numerical and categorical data should be transformed in different ways. So I define num_col for numerical columns (numbers) and cat_cols for categorical columns.

num_cols = ['city_development_index','relevent_experience', 'experience','last_new_job', 'training_hours']

cat_cols = ['gender', 'enrolled_university', 'education_level', 'major_discipline', 'company_size', 'company_type']

Step 3: Create Pipelines for Numerical and Categorical Features

The syntax of the pipeline is:

Pipeline(steps = [(‘step name’, transform function), …])

For numerical features, I perform the following actions:

SimpleImputer to fill in the missing values with the mean of that column.
MinMaxScaler to scale the value to range from 0 to 1 (this will affect regression performance).

For categorical features, I perform the following actions:

SimpleImputer to fill in the missing values with the most frequency value of that column.
OneHotEncoder to split to many numerical columns for model training. (handle_unknown=’ignore’ is specified to prevent errors when it finds an unseen category in the test set)

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale',MinMaxScaler())
])
cat_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot',OneHotEncoder(handle_unknown='ignore', sparse=False))
])

Step 4: Create ColumnTransformer to Apply the Pipeline for Each Column Set

The syntax of the ColumnTransformer is:

ColumnTransformer(transformers=[(‘step name’, transform function,cols), …])

Pass numerical columns through the numerical pipeline and pass categorical columns through the categorical pipeline created in step 3.

remainder=’drop’ is specified to ignore other columns in a dataframe.

n_job = -1 means that we'll be using all processors to run in parallel.

from sklearn.compose import ColumnTransformer

col_trans = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
    ],
    remainder='drop',
    n_jobs=-1)

Step 5: Add a Model to the Final Pipeline

I'm using the logistic regression model in this example.

Create a new pipeline to commingle the ColumnTransformer in step 4 with the logistic regression model. I use a pipeline in this case because the entire dataframe must pass the ColumnTransformer step and modeling step, respectively.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0)
clf_pipeline = Pipeline(steps=[
    ('col_trans', col_trans),
    ('model', clf)
])

Step 6: Display the Pipeline

The syntax for this is display(pipeline name):

from sklearn import set_config

set_config(display='diagram')
display(clf_pipeline)

Displayed pipeline

You can click on the displayed image to see the details of each step.
How convenient!

Expanded displayed pipeline

Step 7: Split the Data into Train and Test Sets

Split 20% of the data into a test set like this:

from sklearn.model_selection import train_test_split

X = df[num_cols+cat_cols]
y = df['target']
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

I will fit the pipeline for the train set and use that fitted pipeline for the test set to prevent data leakage from the test set to the model.

Step 8: Pass Data through the Pipeline

Here's the syntax for this:

pipeline_name.fit, pipeline_name.predict, pipeline_name.score

pipeline.fit passes data through a pipeline. It also fits the model.

pipeline.predict uses the model trained when pipeline.fits to predict new data.

pipeline.score gets a score of the model in the pipeline (accuracy of logistic regression in this example).

clf_pipeline.fit(X_train, y_train)
# preds = clf_pipeline.predict(X_test)
score = clf_pipeline.score(X_test, y_test)
print(f"Model score: {score}") # model accuracy

(Optional) Step 9: Save the Pipeline

The syntax for this is joblib.dumb.

Use the joblib library to save the pipeline for later use, so you don’t need to create and fit the pipeline again. When you want to use a saved pipeline, just load the file using joblib.load like this:

import joblib

# Save pipeline to file "pipe.joblib"
joblib.dump(clf_pipeline,"pipe.joblib")

# Load pipeline when you want to use
same_pipe = joblib.load("pipe.joblib")

How to Find the Best Hyperparameter and Data Preparation Method

A pipeline does not only make your code tidier, it can also help you optimize hyperparameters and data preparation methods.

Here's what we'll cover in this section:

How to find the changeable pipeline parameters
How to find the best hyperparameter sets: Add a pipeline to Grid Search
How to find the best data preparation method: Skip a step in a pipeline
How to Find the best hyperparameter sets and the best data preparation method

How to Find the Changeable Pipeline Parameters

First, let’s see the list of parameters that can be adjusted.

clf_pipeline.get_params()

The result can be very long. Take a deep breath and continue reading.

The first part is just about the steps of the pipeline.

Below the first part you'll find what we are interested in: a list of parameters that we can adjust.

The format is step1step2…_parameter.

For example col_trans_cat_pipelineone-hotsparse means parameter sparse of the one-hot step.

You can change parameters directly using set_param.

clf_pipeline.set_params(model_C = 10)

How to Find the Best Hyperparameter Sets: Add a Pipeline to Grid Search

Grid Search is a method you can use to perform hyperparameter tuning. It helps you find the optimum parameter sets that yield the highest model accuracy.

Set the tuning parameters and their range.

Create a dictionary of tuning parameters (hyperparameters)

{ ‘tuning parameter’ : ‘possible value’, … }

In this example, I want to find the best penalty type and C of a logistic regression model.

grid_params = {'model__penalty' : ['none', 'l2'],
               'model__C' : np.logspace(-4, 4, 20)}

Add the pipeline to Grid Search

GridSearchCV(model, tuning parameter, …)

Our pipeline has a model step as the final step, so we can input the pipeline directly to the GridSearchCV function.

from sklearn.model_selection import GridSearchCV

gs = GridSearchCV(clf_pipeline, grid_params, cv=5, scoring='accuracy')
gs.fit(X_train, y_train)

print("Best Score of train set: "+str(gs.best_score_))
print("Best parameter set: "+str(gs.best_params_))
print("Test Score: "+str(gs.score(X_test,y_test)))

Result of Grid Search

After setting Grid Search, you can fit Grid Search with the data and see the results. Let's see what the code is doing:

.fit: fits the model and tries all sets of parameters in the tuning parameter dictionary
.best_score_: the highest accuracy across all sets of parameters
.best_params_: The set of parameters that yield the best score
.score(X_test,y_test): The score when trying the best model with the test set.

You can read more about GridSearchCV in the documentation here.

How to Find the Best Data Preparation Method: Skip a Step in a Pipeline

Finding the best data preparation method can be difficult without a pipeline since you have to create so many variables for many data transformation cases.

With the pipeline, we can create data transformation steps in the pipeline and perform a grid search to find the best step. A grid search will select which step to skip and compare the result of each case.

How to adjust the current pipeline a little

I want to know which scaling method will work best for my data between MinMaxScaler and StandardScaler.

I add a step StandardScaler in the num_pipeline. The rest doesn't change.

from sklearn.preprocessing import StandardScaler

num_pipeline2 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('minmax_scale', MinMaxScaler()),
    ('std_scale', StandardScaler()),
])

col_trans2 = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline2,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
    ],
    remainder='drop',
    n_jobs=-1)

clf_pipeline2 = Pipeline(steps=[
    ('col_trans', col_trans2),
    ('model', clf)
])

Adjusted pipeline

How to Perform Grid Search

In grid search parameters, specify the steps you want to skip and set their value to passthrough.

Since MinMaxScaler and StandardScaler should not perform at the same time, I will use a list of dictionaries for the grid search parameters.

[{case 1},{case 2}]

If using a list of dictionaries, grid search will perform a combination of every parameter in case 1 until complete. Then, it will perform a combination of every parameter in case 2. So there is no case where MinMaxScaler and StandardScaler are used together.

grid_step_params = [{'col_trans__num_pipeline__minmax_scale': ['passthrough']},
                    {'col_trans__num_pipeline__std_scale': ['passthrough']}]

Perform Grid Search and print the results (like a normal grid search).

gs2 = GridSearchCV(clf_pipeline2, grid_step_params, scoring='accuracy')
gs2.fit(X_train, y_train)

print("Best Score of train set: "+str(gs2.best_score_))
print("Best parameter set: "+str(gs2.best_params_))
print("Test Score: "+str(gs2.score(X_test,y_test)))

The best case is minmax_scale : ‘passthrough’, so StandardScaler is the best scaling method for this data.

How to Find the Best Hyperparameter Sets and the Best Data Preparation Method

You can find the best hyperparameter sets and the best data preparation method by adding tuning parameters to the dictionary of each case of the data preparation method.

grid_params = {'model__penalty' : ['none', 'l2'],
               'model__C' : np.logspace(-4, 4, 20)}

grid_step_params = [{**{'col_trans__num_pipeline__minmax_scale': ['passthrough']}, **grid_params},
                    {**{'col_trans__num_pipeline__std_scale': ['passthrough']}, **grid_params}]

grid_params will be added to both case 1 (skip MinMaxScaler) and case 2 (skip StandardScalerand).

# You can merge dictionary using the syntax below.

merge_dict = {**dict_1,**dict_2}

Perform Grid Search and print the results (like a normal grid search).

gs3 = GridSearchCV(clf_pipeline2, grid_step_params2, scoring='accuracy')
gs3.fit(X_train, y_train)

print("Best Score of train set: "+str(gs3.best_score_))
print("Best parameter set: "+str(gs3.best_params_))
print("Test Score: "+str(gs3.score(X_test,y_test)))

You can find the best parameter set using .bestparams. As minmax_scale : ‘passthrough’, so StandardScaler is the best scaling method for this data.

You can show all grid search cases using .cvresults:

pd.DataFrame(gs3.cv_results_)

GridSearch result

There are 80 cases for this example. There's running time and accuracy of each case for you to consider, since sometimes we may select the fastest model with acceptable accuracy instead of the highest accuracy one.

How to Add Custom Transformations and Find the Best Machine Learning Model

Searching for the best machine learning model can be a time-consuming task. The pipeline can make this task much more convenient so that you can shorten the model training and evaluation loop.

Here's what we'll cover in this part:

Add a custom transformation
Find the best machine learning model

How to Add a Custom Transformation

Apart from standard data transformation functions such as MinMaxScaler from sklearn, you can also create your own transformation for your data.

In this example, I will create a class method to encode ordinal features using mapping to transform categorical features into numerical ones. In simple words, we'll change data from text to numbers.

First we'll do the required data processing before regression model training.

from sklearn.base import TransformerMixin

class Encode(TransformerMixin):

    def __init__(self):
        # Making Dictionaries of ordinal features
        self.rel_exp_map = {
            'Has relevent experience': 1,
            'No relevent experience': 0}

    def fit(self, df, y = None):
        return self

    def transform(self, df, y = None):
        df_pre = df.copy()
        df_pre.loc[:,'rel_exp'] = df_pre['rel_exp']\
                               .map(self.rel_exp_map)
        return df_pre

Here's an explanation of what's going on in this code:

Create a class named Encode which inherits the base class called TransformerMixin from sklearn.
Inside the class, there are 3 necessary methods: __init__, fit, and transform
__init__ will be called when a pipeline is created. It is where we define variables inside the class. I created a variable ‘rel_exp_map’ which is a dictionary that maps categories to numbers.
fit will be called when fitting the pipeline. I left it blank for this case.
transform will be called when a pipeline transform is used. This method requires a dataframe (df) as an input while y is set to be None by default (It is forced to have y argument but I will not use it anyway).
In transform, the dataframe column ‘rel_exp’ will be mapped with the rel_exp_map.

Note that the \ is only to continue the code to a new line.

Next, add this Encode class as a pipeline step.

pipeline = Pipeline(steps=[
    ('Encode', Encode()),
    ('col_trans', col_trans),
    ('model', LogisticRegression())
])

Then you can fit, transform, or grid search the pipeline like a normal pipeline.

How to Find the Best Machine Learning Model

The first solution that came to my mind was adding many model steps in a pipeline and skipping a step by changing the step value to ‘passthrough’ in the grid search. This is like what we did when finding the best data preparation method.

temp_pipeline = Pipeline(steps=[
    ('model1', LogisticRegression()),
    ('model2',SVC(gamma='auto'))
])

But I saw an error like this:

Error when there are 2 classifiers in 1 pipeline

Ah ha – you can’t have two classification models in a pipeline!

The solution to this problem is to create a custom transformation that receives a model as an input and performs grid search to find the best model.

Here are the steps we'll follow:

Create a class that receives a model as an input
Add the class in step 1 to a pipeline
Perform grid search
Print grid search results as a table

Step 1: Create a class that receives a model as an input

from sklearn.base import BaseEstimator
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

class ClfSwitcher(BaseEstimator):

def __init__(self, estimator = LogisticRegression()):
        self.estimator = estimator

def fit(self, X, y=None, **kwargs):
        self.estimator.fit(X, y)
        return self

def predict(self, X, y=None):
        return self.estimator.predict(X)

def predict_proba(self, X):
        return self.estimator.predict_proba(X)

def score(self, X, y):
        return self.estimator.score(X, y)

Code explanation:

Create a class named ClfSwitcher which inherits the base class called BaseEstimator from sklearn.
Inside the class, there are five necessary methods like classification model: __init__, fit, predict, predict_proba and score
__init__ receives an estimator (model) as an input. I stated LogisticRegression() as a default model.
fit is for model fitting. There's no return value.
The other methods are to simulate the model. It will return the result as if it's the model itself.

Step 2: Add the class in step 1 to a pipeline

clf_pipeline = Pipeline(steps=[
    ('Encode', Encode()),
    ('col_trans', col_trans),
    ('model', ClfSwitcher())
])

Step 3: Perform Grid search

There are 2 cases using different classification models in grid search parameters, including logistic regression and support vector machine.

from sklearn.model_selection import GridSearchCV

grid_params = [
    {'model__estimator': [LogisticRegression()]},
    {'model__estimator': [SVC(gamma='auto')]}
]

gs = GridSearchCV(clf_pipeline, grid_params, scoring='accuracy')
gs.fit(X_train, y_train)

print("Best Score of train set: "+str(gs.best_score_))
print("Best parameter set: "+str(gs.best_params_))
print("Test Score: "+str(gs.score(X_test,y_test)))

Grid Search Result

The result shows that logistic regression yields the best result.

Step 4: Print grid search results as a table

pd.DataFrame(gs.cv_results_)

Grid Search Result Table

Logistic regression has a little higher accuracy than SVC but is much faster (less fit time).

Remember that you can apply different data preparation methods for each model as well.

Conclusion

You can implement the Scikit-learn pipeline and ColumnTransformer from the data cleaning to the data modeling steps to make your code neater.

You can also find the best hyperparameter, data preparation method, and machine learning model with grid search and the passthrough keyword.

You can find my code in this GitHub

How to Build a GUI Using Gradio for Machine Learning Models

freeCodeCamp — Thu, 27 Jan 2022 21:00:00 +0000

By Edem Gold

If you have ever built a Machine Learning model, you've probably thought "well this was cool, but how will other people be able to see how cool it is?"

Model deployment is a part of Machine Learning which isn't talked about as much as it should be.

So in this article, I will introduce you to a new tool that will help you generate a web app for your Machine Learning model which you can then share with other devs so they can try it out.

I will be building a simple neural network model using scikit-learn and I'll create a GUI for the model using Gradio (this is the cool new tool I spoke about).

Let's get started.

We cannot solve our problems with the same thinking we used to create them - Albert Einstein

What is Gradio?

image credits: gradio

According to the Gradio website,

Gradio allows you to quickly create customizable UI components around your TensorFlow or PyTorch models or even arbitrary Python functions.

Well, that's not terribly informative, is it? 😅.

If you have ever used a Python GUI library like Tkinter, then Gradio is like that.

Gradio is a GUI library that allows you to create customizable GUI components for your Machine Learning model.

Now that we understand what Gradio is, let's get into the project.

Pre-requisite

For you to successfully work through this tutorial, you'll need to have Python installed.

Let's Get Building

You can check out the GitHub repo for the project here. Now I'll take you through the project step by step.

Install the required packages

Let's install the required packages:

pip install sklearn

pip install pandas

pip install numpy

pip install gradio

Get our data

Our data is going to be in the .CSV format. You can get the data by clicking here.

Import the Packages

We are going to import the required packages like this:

import numpy as np

import pandas as pd

import gradio as gr

Next, we are going to filter the warnings so we don't see them.

import warnings

warnings.filterwarnings('ignore')

Import the data

Next, we are going to import our data:

data = pd.read_csv('diabetes.csv')

Now let's see a little preview of our dataset with this command:

data.head()

Let's see the feature columns in our dataset:

print (data.columns)

Get our Variables

Next, we get our X and Y variables, so type in these commands:

x = data.drop(['Outcome'], axis=1)

y = data['Outcome']

Split the data

Now we are going to split our data using scikit-learn's inbuilt _train_testsplit function.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y)

Scale our data

Next, we are going to scale our data using scikit-learn's inbuilt StandardScaler object.

from sklearn.preprocessing import StandardScaler

#instantiate StandardScaler object
scaler = StandardScaler()

#scale data
x_train_scaled = scaler.fit_transform(x_train)

x_test_scaled = scaler.fit_transform(x_test)

In the code above, we scaled our data using the StandardScaler object made available to us through scikit-learn. To learn more about Scaling and why we do it, click here.

Instantiate and train the model

In this section, we are going to create and train our model. The model we are going to use will be a Multi-Layer Perceptron Classifier, a neural network built into scikit-learn.

#import model object
from sklearn.neural_network import MLPClassifier
model =  MLPClassifier(max_iter=1000,  alpha=1)

#train model on training data
model.fit(x_train_scaled, y_train)

#getting model performance on test data
print("accuracy:", model.score(x_test_scaled, y_test))

Create the function for Gradio

Now comes the fun part. Here we are going to create a function that will take in the features of the data set which our model was trained on and pass it as an array to our model to predict. Then we are going to build our Gradio web app based on that function.

To understand why we have to write a function, you must first understand that Gradio builds GUI components for our Machine Learning model based on the function. The function provides a way for Gradio to get input from users and pass it on to the ML model, which will then process it and then pass it back to Gradio which then passes the result out.

Let's write some code...

First, we will get the feature columns which we will then pass onto our function.

#geting our columns

print(data.columns)

Now we will create our function like this:

def diabetes(Pregnancies, Glucose, Blood_Pressure, SkinThickness, Insulin, BMI, Diabetes_Pedigree, Age):
#turning the arguments into a numpy array  

 x = np.array([Pregnancies,Glucose,Blood_Pressure,SkinThickness,Insulin,BMI,Diabetes_Pedigree,Age])

  prediction = model.predict(x.reshape(1, -1))

  return prediction

In the code above, we passed the feature columns from our data as arguments into a function which we named diabetes. Then we turned the arguments into a NumPy array which we then passed onto our model for prediction. Finally we returned the predicted result of our model.

Create our Gradio Interface

Now we are going to create our Web App interface using Gradio:

outputs = gr.outputs.Textbox()

app = gr.Interface(fn=diabetes, inputs=['number','number','number','number','number','number','number','number'], outputs=outputs,description="This is a diabetes model")

The first thing we did above was to create a variable named outputs which holds the GUI component for our model result. The result of our model's prediction will be outputted in a text box.

Then we instantiated the Gradio interface object and passed in our earlier diabetes function. Then we generated our Inputs GUI component and told the radio to expect 8 inputs in the form of numbers.

The inputs represent the feature columns that are present in our dataset – the same 8 feature column names we passed into our diabetes function.

Then we passed our earlier output variable into the outputs parameter present in the object.

Finally, we passed in the description of our web app into the description parameter.

Launch the Gradio Web App

Now we're going to Launch our Gradio web app.

app.launch()

NOTE: If you are launching the Gradio app as a script from that e command line, you will be given a local host link which you will then copy and paste into your browser to see your web app.

If you are launching the app from a Jupyter notebook, you will see a live preview of the app as you run the cell (and you will also be provided with a link).

If you want to share your web app, all you have to do is put in share=True as a parameter in your launch object.

#To provide a shareable link
app.launch(share=True)

You'll then get a link with a .gradio extension. But this shareable link lasts for only 24 hours and will last if only your system is running. Because Gradio just hosts the web app on your system.

In simple words, for your link to work, your system has to be on. This is because Gradio uses your system to host the web app, so once your system is off the server connection is severed and you get a 500😅.

Luckily for us, Gradio also provides a way for you to permanently host your model. But the service is subscription-based, so you have to pay $7 monthly to access it. Permanent hosting is way out of the scope of this article (partly because the author is broke😅). But if you are interested in it, click here.

Important resources

Summary

The Gradio library is really cool and it helps solve a huge problem plaguing the Machine Learning community – model deployment.

90% of Machine Learning models built are not deployed, and Gradio is working to fix that.

It also serves as a way for beginners and experts to show off their models and also test the models in real life.

You can't go wrong with the Gradio Library. Give it a try.

Cover image source.

Machine Learning in Python – The Top New Scikit-Learn 0.24 Features You Should Know

freeCodeCamp — Fri, 04 Jun 2021 20:58:07 +0000

By Davis David

Scikit-learn is one of the most popular open-source and free machine learning libraries for Python.

The scikit-learn library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction.

Many data scientists, machine learning engineers, and researchers rely on this library for their machine learning projects. I personally love using the scikit-learn library because it offers a ton of flexibility and it’s easy to understand its documentation with a lot of examples.

In this article, I’m happy to share with you the five best new features in scikit-learn 0.24.

First, Install the Latest Version of the Scikit-Learn Library

Firstly, make sure you install the latest version (with pip):

pip install --upgrade scikit-learn

If you are using conda, use the following command:

conda install -c conda-forge scikit-learn

Note: This version supports Python versions 3.6 to 3.9.

Now, let’s look at the new features!

Mean Absolute Percentage Error (MAPE)

The new version of scikit-learn introduces a new evaluation metric for a regression problem called Mean Absolute Percentage Error(MAPE). Previously you could calculate MAPE by using a piece of code.

np.mean(np.abs((y_test — preds)/y_test))

But now you can call a function called mean_absolute_percentage_error from the sklearn.metrics module to evaluate the performance of your regression model.

Example:

from sklearn.metrics import mean_absolute_percentage_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

print(mean_absolute_percentage_error(y_true, y_pred))

0.3273809523809524

Note: Keep in mind that the function does not represent the output as a percentage in the range [0, 100]. Instead, we represent it in the range [0, 1/eps]. The best value is 0.0.

OneHotEncoder Supports Missing Values

OneHotEncoder can now handle missing values if presented in the dataset. It treats a missing value as a category. Let’s understand more about how it works in the following example.

First import the important packages:

import pandas as pd 
import numpy as np
from sklearn.preprocessing import OneHotEncoder

Create a simple data-frame with a categorical feature that has missing values:

# intialise data of lists.
data = {'education_level':['primary', 'secondary', 'bachelor', np.nan,'masters',np.nan]}

# Create DataFrame
df = pd.DataFrame(data)

# Print the output.
print(df)

As you can see, we have two missing values in our education_level column.

Create the instance of OneHotEncoder:

enc = OneHotEncoder()

Then fit and transform our data:

enc.fit_transform(df).toarray()

Our education_level column has been transformed and all missing values treated as a new category (check the last column of the above array).

New Method for Feature Selection

SequentialFeatureSelector is a new method for feature selection in scikit-learn. It can be either forward selection or backward selection.

Forward Selection

Forward Selection iteratively finds the best new feature and then adds it to the set of selected features.

This means we start with zero features and then find a feature that maximizes the cross-validation score of an estimator. The selected feature is added to the set and the procedure is repeated until we reach our desired number of selected features.

Backward Selection

This second selection follows the same idea but in a different direction. Here we start with all features and then remove a feature from the set until we reach the desired number of selected features.

Example

Import the important packages:

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

Load the iris dataset and its feature names:

X, y = load_iris(return_X_y=True, as_frame=True)
feature_names = X.columns

Create the instance of the estimator:

knn = KNeighborsClassifier(n_neighbors=3)

Create the instance of SequentialFeatureSelector, set the number of features to select to be 2, and set the direction to be “backward”:

sfs = SequentialFeatureSelector(knn, n_features_to_select=2,direction='backward')

Finally learn the features to select:

sfs.fit(X,y)

Show selected features:

print("Features selected by backward sequential selection: "f{feature_names[sfs.get_support()].tolist()}")

Features selected by backward sequential selection: [‘petal length (cm)’, ‘petal width (cm)’].

The only downside of this new feature selection method is that it can be slower than other methods you already know (SelectFromModel & RFE ) because it evaluates models with cross-validation.

New Methods for Hyper-Parameter Tuning

When it comes to hyper-parameter tuning, GridSearchCV and RandomizedSearchCv from Scikit-learn have been the first choice for many data scientists.

But in the new version, we have two new classes for hyper-parameter tuning called HalvingGridSearchCV and HalvingRandomSearchCV.

HalvingGridSearchCV and HalvingRandomSearchCV use a new approach called successive halving to find the best hyperparameters. Successive halving is like competition or tournament among all hyperparameter combinations.

How does successive halving work?

In the first iteration, they train a combination of hyper-parameters on a subset of observations (training data).

Then in the next iteration, it selects only a combination of hyper-parameters that have good performance in the first iteration and they will be trained in a large number of observations to compete.

So it repeats this selection process in each iteration until it selects the best combination of hyperparameters in the final iteration.

Note: These classes are still experimental:

Example:

Import the important packages:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_halving_search_cv  
from sklearn.model_selection import HalvingRandomSearchCV
from scipy.stats import randint

Since these new classes are still experimental, to use them, we explicitly import enable_halving_search_cv.

Create a classification dataset by using the make_classification method:

X, y = make_classification(n_samples=1000)

Create the instance of the estimator. Here we use a Random Forest Classifier:

clf = RandomForestClassifier(n_estimators=20)

Create parameter distribution for tuning:

param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 11),
              "min_samples_split": randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

Then we instantiate the HalvingGridSearchCV class with the RandomForestClassifier as an estimator and the list of parameter distributions:

rsh = HalvingRandomSearchCV(
    estimator=clf,
    param_distributions=param_dist,
    cv = 5,
    factor=2,
    min_resources = 20)

There are two important parameters in HalvingRandomSearchCV you need to know.

factor — This determines the proportion of the combination of hyper-parameters that are selected for each subsequent iteration. For example, factor=3 means that only one-third of the candidates are selected for the next iteration.
min_resources is the amount of resources (number of observations) allocated at the first iteration for each combination of hyper-parameters.

Finally, we can fit the search object that we have created with our dataset.

rsh.fit(X,y)

After training, we can see different output such as:

The number of iterations

print(rsh.n_iterations_ )

which is 6.

Or the number of candidate parameters that were evaluated at each iteration

print(rsh.n_candidates_ )

which is [50, 25, 13, 7, 4, 2].

Or the number of resources used at each iteration:

print(rsh.n_resources_)

which is [20, 40, 80, 160, 320, 640].

Or parameter setting that gave the best results on the hold-out data:

print(rsh.best_params_)

{‘bootstrap’: False,
‘criterion’: ‘entropy’,
‘max_depth’: None,
‘max_features’: 5,
‘min_samples_split’: 2}

New self-training meta-estimator for semi-supervised learning

Scikit-learn 0.24 has introduced a new self-training implementation for semi-supervised learning called SelfTrainingclassifier. You can use SelfTrainingClassifier with any supervised classifier that can return probability estimates for each class.

This means any supervised classifier can function as a semi-supervised classifier, allowing it to learn from unlabeled observations in the dataset.

Note: The unlabeled values in the target column must have a value of -1.

Let’s understand more about how it works in the following example.

Import the important packages:

import numpy as np
from sklearn import datasets
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.svm import SVC

In this example, we will use the iris dataset and the Super vector machine algorithm as a supervised classifier (it can implement fit and predict_proba).

Then we load the dataset and select randomly some of the observations to be unlabeled:

rng = np.random.RandomState(42)
iris = datasets.load_iris()
random_unlabeled_points = rng.rand(iris.target.shape[0]) < 0.3
iris.target[random_unlabeled_points] = -1

As you can see, unlabeled values in the target column have a value of -1.

Create an instance of the supervised estimator:

svc = SVC(probability=True, gamma="auto")

Create an instance of the self-training meta estimator and add svc as a base_estimator:

self_training_model = SelfTrainingClassifier(base_estimator=svc)

Finally, we can train self_traning_model on the iris dataset that has some unlabeled observations:

self_training_model.fit(iris.data, iris.target)

SelfTrainingClassifier(base_estimator=SVC(gamma=’auto’, probability=True))

Final Thoughts on Scikit-Learn 0.24

As I said, scikit-learn remains one the most popular open-source machine learning libraries. And it has all the features you need to build an end-to-end machine learning project.

You can also implement the new impressive features presented in this article in your machine learning project.

You can find the highlights of other features released in scikit-learn 0.24 here.

Congratulations 👏👏, you have made it to the end of this article! I hope you have learned something new that will help you on your next machine learning or data science project.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

You can also find me on Twitter @Davis_McDavid.

You can read other articles here.

Python scikit-learn Tutorial – Machine Learning Crash Course

Beau Carnes — Wed, 07 Apr 2021 15:24:06 +0000

Scikit-learn is one of the most popular machine leaning libraries for Python. It provides many unsupervised and supervised learning algorithms that make machine leaning simpler.

We just published a scikit-learn course on the freeCodeCamp.org YouTube channel. This course will teach you the basics of scikit-learn so you can start using it in your own machine learning projects.

Vincent D. Warmerdam created this course. Vincent has taught many machine learning concepts on his website and in his job as a research advocate. He has also created some useful open source libraries that work with scikit-learn.

Vincent has a knack for breaking down complex topics in a calm and simple manner.

First, you will get an overview of scikit-learn and learn about some high-level topics.

Next, you will learn about preprocessing tools. Preprocessing has a big impact on the performance of a model.

In the third section you will learn about metrics and how to create custom metrics to judge your machine learning models on.

Then, you will learn about meta estimators. These relate to post-processing your data.

Finally, you will learn about a machine learning library that integrates with scikit-learn and tries to make machine learning more human.

Watch the full course below or on the freeCodeCamp.org YouTube channel (2-hour watch).

Machine Learning with Scikit-Learn—Full Course

Beau Carnes — Wed, 24 Jun 2020 19:41:37 +0000

Scikit-learn is a free machine learning library for the Python programming language. We have released a full course on the freeCodeCamp.org YouTube channel that will teach you about machine learning using scikit-learn (also known as sklearn).

First you will learn about the basics of machine learning and scikit-learn. Then you will learn about some common machine learning algorithms and how to implement them with scikit-learn. Finally, you will learn about artificial intelligence and the science behind it.

This course was created by DLAcademy. Throughout the course, machine learning cocepts will be taught through practical examples.

Here are the topics covered:

Installing scikit-learn
Plotting a graph
Identifying features and labels
Saving and opening a model
Classification
Train / test split
What is KNN?
What is SVM?
Linear regression
Logistic vs linear regression
KMeans
Neural networks
Overfitting and underfitting
Backpropagation
Cost function and gradient descent
CNNs
Implementing a handwritten digits recognizer

Watch the course on the freeCodeCamp.org YouTube channel (3 hour watch).

Two hours later and still running? How to keep your sklearn.fit under control.

freeCodeCamp — Wed, 13 Mar 2019 15:36:10 +0000

By Nathan Toubiana

Written by Gabriel Lerner and Nathan Toubiana

All you wanted to do was test your code, yet two hours later your Scikit-learn fit shows no sign of ever finishing. Scitime is a package that predicts the runtime of machine learning algorithms so that you will not be caught off guard by an endless fit.

_Image by Kevin Ku on [unsplash.com](https://unsplash.com/photos/aiyBwbrWWlo" rel="noopener" target="blank" title=")

Whether you are in the process of building a machine learning model or deploying your code to production, knowledge of how long your algorithm will take to fit is key to streamlining your workflow. With Scitime you will be able in a matter of seconds to estimate how long the fit should take for the most commonly used Scikit Learn algorithms.

There have been a couple of research articles (such as this one) published on that subject. However, as far as we know, there’s no practical implementation of it. The goal here is not to predict the exact runtime of the algorithm but more to give a rough approximation.

What is Scitime?

Scitime is a python package requiring at least python 3.6 with pandas, scikit-learn, psutil and joblib dependencies. You will find the Scitime repo here.

The main function in this package is called “time”. Given a matrix vector X, the estimated vector Y along with the Scikit Learn model of your choice, time will output both the estimated time and its confidence interval. The package currently supports the following Scikit Learn algorithms with plans to add more in the near future:

Quick Start

Let’s install the package and run the basics.

First create a new virtualenv (this is optional, to avoid any version conflicts!)

❱ virtualenv env❱ source env/bin/activate

and then run:

❱ (env) pip install scitime

or with conda:

❱ (env) conda install -c conda-forge scitime

Once the installation has succeeded, you are ready to estimate the time of your first algorithm.

Let’s say you wanted to train a kmeans clustering, for example. You would first need to import the scikit-learn package, set the kmeans parameters, and also choose the inputs (a.k.a X), here generated randomly for simplicity.

Running this before doing the actual fit would give an approximation of the runtime:

As you can see, you can get this info only in one extra line of code! The inputs of the time function are exactly what’s needed to run the fit (that is the algo itself, and X), which makes it even easier to use.

Looking more closely at the last line of the above code, the first output (estimation: 15 seconds in this case) is the predicted runtime you’re looking for. Scitime will also output it with a confidence interval (_lowerbound and _upperbound: 10 and 30 seconds in this case). You can always compare it to the actual training time by running:

In this case, on our local machine, the estimation is 15 seconds, whereas the actual training time is 20 seconds (but you might not get the same results, as we’ll explain later).

As a quick usage guide:

_Estimator(metaalgo, verbose, confidence) class:

meta_algo: The estimator used to predict the time, either ‘RF’ or ‘NN’ (see details in next paragraph) — defaults to‘RF’
verbose: Control of the amount of log output (either 0, 1, 2 or 3) — defaults to 0
confidence: Confidence for intervals — defaults to 95%

estimator.time(algo, X, y) function:

algo: algo whose runtime the user wants to predict
X: numpy array of inputs to be trained
y: numpy array of outputs to be trained (set to None if the algo is unsupervised)

Quick note: to avoid any confusion, it’s worth highlighting that algo and meta_algo are two different things here: algo is the algorithm whose runtime we want to estimate, meta_algo is the algorithm used by Scitime to predict the runtime.

How Scitime works

We are able to predict the runtime to fit by using our own estimator, we call it meta algorithm (_metaalgo), whose weights are stored in a dedicated pickle file in the package metadata. For each Scikit Learn model, you will find a corresponding meta algo pickle file in Scitime’s code base.

You might be thinking:

Why not manually estimate the time complexity with big O notations?

That’s a fair point. It’s a valid way of approaching the problem and something we thought about at the beginning of the project. One thing however is that we would need to formulate the complexity explicitly for each algo and set of parameters which is rather challenging in some cases, given the number of factors playing a role in the runtime. The meta_algo basically does all the work for you, and we’ll explain how.

Two types of meta algos have been trained to estimate the time to fit (both from Scikit Learn):

The RF meta algo, a RandomForestRegressor estimator.
The NN meta algo, a basic MLPRegressor estimator.

These meta algos estimate the time to fit using an array of ‘meta’ features. Here’s a summary of how we build these features:

Firstly, we fetch the shape of your input matrix X and output vector y. Second, the parameters you feed to the Scikit Learn model are taken into consideration as they will impact the training time as well. Lastly, your specific hardware, unique to your machine such as available memory and cpu counts are also considered.

As shown earlier, we also provide confidence intervals on the time prediction. The way these are computed depends on the meta algo chosen:

For RF, since any random forest regressor is a combination of multiple trees (also called estimators), the confidence interval will be based on the distribution of the set of predictions computed by each estimator.
For NN, the process is a little less straightforward: we first compute a set of MSEs along with the number of observations on a test set, grouped by predicted duration bins (that is from 0 to 1 second, 1 to 5 seconds, and so on), and we then compute a t-stat to get the lower and upper bounds of the estimation. As we don’t have a lot of data for very long models, the confidence interval for such data might get very broad.

How we built it

You might be thinking:

How did you get enough data on the training time of all these sciki- learn fits over various parameters and hardware configurations?

The (unglamorous) answer is we generated the data ourselves using a combination of computers and VM hardwares to simulate what the training time would be on the different systems. We then fitted our meta algos on these randomly generated data points to build an estimator meant to be reliable regardless of your system.

While the estimate.py file handles the runtime prediction, the _model.py file helped us generate data to train our meta algos, using our dedicated Model class. Here’s a corresponding code sample, for kmeans:

Note that you can also use the file _data.py directly with the command line to generate data or train a new model. Related instructions can be found in the repo Readme file.

When generating data points, you can edit the parameters of the Scikit Learn models you want to train on. You can head to _scitime/config.json and edit the parameters of the models as well as the number of rows and columns you would want to train with.

We use an itertool function to loop through every possible combination, along with a drop rate set between 0 and 1 to control how quickly the loop will jump through the different possible iterations.

How accurate is Scitime?

Below, we highlight how our predictions perform for the specific case of kmeans. Our generated dataset contains ~100k data points, which we split into a train and test sets (75% — 25%).

We grouped training predicted times by different time buckets and computed the MAPE and RMSE over each of those buckets for all our estimators using the RF meta-algo and the NN meta-algo.

Please note that these results were performed on a restricted data set, so they might be different on unexplored data points (such as other systems / extreme values of certain model parameters). For this specific training set, the R-squared is around 80% for NN and 90% for RF.

As we can see, not surprisingly, the accuracy is consistently higher on the train set than on the test, for both NN and RF. We also see that RF seems to perform way better than NN overall. The MAPE for RF is around 20% on the train set and 40% on the test set. The NN MAPE is surprisingly very high.

Let’s slice the MAPE (on test set) by the number of predicted seconds:

One important thing to keep in mind is that for some cases the time prediction is sensitive to the meta algo chosen (RF or NN). In our experience RF has performed very well within the data set input ranges, as shown above. However, for out of range points, NN might perform better, as suggested by the end of the above chart. This would explain why NN MAPE is quite high while the RMSE is decent: it performs poorly on small values.

As an example, if you try to predict the runtime of a kmeans with default parameters and with an input matrix of a few thousand lines, the RF meta algo will be precise because our training dataset contains similar data points. However, for predicting very specific parameters (for instance, a very high number of clusters), NN might perform better because it extrapolates from the training set, whereas RF doesn’t. NN performs worse on the above charts because these plots are only based on data close to the set of inputs of the training data.

However, as shown in this graph, the out of range values (thin lines) are extrapolated by the NN estimator, whereas the RF estimator predicts the output stepwise.

Now let’s look at the most important ‘meta’ features for the example of kmeans:

As we can see, only 6 features account for more than 80% of the model variance. Among them, the most important is a parameter of the scikit-learn kmeans class itself (number of clusters), but a lot of external factors have great influence on the runtime such as number of rows/columns and available memory.

Limitations

As mentioned earlier, the first limitation is related to the confidence intervals: they may be very wide, especially for NN, and for heavy models (that would take at least an hour).

Additionally, the NN might perform poorly on small to medium predictions. Sometimes, for small durations, the NN might even predict a negative duration, in which case we automatically switch back to RF.

Another limitation of the estimator arise for when ‘special’ algo parameter values are used. For example, in a RandomForest scenario, when max_depth is set to None, the depth could take any value. This might result in a much longer time to fit which is more difficult for the meta algo to pick up, although we did our best to account for them.

When running estimator.time(algo, X, y) we do require the user to enter the actual X and y vector which seems unnecessary, as we could simply request the shape of the data to estimate the training time. The reason for this is that we actually try to fit the model before predicting the runtime, in order to raise any instant errors. We run algo.fit(X, y) in a subprocess for one second to check for any fit error up after which we move on to the prediction part. However, there are times where the algo (and / or the input matrix) are so big that running algo.fit(X,y) will throw a memory error eventually, which we can’t account for.

Future improvements

The most effective and obvious way to improve the performance of our current predictions would be to generate more data points on different systems to better support a wide range of hardware/parameters.

We will be looking at adding more supported Scikit Learn algos in the near future. We could also implement other algos such as lightGBM or xgboost. Feel free to contact us if there’s an algorithm you would like us to implement in the next iterations of Scitime!

Other interesting avenues for improving the performance of the estimator would be to include more granular information about the input matrix such as variance, or correlation with output. We currently generate data completely randomly, for which the fit time might be higher than for real world datasets. So in some cases it might overestimate the training time.

In addition we could track finer hardware specific information such as frequency of the cpu, or current cpu usage.

Ideally, as the algorithm might change from a scikit-learn version to another, and thus have an impact on the runtime, we would also account for it, for example by using the version as a ‘meta’ feature.

As we acquire more data to fit our meta algos, we might think of using more complex meta algos, such as sophisticated neural networks (using regularization techniques like dropout or batch normalization). We could even consider using tensorflow to fit the meta algo (and add it as optional): it would not only help us get a better accuracy, but also build more robust confidence intervals using dropout.

Contributing to Scitime and sending us your feedback

First, any kind of feedback, especially on the performance of the predictions and on ideas to improve this process of generating data, is very much appreciated!

As discussed before, you can use our repo to generate your own data points in order to train your own meta algorithm. When doing so, you can help make Scitime better by sharing your data points found in the result csv (_~/scitime/scitime/[algo]results.csv) so that we can integrate it to our model.

To generate your own data you can run a command similar to this one (from the package repo source):

❱ python _data.py --verbose 3 --algo KMeans --drop_rate 0.99

Note: if run directly using the code source (with the Model class), do not forget to set _writecsv to true, otherwise the generated data points will not be saved.

We use GitHub issues to track all bugs and feature requests. Feel free to open an issue if you have found a bug or wish to see a new feature implemented. More info can be found about how to contribute in the Scitime repo.

For issues with training time predictions, when submitting feedback, including the full dictionary of parameters you are fitting into your model might help, so that we can diagnose why the performance is subpar for your specific use case. To do so simply set the verbose parameter to 3 and copy paste the log of the parameter dic in the issue description.

Find the code source

Find the documentation

Credits

Gabriel Lerner & Nathan Toubiana are the main contributors of this package and co-authors of this article
Special thanks to Philippe Mizrahi for helping along the way
Thanks for all the help we got from early reviews / beta testing

A beginner’s guide to training and deploying machine learning models using Python

freeCodeCamp — Wed, 27 Jun 2018 16:33:23 +0000

By Ivan Yung

When I was first introduced to machine learning, I had no idea what I was reading. All the articles I read consisted of weird jargon and crazy equations. How could I figure all this out?

I opened a new tab in Chrome and looked for easier solutions. I found APIs from Amazon, Microsoft, and Google that did all the machine learning for me. Each hackathon project I made would call their servers and WOW — it looked so smart! I was hooked.

But, after a year, I realized that I wasn’t learning anything. Everything I was doing was described by this Nedroid comic that I modified:

_[Original image source](https://nedroidcomics.tumblr.com/post/41879001445/the-internet" rel="noopener" target="blank" title=").

Eventually, I sat down and learned how to use machine learning without megacorporations. And turns out, anyone can do it. The current libraries we have in Python are amazing. In this article, I will explain how I use these libraries to create a proper machine learning back end.

Getting a dataset

Machine learning projects are reliant on finding good datasets. If the dataset is bad, or too small, we cannot make accurate predictions. You can find some good datasets at Kaggle or the UC Irvine Machine Learning Repository.

In this article, I am using a wine quality dataset with many features and one label. Features are independent variables which affect the dependent variable called the label. In this case, we have one label column — wine quality — that is affected by all the other columns (features like pH, density, acidity, and so on).

In the following Python code, I use a library called pandas to control my dataset. pandas provides datasets with many functions to select and manipulate data.

First, I load the dataset to a panda and split it into the label and its features. I then grab the label column by its name (quality) and then drop the column to get all the features.

Scikits-learn, the library we will use for machine learning

Training a model

Machine learning works by finding a relationship between a label and its features. We do this by showing an object (our model) a bunch of examples from our dataset. Each example helps define how each feature affects the label. We refer to this process as training our model.

I use the estimator object from the Scikit-learn library for simple machine learning. Estimators are empty models that create relationships through a predefined algorithm.

For this wine dataset, I create a model from a linear regression estimator. (Linear regression attempts to draw a straight line of best fit through our dataset.) The model is able to get the regression data through the fit function. I can use the model by passing in a fake set of features through the predict function. The example below shows the features for one fake wine. The model will output an answer based on its training.

The code for this model, and fake wine, is below:

Importing and exporting our Python model

The pickle library makes it easy to serialize the models into files that I create. I am also able to load the model back into my code. This allows me to keep my model training code separated from the code that deploys my model.

I can import or export my Python model for use in other Python scripts with the code below:

Creating a simple web server

Flask, the framework we will use to create a web server.

To deploy my model, I first have to create a server. Servers listen to web traffic, and run functions when they find a request addressed to them. The function that runs can depend on the request’s route and other data that it has. Afterwards, the server can send a message of confirmation back to the requester.

The Flask Python framework allows me to create web servers in record time.

In the code below, I use Flask to run a simple one-route web server. My one route listens to POST requests and sends a hello back. POST requests are a special type of request that carry data in a JSON object.

Adding the model to my server

With the pickle library, I am able to able to load our trained model into my web server.

Our server now loads the trained model during its initialization. I can access it by sending a post request to my “/echo” route. The route grabs an array of features from the request body and gives it to the model. The model’s prediction is then sent back to the requester.

Conclusion

After reading this article, you should be able to create your own machine learning back end. For more detail, you can find a full example that I made at this repository.

When you have time, I recommend taking a step back from coding and reading about machine learning. This article only teaches the bare necessities to create a model. There are topics like loss reduction and neural nets that you need to know.

Good luck and happy coding!

Text classification and prediction using the Bag Of Words approach

freeCodeCamp — Fri, 23 Mar 2018 21:40:55 +0000

By gk_

There are a number of approaches to text classification. In other articles I’ve covered Multinomial Naive Bayes and Neural Networks.

One of the simplest and most common approaches is called “Bag of Words.” It has been used by commercial analytics products including Clarabridge, Radian6, and others.

_Image [source](https://machinelearnings.co/text-classification-using-neural-networks-f5cd7b8765c6" rel="noopener" target="blank" title=").

The approach is relatively simple: given a set of topics and a set of terms associated with each topic, determine which topic(s) exist within a document (for example, a sentence).

While other, more exotic algorithms also organize words into “bags,” in this technique we don’t create a model or apply mathematics to the way in which this “bag” intersects with a classified document. A document’s classification will be polymorphic, as it can be associated with multiple topics.

Does this seem too simple to be useful? Try it before you jump to conclusions. In NLP, it is often the case that a simple approach can sometimes go a long way.

_credit: Smitha Milli [https://twitter.com/smithamilli](https://twitter.com/smithamilli/status/837153616116985856" rel="noopener" target="blank" title=")

We will need three things:

A topics/words definition file
A classifier function
A notebook to test our classifier

And then we will venture a bit further and build and test a predictive model using our classification data.

Topics and Words

Our definition file is in JSON format.We will use it to classify messages between patients and a nurse assigned to their care.

topics.json

There are two items of note in this definition.

First, let’s look at some terms some terms. For example, “bruis” is a stem. It will cover supersets such as “bruise,” “bruising,” and so on. Second, terms containing * are actually patterns, for example *dpm is a pattern for a numeric digit followed by “pm.”

To keep things simple, we are only handling numeric pattern matching, but this could be expanded to a broader scope.

This ability of finding patterns within a term is very useful to when classifying documents containing dates, times, monetary values, and so on.

Let’s try out some classification.

The classifier returns a JSON result set containing the sentence(s) associated with each topic found in the message. A message can contain multiple sentences, and a sentence can be associated with none, one, or multiple topics.

Let’s take a look at our classifier. The code is here.

msgClassify.py

The code is relatively straightforward, and includes a convenience function to split a document into sentences.

Predictive Modeling

The aggregate classification for a set of documents associated with an outcome can be used to build a predictive model.

In this use-case, we wanted to see if we could predict hospitalizations based on the messages between patient and nurse prior to the incident. We compared messages for patients who did and did not incur hospitalizations.

You could use a similar technique for other types of messaging associated with some binary outcome.

This process takes a number of steps:

A set of messages are classified and each topic receives a count for this set. The result is a fixed list of topics with a % allocation from the messages.
The topic allocation is then assigned a binary value, in our case a 0 if there was no hospitalization and a 1 if there was a hospitalization
A logistic Regression algorithm is used to build a predictive model
The model is used to predict the outcome from new input

Let’s look at our input data. Your data should have a similar structure. We’re using a pandas DataFrame.

“incident” is the binary outcome, and it needs to be the first column in the input data.

Each subsequent column is a topic and the % of classification from the set of messages belonging to the patient.

In row 0, we see that roughly a quarter of the messages for this patient are about the thanks topic, and none are about medical terms or money. Thus each row is a binary outcome and a messaging classification profile across topics.

Your input data will have different topics, different column labels, and a different binary condition, but otherwise will be a similar structure.

Let’s use scikit-learn to build a Logistic Regression and test our model.

Here’s our output:

precision    recall  f1-score   support          0       0.66      0.69      0.67       191          1       0.69      0.67      0.68       202avg / total       0.68      0.68      0.68       393

The precision and recall of this model against the test data are in the high-60’s — slightly better than a guess, and not accurate enough to be of much value, unfortunately.

In this example, the amount of data was relatively small (a thousand patients, ~30 messages sampled per patient). Remember that only half of the data can be used for training, while the other half (after shuffling) is used to test.

By including structured data such as age, gender, condition, past incidents, and so on, we could strengthen our model and produce a stronger signal. Having more data would also be helpful as the number of training data columns is fairly large.

Try this with your structured/unstructured data and see if you can get a highly predictive model. You may not get the kind of precision that leads to automated actions, but a “risk” probability could be used as a filter or sorting function or as an early warning sign for human experts.

The “Bag of Words” approach is suitable to certain kinds of text classification work, particularly where the language is not nuanced.

Enjoy.

scikit learn - freeCodeCamp.org

Machine Learning with Python and Scikit-Learn

How to Improve Machine Learning Code Quality with Scikit-learn Pipeline and ColumnTransformer

Table of Contents

What is the Scikit-learn Pipeline?

What is the Scikit-learn ColumnTransformer?

What's the Difference between the Pipeline and ColumnTransformer?

How to Create a Pipeline

Get the Dataset

Data Preprocessing Plan

Here are the steps we'll follow:

Step 1: Import and Encode the Data

Step 2: Define Sets of Columns to be Transformed in Different Ways

Step 3: Create Pipelines for Numerical and Categorical Features

Step 4: Create ColumnTransformer to Apply the Pipeline for Each Column Set

Step 5: Add a Model to the Final Pipeline

Step 6: Display the Pipeline

Step 7: Split the Data into Train and Test Sets

Step 8: Pass Data through the Pipeline

(Optional) Step 9: Save the Pipeline

How to Find the Best Hyperparameter and Data Preparation Method

Here's what we'll cover in this section:

How to Find the Changeable Pipeline Parameters

How to Find the Best Hyperparameter Sets: Add a Pipeline to Grid Search

Set the tuning parameters and their range.

Add the pipeline to Grid Search

How to Find the Best Data Preparation Method: Skip a Step in a Pipeline

How to adjust the current pipeline a little

How to Perform Grid Search

How to Find the Best Hyperparameter Sets and the Best Data Preparation Method

How to Add Custom Transformations and Find the Best Machine Learning Model

Here's what we'll cover in this part:

How to Add a Custom Transformation

How to Find the Best Machine Learning Model

Here are the steps we'll follow:

Step 1: Create a class that receives a model as an input

Step 2: Add the class in step 1 to a pipeline

Step 3: Perform Grid search

Step 4: Print grid search results as a table

Conclusion

How to Build a GUI Using Gradio for Machine Learning Models

What is Gradio?

Pre-requisite

Let's Get Building

Install the required packages

Get our data

Import the Packages

Import the data

Get our Variables

Split the data

Scale our data

Instantiate and train the model

Create the function for Gradio

Create our Gradio Interface

Launch the Gradio Web App

Host and Share your Web App

Important resources

Summary

Machine Learning in Python – The Top New Scikit-Learn 0.24 Features You Should Know

First, Install the Latest Version of the Scikit-Learn Library

Mean Absolute Percentage Error (MAPE)

OneHotEncoder Supports Missing Values

New Method for Feature Selection

Forward Selection

Backward Selection

Example

New Methods for Hyper-Parameter Tuning

How does successive halving work?

Example:

New self-training meta-estimator for semi-supervised learning

Final Thoughts on Scikit-Learn 0.24

Python scikit-learn Tutorial – Machine Learning Crash Course

Machine Learning with Scikit-Learn—Full Course

Two hours later and still running? How to keep your sklearn.fit under control.

What is Scitime?

Quick Start

How Scitime works

How we built it

How accurate is Scitime?

Limitations