neural networks - freeCodeCamp.org

How to Enhance Images with Neural Networks

Manish Shivanandhan — Thu, 04 Sep 2025 00:44:55 +0000

Artificial intelligence is changing how we work with images. What once took hours in Photoshop can now happen in seconds with AI-powered tools. You can take a blurry picture, enlarge it without losing sharpness, fix the lighting, remove unwanted noise, or even bring color to a black-and-white photo, all with a single click.

The magic you see in these tools is powered by algorithms which are trained AI models that understand how images should look and then reconstruct them accordingly. These models have studied millions of examples to learn patterns, textures, and details, so they can “predict” what’s missing and fill it in naturally.

For developers, photographers, and content creators, knowing the basics of these algorithms can help you pick the right tools for your workflow. Even if you never plan to code an AI model yourself, this knowledge will help you make better choices for image processing, web apps, or creative projects.

Let’s look at five of the most important algorithms used in AI image enhancement today. Along the way, you’ll see real-world tools that use these algorithms and how you can try them yourself.

Image Colorization
GAN-Based Image Enhancement
Noise Reduction (Denoising Autoencoders)
Image Upscaling using Super-Resolution
Artifact Removal
Why These Algorithms Matter to Developers
Conclusion

Image Colorization

Automatic image colorization might be the most visually dramatic AI enhancement of all. It takes a black-and-white image and predicts the colors that should be there, often producing results that look like the photo was taken in full color.

The AI behind this uses convolutional neural networks (CNNs) trained on huge datasets of color images. The model sees both the grayscale and the color versions during training, so it learns how certain objects typically appear. For example, it might learn that grass is usually green, the sky is often blue, and human skin falls within a certain range of tones.

One of the most famous models is DeOldify, which combines CNNs with GANs. The GAN setup helps refine the results, making colors more natural and avoiding strange or overly bright tones.

Colorization has practical uses beyond restoring old family photos. It’s used in film restoration, historical projects, digital storytelling, and even concept art.

See Image Colorization in action.

GAN-Based Image Enhancement

GANs, or Generative Adversarial Networks, are one of the most powerful AI techniques in image enhancement. They consist of two neural networks: the generator, which tries to create realistic-looking images, and the discriminator, which evaluates them. Over many iterations, the generator becomes extremely good at producing images that pass as real.

In image retouching, GANs can handle many tasks at once, like fixing lighting, improving sharpness, enhancing textures, and even subtly changing elements to make the picture more appealing. Because GANs learn from real-world images, the results often feel more natural than traditional editing filters.

GAN-based retouching is used in professional portrait editing, e-commerce product photos, real estate listings, and even game asset creation. It’s also behind many “one-click enhance” buttons you see in modern apps.

See a GAN powered photo enhancer here.

Noise Reduction (Denoising Autoencoders)

Noise in images looks like random specks of color or brightness that shouldn’t be there. It often happens in low-light photos or in images taken with high ISO settings. Noise makes photos look grainy and less professional.

Traditional noise removal methods simply blurs the image to hide the noise, but this also destroyed fine details. AI noise reduction works differently.

Denoising Autoencoders, one of the most common approaches, learn from pairs of images—one clean and one noisy. The AI studies how noise distorts details, then learns to reverse the process.

When you pass a noisy photo through a denoising autoencoder, it removes the noise while preserving edges, textures, and important small details.

Noise reduction isn’t just for photography. It’s also used in document scanning to make text easier to read, medical imaging to clarify scans, cleaning up screenshots or UI mockups for presentations

See Noise Reduction in action here.

Image Upscaling using Super-Resolution

Super-resolution is the process of increasing the resolution of an image to make it sharper and larger without simply stretching the pixels.

In the past, enlarging a small image just made it blurry. AI super-resolution works differently. It studies the image, detects patterns, and then generates new pixels that match what would have been there in a higher-quality original.

One of the first big breakthroughs was SRCNN (Super-Resolution Convolutional Neural Network). SRCNN works by breaking the image into patches, analyzing them, and then predicting what higher-resolution patches should look like. This early approach was effective but sometimes produced overly smooth images.

Then came ESRGAN (Enhanced Super-Resolution Generative Adversarial Network), which took things further. ESRGAN uses a GAN architecture, a generator creates enhanced images, while a discriminator judges how real they look. Through this back-and-forth training, the generator learns to produce fine textures like hair strands, fabric weaves, or building details that look realistic to the human eye.

Super-resolution is widely used in e-commerce (for clearer product photos), printing (turning web images into high-resolution posters), and web apps (making user-uploaded images look professional).

See Super resolution powered image upscaler in action.

Artifact Removal

When a JPEG image is heavily compressed, it develops blocky patches, fuzzy edges, and strange halos around lines. These are called compression artifacts, and they appear because JPEG reduces file size by removing fine detail. Traditional fixes blur the image to hide these defects, but that also softens important edges and textures.

FBCNN, or Flexible Blind Convolutional Neural Network, takes a smarter approach. Instead of needing to know the exact compression level beforehand, FBCNN is trained to handle a wide range of artifact severities without extra input. This is what makes it “blind”, it doesn’t require metadata about how the JPEG was compressed. It can adapt its restoration process on the fly.

FBCNN works in two main steps. First, it extracts features from the image, analyzing patterns in edges, textures, and flat areas to identify where artifacts are most likely. Then, it applies a learned mapping to reconstruct what those regions should look like without the damage.

Because it can estimate the compression quality itself, FBCNN avoids the common problem of over-smoothing lightly compressed images or under-restoring heavily compressed ones.

This flexibility makes FBCNN useful in many scenarios: cleaning up low-quality images from social media, restoring graphics and text in screenshots, or preparing old compressed web images for printing. Modern AI tools often integrate FBCNN-style processing as a first step before applying super-resolution or general enhancement.

FBCNN’s ability to adapt without manual tuning makes it one of the most practical and developer-friendly models for real-world JPEG restoration today.

See artifact removal in action.

Why These Algorithms Matter to Developers

Even if you have never trained your own AI model, understanding these algorithms gives you a better sense of what’s possible and how to apply it. Many of the tools mentioned here offer APIs, which means developers can build them into their own apps and websites.

If you run a social platform, you can automatically enhance user-uploaded images before they appear in feeds. If you build e-commerce platforms, you can clean and upscale product images for better sales conversions. If you work in media archiving, you can restore and preserve images without spending hours on manual edits.

The real value comes from knowing which algorithm is right for the problem you’re solving. Super-resolution for enlarging, denoising for cleaning, colorization for restoration, artifact removal for fixing compression, and GAN retouching for overall beautification.

Conclusion

AI image enhancement has moved from research labs to everyday tools, making it possible for anyone to transform low-quality images into something sharp, vibrant, and professional. The algorithms behind these tools like super-resolution, denoising, colorization, artifact removal, and GAN retouching are the building blocks of modern visual AI.

Whether you’re a developer looking to integrate image processing into your app or a creator who wants to improve your visuals, knowing how these algorithms work will help you get the most out of AI. This is only the beginning and future models will be even more precise, faster, and capable of things we haven’t yet imagined. Developers who understand these foundations will be ready to make the most of the next wave of AI-powered creativity.

Hope you enjoyed this article. Signup for my free AI newsletter TuringTalks.ai for more hands-on tutorials on AI. You can also visit my website.

Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code

Kuriko — Fri, 30 May 2025 18:21:29 +0000

The perceptron is a fundamental concept in deep learning, with many algorithms stemming from its original design.

In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks:

Custom classifier
Scikit-learn’s MLPClassifier
Keras Sequential classifier using SGD and Adam optimizers.

This will help you learn about their various use cases and how they work.

What is a Perceptron?
How to Build a Single-Layered Classifier
What is a Multi-Layer Perceptron?
How to Build Multi-Layered Perceptrons
Understanding Optimizers
How to Build an MLP Classifier with SGD Optimizer
How to Build an MLP Classifier with Adam Optimizer
Final Results: Generalization
Conclusion

Prerequisites

Mathematics (Calculus, Linear Algebra, Statistics)
Coding in Python
Basic understanding of Machine Learning concepts

What is a Perceptron?

A perceptron is one of the simplest types of artificial neurons used in Machine Learning. It’s a building block of artificial neural networks that learns from labeled data to perform classification and pattern recognition tasks, typically on linearly separable data.

A single-layer perceptron consists of a single layer of artificial neurons, called perceptrons.

But when you connect many perceptrons together in layers, you have a multi-layer perceptron (MLP). This lets the network learn more complex patterns by combining simple decisions from each perceptron. And this makes MLPs powerful tools for tasks like image recognition and natural language processing.

The perceptron consists of four main parts:

Input layer: Takes the initial numerical values into the system for further processing.
Weights: Combines input values with weights (and bias terms).
Activation function: Determines whether the neuron should fire based on the threshold value.
Output layer: Produces classification result.

It performs a weighted sum of inputs, adds a bias, and passes the result through an activation function – just like logistic regression. It’s sort of like a little decision-maker that says “yes” or “no” based on the information it gets.

So for instance, when we use a sigmoid activation, its output is a probability between 0 and 1, mimicking the behavior of logistic regression.

Applications of Perceptrons

Perceptrons are applied to tasks such as:

Image classification: Perceptrons classify images containing specific objects. They achieve this by performing binary classification tasks.
Linear regression: Perceptrons can predict continuous outputs based on input features. This makes them useful for solving linear regression problems.

How the Activation Function Works

For a single perceptron used for binary classification, the most common activation function is the step function (also known as the threshold function):

$$\phi(z) = \begin{cases} 1 &\text{if } z \geq \theta \\ \\ 0 &\text{if } z < \theta \end{cases}$$

where:

ϕ(z): the output of the activation function.
z: the weighted sum of the inputs plus the bias:

$$z = \sum_{i=1}^m w_i x_i + b$$

(xi: input values, w: weight associated with each input, b: bias terms)

θ is the threshold. Often, the threshold θ is set to zero, and the bias (b) effectively controls the activation threshold.

In that case, the formula becomes:

$$\phi(z) = \begin{cases} 1 &\text{if } z \geq 0 \\ \\ 0 &\text{if } z < 0 \end{cases}$$

When the step function ϕ(z) outputs one, it signifies that the input belongs to the class labeled one.

This occurs when the weighted sum is greater than zero, leading the perceptron to predict the input is in this binary class.

While the step function is conceptually the original activation for a perceptron, its discontinuity at zero causes computational challenges.

In modern implementations, we can use other activation functions like the sigmoid function:

$$\sigma (z) = \frac {1} {1 + e^{-z}}$$

The sigmoid function also outputs zero or one depending on the weighted sum (z).

How the Loss Function Works

The loss function is a crucial concept in machine learning that quantifies the error or discrepancy between the model's predictions and the actual target values.

Its purpose is to penalize the model for making incorrect or inaccurate predictions, which guides the learning algorithm (for example, gradient descent) to adjust the model's parameters in a way that minimizes this error and improves performance.

In a binary classification task, the model may adopt the hinge loss function to penalize misclassifications by incurring an additional cost for incorrect predictions:

$$L(y, h(x)) = max(0, 1- y*h(x))$$

(h(x): prediction label, y: true label)

How to Build a Single-Layered Classifier

Now, let’s build a simple single-layer perceptron for binary classification.

1. Custom Classifier

Initialize the classifier

We’ll first initialize the classifier with weights, bias, number of epochs (n_iterations), and learning_rates.

def __init__(self, learning_rate=0.01, n_iterations=1000):
    self.learning_rate = learning_rate
    self.n_iterations = n_iterations
    self.weights = None
    self.bias = None

Define the activation function

Use a step function that returns zero if input (x) ≤ 0, else 1. By default, the threshold is set to zero.

def _step_function(self, x, threshold: int = 0):
     return np.where(x > threshold, 1, 0)

Train the model

Now it’s time to start training. The learning process involves iteratively updating the perceptron’s internal parameters: weights and bias.

This process is controlled by a specified number of training epochs defined by n_iterations.

In each epoch, the model processes the entire input dataset (X) and adjusts its weights and bias based on the difference between its predictions and the true labels (y), guided by a predefined learning_rate.

def fit(self, X, y):
    n_samples, n_features = X.shape

    self.weights = np.zeros(n_features)
    self.bias = 0

    for _ in range(self.n_iterations):
        for i in range(n_samples):
            # compute weighted sum (z)
            z = np.dot(X[i], self.weights) + self.bias

            # apply the activation function
            y_pred = self._step_function(z)

            # update weights and bias
            self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
            self.bias += self.learning_rate * (y[i] - y_pred)

How the weights work in the iteration loop

The weights in a perceptron define the orientation (slope) of the decision boundary that separates the classes.

Its iterative update in the for loop aims to reduce classification errors such that:

$$\begin {align*} w_j &:= w_j + \Delta w_j \\ & := w_j + \eta (y_i - \hat y_i)x_{ij} \\ &= \begin{cases} w_j &\text{(a) } y_i - \hat y_i = 0\\ w_j + \eta x_ij &\text{(b) } y_i - \hat y_i = 1 \\ w_j - \eta x_ij &\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$

(w_j: j-th weight, η: learning rate, (yi−y^i): error)

This means that:

When the prediction is correct, the error is zero, so the weight is unchanged.
When the prediction is too low (yi=1 and y^i=0), the weight is adjusted to the same direction to increase the weighted sum.
When the prediction is too high (yi=0 and y^i=1), the weight is adjusted to the opposite direction to pull the weighted sum lower.

How the bias terms work in the iteration loop

The bias determines the decision boundary’s intercept (position from the origin).

Similar to weights, we adjust the bias terms in each epoch to position the decision boundary:

$$\begin {align*} b &:= b + \Delta b \\ & := b + \eta (y_i - \hat y_i) \\ &= \begin{cases} b &\text{(a) } y_i - \hat y_i = 0\\ b + \eta &\text{(b) } y_i - \hat y_i = 1 \\ b - \eta &\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$

This repeated adjustment aims to optimize the model’s ability to correctly classify the training data.

Make a prediction

Lastly, we add a function to generate an outcome value (zero or one) for a new, unseen data (X):

def predict(self, X):
      linear_output = np.dot(X, self.weights) + self.bias
      predictions = self._step_function(linear_output)
      return predictions

The entire classifier looks like this:

import numpy as np

class Perceptron:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None

    def _step_function(self, x, threshold: int = 0):
        return np.where(x > threshold, 1, 0)

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iterations):
            for i in range(n_samples):
                linear_output = np.dot(X[i], self.weights) + self.bias
                y_pred = self._step_function(linear_output)
                self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                self.bias += self.learning_rate * (y[i] - y_pred)
        return self

    def predict(self, X):
        linear_output = np.dot(X, self.weights) + self.bias
        y_pred = self._step_function(linear_output)
        return y_pred

Simulate with synthetic datasets

First, we generated a synthetic linearly separable dataset using make_blob and computed a decision boundary, then train the classifier we created.

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
import numpy as np

# create a mock dataset
X, y = make_blobs(n_features=2, centers=2, n_samples=1000, random_state=12)

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train the model
perceptron = Perceptron(learning_rate=0.1, n_iterations=1000).fit(X_train, y_train)

# make a prediction
y_pred_train = perceptron.predict(X_train)
y_pred_test = perceptron.predict(X_test)

# evaluate the results
acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(f"Accuracy (Train): {acc_train:.3} \nAccuracy (Test): {acc_test:.3}")

Results

The classifier generated a clear, highly accurate linear decision boundary.

Accuracy (Train): 0.981
Accuracy (Test): 0.975

2. Leverage SckitLearn’s MCP Classifier

For our convenience, we’ll use sckit-learn’s build-in classifier ( MCPClassifier) to build a similar, yet more robust classifier:

model = MLPClassifier(
    hidden_layer_sizes=(), # intentionally set empty to create a single layer perceptron
    activation='logistic', # choosing a sigmoid function as an activation function
    solver='sgd', # choosing SGD optimizer
    max_iter=1000,
    random_state=42, 
    learning_rate='constant', 
    learning_rate_init=0.1
).fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(f"MCPClassifier\nAccuracy (Train): {acc_train:.3} \nAccuracy (Test): {acc_test:.3}")

Results

The MCP Classifier generated a clear linear decision boundary with slightly better accuracy scores.

Accuracy (Train): 0.985
Accuracy (Test): 0.995

Limitations of Single-Layer Perceptrons

Now, let’s talk about the key differences between the MCP Classifier and our custom single-layer perceptron.

Unlike more general neural networks, single-layer perceptrons use a step function as their activation.

Due to its discontinuity at x=0, the step function is not differentiable over its entire domain (−∞ to ∞).

This fundamental property precludes the use of gradient-based optimization algorithms such as SGD or Adam, as these methods depend on the computation of gradients, partial derivatives for the cost function.

In contrast, most neural networks employ differentiable activation functions (for example, sigmoid, ReLU) and loss functions (for example, MSE, Cross-Entropy) for effective optimization.

Other challenges of a single-layer perceptron include:

Limited to linear separability: Because they can only learn linear decision boundaries, they are unable to handle complex, non-linearly separable data.
Lack of depth: Being single-layered, they cannot learn complex hierarchical representations.
Limited optimizer options: As mentioned, their non-differentiable activation function precludes the use of major gradient-based optimizers.

So, in the next section, you’ll learn about multi-layered perceptrons to overcome the disadvantages.

What is a Multi-Layer Perceptron?

An MLP is a class of feedforward artificial neural network that consists of at least three layers of nodes:

an input layer,
one or more hidden layers, and
an output layer.

Except for the input nodes, each node is a neuron that uses a nonlinear activation function.

MLPs are widely used for classification problems as well as regression:

Classification tasks: MLPs are widely used for classification problems, such as handwriting recognition and speech recognition.
Regression analysis: They are also applied in regression problems where the relationship between input and output is complex.

How to Build Multi-Layered Perceptrons

Let’s handle a binary classification task using a standard MLP architecture.

Outline of the Project

Objective

Detect fraudulent transactions

Evaluation Metrics

Considering the cost of misclassification, we’ll prioritize improving Recall and Precision scores
Then check the accuracy of classification with Accuracy Score (TP + TN / (TP + TN + FP + FN ))

Cost of Misclassification (from high to low):

False Negative (FN): The model incorrectly identifies a fraudulent transaction as legitimate (Missing actual fraud)
False Positive (FP): The model incorrectly identifies a legitimate transaction as fraudulent (Blocking legitimate customers.)
True Positive (TP): The model correctly identifies a fraudulent transaction as fraud.
True Negative (TN): The model correctly identifies a non-fraudulent transaction as non-fraud.

Planning an MLP Architecture

In the network, 19 input features feed into the first hidden layer’s 30 neurons, which use a ReLU activation function.

Then, their outputs are passed to the second layer, culminating in sigmoid values as the final output.

During the optimization process, we’ll let the optimizer (SGD and Adam) perform forward and backward passes to adjust parameters.

Image: Standard MLP Architecture for Binary Classification Tasks (Created by Kuriko Iwai using image source)

Especially in deeper network, ReLU is advantageous in preventing vanishing gradient problems where gradients become extremely small as they are backpropagated from the output layers.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

Preprocessing the Datasets

First, we consolidate three datasets – transaction, customer, and credit card – into a single DataFrame, independently sanitizing numerical and categorical data:

import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# download the raw data to local
import kagglehub
path = kagglehub.dataset_download("computingvictor/transactions-fraud-datasets")
dir = f'{path}/gd_card_flaud_demo'

def sanitize_df(amount_str):
    """Removes '$' and converts the string to a float."""
    if isinstance(amount_str, str):
        return float(amount_str.replace('$', ''))
    return amount_str

# load transaction data
trx_df = pd.read_csv(f'{dir}/transactions_data.csv')

# sanitize the dataset (drop unnecessary columns and error transactions, convert string to int/float dtype)
trx_df = trx_df[trx_df['errors'].isna()]
trx_df = trx_df.drop(columns=['merchant_city','merchant_state', 'date', 'mcc', 'errors'], axis='columns')
trx_df['amount'] = trx_df['amount'].apply(sanitize_df)

# merge the dataframe with fraud transaction flag.
with open(f'{dir}/train_fraud_labels.json', 'r') as fp:
    fraud_labels_json = json.load(fp=fp)

fraud_labels_dict = fraud_labels_json.get('target', {})
fraud_labels_series = pd.Series(fraud_labels_dict, name='is_fraud')
fraud_labels_series.index = fraud_labels_series.index.astype(int) # convert the datatype from string to integer
merged_df = pd.merge(trx_df, fraud_labels_series, left_on='id', right_index=True, how='left')
merged_df.fillna({'is_fraud': 'No'}, inplace=True)
merged_df['is_fraud'] = merged_df['is_fraud'].map({'Yes': 1, 'No': 0})

# load card data
card_df = pd.read_csv(f'{dir}/cards_data.csv')
card_df = card_df.drop(columns=['client_id', 'acct_open_date', 'card_number', 'expires', 'cvv'], axis='columns')
card_df['credit_limit'] = card_df['credit_limit'].apply(sanitize_df)

# merge transaction and card data
merged_df = pd.merge(left=merged_df, right=card_df, left_on='card_id', right_on='id', how='inner')
merged_df = merged_df.drop(columns=['id_y', 'card_id'], axis='columns')

# converts categorical variables into a new binary column (0 or 1)
categorical_cols = merged_df.select_dtypes(include=['object']).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=False, dtype=float) 
df = df.dropna().drop(['client_id', 'id_x'], axis=1)
print('\nDataFrame: \n', df.head(n=3))

DataFrame:

Our DataFrame shows an extremely skewed data distribution with:

Fraud samples: 1,191
Non-fraud samples: 11,477,397

For classification tasks, it's crucial to be aware of sample size imbalances and employ appropriate strategies to mitigate their negative impact on classification model performance, especially regarding the minority class.

For our data, we’ll:

split the 1,191 fraud samples into training, validation, and test sets,
add an equal number of randomly chosen non-fraud samples from the DataFrame, and
adjust split balances later if generalization challenges arise.

# define the desired size of the fraud samples for the validation and test sets
val_size_per_class = 200
test_size_per_class = 200

# create test sets
X_test_fraud = df_fraud.sample(n=test_size_per_class, random_state=42)
X_test_non_fraud = df_non_fraud.sample(n=test_size_per_class, random_state=42)

# combine to form the balanced test set
X_test = pd.concat([X_test_fraud, X_test_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_test = X_test['is_fraud']
X_test = X_test.drop('is_fraud', axis=1)

# remove sampled rows from the original dataframes to avoid data leakage
df_fraud_remaining = df_fraud.drop(X_test_fraud.index)
df_non_fraud_remaining = df_non_fraud.drop(X_test_non_fraud.index)


# create validation sets
X_val_fraud = df_fraud_remaining.sample(n=val_size_per_class, random_state=42)
X_val_non_fraud = df_non_fraud_remaining.sample(n=val_size_per_class, random_state=42)

# combine to form the balanced validation set
X_val = pd.concat([X_val_fraud, X_val_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_val = X_val['is_fraud']
X_val = X_val.drop('is_fraud', axis=1)

# remove sampled rows from the remaining dataframes
df_fraud_train = df_fraud_remaining.drop(X_val_fraud.index)
df_non_fraud_train = df_non_fraud_remaining.drop(X_val_non_fraud.index)


# create training sets
min_train_samples_per_class = min(len(df_fraud_train), len(df_non_fraud_train))

X_train_fraud = df_fraud_train.sample(n=min_train_samples_per_class, random_state=42)
X_train_non_fraud = df_non_fraud_train.sample(n=min_train_samples_per_class, random_state=42)

X_train = pd.concat([X_train_fraud, X_train_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_train = X_train['is_fraud']
X_train = X_train.drop('is_fraud', axis=1)


print("\n--- Final Dataset Shapes and Distributions ---")
print(f"X_train shape: {X_train.shape}, y_train distribution: {np.unique(y_train, return_counts=True)}")
print(f"X_val shape: {X_val.shape}, y_val distribution: {np.unique(y_val, return_counts=True)}")
print(f"X_test shape: {X_test.shape}, y_test distribution: {np.unique(y_test, return_counts=True)}")

After the operation, we secured 1,582 training, 400 validation, and 400 test samples, each dataset maintaining a 50:50 split between fraud and non-fraud transactions:

Considering the high dimensional feature space with 19 input features, we’ll apply SMOTE to resample the training data (SMOTE should not be applied to validation or test sets to avoid data leakage):

from imblearn.over_sampling import SMOTE
from collections import Counter

train_target = 2000

smote_train = SMOTE(
  sampling_strategy={0: train_target, 1: train_target},  # increase sample size to 2,000
  random_state=12
)
X_train, y_train = smote_train.fit_resample(X_train, y_train)

print(f"\nAfter SMOTE with custom sampling_strategy (target train: {train_target}):")
print(f"X_train_oversampled shape: {X_train.shape}")
print(f"y_train_oversampled distribution: {Counter(y_train)}")

We’ve secured 4,000 training samples, maintaining a 50:50 split between fraud and non-fraud transactions:

Lastly, we’ll apply column transformers to numerical and categorical features separately.

Column transformers are advantageous in handling datasets with multiple data types, as they can apply different transformations to different subsets of columns while preventing data leakage.

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore'))])

numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

Understanding Optimizers

In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s loss function, enhancing performance.

Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameters for improved predictions efficiently.

In this article, we’ll use the SGD Optimizer and Adam Optimizer.

1. How a SGD (Stochastic Gradient Descent) Optimizer Works

SGD is a major optimization algorithm that computes the gradient (partial derivative of the cost function) using a small mini-batch of examples at each epoch:

$$\begin{align*} w_j &:= w_j - \eta \frac {\partial J} {\partial w_j} \\ \\ b &:= b - \eta \frac {\partial J} {\partial b} \end{align*}$$

(w: weight, b: bias, J: cost function, η: learning rate)

In binary classification, the cost function (J) is defined with a sigmoid function (σ(z)) where z generates weighted sum of inputs and bias terms:

$$\begin{align*} J(y, \hat y) &=−[y log(\hat y) + (1-y)log(1-\hat y)] \\ \\ \hat y &= \sigma (z) = \frac {1} {1+e^{-z}} \\ \\ z &= \sum_{i=1}^m w_i x_i + b \end {align*}$$

2. How Adam (Adaptive Moment Estimation) Optimizer Works

Adam is an optimization algorithm that computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

Adam optimizer combines the advantages of RMSprop (using squared gradients to scale the learning rate) and Momentum (using past gradients to accelerate convergence):

$$w_{j,t+1} = w_{j,t} - \alpha \cdot \frac{\hat{m}{t,w_j}}{\sqrt{\hat{v}{t,w_j}} + \epsilon}$$

where:

α: The learning rate (default is 0.001)
ϵ: A small positive constant used to avoid division by zero
m^: First moment (mean) estimate with a bias correction, leveraging Momentum:

$$\begin{align*} \hat m_t &= \frac {m_t} {1 - \beta_1^t} \\ \\ m_t &= \beta_1 m_{t-1} + (1-\beta_1) \underbrace{ \frac {\partial L} {\partial w_t}}_{\text{gradient}} \end{align*}$$

(β1: Decay rates, typically set to β1=0.9)

v^: Second moment (variance) estimate with a bias correction, leveraging RMSprop:

$$\begin{align*} \hat v_t &= \frac {v_t} {1 - \beta_2^t} \\ \\ v_t &=\beta_2 v_{t-1} + (1- \beta_2) (\frac {\partial L} {\partial w_t})^2 \end {align*}$$

(β2: Decay rates, typically set to β2=0.999)

Since both m and v are initialized at zero, Adam computes the bias-corrected estimates to prevent them being biased toward zero.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

How to Build an MLP Classifier with SGD Optimizer

Custom Classifier

This process involves a forward pass and backpropagation, during which SGD computes optimal weights and biases using gradients:

for i in range(0, n_samples, self.batch_size):
    # SGD starts with randomly selected mini-batch for the epoch
    X_batch = X_shuffled[i : i + self.batch_size]
    y_batch = y_shuffled[i : i + self.batch_size]

    # A. forward pass
    activations, zs = self._forward_pass(X_batch)
    y_pred = activations[-1]  # final output of the network

    # B. backpropagation
    # 1) calculating gradients for the output layer)
    delta = y_pred - y_batch
    dW = np.dot(activations[-2].T, delta) / X_batch.shape[0]
    db = np.sum(delta, axis=0) / X_batch.shape[0]

    # 2) update output layer parameters
    self.weights[-1] -= self.learning_rate * dW
    self.biases[-1] -= self.learning_rate * db

    # 3) iterate backward from last hidden layer to the input layer
    for l in range(len(self.weights) - 2, -1, -1):
        delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
        dW = np.dot(activations[l].T, delta) / X_batch.shape[0]
        db = np.sum(delta, axis=0) / X_batch.shape[0]

        self.weights[l] -= self.learning_rate * dW
        self.biases[l] -= self.learning_rate * db

In the process of the forward pass, the network calculates a weighted sum of weights and bias (z), applies an activation function (ReLU) to the values in each hidden layer, and then computes the predicted output (y_pred) using a sigmoid function.

def _forward_pass(self, X):
    activations = [X]
    zs = []

    # forward through hidden layers
    for i in range(len(self.weights) - 1):
        z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
        zs.append(z)
        a = self._relu(z) # using ReLU for hidden layers
        activations.append(a)

    # forward through output layer
    z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
    zs.append(z_output)

    # computes the final output using sigmoid function
    y_pred = 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    activations.append(y_pred)
    return activations, zs

So the final classifier looks like this:

from sklearn.metrics import accuracy_score

class MLP_SGD:
    def __init__(self, hidden_layer_sizes=(10,), learning_rate=0.01, n_epochs=1000, batch_size=32):
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.weights = []
        self.biases = []
        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def _sigmoid_derivative(self, x):
        s = self._sigmoid(x)
        return s * (1 - s)

    def _relu(self, x):
        return np.maximum(0, x)

    def _relu_derivative(self, x):
        return (x > 0).astype(float)

    def _initialize_parameters(self, n_features):
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [1]
        self.weights = []
        self.biases = []

        for i in range(len(layer_sizes) - 1):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+1]
            limit = np.sqrt(6 / (fan_in + fan_out))
            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((1, fan_out)))

    def _forward_pass(self, X):
        activations = [X]
        zs = []

        for i in range(len(self.weights) - 1):
            z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        return activations, zs

    def _compute_loss(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(-1, 1)
        X = np.asarray(X)
        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() for w in self.weights])
        self.biases_history.append([b.copy() for b in self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[-1])
        self.loss_history.append(initial_loss)

        for epoch in range(self.n_epochs):
            # shuffle datasets
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            # mini-batch loop
            for i in range(0, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[-1]

                delta = y_pred - y_batch
                dW = np.dot(activations[-2].T, delta) / X_batch.shape[0]
                db = np.sum(delta, axis=0) / X_batch.shape[0]
                self.weights[-1] -= self.learning_rate * dW
                self.biases[-1] -= self.learning_rate * db

                for l in range(len(self.weights) - 2, -1, -1):
                    delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
                    dW = np.dot(activations[l].T, delta) / X_batch.shape[0]
                    db = np.sum(delta, axis=0) / X_batch.shape[0]

                    self.weights[l] -= self.learning_rate * dW
                    self.biases[l] -= self.learning_rate * db

            self.weights_history.append([w.copy() for w in self.weights])
            self.biases_history.append([b.copy() for b in self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[-1])
            self.loss_history.append(epoch_loss)

            if (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{self.n_epochs}, Loss: {epoch_loss:.4f}")
        return self

    def predict_proba(self, X):
        activations, _ = self._forward_pass(X)
        return activations[-1]

    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).flatten() # for 1D output

Training / Prediction

Train the model and make a prediction using training and validation datasets:

# 1. define the model
mlp_sgd = MLP_SGD(
  hidden_layer_sizes=(30, 30, ), # 2 hidden layers with 30 neurons each
  learning_rate=0.001,           # a step size
  n_epochs=1000,                 # number of epochs
  batch_size=32                  # mini-batch size
)

# 2. train the model
mlp_sgd.fit(X_train_processed, y_train)

# 3. make a prediction with training and validation datasets
y_pred_train = mlp_sgd.predict(X_train_processed)
y_pred_val = mlp_sgd.predict(X_val_processed)

# 4. compute evaluation matrics
conf_matrix = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, pos_label=1)
recall = recall_score(y_true, y_pred, pos_label=1)
f1 = f1_score(y_true, y_pred, pos_label=1)


print(f"\nMLP (Custom SGD) Accuracy (Train): {acc_train:.3f}")
print(f"MLP (Custom SGD) Accuracy (Validation): {acc_val:.3f}")

Results

Recall: 0.7930 — 0.6650 (from training to validation)
Precision: 0.7790 — 0.6786 (from training to validation)

The model effectively learned and generalized the patterns, achieving a Recall of 79.3% (approximately 80% accuracy in identifying fraud transactions) with a 12-point drop on the validation set.

Loss history:

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes. Note that the boundary is non-linear.

Leverage SckitLearn’s MCP Classifier

We can use an MCP Classifier to define a similar model, incorporating;

Early stopping using internal validation to prevent overfitting and
L2 regularization with a small tolerance.

from sklearn.neural_network import MLPClassifier

# define a model
model_sklearn_mlp_sgd = MLPClassifier(
    hidden_layer_sizes=(30, 30),
    activation='relu',
    solver='sgd',
    learning_rate_init=0.001,
    learning_rate='constant',
    momentum=0.9,
    nesterovs_momentum=True,
    alpha=0.00001,           # l2 regulation strength
    max_iter=3000,           # max epochs (keep it high)
    batch_size=16,           # mini-batch size
    random_state=42,
    early_stopping=True,     # apply early stopping
    n_iter_no_change=50,     # stop the iteration if internal validation score doesn't improve for 50 epochs
    validation_fraction=0.1, # proportion of training data for internal validation (default is 0.1)
    tol=1e-4,                # tolerance for optimization
    verbose=False,
)

# training
model_sklearn_mlp_sgd.fit(X_train_processed, y_train)

# make a prediction
y_pred_train_sklearn = model_sklearn_mlp_sgd.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_sgd.predict(X_val_processed)

Results

Recall: 0.7830 - 0.6200 (from training to validation)
Precision: 0.8208 - 0.6703 (from training to validation)

The model showed strong performance during training, achieving a Recall of 78.30%. Its performance declined on the validation set.

This suggests that while the model learned effectively from the training data, it may be overfitting and not generalizing as well to unseen data.

Leverage Keras Sequential Classifier

For the sequential classifier, we can further enhance the classifier by:

Initializing the output layer’s bias with the log-odds of positive class occurrences in the training data (y_train) to address dataset imbalance and promote faster convergence,
Integrating 10% dropout between hidden layers to prevent overfitting by randomly deactivating neurons during training,
Including Precision and Recall in the model’s compilation metrics to optimize for classification performance,
Applying class weights to penalize misclassifications of the minority class more heavily, improving the model’s ability to learn rare patterns, and
Utilizing a separate validation dataset for monitoring performance during training to help detect overfitting and guides hyperparameter tuning.

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from keras.optimizers import SGD
from keras.callbacks import EarlyStopping
from sklearn.utils import class_weight


# calculates an initial bias for the output layer 
initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])


# defines the model
model_keras_sgd = Sequential([
    Input(shape=(X_train_processed.shape[1],)), 
    Dense(30, activation='relu'),
    Dropout(0.1), # 10% of the neurons in that layer randomly dropped out
    Dense(30, activation='relu'),
    Dropout(0.1),
    Dense(1, activation='sigmoid', # binary classification
          bias_initializer=tf.keras.initializers.Constant(initial_bias)) # to address the imbalanced datasets
])



# compiles the model with the SGD optimizer
opt = SGD(learning_rate=0.001)
model_keras_sgd.compile(
    optimizer=opt, 
    loss='binary_crossentropy',
    metrics=[
        'accuracy', # add several metrics to return
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.AUC(name='auc') 
    ]
)


# defines early stopping to prevent overfitting
early_stopping_callback = EarlyStopping(
    monitor='val_recall',  # monitor recall 
    mode='max',         # maximize recall
    patience=50,        # stop after 50 epochs without loss improvement
    min_delta=1e-4,     # minimum change to be considered an improvement (tol)
    verbose=0
)


# compute the class weight
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))


# train the model
history = model_keras_sgd.fit(
    X_train_processed, y_train,
    epochs=1000,
    batch_size=32,
    validation_data=(X_val_processed, y_val), # use our external val set
    callbacks=[early_stopping_callback], # early stopping to prevent overfitting
    class_weight=class_weights_dict, # penarlize more misclassification on minority class
    verbose=0
)

# evaluate
loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_sgd.evaluate(X_train_processed, y_train, verbose=0)
print(f"\n--- Keras Model Accuracy (Train) ---")
print(f"Loss: {loss_train:.4f}")
print(f"Accuracy: {accuracy_train:.4f}")
print(f"Precision: {precision_train:.4f}")
print(f"Recall: {recall_train:.4f}")
print(f"AUC: {auc_train:.4f}")

loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_sgd.evaluate(X_val_processed, y_val, verbose=0)
print(f"\n--- Keras Model Accuracy (Validation) ---")
print(f"Loss: {loss_val:.4f}")
print(f"Accuracy: {accuracy_val:.4f}")
print(f"Precision: {precision_val:.4f}")
print(f"Recall: {recall_val:.4f}")
print(f"AUC: {auc_val:.4f}")

# display model summary
model_keras_sgd.summary()

Results

Recall: 0.7125 — 0.7250 (from training to validation)
Precision: 0.7607 — 0.7545 (from training to validation)

Given that the gaps between training and validation are relatively small, the model is generalizing reasonably well.

It suggests that the regularization techniques are likely effective in preventing significant overfitting.

How to Build an MLP Classifier with Adam Optimizer

Custom Classifier

This iterative process of updating parameters occurs within the mini-batch loop to keep updating weights and bias:

# apply Adam updates for output layer parameters
# 1) weights (w)
self.m_weights[-1] = self.beta1 * self.m_weights[-1] + (1 - self.beta1) * grad_w_output
self.v_weights[-1] = self.beta2 * self.v_weights[-1] + (1 - self.beta2) * (grad_w_output ** 2)
m_w_hat = self.m_weights[-1] / (1 - self.beta1**t)
v_w_hat = self.v_weights[-1] / (1 - self.beta2**t)
self.weights[-1] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

# 2) bias (b)
self.m_biases[-1] = self.beta1 * self.m_biases[-1] + (1 - self.beta1) * grad_b_output
self.v_biases[-1] = self.beta2 * self.v_biases[-1] + (1 - self.beta2) * (grad_b_output ** 2)
m_b_hat = self.m_biases[-1] / (1 - self.beta1**t)
v_b_hat = self.v_biases[-1] / (1 - self.beta2**t)
self.biases[-1] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)

Following the principles of forward and backward passes, we construct the final classifier by initializing it with beta1 and beta2, built upon an MLP_SGD architecture:

class MLP_Adam:
    def __init__(self, hidden_layer_sizes=(10,), learning_rate=0.001, n_epochs=1000, batch_size=32,
                 beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

        self.weights = [] 
        self.biases = []

        # Adam optimizer internal states for each parameter (weights and biases)
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def _sigmoid_derivative(self, x):
        s = self._sigmoid(x)
        return s * (1 - s)

    def _relu(self, x):
        return np.maximum(0, x)

    def _relu_derivative(self, x):
        return (x > 0).astype(float)

    def _initialize_parameters(self, n_features):
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [1]

        self.weights = []
        self.biases = []
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        for i in range(len(layer_sizes) - 1):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+1]
            limit = np.sqrt(6 / (fan_in + fan_out))

            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((1, fan_out)))

            self.m_weights.append(np.zeros((fan_in, fan_out)))
            self.v_weights.append(np.zeros((fan_in, fan_out)))
            self.m_biases.append(np.zeros((1, fan_out)))
            self.v_biases.append(np.zeros((1, fan_out)))


    def _forward_pass(self, X):
        activations = [X]
        zs = []

        for i in range(len(self.weights) - 1):
            z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        return activations, zs

    def _compute_loss(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(-1, 1)
        X = np.asarray(X)

        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() for w in self.weights])
        self.biases_history.append([b.copy() for b in self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[-1])
        self.loss_history.append(initial_loss)

        # global time step for Adam bias correction
        t = 0

        for epoch in range(self.n_epochs):
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            # Mini-batch loop
            for i in range(0, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                t += 1

                # 1. forward pass
                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[-1] # Output of the network

                # 2. backpropagation
                delta = y_pred - y_batch
                grad_w_output = np.dot(activations[-2].T, delta) / X_batch.shape[0] # Average over batch
                grad_b_output = np.sum(delta, axis=0) / X_batch.shape[0]

                # apply Adam updates to weights
                self.m_weights[-1] = self.beta1 * self.m_weights[-1] + (1 - self.beta1) * grad_w_output
                self.v_weights[-1] = self.beta2 * self.v_weights[-1] + (1 - self.beta2) * (grad_w_output ** 2)
                m_w_hat = self.m_weights[-1] / (1 - self.beta1**t)
                v_w_hat = self.v_weights[-1] / (1 - self.beta2**t)
                self.weights[-1] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                # apply Adam updates to bias
                self.m_biases[-1] = self.beta1 * self.m_biases[-1] + (1 - self.beta1) * grad_b_output
                self.v_biases[-1] = self.beta2 * self.v_biases[-1] + (1 - self.beta2) * (grad_b_output ** 2)
                m_b_hat = self.m_biases[-1] / (1 - self.beta1**t)
                v_b_hat = self.v_biases[-1] / (1 - self.beta2**t)
                self.biases[-1] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


                # Propagate gradients backward through hidden layers
                for l in range(len(self.weights) - 2, -1, -1):
                    delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
                    grad_w_hidden = np.dot(activations[l].T, delta) / X_batch.shape[0]
                    grad_b_hidden = np.sum(delta, axis=0) / X_batch.shape[0]

                    # apply Adam updates to weights
                    self.m_weights[l] = self.beta1 * self.m_weights[l] + (1 - self.beta1) * grad_w_hidden
                    self.v_weights[l] = self.beta2 * self.v_weights[l] + (1 - self.beta2) * (grad_w_hidden ** 2)
                    m_w_hat = self.m_weights[l] / (1 - self.beta1**t)
                    v_w_hat = self.v_weights[l] / (1 - self.beta2**t)
                    self.weights[l] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                    # apply Adam updates to bias
                    self.m_biases[l] = self.beta1 * self.m_biases[l] + (1 - self.beta1) * grad_b_hidden
                    self.v_biases[l] = self.beta2 * self.v_biases[l] + (1 - self.beta2) * (grad_b_hidden ** 2)
                    m_b_hat = self.m_biases[l] / (1 - self.beta1**t)
                    v_b_hat = self.v_biases[l] / (1 - self.beta2**t)
                    self.biases[l] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


            self.weights_history.append([w.copy() for w in self.weights])
            self.biases_history.append([b.copy() for b in self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[-1])
            self.loss_history.append(epoch_loss)

            if (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{self.n_epochs}, Loss: {epoch_loss:.4f}")
        return self


    def predict_proba(self, X):
        activations, _ = self._forward_pass(X)
        return activations[-1]

    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).flatten()

Training / Prediction

Train the model and make a prediction using training and validation datasets:

mlp_adam = MLP_Adam(hidden_layer_sizes=(30, 10), learning_rate=0.001, n_epochs=500, batch_size=32)
mlp_adam.fit(X_train_processed, y_train)

y_pred_train = mlp_adam.predict(X_train_processed)
y_pred_val = mlp_adam.predict(X_val_processed)

acc_train = accuracy_score(y_train, y_pred_train)
acc_val = accuracy_score(y_val, y_pred_val)

print(f"\nMLP (Custom Adam) Accuracy (Train): {acc_train:.3f}")
print(f"MLP (Custom Adam) Accuracy (Validation): {acc_val:.3f}")

Results

Recall: 0.9870–0.6150 (from training to validation)
Precision: 0.9811–0.6474 (from training to validation)

While the Adam optimizer outperformed SGD, the model exhibited significant overfitting, with both Recall and Precision falling by around 30 points between training and validation.

Loss History

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes.

Leverage SckitLearn’s MCP Classifier

We’ve switched the optimizer from SGD to Adam, keeping all other settings constant:

model_sklearn_mlp_adam = MLPClassifier(
    hidden_layer_sizes=(30, 30),
    activation='relu',
    solver='adam',             # update the optimizer from SGD to Adam
    learning_rate_init=0.001,
    learning_rate='constant',
    alpha=0.0001,
    max_iter=3000,
    batch_size=16,
    random_state=42,
    early_stopping=True,
    n_iter_no_change=50,
    validation_fraction=0.1,
    tol=1e-4,
    verbose=False,
)

model_sklearn_mlp_adam.fit(X_train_processed, y_train)

y_pred_train_sklearn = model_sklearn_mlp_adam.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_adam.predict(X_val_processed)

Results

Recall: 0.8975–0.6400 (from training to validation)
Precision: 0.8864 — 0.6305 (from training to validation)

Despite a performance improvement compared to the SGD optimizer, the significant drop in both Recall (from 0.8975 to 0.6400) and Precision (from 0.8864 to 0.6305) from training to validation data indicates that the model is still overfitting.

Leverage Keras Sequential Classifier

Similar to MLPClassifier, we’ve switched the optimizer from SGD to Adam with all the other conditions remaining the same:

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from sklearn.utils import class_weight


initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])
model_keras_adam = Sequential([
    Input(shape=(X_train_processed.shape[1],)), 
    Dense(30, activation='relu')),
    Dropout(0.1),
    Dense(30, activation='relu'),
    Dropout(0.1),
    Dense(1, activation='sigmoid', 
          bias_initializer=tf.keras.initializers.Constant(initial_bias))
])


optimizer_keras = Adam(learning_rate=0.001)
model_keras_adam.compile(
    optimizer=optimizer_keras, 
    loss='binary_crossentropy', 
    metrics=[
        'accuracy',
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.AUC(name='auc') 
    ]
)

early_stopping_callback = EarlyStopping(
    monitor='val_recall',
    mode='max',
    patience=50,
    min_delta=1e-4,
    verbose=0
)

class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

model_keras_adam.fit(
    X_train_processed, y_train,
    epochs=1000,
    batch_size=32,
    validation_data=(X_val_processed, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict,
    verbose=0
)


loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_adam.evaluate(X_train_processed, y_train, verbose=0)
print(f"\n--- Keras Model Accuracy (Train) ---")
print(f"Loss: {loss_train:.4f}")
print(f"Accuracy: {accuracy_train:.4f}")
print(f"Precision: {precision_train:.4f}")
print(f"Recall: {recall_train:.4f}")
print(f"AUC: {auc_train:.4f}")


loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_adam.evaluate(X_val_processed, y_val, verbose=0)
print(f"\n--- Keras Model Accuracy (Validation) ---")
print(f"Loss: {loss_val:.4f}")
print(f"Accuracy: {accuracy_val:.4f}")
print(f"Precision: {precision_val:.4f}")
print(f"Recall: {recall_val:.4f}")
print(f"AUC: {auc_val:.4f}")


model_keras_adam.summary()

Results

Recall: 0.7995–0.7500 (from training to validation)
Precision: 0.8409–0.8065 (from training to validation)

The model exhibits good performance, with Recall slightly decreasing from 0.7995 (training) to 0.7500 (validation), and Precision similarly dropping from 0.8409 (training) to 0.8065 (validation).

This indicates good generalization, with only minor performance degradation on unseen data.

Final Results: Generalization

Finally, we’ll evaluate the model’s ultimate performance on the test dataset, which has remained completely separate from all prior training and validation processes.

# Custom classifiers
y_pred_test_custom_sgd = mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_custom_adam = mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

# MLPClassifer
y_pred_test_sk_sgd = model_sklearn_mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_sk_adam = model_sklearn_mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

# Keras Sequential
_, accuracy_val_sgd, precision_val_sgd, recall_val_sgd, auc_val_sgd = model_keras_sgd.evaluate(X_test_processed, y_test, verbose=0)
_, accuracy_val_adam, precision_val_adam, recall_val_adam, auc_val_adam = model_keras_adam.evaluate(X_test_processed, y_test, verbose=0)

Overall, the Keras Sequential model, optimized with SGD, achieved the best performance with an AUPRC (Area Under Precision-Recall Curve) of 0.72.

Conclusion

In this exploration, we experimented with custom classifiers, Scikit-learn models, and Keras deep learning architectures.

Our findings underscore that effective machine learning hinges on three critical factors:

robust data preprocessing (tailored to objectives and data distribution),
judicious model selection, and
strategic framework or library choices.

Choosing the right framework

Generally speaking, choose MLPClassifier when:

You’re primarily working with tabular data,
You want to prioritize simplicity, quick iteration, and seamless integration,
You have simple, shallow architectures, and
You have a moderate dataset size (manageable on a CPU).

Choose Keras Sequential when:

You’re dealing with image, text, audio, or other sequential data,
You’re building deep learning models such as CNNs, RNNs, LSTMs,
You need fine-grained control over the model architecture, training process, or custom components,
You need to leverage GPU acceleration,
You’re planning for production deployment, and
You want to experiment with more advanced deep learning techniques.

Limitation of MLPs

While Multilayer Perceptrons (MLPs) proved valuable, their susceptibility to computational complexity and overfitting emerged as key challenges.

Looking ahead, we’ll delve into how Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) offer powerful solutions to these inherent MLP limitations.

You can find more info about me on my Portfolio / LinkedIn / Github.

How AI Models Think: The Key Role of Activation Functions with Code Examples

Tiago Capelo Monteiro — Wed, 10 Apr 2024 15:44:31 +0000

In Artificial Intelligence, Machine Learning is the foundation of most revolutionary AI applications. From language processing to image recognition, Machine Learning is everywhere.

Machine Learning relies on algorithms, statistical models, and neural networks. And Deep Learning is the subfield of Machine Learning focused only on neural networks.

A key piece of any neural network are activation functions. But understanding exactly why they are essential to any neural network system is a common question, and it can be a difficult one to answer.

This tutorial focuses on explaining, in a simple manner with analogies, why exactly activation functions are necessary.

By understanding this, you will understand the process of how AI models think.

Before that, we will explore neural networks in AI. We will also explore the most commonly used activation functions.

We're also going to analyze every line of a very simple PyTorch code example of a neural network.

In this article, we will explore:

Artificial Intelligence and the Rise of Deep Learning
Understanding Activation Functions: Simplifying Neural Network Mechanics
Simple Analogy: The Necessity of Activation Functions
What Happens Without Activation Functions?
PyTorch Activation Function Code Example
Conclusion: The Unsung Heroes of AI Neural Networks

This article won't cover dropout or other regularization techniques, hyperparameter optimization, complex architectures like CNNs, or detailed differences in gradient descent variants.

I just want to showcase why activation functions are needed and what happens when they are not applied to neural networks.

If you don't know much about deep learning, I personally recommend this Deep Learning crash course on freeCodeCamp's YouTube channel:

Artificial Intelligence and the Rise of Deep Learning

What is Deep Learning in Artificial Intelligence?

Deep learning is a subfield of artificial intelligence. It uses neural networks to process complex patterns, just like the strategies a sports team uses to win a match.

The bigger the neural network, the more capable it is of doing awesome things – like ChatGPT, for example, which uses natural language processing to answer questions and interact with users.

To truly understand the basics of neural networks – what every single AI model has in common that enables it to work – we need to understand activation layers.

Deep Learning = Training Neural Networks

Simple neural network

At the core of deep learning is the training of neural networks.

That means basically using data to get the right values of the weights to be able to predict what we want.

Neural networks are made of neurons organized in layers. Each layer extracts unique features from the data.

This layered structure allows deep learning models to analyze and interpret complex data.

Understanding Activation Functions: Simplifying Neural Network Mechanics

Leaky reLU activation function

Activation functions help neural networks handle complex data. They change the neuron value based on the data they receive.

It is almost like a filter every neuron has before sending its value to the next neuron.

Essentially, activation functions control the information flow of neural networks – they decide which data is relevant and which is not.

This helps prevent the vanishing gradients to ensure the network learns properly.

The vanishing gradients problem happens when the neural network's learning signals are too weak to make the weight values change. This makes learning from data very difficult.

Simple Analogy: Why Activation Functions are Necessary

In a soccer game, players decide whether to pass, dribble, or shoot the ball.

These decisions are based on the current game situation, just as neurons in a neural network process data.

In this case, activation functions act like this in the decision-making process.

Without them, neurons would pass data without any selective analysis – like players mindlessly kicking the ball regardless of the game context.

In this way, activation functions introduce complexity into a neural network, allowing it to learn complex patterns.

What Happens Without Activation Functions?

To understand what would happen without activation functions, let's first think about what happens if players mindlessly kick the ball in a soccer match.

They'd likely lose the match because there would be no decision-making processes as a team. That ball still goes somewhere – but most of the time it will not go where it's intended.

This is similar to what happens in a neural network without activation functions: the neural network doesn't make good predictions because the neurons were just passing data to each other randomly.

We still get a prediction. Just not what we wanted, or what's helpful.

This dramatically limits the capability – of both the soccer team and the neural network.

Intuitive Explanation of Activation Functions

Let's now look at an example so you can understand this intuitively.

reLU activation function

Let's start with the most widely used activation function in deep learning (it's also one of the simpler ones).

This is an ReLU activation function. It basically acts as a filter before a neuron sends a value to its next neuron.

This filter is essentially two conditions:

If the value of the weight is negative, it becomes 0
If the value of the weight is positive, it does not change anything

With this, we are adding a decision-making process to each neuron. It decides which data to send and which not to send.

Now let's look at some examples of other activation functions.

Sigmoid Activation Functions

This activation function converts the input value between 0 and 1. Sigmoids are widely used in binary classification problems in the last neuron.

Sigmoid activation function

There is a problem with sigmoid activation functions, though. Take the output values from a given linear transformation:

0.00000003
0.99999992
0.00000247
0.99993320

There are some questions about these values we can ask:

Are values like 0.00000003 and 0.000002 really important? Can't they be just 0 so that we have fewer things to run on the computer? Remember, in many of today's models, we have millions of weights in them. Can't millions of 0.00000003 and 0.000002 be 0?
And if it is a positive value, how will it distinguish a big value from a very big value? For example, in 0.99993320 and 0.99999992, where are the input values like 7 and 13 or 7 and 55? 0.99993320 and 0.99999992 do not accurately describe their input values.

How can we distinguish the subtle differences in outputs so that accuracy is maintained?

This is what the ReLU activation functions solved: setting negative numbers to zero while keeping positive ones boosts neural network computational efficiency.

Tanh (Hyperbolic Tangent) Activation Functions

tanh activation function

These activation functions output values between -1 and 1, similar to Sigmoid.

They're often used in recurrent neural networks (RNNs) and long short-term memory networks (LSTMs).

Tanh is also used because it is zero-centered. This means that the mean of the output values is around zero. This property helps when dealing with the vanishing gradient problem.

Leaky reLU

Leaky reLU activation function

Instead of ignoring the negative values, Leaky ReLU activation functions are going to have a small negative value.

This way, negative values are also used when training neural networks.

With the ReLU activation function, neurons with negative values are inactive and do not contribute to the learning process.

With the Leaky ReLU activation function, neurons with negative values are active and contribute to the learning process.

This decision-making process is implemented by activations function. Without it, it would simply send the data to the next neuron (just like a player mindlessly kicking the ball).

Mathematical Explanation of Activation Functions

Neurons do two things:

They use linear transformations with the previous neurons weights values
They use activation functions to filter certain values to selectively pass on values.

Without activation functions, the neural network just does one thing: Linear transformations.

If it only does linear transformations, it is a linear system.

If it is a linear system, in very simple terms without being too technical, the superposition theorem tells us that any mixture of two or more linear transformations can be simplified into one single transformation.

Essentially, it means that, without activation functions, this complex neural network:

Long neural network without activation functions

Is the same as this simple one:

Short neural network without activation functions

This is because each layer in its matrix form is a product of linear transformations of previous layers.

And according to the theorem, since any mixture of two or more linear transformations can be simplified in one single transformation, then any mixture of hidden layers (that is, layers between the inputs and outputs of neurons) in a neural network can be simplified into only one layer.

What does this all mean?

It means that it can only model data linearly. But in real life with real data, every system is non-linear. So we need activation functions.

We introduce non-linearity into a neural network so that it learns non-linear patterns.

PyTorch Activation Function Code Example

In this section, we are going to train the neural network below:

Simple feed forward neural network

This is a simple neural network AI model with four layers:

Input layer with 10 neurons
Two hidden layers with 18 neurons
One hidden layer with 18 neurons
One output layer with 1 neuron

In the code, we can choose any of the four activation functions mentioned in this tutorial.

Here it is the full code – we'll discuss in detail below:

import torch
import torch.nn as nn
import torch.optim as optim

#Choose which activation function to use in code
defined_activation_function = 'relu'

activation_functions = {
    'relu': nn.ReLU(),
    'sigmoid': nn.Sigmoid(),
    'tanh': nn.Tanh(),
    'leaky_relu': nn.LeakyReLU()
}

# Initializing hyperparameters
num_samples = 100
batch_size = 10
num_epochs = 150
learning_rate = 0.001

# Define a simple synthetic dataset
def generate_data(num_samples):
    X = torch.randn(num_samples, 10)
    y = torch.randn(num_samples, 1)
    return X, y

# Generate synthetic data
X, y = generate_data(num_samples)

class SimpleModel(nn.Module):
    def __init__(self, activation=defined_activation_function):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(in_features=10, out_features=18)
        self.fc2 = nn.Linear(in_features=18, out_features=18)
        self.fc3 = nn.Linear(in_features=18, out_features=4)
        self.fc4 = nn.Linear(in_features=4, out_features=1)
        self.activation = activation_functions[activation]

    def forward(self, x):
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x) 
        x = self.activation(x)
        x = self.fc3(x) 
        x = self.activation(x)  
        x = self.fc4(x) 
        return x

# Initialize the model, define loss function and optimizer
model = SimpleModel(activation=defined_activation_function)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    for i in range(0, num_samples, batch_size):
        # Get the mini-batch
        inputs = X[i:i+batch_size]
        labels = y[i:i+batch_size]

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)

        # Compute the loss
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss}')

print("Training complete.")

Looks like a lot, doesn't it? Don't worry – we'll take it piece by piece.

1: Importing libraries and defining activation functions

import torch
import torch.nn as nn
import torch.optim as optim

#Choose which activation function to use in code
defined_activation_function = 'relu'

activation_functions = {
    'relu': nn.ReLU(),
    'sigmoid': nn.Sigmoid(),
    'tanh': nn.Tanh(),
    'leaky_relu': nn.LeakyReLU()
}

Importing libraries and defining dictionary with activation functions

In this code:

import torch: Imports the PyTorch library.
import torch.nn as nn: Imports the neural network module from PyTorch.
import torch.optim as optim: Imports the optimization module from PyTorch.

The variable and the dictionary above help you easily define, for this deep learning model, the activation function you want to use.

2: Defining hyperparameters and generating a dataset

# Initializing hyperparameters
num_samples = 100
batch_size = 10
num_epochs = 150
learning_rate = 0.001

# Define a simple synthetic dataset
def generate_data(num_samples):
    X = torch.randn(num_samples, 10)
    y = torch.randn(num_samples, 1)
    return X, y

# Generate synthetic data
X, y = generate_data(num_samples)

Initializing hyperparameters and creating, with a function, a synthetic dataset

In this code:

num_samples is the number of samples in the synthetic dataset.
batch_size is the size of each mini-batch during training.
num_epochs is the number of iterations over the entire dataset during training.
learning_rate is the learning rate used by the optimization algorithm.

Besides, we define a generate_data function to create two tensors with random values. Then it calls the function and it generates, for X and y, two tensors with random values.

3: Creating the deep learning model

class SimpleModel(nn.Module):
    def __init__(self, activation=defined_activation_function):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(in_features=10, out_features=18)
        self.fc2 = nn.Linear(in_features=18, out_features=18)
        self.fc3 = nn.Linear(in_features=18, out_features=4)
        self.fc4 = nn.Linear(in_features=4, out_features=1)
        self.activation = activation_functions[activation]

    def forward(self, x):
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x) 
        x = self.activation(x)
        x = self.fc3(x) 
        x = self.activation(x)  
        x = self.fc4(x) 
        return x

A simple feed forward neural network deep learning AI model

The __init__ method in the SimpleModel class initializes the neural network architecture. It initializes four fully connected layers and defines the activation function we are going to use.

We create each layer using nn.Linear, while the forward method defines how the data flows through the neural network.

4: Initializing the model and defining the loss function and optimizer

model = SimpleModel(activation=defined_activation_function)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

Defining activation function, loss function and gradient descend variation to be used

In this code:

model = SimpleModel(activation=defined_activation_function) creates a neural network model with a specified activation function.
criterion = nn.MSELoss() defines the Mean Squared Error (MSE) Loss function.
optimizer = optim.Adam(model.parameters(), lr=learning_rate) sets up the Adam optimizer for updating the model parameters during training, with a specified learning rate.

5: Training the deep learning model

for epoch in range(num_epochs):
    for i in range(0, num_samples, batch_size):
        # Get the mini-batch
        inputs = X[i:i+batch_size]
        labels = y[i:i+batch_size]

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)

        # Compute the loss
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss}')

Training the model

The outer loop, based on num_epochs (number of epochs) controls how many times the entire dataset is processed.
The inner loop divides the dataset in mini-batches using the range function.

In each mini loop:

With inputs and labels, we get the data from the mini-batch we want to process
We eliminate with optimizer.zero_grad(), the gradients – variables that tell us how to adjust weights for accurate predictions – of the previous mini-batch iteration. This is important to prevent mixing gradient information between mini-batches.
The forward pass gets us the model predictions (outputs), and the loss is calculated using the specified loss function (criterion).
With loss.backward(), we calculate the gradients for the weights.
Finally, optimizer.step() updates the model's weights based on those gradients to minimize the loss function.

This is the full code to train a very simple deep learning model on a very simple dataset.

It does not have anything more advanced like convolutional neural networks.

Conclusion: The Unsung Heroes of AI Neural Networks

Activation functions are like gatekeepers. By restricting the flow of information, the neural network can learn better.

Activation functions are just like people when they study, or soccer players when deciding what to do with a ball.

These functions give neural networks the ability to learn and predict correctly.

Mathematically, activation functions are what allow the correct approximation of any linear or non-linear function in neural networks. Without them, neural networks approximate only linear functions.

And I leave you with this:

The mathematical idea of a neural network being able to approximate any non linear function is called the Universal Approximation Theorem‌‌.

You can find the full code on GitHub here:

https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code

Learn Machine Learning and Neural Networks without Frameworks

Beau Carnes — Wed, 30 Aug 2023 16:22:01 +0000

A lot of machine learning courses relay on frameworks that abstract the inner workings of what's going on. If you want to become proficient, it's helpful to understand how things work under the hood.

We just published a course on the freeCodeCamp.org YouTube channel that will teach you how to use machine learning and neural networks without any frameworks.

Dr. Radu Mariescu-Istodor created this course. He has a PhD in computer science and is creates engaging and creative software tutorials.

In a world where machine learning is becoming increasingly prevalent, understanding the underlying concepts and algorithms has never been more crucial. The course doesn't rely on pre-existing libraries. Instead, it takes you through building a machine learning system from scratch.

The course centers around a captivating project: creating a web app that learns to recognize drawings. This is Phase 2 of the course, where the focus shifts towards enhancing the accuracy of the methods developed in Phase 1 (but you can still follow along if you did not watch Phase 1).

Through a series of comprehensive sections, you'll explore advanced features, classification methods, data cleaning techniques, and a wide array of fundamental concepts that are essential to machine learning.

The course is structured to gradually build your understanding from the ground up. Here are the topic sections of this course:

Introduction
Phase 1 Code Review
Data Cleaning
Confusion Matrix
Euclidean Distance Marker
Measuring the Elongation
Measuring the Roundness
Vector vs Raster (Pixels)
Neural Networks
Optimizing Neural Networks
Deep Neural Networks

By unveiling the inner workings of machine learning systems, you'll not only develop a profound understanding of the subject but also hone your software development skills.

Watch the full course on the freeCodeCamp.org YouTube channel (4-hour watch).

The Brain-Inspired Approach to AI – Explained for Developers

freeCodeCamp — Mon, 08 May 2023 15:43:38 +0000

By Edem Gold

"Our intelligence is what makes us human, and AI is an extension of that quality." – Yan LeCun

Since the advent of Neural Networks (also known as artificial neural networks), the AI industry has enjoyed unparalleled success. Neural Networks are the driving force behind modern AI systems, and they are modeled after the human brain.

Modern AI research involves creating and implementing algorithms that aim to mimic the neural processes of the human brain. Their goal is to create systems that learn and act in ways similar to human beings.

In this article, we will attempt to understand the brain-inspired approach to building AI systems.

Here's what we'll cover:

How we'll approach this
The history of the brain-inspired approach to AI
How the human brain works and how it's related to AI systems
Core principles behind the brain-inspired approach to AI
Challenges in building brain-inspired AI systems
Summary

How We'll Approach This

This article will begin by providing background history on how researchers began to model AI to mimic the human brain and end by discussing the challenges currently being faced by researchers in attempting to imitate the human brain. Below is an in-depth description of what to expect from each section.

It is worth noting that while this topic is an inherently broad one, I will try to be as brief and succinct as possible to keep this engaging. I plan to treat sub-topics which have more intricate sub-branches as standalone articles. I'll also leave references at the end of the article.

Here's a brief outline of what we'll cover:

History of the brain-inspired approach to AI: Here we'll discuss how scientists Norman Weiner and Warren McCulloch brought about the convergence of neuroscience and computer science. We'll also discuss how Frank Rosenblatt's Perceptron was the first real attempt to mimic human intelligence. And we'll learn how its failure brought about ground-breaking work which would serve as the platform for Neural Networks.

How the human brain works and how it relates to AI systems: In this section, we'll dive into the biological basis for the brain-inspired approach to AI. We will discuss the basic structure and functions of the human brain, understand its core building block, the neuron, and how they work together to process information and enable complex actions.

The Core Principles behind the brain-inspired Approach to AI: Here we will discuss the fundamental concepts behind the brain-inspired approach to AI. We will explain concepts such as Neural networks, Hierarchical processing, and plasticity work. We'll also learn how the techniques of parallel processing, distributed representations, and recurrent feedback help AI mimicking the brain's functioning.

Challenges in building AI systems modeled after the human brain: Here we will talk about the challenges and limitations inherent in attempting to build systems that mimic the human brain. Challenges such as the complexity of the brain, and the lack of a unified theory of cognition, and we'll explore the ways these challenges an limitations are being addressed.

Let's begin!

The History of the Brain-inspired Approach to AI

The drive to build machines that are capable of intelligent behaviour owes much inspiration to MIT Professor Norbert Weiner. Norbert Weiner was a child prodigy who could read by the age of three. He had broad knowledge of various fields such as mathematics, neurophysiology, medicine, and physics.

Norbert Weiner believed that the main opportunities in science lay in exploring what he termed as Boundary Regions. These are areas of study that are not clearly within a certain discipline but rather a mixture of disciplines, like the study of medicine and engineering coming together to create the field of Medical Engineering. He was quoted saying:

"If the difficulty of a physiological problem is mathematical in nature, ten physiologists ignorant of mathematics will get precisely as far as one physiologist ignorant of mathematics."

In 1934, Weiner and a couple of other academics gathered monthly to discuss papers involving boundary region science.

Norman Weiner

He described it as "a perfect catharsis for half-baked ideas, insufficient self-criticism, exaggerated self-confidence and pomposity."

From these sessions and from his own personal research, Weiner learned about new research on biological nervous systems as well as about pioneering work on electronic computers.

His natural inclination was to blend these two fields, so a relationship between neuroscience and computer science was formed. This relationship became the cornerstone for the creation of artificial intelligence as we know it.

After World War II, Wiener began forming theories about intelligence in both humans and machines and this new field was named Cybernetics. Wiener’s foray into Cybernetics was successful in getting scientists talking about the possibility of biology fusing with engineering.

One of these scientists was a neurophysiologist named Warren McCulloch. He dropped out of Haverford University and went to Yale to study philosophy and psychology. While attending a scientific conference in New York, he came discovered papers written by colleagues on biological feedback mechanisms.

The following year, in collaboration with his brilliant 18-year-old protégé named Walter Pitts, McCulloch proposed a theory about how the brain works. This theory would help foster the widespread perception that computers and brains function essentially in the same way.

They based their conclusions on research by McCulloch on the possibility of neurons processing Binary Numbers (computers communicate via binary numbers). This theory became the foundation for what became the first model of an artificial neural network, which was named the McCulloch-Pitts Neuron (MCP).

The MCP served as the foundation for the creation of the first-ever neural network, known as the perceptron. The Perceptron was created by Psychologist Frank Rosenblatt. Inspired by the synapses in the brain, he decided that since the human brain could process and classify information through synapses (communication between neurons) then perhaps a digital computer could do the same via a neural network.

The Perceptron essentially scaled the MCP neuron from one artificial neuron into a network of neurons. But unfortunately, the perceptron had some technical challenges which limited its practical application. The most notable of these limitations was its inability to perform complex operations (like classifying between more than one item – for example, the perceptron could not perform classification between a cat, a dog, and a bird).

In 1969, a book published by Marvin Minsky and Seymour Papert titled Perceptron lay out in detail the flaws of the Perceptron. Because of this, research on Artificial Neural Networks stagnated until the proposal of Back Propagation by Paul Werbos.

Back Propagation hopes to solve the issue of classifying complex data that hindered the industrial application of Neural Networks at the time. It was inspired by synaptic plasticity – the way the brain modifies the strengths of connections between neurons and as such improves performance.

Back Propagation was designed to mimic the process in the brain that strengthens connections between neurons via a process called weight adjustment.

Despite the early proposal by Paul Werbos, the concept of back propagation only gained widespread adoption when researchers such as David Rumelheart, Geoffrey Hinton, and Ronald Williams published papers that demonstrated its effectiveness for training neural networks.

The implementation of back propagation led to the creation of Deep Learning which powers most of the AI systems available in the world.

"People are smarter than today's computers because the brain employs a basic computational architecture that is more suited to deal with a central aspect of the natural information processing tasks that people are so good at." – Parallel Distributed Processing

Illustration of how the brain's cells process information

We have discussed how researchers began to model AI to mimic the human brain. Now let's look at how the brain works and define the relationship between the brain and AI systems.

How the brain works – a simplified description

The human brain essentially processes thoughts via the use of neurons. A neuron is made up of 3 core sections: the dendrite, axon, and the soma.

The dendrite is responsible for receiving signals from other neurons. The soma processes information received from the dendrite, and the axon is responsible for transferring the processed information to the next dendrite in the sequence.

To grasp how the brain processes thought, imagine you see a car coming towards you. Your eyes immediately send electrical signals to your brain through the optical nerve. Then the brain forms a chain of neurons to make sense of the incoming signal.

So the first neuron in the chain collects the signal through its dendrites and sends it to the soma to process the signal. After the soma finishes with its task, it sends the signal to the axon which then sends it to the dendrite of the next neuron in the chain.

The connection between axons and dendrites when passing on information is called a Synapse. So the entire process continues until the brain finds a Sapiotemporal Synaptic Input (that's scientific lingo for the brain continues processing until it finds an optimal response to the signal sent to it). Then it sends signals to the necessary effectors, for example your legs, and then the brain sends a signal to your legs to run away from the oncoming car.

The relationship between the brain and AI systems

The relationship between the brain and AI is largely mutually beneficial. The brain is the main source of inspiration behind the design of AI systems and advances in AI, leading to a better understanding of the brain and how it works.

There is a reciprocal exchange of knowledge and ideas when it comes to the brain and AI. There are several examples that attest to the positively symbiotic nature of this relationship:

Neural Networks: Arguably the most significant impact made by the human brain to the field of Artificial Intelligence is the creation of Neural Networks. In essence, Neural Networks are computational models that mimic the function and structure of biological neurons. The architecture of neural networks and their learning algorithms are largely inspired by the way neurons in the brain interact and adapt.
Brain Simulations: AI systems have been used to simulate the human brain and study its interactions with the physical world. For example, researchers have Machine Learning techniques to simulate the activity of biological neurons involved in visual processing. The result has provided insight into how the brain handles visual information.
Insights into the brain: Researchers have begun using Machine Learning Algorithms to analyse and gain insights from brain data, and fMRI scans. These insights serve to identify patterns and relationships which would otherwise have remained hidden. These insights can help us understand internal cognitive functions, memory, and decision-making. They also help in the treatment of brain-native illnesses such as Alzheimer's.

Core Principles Behind the Brain-inspired Approach to AI

Here we will discuss several concepts which help AI imitate the way the human brain functions. These concepts have helped AI researchers create more powerful and intelligent systems which are capable of performing complex tasks.

Neural Networks

As discussed earlier, neural networks have arguably derived the most significant inspiration from the human brain and have made the biggest impact on the field of Artificial Intelligence.

In essence, Neural Networks are computational models that mimic the function and structure of biological neurons. The networks are made up of various layers of interconnected nodes, called artificial neurons, which aid in the processing and transmitting of information. This is similar to what is done by dendrites, somas, and axons in biological neural networks.

Neural Networks are architected to learn from past experiences the same way the brain does.

Distributed Representations

Distributed representations are simply a way of encoding concepts or ideas in a neural network as a pattern along several nodes in the network in order to form a pattern.

For example, the concept of smoking could be represented (encoded) using a certain set of nodes in a neural network. So if a network comes accross an image of a person smoking, it then uses those selected nodes to make sense of the image (it's a lot more complex than that but for the sake of simplicity we'll leave it at that).

This technique helps AI systems remember complex concepts or relationships between concepts the same way the brain recognizes and remembers complex stimuli.

Recurrent Feedback

This is a technique used in training AI models where the output of a neural network is returned as input to allow the network to integrate its output as extra data input in training. This is similar to how the brain makes use of feedback loops in order to adjust its model based on previous experiences.

Parallel Processing

Parallel processing involves breaking up complex computational tasks into smaller bits in an effort to process the smaller bits on another processor in an attempt to improve speed. This approach enables AI systems to process more input data faster, similar to how the brain is able to perform different tasks at the same time (multi-tasking).

Attention Mechanisms

This is a technique used which enables AI models to focus on specific parts of input data. It is commonly employed in sectors such as Natural Language Processing which contains complex and cumbersome data.

It is inspired by the brain's ability to attend to only specific parts of a largely distracting environment – like your ability to tune into and interact in one conversation out of a cacophony of conversations.

Reinforcement Learning

Reinforcement Learning is a technique used to train AI systems. It was inspired by how human beings learn skills through trial and error. It involves an AI agent receiving rewards or punishments based on its actions. This enables the agent to learn from its mistakes and be more efficient in its future actions (this technique is usually used in the creation of games).

Unsupervised Learning

The brain is constantly receiving new streams of data in the form of sounds, visual content, sensory feelings to the skin, and so on. It has to make sense of it all and attempt to form a coherent and logical understanding of how all these seemingly disparate events affect its physical state.

Take this analogy as an example: you feel water drop on your skin, you hear the sound of water droplets dropping quickly on rooftops, you feel your clothes getting heavy and in that instant, you know rain is falling.

You then search your memory bank to ascertain if you carried an umbrella. If you did, you are fine, otherwise you check to see the distance from your current location to your home. If it is close, you are fine, but otherwise you try to gauge how intense the rain is going to become. If it is a light drizzle you can attempt to continue the journey back to your home, but if it is becoming a heavier shower, then you have to find shelter.

The ability to make sense of seemingly disparate data points (water, sound, feeling, distance) is implemented in Artificial intelligence in the form of a technique called Unsupervised Learning. It is an AI training technique where AI systems are taught to make sense of raw, unstructured data without explicit labelling ( no one tells you rain is falling when it is falling, do they?).

Challenges in Building Brain-Inspired AI Systems

So far, you've learned how researchers used the brain as inspiration for AI systems. We've also discussed how the brain relates to AI and the core principles behind brain-inspired AI.

In this section, we are going to talk about some of the technical and conceptual challenges inherent in building AI systems modeled after the human brain.

Complexity

This is a pretty daunting challenge. The brain-inspired approach to AI is based on modeling the brain and building AI systems after that model. But the human brain is an inherently complex system with 100 billion neurons and approximately 600 trillion synaptic connections (each neuron has, on average, 10,000 synaptic connections with other neurons). These synapses are constantly interacting in dynamic and unpredictable ways.

Building AI systems that are aimed to mimic, and perhaps exceed, that complexity is in itself a challenge and requires equally complex statistical models.

Data Requirements for Training Large Models

Open AI's GPT 4, which is, at the moment, the cutting edge of text-based AI models, requires 47 GigaBytes of data. In comparison, its predecessor GPT3 was trained on 17 Gigabytes of data, which is approximately 3 orders of magnitude lower. Imagine how much GPT 5 will be trained on.

To get acceptable results, brain-inspired AI systems require vast amounts of data for tasks, especially auditory and visual tasks. This places a lot of emphasis on the creation of data collection pipelines. For instance, Tesla has 780 million miles of driving data and its data collection pipeline adds another million every 10 hours.

Energy Efficiency

Building brain-inspired AI systems that emulate the brain's energy efficiency is a huge challenge. The human brain consumes approximately 20 watts of power. In comparison, Tesla's Autopilot, on specialized chips, consumes about 2,500 watts per second and it takes around 7.5-megawatt hours (MWh) to train an AI model the size of ChatGPT.

The Explainability Problem

Developing brain-inspired AI systems that can be trusted by users is crucial to the growth and adoption of AI – but therein lies the problem.

The brain, which AI systems are meant to be modeled after, is essentially a black box. The inner workings of the brain are not easy to understand, partly because of a lack of information surrounding how the brain processes thought.

There is no lack of research on the biological structure of the human brain, but there is a certain lack of empirical information on the functional qualities of the brain – that is, how thought is formed, how deja vu occurs, and so on. This leads to problems in the building of brain-inspired AI systems.

The Interdisciplinary Requirements

The act of building brain-inspired AI systems requires the knowledge of experts in different fields, like Neuroscience, Computer Science, Engineering, Philosophy, and Psychology.

But this presents challenges, both logistical and foundational: getting experts from different fields is financially expensive. Also, there's the problem of knowledge conflict – it can be really difficult to get an engineer to care about the psychological effects of what they're building, not to mention of the problem of colliding egos.

Summary

While the brain-inspired approach seems like the obvious route to building AI systems, it has its challenges. But we can look to the future with the hope that efforts are being made to solve these problems.

If you enjoyed this article, consider subscribing to my newsletter to get more articles like this.

References

How to Detect Objects in Images Using the YOLOv8 Neural Network

freeCodeCamp — Thu, 04 May 2023 18:17:42 +0000

By Andrey Germanov

Object detection is a computer vision task that involves identifying and locating objects in images or videos. It is an important part of many applications, such as self-driving cars, robotics, and video surveillance.

Over the years, many methods and algorithms have been developed to find objects in images and their positions. The best quality in performing these tasks comes from using convolutional neural networks.

One of the most popular neural networks for this task is YOLO, created in 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi in their famous research paper "You Only Look Once: Unified, Real-Time Object Detection".

Since that time, there have been quite a few versions of YOLO. Recent releases can do even more than object detection. The newest release is YOLOv8, which we are going to use in this tutorial.

Here, I will show you the main features of this network for object detection. First, we will use a pre-trained model to detect common object classes like cats and dogs. Then, I will show how to train your own model to detect specific object types that you select, and how to prepare the data for this process. Finally, we will create a web application to detect objects on images right in a web browser using the custom trained model.

To follow this tutorial, you should be familiar with Python and have a basic understanding of machine learning, neural networks, and their application in object detection. You can watch this short video course to familiarize yourself with all required machine learning theory.

Once you've refreshed the theory, let's get started with the practice! Here's what we'll cover:

Problems YOLOv8 Can Solve
How to Get Started with YOLOv8
How to Prepare Data to Train the YOLOv8 Model
How to Train the YOLOv8 Model
How to Create an Object Detection Web Service
How to Create the Frontend
How to Create the Backend
Conclusion

Problems YOLOv8 Can Solve

You can use the YOLOv8 network to solve classification, object detection, and image segmentation problems. All these methods detect objects in images or in videos in different ways, as you can see in the image below:

Common computer vision problems - classification, detection, and segmentation

The neural network that's created and trained for image classification determines a class of object on the image and returns its name and the probability of this prediction.

For example, on the left image, it returned that this is a "cat" and that the confidence level of this prediction is 92% (0.92).

The neural network for object detection, in addition to the object type and probability, returns the coordinates of the object on the image: x, y, width and height, as shown on the second image. Object detection neural networks can also detect several objects in the image and their bounding boxes.

Finally, in addition to object types and bounding boxes, the neural network trained for image segmentation detects the shapes of the objects, as shown on the right image.

There are many different neural network architectures developed for these tasks, and for each of them you had to use a separate network in the past. Fortunately, things changed after the YOLO created. Now you can use a single platform for all these problems.

In this article, we will explore object detection using YOLOv8. I will guide you through how to create a web application that will detect traffic lights and road signs in images. In later articles I will cover other features, including image segmentation.

In the next sections, we will go through all steps required to create an object detector. By the end of this tutorial, you will have a complete AI powered web application.

How to Get Started with YOLOv8

Technically speaking, YOLOv8 is a group of convolutional neural network models, created and trained using the PyTorch framework.

In addition, the YOLOv8 package provides a single Python API to work with all of them using the same methods. That is why, to use it, you need an environment to run Python code. I highly recommend using Jupyter Notebook.

After making sure that you have Python and Jupyter installed on your computer, run the notebook and install the YOLOv8 package in it by running the following command:

!pip install ultralytics

The ultralytics package has the YOLO class, used to create neural network models.

To get access to it, import it to your Python code:

from ultralytics import YOLO

Now everything is ready to create the neural network model:

model = YOLO("yolov8m.pt")

As I mentioned before, YOLOv8 is a group of neural network models. These models were created and trained using PyTorch and exported to files with the .pt extension.

There are three types of models and 5 models of different sizes for each type:

Classification	Detection	Segmentation	Kind
yolov8n-cls.pt	yolov8n.pt	yolov8n-seg.pt	Nano
yolov8s-cls.pt	yolov8s.pt	yolov8s-seg.pt	Small
yolov8m-cls.pt	yolov8m.pt	yolov8m-seg.pt	Medium
yolov8l-cls.pt	yolov8l.pt	yolov8l-seg.pt	Large
yolov8x-cls.pt	yolov8x.pt	yolov8x-seg.pt	Huge

The bigger the model you choose, the better the prediction quality you can achieve, but the slower it will work.

In this tutorial I will cover object detection – which is why, in the previous code snippet, I selected the "yolov8m.pt", which is a middle-sized model for object detection.

When you run this code for the first time, it will download the yolov8m.pt file from the Ultralytics server to the current folder. Then it will construct the model object. Now you can train this model, detect objects, and export it to use in production. For all these tasks, there are convenient methods:

train({path to dataset descriptor file}) – used to train the model on the images dataset.
predict({image}) – used to make a prediction for a specified image, for example to detect bounding boxes of all objects that the model can find in the image.
export({format}) – used to export the model from the default PyTorch format to a specified format.

All YOLOv8 models for object detection ship already pre-trained on the COCO dataset, which is a huge collection of images of 80 different types. So, if you do not have specific needs, then you can just run it as is, without additional training.

For example, you can download this image as "cat_dog.jpg":

A sample image with cat and dog

and run predict to detect all objects in it:

results = model.predict("cat_dog.jpg")

The predict method accepts many different input types, including a path to a single image, an array of paths to images, the Image object of the well-known PIL Python library, and others.

After running the input through the model, it returns an array of results for each input image. As we provided only a single image, it returns an array with a single item that you can extract like this:

result = results[0]

The result contains detected objects and convenient properties to work with them. The most important one is the boxes array with information about detected bounding boxes on the image. You can determine how many objects it detected by running the len function:

len(result.boxes)

When I ran this, I got "2", which means that there are two boxes detected: one for the dog and one for the cat.

Then you can analyze each box either in a loop or manually. Let's get the first one:

box = result.boxes[0]

The box object contains the properties of the bounding box, including:

xyxy – the coordinates of the box as an array [x1,y1,x2,y2]
cls – the ID of object type
conf – the confidence level of the model about this object. If it's very low, like < 0.5, then you can just ignore the box.

Let's print information about the detected box:

print("Object type:", box.cls)
print("Coordinates:", box.xyxy)
print("Probability:", box.conf)

For the first box, you will receive the following information:

Object type: tensor([16.])
Coordinates: tensor([[261.1901,  94.3429, 460.5649, 312.9910]])
Probability: tensor([0.9528])

As I explained above, YOLOv8 contains PyTorch models. The outputs from the PyTorch models are encoded as an array of PyTorch Tensor objects, so you need to extract the first item from each of these arrays:

print("Object type:",box.cls[0])
print("Coordinates:",box.xyxy[0])
print("Probability:",box.conf[0])

Object type: tensor(16.)
Coordinates: tensor([261.1901,  94.3429, 460.5649, 312.9910])
Probability: tensor(0.9528)

Now you see the data as Tensor objects. To unpack actual values from Tensor, you need to use the .tolist() method for tensors with array inside, as well as the .item() method for tensors with scalar values.

Let's extract the data to the appropriate variables:

cords = box.xyxy[0].tolist()
class_id = box.cls[0].item()
conf = box.conf[0].item()
print("Object type:", class_id)
print("Coordinates:", cords)
print("Probability:", conf)

Object type: 16.0
Coordinates: [261.1900634765625, 94.3428955078125, 460.5649108886719, 312.9909973144531]
Probability: 0.9528293609619141

Now you see the actual data. The coordinates can be rounded, and the probability also can be rounded to two digits after the dot.

The object type is 16 here. What does this mean? Let's talk more about that.

All objects that the neural network can detect have numeric IDs. In case of a YOLOv8 pretrained model, there are 80 object types with IDs from 0 to 79. The COCO object classes are well known and you can easily google them on the Internet. In addition, the YOLOv8 result object contains the convenient names property to get these classes:

print(result.names)

{0: 'person',
 1: 'bicycle',
 2: 'car',
 3: 'motorcycle',
 4: 'airplane',
 5: 'bus',
 6: 'train',
 7: 'truck',
 8: 'boat',
 9: 'traffic light',
 10: 'fire hydrant',
 11: 'stop sign',
 12: 'parking meter',
 13: 'bench',
 14: 'bird',
 15: 'cat',
 16: 'dog',
 17: 'horse',
 18: 'sheep',
 19: 'cow',
 20: 'elephant',
 21: 'bear',
 22: 'zebra',
 23: 'giraffe',
 24: 'backpack',
 25: 'umbrella',
 26: 'handbag',
 27: 'tie',
 28: 'suitcase',
 29: 'frisbee',
 30: 'skis',
 31: 'snowboard',
 32: 'sports ball',
 33: 'kite',
 34: 'baseball bat',
 35: 'baseball glove',
 36: 'skateboard',
 37: 'surfboard',
 38: 'tennis racket',
 39: 'bottle',
 40: 'wine glass',
 41: 'cup',
 42: 'fork',
 43: 'knife',
 44: 'spoon',
 45: 'bowl',
 46: 'banana',
 47: 'apple',
 48: 'sandwich',
 49: 'orange',
 50: 'broccoli',
 51: 'carrot',
 52: 'hot dog',
 53: 'pizza',
 54: 'donut',
 55: 'cake',
 56: 'chair',
 57: 'couch',
 58: 'potted plant',
 59: 'bed',
 60: 'dining table',
 61: 'toilet',
 62: 'tv',
 63: 'laptop',
 64: 'mouse',
 65: 'remote',
 66: 'keyboard',
 67: 'cell phone',
 68: 'microwave',
 69: 'oven',
 70: 'toaster',
 71: 'sink',
 72: 'refrigerator',
 73: 'book',
 74: 'clock',
 75: 'vase',
 76: 'scissors',
 77: 'teddy bear',
 78: 'hair drier',
 79: 'toothbrush'}

This dictionary has everything that this model can detect. Now you can find that 16 is "dog", so this bounding box is the bounding box for detected DOG.

Let's modify the output to show results in a more representative way:

cords = box.xyxy[0].tolist()
cords = [round(x) for x in cords]
class_id = result.names[box.cls[0].item()]
conf = round(box.conf[0].item(), 2)
print("Object type:", class_id)
print("Coordinates:", cords)
print("Probability:", conf)

In this code I rounded all coordinates using Python list comprehension. Then I got the name of the detected object class by ID using the result.names dictionary. I also rounded the probability. You should get the following output:

Object type: dog
Coordinates: [261, 94, 461, 313]
Probability: 0.95

This data is good enough to show in the user interface. Let's now write some code to get this information for all detected boxes in a loop:

for box in result.boxes:
  class_id = result.names[box.cls[0].item()]
  cords = box.xyxy[0].tolist()
  cords = [round(x) for x in cords]
  conf = round(box.conf[0].item(), 2)
  print("Object type:", class_id)
  print("Coordinates:", cords)
  print("Probability:", conf)
  print("---")

This code will do the same for each box and will output the following:

Object type: dog
Coordinates: [261, 94, 461, 313]
Probability: 0.95
---
Object type: cat
Coordinates: [140, 170, 256, 316]
Probability: 0.92
---

This way you can run object detection for other images and see everything that a COCO-trained model can detect in them.

This video shows the whole coding session of this section in Jupyter Notebook, assuming you have it installed.

Using models that are pre-trained on well-known objects is ok to start. But in practice, you may need a solution to detect specific objects for a concrete business problem.

For example, someone may need to detect specific products on supermarket shelves or discover brain tumors on x-rays. It's highly likely that this information is not available in public datasets, and there are no free models that know about everything.

So, you have to teach your own model to detect these types of objects. To do that, you need to create a database of annotated images for your problem and train the model on these images.

How to Prepare Data to Train the YOLOv8 Model

To train the model, you need to prepare annotated images and split them into training and validation datasets.

You'll use the training set to teach the model and the validation set to test the results of the study and measure the quality of the trained model. You can put 80% of the images in the training set and 20% in the validation set.

These are the steps that you need to follow to create each of the datasets:

Decide on and encode classes of objects you want to teach your model to detect. For example, if you want to detect only cats and dogs, then you can state that "0" is cat and "1" is dog.
Create a folder for your dataset and two subfolders in it: "images" and "labels".
Add the images to the "images" subfolder. The more images you collect, the better for training.
For each image, create an annotation text file in the "labels" subfolder. Annotation text files should have the same names as image files and the ".txt" extensions. In the annotation files you should add records about each object that exist on the appropriate image in the following format:

{object_class_id} {x_center} {y_center} {width} {height}

Bounding box parameters

This is the most time-consuming manual work in the machine learning process: to measure bounding boxes for all objects and add them to annotation files.

You should also normalize the coordinates to fit in a range from 0 to 1. To calculate them, you need to use the following formulas:

x_center = (box_x_left+box_x_width/2)/image_width
y_center = (box_y_top+box_height/2)/image_height
width = box_width/image_width
height = box_height/image_height

For example, if you want to add the "cat_dog.jpg" image that we used before to the dataset, you need to copy it to the "images" folder and then measure and collect the following data about the image, and it's bounding boxes:

Image:

image_width = 612
image_height = 415

Objects:

Dog	Cat
box_x_left=261 box_x_top=94 box_width=200 box_height=219	box_x_left=140 box_x_top=170 box_width=116 box_height=146

Then, create the "cat_dog.txt" file in the "labels" folder and, using the formulas above, calculate the coordinates:

Dog (class id=1):

x_center = (261+200/2)/612 = 0.589869281
y_center = (94+219/2)/415 = 0.490361446
width = 200/612 = 0.326797386
height = 219/415 = 0.527710843

Cat (class id=0)

x_center = (140+116/2)/612 = 0.323529412
y_center = (170+146/2)/415 = 0.585542169
width = 116/612 = 0.189542484
height = 146/415 = 0.351807229

and add the following lines to the file:

1 0.589869281 0.490361446 0.326797386 0.527710843
0 0.323529412 0.585542169 0.189542484 0.351807229

The first line contains a bounding box for the dog (class id=1). The second line contains a bounding box for the cat (class id=0). Of course, you can have the image with many dogs and many cats at the same time, and you can add bounding boxes for all of them.

After adding and annotating all images, the dataset is ready. You need to create two datasets and place them in different folders. The final folder structure can look like this:

Dataset structure

As you can see, the training dataset is located in the "train" folder and the validation dataset is located in the "val" folder.

Finally, you need to create a dataset descriptor YAML-file that points to the created datasets and describes the object classes in them. This is a sample of this file for the data created above:

train: ../train/images
val: ../val/images

nc: 2
names: ['cat','dog']

In the first two lines, you need to specify paths to the images of the training and the validation datasets. The paths can be either relative to the current folder or absolute.

Then, the nc line specifies the number of classes that exist in these datasets, and names is an array of class names in correct order.

Indexes of these items are numbers that you used when annotating the images, and these indexes will be returned by the model when it detects objects using the predict method. So, if you used "0" for cats, then it should be the first item in the names array.

This YAML file should be passed to the train method of the model to start the training process.

To make the image annotation process easier, there are a lot of programs you can use to visually annotate images for machine learning. You can search for something like "software to annotate images for machine learning" to get a list of these programs.

There are also many online tools that can do all this work, like Roboflow Annotate. Using this service, you just need to upload your images, draw bounding boxes on them, and set classes for each bounding box. Then, the tool will automatically create annotation files, split your data to train and validation datasets, and create a YAML descriptor file. Then you can export and download the annotated data as a ZIP file.

In the below video, I show you how to use Roboflow to create the "cats and dogs" micro-dataset.

For real life problems, that database should be much bigger. To train a good model, you should have hundreds or thousands of annotated images.

Also, when preparing the images database, try to make it balanced. It should have an equal number of objects of each class, that is an equal number of dogs and cats in this example. Otherwise, the model trained on it may predict one class better than another.

After the data is ready, copy it to the folder with your Python code that you will use for training and return back to your Jupyter Notebook to start the training process.

How to Train the YOLOv8 Model

After the data is ready, you need to pass it through the model. To make it more interesting, we will not use this small "cats and dogs" dataset. We will use another custom dataset for training that contains traffic lights and road signs. This is a free dataset that I got from the Roboflow Universe. Press "Download Dataset" and select "YOLOv8" as the format.

If it's not available on Roboflow when you read this, then you can get it from my Google Drive. You can use this dataset to teach YOLOv8 to detect different objects on roads, like you can see in the next screenshot.

Traffic lights detection demo

You can open the downloaded zip file and ensure that it's already annotated and structured using the rules described above. You can find the dataset descriptor file data.yaml in the archive as well.

If you downloaded the archive from Roboflow, it will contain the additional "test" dataset, which is not used by the training process. You can use the images from it for additional testing on your own after training.

Extract the archive to the folder with your Python code and execute the train method to start a training loop:

model.train(data="data.yaml", epochs=30)

The data is the only required option. You have to pass the YAML descriptor file to it. The epochs option specifies the number of training cycles (100 by default). There are other options that can affect the process and quality of the trained model.

Each training cycle consists of two phases: a training phase and a validation phase.

During the training phase, the train method does the following:

Extracts the random batch of images from the training dataset (the number of images in the batch can be specified using the batch option).
Passes these images through the model and receives the resulting bounding boxes of all detected objects and their classes.
Passes the result to the loss function that's used to compare the received output with correct result from annotation files for these images. The loss function calculates the amount of error.
The result of the loss function is passed to the optimizer to adjust the model weights based on the amount of error in the correct direction. This reduces the errors in the next cycle. By default, the SGD (Stochastic Gradient Descent) optimizer is used, but you can try others, like Adam, to see the difference.

During the validation phase, train does the following:

Extracts the images from the validation dataset.
Passes them through the model and receives the detected bounding boxes for these images.
Compares the received result with true values for these images from annotation text files.
Calculates the precision of the model based on the difference between actual and expected results.

The progress and results of each phase for each epoch are displayed on the screen. This way you can see how the model learns and improves from epoch to epoch.

When you run the train code, you will see a similar output to the following during the training loop:

Training process

For each epoch it shows a summary for both the training and validation phases: lines 1 and 2 show results of the training phase and lines 3 and 4 show the results of the validation phase for each epoch.

The training phase includes a calculation of the amount of error in a loss function, so the most valuable metrics here are box_loss and cls_loss.

box_loss shows the amount of error in detected bounding boxes.
cls_loss shows the amount of error in detected object classes.

Why is the loss split to different metrics? Because the model might correctly detect the bounding box coordinates around the object, but incorrectly detect the object class in this box. For example, in my practice, it detected the dog as a horse, but the dimensions of the object were detected correctly.

If the model really learns something from the data, then you should see that these values decrease from epoch to epoch. In a previous screenshot the box_loss decreased: 0.7751, 0.7473, 0.742 and the cls_loss decreased too: 0.702, 0.6422, 0.6211.

In the validation phase, it calculates the quality of the model after training using the images from the validation dataset.

The most valuable quality metric is mAP50-95, which is Mean Average Precision. If the model learns and improves, the precision should grow from epoch to epoch. In a previous screenshot you can see that it slowly grew: 0.788, 0.788, 0.791.

If after the last epoch you did not get acceptable precision, you can increase the number of epochs and run the training again. Also, you can tune other parameters like batch, lr0, lrf or change the optimizer you're using. There are no clear rules on what to do here, but there are a lot of recommendations.

The topic of tuning the parameters of the training process goes beyond the scope of article. I think it's possible to write a book about this and many of them already exist. You can easily find them on the Internet. But in a few words, most of them say that you need to experiment and try all possible options and compare results.

In addition to the metrics that are shown during the training process, it writes a lot of statistics on disk. When training starts, it creates the runs/detect/train subfolder in the current folder and after each epoch it logs different log files to it.

It also exports the trained model after each epoch to the /runs/detect/train/weights/last.pt file and the model with the highest precision to the /runs/detect/train/weights/best.pt file. So, after training is finished, you can get the best.pt file to use in production.

You can watch this video to learn more about how the training process works. I used Google Colab which is a cloud version of Jupyter Notebook to get access to hardware with more powerful GPU to speed up the training process.

The video shows how to train the model on 5 epochs and download the final best.pt model. In real world problems, you need to run much more epochs and be prepared to wait hours or maybe days until training finishes.

After it's finished, it's time to run the trained model in production. In the next section, we will create a web service to detect objects in images online in a web browser.

How to Create an Object Detection Web Service

At this point, we're finished experimenting with the model in the Jupyter Notebook. You'll need to write the next batch of code as a separate project, using any Python IDE like VS Code or PyCharm.

The web service that we are going to create will have a web page with a file input field and an HTML5 canvas element.

When the user selects an image file using the input field, the interface will send it to the backend. Then, the backend will pass the image through the model that we created and trained and return the array of detected bounding boxes to the web page.

When it receives this, the frontend will draw the image on the canvas element and the detected bounding boxes on top of it.

The service will look and work as demonstrated on this video:

In the video, I used the model trained on 30 epochs, and it still does not detect some traffic lights. You can try to train it more to get better results. But the best way to improve the quality of a machine learning model is by adding more and more data.

So, as an additional exercise, you can import the dataset folder to Roboflow, add and annotate more images to it, and then use the updated data to continue training the model.

How to Create the Frontend

To start with, create a folder for a new Python project and an index.html file in it for the frontend web page. Here are the contents of this file:

html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>YOLOv8 Object Detectiontitle>
    <style>
        canvas {
            display:block;
            border: 1px solid black;
            margin-top:10px;
        }
    style>
head>
<body>
    <input id="uploadInput" type="file"/>
    <canvas>canvas>
    <script>
       /**
       * "Upload" button onClick handler: uploads selected 
       * image file to backend, receives an array of
       * detected objects and draws them on top of image
       */
       const input = document.getElementById("uploadInput");
       input.addEventListener("change",async(event) => {
           const file = event.target.files[0];
           const data = new FormData();
           data.append("image_file",file,"image_file");
           const response = await fetch("/detect",{
               method:"post",
               body:data
           });
           const boxes = await response.json();
           draw_image_and_boxes(file,boxes);
       })

       /**
       * Function draws the image from provided file
       * and bounding boxes of detected objects on
       * top of the image
       * @param file Uploaded file object
       * @param boxes Array of bounding boxes in format
         [[x1,y1,x2,y2,object_type,probability],...]
       */
       function draw_image_and_boxes(file,boxes) {
          const img = new Image()
          img.src = URL.createObjectURL(file);
          img.onload = () => {
              const canvas = document.querySelector("canvas");
              canvas.width = img.width;
              canvas.height = img.height;
              const ctx = canvas.getContext("2d");
              ctx.drawImage(img,0,0);
              ctx.strokeStyle = "#00FF00";
              ctx.lineWidth = 3;
              ctx.font = "18px serif";
              boxes.forEach(([x1,y1,x2,y2,label]) => {
                  ctx.strokeRect(x1,y1,x2-x1,y2-y1);
                  ctx.fillStyle = "#00ff00";
                  const width = ctx.measureText(label).width;
                  ctx.fillRect(x1,y1,width+10,25);
                  ctx.fillStyle = "#000000";
                  ctx.fillText(label,x1,y1+18);
              });
          }
       }
  script>  
body>
html>

The HTML part is very tiny and consists only of the file input field with "uploadInput" ID and the canvas element below it.

Then, in the JavaScript part, the "onChange" we define the event handler for the input field. When the user selects an image file, the handler uses fetch to make a POST request to the /detect backend endpoint (which we will create later) and sends this image file to it.

The backend should detect objects on this image and return a response with a boxes array as JSON. This response then gets decoded and passed to the draw_image_and_boxes function along with an image file itself.

The draw_image_and_boxes function loads the image from file. As soon as it's loaded, it draws it on the canvas. Then, it draws each bounding box with a class label on top of the canvas with the image.

So, now let's create the backend with a /detect endpoint for it.

How to Create the Backend

We'll create the backend using Flask. Flask has its own internal web server, but according to many Flask developers, it's not reliable enough for productio. So we will use the Waitress web server and run our Flask app in it.

Also, we will use the Pillow library to read an uploaded binary files as images. Make sure you have all these packages installed on your system before continuing:

pip3 install flask
pip3 install waitress
pip3 install pillow

The backend will be in a single file. Let's name it object_detector.py:

from ultralytics import YOLO
from flask import request, Response, Flask
from waitress import serve
from PIL import Image
import json

app = Flask(__name__)

@app.route("/")
def root():
    """
    Site main page handler function.
    :return: Content of index.html file
    """
    with open("index.html") as file:
        return file.read()


@app.route("/detect", methods=["POST"])
def detect():
    """
        Handler of /detect POST endpoint
        Receives uploaded file with a name "image_file", 
        passes it through YOLOv8 object detection 
        network and returns an array of bounding boxes.
        :return: a JSON array of objects bounding 
        boxes in format 
        [[x1,y1,x2,y2,object_type,probability],..]
    """
    buf = request.files["image_file"]
    boxes = detect_objects_on_image(Image.open(buf.stream))
    return Response(
      json.dumps(boxes),  
      mimetype='application/json'
    )


def detect_objects_on_image(buf):
    """
    Function receives an image,
    passes it through YOLOv8 neural network
    and returns an array of detected objects
    and their bounding boxes
    :param buf: Input image file stream
    :return: Array of bounding boxes in format 
    [[x1,y1,x2,y2,object_type,probability],..]
    """
    model = YOLO("best.pt")
    results = model.predict(buf)
    result = results[0]
    output = []
    for box in result.boxes:
        x1, y1, x2, y2 = [
          round(x) for x in box.xyxy[0].tolist()
        ]
        class_id = box.cls[0].item()
        prob = round(box.conf[0].item(), 2)
        output.append([
          x1, y1, x2, y2, result.names[class_id], prob
        ])
    return output

serve(app, host='0.0.0.0', port=8080)

First, we import the required libraries:

ultralytics for the YOLOv8 model.
flask to create a Flask web application, to receive requests from the frontend and send responses back to it.
waitress to run a web server and serve the Flask web app in it.
PIL to load an uploaded file as an Image object, that required for YOLOv8.
json to convert the array of bounding boxes to JSON before returning it to the frontend.

Then, we defined two routes:

/ that serves as a root of web service. It just returns the content of the "index.html" file.
/detect that responds to an image upload request from the frontend. It converts the RAW file to the Pillow Image object, then passes this image to the detect_objects_on_image function.

The detect_objects_on_image function creates a model object based on the best.pt model that we trained in the previous section. Make sure that this file exists in the folder where you write the code.

Then it calls the predict method for the image. predict returns the detected bounding boxes.

Next, for each box it extracts the coordinates, class name, and probability in the same way as we did in the beginning of the tutorial. It adds this info to the output array.

Finally, the function returns the array of detected object coordinates and their classes.

After this, the array gets encoded to JSON and is returned to the frontend.

The last line of code starts the web server on port 8080 that serves the app Flask application.

To run the service, execute the following command:

python3 object_detector.py

If everything is working properly, you can open http:///localhost:8080 in a web browser. It should show the index.html page. When you select any image file, it will process it and display bounding boxes around all detected objects (or just display the image if nothing is detected on it).

The web service we just created is universal. You can use it with any YOLOv8 model. At the moment, it detects traffic lights and road signs using the best.pt model we created. But you can change it to use another model, like the yolov8m.pt model we used earlier to detect cats, dogs, and all other object classes that pretrained YOLOv8 models can detect.

Conclusion

In this tutorial, I guided you thought a process of creating an AI powered web application that uses the YOLOv8, a state-of-the-art convolutional neural network for object detection.

I showed you how to create models using the pre-trained models and prepare the data to train custom models. And finally we created a web application with a frontend and backend that uses the custom trained YOLOv8 model to detect traffic lights and road signs.

You can find a source code of this app in this GitHub repository.

For all these tasks, we used the Ultralytics high level APIs that come with the YOLOv8 package by default. These APIs are based on the PyTorch framework, which was used to create the bigger part of today's neural networks.

It's quite convenient on the one hand, but dependence on these high level APIs has a negative effect as well. If you need to run this web app in production, you should install all these environments there, including Python, PyTorch and the other dependencies.

To run this on a clean new server, you'll need to download and install more than 1 GB of third party libraries! This is definitely not the best way to go.

Also, what if you do not have Python in your production environment? What if all your other code is written in another programming language, and you do not plan to use Python? Or what if you want to run the model on a mobile phone with Android or iOS?

All this is to say that using Ultralytics packages is great for experimenting, training, and preparing the models for production. But in production itself, you have to load and use the model directly and not use those high-level APIs.

To do this, you need to understand how the YOLOv8 neural network works under the hood and write more code to provide input to the model and to process the output from it. This will make your apps faster and less resource-intense. You will not need to have PyTorch installed to run your object detection model.

Also, you will be able to run your models even without Python, using many other programming languages, including Julia, C++, Go, Node.js on backend, or even without backend at all. You can run the YOLOv8 models right in a browser, using only JavaScript on frontend.

Want to know how? This will be the topic of my next article about YOLOv8.

You can find me on LinkedIn, Twitter, and Facebook to know first about new articles like this one and other software development news.

Have a fun coding and never stop learning!

Deep Learning with Julia – How to Build and Train a Model using a Neural Network

freeCodeCamp — Tue, 07 Mar 2023 21:34:07 +0000

By Andrey Germanov

Julia is a general purpose programming language well suited for numerical analysis and computational science. Some consider it the future of machine learning and the most natural replacement for Python in this field.

In the previous post "Machine learning with Julia – How to Build and Deploy a Trained AI Model as a Web Service" I introduced the basic machine learning features of Julia and explained why it's so good for this.

In this article, I want to move one step forward and explore deep learning features of Julia to show how you can use it to solve computer vision tasks using neural networks.

Computer vision is one of the most impressive areas of artificial intelligence. It includes such interesting tasks as image classification, text recognition, object detection and image segmentation. Neural networks showed the best performance in solving computer vision problems.

In this tutorial, I will guide you through the process of building and training a neural network to recognize handwritten digits using Julia. I will also explain how to create a website that will use the trained model to read handwritten phone numbers.

Here's what we'll cover:

What should you know in advance
Handwritten digits recognition workflow
How to collect initial image data
How to work with images in Julia
How to prepare the image data for machine learning
How to create a machine learning model
How to train the model
How to evaluate the accuracy of the trained model
How to create and train the convolutional neural network
How to export the trained model to a file
How to create a frontend
How to create a backend
Conclusion

What should you know in advance

This tutorial assumes that you have basic Julia knowledge, that possible to get by reading my previous article. That article also includes instructions on how to install Julia and integrate it with Jupyter notebook, which will be used to write most of the code.

The "Handwritten digit recognition using deep learning" problem and the theory that stands behind it is well known. That is why I will cover it only briefly. There are many good resources that explain how neural networks are used to solve the image classification tasks. Personally, I recommend watching this video and read the first chapter of this great online book.

The goal of this tutorial is only to show you how to implement the theory, explained in those resources, using Julia.

Handwritten digits recognition workflow

To build a machine learning model we will use the Flux.jl framework which is a pure Julia implementation of most well-known neural network types including feed forward, convolutional and recurrent networks.

Recognizing handwritten numbers is a supervised machine learning task of image classification. To implement it, you need to have a labeled dataset of handwritten digits and use it to train the machine learning model.

This is how the ML workflow looks:

Collect the images of handwritten digits for recognition.
Prepare a labeled dataset for machine learning by cleaning and labeling the data.
Create a machine learning model to recognize handwritten digits.
Train the model using training dataset.
Evaluate the accuracy of the trained model by feeding it with data from a testing dataset.
After achieving good accuracy, export the model to a file to use in applications.

How to collect initial image data

The first step of any machine learning task is to collect the data that will be used for training. Usually this is the bigger part of the whole process.

How do you collect handwritten digits for this? Well, for example, you can ask all your friends in social networks to write down digits from 0 to 9 and save them to images. They also can ask their friends to do the same and finally send all these images to you.

The more data you collect, the better for machine learning.

Then, you could create folders with names from "0" to "9" and arrange these images within them. Also, you need to convert the images to the same format: convert to grayscale and resize them. All images should have the same size and color format.

Finally, you'll have a labeled collection of handwritten digits that are ready to work with.

Fortunately, you do not need to do all this manual work, because it was already done in 1998 by the National Institute of Standards and Technology. The database of handwritten digits, that called MNIST, is available to download from Kaggle or from many other places. For example, you can download and extract the MNIST archive using this link.

This database is already split into testing and training data in appropriate folders. Each of these folders contains images of handwritten digits, classified to folders from "0" to "9". There are 60000 images in the training folder and 10000 images in the testing folder:

MNIST database images

Each file is a 28x28 gray scaled image. We will use the content of the training folder to prepare the dataset for training the neural network model. Then we will use the content of the testing folder to validate the accuracy of the trained model. Before doing that, we need to convert this raw data to datasets.

In order to continue, run the Jupyter notebook and create a new notebook in it, selecting "Julia" as a language. Then, copy the training and testing folders with images to the folder in which you created the notebook.

How to work with images in Julia

An image is not a natural data format for machine learning models. The models understand only numbers. That is why, to prepare the images for machine learning, you need to load them and convert to numbers.

To work with images in Julia, we will use the Julia Images library. Using this library, you can load the image, convert it to matrix of pixels, and apply different transformations that can be required before pushing it to ML. The transformations include resizing, converting from color to black and white, inverting, cropping, and more.

To start working with these functions, you need to install the Images package and import it to your notebook:

using Pkg
Pkg.add("Images")
using Images

How to load and view the image

You can use the load function to load the image. Let's load the first digit from our training dataset. If this file exists, it should load it to the img variable and display the image itself:

img = load("training/0/1.png")

Loaded digit image

This is a loaded digit. Let's see the shape of the img variable:

size(img)

(28,28)

As you see, the img variable is an 2D array or matrix of image pixels. The first dimension of the array is a number of rows and the second dimension is a number of columns. That is why the height of image is the first value and the width of image is the second value.

Let's see the type of this variable now:

typeof(img)

Matrix{Gray{N0f8}} (alias for Array{Gray{Normed{UInt8, 8}}, 2})

It shows that this is a matrix of "Gray" objects. The Gray type defines a gray pixel. It means that the image that we loaded does not have color information.

The Gray data type defines the pixel by a single value – the intensity of gray color in a range between 0 and 1. So, the 0 is completely black and the 1 is completely white.

You can change a color of any pixel using the following code:

img[5,5] = Gray(0.5)

This way you set the average gray color to the specified pixel (which was previously black).

The image with modified pixel

If you load the full color image and request its type, it will show something like this:

Matrix{RGB{N0f8}} (alias for Array{RGB{Normed{UInt8, 8}}, 2})

In this case, each pixel has a type of RGB which defined by 3 values: intensity of Red, intensity of Green and intensity of Blue. Also, if you run size(img) for a colored image, you will see that this is a 3D array, like this:

(3,28,28)

where the first dimension is a number of color channels, the second dimension is a height and the third dimension is a width.

In other words, this color image consists of three matrices of 28x28 size. Each of them contains intensities of the appropriate color.

To set the color of any pixel in this image, you need to specify intensities of 3 channels in the RGB type constructor:

img[5,5] = RGB(1,0.5,0)

This code sets the pixel color to orange.

How to implement basic image transformations

Because the image is an array, you can use the array syntax to get access to any part of the image or even to individual pixels.

For example, you can run this to extract the first 10 rows and 20 columns of this image and write them to the new image:

img2 = img[1:10,1:20]

Part of image

You can crop the image by 5 pixels from all sides:

img3 = img[5:22,5:22]

Cropped image

You can apply different filters to the image by applying the specified function to each element of the matrix, using the Julia broadcasting feature via "dot" syntax.

For example, this code applies the Gray function to each pixel of the image. This approach can be used to convert images from colored to grayscale:

img4 = Gray.(img)

Similarly, you can convert gray images to colored:

img5 = RGB.(img)

You can apply custom functions to each pixel. For example, if you apply the next anonymous function to the gray image this way:

img6 = (x-> Gray(1)-x.val).(img)

it will invert the image colors by subtracting the color value of each pixel from 1. If the img has a white digit on a black background, then the img6 will have a black digit on a white background:

Inverted image

Finally, to resize the image, you can use the imresize function. For example, to resize the img to 50x50 pixels, you can use the following code:

img6 = imresize(img,(50,50))

We will use only the features described above to prepare the images for handwritten digit recognition. But the Images module has many more interesting and fun things. Watch this video to see some of them. Also, you can find a lot of interesting information in this book.

How to convert the image to numeric matrix

The last image preprocessing step is converting the pixels to numbers, because objects of type Gray() or RGB() are not suitable as an input for the machine learning model.

You can do this in two steps. First, you need to apply the channelview function to the image to get the matrix view of the image object, and then, convert the result to float numbers. So, if you run this command:

data = Float32.(channelview(img))

Image matrix

you will get the matrix, where each value is a float number that represents an intensity of the corresponding pixel. This data is ready to go to the neural network.

How to prepare the image data for machine learning

As I wrote in a previous article, the training dataset should consist of data from the feature matrix and from the labels vector. Both should contain only numbers.

Let's go back to our image collections in the training and testing folders. The labels are subfolder names where images located. They are already numbers. The features of an image are the pixels. Each pixel is defined by its color intensity.

So, to create a dataset that is ready for training from the images folder, you need to read all files from all subfolders, convert them to matrices of float numbers, and put them in the array.

path = "training"
X = []
y = []
for label in readdir(path)
    for file in readdir("$path/$label")
        img = load("$path/$label/$file")
        data = reshape(Float32.(channelview(img)),28,28,1)
        if length(X) == 0
            X = data
        else
            X = cat(X,data,dims=3)
        end
        push!(y,parse(Float32,label))
    end
end

Ensure that the "training" and the "testing" folders with the MNIST images exist in the current folder before running this program. It will take a while to execute this code, because it will load 60000 images and will convert them to matrices.

In the outer loop, it reads the contents of the "training" folder. There are subfolders with names from 0 to 9 that will be used as labels.

Then, in the inner loop, it reads all image files of each of these subfolders using the load function from the Images package.

Next, it converts each image to the matrix of color intensities and places it in the data variable. After that, it appends this matrix to X.

Finally, it appends the name of the subfolder (which is an actual digit) to the labels vector y.

This way, you will have a dataset with feature matrix in X and labels vector in y. Let's refactor this code to a function to be able to reuse it to convert any folder with images, classified this way, to the dataset.

using Images
function createDataset(path)
    X = []
    y = []
    for label in readdir(path)
        for file in readdir("$path/$label")
            img = load("$path/$label/$file")
            data = reshape(Float32.(channelview(img)),28,28,1)
            if length(X) == 0
                X = data
            else
                X = cat(X,data,dims=3)
            end
            push!(y,parse(Float32,label))
        end
    end
    return X,y
end

Using this function, you can now easily create both training and testing datasets:

x_train, y_train = createDataset("training")
x_test, y_test = createDataset("testing")

How to create a machine learning model

We will use a neural network to create a model and train it using the training data. To work with neural networks we will use the Flux.jl framework which allows you to create and train neural networks of various types, including feed forward, convolutional, and recurrent.

For handwritten image classification, we will implement both the Feed Forward and the Convolutional networks and compare their accuracy. If you need to, you can review the basics of neural networks by watching this video. Now is the best time to watch this before you continue reading.

Neural network basics

A neural network is a chain of layers. Each layer has a defined number of neurons with inputs and outputs.

To convert input to output for each layer, the neurons use the activation function, defined for this layer. Features of the image are the inputs of the first layer, and the classification results are the outputs of the last layer.

The best way to understand all this is to visualize some neural network architecture. Let's see the following basic neural net of 3 layers:

Feed forward neural network for digits recognition. Source: http://neuralnetworksanddeeplearning.com/chap1.html

In this picture, the input layer contains 784 neurons that should receive the features of each image. As you remember, the training dataset consists of 28x28 images, which is 784 pixels. This is how this neural network works:

The color value of each pixel goes to each neuron of the input layer.
Each neuron of the input layer sends its value to each neuron of the hidden layer.
Each neuron of the hidden layer has a weight coefficient for each input. By default, these coefficients are random numbers. So, each neuron on the hidden layer receives input values from the previous layer and multiplies each input by the appropriate weight, summarizes these products, and applies the activation function to that sum.
Each neuron of the hidden layer sends the resulting sum to each neuron of the output layer, which has 10 neurons.
The output layer does exactly the same for each input value as the previous layer and finally accumulates some sum inside.
This sum is treated as a probability of the appropriate digit, for example the first neuron should contain the probability that the input image is "0", the second neuron should contain the probability that the image is "1", and so on.

Then, the application should look at which of these 10 neurons has the highest value and make the appropriate prediction.

How to create the neural network with Flux

Let's create this neural network using Flux. If you haven't installed and imported it yet, do this in your notebook:

using Pkg
Pkg.add("Flux")
using Flux

As you have seen, the neural network is a chain of layers with different parameters. So, Flux has a Chain function that you use to construct neural networks. Let's construct that network:

model = Chain(
    Flux.flatten,
    Dense(784=>15,relu),
    Dense(15=>10,sigmoid),
    softmax
)

The Chain receives an array of functions as arguments. Each function defines a layer and it's parameters. Each of these functions receives some inputs, then after the appropriate actions returns the outputs and forwards them as inputs to the next function in the chain.

So, this is how the defined neural network works:

The input image, which is a 28x28 array of pixel color intensities, comes to the Flux.flatten function. This function just converts this 28x28 matrix to a vector with 784 elements. This way we constructed the input for the first Dense layer.
Then, the next Dense function receives 784 values by 15 neurons. Then it multiplies these values by weights, summarizes these products, applies the [relu](https://fluxml.ai/Flux.jl/stable/models/activation/#NNlib.relu) activation function to this sum, and forwards these 15 values to 10 neurons of the next layer.
Next, the dense layer also multiplies each 15 inputs by the weight coefficients, summarizes them, and applies the sigmoid activation function to convert these sums to fractions of 1.
The final [softmax](https://en.wikipedia.org/wiki/Softmax_function) function actually doesn't build a new layer, but it just converts values that accumulated in the 10 neurons of the output layer to correct probabilities to properly show the probability distribution. Applying this function ensures that the sum of all 10 probabilities is equal to 1. The array of these probabilities will be returned by the model as a result.

You can call the model which you just created as a function by passing an image matrix as an input argument.

You can run the model to predict the digit for the first image from the training dataset using the following code:

predict = model(Flux.unsqueeze(x_train[:,:,1],dims=3))

We use the [unsqueeze](https://fluxml.ai/Flux.jl/stable/utilities/#Flux.unsqueeze) function here to convert the image without channels of the (28,28) shape to the single channel image of the (28,28,1) shape.

This is an important rule for deep neural network processing – that the image is something that has a width, height, and color channels. So, even if it has only a single channel, it must be specified.

The model function receives the input image matrix, passes it through a chain of layers, and returns the array of probabilities.

New neural network probabilities

As you can see, the highest probability has a neuron number 2 (0.12457416) which means that the model predicted the digit "1". However, if you check the real answer in the labels vector:

y_train[1]

you will see "0", so the prediction is incorrect. This is because this model is untrained and just uses random weights to calculate the output for each layer. You need to train it to adjust these weights and calculate more accurate probability.

How to train the model

Flux.jl has different approaches to training a model. The most obvious one is the [Flux.train](https://fluxml.ai/Flux.jl/stable/training/reference/#Flux.Optimise.train!-NTuple{4,%20Any}) function. The function runs the following training process:

The function receives the training dataset as an argument, including the features matrix and the labels vector.
The function runs the model for each row of the training dataset and receives the resulting probabilities array.
The function compares these probabilities with the true values from the labels vector and calculates the amount of error (about this later).
Using information about the error, the function adjusts the weights and bias for each neuron on each layer.

Usually you need to run this training process many times in a loop. On each iteration it will adjust the weights for each neuron, decreasing the error value more and more.

This visualization shows how the training process in a loop works for a single neuron on a single layer. For the whole network it works similar.

The training process in a loop for a single neuron

This is a syntax of the train function:

Flux.train!(loss_function, model, data, optimizer)

Let's break this down:

loss_function – as I described before, during the training process, the train function measures the amount of error. To do this, it uses the loss_function, which you should define and provide here.

This function receives the model, the row of the training data, and the truth label. Based on these arguments, the loss function should make a prediction by passing the row of data through the model, comparing this prediction with the truth label, calculating the difference between them, and returning the amount of error as a float number.

There are different algorithms exist to calculate the amount of error for different machine learning problem types. For classification problems we will use cross entropy.

model – the neural network model to train.
data – the training data that includes both x_train and y_train assembled to a single array of tuples. You can do this simply by using the [Flux.DataLoader](https://fluxml.ai/Flux.jl/v0.10/data/dataloader/) function, which we will use below.
optimizer – as described above, after measuring the amount of error, the function adjusts the weights to decrease the error. The weights are not adjusted randomly, but by the optimizer that defines the algorithm. You use it to adjust the weights in the correct direction.

Most of the weight adjustment algorithms are based on Gradient Descent. In particular, we will use the ADAM optimizer, which is very common today.

Let's connect all these parts together in the following code:

# Assemble the training data
data = Flux.DataLoader((x_train,y_train), shuffle=true)

# Initialize the ADAM optimizer with default settings
optimizer = Flux.setup(Adam(), model)

# Define the loss function that uses the cross-entropy to 
# measure the error by comparing model predictions of data 
# row "x" with true data label in the "y"
function loss(model, x, y)
    return Flux.crossentropy(model(x),Flux.onehotbatch(y,0:9))
end

# Train the model 10 times in a loop
for epoch in 1:10
    Flux.train!(loss, model, data, optimizer)
end

For each row of data, the Flux.train! calls the loss function, then the loss function runs the model. Using cross entropy, it calculates the difference between the predictions with true values of this row. This difference is returned as an error, and then the optimizer is used to adjust the weights of the model neurons based on this error value and the loss function. On each iteration, the error value should go down.

Finally, after running the training process, you can check how it predicts the digit for the first image using the trained model:

predict = model(Flux.unsqueeze(x_train[:,:,1],dims=3))

When I did that, I received the following probabilities:

Trained model probabilities

The first one, related to "0" is the highest and this is definitely true. You can try to check other images, like image number 100 or 200. But it doesn't make much sense to measure model quality this way, because this is a training data that the model has already seen. Only the testing data should be used to measure the accuracy of the model.

How to evaluate the accuracy of the trained model

We have the testing dataset in the x_test features matrix and in the y_test labels vector. We will run the model for each row of this data and measure the accuracy: the number of correct predictions divided by the number of all predictions.

Let's create a function for this:

function accuracy()
    correct = 0
    for index in 1:length(y_test)
        probs = model(Flux.unsqueeze(x_test[:,:,index],dims=3))
        predicted_digit = argmax(probs)[1]-1
        if predicted_digit == y_test[index]
            correct +=1
        end
    end
    return correct/length(y_test)
end

The function goes over all items of the testing dataset. For each item it runs the model and receives the probs array. Then, it writes an index of the highest probability using the [argmax](https://docs.julialang.org/en/v1/base/collections/#Base.argmax) function to the predicted_digit variable. Next it compares the predicted digit with the truth value from y_test labels vector and increases the number of correct predictions if they match. The function returns the quotient of the number of correct predictions and the total number of rows.

Now you can run this function to see the accuracy. For example, when I ran this, I received the 0.9455, which is about 94.6%.

However, it's better to place this function call inside the training loop, right after the Flux.train! line to see how the accuracy changes after each training iteration.

for epoch in 1:10
    Flux.train!(loss, model, data, optimizer)
    println(accuracy())
end

Then run the training again. You should receive output similar to this:

Accuracy of the neural network

It shows that accuracy was going up until the 6th iteration. Since then, it started to go down, which could be a sign that the model started to overfit.

To increase the prediction quality, you can either add more data to the training dataset or change the model architecture.

For example, you can add more Dense layers, increase the number of neurons on the hidden layer, or change activation functions from relu to sigmoid or vice versa.

When I increased the number of neurons from 15 to 42 on the hidden layer and then removed the sigmoid activation from the output layer, I've achieved about 97% accuracy. But when I added one more hidden layer before output, the accuracy dropped to 90%.

So, building the neural net architecture is like art – you need to try different options a lot of times and finally select the one that works the best.

Regardless of the options I chose, I could never achieve more than 97%. Also, when I finally tried to use this network architecture in production with real handwritten digits from users, the prediction quality was poor. Very often it could not recognize the 7 digit properly, and it recognized 1 as 4 and 6 as 5.

This is because using the feed forward neural network, in which we just put all 784 pixels of the image as an input without any filters, is not the best approach.

For most machine learning tasks with images, the Convolutional neural networks is the better option. We will create and try this one in the next section.

How to create and train the convolutional neural network

The most important step during the machine learning process is data preprocessing. If input features are processed properly, then the prediction accuracy will be better.

To increase the model quality, you need to remove noise from data, or features that are not relevant for the value that you need to predict.

Also, oftentimes you need to create new features from existing ones that could be more relevant to the result.

For example, for the Titanic machine learning problem, you can remove such features as "Passenger ID" and "Passenger name", because they can't help to predict whether the passenger might survive or not.

Also, if you have a task to predict the price of a flat and have input data with fields of room areas like "Area 1", "Area 2" and so on, you can create a new field "Total Flat Area" and write the sum of all room areas to it.

Perhaps this new feature that you generated is more relevant than others for the model, so you can remove the fields from which you generated that new column.

Using these techniques, you generalize the data by keeping and creating the features that are important and by removing others that can only confuse the machine learning model.

When working with tabular data, you can use your own experience or statistical methods to find which features to generate or remove from input data. But when working with images, things are not as clear as with strings or numbers.

For example, the model for the handwritten digits recognition task receives the 784 pixel colors in a single row as an input. They have an equal value from a human point of view, and it's unknown which of them are more important and which of them are less.

To help you in this, you can use convolutional neural networks to preprocess this kind of data. They help you do the feature engineering automatically.

You build a convolutional neural network from two types of layers:

Convolution layers used to generate new features from input image pixels.
Pooling layers used to generalize features using some rules and this way reduce their quantity.

By combining these two types of layers in the chain, you can preprocess the input image matrix to receive a reduced number of the most valuable features. Then, you can train the network using these generated features as input data in the same way as you did before.

I think it's difficult to describe CNNs better than it's done in this video, so I highly recommend watching it (or at least the first 15 minutes) before continue. It clearly explains the theoretical aspects of all steps that you will do below.

So, let's review the neural network that you have now:

model = Chain(
    Flux.flatten,
    Dense(784=>15,relu),
    Dense(15=>10,sigmoid),
    softmax
)

The only data preprocessing step here is the Flux.flatten, that receives the image of 28x28 pixels and returns it joined to a single row of 784 numbers. We need to add some convolution layers before the Flux.flatten to give to our network the ability to generate better features than just raw pixels.

To create the convolution layer, the Flux.jl has the [Conv](https://fluxml.ai/Flux.jl/stable/models/layers/#Flux.Conv) function with the following main parameters:

Conv(filter,in=>out,activation_function)

filter defines dimensions of the kernel matrix that will be applied to each pixel of the input matrix to create a feature from it. For example, the value (3,3) defines the 3x3 kernel matrix. This is how the convolution using this kernel matrix works to generate the features for an image of 6x6 size:

How convolution layer works

in is the number of input image channels. For our input data, gray images have a single channel. For other layers, the number of in channels of current layer must be equal to the out channels of previous layer.
out is the number of output channels after apply the convolution. In other words, it's a number of features that will be generated for each pixel.
activation_function is the function that will be applied to each feature after convolution and before sending to the next layer of the network, the same as we did before for Dense layers.

For example, if you add the following Conv layer on top of the others to the Chain:

model = Chain(
    Conv((5,5),1=>6,relu),
    Flux.flatten,
    Dense(4704=>15,relu),
    Dense(15=>10,sigmoid),
    softmax
)

this network will get a single channel image of the following shape: (28,28,1). It will produce 6 matrices from this image by applying different convolution kernels of 5x5 to the input data.

The output of this layer will be the image of the following shape: (28,28,6). In other words, this convolution layer will generate 28286 = 4704 features from 784 input pixels for our network.

But if you have more features, it does not mean that they are all good. Perhaps you need to generalize them and leave only the most valuable ones. This is why the pooling layers are created.

In Flux.jl, the pooling layer can be defined using the [MaxPool](https://fluxml.ai/Flux.jl/stable/models/layers/#Flux.MaxPool) function. It receives the pooling window dimensions as an argument.

For example, if you create the following MaxPool layer:

MaxPool((2,2))

How Max pool layer works

it will apply the 2x2 window to the input image. As you can see, for each window it selects the maximum value and adds it to the output. This way it reduces the input data by leaving only maximums in it. That is why it's called the MAX pool layer.

Let's add the MaxPool layer to the chain:

model = Chain(
    Conv((5,5),1=>6,relu),
    MaxPool((2,2)),
    Flux.flatten,
    Dense(1176=>15,relu),
    Dense(15=>10,sigmoid),
    softmax
)

So, the MaxPool receives the (28,28,6) sized image from the convolution layer, applies the 2x2 max pool window to it, and outputs (14,14,6) image. After this, the 14146=1176 generalized features are forwarded to the network layers below.

The main question is how to know which number of convolution and max pool layers to add, and which parameters to set for each of them to achieve good prediction accuracy.

Well, the first way is to try different options. But to build a good neural network architecture this way could take days, months, or even years.

Fortunately, for many machine learning tasks, it has already been done by other people. You can find suitable architectures for most of your problems, including the model for the handwritten digit recognition.

The most known architecture for this task was created by Yann LeCun, and it's named LeNet. You can find a full description and implementations of this model for different ML platforms here. It was created exactly for the digit images from MNIST dataset. It's relatively old, but still used in many ATMs to recognize digits for processing deposits.

This is how this architecture looks:

LeNet architecture

Just like the network we created, this one consists of a convolutional part and a feed forward part. The convolutional net part consists of 2 Conv and 2 MaxPool layers. The feed forward neural network part consists of 3 dense layers.

You can create this network using Flux.jl this way:

model = Chain(
    Conv((5,5),1 => 6, relu),
    MaxPool((2,2)),
    Conv((5,5),6 => 16, relu),
    MaxPool((2,2)),
    Flux.flatten,
    Dense(256=>120,relu),
    Dense(120=>84, relu),
    Dense(84=>10, sigmoid),
    softmax
)

After applying 2 convolutions and pooling to the input image matrix, the Flux.flatten layer receives the 4x4x16 image and converts it to 4416=256 generalized features. Then they go through 3 dense layers to finally calculate probabilities for 10 digits.

Before training this model using the data from x_train, you need to reshape it a little bit. The convolution layer expects to get the data in the following 4-dimensional shape (width,height,channels,length), but the x_train has the following shape: (28,28,60000) which is 60000 images of 28x28.

To make it compatible, you need to reshape it to (28, 28, 1, 60000). You can do this using the following code:

x_train = reshape(x_train, 28, 28, 1, :)

You'll need to do the same with x_test:

x_test = reshape(x_test, 28, 28, 1, :)

To run this model, you also need to pass a 4 dimensional image structure to the model function. For example, to make a prediction for the first image, you can run this:

model(Flux.unsqueeze(x_test[:,:,:,1],dims=4))

Then you can train the model the same way as you did before.

This is the whole code to define and train the convolutional network:

# Create a LeNet model
model = Chain(
    Conv((5,5),1 => 6, relu),
    MaxPool((2,2)),
    Conv((5,5),6 => 16, relu),
    MaxPool((2,2)),
    Flux.flatten,
    Dense(256=>120,relu),
    Dense(120=>84, relu),
    Dense(84=>10, sigmoid),
    softmax
)

# Function to measure the model accuracy
function accuracy()
    correct = 0
    for index in 1:length(y_test)
        probs = model(Flux.unsqueeze(x_test[:,:,:,index],dims=4))
        predicted_digit = argmax(probs)[1]-1
        if predicted_digit == y_test[index]
            correct +=1
        end
    end
    return correct/length(y_test)
end

# Reshape the data
x_train = reshape(x_train, 28, 28, 1, :)
x_test = reshape(x_test, 28, 28, 1, :)

# Assemble the training data
train_data = Flux.DataLoader((x_train,y_train), shuffle=true)

# Initialize the ADAM optimizer with default settings
optimizer = Flux.setup(Adam(), model)

# Define the loss function that uses the cross-entropy to 
# measure the error by comparing model predictions of 
# data row "x" with true data from label "y"
function loss(model, x, y)
    return Flux.crossentropy(model(x),Flux.onehotbatch(y,0:9))
end

# Train model 10 times in a loop
for epoch in 1:10
    Flux.train!(loss, model, train_data, optimizer)
    println(accuracy())
end

After running this code, I received about 99% accuracy, which is close to ideal:

Accuracy of the convolutional network

Now it's time to save this model to a file and move it to production.

How to export trained model to a file

Flux.jl models can be saved to BSON files. You need to import the BSON package and use the @save macro command to export the model object:

using BSON
BSON.@save "digits.bson" model

This will save the model to the digits.bson file into the current folder.

This is the end of your work in the Jupyter notebook. We'll implement the following code as a new application.

How to create a frontend

The application which you are going to create will allow a user to write their phone number and recognize it using the model that you created and trained before. The frontend page will look like this:

Frontend

Using this interface, the user can draw digits of a phone number in the boxes using the mouse, then press the "Recognise" button and display the recognised digits in the "Result" input field.

Also, there is a "Switch to eraser" button. When the user presses it, the drawing mode changes to the eraser mode and the user can erase any number in any box.

Let's start building the web application. Create a new folder with any name that you like. Then create an index.html file in it and copy the following code to this file:

html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Phones readertitle>
head>
<body>
    <h1>Draw phone number and recognise ith1>
    <div class="digits">
        <strong>+strong>
        <canvas width="50" height="50">canvas>
        <strong>(strong>
        <canvas width="50" height="50">canvas>
        <canvas width="50" height="50">canvas>
        <canvas width="50" height="50">canvas>
        <strong>)strong>
        <canvas width="50" height="50">canvas>
        <canvas width="50" height="50">canvas>
        <canvas width="50" height="50">canvas>
        <strong>-strong>
        <canvas width="50" height="50">canvas>
        <canvas width="50" height="50">canvas>
        <canvas width="50" height="50">canvas>
        <canvas width="50" height="50">canvas>
        <div class="buttons">
            <button id="mode">Switch to eraserbutton>
        div>
    div>
    <div class="result">
        <button id="recognise">Recognisebutton>
        <label>Result:label>
        <input id="result">div>
    div>
body>
<script>
    let mode = "brush";
    // "Switch" button handler. Switches mode from 
    // brush to eraser and back
    document.querySelector("#mode").addEventListener("click",() => {
        if (mode === "brush") {
            mode = "eraser";
            event.target.innerHTML = "Switch to brush";
        } else {
            mode = "brush";
            event.target.innerHTML = "Switch to eraser";
        }
    });
    // Digits canvases mouse move handler.
    // If mouse button pressed while user moves the mouse
    // on canvas, it draws circles in cursor position.
    // If mode="brush" then circles are black, otherwise
    // they are white
    document.querySelectorAll("canvas").forEach(item => {
        ctx = item.getContext("2d");  
        ctx.fillStyle="#FFFFFF";
        ctx.fillRect(0,0,50,50);
        item.addEventListener("mousemove", (event) => {
            if (event.buttons) {
                ctx = event.target.getContext("2d");  
                if (mode === "brush") {
                    ctx.fillStyle = "#000000";         
                } else {
                    ctx.fillStyle = "#FFFFFF";         
                }
                ctx.beginPath();               
                ctx.arc(event.offsetX-1,event.offsetY-1,2,0, 2 * Math.PI);
                ctx.fill();   
            }
        })
    })
    // "Recognise" button handler. Captures
    // content of all digit canvases as BLOB.
    // Construct files from these blobs and
    // posts them to backend as a files as a
    // multipart form
    document.querySelector("#recognise").addEventListener("click", async() => {
        data = new FormData();
        canvases = document.querySelectorAll("canvas");
        const getPng = (canvas) => {
            return new Promise(resolve => {
                canvas.toBlob(png => {
                    resolve(png)
                })
            })
        }
        index = 0
        for (let canvas of canvases) {
            const png = await getPng(canvas);
            data.append((++index)+".png",new File([png],index+".png"));
        }
        const response = await fetch("http://localhost:8080/api/recognize", {
            body: data,
            method: "POST"
        })
        document.querySelector("#result").value = await response.text();
    })

script>
<style>
    body {
        display:flex;
        flex-direction: column;
        justify-content: flex-start;
        align-items: flex-start;
    }
    canvas {
        border-width:1px;
        border-color:black;
        border-style: solid;
        margin-right:5px;
        cursor:crosshair;
    }
    .digits {
        display:flex;
        flex-direction: row;
        align-items: center;
        justify-content: flex-start;
    }
    .digits strong {
        font-size: 72px;
        margin:10px;
    }
    .buttons {
        display:flex;
        flex-direction: column;
        justify-content: flex-start;
        align-items: center;
    }
    button {
        width:100px;
        margin-bottom:5px;
        margin-right:10px;
    }
    .result {
        margin-top:10px;
        display:flex;
        flex-direction: row;
        align-items: flex-start;
        justify-content: flex-start;
    }
    input {
        margin-left:10px;
    }
style>
html>

The HTML part of this code contains 11 HTML5 canvas elements that display the boxes where you can draw. Each box has a size of 50x50 pixels and is filled with a white color. Also, the HTML contains "Switch to ..." and "Recognise" buttons and the "Result" input field.

The JavaScript part defines the "mode" global variable, which is equal to "brush" by default. When the user presses the "Switch to ..." button, it changes the mode to the "eraser". If they press it again, it switches back to the "brush".

Next, the JavaScript code defines "mousemove" event handlers for all canvas boxes. If the user presses the left mouse button in the "brush mode" and moves the mouse in the box, it draws black circles in place of the mouse cursor. This way, the user draws the digits. If the mode is "eraser", then it draws white circles. This way, the user can erase the black marks.

Finally, we defined the "Recognise" button click handler. When the user clicks this button, the handler function collects 11 digit images from the canvas elements and converts them to BLOB objects in a PNG image format.

Then it creates a POST request, puts these 11 digit images in it as files with names 1.png, 2.png and so on, and sends them to the /api/recognize endpoint of the backend service on port 8080 of a local host (which we will create in the next section).

The backend should receive these images, recognise digits in them, and return the recognition result as a string. This string will be displayed in the "Result" input field.

Lastly, I defined some CSS to apply basic styles to this page. You can modify them as you want. Now, let's move to the most interesting part – the digits recognition backend.

How to create a backend

As a modern and mature programming language, Julia has a lot of libraries and frameworks for different tasks. Web frameworks are not an exception. We will use the Genie.jl framework, which is similar to the Express in Node.js or Flask in Python.

With Genie.jl you can run a basic web service in two lines of code:

using Genie
up(8080, async=false)

It will run a web server on port 8080 of a local host.

Using any text editor, for example VSCode with the Julia extension, create a new Julia file like digits.jl in the same folder with the index.html. This is where you'll write the next bit of code.

This web service will have two endpoints:

/ to display the index.html web page that you created before.
/api/recognize to receive POST requests with the images of digits, recognize them, and return a string with recognized numbers.

As with most other web frameworks, to receive and process HTTP requests Genie.jl uses routes. This application will have two routes:

using Genie, Genie.Router, Genie.Requests

route("/") do 
    return String(read("index.html"))
end

route("/api/recognize", method=POST) do
    result = ""
    # TODO: in a loop, extract each image 
    # from POST request body, send it to 
    # the digit recognition function, 
    # receive recognized digit and add 
    # it to the result
    return result
end

up(8080, async=false)

To work with routes and requests, you need to import two additional subpackages – Genie.Router and Genie.Requets.

The first route just returns the content of the index.html file.

The second route processes the POST requests to the /api/recognize endpoint. This is how you can define it:

using Images
route("/api/recognize", method=POST) do
    result = ""
    files = filespayload();   
    for index in 1:11
        file = files["$index.png"]
        img = load(IOBuffer(file.data))
        result *= recognizeDigit(img)        
    end    
    return result
end

To load the received file as an image, we will use the Julia Images library that we imported on the first line.

Then, the [filespayload](https://github.com/GenieFramework/Genie.jl/blob/7eb45c9ec32f0e4659abb08559b0b2729451421a/src/Requests.jl#L50)() function extracts all files from the received request.

Then, we assume that the request has 11 files and we process them in a loop. Each file data is extracted as an array of bytes, but the [load](https://juliaimages.org/stable/function_reference/#FileIO.load) function requires the object that implements an IO buffer. That is why the [IOBuffer](https://docs.julialang.org/en/v1/base/io-network/#Base.IOBuffer) converts the array of bytes to a suitable format.

Then, the loaded image gets passed to the recognizeDigit function. This function will be written below. It should receive the image, then recognize it using the trained model and return the recognized digit as a string. This digit will be appended to the result string. Finally, the result with 11 recognized digits will be sent to the web page.

Before writing the recognizeDigit function, ensure that the saved model file digits.bson was copied to the folder with your backend code.

Also, it's important to understand that we can't process the input image as is because it has a size of 50x50, and it is a black digit on a white background.

If the model trained on images with size 28x28, then it can't be used to recognize images of other sizes.

Also, the model that trained on images that had white text written on black background will work poorly for colored images and for images with black text on a white background.

So, before you send the image to the model for recognition, you need to preprocess them using the following steps:

Convert the images to gray
Invert the colors
Resize them to 28x28

Now you are ready to implement the digits recognition function:

using Flux, MLUtils, BSON
function recognizeDigit(img)
    # load the model
    BSON.@load "digits.bson" model
    # Convert image to grayscale
    img = Gray.(img)
    # Invert each pixel color
    img = (x->Gray(1)-x.val).(img)
    # resize image to 28x28 pixels
    img = imresize(img,(28,28))
    # Get matrix of image
    digit_data = Float32.(channelview(img))
    # predict the digit (get probabilities)
    probs = model(cat(digit_data,dims=4))
    # return the digit with the largest 
    # probability, converted to a string
    return "$(argmax(probs)[1]-1)"
end

When all this is done, you are almost ready to run the app. Before doing that, ensure that all required packages are installed. Run the julia REPL in a project folder. Then run the following code line by line, to install all packages mentioned in the using lines:

using Pkg
Pkg.add("Genie")
Pkg.add("Images")
Pkg.add("Flux")
Pkg.add("MLUtils")

Then exit the repl using the exit() command.

Now you can run the app. To do that, either execute the julia digits.jl command from the terminal or press Ctrl+F5 in VSCode.

Then, go to http://localhost:8080 in a web browser, draw the digits, press the "Recognise" button, and in a few moments you will see the recognised number as a text in the "Result" field.

Conclusion

In this tutorial, I demonstrated how to create and train both feed forward and convolutional neural networks using Julia. You also learned how to export and use them in a web application.

In addition, I tried to show that you should not reinvent the wheel when creating neural networks.

When solving real life problems, you should not build neural network architectures from scratch. Most of them have already been created by data scientists and enthusiasts around the world. In practice, you will just reuse them.

You'll just need to find the suitable architecture and either use it as is or change the last few layers to adjust the outputs according to your needs.

For example, you can search this collection where you'll find different models classified by problem types. Even if many of them were not created with Julia, you can create them using Flux.jl after reading their descriptions.

The way we created and trained our neural network is not the best or the only possible one. Perhaps in some points I oversimplified things, because I wanted to explain all this as simply as possible.

But if you've understood the examples here, you can learn and reuse the following more advanced Julia solutions of the handwritten digits recognition task:

You can see the source code of this article including the Jupyter Notebook and the web service in this repository.

Have a fun coding and never stop learning!

You can find me on LinkedIn, Twitter, and Facebook to know first about new articles like this one and other software development news.

How to Use TensorFlow for Deep Learning – Basics for Beginners

Manish Shivanandhan — Tue, 14 Feb 2023 23:46:51 +0000

TensorFlow is a library that helps engineers build and train deep learning models. It provides all the tools we need to create neural networks.

We can use TensorFlow to train simple to complex neural networks using large sets of data.

TensorFlow is used in a variety of applications, from image and speech recognition to natural language processing and robotics. TensorFlow enables us to quickly and easily build powerful AI models with high accuracy and performance.

TensorFlow also works with GPUs and TPUs, which are types of computer chips built to extend TensorFlow’s capabilities. These chips make TensorFlow run faster, which is helpful when you have a lot of data to work with.

In this article, we will learn about tensors and how to work with tensors using TensorFlow. Let’s dive right in.

What is a Tensor?

A simple explanation would be that a tensor is a multi-dimensional array.

Scalar, Vector, Matrix and Tensor

A scalar is a single number. A vector is an array of numbers. A matrix is a 2-dimensional array. A tensor is an n-dimensional array.

In TensorFlow, everything can be considered a tensor including a scalar. A scalar would be a tensor of dimension 0, a vector of dimension 1, and a matrix of dimension 2.

Now, this is useful because we are not limited to working with complex datasets in TensorFlow. TensorFlow can handle any type of data and feed it to machine learning models.

What is TensorFlow?

TensorFlow is an open-source software library for building neural networks. Google Brain team was the one who built it and it is the most popular deep learning library in the market today.

You can use TensorFlow to build AI models including image and speech recognition, natural language processing, and predictive modeling.

Classification neural network

TensorFlow uses a dataflow graph to represent computations. To put it simply, TensorFlow has made it easy to build complex machine learning models.

TensorFlow takes care of a lot of work behind the scenes which makes it useful while building and training any type of machine learning model. TensorFlow also manages the computation, including parallelization and optimization, on the user’s behalf.

TensorFlow and Keras

Tensorflow and Keras

TensorFlow has a high-level API called Keras. Keras was a standalone project which is now available within the TensorFlow library. Keras makes it easy to define and train models while TensorFlow provides more control over the computation.

TensorFlow supports a wide range of hardware, including CPUs, GPUs, and TPUs. TPUs are Tensor processing Unites, built specifically to work with Tensors and TensorFlow.

We can also run TensorFlow on mobile devices and IoT devices using TensorFlow Lite. TensorFlow also has a large community of developers, and it is updated with new features and capabilities.

How to Build Tensors with TensorFlow

Let’s start writing some code. If you don't have TensorFlow installed, you can use a Google colab notebook to follow along.

Let’s start by importing TensorFlow and printing out the version.

import tensorflow as tf
print(tf.__version__)

OUTPUT:
2.9.2

Let’s first create a scalar using tf.constant. We use tf.constant to create a new constant value. We can also use tf.variable to create a variable value. We will then print the value and also check the dimension of the scalar using the ndim property. Its dimension will be zero because it is a single value.

scalar = tf.constant(7)
print(scalar)
print(scalar.ndim)

OUTPUT:
tf.Tensor(7, shape=(), dtype=int32)
0

Now let’s create a vector and print its dimensions. You can see that the dimension is 1.

vector = tf.constant([10,10])
print(vector)
print(vector.ndim)

OUTPUT:
tf.Tensor([10 10], shape=(2,), dtype=int32)
1

Now let’s try creating a matrix and printing its dimensions.

matrix = tf.constant([
    [10,11],
    [12,13]
])
print(matrix)
print(matrix.ndim)

OUTPUT:
tf.Tensor(
[[10 11]
 [12 13]], shape=(2, 2), dtype=int32)
2

You will see that the dimension is now 2. You can also see that the shape of the matrix is 2 by 2.

Shapes and dimensions are useful when working with TensorFlow because we will often change them while using these data to train neural networks.

We have seen that these tensors have a default datatype of int32. What if we want to create a dataset with a custom datatype?

tf.constant provides us with the dtype argument. Let’s create the same matrix again with float16 as the data type.

tensor_1 = tf.constant([
    [
        [1,2,3]
    ],
    [
        [4,5,6]
    ],
    [
        [7,8,9]
    ]
],dtype='float32')
print(tensor_1)

OUTPUT:
tf.Tensor(
[[[1. 2. 3.]]

 [[4. 5. 6.]]

 [[7. 8. 9.]]], shape=(3, 1, 3), dtype=float32)

Now let’s create a tensor. We will input a 3-dimensional array to tf.constant. We will also print its dimensions.

tensor = tf.constant([
    [
        [1,2,3]
    ],
    [
        [4,5,6]
    ],
    [
        [7,8,9]
    ]
])
print(tensor)
print(tensor.ndim)

OUTPUT:
tf.Tensor(
[[[1 2 3]]
 [[4 5 6]]
 [[7 8 9]]], shape=(3, 1, 3), dtype=int32)
3

Now we have a tensor of dimension 3 and shape 3 by 1 by 3. This is the simplest tensor you can create. In real-world scenarios, we will be dealing with tensors of higher dimensions and bigger shapes.

Now let’s look at how to create a variable tensor. We won’t be using variable tensors very often compared to constant tensors, but it is good to know that we have an option.

We will use tf.Variable to create a variable tensor. The difference between the constant tensor and variable tensor is that you can change the data in a variable tensor, but you can’t change the values in a constant tensor. Let’s create a variable tensor and print the dimensions.

var_tensor = tf.Variable([
    [
        [1,2,3]
    ],
    [
        [4,5,6]
    ],
    [
        [7,8,9]
    ]
])
print(var_tensor)

OUTPUT:
<tf.Variable 'Variable:0' shape=(3, 1, 3) dtype=int32, numpy=
array([[[1, 2, 3]],
       [[4, 5, 6]],
       [[7, 8, 9]]], dtype=int32)>

How to Generate and Load Tensors

Let’s look at how to generate tensors. In most cases, you won’t be creating tensors from scratch. You will either load a dataset, convert other datasets like NumPy arrays to tensors, or generate tensors. First, let’s look at how to generate tensors.

Let’s create a tensor with random values. There are two common ways you can do this: generate a normal distribution of data or a uniform distribution of data.

Normal distribution

The normal distribution is a bell-shaped curve that represents the distribution of data. Most of the data will be close to the average and fewer data will be away from the average. This means the probability of getting a value near the average is higher.

Uniform distribution

The uniform distribution is a straight line that represents the distribution of data. All the values in a uniform distribution will have an equal probability of occurring within a given range.

Before we generate random values, you must understand what a seed is. If we use a seed value, we can regenerate the same set of data multiple times. This will be useful when we want to test our machine-learning model against the same data after we tweak its performance.

Let’s create two arrays of random tensors. We will first set a seed and generate the random values using that seed.

seed = tf.random.Generator.from_seed(42)

Now we will create a normal and uniform distribution with the shape of 3 by 2.

normal_tensor = seed.normal(shape=(3,2))
print(normal_tensor)
uniform_tensor = seed.uniform(shape=(3,2))
print(uniform_tensor)

OUTPUT:
tf.Tensor( [[-0.7565803  -0.06854702]  [ 0.07595026 -1.2573844 ]  [-0.23193765 -1.8107855 ]], shape=(3, 2), dtype=float32)
tf.Tensor( [[0.7647915  0.03845465]  [0.8506975  0.20781887]  [0.711869   0.8843919 ]], shape=(3, 2), dtype=float32)

We have two tensors created, one with a normal distribution of random numbers and the other with a uniform distribution of random numbers.

Next, we will create a tensor with zeros and ones. In TensorFlow, tensors filled with zeros or ones are often used as a starting point for creating other tensors. They can also be placeholders for inputs in a computational graph.

To create a tensor of zeroes, use the tf.zeros function with a shape as the input argument. To create a tensor with ones, we use tf.ones with the shape as input argument.

zeros = tf.zeros(shape=(3,2))
print(zeros)
ones = tf.ones(shape=(3,2))
print(ones)

OUTPUT:
tf.Tensor(
[[0. 0.]
 [0. 0.]
 [0. 0.]], shape=(3, 2), dtype=float32)
tf.Tensor(
[[1. 1.]
 [1. 1.]
 [1. 1.]], shape=(3, 2), dtype=float32)

Now, let’s look at converting NumPy arrays into tensors. If you don’t know what NumPy is, it is a Python library for numerical computing. It helps us handle large datasets and perform a variety of computations on them.

Let’s import NumPy and create a NumPy array using NumPy’s arrange function.

import numpy as np
numpy_arr = np.arange(1,25,dtype=np.int32)

Now, we can create a tensor using the tf.constant function with the NumPy array as input. TensorFlow has built-in support to handle NumPy arrays, so it is just a matter of importing a NumPy array and setting a shape.

print(numpy_arr)
numpy_tensor = tf.constant(numpy_arr,shape=[2,4,3])
print(numpy_tensor)

OUTPUT:
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
tf.Tensor(
[[[ 1  2  3]
  [ 4  5  6]
  [ 7  8  9]
  [10 11 12]]
 [[13 14 15]
  [16 17 18]
  [19 20 21]
  [22 23 24]]], shape=(2, 4, 3), dtype=int32)

You can see both the NumPy array as well as our tensor. The original NumPy array was 1x12 but our tensor is 2x4x3. This is called re-shaping a tensor which we will often do while training deep neural networks.

Basic Operations using Tensorflow

We have learned how tensors are created in TensorFlow. Now let’s look at some basic operations using tensors.

We will start by getting some information on our tensors. Let’s create a 4D tensor with 0 values with the shape 2x3x4x5.

rank4_tensor = tf.zeros([2,3,4,5])
print(rank4_tensor)

OUTPUT:
tf.Tensor(
[[[[0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]]
  [[0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]]
  [[0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]]]
 [[[0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]]
  [[0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]]
  [[0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]
   [0. 0. 0. 0. 0.]]]], shape=(2, 3, 4, 5), dtype=float32)

We have created our rank 4 tensor. Now let's get some information about the size, shape (number of values), and the dimension of the tensor.

We will use tf.size function to get the size. The shape and ndim properties will give us the shape and dimensions of the tensor.

print("Size",tf.size(rank4_tensor))
print("shape",rank4_tensor.shape)
print("Dimension",rank4_tensor.ndim)

OUTPUT: 

Size tf.Tensor(120, shape=(), dtype=int32)
shape (2, 3, 4, 5)
Dimension 4

Let’s look at some simple calculations using the tensor. I will create a new basic tensor.

basic_tensor = tf.constant([[10,11],[12,13]])
print(basic_tensor)

OUTPUT: 

tf.Tensor(
[[10 11]
 [12 13]], shape=(2, 2), dtype=int32)

Let’s try some simple operations. We can add, subtract, multiply, and divide every value in a tensor using the basic operators.

print(basic_tensor + 10)
print(basic_tensor - 10)
print(basic_tensor * 10)
print(basic_tensor / 10)

OUTPUT:
tf.Tensor(
[[20 21]
 [22 23]], shape=(2, 2), dtype=int32)
tf.Tensor(
[[0 1]
 [2 3]], shape=(2, 2), dtype=int32)
tf.Tensor(
[[100 110]
 [120 130]], shape=(2, 2), dtype=int32)
tf.Tensor(
[[1.  1.1]
 [1.2 1.3]], shape=(2, 2), dtype=float64)

Now let’s try matrix multiplication. I will create two simple tensors tensor_011 and tensor_012.

tensor_011 = tf.constant([[2,2],[4,4]])
tensor_012 = tf.constant([[2,3],[4,5]])

Keep in mind that in matrix multiplication, the inner dimensions should match. For example, a (3, 5) (3, 5) multiplication won’t work but (3, 5) (5, 3) will work.

The final shape of the resulting matrix will be its outer dimension. so, a 3x5 tensor multiplied by a 5x3 tensor will give us a 5x5 tensor. We will use the tf.matmul function to perform matrix multiplication.

print(tf.matmul(tensor_011,tensor_012))

OUTPUT:
tf.Tensor(
[[12 16]
 [24 32]], shape=(2, 2), dtype=int32)

Next, let’s look at reshaping and transposing a matrix. As we saw before, we will often use reshaping to change our matrix structure while training neural networks.

For example, an image pixel matrix of 28x28 will be converted into a 1-dimensional 784-pixel array for an image classification neural network.

To reshape, we use the tf.reshape function. To transpose, we use the tf.transpose function. If you don't know what a transpose is, it's converting rows into columns and columns into rows.

print(tf.reshape(tensor_011,[4,1]))
print(tf.transpose(tensor_011))

OUTPUT:
tf.Tensor(
[[2]
 [2]
 [4]
 [4]], shape=(4, 1), dtype=int32)
tf.Tensor(
[[2 4]
 [2 4]], shape=(2, 2), dtype=int32)

Finally, let’s look at some aggregate operations like min, max, standard deviation, square and square root.

To find the minimum and maximum values, we use the tf.reduce_min and tf.reduce_max functions. And to find the sum of the array, we use the tf.reduce_sum function.

tensor_013 = tf.constant([
    [1,2,3],
    [4,5,6],
    [7,8,9]
],dtype='float32')
print(tf.reduce_min(tensor_013))
print(tf.reduce_max(tensor_013))
print(tf.reduce_sum(tensor_013))

OUTPUT:
tf.Tensor(1.0, shape=(), dtype=float32)
tf.Tensor(9.0, shape=(), dtype=float32)
tf.Tensor(45.0, shape=(), dtype=float32)

Now for the standard deviation and variance, we use the tf.math.reduce_std function and tf.math.reduce_variance function.

print(tf.math.reduce_std(tensor_013))
print(tf.math.reduce_variance(tensor_013))

OUTPUT:
tf.Tensor(2.5819888, shape=(), dtype=float32)
tf.Tensor(6.6666665, shape=(), dtype=float32)

Let’s find the square, square root, and log of each value in a tensor.

print(tf.sqrt(tensor_013))
print(tf.square(tensor_013))
print(tf.math.log(tensor_013))

OUTPUT:
tf.Tensor(
[[1.        1.4142135 1.7320508]
 [2.        2.236068  2.4494898]
 [2.6457512 2.828427  3.       ]], shape=(3, 3), dtype=float32)
tf.Tensor(
[[ 1.  4.  9.]
 [16. 25. 36.]
 [49. 64. 81.]], shape=(3, 3), dtype=float32)
tf.Tensor(
[[0.        0.6931472 1.0986123]
 [1.3862944 1.609438  1.7917595]
 [1.9459102 2.0794415 2.1972246]], shape=(3, 3), dtype=float32)

We have learned the basics of TensorFlow in this article. You are now equipped to work with TensorFlow and use it to model data.

If you want to start using this knowledge and build a project, you can check out my course on building a handwriting recognition neural network using TensorFlow. You can also learn advanced TensorFlow concepts using the official documentation.

Conclusion

Tensorflow is a powerful library to build deep-learning models. It has all the tools we need to construct neural networks to solve problems like image classification, sentiment analysis, stock market predictions, etc.

With the advent of technologies like ChatGPT, learning TensorFlow will give you a head start in the current job market.

Hope you liked this article. You can learn more about me and my articles/videos at manishmshiva.com.

Learn Neural Networks by Building a Self-Driving Car Sim Using JavaScript

Beau Carnes — Thu, 12 May 2022 16:29:42 +0000

"Any application that can be written in JavaScript, will eventually be written in JavaScript." – Jeff Atwood

It's time for you to create a self-driving car using JavaScript!

We just published a course on the freeCodeCamp.org YouTube channel that will help you learn about neural networks by teaching you how to build a self-driving car simulator in JavaScript (with no libraries!).

Radu Mariescu-Istodor developed this course. Radu has a PhD in computer science and is known for creating creative tutorials relating to machine learning and programming.

In this course you will learn how to implement the car driving mechanics, how to define the environment, how to simulate some sensors, how to detect collisions, and how to make the car control itself using a neural network.

The course covers how artificial neural networks work, by comparing them with the real neural networks in our brain. You will learn how to implement a neural network and how to visualize it so we can see it in action.

Radu uses JavaScript to implement the system and he teaches modern JavaScript techniques. This course is perfect for people interested in becoming software engineers or machine learning specialists (like Radu – he has over 10 years research experience with machine learning).

Here are the sections covered in this course:

Car driving mechanics
Defining the road
Artificial sensors
Collision detection
Simulating traffic
Neural network
Parallelization
Genetic algorithm

Watch the full course below or on the freeCodeCamp.org YouTube channel (2.5-hour watch).

What Are Graph Neural Networks? How GNNs Work, Explained with Examples

freeCodeCamp — Tue, 01 Feb 2022 16:50:35 +0000

By Rishit Dagli

Graph Neural Networks are getting more and more popular and are being used extensively in a wide variety of projects.

In this article, I help you get started and understand how graph neural networks work while also trying to address the question "why" at each stage.

Finally we will also take a look at implementing some of the methods we talk about in this article in code.

And don't worry – you won't need to know very much math to understand these concepts and learn how to apply them.

What is a graph?

Put quite simply, a graph is a collection of nodes and the edges between the nodes. In the below diagram, the white circles represent the nodes, and they are connected with edges, the red colored lines.

You could continue adding nodes and edges to the graph. You could also add directions to the edges which would make it a directed graph.

A simple representation of a graph

Something quite handy is the adjacency matrix which is a way to express the graph. The values of this matrix (A_{ij}) are defined as:

$$A_{ij} = \left\{\begin{array}{ c l }1 & \quad \textrm{if there exists an edge } j \rightarrow i \\ 0 & \quad \textrm{if no edge exists} \end{array} \right.$$

Another way to represent the adjacency matrix is simply flipping the direction so in the same equation (A_{ij}) will be 1 if there is an edge (i \rightarrow j) instead.

The later representation is in fact what I studied in school. But often in Machine Learning papers, you will find the first notation used – so for this article we will stick to the first representation.

There are a lot interesting things you might notice from the adjacency matrix. First of all, you might notice that if the graph is undirected, you essentially end up with a symmetric matrix and more interesting properties, especially with the eigen values of this matrix.

One such interpretation which would be helpful in the context is taking powers of the matrix ((A^n)_{ij}) gives us the number of (directed or undirected) walks of length (n) between nodes (i) and (j).

Why work with data in Graphs?

Well graphs are used in all kinds of common scenarios, and they have many possible applications.

Probably the most common application of representing data with graphs is using molecular graphs to represent chemical structures. These have helped predict bond lengths, charges, and new molecules.

With molecular graphs, you can use Machine Learning to predict if a molecule is a potent drug.

For example, you could train a graph neural network to predict if a molecule will inhibit certain bacteria and train it on a variety of compounds you know the results for.

Then you could essentially apply your model to any molecule and end up discovering that a previously overlooked molecule would in fact work as an excellent antibiotic. This is how Stokes et al. in their paper (2020) predicted a new antibiotic called Halicin.

Another interesting paper by DeepMind (ETA Prediction with Graph Neural Networks in Google Maps, 2021) modeled transportation maps as graphs and ran a graph neural network to improve the accuracy of ETAs by up to 50% in Google Maps.

In this paper they partition travel routes into super segments which model a part of the route. This gave them a graph structure to operate over on which they run a graph neural network.

There have been other interesting papers that represent naturally occurring data as graphs (social networks, electrical circuits, Feynman diagrams and more) that made significant discoveries as well.

And if you think abut it, a standard neural network can be represented as a graph too 🤯.

What can we do with Graph Neural Networks?

Let's first start with what we might want to do with our graph neural network before understanding how we would do that.

One kind of output we might want from our graph neural network is on the entire graph level, to have a single output vector. You could relate this kind of output with the ETA prediction or predicting binding energy from a molecular structure from the examples we talked about.

Another kind of output you might want is the node or edge level predictions and end up with a vector for each node or edge. You could relate this with an example where you need to rank every node in the prediction or probably predict the bond angle for all bonds given the molecular structure.

You might also be interested in answering the question "Where should I place a new edge or a node" or predict where an edge or a node might appear. We could not only get that prediction from the graph, but then we could also turn some other data into a graph.

Defining what we want our GNN to do

As you might have guessed with the graph neural network, we first want to generate an output graph or latents from which we would then be able to work on this wide variety of standard tasks.

So essentially what we need to do from the latent graph (features for each node represented as (\vec{h_i})) for the graph level predictions is:

first figure out some way to aggregate all the vectors (like simply summing), and
then create some function to get the predictions:

$$\vec{Z_G} = f(\sum_i \vec{h_i})$$

And now it is quite simple to show on a high level what we need to do from the latents to get our outputs.

For node level outputs we would just have one node vector passed into our function and get the predictions for that node:

$$\vec{Z_i} = f(\vec{h_i})$$

The problem with variable sized inputs

Now that we know what we can do with the graph neural networks and why you might want to represent your data in graphs, let's see how we would go about training on graph data.

But first off, we have a problem on our hands: graphs are essentially variable size inputs. In a standard neural network, as shown in the figure below, the input layer (shown in the figure as (x_i)) has a fixed number of neurons. In this network you cannot suddenly apply the network to a variable sized input.

Why the standard neural network won't work?

But if you recall, you can apply convolutional neural networks on variable sized inputs.

Let's put this in terms of an example: you have a convolution with the filter count (K=5), spatial extent (F=2), stride (S=4), and no zero padding (P=0). You can pass in ((256 \times 256 \times 3)) inputs and get ((64 \times 64 \times 5)) outputs ((\left \lfloor{\frac{256-2+0}{4}+1}\right \rfloor)) and you can also pass ((96 \times 96 \times 6)) inputs and get ((24 \times 24 \times 5)) outputs and so on – it is essentially independent of size.

This does make us wonder if we can draw some inspiration from convolutional neural networks.

Another really interesting way of solving the problem of variable input sizes that takes inspiration from Physics comes from the paper Learning to Simulate Complex Physics with Graph Networks by DeeepMind (2020).

Let's start off by taking some particles (i) and each of those particles have a certain location (\vec{r_i}) and some velocity (\vec{v_i}). Let's say that these particles have springs in between them to help us understand any interactions.

Now this system is, of course, a graph: you can take the particles to be nodes and the springs to be edges. If you now recall simple high-school physics, (force = mass \cdot acceleration) – and, well, what is another way in this system to denote the total force acting on the particle? It is the sum of forces acting on all neighboring particles.

You can now write ((e_{ij}) represents the properties of the edge or spring between i and j):

$$m\frac{\mathrm{d} \vec{v_i}}{\mathrm{d}t} = \sum_{j \in \textrm{ neighbours of } i } \vec{F}(\vec{r_i}, \vec{r_j}, e_{ij})$$

Something I would like to draw your attention to here is that this force law is always the same. Maybe there are differences in the properties of the spring or edge, but you can still apply the same law. You can have different numbers of nodes and edges and you can still apply the exact same equation of motion.

Visualizing the presented solutions to variable sized inputs

If you look closely, the intuitions we discussed to get around the problem of fixed inputs have an aspect of similarity to them: it is fairly clear in writing that the second approach takes into account the neighboring nodes and edges and creates some function (here force) of it. I wanted to point out that the way convolutional neural networks work is not much different.

How to learn from data in a graph

Now that we've discussed what might give us inspiration to create a graph neural network, let's now try actually building one. Here we'll see how we can learn from the data residing in a graph.

We will start by talking about "Neural Message Passing" which is analogous to filters in a convolutional neural network or force which we talked about in the earlier section.

So let's say we have a graph with 3 nodes (directed or undirected). As you might have guessed, we have a corresponding value for each node (x_1), (x_2) and (x_3).

Just like any neural network, our goal is to find an algorithm to update these node values which is analogous to a layer in the graph neural network. And then you can of course keep on adding such layers.

So how do you do these updates? One idea would be to use the edges in our graph. For the purposes of this article, let's assume that from the 3 nodes we have an edge pointing from (x_3 \rightarrow x_1). We can send a message along this edge which will carry a value that will be computed by some neural network.

For this case we can write this down like below (and we will break down what this means too):

$$\vec{m_{31}}=f_e(\vec{h_3}, \vec{h_1}, \vec{e_{31}})$$

We will use our same notations:

(m_31) is the message passed from node 3 to node 1,
(\vec{h_3}) is the value node 3 has,
(\vec{e_{31}}) is the value of edge between node 3 and node 1, and
(f_e) represents the "some neural network" function which depends on all these values often called the message function.

And let's say we have an edge from (x_2 \rightarrow x_1) as well. We can apply the same expression we created above, just replacing the node numbers.

If you have more nodes, you would want to do this for every edge pointing to node 1. And the easiest way to accumulate all these is to simply sum them up. Look closely and you will see this is really similar to the intuition from particles we had discussed earlier!

Now you have an aggregated value of the messages coming to node 2 but you still need to update its weights. So we will use another neural network (f_v) often called the update network. It depends on two things: your original value of node 3 of course and the aggregate of the messages we had.

Simply putting these together not just for node 3 in our example but for any node in any graph, we can write it down as:

$$\vec{h_i^{\prime}} = f_v(h_i, \sum_{j \in N_i} \vec{m_{ij}})$$

(\vec{hi^{\prime}}) are our update node values, and (\vec{m{ij}}) is the messages coming to node (i) we calculate earlier.

You would then apply these same two neural networks (f_e) and (f_v) for each of the nodes comprising the graph.

A really important thing to note here is that the two neural networks where we have to update our node values operate on fixed sized inputs like a standard neural network. Generally the two neural networks we spoke of (f_e) and (f_v) are small MLPs.

Visualizing Message Passing Neural Networks

Earlier we talked about the different kind of outputs we are interested in obtaining from our graph neural networks. You might have already noticed that when training our model the way we talked about, we will be able to generate the node level predictions: a vector for each node.

To perform graph classification, we want to try and aggregate all the node values we have after training our network. We will use a readout or pooling layer (quite clear how the name comes).

Generally we can create a function (f_r) depending on the set of node values. But it should also be permutation independent (should not matter on your choice of labelling the nodes), and it should look something like this:

$$y^{\prime} = f_r({x_i \vert i \in \textrm{ graph} })$$

The simplest way to define a readout function would be by summing over all node values. Then finding the mean, maximum, or minimum, or even a combination of these or other permutation invariant properties best suiting the situation. Your (f_r), as you might have guessed, can also be a neural network which is often used in practice.

The ideas and intuitions we just talked about create the Message Passing Neural Networks (MPNNs), one of the most potent graph neural networks first proposed in Neural Message Passing for Quantum Chemistry (Gilmer et al. 2017).

How to change edge values

It now seems like we have indeed created a general graph neural network. But you can see that our message network requires (e_{ij}), the edge property – just as you randomly initialize node values at start.

But while the node values get changed at each step, the edge values are also initialized by you – but they're not changed. So, we need to try and generalize this as well, an extension to what we just saw.

Understanding how the node updates work, I think you can very easily apply something similar for an edge update function as well.

(U_{edge}) is another standard neural network:

$$e_{ij}^{\prime} = U_{edge}(e_{ij}, x_i, x_j)$$

Something you could also do with this framework is that the outputs by (U_{edge}) are already edge level properties – so why not just use them as my message? Well, you could do this as well.

Message Passing Neural Network discussion

Message Passing Neural Networks (MPNN) are the most general graph neural network layers. But this does require storage and manipulation of edge messages as well as the node features.

This can get a bit troublesome in terms of memory and representation. So sometimes these do suffer from scalability issues, and in practice are applicable to small sized graphs.

As Petar Veličković says "MPNNs are the MLPs of the graph domain". We will be looking at some extensions of MPNNs as well as how to implement an MPNN in code.

You can quite easily apply exactly what we talked about in either PyTorch or TensorFlow – but try doing so and you will see that this just blows up the memory.

Usually what we do with standard neural networks is work on batches of data. So you usually pass in an input array of shape [batch size, # of input neurons] to the neural network to make it work efficiently.

Now our number of input neurons here are not the same as highlighted earlier, and yes, convolutional neural networks do deal with arbitrary sized images. But when you think in terms of batches, you need all the images to be the same dimensions.

There are multiple things you could do:

Operate with a single graph at a time (of course very inefficient)
You could also aggregate your graphs into one big graph and not allow messages to pass from one of the smaller graphs to another smaller graph. This would introduce complications when doing graph level predictions and you would have to adapt your readout function.
You could also use Ragged Tensors which are variable length tensors: a great tutorial can be found here.
Take inspiration from CNNs again: you could use padding so your batch has, for example, graphs with different sizes. So you just take a graph with 7 nodes and set the remaining 3 nodes to be 0. It's similar with a graph with 8 nodes, set the remaining 2 nodes to be 0.

Other popular GNN architectures

In this section I will give you an overview of some other widely used graph neural network layers.

We won't be looking at the intuition behind any of these layers and how each part pieces together in the update function. Instead I'll just give you a high level overview of these methods. You could most certainly read the original papers to get a better understanding.

Graph Convolutional Networks

One of the most popular GNN architectures is Graph Convolutional Networks (GCN) by Kipf et al. which is essentially a spectral method.

Spectral methods work with the representation of a graph in the spectral domain. Spectral here means that we will utilize the Laplacian eigenvectors.

GCNs are based on top of ChebNets which propose that the feature representation of any vector should be affected only by his k-hop neighborhood. We would compute our convolution using Chebyshev polynomials.

In a GCN this is simplified to (K=1). We will start off by defining a degree matrix (row wise summation of adjacency matrix):

$$\tilde{D}_{ij}=\sumj\tilde{A}{ij}$$

The graph convolutional network update rule after using a symmetric normalization can be written where H is the feature matrix and W is the trainable weight matrix:

$$H^{\prime}=\sigma(\tilde{D}^{-1/2} \tilde{A}\tilde{D}^{-1/2} HW)$$

Node-wise, you can write this as where (N_i) and (N_j) are the sizes of the node neighborhoods:

$$\vec{h_i^{\prime}} = \sigma(\sum_{i \in N_j} \frac{1}{\sqrt{|N_i||N_j|}} W \vec{h_j^{\prime}} )$$

Of course with GCN you no longer have edge features, and the idea that a node can send a value across the graph which we had with MPNN we discussed earlier.

Graph Attention Network

Recall the node-wise update rule in GCN we just saw? (\frac{1}{\sqrt{|N_i||N_j|}}) is derived from the degree matrix of the graph.

In Graph Attention Network (GAT) by Veličković et al., this coefficient (\alpha_{ij}) is computed implicitly. So for a particular edge you take the features of the sender node, receiver node, and the edge features as well and pass them through an attention function.

$$a_{ij}=a(\vec{h_i}, \vec{h_j}, \vec{e_{ij}})$$

(a) could be any learnable, shared, self-attention mechanism like transformers. These could then be normalized with a softmax function across the neighborhood:

$$\alpha_{ij}=\frac{e^{a_{ij}}}{\sum_{k \in N_i} e^{a_{ik}}}$$

This constitutes the GAT update rule. The authors hypothesize that this could be significantly stabilized with multi-head self attention. Here is a visualization by the paper's authors showing a step of the GAT.

A single GAT step

This method is also very scalable because it had to compute a scalar for the influence form node i to node j and note a vector as in MPNN. But this is probably not as general as MPNNs, though.

Code Implementation for Graph Neural Networks

With multiple frameworks like PyTorch Geometric, TF-GNN, Spektral (based on TensorFlow) and more, it is indeed quite simple to implement graph neural networks. We will see a couple of examples here starting with MPNNs.

Here is how you create a message passing neural network similar to the one in the original paper "Neural Message Passing for Quantum Chemistry" with PyTorch Geometric:

import torch.nn as nn
import torch.nn.functional as F
import torch_geometric.transforms as T
from torch_geometric.utils import normalized_cut
from torch_geometric.nn import NNConv, global_mean_pool, graclus, max_pool, max_pool_x


def normalized_cut_2d(edge_index, pos):
    row, col = edge_index
    edge_attr = torch.norm(pos[row] - pos[col], p=2, dim=1)
    return normalized_cut(edge_index, edge_attr, num_nodes=pos.size(0))


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        nn1 = nn.Sequential(
            nn.Linear(2, 25), nn.ReLU(), nn.Linear(25, d.num_features * 32)
        )
        self.conv1 = NNConv(d.num_features, 32, nn1, aggr="mean")

        nn2 = nn.Sequential(nn.Linear(2, 25), nn.ReLU(), nn.Linear(25, 32 * 64))
        self.conv2 = NNConv(32, 64, nn2, aggr="mean")

        self.fc1 = torch.nn.Linear(64, 128)
        self.fc2 = torch.nn.Linear(128, d.num_classes)

    def forward(self, data):
        data.x = F.elu(self.conv1(data.x, data.edge_index, data.edge_attr))
        weight = normalized_cut_2d(data.edge_index, data.pos)
        cluster = graclus(data.edge_index, weight, data.x.size(0))
        data.edge_attr = None
        data = max_pool(cluster, data, transform=transform)

        data.x = F.elu(self.conv2(data.x, data.edge_index, data.edge_attr))
        weight = normalized_cut_2d(data.edge_index, data.pos)
        cluster = graclus(data.edge_index, weight, data.x.size(0))
        x, batch = max_pool_x(cluster, data.x, data.batch)

        x = global_mean_pool(x, batch)
        x = F.elu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        return F.log_softmax(self.fc2(x), dim=1)

You can find a complete Colab Notebook demonstrating the implementation here, and it is indeed quite heavy. It is quite simple to implement this in TensorFlow as well, and you can find a full length tutorial on Keras Examples here.

Implementing a GCN is also quite simple with PyTorch Geometric. You can easily implement it with TensorFlow as well, and you can find a complete Colab Notebook here.

class Net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(dataset.num_features, 16, cached=True,
                             normalize=not args.use_gdc)
        self.conv2 = GCNConv(16, dataset.num_classes, cached=True,
                             normalize=not args.use_gdc)

    def forward(self):
        x, edge_index, edge_weight = data.x, data.edge_index, data.edge_attr
        x = F.relu(self.conv1(x, edge_index, edge_weight))
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index, edge_weight)
        return F.log_softmax(x, dim=1)

And now let's try implementing a GAT. You can find the complete Colab Notebook here.

class Net(torch.nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()

        self.conv1 = GATConv(in_channels, 8, heads=8, dropout=0.6)
        # On the Pubmed dataset, use heads=8 in conv2.
        self.conv2 = GATConv(8 * 8, out_channels, heads=1, concat=False,
                             dropout=0.6)

    def forward(self, x, edge_index):
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=-1)

Conclusion

Thank you for sticking with me until the end. I hope that you've taken away a thing or two about graph neural networks and enjoyed reading through how these intuitions for graph neural networks form in the first place.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

Lastly, for the motivated reader, among others I would also encourage you to read the original paper "The Graph Neural Network Model" where GNN was first proposed, as it is really interesting. An open-access archive of the paper can be found here. This article also takes inspiration from Theoretical Foundations of Graph Neural Networks and CS224W which I suggest you to check out.

You can also find me on Twitter @rishit_dagli, where I tweet about machine learning, and a bit of Android.

How to Improve the Accuracy of Your Image Recognition Models

Jason — Mon, 29 Nov 2021 17:09:30 +0000

These 7 tricks and tips will take you from 50% to 90% accuracy for your image recognition models in literally minutes.

So, you have gathered a dataset, built a neural network, and trained your model.

But despite the hours (and sometimes days) of work you've invested to create the model, it spits out predictions with an accuracy of 50–70%. Chances are, this is not what you expected.

Here are a few strategies, or hacks, to boost your model’s performance metrics.

1. Get More Data

Deep learning models are only as powerful as the data you bring in. One of the easiest ways to increase validation accuracy is to add more data. This is especially useful if you don’t have many training instances.

If you’re working on image recognition models, you may consider increasing the diversity of your available dataset by employing data augmentation. These techniques include anything from flipping an image over an axis and adding noise to zooming in on the image. If you are a strong machine learning engineer, you could also try data augmentation with GANs.

2. Add More Layers

Adding more layers to your model increases its ability to learn your dataset’s features more deeply. This means that it will be able to recognize subtle differences that you, as a human, might not have picked up on.

This hack entirely relies on the nature of the task you are trying to solve.

For complex tasks, such as differentiating between the breeds of cats and dogs, adding more layers makes sense because your model will be able to learn the subtle features that differentiate a poodle from a Shih Tzu.

For simple tasks, such as classifying cats and dogs, a simple model with few layers will do.

More layers -> More nuanced model.

Photo by [Unsplash](https://unsplash.com/@alvannee?utm_source=medium&utm_medium=referral" rel="photo-creator noopener noopener noopener noopener noopener noopener noopener noopener noopener noopener noopener noopener noopener noopener noopener">Alvan Nee on 3. Change Your Image Size

When you preprocess your images for training and evaluation, there is a lot of experimentation you can do with regards to the image size.

If you choose an image size that is too small, your model will not be able to pick up on the distinctive features that help with image recognition.

Conversely, if your images are too big, it increases the computational resources required by your computer and/or your model might not be sophisticated enough to process them.

Common image sizes include 64x64, 128x128, 28x28 (MNIST), and 224x224 (VGG-16).

Keep in mind that most preprocessing algorithms do not consider the aspect ratio of the image, so smaller-sized images might appear to have shrunk over a certain axis.

Converting an image from a large resolution to a small size, like 28x28, usually ends up with a lot of pixelation that tends to have negative effects on your model’s performance. [Source](https://dribbble.com/shots/4829233-Pixelated-Mona-Lisa" rel="noopener)

4. Increase Epochs

Epochs are basically how many times you pass the entire dataset through the neural network. Incrementally train your model with more epochs with intervals of +25, +100, and so on.

Increasing epochs makes sense only if you have a lot of data in your dataset. However, your model will eventually reach a point where increasing epochs will not improve accuracy.

At this point, you should consider playing around with your model’s learning rate. This little hyperparameter dictates whether your model reaches its global minimum (the ultimate goal for neural nets) or gets stuck in a local minimum.

Global Minimum is the ultimate goal for neural networks. [Source](https://www.dna-ghost.com/single-post/2018/03/13/Neural-network-Escaping-from-variety-of-non-global-minimum-traps" rel="noopener)

5. Decrease Colour Channels

Colour channels reflect the dimensionality of your image arrays. Most colour (RGB) images are composed of three colour channels, while grayscale images have just one channel.

The more complex the colour channels are, the more complex the dataset is and the longer it will take to train the model.

If colour is not such a significant factor in your model, you can go ahead and convert your colour images to grayscale.

You can even consider other colour spaces, like HSV and Lab.

RGB images are composed of three colour channels: red, green, and blue. [Source](https://www.youtube.com/watch?v=ZqUotba3V5Y" rel="noopener)

6. Transfer Learning

Transfer learning involves the use of a pre-trained model, such as YOLO and ResNet, as a starting point for most computer vision and natural language processing tasks.

Pre-trained models are state-of-the-art deep learning models that were trained on millions and millions of samples, and often for months. These models have an astonishingly huge capability of detecting nuances in different images.

These models can be used as a base for your model. Most models are so good that you won’t need to add convolutional and pooling Layers.

Read more about using transfer learning.

Transfer learning can greatly improve your model’s accuracy from ~50% to 90%! Source: [Nvidia blog](https://www.nvidia.com/content/dam/en-zz/en_sg/ai-innovation-day-2019/assets/pdf/9_NVIDIA-Transfer-Learning-Toolkit-for-Intelligent-Video-Analytics.pdf" rel="noopener)

Final Thoughts

The hacks above offer a base for you to optimize a model. To really fine tune a model, you’ll need to consider tuning the various hyperparameters and functions involved in your model, such as the learning rate (as discussed above), activation functions, loss functions, and so on.

This hack comes as an “I hope you know what you’re doing” warning because there is a wider scope to mess up your model.

Always Save Your Models

Always save your model every time you make a change to your deep learning model. This will help you reuse a previous configuration of the model if it provides greater accuracy.

Most deep learning frameworks like Tensorflow and Pytorch have a “save model” method.

# In Tensorflow model.save('model.h5') # Saves the entire model to a single artifact # In Pytorch torch.save(model, PATH)

There are countless other ways to further optimize your deep learning, but the hacks described above serve as a base in the optimization part of deep learning.

Tweet at me letting me know what your favourite hack is!

Deep Learning Tutorial – How to Use PyTorch and Transfer Learning to Diagnose COVID-19 Patients

Juan Cruz Martinez — Wed, 03 Nov 2021 19:49:35 +0000

Ever since the outbreak of COVID-19 in December 2019, researchers in the field of artificial intelligence and machine learning have been trying to find better ways to diagnose the disease.

They've worked on developing algorithms that would detect the disease within a matter of seconds – and only by looking at chest X-rays and/or CT scan images.

Some of these techniques have proven to be extremely useful and accurate in diagnosing COVID-19 cases.

There are multiple approaches that use both machine and deep learning to detect and/or classify of the disease. And researches have proposed newly developed architectures along with transfer learning approaches.

In this article, we will look at a transfer learning approach that classifies COVID-19 cases using chest X-ray images.

The model we are going to use is one of the seven variants of the EfficientNet architecture. We will use a pre-trained model on the immense ImageNet dataset. EfficientNet is an advanced and complex convolutional neural network-based architecture.

We will further investigate the details of Convolutional Neural Networks, pre-trained models, and EfficientNet during the course of this article. I've divided it into five parts:

What are convolutional neural networks?

A dive into transfer learning.

What is EfficientNet?

An introduction to PyTorch.

Implementation of COVID-19 classifier using EfficientNet with PyTorch.

This tutorial assumes that you have prior knowledge of both machine learning and deep learning. If you want to further develop your foundation in these topics, check out this article on Artificial Intelligence vs Machine Learning vs Deep Learning.

Also, although the dataset we'll work with here is COVID-related, you can apply the actual code implementation and analysis to other datasets.

What is a Convolutional Neural Network?

Convolutional Neural networks (CNNs) are a type of deep neural network that works on visual data – this is, images. A CNN takes an image as an input and performs two or three-dimensional convolutional operations on the image with several filters, also referred to as kernels.

These convolution operations output a 2D or 3D matrix which contains the learnable weights and biases regarding the spatial information of the input image. This output matrix is referred to as the feature map of the image.

Processing a convolutional neural network in the training process can be, in some cases, extremely slow. This is why it's a good idea to use GPUs and TPUs during training for deep learning techniques, especially convolutional neural networks.

Convolutional neural networks learn spatial and temporal information about the image far better than the basic feed forward neural network. Also, CNNs can reduce the size of the image while retaining the most important information in the image, which is crucial for predictive analysis of images.

Source

The starting layers of convolutional neural networks learn the abstract and simpler features in an image, such as lines and edges. But as we move deeper into the network, the feature map turns to the more complex structures in the image.

It starts to learn the more specific features of the image, such as a cat, a dog, or a person, the same way we would, as humans, perceive the world around us. This is a core concept in modern deep learning-based computer vision.

Now before we move on to advanced concepts, it is important to learn the basics of 2D convolution.

What is 2D Convolution?

2D convolution is a bit complex to explain, but here it goes: if the convolutional process (which is extensively used in h1-D signal processing) is performed between two signals – but not just along a single dimension, rather along two mutually perpendicular dimensions – it is called 2D convolution.

In the case of images, the two mutually perpendicular dimensions are the rows and columns of a greyscale image. The convolutional operation is mathematically done by multiplying and then accumulating the values of the overlapping samples of the two input signals, where one of the signals is flipped. The output of this multiplication and accumulation gives a single point on the feature map.

In the case of CNNs, the image is one signal and the filter/kernel is the second signal which is flipped. The size of the kernel is always smaller than that of the image.

The flipped kernel is then swept across the whole image both row by row and column by column to output the feature map.

2d convolution

Here a 3x3 kernel is swept across a 6x6 image to output a 4x4 feature map. As you can see, the dimensions of the output feature map are smaller than the input image. So there are a few concepts used in convolution to control the dimensions of the output feature map. These include padding, stride, and kernel size.

Padding is the manual addition of rows and columns around the input to keep the output dimension the same as the input dimension or vary it.

Stride refers to the jump the kernel takes during the sweep, both in columns and rows. In the example above, the stride of the convolution is 1 as the kernel is moving one unit in both rows and columns.

Kernel size refers to the dimensions of the kernel used. Changing the dimensions of the kernel to be swept changes the output size of the feature map.

The image below describes the convolution with the same kernel size but with a padding of 1 and stride of 2.

The equation that describes the relationship of stride, padding, and kernel size to input and output dimensions is as follows:

The concept of 3D convolution is just an extension of 2D convolution where both the input image and the kernel are three-dimensional.

Like 2D convolution, we sweep the three-dimensional kernel across the whole image in two mutually perpendicular dimensions, namely the rows and the columns.

We do not usually sweep the kernel across the color channels because the kernel has the same third dimension, that is the channel length, as the original image. This gives an output feature map that is two-dimensional instead of three.

To learn more about the details of 3D convolution, you can read this article.

What is Transfer Learning?

In transfer learning, you take a machine or deep learning model that is pre-trained on a previous dataset and use it to solve a different problem without needing to re-train the whole model.

Instead, you can just use the weights and biases of the pre-trained model to make a prediction. You transfer the weights from one model to your own model and adjust them to your own dataset without re-training all the previous layers of the architecture.

We use transfer learning in the applications of convolutional neural networks and natural language processing because it decreases the computation time and complexity of the training process. And, in many cases, it performs surprisingly well.

This also helps in cases where we have limited data available – since neural networks demand an extremely large amount of data to achieve good performance.

This means that using transfer learning methods can greatly reduce the demand for data since the weights and biases are pre-adjusted and are able to work better with just a small amount of data by tweaking the weights and biases a little.

But transfer learning models do not always give you great performance (although the newer architectures perform efficiently on almost every problem). Still, sometimes the problem at hand needs an architecture that is pre-trained on data that's similar to what you have. This factor depends upon the complexity of the problem you are trying to solve.

There are a couple ways you can perform transfer learning:

Using a pre-trained model.

Developing a new model.

You can use a pre-trained model in two ways. First, you can use the pre-trained weights and biases as initial parameters for your own model, and then train a whole convolutional model using those weights.

The other way is to perform feature extraction from the pre-trained model. You use the parameters of the pre-trained model to extract features from your input image and just train a simple classifier on top of it.

Another option is that if you have a problem with a small amount of data, you develop another model for a similar problem that has a large amount of data and train the model. Then you can use the trained weights from the new model to solve the original problem with less data.

In this tutorial, we will be using a pre-trained model as a feature extractor and we'll train a simple classifier on top of it to output the prediction.

There are many well-known architectures in the field of deep learning that are nowadays used for the purpose of transfer learning. Almost all of these are trained on the ImageNet dataset which is the largest open-source dataset available. It contains around 1000 classes and has around fifteen million instances.

Among these pre-trained architectures, LeNet is the first one that was proposed in 1998. Other well-known models include VGG, ResNet, AlexNet, GoogleNet, Inception, and Xception.

EfficientNet is also part of the series that was proposed recently, in 2019.

What is EfficientNet?

EfficientNet (or perhaps it's better to say EfficientNets) is a family of convolutional neural network-based image classification models. They perform extremely well on the state-of-the-art ImageNet dataset and other popular datasets such as CIFAR-100 and Flowers.

In addition to performing so well, the architecture is small and computes faster than any of the previous models. The architecture has variants ranging from EfficientNet-B0 up to EffieicntNet-B7.

The variants ranging from B0 to B7 are based on the compound scaling method to scale up the baseline in B0 to obtain B1 to B7. EfficientNet-B7 acquired a Top-1 accuracy of 84.4% on the ImageNet dataset, which is the highest level of Top-1 accuracy ever achieved on ImageNet.

If you want to learn more about how EfficientNets work, you can read this paper ‘Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks.’

Source

In the coding tutorial further along in this article, we'll be using the EfficientNet-B0 as a feature extractor and a classifier on top of it to classify COVID-19 using chest x-ray images.

An Introduction to PyTorch

PyTorch is a Python-supported library that helps us build deep learning models. Unlike Keras (another deep learning library), PyTorch is flexible and gives the developer more control.

It is similar to NumPy in processing but has a faster GPU acceleration. To learn more about NumPy and its features, you can check out this in-depth guide along with its documentation.

PyTorch has a data structure known as a ‘Tensor’ that is similar to the NumPy ndarray but it has the option to operate on GPU.

PyTorch provides an uncomplicated way to switch computation between a CPU and a GPU. It also supports processing on NumPy arrays by simply providing a built-in module that can convert NumPy arrays into Tensors and vice versa.

One of the handiest modules in PyTorch is grad(). It allows you to compute the gradient of a tensor as it goes forward into processing without needing to manually compute the gradient and store it.

This gives you greater control of your deep learning operations, specifically back propagation, during the training process. This is helpful when computing the loss function which lets you adjust the parameters of a model.

We can also limit a tensor so that its gradient is not computed during the entire process by making the module's requires_grad equal False. To learn more about tensors and how to perform gradient computations in PyTorch, you can check out this tutorial and this course.

How to Implement a COVID-19 Classifier using EfficientNet with PyTorch

Now let's move on to the practical implementation of EfficientNet in PyTorch. We will use the B0 variant of the EfficientNet family.

First, we'll examine the data and preprocess it. Kaggle has an vast library of datasets available for open-source use in projects and research. There are no limits as to what dataset can be used for this project. You can use any dataset containing chest X-ray images of COVID-19 patients and people without COVID.

For the sake of this tutorial, we'll use this dataset here. But for the code to work on your custom dataset, you must divide your data into three directories: train, test, and valid.

Each directory should contain two more directories with the labels covid and normal. These covid and normal folders will contain the images corresponding to the specific class of the directory they are present in.

The original dataset we'll use in this article contains three folders: covid, normal, and pneumonia. We discard the pneumonia folder completely and divide the other data in the same way described above.

We do this to create a logical division between the data used for training and the data used for testing and validation. Also, PyTorch, by default, takes the name of the folder, an instance it is present in, as the label of the class – so we do not have a label file corresponding to the input dataset.

The data and the architecture

Let's have a look at the data. Below we can see the x-ray images of patients with COVID-19:

And here we can see the normal category’s x-ray images:

There are 237 total layers in the B-0 architecture. The whole architecture can be condensed into the following diagrams. We provide the x-ray data to the input layer.

Source

We will freeze the learning of the weights across all these blocks as we will be using the pre-trained weights to extract the features from our own input.

We'll do the feature extraction after the input passes Module 7. We then transfer the feature map obtained from Module 7 to our own final classification layers (this is why it's called transfer learning). We top the architecture with the following top layers:

BatchNorm1d

Linear(output neurons = 512)

ReLU()

BatchNorm1d()

Linear(output neurons = 128)

ReLU()

BatchNorm1d()

Dropout(probability of zeroing the parameters = 0.4)

Linear(output neurons = 2)

Let's head over to the code

Now before we start the code, there are a couple of dependencies we need to install. First, you'll need to install PyTorch on your local machine. You can do this using the pip install command in your Python environment. Refer here to install it depending on your machine (whether it has GPU available or not).

Before you move on to the code, I strongly recommend that you actually work through the code yourself. This makes it much easier to understand. With that said, you can access the full code in a Jupyter notebook here.

You also need to install Efficientnet support for PyTorch into the same Python environment. Run the command below to install it:

pip install efficientnet_pytorch

Apart from this you will need to import some other dependencies at the start of the code.

Now we start building the classification model. To start, we import all the necessary modules:

#importing required modules import gdown import zipfile import numpy as np from glob import glob import matplotlib.pyplot as plt import torch import torch.nn as nn from torchsummary import summary from torchvision import datasets, transforms as T from efficientnet_pytorch import EfficientNet import os import torch.optim as optim from PIL import ImageFile from sklearn.metrics import accuracy_score

All these modules are essential to perform multiple functions across the model. You can install all the absent modules using the pip command.

Then we download and extract the data we prepared for the model:

#importing data #Dataset address url = 'https://drive.google.com/uc?export=download&id=1B75cOYH7VCaiqdeQYvMuUuy_Mn_5tPMY' output = 'data.zip' gdown.download(url, output, quiet=False) #giving zip file name data_dir='./data.zip' #Extracting data from zip file with zipfile.ZipFile(data_dir, 'r') as zf: zf.extractall('./data/')

The gdown.download module downloads the data from the URL provided and the zipfile.extractall extracts the data into the same directory where you currently are (or the same runtime if you are working on Google Colab).

I highly recommend working on Google Colab for this project in case you do not locally have a GPU available.

Next, create a check variable to check the availability of a GPU.

#Checking the availability of a GPU use_cuda = torch.cuda.is_available()

This module returns ‘True’ if GPU is available and ‘False' if not.

Next, we need to apply pre-processing techniques to the data. Since our data is pre-augmented, we do not need to apply many pre-processing techniques to it. We only resize all the images to a single size of (224,224). We do this because the images in our dataset are all of different dimensions and we need a consistent dimension for the model.

We'll also convert the images to tensors to be processed by PyTorch and then we normalize all the images. This normalize function normalizes all the images with a mean and standard deviation of 0.5.

After that, we create the locations for the train, test and validation sets which will be given as input to the ‘datasets’ module. We do this so that the PyTorch model knows exactly where the data is located and also so that that data can be loaded to the GPU. We keep a batch size of 32.

#declaring batch size batch_size = 32 #applying required transformations on the dataset img_transforms = { 'train': T.Compose([ T.Resize(size=(224,224)), T.ToTensor(), T.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]), ]), 'valid': T.Compose([ T.Resize(size=(224,224)), T.ToTensor(), T.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]) ]), 'test': T.Compose([ T.Resize(size=(224,224)), T.ToTensor(), T.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]) ]), } # creating Location of data: train, validation, test data='./data/' train_path=os.path.join(data,'train') valid_path=os.path.join(data,'test') test_path=os.path.join(data,'valid') # creating Datasets to each of folder created in prev train_file=datasets.ImageFolder(train_path,transform=img_transforms['train']) valid_file=datasets.ImageFolder(valid_path,transform=img_transforms['valid']) test_file=datasets.ImageFolder(test_path,transform=img_transforms['test']) #Creating loaders for the dataset loaders_transfer={ 'train':torch.utils.data.DataLoader(train_file,batch_size,shuffle=True), 'valid':torch.utils.data.DataLoader(valid_file,batch_size,shuffle=True), 'test': torch.utils.data.DataLoader(test_file,batch_size,shuffle=True) }

After pre-processing, we move on to building the model.

#importing the pretrained EfficientNet model model_transfer = EfficientNet.from_pretrained('efficientnet-b0') # Freeze weights for param in model_transfer.parameters(): param.requires_grad = False in_features = model_transfer._fc.in_features # Defining Dense top layers after the convolutional layers model_transfer._fc = nn.Sequential( nn.BatchNorm1d(num_features=in_features), nn.Linear(in_features, 512), nn.ReLU(), nn.BatchNorm1d(512), nn.Linear(512, 128), nn.ReLU(), nn.BatchNorm1d(num_features=128), nn.Dropout(0.4), nn.Linear(128, 2), ) if use_cuda: model_transfer = model_transfer.cuda()

First, we import the EfficientNet-B0 model with its pre-trained weights. Next, we disable the training of the parameters of the model because we are going to use the pre-trained parameters to extract features from our data.

Then we replace the top fully connected layers of the model with our own classifier.

Batchnorm normalizes the whole batch of data into the number of neurons given as an argument. This reduces the complexity of the model and prevents it from overfitting. Dropout does something similar – it zeroes out some neurons in the model with a probability of the value given as an argument.

The Linear layer is a simple fully-connected neural network layer.

Finally, we transfer our model to the GPU, if available.

# selecting loss function criterion_transfer = nn.CrossEntropyLoss() #using Adam classifier optimizer_transfer = optim.Adam(model_transfer.parameters(), lr=0.0005)

Here, we select the loss function and the optimizer for our training phase. We also define the value of the learning rate for the optimizer. You can change this value to see how different learning rates influence the model in different ways.

Next, we move on to the training of the model.

ImageFile.LOAD_TRUNCATED_IMAGES = True # Creating the function for training def train(n_epochs, loaders, model, optimizer, criterion, use_cuda, save_path): """returns trained model""" # initialize tracker for minimum validation loss valid_loss_min = np.Inf trainingloss = [] validationloss = [] for epoch in range(1, n_epochs+1): # initialize the variables to monitor training and validation loss train_loss = 0.0 valid_loss = 0.0 ################### # training the model # ################### model.train() for batch_idx, (data, target) in enumerate(loaders['train']): # move to GPU if use_cuda: data, target = data.cuda(), target.cuda() optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.data - train_loss)) ###################### # validating the model # ###################### model.eval() for batch_idx, (data, target) in enumerate(loaders['valid']): if use_cuda: data, target = data.cuda(), target.cuda() output = model(data) loss = criterion(output, target) valid_loss = valid_loss + ((1 / (batch_idx + 1)) * (loss.data - valid_loss)) train_loss = train_loss/len(train_file) valid_loss = valid_loss/len(valid_file) trainingloss.append(train_loss) validationloss.append(valid_loss) # printing training/validation statistics print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format( epoch, train_loss, valid_loss )) ## saving the model if validation loss has decreased if valid_loss < valid_loss_min: torch.save(model.state_dict(), save_path) valid_loss_min = valid_loss # return trained model return model, trainingloss, validationloss

We create a function for the training and validation phase of the model. We allow the model to accept truncated images also with fewer than three channels. We initialize the values of the train and validation losses and start the training loop. We import the data batch by batch from the data loaders and perform the training operations.

After the training loop, we start the validation loop where we only compute the loss and the output predictions and do not update the parameters as we did in the training loop. We save the model which has the minimum loss for the validation set.

# training the model n_epochs=10 model_transfer, train_loss, valid_loss = train(n_epochs, loaders_transfer, model_transfer, optimizer_transfer, criterion_transfer, use_cuda, 'model.pt')

We run the model for 10 epochs, that is 10 loops. You can change the number of epochs and test out the loss values. The saved model is saved under the name model.pt. Now we load the model and move on to the testing phase.

# Defining the test function def test(loaders, model, criterion, use_cuda): # monitoring test loss and accuracy test_loss = 0. correct = 0. total = 0. preds = [] targets = [] model.eval() for batch_idx, (data, target) in enumerate(loaders['test']): # moving to GPU if use_cuda: data, target = data.cuda(), target.cuda() # forward pass output = model(data) # calculate the loss loss = criterion(output, target) # updating average test loss test_loss = test_loss + ((1 / (batch_idx + 1)) * (loss.data - test_loss)) # converting the output probabilities to predicted class pred = output.data.max(1, keepdim=True)[1] preds.append(pred) targets.append(target) # compare predictions correct += np.sum(np.squeeze(pred.eq(target.data.view_as(pred))).cpu().numpy()) total += data.size(0) return preds, targets # calling test function preds, targets = test(loaders_transfer, model_transfer, criterion_transfer, use_cuda)

We now create a test function to apply our model to our test dataset and evaluate its performance.

We pass the dataset batch by batch as we did in the train and testing phase, but we only do it once here instead of 10 epochs. This is because we just have to test the model and not update the parameters.

The function returns the predictions it computed for the input test set and also the original target values of the test set.

Now we compute the accuracy of the model. First, we need to convert the tensors, that is predictions and targets, into NumPy arrays. We do this by first moving them from the GPU to the CPU and then converting them to NumPy arrays. The following code does this:

#converting the tensor object to a list for metric functions preds2, targets2 = [],[] for i in preds: for j in range(len(i)): preds2.append(i.cpu().numpy()[j]) for i in targets: for j in range(len(i)): targets2.append(i.cpu().numpy()[j])

Now we compute the accuracy using the accuracy metric of the sklearn library.

#Computing the accuracy acc = accuracy_score(targets2, preds2) print("Accuracy: ", acc)

Our model had an accuracy of 95.45%.

The next image is the confusion matrix for the test run of the classifier. In it, you can see the visual of the model’s performance. The actual labels indicate whether the person had COVID or not, while the predicted labels indicate how our model classified the images.

As we can see, our model predicted most of the labels correctly. The small portion of wrongly predicted labels include 7 people who did not have COVID, but our model predicted they did. This is not too alarming.

On the other hand, there were 14 examples where our model predicted that they did not have COVID, but they did. In machine learning, these are called false negatives. This is a very alarming situation because we would've sent home people suffering from COVID-19. This would increase their risk that the disease would get worse.

Conclusion

Convolutional neural networks have proved extremely useful in computer vision techniques, and we can also use them efficiently in medical imaging and diagnosis.

Transfer learning is an effective method for using pre-trained architectures to perform efficiently in other applications.

But as we saw above, using these models depends upon what kind of problem we have and what our objectives are. Just like in the detection of COVID-19, we would prefer to have a model that gives us 0 false negatives. But there's still great potential for deep learning to be useful in COVID diagnosis as well as other medical diagnosis techniques.

Thanks for reading! If you enjoyed the article and would like to read more interesting articles around computer science, Python and JavaScript, please follow me on Twitter.

Data Science Interview Questions for Beginners

freeCodeCamp — Wed, 25 Aug 2021 21:39:37 +0000

By Davis David

In 2012, Harvard Business Review named data science the sexiest job of the 21st century. But if you want to get a job as a data scientist, you'll need to go through a tough interview process.

During data science job interviews, the interviewer will likely ask questions from different data science topics such as statistics, programming, data analysis, data pre-processing, and modeling.

Your skills will be put to the test, and you need to prepare yourself if you want to get through the interview successfully.

In this article, I have compiled a list of common data science interview questions with tips on how you can answer them. I've also shared a list of resources that will help you learn more about the specific topic presented in each interview question.

Data Science Interview Questions

What is Logistic Regression? How Have You Used Logistic Regression Recently?

Logistic regression is a popular algorithm used to solve classification problems. In this question, you need to explain what logistic regression is, how it works, and give an example of a data science problem you solved by using logistic regression.

Here are resources to help you get started crafting your response:

Logistic Regression: The good parts

The Least Squares Regression Method – How to Find the Line of Best Fit

Why do we Need Evaluation Metrics? What is a Confusion Matrix?

Machine learning models must be evaluated to check their performance. In this question, you need to explain how you can use the confusion matrix to evaluate the model's performance. You can further mention other metrics to evaluate regression and classification models.

Here are resources to help you get started crafting your response:

9 Key Machine Learning Algorithms Explained in Plain English

How I used Deep Learning to classify medical images with Fast.ai

How is Data Science Different from Traditional Application Programming?

A good way to answer this question is by using examples of how the program is created in both cases.

Traditional programming approach:

Data science approach:

Here is a good resource to help you get started crafting your response:

Free 6-Hour Data Science Course for Beginners

Explain the Difference Between Supervised and Unsupervised Learning.

Supervised and unsupervised learning are two types of machine learning techniques. The best way to answer this question is by explaining their differences in terms of the kind of datasets you can use in each technique and examples of algorithms.

Here is a good resource to help you get started crafting your response:

When to use different machine learning algorithms: a simple guide

Want to know how deep learning works? Here's a quick guide

What is a Decision Tree?

A decision tree is another supervised learning algorithm that you can use to solve regression or classification problems.

You should be able to explain how the decision tree algorithm learns from the data and the advantages and disadvantages of using a decision tree algorithm.

Here are resources to help you get started crafting your response:

How to Use Tree-Based Algorithms in Machine Learning

9 Key Machine Learning Algorithms Explained in Plain English

What is Cross-Validation?

The purpose of this question is to determine if you know any techniques used to assess the effectiveness of the machine learning model – for example, when you want to avoid overfitting.

When answering this question, you should explain any methods of cross-validation you have applied in any data science projects.

Here are resources to help you get started crafting your response:

Get a Grip on Cross-Validation in Machine Learning

Key Machine Learning Concepts Explained

What is a Normal Distribution?

This term is commonly used when you're solving a data science problem. In this question, you can explain the meaning of normal distribution, its properties, and why it is important to check if your data is normally distributed.

Here are resources to help you get started crafting your response:

Normal Distribution Explained in Plain English

Normal Distribution Clearly Explained

What is a Random Forest Algorithm?

Random forest is one of the most popular machine learning algorithms. When answering this question, you should explain how the algorithm learns from the data and when you should use the random forest algorithm over other machine learning algorithms.

Here are resources to help you get started crafting your response:

Random Forest Classifier Tutorial

Dataset Splitting and Random Forest Algorithms

Random Forest Algorithm Explained

Explain Univariate, Bivariate, and Multivariate Analyses

These three types of analyses are used to summarize variables in the dataset and help you get some insights. You can also talk about how they're different and when you can apply them – just make sure to show some examples.

Here are resources to help you get started crafting your response:

Univariate, Bivariate and Multivariate Analysis

How to Select the Best Performing Linear Regression for Univariate Models

How can we Handle Missing Data?

Some datasets may have missing data or values and can cause a problem when training machine learning models.

It is important to mention some techniques that can be used to handle missing data. You can also share your experience of how you handled missing data in your last data science project.

Here are resources to help you get started crafting your response:

The Penalty of Missing Values in Data Science

Feature Engineering and Feature Selection for Beginners

Handling Missing Data Easily Explained

What is the Benefit of Dimensionality Reduction?

Dimensionality reduction is a technique to reduce the number of features or variables in the dataset.

There are different advantages of dimensionality reduction you can explain when answering this question. You should explain why and when you need to apply this technique.

Here are resources to help you get started crafting your response:

How to use dimensionality reduction

Escaping the curse of dimensionality

Pros and Cons of Dimensionality Reduction

How can we deal with Outliers?

An outlier is a data point that deviates significantly from the rest. In this question, you can explain how one can identify outliers and different techniques used to deal with outliers.

Here are resources to help you get started crafting your response:

What is an Outlier in Statistics?

Three Ways to Deal with Outliers

How to Remove Outliers from a Dataset

What is Ensemble Learning?

In machine learning, ensemble learning is a process of using multiple algorithms to obtain better predictive performance than could be obtained from any one algorithm alone.

When answering this question, you can also share your experience the last time you implemented ensemble methods in a data science project.

Here are resources to help you get started crafting your response:

Introduction to Ensemble Learning

Ensemble Learning in Machine Learning

Explain how Machine Learning is Different from Deep Learning

The best way to explain the difference between machine learning and deep learning is the way they solve problems.

You can go further by explaining some of the problems that can be solved by either machine learning or deep learning techniques.

Here are resources to help you get started crafting your response:

A beginner's guide to Machine Learning and Deep Learning

AI vs ML – What's the Difference between Artificial Intelligence and Machine Learning?

Machine Learning Crash Course and Deep Learning Crash Course

What are the Differences Between Overfitting and Underfitting?

The best way to explain the difference between overfitting and underfitting is not just with a definition but through examples.

You can also share your personal experience when faced with overfitting or underfitting problems in a data science project.

Here are resources to help you get started crafting your response:

How to Handle Overfitting in Deep Learning Models

How to Build Better Machine Learning Models

Deep Learning with PyTorch Course

What is Regularisation? Why is it Useful?

When answering this question, you can also go further by explaining the two common regularization techniques L1 norm and L2 norm.

Here are resources to help you get started crafting your response:

How to Build your First Neural Network

Deep Learning Crash Course

What is Selection Bias?

It is not enough just to define Selection Bias. If possible you should explain different types of bias, their effects, and how to avoid them.

Here are resources to help you get started crafting your response:

What is Selection Bias?

Selection Bias – Don't forget about me!

Can you Explain the Difference Between a Validation Set and a Test Set?

In this question, after explaining their differences, you can explain the advantage of having a validation set and a test set in a data science project.

Here are resources to help you get started crafting your response:

Key Machine Learning Concepts Explained

Difference between Test Sets and Validation Sets

What to do when your training and testing data come from different distributions

Machine Learning – Validation vs Testing

What is the Difference Between Regression and Classification ML Techniques?

We all know that regression and classification are supervised learning and the only difference is their output. When you answer this question, you can mention a few algorithms that can be used to solve regression problems or classification problems. Also, try to share how their models are evaluated.

Here are resources to help you get started crafting your response:

How to Build and Train Linear and Logistic Regression ML Models

Regression vs Classification in Machine Learning

Machine Learning Basics for Developers

Classification and Regression in Machine Learning

What are Artificial Neural Networks?

In this question don't just define Artificial Neural Networks but also explain their advantages and where you can use them.

Here are resources to help you get started crafting your response:

Overview of Artificial Neural Networks and their Applications

Deep Learning Neural Networks Explained in Plain English

What Tools and Devices do you Plan to use in Your Role as a Data Scientist?

This question is straightforward but you should mention tools you have used before or you are planning to use in the future project.

You can also share your experience of how various tools help you implement data science projects successfully.

Keep in mind that you will use different tools for different projects. For example, some tools can be used for an NLP project and others for a Time-series project.

Here are resources to help you get started crafting your response:

13 Tools Every Data Scientist Needs to Know

What is Natural Language Processing? State some Real-Life Examples of NLP.

You have to define Natural language processing in a simple way and how it can be used to solve business problems. Then share some real-life examples. If possible you can also share some of the NLP projects you have done or collaborate with others.

Here are resources to help you get started crafting your response:

What is Natural Language Processing? A tutorial for beginners

Learn Natural Language Processing with Python and TensorFlow

What Every Developer Needs to Know about NLP

Applications of NLP

What is Normalisation? Difference between Normalisation and Standardization?

Normalization and standardization are techniques used to pre-process the data before applying machine learning algorithms.

The purpose of the question is to explain the differences between these two techniques and at what condition of the dataset you should apply one over another.

Here are resources to help you get started crafting your response:

The Difference Between Normalization and Standardization

Text Preprocessing for NLP and Machine Learning

Feature Engineering and Feature Selection for Beginners

Standardization vs Normalization – Feature Scaling

Preprocessing for Deep Learning

Final Thoughts on Data Science Interview Questions

Reviewing these common data science interview questions will actually boost your confidence during the interview.

Don't expect the interviewer to ask you all questions mentioned in this article. But most of the interview questions will come from the same topics.

For example, instead of asking "Explain the difference between supervised and unsupervised learning", the interviewer can ask you to “Explain some supervised learning algorithms and how they learn from the data”.

If you are interested in learning and reading more data science interview questions, take your time and read through these additional resources for inspiration.

And don't forget to practice your coding skills because some questions during the interview require you to code the solution.

I hope these data science interview questions will help you prepare for your interview and I wish you the best of luck in your data science career.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

You can also find me on Twitter @Davis_McDavid.

And you can read more articles like this here.

How to Build Better Machine Learning Models

freeCodeCamp — Fri, 23 Apr 2021 16:22:43 +0000

By Rishit Dagli

Hello developers 👋. If you have built Deep Neural Networks before, you might know that it can involve a lot of experimentation.

In this article, I will share with you some useful tips and guidelines that you can use to better build better deep learning models. These tricks should make it a lot easier for you to develop a good network.

You can pick and choose which tips you use, as some will be more helpful for the projects you are working on. Not everything mentioned in this article will straight up improve your models’ performance.

A high-level approach to Hyperparameter tuning🕹️

One of the more painful things about training Deep Neural Networks is the large number of hyperparameters you have to deal with.

These could be your learning rate α, the discounting factor ρ, and epsilon ε if you are using the RMSprop optimizer (Hinton et al.) or the exponential decay rates β₁ and β₂ if you are using the Adam optimizer (Kingma et al.).

You also need to choose the number of layers in the network or the number of hidden units for the layers. You might be using learning rate schedulers and would want to configure those features and a lot more 😩! We definitely need ways to better organize our hyperparameter tuning process.

A common algorithm I tend to use to organize my hyperparameter search process is Random Search. Though there are other algorithms that might be better, I usually end up using it anyway.

Let’s say for the purpose of this example you want to tune two hyperparameters and you suspect that the optimal values for both would be somewhere between one and five.

The idea here is that instead of picking twenty-five values to try out like (1, 1) (1, 2) and so on systematically, it would be more effective to select twenty-five points at random.

Based on Lecture Notes of Andrew Ng‌‌

Here is a simple example with TensorFlow where I try to use Random Search on the Fashion MNIST Dataset for the learning rate and the number of units in the first Dense layer:

import kerastuner as kt import tensorflow as tf def model_builder(hp): model = tf.keras.Sequential() model.add(tf.keras.layers.Flatten(input_shape=(28, 28))) # Tune the number of units in the first Dense layer # Choose an optimal value between 32-512 hp_units = hp.Int('units', min_value = 32, max_value = 512, step = 32) model.add(tf.keras.layers.Dense(units = hp_units, activation = 'relu')) model.add(tf.keras.layers.Dense(10)) # Tune the learning rate for the optimizer # Choose an optimal value from 0.01, 0.001, or 0.0001 hp_learning_rate = hp.Choice('learning_rate', values = [1e-2, 1e-3, 1e-4]) model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = hp_learning_rate), loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True), metrics = ['accuracy']) return model tuner = kt.RandomSearch(model_builder, objective = 'val_accuracy', max_trials = 10, directory = 'random_search_starter', project_name = 'intro_to_kt') tuner.search(img_train, label_train, epochs = 10, validation_data = (img_test, label_test)) # Which was the best model? best_model = tuner.get_best_models(1)[0] # What were the best hyperparameters? best_hyperparameters = tuner.get_best_hyperparameters(1)[0]

Here I suspect that an optimal number of units in the first Dense layer would be somewhere between 32 and 512, and my learning rate would be one of 1e-2, 1e-3, or 1e-4.

Consequently, as shown in this example, I set my minimum value for the number of units to be 32 and the maximum value to be 512 and have a step size of 32. Then, instead of hardcoding a value for the number of units, I specify a range to try out.

hp_units = hp.Int('units', min_value = 32, max_value = 512, step = 32) model.add(tf.keras.layers.Dense(units = hp_units, activation = 'relu'))

We do the same for our learning rate, but our learning rate is simply one of 1e-2, 1e-3, or 1e-4 rather than a range.

hp_learning_rate = hp.Choice('learning_rate', values = [1e-2, 1e-3, 1e-4]) optimizer = tf.keras.optimizers.Adam(learning_rate = hp_learning_rate)

Finally, we perform Random Search and specify that among all the models we build, the model with the highest validation accuracy would be called the best model. Or simply that getting a good validation accuracy is the goal.

tuner = kt.RandomSearch(model_builder, objective = 'val_accuracy', max_trials = 10, directory = 'random_search_starter', project_name = 'intro_to_kt') tuner.search(img_train, label_train, epochs = 10, validation_data = (img_test, label_test))

After doing so, I also want to retrieve the best model and the best hyperparameter choice. Though I would like to point out that using the get_best_models is usually considered a shortcut.

To get the best performance you should retrain your model with the best hyperparameters you get on the full dataset.

# Which was the best model? best_model = tuner.get_best_models(1)[0] # What were the best hyperparameters? best_hyperparameters = tuner.get_best_hyperparameters(1)[0]

I won't be talking about this code in detail in this article, but you can read about it in this article I wrote some time back if you want.

Use Mixed Precision Training for large networks🎨

The bigger your neural network is, the more accurate your results (in general). As model sizes grow, the memory and compute requirements for training these models also increase.

The idea with using Mixed Precision Training (NVIDIA, Micikevicius et al.) is to train deep neural networks using half-precision floating-point numbers which let you train large neural networks a lot faster with no or negligible decrease in the performance of the networks.

But, I'd like to point out that this technique should only be used for large models with more than 100 million parameters or so.

While mixed-precision would run on most hardware, it will only speed up models on recent NVIDIA GPUs (for example Tesla V100 and Tesla T4) and Cloud TPUs.

I want to give you an idea of the performance gains when using Mixed Precision. When I trained a ResNet model on my GCP Notebook instance (consisting of a Tesla V100) it was almost three times better in the training time and almost 1.5 times on a Cloud TPU instance with almost no difference in accuracy. The code to measure the above speed-ups was taken from this example.

To further increase your training throughput, you could also consider using a larger batch size – and since we are using float16 tensors you should not run out of memory.

It is also rather easy to implement Mixed Precision with TensorFlow. With TensorFlow you could easily use the tf.keras.mixed_precision Module that allows you to set up a data type policy (to use float16) and also apply loss scaling to prevent underflow.

Here is a minimalistic example of using Mixed Precision Training on a network:

import tensorflow as tf policy = tf.keras.mixed_precision.Policy('mixed_float16') tf.keras.mixed_precision.set_global_policy(policy) inputs = keras.Input(shape=(784,)) x = tf.keras.layers.Dense(4096, activation='relu')(inputs) x = tf.keras.layers.Dense(4096, activation='relu')(x) x = layers.Dense(10)(x) outputs = layers.Activation('softmax', dtype='float32')(x) model = keras.Model(inputs=inputs, outputs=outputs) model.compile(...) model.fit(...)

In this example we first set the dtype policy to be float16 which implies that all of our model layers will automatically use float16.

After doing so we build a model, but we override the data type for the last or the output layer to be float32 to prevent any numeric issues. Ideally your output layers should be float32.

Note: I've built a model with so many units so we can see some difference in the training time with Mixed Precision Training since it works well for large models.

If you are looking for more inspiration to use Mixed Precision Training, here is an image demonstrating speedup for multiple models by Google Cloud on a TPU:

Speedups on a Cloud TPU

Use Grad Check for backpropagation ✔️

In multiple scenarios, I have had to custom implement a neural network. And implementing backpropagation is typically the aspect that's prone to mistakes and is also difficult to debug.

With incorrect backpropagation your model could learn something which might look reasonable, which makes it even more difficult to debug. So, how cool would it be if we could implement something which could allow us to debug our neural nets easily?

I often use Gradient Check when implementing backpropagation to help me debug it. The idea here is to approximate the gradients using a numerical approach. If it is close to the calculated gradients by the backpropagation algorithm, then you can be more confident that the backpropagation was implemented correctly.

As of now, you can use this expression in standard terms to get a vector which we will call dθ[approx]:

Calculate approx gradients‌‌

In case you are looking for the reasoning behind this, you can find more about it in this article I wrote.

So, now we have two vectors dθ[approx] and dθ (calculated by backprop). And these should be almost equal to each other. You could simply compute the Euclidean distance between these two vectors and use this reference table to help you debug your nets:

Reference table

Cache Your Datasets 💾

Caching datasets is a simple idea but it's not one I have seen used much. The idea here is to go over the dataset in its entirety and cache it either in a file or in memory (if it is a small dataset).

This should save you from performing some expensive CPU operations like file opening and data reading during every single epoch.

This does also means that your first epoch would comparatively take more time📉 since you would ideally be performing all operations like opening files and reading data in the first epoch and then caching them. But the subsequent epochs should be a lot faster since you would be using the cached data.

This definitely seems like a very simple to implement idea, right? Here is an example with TensorFlow showing how you can very easily cache datasets. It also shows the speedup 🚀 from implementing this idea. Find the complete code for the below example in this gist of mine.

A simple example of caching datasets and the speedup with it

How to tackle overfitting ⭐

When you're working with neural networks, overfitting and underfitting might be two of the most common problems you face. This section talks about some common approaches that I use when tackling these problems.

You might know this, but high bias will cause you to miss a relationship between features and labels (underfitting) and high variance will cause the model to capture the noise and overfit to the training data.

I believe the most effective way to solve overfitting is to get more data – though you could also augment your data. A benefit of deep neural networks is that their performance improves as they are fed more and more data.

But in a lot of situations, it might be too expensive to get more data or it simply might not be possible to do so. In that case, let's talk about a couple of other methods you could use to tackle overfitting.

Apart from getting more data or augmenting your data, you could also tackle overfitting either by changing the architecture of the network or by applying some modifications to the network’s weights. Let's look at these two methods.

Changing the Model Architecture

A simple way to change the architecture such that it doesn’t overfit would be to use Random Search to stumble upon a good architecture. Or you could try pruning nodes from your model, essentially lowering the capacity of your model.

We already talked about Random Search, but in case you want to see an example of pruning you could take a look at the TensorFlow Model Optimization Pruning Guide.

Modifying Network Weights

In this section we will see some methods I commonly use to prevent overfitting by modifying a network's weights.

Weight Regularization

Iterating back on what we discussed, "simpler models are less likely to overfit than complex ones". We try to keep a bar on the complexity of the network by forcing its weights only to take small values.

To do so we will add to our loss function a term that can penalize our model if it has large weights. Often L₁ and L₂ regularizations are used, the difference being:

L1 - The penalty added is ∝ to |weight coefficients|

L2 - The penalty added is ∝ to |weight coefficients|²

where |x| represents absolute values.

Do you notice the difference between L1 and L2, the square term? Due to this, L1 might push weights to be equal to zero whereas L2 would have weights tending to zero but not zero.

In case you are curious about exploring this further, this article goes deep into regularizations and might help.

This is also the exact reason why I tend to use L2 more than L1 regularization. Let's see an example of this with TensorFlow.

Here I show some code to create a simple Dense layer with 3 units and the L2 regularization:

import tensorflow as tf tf.keras.layers.Dense(3, kernel_regularizer = tf.keras.regularizers.L2(0.1))

To provide more clarity on what this does, as we discussed above this would add a term (0.1 × weight_coefficient_value²) to the loss function which works as a penalty to very big weights. Also, it is as easy as replacing L2 to L1 in the above code to implement L1 for your layer.

Dropouts

The first thing I do when I am building a model and face overfitting is try using dropouts (Srivastava et al.). The idea here is to randomly dropout or set to zero (ignore) x% of output features of the layer during training.

We do this to stop individual nodes from relying on the output of other nodes and prevent them from co-adapting from other nodes too much.

Dropouts are rather easy to implement with TensorFlow since they are available as layers. Here is an example of me trying to build a model to differentiate images of dogs and cats with Dropout to reduce overfitting:

model = tf.keras.models.Sequential([ tf.keras.layers.Conv2D(32, (3,3), padding='same', activation='relu',input_shape=(IMG_HEIGHT, IMG_WIDTH ,3)), tf.keras.layers.MaxPooling2D(2,2), tf.keras.layers.Dropout(0.2), tf.keras.layers.Conv2D(128, (3,3), padding='same', activation='relu'), tf.keras.layers.MaxPooling2D(2,2), tf.keras.layers.Dropout(0.2), tf.keras.layers.Flatten(), tf.keras.layers.Dense(512, activation='relu'), tf.keras.layers.Dense(1, activation='sigmoid') ])

As you could see in the code above, you could directly use tf.keras.layers.dropout to implement the dropout, passing it the fraction of output features to ignore (here 20% of the output features).

Early stopping

Early stopping is another regularization method I often use. The idea here is to monitor the performance of the model at every epoch on a validation set and terminate the training when you meet some specified condition for the validation performance (like stop training when loss < 0.5)

It turns out that the basic condition like we talked about above works like a charm if your training error and validation error look something like in this image. In this case, Early Stopping would just stop training when it reaches the red box (for demonstration) and would straight up prevent overfitting.

It (Early stopping) is such a simple and efficient regularization technique that Geoffrey Hinton called it a "beautiful free lunch". – Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurelien Geron

_Adapted from Lutz Prechelt_

However, for some cases you would not end up with such straightforward choices for identifying the criterion or knowing when Early Stopping should stop training the model.

For the scope of this article we will not be talking about more criteria here, but I would recommend that you check out "Early Stopping — But When, Lutz Prechelt" which I use a lot to help decide criteria.

Let's see an example of Early Stopping in action with TensorFlow:

import tensorflow as tf callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3) model = tf.keras.models.Sequential([...]) model.compile(...) model.fit(..., callbacks = [callback])

In the above example we create an Early Stopping Callback and specify that we want to monitor our loss values. We also specify that it should stop training if it does not see noticeable improvements in loss values in 3 epochs. Finally, while training the model, we specify that it should use this callback.

Also, for the purpose of this example I show a Sequential model – but this could work in the exact same manner with a model created with the functional API or sub classed models, too.

Thank you for reading!

Thank you for sticking with me until the end. I hope you will benefit from this article and incorporate these tips in your own experiments.

I am excited to see if they help you improve the performance of your neural nets, too. If you have any feedback or suggestions for me please feel free to reach out to me on Twitter.

How to Automate Machine Learning Model Publishing with the Gitlab Package Registry

freeCodeCamp — Thu, 15 Apr 2021 16:33:05 +0000

By Yacine Mahdid

In this tutorial we'll learn how to automatically publish machine learning models in a Gitlab package registry and make them available for your teammates to use. You can also use this technique to share a packaged version of your code as a binary.

If you are a beginner Gitlab user and are unfamiliar with CI/CD techniques, this tutorial is for you! A basic understanding of how machine-learning and deep learning is a plus, but it isn't a requirement to understand the CI/CD publishing part.

Here's what we'll cover:

Gitlab Code Setup

Deep Convolutional Neural Network Code

Image Recognition Code

Branching Methodology

CI/CD Uploading

Conclusion

First, Some Background

At some point during your machine learning engineer career you might need to share a model you've trained with other developers. There are multiple ways of doing this.

Give access to the repository

If you don't mind showing your whole code, this is a very viable option.

If you use a good branching methodology your colleagues will only need to look at the main branch in order to figure out what's the most up to date model they can use. Then they can check the README.md to learn how to use it.

However, giving full access to the repository might not be a viable option for you.

Share the latest model manually

Another way would be to extract the relevant code that you want to make public and send it to them manually.

This can become a bit of a mess if you are working with more than one person because the model you send might not be up to date. It also puts the burden on you to make sure that people are always using the latest version of your model.

Share the latest model automatically

A simpler solution, even in the case where the repository code is available, is to put the packaging burden on a CI/CD pipeline.

This is the topic of this tutorial, and our setup will look like this:

The code repository, CI/CD tool set, and package registry will be on Gitlab

The code we'll be packaging will be a simple trained PyTorch neural network on the MNIST dataset for digit recognition.

All the instructions and the requirements will be available in the package.

🚨 Disclaimer 🚨: This isn't how you should deploy a PyTorch production-ready model! To learn how to do this, check out this tutorial on TorchScript.

Let's get started.

Gitlab Code Setup

For this tutorial we will bundle four files:

model.pth: which is a pickled version of the latest version of the trained model.

run_mnist.py: simple Python script to run the model to detect a digit from a png image.

requirements.txt: text file containing all the dependencies required to run the model.

INSTRUCTION.md: step by step instructions to use the package.

The package can then be used freely by anyone who has access to the package registry and will be automatically updated.

The package will then look like this on Gitlab Package Registry!

Let's jump into the neural network code, which is a modified version of this comprehensive article on digit recognition. The modified code can be found over at my public Gitlab repository.

Deep Convolutional Neural Network Code

In the section below, you will see quite a lot of terminology about deep neural networks. This isn't a tutorial on neural networks, so if you feel a bit overwhelmed by the specifics you can jump directly to the Branching Methodology section.

Just keep in mind that we've trained some sort of image recognition program that, given a .png file representing a digit, will be able to tell you what number it contains.

However, for those that want to get a better understanding about how Deep Neural Networks work under the hood, you can take a look at my tutorial where I build one from scratch or checkout the code directly in my Github.

Neural Network Definition

The network definition code is very straightforward since the network we will use is simple. It has the following characteristics:

2 convolutional layers.

Dropout is applied on the second convolutional layer.

Relu activation functions applied on all neurons.

2 fully connected layers at the end for inference.

import torch import torchvision import torch.nn as nn import torch.nn.functional as F import torch.optim as optim # Define the network # It's a 2 convolutional layer with dropout at the 2nd and finally 2 fully connected layer # All layers use relu class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 10, kernel_size=5) self.conv2 = nn.Conv2d(10, 20, kernel_size=5) self.conv2_drop = nn.Dropout2d() self.fc1 = nn.Linear(320, 50) self.fc2 = nn.Linear(50, 10) def forward(self, x): x = F.relu(F.max_pool2d(self.conv1(x), 2)) x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2)) x = x.view(-1, 320) x = F.relu(self.fc1(x)) x = F.dropout(x, training=self.training) x = self.fc2(x) return F.log_softmax(x, dim=1)

Training Function

We then created a utility training function in order to iteratively improve our defined network using gradient descent. If you want to learn more about how gradient descent works check out my short tutorial on it.

This training regimen will do the following:

Iterate on batches of training data representing 28 by 28 digits.

Use the negative log likelihood cost function to calculate the loss.

Calculate gradients.

Optimize the weights of the network using gradient descent.

Save the model at fixed intervals.

def train(network, optimizer, train_loader, epoch_id, log_interval=10): """Run the training regiment on the training set using train_loader Args: network: The instantiated network. optimizer: The optimizer used to change the weights. train_loader: the loader for the training set already setup epoch_id: the current id of the epoch used for cosmetic reason. log_interval: interval at which we print an output Returns: nothing, will save directly at root level the model and the optimizer state """ # Set the network in training mode network.train() # Iterate over the full training set for batch_idx, (data, target) in enumerate(train_loader): # Calculate the gradients for this batch of data optimizer.zero_grad() output = network(data) loss = F.nll_loss(output, target) loss.backward() # Optimize the network optimizer.step() # Log and save every selected interval if batch_idx % log_interval == 0: print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( epoch_id, batch_idx * len(data), len(train_loader.dataset), 100. * batch_idx / len(train_loader), loss.item())) # This will save the state as a pickled object torch.save(network.state_dict(), './model.pth') torch.save(optimizer.state_dict(), './optimizer.pth')

The data for training can be found over here on the Yan LeCun website. Here we are using the datasets formatted as 28 by 28 PyTorch tensors for training.

Testing Function

The next function we create is a testing function to validate if our network has learned something without reusing the same training data. This function is simple in the sense that it will just tally the correct and incorrect predictions.

def test(network, test_loader): """Run the testing regiment on the test set using test_loader Args: network: The instantiated and trained network. test_loader: the loader for the testing set already setup Returns: nothing, will only print result """ # Variable instantiation test_loss = 0 correct = 0 # Move the network to evaluate mode instead of training network.eval() # setup torch so to not track any gradient with torch.no_grad(): # Iterate on all the test data and accumulate the loss for data, target in test_loader: output = network(data) test_loss += F.nll_loss(output, target, size_average=False).item() pred = output.data.max(1, keepdim=True)[1] correct += pred.eq(target.data.view_as(pred)).sum() # Average loss calculation and printing test_loss /= len(test_loader.dataset) print('\nTest set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format( test_loss, correct, len(test_loader.dataset), 100. * correct / len(test_loader.dataset)))

This function will be useful to check how well our network has learned after each training iteration.

Training Regimen

Finally, we can tie all of the above together with the main body of the training script! A few things are happening, but the most important points are the following:

We set our hyper parameters statically. A better way to define them would be to use a validation set to figure them out based on the data.

We create our data loader which will ingest data and spit out tensors in the right shape for the network. These loader will transform the data by normalizing them with the global mean and standard deviation for the MNIST datasets.

We use stochastic gradient descent with momentum as the optimization method, which is one of the many flavors of gradient descent we can use.

We loop through the full training dataset's "epoch", the amount of time to train the network while testing on the held-out test datasets.

# Experimental Parameters that we can tweak n_epochs = 3 batch_size_train = 64 batch_size_test = 1000 learning_rate = 0.01 momentum = 0.5 # Variable from the dataset that should stay as is global_mean_mnist = 0.1307 global_std_mnist = 0.3081 # Random Seed for Reproducible Experimentation random_seed = 42 torch.backends.cudnn.enabled = False torch.manual_seed(random_seed) # Data Loader to gather the data and then normalize them train_loader = torch.utils.data.DataLoader( torchvision.datasets.MNIST('./data/', train=True, download=True, transform=torchvision.transforms.Compose([ torchvision.transforms.ToTensor(), torchvision.transforms.Normalize( (global_mean_mnist,), (global_std_mnist,)) ])), batch_size=batch_size_train, shuffle=True) test_loader = torch.utils.data.DataLoader( torchvision.datasets.MNIST('./data/', train=False, download=True, transform=torchvision.transforms.Compose([ torchvision.transforms.ToTensor(), torchvision.transforms.Normalize( (global_mean_mnist,), (global_std_mnist,)) ])), batch_size=batch_size_test, shuffle=True) # Initialize network and optimizer network = Net() optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum) # Test first to show that the model didn't learn a thing test(network, test_loader) # Train on the whole dataset multiple time and test for epoch_id in range(1, n_epochs + 1): train(network, optimizer, train_loader, epoch_id) test(network, test_loader)

Note that it's very important to test your network on a held-out set to avoid over-fitting on the training data.

All of the above scripts can be found in the file train_mnist.py in the repository.

At this point, we can train a model and have it saved at regular intervals in a pickle format.

We can now use that saved trained mode to evaluate a digit in a .png file.

Image Recognition Code

Let's say we have as an input the following image:

a small 0 digit

or this one:

a bigger 7 digit

How can we make our network, which works on a 28 by 28 PyTorch tensor, evaluate the numbers?

It's fairly straightforward if we follow roughly the same process that the training datasets went through, which is:

Have grayscale images (no color or alpha channels)

Resize the images to be 28 by 28 pixels

Normalize the images using the mean and standard deviation of the MNIST datasets.

if __name__ == "__main__": # Variable iniatilization global_mean_mnist = 0.1307 global_std_mnist = 0.3081 # Loading of the network with right weight result_path = './model.pth' model = Net() model.load_state_dict(torch.load(result_path)) model.eval() # Setup the transform from image to normalized tensors transform = transforms.Compose([ transforms.Resize((28,28)), transforms.ToTensor(), transforms.Normalize( (global_mean_mnist,), (global_std_mnist,)) ]) # Parse the input from the user which should be a filename with the --image flag parser = OptionParser() parser.add_option("--image", dest = "input_image_path", help = "Input Image Path") (options, args) = parser.parse_args() # Get the path to the image to decode input_image_path = str(options.input_image_path) # Open the image(s) and do the inference images=glob.glob(input_image_path) for image in images: # Convert the image to grayscale img = Image.open(image).convert('L') # Transform the image to a normalized tensor img_tensor = transform(img).unsqueeze(0) # Make and print the prediction output = model(img_tensor).data.max(1, keepdim=True)[1][0][0] print(f"Image is a {int(output)}")

As you can see, we use a parser to accept an image path on the command line before applying our transformations. Once they are applied we can feed that to our loaded model and collect the output prediction.

⚠️ Don't forget to include the definition of the network in the script (by importing or copy pasting), otherwise the pickled model will not be able to load properly.

We can now run our code like this:

python run_mnist.py --image NAME_OF_IMAGE.png

This will simply print the model's inference about what that particular image contains.

Now that we have the basic training and evaluation code set up, let's discuss a bit more about how to use git branching to our advantage to publish this model to the package registry.

Branching Methodology

If you are working alone on a project, it is very tempting to simply commit to master/main and be done with it. However, this way of working is very difficult to maintain and it makes incorporating proper CI/CD tools a pain.

A main / develop branch strategy as shown below is more maintainable:

Image from: https://nvie.com/posts/a-successful-git-branching-model/

By always keeping the main branch clean, we can easily flag our CI/CD pipeline to be triggered as soon as we push to the main. We will be also free to commit as much as we need in the develop branch while we improve our models.

When we are ready for a new deploy we will only need to merge with the main branch (or better yet do a merge-request / pull-request and then merge).

This merge to main should trigger Gitlab to upload the new version of our model to the package registry.

Let's take a look at the simple way to automate publishing to the package registry using the .gitlab-ci.yml file.

CI/CD Pipeline

The .gitlab-ci.yml file is a special file in your repository used by Gitlab to define what the Gitlab server should do when you push to a repository.

To learn more about how CI/CD works in Gitlab, head over to this Gitlab CI/CD crash course.

In this tutorial our .gitlab-ci.yml file looks like this:

image: pytorch/pytorch variables: VERSION: "0.0.4" # To Change if needs be stages: - upload upload: stage: upload only: - master script: - apt-get update - apt-get install -y curl wget - 'curl --header "JOB-TOKEN: $CI_JOB_TOKEN" --upload-file ./model.pth "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/example-ml-packaging-pipeline/${VERSION}/model.pth"' - 'curl --header "JOB-TOKEN: $CI_JOB_TOKEN" --upload-file ./run_mnist.py "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/example-ml-packaging-pipeline/${VERSION}/run_mnist.py"' - 'curl --header "JOB-TOKEN: $CI_JOB_TOKEN" --upload-file ./requirements.txt "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/example-ml-packaging-pipeline/${VERSION}/requirements.txt"' - 'curl --header "JOB-TOKEN: $CI_JOB_TOKEN" --upload-file ./INSTRUCTION.md "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/example-ml-packaging-pipeline/${VERSION}/INSTRUCTION.md"'

The anatomy of this .yml file is very bare bones. We have only one stage in our pipeline which is the upload stage.

In the upload stage, we will run the script section only when the master branch gets updated. The script that we ran is simply using curl to transfer the data from this repository (4 files) into the package registry.

Let's take a look at the anatomy of the curl command we are using:

- 'curl --header "JOB-TOKEN: $CI_JOB_TOKEN" --upload-file ./NAME_OF_FILE "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/example-ml-packaging-pipeline/${VERSION}/NAME_OF_FILE"'

--header is used to tell curl that you will be including an extra header to the request.

JOB-TOKEN is our header and $CI_JOB_TOKEN is its value. It's a variable that lives within Gitlab servers when a job is created

--upload-file is a flag to tell that we will transfer a local file to the remote URL.

./NAME_OF_FILE is the name of the local file we want to transfer.

${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/example-ml-packaging-pipeline/${VERSION}/NAME_OF_FILE is the location of the remote URL that we want to transfer a file.

Here $CI_API_V4_URL is the URL of the Gitlab API we are using, $CI_PROJECT_ID is defined within Gitlab CI as the id for our project, and finally VERSION is the version number we defined at the top of the .yml file.

That's it! When you update the main branch to the remote repository on Gitlab it will fire up a pipeline that will run your packaging job.

The job will then be available and you will be able to check the trace on Gitlab!

You and your teammates will be able to see the document in the package registry section and get the right versioned files in the package:

This is our v.0.0.5 of the example package!

To get a more complete idea of what is possible with the Packages API, head over to the official documentation.

Conclusion

In this tutorial you've learn how to bundle, upload, and automatize a machine learning model packaging using Gitlab CI/CD.

Congratulation! 🎉🎉🎉

There is still a lot more you can do with Gitlab CI/CD, for instance:

Add a testing stage before the bundling in order to make sure that there is no regression in the code.

Add a testing stage after the bundling to make sure that the performance of your model is satisfactory in terms of inference latency.

Use a more optimized version of the model with TorchScript.

Add automatic social notification of new release after the upload step.

To learn more about Gitlab CI/CD the official docs is a great place to start out, and the get started section is very beginner friendly.

If you want to read more of this type of content, check out my mechanical/software engineering articles. If you want to discuss any of this feel free to send me a DM on LinkedIn or Twitter :)

neural networks - freeCodeCamp.org

How to Enhance Images with Neural Networks

Table of Contents

Image Colorization

GAN-Based Image Enhancement

Noise Reduction (Denoising Autoencoders)

Image Upscaling using Super-Resolution

Artifact Removal

Why These Algorithms Matter to Developers

Conclusion

Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code

Table of Contents

Prerequisites

What is a Perceptron?

Applications of Perceptrons

How the Activation Function Works

How the Loss Function Works

How to Build a Single-Layered Classifier

1. Custom Classifier

Initialize the classifier

Define the activation function

Train the model

How the weights work in the iteration loop

How the bias terms work in the iteration loop

Make a prediction

Simulate with synthetic datasets

Results

2. Leverage SckitLearn’s MCP Classifier

Results

Limitations of Single-Layer Perceptrons

What is a Multi-Layer Perceptron?

How to Build Multi-Layered Perceptrons

Outline of the Project

Objective

Evaluation Metrics

Planning an MLP Architecture

Preprocessing the Datasets

Understanding Optimizers

1. How a SGD (Stochastic Gradient Descent) Optimizer Works

2. How Adam (Adaptive Moment Estimation) Optimizer Works

How to Build an MLP Classifier with SGD Optimizer

Custom Classifier

Training / Prediction

Results

Leverage SckitLearn’s MCP Classifier

Results

Leverage Keras Sequential Classifier

Results

How to Build an MLP Classifier with Adam Optimizer

Custom Classifier

Training / Prediction

Results

Leverage SckitLearn’s MCP Classifier

Results

Leverage Keras Sequential Classifier

Results

Final Results: Generalization

Conclusion

Choosing the right framework

Limitation of MLPs

How AI Models Think: The Key Role of Activation Functions with Code Examples

In this article, we will explore:

Artificial Intelligence and the Rise of Deep Learning

What is Deep Learning in Artificial Intelligence?

Deep Learning = Training Neural Networks

Understanding Activation Functions: Simplifying Neural Network Mechanics

Simple Analogy: Why Activation Functions are Necessary

What Happens Without Activation Functions?

Intuitive Explanation of Activation Functions

Sigmoid Activation Functions

Tanh (Hyperbolic Tangent) Activation Functions

Leaky reLU

Mathematical Explanation of Activation Functions

PyTorch Activation Function Code Example

1: Importing libraries and defining activation functions

2: Defining hyperparameters and generating a dataset

3: Creating the deep learning model

4: Initializing the model and defining the loss function and optimizer

5: Training the deep learning model

Conclusion: The Unsung Heroes of AI Neural Networks