pytorch - freeCodeCamp.org

Building NMT from Scratch – PyTorch Replications of 7 Landmark Papers

Beau Carnes — Wed, 10 Dec 2025 15:23:41 +0000

Learn about the complete neural machine translation journey.

We just posted a course on the freeCodeCamp.org YouTube channel that is a comprehensive journey through the evolution of sequence models and neural machine translation (NMT). It blends historical breakthroughs, architectural innovations, mathematical insights, and hands-on PyTorch replications of landmark papers that shaped modern NLP and AI.

The course features:

A detailed narrative tracing the history and breakthroughs of RNNs, LSTMs, GRUs, Seq2Seq, Attention, GNMT, and Multilingual NMT.
Replications of 7 landmark NMT papers in PyTorch, so learners can code along and rebuild history step by step.
Explanations of the math behind RNNs, LSTMs, GRUs, and Transformers.
Conceptual clarity with architectural comparisons, visual explanations, and interactive demos like the Transformer Playground.

Here are all the sections in the course:

Evolution of RNN
Evolution of Machine Translation
Machine Translation Techniques
Long Short-Term Memory (Overview)
Learning Phrase Representation using RNN (Encoder–Decoder for SMT)
Learning Phrase Representation (PyTorch Lab – Replicating Cho et al., 2014)
Seq2Seq Learning with Neural Networks
Seq2Seq (PyTorch Lab – Replicating Sutskever et al., 2014)
NMT by Jointly Learning to Align (Bahdanau et al., 2015)
NMT by Jointly Learning to Align & Translate (PyTorch Lab – Replicating Bahdanau et al., 2015)
On Using Very Large Target Vocabulary
Large Vocabulary NMT (PyTorch Lab – Replicating Jean et al., 2015)
Effective Approaches to Attention (Luong et al., 2015)
Attention Approaches (PyTorch Lab – Replicating Luong et al., 2015)
Long Short-Term Memory Network (Deep Explanation)
Attention Is All You Need (Vaswani et al., 2017)
Google Neural Machine Translation System (GNMT – Wu et al., 2016)
GNMT (PyTorch Lab – Replicating Wu et al., 2016)
Google’s Multilingual NMT (Johnson et al., 2017)
Multilingual NMT (PyTorch Lab – Replicating Johnson et al., 2017)
Transformer vs GPT vs BERT Architectures
Transformer Playground (Tool Demo)
Seq2Seq Idea from Google Translate Tool
RNN, LSTM, GRU Architectures (Comparisons)
LSTM & GRU Equations

Watch the full course on the freeCodeCamp.org YouTube channel (7-hour watch).

How to Use Transformers for Real-Time Gesture Recognition

OMOTAYO OMOYEMI — Mon, 06 Oct 2025 13:39:30 +0000

Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are not static images. Rather, they unfold over time. To build more robust, real-time systems, we need models that can capture both spatial details and temporal context.

This is where Transformers come in. Originally built for language, they’ve become state-of-the-art in vision tasks thanks to models like the Vision Transformer (ViT) and video-focused variants such as TimeSformer.

In this tutorial, we’ll use a Transformer backbone to create a lightweight real-time gesture recognition tool, optimized for small datasets and deployable on a regular laptop webcam.

Why Transformers for Gestures?
What You’ll Learn
Prerequisites
Project Setup
Generate a Gesture Dataset
Option 1: Generate a Synthetic Dataset
Training Script: train.py
Export the Model to ONNX
Evaluate Accuracy + Latency
Option 2: Use Small Samples from Public Gesture Datasets
Accessibility Notes & Ethical Limits
Next Steps
Conclusion

Why Transformers for Gestures?

Transformers are powerful because they use self-attention to model relationships across a sequence. For gestures, this means the model doesn’t just see isolated frames, but also learns how movements evolve over time. A wave, for example, looks different from a raised hand only when viewed as a sequence.

Vision Transformers process images as patches, while video Transformers extend this to multiple frames with temporal attention. Even a simple approach, like applying ViT to each frame and pooling across time, can outperform traditional CNN-based methods for small datasets.

Combined with Hugging Face’s pre-trained models and ONNX Runtime for optimization, Transformers make it possible to train on a modest dataset and still achieve smooth real-time recognition.

What You’ll Learn

In this tutorial, you’ll build a gesture recognition system using Transformers. By the end, you’ll know how to:

Create (or record) a tiny gesture dataset
Train a Vision Transformer (ViT) with temporal pooling
Export the model to ONNX for faster inference
Build a real-time Gradio app that classifies gestures from your webcam
Evaluate your model’s accuracy and latency with simple scripts
Understand the accessibility potential and ethical limits of gesture recognition

Prerequisites

To follow along, you should have:

Basic Python knowledge (functions, scripts, virtual environments)
Familiarity with PyTorch (tensors, datasets, training loops) – helpful but not required
Python 3.8+ installed on your system
A webcam (for the live demo in Gradio)
Optionally: GPU access (training on CPU works, but is slower)

Project Setup

Create a new project folder and install the required libraries.

# Create a new project directory and navigate into it
mkdir transformer-gesture && cd transformer-gesture

# Set up a Python virtual environment
python -m venv .venv

# Activate the virtual environment
# Windows PowerShell
.venv\Scripts\Activate.ps1

# macOS/Linux
source .venv/bin/activate

The provided code snippet is a set of commands for setting up a new Python project with a virtual environment. Here's a breakdown of each part:

mkdir transformer-gesture && cd transformer-gesture: This command creates a new directory named "transformer-gesture" and then navigates into it.
python -m venv .venv: This command creates a new virtual environment in the current directory. The virtual environment is stored in a folder named ".venv".
Activating the virtual environment:
- For Windows PowerShell, you can use .venv\Scripts\Activate.ps1 to activate the virtual environment.
- For macOS/Linux, use source .venv/bin/activate to activate the virtual environment.

Activating a virtual environment ensures that the Python interpreter and any packages you install are isolated to this specific project, preventing conflicts with other projects or system-wide packages.

Create a requirements.txt file:

torch>=2.0
torchvision
torchaudio
timm
huggingface_hub

onnx
onnxruntime

gradio

numpy
opencv-python
pillow

matplotlib
seaborn
scikit-learn

The list provided is a set of package dependencies typically found in a requirements.txt file for a Python project. Here's a brief explanation of each package:

torch>=2.0: PyTorch is a popular open-source deep learning framework that provides a flexible and efficient platform for building and training neural networks. Version 2.0 and above includes improvements in performance and new features.
torchvision: This library is part of the PyTorch ecosystem and provides tools for computer vision tasks, including datasets, model architectures, and image transformations.
torchaudio: Also part of the PyTorch ecosystem, Torchaudio provides audio processing tools and datasets, making it easier to work with audio data in deep learning projects.
timm: The PyTorch Image Models (timm) library offers a collection of pre-trained models and utilities for computer vision tasks, facilitating quick experimentation and deployment.
huggingface_hub: This library allows easy access to models and datasets hosted on the Hugging Face Hub, a platform for sharing and collaborating on machine learning models and datasets.
onnx: The Open Neural Network Exchange (ONNX) format is used to represent machine learning models, enabling interoperability between different frameworks.
onnxruntime: This is a high-performance runtime for executing ONNX models, allowing for efficient deployment across various platforms.
gradio: Gradio is a library for creating user interfaces for machine learning models, making them accessible through a web interface for easy interaction and testing.
numpy: A fundamental package for numerical computing in Python, providing support for arrays and a wide range of mathematical functions.
opencv-python: OpenCV is a library for computer vision and image processing tasks, widely used for real-time applications.
pillow: A Python Imaging Library (PIL) fork, Pillow provides tools for opening, manipulating, and saving many different image file formats.
matplotlib: A plotting library for Python, Matplotlib is used for creating static, interactive, and animated visualizations in Python.
seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.
scikit-learn: A machine learning library in Python that provides simple and efficient tools for data analysis and modeling, including classification, regression, clustering, and dimensionality reduction.

Install dependencies:

pip install -r requirements.txt

The command pip install -r requirements.txt is used to install all the Python packages listed in a file named requirements.txt. This file typically contains a list of package dependencies required for a Python project, each specified with a package name and optionally a version number.

By running this command, pip, which is the Python package installer, reads the file and installs each package listed, ensuring that the project has all the necessary dependencies to run properly. This is a common practice in Python projects to manage and share dependencies easily.

Generate a Gesture Dataset

To train our Transformer-based gesture recognizer, we need some data. Instead of downloading a huge dataset, we’ll start with a tiny synthetic dataset you can generate in seconds. This makes the tutorial lightweight and ensures that everyone can follow along without dealing with multi-gigabyte downloads.

Option 1: Generate a Synthetic Dataset

We’ll use a small Python script that creates short .mp4 clips of a moving (or still) coloured box. Each class represents a gesture:

swipe_left – box moves from right to left
swipe_right – box moves from left to right
stop – box stays still in the center

Save this script as generate_synthetic_gestures.py in your project root:

import os, cv2, numpy as np, random, argparse

def ensure_dir(p): os.makedirs(p, exist_ok=True)

def make_clip(mode, out_path, seconds=1.5, fps=16, size=224, box_size=60, seed=0, codec="mp4v"):
    rng = random.Random(seed)
    frames = int(seconds * fps)
    H = W = size

    # background + box color
    bg_val = rng.randint(160, 220)
    bg = np.full((H, W, 3), bg_val, dtype=np.uint8)
    color = (rng.randint(20, 80), rng.randint(20, 80), rng.randint(20, 80))

    # path of motion
    y = rng.randint(40, H - 40 - box_size)
    if mode == "swipe_left":
        x_start, x_end = W - 20 - box_size, 20
    elif mode == "swipe_right":
        x_start, x_end = 20, W - 20 - box_size
    elif mode == "stop":
        x_start = x_end = (W - box_size) // 2
    else:
        raise ValueError(f"Unknown mode: {mode}")

    fourcc = cv2.VideoWriter_fourcc(*codec)
    vw = cv2.VideoWriter(out_path, fourcc, fps, (W, H))
    if not vw.isOpened():
        raise RuntimeError(
            f"Could not open VideoWriter with codec '{codec}'. "
            "Try --codec XVID and use .avi extension, e.g. out.avi"
        )

    for t in range(frames):
        alpha = t / max(1, frames - 1)
        x = int((1 - alpha) * x_start + alpha * x_end)
        # small jitter to avoid being too synthetic
        jitter_x, jitter_y = rng.randint(-2, 2), rng.randint(-2, 2)
        frame = bg.copy()
        cv2.rectangle(frame, (x + jitter_x, y + jitter_y),
                      (x + jitter_x + box_size, y + jitter_y + box_size),
                      color, thickness=-1)
        # overlay text
        cv2.putText(frame, mode, (8, 24), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 0), 2, cv2.LINE_AA)
        cv2.putText(frame, mode, (8, 24), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 1, cv2.LINE_AA)
        vw.write(frame)

    vw.release()

def write_labels(labels, out_dir):
    with open(os.path.join(out_dir, "labels.txt"), "w", encoding="utf-8") as f:
        for c in labels:
            f.write(c + "\n")

def main():
    ap = argparse.ArgumentParser(description="Generate a tiny synthetic gesture dataset.")
    ap.add_argument("--out", default="data", help="Output directory (default: data)")
    ap.add_argument("--classes", nargs="+",
                    default=["swipe_left", "swipe_right", "stop"],
                    help="Class names (default: swipe_left swipe_right stop)")
    ap.add_argument("--clips", type=int, default=16, help="Clips per class (default: 16)")
    ap.add_argument("--seconds", type=float, default=1.5, help="Seconds per clip (default: 1.5)")
    ap.add_argument("--fps", type=int, default=16, help="Frames per second (default: 16)")
    ap.add_argument("--size", type=int, default=224, help="Frame size WxH (default: 224)")
    ap.add_argument("--box", type=int, default=60, help="Box size (default: 60)")
    ap.add_argument("--codec", default="mp4v", help="Codec fourcc (mp4v or XVID)")
    ap.add_argument("--ext", default=".mp4", help="File extension (.mp4 or .avi)")
    args = ap.parse_args()

    ensure_dir(args.out)
    write_labels(args.classes, ".")  # writes labels.txt to project root

    print(f"Generating synthetic dataset -> {args.out}")
    for cls in args.classes:
        cls_dir = os.path.join(args.out, cls)
        ensure_dir(cls_dir)
        mode = "stop" if cls == "stop" else ("swipe_left" if "left" in cls else ("swipe_right" if "right" in cls else "stop"))
        for i in range(args.clips):
            filename = os.path.join(cls_dir, f"{cls}_{i+1:03d}{args.ext}")
            make_clip(
                mode=mode,
                out_path=filename,
                seconds=args.seconds,
                fps=args.fps,
                size=args.size,
                box_size=args.box,
                seed=i + 1,
                codec=args.codec
            )
        print(f"  {cls}: {args.clips} clips")

    print("Done. You can now run: python train.py, python export_onnx.py, python app.py")

if __name__ == "__main__":
    main()

The script generates a synthetic gesture dataset by creating video clips of a moving or stationary coloured box, simulating gestures like "swipe left," "swipe right," and "stop," and saves them in a specified output directory.

Now run it inside your virtual environment:

python generate_synthetic_gestures.py --out data --clips 16 --seconds 1.5

The command above runs a Python script named generate_synthetic_gestures.py, which generates a synthetic gesture dataset with 16 clips per gesture, each lasting 1.5 seconds, and saves the output in a directory named "data".

This creates a dataset like:

data/
  swipe_left/*.mp4
  swipe_right/*.mp4
  stop/*.mp4
labels.txt

Each folder contains short clips of a moving (or still) box that simulate gestures. This is perfect for testing the pipeline.

Training Script: `train.py`

Now that we have our dataset, let’s fine-tune a Vision Transformer with temporal pooling. This model applies ViT frame-by-frame, averages embeddings across time, and trains a classification head on your gestures.

Here’s the full training script:

# train.py
import torch, torch.nn as nn, torch.optim as optim
from torch.utils.data import DataLoader
import timm
from dataset import GestureClips, read_labels

class ViTTemporal(nn.Module):
    """Frame-wise ViT encoder -> mean pool over time -> linear head."""
    def __init__(self, num_classes, vit_name="vit_tiny_patch16_224"):
        super().__init__()
        self.vit = timm.create_model(vit_name, pretrained=True, num_classes=0, global_pool="avg")
        feat_dim = self.vit.num_features
        self.head = nn.Linear(feat_dim, num_classes)

    def forward(self, x):  # x: (B,T,C,H,W)
        B, T, C, H, W = x.shape
        x = x.view(B * T, C, H, W)
        feats = self.vit(x)                  # (B*T, D)
        feats = feats.view(B, T, -1).mean(dim=1)  # (B, D)
        return self.head(feats)

def train():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    labels, _ = read_labels("labels.txt")
    n_classes = len(labels)

    train_ds = GestureClips(train=True)
    val_ds   = GestureClips(train=False)
    print(f"Train clips: {len(train_ds)} | Val clips: {len(val_ds)}")

    # Windows/CPU friendly
    train_dl = DataLoader(train_ds, batch_size=2, shuffle=True,  num_workers=0, pin_memory=False)
    val_dl   = DataLoader(val_ds,   batch_size=2, shuffle=False, num_workers=0, pin_memory=False)

    model = ViTTemporal(num_classes=n_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.05)

    best_acc = 0.0
    epochs = 5
    for epoch in range(1, epochs + 1):
        # ---- Train ----
        model.train()
        total, correct, loss_sum = 0, 0, 0.0
        for x, y in train_dl:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()

            loss_sum += loss.item() * x.size(0)
            correct += (logits.argmax(1) == y).sum().item()
            total += x.size(0)

        train_acc = correct / total if total else 0.0
        train_loss = loss_sum / total if total else 0.0

        # ---- Validate ----
        model.eval()
        vtotal, vcorrect = 0, 0
        with torch.no_grad():
            for x, y in val_dl:
                x, y = x.to(device), y.to(device)
                vcorrect += (model(x).argmax(1) == y).sum().item()
                vtotal += x.size(0)
        val_acc = vcorrect / vtotal if vtotal else 0.0

        print(f"Epoch {epoch:02d} | train_loss {train_loss:.4f} "
              f"| train_acc {train_acc:.3f} | val_acc {val_acc:.3f}")

        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), "vit_temporal_best.pt")

    print("Best val acc:", best_acc)

if __name__ == "__main__":
    train()

Running the command python train.py initiates the training process for your gesture recognition model. Here's a breakdown of what happens:

Load your dataset from data/: The script will access and load the gesture dataset stored in the "data" directory. This dataset is used to train the model.
Fine-tune a pre-trained Vision Transformer: The training script will take a Vision Transformer model that has been pre-trained on a larger dataset and fine-tune it using your specific gesture dataset. Fine-tuning helps the model adapt to the nuances of your data, improving its performance on the specific task of gesture recognition.
Save the best checkpoint as vit_temporal_best.pt: During training, the script will evaluate the model's performance on a validation set. The best-performing version of the model (based on some metric like accuracy) will be saved as a checkpoint file named "vit_temporal_best.pt". This file can later be used for inference or further training.

What Training Looks Like

You should see logs similar to this:

Train clips: 38 | Val clips: 10
Epoch 01 | train_loss 1.4508 | train_acc 0.395 | val_acc 0.200
Epoch 02 | train_loss 1.2466 | train_acc 0.263 | val_acc 0.200
Epoch 03 | train_loss 1.1361 | train_acc 0.368 | val_acc 0.200
Best val acc: 0.200

Don’t worry if your accuracy is low at first, as with the synthetic dataset that’s normal. The key is proving that the Transformer pipeline works. You can boost results later by:

Adding more clips per class
Training for more epochs
Switching to real recorded gestures

Figure 1. Example training logs from train.py, where the Vision Transformer with temporal pooling is fine-tuned on a tiny synthetic dataset.

Export the Model to ONNX

To make our model easier to run in real time (and lighter on CPU), we’ll export it to the ONNX format.

Note: ONNX, which stands for Open Neural Network Exchange, is an open-source format designed to facilitate the interchange of deep learning models between different frameworks. It lets you train a model in one framework, such as PyTorch or TensorFlow, and then deploy it in another, like Caffe2 or MXNet, without needing to completely rewrite the model. This interoperability is achieved by providing a standardized representation of the model's architecture and parameters.

ONNX supports a wide range of operators and is continually updated to include new features, making it a versatile choice for deploying machine learning models across various platforms and devices.

Create a file called export_onnx.py:

import torch
from train import ViTTemporal
from dataset import read_labels

labels, _ = read_labels("labels.txt")
n_classes = len(labels)

# Load trained model
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load("vit_temporal_best.pt", map_location="cpu"))
model.eval()

# Dummy input: batch=1, 16 frames, 3x224x224
dummy = torch.randn(1, 16, 3, 224, 224)

# Export
torch.onnx.export(
    model, dummy, "vit_temporal.onnx",
    input_names=["video"], output_names=["logits"],
    dynamic_axes={"video": {0: "batch"}},
    opset_version=13
)

print("Exported vit_temporal.onnx")

Run it with python export_onnx.py.

This generates a file vit_temporal.onnx in your project folder. ONNX lets us use onnxruntime, which is much faster for inference.

Create a file called app.py:

import os, tempfile, cv2, torch, onnxruntime, numpy as np
import gradio as gr
from dataset import read_labels

T = 16
SIZE = 224
MODEL_PATH = "vit_temporal.onnx"

labels, _ = read_labels("labels.txt")

# --- ONNX session + auto-detect names ---
ort_session = onnxruntime.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
# detect first input and first output names to avoid mismatches
INPUT_NAME = ort_session.get_inputs()[0].name   # e.g. "input" or "video"
OUTPUT_NAME = ort_session.get_outputs()[0].name # e.g. "logits" or something else

def preprocess_clip(frames_rgb):
    if len(frames_rgb) == 0:
        frames_rgb = [np.zeros((SIZE, SIZE, 3), dtype=np.uint8)]
    if len(frames_rgb) < T:
        frames_rgb = frames_rgb + [frames_rgb[-1]] * (T - len(frames_rgb))
    frames_rgb = frames_rgb[:T]
    clip = [cv2.resize(f, (SIZE, SIZE), interpolation=cv2.INTER_AREA) for f in frames_rgb]
    clip = np.stack(clip, axis=0)                                    # (T,H,W,3)
    clip = np.transpose(clip, (0, 3, 1, 2)).astype(np.float32) / 255 # (T,3,H,W)
    clip = (clip - 0.5) / 0.5
    clip = np.expand_dims(clip, 0)                                   # (1,T,3,H,W)
    return clip

def _extract_path_from_gradio_video(inp):
    if isinstance(inp, str) and os.path.exists(inp):
        return inp
    if isinstance(inp, dict):
        for key in ("video", "name", "path", "filepath"):
            v = inp.get(key)
            if isinstance(v, str) and os.path.exists(v):
                return v
        for key in ("data", "video"):
            v = inp.get(key)
            if isinstance(v, (bytes, bytearray)):
                tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4")
                tmp.write(v); tmp.flush(); tmp.close()
                return tmp.name
    if isinstance(inp, (list, tuple)) and inp and isinstance(inp[0], str) and os.path.exists(inp[0]):
        return inp[0]
    return None

def _read_uniform_frames(video_path):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) or 1
    idxs = np.linspace(0, total - 1, max(T, 1)).astype(int)
    want = set(int(i) for i in idxs.tolist())
    j = 0
    while True:
        ok, bgr = cap.read()
        if not ok: break
        if j in want:
            rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
            frames.append(rgb)
        j += 1
    cap.release()
    return frames

def predict_from_video(gradio_video):
    video_path = _extract_path_from_gradio_video(gradio_video)
    if not video_path or not os.path.exists(video_path):
        return {}
    frames = _read_uniform_frames(video_path)

    # If OpenCV choked on the codec (common with recorded webm), re-encode once:
    if len(frames) == 0:
        tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4"); tmp_name = tmp.name; tmp.close()
        cap = cv2.VideoCapture(video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) or 640
        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) or 480
        out = cv2.VideoWriter(tmp_name, fourcc, 20.0, (w, h))
        while True:
            ok, frame = cap.read()
            if not ok: break
            out.write(frame)
        cap.release(); out.release()
        frames = _read_uniform_frames(tmp_name)

    clip = preprocess_clip(frames)
    # >>> use the detected ONNX input/output names <<<
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[0]  # (1, C)
    probs = torch.softmax(torch.from_numpy(logits), dim=1)[0].numpy().tolist()
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

def predict_from_image(image):
    if image is None:
        return {}
    clip = preprocess_clip([image] * T)
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[0]
    probs = torch.softmax(torch.from_numpy(logits), dim=1)[0].numpy().tolist()
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

with gr.Blocks() as demo:
    gr.Markdown("# Gesture Classifier (ONNX)\nRecord or upload a short video, then click **Classify Video**.")
    with gr.Tab("Video (record or upload)"):
        vid_in = gr.Video(label="Record from webcam or upload a short clip")
        vid_out = gr.Label(num_top_classes=3, label="Prediction")
        gr.Button("Classify Video").click(fn=predict_from_video, inputs=vid_in, outputs=vid_out)
    with gr.Tab("Single Image (fallback)"):
        img_in = gr.Image(label="Upload an image frame", type="numpy")
        img_out = gr.Label(num_top_classes=3, label="Prediction")
        gr.Button("Classify Image").click(fn=predict_from_image, inputs=img_in, outputs=img_out)

if __name__ == "__main__":
    demo.launch()

Running the command python app.py launches a Gradio application in your web browser. Here's what happens:

Webcam feed streams live: The application accesses your webcam to provide a live video feed. This allows you to perform gestures in front of the camera in real-time.
Predictions update continuously: As you perform gestures, the model processes the video frames continuously, updating its predictions in real-time.
Top 3 gesture classes displayed with probabilities: The application displays the top three predicted gesture classes along with their probabilities, giving you an idea of the model's confidence in its predictions.

When you open the app in your browser, you'll find two tabs. In the Video tab, you can click Record from webcam to capture a short clip of your gesture, typically lasting 2–4 seconds. After recording, click Classify Video. The model will then process the captured frames using the Transformer model and display the predicted gesture probabilities. This setup allows for interactive testing and demonstration of the gesture recognition system.

Here’s an example where I raised my hand for a stop gesture, and the model predicts “stop” as the top class:

Figure 2. The Gradio app running locally. After recording a short clip, the Transformer model predicts the gesture with class probabilities.

Evaluate Accuracy + Latency

Now that the model runs in a demo app, let’s check how well it performs. There are two sides to this:

Accuracy: does the model predict the right gesture class?
Latency: how fast does it respond, especially on CPU vs GPU?

1. Quick Accuracy Check

Save this as eval.py in the same folder as your other scripts:

import torch
from dataset import GestureClips, read_labels
from train import ViTTemporal

labels, _ = read_labels("labels.txt")
n_classes = len(labels)

# Load validation data
val_ds = GestureClips(train=False)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=2, shuffle=False)

# Load trained model
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load("vit_temporal_best.pt", map_location="cpu"))
model.eval()

correct, total = 0, 0
all_preds, all_labels = [], []

with torch.no_grad():
    for x, y in val_dl:
        logits = model(x)
        preds = logits.argmax(dim=1)
        correct += (preds == y).sum().item()
        total += y.size(0)
        all_preds.extend(preds.tolist())
        all_labels.extend(y.tolist())

print(f"Validation accuracy: {correct/total:.2%}")

2. Confusion Matrix

Let’s also visualize which gestures are confused. Add this snippet at the bottom of eval.py:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(all_labels, all_preds)

plt.figure(figsize=(6,6))
sns.heatmap(cm, annot=True, fmt="d", xticklabels=labels, yticklabels=labels, cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()

When you run python eval.py, a heatmap like this will pop up:

Figure 3. Confusion matrix on the validation set. Correct predictions appear along the diagonal. Off-diagonal counts show gesture confusions.

3. Latency Benchmark

Finally, let’s see how fast inference runs. Save the following as benchmark.py:

import time, numpy as np, onnxruntime
from dataset import read_labels

labels, _ = read_labels("labels.txt")

ort = onnxruntime.InferenceSession("vit_temporal.onnx", providers=["CPUExecutionProvider"])
INPUT_NAME = ort.get_inputs()[0].name
OUTPUT_NAME = ort.get_outputs()[0].name

dummy = np.random.randn(1, 16, 3, 224, 224).astype(np.float32)

# Warmup
for _ in range(3):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})

# Benchmark
t0 = time.time()
for _ in range(50):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})
t1 = time.time()

print(f"Average latency: {(t1 - t0)/50:.3f} seconds per clip")

Run: python benchmark.py

On CPU, you might see ~0.05–0.15s per clip; on GPU it’s much faster.

Note: If latency is high, you can enable quantization in ONNX to shrink the model and speed up inference.

Option 2: Use Small Samples from Public Gesture Datasets

If you’d prefer to see your model trained on real gesture clips instead of synthetic moving boxes, you can grab a handful of videos from open datasets. You don’t need to download the entire dataset (which can be several GB) just a few .mp4 samples are enough to follow along.

Recommended sources

20BN Jester Dataset: Contains short clips of hand gestures like swiping, clapping, and pointing.
WLASL: A large-scale dataset of isolated sign language words.

Both projects provide small .mp4 videos you can use as realistic training examples. I’ve linked them below.

Setting up your dataset folder

Once you download a few clips, place them in the data/ folder under subfolders named after each gesture class. For example:

data/
├── swipe_left/
│   ├── clip1.mp4
│   └── clip2.mp4
├── swipe_right/
│   ├── clip1.mp4
│   └── clip2.mp4
└── stop/
    ├── clip1.mp4
    └── clip2.mp4

And update labels.txt to match the folder names:

swipe_left
swipe_right
stop

Now your dataset is ready, and the same training scripts from earlier (train.py, eval.py) will work without modification.

Why choose this option?

Gives more realistic results than synthetic coloured boxes
Lets you see how the model handles actual human hand movements
It just requires a bit more effort (downloading clips, trimming them if needed)

Tip: If downloading from these datasets feels too heavy, you can also record your own short gestures using your laptop webcam. Just save them as .mp4 files and organize them in the same folder structure.

Accessibility Notes & Ethical Limits

While this project shows the technical workflow for gesture recognition with Transformers, it’s important to step back and consider the human context:

Accessibility first: Tools like this can help students with speech or motor difficulties, but they should always be co-designed with the people who will use them. Don’t assume one-size-fits-all.
Dataset sensitivity: Using publicly available sign or gesture datasets is fine for prototyping, but deploying such a system requires careful consideration of consent and representation.
Error tolerance: Even small misclassifications can have big consequences in accessibility contexts (for example, confusing stop with go). Always plan for fallback options (like manual input or confirmation).
Bias and inclusivity: Models trained on narrow datasets may fail for different skin tones, lighting conditions, or cultural gesture variations. Broad and diverse training data is essential for fairness.

In other words: this demo is a teaching scaffold, not a production-ready accessibility tool. Responsible deployment requires collaboration with educators, therapists, and end users.

Next Steps

If you’d like to push this project further, here are some directions to explore:

Better models: Try video-focused Transformers like TimeSformer or VideoMAE for stronger temporal reasoning.
Larger vocabularies: Add more gesture classes, build your own dataset, or use portions of public datasets like 20BN Jester or WLASL.
Pose fusion: Combine gesture video with human pose keypoints from MediaPipe or OpenPose for more robust predictions.
Real-time smoothing: Implement temporal smoothing or debounce logic in the app so predictions are more stable during live use.
Quantization + edge devices: Convert your ONNX model to an INT8 quantized version and deploy it on a Raspberry Pi or Jetson Nano for classroom-ready prototypes.

Conclusion

In this tutorial, you learned how to create a gesture recognition system using Transformer models, demonstrating the potential of cutting-edge machine learning techniques. By preparing a small dataset, training a Vision Transformer with temporal pooling, exporting the model to ONNX for efficient inference, and deploying a real-time Gradio app, you showcased a practical application of these technologies. The evaluation of accuracy and latency further highlighted the system's effectiveness and responsiveness.

This project illustrates how you can leverage advanced ML methods to enhance accessibility and communication, paving the way for more inclusive learning environments.

Remember: while this demo works with small datasets, real-world applications need larger, more diverse data and careful consideration of accessibility, inclusivity, and ethics.

Here’s the GitHub repo for full source code: transformer-gesture.

Build Your Own ViT Model from Scratch

Beau Carnes — Wed, 28 May 2025 13:40:21 +0000

Vision Transformers have fundamentally changed how we approach computer vision problems, delivering state-of-the-art results that often surpass traditional convolutional neural networks. As the industry shifts toward transformer-based architectures for image classification, object detection, and beyond, understanding how to build and implement these models from scratch has become essential for machine learning practitioners and researchers who want to stay at the forefront of computer vision innovation.

We've just released a comprehensive new course on the freeCodeCamp.org YouTube channel that takes you through the complete process of building a Vision Transformer (ViT) model using PyTorch. This hands-on tutorial guides you through each component, from patch embedding to the Transformer Encoder, while training your custom model on the CIFAR-10 dataset for practical image classification experience. Mohammed Al Abrah developed this course.

What You'll Accomplish

This course provides both theoretical understanding and practical implementation skills. You'll start with the foundational concepts of Vision Transformers, learning how they differ from CNNs and why they've become so effective for computer vision tasks. The tutorial then walks you through setting up your development environment and configuring the necessary hyperparameters for optimal training.

The core of the course focuses on building the ViT architecture from the ground up. You'll implement image transformation operations, download and prepare the CIFAR-10 dataset, and create efficient DataLoaders. Most importantly, you'll construct the complete Vision Transformer model, understanding each component's role in the overall architecture.

Training and Optimization

The course covers the complete machine learning pipeline, including defining appropriate loss functions and optimizers for your ViT model. You'll implement a comprehensive training loop and learn to visualize training progress by comparing training versus testing accuracy. The tutorial also demonstrates how to make predictions with your trained model and visualize the results.

Advanced sections focus on fine-tuning techniques using data augmentation to improve model performance. You'll train the enhanced model and compare results before and after fine-tuning, gaining insights into optimization strategies that can significantly boost your model's effectiveness.

Course Structure

The tutorial is organized into clear, logical sections that build upon each other. Starting with theoretical foundations, you'll progress through environment setup, data preparation, model construction, training procedures, and advanced optimization techniques. Each section includes practical code implementation, ensuring you gain hands-on experience with every aspect of Vision Transformer development.

The course concludes with comprehensive evaluation methods, teaching you to assess model performance and understand the impact of different training strategies. You'll learn to visualize predictions and analyze results, skills that are crucial for real-world machine learning applications.

Why This Matters Now

As transformer architectures continue to dominate both natural language processing and computer vision, the ability to implement these models from scratch provides invaluable insight into their inner workings. This understanding enables you to modify architectures for specific use cases, debug training issues effectively, and adapt to new developments in the field.

Ready to master one of the most important advances in modern computer vision? Watch the full course on the freeCodeCamp.org YouTube channel (2-hour watch).

Learn PyTorch in Five Projects

Beau Carnes — Thu, 06 Mar 2025 17:56:01 +0000

Deep learning has revolutionized the way we approach complex problems like image recognition, natural language processing, and even audio analysis. At the core of many deep learning applications is PyTorch, a powerful and flexible framework that allows developers and researchers to build and train neural networks efficiently. If you're looking to gain hands-on experience with PyTorch and understand its syntax in real-world applications, we've got the perfect course for you.

We just published a course on the freeCodeCamp.org YouTube channel that will teach you all about PyTorch and its syntax through five practical exercises, guided by Omar Atef. This course provides a structured introduction to PyTorch, covering different types of machine learning tasks, from tabular data classification to deep learning applications in image, audio, and text classification. Each section focuses on a specific problem, allowing you to see PyTorch in action and build models that handle various types of data.

What You'll Learn in This Course

🔹 Tabular Data Classification – Learn how to use PyTorch for structured data, a crucial skill for predictive modeling in industries like finance, healthcare, and retail.

🔹 Image Classification – Train a deep learning model to recognize objects in images, a fundamental task in computer vision.

🔹 Pre-trained Models for Image Classification – Discover how to leverage powerful, pre-trained neural networks to achieve high accuracy with minimal training time.

🔹 Audio Classification – Explore how PyTorch can be used to classify sounds and speech, an essential step in applications like voice recognition and music categorization.

🔹 Text Classification with BERT – Learn how to use the BERT model for natural language processing tasks such as sentiment analysis and spam detection.

Why Learn PyTorch?

PyTorch is widely used in both research and industry due to its ease of use, dynamic computation graph, and strong community support. By mastering PyTorch, you'll gain the ability to build and deploy deep learning models efficiently, making it an essential skill for data scientists, AI engineers, and researchers.

This course is beginner-friendly but also provides valuable insights for those already familiar with machine learning. Each section includes hands-on coding exercises that reinforce your understanding and help you apply what you learn to real-world problems.

Watch the full course here: PyTorch Course on freeCodeCamp.org (6-hour watch).

Build a Stable Diffusion VAE From Scratch using Pytorch

Beau Carnes — Wed, 04 Dec 2024 14:53:47 +0000

We just published a course on the freeCodeCamp.org YouTube channel that will teach you everything you need to know about Variational Autoencoders (VAEs). This course is perfect for anyone looking to dive deep into one of the fundamental concepts behind modern image generation techniques, such as those used in latent diffusion models and GANs. Harsh Bhatt developed this course. He is a machine learning engineer.

VAEs are a special type of autoencoder that work with probability distributions instead of fixed points in the latent space. This capability allows VAEs to learn and represent the variability in datasets, such as the different ways the digit "7" might appear in handwritten forms. By learning a mean (μ) and standard deviation (σ), the VAE effectively captures the distribution of the data, making it an essential tool for applications in generative modeling and unsupervised learning.

Why Learn Variational Autoencoders?

VAEs are more than just a stepping stone to understanding image generation. They solve key challenges in dimensionality reduction and data representation. Unlike traditional autoencoders, which focus on compressing data into a fixed latent representation, VAEs leverage probabilistic methods to create smoother and more meaningful latent spaces. This makes them particularly useful for tasks like:

Image synthesis: Generating realistic and diverse images.
Data augmentation: Creating new data samples for training.
Anomaly detection: Identifying outliers in data distributions.

What You'll Learn in This Course

This comprehensive course begins by introducing the basic concepts of autoencoders, including the encoder-decoder architecture. You'll then delve into the differences between standard autoencoders and VAEs, learning why encoding data into probability distributions is a game changer. Key topics covered include:

Latent space representation: How VAEs group similar data points into clusters within the latent space.
The reparameterization trick: Enabling gradient-based optimization by representing random variables in a differentiable way.
Loss functions for VAEs: Combining reconstruction loss and KL divergence to optimize the model.
Implementation with PyTorch: Hands-on coding to build and train your own VAE from scratch.

Hands-On Implementation

The course takes you step by step through implementing a VAE using PyTorch, starting with the encoder and decoder architecture. You’ll learn how to:

Encode images into a latent representation.
Decode the latent vectors to reconstruct the original images.
Optimize the model using reconstruction loss and KL divergence.
Visualize and interpret the latent space.

You’ll also gain insights into advanced techniques like self-attention layers for encoding context and residual blocks for efficient neural network training.

Conclusion

Ready to start your journey into generative modeling? Watch the course now on freeCodeCamp.org's YouTube channel and get hands-on with Variational Autoencoders!

PyTorch vs TensorFlow – Which is Better for Deep Learning Projects?

Manish Shivanandhan — Wed, 10 Jan 2024 18:46:30 +0000

In this article, we'll look at two popular deep learning libraries — PyTorch and TensorFlow – and see how they compare.

If you are getting started with deep learning, the available tools and frameworks will be overwhelming. Industry experts may recommend TensorFlow while hardcore ML engineers may prefer PyTorch.

Both these frameworks are powerful deep-learning tools. While TensorFlow is used in Google search and by Uber, Pytorch powers OpenAI’s ChatGPT and Tesla's autopilot.

Choosing between these two frameworks is a common challenge for developers. If you're in this position, in this article we’ll compare TensorFlow and PyTorch to help you make an informed choice.

Understanding PyTorch and TensorFlow

Let’s start by getting to know our contenders better.

PyTorch, created by Facebook’s AI Research lab, has gained recognition for its simplicity and user-friendliness. Pytorch can efficiently handle dynamic computational graphs.

A computation graph is a visual representation of mathematical operations and their relationships. It’s like a flowchart that shows how data flow through the deep learning model.

Training neural networks involves a lot of computations. So computation graphs help computers organize and execute calculations efficiently when training neural networks.

PyTorch is easy to use, making it a favoured choice among developers and researchers alike. For people who appreciate a straightforward framework for their projects, PyTorch is a perfect choice.

TensorFlow, Google’s brainchild, has robust production capabilities and support for distributed training. TensorFlow excels in scenarios where you need large-scale machine learning models in real-world applications.

Distributed training is a technique used in deep learning to train large and complex models. By spreading the training process across multiple machines or devices, it is useful when dealing with massive datasets.

Tensorflow is the go-to choice for companies that need scalability and reliability in their deep learning models.

So as you may be able to see, the choice between PyTorch and TensorFlow often depends on the specific needs of a project.

PyTorch vs TensorFlow – Which One's Right for You?

Ease of Learning and Use

When you’re starting a new project, it's helpful to have an easier learning curve. It helps both in building the project as well as hiring / training engineers for your project.

PyTorch is simpler and has a “Pythonic” way of doing things. It's a favourite for beginners and researchers. And its dynamic computation graph means you can change things on the fly, which is great for experimentation.

TensorFlow offers a more structured approach. Its static computation graph requires a bit more planning ahead. TensorFlow also comes with a steep learning curve. But this can lead to more optimized and high-performance models.

TensorFlow 2.0 has also made strides in simplicity. It has incorporated more of PyTorch’s dynamic nature through its Eager Execution feature.

But when it comes to simplicity and ease of learning, PyTorch is a clear winner.

Performance and Scalability

When it comes to performance and scalability, TensorFlow shines. Its can handle large-scale, distributed training with ease. So TensorFlow is a go-to choice for production environments.

TensorFlow’s integrated tool, TensorBoard, is also a powerful tool for visualization and debugging.

PyTorch is catching up, with recent updates improving its scalability.

PyTorch has made improvements to support distributed training and scalability. It provides tools to help you train deep learning models on multiple GPUs and even across multiple machines.

But TensorFlow still holds the lead in deploying large-scale models in production.

Community and Support

The strength of a framework is also partly defined by its community. As these are open-source frameworks, there is no customer support. So you have to depend on the community for help if you get stuck while building a project using these frameworks.

TensorFlow, being older, has a larger community. It also has a vast array of tutorials, courses, and books.

PyTorch, while younger, has seen rapid growth in its community. PyTorch is a favourite, especially among researchers since it's easy to use Pytorch for experimenting with datasets.

Both frameworks have strong support, but TensorFlow’s maturity gives it a slight edge in this area.

Flexibility and Innovation

If you’re working on cutting-edge research or need more flexibility, PyTorch is your best bet. Its dynamic computation graph allows for more creative and complex model architectures.

As I said before, this flexibility makes PyTorch a beloved tool in the research community. Where rapid prototyping and experimentation are key, PyTorch is your best option.

TensorFlow has been working towards adding more flexibility. But it's a difficult battle to win since PyTorch is built for simplicity from the ground up.

Industry Adoption

PyTorch (blue) vs TensorFlow (red)

TensorFlow has tpyically had the upper hand, particularly in large companies and production environments. Its robustness and scalability make it a safe choice for businesses.

But PyTorch is quickly gaining ground. As you can see in the trends chart, PyTorch has already taken over TensorFlow as the most searched deep learning library. You can find the live chart here.

Multiple industries are starting to adopt PyTorch for research and development due to its user-friendliness and flexibility. Pytorch has also proved its capability as a production-grade tool after the release of models like ChatGPT.

Here is a list of companies using TensorFlow and PyTorch.

Products Using Tensorflow

Google Search and Recommendations: Google uses TensorFlow to enhance its search engine and recommendation systems. It helps improve search accuracy and provides personalized recommendations based on user behaviour and preferences.
NVIDIA Deep Learning Accelerator (NVDLA): NVDLA is a hardware accelerator for deep learning applications. It uses TensorFlow to optimize and deploy models on this hardware.
Uber’s Michelangelo: Uber uses TensorFlow in its Michelangelo platform for machine learning. It assists in various tasks, including ETA predictions, fraud detection, and dynamic pricing.

Products Using PyTorch

Facebook: Since PyTorch is from Facebook, Facebook uses PyTorch for various internal AI research and applications, including content recommendations and language translation.
Tesla Autopilot: Tesla’s Autopilot system relies on PyTorch for its deep learning components, such as object detection and navigation.
OpenAI’s GPT Models: Many of OpenAI’s language models, including GPT-2 and GPT-3, are built using PyTorch. These models are used for a wide range of natural language processing tasks, including text generation and language translation.

Conclusion

Choosing between PyTorch and TensorFlow depends on your project’s needs.

For those who need ease of use and flexibility, PyTorch is a great choice. If you prefer scalability from the ground up, production deployment, and a mature ecosystem, TensorFlow might be the way to go.

Both frameworks are evolving, so keep an eye on their development. Your choice today might not be your choice tomorrow. Remember, the best tool is the one that suits your project’s needs and not the popular one.

Thanks for coming this far. If you want weekly machine learning tutorials delivered to your inbox, join my newsletter. To get in touch with me, you can connect with me on LinkedIn.

Learn PyTorch for Deep Learning – Free 26-Hour Course

freeCodeCamp — Thu, 06 Oct 2022 14:48:39 +0000

By Daniel Bourke

My comprehensive PyTorch course is now live on the freeCodeCamp.org YouTube channel.

You can view the full 26 hour course here.
Read the course materials online for free at learnpytorch.io.
See all of the course materials on GitHub.

You can learn more about the course below the embedded video.

The best way to learn is by doing.

And that's just what we'll do in the Learn PyTorch for Deep Learning: Zero to Mastery course.

We'll learn by doing.

Throughout the course, we'll go through many of the most important concepts in machine learning and deep learning by writing PyTorch code.

If you're new to data science and machine learning, consider the course a momentum builder.

By the end, you'll be comfortable navigating the PyTorch documentation, reading PyTorch code, writing PyTorch code, searching for things you don't understand and building your own machine learning projects.

What is PyTorch?

PyTorch is a machine learning framework written in the Python programming language.

It allows you to write machine learning algorithms capable of turning data into models into intelligence.

Why Learn PyTorch?

As of July 2022, 58% of machine learning research papers that contain code use PyTorch. And this number has been growing since PyTorch’s release.

In essence, machine learning researchers love PyTorch.

And typically, industry follows research.

So if all of the best machine learning research is coming out in PyTorch, knowing PyTorch is a fantastic way to start working in machine learning.

What are the prerequisites?

Bad: "I can't learn it" (that's bulls*).

Good: Three to six months of experience writing Python code and a willingness to learn (you're more than ready to go).

The course is as beginner-friendly as possible.

So if you've got more than one year's experience with machine learning, you might learn a few things but the materials are designed for beginners.

How's the Course Taught?

The focus of the course is code, code, code, experiment, experiment, experiment.

There's a reason two of the course mottos are:

If in doubt, run the code!

Experiment, experiment, experiment!

We'll write code together, apprenticeship style.

Meaning in the video version of the course, I'll write PyTorch code and explain it and then you'll follow along by writing the same code.

If we get stuck on something, we'll search for an answer.

You'll notice I leave many of my errors in the videos, this is on purpose.

Because errors happen (often) and being able to troubleshoot them is important.

I'm a big fan of there being no speed limit to learning something.

So that's what we'll be doing.

Learning by coding.

Learning by experimenting.

Fast.

What Does This Course Cover?

You can view and read all of the materials online for free at learnpytorch.io.

But let's get specific.

The course is comprised of 5 modules (or notebooks), best taken sequentially (but feel free to jump around).

00 – PyTorch Fundamentals

We'll start from the ground up.

Answering questions like what is PyTorch (an open-source machine learning framework) and what can PyTorch be used for (manipulating data and writing machine learning algorithms).

Then we'll get familiar with the fundamental building block of deep learning, the tensor.

A tensor is a numerical representation of data (where data can be almost anything, images, text, tables of numbers).

And the whole goal of machine learning is to find patterns in data.

So knowing how to create, interact with and manipulate tensors is paramount.

All of the course materials are available to read in an interactive online book at learnpytorch.io

01 – PyTorch Workflow

The idea of machine learning is to turn data into intelligence.

And the machine learning model that's able to do that the best is the winner.

So how do you go from data to model to intelligence with PyTorch?

That's what PyTorch Workflow focuses on:

Preparing data (turning it into tensors).
Building or picking a pretrained model (to suit your problem).
Fitting the model to the data (or letting the model find patterns in the data).
Evaluating the trained model (after it's learned patterns in data).
Improving the model through experimentation.
Saving and reloading a trained model (so you can export it and use it in applications).

We'll use and build upon this workflow throughout the course.

The PyTorch WorkFlow we'll cover and build upon throughout the Learn PyTorch for Deep Learning course.

02 – PyTorch Neural Network Classification

Neural networks are one of, if not the most powerful kind of machine learning algorithms.

They're what power many of today's most advanced artificial intelligence (AI) systems such as search and self-driving cars.

But can you get a neural network to do something simple like classifying whether a dot is red or blue?

A simple problem, yes, but experimenting with toy problems is one of the best ways to learn machine learning.

In doing so, we'll go through all of the major steps for one of the most common machine learning problems, classification: building a neural network to predict if something is one thing or another.

03 – PyTorch Computer Vision

Neural networks changed the game of computer vision forever.

And now PyTorch drives many of the latest advancements in computer vision algorithms.

Tesla uses PyTorch to build their computer vision algorithms for their self-driving software.

Apple uses PyTorch to build models that computationally enhance photos taken with the iPhone.

In PyTorch Computer Vision, we'll write PyTorch code to create a neural network capable of seeing patterns in images and classifying them into different categories.

04 – PyTorch Custom Datasets

The magic of machine learning is building algorithms to find patterns in your own custom data.

There are plenty of existing datasets out there, but how do load your own custom dataset into PyTorch to build models to find patterns in it?

Perhaps you'd like to build a security system for your home and you'd like to teach it what your family looks like so it recognizes them.

Or perhaps you'd like to build an application capable of classifying the different dog photos you take.

That’s exactly what PyTorch Custom Datasets covers, we'll create our own custom dataset with food images of pizza, steak and sushi to start the major project of the course: FoodVision.

Can't I Learn All of this Myself?

Yes.

You can.

There's a reason I'm calling this course the second best place on the internet to learn PyTorch.

Because the best place is the PyTorch documentation.

Though documentation can be a little intimidating when you first encounter it.

So this course structures things in a way that's a fun warmup before diving into the documentation.

Got another question?

Feel free to leave a discussion on the course's GitHub repository.

Otherwise, happy machine learning and I'll see you in the course.

Let's code!

Real-World Machine Learning—PyTorch and Monai for Healthcare Imaging

Beau Carnes — Thu, 06 Jan 2022 16:30:57 +0000

To improve your skills in machine learning and artificial intelligence, it is important to solve real-world problems. What better problem to solve then helping to save people's lives?

Machine learning is being used more and more in the field of healthcare. PyTorch and Monai can be used to discover tumors in livers.

We just published a course on the freeCodeCamp.org YouTube channel that will teach you how to use PyTorch, Monai, and Python for 3D liver segmentation. You will use machine learning and computer vision to find tumors in livers.

Mohammed El Amine MOKHTARI developed this course. He is a computer vision Ph.D. student and online content creator.

Here are the sections in this course:

What is U-Net
Software Installation
Finding the Datasets
Preparing the Data
Installing the Packages
Preprocessing
Errors you May Face
Dice Loss
Weighted Cross Entropy
The Training Part
The Testing Part
Using the GitHub Repository

Watch the full course below or on the freeCodeCamp.org YouTube channel (5-hour watch).

Deep Learning Tutorial – How to Use PyTorch and Transfer Learning to Diagnose COVID-19 Patients

Juan Cruz Martinez — Wed, 03 Nov 2021 19:49:35 +0000

Ever since the outbreak of COVID-19 in December 2019, researchers in the field of artificial intelligence and machine learning have been trying to find better ways to diagnose the disease.

They've worked on developing algorithms that would detect the disease within a matter of seconds – and only by looking at chest X-rays and/or CT scan images.

Some of these techniques have proven to be extremely useful and accurate in diagnosing COVID-19 cases.

There are multiple approaches that use both machine and deep learning to detect and/or classify of the disease. And researches have proposed newly developed architectures along with transfer learning approaches.

In this article, we will look at a transfer learning approach that classifies COVID-19 cases using chest X-ray images.

The model we are going to use is one of the seven variants of the EfficientNet architecture. We will use a pre-trained model on the immense ImageNet dataset. EfficientNet is an advanced and complex convolutional neural network-based architecture.

We will further investigate the details of Convolutional Neural Networks, pre-trained models, and EfficientNet during the course of this article. I've divided it into five parts:

What are convolutional neural networks?
A dive into transfer learning.
What is EfficientNet?
An introduction to PyTorch.
Implementation of COVID-19 classifier using EfficientNet with PyTorch.

This tutorial assumes that you have prior knowledge of both machine learning and deep learning. If you want to further develop your foundation in these topics, check out this article on Artificial Intelligence vs Machine Learning vs Deep Learning.

Also, although the dataset we'll work with here is COVID-related, you can apply the actual code implementation and analysis to other datasets.

What is a Convolutional Neural Network?

Convolutional Neural networks (CNNs) are a type of deep neural network that works on visual data – this is, images. A CNN takes an image as an input and performs two or three-dimensional convolutional operations on the image with several filters, also referred to as kernels.

These convolution operations output a 2D or 3D matrix which contains the learnable weights and biases regarding the spatial information of the input image. This output matrix is referred to as the feature map of the image.

Processing a convolutional neural network in the training process can be, in some cases, extremely slow. This is why it's a good idea to use GPUs and TPUs during training for deep learning techniques, especially convolutional neural networks.

Convolutional neural networks learn spatial and temporal information about the image far better than the basic feed forward neural network. Also, CNNs can reduce the size of the image while retaining the most important information in the image, which is crucial for predictive analysis of images.

Source

The starting layers of convolutional neural networks learn the abstract and simpler features in an image, such as lines and edges. But as we move deeper into the network, the feature map turns to the more complex structures in the image.

It starts to learn the more specific features of the image, such as a cat, a dog, or a person, the same way we would, as humans, perceive the world around us. This is a core concept in modern deep learning-based computer vision.

Now before we move on to advanced concepts, it is important to learn the basics of 2D convolution.

What is 2D Convolution?

2D convolution is a bit complex to explain, but here it goes: if the convolutional process (which is extensively used in h1-D signal processing) is performed between two signals – but not just along a single dimension, rather along two mutually perpendicular dimensions – it is called 2D convolution.

In the case of images, the two mutually perpendicular dimensions are the rows and columns of a greyscale image. The convolutional operation is mathematically done by multiplying and then accumulating the values of the overlapping samples of the two input signals, where one of the signals is flipped. The output of this multiplication and accumulation gives a single point on the feature map.

In the case of CNNs, the image is one signal and the filter/kernel is the second signal which is flipped. The size of the kernel is always smaller than that of the image.

The flipped kernel is then swept across the whole image both row by row and column by column to output the feature map.

2d convolution

Here a 3x3 kernel is swept across a 6x6 image to output a 4x4 feature map. As you can see, the dimensions of the output feature map are smaller than the input image. So there are a few concepts used in convolution to control the dimensions of the output feature map. These include padding, stride, and kernel size.

Padding is the manual addition of rows and columns around the input to keep the output dimension the same as the input dimension or vary it.

Stride refers to the jump the kernel takes during the sweep, both in columns and rows. In the example above, the stride of the convolution is 1 as the kernel is moving one unit in both rows and columns.

Kernel size refers to the dimensions of the kernel used. Changing the dimensions of the kernel to be swept changes the output size of the feature map.

The image below describes the convolution with the same kernel size but with a padding of 1 and stride of 2.

The equation that describes the relationship of stride, padding, and kernel size to input and output dimensions is as follows:

The concept of 3D convolution is just an extension of 2D convolution where both the input image and the kernel are three-dimensional.

Like 2D convolution, we sweep the three-dimensional kernel across the whole image in two mutually perpendicular dimensions, namely the rows and the columns.

We do not usually sweep the kernel across the color channels because the kernel has the same third dimension, that is the channel length, as the original image. This gives an output feature map that is two-dimensional instead of three.

To learn more about the details of 3D convolution, you can read this article.

What is Transfer Learning?

In transfer learning, you take a machine or deep learning model that is pre-trained on a previous dataset and use it to solve a different problem without needing to re-train the whole model.

Instead, you can just use the weights and biases of the pre-trained model to make a prediction. You transfer the weights from one model to your own model and adjust them to your own dataset without re-training all the previous layers of the architecture.

We use transfer learning in the applications of convolutional neural networks and natural language processing because it decreases the computation time and complexity of the training process. And, in many cases, it performs surprisingly well.

This also helps in cases where we have limited data available – since neural networks demand an extremely large amount of data to achieve good performance.

This means that using transfer learning methods can greatly reduce the demand for data since the weights and biases are pre-adjusted and are able to work better with just a small amount of data by tweaking the weights and biases a little.

But transfer learning models do not always give you great performance (although the newer architectures perform efficiently on almost every problem). Still, sometimes the problem at hand needs an architecture that is pre-trained on data that's similar to what you have. This factor depends upon the complexity of the problem you are trying to solve.

There are a couple ways you can perform transfer learning:

Using a pre-trained model.
Developing a new model.

You can use a pre-trained model in two ways. First, you can use the pre-trained weights and biases as initial parameters for your own model, and then train a whole convolutional model using those weights.

The other way is to perform feature extraction from the pre-trained model. You use the parameters of the pre-trained model to extract features from your input image and just train a simple classifier on top of it.

Another option is that if you have a problem with a small amount of data, you develop another model for a similar problem that has a large amount of data and train the model. Then you can use the trained weights from the new model to solve the original problem with less data.

In this tutorial, we will be using a pre-trained model as a feature extractor and we'll train a simple classifier on top of it to output the prediction.

There are many well-known architectures in the field of deep learning that are nowadays used for the purpose of transfer learning. Almost all of these are trained on the ImageNet dataset which is the largest open-source dataset available. It contains around 1000 classes and has around fifteen million instances.

Among these pre-trained architectures, LeNet is the first one that was proposed in 1998. Other well-known models include VGG, ResNet, AlexNet, GoogleNet, Inception, and Xception.

EfficientNet is also part of the series that was proposed recently, in 2019.

What is EfficientNet?

EfficientNet (or perhaps it's better to say EfficientNets) is a family of convolutional neural network-based image classification models. They perform extremely well on the state-of-the-art ImageNet dataset and other popular datasets such as CIFAR-100 and Flowers.

In addition to performing so well, the architecture is small and computes faster than any of the previous models. The architecture has variants ranging from EfficientNet-B0 up to EffieicntNet-B7.

The variants ranging from B0 to B7 are based on the compound scaling method to scale up the baseline in B0 to obtain B1 to B7. EfficientNet-B7 acquired a Top-1 accuracy of 84.4% on the ImageNet dataset, which is the highest level of Top-1 accuracy ever achieved on ImageNet.

If you want to learn more about how EfficientNets work, you can read this paper ‘Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks.’

Source

In the coding tutorial further along in this article, we'll be using the EfficientNet-B0 as a feature extractor and a classifier on top of it to classify COVID-19 using chest x-ray images.

An Introduction to PyTorch

PyTorch is a Python-supported library that helps us build deep learning models. Unlike Keras (another deep learning library), PyTorch is flexible and gives the developer more control.

It is similar to NumPy in processing but has a faster GPU acceleration. To learn more about NumPy and its features, you can check out this in-depth guide along with its documentation.

PyTorch has a data structure known as a ‘Tensor’ that is similar to the NumPy ndarray but it has the option to operate on GPU.

PyTorch provides an uncomplicated way to switch computation between a CPU and a GPU. It also supports processing on NumPy arrays by simply providing a built-in module that can convert NumPy arrays into Tensors and vice versa.

One of the handiest modules in PyTorch is grad(). It allows you to compute the gradient of a tensor as it goes forward into processing without needing to manually compute the gradient and store it.

This gives you greater control of your deep learning operations, specifically back propagation, during the training process. This is helpful when computing the loss function which lets you adjust the parameters of a model.

We can also limit a tensor so that its gradient is not computed during the entire process by making the module's requires_grad equal False. To learn more about tensors and how to perform gradient computations in PyTorch, you can check out this tutorial and this course.

How to Implement a COVID-19 Classifier using EfficientNet with PyTorch

Now let's move on to the practical implementation of EfficientNet in PyTorch. We will use the B0 variant of the EfficientNet family.

First, we'll examine the data and preprocess it. Kaggle has an vast library of datasets available for open-source use in projects and research. There are no limits as to what dataset can be used for this project. You can use any dataset containing chest X-ray images of COVID-19 patients and people without COVID.

For the sake of this tutorial, we'll use this dataset here. But for the code to work on your custom dataset, you must divide your data into three directories: train, test, and valid.

Each directory should contain two more directories with the labels covid and normal. These covid and normal folders will contain the images corresponding to the specific class of the directory they are present in.

The original dataset we'll use in this article contains three folders: covid, normal, and pneumonia. We discard the pneumonia folder completely and divide the other data in the same way described above.

We do this to create a logical division between the data used for training and the data used for testing and validation. Also, PyTorch, by default, takes the name of the folder, an instance it is present in, as the label of the class – so we do not have a label file corresponding to the input dataset.

The data and the architecture

Let's have a look at the data. Below we can see the x-ray images of patients with COVID-19:

And here we can see the normal category’s x-ray images:

There are 237 total layers in the B-0 architecture. The whole architecture can be condensed into the following diagrams. We provide the x-ray data to the input layer.

Source

We will freeze the learning of the weights across all these blocks as we will be using the pre-trained weights to extract the features from our own input.

We'll do the feature extraction after the input passes Module 7. We then transfer the feature map obtained from Module 7 to our own final classification layers (this is why it's called transfer learning). We top the architecture with the following top layers:

BatchNorm1d
Linear(output neurons = 512)
ReLU()
BatchNorm1d()
Linear(output neurons = 128)
ReLU()
BatchNorm1d()
Dropout(probability of zeroing the parameters = 0.4)
Linear(output neurons = 2)

Let's head over to the code

Now before we start the code, there are a couple of dependencies we need to install. First, you'll need to install PyTorch on your local machine. You can do this using the pip install command in your Python environment. Refer here to install it depending on your machine (whether it has GPU available or not).

Before you move on to the code, I strongly recommend that you actually work through the code yourself. This makes it much easier to understand. With that said, you can access the full code in a Jupyter notebook here.

You also need to install Efficientnet support for PyTorch into the same Python environment. Run the command below to install it:

pip install efficientnet_pytorch

Apart from this you will need to import some other dependencies at the start of the code.

Now we start building the classification model. To start, we import all the necessary modules:

#importing required modules
import gdown
import zipfile
import numpy as np
from glob import glob
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torchsummary import summary
from torchvision import datasets, transforms as T
from efficientnet_pytorch import EfficientNet
import os
import torch.optim as optim
from PIL import ImageFile
from sklearn.metrics import accuracy_score

All these modules are essential to perform multiple functions across the model. You can install all the absent modules using the pip command.

Then we download and extract the data we prepared for the model:

#importing data
#Dataset address
url = 'https://drive.google.com/uc?export=download&id=1B75cOYH7VCaiqdeQYvMuUuy_Mn_5tPMY'
output = 'data.zip'
gdown.download(url, output, quiet=False)
#giving zip file name
data_dir='./data.zip'
#Extracting data from zip file
with zipfile.ZipFile(data_dir, 'r') as zf:
zf.extractall('./data/')

The gdown.download module downloads the data from the URL provided and the zipfile.extractall extracts the data into the same directory where you currently are (or the same runtime if you are working on Google Colab).

I highly recommend working on Google Colab for this project in case you do not locally have a GPU available.

Next, create a check variable to check the availability of a GPU.

#Checking the availability of a GPU
use_cuda = torch.cuda.is_available()

This module returns ‘True’ if GPU is available and ‘False' if not.

Next, we need to apply pre-processing techniques to the data. Since our data is pre-augmented, we do not need to apply many pre-processing techniques to it. We only resize all the images to a single size of (224,224). We do this because the images in our dataset are all of different dimensions and we need a consistent dimension for the model.

We'll also convert the images to tensors to be processed by PyTorch and then we normalize all the images. This normalize function normalizes all the images with a mean and standard deviation of 0.5.

After that, we create the locations for the train, test and validation sets which will be given as input to the ‘datasets’ module. We do this so that the PyTorch model knows exactly where the data is located and also so that that data can be loaded to the GPU. We keep a batch size of 32.

#declaring batch size
batch_size = 32

#applying required transformations on the dataset
img_transforms = {
    'train':
    T.Compose([
        T.Resize(size=(224,224)), 
        T.ToTensor(),
        T.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]), 
        ]),

    'valid':
    T.Compose([
        T.Resize(size=(224,224)),
        T.ToTensor(),
        T.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
        ]),

    'test':
    T.Compose([
        T.Resize(size=(224,224)),
        T.ToTensor(),
        T.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
        ]),
     }

# creating Location of data: train, validation, test
data='./data/'

train_path=os.path.join(data,'train')
valid_path=os.path.join(data,'test')
test_path=os.path.join(data,'valid')


# creating Datasets to each of  folder created in prev
train_file=datasets.ImageFolder(train_path,transform=img_transforms['train'])
valid_file=datasets.ImageFolder(valid_path,transform=img_transforms['valid'])
test_file=datasets.ImageFolder(test_path,transform=img_transforms['test'])


#Creating loaders for the dataset
loaders_transfer={
    'train':torch.utils.data.DataLoader(train_file,batch_size,shuffle=True),
    'valid':torch.utils.data.DataLoader(valid_file,batch_size,shuffle=True),
    'test': torch.utils.data.DataLoader(test_file,batch_size,shuffle=True)
}

After pre-processing, we move on to building the model.

#importing the pretrained EfficientNet model

model_transfer = EfficientNet.from_pretrained('efficientnet-b0')

# Freeze weights
for param in model_transfer.parameters():
    param.requires_grad = False
in_features = model_transfer._fc.in_features


# Defining Dense top layers after the convolutional layers
model_transfer._fc = nn.Sequential(
    nn.BatchNorm1d(num_features=in_features),    
    nn.Linear(in_features, 512),
    nn.ReLU(),
    nn.BatchNorm1d(512),
    nn.Linear(512, 128),
    nn.ReLU(),
    nn.BatchNorm1d(num_features=128),
    nn.Dropout(0.4),
    nn.Linear(128, 2),
    )
if use_cuda:
    model_transfer = model_transfer.cuda()

First, we import the EfficientNet-B0 model with its pre-trained weights. Next, we disable the training of the parameters of the model because we are going to use the pre-trained parameters to extract features from our data.

Then we replace the top fully connected layers of the model with our own classifier.

Batchnorm normalizes the whole batch of data into the number of neurons given as an argument. This reduces the complexity of the model and prevents it from overfitting. Dropout does something similar – it zeroes out some neurons in the model with a probability of the value given as an argument.

The Linear layer is a simple fully-connected neural network layer.

Finally, we transfer our model to the GPU, if available.

# selecting loss function
criterion_transfer = nn.CrossEntropyLoss()

#using Adam classifier
optimizer_transfer = optim.Adam(model_transfer.parameters(), lr=0.0005)

Here, we select the loss function and the optimizer for our training phase. We also define the value of the learning rate for the optimizer. You can change this value to see how different learning rates influence the model in different ways.

Next, we move on to the training of the model.

ImageFile.LOAD_TRUNCATED_IMAGES = True

# Creating the function for training
def train(n_epochs, loaders, model, optimizer, criterion, use_cuda, save_path):
    """returns trained model"""
    # initialize tracker for minimum validation loss
    valid_loss_min = np.Inf 
    trainingloss = []
    validationloss = []

    for epoch in range(1, n_epochs+1):
        # initialize the variables to monitor training and validation loss
        train_loss = 0.0
        valid_loss = 0.0

        ###################
        # training the model #
        ###################
        model.train()
        for batch_idx, (data, target) in enumerate(loaders['train']):
            # move to GPU
            if use_cuda:
                data, target = data.cuda(), target.cuda()

            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.data - train_loss))

        ######################    
        # validating the model #
        ######################
        model.eval()
        for batch_idx, (data, target) in enumerate(loaders['valid']):
            if use_cuda:
                data, target = data.cuda(), target.cuda()

            output = model(data)
            loss = criterion(output, target)
            valid_loss = valid_loss + ((1 / (batch_idx + 1)) * (loss.data - valid_loss))

        train_loss = train_loss/len(train_file)
        valid_loss = valid_loss/len(valid_file)

        trainingloss.append(train_loss)
        validationloss.append(valid_loss)

        # printing training/validation statistics 
        print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(
            epoch, 
            train_loss,
            valid_loss
            ))

        ## saving the model if validation loss has decreased
        if valid_loss < valid_loss_min:
            torch.save(model.state_dict(), save_path)

            valid_loss_min = valid_loss

    # return trained model
    return model, trainingloss, validationloss

We create a function for the training and validation phase of the model. We allow the model to accept truncated images also with fewer than three channels. We initialize the values of the train and validation losses and start the training loop. We import the data batch by batch from the data loaders and perform the training operations.

After the training loop, we start the validation loop where we only compute the loss and the output predictions and do not update the parameters as we did in the training loop. We save the model which has the minimum loss for the validation set.

# training the model

n_epochs=10

model_transfer, train_loss, valid_loss = train(n_epochs, loaders_transfer, model_transfer, optimizer_transfer, criterion_transfer, use_cuda, 'model.pt')

We run the model for 10 epochs, that is 10 loops. You can change the number of epochs and test out the loss values. The saved model is saved under the name model.pt. Now we load the model and move on to the testing phase.

# Defining the test function

def test(loaders, model, criterion, use_cuda):

    # monitoring test loss and accuracy
    test_loss = 0.
    correct = 0.
    total = 0.
    preds = []
    targets = []

    model.eval()
    for batch_idx, (data, target) in enumerate(loaders['test']):
        # moving to GPU
        if use_cuda:
            data, target = data.cuda(), target.cuda()
        # forward pass
        output = model(data)
        # calculate the loss
        loss = criterion(output, target)
        # updating average test loss 
        test_loss = test_loss + ((1 / (batch_idx + 1)) * (loss.data - test_loss))
        # converting the output probabilities to predicted class
        pred = output.data.max(1, keepdim=True)[1]
        preds.append(pred)
        targets.append(target)
        # compare predictions
        correct += np.sum(np.squeeze(pred.eq(target.data.view_as(pred))).cpu().numpy())
        total += data.size(0)

    return preds, targets

# calling test function
preds, targets = test(loaders_transfer, model_transfer, criterion_transfer, use_cuda)

We now create a test function to apply our model to our test dataset and evaluate its performance.

We pass the dataset batch by batch as we did in the train and testing phase, but we only do it once here instead of 10 epochs. This is because we just have to test the model and not update the parameters.

The function returns the predictions it computed for the input test set and also the original target values of the test set.

Now we compute the accuracy of the model. First, we need to convert the tensors, that is predictions and targets, into NumPy arrays. We do this by first moving them from the GPU to the CPU and then converting them to NumPy arrays. The following code does this:

#converting the tensor object to a list for metric functions

preds2, targets2 = [],[]

for i in preds:
  for j in range(len(i)):
    preds2.append(i.cpu().numpy()[j])
for i in targets:
  for j in range(len(i)):
    targets2.append(i.cpu().numpy()[j])

Now we compute the accuracy using the accuracy metric of the sklearn library.

#Computing the accuracy
acc = accuracy_score(targets2, preds2)
print("Accuracy: ", acc)

Our model had an accuracy of 95.45%.

The next image is the confusion matrix for the test run of the classifier. In it, you can see the visual of the model’s performance. The actual labels indicate whether the person had COVID or not, while the predicted labels indicate how our model classified the images.

As we can see, our model predicted most of the labels correctly. The small portion of wrongly predicted labels include 7 people who did not have COVID, but our model predicted they did. This is not too alarming.

On the other hand, there were 14 examples where our model predicted that they did not have COVID, but they did. In machine learning, these are called false negatives. This is a very alarming situation because we would've sent home people suffering from COVID-19. This would increase their risk that the disease would get worse.

Conclusion

Convolutional neural networks have proved extremely useful in computer vision techniques, and we can also use them efficiently in medical imaging and diagnosis.

Transfer learning is an effective method for using pre-trained architectures to perform efficiently in other applications.

But as we saw above, using these models depends upon what kind of problem we have and what our objectives are. Just like in the detection of COVID-19, we would prefer to have a model that gives us 0 false negatives. But there's still great potential for deep learning to be useful in COVID diagnosis as well as other medical diagnosis techniques.

Thanks for reading! If you enjoyed the article and would like to read more interesting articles around computer science, Python and JavaScript, please follow me on Twitter.

PyTorch Tensor Methods – How to Create Tensors in Python

freeCodeCamp — Thu, 03 Dec 2020 17:00:26 +0000

By Srijan

PyTorch is an open-source Python-based library. It provides high flexibility and speed while building, training, and deploying deep learning models.

At its core, PyTorch involves operations involving tensors. A tensor is a number, vector, matrix, or any n-dimensional array.

In this article, we will see different ways of creating tensors using PyTorch tensor methods (functions).

Topics we'll cover

tensor
zeros
ones
full
arange
linspace
rand
randint
eye
complex

The tensor() method

This method returns a tensor when data is passed to it. data can be a scalar, tuple, a list or a NumPy array.

In the above example, a NumPy array that was created using np.arange() was passed to the tensor() method, resulting in a 1-D tensor.

We can create a multi-dimensional tensor by passing a tuple of tuples, a list of lists, or a multi-dimensional NumPy array.

When an empty tuple or list is passed into tensor(), it creates an empty tensor.

The zeros() method

This method returns a tensor where all elements are zeros, of specified size (shape). The size can be given as a tuple or a list or neither.

We could have passed 3, 2 inside a tuple or a list as well. It is self-explainable that passing negative numbers or a float would result in a run time error.

Passing an empty tuple or an empty list gives a tensor of size (dimension) 0, having 0 as its only element, whose data type is float.

The ones() method

Similar to zeros(), ones() returns a tensor where all elements are 1, of specified size (shape). The size can be given as a tuple or a list or neither.

Like zeros(), passing an empty tuple or list gives a tensor of 0 dimension, having 1 as the sole element, whose data type is float.

The full() method

What if you want all the elements of a tensor to be equal to some value but not only 0 and 1? Maybe 2.9?

full() returns a tensor of a shape given by the size argument, with all its elements equal to the fill_value.

Here, we have created a tensor of shape 3, 2 with the fill_value as 3. Here again, passing an empty tuple or list creates a scalar tensor of zero dimension.

While using full, it is necessary to give size as a tuple or a list.

The arange() method

This method returns a 1-D tensor, with elements from start (inclusive) to end (exclusive) with a common difference step. The default value for start is 0 while that for step is 1.

The elements of the tensor can be said to be in Arithmetic Progression, with step as common difference.

Here, we created a tensor which starts from 2 and goes until 20 with a step (common difference) of 2.

All the three parameters, start, end and step can be positive, negative or float.

While choosing start, end, and step, we need to ensure that start and end are consistent with the step sign.

Since step is set as -2, there is no way -42 can reach -22 (exclusive). Hence, it gives an error.

The linspace() method

This method returns a 1-D dimensional tensor, with elements from start (inclusive) to end (inclusive). However, unlike arange(), here, steps isn't the common difference but the number of elements to be in the tensor.

PyTorch automatically decides the common difference based on the steps given.

Not providing a value for steps is deprecated. For backwards compatibility, not providing a value for steps creates a tensor with 100 elements. According to the official documentation, in a future PyTorch release, failing to provide a value for steps will throw a runtime error.

Unlike arange(), linspace can have a start greater than end since the common difference is automatically calculated.

Since steps here is not a common difference, but the number of elements, it can only be a non-negative integer.

The rand() method

This method returns a tensor filled with random numbers from a uniform distribution on the interval 0 (inclusive) to 1 (exclusive). The shape is given by the size argument. The size argument can be given as a tuple or list or neither.

Passing an empty tuple or list creates a scalar tensor of zero dimension.

The randint() method

This method returns a tensor filled with random integers generated uniformly between low (inclusive) and high (exclusive). The shape is given by the size argument. The default value for low is 0.

When only one int argument is passed, low gets the value 0, by default, and high gets the passed value.

The size argument only takes a tuple or a list. An empty tuple or list creates a tensor with zero dimension.

The eye() method

This method returns a 2-D tensor with ones on the diagonal and zeros elsewhere. The number of rows is given by n and columns is given by m.

The default value for m is the value of n. When only n is passed, it creates a tensor in the form of an identity matrix. An identity matrix has its diagonal elements as 1 and all others as 0.

The complex() method

This method returns a complex tensor with its real part equal to real and its imaginary part equal to imag. Both real and imag are tensors.

The data type of both the real and imag tensors should be either float or double.

Also, the size of both tensors, real and imag, should be the same, since the corresponding elements of the two matrices form a complex number.

Conclusion

We've covered ten different ways to create tensors using PyTorch methods. You can go through the official documentation to know more about other PyTorch methods.

You can click here to go to the Jupyter notebook where you can play around with these methods.

If you want to learn more about PyTorch, check out this amazing course on freeCodeCamp's YouTube channel.

Stay safe!

Free Live Course: Deep Learning with PyTorch

Beau Carnes — Fri, 20 Nov 2020 17:16:00 +0000

Are you interested in learning about Deep Learning? We are hosting a free 6-week live course on our YouTube channel, starting Saturday, November 20th at 9:30 AM PST.

Passively watching a video is often not enough to learn a software concept. You need to be able to ask questions and build real projects. That is exactly what you will be able to do in the course “Deep Learning with PyTorch: Zero to GANs”.

This is an online course intended to provide a coding-first introduction to deep learning using the PyTorch framework. The course takes a hands-on coding-focused approach and will be taught using live interactive Jupyter notebooks, allowing students to follow along and experiment.

This course is taught by Aakash N S. He is the co-founder and CEO of Jovian.ml, a project management and collaboration platform for machine learning.

Theoretical concepts will be explained in simple terms using code. Students will receive weekly assignments, work on a project with real-world datasets and participate in a private data science competition to test their skills. Upon successful completion of the course, students will receive a certificate of completion.

This is a beginner-friendly course, and no prior knowledge of data science, machine learning or deep learning is assumed. It is preferable to have some background in the following areas:

Programming knowledge, preferably in Python
Basics of linear algebra (vectors, matrices, dot products)
Basics of calculus (differentiation, geometric interpretation of derivative)

Syllabus

The course is divided into 6 modules, and will be taught over 6 weeks via video lectures and interactive Jupyter notebooks. Each lecture will be around 2 hours long.

Module 1: PyTorch Basics - Tensors & Gradients

Introduction to Jupyter notebooks & Data Science in Python
Creating vectors, matrices & Tensors in PyTorch
Tensor operations and gradient computations
Interoperability of PyTorch with Numpy

Module 2: Linear Regression & Gradient Descent

Linear Regression from scratch using Tensor operations
Weights, biases and the mean squared error loss function
Gradient descent and model training with PyTorch Autograd
Linear Regression using PyTorch built-ins (nn.Linear, nn.functional etc.)

Module 3: Logistic Regression for Image Classification

Working with images from the MNIST dataset
Training and validation dataset creation
Softmax function and categorical cross entropy loss
Model training, evaluation and sample predictions

Module 4: Feedforward Neural Networks & GPUs

Working with cloud GPU platforms like Kaggle & Colab
Creating a multilayer neural network using nn.Module
Activation function, non-linearity and universal approximation theorem
Moving with datasets and models to the GPU for faster training

Module 5a: Image Classification using Convolutional Neural Networks

Working with the 3-channel RGB images from the CIFAR10 dataset
Introduction to Convolutions, kernels & features maps
Underfitting, overfitting and techniques to improve model performance

Module 5b: Data Augmentation, Regularization and Residual Networks

Improving the dataset using data normalization and data augmentation
Improving the model using residual connections and batch normalization
Improving the training loop using learning rate annealing, weight decay and gradient clip
Training a state of the art image classifier from scratch in 10 minutes

Module 6: Image Generation using Generative Adversarial Networks (GANs)

Introduction to generative modeling and application of GANs
Creating generator and discriminator neural networks
Generating and evaluating fake images of handwritten digits
Training the generator and discriminator in tandem and visualizing results

Exercises & Assignments

Weekly Assignments

Week 1: Linear Regression
Week 2: Image Classification
Week 3: Feedforward neural networks

Course Project

For the course project, students will create an image classification model using Convolutional neural networks, on a real-world dataset of their choice. The project will allow students to experiment with different types of models and regularization techniques. Students will also present their work at the end of the course and publish a blog post describing their approach and results.

Certificate of Completion

Students who attend at least 5 out of 6 video lectures and make valid submissions for all assignments will be eligible to receive a Certificate of Completion by Jovian.ml. Selected projects will also be receive a Best Project Award based on evaluation criteria determined by the instructors.

You can sign up for the course here: http://zerotogans.com/

Whether or not you sign up, you can watch the course on the freeCodeCamp.org YouTube channel.

Deep Learning Frameworks Compared: MxNet vs TensorFlow vs DL4j vs PyTorch

Manish Shivanandhan — Tue, 29 Sep 2020 15:22:13 +0000

It's a great time to be a deep learning engineer. In this article, we will go through some of the popular deep learning frameworks like Tensorflow and CNTK so you can choose which one is best for your project.

Deep Learning is a branch of Machine Learning. Though machine learning has various algorithms, the most powerful are neural networks.

Deep learning is the technique of building complex multi-layered neural networks. This helps us solve tough problems like image recognition, language translation, self-driving car technology, and more.

There are tons of real-world applications of deep learning from self-driving Tesla cars to AI assistants like Siri. To build these neural networks, we use different frameworks like Tensorflow, CNTK, and MxNet.

If you are new to deep learning, start here for a good overview.

Frameworks

Without the right framework, constructing quality neural networks can be hard. With the right framework, you only have to worry about getting your hands on the right data.

That doesn’t imply that knowledge of the deep learning frameworks alone is enough to make you a successful data scientist.

You need a strong foundation of the fundamental concepts to be a successful deep learning engineer. But the right framework will make your life easier.

Also, not all programming languages have their own machine learning / deep learning frameworks. This is because not all programming languages have the capacity to handle machine learning problems.

Languages like Python stand out among others due to their complex data processing capability.

Let's go through some of the popular deep learning frameworks in use today. Each one comes with its own set of advantages and limitations. It is important to have at least a basic understanding of these frameworks so you can choose the right one for your organization or project.

TensorFlow

TensorFlow is the most famous deep learning library around. If you are a data scientist, you probably started with Tensorflow. It is one of the most efficient open-source libraries to work with.

Google built TensorFlow to use as an internal deep learning tool before open-sourcing it. TensorFlow powers a lot of useful applications including Uber, Dropbox, and Airbnb.

Advantages of Tensorflow

User Friendly. Easy to learn if you are familiar with Python.
Tensorboard for monitoring and visualization. It is a great tool if you want to see your deep learning models in action.
Community support. Experts engineers from Google and other companies improve TensorFlow almost on a daily basis.
You can use TensorFlow Lite to run TensorFlow models on mobile devices.
Tensorflow.js lets you to run real-time deep learning models in the browser using JavaScript.

Limitations of Tensorflow

TensorFlow is a bit slow compared to frameworks like MxNet and CNTK.
Debugging can be challenging.
No support for OpenCL.

Apache MXNet

MXNet is another popular Deep Learning framework. Founded by the Apache Software Foundation, MXNet supports a wide range of languages like JavaScript, Python, and C++. MXNet is also supported by Amazon Web Services to build deep learning models.

MXNet is a computationally efficient framework used in business as well as in academia.

Advantages of Apache MXNet

Efficient, scalable, and fast.
Supported by all major platforms.
Provides GPU support, along with multi-GPU mode.
Support for programming languages like Scala, R, Python, C++, and JavaScript.
Easy model serving and high-performance API.

Disadvantages of Apache MXNet

Compared to TensorFlow, MXNet has a smaller open source community.
Improvements, bug fixes, and other features take longer due to a lack of major community support.
Despite being widely used by many organizations in the tech industry, MxNet is not as popular as Tensorflow.

Microsoft CNTK

Large companies usually use Microsoft Cognitive Toolkit (CNTK) to build deep learning models.

Though created by Microsoft, CNTK is an open-source framework. It illustrates neural networks in the form of directed graphs by using a sequence of computational steps.

CNTK is written using C++, but it supports various languages like C#, Python, C++, and Java.

Microsoft’s backing is an advantage for CNTK since Windows is the preferred operating system for enterprises. CNTK is also heavily used in the Microsoft ecosystem.

Popular products that use CNTK are Xbox, Cortana, and Skype.

Advantages of Microsoft CNTK

Offers reliable and excellent performance.
The scalability of CNTK has made it a popular choice in many enterprises.
Has numerous optimized components.
Easy to integrate with Apache Spark, an analytics engine for data processing.
Works well with Azure Cloud, both being backed by Microsoft.
Resource usage and management are efficient.

Disadvantages of Microsoft CNTK

Minimal community support compared to Tensorflow, but has a dedicated team of Microsoft engineers working full time on it.
Significant learning curve.

PyTorch

PyTorch is another popular deep learning framework. Facebook developed Pytorch in its AI research lab (FAIR). Pytorch has been giving tough competition to Google’s Tensorflow.

Pytorch supports both Python and C++ to build deep learning models. Released three years ago, it's already being used by companies like Salesforce, Facebook, and Twitter.

Image Recognition, Natural Language Processing, and Reinforcement Learning are some of the many areas in which PyTorch shines. It is also used in research by universities like Oxford and organizations like IBM.

PyTorch is also a great choice for creating computational graphs. It also supports cloud software development and offers useful features, tools, and libraries. And it works well with cloud platforms like AWS and Azure.

Advantages of PyTorch

User-friendly design and structure that makes constructing deep learning models transparent.
Has useful debugging tools like PyCharm debugger.
Contains many pre-trained models and supports distributed training.

Disadvantages of PyTorch

Does not have interfaces for monitoring and visualization like TensorFlow.
Comparatively, PyTorch is a new deep learning framework and currently has less community support.

DeepLearning4j

DeepLearning4j is an excellent framework if your main programming language is Java. It is a commercial-grade, open-source, distributed deep-learning library.

Deeplearning4j supports all major types of neural network architectures like RNNs and CNNs.

Deeplearning4j is written for Java and Scala. It also integrates well with Hadoop and Apache Spark. Deeplearning4j also has support for GPUs, making it a great choice for Java-based deep learning solutions.

Advantages of DeepLearning4j

Scalable and can easily process large amounts of data.
Easy integration with Apache Spark.
Excellent community support and documentation.

Disadvantages of DeepLearning4j

Limited to the Java programming language.
Relatively less popular compared to Tensorflow and PyTorch.

Conclusion

Each framework comes with its list of pros and cons. But choosing the right framework is crucial to the success of a project.

You have to consider various factors like security, scalability, and performance. For enterprise-grade solutions, reliability becomes another primary contributing factor.

If you are just getting started, begin with Tensorflow. If you are building a Windows-based enterprise product, choose CNTK. If you prefer Java, choose DL4J.

I hope this article helps you choose the right deep learning framework for your next project. If you have any questions, reach out to me.

Loved this article? Join my Newsletter and get a summary of my articles and videos every Monday.

How to Build a Neural Network from Scratch with PyTorch

freeCodeCamp — Tue, 15 Sep 2020 19:31:05 +0000

By Bipin Krishnan P

In this article, we'll be going under the hood of neural networks to learn how to build one from the ground up.

The one thing that excites me the most in deep learning is tinkering with code to build something from scratch. It's not an easy task, though, and teaching someone else how to do so is even more difficult.

I've been working my way through the Fast.ai course and this blog is greatly inspired by my experience.

Without any further delay let's start our wonderful journey of demystifying neural networks.

How does a neural network work?

Let's start by understanding the high level workings of neural networks.

A neural network takes in a data set and outputs a prediction. It's as simple as that.

How a neural network works

Let me give you an example.

Let's say that one of your friends (who is not a great football fan) points at an old picture of a famous footballer – say Lionel Messi – and asks you about him.

You will be able to identify the footballer in a second. The reason is that you have seen his pictures a thousand times before. So you can identify him even if the picture is old or was taken in dim light.

But what happens if I show you a picture of a famous baseball player (and you have never seen a single baseball game before)? You will not be able to recognize that player. In that case, even if the picture is clear and bright, you won't know who it is.

This is the same principle used for neural networks. If our goal is to build a neural network to recognize cats and dogs, we just show the neural network a bunch of pictures of dogs and cats.

More specifically, we show the neural network pictures of dogs and then tell it that these are dogs. And then show it pictures of cats, and identify those as cats.

Once we train our neural network with images of cats and dogs, it can easily classify whether an image contains a cat or a dog. In short, it can recognize a cat from a dog.

But if you show our neural network a picture of a horse or an eagle, it will never identify it as horse or eagle. This is because it has never seen a picture of a horse or eagle before because we have never shown it those animals.

If you wish to improve the capability of the neural network, then all you have to do is show it pictures of all the animals that you want the neural network to classify. As of now, all it knows is cats and dogs and nothing else.

The data set we use for our training heavily depends on the problem on our hands. If you wish to classify whether a tweet has a positive or negative sentiment, then probably, you will want a data set containing a lot of tweets with their corresponding label as either positive or negative.

Now that you have a high-level overview of data sets and how a neural network learns from that data, let's dive deeper into how neural networks work.

Understanding neural networks

We will be building a neural network to classify the digits three and seven from an image.

But before we build our neural network, we need to go deeper to understand how they work.

Every image that we pass to our neural network is just a bunch of numbers. That is, each of our images has a size of 28×28 which means it has 28 rows and 28 columns, just like a matrix.

We see each of the digits as a complete image, but to a neural network, it is just a bunch of numbers ranging from 0 to 255.

Here is a pixel representation of the digit five:

Pixel values along with shades

As you can see above, we have 28 rows and 28 columns (the index starts from 0 and ends at 27) just like a matrix. Neural networks only see these 28×28 matrices.

To show some more details, I've just shown the shade along with the pixel values. If you look closer into the image, you can see that the pixel values close to 255 are darker whereas the values closer to 0 are lighter in shade.

In PyTorch we don't use the term matrix. Instead, we use the term tensor. Every number in PyTorch is represented as a tensor. So, from now on, we will use the term tensor instead of matrix.

Visualizing a neural network

A neural network can have any number of neurons and layers.

This is how a neural network looks:

Artificial neural network

Don't get confused by the Greek letters in the picture. I will break it down for you:

Take the case of predicting whether a patient will survive or not based on a data set containing the name of the patient, temperature, blood pressure, heart condition, monthly salary, and age.

In our data set, only the temperature, blood pressure, heart condition, and age have significant importance for predicting whether the patient will survive or not. So we will assign a higher weight value to these values in order to show higher importance.

But features like the name of the patient and monthly salary have little or no influence on the patient's survival rate. So we assign smaller weight values to these features to show less importance.

In the above figure, x1, x2, x3...xn are the features in our data set which may be pixel values in the case of image data or features like blood pressure or heart condition as in the above example.

The feature values are multiplied by the corresponding weight values referred to as w1j, w2j, w3j...wnj. The multiplied values are summed together and passed to the next layer.

The optimum weight values are learned during the training of the neural network. The weight values are updated continuously in such a way as to maximize the number of correct predictions.

The activation function is nothing but the sigmoid function in our case. Any value we pass to the sigmoid gets converted to a value between 0 and 1. We just put the sigmoid function on top of our neural network prediction to get a value between 0 and 1.

You will understand the importance of the sigmoid layer once we start building our neural network model.

There are a lot of other activation functions that are even simpler to learn than sigmoid.

This is the equation for a sigmoid function:

Sigmoid function

The circular-shaped nodes in the diagram are called neurons. At each layer of the neural network, the weights are multiplied with the input data.

We can increase the depth of the neural network by increasing the number of layers. We can improve the capacity of a layer by increasing the number of neurons in that layer.

Understanding our data set

The first thing we need in order to train our neural network is the data set.

Since the goal of our neural network is to classify whether an image contains the number three or seven, we need to train our neural network with images of threes and sevens. So, let's build our data set.

Luckily, we don't have to create the data set from scratch. Our data set is already present in PyTorch. All we have to do is just download it and do some basic operations on it.

We need to download a data set called MNIST (Modified National Institute of Standards and Technology) from the torchvision library of PyTorch.

Now let's dig deeper into our data set.

What is the MNIST data set?

The MNIST data set contains handwritten digits from zero to nine with their corresponding labels as shown below:

MNIST data set

So, what we do is simply feed the neural network the images of the digits and their corresponding labels which tell the neural network that this is a three or seven.

How to prepare our data set

The downloaded MNIST data set has images and their corresponding labels.

We just write the code to index out only the images with a label of three or seven. Thus, we get a data set of threes and sevens.

First, let's import all the necessary libraries.

import torch
from torchvision import datasets
import matplotlib.pyplot as plt

We import the PyTorch library for building our neural network and the torchvision library for downloading the MNIST data set, as discussed before. The Matplotlib library is used for displaying images from our data set.

Now, let's prepare our data set.

mnist = datasets.MNIST('./data', download=True)

threes = mnist.data[(mnist.targets == 3)]/255.0
sevens = mnist.data[(mnist.targets == 7)]/255.0

len(threes), len(sevens)

As we learned above, everything in PyTorch is represented as tensors. So our data set is also in the form of tensors.

We download the data set in the first line. We index out only the images whose target value is equal to 3 or 7 and normalize them by dividing with 255 and store them separately.

We can check whether our indexing was done properly by running the code in the last line which gives the number of images in the threes and sevens tensor.

Now let's check whether we've prepared our data set correctly.

def show_image(img):
  plt.imshow(img)
  plt.xticks([])
  plt.yticks([])
  plt.show()

show_image(threes[3])
show_image(sevens[8])

Using the Matplotlib library, we create a function to display the images.

Let's do a quick sanity check by printing the shape of our tensors.

print(threes.shape, sevens.shape)

If everything went right, you will get the size of threes and sevens as ([6131, 28, 28]) and ([6265, 28, 28]) respectively. This means that we have 6131 28×28 sized images for threes and 6265 28×28 sized images for sevens.

We've created two tensors with images of threes and sevens. Now we need to combine them into a single data set to feed into our neural network.

combined_data = torch.cat([threes, sevens])
combined_data.shape

We will concatenate the two tensors using PyTorch and check the shape of the combined data set.

Now we will flatten the images in the data set.

flat_imgs = combined_data.view((-1, 28*28))
flat_imgs.shape

We will flatten the images in such a way that each of the 28×28 sized images becomes a single row with 784 columns (28×28=784). Thus the shape gets converted to ([12396, 784]).

We need to create labels corresponding to the images in the combined data set.

target = torch.tensor([1]*len(threes)+[2]*len(sevens))
target.shape

We assign the label 1 for images containing a three, and the label 0 for images containing a seven.

How to train your Neural Network

To train your neural network, follow these steps.

Step 1: Building the model

Below you can see the simplest equation that shows how neural networks work:

y = Wx + b

Here, the term 'y' refers to our prediction, that is, three or seven. 'W' refers to our weight values, 'x' refers to our input image, and 'b' is the bias (which, along with weights, help in making predictions).

In short, we multiply each pixel value with the weight values and add them to the bias value.

The weights and bias value decide the importance of each pixel value while making predictions.

We are classifying three and seven, so we have only two classes to predict.

So, we can predict 1 if the image is three and 0 if the image is seven. The prediction we get from that step may be any real number, but we need to make our model (neural network) predict a value between 0 and 1.

This allows us to create a threshold of 0.5. That is, if the predicted value is less than 0.5 then it is a seven. Otherwise it is a three.

We use a sigmoid function to get a value between 0 and 1.

We will create a function for sigmoid using the same equation shown earlier. Then we pass in the values from the neural network into the sigmoid.

We will create a single layer neural network.

We cannot create a lot of loops to multiply each weight value with each pixel in the image, as it is very expensive. So we can use a magic trick to do the whole multiplication in one go by using matrix multiplication.

def sigmoid(x): return 1/(1+torch.exp(-x))

def simple_nn(data, weights, bias): return sigmoid((data@weights) + bias)

Step 2: Defining the loss

Now, we need a loss function to calculate by how much our predicted value is different from that of the ground truth.

For example, if the predicted value is 0.3 but the ground truth is 1, then our loss is very high. So our model will try to reduce this loss by updating the weights and bias so that our predictions become close to the ground truth.

We will be using mean squared error to check the loss value. Mean squared error finds the mean of the square of the difference between the predicted value and the ground truth.

def error(pred, target): return ((pred-target)**2).mean()

Step 3: Initialize the weight values

We just randomly initialize the weights and bias. Later, we will see how these values are updated to get the best predictions.

w = torch.randn((flat_imgs.shape[1], 1), requires_grad=True)
b = torch.randn((1, 1), requires_grad=True)

The shape of the weight values should be in the following form:

(Number of neurons in the previous layer, number of neurons in the next layer)

We use a method called gradient descent to update our weights and bias to make the maximum number of correct predictions.

Our goal is to optimize or decrease our loss, so the best method is to calculate gradients.

We need to take the derivative of each and every weight and bias with respect to the loss function. Then we have to subtract this value from our weights and bias.

In this way, our weights and bias values are updated in such a way that our model makes a good prediction.

Updating a parameter for optimizing a function is not a new thing – you can optimize any arbitrary function using gradients.

We've set a special parameter (called requires_grad) to true to calculate the gradient of weights and bias.

Step 4: Update the weights

If our prediction does not come close to the ground truth, that means that we've made an incorrect prediction. This means that our weights are not correct. So we need to update our weights until we get good predictions.

For this purpose, we put all of the above steps inside a for loop and allow it to iterate any number of times we wish.

At each iteration, the loss is calculated and the weights and biases are updated to get a better prediction on the next iteration.

Thus our model becomes better after each iteration by finding the optimal weight value suitable for our task in hand.

Each task requires a different set of weight values, so we can't expect our neural network trained for classifying animals to perform well on musical instrument classification.

This is how our model training looks like:

for i in range(2000):
  pred = simple_nn(flat_imgs, w, b)
  loss = error(pred, target.unsqueeze(1))
  loss.backward()

  w.data -= 0.001*w.grad.data
  b.data -= 0.001*b.grad.data

  w.grad.zero_()
  b.grad.zero_()

print("Loss: ", loss.item())

We will calculate the predictions and store it in the 'pred' variable by calling the function that we've created earlier. Then we calculate the mean squared error loss.

Then, we will calculate all the gradients for our weights and bias and update the value using those gradients.

We've multiplied the gradients by 0.001, and this is called learning rate. This value decides the rate at which our model will learn, if it is too low, then the model will learn slowly, or in other words, the loss will be reduced slowly.

If the learning rate is too high, our model will not be stable, jumping between a wide range of loss values. This means it will fail to converge.

We do the above steps for 2000 times, and each time our model tries to reduce the loss by updating the weights and bias values.

We should zero out the gradients at the end of each loop or epoch so that there is no accumulation of unwanted gradients in the memory which will affect our model's learning.

Since our model is very small, it doesn't take much time to train for 2000 epochs or iterations. After 2000 epochs, our neural netwok has given a loss value of 0.6805 which is not bad from such a small model.

Final result

Conclusion

There is a huge space for improvement in the model that we've just created.

This is just a simple model, and you can experiment on it by increasing the number of layers, number of neurons in each layer, or increasing the number of epochs.

In short, machine learning is a whole lot of magic using math. Always learn the foundational concepts – they may be boring, but eventually you will understand that those boring math concepts created these cutting edge technologies like deepfakes.

You can get the complete code on GitHub or play with the code in Google colab.

Learn How to Use PyTorch for Deep Learning

Beau Carnes — Thu, 30 Apr 2020 20:04:33 +0000

PyTorch is an open source machine learning library for Python that facilitates building deep learning projects. We've published a 10-hour course that will take you from being complete beginner in PyTorch to using it to code your own GANs (generative adversarial networks). You don't even have to know what a GAN is to start!

This coding-first course is approachable to people starting out with deep learning and neural networks. The course was developed by Aakash from Jovian.ml. This is a comprehensive course and it covers the following topics:

PyTorch Basics & Linear Regression
Image Classification with Logistic Regression
Training Deep Neural Networks on a GPU with PyTorch
Image Classification using Convolutional Neural Networks
Residual Networks, Data Augmentation and Regularization
Training Generative Adverserial Networks (GANs)

There is code and detailed notes to go along with each section of this course. You can access the code in Jupyter Notebooks that are provided. This allows you to try the code yourself at each step of the way.

If you have been wanting to learn more about deep learning but haven't known where to start, this is a great place to begin your journey of learning about deep learning. It will be helpful to have a basic understanding of Python before you start.

You can watch the course below or on the freeCodeCamp.org YouTube channel.

Learn to apply deep learning with PyTorch in this full course

Beau Carnes — Thu, 31 Jan 2019 18:30:59 +0000

In this complete course from Fawaz Sammani you will learn the key concepts behind deep learning and how to apply the concepts to a real-life project using PyTorch.

First, you will learn the theoretical concepts you need to know for building a chatbot, which include RNNs, LSTMS and Sequence Models with Attention.

Then you will learn about PyTorch, a very powerful and advanced deep learning Library. You will learn how to install PyTorch and how to use it.

Finally, the biggest part of the course shows how to apply the concepts to build a chatbot in PyTorch.

You can watch the tutorial on the freeCodeCamp.org YouTube channel (6 hour watch).

pytorch - freeCodeCamp.org

Building NMT from Scratch – PyTorch Replications of 7 Landmark Papers

How to Use Transformers for Real-Time Gesture Recognition

Table of Contents

Why Transformers for Gestures?

What You’ll Learn

Prerequisites

Project Setup

Generate a Gesture Dataset

Option 1: Generate a Synthetic Dataset

Training Script: train.py

What Training Looks Like

Export the Model to ONNX

Evaluate Accuracy + Latency

1. Quick Accuracy Check

2. Confusion Matrix

3. Latency Benchmark

Option 2: Use Small Samples from Public Gesture Datasets

Recommended sources

Setting up your dataset folder

Why choose this option?

Accessibility Notes & Ethical Limits

Next Steps

Conclusion

Build Your Own ViT Model from Scratch

What You'll Accomplish

Training and Optimization

Course Structure

Why This Matters Now

Learn PyTorch in Five Projects

What You'll Learn in This Course

Why Learn PyTorch?

Build a Stable Diffusion VAE From Scratch using Pytorch

Why Learn Variational Autoencoders?

What You'll Learn in This Course

Hands-On Implementation

Conclusion

PyTorch vs TensorFlow – Which is Better for Deep Learning Projects?

Understanding PyTorch and TensorFlow

PyTorch vs TensorFlow – Which One's Right for You?

Ease of Learning and Use

Performance and Scalability

Community and Support

Flexibility and Innovation

Industry Adoption

Products Using Tensorflow

Products Using PyTorch

Conclusion

Learn PyTorch for Deep Learning – Free 26-Hour Course

You can learn more about the course below the embedded video.

What is PyTorch?

Why Learn PyTorch?

What are the prerequisites?

How's the Course Taught?

What Does This Course Cover?

00 – PyTorch Fundamentals

01 – PyTorch Workflow

02 – PyTorch Neural Network Classification

03 – PyTorch Computer Vision

04 – PyTorch Custom Datasets

Can't I Learn All of this Myself?

Got another question?

Real-World Machine Learning—PyTorch and Monai for Healthcare Imaging

Deep Learning Tutorial – How to Use PyTorch and Transfer Learning to Diagnose COVID-19 Patients

What is a Convolutional Neural Network?

What is 2D Convolution?

What is Transfer Learning?

What is EfficientNet?

An Introduction to PyTorch

How to Implement a COVID-19 Classifier using EfficientNet with PyTorch

The data and the architecture

Let's head over to the code

Conclusion

PyTorch Tensor Methods – How to Create Tensors in Python

Topics we'll cover

The tensor() method

The zeros() method

The ones() method

The full() method

The arange() method

Training Script: `train.py`