<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ gradio - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ gradio - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 31 May 2026 05:05:38 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/gradio/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Use Transformers for Real-Time Gesture Recognition ]]>
                </title>
                <description>
                    <![CDATA[ Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are no... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/using-transformers-for-real-time-gesture-recognition/</link>
                <guid isPermaLink="false">68e3c692aa82abf4b593114c</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ transformers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pytorch ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ONNX ]]>
                    </category>
                
                    <category>
                        <![CDATA[ gradio ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Gesture Recognition ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Accessibility ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Tutorial ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ OMOTAYO OMOYEMI ]]>
                </dc:creator>
                <pubDate>Mon, 06 Oct 2025 13:39:30 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759757931295/5f19fd4e-93c0-4bd7-a75c-a7858e061ecd.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are not static images. Rather, they unfold over time. To build more robust, real-time systems, we need models that can capture both spatial details and temporal context.</p>
<p>This is where Transformers come in. Originally built for language, they’ve become state-of-the-art in vision tasks thanks to models like the Vision Transformer (ViT) and video-focused variants such as TimeSformer.</p>
<p>In this tutorial, we’ll use a Transformer backbone to create a lightweight real-time gesture recognition tool, optimized for small datasets and deployable on a regular laptop webcam.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-why-transformers-for-gestures">Why Transformers for Gestures?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-youll-learn">What You’ll Learn</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-generate-a-gesture-dataset">Generate a Gesture Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-1-generate-a-synthetic-dataset">Option 1: Generate a Synthetic Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-training-script-trainpy">Training Script:</a> <a target="_blank" href="http://train.py">train.py</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-export-the-model-to-onnx">Export the Model to ONNX</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-evaluate-accuracy-latency">Evaluate Accuracy + Latency</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-2-use-small-samples-from-public-gesture-datasets">Option 2: Use Small Samples from Public Gesture Datasets</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-accessibility-notes-amp-ethical-limits">Accessibility Notes &amp; Ethical Limits</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-next-steps">Next Steps</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-why-transformers-for-gestures">Why Transformers for Gestures?</h2>
<p>Transformers are powerful because they use self-attention to model relationships across a sequence. For gestures, this means the model doesn’t just see isolated frames, but also learns how movements evolve over time. A wave, for example, looks different from a raised hand only when viewed as a sequence.</p>
<p>Vision Transformers process images as patches, while video Transformers extend this to multiple frames with temporal attention. Even a simple approach, like applying ViT to each frame and pooling across time, can outperform traditional CNN-based methods for small datasets.</p>
<p>Combined with Hugging Face’s pre-trained models and ONNX Runtime for optimization, Transformers make it possible to train on a modest dataset and still achieve smooth real-time recognition.</p>
<h2 id="heading-what-youll-learn">What You’ll Learn</h2>
<p>In this tutorial, you’ll build a gesture recognition system using Transformers. By the end, you’ll know how to:</p>
<ul>
<li><p>Create (or record) a tiny gesture dataset</p>
</li>
<li><p>Train a Vision Transformer (ViT) with temporal pooling</p>
</li>
<li><p>Export the model to ONNX for faster inference</p>
</li>
<li><p>Build a real-time Gradio app that classifies gestures from your webcam</p>
</li>
<li><p>Evaluate your model’s accuracy and latency with simple scripts</p>
</li>
<li><p>Understand the accessibility potential and ethical limits of gesture recognition</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along, you should have:</p>
<ul>
<li><p>Basic Python knowledge (functions, scripts, virtual environments)</p>
</li>
<li><p>Familiarity with PyTorch (tensors, datasets, training loops) – helpful but not required</p>
</li>
<li><p>Python 3.8+ installed on your system</p>
</li>
<li><p>A webcam (for the live demo in Gradio)</p>
</li>
<li><p>Optionally: GPU access (training on CPU works, but is slower)</p>
</li>
</ul>
<h2 id="heading-project-setup">Project Setup</h2>
<p>Create a new project folder and install the required libraries.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create a new project directory and navigate into it</span>
mkdir transformer-gesture &amp;&amp; <span class="hljs-built_in">cd</span> transformer-gesture

<span class="hljs-comment"># Set up a Python virtual environment</span>
python -m venv .venv

<span class="hljs-comment"># Activate the virtual environment</span>
<span class="hljs-comment"># Windows PowerShell</span>
.venv\Scripts\Activate.ps1

<span class="hljs-comment"># macOS/Linux</span>
<span class="hljs-built_in">source</span> .venv/bin/activate
</code></pre>
<p>The provided code snippet is a set of commands for setting up a new Python project with a virtual environment. Here's a breakdown of each part:</p>
<ol>
<li><p><code>mkdir transformer-gesture &amp;&amp; cd transformer-gesture</code>: This command creates a new directory named "transformer-gesture" and then navigates into it.</p>
</li>
<li><p><code>python -m venv .venv</code>: This command creates a new virtual environment in the current directory. The virtual environment is stored in a folder named ".venv".</p>
</li>
<li><p>Activating the virtual environment:</p>
<ul>
<li><p>For Windows PowerShell, you can use <code>.venv\Scripts\Activate.ps1</code> to activate the virtual environment.</p>
</li>
<li><p>For macOS/Linux, use <code>source .venv/bin/activate</code> to activate the virtual environment.</p>
</li>
</ul>
</li>
</ol>
<p>Activating a virtual environment ensures that the Python interpreter and any packages you install are isolated to this specific project, preventing conflicts with other projects or system-wide packages.</p>
<p>Create a <code>requirements.txt</code> file:</p>
<pre><code class="lang-plaintext">torch&gt;=2.0
torchvision
torchaudio
timm
huggingface_hub

onnx
onnxruntime

gradio

numpy
opencv-python
pillow

matplotlib
seaborn
scikit-learn
</code></pre>
<p>The list provided is a set of package dependencies typically found in a <code>requirements.txt</code> file for a Python project. Here's a brief explanation of each package:</p>
<ol>
<li><p><strong>torch&gt;=2.0</strong>: PyTorch is a popular open-source deep learning framework that provides a flexible and efficient platform for building and training neural networks. Version 2.0 and above includes improvements in performance and new features.</p>
</li>
<li><p><strong>torchvision</strong>: This library is part of the PyTorch ecosystem and provides tools for computer vision tasks, including datasets, model architectures, and image transformations.</p>
</li>
<li><p><strong>torchaudio</strong>: Also part of the PyTorch ecosystem, Torchaudio provides audio processing tools and datasets, making it easier to work with audio data in deep learning projects.</p>
</li>
<li><p><strong>timm</strong>: The PyTorch Image Models (timm) library offers a collection of pre-trained models and utilities for computer vision tasks, facilitating quick experimentation and deployment.</p>
</li>
<li><p><strong>huggingface_hub</strong>: This library allows easy access to models and datasets hosted on the Hugging Face Hub, a platform for sharing and collaborating on machine learning models and datasets.</p>
</li>
<li><p><strong>onnx</strong>: The Open Neural Network Exchange (ONNX) format is used to represent machine learning models, enabling interoperability between different frameworks.</p>
</li>
<li><p><strong>onnxruntime</strong>: This is a high-performance runtime for executing ONNX models, allowing for efficient deployment across various platforms.</p>
</li>
<li><p><strong>gradio</strong>: Gradio is a library for creating user interfaces for machine learning models, making them accessible through a web interface for easy interaction and testing.</p>
</li>
<li><p><strong>numpy</strong>: A fundamental package for numerical computing in Python, providing support for arrays and a wide range of mathematical functions.</p>
</li>
<li><p><strong>opencv-python</strong>: OpenCV is a library for computer vision and image processing tasks, widely used for real-time applications.</p>
</li>
<li><p><strong>pillow</strong>: A Python Imaging Library (PIL) fork, Pillow provides tools for opening, manipulating, and saving many different image file formats.</p>
</li>
<li><p><strong>matplotlib</strong>: A plotting library for Python, Matplotlib is used for creating static, interactive, and animated visualizations in Python.</p>
</li>
<li><p><strong>seaborn</strong>: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.</p>
</li>
<li><p><strong>scikit-learn</strong>: A machine learning library in Python that provides simple and efficient tools for data analysis and modeling, including classification, regression, clustering, and dimensionality reduction.</p>
</li>
</ol>
<p>Install dependencies:</p>
<pre><code class="lang-bash">pip install -r requirements.txt
</code></pre>
<p>The command <code>pip install -r requirements.txt</code> is used to install all the Python packages listed in a file named <code>requirements.txt</code>. This file typically contains a list of package dependencies required for a Python project, each specified with a package name and optionally a version number.</p>
<p>By running this command, <code>pip</code>, which is the Python package installer, reads the file and installs each package listed, ensuring that the project has all the necessary dependencies to run properly. This is a common practice in Python projects to manage and share dependencies easily.</p>
<h2 id="heading-generate-a-gesture-dataset">Generate a Gesture Dataset</h2>
<p>To train our Transformer-based gesture recognizer, we need some data. Instead of downloading a huge dataset, we’ll start with a tiny synthetic dataset you can generate in seconds. This makes the tutorial lightweight and ensures that everyone can follow along without dealing with multi-gigabyte downloads.</p>
<h2 id="heading-option-1-generate-a-synthetic-dataset">Option 1: Generate a Synthetic Dataset</h2>
<p>We’ll use a small Python script that creates short <code>.mp4</code> clips of a moving (or still) coloured box. Each class represents a gesture:</p>
<ul>
<li><p><strong>swipe_left</strong> – box moves from right to left</p>
</li>
<li><p><strong>swipe_right</strong> – box moves from left to right</p>
</li>
<li><p><strong>stop</strong> – box stays still in the center</p>
</li>
</ul>
<p>Save this script as <code>generate_synthetic_gestures.py</code> in your project root:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os, cv2, numpy <span class="hljs-keyword">as</span> np, random, argparse

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ensure_dir</span>(<span class="hljs-params">p</span>):</span> os.makedirs(p, exist_ok=<span class="hljs-literal">True</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">make_clip</span>(<span class="hljs-params">mode, out_path, seconds=<span class="hljs-number">1.5</span>, fps=<span class="hljs-number">16</span>, size=<span class="hljs-number">224</span>, box_size=<span class="hljs-number">60</span>, seed=<span class="hljs-number">0</span>, codec=<span class="hljs-string">"mp4v"</span></span>):</span>
    rng = random.Random(seed)
    frames = int(seconds * fps)
    H = W = size

    <span class="hljs-comment"># background + box color</span>
    bg_val = rng.randint(<span class="hljs-number">160</span>, <span class="hljs-number">220</span>)
    bg = np.full((H, W, <span class="hljs-number">3</span>), bg_val, dtype=np.uint8)
    color = (rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>), rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>), rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>))

    <span class="hljs-comment"># path of motion</span>
    y = rng.randint(<span class="hljs-number">40</span>, H - <span class="hljs-number">40</span> - box_size)
    <span class="hljs-keyword">if</span> mode == <span class="hljs-string">"swipe_left"</span>:
        x_start, x_end = W - <span class="hljs-number">20</span> - box_size, <span class="hljs-number">20</span>
    <span class="hljs-keyword">elif</span> mode == <span class="hljs-string">"swipe_right"</span>:
        x_start, x_end = <span class="hljs-number">20</span>, W - <span class="hljs-number">20</span> - box_size
    <span class="hljs-keyword">elif</span> mode == <span class="hljs-string">"stop"</span>:
        x_start = x_end = (W - box_size) // <span class="hljs-number">2</span>
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f"Unknown mode: <span class="hljs-subst">{mode}</span>"</span>)

    fourcc = cv2.VideoWriter_fourcc(*codec)
    vw = cv2.VideoWriter(out_path, fourcc, fps, (W, H))
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> vw.isOpened():
        <span class="hljs-keyword">raise</span> RuntimeError(
            <span class="hljs-string">f"Could not open VideoWriter with codec '<span class="hljs-subst">{codec}</span>'. "</span>
            <span class="hljs-string">"Try --codec XVID and use .avi extension, e.g. out.avi"</span>
        )

    <span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> range(frames):
        alpha = t / max(<span class="hljs-number">1</span>, frames - <span class="hljs-number">1</span>)
        x = int((<span class="hljs-number">1</span> - alpha) * x_start + alpha * x_end)
        <span class="hljs-comment"># small jitter to avoid being too synthetic</span>
        jitter_x, jitter_y = rng.randint(<span class="hljs-number">-2</span>, <span class="hljs-number">2</span>), rng.randint(<span class="hljs-number">-2</span>, <span class="hljs-number">2</span>)
        frame = bg.copy()
        cv2.rectangle(frame, (x + jitter_x, y + jitter_y),
                      (x + jitter_x + box_size, y + jitter_y + box_size),
                      color, thickness=<span class="hljs-number">-1</span>)
        <span class="hljs-comment"># overlay text</span>
        cv2.putText(frame, mode, (<span class="hljs-number">8</span>, <span class="hljs-number">24</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">0.7</span>, (<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>), <span class="hljs-number">2</span>, cv2.LINE_AA)
        cv2.putText(frame, mode, (<span class="hljs-number">8</span>, <span class="hljs-number">24</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">0.7</span>, (<span class="hljs-number">255</span>, <span class="hljs-number">255</span>, <span class="hljs-number">255</span>), <span class="hljs-number">1</span>, cv2.LINE_AA)
        vw.write(frame)

    vw.release()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_labels</span>(<span class="hljs-params">labels, out_dir</span>):</span>
    <span class="hljs-keyword">with</span> open(os.path.join(out_dir, <span class="hljs-string">"labels.txt"</span>), <span class="hljs-string">"w"</span>, encoding=<span class="hljs-string">"utf-8"</span>) <span class="hljs-keyword">as</span> f:
        <span class="hljs-keyword">for</span> c <span class="hljs-keyword">in</span> labels:
            f.write(c + <span class="hljs-string">"\n"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    ap = argparse.ArgumentParser(description=<span class="hljs-string">"Generate a tiny synthetic gesture dataset."</span>)
    ap.add_argument(<span class="hljs-string">"--out"</span>, default=<span class="hljs-string">"data"</span>, help=<span class="hljs-string">"Output directory (default: data)"</span>)
    ap.add_argument(<span class="hljs-string">"--classes"</span>, nargs=<span class="hljs-string">"+"</span>,
                    default=[<span class="hljs-string">"swipe_left"</span>, <span class="hljs-string">"swipe_right"</span>, <span class="hljs-string">"stop"</span>],
                    help=<span class="hljs-string">"Class names (default: swipe_left swipe_right stop)"</span>)
    ap.add_argument(<span class="hljs-string">"--clips"</span>, type=int, default=<span class="hljs-number">16</span>, help=<span class="hljs-string">"Clips per class (default: 16)"</span>)
    ap.add_argument(<span class="hljs-string">"--seconds"</span>, type=float, default=<span class="hljs-number">1.5</span>, help=<span class="hljs-string">"Seconds per clip (default: 1.5)"</span>)
    ap.add_argument(<span class="hljs-string">"--fps"</span>, type=int, default=<span class="hljs-number">16</span>, help=<span class="hljs-string">"Frames per second (default: 16)"</span>)
    ap.add_argument(<span class="hljs-string">"--size"</span>, type=int, default=<span class="hljs-number">224</span>, help=<span class="hljs-string">"Frame size WxH (default: 224)"</span>)
    ap.add_argument(<span class="hljs-string">"--box"</span>, type=int, default=<span class="hljs-number">60</span>, help=<span class="hljs-string">"Box size (default: 60)"</span>)
    ap.add_argument(<span class="hljs-string">"--codec"</span>, default=<span class="hljs-string">"mp4v"</span>, help=<span class="hljs-string">"Codec fourcc (mp4v or XVID)"</span>)
    ap.add_argument(<span class="hljs-string">"--ext"</span>, default=<span class="hljs-string">".mp4"</span>, help=<span class="hljs-string">"File extension (.mp4 or .avi)"</span>)
    args = ap.parse_args()

    ensure_dir(args.out)
    write_labels(args.classes, <span class="hljs-string">"."</span>)  <span class="hljs-comment"># writes labels.txt to project root</span>

    print(<span class="hljs-string">f"Generating synthetic dataset -&gt; <span class="hljs-subst">{args.out}</span>"</span>)
    <span class="hljs-keyword">for</span> cls <span class="hljs-keyword">in</span> args.classes:
        cls_dir = os.path.join(args.out, cls)
        ensure_dir(cls_dir)
        mode = <span class="hljs-string">"stop"</span> <span class="hljs-keyword">if</span> cls == <span class="hljs-string">"stop"</span> <span class="hljs-keyword">else</span> (<span class="hljs-string">"swipe_left"</span> <span class="hljs-keyword">if</span> <span class="hljs-string">"left"</span> <span class="hljs-keyword">in</span> cls <span class="hljs-keyword">else</span> (<span class="hljs-string">"swipe_right"</span> <span class="hljs-keyword">if</span> <span class="hljs-string">"right"</span> <span class="hljs-keyword">in</span> cls <span class="hljs-keyword">else</span> <span class="hljs-string">"stop"</span>))
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(args.clips):
            filename = os.path.join(cls_dir, <span class="hljs-string">f"<span class="hljs-subst">{cls}</span>_<span class="hljs-subst">{i+<span class="hljs-number">1</span>:<span class="hljs-number">03</span>d}</span><span class="hljs-subst">{args.ext}</span>"</span>)
            make_clip(
                mode=mode,
                out_path=filename,
                seconds=args.seconds,
                fps=args.fps,
                size=args.size,
                box_size=args.box,
                seed=i + <span class="hljs-number">1</span>,
                codec=args.codec
            )
        print(<span class="hljs-string">f"  <span class="hljs-subst">{cls}</span>: <span class="hljs-subst">{args.clips}</span> clips"</span>)

    print(<span class="hljs-string">"Done. You can now run: python train.py, python export_onnx.py, python app.py"</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    main()
</code></pre>
<p>The script generates a synthetic gesture dataset by creating video clips of a moving or stationary coloured box, simulating gestures like "swipe left," "swipe right," and "stop," and saves them in a specified output directory.</p>
<p>Now run it inside your virtual environment:</p>
<pre><code class="lang-bash">python generate_synthetic_gestures.py --out data --clips 16 --seconds 1.5
</code></pre>
<p>The command above runs a Python script named <code>generate_synthetic_gestures.py</code>, which generates a synthetic gesture dataset with 16 clips per gesture, each lasting 1.5 seconds, and saves the output in a directory named "data".</p>
<p>This creates a dataset like:</p>
<pre><code class="lang-plaintext">data/
  swipe_left/*.mp4
  swipe_right/*.mp4
  stop/*.mp4
labels.txt
</code></pre>
<p>Each folder contains short clips of a moving (or still) box that simulate gestures. This is perfect for testing the pipeline.</p>
<h3 id="heading-training-script-trainpy">Training Script: <code>train.py</code></h3>
<p>Now that we have our dataset, let’s fine-tune a Vision Transformer with temporal pooling. This model applies ViT frame-by-frame, averages embeddings across time, and trains a classification head on your gestures.</p>
<p>Here’s the full training script:</p>
<pre><code class="lang-python"><span class="hljs-comment"># train.py</span>
<span class="hljs-keyword">import</span> torch, torch.nn <span class="hljs-keyword">as</span> nn, torch.optim <span class="hljs-keyword">as</span> optim
<span class="hljs-keyword">from</span> torch.utils.data <span class="hljs-keyword">import</span> DataLoader
<span class="hljs-keyword">import</span> timm
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> GestureClips, read_labels

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ViTTemporal</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-string">"""Frame-wise ViT encoder -&gt; mean pool over time -&gt; linear head."""</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, num_classes, vit_name=<span class="hljs-string">"vit_tiny_patch16_224"</span></span>):</span>
        super().__init__()
        self.vit = timm.create_model(vit_name, pretrained=<span class="hljs-literal">True</span>, num_classes=<span class="hljs-number">0</span>, global_pool=<span class="hljs-string">"avg"</span>)
        feat_dim = self.vit.num_features
        self.head = nn.Linear(feat_dim, num_classes)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>  <span class="hljs-comment"># x: (B,T,C,H,W)</span>
        B, T, C, H, W = x.shape
        x = x.view(B * T, C, H, W)
        feats = self.vit(x)                  <span class="hljs-comment"># (B*T, D)</span>
        feats = feats.view(B, T, <span class="hljs-number">-1</span>).mean(dim=<span class="hljs-number">1</span>)  <span class="hljs-comment"># (B, D)</span>
        <span class="hljs-keyword">return</span> self.head(feats)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train</span>():</span>
    device = <span class="hljs-string">"cuda"</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"cpu"</span>
    labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
    n_classes = len(labels)

    train_ds = GestureClips(train=<span class="hljs-literal">True</span>)
    val_ds   = GestureClips(train=<span class="hljs-literal">False</span>)
    print(<span class="hljs-string">f"Train clips: <span class="hljs-subst">{len(train_ds)}</span> | Val clips: <span class="hljs-subst">{len(val_ds)}</span>"</span>)

    <span class="hljs-comment"># Windows/CPU friendly</span>
    train_dl = DataLoader(train_ds, batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">True</span>,  num_workers=<span class="hljs-number">0</span>, pin_memory=<span class="hljs-literal">False</span>)
    val_dl   = DataLoader(val_ds,   batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">False</span>, num_workers=<span class="hljs-number">0</span>, pin_memory=<span class="hljs-literal">False</span>)

    model = ViTTemporal(num_classes=n_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=<span class="hljs-number">3e-4</span>, weight_decay=<span class="hljs-number">0.05</span>)

    best_acc = <span class="hljs-number">0.0</span>
    epochs = <span class="hljs-number">5</span>
    <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, epochs + <span class="hljs-number">1</span>):
        <span class="hljs-comment"># ---- Train ----</span>
        model.train()
        total, correct, loss_sum = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0.0</span>
        <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> train_dl:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()

            loss_sum += loss.item() * x.size(<span class="hljs-number">0</span>)
            correct += (logits.argmax(<span class="hljs-number">1</span>) == y).sum().item()
            total += x.size(<span class="hljs-number">0</span>)

        train_acc = correct / total <span class="hljs-keyword">if</span> total <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>
        train_loss = loss_sum / total <span class="hljs-keyword">if</span> total <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>

        <span class="hljs-comment"># ---- Validate ----</span>
        model.eval()
        vtotal, vcorrect = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>
        <span class="hljs-keyword">with</span> torch.no_grad():
            <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> val_dl:
                x, y = x.to(device), y.to(device)
                vcorrect += (model(x).argmax(<span class="hljs-number">1</span>) == y).sum().item()
                vtotal += x.size(<span class="hljs-number">0</span>)
        val_acc = vcorrect / vtotal <span class="hljs-keyword">if</span> vtotal <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>

        print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch:<span class="hljs-number">02</span>d}</span> | train_loss <span class="hljs-subst">{train_loss:<span class="hljs-number">.4</span>f}</span> "</span>
              <span class="hljs-string">f"| train_acc <span class="hljs-subst">{train_acc:<span class="hljs-number">.3</span>f}</span> | val_acc <span class="hljs-subst">{val_acc:<span class="hljs-number">.3</span>f}</span>"</span>)

        <span class="hljs-keyword">if</span> val_acc &gt; best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), <span class="hljs-string">"vit_temporal_best.pt"</span>)

    print(<span class="hljs-string">"Best val acc:"</span>, best_acc)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    train()
</code></pre>
<p>Running the command <code>python train.py</code> initiates the training process for your gesture recognition model. Here's a breakdown of what happens:</p>
<ol>
<li><p><strong>Load your dataset from data/</strong>: The script will access and load the gesture dataset stored in the "data" directory. This dataset is used to train the model.</p>
</li>
<li><p><strong>Fine-tune a pre-trained Vision Transformer</strong>: The training script will take a Vision Transformer model that has been pre-trained on a larger dataset and fine-tune it using your specific gesture dataset. Fine-tuning helps the model adapt to the nuances of your data, improving its performance on the specific task of gesture recognition.</p>
</li>
<li><p><strong>Save the best checkpoint as vit_temporal_best.pt</strong>: During training, the script will evaluate the model's performance on a validation set. The best-performing version of the model (based on some metric like accuracy) will be saved as a checkpoint file named "vit_temporal_best.pt". This file can later be used for inference or further training.</p>
</li>
</ol>
<h4 id="heading-what-training-looks-like">What Training Looks Like</h4>
<p>You should see logs similar to this:</p>
<pre><code class="lang-plaintext">Train clips: 38 | Val clips: 10
Epoch 01 | train_loss 1.4508 | train_acc 0.395 | val_acc 0.200
Epoch 02 | train_loss 1.2466 | train_acc 0.263 | val_acc 0.200
Epoch 03 | train_loss 1.1361 | train_acc 0.368 | val_acc 0.200
Best val acc: 0.200
</code></pre>
<p>Don’t worry if your accuracy is low at first, as with the synthetic dataset that’s normal. The key is proving that the Transformer pipeline works. You can boost results later by:</p>
<ul>
<li><p>Adding more clips per class</p>
</li>
<li><p>Training for more epochs</p>
</li>
<li><p>Switching to real recorded gestures</p>
</li>
</ul>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/training-logs.png?raw=true" alt="Training logs" width="600" height="400" loading="lazy"></p>
<p>Figure 1. Example training logs from <code>train.py</code>, where the Vision Transformer with temporal pooling is fine-tuned on a tiny synthetic dataset.</p>
<h3 id="heading-export-the-model-to-onnx">Export the Model to ONNX</h3>
<p>To make our model easier to run in real time (and lighter on CPU), we’ll export it to the ONNX format.</p>
<p><strong>Note:</strong> ONNX, which stands for Open Neural Network Exchange, is an open-source format designed to facilitate the interchange of deep learning models between different frameworks. It lets you train a model in one framework, such as PyTorch or TensorFlow, and then deploy it in another, like Caffe2 or MXNet, without needing to completely rewrite the model. This interoperability is achieved by providing a standardized representation of the model's architecture and parameters.</p>
<p>ONNX supports a wide range of operators and is continually updated to include new features, making it a versatile choice for deploying machine learning models across various platforms and devices.</p>
<p>Create a file called <code>export_onnx.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> train <span class="hljs-keyword">import</span> ViTTemporal
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
n_classes = len(labels)

<span class="hljs-comment"># Load trained model</span>
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load(<span class="hljs-string">"vit_temporal_best.pt"</span>, map_location=<span class="hljs-string">"cpu"</span>))
model.eval()

<span class="hljs-comment"># Dummy input: batch=1, 16 frames, 3x224x224</span>
dummy = torch.randn(<span class="hljs-number">1</span>, <span class="hljs-number">16</span>, <span class="hljs-number">3</span>, <span class="hljs-number">224</span>, <span class="hljs-number">224</span>)

<span class="hljs-comment"># Export</span>
torch.onnx.export(
    model, dummy, <span class="hljs-string">"vit_temporal.onnx"</span>,
    input_names=[<span class="hljs-string">"video"</span>], output_names=[<span class="hljs-string">"logits"</span>],
    dynamic_axes={<span class="hljs-string">"video"</span>: {<span class="hljs-number">0</span>: <span class="hljs-string">"batch"</span>}},
    opset_version=<span class="hljs-number">13</span>
)

print(<span class="hljs-string">"Exported vit_temporal.onnx"</span>)
</code></pre>
<p>Run it with <code>python export_onnx.py</code>.</p>
<p>This generates a file <code>vit_temporal.onnx</code> in your project folder. ONNX lets us use onnxruntime, which is much faster for inference.</p>
<p>Create a file called <code>app.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os, tempfile, cv2, torch, onnxruntime, numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> gradio <span class="hljs-keyword">as</span> gr
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

T = <span class="hljs-number">16</span>
SIZE = <span class="hljs-number">224</span>
MODEL_PATH = <span class="hljs-string">"vit_temporal.onnx"</span>

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)

<span class="hljs-comment"># --- ONNX session + auto-detect names ---</span>
ort_session = onnxruntime.InferenceSession(MODEL_PATH, providers=[<span class="hljs-string">"CPUExecutionProvider"</span>])
<span class="hljs-comment"># detect first input and first output names to avoid mismatches</span>
INPUT_NAME = ort_session.get_inputs()[<span class="hljs-number">0</span>].name   <span class="hljs-comment"># e.g. "input" or "video"</span>
OUTPUT_NAME = ort_session.get_outputs()[<span class="hljs-number">0</span>].name <span class="hljs-comment"># e.g. "logits" or something else</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">preprocess_clip</span>(<span class="hljs-params">frames_rgb</span>):</span>
    <span class="hljs-keyword">if</span> len(frames_rgb) == <span class="hljs-number">0</span>:
        frames_rgb = [np.zeros((SIZE, SIZE, <span class="hljs-number">3</span>), dtype=np.uint8)]
    <span class="hljs-keyword">if</span> len(frames_rgb) &lt; T:
        frames_rgb = frames_rgb + [frames_rgb[<span class="hljs-number">-1</span>]] * (T - len(frames_rgb))
    frames_rgb = frames_rgb[:T]
    clip = [cv2.resize(f, (SIZE, SIZE), interpolation=cv2.INTER_AREA) <span class="hljs-keyword">for</span> f <span class="hljs-keyword">in</span> frames_rgb]
    clip = np.stack(clip, axis=<span class="hljs-number">0</span>)                                    <span class="hljs-comment"># (T,H,W,3)</span>
    clip = np.transpose(clip, (<span class="hljs-number">0</span>, <span class="hljs-number">3</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>)).astype(np.float32) / <span class="hljs-number">255</span> <span class="hljs-comment"># (T,3,H,W)</span>
    clip = (clip - <span class="hljs-number">0.5</span>) / <span class="hljs-number">0.5</span>
    clip = np.expand_dims(clip, <span class="hljs-number">0</span>)                                   <span class="hljs-comment"># (1,T,3,H,W)</span>
    <span class="hljs-keyword">return</span> clip

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_extract_path_from_gradio_video</span>(<span class="hljs-params">inp</span>):</span>
    <span class="hljs-keyword">if</span> isinstance(inp, str) <span class="hljs-keyword">and</span> os.path.exists(inp):
        <span class="hljs-keyword">return</span> inp
    <span class="hljs-keyword">if</span> isinstance(inp, dict):
        <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> (<span class="hljs-string">"video"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"path"</span>, <span class="hljs-string">"filepath"</span>):
            v = inp.get(key)
            <span class="hljs-keyword">if</span> isinstance(v, str) <span class="hljs-keyword">and</span> os.path.exists(v):
                <span class="hljs-keyword">return</span> v
        <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> (<span class="hljs-string">"data"</span>, <span class="hljs-string">"video"</span>):
            v = inp.get(key)
            <span class="hljs-keyword">if</span> isinstance(v, (bytes, bytearray)):
                tmp = tempfile.NamedTemporaryFile(delete=<span class="hljs-literal">False</span>, suffix=<span class="hljs-string">".mp4"</span>)
                tmp.write(v); tmp.flush(); tmp.close()
                <span class="hljs-keyword">return</span> tmp.name
    <span class="hljs-keyword">if</span> isinstance(inp, (list, tuple)) <span class="hljs-keyword">and</span> inp <span class="hljs-keyword">and</span> isinstance(inp[<span class="hljs-number">0</span>], str) <span class="hljs-keyword">and</span> os.path.exists(inp[<span class="hljs-number">0</span>]):
        <span class="hljs-keyword">return</span> inp[<span class="hljs-number">0</span>]
    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_read_uniform_frames</span>(<span class="hljs-params">video_path</span>):</span>
    cap = cv2.VideoCapture(video_path)
    frames = []
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) <span class="hljs-keyword">or</span> <span class="hljs-number">1</span>
    idxs = np.linspace(<span class="hljs-number">0</span>, total - <span class="hljs-number">1</span>, max(T, <span class="hljs-number">1</span>)).astype(int)
    want = set(int(i) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> idxs.tolist())
    j = <span class="hljs-number">0</span>
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        ok, bgr = cap.read()
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ok: <span class="hljs-keyword">break</span>
        <span class="hljs-keyword">if</span> j <span class="hljs-keyword">in</span> want:
            rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
            frames.append(rgb)
        j += <span class="hljs-number">1</span>
    cap.release()
    <span class="hljs-keyword">return</span> frames

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_from_video</span>(<span class="hljs-params">gradio_video</span>):</span>
    video_path = _extract_path_from_gradio_video(gradio_video)
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> video_path <span class="hljs-keyword">or</span> <span class="hljs-keyword">not</span> os.path.exists(video_path):
        <span class="hljs-keyword">return</span> {}
    frames = _read_uniform_frames(video_path)

    <span class="hljs-comment"># If OpenCV choked on the codec (common with recorded webm), re-encode once:</span>
    <span class="hljs-keyword">if</span> len(frames) == <span class="hljs-number">0</span>:
        tmp = tempfile.NamedTemporaryFile(delete=<span class="hljs-literal">False</span>, suffix=<span class="hljs-string">".mp4"</span>); tmp_name = tmp.name; tmp.close()
        cap = cv2.VideoCapture(video_path)
        fourcc = cv2.VideoWriter_fourcc(*<span class="hljs-string">"mp4v"</span>)
        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) <span class="hljs-keyword">or</span> <span class="hljs-number">640</span>
        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) <span class="hljs-keyword">or</span> <span class="hljs-number">480</span>
        out = cv2.VideoWriter(tmp_name, fourcc, <span class="hljs-number">20.0</span>, (w, h))
        <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
            ok, frame = cap.read()
            <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ok: <span class="hljs-keyword">break</span>
            out.write(frame)
        cap.release(); out.release()
        frames = _read_uniform_frames(tmp_name)

    clip = preprocess_clip(frames)
    <span class="hljs-comment"># &gt;&gt;&gt; use the detected ONNX input/output names &lt;&lt;&lt;</span>
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[<span class="hljs-number">0</span>]  <span class="hljs-comment"># (1, C)</span>
    probs = torch.softmax(torch.from_numpy(logits), dim=<span class="hljs-number">1</span>)[<span class="hljs-number">0</span>].numpy().tolist()
    <span class="hljs-keyword">return</span> {labels[i]: float(probs[i]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(labels))}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_from_image</span>(<span class="hljs-params">image</span>):</span>
    <span class="hljs-keyword">if</span> image <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">return</span> {}
    clip = preprocess_clip([image] * T)
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[<span class="hljs-number">0</span>]
    probs = torch.softmax(torch.from_numpy(logits), dim=<span class="hljs-number">1</span>)[<span class="hljs-number">0</span>].numpy().tolist()
    <span class="hljs-keyword">return</span> {labels[i]: float(probs[i]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(labels))}

<span class="hljs-keyword">with</span> gr.Blocks() <span class="hljs-keyword">as</span> demo:
    gr.Markdown(<span class="hljs-string">"# Gesture Classifier (ONNX)\nRecord or upload a short video, then click **Classify Video**."</span>)
    <span class="hljs-keyword">with</span> gr.Tab(<span class="hljs-string">"Video (record or upload)"</span>):
        vid_in = gr.Video(label=<span class="hljs-string">"Record from webcam or upload a short clip"</span>)
        vid_out = gr.Label(num_top_classes=<span class="hljs-number">3</span>, label=<span class="hljs-string">"Prediction"</span>)
        gr.Button(<span class="hljs-string">"Classify Video"</span>).click(fn=predict_from_video, inputs=vid_in, outputs=vid_out)
    <span class="hljs-keyword">with</span> gr.Tab(<span class="hljs-string">"Single Image (fallback)"</span>):
        img_in = gr.Image(label=<span class="hljs-string">"Upload an image frame"</span>, type=<span class="hljs-string">"numpy"</span>)
        img_out = gr.Label(num_top_classes=<span class="hljs-number">3</span>, label=<span class="hljs-string">"Prediction"</span>)
        gr.Button(<span class="hljs-string">"Classify Image"</span>).click(fn=predict_from_image, inputs=img_in, outputs=img_out)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    demo.launch()
</code></pre>
<p>Running the command <code>python app.py</code> launches a Gradio application in your web browser. Here's what happens:</p>
<ol>
<li><p><strong>Webcam feed streams live</strong>: The application accesses your webcam to provide a live video feed. This allows you to perform gestures in front of the camera in real-time.</p>
</li>
<li><p><strong>Predictions update continuously</strong>: As you perform gestures, the model processes the video frames continuously, updating its predictions in real-time.</p>
</li>
<li><p><strong>Top 3 gesture classes displayed with probabilities</strong>: The application displays the top three predicted gesture classes along with their probabilities, giving you an idea of the model's confidence in its predictions.</p>
</li>
</ol>
<p>When you open the app in your browser, you'll find two tabs. In the <strong>Video tab</strong>, you can click <em>Record from webcam</em> to capture a short clip of your gesture, typically lasting 2–4 seconds. After recording, click <strong>Classify Video</strong>. The model will then process the captured frames using the Transformer model and display the predicted gesture probabilities. This setup allows for interactive testing and demonstration of the gesture recognition system.</p>
<p>Here’s an example where I raised my hand for a <strong>stop</strong> gesture, and the model predicts “stop” as the top class:</p>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/realtime-demo.png?raw=true" alt="Gradio demo output" width="600" height="400" loading="lazy"></p>
<p>Figure 2. The Gradio app running locally. After recording a short clip, the Transformer model predicts the gesture with class probabilities.</p>
<h3 id="heading-evaluate-accuracy-latency">Evaluate Accuracy + Latency</h3>
<p>Now that the model runs in a demo app, let’s check how well it performs. There are two sides to this:</p>
<ul>
<li><p><strong>Accuracy</strong>: does the model predict the right gesture class?</p>
</li>
<li><p><strong>Latency</strong>: how fast does it respond, especially on CPU vs GPU?</p>
</li>
</ul>
<h4 id="heading-1-quick-accuracy-check">1. Quick Accuracy Check</h4>
<p>Save this as <code>eval.py</code> in the same folder as your other scripts:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> GestureClips, read_labels
<span class="hljs-keyword">from</span> train <span class="hljs-keyword">import</span> ViTTemporal

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
n_classes = len(labels)

<span class="hljs-comment"># Load validation data</span>
val_ds = GestureClips(train=<span class="hljs-literal">False</span>)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># Load trained model</span>
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load(<span class="hljs-string">"vit_temporal_best.pt"</span>, map_location=<span class="hljs-string">"cpu"</span>))
model.eval()

correct, total = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>
all_preds, all_labels = [], []

<span class="hljs-keyword">with</span> torch.no_grad():
    <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> val_dl:
        logits = model(x)
        preds = logits.argmax(dim=<span class="hljs-number">1</span>)
        correct += (preds == y).sum().item()
        total += y.size(<span class="hljs-number">0</span>)
        all_preds.extend(preds.tolist())
        all_labels.extend(y.tolist())

print(<span class="hljs-string">f"Validation accuracy: <span class="hljs-subst">{correct/total:<span class="hljs-number">.2</span>%}</span>"</span>)
</code></pre>
<h4 id="heading-2-confusion-matrix">2. Confusion Matrix</h4>
<p>Let’s also visualize which gestures are confused. Add this snippet at the bottom of <code>eval.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> confusion_matrix

cm = confusion_matrix(all_labels, all_preds)

plt.figure(figsize=(<span class="hljs-number">6</span>,<span class="hljs-number">6</span>))
sns.heatmap(cm, annot=<span class="hljs-literal">True</span>, fmt=<span class="hljs-string">"d"</span>, xticklabels=labels, yticklabels=labels, cmap=<span class="hljs-string">"Blues"</span>)
plt.xlabel(<span class="hljs-string">"Predicted"</span>)
plt.ylabel(<span class="hljs-string">"True"</span>)
plt.title(<span class="hljs-string">"Confusion Matrix"</span>)
plt.tight_layout()
plt.show()
</code></pre>
<p>When you run <code>python eval.py</code>, a heatmap like this will pop up:</p>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/confusion-matrix.png?raw=true" alt="Confusion matrix" width="600" height="400" loading="lazy"></p>
<p>Figure 3. Confusion matrix on the validation set. Correct predictions appear along the diagonal. Off-diagonal counts show gesture confusions.</p>
<h4 id="heading-3-latency-benchmark">3. Latency Benchmark</h4>
<p>Finally, let’s see how fast inference runs. Save the following as <code>benchmark.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time, numpy <span class="hljs-keyword">as</span> np, onnxruntime
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)

ort = onnxruntime.InferenceSession(<span class="hljs-string">"vit_temporal.onnx"</span>, providers=[<span class="hljs-string">"CPUExecutionProvider"</span>])
INPUT_NAME = ort.get_inputs()[<span class="hljs-number">0</span>].name
OUTPUT_NAME = ort.get_outputs()[<span class="hljs-number">0</span>].name

dummy = np.random.randn(<span class="hljs-number">1</span>, <span class="hljs-number">16</span>, <span class="hljs-number">3</span>, <span class="hljs-number">224</span>, <span class="hljs-number">224</span>).astype(np.float32)

<span class="hljs-comment"># Warmup</span>
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(<span class="hljs-number">3</span>):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})

<span class="hljs-comment"># Benchmark</span>
t0 = time.time()
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(<span class="hljs-number">50</span>):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})
t1 = time.time()

print(<span class="hljs-string">f"Average latency: <span class="hljs-subst">{(t1 - t0)/<span class="hljs-number">50</span>:<span class="hljs-number">.3</span>f}</span> seconds per clip"</span>)
</code></pre>
<p>Run: <code>python benchmark.py</code></p>
<p>On CPU, you might see ~0.05–0.15s per clip; on GPU it’s much faster.</p>
<p><strong>Note</strong>: If latency is high, you can enable <strong>quantization</strong> in ONNX to shrink the model and speed up inference.</p>
<h2 id="heading-option-2-use-small-samples-from-public-gesture-datasets">Option 2: Use Small Samples from Public Gesture Datasets</h2>
<p>If you’d prefer to see your model trained on <em>real</em> gesture clips instead of synthetic moving boxes, you can grab a handful of videos from open datasets. You don’t need to download the entire dataset (which can be several GB) just a few <code>.mp4</code> samples are enough to follow along.</p>
<h3 id="heading-recommended-sources">Recommended sources</h3>
<ul>
<li><p><strong>20BN Jester Dataset</strong>: Contains short clips of hand gestures like swiping, clapping, and pointing.</p>
</li>
<li><p><strong>WLASL</strong>: A large-scale dataset of isolated sign language words.</p>
</li>
</ul>
<p>Both projects provide small <code>.mp4</code> videos you can use as realistic training examples. I’ve linked them below.</p>
<h3 id="heading-setting-up-your-dataset-folder">Setting up your dataset folder</h3>
<p>Once you download a few clips, place them in the <code>data/</code> folder under subfolders named after each gesture class. For example:</p>
<pre><code class="lang-plaintext">data/
├── swipe_left/
│   ├── clip1.mp4
│   └── clip2.mp4
├── swipe_right/
│   ├── clip1.mp4
│   └── clip2.mp4
└── stop/
    ├── clip1.mp4
    └── clip2.mp4
</code></pre>
<p>And update <code>labels.txt</code> to match the folder names:</p>
<pre><code class="lang-plaintext">swipe_left
swipe_right
stop
</code></pre>
<p>Now your dataset is ready, and the same training scripts from earlier (<code>train.py</code>, <code>eval.py</code>) will work without modification.</p>
<h3 id="heading-why-choose-this-option">Why choose this option?</h3>
<ul>
<li><p>Gives more realistic results than synthetic coloured boxes</p>
</li>
<li><p>Lets you see how the model handles <em>actual human hand movements</em></p>
</li>
<li><p>It just requires a bit more effort (downloading clips, trimming them if needed)</p>
</li>
</ul>
<p><strong>Tip:</strong> If downloading from these datasets feels too heavy, you can also record your own short gestures using your laptop webcam. Just save them as <code>.mp4</code> files and organize them in the same folder structure.</p>
<h2 id="heading-accessibility-notes-amp-ethical-limits">Accessibility Notes &amp; Ethical Limits</h2>
<p>While this project shows the technical workflow for gesture recognition with Transformers, it’s important to step back and consider the <strong>human context</strong>:</p>
<ul>
<li><p><strong>Accessibility first</strong>: Tools like this can help students with speech or motor difficulties, but they should always be co-designed with the people who will use them. Don’t assume one-size-fits-all.</p>
</li>
<li><p><strong>Dataset sensitivity</strong>: Using publicly available sign or gesture datasets is fine for prototyping, but deploying such a system requires careful consideration of consent and representation.</p>
</li>
<li><p><strong>Error tolerance</strong>: Even small misclassifications can have big consequences in accessibility contexts (for example, confusing <em>stop</em> with <em>go</em>). Always plan for fallback options (like manual input or confirmation).</p>
</li>
<li><p><strong>Bias and inclusivity</strong>: Models trained on narrow datasets may fail for different skin tones, lighting conditions, or cultural gesture variations. Broad and diverse training data is essential for fairness.</p>
</li>
</ul>
<p>In other words: this demo is a <strong>teaching scaffold</strong>, not a production-ready accessibility tool. Responsible deployment requires collaboration with educators, therapists, and end users.</p>
<h2 id="heading-next-steps">Next Steps</h2>
<p>If you’d like to push this project further, here are some directions to explore:</p>
<ul>
<li><p><strong>Better models</strong>: Try video-focused Transformers like <a target="_blank" href="https://arxiv.org/abs/2102.05095">TimeSformer</a> or <a target="_blank" href="https://arxiv.org/abs/2203.12602">VideoMAE</a> for stronger temporal reasoning.</p>
</li>
<li><p><strong>Larger vocabularies</strong>: Add more gesture classes, build your own dataset, or use portions of public datasets like <a target="_blank" href="https://www.kaggle.com/datasets/toxicmender/20bn-jester">20BN Jester</a> or <a target="_blank" href="https://www.kaggle.com/datasets/risangbaskoro/wlasl-processed">WLASL.</a></p>
</li>
<li><p><strong>Pose fusion</strong>: Combine gesture video with human pose keypoints from <a target="_blank" href="https://mediapipe.readthedocs.io/en/latest/solutions/hands.html">MediaPipe</a> or <a target="_blank" href="https://github.com/CMU-Perceptual-Computing-Lab/openpose">OpenPose</a> for more robust predictions.</p>
</li>
<li><p><strong>Real-time smoothing</strong>: Implement temporal smoothing or debounce logic in the app so predictions are more stable during live use.</p>
</li>
<li><p><strong>Quantization + edge devices</strong>: Convert your ONNX model to an INT8 quantized version and deploy it on a Raspberry Pi or Jetson Nano for classroom-ready prototypes.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to create a gesture recognition system using Transformer models, demonstrating the potential of cutting-edge machine learning techniques. By preparing a small dataset, training a Vision Transformer with temporal pooling, exporting the model to ONNX for efficient inference, and deploying a real-time Gradio app, you showcased a practical application of these technologies. The evaluation of accuracy and latency further highlighted the system's effectiveness and responsiveness.</p>
<p>This project illustrates how you can leverage advanced ML methods to enhance accessibility and communication, paving the way for more inclusive learning environments.</p>
<p>Remember: while this demo works with small datasets, real-world applications need larger, more diverse data and careful consideration of accessibility, inclusivity, and ethics.</p>
<p>Here’s the GitHub repo for full source code: <a target="_blank" href="https://github.com/tayo4christ/transformer-gesture">transformer-gesture</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
