<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ OMOTAYO OMOYEMI - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ OMOTAYO OMOYEMI - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Fri, 22 May 2026 10:06:50 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/tayo4christ/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How I Built a Makaton AI Companion Using Gemini Nano and the Gemini API ]]>
                </title>
                <description>
                    <![CDATA[ When I started my research on AI systems that could translate Makaton (a sign and symbol language designed to support speech and communication), I wanted to bridge a gap in accessibility for learners with speech or language difficulties. Over time, t... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-i-built-a-makaton-ai-companion-using-gemini-nano-and-the-gemini-api/</link>
                <guid isPermaLink="false">690e1f43cb50ea9684f6d9aa</guid>
                
                    <category>
                        <![CDATA[ geminiAPI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ gemini-nano ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ OMOTAYO OMOYEMI ]]>
                </dc:creator>
                <pubDate>Fri, 07 Nov 2025 16:33:07 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762533154134/e2209ade-6971-464b-aeef-f05abd0a30d7.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When I started my research on AI systems that could translate Makaton (a sign and symbol language designed to support speech and communication), I wanted to bridge a gap in accessibility for learners with speech or language difficulties.</p>
<p>Over time, this academic interest evolved into a working prototype that combines on-device AI and cloud AI to describe images and translate them into English meanings. The idea was simple: I wanted to build a lightweight web app that recognized Makaton gestures or symbols and instantly provided an English interpretation.</p>
<p>In this article, I’ll walk you through how I built my Makaton AI Companion, a single-page web app powered by Gemini Nano (on-device) and the Gemini API (cloud). You’ll see how it works, how I solved common issues like CORS and API model errors, and how this small project became part of my journey toward AI for accessibility.</p>
<p>By the end of this article, you will be able to:</p>
<ul>
<li><p>Understand the core concept behind Makaton and why it’s important in accessibility and inclusive education.</p>
</li>
<li><p>Learn how to combine on-device AI (Gemini Nano) and cloud-based AI (Gemini API) in a single web project.</p>
</li>
<li><p>Build a functional AI-powered web app that can describe images and map them to predefined English meanings.</p>
</li>
<li><p>Discover how to handle common errors such as model endpoint issues, missing API keys, and CORS restrictions when working with generative AI APIs.</p>
</li>
<li><p>Learn how to store API keys locally for user privacy using <code>localStorage</code>.</p>
</li>
<li><p>Use browser speech synthesis to convert the AI-generated English meanings into spoken output.</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-tools-and-tech-stack">Tools and Tech Stack</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-building-the-app-step-by-step">Building the App Step by Step</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-fix-the-common-issues">How to Fix the Common Issues</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-demo-the-makaton-ai-companion-in-action">Demo: The Makaton AI Companion in Action</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-broader-reflections">Broader Reflections</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-tools-and-tech-stack">Tools and Tech Stack</h2>
<p>To build the Makaton AI Companion, I wanted something lightweight, fast to prototype, and easy for anyone to run without complicated dependencies. I chose a plain web stack with a focus on accessibility and transparency.</p>
<p>Here’s what I used:</p>
<h3 id="heading-frontend">Frontend</h3>
<ul>
<li><p><strong>HTML + CSS + JavaScript (Vanilla):</strong> No frameworks, just clean and understandable code that any beginner can follow.</p>
</li>
<li><p>A single <code>index.html</code> page handles the upload interface, output display, and AI logic.</p>
</li>
</ul>
<h3 id="heading-ai-components">AI Components</h3>
<ul>
<li><p><strong>Gemini Nano</strong> runs locally in Chrome Canary. This on-device model lets users generate short text without calling the cloud API.</p>
</li>
<li><p><strong>Gemini API (Cloud)</strong> used as a fallback when on-device AI isn’t available or when image analysis is required.</p>
<ul>
<li><p>Model tested: <code>gemini-1.5-flash</code> and <code>gemini-pro-vision</code>.</p>
</li>
<li><p>Fallback logic ensures the app checks multiple model endpoints if one returns a 404 error.</p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-local-storage">Local Storage</h3>
<ul>
<li>The Gemini API key is stored safely in the browser’s localStorage, so it never leaves the user’s computer.</li>
</ul>
<h3 id="heading-browser-speechsynthesis-api">Browser SpeechSynthesis API</h3>
<ul>
<li>Converts the translated English meaning into spoken audio with one click.</li>
</ul>
<h3 id="heading-mapping-logic">Mapping Logic</h3>
<ul>
<li>A small custom dictionary (<code>mapping.js</code>) links AI-generated descriptions to likely Makaton meanings. For example: <code>{ keywords: ["open hand", "raised hand", "wave"], meaning: "Hello / Stop" }</code></li>
</ul>
<h3 id="heading-local-server">Local Server</h3>
<ul>
<li><p>The app is served locally using Python’s built-in HTTP server to avoid CORS issues:</p>
<p>  <code>python -m http.server 8080</code></p>
</li>
</ul>
<p>Then open <code>http://localhost:8080</code> in Chrome Canary.</p>
<h2 id="heading-building-the-app-step-by-step">Building the App Step by Step</h2>
<p>Now let’s dive into how the Makaton AI Companion works under the hood. This project follows a simple but effective flow: Upload an image → Describe (AI) → Map to Meaning → Speak or Copy the result</p>
<p>We’ll go through each part step by step.</p>
<h3 id="heading-1-setting-up-the-project-folder">1. Setting Up the Project Folder</h3>
<p>You don’t need any complex setup. Just create a new folder and add these files:</p>
<pre><code class="lang-plaintext">makaton-ai-companion/
│
├── index.html
├── styles.css
├── app.js
└── lib/
    ├── mapping.js
    └── ai.js
</code></pre>
<p>If you prefer a ready-to-run version, you can serve everything from one zip (I’ll share a GitHub link at the end).</p>
<h3 id="heading-2-creating-the-basic-html-structure">2. Creating the Basic HTML Structure</h3>
<p>Your <code>index.html</code> file defines the interface where users upload an image, click <em>Describe</em>, and view the results.</p>
<pre><code class="lang-html"><span class="hljs-meta">&lt;!DOCTYPE <span class="hljs-meta-keyword">html</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">html</span> <span class="hljs-attr">lang</span>=<span class="hljs-string">"en"</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">head</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">meta</span> <span class="hljs-attr">charset</span>=<span class="hljs-string">"UTF-8"</span> /&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">meta</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"viewport"</span> <span class="hljs-attr">content</span>=<span class="hljs-string">"width=device-width, initial-scale=1.0"</span>/&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">title</span>&gt;</span>Makaton AI Companion<span class="hljs-tag">&lt;/<span class="hljs-name">title</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">link</span> <span class="hljs-attr">rel</span>=<span class="hljs-string">"stylesheet"</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"styles.css"</span>/&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">head</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">body</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">header</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"app-header"</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>🧩 Makaton AI Companion<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnSettings"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn secondary"</span>&gt;</span>Settings<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
  <span class="hljs-tag">&lt;/<span class="hljs-name">header</span>&gt;</span>

  <span class="hljs-tag">&lt;<span class="hljs-name">main</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"container"</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">section</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"card"</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">h2</span>&gt;</span>1) Upload an image (Makaton sign/symbol)<span class="hljs-tag">&lt;/<span class="hljs-name">h2</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">label</span> <span class="hljs-attr">for</span>=<span class="hljs-string">"file"</span>&gt;</span>
        Choose an image file
        <span class="hljs-tag">&lt;<span class="hljs-name">input</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"file"</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"file"</span> <span class="hljs-attr">accept</span>=<span class="hljs-string">"image/*"</span> <span class="hljs-attr">title</span>=<span class="hljs-string">"Select an image file"</span>/&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">label</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"preview"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"preview hidden"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">p</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"status"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"status"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"actions"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnDescribe"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn"</span>&gt;</span>Describe (Cloud or Nano)<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnType"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn ghost"</span>&gt;</span>Type a description instead<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"typedBox"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"typed hidden"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">textarea</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"typed"</span> <span class="hljs-attr">rows</span>=<span class="hljs-string">"3"</span> <span class="hljs-attr">placeholder</span>=<span class="hljs-string">"Describe what you see..."</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">textarea</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnUseTyped"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn"</span>&gt;</span>Use this description<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">section</span>&gt;</span>

    <span class="hljs-tag">&lt;<span class="hljs-name">section</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"card"</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">h2</span>&gt;</span>2) AI Output<span class="hljs-tag">&lt;/<span class="hljs-name">h2</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"grid"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span>Image Description<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"output"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"output"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span>English Meaning (Mapped)<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"meaning"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"meaning"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"actions"</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnSpeak"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn ghost"</span> <span class="hljs-attr">disabled</span>&gt;</span>🔊 Speak<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnCopy"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn ghost"</span> <span class="hljs-attr">disabled</span>&gt;</span>📋 Copy<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
          <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">section</span>&gt;</span>
  <span class="hljs-tag">&lt;/<span class="hljs-name">main</span>&gt;</span>

  <span class="hljs-tag">&lt;<span class="hljs-name">dialog</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"settings"</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">form</span> <span class="hljs-attr">method</span>=<span class="hljs-string">"dialog"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"settings-form"</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">h2</span>&gt;</span>Settings<span class="hljs-tag">&lt;/<span class="hljs-name">h2</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">label</span>&gt;</span>Gemini API key (optional)<span class="hljs-tag">&lt;<span class="hljs-name">input</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"apiKey"</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"password"</span> <span class="hljs-attr">placeholder</span>=<span class="hljs-string">"AIza..."</span>/&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">label</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"settings-actions"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnSaveKey"</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"submit"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn"</span>&gt;</span>Save<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnCloseSettings"</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"button"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn secondary"</span>&gt;</span>Close<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"apiStatus"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"api-status"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">form</span>&gt;</span>
  <span class="hljs-tag">&lt;/<span class="hljs-name">dialog</span>&gt;</span>

  <span class="hljs-tag">&lt;<span class="hljs-name">script</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"module"</span> <span class="hljs-attr">src</span>=<span class="hljs-string">"lib/mapping.js"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">script</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">script</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"module"</span> <span class="hljs-attr">src</span>=<span class="hljs-string">"lib/ai.js"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">script</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">script</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"module"</span> <span class="hljs-attr">src</span>=<span class="hljs-string">"app.js"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">script</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">body</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">html</span>&gt;</span>
</code></pre>
<p>This interface is intentionally minimal: no frameworks, no build tools, just clear HTML.</p>
<h3 id="heading-3-mapping-descriptions-to-makaton-meanings">3. Mapping Descriptions to Makaton Meanings</h3>
<p>The <code>mapping.js</code> file holds a simple keyword-based dictionary. When the AI describes an image (like <em>“a raised open hand”</em>), the app searches for keywords that match known Makaton signs.</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// lib/mapping.js</span>

<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> MAKATON_GLOSSES = [
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"open hand"</span>, <span class="hljs-string">"raised hand"</span>, <span class="hljs-string">"wave"</span>, <span class="hljs-string">"hand up"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Hello / Stop"</span> },
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"eat"</span>, <span class="hljs-string">"food"</span>, <span class="hljs-string">"spoon"</span>, <span class="hljs-string">"hand to mouth"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Eat"</span> },
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"drink"</span>, <span class="hljs-string">"cup"</span>, <span class="hljs-string">"glass"</span>, <span class="hljs-string">"bottle"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Drink"</span> },
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"home"</span>, <span class="hljs-string">"house"</span>, <span class="hljs-string">"roof"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Home"</span> },
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"sleep"</span>, <span class="hljs-string">"bed"</span>, <span class="hljs-string">"eyes closed"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Sleep"</span> },
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"book"</span>, <span class="hljs-string">"reading"</span>, <span class="hljs-string">"pages"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Book / Read"</span> },
  <span class="hljs-comment">// Added so your current screenshot maps correctly:</span>
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"help"</span>, <span class="hljs-string">"assist"</span>, <span class="hljs-string">"thumb on palm"</span>, <span class="hljs-string">"hand over hand"</span>, <span class="hljs-string">"assisting"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Help"</span> },
];

<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">mapDescriptionToMeaning</span>(<span class="hljs-params">desc</span>) </span>{
  <span class="hljs-keyword">if</span> (!desc) <span class="hljs-keyword">return</span> <span class="hljs-string">""</span>;
  <span class="hljs-keyword">const</span> d = desc.toLowerCase();
  <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> entry <span class="hljs-keyword">of</span> MAKATON_GLOSSES) {
    <span class="hljs-keyword">if</span> (entry.keywords.some(<span class="hljs-function"><span class="hljs-params">k</span> =&gt;</span> d.includes(k))) <span class="hljs-keyword">return</span> entry.meaning;
  }
  <span class="hljs-keyword">if</span> (d.includes(<span class="hljs-string">"hand"</span>)) <span class="hljs-keyword">return</span> <span class="hljs-string">"Gesture / Hand sign (clarify)"</span>;
  <span class="hljs-keyword">return</span> <span class="hljs-string">"No direct mapping found."</span>;
}
</code></pre>
<p>It’s simple but effective enough to simulate real symbol-to-language translation for demo purposes.</p>
<h3 id="heading-4-adding-gemini-ai-logic">4. Adding Gemini AI Logic</h3>
<p>The <code>ai.js</code> file connects to Gemini Nano (on-device) or the Gemini API (cloud). If Nano isn’t available, the app falls back to the cloud model. And if that fails, it lets users type a description manually.</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// lib/ai.js — dynamic model discovery (try-all version)</span>

<span class="hljs-comment">// --- On-device availability (Gemini Nano) ---</span>
<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">checkAvailability</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">const</span> res = { <span class="hljs-attr">nanoTextPossible</span>: <span class="hljs-literal">false</span> };
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> canCreate = self.ai?.canCreateTextSession || self.ai?.languageModel?.canCreate;
    <span class="hljs-keyword">if</span> (<span class="hljs-keyword">typeof</span> canCreate === <span class="hljs-string">"function"</span>) {
      <span class="hljs-keyword">const</span> ok = <span class="hljs-keyword">await</span> (self.ai.canCreateTextSession?.() || self.ai.languageModel.canCreate?.());
      res.nanoTextPossible = ok === <span class="hljs-string">"readily"</span> || ok === <span class="hljs-string">"after-download"</span> || ok === <span class="hljs-literal">true</span>;
    }
  } <span class="hljs-keyword">catch</span> {}
  <span class="hljs-keyword">return</span> res;
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">createNanoTextSession</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">if</span> (self.ai?.createTextSession) <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> self.ai.createTextSession();
  <span class="hljs-keyword">if</span> (self.ai?.languageModel?.create) <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> self.ai.languageModel.create();
  <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"Gemini Nano text session not available"</span>);
}

<span class="hljs-comment">// --- Cloud: dynamically discover models for this key ---</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">listModels</span>(<span class="hljs-params">key</span>) </span>{
  <span class="hljs-keyword">const</span> url = <span class="hljs-string">"https://generativelanguage.googleapis.com/v1/models?key="</span> + <span class="hljs-built_in">encodeURIComponent</span>(key);
  <span class="hljs-keyword">const</span> r = <span class="hljs-keyword">await</span> fetch(url);
  <span class="hljs-keyword">if</span> (!r.ok) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"ListModels failed: "</span> + (<span class="hljs-keyword">await</span> r.text()));
  <span class="hljs-keyword">const</span> j = <span class="hljs-keyword">await</span> r.json();
  <span class="hljs-keyword">return</span> (j.models || []).map(<span class="hljs-function"><span class="hljs-params">m</span> =&gt;</span> m.name).filter(<span class="hljs-built_in">Boolean</span>);
}

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">rankModels</span>(<span class="hljs-params">names</span>) </span>{
  <span class="hljs-comment">// Prefer Gemini 1.5 (multimodal), then flash variants, then anything with vision/pro.</span>
  <span class="hljs-keyword">return</span> names
    .filter(<span class="hljs-function"><span class="hljs-params">n</span> =&gt;</span> n.startsWith(<span class="hljs-string">"models/"</span>))              <span class="hljs-comment">// ignore tunedModels, etc.</span>
    .filter(<span class="hljs-function"><span class="hljs-params">n</span> =&gt;</span> !n.includes(<span class="hljs-string">"experimental"</span>))          <span class="hljs-comment">// skip experimental</span>
    .sort(<span class="hljs-function">(<span class="hljs-params">a, b</span>) =&gt;</span> score(b) - score(a));

  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">score</span>(<span class="hljs-params">n</span>) </span>{
    <span class="hljs-keyword">let</span> s = <span class="hljs-number">0</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"1.5"</span>)) s += <span class="hljs-number">10</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"flash"</span>)) s += <span class="hljs-number">8</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"pro-vision"</span>)) s += <span class="hljs-number">7</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"pro"</span>)) s += <span class="hljs-number">6</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"vision"</span>)) s += <span class="hljs-number">5</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"latest"</span>)) s += <span class="hljs-number">2</span>;
    <span class="hljs-keyword">return</span> s;
  }
}

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">tryGenerateForModels</span>(<span class="hljs-params">imageDataUrl, key, models, mimeType</span>) </span>{
  <span class="hljs-keyword">const</span> base64 = imageDataUrl.split(<span class="hljs-string">","</span>)[<span class="hljs-number">1</span>];
  <span class="hljs-keyword">const</span> body = {
    <span class="hljs-attr">contents</span>: [{
      <span class="hljs-attr">parts</span>: [
        { <span class="hljs-attr">text</span>: <span class="hljs-string">"Describe this image briefly in one sentence focusing on the main gesture or symbol."</span> },
        { <span class="hljs-attr">inline_data</span>: { <span class="hljs-attr">mime_type</span>: mimeType || <span class="hljs-string">"image/png"</span>, <span class="hljs-attr">data</span>: base64 } }
      ]
    }]
  };
  <span class="hljs-keyword">let</span> lastErr = <span class="hljs-string">""</span>;
  <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> model <span class="hljs-keyword">of</span> models) {
    <span class="hljs-keyword">const</span> endpoint = <span class="hljs-string">"https://generativelanguage.googleapis.com/v1/"</span> + model + <span class="hljs-string">":generateContent?key="</span> + <span class="hljs-built_in">encodeURIComponent</span>(key);
    <span class="hljs-keyword">try</span> {
      <span class="hljs-keyword">const</span> r = <span class="hljs-keyword">await</span> fetch(endpoint, { <span class="hljs-attr">method</span>: <span class="hljs-string">"POST"</span>, <span class="hljs-attr">headers</span>: { <span class="hljs-string">"Content-Type"</span>: <span class="hljs-string">"application/json"</span> }, <span class="hljs-attr">body</span>: <span class="hljs-built_in">JSON</span>.stringify(body)});
      <span class="hljs-keyword">if</span> (!r.ok) { lastErr = <span class="hljs-keyword">await</span> r.text().catch(<span class="hljs-function">()=&gt;</span><span class="hljs-built_in">String</span>(r.status)); <span class="hljs-keyword">continue</span>; }
      <span class="hljs-keyword">const</span> j = <span class="hljs-keyword">await</span> r.json();
      <span class="hljs-keyword">const</span> text = j?.candidates?.[<span class="hljs-number">0</span>]?.content?.parts?.map(<span class="hljs-function"><span class="hljs-params">p</span>=&gt;</span>p.text).join(<span class="hljs-string">" "</span>).trim();
      <span class="hljs-keyword">if</span> (text) <span class="hljs-keyword">return</span> text;
      lastErr = <span class="hljs-string">"Empty response from "</span> + model;
    } <span class="hljs-keyword">catch</span> (e) {
      lastErr = <span class="hljs-built_in">String</span>(e?.message || e);
    }
  }
  <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"All discovered models failed. Last error: "</span> + lastErr);
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">describeImageWithGemini</span>(<span class="hljs-params">imageDataUrl, apiKey, mimeType = <span class="hljs-string">"image/png"</span></span>) </span>{
  <span class="hljs-keyword">if</span> (!apiKey) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"No API key provided"</span>);

  <span class="hljs-keyword">const</span> models = <span class="hljs-keyword">await</span> listModels(apiKey);
  <span class="hljs-keyword">if</span> (!models.length) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"No models returned for this key. Ensure Generative Language API is enabled and T&amp;Cs accepted in AI Studio."</span>);

  <span class="hljs-keyword">const</span> ranked = rankModels(models);
  <span class="hljs-keyword">if</span> (!ranked.length) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"No usable model names returned (models/*)."</span>);

  <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> tryGenerateForModels(imageDataUrl, apiKey, ranked, mimeType);
}

<span class="hljs-comment">// --- Key storage (local only) ---</span>
<span class="hljs-keyword">const</span> KEY = <span class="hljs-string">"makaton_demo_gemini_key"</span>;
<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">saveApiKey</span>(<span class="hljs-params">k</span>) </span>{ <span class="hljs-built_in">localStorage</span>.setItem(KEY, k || <span class="hljs-string">""</span>); }
<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">loadApiKey</span>(<span class="hljs-params"></span>) </span>{ <span class="hljs-keyword">return</span> <span class="hljs-built_in">localStorage</span>.getItem(KEY) || <span class="hljs-string">""</span>; }
</code></pre>
<p>Note: This retry system is essential because many users encounter 404 model errors due to the unavailability of certain Gemini versions in every account.</p>
<h3 id="heading-5-the-main-logic-appjs">5. The Main Logic (app.js)</h3>
<p>This script ties everything together: file upload, AI call, meaning mapping, and output display.</p>
<pre><code class="lang-javascript">
<span class="hljs-keyword">import</span> { mapDescriptionToMeaning } <span class="hljs-keyword">from</span> <span class="hljs-string">'./lib/mapping.js'</span>;
<span class="hljs-keyword">import</span> { checkAvailability, createNanoTextSession, describeImageWithGemini, saveApiKey, loadApiKey } <span class="hljs-keyword">from</span> <span class="hljs-string">'./lib/ai.js'</span>;

<span class="hljs-built_in">document</span>.addEventListener(<span class="hljs-string">'DOMContentLoaded'</span>, <span class="hljs-function">() =&gt;</span> {
  <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton] DOM ready'</span>);

  <span class="hljs-keyword">const</span> $ = <span class="hljs-function">(<span class="hljs-params">s</span>) =&gt;</span> <span class="hljs-built_in">document</span>.querySelector(s);

  <span class="hljs-comment">// Elements</span>
  <span class="hljs-keyword">const</span> fileInput   = $(<span class="hljs-string">'#file'</span>);
  <span class="hljs-keyword">const</span> preview     = $(<span class="hljs-string">'#preview'</span>);
  <span class="hljs-keyword">const</span> meaningEl   = $(<span class="hljs-string">'#meaning'</span>);
  <span class="hljs-keyword">const</span> outputEl    = $(<span class="hljs-string">'#output'</span>);
  <span class="hljs-keyword">const</span> btnDescribe = $(<span class="hljs-string">'#btnDescribe'</span>);
  <span class="hljs-keyword">const</span> btnType     = $(<span class="hljs-string">'#btnType'</span>);
  <span class="hljs-keyword">const</span> typedBox    = $(<span class="hljs-string">'#typedBox'</span>);
  <span class="hljs-keyword">const</span> typed       = $(<span class="hljs-string">'#typed'</span>);
  <span class="hljs-keyword">const</span> btnUseTyped = $(<span class="hljs-string">'#btnUseTyped'</span>);
  <span class="hljs-keyword">const</span> btnSpeak    = $(<span class="hljs-string">'#btnSpeak'</span>);
  <span class="hljs-keyword">const</span> btnCopy     = $(<span class="hljs-string">'#btnCopy'</span>);
  <span class="hljs-keyword">const</span> statusEl    = $(<span class="hljs-string">'#status'</span>);

  <span class="hljs-keyword">const</span> settings        = $(<span class="hljs-string">'#settings'</span>);
  <span class="hljs-keyword">const</span> btnSettings     = $(<span class="hljs-string">'#btnSettings'</span>);
  <span class="hljs-keyword">const</span> btnCloseSettings= $(<span class="hljs-string">'#btnCloseSettings'</span>);
  <span class="hljs-keyword">const</span> btnSaveKey      = $(<span class="hljs-string">'#btnSaveKey'</span>);
  <span class="hljs-keyword">const</span> apiKeyInput     = $(<span class="hljs-string">'#apiKey'</span>);
  <span class="hljs-keyword">const</span> apiStatus       = $(<span class="hljs-string">'#apiStatus'</span>);

  <span class="hljs-keyword">let</span> currentImageDataUrl = <span class="hljs-literal">null</span>;
  <span class="hljs-keyword">let</span> currentImageMime    = <span class="hljs-string">"image/png"</span>;

  <span class="hljs-comment">// Sanity logs</span>
  <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton] Elements:'</span>, {
    <span class="hljs-attr">fileInput</span>: !!fileInput, <span class="hljs-attr">preview</span>: !!preview, <span class="hljs-attr">outputEl</span>: !!outputEl,
    <span class="hljs-attr">meaningEl</span>: !!meaningEl, <span class="hljs-attr">btnDescribe</span>: !!btnDescribe, <span class="hljs-attr">statusEl</span>: !!statusEl
  });

  <span class="hljs-comment">// Init API key</span>
  <span class="hljs-keyword">if</span> (apiKeyInput) apiKeyInput.value = loadApiKey() || <span class="hljs-string">""</span>;

  <span class="hljs-comment">// --- Helpers ---</span>
  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">setStatus</span>(<span class="hljs-params">text</span>) </span>{
    <span class="hljs-keyword">if</span> (statusEl) statusEl.textContent = text || <span class="hljs-string">''</span>;
    <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton][Status]'</span>, text);
  }
  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">clearOutputs</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">if</span> (outputEl) outputEl.textContent = <span class="hljs-string">''</span>;
    <span class="hljs-keyword">if</span> (meaningEl) meaningEl.textContent = <span class="hljs-string">''</span>;
    <span class="hljs-keyword">if</span> (btnSpeak) btnSpeak.disabled = <span class="hljs-literal">true</span>;
    <span class="hljs-keyword">if</span> (btnCopy)  btnCopy.disabled  = <span class="hljs-literal">true</span>;
  }
  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">setOutput</span>(<span class="hljs-params">desc</span>) </span>{
    <span class="hljs-keyword">if</span> (outputEl) outputEl.textContent = desc || <span class="hljs-string">''</span>;
    <span class="hljs-keyword">const</span> meaning = mapDescriptionToMeaning(desc || <span class="hljs-string">''</span>);
    <span class="hljs-keyword">if</span> (meaningEl) meaningEl.textContent = meaning;
    <span class="hljs-keyword">if</span> (btnSpeak) btnSpeak.disabled = !meaning || meaning.includes(<span class="hljs-string">'No direct mapping'</span>);
    <span class="hljs-keyword">if</span> (btnCopy)  btnCopy.disabled  = !meaning;
    setStatus(<span class="hljs-string">'Done.'</span>);
  }
  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">fileToDataURL</span>(<span class="hljs-params">file</span>) </span>{
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Promise</span>(<span class="hljs-function">(<span class="hljs-params">resolve, reject</span>) =&gt;</span> {
      <span class="hljs-keyword">const</span> reader = <span class="hljs-keyword">new</span> FileReader();
      reader.onload  = <span class="hljs-function">() =&gt;</span> resolve(reader.result);
      reader.onerror = <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> reject(e);
      reader.readAsDataURL(file);
    });
  }
  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">handleFiles</span>(<span class="hljs-params">files</span>) </span>{
    <span class="hljs-keyword">const</span> file = files?.[<span class="hljs-number">0</span>];
    <span class="hljs-keyword">if</span> (!file) { setStatus(<span class="hljs-string">'No file selected.'</span>); <span class="hljs-keyword">return</span>; }
    currentImageMime = file.type || <span class="hljs-string">"image/png"</span>;
    fileToDataURL(file)
      .then(<span class="hljs-function">(<span class="hljs-params">dataUrl</span>) =&gt;</span> {
        currentImageDataUrl = dataUrl;
        <span class="hljs-keyword">if</span> (preview) {
          preview.innerHTML = <span class="hljs-string">`&lt;img alt="preview" src="<span class="hljs-subst">${dataUrl}</span>" /&gt;`</span>;
          preview.classList.remove(<span class="hljs-string">'hidden'</span>);
        }
        setStatus(<span class="hljs-string">'Image loaded. Click "Describe" to continue.'</span>);
      })
      .catch(<span class="hljs-function">(<span class="hljs-params">err</span>) =&gt;</span> {
        <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'[Makaton] fileToDataURL error'</span>, err);
        setStatus(<span class="hljs-string">'Could not read the image.'</span>);
      });
  }

  <span class="hljs-comment">// --- File input change ---</span>
  <span class="hljs-keyword">if</span> (fileInput) {
    fileInput.addEventListener(<span class="hljs-string">'change'</span>, <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> {
      <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton] file input change'</span>);
      handleFiles(e.target.files);
    });
  } <span class="hljs-keyword">else</span> {
    <span class="hljs-built_in">console</span>.warn(<span class="hljs-string">'[Makaton] #file input not found in DOM.'</span>);
  }

  <span class="hljs-comment">// --- Drag &amp; drop support on preview area ---</span>
  <span class="hljs-keyword">if</span> (preview) {
    preview.addEventListener(<span class="hljs-string">'dragover'</span>, <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> { e.preventDefault(); preview.classList.add(<span class="hljs-string">'drag'</span>); });
    preview.addEventListener(<span class="hljs-string">'dragleave'</span>, <span class="hljs-function">() =&gt;</span> preview.classList.remove(<span class="hljs-string">'drag'</span>));
    preview.addEventListener(<span class="hljs-string">'drop'</span>, <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> {
      e.preventDefault();
      preview.classList.remove(<span class="hljs-string">'drag'</span>);
      <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton] drop'</span>);
      handleFiles(e.dataTransfer?.files);
    });
  }

  <span class="hljs-comment">// --- Describe click ---</span>
  <span class="hljs-keyword">if</span> (btnDescribe) {
    btnDescribe.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-keyword">async</span> () =&gt; {
      <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton] Describe clicked'</span>);
      <span class="hljs-keyword">if</span> (!currentImageDataUrl) { setStatus(<span class="hljs-string">'Please upload an image first.'</span>); <span class="hljs-keyword">return</span>; }
      clearOutputs();
      setStatus(<span class="hljs-string">'Checking on-device AI availability…'</span>);

      <span class="hljs-keyword">const</span> avail = <span class="hljs-keyword">await</span> checkAvailability().catch(<span class="hljs-function">() =&gt;</span> ({ <span class="hljs-attr">nanoTextPossible</span>: <span class="hljs-literal">false</span> }));
      <span class="hljs-keyword">try</span> {
        <span class="hljs-keyword">const</span> apiKey = loadApiKey();
        <span class="hljs-keyword">if</span> (apiKey) {
          setStatus(<span class="hljs-string">'Using Gemini cloud for image description…'</span>);
          <span class="hljs-keyword">const</span> desc = <span class="hljs-keyword">await</span> describeImageWithGemini(currentImageDataUrl, apiKey, currentImageMime);
          setOutput(desc);
          <span class="hljs-keyword">return</span>;
        }
        <span class="hljs-keyword">if</span> (avail.nanoTextPossible) {
          setStatus(<span class="hljs-string">'No API key found. Using on-device AI (text) for best guess…'</span>);
          <span class="hljs-keyword">const</span> session = <span class="hljs-keyword">await</span> createNanoTextSession();
          <span class="hljs-keyword">const</span> desc = <span class="hljs-keyword">await</span> session.prompt(<span class="hljs-string">'Given an image is uploaded by the user (not directly visible to you), infer a likely one-sentence description of a common Makaton sign or symbol a teacher might upload. Keep it generic and safe.'</span>);
          setOutput(desc);
          <span class="hljs-keyword">return</span>;
        }
        setStatus(<span class="hljs-string">'No AI available. Please type a brief description.'</span>);
        <span class="hljs-keyword">if</span> (typedBox) typedBox.classList.remove(<span class="hljs-string">'hidden'</span>);
      } <span class="hljs-keyword">catch</span> (err) {
        <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'[Makaton] Describe error'</span>, err);
        setStatus(<span class="hljs-string">'Description failed: '</span> + (err?.message || err));
        <span class="hljs-keyword">if</span> (typedBox) typedBox.classList.remove(<span class="hljs-string">'hidden'</span>);
      }
    });
  } <span class="hljs-keyword">else</span> {
    <span class="hljs-built_in">console</span>.warn(<span class="hljs-string">'[Makaton] Describe button not found.'</span>);
  }

  <span class="hljs-comment">// --- Manual typing flow ---</span>
  <span class="hljs-keyword">if</span> (btnType) {
    btnType.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> {
      <span class="hljs-keyword">if</span> (typedBox) typedBox.classList.remove(<span class="hljs-string">'hidden'</span>);
      <span class="hljs-keyword">if</span> (typed) typed.focus();
    });
  }
  <span class="hljs-keyword">if</span> (btnUseTyped) {
    btnUseTyped.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> {
      <span class="hljs-keyword">const</span> text = (typed?.value || <span class="hljs-string">''</span>).trim();
      <span class="hljs-keyword">if</span> (!text) { setStatus(<span class="hljs-string">'Type a description first.'</span>); <span class="hljs-keyword">return</span>; }
      setOutput(text);
    });
  }

  <span class="hljs-comment">// --- Utilities ---</span>
  <span class="hljs-keyword">if</span> (btnSpeak) {
    btnSpeak.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> {
      <span class="hljs-keyword">const</span> text = meaningEl?.textContent?.trim();
      <span class="hljs-keyword">if</span> (!text) <span class="hljs-keyword">return</span>;
      <span class="hljs-keyword">const</span> u = <span class="hljs-keyword">new</span> SpeechSynthesisUtterance(text);
      speechSynthesis.cancel();
      speechSynthesis.speak(u);
    });
  }
  <span class="hljs-keyword">if</span> (btnCopy) {
    btnCopy.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-keyword">async</span> () =&gt; {
      <span class="hljs-keyword">const</span> text = meaningEl?.textContent?.trim();
      <span class="hljs-keyword">if</span> (!text) <span class="hljs-keyword">return</span>;
      <span class="hljs-keyword">try</span> {
        <span class="hljs-keyword">await</span> navigator.clipboard.writeText(text);
        setStatus(<span class="hljs-string">'Copied meaning to clipboard.'</span>);
      } <span class="hljs-keyword">catch</span> {
        setStatus(<span class="hljs-string">'Copy failed.'</span>);
      }
    });
  }

  <span class="hljs-comment">// --- Settings modal ---</span>
  <span class="hljs-keyword">if</span> (btnSettings &amp;&amp; settings) btnSettings.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> settings.showModal());
  <span class="hljs-keyword">if</span> (btnCloseSettings &amp;&amp; settings) btnCloseSettings.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> settings.close());
  <span class="hljs-keyword">if</span> (btnSaveKey) {
    btnSaveKey.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> {
      e.preventDefault();
      <span class="hljs-keyword">const</span> k = apiKeyInput?.value?.trim() || <span class="hljs-string">""</span>;
      saveApiKey(k);
      <span class="hljs-keyword">if</span> (apiStatus) apiStatus.textContent = k ? <span class="hljs-string">"API key saved locally. Try Describe again."</span> : <span class="hljs-string">"Cleared API key. You can still use on-device or typed mode."</span>;
    });
  }

  <span class="hljs-comment">// First status</span>
  setStatus(<span class="hljs-string">'Ready. Upload an image to begin.'</span>);
});
</code></pre>
<p>Let's break down the main sections of the <code>app.js</code> script for the Makaton AI Companion, as there’s a lot going on here:</p>
<ol>
<li><p><strong>Imports and Initial Setup:</strong></p>
<ul>
<li><p>The script imports functions from <code>mapping.js</code> and <code>ai.js</code> to handle mapping descriptions to meanings and AI interactions.</p>
</li>
<li><p>It sets up event listeners for when the DOM content is fully loaded, ensuring all elements are ready for interaction.</p>
</li>
</ul>
</li>
<li><p><strong>Element Selection:</strong></p>
<ul>
<li>It uses a helper function <code>$</code> to select DOM elements by their CSS selectors. This includes file inputs, buttons, and display areas for image previews and outputs.</li>
</ul>
</li>
<li><p><strong>Sanity Logs:</strong></p>
<ul>
<li>It logs the presence of key elements to the console for debugging purposes, ensuring that all necessary elements are found in the DOM.</li>
</ul>
</li>
<li><p><strong>API Key Initialization:</strong></p>
<ul>
<li>It loads any saved API key from local storage and sets it in the input field for user convenience.</li>
</ul>
</li>
<li><p><strong>Helper Functions:</strong></p>
<ul>
<li><p><code>setStatus</code>: Updates the status message displayed to the user.</p>
</li>
<li><p><code>clearOutputs</code>: Clears the output and meaning display areas and disables buttons for speaking and copying.</p>
</li>
<li><p><code>setOutput</code>: Displays the AI-generated description and maps it to a Makaton meaning, enabling buttons if a valid meaning is found.</p>
</li>
<li><p><code>fileToDataURL</code>: Converts an uploaded file to a data URL for image preview and processing.</p>
</li>
<li><p><code>handleFiles</code>: Handles file selection, updating the preview and setting the current image data URL.</p>
</li>
</ul>
</li>
<li><p><strong>File Input Change Handling:</strong></p>
<ul>
<li>It listens for changes in the file input, processes the selected file, and updates the preview area.</li>
</ul>
</li>
<li><p><strong>Drag &amp; Drop Support:</strong></p>
<ul>
<li>It adds drag-and-drop functionality to the preview area, allowing users to drag files directly onto the app for processing.</li>
</ul>
</li>
<li><p><strong>Describe Button Click:</strong></p>
<ul>
<li><p>It handles the "Describe" button click event, checking for an uploaded image and attempting to describe it using either the Gemini API or on-device AI.</p>
</li>
<li><p>If no AI is available, it prompts the user to type a description manually.</p>
</li>
</ul>
</li>
<li><p><strong>Manual Typing Flow:</strong></p>
<ul>
<li>It allows users to manually type a description if AI processing is unavailable or fails, updating the output with the typed text.</li>
</ul>
</li>
<li><p><strong>Utilities:</strong></p>
<ul>
<li><p><code>btnSpeak</code>: Uses the browser's SpeechSynthesis API to read aloud the mapped meaning.</p>
</li>
<li><p><code>btnCopy</code>: Copies the mapped meaning to the clipboard for easy sharing.</p>
</li>
</ul>
</li>
<li><p><strong>Settings Modal:</strong></p>
<ul>
<li>It manages the settings modal for entering and saving the API key, providing feedback on the key's status.</li>
</ul>
</li>
<li><p><strong>Initial Status:</strong></p>
<ul>
<li>It sets the initial status message to guide the user to upload an image to begin the process.</li>
</ul>
</li>
</ol>
<p>This script effectively ties together the user interface, file handling, AI processing, and output display, providing a seamless experience for translating Makaton signs into English meanings.</p>
<h4 id="heading-how-vision-and-language-work-together-here">How Vision and Language Work Together Here</h4>
<p>While working on this project, I started appreciating how computer vision and language understanding complement each other in multimodal systems like this one.</p>
<ul>
<li><p>The vision model (Gemini or Nano) interprets <em>what it sees</em> like hand shapes, gestures, or layout and turns that visual context into descriptive language.</p>
</li>
<li><p>The language mapping logic then interprets those words, infers intent, and finds the closest semantic match (e.g., “help,” “friend,” “eat”).</p>
</li>
<li><p>It’s a collaboration between two forms of understanding (<em>perceptual</em> and <em>semantic</em>) that together allow the AI to bridge the gap between gesture and meaning.</p>
</li>
</ul>
<p>This realization reshaped how I think about accessibility: the best assistive technologies often emerge not from smarter models alone, but from the interaction between modalities like seeing, describing, and reasoning in context.</p>
<h3 id="heading-6-optional-speak-and-copy">6. Optional — Speak and Copy</h3>
<p>To make the app more accessible, I added speech output and a quick copy button:</p>
<pre><code class="lang-javascript">btnSpeak.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> {
  <span class="hljs-keyword">const</span> text = meaningEl.textContent.trim();
  <span class="hljs-keyword">if</span> (text) speechSynthesis.speak(<span class="hljs-keyword">new</span> SpeechSynthesisUtterance(text));
});

btnCopy.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-keyword">async</span> () =&gt; {
  <span class="hljs-keyword">const</span> text = meaningEl.textContent.trim();
  <span class="hljs-keyword">if</span> (text) <span class="hljs-keyword">await</span> navigator.clipboard.writeText(text);
});
</code></pre>
<p>This gives users both visual and auditory feedback, especially helpful for learners or educators.</p>
<h2 id="heading-how-to-fix-the-common-issues">How to Fix the Common Issues</h2>
<p>No AI or web integration project runs smoothly the first time – and that’s okay. Here’s a breakdown of the main issues I faced while building the Makaton AI Companion, how I diagnosed them, and how I fixed each one.</p>
<p>These lessons will help anyone trying to integrate Gemini APIs, on-device AI, or local web apps without a full backend.</p>
<h3 id="heading-1-the-cors-error-when-running-with-file">1. The “CORS” Error When Running With <code>file://</code></h3>
<p>When I first opened my <code>index.html</code> directly from my file explorer, Chrome threw several CORS policy errors:</p>
<pre><code class="lang-python">Access to script at <span class="hljs-string">'file:///lib/ai.js'</span> <span class="hljs-keyword">from</span> origin <span class="hljs-string">'null'</span> has been blocked by CORS policy.
</code></pre>
<p>At first this looked confusing, but the reason is simple: modern browsers block JavaScript modules (<code>import/export</code>) when running from <code>file://</code> paths for security reasons.</p>
<p>✅ <strong>Fix:</strong> I realized I needed to serve the files over <strong>HTTP</strong>, not from the file system. So I ran a quick local web server using Python:</p>
<pre><code class="lang-python">python -m http.server <span class="hljs-number">8080</span>
</code></pre>
<p>Then opened:</p>
<pre><code class="lang-python">http://localhost:<span class="hljs-number">8080</span>/index.html
</code></pre>
<p>That single step fixed all the CORS errors and allowed my modules to load correctly.</p>
<h3 id="heading-2-model-not-found-404-from-the-gemini-api">2. “Model Not Found” (404) From the Gemini API</h3>
<p>The next big challenge came from the Gemini API. Even though I had a valid API key, my console showed this error:</p>
<pre><code class="lang-python"><span class="hljs-string">"models/gemini-1.5-flash"</span> <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> found <span class="hljs-keyword">for</span> API version v1beta, <span class="hljs-keyword">or</span> <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> supported <span class="hljs-keyword">for</span> generateContent.
</code></pre>
<p>It turns out Google’s API endpoints can vary slightly depending on your project setup and key permissions.</p>
<p>✅ <strong>Fix:</strong> I rewrote my <code>lib/ai.js</code> script to automatically <strong>try multiple Gemini model endpoints</strong> until it found one that worked. Something like this:</p>
<pre><code class="lang-python">const GEMINI_IMAGE_ENDPOINTS = [
  <span class="hljs-string">"https://generativelanguage.googleapis.com/v1/models/gemini-1.5-flash:generateContent"</span>,
  <span class="hljs-string">"https://generativelanguage.googleapis.com/v1/models/gemini-1.5-pro:generateContent"</span>,
  <span class="hljs-string">"https://generativelanguage.googleapis.com/v1/models/gemini-1.5-flash-latest:generateContent"</span>,
];
</code></pre>
<p>And I wrapped it in a loop that stopped once one endpoint succeeded.</p>
<p>Later, I improved it further by listing available models dynamically using<br><code>https://generativelanguage.googleapis.com/v1/models?key=YOUR_KEY</code> and automatically trying whichever ones supported image generation.</p>
<p>That dynamic discovery approach fixed the 404 errors permanently.</p>
<h3 id="heading-3-packaging-a-local-single-file-version"><strong>3. Packaging a Local Single-File Version</strong></h3>
<p>Once I got everything working, I wanted a version that others could test easily without installing Node.js or running build tools.</p>
<p>✅ <strong>Fix:</strong> I bundled the project into a simple zip file containing:</p>
<pre><code class="lang-python">index.html
app.js
lib/ai.js
lib/mapping.js
styles.css
</code></pre>
<p>That way, anyone can just unzip and run:</p>
<pre><code class="lang-python">python -m http.server <span class="hljs-number">8080</span>
</code></pre>
<p>and open <code>localhost:8080</code>.</p>
<p>Everything runs locally in the browser, no server-side code required. This also makes it perfect for demos, classrooms, and so on.</p>
<h3 id="heading-4-debugging-script-import-errors-in-the-console">4. Debugging Script Import Errors in the Console</h3>
<p>Another subtle issue appeared when I noticed this red message:</p>
<pre><code class="lang-python">The requested module <span class="hljs-string">'./lib/mapping.js'</span> does <span class="hljs-keyword">not</span> provide an export named <span class="hljs-string">'mapDescriptionToMeaning'</span>
</code></pre>
<p>That line told me exactly what was wrong: my import and export function names didn’t match. The fix was straightforward:</p>
<pre><code class="lang-python">// app.js
<span class="hljs-keyword">import</span> { mapDescriptionToMeaning } <span class="hljs-keyword">from</span> <span class="hljs-string">'./lib/mapping.js'</span>;
</code></pre>
<p>And then ensuring the mapping file exported it:</p>
<pre><code class="lang-python">// mapping.js
export function mapDescriptionToMeaning(desc) { ... }
</code></pre>
<p>After that, all the pieces connected smoothly.</p>
<p>Using the browser console <strong>as my debugging dashboard</strong> turned out to be the most powerful tool of all. Every fix started by reading and reasoning about those red error lines.</p>
<h2 id="heading-demo-the-makaton-ai-companion-in-action">Demo: The Makaton AI Companion in Action</h2>
<p>Let’s see the Makaton AI Companion in action and understand what’s happening under the hood.</p>
<h3 id="heading-step-1-run-the-app-locally">Step 1: Run the app locally</h3>
<p>Once you’ve downloaded or cloned the project folder, open your terminal in that directory and start a local development server: <code>python -m http.server 8080</code>. Then open your browser and visit: <code>http://localhost:8080/index.html</code></p>
<p>You should see the Makaton AI Companion interface:</p>
<p><img src="https://github.com/tayo4christ/makaton-ai-companion/blob/9cc834fa75f6dcd39866c538ed42255f9006bb51/assets/app-interface.jpg?raw=true" alt="Main interface of the Makaton AI Companion app" width="600" height="400" loading="lazy"></p>
<h3 id="heading-step-2-get-your-gemini-api-key">Step 2: Get Your Gemini API Key</h3>
<p>To enable cloud-based image description, you’ll need a <a target="_blank" href="https://aistudio.google.com/welcome?utm_source=PMAX&amp;utm_medium=display&amp;utm_campaign=FY25-global-DR-pmax-1710442&amp;utm_content=pmax&amp;gclsrc=aw.ds&amp;gad_source=1&amp;gad_campaignid=21521981511&amp;gbraid=0AAAAACn9t66nbeHlpP_VYvpWIrX7IJGEW&amp;gclid=EAIaIQobChMIqf-KiIHbkAMV1ZFQBh0KHA8wEAAYASAAEgKLA_D_BwE"><strong>Gemini API key</strong></a> from Google AI Studio.</p>
<p><strong>Here’s how to generate one:</strong></p>
<ol>
<li><p>Visit: <code>https://aistudio.google.com/welcome</code></p>
</li>
<li><p>Click <strong>“Create API key”</strong> and link it to your Google Cloud project (or create a new one).</p>
</li>
<li><p>Copy the key it will look like this: <code>AIzaSyA...XXXXXXXXXXXX</code></p>
</li>
<li><p>Open the Makaton AI Companion in your browser and click the <strong>Settings</strong> button (top left).</p>
</li>
<li><p>Paste your key in the input box and click <strong>Save</strong>.</p>
</li>
</ol>
<p><img src="https://github.com/tayo4christ/makaton-ai-companion/blob/9cc834fa75f6dcd39866c538ed42255f9006bb51/assets/api-key-setting.jpg?raw=true" alt="Setting up the OpenAI API key in the app interface" width="600" height="400" loading="lazy"></p>
<p>You’ll see a confirmation message like this:</p>
<blockquote>
<p><em>“API key saved locally. Try Describe again.”</em></p>
</blockquote>
<p>This means your key is stored safely in localStorage and is only accessible from your browser.</p>
<h3 id="heading-step-3-enable-gemini-nano-for-on-device-ai">Step 3: Enable Gemini Nano for On-Device AI</h3>
<p>If you’re using <a target="_blank" href="https://www.google.com/intl/en_uk/chrome/canary/"><strong>Chrome Canary</strong>,</a> you can run Gemini Nano locally without internet access. This allows the Makaton AI Companion to generate text even when the API key isn’t set.</p>
<h4 id="heading-download-and-install-chrome-canary">Download and Install Chrome Canary:</h4>
<p>Visit the official Chrome Canary download page and install it on your Windows or macOS system. Chrome Canary is a special version of Chrome designed for developers and early adopters, offering the latest features and updates.</p>
<h4 id="heading-enable-gemini-nano">Enable Gemini Nano:</h4>
<p>Open Chrome Canary and type <code>chrome://flags/#prompt-api-for-gemini-nano</code> in the address bar.</p>
<p>Locate the "Prompt API for Gemini Nano" flag in the list. Set this flag to <strong>Enabled</strong>. This action allows Chrome Canary to support the Gemini Nano model for on-device AI processing.</p>
<p>After enabling the flag, relaunch Chrome Canary to apply the changes.</p>
<h4 id="heading-download-the-gemini-nano-model">Download the Gemini Nano Model:</h4>
<p>Open a new tab in Chrome Canary and enter <code>chrome://components</code> in the address bar.</p>
<p>Scroll down to find the <strong>“Optimization Guide”</strong> component. Click on <strong>Check for update</strong>. This action will initiate the download of the Gemini Nano model, which is necessary for running AI tasks locally without an internet connection.</p>
<h4 id="heading-verify-installation">Verify Installation:</h4>
<p>Once the Gemini Nano model is installed, the Makaton AI Companion app will automatically detect it. You should see a message indicating that the app is using on-device AI: <em>“No API key found. Using on-device AI (text) for best guess…”</em></p>
<p>This confirmation means that the app can now generate text descriptions using the Gemini Nano model without needing an API key or internet access.</p>
<p>By following these detailed steps, you ensure that the Gemini Nano model is correctly set up and ready to use for on-device AI processing in the Makaton AI Companion.</p>
<h3 id="heading-step-4-upload-a-makaton-sign-or-symbol">Step 4: Upload a Makaton sign or symbol</h3>
<p>Click <strong>Choose File</strong> to upload any Makaton image (for example, the “help” sign), then press <strong>Describe (Cloud or Nano)</strong>. You’ll immediately see console logs confirming that the app is running correctly and connecting to the Gemini API:</p>
<p><img src="https://github.com/tayo4christ/makaton-ai-companion/blob/9cc834fa75f6dcd39866c538ed42255f9006bb51/assets/console.jpg?raw=true" alt="Console output showing real-time translation logs" width="600" height="400" loading="lazy"></p>
<h3 id="heading-step-5-ai-description-and-mapping">Step 5: AI Description and Mapping</h3>
<p>Here’s what happens next:</p>
<ol>
<li><p>The image is read and encoded as Base64.</p>
</li>
<li><p>The Gemini API (cloud or on-device) generates a short visual description.</p>
</li>
<li><p>The description is passed to the <code>mapDescriptionToMeaning()</code> function.</p>
</li>
<li><p>If keywords match an entry in the <code>MAKATON_GLOSSES</code> dictionary, the app displays the corresponding English meaning.</p>
</li>
<li><p>Finally, users can click <strong>Speak</strong> or <strong>Copy</strong> to hear or reuse the translation.</p>
</li>
</ol>
<p>Example outputs:</p>
<p><strong>When no mapping is found:</strong><br>The AI description is accurate but doesn’t yet match a known Makaton keyword.</p>
<p><img src="https://github.com/tayo4christ/makaton-ai-companion/blob/9cc834fa75f6dcd39866c538ed42255f9006bb51/assets/Incorrect-demonstration.jpg?raw=true" alt="Incorrect demonstration showing the model misinterpreting a sign" width="600" height="400" loading="lazy"></p>
<p><strong>After updating the mapping list:</strong><br>Adding new keywords like <code>"help"</code>, <code>"assist"</code>, or <code>"hand over hand"</code> enables correct translation.</p>
<p><img src="https://github.com/tayo4christ/makaton-ai-companion/blob/9cc834fa75f6dcd39866c538ed42255f9006bb51/assets/correct-demonstration.jpg?raw=true" alt="Correct demonstration where the AI accurately recognizes the Makaton sign" width="600" height="400" loading="lazy"></p>
<h3 id="heading-why-this-matters">Why this matters</h3>
<p>This demonstrates how accessible, AI-assisted tools can support communication for people who rely on Makaton. Even when a gesture isn’t recognized, the system provides a structured output and allows users or educators to expand the mapping list making the tool smarter over time.</p>
<h2 id="heading-broader-reflections">Broader Reflections</h2>
<p>Building this project turned out to be much more than a coding exercise for me.<br>It was a meaningful experiment in combining accessibility, natural language processing, and computer vision. These three fields, when brought together, can create real social impact.</p>
<p>While working on it, I began to understand how computer vision and language understanding complement each other in practice. The vision model perceives the world by identifying shapes, gestures, and spatial patterns, while the language model interprets what those visuals mean in human terms.<br>In this project, the artificial intelligence system first sees the Makaton sign, then describes it, and finally maps it to an English word that carries intent and meaning.</p>
<p>This interaction between perception and semantics is what makes multimodal artificial intelligence so powerful. It is not only about recognizing an image or generating text; it is about building systems that connect understanding across different forms of information to make technology more inclusive and human centered.</p>
<p>This realization changed how I think about accessibility technology. True innovation happens not only through smarter models but through the harmony between seeing and understanding, between what an artificial intelligence system observes and how it communicates that observation to help people.</p>
<h3 id="heading-accessibility-meets-ai">Accessibility Meets AI</h3>
<p>Working on this project reminded me that accessibility isn’t just about compliance or assistive devices. It’s also about inclusion. A simple AI system that can describe a hand gesture or symbol in real time can empower teachers, parents, and students who communicate using Makaton or similar systems.</p>
<p>By mapping AI-generated descriptions to meaningful phrases, the app demonstrates how AI can support inclusive education<strong>,</strong> even at small scales. It bridges the communication gap between verbal and nonverbal learners, which is something that traditional translation systems often overlook.</p>
<h3 id="heading-integrating-nlp-and-computer-vision">Integrating NLP and Computer Vision</h3>
<p>On the technical side, this project showed me how naturally computer vision and language understanding complement each other. The Gemini API’s multimodal models were able to analyze an image and produce coherent natural-language sentences, something that older APIs couldn’t do without chaining multiple tools.</p>
<p>By feeding that output into a lightweight NLP mapping function, I was able to simulate a very early-stage symbol-to-language translator the core of my broader research interest in automatic Makaton-to-English translation.</p>
<h3 id="heading-why-local-ai-gemini-nano-matters">Why Local AI (Gemini Nano) Matters</h3>
<p>While the cloud models are powerful, experimenting with Gemini Nano revealed something exciting:<br>on-device AI can make accessibility tools faster, safer, and more private.</p>
<p>In classrooms or therapy sessions, you often can’t rely on stable internet connections or share sensitive student data. Running inference locally means learners’ gestures or symbol images never leave the device, a crucial step toward privacy-preserving accessibility AI.</p>
<p>And since Nano runs directly inside Chrome Canary, it shows how AI is becoming embedded at the browser level, lowering barriers for teachers and developers to build inclusive solutions without needing large infrastructure.</p>
<h3 id="heading-looking-forward">Looking Forward</h3>
<p>This prototype is just a starting point. Future iterations could integrate gesture recognition directly from camera input, support multiple symbol sets, or even learn from user feedback to expand the dictionary automatically.</p>
<p>Most importantly, it reinforces a central belief in my research and teaching journey:</p>
<p><strong>Accessibility innovation doesn’t require massive systems. It starts with curiosity, empathy, and a few lines of purposeful code.</strong></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Building the Makaton AI Companion has been one of the most rewarding projects in my AI journey – not just because it worked, but because it proved how accessible innovation can be.</p>
<p>With just a browser, a few lines of JavaScript, and the right API, I was able to combine computer vision, language understanding, and accessibility design into a working system that translates symbols into meaning. It’s a small step toward a future where anyone, regardless of speech or language ability, can be understood through technology.</p>
<p>The project also reinforced something deeply personal to me as a researcher and educator: that AI for accessibility doesn’t need to be complex, expensive, or centralized. It can be lightweight, open, and built with empathy by anyone who’s willing to learn and experiment.</p>
<h3 id="heading-join-the-conversation">Join the Conversation</h3>
<p>If this project inspires you, I’d love to see your own experiments and improvements. Can you make it support live webcam gestures? Could you adapt it for other symbol systems, like PECS or BSL?</p>
<p>Share your ideas in the comments or tag me if you publish your own version. Together, we can grow a small prototype into a community-driven accessibility tool and continue exploring how AI can give more people a voice.</p>
<p>Full source code on GitHub: <a target="_blank" href="https://github.com/tayo4christ/makaton-ai-companion">Makaton-ai-companion</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Transformers for Real-Time Gesture Recognition ]]>
                </title>
                <description>
                    <![CDATA[ Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are no... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/using-transformers-for-real-time-gesture-recognition/</link>
                <guid isPermaLink="false">68e3c692aa82abf4b593114c</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ transformers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pytorch ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ONNX ]]>
                    </category>
                
                    <category>
                        <![CDATA[ gradio ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Gesture Recognition ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Accessibility ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Tutorial ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ OMOTAYO OMOYEMI ]]>
                </dc:creator>
                <pubDate>Mon, 06 Oct 2025 13:39:30 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759757931295/5f19fd4e-93c0-4bd7-a75c-a7858e061ecd.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are not static images. Rather, they unfold over time. To build more robust, real-time systems, we need models that can capture both spatial details and temporal context.</p>
<p>This is where Transformers come in. Originally built for language, they’ve become state-of-the-art in vision tasks thanks to models like the Vision Transformer (ViT) and video-focused variants such as TimeSformer.</p>
<p>In this tutorial, we’ll use a Transformer backbone to create a lightweight real-time gesture recognition tool, optimized for small datasets and deployable on a regular laptop webcam.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-why-transformers-for-gestures">Why Transformers for Gestures?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-youll-learn">What You’ll Learn</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-generate-a-gesture-dataset">Generate a Gesture Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-1-generate-a-synthetic-dataset">Option 1: Generate a Synthetic Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-training-script-trainpy">Training Script:</a> <a target="_blank" href="http://train.py">train.py</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-export-the-model-to-onnx">Export the Model to ONNX</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-evaluate-accuracy-latency">Evaluate Accuracy + Latency</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-2-use-small-samples-from-public-gesture-datasets">Option 2: Use Small Samples from Public Gesture Datasets</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-accessibility-notes-amp-ethical-limits">Accessibility Notes &amp; Ethical Limits</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-next-steps">Next Steps</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-why-transformers-for-gestures">Why Transformers for Gestures?</h2>
<p>Transformers are powerful because they use self-attention to model relationships across a sequence. For gestures, this means the model doesn’t just see isolated frames, but also learns how movements evolve over time. A wave, for example, looks different from a raised hand only when viewed as a sequence.</p>
<p>Vision Transformers process images as patches, while video Transformers extend this to multiple frames with temporal attention. Even a simple approach, like applying ViT to each frame and pooling across time, can outperform traditional CNN-based methods for small datasets.</p>
<p>Combined with Hugging Face’s pre-trained models and ONNX Runtime for optimization, Transformers make it possible to train on a modest dataset and still achieve smooth real-time recognition.</p>
<h2 id="heading-what-youll-learn">What You’ll Learn</h2>
<p>In this tutorial, you’ll build a gesture recognition system using Transformers. By the end, you’ll know how to:</p>
<ul>
<li><p>Create (or record) a tiny gesture dataset</p>
</li>
<li><p>Train a Vision Transformer (ViT) with temporal pooling</p>
</li>
<li><p>Export the model to ONNX for faster inference</p>
</li>
<li><p>Build a real-time Gradio app that classifies gestures from your webcam</p>
</li>
<li><p>Evaluate your model’s accuracy and latency with simple scripts</p>
</li>
<li><p>Understand the accessibility potential and ethical limits of gesture recognition</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along, you should have:</p>
<ul>
<li><p>Basic Python knowledge (functions, scripts, virtual environments)</p>
</li>
<li><p>Familiarity with PyTorch (tensors, datasets, training loops) – helpful but not required</p>
</li>
<li><p>Python 3.8+ installed on your system</p>
</li>
<li><p>A webcam (for the live demo in Gradio)</p>
</li>
<li><p>Optionally: GPU access (training on CPU works, but is slower)</p>
</li>
</ul>
<h2 id="heading-project-setup">Project Setup</h2>
<p>Create a new project folder and install the required libraries.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create a new project directory and navigate into it</span>
mkdir transformer-gesture &amp;&amp; <span class="hljs-built_in">cd</span> transformer-gesture

<span class="hljs-comment"># Set up a Python virtual environment</span>
python -m venv .venv

<span class="hljs-comment"># Activate the virtual environment</span>
<span class="hljs-comment"># Windows PowerShell</span>
.venv\Scripts\Activate.ps1

<span class="hljs-comment"># macOS/Linux</span>
<span class="hljs-built_in">source</span> .venv/bin/activate
</code></pre>
<p>The provided code snippet is a set of commands for setting up a new Python project with a virtual environment. Here's a breakdown of each part:</p>
<ol>
<li><p><code>mkdir transformer-gesture &amp;&amp; cd transformer-gesture</code>: This command creates a new directory named "transformer-gesture" and then navigates into it.</p>
</li>
<li><p><code>python -m venv .venv</code>: This command creates a new virtual environment in the current directory. The virtual environment is stored in a folder named ".venv".</p>
</li>
<li><p>Activating the virtual environment:</p>
<ul>
<li><p>For Windows PowerShell, you can use <code>.venv\Scripts\Activate.ps1</code> to activate the virtual environment.</p>
</li>
<li><p>For macOS/Linux, use <code>source .venv/bin/activate</code> to activate the virtual environment.</p>
</li>
</ul>
</li>
</ol>
<p>Activating a virtual environment ensures that the Python interpreter and any packages you install are isolated to this specific project, preventing conflicts with other projects or system-wide packages.</p>
<p>Create a <code>requirements.txt</code> file:</p>
<pre><code class="lang-plaintext">torch&gt;=2.0
torchvision
torchaudio
timm
huggingface_hub

onnx
onnxruntime

gradio

numpy
opencv-python
pillow

matplotlib
seaborn
scikit-learn
</code></pre>
<p>The list provided is a set of package dependencies typically found in a <code>requirements.txt</code> file for a Python project. Here's a brief explanation of each package:</p>
<ol>
<li><p><strong>torch&gt;=2.0</strong>: PyTorch is a popular open-source deep learning framework that provides a flexible and efficient platform for building and training neural networks. Version 2.0 and above includes improvements in performance and new features.</p>
</li>
<li><p><strong>torchvision</strong>: This library is part of the PyTorch ecosystem and provides tools for computer vision tasks, including datasets, model architectures, and image transformations.</p>
</li>
<li><p><strong>torchaudio</strong>: Also part of the PyTorch ecosystem, Torchaudio provides audio processing tools and datasets, making it easier to work with audio data in deep learning projects.</p>
</li>
<li><p><strong>timm</strong>: The PyTorch Image Models (timm) library offers a collection of pre-trained models and utilities for computer vision tasks, facilitating quick experimentation and deployment.</p>
</li>
<li><p><strong>huggingface_hub</strong>: This library allows easy access to models and datasets hosted on the Hugging Face Hub, a platform for sharing and collaborating on machine learning models and datasets.</p>
</li>
<li><p><strong>onnx</strong>: The Open Neural Network Exchange (ONNX) format is used to represent machine learning models, enabling interoperability between different frameworks.</p>
</li>
<li><p><strong>onnxruntime</strong>: This is a high-performance runtime for executing ONNX models, allowing for efficient deployment across various platforms.</p>
</li>
<li><p><strong>gradio</strong>: Gradio is a library for creating user interfaces for machine learning models, making them accessible through a web interface for easy interaction and testing.</p>
</li>
<li><p><strong>numpy</strong>: A fundamental package for numerical computing in Python, providing support for arrays and a wide range of mathematical functions.</p>
</li>
<li><p><strong>opencv-python</strong>: OpenCV is a library for computer vision and image processing tasks, widely used for real-time applications.</p>
</li>
<li><p><strong>pillow</strong>: A Python Imaging Library (PIL) fork, Pillow provides tools for opening, manipulating, and saving many different image file formats.</p>
</li>
<li><p><strong>matplotlib</strong>: A plotting library for Python, Matplotlib is used for creating static, interactive, and animated visualizations in Python.</p>
</li>
<li><p><strong>seaborn</strong>: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.</p>
</li>
<li><p><strong>scikit-learn</strong>: A machine learning library in Python that provides simple and efficient tools for data analysis and modeling, including classification, regression, clustering, and dimensionality reduction.</p>
</li>
</ol>
<p>Install dependencies:</p>
<pre><code class="lang-bash">pip install -r requirements.txt
</code></pre>
<p>The command <code>pip install -r requirements.txt</code> is used to install all the Python packages listed in a file named <code>requirements.txt</code>. This file typically contains a list of package dependencies required for a Python project, each specified with a package name and optionally a version number.</p>
<p>By running this command, <code>pip</code>, which is the Python package installer, reads the file and installs each package listed, ensuring that the project has all the necessary dependencies to run properly. This is a common practice in Python projects to manage and share dependencies easily.</p>
<h2 id="heading-generate-a-gesture-dataset">Generate a Gesture Dataset</h2>
<p>To train our Transformer-based gesture recognizer, we need some data. Instead of downloading a huge dataset, we’ll start with a tiny synthetic dataset you can generate in seconds. This makes the tutorial lightweight and ensures that everyone can follow along without dealing with multi-gigabyte downloads.</p>
<h2 id="heading-option-1-generate-a-synthetic-dataset">Option 1: Generate a Synthetic Dataset</h2>
<p>We’ll use a small Python script that creates short <code>.mp4</code> clips of a moving (or still) coloured box. Each class represents a gesture:</p>
<ul>
<li><p><strong>swipe_left</strong> – box moves from right to left</p>
</li>
<li><p><strong>swipe_right</strong> – box moves from left to right</p>
</li>
<li><p><strong>stop</strong> – box stays still in the center</p>
</li>
</ul>
<p>Save this script as <code>generate_synthetic_gestures.py</code> in your project root:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os, cv2, numpy <span class="hljs-keyword">as</span> np, random, argparse

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ensure_dir</span>(<span class="hljs-params">p</span>):</span> os.makedirs(p, exist_ok=<span class="hljs-literal">True</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">make_clip</span>(<span class="hljs-params">mode, out_path, seconds=<span class="hljs-number">1.5</span>, fps=<span class="hljs-number">16</span>, size=<span class="hljs-number">224</span>, box_size=<span class="hljs-number">60</span>, seed=<span class="hljs-number">0</span>, codec=<span class="hljs-string">"mp4v"</span></span>):</span>
    rng = random.Random(seed)
    frames = int(seconds * fps)
    H = W = size

    <span class="hljs-comment"># background + box color</span>
    bg_val = rng.randint(<span class="hljs-number">160</span>, <span class="hljs-number">220</span>)
    bg = np.full((H, W, <span class="hljs-number">3</span>), bg_val, dtype=np.uint8)
    color = (rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>), rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>), rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>))

    <span class="hljs-comment"># path of motion</span>
    y = rng.randint(<span class="hljs-number">40</span>, H - <span class="hljs-number">40</span> - box_size)
    <span class="hljs-keyword">if</span> mode == <span class="hljs-string">"swipe_left"</span>:
        x_start, x_end = W - <span class="hljs-number">20</span> - box_size, <span class="hljs-number">20</span>
    <span class="hljs-keyword">elif</span> mode == <span class="hljs-string">"swipe_right"</span>:
        x_start, x_end = <span class="hljs-number">20</span>, W - <span class="hljs-number">20</span> - box_size
    <span class="hljs-keyword">elif</span> mode == <span class="hljs-string">"stop"</span>:
        x_start = x_end = (W - box_size) // <span class="hljs-number">2</span>
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f"Unknown mode: <span class="hljs-subst">{mode}</span>"</span>)

    fourcc = cv2.VideoWriter_fourcc(*codec)
    vw = cv2.VideoWriter(out_path, fourcc, fps, (W, H))
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> vw.isOpened():
        <span class="hljs-keyword">raise</span> RuntimeError(
            <span class="hljs-string">f"Could not open VideoWriter with codec '<span class="hljs-subst">{codec}</span>'. "</span>
            <span class="hljs-string">"Try --codec XVID and use .avi extension, e.g. out.avi"</span>
        )

    <span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> range(frames):
        alpha = t / max(<span class="hljs-number">1</span>, frames - <span class="hljs-number">1</span>)
        x = int((<span class="hljs-number">1</span> - alpha) * x_start + alpha * x_end)
        <span class="hljs-comment"># small jitter to avoid being too synthetic</span>
        jitter_x, jitter_y = rng.randint(<span class="hljs-number">-2</span>, <span class="hljs-number">2</span>), rng.randint(<span class="hljs-number">-2</span>, <span class="hljs-number">2</span>)
        frame = bg.copy()
        cv2.rectangle(frame, (x + jitter_x, y + jitter_y),
                      (x + jitter_x + box_size, y + jitter_y + box_size),
                      color, thickness=<span class="hljs-number">-1</span>)
        <span class="hljs-comment"># overlay text</span>
        cv2.putText(frame, mode, (<span class="hljs-number">8</span>, <span class="hljs-number">24</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">0.7</span>, (<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>), <span class="hljs-number">2</span>, cv2.LINE_AA)
        cv2.putText(frame, mode, (<span class="hljs-number">8</span>, <span class="hljs-number">24</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">0.7</span>, (<span class="hljs-number">255</span>, <span class="hljs-number">255</span>, <span class="hljs-number">255</span>), <span class="hljs-number">1</span>, cv2.LINE_AA)
        vw.write(frame)

    vw.release()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_labels</span>(<span class="hljs-params">labels, out_dir</span>):</span>
    <span class="hljs-keyword">with</span> open(os.path.join(out_dir, <span class="hljs-string">"labels.txt"</span>), <span class="hljs-string">"w"</span>, encoding=<span class="hljs-string">"utf-8"</span>) <span class="hljs-keyword">as</span> f:
        <span class="hljs-keyword">for</span> c <span class="hljs-keyword">in</span> labels:
            f.write(c + <span class="hljs-string">"\n"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    ap = argparse.ArgumentParser(description=<span class="hljs-string">"Generate a tiny synthetic gesture dataset."</span>)
    ap.add_argument(<span class="hljs-string">"--out"</span>, default=<span class="hljs-string">"data"</span>, help=<span class="hljs-string">"Output directory (default: data)"</span>)
    ap.add_argument(<span class="hljs-string">"--classes"</span>, nargs=<span class="hljs-string">"+"</span>,
                    default=[<span class="hljs-string">"swipe_left"</span>, <span class="hljs-string">"swipe_right"</span>, <span class="hljs-string">"stop"</span>],
                    help=<span class="hljs-string">"Class names (default: swipe_left swipe_right stop)"</span>)
    ap.add_argument(<span class="hljs-string">"--clips"</span>, type=int, default=<span class="hljs-number">16</span>, help=<span class="hljs-string">"Clips per class (default: 16)"</span>)
    ap.add_argument(<span class="hljs-string">"--seconds"</span>, type=float, default=<span class="hljs-number">1.5</span>, help=<span class="hljs-string">"Seconds per clip (default: 1.5)"</span>)
    ap.add_argument(<span class="hljs-string">"--fps"</span>, type=int, default=<span class="hljs-number">16</span>, help=<span class="hljs-string">"Frames per second (default: 16)"</span>)
    ap.add_argument(<span class="hljs-string">"--size"</span>, type=int, default=<span class="hljs-number">224</span>, help=<span class="hljs-string">"Frame size WxH (default: 224)"</span>)
    ap.add_argument(<span class="hljs-string">"--box"</span>, type=int, default=<span class="hljs-number">60</span>, help=<span class="hljs-string">"Box size (default: 60)"</span>)
    ap.add_argument(<span class="hljs-string">"--codec"</span>, default=<span class="hljs-string">"mp4v"</span>, help=<span class="hljs-string">"Codec fourcc (mp4v or XVID)"</span>)
    ap.add_argument(<span class="hljs-string">"--ext"</span>, default=<span class="hljs-string">".mp4"</span>, help=<span class="hljs-string">"File extension (.mp4 or .avi)"</span>)
    args = ap.parse_args()

    ensure_dir(args.out)
    write_labels(args.classes, <span class="hljs-string">"."</span>)  <span class="hljs-comment"># writes labels.txt to project root</span>

    print(<span class="hljs-string">f"Generating synthetic dataset -&gt; <span class="hljs-subst">{args.out}</span>"</span>)
    <span class="hljs-keyword">for</span> cls <span class="hljs-keyword">in</span> args.classes:
        cls_dir = os.path.join(args.out, cls)
        ensure_dir(cls_dir)
        mode = <span class="hljs-string">"stop"</span> <span class="hljs-keyword">if</span> cls == <span class="hljs-string">"stop"</span> <span class="hljs-keyword">else</span> (<span class="hljs-string">"swipe_left"</span> <span class="hljs-keyword">if</span> <span class="hljs-string">"left"</span> <span class="hljs-keyword">in</span> cls <span class="hljs-keyword">else</span> (<span class="hljs-string">"swipe_right"</span> <span class="hljs-keyword">if</span> <span class="hljs-string">"right"</span> <span class="hljs-keyword">in</span> cls <span class="hljs-keyword">else</span> <span class="hljs-string">"stop"</span>))
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(args.clips):
            filename = os.path.join(cls_dir, <span class="hljs-string">f"<span class="hljs-subst">{cls}</span>_<span class="hljs-subst">{i+<span class="hljs-number">1</span>:<span class="hljs-number">03</span>d}</span><span class="hljs-subst">{args.ext}</span>"</span>)
            make_clip(
                mode=mode,
                out_path=filename,
                seconds=args.seconds,
                fps=args.fps,
                size=args.size,
                box_size=args.box,
                seed=i + <span class="hljs-number">1</span>,
                codec=args.codec
            )
        print(<span class="hljs-string">f"  <span class="hljs-subst">{cls}</span>: <span class="hljs-subst">{args.clips}</span> clips"</span>)

    print(<span class="hljs-string">"Done. You can now run: python train.py, python export_onnx.py, python app.py"</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    main()
</code></pre>
<p>The script generates a synthetic gesture dataset by creating video clips of a moving or stationary coloured box, simulating gestures like "swipe left," "swipe right," and "stop," and saves them in a specified output directory.</p>
<p>Now run it inside your virtual environment:</p>
<pre><code class="lang-bash">python generate_synthetic_gestures.py --out data --clips 16 --seconds 1.5
</code></pre>
<p>The command above runs a Python script named <code>generate_synthetic_gestures.py</code>, which generates a synthetic gesture dataset with 16 clips per gesture, each lasting 1.5 seconds, and saves the output in a directory named "data".</p>
<p>This creates a dataset like:</p>
<pre><code class="lang-plaintext">data/
  swipe_left/*.mp4
  swipe_right/*.mp4
  stop/*.mp4
labels.txt
</code></pre>
<p>Each folder contains short clips of a moving (or still) box that simulate gestures. This is perfect for testing the pipeline.</p>
<h3 id="heading-training-script-trainpy">Training Script: <code>train.py</code></h3>
<p>Now that we have our dataset, let’s fine-tune a Vision Transformer with temporal pooling. This model applies ViT frame-by-frame, averages embeddings across time, and trains a classification head on your gestures.</p>
<p>Here’s the full training script:</p>
<pre><code class="lang-python"><span class="hljs-comment"># train.py</span>
<span class="hljs-keyword">import</span> torch, torch.nn <span class="hljs-keyword">as</span> nn, torch.optim <span class="hljs-keyword">as</span> optim
<span class="hljs-keyword">from</span> torch.utils.data <span class="hljs-keyword">import</span> DataLoader
<span class="hljs-keyword">import</span> timm
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> GestureClips, read_labels

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ViTTemporal</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-string">"""Frame-wise ViT encoder -&gt; mean pool over time -&gt; linear head."""</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, num_classes, vit_name=<span class="hljs-string">"vit_tiny_patch16_224"</span></span>):</span>
        super().__init__()
        self.vit = timm.create_model(vit_name, pretrained=<span class="hljs-literal">True</span>, num_classes=<span class="hljs-number">0</span>, global_pool=<span class="hljs-string">"avg"</span>)
        feat_dim = self.vit.num_features
        self.head = nn.Linear(feat_dim, num_classes)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>  <span class="hljs-comment"># x: (B,T,C,H,W)</span>
        B, T, C, H, W = x.shape
        x = x.view(B * T, C, H, W)
        feats = self.vit(x)                  <span class="hljs-comment"># (B*T, D)</span>
        feats = feats.view(B, T, <span class="hljs-number">-1</span>).mean(dim=<span class="hljs-number">1</span>)  <span class="hljs-comment"># (B, D)</span>
        <span class="hljs-keyword">return</span> self.head(feats)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train</span>():</span>
    device = <span class="hljs-string">"cuda"</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"cpu"</span>
    labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
    n_classes = len(labels)

    train_ds = GestureClips(train=<span class="hljs-literal">True</span>)
    val_ds   = GestureClips(train=<span class="hljs-literal">False</span>)
    print(<span class="hljs-string">f"Train clips: <span class="hljs-subst">{len(train_ds)}</span> | Val clips: <span class="hljs-subst">{len(val_ds)}</span>"</span>)

    <span class="hljs-comment"># Windows/CPU friendly</span>
    train_dl = DataLoader(train_ds, batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">True</span>,  num_workers=<span class="hljs-number">0</span>, pin_memory=<span class="hljs-literal">False</span>)
    val_dl   = DataLoader(val_ds,   batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">False</span>, num_workers=<span class="hljs-number">0</span>, pin_memory=<span class="hljs-literal">False</span>)

    model = ViTTemporal(num_classes=n_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=<span class="hljs-number">3e-4</span>, weight_decay=<span class="hljs-number">0.05</span>)

    best_acc = <span class="hljs-number">0.0</span>
    epochs = <span class="hljs-number">5</span>
    <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, epochs + <span class="hljs-number">1</span>):
        <span class="hljs-comment"># ---- Train ----</span>
        model.train()
        total, correct, loss_sum = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0.0</span>
        <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> train_dl:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()

            loss_sum += loss.item() * x.size(<span class="hljs-number">0</span>)
            correct += (logits.argmax(<span class="hljs-number">1</span>) == y).sum().item()
            total += x.size(<span class="hljs-number">0</span>)

        train_acc = correct / total <span class="hljs-keyword">if</span> total <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>
        train_loss = loss_sum / total <span class="hljs-keyword">if</span> total <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>

        <span class="hljs-comment"># ---- Validate ----</span>
        model.eval()
        vtotal, vcorrect = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>
        <span class="hljs-keyword">with</span> torch.no_grad():
            <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> val_dl:
                x, y = x.to(device), y.to(device)
                vcorrect += (model(x).argmax(<span class="hljs-number">1</span>) == y).sum().item()
                vtotal += x.size(<span class="hljs-number">0</span>)
        val_acc = vcorrect / vtotal <span class="hljs-keyword">if</span> vtotal <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>

        print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch:<span class="hljs-number">02</span>d}</span> | train_loss <span class="hljs-subst">{train_loss:<span class="hljs-number">.4</span>f}</span> "</span>
              <span class="hljs-string">f"| train_acc <span class="hljs-subst">{train_acc:<span class="hljs-number">.3</span>f}</span> | val_acc <span class="hljs-subst">{val_acc:<span class="hljs-number">.3</span>f}</span>"</span>)

        <span class="hljs-keyword">if</span> val_acc &gt; best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), <span class="hljs-string">"vit_temporal_best.pt"</span>)

    print(<span class="hljs-string">"Best val acc:"</span>, best_acc)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    train()
</code></pre>
<p>Running the command <code>python train.py</code> initiates the training process for your gesture recognition model. Here's a breakdown of what happens:</p>
<ol>
<li><p><strong>Load your dataset from data/</strong>: The script will access and load the gesture dataset stored in the "data" directory. This dataset is used to train the model.</p>
</li>
<li><p><strong>Fine-tune a pre-trained Vision Transformer</strong>: The training script will take a Vision Transformer model that has been pre-trained on a larger dataset and fine-tune it using your specific gesture dataset. Fine-tuning helps the model adapt to the nuances of your data, improving its performance on the specific task of gesture recognition.</p>
</li>
<li><p><strong>Save the best checkpoint as vit_temporal_best.pt</strong>: During training, the script will evaluate the model's performance on a validation set. The best-performing version of the model (based on some metric like accuracy) will be saved as a checkpoint file named "vit_temporal_best.pt". This file can later be used for inference or further training.</p>
</li>
</ol>
<h4 id="heading-what-training-looks-like">What Training Looks Like</h4>
<p>You should see logs similar to this:</p>
<pre><code class="lang-plaintext">Train clips: 38 | Val clips: 10
Epoch 01 | train_loss 1.4508 | train_acc 0.395 | val_acc 0.200
Epoch 02 | train_loss 1.2466 | train_acc 0.263 | val_acc 0.200
Epoch 03 | train_loss 1.1361 | train_acc 0.368 | val_acc 0.200
Best val acc: 0.200
</code></pre>
<p>Don’t worry if your accuracy is low at first, as with the synthetic dataset that’s normal. The key is proving that the Transformer pipeline works. You can boost results later by:</p>
<ul>
<li><p>Adding more clips per class</p>
</li>
<li><p>Training for more epochs</p>
</li>
<li><p>Switching to real recorded gestures</p>
</li>
</ul>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/training-logs.png?raw=true" alt="Training logs" width="600" height="400" loading="lazy"></p>
<p>Figure 1. Example training logs from <code>train.py</code>, where the Vision Transformer with temporal pooling is fine-tuned on a tiny synthetic dataset.</p>
<h3 id="heading-export-the-model-to-onnx">Export the Model to ONNX</h3>
<p>To make our model easier to run in real time (and lighter on CPU), we’ll export it to the ONNX format.</p>
<p><strong>Note:</strong> ONNX, which stands for Open Neural Network Exchange, is an open-source format designed to facilitate the interchange of deep learning models between different frameworks. It lets you train a model in one framework, such as PyTorch or TensorFlow, and then deploy it in another, like Caffe2 or MXNet, without needing to completely rewrite the model. This interoperability is achieved by providing a standardized representation of the model's architecture and parameters.</p>
<p>ONNX supports a wide range of operators and is continually updated to include new features, making it a versatile choice for deploying machine learning models across various platforms and devices.</p>
<p>Create a file called <code>export_onnx.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> train <span class="hljs-keyword">import</span> ViTTemporal
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
n_classes = len(labels)

<span class="hljs-comment"># Load trained model</span>
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load(<span class="hljs-string">"vit_temporal_best.pt"</span>, map_location=<span class="hljs-string">"cpu"</span>))
model.eval()

<span class="hljs-comment"># Dummy input: batch=1, 16 frames, 3x224x224</span>
dummy = torch.randn(<span class="hljs-number">1</span>, <span class="hljs-number">16</span>, <span class="hljs-number">3</span>, <span class="hljs-number">224</span>, <span class="hljs-number">224</span>)

<span class="hljs-comment"># Export</span>
torch.onnx.export(
    model, dummy, <span class="hljs-string">"vit_temporal.onnx"</span>,
    input_names=[<span class="hljs-string">"video"</span>], output_names=[<span class="hljs-string">"logits"</span>],
    dynamic_axes={<span class="hljs-string">"video"</span>: {<span class="hljs-number">0</span>: <span class="hljs-string">"batch"</span>}},
    opset_version=<span class="hljs-number">13</span>
)

print(<span class="hljs-string">"Exported vit_temporal.onnx"</span>)
</code></pre>
<p>Run it with <code>python export_onnx.py</code>.</p>
<p>This generates a file <code>vit_temporal.onnx</code> in your project folder. ONNX lets us use onnxruntime, which is much faster for inference.</p>
<p>Create a file called <code>app.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os, tempfile, cv2, torch, onnxruntime, numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> gradio <span class="hljs-keyword">as</span> gr
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

T = <span class="hljs-number">16</span>
SIZE = <span class="hljs-number">224</span>
MODEL_PATH = <span class="hljs-string">"vit_temporal.onnx"</span>

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)

<span class="hljs-comment"># --- ONNX session + auto-detect names ---</span>
ort_session = onnxruntime.InferenceSession(MODEL_PATH, providers=[<span class="hljs-string">"CPUExecutionProvider"</span>])
<span class="hljs-comment"># detect first input and first output names to avoid mismatches</span>
INPUT_NAME = ort_session.get_inputs()[<span class="hljs-number">0</span>].name   <span class="hljs-comment"># e.g. "input" or "video"</span>
OUTPUT_NAME = ort_session.get_outputs()[<span class="hljs-number">0</span>].name <span class="hljs-comment"># e.g. "logits" or something else</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">preprocess_clip</span>(<span class="hljs-params">frames_rgb</span>):</span>
    <span class="hljs-keyword">if</span> len(frames_rgb) == <span class="hljs-number">0</span>:
        frames_rgb = [np.zeros((SIZE, SIZE, <span class="hljs-number">3</span>), dtype=np.uint8)]
    <span class="hljs-keyword">if</span> len(frames_rgb) &lt; T:
        frames_rgb = frames_rgb + [frames_rgb[<span class="hljs-number">-1</span>]] * (T - len(frames_rgb))
    frames_rgb = frames_rgb[:T]
    clip = [cv2.resize(f, (SIZE, SIZE), interpolation=cv2.INTER_AREA) <span class="hljs-keyword">for</span> f <span class="hljs-keyword">in</span> frames_rgb]
    clip = np.stack(clip, axis=<span class="hljs-number">0</span>)                                    <span class="hljs-comment"># (T,H,W,3)</span>
    clip = np.transpose(clip, (<span class="hljs-number">0</span>, <span class="hljs-number">3</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>)).astype(np.float32) / <span class="hljs-number">255</span> <span class="hljs-comment"># (T,3,H,W)</span>
    clip = (clip - <span class="hljs-number">0.5</span>) / <span class="hljs-number">0.5</span>
    clip = np.expand_dims(clip, <span class="hljs-number">0</span>)                                   <span class="hljs-comment"># (1,T,3,H,W)</span>
    <span class="hljs-keyword">return</span> clip

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_extract_path_from_gradio_video</span>(<span class="hljs-params">inp</span>):</span>
    <span class="hljs-keyword">if</span> isinstance(inp, str) <span class="hljs-keyword">and</span> os.path.exists(inp):
        <span class="hljs-keyword">return</span> inp
    <span class="hljs-keyword">if</span> isinstance(inp, dict):
        <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> (<span class="hljs-string">"video"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"path"</span>, <span class="hljs-string">"filepath"</span>):
            v = inp.get(key)
            <span class="hljs-keyword">if</span> isinstance(v, str) <span class="hljs-keyword">and</span> os.path.exists(v):
                <span class="hljs-keyword">return</span> v
        <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> (<span class="hljs-string">"data"</span>, <span class="hljs-string">"video"</span>):
            v = inp.get(key)
            <span class="hljs-keyword">if</span> isinstance(v, (bytes, bytearray)):
                tmp = tempfile.NamedTemporaryFile(delete=<span class="hljs-literal">False</span>, suffix=<span class="hljs-string">".mp4"</span>)
                tmp.write(v); tmp.flush(); tmp.close()
                <span class="hljs-keyword">return</span> tmp.name
    <span class="hljs-keyword">if</span> isinstance(inp, (list, tuple)) <span class="hljs-keyword">and</span> inp <span class="hljs-keyword">and</span> isinstance(inp[<span class="hljs-number">0</span>], str) <span class="hljs-keyword">and</span> os.path.exists(inp[<span class="hljs-number">0</span>]):
        <span class="hljs-keyword">return</span> inp[<span class="hljs-number">0</span>]
    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_read_uniform_frames</span>(<span class="hljs-params">video_path</span>):</span>
    cap = cv2.VideoCapture(video_path)
    frames = []
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) <span class="hljs-keyword">or</span> <span class="hljs-number">1</span>
    idxs = np.linspace(<span class="hljs-number">0</span>, total - <span class="hljs-number">1</span>, max(T, <span class="hljs-number">1</span>)).astype(int)
    want = set(int(i) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> idxs.tolist())
    j = <span class="hljs-number">0</span>
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        ok, bgr = cap.read()
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ok: <span class="hljs-keyword">break</span>
        <span class="hljs-keyword">if</span> j <span class="hljs-keyword">in</span> want:
            rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
            frames.append(rgb)
        j += <span class="hljs-number">1</span>
    cap.release()
    <span class="hljs-keyword">return</span> frames

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_from_video</span>(<span class="hljs-params">gradio_video</span>):</span>
    video_path = _extract_path_from_gradio_video(gradio_video)
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> video_path <span class="hljs-keyword">or</span> <span class="hljs-keyword">not</span> os.path.exists(video_path):
        <span class="hljs-keyword">return</span> {}
    frames = _read_uniform_frames(video_path)

    <span class="hljs-comment"># If OpenCV choked on the codec (common with recorded webm), re-encode once:</span>
    <span class="hljs-keyword">if</span> len(frames) == <span class="hljs-number">0</span>:
        tmp = tempfile.NamedTemporaryFile(delete=<span class="hljs-literal">False</span>, suffix=<span class="hljs-string">".mp4"</span>); tmp_name = tmp.name; tmp.close()
        cap = cv2.VideoCapture(video_path)
        fourcc = cv2.VideoWriter_fourcc(*<span class="hljs-string">"mp4v"</span>)
        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) <span class="hljs-keyword">or</span> <span class="hljs-number">640</span>
        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) <span class="hljs-keyword">or</span> <span class="hljs-number">480</span>
        out = cv2.VideoWriter(tmp_name, fourcc, <span class="hljs-number">20.0</span>, (w, h))
        <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
            ok, frame = cap.read()
            <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ok: <span class="hljs-keyword">break</span>
            out.write(frame)
        cap.release(); out.release()
        frames = _read_uniform_frames(tmp_name)

    clip = preprocess_clip(frames)
    <span class="hljs-comment"># &gt;&gt;&gt; use the detected ONNX input/output names &lt;&lt;&lt;</span>
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[<span class="hljs-number">0</span>]  <span class="hljs-comment"># (1, C)</span>
    probs = torch.softmax(torch.from_numpy(logits), dim=<span class="hljs-number">1</span>)[<span class="hljs-number">0</span>].numpy().tolist()
    <span class="hljs-keyword">return</span> {labels[i]: float(probs[i]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(labels))}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_from_image</span>(<span class="hljs-params">image</span>):</span>
    <span class="hljs-keyword">if</span> image <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">return</span> {}
    clip = preprocess_clip([image] * T)
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[<span class="hljs-number">0</span>]
    probs = torch.softmax(torch.from_numpy(logits), dim=<span class="hljs-number">1</span>)[<span class="hljs-number">0</span>].numpy().tolist()
    <span class="hljs-keyword">return</span> {labels[i]: float(probs[i]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(labels))}

<span class="hljs-keyword">with</span> gr.Blocks() <span class="hljs-keyword">as</span> demo:
    gr.Markdown(<span class="hljs-string">"# Gesture Classifier (ONNX)\nRecord or upload a short video, then click **Classify Video**."</span>)
    <span class="hljs-keyword">with</span> gr.Tab(<span class="hljs-string">"Video (record or upload)"</span>):
        vid_in = gr.Video(label=<span class="hljs-string">"Record from webcam or upload a short clip"</span>)
        vid_out = gr.Label(num_top_classes=<span class="hljs-number">3</span>, label=<span class="hljs-string">"Prediction"</span>)
        gr.Button(<span class="hljs-string">"Classify Video"</span>).click(fn=predict_from_video, inputs=vid_in, outputs=vid_out)
    <span class="hljs-keyword">with</span> gr.Tab(<span class="hljs-string">"Single Image (fallback)"</span>):
        img_in = gr.Image(label=<span class="hljs-string">"Upload an image frame"</span>, type=<span class="hljs-string">"numpy"</span>)
        img_out = gr.Label(num_top_classes=<span class="hljs-number">3</span>, label=<span class="hljs-string">"Prediction"</span>)
        gr.Button(<span class="hljs-string">"Classify Image"</span>).click(fn=predict_from_image, inputs=img_in, outputs=img_out)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    demo.launch()
</code></pre>
<p>Running the command <code>python app.py</code> launches a Gradio application in your web browser. Here's what happens:</p>
<ol>
<li><p><strong>Webcam feed streams live</strong>: The application accesses your webcam to provide a live video feed. This allows you to perform gestures in front of the camera in real-time.</p>
</li>
<li><p><strong>Predictions update continuously</strong>: As you perform gestures, the model processes the video frames continuously, updating its predictions in real-time.</p>
</li>
<li><p><strong>Top 3 gesture classes displayed with probabilities</strong>: The application displays the top three predicted gesture classes along with their probabilities, giving you an idea of the model's confidence in its predictions.</p>
</li>
</ol>
<p>When you open the app in your browser, you'll find two tabs. In the <strong>Video tab</strong>, you can click <em>Record from webcam</em> to capture a short clip of your gesture, typically lasting 2–4 seconds. After recording, click <strong>Classify Video</strong>. The model will then process the captured frames using the Transformer model and display the predicted gesture probabilities. This setup allows for interactive testing and demonstration of the gesture recognition system.</p>
<p>Here’s an example where I raised my hand for a <strong>stop</strong> gesture, and the model predicts “stop” as the top class:</p>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/realtime-demo.png?raw=true" alt="Gradio demo output" width="600" height="400" loading="lazy"></p>
<p>Figure 2. The Gradio app running locally. After recording a short clip, the Transformer model predicts the gesture with class probabilities.</p>
<h3 id="heading-evaluate-accuracy-latency">Evaluate Accuracy + Latency</h3>
<p>Now that the model runs in a demo app, let’s check how well it performs. There are two sides to this:</p>
<ul>
<li><p><strong>Accuracy</strong>: does the model predict the right gesture class?</p>
</li>
<li><p><strong>Latency</strong>: how fast does it respond, especially on CPU vs GPU?</p>
</li>
</ul>
<h4 id="heading-1-quick-accuracy-check">1. Quick Accuracy Check</h4>
<p>Save this as <code>eval.py</code> in the same folder as your other scripts:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> GestureClips, read_labels
<span class="hljs-keyword">from</span> train <span class="hljs-keyword">import</span> ViTTemporal

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
n_classes = len(labels)

<span class="hljs-comment"># Load validation data</span>
val_ds = GestureClips(train=<span class="hljs-literal">False</span>)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># Load trained model</span>
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load(<span class="hljs-string">"vit_temporal_best.pt"</span>, map_location=<span class="hljs-string">"cpu"</span>))
model.eval()

correct, total = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>
all_preds, all_labels = [], []

<span class="hljs-keyword">with</span> torch.no_grad():
    <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> val_dl:
        logits = model(x)
        preds = logits.argmax(dim=<span class="hljs-number">1</span>)
        correct += (preds == y).sum().item()
        total += y.size(<span class="hljs-number">0</span>)
        all_preds.extend(preds.tolist())
        all_labels.extend(y.tolist())

print(<span class="hljs-string">f"Validation accuracy: <span class="hljs-subst">{correct/total:<span class="hljs-number">.2</span>%}</span>"</span>)
</code></pre>
<h4 id="heading-2-confusion-matrix">2. Confusion Matrix</h4>
<p>Let’s also visualize which gestures are confused. Add this snippet at the bottom of <code>eval.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> confusion_matrix

cm = confusion_matrix(all_labels, all_preds)

plt.figure(figsize=(<span class="hljs-number">6</span>,<span class="hljs-number">6</span>))
sns.heatmap(cm, annot=<span class="hljs-literal">True</span>, fmt=<span class="hljs-string">"d"</span>, xticklabels=labels, yticklabels=labels, cmap=<span class="hljs-string">"Blues"</span>)
plt.xlabel(<span class="hljs-string">"Predicted"</span>)
plt.ylabel(<span class="hljs-string">"True"</span>)
plt.title(<span class="hljs-string">"Confusion Matrix"</span>)
plt.tight_layout()
plt.show()
</code></pre>
<p>When you run <code>python eval.py</code>, a heatmap like this will pop up:</p>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/confusion-matrix.png?raw=true" alt="Confusion matrix" width="600" height="400" loading="lazy"></p>
<p>Figure 3. Confusion matrix on the validation set. Correct predictions appear along the diagonal. Off-diagonal counts show gesture confusions.</p>
<h4 id="heading-3-latency-benchmark">3. Latency Benchmark</h4>
<p>Finally, let’s see how fast inference runs. Save the following as <code>benchmark.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time, numpy <span class="hljs-keyword">as</span> np, onnxruntime
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)

ort = onnxruntime.InferenceSession(<span class="hljs-string">"vit_temporal.onnx"</span>, providers=[<span class="hljs-string">"CPUExecutionProvider"</span>])
INPUT_NAME = ort.get_inputs()[<span class="hljs-number">0</span>].name
OUTPUT_NAME = ort.get_outputs()[<span class="hljs-number">0</span>].name

dummy = np.random.randn(<span class="hljs-number">1</span>, <span class="hljs-number">16</span>, <span class="hljs-number">3</span>, <span class="hljs-number">224</span>, <span class="hljs-number">224</span>).astype(np.float32)

<span class="hljs-comment"># Warmup</span>
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(<span class="hljs-number">3</span>):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})

<span class="hljs-comment"># Benchmark</span>
t0 = time.time()
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(<span class="hljs-number">50</span>):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})
t1 = time.time()

print(<span class="hljs-string">f"Average latency: <span class="hljs-subst">{(t1 - t0)/<span class="hljs-number">50</span>:<span class="hljs-number">.3</span>f}</span> seconds per clip"</span>)
</code></pre>
<p>Run: <code>python benchmark.py</code></p>
<p>On CPU, you might see ~0.05–0.15s per clip; on GPU it’s much faster.</p>
<p><strong>Note</strong>: If latency is high, you can enable <strong>quantization</strong> in ONNX to shrink the model and speed up inference.</p>
<h2 id="heading-option-2-use-small-samples-from-public-gesture-datasets">Option 2: Use Small Samples from Public Gesture Datasets</h2>
<p>If you’d prefer to see your model trained on <em>real</em> gesture clips instead of synthetic moving boxes, you can grab a handful of videos from open datasets. You don’t need to download the entire dataset (which can be several GB) just a few <code>.mp4</code> samples are enough to follow along.</p>
<h3 id="heading-recommended-sources">Recommended sources</h3>
<ul>
<li><p><strong>20BN Jester Dataset</strong>: Contains short clips of hand gestures like swiping, clapping, and pointing.</p>
</li>
<li><p><strong>WLASL</strong>: A large-scale dataset of isolated sign language words.</p>
</li>
</ul>
<p>Both projects provide small <code>.mp4</code> videos you can use as realistic training examples. I’ve linked them below.</p>
<h3 id="heading-setting-up-your-dataset-folder">Setting up your dataset folder</h3>
<p>Once you download a few clips, place them in the <code>data/</code> folder under subfolders named after each gesture class. For example:</p>
<pre><code class="lang-plaintext">data/
├── swipe_left/
│   ├── clip1.mp4
│   └── clip2.mp4
├── swipe_right/
│   ├── clip1.mp4
│   └── clip2.mp4
└── stop/
    ├── clip1.mp4
    └── clip2.mp4
</code></pre>
<p>And update <code>labels.txt</code> to match the folder names:</p>
<pre><code class="lang-plaintext">swipe_left
swipe_right
stop
</code></pre>
<p>Now your dataset is ready, and the same training scripts from earlier (<code>train.py</code>, <code>eval.py</code>) will work without modification.</p>
<h3 id="heading-why-choose-this-option">Why choose this option?</h3>
<ul>
<li><p>Gives more realistic results than synthetic coloured boxes</p>
</li>
<li><p>Lets you see how the model handles <em>actual human hand movements</em></p>
</li>
<li><p>It just requires a bit more effort (downloading clips, trimming them if needed)</p>
</li>
</ul>
<p><strong>Tip:</strong> If downloading from these datasets feels too heavy, you can also record your own short gestures using your laptop webcam. Just save them as <code>.mp4</code> files and organize them in the same folder structure.</p>
<h2 id="heading-accessibility-notes-amp-ethical-limits">Accessibility Notes &amp; Ethical Limits</h2>
<p>While this project shows the technical workflow for gesture recognition with Transformers, it’s important to step back and consider the <strong>human context</strong>:</p>
<ul>
<li><p><strong>Accessibility first</strong>: Tools like this can help students with speech or motor difficulties, but they should always be co-designed with the people who will use them. Don’t assume one-size-fits-all.</p>
</li>
<li><p><strong>Dataset sensitivity</strong>: Using publicly available sign or gesture datasets is fine for prototyping, but deploying such a system requires careful consideration of consent and representation.</p>
</li>
<li><p><strong>Error tolerance</strong>: Even small misclassifications can have big consequences in accessibility contexts (for example, confusing <em>stop</em> with <em>go</em>). Always plan for fallback options (like manual input or confirmation).</p>
</li>
<li><p><strong>Bias and inclusivity</strong>: Models trained on narrow datasets may fail for different skin tones, lighting conditions, or cultural gesture variations. Broad and diverse training data is essential for fairness.</p>
</li>
</ul>
<p>In other words: this demo is a <strong>teaching scaffold</strong>, not a production-ready accessibility tool. Responsible deployment requires collaboration with educators, therapists, and end users.</p>
<h2 id="heading-next-steps">Next Steps</h2>
<p>If you’d like to push this project further, here are some directions to explore:</p>
<ul>
<li><p><strong>Better models</strong>: Try video-focused Transformers like <a target="_blank" href="https://arxiv.org/abs/2102.05095">TimeSformer</a> or <a target="_blank" href="https://arxiv.org/abs/2203.12602">VideoMAE</a> for stronger temporal reasoning.</p>
</li>
<li><p><strong>Larger vocabularies</strong>: Add more gesture classes, build your own dataset, or use portions of public datasets like <a target="_blank" href="https://www.kaggle.com/datasets/toxicmender/20bn-jester">20BN Jester</a> or <a target="_blank" href="https://www.kaggle.com/datasets/risangbaskoro/wlasl-processed">WLASL.</a></p>
</li>
<li><p><strong>Pose fusion</strong>: Combine gesture video with human pose keypoints from <a target="_blank" href="https://mediapipe.readthedocs.io/en/latest/solutions/hands.html">MediaPipe</a> or <a target="_blank" href="https://github.com/CMU-Perceptual-Computing-Lab/openpose">OpenPose</a> for more robust predictions.</p>
</li>
<li><p><strong>Real-time smoothing</strong>: Implement temporal smoothing or debounce logic in the app so predictions are more stable during live use.</p>
</li>
<li><p><strong>Quantization + edge devices</strong>: Convert your ONNX model to an INT8 quantized version and deploy it on a Raspberry Pi or Jetson Nano for classroom-ready prototypes.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to create a gesture recognition system using Transformer models, demonstrating the potential of cutting-edge machine learning techniques. By preparing a small dataset, training a Vision Transformer with temporal pooling, exporting the model to ONNX for efficient inference, and deploying a real-time Gradio app, you showcased a practical application of these technologies. The evaluation of accuracy and latency further highlighted the system's effectiveness and responsiveness.</p>
<p>This project illustrates how you can leverage advanced ML methods to enhance accessibility and communication, paving the way for more inclusive learning environments.</p>
<p>Remember: while this demo works with small datasets, real-world applications need larger, more diverse data and careful consideration of accessibility, inclusivity, and ethics.</p>
<p>Here’s the GitHub repo for full source code: <a target="_blank" href="https://github.com/tayo4christ/transformer-gesture">transformer-gesture</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Multimodal Makaton-to-English Translator for Accessible Education ]]>
                </title>
                <description>
                    <![CDATA[ A year nine student walks into class full of ideas, but when it is time to contribute, the tools around them do not listen. Their speech is difficult for standard voice systems to recognise, typing feels slow and exhausting, and the lesson moves on w... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-multimodal-translator-for-accessible-education/</link>
                <guid isPermaLink="false">68cb5e6df1766dffdd20f610</guid>
                
                    <category>
                        <![CDATA[ Accessibility ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ OMOTAYO OMOYEMI ]]>
                </dc:creator>
                <pubDate>Thu, 18 Sep 2025 01:20:45 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1758158024064/bf3d7dac-0231-450a-9b40-6abf43085e49.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>A year nine student walks into class full of ideas, but when it is time to contribute, the tools around them do not listen. Their speech is difficult for standard voice systems to recognise, typing feels slow and exhausting, and the lesson moves on without their voice being heard. The challenge is not a lack of ability but a lack of access.</p>
<p>Across the world, millions of learners face communication barriers. Some live with apraxia of speech or dysarthria, others with limited mobility, hearing differences, or neurodiverse needs. When speaking, writing, or pointing is unreliable or tiring, participation becomes limited, feedback is lost, and confidence slowly erodes. This is not a rare exception but an everyday reality in classrooms.</p>
<p>These barriers appear in very practical ways. Students are skipped or misunderstood when they cannot respond quickly. Their ability is under-measured because their means of expression are constrained. Teachers struggle to maintain the pace of lessons while making individual accommodations. Peers interact less often, reducing opportunities for social belonging.</p>
<p>Assistive technologies have helped over the years, with tools like text-to-speech, symbol boards, and simple gesture inputs. Yet most of these tools are designed for a single mode of interaction. They assume the learner will either speak, or type, or tap. Real communication, however, is fluid. Learners naturally combine gestures, partial speech, symbols, and context to share meaning, especially when fatigue, anxiety, or motor challenges come into play.</p>
<p>This is where modern AI changes the picture. We are beginning to move beyond single-solution tools into multimodal systems that can understand speech, even when it is disordered, interpret gestures and visual symbols, combine signals to infer intent, and adapt in real time as the learner’s abilities develop or change.</p>
<p>AI is reshaping accessibility in education by shifting from isolated tools to multimodal and adaptive systems. These systems combine gesture, speech, and intelligent feedback to meet learners where they are, while also supporting their growth over time.</p>
<p>In this article, we will explore what this shift looks like in practice, how it can unlock participation, and how adaptive feedback personalises support and we will also build a hands-on multimodal demo that turns these ideas into a classroom-ready tool.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p><strong>An Operating System:</strong> Windows, macOS, or Linux</p>
</li>
<li><p><strong>Python installed (3.9 or later)</strong> – Along with <code>pip</code> for installing packages.</p>
</li>
<li><p><strong>Editor:</strong> Visual Studio Code or any Integrated development environment (IDE)</p>
</li>
<li><p><strong>Basics:</strong> Comfortable running commands in a terminal</p>
</li>
<li><p><strong>Optional hardware:</strong> Microphone (speech input), Webcam (single-frame tab), speakers (TTS playback)</p>
</li>
<li><p><strong>Internet:</strong> Required for the default SpeechRecognition (Google Web Speech API) and gTTS</p>
</li>
<li><p><strong>No dataset/model needed:</strong> A stub gesture classifier is provided so the demo runs end-to-end</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-weve-achieved-so-far">What We’ve Achieved So Far</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-case-study-1-translating-makaton-to-english">Case Study 1: Translating Makaton to English</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-case-study-2-aura-prototype-adaptive-speech-assistant">Case Study 2: AURA Prototype (Adaptive Speech Assistant)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-bigger-picture-multimodal-accessibility-tools">The Bigger Picture: Multimodal Accessibility Tools</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-a-multimodal-makaton-to-english-translator-gesture-speech">How to Build a Multimodal Makaton to English Translator (Gesture + Speech)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-overview">Project Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-challenges-and-ethical-considerations">Challenges and Ethical Considerations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-where-were-heading-next">Where We’re Heading Next</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion-building-an-inclusive-future-with-ai">Conclusion: Building an Inclusive Future with AI</a></p>
</li>
</ul>
<h2 id="heading-what-weve-achieved-so-far">What We’ve Achieved So Far</h2>
<p>The past few years have shown how AI can make classrooms more inclusive when we focus on accessibility. Developers, educators, and researchers are already experimenting with tools that bridge communication gaps.</p>
<p>In <a target="_blank" href="https://www.freecodecamp.org/news/create-a-real-time-gesture-to-text-translator/">my first freeCodeCamp tutorial</a>, I built a gesture-to-text translator using MediaPipe. This project demonstrated how computer vision can track hand movements and convert them into text in real time. For learners who rely on gestures, this kind of system can provide a bridge to participation.</p>
<p>Here is a simplified example of how MediaPipe detects hand landmarks:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> mediapipe <span class="hljs-keyword">as</span> mp
<span class="hljs-keyword">import</span> cv2

<span class="hljs-comment"># Initialize MediaPipe Hands</span>
mp_hands = mp.solutions.hands
hands = mp_hands.Hands()

<span class="hljs-comment"># Start capturing video from the webcam</span>
cap = cv2.VideoCapture(<span class="hljs-number">0</span>)

<span class="hljs-comment"># Capture a frame from the video</span>
ret, frame = cap.read()

<span class="hljs-comment"># Process the frame to detect hand landmarks</span>
results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

<span class="hljs-comment"># Print the detected hand landmarks</span>
print(<span class="hljs-string">"Hand landmarks:"</span>, results.multi_hand_landmarks)
</code></pre>
<p>This small piece of code shows how MediaPipe processes a video frame and extracts hand landmarks. From there, you can classify gestures and map them to text.</p>
<p>👉 You can explore the full project on <a target="_blank" href="https://github.com/tayo4christ/Gesture_Article">GitHub</a> or read the complete tutorial on <a target="_blank" href="https://www.freecodecamp.org/news/create-a-real-time-gesture-to-text-translator/">freeCodeCamp</a>.</p>
<p>In another <a target="_blank" href="https://www.freecodecamp.org/news/build-ai-accessibility-tools-with-python/">freeCodeCamp article</a>, I demonstrated how to build AI accessibility tools with Python, such as speech recognition and text-to-speech. These projects provided readers with a foundation for building their own inclusive tools, and you can find the full source code in the <a target="_blank" href="https://github.com/tayo4christ/inclusive-ai-toolkit">repository.</a></p>
<p>Beyond these individual projects, the wider field has also made significant progress. Advances in sign language recognition have improved accuracy in capturing complex hand shapes and movements. Text-to-speech systems have become more natural and adaptive, giving users voices that sound closer to human speech. Mobile and desktop accessibility apps have brought these capabilities into everyday classrooms.</p>
<p>These achievements are encouraging, but they remain limited. Most of today’s tools are still designed for a single mode of communication. A system may work for gestures, or for speech, or for text, but not all of them together.</p>
<p>The next step is clear: we need multimodal, adaptive AI tools that can blend gestures, speech, and feedback into unified systems. This is where the most exciting opportunities in accessibility lie, and it is where we will turn next.</p>
<p><img src="https://github.com/tayo4christ/ai-accessibility-articles-assets/blob/main/single-vs-multimodal.png?raw=true" alt="Single vs Multimodal Systems" width="600" height="400" loading="lazy"></p>
<p><em>Figure 1: Comparison of isolated single-modality systems with unified multimodal AI systems.</em></p>
<h2 id="heading-case-study-1-translating-makaton-to-english">Case Study 1: Translating Makaton to English</h2>
<p>One of my first projects in this area focused on translating Makaton into English.</p>
<p>Makaton is a language programme that uses signs and symbols to support people with speech and language difficulties. It is widely used in classrooms where learners may not rely fully on speech. The challenge is that while a learner communicates in Makaton, their teachers and peers often work in English, which creates a communication gap.</p>
<h3 id="heading-the-ai-workflow">The AI Workflow</h3>
<p>The system followed a clear pipeline:</p>
<p><em>Camera Input → Hand Landmark Detection → Gesture Classification → English Translation Output</em></p>
<p><img src="https://github.com/tayo4christ/ai-accessibility-articles-assets/blob/main/makaton-workflow.png?raw=true" alt="Makaton Workflow" width="600" height="400" loading="lazy"></p>
<p><em>Figure 2: AI workflow for translating Makaton gestures into English.</em></p>
<ul>
<li><p><strong>Camera Input</strong>: captures the learner’s Makaton sign.</p>
</li>
<li><p><strong>Hand Landmark Detection</strong>: a vision library such as MediaPipe or OpenCV identifies the position of the fingers and hands.</p>
</li>
<li><p><strong>Gesture Classification</strong>: a trained machine learning model classifies which Makaton sign was made.</p>
</li>
<li><p><strong>English Translation Output</strong>: the system maps that gesture to its English word or phrase and displays it.</p>
</li>
</ul>
<h3 id="heading-example-in-python">Example in Python</h3>
<p>Here is a simplified version of how this workflow might look in code:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Step 1: Capture input</span>
frame = camera.read()

<span class="hljs-comment"># Step 2: Detect hand landmarks</span>
landmarks = mediapipe.process(frame)

<span class="hljs-comment"># Step 3: Classify gesture</span>
gesture = gesture_model.predict(landmarks)

<span class="hljs-comment"># Step 4: Translate to English</span>
translation_map = {
    <span class="hljs-string">"hello_sign"</span>: <span class="hljs-string">"Hello"</span>,
    <span class="hljs-string">"thank_you_sign"</span>: <span class="hljs-string">"Thank you"</span>
}
text = translation_map.get(gesture, <span class="hljs-string">"Unknown sign"</span>)

print(<span class="hljs-string">"Makaton sign:"</span>, gesture, <span class="hljs-string">" -&gt; English:"</span>, text)
</code></pre>
<p>This is a simplified example, but it shows the core idea: map gestures to meaning and then bridge that meaning into English.</p>
<h3 id="heading-why-this-matters">Why This Matters</h3>
<p>Imagine a student signing <em>thank you</em> in Makaton and the system instantly displaying the words on screen. Teachers can check understanding, peers can respond naturally, and the learner’s contribution becomes visible to everyone.</p>
<p>The key takeaway is that AI can bridge symbol and gesture based languages with mainstream spoken and written communication. Instead of forcing learners to adapt to rigid systems, we can design systems that adapt to the way they already communicate.</p>
<h2 id="heading-case-study-2-aura-prototype-adaptive-speech-assistant">Case Study 2: AURA Prototype (Adaptive Speech Assistant)</h2>
<p>Another project I worked on is called <a target="_blank" href="https://aura-apraxia-aac-a8qejouwasaqequrhetbfw.streamlit.app/"><strong>AURA</strong></a>, the <em>Apraxia of Speech Adaptive Understanding and Relearning Assistant</em>. The idea was to design a system that not only recognises speech but also supports learners with speech disorders by detecting errors, adapting feedback, and offering multimodal alternatives.</p>
<h3 id="heading-the-challenge">The Challenge</h3>
<p>Most commercial speech recognition systems fail when a person’s speech does not follow typical patterns. This is especially true for people with apraxia of speech, where motor planning difficulties make pronunciation inconsistent. The result is frequent misrecognition, frustration, and exclusion from tools that rely on voice input.</p>
<h3 id="heading-the-ai-workflow-1">The AI Workflow</h3>
<p>The AURA prototype used a layered architecture:</p>
<p><em>Speech Input → Wav2Vec2 (fine-tuned for disordered speech) → CNN + BiLSTM Error Detection → Reinforcement Learning Feedback → Multimodal Output (Speech + Gesture)</em></p>
<p><img src="https://github.com/tayo4christ/ai-accessibility-articles-assets/blob/main/aura-workflow.png?raw=true" alt="AURA Workflow" width="600" height="400" loading="lazy"></p>
<p><em>Figure 3: Workflow of the AURA prototype, combining speech, error detection, adaptive feedback, and multimodal outputs.</em></p>
<ul>
<li><p><strong>Wav2Vec2 Speech Recognition</strong>: fine-tuned on disordered speech to improve transcription accuracy.</p>
</li>
<li><p><strong>CNN + BiLSTM Model</strong>: classifies articulation or phonological errors in real time.</p>
</li>
<li><p><strong>Reinforcement Learning Engine</strong>: adapts feedback loops so therapy suggestions improve as the learner progresses.</p>
</li>
<li><p><strong>Gesture-to-Speech Multimodal Input</strong>: when speech is too difficult, MediaPipe gestures can be used to trigger spoken outputs.</p>
</li>
<li><p><strong>Streamlit Interface</strong>: integrates everything into a single accessible app for testing.</p>
</li>
</ul>
<p>Here’s a simplified view of how an error detection module could be structured:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Example: Error classification using CNN + BiLSTM</span>
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn

<span class="hljs-comment"># Define the ErrorClassifier model</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ErrorClassifier</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        super(ErrorClassifier, self).__init__()
        self.cnn = nn.Conv1d(in_channels=<span class="hljs-number">40</span>, out_channels=<span class="hljs-number">64</span>, kernel_size=<span class="hljs-number">3</span>)
        self.lstm = nn.LSTM(<span class="hljs-number">64</span>, <span class="hljs-number">128</span>, batch_first=<span class="hljs-literal">True</span>, bidirectional=<span class="hljs-literal">True</span>)
        self.fc = nn.Linear(<span class="hljs-number">256</span>, <span class="hljs-number">3</span>)  <span class="hljs-comment"># Output classes: e.g. correct, substitution, omission</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>
        x = self.cnn(x)
        x, _ = self.lstm(x)
        <span class="hljs-keyword">return</span> self.fc(x[:, <span class="hljs-number">-1</span>, :])

<span class="hljs-comment"># Instantiate the model</span>
model = ErrorClassifier()
</code></pre>
<p>This snippet shows the heart of the error detection pipeline: combining CNN layers for feature extraction with BiLSTMs for sequence modeling. The model can flag articulation errors, which then guide the feedback loop.</p>
<h3 id="heading-why-this-matters-1">Why This Matters</h3>
<p>With AURA, the goal was not just to recognise what someone said, but to help them communicate more effectively. The prototype adapted in real time offering corrective feedback, suggesting gestures, or switching modes when speech became difficult.</p>
<p>The takeaway is that AI can evolve from being a passive recognition tool into an active partner in learning and communication.</p>
<h2 id="heading-the-bigger-picture-multimodal-accessibility-tools">The Bigger Picture: Multimodal Accessibility Tools</h2>
<p>The two projects we explored, translating Makaton into English and building the AURA prototype highlight a much larger transformation underway. Accessibility technology is moving away from isolated, single-purpose applications toward multimodal platforms that bring together speech, gestures, text, and adaptive AI into one seamless system.</p>
<h3 id="heading-why-this-shift-matters">Why This Shift Matters</h3>
<p>The benefits of this shift are profound:</p>
<ul>
<li><p><strong>Greater inclusivity in classrooms</strong>: learners who rely on different modes of communication can participate equally.</p>
</li>
<li><p><strong>Real-time support</strong>: systems that detect errors or adapt to gestures give learners immediate feedback rather than delayed corrections.</p>
</li>
<li><p><strong>Lower frustration</strong>: multimodal options mean if one channel breaks down (for example, speech), others like gesture or text can take over smoothly.</p>
</li>
<li><p><strong>Confidence and independence</strong>: learners express themselves more fully, without depending heavily on support staff or interpreters.</p>
</li>
</ul>
<h3 id="heading-beyond-the-classroom">Beyond the Classroom</h3>
<p>The impact of multimodal accessibility extends across many sectors:</p>
<ul>
<li><p>In <strong>healthcare</strong>, patients with communication difficulties can use multimodal AI assistants to express needs clearly, reducing misdiagnosis and stress.</p>
</li>
<li><p>In the <strong>workplace</strong>, employees with speech or motor impairments can collaborate effectively using adaptive AI tools.</p>
</li>
<li><p>In <strong>community settings</strong>, individuals can participate more freely in conversations, services, and digital platforms, strengthening social inclusion.</p>
</li>
</ul>
<h3 id="heading-visualising-the-shift">Visualising the Shift</h3>
<p><img src="https://github.com/tayo4christ/ai-accessibility-articles-assets/blob/main/multimodal-applications.png?raw=true" alt="Multimodal Applications" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-build-a-multimodal-makaton-to-english-translator-gesture-speech">How to Build a Multimodal Makaton to English Translator (Gesture + Speech)</h2>
<p>This demo combines both use cases: a Makaton to English classroom tool and the AURA assistive speech path. It prioritizes gesture when a sign is detected, falls back to speech when it isn’t, and produces a unified English output (with optional text-to-speech). We’ll focus on the translation layer, multimodal fusion, and a simple Streamlit UI.</p>
<h3 id="heading-project-structure">Project structure</h3>
<pre><code class="lang-python">makaton_multimodal_demo/
├─ .streamlit/
│   └─ config.toml 
├─ assets/
│   └─ README.txt 
├─ tests/
│   └─ test_fuse.py 
└─ streamlit_app.py
</code></pre>
<p>The structure provided above outlines the organization of a project directory for a multimodal Makaton to English translator demo using Streamlit. Here's a brief explanation of each component:</p>
<ul>
<li><p><code>makaton_multimodal_demo/</code>: This is the root directory of the project.</p>
</li>
<li><p><code>.streamlit/</code>: This directory contains configuration files for Streamlit, which is a framework used to build web apps in Python. The <code>config.toml</code> file is optional and can be used to customize the Streamlit app's settings.</p>
</li>
<li><p><code>assets/</code>: This directory is intended to store models or other necessary files for the project. The <code>README.txt</code> serves as a placeholder to indicate where these files should be placed.</p>
</li>
<li><p><code>tests/</code>: This directory is for test scripts. The <code>test_</code><a target="_blank" href="http://fuse.py"><code>fuse.py</code></a> file likely contains tests for the fusion function, which is a part of the multimodal translation process.</p>
</li>
<li><p><code>streamlit_</code><a target="_blank" href="http://app.py"><code>app.py</code></a>: This is the main application file where the Streamlit app is implemented. It contains the code that runs the app, handling the user interface and the logic for translating Makaton gestures and speech into English.</p>
</li>
</ul>
<h3 id="heading-install-amp-run">Install &amp; run</h3>
<pre><code class="lang-bash"><span class="hljs-comment"># (optional) create and activate a virtualenv</span>
python -m venv .venv

<span class="hljs-comment"># Windows</span>
.\.venv\Scripts\activate

<span class="hljs-comment"># macOS/Linux</span>
<span class="hljs-built_in">source</span> .venv/bin/activate
</code></pre>
<p>The code snippet above provides instructions for creating and activating a Python virtual environment, which is a self-contained directory that contains a Python installation for a particular version of Python, plus several additional packages.</p>
<ol>
<li><p><code>python -m venv .venv</code>: This command creates a new virtual environment in a directory named <code>.venv</code>. The <code>venv</code> module is used to create lightweight virtual environments.</p>
</li>
<li><p><code>.\.venv\Scripts\activate</code> (Windows): This command activates the virtual environment on Windows. Once activated, the environment's Python interpreter and installed packages will be used.</p>
</li>
<li><p><code>source .venv/bin/activate</code> (macOS/Linux): This command activates the virtual environment on macOS or Linux. Similar to Windows, activating the environment ensures that the specific Python interpreter and packages within the environment are used.</p>
</li>
</ol>
<h3 id="heading-install-dependencies">Install dependencies</h3>
<pre><code class="lang-python">pip install streamlit opencv-python mediapipe SpeechRecognition gTTS pydub numpy
</code></pre>
<p>The command above is used to install multiple Python packages at once. Here's what each package does:</p>
<ul>
<li><p><strong>streamlit</strong>: A framework for building interactive web applications in Python, often used for data science and machine learning projects.</p>
</li>
<li><p><strong>opencv-python</strong>: Provides OpenCV, a library for computer vision tasks such as image processing and video analysis.</p>
</li>
<li><p><strong>mediapipe</strong>: A library developed by Google for building cross-platform, customizable machine learning solutions for live and streaming media, including hand and face detection.</p>
</li>
<li><p><strong>SpeechRecognition</strong>: A library for performing speech recognition, allowing Python to recognize and process human speech.</p>
</li>
<li><p><strong>gTTS</strong>: Google Text-to-Speech, a library and CLI tool to interface with Google Translate's text-to-speech API, enabling text-to-speech conversion.</p>
</li>
<li><p><strong>pydub</strong>: A library for audio processing, allowing manipulation of audio files, such as converting between different audio formats.</p>
</li>
<li><p><strong>numpy</strong>: A fundamental package for scientific computing in Python, providing support for arrays and matrices, along with a collection of mathematical functions.</p>
</li>
</ul>
<h3 id="heading-create-streamlitapppy">Create <code>streamlit_app.py</code></h3>
<pre><code class="lang-python"><span class="hljs-comment"># streamlit_app.py</span>
<span class="hljs-keyword">from</span> io <span class="hljs-keyword">import</span> BytesIO
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Optional
<span class="hljs-keyword">import</span> streamlit <span class="hljs-keyword">as</span> st

<span class="hljs-comment"># Optional deps (kept optional so readers can still run the core demo)</span>
<span class="hljs-keyword">try</span>:
    <span class="hljs-keyword">import</span> cv2
    <span class="hljs-keyword">import</span> mediapipe <span class="hljs-keyword">as</span> mp
    MP_OK = <span class="hljs-literal">True</span>
<span class="hljs-keyword">except</span> Exception:
    MP_OK = <span class="hljs-literal">False</span>

<span class="hljs-keyword">try</span>:
    <span class="hljs-keyword">import</span> speech_recognition <span class="hljs-keyword">as</span> sr
    SR_OK = <span class="hljs-literal">True</span>
<span class="hljs-keyword">except</span> Exception:
    SR_OK = <span class="hljs-literal">False</span>

<span class="hljs-keyword">try</span>:
    <span class="hljs-keyword">from</span> gtts <span class="hljs-keyword">import</span> gTTS
    GTTS_OK = <span class="hljs-literal">True</span>
<span class="hljs-keyword">except</span> Exception:
    GTTS_OK = <span class="hljs-literal">False</span>

<span class="hljs-comment"># --- 1) Minimal Makaton dictionary (extend as needed)</span>
MAKATON_DICT = {
    <span class="hljs-string">"hello_sign"</span>: <span class="hljs-string">"Hello"</span>,
    <span class="hljs-string">"thank_you_sign"</span>: <span class="hljs-string">"Thank you"</span>,
    <span class="hljs-string">"help_sign"</span>: <span class="hljs-string">"Help"</span>,
    <span class="hljs-string">"toilet_sign"</span>: <span class="hljs-string">"Toilet"</span>,
    <span class="hljs-string">"stop_sign"</span>: <span class="hljs-string">"Stop"</span>,
}

<span class="hljs-comment"># --- 2) Gesture classifier (stub for the demo)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">classify_gesture</span>(<span class="hljs-params">landmarks</span>) -&gt; Optional[str]:</span>
    <span class="hljs-string">"""
    Return a canonical label like 'hello_sign' or None if unknown.
    Replace this stub with your trained model + confidence threshold.
    """</span>
    <span class="hljs-keyword">return</span> <span class="hljs-string">"hello_sign"</span> <span class="hljs-keyword">if</span> landmarks <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>

<span class="hljs-comment"># --- 3) Speech recognizer (fallback path)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">transcribe_speech</span>(<span class="hljs-params">seconds: int = <span class="hljs-number">3</span></span>) -&gt; Optional[str]:</span>
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> SR_OK:
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>
    r = sr.Recognizer()
    <span class="hljs-keyword">try</span>:
        <span class="hljs-keyword">with</span> sr.Microphone() <span class="hljs-keyword">as</span> source:
            st.info(<span class="hljs-string">"Listening..."</span>)
            audio = r.listen(source, phrase_time_limit=seconds)
        <span class="hljs-keyword">return</span> r.recognize_google(audio)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        st.warning(<span class="hljs-string">f"Speech recognition error: <span class="hljs-subst">{e}</span>"</span>)
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

<span class="hljs-comment"># --- 4) Fusion logic (gesture first, speech fallback)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fuse</span>(<span class="hljs-params">gesture_label: Optional[str], speech_text: Optional[str]</span>) -&gt; str:</span>
    <span class="hljs-keyword">if</span> gesture_label <span class="hljs-keyword">and</span> gesture_label <span class="hljs-keyword">in</span> MAKATON_DICT:
        <span class="hljs-keyword">return</span> MAKATON_DICT[gesture_label]
    <span class="hljs-keyword">if</span> speech_text:
        <span class="hljs-keyword">return</span> speech_text
    <span class="hljs-keyword">return</span> <span class="hljs-string">"No input detected"</span>

<span class="hljs-comment"># --- 5) Optional: extract single-frame hand landmarks using MediaPipe</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">extract_hand_landmarks_from_image</span>(<span class="hljs-params">image_bytes: bytes</span>):</span>
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> MP_OK:
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>
    <span class="hljs-keyword">try</span>:
        <span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
        np_arr = np.frombuffer(image_bytes, dtype=np.uint8)
        img = cv2.imdecode(np_arr, cv2.IMREAD_COLOR)
        <span class="hljs-keyword">if</span> img <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
            <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

        mp_hands = mp.solutions.hands
        <span class="hljs-keyword">with</span> mp_hands.Hands(static_image_mode=<span class="hljs-literal">True</span>, max_num_hands=<span class="hljs-number">1</span>, min_detection_confidence=<span class="hljs-number">0.5</span>) <span class="hljs-keyword">as</span> hands:
            img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            result = hands.process(img_rgb)

        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> result.multi_hand_landmarks:
            <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

        hand_landmarks = result.multi_hand_landmarks[<span class="hljs-number">0</span>]
        <span class="hljs-keyword">return</span> [(lm.x, lm.y, lm.z) <span class="hljs-keyword">for</span> lm <span class="hljs-keyword">in</span> hand_landmarks.landmark]
    <span class="hljs-keyword">except</span> Exception:
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

<span class="hljs-comment"># --- 6) Streamlit UI</span>
st.set_page_config(page_title=<span class="hljs-string">"Makaton → English (Multimodal Demo)"</span>)
st.title(<span class="hljs-string">"Makaton → English (Multimodal Demo)"</span>)
st.caption(<span class="hljs-string">"Combines a classroom Makaton translator with an assistive speech path (AURA-style)."</span>)

<span class="hljs-keyword">with</span> st.expander(<span class="hljs-string">"What this demo shows"</span>):
    st.write(
        <span class="hljs-string">"- **Translation layer:** small Makaton dictionary you can extend.\n"</span>
        <span class="hljs-string">"- **Multimodal fusion:** gesture prioritized, speech as fallback.\n"</span>
        <span class="hljs-string">"- **UI:** one page, clear output, optional text-to-speech."</span>
    )

tabs = st.tabs([<span class="hljs-string">"Simulated Sign"</span>, <span class="hljs-string">"Single-Frame Webcam (Optional)"</span>, <span class="hljs-string">"About"</span>])

<span class="hljs-comment"># Tab 1: Simulated (no CV model required)</span>
<span class="hljs-keyword">with</span> tabs[<span class="hljs-number">0</span>]:
    st.subheader(<span class="hljs-string">"Simulated Gesture + Speech"</span>)
    col1, col2 = st.columns(<span class="hljs-number">2</span>)

    <span class="hljs-keyword">with</span> col1:
        simulate = st.selectbox(
            <span class="hljs-string">"Pick a sign"</span>,
            [<span class="hljs-string">""</span>, <span class="hljs-string">"hello_sign"</span>, <span class="hljs-string">"thank_you_sign"</span>, <span class="hljs-string">"help_sign"</span>, <span class="hljs-string">"toilet_sign"</span>, <span class="hljs-string">"stop_sign"</span>],
            index=<span class="hljs-number">0</span>
        )
        gesture_label = simulate <span class="hljs-keyword">or</span> <span class="hljs-literal">None</span>

    <span class="hljs-keyword">with</span> col2:
        speech_text = st.session_state.get(<span class="hljs-string">"speech_text"</span>)
        st.write(<span class="hljs-string">"Current speech:"</span>, speech_text <span class="hljs-keyword">or</span> <span class="hljs-string">"None"</span>)
        <span class="hljs-keyword">if</span> st.button(<span class="hljs-string">"Transcribe 3s"</span>):
            <span class="hljs-keyword">if</span> SR_OK:
                speech_text = transcribe_speech(<span class="hljs-number">3</span>)
                st.session_state[<span class="hljs-string">"speech_text"</span>] = speech_text
            <span class="hljs-keyword">else</span>:
                st.warning(<span class="hljs-string">"SpeechRecognition not installed."</span>)

    output = fuse(gesture_label, st.session_state.get(<span class="hljs-string">"speech_text"</span>))
    st.markdown(<span class="hljs-string">f"### Output: **<span class="hljs-subst">{output}</span>**"</span>)

    <span class="hljs-keyword">if</span> output <span class="hljs-keyword">and</span> output != <span class="hljs-string">"No input detected"</span>:
        <span class="hljs-keyword">if</span> st.button(<span class="hljs-string">"Speak output"</span>):
            <span class="hljs-keyword">if</span> GTTS_OK:
                mp3 = BytesIO()
                <span class="hljs-keyword">try</span>:
                    gTTS(output).write_to_fp(mp3)
                    st.audio(mp3.getvalue(), format=<span class="hljs-string">"audio/mp3"</span>)
                <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
                    st.warning(<span class="hljs-string">f"TTS failed: <span class="hljs-subst">{e}</span>"</span>)
            <span class="hljs-keyword">else</span>:
                st.warning(<span class="hljs-string">"gTTS not installed."</span>)

<span class="hljs-comment"># Tab 2: Optional single-frame webcam capture</span>
<span class="hljs-keyword">with</span> tabs[<span class="hljs-number">1</span>]:
    st.subheader(<span class="hljs-string">"Single-Frame Hand Detection (Webcam)"</span>)
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> MP_OK:
        st.warning(<span class="hljs-string">"Install MediaPipe + OpenCV to enable this tab."</span>)
    <span class="hljs-keyword">else</span>:
        img = st.camera_input(<span class="hljs-string">"Capture a frame"</span>)
        captured_label = <span class="hljs-literal">None</span>
        <span class="hljs-keyword">if</span> img <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
            landmarks = extract_hand_landmarks_from_image(img.getvalue())
            <span class="hljs-keyword">if</span> landmarks:
                captured_label = classify_gesture(landmarks)
                st.success(<span class="hljs-string">"Hand detected."</span>)
            <span class="hljs-keyword">else</span>:
                st.info(<span class="hljs-string">"No hand landmarks found. Try better lighting/framing."</span>)

        <span class="hljs-keyword">if</span> st.button(<span class="hljs-string">"Transcribe 3s (webcam tab)"</span>):
            st.session_state[<span class="hljs-string">"speech_text2"</span>] = transcribe_speech(<span class="hljs-number">3</span>) <span class="hljs-keyword">if</span> SR_OK <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>

        speech_text2 = st.session_state.get(<span class="hljs-string">"speech_text2"</span>)
        st.write(<span class="hljs-string">"Current speech:"</span>, speech_text2 <span class="hljs-keyword">or</span> <span class="hljs-string">"None"</span>)

        output2 = fuse(captured_label, speech_text2)
        st.markdown(<span class="hljs-string">f"### Output: **<span class="hljs-subst">{output2}</span>**"</span>)

        <span class="hljs-keyword">if</span> output2 <span class="hljs-keyword">and</span> output2 != <span class="hljs-string">"No input detected"</span>:
            <span class="hljs-keyword">if</span> st.button(<span class="hljs-string">"Speak output (webcam tab)"</span>):
                <span class="hljs-keyword">if</span> GTTS_OK:
                    mp3 = BytesIO()
                    <span class="hljs-keyword">try</span>:
                        gTTS(output2).write_to_fp(mp3)
                        st.audio(mp3.getvalue(), format=<span class="hljs-string">"audio/mp3"</span>)
                    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
                        st.warning(<span class="hljs-string">f"TTS failed: <span class="hljs-subst">{e}</span>"</span>)
                <span class="hljs-keyword">else</span>:
                    st.warning(<span class="hljs-string">"gTTS not installed."</span>)
</code></pre>
<p>The code above creates a Streamlit application that combines gesture recognition and speech recognition to translate Makaton signs into English. Here's a brief explanation of how it works:</p>
<ol>
<li><p><strong>Dependencies and Setup</strong>: The code attempts to import optional dependencies like OpenCV, MediaPipe, SpeechRecognition, and gTTS. These are used for gesture detection, speech recognition, and text-to-speech functionalities.</p>
</li>
<li><p><strong>Makaton Dictionary</strong>: A minimal dictionary that maps Makaton signs to English words. This can be extended to include more signs.</p>
</li>
<li><p><strong>Gesture Classifier</strong>: A placeholder function (<code>classify_gesture</code>) is used to classify hand gestures. In a real application, this would be replaced with a trained model.</p>
</li>
<li><p><strong>Speech Recognizer</strong>: The <code>transcribe_speech</code> function uses the SpeechRecognition library to convert spoken words into text, serving as a fallback when gestures are not detected.</p>
</li>
<li><p><strong>Fusion Logic</strong>: The <code>fuse</code> function prioritizes gesture recognition over speech. If a gesture is recognized, it translates it using the dictionary; otherwise, it uses the transcribed speech.</p>
</li>
<li><p><strong>Hand Landmark Extraction</strong>: The code includes a function to extract hand landmarks from an image using MediaPipe, which is used for gesture classification.</p>
</li>
<li><p><strong>Streamlit UI</strong>: The user interface is built with Streamlit, featuring tabs for simulated gestures, webcam-based gesture detection, and additional information. Users can simulate gestures, capture gestures via webcam, and use speech input. The output is displayed and can be converted to speech using gTTS.</p>
</li>
</ol>
<p>This application demonstrates a multimodal approach by integrating both gesture and speech recognition to facilitate communication for users who rely on Makaton.</p>
<h3 id="heading-run">Run</h3>
<pre><code class="lang-bash">streamlit run .\streamlit_app.py
</code></pre>
<p>The command above is used to launch a Streamlit application. When executed, it starts a local web server and opens the specified Python script in a web browser, allowing you to interact with the app's user interface. This command is typically run in a terminal or command prompt.</p>
<p><img src="https://github.com/tayo4christ/ai-accessibility-articles-assets/blob/8117234b9dc032aa0f4ff32abad92e7ad3344b81/ui-home-simulated-tab.jpg?raw=1" alt="Streamlit app ‘Makaton to English (Multimodal Demo)’ showing the Simulated Sign tab with ‘Pick a sign’, ‘Transcribe 3s’, and ‘Output: No input detected’." width="600" height="400" loading="lazy"></p>
<p><em>Figure — App interface: the Simulated Sign tab before any input.</em></p>
<p><img src="https://github.com/tayo4christ/ai-accessibility-articles-assets/blob/8117234b9dc032aa0f4ff32abad92e7ad3344b81/ui-simulated-hello-output.jpg?raw=1" alt="Simulated sign ‘hello_sign’ selected in the Streamlit app; Output shows “Hello”." width="600" height="400" loading="lazy"></p>
<p><em>Figure — Selecting</em> <code>hello_sign</code> <em>produces “Output: Hello”.</em></p>
<h2 id="heading-project-overview">Project Overview</h2>
<p>You have developed a multimodal translator that integrates both gesture recognition (specifically Makaton signs) and speech recognition to produce a unified English output. The system is designed to prioritize gesture input, using speech as a fallback when gestures are not detected.</p>
<p><strong>User Interface</strong></p>
<p>The application is built using Streamlit, featuring two main tabs:</p>
<ul>
<li><p><strong>Simulated Sign Tab</strong>: Allows users to simulate gestures without requiring computer vision (CV) capabilities.</p>
</li>
<li><p><strong>Webcam Single Frame Tab</strong>: Optionally uses a webcam to capture and process a single frame for gesture detection.</p>
</li>
</ul>
<p><strong>Use Case Integration</strong></p>
<ul>
<li><p><strong>Makaton to English Translation</strong>: In a classroom setting, detected Makaton signs are translated into short English phrases, facilitating communication.</p>
</li>
<li><p><strong>AURA-style Assistive Path</strong>: If no gesture is detected, the system relies on speech input to generate an output, ensuring continuous communication support.</p>
</li>
</ul>
<p><strong>Design Limitations</strong></p>
<ul>
<li><p>The gesture classifier is currently a placeholder and should be replaced with a trained model that includes a confidence threshold for better accuracy.</p>
</li>
<li><p>The Makaton dictionary is minimal and can be expanded to include more phrases and templates.</p>
</li>
<li><p>The speech recognition component uses a basic recognizer. For improved robustness, consider using advanced models like Wav2Vec2 or offline automatic speech recognition (ASR) systems.</p>
</li>
</ul>
<p><strong>Suggested Extensions</strong></p>
<ul>
<li><p>Implement a confidence threshold to display both gesture and speech inputs when the system is uncertain.</p>
</li>
<li><p>Expand the dictionary to support slot templates, such as "I want [item]".</p>
</li>
<li><p>Introduce a toggle to switch between speech-first and gesture-first input priorities.</p>
</li>
<li><p>Enable logging of outputs for teachers and provide an option to export these logs as CSV files.</p>
</li>
<li><p>Consider replacing gTTS with an offline text-to-speech solution for better reliability.</p>
</li>
</ul>
<p><strong>Troubleshooting Tips</strong></p>
<ul>
<li><p>If you encounter microphone errors, ensure that pyaudio is installed. On Windows, use <code>pip install pipwin</code> followed by <code>pipwin install pyaudio</code>.</p>
</li>
<li><p>If the webcam is not detected, check your browser permissions. The Simulated Sign tab can still be used without a webcam.</p>
</li>
<li><p>If there are issues with package imports, verify that they are installed in your active virtual environment.</p>
</li>
</ul>
<p>The link to the full code: <a target="_blank" href="https://github.com/tayo4christ/makaton-multimodal-demo/tree/main/makaton_multimodal_demo">Multimodal_Makaton</a></p>
<h2 id="heading-challenges-and-ethical-considerations">Challenges and Ethical Considerations</h2>
<p>While the promise of multimodal accessibility tools is exciting, building them responsibly requires us to confront several challenges. These are not only technical problems but also ethical ones that affect how learners, teachers, and communities experience AI.</p>
<h3 id="heading-data-scarcity">Data Scarcity</h3>
<p>Training AI systems requires large, diverse datasets. But when it comes to disordered speech or symbol systems like Makaton, the data is limited. Without enough examples, models risk being inaccurate or biased toward a narrow group of users. Collecting more data is essential, but it must be done ethically, with consent and respect for the communities involved.</p>
<h3 id="heading-fairness-and-inclusion">Fairness and Inclusion</h3>
<p>AI systems often work better for some groups than others. A model trained mostly on fluent English speakers may fail to recognise learners with strong accents or speech difficulties. Similarly, gesture recognition may not account for differences in motor ability. Fairness means designing models that work across abilities, accents, and cultures, so that no group is excluded by design.</p>
<h3 id="heading-privacy-and-security">Privacy and Security</h3>
<p>Speech and video data are highly sensitive, especially when collected in schools. Protecting this data is not optional, it is a requirement. Systems must anonymize or encrypt recordings and store them securely. Transparency is also key: learners, parents, and teachers should know exactly how data is being used and who has access to it.</p>
<h3 id="heading-accessibility-of-the-tools-themselves">Accessibility of the Tools Themselves</h3>
<p>Ironically, many “accessibility tools” remain inaccessible because they are expensive, require powerful hardware, or are too complex to use. For AI to truly reduce barriers, solutions must be affordable, lightweight, and easy for teachers to set up in real classrooms, not just in research labs.</p>
<h3 id="heading-takeaway">Takeaway</h3>
<p>These challenges remind us that accessibility in AI is not only a technical question but also an ethical and social responsibility. To build tools that genuinely help learners, we need collaboration between developers, educators, policymakers, and the communities who will use the systems.</p>
<h2 id="heading-where-were-heading-next">Where We’re Heading Next</h2>
<p>The future of AI accessibility tools is speculative, but the possibilities are both exciting and necessary. What we have now are prototypes and early systems. What lies ahead are tools that could reshape how classrooms and society more broadly approach communication and inclusion.</p>
<h3 id="heading-multilingual-makaton-translation">Multilingual Makaton Translation</h3>
<p>One promising direction is the ability to translate Makaton across multiple languages. A learner in the UK could sign in Makaton and see their contribution appear not just in English but in French, Spanish, or Yoruba. This would open up international classrooms and give learners access to global opportunities that are often closed off by language barriers.</p>
<h3 id="heading-ai-tutors-with-dynamic-adaptation">AI Tutors with Dynamic Adaptation</h3>
<p>Imagine a classroom assistant powered by AI that adapts in real time. If a learner struggles with speech, it could switch to gesture recognition. If gestures become tiring, it could prompt the learner with symbol-based options. These AI tutors would not only support communication but also guide learning, adapting to each student’s strengths and challenges over time.</p>
<h3 id="heading-wearable-multimodal-devices">Wearable Multimodal Devices</h3>
<p>The rise of lightweight hardware makes it possible to imagine wearable AI assistants that provide instant translation and support. Glasses could capture gestures and overlay text, while earbuds could translate disordered speech into clear audio for peers and teachers. Instead of bulky setups, accessibility would become portable, personal, and ever-present.</p>
<h3 id="heading-a-broader-impact">A Broader Impact</h3>
<p>These innovations go beyond technology alone. They align with the United Nations Sustainable Development Goals (SDGs) especially:</p>
<ul>
<li><p><strong>Quality Education (Goal 4):</strong> ensuring that every learner, regardless of ability, has equal access to education.</p>
</li>
<li><p><strong>Reduced Inequalities (Goal 10):</strong> breaking down barriers so that disability or difference is not a cause of exclusion.</p>
</li>
</ul>
<p>The journey from single-modality tools to multimodal, adaptive systems is still in its early stages. But if we continue to push forward with creativity, ethics, and inclusivity at the center, AI accessibility tools will not only change classrooms they will change lives.</p>
<h2 id="heading-conclusion-building-an-inclusive-future-with-ai">Conclusion: Building an Inclusive Future with AI</h2>
<p>AI accessibility tools are no longer just optional add-ons for a few learners. They are becoming core enablers of inclusion in education, healthcare, workplaces, and daily life.</p>
<p>The journey from early gesture recognition systems to multimodal, adaptive prototypes like Makaton translation and AURA shows what is possible when technology is designed around people rather than forcing people to adapt to technology. These innovations break down communication barriers and open up new opportunities for learners who have too often been left on the margins.</p>
<p>But the future of accessibility is not automatic. It depends on choices we make now as developers, educators, researchers, and policymakers. Building tools that are open, ethical, and affordable requires collaboration and commitment.</p>
<p>The vision is clear: a world where every learner, regardless of ability, can express themselves fully, be understood by others, and participate with confidence.</p>
<p><strong>The future of education is inclusive and with thoughtful design, AI can help us get there.</strong></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build AI Speech-to-Text and Text-to-Speech Accessibility Tools with Python ]]>
                </title>
                <description>
                    <![CDATA[ Classrooms today are more diverse than ever before. Among the students are neurodiverse learners with different learning needs. While these learners bring unique strengths, traditional teaching methods don’t always meet their needs. This is where AI-... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-ai-accessibility-tools-with-python/</link>
                <guid isPermaLink="false">68b5f910f596271023ce3698</guid>
                
                    <category>
                        <![CDATA[ Accessibility ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ OMOTAYO OMOYEMI ]]>
                </dc:creator>
                <pubDate>Mon, 01 Sep 2025 19:50:40 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756755907758/3568b7ab-f659-45c9-8c1a-e877d1a0a166.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Classrooms today are more diverse than ever before. Among the students are neurodiverse learners with different learning needs. While these learners bring unique strengths, traditional teaching methods don’t always meet their needs.</p>
<p>This is where AI-driven accessibility tools can make a difference. From real-time captioning to adaptive reading support, artificial intelligence is transforming classrooms into more inclusive spaces.</p>
<p>In this article, you’ll:</p>
<ul>
<li><p>Understand what inclusive education means in practice.</p>
</li>
<li><p>See how AI can support neurodiverse learners.</p>
</li>
<li><p>Try two hands-on Python demos:</p>
<ul>
<li><p><strong>Speech-to-Text</strong> using local Whisper (free, no API key).</p>
</li>
<li><p><strong>Text-to-Speech</strong> using Hugging Face SpeechT5.</p>
</li>
</ul>
</li>
<li><p>Get a ready-to-use project structure, requirements**,** and troubleshooting tips for Windows and macOS/Linux users.</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-a-note-on-missing-files">A Note on Missing Files</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-inclusive-education-really-means">What Inclusive Education Really Means</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-toolbox-five-ai-accessibility-tools-teachers-can-try-today">Toolbox: Five AI Accessibility Tools Teachers Can Try Today</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-platform-notes-windows-vs-macoslinux">Platform Notes (Windows vs macOS/Linux)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-hands-on-build-a-simple-accessibility-toolkit-python">Hands-On: Build a Simple Accessibility Toolkit (Python)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-quick-setup-cheatsheet">Quick Setup Cheatsheet</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-from-code-to-classroom-impact">From Code to Classroom Impact</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-developer-challenge-build-for-inclusion">Developer Challenge: Build for Inclusion</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-challenges-and-considerations">Challenges and Considerations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-looking-ahead">Looking Ahead</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have the following:</p>
<ul>
<li><p><strong>Python 3.8</strong> or later versions installed (for Windows users, in case you don’t have it installed, you can download the latest version at: <a target="_blank" href="http://python.org">python.org</a>. macOS users usually already have <code>python3</code>).</p>
</li>
<li><p><strong>Virtual environment</strong> set up (<code>venv</code>) — recommended to keep things clean.</p>
</li>
<li><p><strong>You have to install</strong> <a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/#"><strong>FFmpeg</strong></a> (This is required for Whisper to read audio files).</p>
</li>
<li><p><strong>PowerShell</strong> (Windows) or <strong>Terminal</strong> (macOS/Linux).</p>
</li>
<li><p><strong>Basic familiarity</strong> with running Python scripts.</p>
</li>
</ul>
<p><strong>Tip</strong>: If you’re new to Python environments, the you shouldn’t worry because the setup commands will be included with each step below.</p>
<h2 id="heading-a-note-on-missing-files">A Note on Missing Files</h2>
<p>Some files are not included in the <a target="_blank" href="https://github.com/tayo4christ/inclusive-ai-toolkit">GitHub repository</a>. This is intentional, they are either generated automatically or should be created/installed locally:</p>
<ul>
<li><p><code>.venv/</code> → Your virtual environment folder. Each reader should create their own locally with:</p>
<pre><code class="lang-python">  python -m venv .venv
</code></pre>
<ol>
<li><p><strong>FFmpeg Installation</strong>:</p>
<ul>
<li><p><strong>Windows</strong>: FFmpeg is not included in the project files because it is large (approximately 90 MB). Users are instructed to download the FFmpeg build separately.</p>
</li>
<li><p><strong>macOS</strong>: Users can install FFmpeg using the Homebrew package manager with the command <code>brew install ffmpeg</code>.</p>
</li>
<li><p><strong>Linux</strong>: Users can install FFmpeg using the package manager with the command <code>sudo apt install ffmpeg</code>.</p>
</li>
</ul>
</li>
<li><p><strong>Output File</strong>:</p>
<ul>
<li><code>output.wav</code> is a file generated when you run the Text-to-Speech script. This file is not included in the GitHub repository, it is created locally on your machine when you execute the script.</li>
</ul>
</li>
</ol>
</li>
</ul>
<p>To keep the repo clean, these are excluded using <code>.gitignore</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Ignore virtual environments</span>
.venv/
env/
venv/

<span class="hljs-comment"># Ignore binary files</span>
ffmpeg.exe
*.dll
*.lib

<span class="hljs-comment"># Ignore generated audio (but keep sample input)</span>
*.wav
*.mp3
!lesson_recording.mp3
</code></pre>
<p>The repository does include all essential files needed to follow along:</p>
<ul>
<li><p><code>requirements.txt</code> (see below)</p>
</li>
<li><p><code>transcribe.py</code> and <code>tts.py</code>(covered step-by-step in the Hands-On section).</p>
</li>
</ul>
<p><code>requirements.txt</code></p>
<pre><code class="lang-python">openai-whisper
transformers
torch
soundfile
sentencepiece
numpy
</code></pre>
<p>This way, you’ll have everything you need to reproduce the project.</p>
<h2 id="heading-what-inclusive-education-really-means">What Inclusive Education Really Means</h2>
<p>Inclusive education goes beyond placing students with diverse needs in the same classroom. It’s about designing learning environments where every student can thrive.</p>
<p>Common barriers include:</p>
<ul>
<li><p>Reading difficulties (for example, dyslexia).</p>
</li>
<li><p>Communication challenges (speech/hearing impairments).</p>
</li>
<li><p>Sensory overload or attention struggles (autism, ADHD).</p>
</li>
<li><p>Note-taking and comprehension difficulties.</p>
</li>
</ul>
<p>AI can help reduce these barriers with captioning, reading aloud, adaptive pacing, and alternative communication tools.</p>
<h2 id="heading-toolbox-five-ai-accessibility-tools-teachers-can-try-today">Toolbox: Five AI Accessibility Tools Teachers Can Try Today</h2>
<ol>
<li><p><a target="_blank" href="https://support.microsoft.com/en-gb/office/use-immersive-reader-in-word-a857949f-c91e-4c97-977c-a4efcaf9b3c1"><strong>Microsoft Immersive Reader</strong></a> – Text-to-speech, reading guides, and translation.</p>
</li>
<li><p><a target="_blank" href="https://cloud.google.com/speech-to-text"><strong>Google Live Transcribe</strong></a> – Real-time captions for speech/hearing support.</p>
</li>
<li><p><a target="_blank" href="http://Otter.ai"><strong>Otter.ai</strong></a> – Automatic note-taking and summarization.</p>
</li>
<li><p><a target="_blank" href="https://www.grammarly.com/"><strong>Grammarly</strong></a> <strong>/</strong> <a target="_blank" href="https://quillbot.com/login?returnUrl=%2F&amp;triggerOneTap=true"><strong>Quillbot</strong></a> – Writing assistance for readability and clarity.</p>
</li>
<li><p><a target="_blank" href="https://blogs.microsoft.com/accessibility/seeing-ai/"><strong>Seeing AI (Microsoft)</strong></a> – Describes text and scenes for visually impaired learners.</p>
</li>
</ol>
<h3 id="heading-real-world-examples">Real-World Examples</h3>
<p>A student with dyslexia can use Immersive Reader to listen to a textbook while following along visually. Another student with hearing loss can use Live Transcribe to follow class discussions. These are small technology shifts that create big inclusion wins.</p>
<h2 id="heading-platform-notes-windows-vs-macoslinux">Platform Notes (Windows vs macOS/Linux)</h2>
<p>Most code works the same across systems, but setup commands differ slightly:</p>
<p><strong>Creating a virtual environment</strong></p>
<p>To create and activate a virtual environment in PowerShell using Python 3.8 or higher, you can follow these steps:</p>
<ol>
<li><p><strong>Create a virtual environment</strong>:</p>
<pre><code class="lang-powershell"> py <span class="hljs-literal">-3</span>.<span class="hljs-number">12</span> <span class="hljs-literal">-m</span> venv .venv
</code></pre>
</li>
<li><p><strong>Activate the virtual environment</strong>:</p>
<pre><code class="lang-powershell"> .\.venv\Scripts\Activate
</code></pre>
</li>
</ol>
<p>Once activated, your PowerShell prompt should change to indicate that you are now working within the virtual environment. This setup helps manage dependencies and keep your project environment isolated.</p>
<p>For Mac OS users to create and activate a virtual environment in a bash shell using Python 3, you can follow these steps:</p>
<ol>
<li><p><strong>Create a virtual environment</strong>:</p>
<pre><code class="lang-bash"> python3 -m venv .venv
</code></pre>
</li>
<li><p><strong>Activate the virtual environment</strong>:</p>
<pre><code class="lang-bash"> <span class="hljs-built_in">source</span> .venv/bin/activate
</code></pre>
</li>
</ol>
<p>Once activated, your bash prompt should change to indicate that you are now working within the virtual environment. This setup helps manage dependencies and keep your project environment isolated.</p>
<p><strong>To install FFmpeg on Windows, follow these steps:</strong></p>
<ol>
<li><p><strong>Download FFmpeg Build</strong>: Visit the official FFmpeg <a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/#">website</a> to download the latest FFmpeg build for Windows.</p>
</li>
<li><p><strong>Unzip the Downloaded File</strong>: Once downloaded, unzip the file to extract its contents. You will find several files, including <code>ffmpeg.exe</code>.</p>
</li>
<li><p><strong>Copy</strong> <code>ffmpeg.exe</code>: You have two options for using <code>ffmpeg.exe</code>:</p>
<ul>
<li><p><strong>Project Folder</strong>: Copy <code>ffmpeg.exe</code> directly into your project folder. This way, your project can access FFmpeg without modifying system settings.</p>
</li>
<li><p><strong>Add to PATH</strong>: Alternatively, you can add the directory containing <code>ffmpeg.exe</code> to your system's PATH environment variable. This allows you to use FFmpeg from any command prompt window without specifying its location.</p>
</li>
</ul>
</li>
</ol>
<p>Additionally, the full project folder, including all necessary files and instructions, is available for <a target="_blank" href="https://github.com/tayo4christ/inclusive-ai-toolkit">download on GitHub</a>. You can also find the link to the GitHub repository at the end of the article.</p>
<p>For macOS users:</p>
<p>To install FFmpeg on macOS, you can use Homebrew, a popular package manager for macOS. Here’s how:</p>
<ol>
<li><p><strong>Open Terminal</strong>: You can find Terminal in the Utilities folder within Applications.</p>
</li>
<li><p><strong>Install Homebrew</strong> (if not already installed): Paste the following command in Terminal and press Enter. Follow the on-screen instructions. /bin/bash -c "$(curl -fsSL <a target="_blank" href="https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh">https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh</a>)"</p>
</li>
<li><p><strong>Install FFmpeg</strong>: Once Homebrew is installed, run the following command in Terminal:</p>
<pre><code class="lang-bash"> brew install ffmpeg
</code></pre>
<p> This command will download and install FFmpeg, making it available for use on your system.</p>
</li>
</ol>
<p>For Linux users (Debian/Ubuntu):</p>
<p>To install FFmpeg on Debian-based systems like Ubuntu, you can use the APT package manager. Here’s how:</p>
<ol>
<li><p><strong>Open Terminal</strong>: You can usually find Terminal in your system’s applications menu.</p>
</li>
<li><p><strong>Update Package List</strong>: Before installing new software, it’s a good idea to update your package list. Run:</p>
<pre><code class="lang-bash"> sudo apt update
</code></pre>
</li>
<li><p><strong>Install FFmpeg</strong>: After updating, install FFmpeg by running:</p>
<pre><code class="lang-bash"> sudo apt install ffmpeg
</code></pre>
<p> This command will download and install FFmpeg, allowing you to use it from the command line.</p>
</li>
</ol>
<p>These steps will ensure that FFmpeg is installed and ready to use on your macOS or Linux system.</p>
<p><strong>Running Python scripts</strong></p>
<ul>
<li><p>Windows: <code>python script.py</code> or <code>py script.py</code></p>
</li>
<li><p>macOS/Linux: <code>python3 script.py</code></p>
</li>
</ul>
<p>I will mark these differences with a <strong>macOS/Linux note</strong> in the relevant steps so you can follow along smoothly on your system.</p>
<h2 id="heading-hands-on-build-a-simple-accessibility-toolkit-python">Hands-On: Build a Simple Accessibility Toolkit (Python)</h2>
<p>You’ll build two small demos:</p>
<ul>
<li><p><strong>Speech-to-Text</strong> with Whisper (local, free).</p>
</li>
<li><p><strong>Text-to-Speech</strong> with Hugging Face SpeechT5.</p>
</li>
</ul>
<h3 id="heading-1-speech-to-text-with-whisper-local-and-free">1) Speech-to-Text with Whisper (Local and free)</h3>
<p><strong>What you’ll build:</strong><br>A Python script that takes a short MP3 recording and prints the transcript to your terminal.</p>
<p><strong>Why Whisper?</strong><br>It’s a robust open-source STT model. The local version is perfect for beginners because it avoids API keys/quotas and works offline after the first install.</p>
<p><strong>How to Install Whisper (PowerShell):</strong></p>
<pre><code class="lang-powershell"><span class="hljs-comment"># Activate your virtual environment</span>
<span class="hljs-comment"># Example: .\venv\Scripts\Activate</span>

<span class="hljs-comment"># Install the openai-whisper package</span>
pip install openai<span class="hljs-literal">-whisper</span>

<span class="hljs-comment"># Check if FFmpeg is available</span>
ffmpeg <span class="hljs-literal">-version</span>

<span class="hljs-comment"># If FFmpeg is not available, download and install it, then add it to PATH or place ffmpeg.exe next to your script</span>
<span class="hljs-comment"># Example: Move ffmpeg.exe to the script directory or update PATH environment variable</span>
</code></pre>
<p><img src="https://github.com/tayo4christ/inclusive-ai-toolkit/blob/a285ef9fd724d5221e1d7090c0d88713d1e5accb/Images/ffmpeg-version.jpg?raw=true" alt="PowerShell confirming FFmpeg is installed" width="600" height="400" loading="lazy"></p>
<p>You should see a version string here before running Whisper.</p>
<p><strong>Note:</strong> Mac OS users can use the same code snippet as above in their terminal</p>
<p>If FFmpeg is not installed, you can install it using the following commands:</p>
<p>For macOS:</p>
<pre><code class="lang-bash">brew install ffmpeg
</code></pre>
<p>For Ubuntu/Debian Linux:</p>
<pre><code class="lang-bash">sudo apt install ffmpeg
</code></pre>
<h3 id="heading-create-transcribepy">Create <code>transcribe.py</code>:</h3>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> whisper

<span class="hljs-comment"># Load the Whisper model</span>
model = whisper.load_model(<span class="hljs-string">"base"</span>)  <span class="hljs-comment"># Use "tiny" or "small" for faster speed</span>

<span class="hljs-comment"># Transcribe the audio file</span>
result = model.transcribe(<span class="hljs-string">"lesson_recording.mp3"</span>, fp16=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># Print the transcript</span>
print(<span class="hljs-string">"Transcript:"</span>, result[<span class="hljs-string">"text"</span>])
</code></pre>
<p><strong>How the code works:</strong></p>
<ul>
<li><p><code>whisper.load_model("base")</code> — downloads/loads the model once, then cached afterward.</p>
</li>
<li><p><code>model.transcribe(...)</code> — handles audio decoding, language detection, and text inference.</p>
</li>
<li><p><code>fp16=False</code> — avoids half-precision GPU math so it runs fine on CPU.</p>
</li>
<li><p><code>result["text"]</code> — the final transcript string.</p>
</li>
</ul>
<p>Run it:</p>
<pre><code class="lang-bash">python transcribe.py
</code></pre>
<p>Expected output:</p>
<p><img src="https://github.com/tayo4christ/inclusive-ai-toolkit/blob/a285ef9fd724d5221e1d7090c0d88713d1e5accb/Images/whisper-transcript.jpg?raw=true" alt="Whisper successfully transcribed audio to text" width="600" height="400" loading="lazy"></p>
<p>Successful Speech-to-Text: Whisper prints the recognized sentence from <code>lesson_recording.mp3</code></p>
<p>To run the <code>transcribe.py</code> script on macOS or Linux, use the following command in your Terminal:</p>
<pre><code class="lang-bash">python3 transcribe.py
</code></pre>
<p><strong>Common hiccups (and fixes):</strong></p>
<ul>
<li><p><code>FileNotFoundError</code> during transcribe → <strong>FFmpeg</strong> isn’t found. Install it and confirm with <code>ffmpeg -version</code>.</p>
</li>
<li><p>Super slow on CPU → switch to <code>tiny</code> or <code>small</code> models: <code>whisper.load_model("small")</code>.</p>
</li>
</ul>
<h3 id="heading-2-text-to-speech-with-speecht5">2) Text-to-Speech with SpeechT5</h3>
<p><strong>What you’ll build:</strong><br>A Python script that converts a short string into a spoken WAV file called <code>output.wav</code>.</p>
<p><strong>Why SpeechT5?</strong><br>It’s a widely used open model that runs on your CPU. Easy to demo and no API key needed.</p>
<p><strong>Install the required packages on (PowerShell) Windows:</strong></p>
<pre><code class="lang-powershell"><span class="hljs-comment"># Activate your virtual environment</span>
<span class="hljs-comment"># Example: .\venv\Scripts\Activate</span>

<span class="hljs-comment"># Install the required packages</span>
pip install transformers torch soundfile sentencepiece
</code></pre>
<p><strong>Note:</strong> Mac OS users can use the same code snippet as above in their terminal</p>
<p>Create <code>tts.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
<span class="hljs-keyword">import</span> soundfile <span class="hljs-keyword">as</span> sf
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Load models</span>
processor = SpeechT5Processor.from_pretrained(<span class="hljs-string">"microsoft/speecht5_tts"</span>)
model = SpeechT5ForTextToSpeech.from_pretrained(<span class="hljs-string">"microsoft/speecht5_tts"</span>)
vocoder = SpeechT5HifiGan.from_pretrained(<span class="hljs-string">"microsoft/speecht5_hifigan"</span>)

<span class="hljs-comment"># Speaker embedding (fixed random seed for a consistent synthetic voice)</span>
g = torch.Generator().manual_seed(<span class="hljs-number">42</span>)
speaker_embeddings = torch.randn((<span class="hljs-number">1</span>, <span class="hljs-number">512</span>), generator=g)

<span class="hljs-comment"># Text to synthesize</span>
text = <span class="hljs-string">"Welcome to inclusive education with AI."</span>
inputs = processor(text=text, return_tensors=<span class="hljs-string">"pt"</span>)

<span class="hljs-comment"># Generate speech</span>
<span class="hljs-keyword">with</span> torch.no_grad():
    speech = model.generate_speech(inputs[<span class="hljs-string">"input_ids"</span>], speaker_embeddings, vocoder=vocoder)

<span class="hljs-comment"># Save to WAV</span>
sf.write(<span class="hljs-string">"output.wav"</span>, speech.numpy(), samplerate=<span class="hljs-number">16000</span>)
print(<span class="hljs-string">"✅ Audio saved as output.wav"</span>)
</code></pre>
<p>Expected Output:</p>
<p><img src="https://github.com/tayo4christ/inclusive-ai-toolkit/blob/a285ef9fd724d5221e1d7090c0d88713d1e5accb/Images/tts-saved-ok.jpg?raw=true" alt="Text-to-Speech complete — Audio saved as output.wav" width="600" height="400" loading="lazy"></p>
<p>Text-to-Speech complete. SpeechT5 generated the audio and saved it as <code>output.wav</code></p>
<p><strong>How the code works:</strong></p>
<ul>
<li><p><code>SpeechT5Processor</code> — prepares your text for the model.</p>
</li>
<li><p><code>SpeechT5ForTextToSpeech</code> — generates a <em>mel-spectrogram</em> (the speech content).</p>
</li>
<li><p><code>SpeechT5HifiGan</code> — a vocoder that turns the spectrogram into a waveform you can play.</p>
</li>
<li><p><code>speaker_embedding</code> — a 512-dim vector representing a “voice.” We seed it for a consistent (synthetic) voice across runs.</p>
</li>
</ul>
<p>Note: If you want the same voice every time you reopen the project, you need to save the embedding once using the snippet below:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> torch

<span class="hljs-comment"># Save the speaker embeddings</span>
np.save(<span class="hljs-string">"speaker_emb.npy"</span>, speaker_embeddings.numpy())

<span class="hljs-comment"># Later, load the speaker embeddings</span>
speaker_embeddings = torch.tensor(np.load(<span class="hljs-string">"speaker_emb.npy"</span>))
</code></pre>
<p>Run it:</p>
<pre><code class="lang-bash">python tts.py
</code></pre>
<p><strong>Note:</strong> MacOS/Linux use <code>python3 tts.py</code> to run the same code as above.</p>
<p><strong>Expected result:</strong></p>
<ul>
<li><p>Terminal prints: <code>✅ Audio saved as output.wav</code></p>
</li>
<li><p>A new file appears in your folder: <code>output.wav</code></p>
</li>
</ul>
<p><img src="https://github.com/tayo4christ/inclusive-ai-toolkit/blob/a285ef9fd724d5221e1d7090c0d88713d1e5accb/Images/output-wav-explorer.png.jpg?raw=true" alt="Explorer showing the generated output.wav file" width="600" height="400" loading="lazy"></p>
<p><strong>Common hiccups (and fixes):</strong></p>
<ul>
<li><p><code>ImportError: sentencepiece not found</code> → <code>pip install sentencepiece</code></p>
</li>
<li><p>Torch install issues on Windows →</p>
</li>
</ul>
<pre><code class="lang-powershell"><span class="hljs-comment"># Activate your virtual environment</span>
<span class="hljs-comment"># Example: .\venv\Scripts\Activate</span>

<span class="hljs-comment"># Install the torch package using the specified index URL for CPU</span>
pip install torch -<span class="hljs-literal">-index</span><span class="hljs-literal">-url</span> https://download.pytorch.org/whl/cpu
</code></pre>
<p>Note: The first run is usually slow because the models may still be downloading. So that’s normal.</p>
<h3 id="heading-3-optional-whisper-via-openai-api">3) Optional: Whisper via OpenAI API</h3>
<p><strong>What this does:</strong><br>Instead of running Whisper locally, you can call the <strong>OpenAI Whisper API (</strong><code>whisper-1</code>). Your audio file is uploaded to OpenAI’s servers, transcribed there, and the text is returned.</p>
<p><strong>Why use the API?</strong></p>
<ul>
<li><p>No need to install or run Whisper models locally (saves disk space &amp; setup time).</p>
</li>
<li><p>Runs on OpenAI’s infrastructure (faster if your computer is slow).</p>
</li>
<li><p>Great if you’re already using OpenAI services in your classroom or app.</p>
</li>
</ul>
<p><strong>What to watch out for:</strong></p>
<ul>
<li><p>Requires an <strong>API key</strong>.</p>
</li>
<li><p>Requires <strong>billing enabled</strong> (the free trial quota is usually small).</p>
</li>
<li><p>Needs internet access (unlike the local Whisper demo).</p>
</li>
</ul>
<p><strong>How to get an API key:</strong></p>
<ol>
<li><p>Go to <a target="_blank" href="https://auth.openai.com/log-in">OpenAI’s API Keys page.</a></p>
</li>
<li><p>Log in with your OpenAI account (or create one).</p>
</li>
<li><p>Click <strong>“Create new secret key”</strong>.</p>
</li>
<li><p>Copy the key — it looks like <code>sk-xxxxxxxx...</code>. Treat this like a password: don’t share it publicly or push it to GitHub.</p>
</li>
</ol>
<h4 id="heading-step-1-set-your-api-key">Step 1: Set your API key</h4>
<p>In PowerShell (session only):</p>
<pre><code class="lang-powershell"><span class="hljs-comment"># Set the OpenAI API key in the environment variable</span>
<span class="hljs-variable">$env:OPENAI_API_KEY</span>=<span class="hljs-string">"your_api_key_here"</span>
</code></pre>
<p>Or permanently set an environment variable in PowerShell - you can use the <code>setx</code> command. Here is how you can do it:</p>
<pre><code class="lang-powershell">setx OPENAI_API_KEY <span class="hljs-string">"your_api_key_here"</span>
</code></pre>
<p>This command sets the <code>OPENAI_API_KEY</code> environment variable to the specified value. Note that you should replace <code>"your_api_key_here"</code> with your actual API key. This change will apply to future PowerShell sessions, but you may need to restart your current session or open a new one to see the changes take effect.</p>
<p>Verify it’s set:</p>
<p>To check the value of an environment variable in PowerShell, you can use the <code>echo</code> command. Here's how you can do it:</p>
<pre><code class="lang-powershell"><span class="hljs-built_in">echo</span> <span class="hljs-variable">$env:OPENAI_API_KEY</span>
</code></pre>
<p>This command will display the current value of the <code>OPENAI_API_KEY</code> environment variable in your PowerShell session. If the variable is set, it will print the value. Otherwise, it will return nothing or an empty line.</p>
<p><strong>Step 2: Install the OpenAI Python client</strong></p>
<p>To install the OpenAI Python client using <code>pip</code>, you can use the following command in your PowerShell:</p>
<pre><code class="lang-python">pip install openai
</code></pre>
<p>This command will download and install the OpenAI package, allowing you to interact with OpenAI's API in your Python projects. Make sure you have Python and pip installed on your system before running this command.</p>
<p><strong>Step 3: Create</strong> <code>transcribe_api.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> OpenAI

<span class="hljs-comment"># Initialize the OpenAI client (reads API key from environment)</span>
client = OpenAI()

<span class="hljs-comment"># Open the audio file and create a transcription</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">"lesson_recording.mp3"</span>, <span class="hljs-string">"rb"</span>) <span class="hljs-keyword">as</span> f:
    transcript = client.audio.transcriptions.create(
        model=<span class="hljs-string">"whisper-1"</span>,
        file=f
    )

<span class="hljs-comment"># Print the transcript</span>
print(<span class="hljs-string">"Transcript:"</span>, transcript.text)
</code></pre>
<h4 id="heading-step-4-run-it">Step 4: Run it</h4>
<pre><code class="lang-bash">python transcribe_api.py
</code></pre>
<p>Expected output:</p>
<p><code>Transcript: Welcome to inclusive education with AI.</code></p>
<h4 id="heading-common-hiccups-and-fixes">Common hiccups (and fixes):</h4>
<ul>
<li><p><strong>Error: insufficient_quota</strong> → You’ve run out of free credits. Add billing to continue.</p>
</li>
<li><p><strong>Slow upload</strong> → If your audio is large, compress it first (for example, MP3 instead of WAV).</p>
</li>
<li><p><strong>Key not found</strong> → Double-check if <code>$env:OPENAI_API_KEY</code> is set in your terminal session.</p>
</li>
</ul>
<p><strong>Local Whisper vs API Whisper — Which Should You Use?</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Feature</td><td>Local Whisper (on your machine)</td><td>OpenAI Whisper API (cloud)</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Setup</strong></td><td>Needs Python packages + FFmpeg</td><td>Just install <code>openai</code> client + set API key</td></tr>
<tr>
<td><strong>Hardware</strong></td><td>Runs on your CPU (slower) or GPU (faster)</td><td>Runs on OpenAI’s servers (no local compute needed)</td></tr>
<tr>
<td><strong>Cost</strong></td><td>✅ Free after initial download</td><td>💳 Pay per minute of audio (after free trial quota)</td></tr>
<tr>
<td><strong>Internet required</strong></td><td>❌ No (fully offline once installed)</td><td>✅ Yes (uploads audio to OpenAI servers)</td></tr>
<tr>
<td><strong>Accuracy</strong></td><td>Very good - depends on model size (tiny → large)</td><td>Consistently strong - optimized by OpenAI</td></tr>
<tr>
<td><strong>Speed</strong></td><td>Slower on CPU, faster with GPU</td><td>Fast (uses OpenAI’s infrastructure)</td></tr>
<tr>
<td><strong>Privacy</strong></td><td>Audio never leaves your machine</td><td>Audio is sent to OpenAI (data handling per policy)</td></tr>
</tbody>
</table>
</div><p><strong>Rule of thumb:</strong></p>
<ul>
<li><p>Use <strong>Local Whisper</strong> if you want free, offline transcription or you’re working with sensitive data.</p>
</li>
<li><p>Use the <strong>API Whisper</strong> if you prefer convenience, don’t mind usage billing, and want speed without local setup.</p>
</li>
</ul>
<h2 id="heading-quick-setup-cheatsheet"><strong>Quick Setup Cheatsheet</strong></h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Task</td><td>Windows (PowerShell)</td><td>macOS / Linux (Terminal)</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Create venv</strong></td><td><code>py -3.12 -m venv .venv</code></td><td><code>python3 -m venv .venv</code></td></tr>
<tr>
<td><strong>Activate venv</strong></td><td><code>.\.venv\Scripts\Activate</code></td><td><code>source .venv/bin/activate</code></td></tr>
<tr>
<td><strong>Install Whisper</strong></td><td><code>pip install openai-whisper</code></td><td><code>pip install openai-whisper</code></td></tr>
<tr>
<td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><strong>Install FFmpeg</strong></a></td><td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/">Download build → unzip → ad</a>d to PATH or cop<a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/">y <code>ffmpeg.exe</code></a></td><td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>brew install</code></a> <code>ffmpeg</code> (macO<a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/">S) <code>sudo apt in</code></a><code>stall ffmpeg</code> <a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/">(Linux)</a></td></tr>
<tr>
<td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><strong>R</strong></a><strong>un</strong> <a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><strong>STT script</strong></a></td><td><code>python</code> <a target="_blank" href="http://transcribe.py"><code>transcribe.py</code></a></td><td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>pyth</code></a><code>on3</code> <a target="_blank" href="http://transcribe.py"><code>transcribe.py</code></a></td></tr>
<tr>
<td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><strong>Install</strong></a> <strong>TTS d</strong><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><strong>eps</strong></a></td><td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>pip ins</code></a><code>tall transformer</code><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>s torch soundf</code></a><code>ile sentencepiece</code></td><td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>pip install</code></a> <code>tra</code><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>nsformers torc</code></a><code>h soundfile sentencepiece</code></td></tr>
<tr>
<td><strong>Run TTS script</strong></td><td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>python</code></a> <a target="_blank" href="http://tts.py"><code>tts.py</code></a></td><td><code>python3</code> <a target="_blank" href="http://tts.py"><code>tts.py</code></a></td></tr>
<tr>
<td><strong>Install OpenAI client (A</strong><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><strong>PI)</strong></a></td><td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>pip ins</code></a><code>tall</code> <a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>openai</code></a></td><td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>pip</code></a> <code>install o</code><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>penai</code></a></td></tr>
<tr>
<td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><strong>Run</strong></a> <strong>API script</strong></td><td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>python trans</code></a><code>cribe_</code><a target="_blank" href="http://api.py"><code>api.py</code></a></td><td><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/"><code>pyth</code></a><code>on3 transcribe_</code><a target="_blank" href="http://api.py"><code>api.py</code></a></td></tr>
</tbody>
</table>
</div><p><strong>Pro tip for MacOS M1/M2 users:</strong> You may need a special PyTorch build for Metal GPU acceleration. Check the <a target="_blank" href="https://pytorch.org/get-started/locally/">PyTorch install guide</a> for the right wheel.</p>
<h2 id="heading-from-code-to-classroom-impact">From Code to Classroom Impact</h2>
<p>Whether you chose the <strong>local Whisper</strong>, the <strong>cloud API</strong>, or SpeechT5 for <strong>text-to-speech</strong><a target="_blank" href="https://www.gyan.dev/ffmpeg/builds/">,</a> you should now have a working prototype that can:</p>
<ul>
<li><p>Convert spoken lessons into text.</p>
</li>
<li><p>Read text aloud for students who prefer auditory input.</p>
</li>
</ul>
<p>That’s the technical foundation. But the real question is: how can these building blocks empower teachers and learners in real classrooms?</p>
<h2 id="heading-developer-challenge-build-for-inclusion">Developer Challenge: Build for Inclusion</h2>
<p>Try combining the two snippets into a simple <strong>classroom companion app</strong> that:</p>
<ul>
<li><p><strong>Captions</strong> what the teacher says in real time.</p>
</li>
<li><p><strong>Reads aloud</strong> transcripts or textbook passages on demand.</p>
</li>
</ul>
<p>Then think about how to expand it further:</p>
<ul>
<li><p>Add <strong>symbol recognition</strong> for non-verbal communication.</p>
</li>
<li><p>Add <strong>multi-language translation</strong> for diverse classrooms.</p>
</li>
<li><p>Add <strong>offline support</strong> for schools with poor connectivity.</p>
</li>
</ul>
<p>These are not futuristic ideas, they are achievable with today’s open-source AI tools.</p>
<h2 id="heading-challenges-and-considerations">Challenges and Considerations</h2>
<p>Of course, building for inclusion isn’t just about code. There are important challenges to address:</p>
<ul>
<li><p><strong>Privacy</strong>: Student data must be safeguarded, especially when recordings are involved.</p>
</li>
<li><p><strong>Cost</strong>: Solutions must be affordable and scalable for schools of all sizes.</p>
</li>
<li><p><strong>Teacher Training</strong>: Educators need support to confidently use these tools.</p>
</li>
<li><p><strong>Balance</strong>: AI should assist teachers, not replace the vital human element in learning.</p>
</li>
</ul>
<h2 id="heading-looking-ahead">Looking Ahead</h2>
<p>The future of inclusive education will likely involve multimodal AI which include systems that combine speech, gestures, symbols, and even emotion recognition. We may even see brain–computer interfaces and wearable devices that enable seamless communication for learners who are currently excluded.</p>
<p>But one principle is clear: inclusion works best when teachers, developers, and neurodiverse learners co-design solutions together.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>AI isn’t here to replace teachers, it’s here to help them reach every student. By embracing AI-driven accessibility, classrooms can transform into spaces where neurodiverse learners aren’t left behind, but instead empowered to thrive.</p>
<p>📢 <strong>Your turn:</strong></p>
<ul>
<li><p><strong>Teachers</strong>: You can try one of the tools in your next lesson.</p>
</li>
<li><p><strong>Developers</strong>: You can use the code snippets above to prototype your own inclusive classroom tool.</p>
</li>
<li><p><strong>Policymakers</strong>: You can support initiatives that make accessibility central to education.</p>
</li>
</ul>
<p>Inclusive education isn’t just a dream, it’s becoming a reality. With thoughtful use of AI, it can become the new norm.</p>
<p><strong>Full source code on GitHub:</strong> <a target="_blank" href="https://github.com/tayo4christ/inclusive-ai-toolkit">Inclusive AI Toolkit</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create a Real-Time Gesture-to-Text Translator Using Python and Mediapipe ]]>
                </title>
                <description>
                    <![CDATA[ Sign and symbol languages, like Makaton and American Sign Language (ASL), are powerful communication tools. However, they can create challenges when communicating with people who don't understand them. As a researcher working on AI for accessibility,... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/create-a-real-time-gesture-to-text-translator/</link>
                <guid isPermaLink="false">68a331edf6c19271552e2ac7</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Accessibility ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ OMOTAYO OMOYEMI ]]>
                </dc:creator>
                <pubDate>Mon, 18 Aug 2025 14:00:13 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755525484024/9f4c42e0-dbfd-4f04-9223-0a2169abd1fb.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Sign and symbol languages, like Makaton and American Sign Language (ASL), are powerful communication tools. However, they can create challenges when communicating with people who don't understand them.</p>
<p>As a researcher working on AI for accessibility, I wanted to explore how machine learning and computer vision could bridge that gap. The result was a real-time gesture-to-text translator built with Python and Mediapipe, capable of detecting hand gestures and instantly converting them to text.</p>
<p>In this tutorial, you’ll learn how to build your own version from scratch, even if you’ve never used Mediapipe before.</p>
<p>By the end, you’ll know how to:</p>
<ul>
<li><p>Detect and track hand movements in real time.</p>
</li>
<li><p>Classify gestures using a simple machine learning model.</p>
</li>
<li><p>Convert recognized gestures into text output.</p>
</li>
<li><p>Extend the system for accessibility-focused applications.</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before following along with this tutorial, you should have:</p>
<ul>
<li><p><strong>Basic Python knowledge</strong> – You should be comfortable writing and running Python scripts.</p>
</li>
<li><p><strong>Familiarity with the command line</strong> – You’ll use it to run scripts and install dependencies.</p>
</li>
<li><p><strong>A working webcam</strong> – This is required for capturing and recognizing gestures in real time.</p>
</li>
<li><p><strong>Python installed (3.8 or later)</strong> – Along with <code>pip</code> for installing packages.</p>
</li>
<li><p><strong>Some understanding of machine learning basics</strong> – Knowing what training data and models are will help, but I’ll explain the key parts along the way.</p>
</li>
<li><p><strong>An internet connection</strong> – To install libraries such as Mediapipe and OpenCV.</p>
</li>
</ul>
<p>If you’re completely new to Mediapipe or OpenCV, don’t worry, I will walk through the core parts you need to know to get this project working.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-this-matters">Why This Matters</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tools-and-technologies">Tools and Technologies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-how-to-install-the-required-libraries">Step 1: How to Install the Required Libraries</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-how-mediapipe-tracks-hands">Step 2: How Mediapipe Tracks Hands</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-project-pipeline">Step 3: Project Pipeline</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-how-to-collect-gesture-data">Step 4: How to Collect Gesture Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-how-to-train-a-gesture-classifier">Step 5: How to Train a Gesture Classifier</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-real-time-gesture-to-text-translation">Step 6: Real-Time Gesture-to-Text Translation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-7-extending-the-project">Step 7: Extending the Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-ethical-and-accessibility-considerations">Ethical and Accessibility Considerations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-why-this-matters">Why This Matters</h2>
<p>Accessible communication is a right, not a privilege. Gesture-to-text translators can:</p>
<ul>
<li><p>Help non-signers communicate with sign/symbol language users.</p>
</li>
<li><p>Assist in educational contexts for children with communication challenges.</p>
</li>
<li><p>Support people with speech impairments.</p>
</li>
</ul>
<p><strong>Note:</strong> This project is a proof-of-concept and should be tested with diverse datasets before real-world deployment.</p>
<h2 id="heading-tools-and-technologies">Tools and Technologies</h2>
<p>We’ll be using:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Tool</td><td>Purpose</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Python</strong></td><td>Primary programming language</td></tr>
<tr>
<td><strong>Mediapipe</strong></td><td>Real-time hand tracking and gesture detection</td></tr>
<tr>
<td><strong>OpenCV</strong></td><td>Webcam input and video display</td></tr>
<tr>
<td><strong>NumPy</strong></td><td>Data processing</td></tr>
<tr>
<td><strong>Scikit-learn</strong></td><td>Gesture classification</td></tr>
</tbody>
</table>
</div><h2 id="heading-step-1-how-to-install-the-required-libraries">Step 1: How to Install the Required Libraries</h2>
<p>Before installing the dependencies, ensure you have Python version 3.8 or higher installed (for example, Python 3.8, 3.9, 3.10, or newer). You can check your current Python version by opening a terminal (Command Prompt on Windows, or Terminal on macOS/Linux) and typing:</p>
<pre><code class="lang-bash">python --version
</code></pre>
<p>or</p>
<pre><code class="lang-bash">python3 --version
</code></pre>
<p>You have to confirm that your Python version is 3.8 or higher because Mediapipe and some dependencies require modern language features and binary wheels. If the commands above print a version older than/before 3.8, then you’ll have to install a newer Python version before you continue.</p>
<p><strong>Windows:</strong></p>
<ol>
<li><p>Press <strong>Windows Key + R</strong></p>
</li>
<li><p>Type <code>cmd</code> and press Enter to open Command Prompt</p>
</li>
<li><p>Type one of the above commands and press Enter</p>
</li>
</ol>
<p><strong>macOS/Linux:</strong></p>
<ol>
<li><p>Open your <strong>Terminal</strong> application</p>
</li>
<li><p>Type one of the above commands and press Enter</p>
</li>
</ol>
<p>If your Python version is older than 3.8, you’ll need to <a target="_blank" href="https://www.python.org/downloads/">download and install a newer version from the official Python website</a>.</p>
<p>Once Python is ready, you can install the required libraries using <code>pip</code>:</p>
<pre><code class="lang-bash">pip install mediapipe opencv-python numpy scikit-learn pandas
</code></pre>
<p>This command installs all the libraries you’ll need for the project:</p>
<ul>
<li><p><strong>Mediapipe</strong> – real-time hand tracking and landmark detection.</p>
</li>
<li><p><strong>OpenCV</strong> – reading frames from your webcam and drawing overlays.</p>
</li>
<li><p><strong>Pandas</strong> – storing our collected landmark data in a CSV for training.</p>
</li>
<li><p><strong>Scikit-learn</strong> – training and evaluating the gesture classification model.</p>
</li>
</ul>
<h2 id="heading-step-2-how-mediapipe-tracks-hands">Step 2: How Mediapipe Tracks Hands</h2>
<p>Mediapipe’s Hand Tracking solution detects 21 key landmarks for each hand including fingertips, joints, and the wrist, at up to <strong>30+ FPS</strong> even on modest hardware.</p>
<p>Here’s a conceptual diagram of the landmarks:</p>
<p><img src="https://github.com/tayo4christ/Gesture_Article/blob/7598826bb530d5bd1cd40251d6f56f35653b6b51/images/landmarks_concept.png?raw=true" alt="Diagram showing Mediapipe hand landmark numbering and connections between joints" width="600" height="400" loading="lazy"></p>
<p>And here’s what real‑time tracking looks like:</p>
<p><img src="https://github.com/tayo4christ/Gesture_Article/blob/7598826bb530d5bd1cd40251d6f56f35653b6b51/images/hand_tracking_3d_android_gpu.gif?raw=true" alt="Animated GIF showing Mediapipe 3D hand tracking detecting finger joints and bones in real-time" width="600" height="400" loading="lazy"></p>
<p>Each landmark has <code>(x, y, z)</code> coordinates relative to the image size, making it easy to measure angles and positions for gesture classification.</p>
<h2 id="heading-step-3-project-pipeline">Step 3: Project Pipeline</h2>
<p>Here’s how the system works, from webcam to text output:</p>
<p><img src="https://github.com/tayo4christ/Gesture_Article/blob/7598826bb530d5bd1cd40251d6f56f35653b6b51/diagrams/pipeline_flowchart.png?raw=true" alt="Pipeline Flowchart showing how gesture input flows through hand tracking, feature extraction, gesture classification, and final text output" width="600" height="400" loading="lazy"></p>
<ul>
<li><p><strong>Capture</strong>: Webcam frames are captured using OpenCV.</p>
</li>
<li><p><strong>Detection</strong>: Mediapipe locates hand landmarks.</p>
</li>
<li><p><strong>Vectorization</strong>: Landmarks are flattened into a numeric vector.</p>
</li>
<li><p><strong>Classification</strong>: A machine learning model predicts the gesture.</p>
</li>
<li><p><strong>Output</strong>: The recognized gesture is displayed as text.</p>
</li>
</ul>
<p>Basic hand detection example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> cv2
<span class="hljs-keyword">import</span> mediapipe <span class="hljs-keyword">as</span> mp

mp_hands = mp.solutions.hands
mp_draw = mp.solutions.drawing_utils

cap = cv2.VideoCapture(<span class="hljs-number">0</span>)

<span class="hljs-keyword">with</span> mp_hands.Hands(max_num_hands=<span class="hljs-number">1</span>) <span class="hljs-keyword">as</span> hands:
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        ret, frame = cap.read()
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ret:
            <span class="hljs-keyword">break</span>

        results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        <span class="hljs-keyword">if</span> results.multi_hand_landmarks:
            <span class="hljs-keyword">for</span> hand_landmarks <span class="hljs-keyword">in</span> results.multi_hand_landmarks:
                mp_draw.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)

        cv2.imshow(<span class="hljs-string">"Hand Tracking"</span>, frame)
        <span class="hljs-keyword">if</span> cv2.waitKey(<span class="hljs-number">1</span>) &amp; <span class="hljs-number">0xFF</span> == ord(<span class="hljs-string">"q"</span>):
            <span class="hljs-keyword">break</span>

cap.release()
cv2.destroyAllWindows()
</code></pre>
<p>The code above opens the webcam and processes each frame with Mediapipe’s Hands solution. The frame is then converted to RGB (as Mediapipe expects), runs detection, and if a hand is found, it draws the 21 landmarks and their connections on top of the frame. You can press <code>q</code> to close the window. This piece verifies your setup and helps you see that landmark tracking works before moving on.</p>
<h2 id="heading-step-4-how-to-collect-gesture-data">Step 4: How to Collect Gesture Data</h2>
<p>Before we can train our model, we need a dataset of <strong>labelled gestures</strong>. Each gesture will be stored in a CSV file (<code>gesture_data.csv</code>) containing the 3D landmark coordinates for all detected hand points.</p>
<p>For example, we’ll collect data for three gestures:</p>
<ul>
<li><p><strong>thumbs_up</strong> – the classic thumbs-up pose.</p>
</li>
<li><p><strong>open_palm</strong> – a flat hand, fingers extended (like a “high five”).</p>
</li>
<li><p><strong>ok</strong> – the “OK” sign, made by touching the thumb and index finger.</p>
</li>
</ul>
<p>You can collect samples for each gesture by running:</p>
<pre><code class="lang-bash">python src/collect_data.py --label thumbs_up --samples 200
</code></pre>
<pre><code class="lang-bash">python src/collect_data.py --label open_palm --samples 200
</code></pre>
<pre><code class="lang-bash">python src/collect_data.py --label ok --samples 200
</code></pre>
<p><strong>Explanation of the command:</strong></p>
<ul>
<li><p><code>--label</code> → the name of the gesture you’re recording. This label will be stored alongside each row of coordinates in the CSV.</p>
</li>
<li><p><code>--samples</code> → the number of frames to capture for that gesture. More samples generally lead to better accuracy.</p>
</li>
</ul>
<p><strong>How the process works:</strong></p>
<ol>
<li><p>When you run a command, your webcam will open.</p>
</li>
<li><p>Make the specified gesture in front of the camera.</p>
</li>
<li><p>The script will use MediaPipe Hands to detect 21 hand landmarks (each with <code>x</code>, <code>y</code>, <code>z</code> coordinates).</p>
</li>
<li><p>These 63 numbers (21 × 3) are stored in a row of the CSV file, along with the gesture label.</p>
</li>
<li><p>The counter at the top will track how many samples have been collected.</p>
</li>
<li><p>When the sample count reaches your target (<code>--samples</code>), the script will close automatically.</p>
</li>
</ol>
<p><strong>Example of what the CSV looks like:</strong></p>
<p><img src="https://raw.githubusercontent.com/tayo4christ/Gesture_Article/26db13366407e5b5d230a6c7dd7923e34a9f2a19/screenshots/gesture_data.webp" alt="Sample of gesture_data.csv" width="600" height="400" loading="lazy"></p>
<p>Each row contains:</p>
<ul>
<li><p><strong>x0, y0, z0 … x20, y20, z20</strong> → coordinates of each hand landmark.</p>
</li>
<li><p><strong>label</strong> → the gesture name.</p>
</li>
</ul>
<p><strong>Example of data collection in progress:</strong></p>
<p><img src="https://github.com/tayo4christ/Gesture_Article/blob/7598826bb530d5bd1cd40251d6f56f35653b6b51/screenshots/detection_example.jpg?raw=true" alt="Screenshot of data collection interface capturing hand gesture landmarks from webcam" width="600" height="400" loading="lazy"></p>
<p>In the above screenshot, the script is capturing <strong>10 out of 10</strong> <code>thumbs_up</code> samples.</p>
<p>📌 <strong>Tip:</strong> Make sure your hand is clearly visible and well-lit. Repeat the process for all gestures you want to train.</p>
<h2 id="heading-step-5-how-to-train-a-gesture-classifier">Step 5: How to Train a Gesture Classifier</h2>
<p>Once you have enough samples for each gesture, train a model:</p>
<pre><code class="lang-bash">python src/train_model.py --data data/gesture_data.csv --label palm_open
</code></pre>
<p>This script:</p>
<ul>
<li><p>Loads the CSV dataset.</p>
</li>
<li><p>Splits into training and testing sets.</p>
</li>
<li><p>Trains a Random Forest Classifier.</p>
</li>
<li><p>Prints accuracy and a classification report.</p>
</li>
<li><p>Saves the trained model.</p>
</li>
</ul>
<p>Core training logic:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> RandomForestClassifier
<span class="hljs-keyword">import</span> pickle

<span class="hljs-comment"># Load the dataset</span>
df = pd.read_csv(<span class="hljs-string">"data/gesture_data.csv"</span>)

<span class="hljs-comment"># Separate features and labels</span>
X = df.drop(<span class="hljs-string">"label"</span>, axis=<span class="hljs-number">1</span>)
y = df[<span class="hljs-string">"label"</span>]

<span class="hljs-comment"># Initialize and train the Random Forest Classifier</span>
model = RandomForestClassifier()
model.fit(X, y)

<span class="hljs-comment"># Save the trained model to a file</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">"data/gesture_model.pkl"</span>, <span class="hljs-string">"wb"</span>) <span class="hljs-keyword">as</span> f:
    pickle.dump(model, f)
</code></pre>
<p>This block loads the gesture dataset from <code>data/gesture_data.csv</code> and splits it into:</p>
<ul>
<li><p><code>X</code> – the input features (the 3D landmark coordinates for each gesture sample).</p>
</li>
<li><p><code>y</code> – the labels (gesture names like <code>thumbs_up</code>, <code>open_palm</code>, <code>ok</code>).</p>
</li>
</ul>
<p>We then created a Random Forest Classifie<strong>r</strong>, which is well-suited for numerical data and works reliably without much tuning. The model learns patterns in the landmark positions that correspond to each gesture.<br>Finally, we saved the trained model as <code>data/gesture_model.pkl</code> so it can be loaded later for real-time gesture recognition without retraining.</p>
<h2 id="heading-step-6-real-time-gesture-to-text-translation">Step 6: Real-Time Gesture-to-Text Translation</h2>
<p>Load the model and run the translator:</p>
<pre><code class="lang-bash">python src/gesture_to_text.py --model data/gesture_model.pkl
</code></pre>
<p>This command runs the real-time gesture recognition script.</p>
<ul>
<li><p>The <code>--model</code> argument tells the script which trained model file to load — in this case, <code>gesture_model.pkl</code> that we saved earlier.</p>
</li>
<li><p>Once running, the script opens your webcam, detects your hand landmarks, and uses the model to predict the gesture.</p>
</li>
<li><p>The predicted gesture name appears as text on the video feed.</p>
</li>
<li><p>Press <code>q</code> to exit the window when you’re done.</p>
</li>
</ul>
<p>Core prediction logic:</p>
<pre><code class="lang-python"><span class="hljs-keyword">with</span> open(<span class="hljs-string">"data/gesture_model.pkl"</span>, <span class="hljs-string">"rb"</span>) <span class="hljs-keyword">as</span> f:
    model = pickle.load(f)

<span class="hljs-keyword">if</span> results.multi_hand_landmarks:
    <span class="hljs-keyword">for</span> hand_landmarks <span class="hljs-keyword">in</span> results.multi_hand_landmarks:
        coords = []
        <span class="hljs-keyword">for</span> lm <span class="hljs-keyword">in</span> hand_landmarks.landmark:
            coords.extend([lm.x, lm.y, lm.z])
        gesture = model.predict([coords])[<span class="hljs-number">0</span>]
        cv2.putText(frame, gesture, (<span class="hljs-number">10</span>, <span class="hljs-number">50</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">1</span>, (<span class="hljs-number">0</span>, <span class="hljs-number">255</span>, <span class="hljs-number">0</span>), <span class="hljs-number">2</span>)
</code></pre>
<p>This code loads the trained gesture recognition model from <code>gesture_model.pkl</code>.<br>If any hands are detected (<code>results.multi_hand_landmarks</code>), it loops through each detected hand and:</p>
<ol>
<li><p><strong>Extracts the coordinates</strong> – for each of the 21 landmarks, it appends the <code>x</code>, <code>y</code>, and <code>z</code> values to the <code>coords</code> list.</p>
</li>
<li><p><strong>Makes a prediction</strong> – passes <code>coords</code> to the model’s <code>predict</code> method to get the most likely gesture label.</p>
</li>
<li><p><strong>Displays the result</strong> – uses <code>cv2.putText</code> to draw the predicted gesture name on the video feed.</p>
</li>
</ol>
<p>This is the real-time decision-making step that turns raw Mediapipe landmark data into a readable gesture label.</p>
<p>You should see the recognized gesture at the top of the video feed:</p>
<p><img src="https://github.com/tayo4christ/Gesture_Article/blob/7598826bb530d5bd1cd40251d6f56f35653b6b51/screenshots/text_output.jpg?raw=true" alt="Screenshot of the real-time gesture recognition output overlaying the 'palm_open' label on the video feed" width="600" height="400" loading="lazy"></p>
<h2 id="heading-step-7-extending-the-project">Step 7: Extending the Project</h2>
<p>You can take this project further by:</p>
<ul>
<li><p><strong>Adding Text-to-Speech</strong>: Use <code>pyttsx3</code> to speak recognized words.</p>
</li>
<li><p><strong>Supporting More Gestures</strong>: Expand your dataset.</p>
</li>
<li><p><strong>Deploying in the Browser</strong>: Use TensorFlow.js for web-based recognition.</p>
</li>
<li><p><strong>Testing with Real Users</strong>: Especially in accessibility contexts.</p>
</li>
</ul>
<h2 id="heading-ethical-and-accessibility-considerations">Ethical and Accessibility Considerations</h2>
<p>Before deploying:</p>
<ul>
<li><p><strong>Dataset Diversity</strong>: Train with gestures from different skin tones, hand sizes, and lighting conditions.</p>
</li>
<li><p><strong>Privacy</strong>: Store only landmark coordinates unless you have consent for video storage.</p>
</li>
<li><p><strong>Cultural Context</strong>: Some gestures have different meanings in different cultures.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, we explored how to use Python, Mediapipe, and machine learning to build a real-time gesture-to-text translator. This technology has exciting potential for accessibility and inclusive communication, and with further development, could become a powerful tool for breaking down language barriers.</p>
<p>You can find the full code and resources here:</p>
<p><strong>GitHub Repo</strong> – <a target="_blank" href="https://github.com/tayo4christ/Gesture_Article">Gesture_Article</a></p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
