OMOTAYO OMOYEMI - freeCodeCamp.org

How I Built a Makaton AI Companion Using Gemini Nano and the Gemini API

OMOTAYO OMOYEMI — Fri, 07 Nov 2025 16:33:07 +0000

When I started my research on AI systems that could translate Makaton (a sign and symbol language designed to support speech and communication), I wanted to bridge a gap in accessibility for learners with speech or language difficulties.

Over time, this academic interest evolved into a working prototype that combines on-device AI and cloud AI to describe images and translate them into English meanings. The idea was simple: I wanted to build a lightweight web app that recognized Makaton gestures or symbols and instantly provided an English interpretation.

In this article, I’ll walk you through how I built my Makaton AI Companion, a single-page web app powered by Gemini Nano (on-device) and the Gemini API (cloud). You’ll see how it works, how I solved common issues like CORS and API model errors, and how this small project became part of my journey toward AI for accessibility.

By the end of this article, you will be able to:

Understand the core concept behind Makaton and why it’s important in accessibility and inclusive education.
Learn how to combine on-device AI (Gemini Nano) and cloud-based AI (Gemini API) in a single web project.
Build a functional AI-powered web app that can describe images and map them to predefined English meanings.
Discover how to handle common errors such as model endpoint issues, missing API keys, and CORS restrictions when working with generative AI APIs.
Learn how to store API keys locally for user privacy using localStorage.
Use browser speech synthesis to convert the AI-generated English meanings into spoken output.

Tools and Tech Stack
Building the App Step by Step
How to Fix the Common Issues
Demo: The Makaton AI Companion in Action
Broader Reflections
Conclusion

Tools and Tech Stack

To build the Makaton AI Companion, I wanted something lightweight, fast to prototype, and easy for anyone to run without complicated dependencies. I chose a plain web stack with a focus on accessibility and transparency.

Here’s what I used:

Frontend

HTML + CSS + JavaScript (Vanilla): No frameworks, just clean and understandable code that any beginner can follow.
A single index.html page handles the upload interface, output display, and AI logic.

AI Components

Gemini Nano runs locally in Chrome Canary. This on-device model lets users generate short text without calling the cloud API.
Gemini API (Cloud) used as a fallback when on-device AI isn’t available or when image analysis is required.
- Model tested: gemini-1.5-flash and gemini-pro-vision.
- Fallback logic ensures the app checks multiple model endpoints if one returns a 404 error.

Local Storage

The Gemini API key is stored safely in the browser’s localStorage, so it never leaves the user’s computer.

Browser SpeechSynthesis API

Converts the translated English meaning into spoken audio with one click.

Mapping Logic

A small custom dictionary (mapping.js) links AI-generated descriptions to likely Makaton meanings. For example: { keywords: ["open hand", "raised hand", "wave"], meaning: "Hello / Stop" }

Local Server

The app is served locally using Python’s built-in HTTP server to avoid CORS issues:

python -m http.server 8080

Then open http://localhost:8080 in Chrome Canary.

Building the App Step by Step

Now let’s dive into how the Makaton AI Companion works under the hood. This project follows a simple but effective flow: Upload an image → Describe (AI) → Map to Meaning → Speak or Copy the result

We’ll go through each part step by step.

1. Setting Up the Project Folder

You don’t need any complex setup. Just create a new folder and add these files:

makaton-ai-companion/
│
├── index.html
├── styles.css
├── app.js
└── lib/
    ├── mapping.js
    └── ai.js

If you prefer a ready-to-run version, you can serve everything from one zip (I’ll share a GitHub link at the end).

2. Creating the Basic HTML Structure

Your index.html file defines the interface where users upload an image, click Describe, and view the results.

html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
  <title>Makaton AI Companiontitle>
  <link rel="stylesheet" href="styles.css"/>
head>
<body>
  <header class="app-header">
    <h1>🧩 Makaton AI Companionh1>
    <button id="btnSettings" class="btn secondary">Settingsbutton>
  header>

  <main class="container">
    <section class="card">
      <h2>1) Upload an image (Makaton sign/symbol)h2>
      <label for="file">
        Choose an image file
        <input id="file" type="file" accept="image/*" title="Select an image file"/>
      label>
      <div id="preview" class="preview hidden">div>
      <p id="status" class="status">p>
      <div class="actions">
        <button id="btnDescribe" class="btn">Describe (Cloud or Nano)button>
        <button id="btnType" class="btn ghost">Type a description insteadbutton>
      div>
      <div id="typedBox" class="typed hidden">
        <textarea id="typed" rows="3" placeholder="Describe what you see...">textarea>
        <button id="btnUseTyped" class="btn">Use this descriptionbutton>
      div>
    section>

    <section class="card">
      <h2>2) AI Outputh2>
      <div class="grid">
        <div>
          <h3>Image Descriptionh3>
          <div id="output" class="output">div>
        div>
        <div>
          <h3>English Meaning (Mapped)h3>
          <div id="meaning" class="meaning">div>
          <div class="actions">
            <button id="btnSpeak" class="btn ghost" disabled>🔊 Speakbutton>
            <button id="btnCopy" class="btn ghost" disabled>📋 Copybutton>
          div>
        div>
      div>
    section>
  main>

  <dialog id="settings">
    <form method="dialog" class="settings-form">
      <h2>Settingsh2>
      <label>Gemini API key (optional)<input id="apiKey" type="password" placeholder="AIza..."/>label>
      <div class="settings-actions">
        <button id="btnSaveKey" type="submit" class="btn">Savebutton>
        <button id="btnCloseSettings" type="button" class="btn secondary">Closebutton>
      div>
      <div id="apiStatus" class="api-status">div>
    form>
  dialog>

  <script type="module" src="lib/mapping.js">script>
  <script type="module" src="lib/ai.js">script>
  <script type="module" src="app.js">script>
body>
html>

This interface is intentionally minimal: no frameworks, no build tools, just clear HTML.

3. Mapping Descriptions to Makaton Meanings

The mapping.js file holds a simple keyword-based dictionary. When the AI describes an image (like “a raised open hand”), the app searches for keywords that match known Makaton signs.

// lib/mapping.js

export const MAKATON_GLOSSES = [
  { keywords: ["open hand", "raised hand", "wave", "hand up"], meaning: "Hello / Stop" },
  { keywords: ["eat", "food", "spoon", "hand to mouth"], meaning: "Eat" },
  { keywords: ["drink", "cup", "glass", "bottle"], meaning: "Drink" },
  { keywords: ["home", "house", "roof"], meaning: "Home" },
  { keywords: ["sleep", "bed", "eyes closed"], meaning: "Sleep" },
  { keywords: ["book", "reading", "pages"], meaning: "Book / Read" },
  // Added so your current screenshot maps correctly:
  { keywords: ["help", "assist", "thumb on palm", "hand over hand", "assisting"], meaning: "Help" },
];

export function mapDescriptionToMeaning(desc) {
  if (!desc) return "";
  const d = desc.toLowerCase();
  for (const entry of MAKATON_GLOSSES) {
    if (entry.keywords.some(k => d.includes(k))) return entry.meaning;
  }
  if (d.includes("hand")) return "Gesture / Hand sign (clarify)";
  return "No direct mapping found.";
}

It’s simple but effective enough to simulate real symbol-to-language translation for demo purposes.

4. Adding Gemini AI Logic

The ai.js file connects to Gemini Nano (on-device) or the Gemini API (cloud). If Nano isn’t available, the app falls back to the cloud model. And if that fails, it lets users type a description manually.

// lib/ai.js — dynamic model discovery (try-all version)

// --- On-device availability (Gemini Nano) ---
export async function checkAvailability() {
  const res = { nanoTextPossible: false };
  try {
    const canCreate = self.ai?.canCreateTextSession || self.ai?.languageModel?.canCreate;
    if (typeof canCreate === "function") {
      const ok = await (self.ai.canCreateTextSession?.() || self.ai.languageModel.canCreate?.());
      res.nanoTextPossible = ok === "readily" || ok === "after-download" || ok === true;
    }
  } catch {}
  return res;
}

export async function createNanoTextSession() {
  if (self.ai?.createTextSession) return await self.ai.createTextSession();
  if (self.ai?.languageModel?.create) return await self.ai.languageModel.create();
  throw new Error("Gemini Nano text session not available");
}

// --- Cloud: dynamically discover models for this key ---
async function listModels(key) {
  const url = "https://generativelanguage.googleapis.com/v1/models?key=" + encodeURIComponent(key);
  const r = await fetch(url);
  if (!r.ok) throw new Error("ListModels failed: " + (await r.text()));
  const j = await r.json();
  return (j.models || []).map(m => m.name).filter(Boolean);
}

function rankModels(names) {
  // Prefer Gemini 1.5 (multimodal), then flash variants, then anything with vision/pro.
  return names
    .filter(n => n.startsWith("models/"))              // ignore tunedModels, etc.
    .filter(n => !n.includes("experimental"))          // skip experimental
    .sort((a, b) => score(b) - score(a));

  function score(n) {
    let s = 0;
    if (n.includes("1.5")) s += 10;
    if (n.includes("flash")) s += 8;
    if (n.includes("pro-vision")) s += 7;
    if (n.includes("pro")) s += 6;
    if (n.includes("vision")) s += 5;
    if (n.includes("latest")) s += 2;
    return s;
  }
}

async function tryGenerateForModels(imageDataUrl, key, models, mimeType) {
  const base64 = imageDataUrl.split(",")[1];
  const body = {
    contents: [{
      parts: [
        { text: "Describe this image briefly in one sentence focusing on the main gesture or symbol." },
        { inline_data: { mime_type: mimeType || "image/png", data: base64 } }
      ]
    }]
  };
  let lastErr = "";
  for (const model of models) {
    const endpoint = "https://generativelanguage.googleapis.com/v1/" + model + ":generateContent?key=" + encodeURIComponent(key);
    try {
      const r = await fetch(endpoint, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify(body)});
      if (!r.ok) { lastErr = await r.text().catch(()=>String(r.status)); continue; }
      const j = await r.json();
      const text = j?.candidates?.[0]?.content?.parts?.map(p=>p.text).join(" ").trim();
      if (text) return text;
      lastErr = "Empty response from " + model;
    } catch (e) {
      lastErr = String(e?.message || e);
    }
  }
  throw new Error("All discovered models failed. Last error: " + lastErr);
}

export async function describeImageWithGemini(imageDataUrl, apiKey, mimeType = "image/png") {
  if (!apiKey) throw new Error("No API key provided");

  const models = await listModels(apiKey);
  if (!models.length) throw new Error("No models returned for this key. Ensure Generative Language API is enabled and T&Cs accepted in AI Studio.");

  const ranked = rankModels(models);
  if (!ranked.length) throw new Error("No usable model names returned (models/*).");

  return await tryGenerateForModels(imageDataUrl, apiKey, ranked, mimeType);
}

// --- Key storage (local only) ---
const KEY = "makaton_demo_gemini_key";
export function saveApiKey(k) { localStorage.setItem(KEY, k || ""); }
export function loadApiKey() { return localStorage.getItem(KEY) || ""; }

Note: This retry system is essential because many users encounter 404 model errors due to the unavailability of certain Gemini versions in every account.

5. The Main Logic (app.js)

This script ties everything together: file upload, AI call, meaning mapping, and output display.


import { mapDescriptionToMeaning } from './lib/mapping.js';
import { checkAvailability, createNanoTextSession, describeImageWithGemini, saveApiKey, loadApiKey } from './lib/ai.js';

document.addEventListener('DOMContentLoaded', () => {
  console.log('[Makaton] DOM ready');

  const $ = (s) => document.querySelector(s);

  // Elements
  const fileInput   = $('#file');
  const preview     = $('#preview');
  const meaningEl   = $('#meaning');
  const outputEl    = $('#output');
  const btnDescribe = $('#btnDescribe');
  const btnType     = $('#btnType');
  const typedBox    = $('#typedBox');
  const typed       = $('#typed');
  const btnUseTyped = $('#btnUseTyped');
  const btnSpeak    = $('#btnSpeak');
  const btnCopy     = $('#btnCopy');
  const statusEl    = $('#status');

  const settings        = $('#settings');
  const btnSettings     = $('#btnSettings');
  const btnCloseSettings= $('#btnCloseSettings');
  const btnSaveKey      = $('#btnSaveKey');
  const apiKeyInput     = $('#apiKey');
  const apiStatus       = $('#apiStatus');

  let currentImageDataUrl = null;
  let currentImageMime    = "image/png";

  // Sanity logs
  console.log('[Makaton] Elements:', {
    fileInput: !!fileInput, preview: !!preview, outputEl: !!outputEl,
    meaningEl: !!meaningEl, btnDescribe: !!btnDescribe, statusEl: !!statusEl
  });

  // Init API key
  if (apiKeyInput) apiKeyInput.value = loadApiKey() || "";

  // --- Helpers ---
  function setStatus(text) {
    if (statusEl) statusEl.textContent = text || '';
    console.log('[Makaton][Status]', text);
  }
  function clearOutputs() {
    if (outputEl) outputEl.textContent = '';
    if (meaningEl) meaningEl.textContent = '';
    if (btnSpeak) btnSpeak.disabled = true;
    if (btnCopy)  btnCopy.disabled  = true;
  }
  function setOutput(desc) {
    if (outputEl) outputEl.textContent = desc || '';
    const meaning = mapDescriptionToMeaning(desc || '');
    if (meaningEl) meaningEl.textContent = meaning;
    if (btnSpeak) btnSpeak.disabled = !meaning || meaning.includes('No direct mapping');
    if (btnCopy)  btnCopy.disabled  = !meaning;
    setStatus('Done.');
  }
  function fileToDataURL(file) {
    return new Promise((resolve, reject) => {
      const reader = new FileReader();
      reader.onload  = () => resolve(reader.result);
      reader.onerror = (e) => reject(e);
      reader.readAsDataURL(file);
    });
  }
  function handleFiles(files) {
    const file = files?.[0];
    if (!file) { setStatus('No file selected.'); return; }
    currentImageMime = file.type || "image/png";
    fileToDataURL(file)
      .then((dataUrl) => {
        currentImageDataUrl = dataUrl;
        if (preview) {
          preview.innerHTML = `${dataUrl}" />`;
          preview.classList.remove('hidden');
        }
        setStatus('Image loaded. Click "Describe" to continue.');
      })
      .catch((err) => {
        console.error('[Makaton] fileToDataURL error', err);
        setStatus('Could not read the image.');
      });
  }

  // --- File input change ---
  if (fileInput) {
    fileInput.addEventListener('change', (e) => {
      console.log('[Makaton] file input change');
      handleFiles(e.target.files);
    });
  } else {
    console.warn('[Makaton] #file input not found in DOM.');
  }

  // --- Drag & drop support on preview area ---
  if (preview) {
    preview.addEventListener('dragover', (e) => { e.preventDefault(); preview.classList.add('drag'); });
    preview.addEventListener('dragleave', () => preview.classList.remove('drag'));
    preview.addEventListener('drop', (e) => {
      e.preventDefault();
      preview.classList.remove('drag');
      console.log('[Makaton] drop');
      handleFiles(e.dataTransfer?.files);
    });
  }

  // --- Describe click ---
  if (btnDescribe) {
    btnDescribe.addEventListener('click', async () => {
      console.log('[Makaton] Describe clicked');
      if (!currentImageDataUrl) { setStatus('Please upload an image first.'); return; }
      clearOutputs();
      setStatus('Checking on-device AI availability…');

      const avail = await checkAvailability().catch(() => ({ nanoTextPossible: false }));
      try {
        const apiKey = loadApiKey();
        if (apiKey) {
          setStatus('Using Gemini cloud for image description…');
          const desc = await describeImageWithGemini(currentImageDataUrl, apiKey, currentImageMime);
          setOutput(desc);
          return;
        }
        if (avail.nanoTextPossible) {
          setStatus('No API key found. Using on-device AI (text) for best guess…');
          const session = await createNanoTextSession();
          const desc = await session.prompt('Given an image is uploaded by the user (not directly visible to you), infer a likely one-sentence description of a common Makaton sign or symbol a teacher might upload. Keep it generic and safe.');
          setOutput(desc);
          return;
        }
        setStatus('No AI available. Please type a brief description.');
        if (typedBox) typedBox.classList.remove('hidden');
      } catch (err) {
        console.error('[Makaton] Describe error', err);
        setStatus('Description failed: ' + (err?.message || err));
        if (typedBox) typedBox.classList.remove('hidden');
      }
    });
  } else {
    console.warn('[Makaton] Describe button not found.');
  }

  // --- Manual typing flow ---
  if (btnType) {
    btnType.addEventListener('click', () => {
      if (typedBox) typedBox.classList.remove('hidden');
      if (typed) typed.focus();
    });
  }
  if (btnUseTyped) {
    btnUseTyped.addEventListener('click', () => {
      const text = (typed?.value || '').trim();
      if (!text) { setStatus('Type a description first.'); return; }
      setOutput(text);
    });
  }

  // --- Utilities ---
  if (btnSpeak) {
    btnSpeak.addEventListener('click', () => {
      const text = meaningEl?.textContent?.trim();
      if (!text) return;
      const u = new SpeechSynthesisUtterance(text);
      speechSynthesis.cancel();
      speechSynthesis.speak(u);
    });
  }
  if (btnCopy) {
    btnCopy.addEventListener('click', async () => {
      const text = meaningEl?.textContent?.trim();
      if (!text) return;
      try {
        await navigator.clipboard.writeText(text);
        setStatus('Copied meaning to clipboard.');
      } catch {
        setStatus('Copy failed.');
      }
    });
  }

  // --- Settings modal ---
  if (btnSettings && settings) btnSettings.addEventListener('click', () => settings.showModal());
  if (btnCloseSettings && settings) btnCloseSettings.addEventListener('click', () => settings.close());
  if (btnSaveKey) {
    btnSaveKey.addEventListener('click', (e) => {
      e.preventDefault();
      const k = apiKeyInput?.value?.trim() || "";
      saveApiKey(k);
      if (apiStatus) apiStatus.textContent = k ? "API key saved locally. Try Describe again." : "Cleared API key. You can still use on-device or typed mode.";
    });
  }

  // First status
  setStatus('Ready. Upload an image to begin.');
});

Let's break down the main sections of the app.js script for the Makaton AI Companion, as there’s a lot going on here:

Imports and Initial Setup:
- The script imports functions from mapping.js and ai.js to handle mapping descriptions to meanings and AI interactions.
- It sets up event listeners for when the DOM content is fully loaded, ensuring all elements are ready for interaction.
Element Selection:
- It uses a helper function $ to select DOM elements by their CSS selectors. This includes file inputs, buttons, and display areas for image previews and outputs.
Sanity Logs:
- It logs the presence of key elements to the console for debugging purposes, ensuring that all necessary elements are found in the DOM.
API Key Initialization:
- It loads any saved API key from local storage and sets it in the input field for user convenience.
Helper Functions:
- setStatus: Updates the status message displayed to the user.
- clearOutputs: Clears the output and meaning display areas and disables buttons for speaking and copying.
- setOutput: Displays the AI-generated description and maps it to a Makaton meaning, enabling buttons if a valid meaning is found.
- fileToDataURL: Converts an uploaded file to a data URL for image preview and processing.
- handleFiles: Handles file selection, updating the preview and setting the current image data URL.
File Input Change Handling:
- It listens for changes in the file input, processes the selected file, and updates the preview area.
Drag & Drop Support:
- It adds drag-and-drop functionality to the preview area, allowing users to drag files directly onto the app for processing.
Describe Button Click:
- It handles the "Describe" button click event, checking for an uploaded image and attempting to describe it using either the Gemini API or on-device AI.
- If no AI is available, it prompts the user to type a description manually.
Manual Typing Flow:
- It allows users to manually type a description if AI processing is unavailable or fails, updating the output with the typed text.
Utilities:
- btnSpeak: Uses the browser's SpeechSynthesis API to read aloud the mapped meaning.
- btnCopy: Copies the mapped meaning to the clipboard for easy sharing.
Settings Modal:
- It manages the settings modal for entering and saving the API key, providing feedback on the key's status.
Initial Status:
- It sets the initial status message to guide the user to upload an image to begin the process.

This script effectively ties together the user interface, file handling, AI processing, and output display, providing a seamless experience for translating Makaton signs into English meanings.

How Vision and Language Work Together Here

While working on this project, I started appreciating how computer vision and language understanding complement each other in multimodal systems like this one.

The vision model (Gemini or Nano) interprets what it sees like hand shapes, gestures, or layout and turns that visual context into descriptive language.
The language mapping logic then interprets those words, infers intent, and finds the closest semantic match (e.g., “help,” “friend,” “eat”).
It’s a collaboration between two forms of understanding (perceptual and semantic) that together allow the AI to bridge the gap between gesture and meaning.

This realization reshaped how I think about accessibility: the best assistive technologies often emerge not from smarter models alone, but from the interaction between modalities like seeing, describing, and reasoning in context.

6. Optional — Speak and Copy

To make the app more accessible, I added speech output and a quick copy button:

btnSpeak.addEventListener('click', () => {
  const text = meaningEl.textContent.trim();
  if (text) speechSynthesis.speak(new SpeechSynthesisUtterance(text));
});

btnCopy.addEventListener('click', async () => {
  const text = meaningEl.textContent.trim();
  if (text) await navigator.clipboard.writeText(text);
});

This gives users both visual and auditory feedback, especially helpful for learners or educators.

How to Fix the Common Issues

No AI or web integration project runs smoothly the first time – and that’s okay. Here’s a breakdown of the main issues I faced while building the Makaton AI Companion, how I diagnosed them, and how I fixed each one.

These lessons will help anyone trying to integrate Gemini APIs, on-device AI, or local web apps without a full backend.

1. The “CORS” Error When Running With `file://`

When I first opened my index.html directly from my file explorer, Chrome threw several CORS policy errors:

Access to script at 'file:///lib/ai.js' from origin 'null' has been blocked by CORS policy.

At first this looked confusing, but the reason is simple: modern browsers block JavaScript modules (import/export) when running from file:// paths for security reasons.

✅ Fix: I realized I needed to serve the files over HTTP, not from the file system. So I ran a quick local web server using Python:

python -m http.server 8080

Then opened:

http://localhost:8080/index.html

That single step fixed all the CORS errors and allowed my modules to load correctly.

2. “Model Not Found” (404) From the Gemini API

The next big challenge came from the Gemini API. Even though I had a valid API key, my console showed this error:

"models/gemini-1.5-flash" is not found for API version v1beta, or is not supported for generateContent.

It turns out Google’s API endpoints can vary slightly depending on your project setup and key permissions.

✅ Fix: I rewrote my lib/ai.js script to automatically try multiple Gemini model endpoints until it found one that worked. Something like this:

const GEMINI_IMAGE_ENDPOINTS = [
  "https://generativelanguage.googleapis.com/v1/models/gemini-1.5-flash:generateContent",
  "https://generativelanguage.googleapis.com/v1/models/gemini-1.5-pro:generateContent",
  "https://generativelanguage.googleapis.com/v1/models/gemini-1.5-flash-latest:generateContent",
];

And I wrapped it in a loop that stopped once one endpoint succeeded.

Later, I improved it further by listing available models dynamically using
https://generativelanguage.googleapis.com/v1/models?key=YOUR_KEY and automatically trying whichever ones supported image generation.

That dynamic discovery approach fixed the 404 errors permanently.

3. Packaging a Local Single-File Version

Once I got everything working, I wanted a version that others could test easily without installing Node.js or running build tools.

✅ Fix: I bundled the project into a simple zip file containing:

index.html
app.js
lib/ai.js
lib/mapping.js
styles.css

That way, anyone can just unzip and run:

python -m http.server 8080

and open localhost:8080.

Everything runs locally in the browser, no server-side code required. This also makes it perfect for demos, classrooms, and so on.

4. Debugging Script Import Errors in the Console

Another subtle issue appeared when I noticed this red message:

The requested module './lib/mapping.js' does not provide an export named 'mapDescriptionToMeaning'

That line told me exactly what was wrong: my import and export function names didn’t match. The fix was straightforward:

// app.js
import { mapDescriptionToMeaning } from './lib/mapping.js';

And then ensuring the mapping file exported it:

// mapping.js
export function mapDescriptionToMeaning(desc) { ... }

After that, all the pieces connected smoothly.

Using the browser console as my debugging dashboard turned out to be the most powerful tool of all. Every fix started by reading and reasoning about those red error lines.

Demo: The Makaton AI Companion in Action

Let’s see the Makaton AI Companion in action and understand what’s happening under the hood.

Step 1: Run the app locally

Once you’ve downloaded or cloned the project folder, open your terminal in that directory and start a local development server: python -m http.server 8080. Then open your browser and visit: http://localhost:8080/index.html

You should see the Makaton AI Companion interface:

Step 2: Get Your Gemini API Key

To enable cloud-based image description, you’ll need a Gemini API key from Google AI Studio.

Here’s how to generate one:

Visit: https://aistudio.google.com/welcome
Click “Create API key” and link it to your Google Cloud project (or create a new one).
Copy the key it will look like this: AIzaSyA...XXXXXXXXXXXX
Open the Makaton AI Companion in your browser and click the Settings button (top left).
Paste your key in the input box and click Save.

You’ll see a confirmation message like this:

“API key saved locally. Try Describe again.”

This means your key is stored safely in localStorage and is only accessible from your browser.

Step 3: Enable Gemini Nano for On-Device AI

If you’re using Chrome Canary, you can run Gemini Nano locally without internet access. This allows the Makaton AI Companion to generate text even when the API key isn’t set.

Download and Install Chrome Canary:

Visit the official Chrome Canary download page and install it on your Windows or macOS system. Chrome Canary is a special version of Chrome designed for developers and early adopters, offering the latest features and updates.

Enable Gemini Nano:

Open Chrome Canary and type chrome://flags/#prompt-api-for-gemini-nano in the address bar.

Locate the "Prompt API for Gemini Nano" flag in the list. Set this flag to Enabled. This action allows Chrome Canary to support the Gemini Nano model for on-device AI processing.

After enabling the flag, relaunch Chrome Canary to apply the changes.

Download the Gemini Nano Model:

Open a new tab in Chrome Canary and enter chrome://components in the address bar.

Scroll down to find the “Optimization Guide” component. Click on Check for update. This action will initiate the download of the Gemini Nano model, which is necessary for running AI tasks locally without an internet connection.

Verify Installation:

Once the Gemini Nano model is installed, the Makaton AI Companion app will automatically detect it. You should see a message indicating that the app is using on-device AI: “No API key found. Using on-device AI (text) for best guess…”

This confirmation means that the app can now generate text descriptions using the Gemini Nano model without needing an API key or internet access.

By following these detailed steps, you ensure that the Gemini Nano model is correctly set up and ready to use for on-device AI processing in the Makaton AI Companion.

Step 4: Upload a Makaton sign or symbol

Click Choose File to upload any Makaton image (for example, the “help” sign), then press Describe (Cloud or Nano). You’ll immediately see console logs confirming that the app is running correctly and connecting to the Gemini API:

Step 5: AI Description and Mapping

Here’s what happens next:

The image is read and encoded as Base64.
The Gemini API (cloud or on-device) generates a short visual description.
The description is passed to the mapDescriptionToMeaning() function.
If keywords match an entry in the MAKATON_GLOSSES dictionary, the app displays the corresponding English meaning.
Finally, users can click Speak or Copy to hear or reuse the translation.

Example outputs:

When no mapping is found:
The AI description is accurate but doesn’t yet match a known Makaton keyword.

After updating the mapping list:
Adding new keywords like "help", "assist", or "hand over hand" enables correct translation.

Why this matters

This demonstrates how accessible, AI-assisted tools can support communication for people who rely on Makaton. Even when a gesture isn’t recognized, the system provides a structured output and allows users or educators to expand the mapping list making the tool smarter over time.

Broader Reflections

Building this project turned out to be much more than a coding exercise for me.
It was a meaningful experiment in combining accessibility, natural language processing, and computer vision. These three fields, when brought together, can create real social impact.

While working on it, I began to understand how computer vision and language understanding complement each other in practice. The vision model perceives the world by identifying shapes, gestures, and spatial patterns, while the language model interprets what those visuals mean in human terms.
In this project, the artificial intelligence system first sees the Makaton sign, then describes it, and finally maps it to an English word that carries intent and meaning.

This interaction between perception and semantics is what makes multimodal artificial intelligence so powerful. It is not only about recognizing an image or generating text; it is about building systems that connect understanding across different forms of information to make technology more inclusive and human centered.

This realization changed how I think about accessibility technology. True innovation happens not only through smarter models but through the harmony between seeing and understanding, between what an artificial intelligence system observes and how it communicates that observation to help people.

Accessibility Meets AI

Working on this project reminded me that accessibility isn’t just about compliance or assistive devices. It’s also about inclusion. A simple AI system that can describe a hand gesture or symbol in real time can empower teachers, parents, and students who communicate using Makaton or similar systems.

By mapping AI-generated descriptions to meaningful phrases, the app demonstrates how AI can support inclusive education, even at small scales. It bridges the communication gap between verbal and nonverbal learners, which is something that traditional translation systems often overlook.

Integrating NLP and Computer Vision

On the technical side, this project showed me how naturally computer vision and language understanding complement each other. The Gemini API’s multimodal models were able to analyze an image and produce coherent natural-language sentences, something that older APIs couldn’t do without chaining multiple tools.

By feeding that output into a lightweight NLP mapping function, I was able to simulate a very early-stage symbol-to-language translator the core of my broader research interest in automatic Makaton-to-English translation.

Why Local AI (Gemini Nano) Matters

While the cloud models are powerful, experimenting with Gemini Nano revealed something exciting:
on-device AI can make accessibility tools faster, safer, and more private.

In classrooms or therapy sessions, you often can’t rely on stable internet connections or share sensitive student data. Running inference locally means learners’ gestures or symbol images never leave the device, a crucial step toward privacy-preserving accessibility AI.

And since Nano runs directly inside Chrome Canary, it shows how AI is becoming embedded at the browser level, lowering barriers for teachers and developers to build inclusive solutions without needing large infrastructure.

Looking Forward

This prototype is just a starting point. Future iterations could integrate gesture recognition directly from camera input, support multiple symbol sets, or even learn from user feedback to expand the dictionary automatically.

Most importantly, it reinforces a central belief in my research and teaching journey:

Accessibility innovation doesn’t require massive systems. It starts with curiosity, empathy, and a few lines of purposeful code.

Conclusion

Building the Makaton AI Companion has been one of the most rewarding projects in my AI journey – not just because it worked, but because it proved how accessible innovation can be.

With just a browser, a few lines of JavaScript, and the right API, I was able to combine computer vision, language understanding, and accessibility design into a working system that translates symbols into meaning. It’s a small step toward a future where anyone, regardless of speech or language ability, can be understood through technology.

The project also reinforced something deeply personal to me as a researcher and educator: that AI for accessibility doesn’t need to be complex, expensive, or centralized. It can be lightweight, open, and built with empathy by anyone who’s willing to learn and experiment.

Join the Conversation

If this project inspires you, I’d love to see your own experiments and improvements. Can you make it support live webcam gestures? Could you adapt it for other symbol systems, like PECS or BSL?

Share your ideas in the comments or tag me if you publish your own version. Together, we can grow a small prototype into a community-driven accessibility tool and continue exploring how AI can give more people a voice.

Full source code on GitHub: Makaton-ai-companion

How to Use Transformers for Real-Time Gesture Recognition

OMOTAYO OMOYEMI — Mon, 06 Oct 2025 13:39:30 +0000

Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are not static images. Rather, they unfold over time. To build more robust, real-time systems, we need models that can capture both spatial details and temporal context.

This is where Transformers come in. Originally built for language, they’ve become state-of-the-art in vision tasks thanks to models like the Vision Transformer (ViT) and video-focused variants such as TimeSformer.

In this tutorial, we’ll use a Transformer backbone to create a lightweight real-time gesture recognition tool, optimized for small datasets and deployable on a regular laptop webcam.

Why Transformers for Gestures?
What You’ll Learn
Prerequisites
Project Setup
Generate a Gesture Dataset
Option 1: Generate a Synthetic Dataset
Training Script: train.py
Export the Model to ONNX
Evaluate Accuracy + Latency
Option 2: Use Small Samples from Public Gesture Datasets
Accessibility Notes & Ethical Limits
Next Steps
Conclusion

Why Transformers for Gestures?

Transformers are powerful because they use self-attention to model relationships across a sequence. For gestures, this means the model doesn’t just see isolated frames, but also learns how movements evolve over time. A wave, for example, looks different from a raised hand only when viewed as a sequence.

Vision Transformers process images as patches, while video Transformers extend this to multiple frames with temporal attention. Even a simple approach, like applying ViT to each frame and pooling across time, can outperform traditional CNN-based methods for small datasets.

Combined with Hugging Face’s pre-trained models and ONNX Runtime for optimization, Transformers make it possible to train on a modest dataset and still achieve smooth real-time recognition.

What You’ll Learn

In this tutorial, you’ll build a gesture recognition system using Transformers. By the end, you’ll know how to:

Create (or record) a tiny gesture dataset
Train a Vision Transformer (ViT) with temporal pooling
Export the model to ONNX for faster inference
Build a real-time Gradio app that classifies gestures from your webcam
Evaluate your model’s accuracy and latency with simple scripts
Understand the accessibility potential and ethical limits of gesture recognition

Prerequisites

To follow along, you should have:

Basic Python knowledge (functions, scripts, virtual environments)
Familiarity with PyTorch (tensors, datasets, training loops) – helpful but not required
Python 3.8+ installed on your system
A webcam (for the live demo in Gradio)
Optionally: GPU access (training on CPU works, but is slower)

Project Setup

Create a new project folder and install the required libraries.

# Create a new project directory and navigate into it
mkdir transformer-gesture && cd transformer-gesture

# Set up a Python virtual environment
python -m venv .venv

# Activate the virtual environment
# Windows PowerShell
.venv\Scripts\Activate.ps1

# macOS/Linux
source .venv/bin/activate

The provided code snippet is a set of commands for setting up a new Python project with a virtual environment. Here's a breakdown of each part:

mkdir transformer-gesture && cd transformer-gesture: This command creates a new directory named "transformer-gesture" and then navigates into it.
python -m venv .venv: This command creates a new virtual environment in the current directory. The virtual environment is stored in a folder named ".venv".
Activating the virtual environment:
- For Windows PowerShell, you can use .venv\Scripts\Activate.ps1 to activate the virtual environment.
- For macOS/Linux, use source .venv/bin/activate to activate the virtual environment.

Activating a virtual environment ensures that the Python interpreter and any packages you install are isolated to this specific project, preventing conflicts with other projects or system-wide packages.

Create a requirements.txt file:

torch>=2.0
torchvision
torchaudio
timm
huggingface_hub

onnx
onnxruntime

gradio

numpy
opencv-python
pillow

matplotlib
seaborn
scikit-learn

The list provided is a set of package dependencies typically found in a requirements.txt file for a Python project. Here's a brief explanation of each package:

torch>=2.0: PyTorch is a popular open-source deep learning framework that provides a flexible and efficient platform for building and training neural networks. Version 2.0 and above includes improvements in performance and new features.
torchvision: This library is part of the PyTorch ecosystem and provides tools for computer vision tasks, including datasets, model architectures, and image transformations.
torchaudio: Also part of the PyTorch ecosystem, Torchaudio provides audio processing tools and datasets, making it easier to work with audio data in deep learning projects.
timm: The PyTorch Image Models (timm) library offers a collection of pre-trained models and utilities for computer vision tasks, facilitating quick experimentation and deployment.
huggingface_hub: This library allows easy access to models and datasets hosted on the Hugging Face Hub, a platform for sharing and collaborating on machine learning models and datasets.
onnx: The Open Neural Network Exchange (ONNX) format is used to represent machine learning models, enabling interoperability between different frameworks.
onnxruntime: This is a high-performance runtime for executing ONNX models, allowing for efficient deployment across various platforms.
gradio: Gradio is a library for creating user interfaces for machine learning models, making them accessible through a web interface for easy interaction and testing.
numpy: A fundamental package for numerical computing in Python, providing support for arrays and a wide range of mathematical functions.
opencv-python: OpenCV is a library for computer vision and image processing tasks, widely used for real-time applications.
pillow: A Python Imaging Library (PIL) fork, Pillow provides tools for opening, manipulating, and saving many different image file formats.
matplotlib: A plotting library for Python, Matplotlib is used for creating static, interactive, and animated visualizations in Python.
seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.
scikit-learn: A machine learning library in Python that provides simple and efficient tools for data analysis and modeling, including classification, regression, clustering, and dimensionality reduction.

Install dependencies:

pip install -r requirements.txt

The command pip install -r requirements.txt is used to install all the Python packages listed in a file named requirements.txt. This file typically contains a list of package dependencies required for a Python project, each specified with a package name and optionally a version number.

By running this command, pip, which is the Python package installer, reads the file and installs each package listed, ensuring that the project has all the necessary dependencies to run properly. This is a common practice in Python projects to manage and share dependencies easily.

Generate a Gesture Dataset

To train our Transformer-based gesture recognizer, we need some data. Instead of downloading a huge dataset, we’ll start with a tiny synthetic dataset you can generate in seconds. This makes the tutorial lightweight and ensures that everyone can follow along without dealing with multi-gigabyte downloads.

Option 1: Generate a Synthetic Dataset

We’ll use a small Python script that creates short .mp4 clips of a moving (or still) coloured box. Each class represents a gesture:

swipe_left – box moves from right to left
swipe_right – box moves from left to right
stop – box stays still in the center

Save this script as generate_synthetic_gestures.py in your project root:

import os, cv2, numpy as np, random, argparse

def ensure_dir(p): os.makedirs(p, exist_ok=True)

def make_clip(mode, out_path, seconds=1.5, fps=16, size=224, box_size=60, seed=0, codec="mp4v"):
    rng = random.Random(seed)
    frames = int(seconds * fps)
    H = W = size

    # background + box color
    bg_val = rng.randint(160, 220)
    bg = np.full((H, W, 3), bg_val, dtype=np.uint8)
    color = (rng.randint(20, 80), rng.randint(20, 80), rng.randint(20, 80))

    # path of motion
    y = rng.randint(40, H - 40 - box_size)
    if mode == "swipe_left":
        x_start, x_end = W - 20 - box_size, 20
    elif mode == "swipe_right":
        x_start, x_end = 20, W - 20 - box_size
    elif mode == "stop":
        x_start = x_end = (W - box_size) // 2
    else:
        raise ValueError(f"Unknown mode: {mode}")

    fourcc = cv2.VideoWriter_fourcc(*codec)
    vw = cv2.VideoWriter(out_path, fourcc, fps, (W, H))
    if not vw.isOpened():
        raise RuntimeError(
            f"Could not open VideoWriter with codec '{codec}'. "
            "Try --codec XVID and use .avi extension, e.g. out.avi"
        )

    for t in range(frames):
        alpha = t / max(1, frames - 1)
        x = int((1 - alpha) * x_start + alpha * x_end)
        # small jitter to avoid being too synthetic
        jitter_x, jitter_y = rng.randint(-2, 2), rng.randint(-2, 2)
        frame = bg.copy()
        cv2.rectangle(frame, (x + jitter_x, y + jitter_y),
                      (x + jitter_x + box_size, y + jitter_y + box_size),
                      color, thickness=-1)
        # overlay text
        cv2.putText(frame, mode, (8, 24), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 0), 2, cv2.LINE_AA)
        cv2.putText(frame, mode, (8, 24), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 1, cv2.LINE_AA)
        vw.write(frame)

    vw.release()

def write_labels(labels, out_dir):
    with open(os.path.join(out_dir, "labels.txt"), "w", encoding="utf-8") as f:
        for c in labels:
            f.write(c + "\n")

def main():
    ap = argparse.ArgumentParser(description="Generate a tiny synthetic gesture dataset.")
    ap.add_argument("--out", default="data", help="Output directory (default: data)")
    ap.add_argument("--classes", nargs="+",
                    default=["swipe_left", "swipe_right", "stop"],
                    help="Class names (default: swipe_left swipe_right stop)")
    ap.add_argument("--clips", type=int, default=16, help="Clips per class (default: 16)")
    ap.add_argument("--seconds", type=float, default=1.5, help="Seconds per clip (default: 1.5)")
    ap.add_argument("--fps", type=int, default=16, help="Frames per second (default: 16)")
    ap.add_argument("--size", type=int, default=224, help="Frame size WxH (default: 224)")
    ap.add_argument("--box", type=int, default=60, help="Box size (default: 60)")
    ap.add_argument("--codec", default="mp4v", help="Codec fourcc (mp4v or XVID)")
    ap.add_argument("--ext", default=".mp4", help="File extension (.mp4 or .avi)")
    args = ap.parse_args()

    ensure_dir(args.out)
    write_labels(args.classes, ".")  # writes labels.txt to project root

    print(f"Generating synthetic dataset -> {args.out}")
    for cls in args.classes:
        cls_dir = os.path.join(args.out, cls)
        ensure_dir(cls_dir)
        mode = "stop" if cls == "stop" else ("swipe_left" if "left" in cls else ("swipe_right" if "right" in cls else "stop"))
        for i in range(args.clips):
            filename = os.path.join(cls_dir, f"{cls}_{i+1:03d}{args.ext}")
            make_clip(
                mode=mode,
                out_path=filename,
                seconds=args.seconds,
                fps=args.fps,
                size=args.size,
                box_size=args.box,
                seed=i + 1,
                codec=args.codec
            )
        print(f"  {cls}: {args.clips} clips")

    print("Done. You can now run: python train.py, python export_onnx.py, python app.py")

if __name__ == "__main__":
    main()

The script generates a synthetic gesture dataset by creating video clips of a moving or stationary coloured box, simulating gestures like "swipe left," "swipe right," and "stop," and saves them in a specified output directory.

Now run it inside your virtual environment:

python generate_synthetic_gestures.py --out data --clips 16 --seconds 1.5

The command above runs a Python script named generate_synthetic_gestures.py, which generates a synthetic gesture dataset with 16 clips per gesture, each lasting 1.5 seconds, and saves the output in a directory named "data".

This creates a dataset like:

data/
  swipe_left/*.mp4
  swipe_right/*.mp4
  stop/*.mp4
labels.txt

Each folder contains short clips of a moving (or still) box that simulate gestures. This is perfect for testing the pipeline.

Training Script: `train.py`

Now that we have our dataset, let’s fine-tune a Vision Transformer with temporal pooling. This model applies ViT frame-by-frame, averages embeddings across time, and trains a classification head on your gestures.

Here’s the full training script:

# train.py
import torch, torch.nn as nn, torch.optim as optim
from torch.utils.data import DataLoader
import timm
from dataset import GestureClips, read_labels

class ViTTemporal(nn.Module):
    """Frame-wise ViT encoder -> mean pool over time -> linear head."""
    def __init__(self, num_classes, vit_name="vit_tiny_patch16_224"):
        super().__init__()
        self.vit = timm.create_model(vit_name, pretrained=True, num_classes=0, global_pool="avg")
        feat_dim = self.vit.num_features
        self.head = nn.Linear(feat_dim, num_classes)

    def forward(self, x):  # x: (B,T,C,H,W)
        B, T, C, H, W = x.shape
        x = x.view(B * T, C, H, W)
        feats = self.vit(x)                  # (B*T, D)
        feats = feats.view(B, T, -1).mean(dim=1)  # (B, D)
        return self.head(feats)

def train():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    labels, _ = read_labels("labels.txt")
    n_classes = len(labels)

    train_ds = GestureClips(train=True)
    val_ds   = GestureClips(train=False)
    print(f"Train clips: {len(train_ds)} | Val clips: {len(val_ds)}")

    # Windows/CPU friendly
    train_dl = DataLoader(train_ds, batch_size=2, shuffle=True,  num_workers=0, pin_memory=False)
    val_dl   = DataLoader(val_ds,   batch_size=2, shuffle=False, num_workers=0, pin_memory=False)

    model = ViTTemporal(num_classes=n_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.05)

    best_acc = 0.0
    epochs = 5
    for epoch in range(1, epochs + 1):
        # ---- Train ----
        model.train()
        total, correct, loss_sum = 0, 0, 0.0
        for x, y in train_dl:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()

            loss_sum += loss.item() * x.size(0)
            correct += (logits.argmax(1) == y).sum().item()
            total += x.size(0)

        train_acc = correct / total if total else 0.0
        train_loss = loss_sum / total if total else 0.0

        # ---- Validate ----
        model.eval()
        vtotal, vcorrect = 0, 0
        with torch.no_grad():
            for x, y in val_dl:
                x, y = x.to(device), y.to(device)
                vcorrect += (model(x).argmax(1) == y).sum().item()
                vtotal += x.size(0)
        val_acc = vcorrect / vtotal if vtotal else 0.0

        print(f"Epoch {epoch:02d} | train_loss {train_loss:.4f} "
              f"| train_acc {train_acc:.3f} | val_acc {val_acc:.3f}")

        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), "vit_temporal_best.pt")

    print("Best val acc:", best_acc)

if __name__ == "__main__":
    train()

Running the command python train.py initiates the training process for your gesture recognition model. Here's a breakdown of what happens:

Load your dataset from data/: The script will access and load the gesture dataset stored in the "data" directory. This dataset is used to train the model.
Fine-tune a pre-trained Vision Transformer: The training script will take a Vision Transformer model that has been pre-trained on a larger dataset and fine-tune it using your specific gesture dataset. Fine-tuning helps the model adapt to the nuances of your data, improving its performance on the specific task of gesture recognition.
Save the best checkpoint as vit_temporal_best.pt: During training, the script will evaluate the model's performance on a validation set. The best-performing version of the model (based on some metric like accuracy) will be saved as a checkpoint file named "vit_temporal_best.pt". This file can later be used for inference or further training.

What Training Looks Like

You should see logs similar to this:

Train clips: 38 | Val clips: 10
Epoch 01 | train_loss 1.4508 | train_acc 0.395 | val_acc 0.200
Epoch 02 | train_loss 1.2466 | train_acc 0.263 | val_acc 0.200
Epoch 03 | train_loss 1.1361 | train_acc 0.368 | val_acc 0.200
Best val acc: 0.200

Don’t worry if your accuracy is low at first, as with the synthetic dataset that’s normal. The key is proving that the Transformer pipeline works. You can boost results later by:

Adding more clips per class
Training for more epochs
Switching to real recorded gestures

Figure 1. Example training logs from train.py, where the Vision Transformer with temporal pooling is fine-tuned on a tiny synthetic dataset.

Export the Model to ONNX

To make our model easier to run in real time (and lighter on CPU), we’ll export it to the ONNX format.

Note: ONNX, which stands for Open Neural Network Exchange, is an open-source format designed to facilitate the interchange of deep learning models between different frameworks. It lets you train a model in one framework, such as PyTorch or TensorFlow, and then deploy it in another, like Caffe2 or MXNet, without needing to completely rewrite the model. This interoperability is achieved by providing a standardized representation of the model's architecture and parameters.

ONNX supports a wide range of operators and is continually updated to include new features, making it a versatile choice for deploying machine learning models across various platforms and devices.

Create a file called export_onnx.py:

import torch
from train import ViTTemporal
from dataset import read_labels

labels, _ = read_labels("labels.txt")
n_classes = len(labels)

# Load trained model
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load("vit_temporal_best.pt", map_location="cpu"))
model.eval()

# Dummy input: batch=1, 16 frames, 3x224x224
dummy = torch.randn(1, 16, 3, 224, 224)

# Export
torch.onnx.export(
    model, dummy, "vit_temporal.onnx",
    input_names=["video"], output_names=["logits"],
    dynamic_axes={"video": {0: "batch"}},
    opset_version=13
)

print("Exported vit_temporal.onnx")

Run it with python export_onnx.py.

This generates a file vit_temporal.onnx in your project folder. ONNX lets us use onnxruntime, which is much faster for inference.

Create a file called app.py:

import os, tempfile, cv2, torch, onnxruntime, numpy as np
import gradio as gr
from dataset import read_labels

T = 16
SIZE = 224
MODEL_PATH = "vit_temporal.onnx"

labels, _ = read_labels("labels.txt")

# --- ONNX session + auto-detect names ---
ort_session = onnxruntime.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
# detect first input and first output names to avoid mismatches
INPUT_NAME = ort_session.get_inputs()[0].name   # e.g. "input" or "video"
OUTPUT_NAME = ort_session.get_outputs()[0].name # e.g. "logits" or something else

def preprocess_clip(frames_rgb):
    if len(frames_rgb) == 0:
        frames_rgb = [np.zeros((SIZE, SIZE, 3), dtype=np.uint8)]
    if len(frames_rgb) < T:
        frames_rgb = frames_rgb + [frames_rgb[-1]] * (T - len(frames_rgb))
    frames_rgb = frames_rgb[:T]
    clip = [cv2.resize(f, (SIZE, SIZE), interpolation=cv2.INTER_AREA) for f in frames_rgb]
    clip = np.stack(clip, axis=0)                                    # (T,H,W,3)
    clip = np.transpose(clip, (0, 3, 1, 2)).astype(np.float32) / 255 # (T,3,H,W)
    clip = (clip - 0.5) / 0.5
    clip = np.expand_dims(clip, 0)                                   # (1,T,3,H,W)
    return clip

def _extract_path_from_gradio_video(inp):
    if isinstance(inp, str) and os.path.exists(inp):
        return inp
    if isinstance(inp, dict):
        for key in ("video", "name", "path", "filepath"):
            v = inp.get(key)
            if isinstance(v, str) and os.path.exists(v):
                return v
        for key in ("data", "video"):
            v = inp.get(key)
            if isinstance(v, (bytes, bytearray)):
                tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4")
                tmp.write(v); tmp.flush(); tmp.close()
                return tmp.name
    if isinstance(inp, (list, tuple)) and inp and isinstance(inp[0], str) and os.path.exists(inp[0]):
        return inp[0]
    return None

def _read_uniform_frames(video_path):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) or 1
    idxs = np.linspace(0, total - 1, max(T, 1)).astype(int)
    want = set(int(i) for i in idxs.tolist())
    j = 0
    while True:
        ok, bgr = cap.read()
        if not ok: break
        if j in want:
            rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
            frames.append(rgb)
        j += 1
    cap.release()
    return frames

def predict_from_video(gradio_video):
    video_path = _extract_path_from_gradio_video(gradio_video)
    if not video_path or not os.path.exists(video_path):
        return {}
    frames = _read_uniform_frames(video_path)

    # If OpenCV choked on the codec (common with recorded webm), re-encode once:
    if len(frames) == 0:
        tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4"); tmp_name = tmp.name; tmp.close()
        cap = cv2.VideoCapture(video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) or 640
        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) or 480
        out = cv2.VideoWriter(tmp_name, fourcc, 20.0, (w, h))
        while True:
            ok, frame = cap.read()
            if not ok: break
            out.write(frame)
        cap.release(); out.release()
        frames = _read_uniform_frames(tmp_name)

    clip = preprocess_clip(frames)
    # >>> use the detected ONNX input/output names <<<
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[0]  # (1, C)
    probs = torch.softmax(torch.from_numpy(logits), dim=1)[0].numpy().tolist()
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

def predict_from_image(image):
    if image is None:
        return {}
    clip = preprocess_clip([image] * T)
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[0]
    probs = torch.softmax(torch.from_numpy(logits), dim=1)[0].numpy().tolist()
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

with gr.Blocks() as demo:
    gr.Markdown("# Gesture Classifier (ONNX)\nRecord or upload a short video, then click **Classify Video**.")
    with gr.Tab("Video (record or upload)"):
        vid_in = gr.Video(label="Record from webcam or upload a short clip")
        vid_out = gr.Label(num_top_classes=3, label="Prediction")
        gr.Button("Classify Video").click(fn=predict_from_video, inputs=vid_in, outputs=vid_out)
    with gr.Tab("Single Image (fallback)"):
        img_in = gr.Image(label="Upload an image frame", type="numpy")
        img_out = gr.Label(num_top_classes=3, label="Prediction")
        gr.Button("Classify Image").click(fn=predict_from_image, inputs=img_in, outputs=img_out)

if __name__ == "__main__":
    demo.launch()

Running the command python app.py launches a Gradio application in your web browser. Here's what happens:

Webcam feed streams live: The application accesses your webcam to provide a live video feed. This allows you to perform gestures in front of the camera in real-time.
Predictions update continuously: As you perform gestures, the model processes the video frames continuously, updating its predictions in real-time.
Top 3 gesture classes displayed with probabilities: The application displays the top three predicted gesture classes along with their probabilities, giving you an idea of the model's confidence in its predictions.

When you open the app in your browser, you'll find two tabs. In the Video tab, you can click Record from webcam to capture a short clip of your gesture, typically lasting 2–4 seconds. After recording, click Classify Video. The model will then process the captured frames using the Transformer model and display the predicted gesture probabilities. This setup allows for interactive testing and demonstration of the gesture recognition system.

Here’s an example where I raised my hand for a stop gesture, and the model predicts “stop” as the top class:

Figure 2. The Gradio app running locally. After recording a short clip, the Transformer model predicts the gesture with class probabilities.

Evaluate Accuracy + Latency

Now that the model runs in a demo app, let’s check how well it performs. There are two sides to this:

Accuracy: does the model predict the right gesture class?
Latency: how fast does it respond, especially on CPU vs GPU?

1. Quick Accuracy Check

Save this as eval.py in the same folder as your other scripts:

import torch
from dataset import GestureClips, read_labels
from train import ViTTemporal

labels, _ = read_labels("labels.txt")
n_classes = len(labels)

# Load validation data
val_ds = GestureClips(train=False)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=2, shuffle=False)

# Load trained model
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load("vit_temporal_best.pt", map_location="cpu"))
model.eval()

correct, total = 0, 0
all_preds, all_labels = [], []

with torch.no_grad():
    for x, y in val_dl:
        logits = model(x)
        preds = logits.argmax(dim=1)
        correct += (preds == y).sum().item()
        total += y.size(0)
        all_preds.extend(preds.tolist())
        all_labels.extend(y.tolist())

print(f"Validation accuracy: {correct/total:.2%}")

2. Confusion Matrix

Let’s also visualize which gestures are confused. Add this snippet at the bottom of eval.py:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(all_labels, all_preds)

plt.figure(figsize=(6,6))
sns.heatmap(cm, annot=True, fmt="d", xticklabels=labels, yticklabels=labels, cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()

When you run python eval.py, a heatmap like this will pop up:

Figure 3. Confusion matrix on the validation set. Correct predictions appear along the diagonal. Off-diagonal counts show gesture confusions.

3. Latency Benchmark

Finally, let’s see how fast inference runs. Save the following as benchmark.py:

import time, numpy as np, onnxruntime
from dataset import read_labels

labels, _ = read_labels("labels.txt")

ort = onnxruntime.InferenceSession("vit_temporal.onnx", providers=["CPUExecutionProvider"])
INPUT_NAME = ort.get_inputs()[0].name
OUTPUT_NAME = ort.get_outputs()[0].name

dummy = np.random.randn(1, 16, 3, 224, 224).astype(np.float32)

# Warmup
for _ in range(3):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})

# Benchmark
t0 = time.time()
for _ in range(50):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})
t1 = time.time()

print(f"Average latency: {(t1 - t0)/50:.3f} seconds per clip")

Run: python benchmark.py

On CPU, you might see ~0.05–0.15s per clip; on GPU it’s much faster.

Note: If latency is high, you can enable quantization in ONNX to shrink the model and speed up inference.

Option 2: Use Small Samples from Public Gesture Datasets

If you’d prefer to see your model trained on real gesture clips instead of synthetic moving boxes, you can grab a handful of videos from open datasets. You don’t need to download the entire dataset (which can be several GB) just a few .mp4 samples are enough to follow along.

Recommended sources

20BN Jester Dataset: Contains short clips of hand gestures like swiping, clapping, and pointing.
WLASL: A large-scale dataset of isolated sign language words.

Both projects provide small .mp4 videos you can use as realistic training examples. I’ve linked them below.

Setting up your dataset folder

Once you download a few clips, place them in the data/ folder under subfolders named after each gesture class. For example:

data/
├── swipe_left/
│   ├── clip1.mp4
│   └── clip2.mp4
├── swipe_right/
│   ├── clip1.mp4
│   └── clip2.mp4
└── stop/
    ├── clip1.mp4
    └── clip2.mp4

And update labels.txt to match the folder names:

swipe_left
swipe_right
stop

Now your dataset is ready, and the same training scripts from earlier (train.py, eval.py) will work without modification.

Why choose this option?

Gives more realistic results than synthetic coloured boxes
Lets you see how the model handles actual human hand movements
It just requires a bit more effort (downloading clips, trimming them if needed)

Tip: If downloading from these datasets feels too heavy, you can also record your own short gestures using your laptop webcam. Just save them as .mp4 files and organize them in the same folder structure.

Accessibility Notes & Ethical Limits

While this project shows the technical workflow for gesture recognition with Transformers, it’s important to step back and consider the human context:

Accessibility first: Tools like this can help students with speech or motor difficulties, but they should always be co-designed with the people who will use them. Don’t assume one-size-fits-all.
Dataset sensitivity: Using publicly available sign or gesture datasets is fine for prototyping, but deploying such a system requires careful consideration of consent and representation.
Error tolerance: Even small misclassifications can have big consequences in accessibility contexts (for example, confusing stop with go). Always plan for fallback options (like manual input or confirmation).
Bias and inclusivity: Models trained on narrow datasets may fail for different skin tones, lighting conditions, or cultural gesture variations. Broad and diverse training data is essential for fairness.

In other words: this demo is a teaching scaffold, not a production-ready accessibility tool. Responsible deployment requires collaboration with educators, therapists, and end users.

Next Steps

If you’d like to push this project further, here are some directions to explore:

Better models: Try video-focused Transformers like TimeSformer or VideoMAE for stronger temporal reasoning.
Larger vocabularies: Add more gesture classes, build your own dataset, or use portions of public datasets like 20BN Jester or WLASL.
Pose fusion: Combine gesture video with human pose keypoints from MediaPipe or OpenPose for more robust predictions.
Real-time smoothing: Implement temporal smoothing or debounce logic in the app so predictions are more stable during live use.
Quantization + edge devices: Convert your ONNX model to an INT8 quantized version and deploy it on a Raspberry Pi or Jetson Nano for classroom-ready prototypes.

Conclusion

In this tutorial, you learned how to create a gesture recognition system using Transformer models, demonstrating the potential of cutting-edge machine learning techniques. By preparing a small dataset, training a Vision Transformer with temporal pooling, exporting the model to ONNX for efficient inference, and deploying a real-time Gradio app, you showcased a practical application of these technologies. The evaluation of accuracy and latency further highlighted the system's effectiveness and responsiveness.

This project illustrates how you can leverage advanced ML methods to enhance accessibility and communication, paving the way for more inclusive learning environments.

Remember: while this demo works with small datasets, real-world applications need larger, more diverse data and careful consideration of accessibility, inclusivity, and ethics.

Here’s the GitHub repo for full source code: transformer-gesture.

How to Build a Multimodal Makaton-to-English Translator for Accessible Education

OMOTAYO OMOYEMI — Thu, 18 Sep 2025 01:20:45 +0000

A year nine student walks into class full of ideas, but when it is time to contribute, the tools around them do not listen. Their speech is difficult for standard voice systems to recognise, typing feels slow and exhausting, and the lesson moves on without their voice being heard. The challenge is not a lack of ability but a lack of access.

Across the world, millions of learners face communication barriers. Some live with apraxia of speech or dysarthria, others with limited mobility, hearing differences, or neurodiverse needs. When speaking, writing, or pointing is unreliable or tiring, participation becomes limited, feedback is lost, and confidence slowly erodes. This is not a rare exception but an everyday reality in classrooms.

These barriers appear in very practical ways. Students are skipped or misunderstood when they cannot respond quickly. Their ability is under-measured because their means of expression are constrained. Teachers struggle to maintain the pace of lessons while making individual accommodations. Peers interact less often, reducing opportunities for social belonging.

Assistive technologies have helped over the years, with tools like text-to-speech, symbol boards, and simple gesture inputs. Yet most of these tools are designed for a single mode of interaction. They assume the learner will either speak, or type, or tap. Real communication, however, is fluid. Learners naturally combine gestures, partial speech, symbols, and context to share meaning, especially when fatigue, anxiety, or motor challenges come into play.

This is where modern AI changes the picture. We are beginning to move beyond single-solution tools into multimodal systems that can understand speech, even when it is disordered, interpret gestures and visual symbols, combine signals to infer intent, and adapt in real time as the learner’s abilities develop or change.

AI is reshaping accessibility in education by shifting from isolated tools to multimodal and adaptive systems. These systems combine gesture, speech, and intelligent feedback to meet learners where they are, while also supporting their growth over time.

In this article, we will explore what this shift looks like in practice, how it can unlock participation, and how adaptive feedback personalises support and we will also build a hands-on multimodal demo that turns these ideas into a classroom-ready tool.

Prerequisites

An Operating System: Windows, macOS, or Linux
Python installed (3.9 or later) – Along with pip for installing packages.
Editor: Visual Studio Code or any Integrated development environment (IDE)
Basics: Comfortable running commands in a terminal
Optional hardware: Microphone (speech input), Webcam (single-frame tab), speakers (TTS playback)
Internet: Required for the default SpeechRecognition (Google Web Speech API) and gTTS
No dataset/model needed: A stub gesture classifier is provided so the demo runs end-to-end

Prerequisites
What We’ve Achieved So Far
Case Study 1: Translating Makaton to English
Case Study 2: AURA Prototype (Adaptive Speech Assistant)
The Bigger Picture: Multimodal Accessibility Tools
How to Build a Multimodal Makaton to English Translator (Gesture + Speech)
Project Overview
Challenges and Ethical Considerations
Where We’re Heading Next
Conclusion: Building an Inclusive Future with AI

What We’ve Achieved So Far

The past few years have shown how AI can make classrooms more inclusive when we focus on accessibility. Developers, educators, and researchers are already experimenting with tools that bridge communication gaps.

In my first freeCodeCamp tutorial, I built a gesture-to-text translator using MediaPipe. This project demonstrated how computer vision can track hand movements and convert them into text in real time. For learners who rely on gestures, this kind of system can provide a bridge to participation.

Here is a simplified example of how MediaPipe detects hand landmarks:

import mediapipe as mp
import cv2

# Initialize MediaPipe Hands
mp_hands = mp.solutions.hands
hands = mp_hands.Hands()

# Start capturing video from the webcam
cap = cv2.VideoCapture(0)

# Capture a frame from the video
ret, frame = cap.read()

# Process the frame to detect hand landmarks
results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

# Print the detected hand landmarks
print("Hand landmarks:", results.multi_hand_landmarks)

This small piece of code shows how MediaPipe processes a video frame and extracts hand landmarks. From there, you can classify gestures and map them to text.

👉 You can explore the full project on GitHub or read the complete tutorial on freeCodeCamp.

In another freeCodeCamp article, I demonstrated how to build AI accessibility tools with Python, such as speech recognition and text-to-speech. These projects provided readers with a foundation for building their own inclusive tools, and you can find the full source code in the repository.

Beyond these individual projects, the wider field has also made significant progress. Advances in sign language recognition have improved accuracy in capturing complex hand shapes and movements. Text-to-speech systems have become more natural and adaptive, giving users voices that sound closer to human speech. Mobile and desktop accessibility apps have brought these capabilities into everyday classrooms.

These achievements are encouraging, but they remain limited. Most of today’s tools are still designed for a single mode of communication. A system may work for gestures, or for speech, or for text, but not all of them together.

The next step is clear: we need multimodal, adaptive AI tools that can blend gestures, speech, and feedback into unified systems. This is where the most exciting opportunities in accessibility lie, and it is where we will turn next.

Figure 1: Comparison of isolated single-modality systems with unified multimodal AI systems.

Case Study 1: Translating Makaton to English

One of my first projects in this area focused on translating Makaton into English.

Makaton is a language programme that uses signs and symbols to support people with speech and language difficulties. It is widely used in classrooms where learners may not rely fully on speech. The challenge is that while a learner communicates in Makaton, their teachers and peers often work in English, which creates a communication gap.

The AI Workflow

The system followed a clear pipeline:

Camera Input → Hand Landmark Detection → Gesture Classification → English Translation Output

Figure 2: AI workflow for translating Makaton gestures into English.

Camera Input: captures the learner’s Makaton sign.
Hand Landmark Detection: a vision library such as MediaPipe or OpenCV identifies the position of the fingers and hands.
Gesture Classification: a trained machine learning model classifies which Makaton sign was made.
English Translation Output: the system maps that gesture to its English word or phrase and displays it.

Example in Python

Here is a simplified version of how this workflow might look in code:

# Step 1: Capture input
frame = camera.read()

# Step 2: Detect hand landmarks
landmarks = mediapipe.process(frame)

# Step 3: Classify gesture
gesture = gesture_model.predict(landmarks)

# Step 4: Translate to English
translation_map = {
    "hello_sign": "Hello",
    "thank_you_sign": "Thank you"
}
text = translation_map.get(gesture, "Unknown sign")

print("Makaton sign:", gesture, " -> English:", text)

This is a simplified example, but it shows the core idea: map gestures to meaning and then bridge that meaning into English.

Why This Matters

Imagine a student signing thank you in Makaton and the system instantly displaying the words on screen. Teachers can check understanding, peers can respond naturally, and the learner’s contribution becomes visible to everyone.

The key takeaway is that AI can bridge symbol and gesture based languages with mainstream spoken and written communication. Instead of forcing learners to adapt to rigid systems, we can design systems that adapt to the way they already communicate.

Case Study 2: AURA Prototype (Adaptive Speech Assistant)

Another project I worked on is called AURA, the Apraxia of Speech Adaptive Understanding and Relearning Assistant. The idea was to design a system that not only recognises speech but also supports learners with speech disorders by detecting errors, adapting feedback, and offering multimodal alternatives.

The Challenge

Most commercial speech recognition systems fail when a person’s speech does not follow typical patterns. This is especially true for people with apraxia of speech, where motor planning difficulties make pronunciation inconsistent. The result is frequent misrecognition, frustration, and exclusion from tools that rely on voice input.

The AI Workflow

The AURA prototype used a layered architecture:

Speech Input → Wav2Vec2 (fine-tuned for disordered speech) → CNN + BiLSTM Error Detection → Reinforcement Learning Feedback → Multimodal Output (Speech + Gesture)

Figure 3: Workflow of the AURA prototype, combining speech, error detection, adaptive feedback, and multimodal outputs.

Wav2Vec2 Speech Recognition: fine-tuned on disordered speech to improve transcription accuracy.
CNN + BiLSTM Model: classifies articulation or phonological errors in real time.
Reinforcement Learning Engine: adapts feedback loops so therapy suggestions improve as the learner progresses.
Gesture-to-Speech Multimodal Input: when speech is too difficult, MediaPipe gestures can be used to trigger spoken outputs.
Streamlit Interface: integrates everything into a single accessible app for testing.

Here’s a simplified view of how an error detection module could be structured:

# Example: Error classification using CNN + BiLSTM
import torch
import torch.nn as nn

# Define the ErrorClassifier model
class ErrorClassifier(nn.Module):
    def __init__(self):
        super(ErrorClassifier, self).__init__()
        self.cnn = nn.Conv1d(in_channels=40, out_channels=64, kernel_size=3)
        self.lstm = nn.LSTM(64, 128, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(256, 3)  # Output classes: e.g. correct, substitution, omission

    def forward(self, x):
        x = self.cnn(x)
        x, _ = self.lstm(x)
        return self.fc(x[:, -1, :])

# Instantiate the model
model = ErrorClassifier()

This snippet shows the heart of the error detection pipeline: combining CNN layers for feature extraction with BiLSTMs for sequence modeling. The model can flag articulation errors, which then guide the feedback loop.

Why This Matters

With AURA, the goal was not just to recognise what someone said, but to help them communicate more effectively. The prototype adapted in real time offering corrective feedback, suggesting gestures, or switching modes when speech became difficult.

The takeaway is that AI can evolve from being a passive recognition tool into an active partner in learning and communication.

The Bigger Picture: Multimodal Accessibility Tools

The two projects we explored, translating Makaton into English and building the AURA prototype highlight a much larger transformation underway. Accessibility technology is moving away from isolated, single-purpose applications toward multimodal platforms that bring together speech, gestures, text, and adaptive AI into one seamless system.

Why This Shift Matters

The benefits of this shift are profound:

Greater inclusivity in classrooms: learners who rely on different modes of communication can participate equally.
Real-time support: systems that detect errors or adapt to gestures give learners immediate feedback rather than delayed corrections.
Lower frustration: multimodal options mean if one channel breaks down (for example, speech), others like gesture or text can take over smoothly.
Confidence and independence: learners express themselves more fully, without depending heavily on support staff or interpreters.

Beyond the Classroom

The impact of multimodal accessibility extends across many sectors:

In healthcare, patients with communication difficulties can use multimodal AI assistants to express needs clearly, reducing misdiagnosis and stress.
In the workplace, employees with speech or motor impairments can collaborate effectively using adaptive AI tools.
In community settings, individuals can participate more freely in conversations, services, and digital platforms, strengthening social inclusion.

Visualising the Shift

How to Build a Multimodal Makaton to English Translator (Gesture + Speech)

This demo combines both use cases: a Makaton to English classroom tool and the AURA assistive speech path. It prioritizes gesture when a sign is detected, falls back to speech when it isn’t, and produces a unified English output (with optional text-to-speech). We’ll focus on the translation layer, multimodal fusion, and a simple Streamlit UI.

Project structure

makaton_multimodal_demo/
├─ .streamlit/
│   └─ config.toml 
├─ assets/
│   └─ README.txt 
├─ tests/
│   └─ test_fuse.py 
└─ streamlit_app.py

The structure provided above outlines the organization of a project directory for a multimodal Makaton to English translator demo using Streamlit. Here's a brief explanation of each component:

makaton_multimodal_demo/: This is the root directory of the project.
.streamlit/: This directory contains configuration files for Streamlit, which is a framework used to build web apps in Python. The config.toml file is optional and can be used to customize the Streamlit app's settings.
assets/: This directory is intended to store models or other necessary files for the project. The README.txt serves as a placeholder to indicate where these files should be placed.
tests/: This directory is for test scripts. The test_fuse.py file likely contains tests for the fusion function, which is a part of the multimodal translation process.
streamlit_app.py: This is the main application file where the Streamlit app is implemented. It contains the code that runs the app, handling the user interface and the logic for translating Makaton gestures and speech into English.

Install & run

# (optional) create and activate a virtualenv
python -m venv .venv

# Windows
.\.venv\Scripts\activate

# macOS/Linux
source .venv/bin/activate

The code snippet above provides instructions for creating and activating a Python virtual environment, which is a self-contained directory that contains a Python installation for a particular version of Python, plus several additional packages.

python -m venv .venv: This command creates a new virtual environment in a directory named .venv. The venv module is used to create lightweight virtual environments.
.\.venv\Scripts\activate (Windows): This command activates the virtual environment on Windows. Once activated, the environment's Python interpreter and installed packages will be used.
source .venv/bin/activate (macOS/Linux): This command activates the virtual environment on macOS or Linux. Similar to Windows, activating the environment ensures that the specific Python interpreter and packages within the environment are used.

Install dependencies

pip install streamlit opencv-python mediapipe SpeechRecognition gTTS pydub numpy

The command above is used to install multiple Python packages at once. Here's what each package does:

streamlit: A framework for building interactive web applications in Python, often used for data science and machine learning projects.
opencv-python: Provides OpenCV, a library for computer vision tasks such as image processing and video analysis.
mediapipe: A library developed by Google for building cross-platform, customizable machine learning solutions for live and streaming media, including hand and face detection.
SpeechRecognition: A library for performing speech recognition, allowing Python to recognize and process human speech.
gTTS: Google Text-to-Speech, a library and CLI tool to interface with Google Translate's text-to-speech API, enabling text-to-speech conversion.
pydub: A library for audio processing, allowing manipulation of audio files, such as converting between different audio formats.
numpy: A fundamental package for scientific computing in Python, providing support for arrays and matrices, along with a collection of mathematical functions.

Create `streamlit_app.py`

# streamlit_app.py
from io import BytesIO
from typing import Optional
import streamlit as st

# Optional deps (kept optional so readers can still run the core demo)
try:
    import cv2
    import mediapipe as mp
    MP_OK = True
except Exception:
    MP_OK = False

try:
    import speech_recognition as sr
    SR_OK = True
except Exception:
    SR_OK = False

try:
    from gtts import gTTS
    GTTS_OK = True
except Exception:
    GTTS_OK = False

# --- 1) Minimal Makaton dictionary (extend as needed)
MAKATON_DICT = {
    "hello_sign": "Hello",
    "thank_you_sign": "Thank you",
    "help_sign": "Help",
    "toilet_sign": "Toilet",
    "stop_sign": "Stop",
}

# --- 2) Gesture classifier (stub for the demo)
def classify_gesture(landmarks) -> Optional[str]:
    """
    Return a canonical label like 'hello_sign' or None if unknown.
    Replace this stub with your trained model + confidence threshold.
    """
    return "hello_sign" if landmarks else None

# --- 3) Speech recognizer (fallback path)
def transcribe_speech(seconds: int = 3) -> Optional[str]:
    if not SR_OK:
        return None
    r = sr.Recognizer()
    try:
        with sr.Microphone() as source:
            st.info("Listening...")
            audio = r.listen(source, phrase_time_limit=seconds)
        return r.recognize_google(audio)
    except Exception as e:
        st.warning(f"Speech recognition error: {e}")
        return None

# --- 4) Fusion logic (gesture first, speech fallback)
def fuse(gesture_label: Optional[str], speech_text: Optional[str]) -> str:
    if gesture_label and gesture_label in MAKATON_DICT:
        return MAKATON_DICT[gesture_label]
    if speech_text:
        return speech_text
    return "No input detected"

# --- 5) Optional: extract single-frame hand landmarks using MediaPipe
def extract_hand_landmarks_from_image(image_bytes: bytes):
    if not MP_OK:
        return None
    try:
        import numpy as np
        np_arr = np.frombuffer(image_bytes, dtype=np.uint8)
        img = cv2.imdecode(np_arr, cv2.IMREAD_COLOR)
        if img is None:
            return None

        mp_hands = mp.solutions.hands
        with mp_hands.Hands(static_image_mode=True, max_num_hands=1, min_detection_confidence=0.5) as hands:
            img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            result = hands.process(img_rgb)

        if not result.multi_hand_landmarks:
            return None

        hand_landmarks = result.multi_hand_landmarks[0]
        return [(lm.x, lm.y, lm.z) for lm in hand_landmarks.landmark]
    except Exception:
        return None

# --- 6) Streamlit UI
st.set_page_config(page_title="Makaton → English (Multimodal Demo)")
st.title("Makaton → English (Multimodal Demo)")
st.caption("Combines a classroom Makaton translator with an assistive speech path (AURA-style).")

with st.expander("What this demo shows"):
    st.write(
        "- **Translation layer:** small Makaton dictionary you can extend.\n"
        "- **Multimodal fusion:** gesture prioritized, speech as fallback.\n"
        "- **UI:** one page, clear output, optional text-to-speech."
    )

tabs = st.tabs(["Simulated Sign", "Single-Frame Webcam (Optional)", "About"])

# Tab 1: Simulated (no CV model required)
with tabs[0]:
    st.subheader("Simulated Gesture + Speech")
    col1, col2 = st.columns(2)

    with col1:
        simulate = st.selectbox(
            "Pick a sign",
            ["", "hello_sign", "thank_you_sign", "help_sign", "toilet_sign", "stop_sign"],
            index=0
        )
        gesture_label = simulate or None

    with col2:
        speech_text = st.session_state.get("speech_text")
        st.write("Current speech:", speech_text or "None")
        if st.button("Transcribe 3s"):
            if SR_OK:
                speech_text = transcribe_speech(3)
                st.session_state["speech_text"] = speech_text
            else:
                st.warning("SpeechRecognition not installed.")

    output = fuse(gesture_label, st.session_state.get("speech_text"))
    st.markdown(f"### Output: **{output}**")

    if output and output != "No input detected":
        if st.button("Speak output"):
            if GTTS_OK:
                mp3 = BytesIO()
                try:
                    gTTS(output).write_to_fp(mp3)
                    st.audio(mp3.getvalue(), format="audio/mp3")
                except Exception as e:
                    st.warning(f"TTS failed: {e}")
            else:
                st.warning("gTTS not installed.")

# Tab 2: Optional single-frame webcam capture
with tabs[1]:
    st.subheader("Single-Frame Hand Detection (Webcam)")
    if not MP_OK:
        st.warning("Install MediaPipe + OpenCV to enable this tab.")
    else:
        img = st.camera_input("Capture a frame")
        captured_label = None
        if img is not None:
            landmarks = extract_hand_landmarks_from_image(img.getvalue())
            if landmarks:
                captured_label = classify_gesture(landmarks)
                st.success("Hand detected.")
            else:
                st.info("No hand landmarks found. Try better lighting/framing.")

        if st.button("Transcribe 3s (webcam tab)"):
            st.session_state["speech_text2"] = transcribe_speech(3) if SR_OK else None

        speech_text2 = st.session_state.get("speech_text2")
        st.write("Current speech:", speech_text2 or "None")

        output2 = fuse(captured_label, speech_text2)
        st.markdown(f"### Output: **{output2}**")

        if output2 and output2 != "No input detected":
            if st.button("Speak output (webcam tab)"):
                if GTTS_OK:
                    mp3 = BytesIO()
                    try:
                        gTTS(output2).write_to_fp(mp3)
                        st.audio(mp3.getvalue(), format="audio/mp3")
                    except Exception as e:
                        st.warning(f"TTS failed: {e}")
                else:
                    st.warning("gTTS not installed.")

The code above creates a Streamlit application that combines gesture recognition and speech recognition to translate Makaton signs into English. Here's a brief explanation of how it works:

Dependencies and Setup: The code attempts to import optional dependencies like OpenCV, MediaPipe, SpeechRecognition, and gTTS. These are used for gesture detection, speech recognition, and text-to-speech functionalities.
Makaton Dictionary: A minimal dictionary that maps Makaton signs to English words. This can be extended to include more signs.
Gesture Classifier: A placeholder function (classify_gesture) is used to classify hand gestures. In a real application, this would be replaced with a trained model.
Speech Recognizer: The transcribe_speech function uses the SpeechRecognition library to convert spoken words into text, serving as a fallback when gestures are not detected.
Fusion Logic: The fuse function prioritizes gesture recognition over speech. If a gesture is recognized, it translates it using the dictionary; otherwise, it uses the transcribed speech.
Hand Landmark Extraction: The code includes a function to extract hand landmarks from an image using MediaPipe, which is used for gesture classification.
Streamlit UI: The user interface is built with Streamlit, featuring tabs for simulated gestures, webcam-based gesture detection, and additional information. Users can simulate gestures, capture gestures via webcam, and use speech input. The output is displayed and can be converted to speech using gTTS.

This application demonstrates a multimodal approach by integrating both gesture and speech recognition to facilitate communication for users who rely on Makaton.

Run

streamlit run .\streamlit_app.py

The command above is used to launch a Streamlit application. When executed, it starts a local web server and opens the specified Python script in a web browser, allowing you to interact with the app's user interface. This command is typically run in a terminal or command prompt.

Figure — App interface: the Simulated Sign tab before any input.

Figure — Selecting hello_sign produces “Output: Hello”.

Project Overview

You have developed a multimodal translator that integrates both gesture recognition (specifically Makaton signs) and speech recognition to produce a unified English output. The system is designed to prioritize gesture input, using speech as a fallback when gestures are not detected.

User Interface

The application is built using Streamlit, featuring two main tabs:

Simulated Sign Tab: Allows users to simulate gestures without requiring computer vision (CV) capabilities.
Webcam Single Frame Tab: Optionally uses a webcam to capture and process a single frame for gesture detection.

Use Case Integration

Makaton to English Translation: In a classroom setting, detected Makaton signs are translated into short English phrases, facilitating communication.
AURA-style Assistive Path: If no gesture is detected, the system relies on speech input to generate an output, ensuring continuous communication support.

Design Limitations

The gesture classifier is currently a placeholder and should be replaced with a trained model that includes a confidence threshold for better accuracy.
The Makaton dictionary is minimal and can be expanded to include more phrases and templates.
The speech recognition component uses a basic recognizer. For improved robustness, consider using advanced models like Wav2Vec2 or offline automatic speech recognition (ASR) systems.

Suggested Extensions

Implement a confidence threshold to display both gesture and speech inputs when the system is uncertain.
Expand the dictionary to support slot templates, such as "I want [item]".
Introduce a toggle to switch between speech-first and gesture-first input priorities.
Enable logging of outputs for teachers and provide an option to export these logs as CSV files.
Consider replacing gTTS with an offline text-to-speech solution for better reliability.

Troubleshooting Tips

If you encounter microphone errors, ensure that pyaudio is installed. On Windows, use pip install pipwin followed by pipwin install pyaudio.
If the webcam is not detected, check your browser permissions. The Simulated Sign tab can still be used without a webcam.
If there are issues with package imports, verify that they are installed in your active virtual environment.

The link to the full code: Multimodal_Makaton

Challenges and Ethical Considerations

While the promise of multimodal accessibility tools is exciting, building them responsibly requires us to confront several challenges. These are not only technical problems but also ethical ones that affect how learners, teachers, and communities experience AI.

Data Scarcity

Training AI systems requires large, diverse datasets. But when it comes to disordered speech or symbol systems like Makaton, the data is limited. Without enough examples, models risk being inaccurate or biased toward a narrow group of users. Collecting more data is essential, but it must be done ethically, with consent and respect for the communities involved.

Fairness and Inclusion

AI systems often work better for some groups than others. A model trained mostly on fluent English speakers may fail to recognise learners with strong accents or speech difficulties. Similarly, gesture recognition may not account for differences in motor ability. Fairness means designing models that work across abilities, accents, and cultures, so that no group is excluded by design.

Privacy and Security

Speech and video data are highly sensitive, especially when collected in schools. Protecting this data is not optional, it is a requirement. Systems must anonymize or encrypt recordings and store them securely. Transparency is also key: learners, parents, and teachers should know exactly how data is being used and who has access to it.

Accessibility of the Tools Themselves

Ironically, many “accessibility tools” remain inaccessible because they are expensive, require powerful hardware, or are too complex to use. For AI to truly reduce barriers, solutions must be affordable, lightweight, and easy for teachers to set up in real classrooms, not just in research labs.

Takeaway

These challenges remind us that accessibility in AI is not only a technical question but also an ethical and social responsibility. To build tools that genuinely help learners, we need collaboration between developers, educators, policymakers, and the communities who will use the systems.

Where We’re Heading Next

The future of AI accessibility tools is speculative, but the possibilities are both exciting and necessary. What we have now are prototypes and early systems. What lies ahead are tools that could reshape how classrooms and society more broadly approach communication and inclusion.

Multilingual Makaton Translation

One promising direction is the ability to translate Makaton across multiple languages. A learner in the UK could sign in Makaton and see their contribution appear not just in English but in French, Spanish, or Yoruba. This would open up international classrooms and give learners access to global opportunities that are often closed off by language barriers.

AI Tutors with Dynamic Adaptation

Imagine a classroom assistant powered by AI that adapts in real time. If a learner struggles with speech, it could switch to gesture recognition. If gestures become tiring, it could prompt the learner with symbol-based options. These AI tutors would not only support communication but also guide learning, adapting to each student’s strengths and challenges over time.

Wearable Multimodal Devices

The rise of lightweight hardware makes it possible to imagine wearable AI assistants that provide instant translation and support. Glasses could capture gestures and overlay text, while earbuds could translate disordered speech into clear audio for peers and teachers. Instead of bulky setups, accessibility would become portable, personal, and ever-present.

A Broader Impact

These innovations go beyond technology alone. They align with the United Nations Sustainable Development Goals (SDGs) especially:

Quality Education (Goal 4): ensuring that every learner, regardless of ability, has equal access to education.
Reduced Inequalities (Goal 10): breaking down barriers so that disability or difference is not a cause of exclusion.

The journey from single-modality tools to multimodal, adaptive systems is still in its early stages. But if we continue to push forward with creativity, ethics, and inclusivity at the center, AI accessibility tools will not only change classrooms they will change lives.

Conclusion: Building an Inclusive Future with AI

AI accessibility tools are no longer just optional add-ons for a few learners. They are becoming core enablers of inclusion in education, healthcare, workplaces, and daily life.

The journey from early gesture recognition systems to multimodal, adaptive prototypes like Makaton translation and AURA shows what is possible when technology is designed around people rather than forcing people to adapt to technology. These innovations break down communication barriers and open up new opportunities for learners who have too often been left on the margins.

But the future of accessibility is not automatic. It depends on choices we make now as developers, educators, researchers, and policymakers. Building tools that are open, ethical, and affordable requires collaboration and commitment.

The vision is clear: a world where every learner, regardless of ability, can express themselves fully, be understood by others, and participate with confidence.

The future of education is inclusive and with thoughtful design, AI can help us get there.

How to Build AI Speech-to-Text and Text-to-Speech Accessibility Tools with Python

OMOTAYO OMOYEMI — Mon, 01 Sep 2025 19:50:40 +0000

Classrooms today are more diverse than ever before. Among the students are neurodiverse learners with different learning needs. While these learners bring unique strengths, traditional teaching methods don’t always meet their needs.

This is where AI-driven accessibility tools can make a difference. From real-time captioning to adaptive reading support, artificial intelligence is transforming classrooms into more inclusive spaces.

In this article, you’ll:

Understand what inclusive education means in practice.
See how AI can support neurodiverse learners.
Try two hands-on Python demos:
- Speech-to-Text using local Whisper (free, no API key).
- Text-to-Speech using Hugging Face SpeechT5.
Get a ready-to-use project structure, requirements**,** and troubleshooting tips for Windows and macOS/Linux users.

Prerequisites
A Note on Missing Files
What Inclusive Education Really Means
Toolbox: Five AI Accessibility Tools Teachers Can Try Today
Platform Notes (Windows vs macOS/Linux)
Hands-On: Build a Simple Accessibility Toolkit (Python)
Quick Setup Cheatsheet
From Code to Classroom Impact
Developer Challenge: Build for Inclusion
Challenges and Considerations
Looking Ahead
Conclusion

Prerequisites

Before you start, make sure you have the following:

Python 3.8 or later versions installed (for Windows users, in case you don’t have it installed, you can download the latest version at: python.org. macOS users usually already have python3).
Virtual environment set up (venv) — recommended to keep things clean.
You have to install FFmpeg (This is required for Whisper to read audio files).
PowerShell (Windows) or Terminal (macOS/Linux).
Basic familiarity with running Python scripts.

Tip: If you’re new to Python environments, the you shouldn’t worry because the setup commands will be included with each step below.

A Note on Missing Files

Some files are not included in the GitHub repository. This is intentional, they are either generated automatically or should be created/installed locally:

.venv/ → Your virtual environment folder. Each reader should create their own locally with:
```
  python -m venv .venv
```
1. FFmpeg Installation:
  - Windows: FFmpeg is not included in the project files because it is large (approximately 90 MB). Users are instructed to download the FFmpeg build separately.
  - macOS: Users can install FFmpeg using the Homebrew package manager with the command brew install ffmpeg.
  - Linux: Users can install FFmpeg using the package manager with the command sudo apt install ffmpeg.
2. Output File:
  - output.wav is a file generated when you run the Text-to-Speech script. This file is not included in the GitHub repository, it is created locally on your machine when you execute the script.

To keep the repo clean, these are excluded using .gitignore:

# Ignore virtual environments
.venv/
env/
venv/

# Ignore binary files
ffmpeg.exe
*.dll
*.lib

# Ignore generated audio (but keep sample input)
*.wav
*.mp3
!lesson_recording.mp3

The repository does include all essential files needed to follow along:

requirements.txt (see below)
transcribe.py and tts.py(covered step-by-step in the Hands-On section).

requirements.txt

openai-whisper
transformers
torch
soundfile
sentencepiece
numpy

This way, you’ll have everything you need to reproduce the project.

What Inclusive Education Really Means

Inclusive education goes beyond placing students with diverse needs in the same classroom. It’s about designing learning environments where every student can thrive.

Common barriers include:

Reading difficulties (for example, dyslexia).
Communication challenges (speech/hearing impairments).
Sensory overload or attention struggles (autism, ADHD).
Note-taking and comprehension difficulties.

AI can help reduce these barriers with captioning, reading aloud, adaptive pacing, and alternative communication tools.

Toolbox: Five AI Accessibility Tools Teachers Can Try Today

Microsoft Immersive Reader – Text-to-speech, reading guides, and translation.
Google Live Transcribe – Real-time captions for speech/hearing support.
Otter.ai – Automatic note-taking and summarization.
Grammarly / Quillbot – Writing assistance for readability and clarity.
Seeing AI (Microsoft) – Describes text and scenes for visually impaired learners.

Real-World Examples

A student with dyslexia can use Immersive Reader to listen to a textbook while following along visually. Another student with hearing loss can use Live Transcribe to follow class discussions. These are small technology shifts that create big inclusion wins.

Platform Notes (Windows vs macOS/Linux)

Most code works the same across systems, but setup commands differ slightly:

Creating a virtual environment

To create and activate a virtual environment in PowerShell using Python 3.8 or higher, you can follow these steps:

Create a virtual environment:
```
 py -3.12 -m venv .venv
```
Activate the virtual environment:
```
 .\.venv\Scripts\Activate
```

Once activated, your PowerShell prompt should change to indicate that you are now working within the virtual environment. This setup helps manage dependencies and keep your project environment isolated.

For Mac OS users to create and activate a virtual environment in a bash shell using Python 3, you can follow these steps:

Create a virtual environment:
```
 python3 -m venv .venv
```
Activate the virtual environment:
```
 source .venv/bin/activate
```

Once activated, your bash prompt should change to indicate that you are now working within the virtual environment. This setup helps manage dependencies and keep your project environment isolated.

To install FFmpeg on Windows, follow these steps:

Download FFmpeg Build: Visit the official FFmpeg website to download the latest FFmpeg build for Windows.
Unzip the Downloaded File: Once downloaded, unzip the file to extract its contents. You will find several files, including ffmpeg.exe.
Copy ffmpeg.exe: You have two options for using ffmpeg.exe:
- Project Folder: Copy ffmpeg.exe directly into your project folder. This way, your project can access FFmpeg without modifying system settings.
- Add to PATH: Alternatively, you can add the directory containing ffmpeg.exe to your system's PATH environment variable. This allows you to use FFmpeg from any command prompt window without specifying its location.

Additionally, the full project folder, including all necessary files and instructions, is available for download on GitHub. You can also find the link to the GitHub repository at the end of the article.

For macOS users:

To install FFmpeg on macOS, you can use Homebrew, a popular package manager for macOS. Here’s how:

Open Terminal: You can find Terminal in the Utilities folder within Applications.
Install Homebrew (if not already installed): Paste the following command in Terminal and press Enter. Follow the on-screen instructions. /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Install FFmpeg: Once Homebrew is installed, run the following command in Terminal:
```
 brew install ffmpeg
```
This command will download and install FFmpeg, making it available for use on your system.

For Linux users (Debian/Ubuntu):

To install FFmpeg on Debian-based systems like Ubuntu, you can use the APT package manager. Here’s how:

Open Terminal: You can usually find Terminal in your system’s applications menu.
Update Package List: Before installing new software, it’s a good idea to update your package list. Run:
```
 sudo apt update
```
Install FFmpeg: After updating, install FFmpeg by running:
```
 sudo apt install ffmpeg
```
This command will download and install FFmpeg, allowing you to use it from the command line.

These steps will ensure that FFmpeg is installed and ready to use on your macOS or Linux system.

Running Python scripts

Windows: python script.py or py script.py
macOS/Linux: python3 script.py

I will mark these differences with a macOS/Linux note in the relevant steps so you can follow along smoothly on your system.

Hands-On: Build a Simple Accessibility Toolkit (Python)

You’ll build two small demos:

Speech-to-Text with Whisper (local, free).
Text-to-Speech with Hugging Face SpeechT5.

1) Speech-to-Text with Whisper (Local and free)

What you’ll build:
A Python script that takes a short MP3 recording and prints the transcript to your terminal.

Why Whisper?
It’s a robust open-source STT model. The local version is perfect for beginners because it avoids API keys/quotas and works offline after the first install.

How to Install Whisper (PowerShell):

# Activate your virtual environment
# Example: .\venv\Scripts\Activate

# Install the openai-whisper package
pip install openai-whisper

# Check if FFmpeg is available
ffmpeg -version

# If FFmpeg is not available, download and install it, then add it to PATH or place ffmpeg.exe next to your script
# Example: Move ffmpeg.exe to the script directory or update PATH environment variable

You should see a version string here before running Whisper.

Note: Mac OS users can use the same code snippet as above in their terminal

If FFmpeg is not installed, you can install it using the following commands:

For macOS:

brew install ffmpeg

For Ubuntu/Debian Linux:

sudo apt install ffmpeg

Create `transcribe.py`:

import whisper

# Load the Whisper model
model = whisper.load_model("base")  # Use "tiny" or "small" for faster speed

# Transcribe the audio file
result = model.transcribe("lesson_recording.mp3", fp16=False)

# Print the transcript
print("Transcript:", result["text"])

How the code works:

whisper.load_model("base") — downloads/loads the model once, then cached afterward.
model.transcribe(...) — handles audio decoding, language detection, and text inference.
fp16=False — avoids half-precision GPU math so it runs fine on CPU.
result["text"] — the final transcript string.

Run it:

python transcribe.py

Expected output:

Successful Speech-to-Text: Whisper prints the recognized sentence from lesson_recording.mp3

To run the transcribe.py script on macOS or Linux, use the following command in your Terminal:

python3 transcribe.py

Common hiccups (and fixes):

FileNotFoundError during transcribe → FFmpeg isn’t found. Install it and confirm with ffmpeg -version.
Super slow on CPU → switch to tiny or small models: whisper.load_model("small").

2) Text-to-Speech with SpeechT5

What you’ll build:
A Python script that converts a short string into a spoken WAV file called output.wav.

Why SpeechT5?
It’s a widely used open model that runs on your CPU. Easy to demo and no API key needed.

Install the required packages on (PowerShell) Windows:

# Activate your virtual environment
# Example: .\venv\Scripts\Activate

# Install the required packages
pip install transformers torch soundfile sentencepiece

Note: Mac OS users can use the same code snippet as above in their terminal

Create tts.py

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import soundfile as sf
import torch
import numpy as np

# Load models
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

# Speaker embedding (fixed random seed for a consistent synthetic voice)
g = torch.Generator().manual_seed(42)
speaker_embeddings = torch.randn((1, 512), generator=g)

# Text to synthesize
text = "Welcome to inclusive education with AI."
inputs = processor(text=text, return_tensors="pt")

# Generate speech
with torch.no_grad():
    speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

# Save to WAV
sf.write("output.wav", speech.numpy(), samplerate=16000)
print("✅ Audio saved as output.wav")

Expected Output:

Text-to-Speech complete. SpeechT5 generated the audio and saved it as output.wav

How the code works:

SpeechT5Processor — prepares your text for the model.
SpeechT5ForTextToSpeech — generates a mel-spectrogram (the speech content).
SpeechT5HifiGan — a vocoder that turns the spectrogram into a waveform you can play.
speaker_embedding — a 512-dim vector representing a “voice.” We seed it for a consistent (synthetic) voice across runs.

Note: If you want the same voice every time you reopen the project, you need to save the embedding once using the snippet below:

import numpy as np
import torch

# Save the speaker embeddings
np.save("speaker_emb.npy", speaker_embeddings.numpy())

# Later, load the speaker embeddings
speaker_embeddings = torch.tensor(np.load("speaker_emb.npy"))

Run it:

python tts.py

Note: MacOS/Linux use python3 tts.py to run the same code as above.

Expected result:

Terminal prints: ✅ Audio saved as output.wav
A new file appears in your folder: output.wav

Common hiccups (and fixes):

ImportError: sentencepiece not found → pip install sentencepiece
Torch install issues on Windows →

# Activate your virtual environment
# Example: .\venv\Scripts\Activate

# Install the torch package using the specified index URL for CPU
pip install torch --index-url https://download.pytorch.org/whl/cpu

Note: The first run is usually slow because the models may still be downloading. So that’s normal.

3) Optional: Whisper via OpenAI API

What this does:
Instead of running Whisper locally, you can call the OpenAI Whisper API (whisper-1). Your audio file is uploaded to OpenAI’s servers, transcribed there, and the text is returned.

Why use the API?

No need to install or run Whisper models locally (saves disk space & setup time).
Runs on OpenAI’s infrastructure (faster if your computer is slow).
Great if you’re already using OpenAI services in your classroom or app.

What to watch out for:

Requires an API key.
Requires billing enabled (the free trial quota is usually small).
Needs internet access (unlike the local Whisper demo).

How to get an API key:

Go to OpenAI’s API Keys page.
Log in with your OpenAI account (or create one).
Click “Create new secret key”.
Copy the key — it looks like sk-xxxxxxxx.... Treat this like a password: don’t share it publicly or push it to GitHub.

Step 1: Set your API key

In PowerShell (session only):

# Set the OpenAI API key in the environment variable
$env:OPENAI_API_KEY="your_api_key_here"

Or permanently set an environment variable in PowerShell - you can use the setx command. Here is how you can do it:

setx OPENAI_API_KEY "your_api_key_here"

This command sets the OPENAI_API_KEY environment variable to the specified value. Note that you should replace "your_api_key_here" with your actual API key. This change will apply to future PowerShell sessions, but you may need to restart your current session or open a new one to see the changes take effect.

Verify it’s set:

To check the value of an environment variable in PowerShell, you can use the echo command. Here's how you can do it:

echo $env:OPENAI_API_KEY

This command will display the current value of the OPENAI_API_KEY environment variable in your PowerShell session. If the variable is set, it will print the value. Otherwise, it will return nothing or an empty line.

Step 2: Install the OpenAI Python client

To install the OpenAI Python client using pip, you can use the following command in your PowerShell:

pip install openai

This command will download and install the OpenAI package, allowing you to interact with OpenAI's API in your Python projects. Make sure you have Python and pip installed on your system before running this command.

Step 3: Create transcribe_api.py

from openai import OpenAI

# Initialize the OpenAI client (reads API key from environment)
client = OpenAI()

# Open the audio file and create a transcription
with open("lesson_recording.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f
    )

# Print the transcript
print("Transcript:", transcript.text)

Step 4: Run it

python transcribe_api.py

Expected output:

Transcript: Welcome to inclusive education with AI.

Common hiccups (and fixes):

Error: insufficient_quota → You’ve run out of free credits. Add billing to continue.
Slow upload → If your audio is large, compress it first (for example, MP3 instead of WAV).
Key not found → Double-check if $env:OPENAI_API_KEY is set in your terminal session.

Local Whisper vs API Whisper — Which Should You Use?

Feature	Local Whisper (on your machine)	OpenAI Whisper API (cloud)
Setup	Needs Python packages + FFmpeg	Just install `openai` client + set API key
Hardware	Runs on your CPU (slower) or GPU (faster)	Runs on OpenAI’s servers (no local compute needed)
Cost	✅ Free after initial download	💳 Pay per minute of audio (after free trial quota)
Internet required	❌ No (fully offline once installed)	✅ Yes (uploads audio to OpenAI servers)
Accuracy	Very good - depends on model size (tiny → large)	Consistently strong - optimized by OpenAI
Speed	Slower on CPU, faster with GPU	Fast (uses OpenAI’s infrastructure)
Privacy	Audio never leaves your machine	Audio is sent to OpenAI (data handling per policy)

Rule of thumb:

Use Local Whisper if you want free, offline transcription or you’re working with sensitive data.
Use the API Whisper if you prefer convenience, don’t mind usage billing, and want speed without local setup.

Quick Setup Cheatsheet

Task	Windows (PowerShell)	macOS / Linux (Terminal)
Create venv	`py -3.12 -m venv .venv`	`python3 -m venv .venv`
Activate venv	`.\.venv\Scripts\Activate`	`source .venv/bin/activate`
Install Whisper	`pip install openai-whisper`	`pip install openai-whisper`
Install FFmpeg	Download build → unzip → add to PATH or copy `ffmpeg.exe`	`brew install` `ffmpeg` (macOS) `sudo apt install ffmpeg` (Linux)
Run STT script	`python` `transcribe.py`	`python3` `transcribe.py`
Install TTS deps	`pip install transformers torch soundfile sentencepiece`	`pip install` `transformers torch soundfile sentencepiece`
Run TTS script	`python` `tts.py`	`python3` `tts.py`
Install OpenAI client (API)	`pip install` `openai`	`pip` `install openai`
Run API script	`python transcribe_api.py`	`python3 transcribe_api.py`

Pro tip for MacOS M1/M2 users: You may need a special PyTorch build for Metal GPU acceleration. Check the PyTorch install guide for the right wheel.

From Code to Classroom Impact

Whether you chose the local Whisper, the cloud API, or SpeechT5 for text-to-speech, you should now have a working prototype that can:

Convert spoken lessons into text.
Read text aloud for students who prefer auditory input.

That’s the technical foundation. But the real question is: how can these building blocks empower teachers and learners in real classrooms?

Developer Challenge: Build for Inclusion

Try combining the two snippets into a simple classroom companion app that:

Captions what the teacher says in real time.
Reads aloud transcripts or textbook passages on demand.

Then think about how to expand it further:

Add symbol recognition for non-verbal communication.
Add multi-language translation for diverse classrooms.
Add offline support for schools with poor connectivity.

These are not futuristic ideas, they are achievable with today’s open-source AI tools.

Challenges and Considerations

Of course, building for inclusion isn’t just about code. There are important challenges to address:

Privacy: Student data must be safeguarded, especially when recordings are involved.
Cost: Solutions must be affordable and scalable for schools of all sizes.
Teacher Training: Educators need support to confidently use these tools.
Balance: AI should assist teachers, not replace the vital human element in learning.

Looking Ahead

The future of inclusive education will likely involve multimodal AI which include systems that combine speech, gestures, symbols, and even emotion recognition. We may even see brain–computer interfaces and wearable devices that enable seamless communication for learners who are currently excluded.

But one principle is clear: inclusion works best when teachers, developers, and neurodiverse learners co-design solutions together.

Conclusion

AI isn’t here to replace teachers, it’s here to help them reach every student. By embracing AI-driven accessibility, classrooms can transform into spaces where neurodiverse learners aren’t left behind, but instead empowered to thrive.

📢 Your turn:

Teachers: You can try one of the tools in your next lesson.
Developers: You can use the code snippets above to prototype your own inclusive classroom tool.
Policymakers: You can support initiatives that make accessibility central to education.

Inclusive education isn’t just a dream, it’s becoming a reality. With thoughtful use of AI, it can become the new norm.

Full source code on GitHub: Inclusive AI Toolkit

How to Create a Real-Time Gesture-to-Text Translator Using Python and Mediapipe

OMOTAYO OMOYEMI — Mon, 18 Aug 2025 14:00:13 +0000

Sign and symbol languages, like Makaton and American Sign Language (ASL), are powerful communication tools. However, they can create challenges when communicating with people who don't understand them.

As a researcher working on AI for accessibility, I wanted to explore how machine learning and computer vision could bridge that gap. The result was a real-time gesture-to-text translator built with Python and Mediapipe, capable of detecting hand gestures and instantly converting them to text.

In this tutorial, you’ll learn how to build your own version from scratch, even if you’ve never used Mediapipe before.

By the end, you’ll know how to:

Detect and track hand movements in real time.
Classify gestures using a simple machine learning model.
Convert recognized gestures into text output.
Extend the system for accessibility-focused applications.

Prerequisites

Before following along with this tutorial, you should have:

Basic Python knowledge – You should be comfortable writing and running Python scripts.
Familiarity with the command line – You’ll use it to run scripts and install dependencies.
A working webcam – This is required for capturing and recognizing gestures in real time.
Python installed (3.8 or later) – Along with pip for installing packages.
Some understanding of machine learning basics – Knowing what training data and models are will help, but I’ll explain the key parts along the way.
An internet connection – To install libraries such as Mediapipe and OpenCV.

If you’re completely new to Mediapipe or OpenCV, don’t worry, I will walk through the core parts you need to know to get this project working.

Prerequisites
Why This Matters
Tools and Technologies
Step 1: How to Install the Required Libraries
Step 2: How Mediapipe Tracks Hands
Step 3: Project Pipeline
Step 4: How to Collect Gesture Data
Step 5: How to Train a Gesture Classifier
Step 6: Real-Time Gesture-to-Text Translation
Step 7: Extending the Project
Ethical and Accessibility Considerations
Conclusion

Why This Matters

Accessible communication is a right, not a privilege. Gesture-to-text translators can:

Help non-signers communicate with sign/symbol language users.
Assist in educational contexts for children with communication challenges.
Support people with speech impairments.

Note: This project is a proof-of-concept and should be tested with diverse datasets before real-world deployment.

Tools and Technologies

We’ll be using:

Tool	Purpose
Python	Primary programming language
Mediapipe	Real-time hand tracking and gesture detection
OpenCV	Webcam input and video display
NumPy	Data processing
Scikit-learn	Gesture classification

Step 1: How to Install the Required Libraries

Before installing the dependencies, ensure you have Python version 3.8 or higher installed (for example, Python 3.8, 3.9, 3.10, or newer). You can check your current Python version by opening a terminal (Command Prompt on Windows, or Terminal on macOS/Linux) and typing:

python --version

python3 --version

You have to confirm that your Python version is 3.8 or higher because Mediapipe and some dependencies require modern language features and binary wheels. If the commands above print a version older than/before 3.8, then you’ll have to install a newer Python version before you continue.

Windows:

Press Windows Key + R
Type cmd and press Enter to open Command Prompt
Type one of the above commands and press Enter

macOS/Linux:

Open your Terminal application
Type one of the above commands and press Enter

If your Python version is older than 3.8, you’ll need to download and install a newer version from the official Python website.

Once Python is ready, you can install the required libraries using pip:

pip install mediapipe opencv-python numpy scikit-learn pandas

This command installs all the libraries you’ll need for the project:

Mediapipe – real-time hand tracking and landmark detection.
OpenCV – reading frames from your webcam and drawing overlays.
Pandas – storing our collected landmark data in a CSV for training.
Scikit-learn – training and evaluating the gesture classification model.

Step 2: How Mediapipe Tracks Hands

Mediapipe’s Hand Tracking solution detects 21 key landmarks for each hand including fingertips, joints, and the wrist, at up to 30+ FPS even on modest hardware.

Here’s a conceptual diagram of the landmarks:

And here’s what real‑time tracking looks like:

Each landmark has (x, y, z) coordinates relative to the image size, making it easy to measure angles and positions for gesture classification.

Step 3: Project Pipeline

Here’s how the system works, from webcam to text output:

Capture: Webcam frames are captured using OpenCV.
Detection: Mediapipe locates hand landmarks.
Vectorization: Landmarks are flattened into a numeric vector.
Classification: A machine learning model predicts the gesture.
Output: The recognized gesture is displayed as text.

Basic hand detection example:

import cv2
import mediapipe as mp

mp_hands = mp.solutions.hands
mp_draw = mp.solutions.drawing_utils

cap = cv2.VideoCapture(0)

with mp_hands.Hands(max_num_hands=1) as hands:
    while True:
        ret, frame = cap.read()
        if not ret:
            break

        results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        if results.multi_hand_landmarks:
            for hand_landmarks in results.multi_hand_landmarks:
                mp_draw.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)

        cv2.imshow("Hand Tracking", frame)
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break

cap.release()
cv2.destroyAllWindows()

The code above opens the webcam and processes each frame with Mediapipe’s Hands solution. The frame is then converted to RGB (as Mediapipe expects), runs detection, and if a hand is found, it draws the 21 landmarks and their connections on top of the frame. You can press q to close the window. This piece verifies your setup and helps you see that landmark tracking works before moving on.

Step 4: How to Collect Gesture Data

Before we can train our model, we need a dataset of labelled gestures. Each gesture will be stored in a CSV file (gesture_data.csv) containing the 3D landmark coordinates for all detected hand points.

For example, we’ll collect data for three gestures:

thumbs_up – the classic thumbs-up pose.
open_palm – a flat hand, fingers extended (like a “high five”).
ok – the “OK” sign, made by touching the thumb and index finger.

You can collect samples for each gesture by running:

python src/collect_data.py --label thumbs_up --samples 200

python src/collect_data.py --label open_palm --samples 200

python src/collect_data.py --label ok --samples 200

Explanation of the command:

--label → the name of the gesture you’re recording. This label will be stored alongside each row of coordinates in the CSV.
--samples → the number of frames to capture for that gesture. More samples generally lead to better accuracy.

How the process works:

When you run a command, your webcam will open.
Make the specified gesture in front of the camera.
The script will use MediaPipe Hands to detect 21 hand landmarks (each with x, y, z coordinates).
These 63 numbers (21 × 3) are stored in a row of the CSV file, along with the gesture label.
The counter at the top will track how many samples have been collected.
When the sample count reaches your target (--samples), the script will close automatically.

Example of what the CSV looks like:

Each row contains:

x0, y0, z0 … x20, y20, z20 → coordinates of each hand landmark.
label → the gesture name.

Example of data collection in progress:

In the above screenshot, the script is capturing 10 out of 10 thumbs_up samples.

📌 Tip: Make sure your hand is clearly visible and well-lit. Repeat the process for all gestures you want to train.

Step 5: How to Train a Gesture Classifier

Once you have enough samples for each gesture, train a model:

python src/train_model.py --data data/gesture_data.csv --label palm_open

This script:

Loads the CSV dataset.
Splits into training and testing sets.
Trains a Random Forest Classifier.
Prints accuracy and a classification report.
Saves the trained model.

Core training logic:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import pickle

# Load the dataset
df = pd.read_csv("data/gesture_data.csv")

# Separate features and labels
X = df.drop("label", axis=1)
y = df["label"]

# Initialize and train the Random Forest Classifier
model = RandomForestClassifier()
model.fit(X, y)

# Save the trained model to a file
with open("data/gesture_model.pkl", "wb") as f:
    pickle.dump(model, f)

This block loads the gesture dataset from data/gesture_data.csv and splits it into:

X – the input features (the 3D landmark coordinates for each gesture sample).
y – the labels (gesture names like thumbs_up, open_palm, ok).

We then created a Random Forest Classifier, which is well-suited for numerical data and works reliably without much tuning. The model learns patterns in the landmark positions that correspond to each gesture.
Finally, we saved the trained model as data/gesture_model.pkl so it can be loaded later for real-time gesture recognition without retraining.

Step 6: Real-Time Gesture-to-Text Translation

Load the model and run the translator:

python src/gesture_to_text.py --model data/gesture_model.pkl

This command runs the real-time gesture recognition script.

The --model argument tells the script which trained model file to load — in this case, gesture_model.pkl that we saved earlier.
Once running, the script opens your webcam, detects your hand landmarks, and uses the model to predict the gesture.
The predicted gesture name appears as text on the video feed.
Press q to exit the window when you’re done.

Core prediction logic:

with open("data/gesture_model.pkl", "rb") as f:
    model = pickle.load(f)

if results.multi_hand_landmarks:
    for hand_landmarks in results.multi_hand_landmarks:
        coords = []
        for lm in hand_landmarks.landmark:
            coords.extend([lm.x, lm.y, lm.z])
        gesture = model.predict([coords])[0]
        cv2.putText(frame, gesture, (10, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)

This code loads the trained gesture recognition model from gesture_model.pkl.
If any hands are detected (results.multi_hand_landmarks), it loops through each detected hand and:

Extracts the coordinates – for each of the 21 landmarks, it appends the x, y, and z values to the coords list.
Makes a prediction – passes coords to the model’s predict method to get the most likely gesture label.
Displays the result – uses cv2.putText to draw the predicted gesture name on the video feed.

This is the real-time decision-making step that turns raw Mediapipe landmark data into a readable gesture label.

You should see the recognized gesture at the top of the video feed:

Step 7: Extending the Project

You can take this project further by:

Adding Text-to-Speech: Use pyttsx3 to speak recognized words.
Supporting More Gestures: Expand your dataset.
Deploying in the Browser: Use TensorFlow.js for web-based recognition.
Testing with Real Users: Especially in accessibility contexts.

Ethical and Accessibility Considerations

Before deploying:

Dataset Diversity: Train with gestures from different skin tones, hand sizes, and lighting conditions.
Privacy: Store only landmark coordinates unless you have consent for video storage.
Cultural Context: Some gestures have different meanings in different cultures.

Conclusion

In this tutorial, we explored how to use Python, Mediapipe, and machine learning to build a real-time gesture-to-text translator. This technology has exciting potential for accessibility and inclusive communication, and with further development, could become a powerful tool for breaking down language barriers.

You can find the full code and resources here:

GitHub Repo – Gesture_Article

OMOTAYO OMOYEMI - freeCodeCamp.org

How I Built a Makaton AI Companion Using Gemini Nano and the Gemini API

Table of Contents

Tools and Tech Stack

Frontend

AI Components

Local Storage

Browser SpeechSynthesis API

Mapping Logic

Local Server

Building the App Step by Step

1. Setting Up the Project Folder

2. Creating the Basic HTML Structure

3. Mapping Descriptions to Makaton Meanings

4. Adding Gemini AI Logic

5. The Main Logic (app.js)

How Vision and Language Work Together Here

6. Optional — Speak and Copy

How to Fix the Common Issues

1. The “CORS” Error When Running With file://

2. “Model Not Found” (404) From the Gemini API

3. Packaging a Local Single-File Version

4. Debugging Script Import Errors in the Console

Demo: The Makaton AI Companion in Action

Step 1: Run the app locally

Step 2: Get Your Gemini API Key

Step 3: Enable Gemini Nano for On-Device AI

Download and Install Chrome Canary:

Enable Gemini Nano:

Download the Gemini Nano Model:

Verify Installation:

Step 4: Upload a Makaton sign or symbol

Step 5: AI Description and Mapping

Why this matters

Broader Reflections

Accessibility Meets AI

Integrating NLP and Computer Vision

Why Local AI (Gemini Nano) Matters

Looking Forward

Conclusion

Join the Conversation

How to Use Transformers for Real-Time Gesture Recognition

Table of Contents

Why Transformers for Gestures?

What You’ll Learn

Prerequisites

Project Setup

Generate a Gesture Dataset

Option 1: Generate a Synthetic Dataset

Training Script: train.py

What Training Looks Like

Export the Model to ONNX

Evaluate Accuracy + Latency

1. Quick Accuracy Check

2. Confusion Matrix

3. Latency Benchmark

Option 2: Use Small Samples from Public Gesture Datasets

Recommended sources

Setting up your dataset folder

Why choose this option?

Accessibility Notes & Ethical Limits

Next Steps

Conclusion

How to Build a Multimodal Makaton-to-English Translator for Accessible Education

Prerequisites

Table of Contents

What We’ve Achieved So Far

Case Study 1: Translating Makaton to English

The AI Workflow

Example in Python

Why This Matters

Case Study 2: AURA Prototype (Adaptive Speech Assistant)

The Challenge

The AI Workflow

Why This Matters

The Bigger Picture: Multimodal Accessibility Tools

Why This Shift Matters

Beyond the Classroom

Visualising the Shift

How to Build a Multimodal Makaton to English Translator (Gesture + Speech)

1. The “CORS” Error When Running With `file://`

Training Script: `train.py`

Create `streamlit_app.py`

Create `transcribe.py`: