Computer Vision - freeCodeCamp.org

How I Built a Makaton AI Companion Using Gemini Nano and the Gemini API

OMOTAYO OMOYEMI — Fri, 07 Nov 2025 16:33:07 +0000

When I started my research on AI systems that could translate Makaton (a sign and symbol language designed to support speech and communication), I wanted to bridge a gap in accessibility for learners with speech or language difficulties.

Over time, this academic interest evolved into a working prototype that combines on-device AI and cloud AI to describe images and translate them into English meanings. The idea was simple: I wanted to build a lightweight web app that recognized Makaton gestures or symbols and instantly provided an English interpretation.

In this article, I’ll walk you through how I built my Makaton AI Companion, a single-page web app powered by Gemini Nano (on-device) and the Gemini API (cloud). You’ll see how it works, how I solved common issues like CORS and API model errors, and how this small project became part of my journey toward AI for accessibility.

By the end of this article, you will be able to:

Understand the core concept behind Makaton and why it’s important in accessibility and inclusive education.
Learn how to combine on-device AI (Gemini Nano) and cloud-based AI (Gemini API) in a single web project.
Build a functional AI-powered web app that can describe images and map them to predefined English meanings.
Discover how to handle common errors such as model endpoint issues, missing API keys, and CORS restrictions when working with generative AI APIs.
Learn how to store API keys locally for user privacy using localStorage.
Use browser speech synthesis to convert the AI-generated English meanings into spoken output.

Tools and Tech Stack
Building the App Step by Step
How to Fix the Common Issues
Demo: The Makaton AI Companion in Action
Broader Reflections
Conclusion

Tools and Tech Stack

To build the Makaton AI Companion, I wanted something lightweight, fast to prototype, and easy for anyone to run without complicated dependencies. I chose a plain web stack with a focus on accessibility and transparency.

Here’s what I used:

Frontend

HTML + CSS + JavaScript (Vanilla): No frameworks, just clean and understandable code that any beginner can follow.
A single index.html page handles the upload interface, output display, and AI logic.

AI Components

Gemini Nano runs locally in Chrome Canary. This on-device model lets users generate short text without calling the cloud API.
Gemini API (Cloud) used as a fallback when on-device AI isn’t available or when image analysis is required.
- Model tested: gemini-1.5-flash and gemini-pro-vision.
- Fallback logic ensures the app checks multiple model endpoints if one returns a 404 error.

Local Storage

The Gemini API key is stored safely in the browser’s localStorage, so it never leaves the user’s computer.

Browser SpeechSynthesis API

Converts the translated English meaning into spoken audio with one click.

Mapping Logic

A small custom dictionary (mapping.js) links AI-generated descriptions to likely Makaton meanings. For example: { keywords: ["open hand", "raised hand", "wave"], meaning: "Hello / Stop" }

Local Server

The app is served locally using Python’s built-in HTTP server to avoid CORS issues:

python -m http.server 8080

Then open http://localhost:8080 in Chrome Canary.

Building the App Step by Step

Now let’s dive into how the Makaton AI Companion works under the hood. This project follows a simple but effective flow: Upload an image → Describe (AI) → Map to Meaning → Speak or Copy the result

We’ll go through each part step by step.

1. Setting Up the Project Folder

You don’t need any complex setup. Just create a new folder and add these files:

makaton-ai-companion/
│
├── index.html
├── styles.css
├── app.js
└── lib/
    ├── mapping.js
    └── ai.js

If you prefer a ready-to-run version, you can serve everything from one zip (I’ll share a GitHub link at the end).

2. Creating the Basic HTML Structure

Your index.html file defines the interface where users upload an image, click Describe, and view the results.

html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
  <title>Makaton AI Companiontitle>
  <link rel="stylesheet" href="styles.css"/>
head>
<body>
  <header class="app-header">
    <h1>🧩 Makaton AI Companionh1>
    <button id="btnSettings" class="btn secondary">Settingsbutton>
  header>

  <main class="container">
    <section class="card">
      <h2>1) Upload an image (Makaton sign/symbol)h2>
      <label for="file">
        Choose an image file
        <input id="file" type="file" accept="image/*" title="Select an image file"/>
      label>
      <div id="preview" class="preview hidden">div>
      <p id="status" class="status">p>
      <div class="actions">
        <button id="btnDescribe" class="btn">Describe (Cloud or Nano)button>
        <button id="btnType" class="btn ghost">Type a description insteadbutton>
      div>
      <div id="typedBox" class="typed hidden">
        <textarea id="typed" rows="3" placeholder="Describe what you see...">textarea>
        <button id="btnUseTyped" class="btn">Use this descriptionbutton>
      div>
    section>

    <section class="card">
      <h2>2) AI Outputh2>
      <div class="grid">
        <div>
          <h3>Image Descriptionh3>
          <div id="output" class="output">div>
        div>
        <div>
          <h3>English Meaning (Mapped)h3>
          <div id="meaning" class="meaning">div>
          <div class="actions">
            <button id="btnSpeak" class="btn ghost" disabled>🔊 Speakbutton>
            <button id="btnCopy" class="btn ghost" disabled>📋 Copybutton>
          div>
        div>
      div>
    section>
  main>

  <dialog id="settings">
    <form method="dialog" class="settings-form">
      <h2>Settingsh2>
      <label>Gemini API key (optional)<input id="apiKey" type="password" placeholder="AIza..."/>label>
      <div class="settings-actions">
        <button id="btnSaveKey" type="submit" class="btn">Savebutton>
        <button id="btnCloseSettings" type="button" class="btn secondary">Closebutton>
      div>
      <div id="apiStatus" class="api-status">div>
    form>
  dialog>

  <script type="module" src="lib/mapping.js">script>
  <script type="module" src="lib/ai.js">script>
  <script type="module" src="app.js">script>
body>
html>

This interface is intentionally minimal: no frameworks, no build tools, just clear HTML.

3. Mapping Descriptions to Makaton Meanings

The mapping.js file holds a simple keyword-based dictionary. When the AI describes an image (like “a raised open hand”), the app searches for keywords that match known Makaton signs.

// lib/mapping.js

export const MAKATON_GLOSSES = [
  { keywords: ["open hand", "raised hand", "wave", "hand up"], meaning: "Hello / Stop" },
  { keywords: ["eat", "food", "spoon", "hand to mouth"], meaning: "Eat" },
  { keywords: ["drink", "cup", "glass", "bottle"], meaning: "Drink" },
  { keywords: ["home", "house", "roof"], meaning: "Home" },
  { keywords: ["sleep", "bed", "eyes closed"], meaning: "Sleep" },
  { keywords: ["book", "reading", "pages"], meaning: "Book / Read" },
  // Added so your current screenshot maps correctly:
  { keywords: ["help", "assist", "thumb on palm", "hand over hand", "assisting"], meaning: "Help" },
];

export function mapDescriptionToMeaning(desc) {
  if (!desc) return "";
  const d = desc.toLowerCase();
  for (const entry of MAKATON_GLOSSES) {
    if (entry.keywords.some(k => d.includes(k))) return entry.meaning;
  }
  if (d.includes("hand")) return "Gesture / Hand sign (clarify)";
  return "No direct mapping found.";
}

It’s simple but effective enough to simulate real symbol-to-language translation for demo purposes.

4. Adding Gemini AI Logic

The ai.js file connects to Gemini Nano (on-device) or the Gemini API (cloud). If Nano isn’t available, the app falls back to the cloud model. And if that fails, it lets users type a description manually.

// lib/ai.js — dynamic model discovery (try-all version)

// --- On-device availability (Gemini Nano) ---
export async function checkAvailability() {
  const res = { nanoTextPossible: false };
  try {
    const canCreate = self.ai?.canCreateTextSession || self.ai?.languageModel?.canCreate;
    if (typeof canCreate === "function") {
      const ok = await (self.ai.canCreateTextSession?.() || self.ai.languageModel.canCreate?.());
      res.nanoTextPossible = ok === "readily" || ok === "after-download" || ok === true;
    }
  } catch {}
  return res;
}

export async function createNanoTextSession() {
  if (self.ai?.createTextSession) return await self.ai.createTextSession();
  if (self.ai?.languageModel?.create) return await self.ai.languageModel.create();
  throw new Error("Gemini Nano text session not available");
}

// --- Cloud: dynamically discover models for this key ---
async function listModels(key) {
  const url = "https://generativelanguage.googleapis.com/v1/models?key=" + encodeURIComponent(key);
  const r = await fetch(url);
  if (!r.ok) throw new Error("ListModels failed: " + (await r.text()));
  const j = await r.json();
  return (j.models || []).map(m => m.name).filter(Boolean);
}

function rankModels(names) {
  // Prefer Gemini 1.5 (multimodal), then flash variants, then anything with vision/pro.
  return names
    .filter(n => n.startsWith("models/"))              // ignore tunedModels, etc.
    .filter(n => !n.includes("experimental"))          // skip experimental
    .sort((a, b) => score(b) - score(a));

  function score(n) {
    let s = 0;
    if (n.includes("1.5")) s += 10;
    if (n.includes("flash")) s += 8;
    if (n.includes("pro-vision")) s += 7;
    if (n.includes("pro")) s += 6;
    if (n.includes("vision")) s += 5;
    if (n.includes("latest")) s += 2;
    return s;
  }
}

async function tryGenerateForModels(imageDataUrl, key, models, mimeType) {
  const base64 = imageDataUrl.split(",")[1];
  const body = {
    contents: [{
      parts: [
        { text: "Describe this image briefly in one sentence focusing on the main gesture or symbol." },
        { inline_data: { mime_type: mimeType || "image/png", data: base64 } }
      ]
    }]
  };
  let lastErr = "";
  for (const model of models) {
    const endpoint = "https://generativelanguage.googleapis.com/v1/" + model + ":generateContent?key=" + encodeURIComponent(key);
    try {
      const r = await fetch(endpoint, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify(body)});
      if (!r.ok) { lastErr = await r.text().catch(()=>String(r.status)); continue; }
      const j = await r.json();
      const text = j?.candidates?.[0]?.content?.parts?.map(p=>p.text).join(" ").trim();
      if (text) return text;
      lastErr = "Empty response from " + model;
    } catch (e) {
      lastErr = String(e?.message || e);
    }
  }
  throw new Error("All discovered models failed. Last error: " + lastErr);
}

export async function describeImageWithGemini(imageDataUrl, apiKey, mimeType = "image/png") {
  if (!apiKey) throw new Error("No API key provided");

  const models = await listModels(apiKey);
  if (!models.length) throw new Error("No models returned for this key. Ensure Generative Language API is enabled and T&Cs accepted in AI Studio.");

  const ranked = rankModels(models);
  if (!ranked.length) throw new Error("No usable model names returned (models/*).");

  return await tryGenerateForModels(imageDataUrl, apiKey, ranked, mimeType);
}

// --- Key storage (local only) ---
const KEY = "makaton_demo_gemini_key";
export function saveApiKey(k) { localStorage.setItem(KEY, k || ""); }
export function loadApiKey() { return localStorage.getItem(KEY) || ""; }

Note: This retry system is essential because many users encounter 404 model errors due to the unavailability of certain Gemini versions in every account.

5. The Main Logic (app.js)

This script ties everything together: file upload, AI call, meaning mapping, and output display.


import { mapDescriptionToMeaning } from './lib/mapping.js';
import { checkAvailability, createNanoTextSession, describeImageWithGemini, saveApiKey, loadApiKey } from './lib/ai.js';

document.addEventListener('DOMContentLoaded', () => {
  console.log('[Makaton] DOM ready');

  const $ = (s) => document.querySelector(s);

  // Elements
  const fileInput   = $('#file');
  const preview     = $('#preview');
  const meaningEl   = $('#meaning');
  const outputEl    = $('#output');
  const btnDescribe = $('#btnDescribe');
  const btnType     = $('#btnType');
  const typedBox    = $('#typedBox');
  const typed       = $('#typed');
  const btnUseTyped = $('#btnUseTyped');
  const btnSpeak    = $('#btnSpeak');
  const btnCopy     = $('#btnCopy');
  const statusEl    = $('#status');

  const settings        = $('#settings');
  const btnSettings     = $('#btnSettings');
  const btnCloseSettings= $('#btnCloseSettings');
  const btnSaveKey      = $('#btnSaveKey');
  const apiKeyInput     = $('#apiKey');
  const apiStatus       = $('#apiStatus');

  let currentImageDataUrl = null;
  let currentImageMime    = "image/png";

  // Sanity logs
  console.log('[Makaton] Elements:', {
    fileInput: !!fileInput, preview: !!preview, outputEl: !!outputEl,
    meaningEl: !!meaningEl, btnDescribe: !!btnDescribe, statusEl: !!statusEl
  });

  // Init API key
  if (apiKeyInput) apiKeyInput.value = loadApiKey() || "";

  // --- Helpers ---
  function setStatus(text) {
    if (statusEl) statusEl.textContent = text || '';
    console.log('[Makaton][Status]', text);
  }
  function clearOutputs() {
    if (outputEl) outputEl.textContent = '';
    if (meaningEl) meaningEl.textContent = '';
    if (btnSpeak) btnSpeak.disabled = true;
    if (btnCopy)  btnCopy.disabled  = true;
  }
  function setOutput(desc) {
    if (outputEl) outputEl.textContent = desc || '';
    const meaning = mapDescriptionToMeaning(desc || '');
    if (meaningEl) meaningEl.textContent = meaning;
    if (btnSpeak) btnSpeak.disabled = !meaning || meaning.includes('No direct mapping');
    if (btnCopy)  btnCopy.disabled  = !meaning;
    setStatus('Done.');
  }
  function fileToDataURL(file) {
    return new Promise((resolve, reject) => {
      const reader = new FileReader();
      reader.onload  = () => resolve(reader.result);
      reader.onerror = (e) => reject(e);
      reader.readAsDataURL(file);
    });
  }
  function handleFiles(files) {
    const file = files?.[0];
    if (!file) { setStatus('No file selected.'); return; }
    currentImageMime = file.type || "image/png";
    fileToDataURL(file)
      .then((dataUrl) => {
        currentImageDataUrl = dataUrl;
        if (preview) {
          preview.innerHTML = `${dataUrl}" />`;
          preview.classList.remove('hidden');
        }
        setStatus('Image loaded. Click "Describe" to continue.');
      })
      .catch((err) => {
        console.error('[Makaton] fileToDataURL error', err);
        setStatus('Could not read the image.');
      });
  }

  // --- File input change ---
  if (fileInput) {
    fileInput.addEventListener('change', (e) => {
      console.log('[Makaton] file input change');
      handleFiles(e.target.files);
    });
  } else {
    console.warn('[Makaton] #file input not found in DOM.');
  }

  // --- Drag & drop support on preview area ---
  if (preview) {
    preview.addEventListener('dragover', (e) => { e.preventDefault(); preview.classList.add('drag'); });
    preview.addEventListener('dragleave', () => preview.classList.remove('drag'));
    preview.addEventListener('drop', (e) => {
      e.preventDefault();
      preview.classList.remove('drag');
      console.log('[Makaton] drop');
      handleFiles(e.dataTransfer?.files);
    });
  }

  // --- Describe click ---
  if (btnDescribe) {
    btnDescribe.addEventListener('click', async () => {
      console.log('[Makaton] Describe clicked');
      if (!currentImageDataUrl) { setStatus('Please upload an image first.'); return; }
      clearOutputs();
      setStatus('Checking on-device AI availability…');

      const avail = await checkAvailability().catch(() => ({ nanoTextPossible: false }));
      try {
        const apiKey = loadApiKey();
        if (apiKey) {
          setStatus('Using Gemini cloud for image description…');
          const desc = await describeImageWithGemini(currentImageDataUrl, apiKey, currentImageMime);
          setOutput(desc);
          return;
        }
        if (avail.nanoTextPossible) {
          setStatus('No API key found. Using on-device AI (text) for best guess…');
          const session = await createNanoTextSession();
          const desc = await session.prompt('Given an image is uploaded by the user (not directly visible to you), infer a likely one-sentence description of a common Makaton sign or symbol a teacher might upload. Keep it generic and safe.');
          setOutput(desc);
          return;
        }
        setStatus('No AI available. Please type a brief description.');
        if (typedBox) typedBox.classList.remove('hidden');
      } catch (err) {
        console.error('[Makaton] Describe error', err);
        setStatus('Description failed: ' + (err?.message || err));
        if (typedBox) typedBox.classList.remove('hidden');
      }
    });
  } else {
    console.warn('[Makaton] Describe button not found.');
  }

  // --- Manual typing flow ---
  if (btnType) {
    btnType.addEventListener('click', () => {
      if (typedBox) typedBox.classList.remove('hidden');
      if (typed) typed.focus();
    });
  }
  if (btnUseTyped) {
    btnUseTyped.addEventListener('click', () => {
      const text = (typed?.value || '').trim();
      if (!text) { setStatus('Type a description first.'); return; }
      setOutput(text);
    });
  }

  // --- Utilities ---
  if (btnSpeak) {
    btnSpeak.addEventListener('click', () => {
      const text = meaningEl?.textContent?.trim();
      if (!text) return;
      const u = new SpeechSynthesisUtterance(text);
      speechSynthesis.cancel();
      speechSynthesis.speak(u);
    });
  }
  if (btnCopy) {
    btnCopy.addEventListener('click', async () => {
      const text = meaningEl?.textContent?.trim();
      if (!text) return;
      try {
        await navigator.clipboard.writeText(text);
        setStatus('Copied meaning to clipboard.');
      } catch {
        setStatus('Copy failed.');
      }
    });
  }

  // --- Settings modal ---
  if (btnSettings && settings) btnSettings.addEventListener('click', () => settings.showModal());
  if (btnCloseSettings && settings) btnCloseSettings.addEventListener('click', () => settings.close());
  if (btnSaveKey) {
    btnSaveKey.addEventListener('click', (e) => {
      e.preventDefault();
      const k = apiKeyInput?.value?.trim() || "";
      saveApiKey(k);
      if (apiStatus) apiStatus.textContent = k ? "API key saved locally. Try Describe again." : "Cleared API key. You can still use on-device or typed mode.";
    });
  }

  // First status
  setStatus('Ready. Upload an image to begin.');
});

Let's break down the main sections of the app.js script for the Makaton AI Companion, as there’s a lot going on here:

Imports and Initial Setup:
- The script imports functions from mapping.js and ai.js to handle mapping descriptions to meanings and AI interactions.
- It sets up event listeners for when the DOM content is fully loaded, ensuring all elements are ready for interaction.
Element Selection:
- It uses a helper function $ to select DOM elements by their CSS selectors. This includes file inputs, buttons, and display areas for image previews and outputs.
Sanity Logs:
- It logs the presence of key elements to the console for debugging purposes, ensuring that all necessary elements are found in the DOM.
API Key Initialization:
- It loads any saved API key from local storage and sets it in the input field for user convenience.
Helper Functions:
- setStatus: Updates the status message displayed to the user.
- clearOutputs: Clears the output and meaning display areas and disables buttons for speaking and copying.
- setOutput: Displays the AI-generated description and maps it to a Makaton meaning, enabling buttons if a valid meaning is found.
- fileToDataURL: Converts an uploaded file to a data URL for image preview and processing.
- handleFiles: Handles file selection, updating the preview and setting the current image data URL.
File Input Change Handling:
- It listens for changes in the file input, processes the selected file, and updates the preview area.
Drag & Drop Support:
- It adds drag-and-drop functionality to the preview area, allowing users to drag files directly onto the app for processing.
Describe Button Click:
- It handles the "Describe" button click event, checking for an uploaded image and attempting to describe it using either the Gemini API or on-device AI.
- If no AI is available, it prompts the user to type a description manually.
Manual Typing Flow:
- It allows users to manually type a description if AI processing is unavailable or fails, updating the output with the typed text.
Utilities:
- btnSpeak: Uses the browser's SpeechSynthesis API to read aloud the mapped meaning.
- btnCopy: Copies the mapped meaning to the clipboard for easy sharing.
Settings Modal:
- It manages the settings modal for entering and saving the API key, providing feedback on the key's status.
Initial Status:
- It sets the initial status message to guide the user to upload an image to begin the process.

This script effectively ties together the user interface, file handling, AI processing, and output display, providing a seamless experience for translating Makaton signs into English meanings.

How Vision and Language Work Together Here

While working on this project, I started appreciating how computer vision and language understanding complement each other in multimodal systems like this one.

The vision model (Gemini or Nano) interprets what it sees like hand shapes, gestures, or layout and turns that visual context into descriptive language.
The language mapping logic then interprets those words, infers intent, and finds the closest semantic match (e.g., “help,” “friend,” “eat”).
It’s a collaboration between two forms of understanding (perceptual and semantic) that together allow the AI to bridge the gap between gesture and meaning.

This realization reshaped how I think about accessibility: the best assistive technologies often emerge not from smarter models alone, but from the interaction between modalities like seeing, describing, and reasoning in context.

6. Optional — Speak and Copy

To make the app more accessible, I added speech output and a quick copy button:

btnSpeak.addEventListener('click', () => {
  const text = meaningEl.textContent.trim();
  if (text) speechSynthesis.speak(new SpeechSynthesisUtterance(text));
});

btnCopy.addEventListener('click', async () => {
  const text = meaningEl.textContent.trim();
  if (text) await navigator.clipboard.writeText(text);
});

This gives users both visual and auditory feedback, especially helpful for learners or educators.

How to Fix the Common Issues

No AI or web integration project runs smoothly the first time – and that’s okay. Here’s a breakdown of the main issues I faced while building the Makaton AI Companion, how I diagnosed them, and how I fixed each one.

These lessons will help anyone trying to integrate Gemini APIs, on-device AI, or local web apps without a full backend.

1. The “CORS” Error When Running With `file://`

When I first opened my index.html directly from my file explorer, Chrome threw several CORS policy errors:

Access to script at 'file:///lib/ai.js' from origin 'null' has been blocked by CORS policy.

At first this looked confusing, but the reason is simple: modern browsers block JavaScript modules (import/export) when running from file:// paths for security reasons.

✅ Fix: I realized I needed to serve the files over HTTP, not from the file system. So I ran a quick local web server using Python:

python -m http.server 8080

Then opened:

http://localhost:8080/index.html

That single step fixed all the CORS errors and allowed my modules to load correctly.

2. “Model Not Found” (404) From the Gemini API

The next big challenge came from the Gemini API. Even though I had a valid API key, my console showed this error:

"models/gemini-1.5-flash" is not found for API version v1beta, or is not supported for generateContent.

It turns out Google’s API endpoints can vary slightly depending on your project setup and key permissions.

✅ Fix: I rewrote my lib/ai.js script to automatically try multiple Gemini model endpoints until it found one that worked. Something like this:

const GEMINI_IMAGE_ENDPOINTS = [
  "https://generativelanguage.googleapis.com/v1/models/gemini-1.5-flash:generateContent",
  "https://generativelanguage.googleapis.com/v1/models/gemini-1.5-pro:generateContent",
  "https://generativelanguage.googleapis.com/v1/models/gemini-1.5-flash-latest:generateContent",
];

And I wrapped it in a loop that stopped once one endpoint succeeded.

Later, I improved it further by listing available models dynamically using
https://generativelanguage.googleapis.com/v1/models?key=YOUR_KEY and automatically trying whichever ones supported image generation.

That dynamic discovery approach fixed the 404 errors permanently.

3. Packaging a Local Single-File Version

Once I got everything working, I wanted a version that others could test easily without installing Node.js or running build tools.

✅ Fix: I bundled the project into a simple zip file containing:

index.html
app.js
lib/ai.js
lib/mapping.js
styles.css

That way, anyone can just unzip and run:

python -m http.server 8080

and open localhost:8080.

Everything runs locally in the browser, no server-side code required. This also makes it perfect for demos, classrooms, and so on.

4. Debugging Script Import Errors in the Console

Another subtle issue appeared when I noticed this red message:

The requested module './lib/mapping.js' does not provide an export named 'mapDescriptionToMeaning'

That line told me exactly what was wrong: my import and export function names didn’t match. The fix was straightforward:

// app.js
import { mapDescriptionToMeaning } from './lib/mapping.js';

And then ensuring the mapping file exported it:

// mapping.js
export function mapDescriptionToMeaning(desc) { ... }

After that, all the pieces connected smoothly.

Using the browser console as my debugging dashboard turned out to be the most powerful tool of all. Every fix started by reading and reasoning about those red error lines.

Demo: The Makaton AI Companion in Action

Let’s see the Makaton AI Companion in action and understand what’s happening under the hood.

Step 1: Run the app locally

Once you’ve downloaded or cloned the project folder, open your terminal in that directory and start a local development server: python -m http.server 8080. Then open your browser and visit: http://localhost:8080/index.html

You should see the Makaton AI Companion interface:

Step 2: Get Your Gemini API Key

To enable cloud-based image description, you’ll need a Gemini API key from Google AI Studio.

Here’s how to generate one:

Visit: https://aistudio.google.com/welcome
Click “Create API key” and link it to your Google Cloud project (or create a new one).
Copy the key it will look like this: AIzaSyA...XXXXXXXXXXXX
Open the Makaton AI Companion in your browser and click the Settings button (top left).
Paste your key in the input box and click Save.

You’ll see a confirmation message like this:

“API key saved locally. Try Describe again.”

This means your key is stored safely in localStorage and is only accessible from your browser.

Step 3: Enable Gemini Nano for On-Device AI

If you’re using Chrome Canary, you can run Gemini Nano locally without internet access. This allows the Makaton AI Companion to generate text even when the API key isn’t set.

Download and Install Chrome Canary:

Visit the official Chrome Canary download page and install it on your Windows or macOS system. Chrome Canary is a special version of Chrome designed for developers and early adopters, offering the latest features and updates.

Enable Gemini Nano:

Open Chrome Canary and type chrome://flags/#prompt-api-for-gemini-nano in the address bar.

Locate the "Prompt API for Gemini Nano" flag in the list. Set this flag to Enabled. This action allows Chrome Canary to support the Gemini Nano model for on-device AI processing.

After enabling the flag, relaunch Chrome Canary to apply the changes.

Download the Gemini Nano Model:

Open a new tab in Chrome Canary and enter chrome://components in the address bar.

Scroll down to find the “Optimization Guide” component. Click on Check for update. This action will initiate the download of the Gemini Nano model, which is necessary for running AI tasks locally without an internet connection.

Verify Installation:

Once the Gemini Nano model is installed, the Makaton AI Companion app will automatically detect it. You should see a message indicating that the app is using on-device AI: “No API key found. Using on-device AI (text) for best guess…”

This confirmation means that the app can now generate text descriptions using the Gemini Nano model without needing an API key or internet access.

By following these detailed steps, you ensure that the Gemini Nano model is correctly set up and ready to use for on-device AI processing in the Makaton AI Companion.

Step 4: Upload a Makaton sign or symbol

Click Choose File to upload any Makaton image (for example, the “help” sign), then press Describe (Cloud or Nano). You’ll immediately see console logs confirming that the app is running correctly and connecting to the Gemini API:

Step 5: AI Description and Mapping

Here’s what happens next:

The image is read and encoded as Base64.
The Gemini API (cloud or on-device) generates a short visual description.
The description is passed to the mapDescriptionToMeaning() function.
If keywords match an entry in the MAKATON_GLOSSES dictionary, the app displays the corresponding English meaning.
Finally, users can click Speak or Copy to hear or reuse the translation.

Example outputs:

When no mapping is found:
The AI description is accurate but doesn’t yet match a known Makaton keyword.

After updating the mapping list:
Adding new keywords like "help", "assist", or "hand over hand" enables correct translation.

Why this matters

This demonstrates how accessible, AI-assisted tools can support communication for people who rely on Makaton. Even when a gesture isn’t recognized, the system provides a structured output and allows users or educators to expand the mapping list making the tool smarter over time.

Broader Reflections

Building this project turned out to be much more than a coding exercise for me.
It was a meaningful experiment in combining accessibility, natural language processing, and computer vision. These three fields, when brought together, can create real social impact.

While working on it, I began to understand how computer vision and language understanding complement each other in practice. The vision model perceives the world by identifying shapes, gestures, and spatial patterns, while the language model interprets what those visuals mean in human terms.
In this project, the artificial intelligence system first sees the Makaton sign, then describes it, and finally maps it to an English word that carries intent and meaning.

This interaction between perception and semantics is what makes multimodal artificial intelligence so powerful. It is not only about recognizing an image or generating text; it is about building systems that connect understanding across different forms of information to make technology more inclusive and human centered.

This realization changed how I think about accessibility technology. True innovation happens not only through smarter models but through the harmony between seeing and understanding, between what an artificial intelligence system observes and how it communicates that observation to help people.

Accessibility Meets AI

Working on this project reminded me that accessibility isn’t just about compliance or assistive devices. It’s also about inclusion. A simple AI system that can describe a hand gesture or symbol in real time can empower teachers, parents, and students who communicate using Makaton or similar systems.

By mapping AI-generated descriptions to meaningful phrases, the app demonstrates how AI can support inclusive education, even at small scales. It bridges the communication gap between verbal and nonverbal learners, which is something that traditional translation systems often overlook.

Integrating NLP and Computer Vision

On the technical side, this project showed me how naturally computer vision and language understanding complement each other. The Gemini API’s multimodal models were able to analyze an image and produce coherent natural-language sentences, something that older APIs couldn’t do without chaining multiple tools.

By feeding that output into a lightweight NLP mapping function, I was able to simulate a very early-stage symbol-to-language translator the core of my broader research interest in automatic Makaton-to-English translation.

Why Local AI (Gemini Nano) Matters

While the cloud models are powerful, experimenting with Gemini Nano revealed something exciting:
on-device AI can make accessibility tools faster, safer, and more private.

In classrooms or therapy sessions, you often can’t rely on stable internet connections or share sensitive student data. Running inference locally means learners’ gestures or symbol images never leave the device, a crucial step toward privacy-preserving accessibility AI.

And since Nano runs directly inside Chrome Canary, it shows how AI is becoming embedded at the browser level, lowering barriers for teachers and developers to build inclusive solutions without needing large infrastructure.

Looking Forward

This prototype is just a starting point. Future iterations could integrate gesture recognition directly from camera input, support multiple symbol sets, or even learn from user feedback to expand the dictionary automatically.

Most importantly, it reinforces a central belief in my research and teaching journey:

Accessibility innovation doesn’t require massive systems. It starts with curiosity, empathy, and a few lines of purposeful code.

Conclusion

Building the Makaton AI Companion has been one of the most rewarding projects in my AI journey – not just because it worked, but because it proved how accessible innovation can be.

With just a browser, a few lines of JavaScript, and the right API, I was able to combine computer vision, language understanding, and accessibility design into a working system that translates symbols into meaning. It’s a small step toward a future where anyone, regardless of speech or language ability, can be understood through technology.

The project also reinforced something deeply personal to me as a researcher and educator: that AI for accessibility doesn’t need to be complex, expensive, or centralized. It can be lightweight, open, and built with empathy by anyone who’s willing to learn and experiment.

Join the Conversation

If this project inspires you, I’d love to see your own experiments and improvements. Can you make it support live webcam gestures? Could you adapt it for other symbol systems, like PECS or BSL?

Share your ideas in the comments or tag me if you publish your own version. Together, we can grow a small prototype into a community-driven accessibility tool and continue exploring how AI can give more people a voice.

Full source code on GitHub: Makaton-ai-companion

How to Use Transformers for Real-Time Gesture Recognition

OMOTAYO OMOYEMI — Mon, 06 Oct 2025 13:39:30 +0000

Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are not static images. Rather, they unfold over time. To build more robust, real-time systems, we need models that can capture both spatial details and temporal context.

This is where Transformers come in. Originally built for language, they’ve become state-of-the-art in vision tasks thanks to models like the Vision Transformer (ViT) and video-focused variants such as TimeSformer.

In this tutorial, we’ll use a Transformer backbone to create a lightweight real-time gesture recognition tool, optimized for small datasets and deployable on a regular laptop webcam.

Why Transformers for Gestures?
What You’ll Learn
Prerequisites
Project Setup
Generate a Gesture Dataset
Option 1: Generate a Synthetic Dataset
Training Script: train.py
Export the Model to ONNX
Evaluate Accuracy + Latency
Option 2: Use Small Samples from Public Gesture Datasets
Accessibility Notes & Ethical Limits
Next Steps
Conclusion

Why Transformers for Gestures?

Transformers are powerful because they use self-attention to model relationships across a sequence. For gestures, this means the model doesn’t just see isolated frames, but also learns how movements evolve over time. A wave, for example, looks different from a raised hand only when viewed as a sequence.

Vision Transformers process images as patches, while video Transformers extend this to multiple frames with temporal attention. Even a simple approach, like applying ViT to each frame and pooling across time, can outperform traditional CNN-based methods for small datasets.

Combined with Hugging Face’s pre-trained models and ONNX Runtime for optimization, Transformers make it possible to train on a modest dataset and still achieve smooth real-time recognition.

What You’ll Learn

In this tutorial, you’ll build a gesture recognition system using Transformers. By the end, you’ll know how to:

Create (or record) a tiny gesture dataset
Train a Vision Transformer (ViT) with temporal pooling
Export the model to ONNX for faster inference
Build a real-time Gradio app that classifies gestures from your webcam
Evaluate your model’s accuracy and latency with simple scripts
Understand the accessibility potential and ethical limits of gesture recognition

Prerequisites

To follow along, you should have:

Basic Python knowledge (functions, scripts, virtual environments)
Familiarity with PyTorch (tensors, datasets, training loops) – helpful but not required
Python 3.8+ installed on your system
A webcam (for the live demo in Gradio)
Optionally: GPU access (training on CPU works, but is slower)

Project Setup

Create a new project folder and install the required libraries.

# Create a new project directory and navigate into it
mkdir transformer-gesture && cd transformer-gesture

# Set up a Python virtual environment
python -m venv .venv

# Activate the virtual environment
# Windows PowerShell
.venv\Scripts\Activate.ps1

# macOS/Linux
source .venv/bin/activate

The provided code snippet is a set of commands for setting up a new Python project with a virtual environment. Here's a breakdown of each part:

mkdir transformer-gesture && cd transformer-gesture: This command creates a new directory named "transformer-gesture" and then navigates into it.
python -m venv .venv: This command creates a new virtual environment in the current directory. The virtual environment is stored in a folder named ".venv".
Activating the virtual environment:
- For Windows PowerShell, you can use .venv\Scripts\Activate.ps1 to activate the virtual environment.
- For macOS/Linux, use source .venv/bin/activate to activate the virtual environment.

Activating a virtual environment ensures that the Python interpreter and any packages you install are isolated to this specific project, preventing conflicts with other projects or system-wide packages.

Create a requirements.txt file:

torch>=2.0
torchvision
torchaudio
timm
huggingface_hub

onnx
onnxruntime

gradio

numpy
opencv-python
pillow

matplotlib
seaborn
scikit-learn

The list provided is a set of package dependencies typically found in a requirements.txt file for a Python project. Here's a brief explanation of each package:

torch>=2.0: PyTorch is a popular open-source deep learning framework that provides a flexible and efficient platform for building and training neural networks. Version 2.0 and above includes improvements in performance and new features.
torchvision: This library is part of the PyTorch ecosystem and provides tools for computer vision tasks, including datasets, model architectures, and image transformations.
torchaudio: Also part of the PyTorch ecosystem, Torchaudio provides audio processing tools and datasets, making it easier to work with audio data in deep learning projects.
timm: The PyTorch Image Models (timm) library offers a collection of pre-trained models and utilities for computer vision tasks, facilitating quick experimentation and deployment.
huggingface_hub: This library allows easy access to models and datasets hosted on the Hugging Face Hub, a platform for sharing and collaborating on machine learning models and datasets.
onnx: The Open Neural Network Exchange (ONNX) format is used to represent machine learning models, enabling interoperability between different frameworks.
onnxruntime: This is a high-performance runtime for executing ONNX models, allowing for efficient deployment across various platforms.
gradio: Gradio is a library for creating user interfaces for machine learning models, making them accessible through a web interface for easy interaction and testing.
numpy: A fundamental package for numerical computing in Python, providing support for arrays and a wide range of mathematical functions.
opencv-python: OpenCV is a library for computer vision and image processing tasks, widely used for real-time applications.
pillow: A Python Imaging Library (PIL) fork, Pillow provides tools for opening, manipulating, and saving many different image file formats.
matplotlib: A plotting library for Python, Matplotlib is used for creating static, interactive, and animated visualizations in Python.
seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.
scikit-learn: A machine learning library in Python that provides simple and efficient tools for data analysis and modeling, including classification, regression, clustering, and dimensionality reduction.

Install dependencies:

pip install -r requirements.txt

The command pip install -r requirements.txt is used to install all the Python packages listed in a file named requirements.txt. This file typically contains a list of package dependencies required for a Python project, each specified with a package name and optionally a version number.

By running this command, pip, which is the Python package installer, reads the file and installs each package listed, ensuring that the project has all the necessary dependencies to run properly. This is a common practice in Python projects to manage and share dependencies easily.

Generate a Gesture Dataset

To train our Transformer-based gesture recognizer, we need some data. Instead of downloading a huge dataset, we’ll start with a tiny synthetic dataset you can generate in seconds. This makes the tutorial lightweight and ensures that everyone can follow along without dealing with multi-gigabyte downloads.

Option 1: Generate a Synthetic Dataset

We’ll use a small Python script that creates short .mp4 clips of a moving (or still) coloured box. Each class represents a gesture:

swipe_left – box moves from right to left
swipe_right – box moves from left to right
stop – box stays still in the center

Save this script as generate_synthetic_gestures.py in your project root:

import os, cv2, numpy as np, random, argparse

def ensure_dir(p): os.makedirs(p, exist_ok=True)

def make_clip(mode, out_path, seconds=1.5, fps=16, size=224, box_size=60, seed=0, codec="mp4v"):
    rng = random.Random(seed)
    frames = int(seconds * fps)
    H = W = size

    # background + box color
    bg_val = rng.randint(160, 220)
    bg = np.full((H, W, 3), bg_val, dtype=np.uint8)
    color = (rng.randint(20, 80), rng.randint(20, 80), rng.randint(20, 80))

    # path of motion
    y = rng.randint(40, H - 40 - box_size)
    if mode == "swipe_left":
        x_start, x_end = W - 20 - box_size, 20
    elif mode == "swipe_right":
        x_start, x_end = 20, W - 20 - box_size
    elif mode == "stop":
        x_start = x_end = (W - box_size) // 2
    else:
        raise ValueError(f"Unknown mode: {mode}")

    fourcc = cv2.VideoWriter_fourcc(*codec)
    vw = cv2.VideoWriter(out_path, fourcc, fps, (W, H))
    if not vw.isOpened():
        raise RuntimeError(
            f"Could not open VideoWriter with codec '{codec}'. "
            "Try --codec XVID and use .avi extension, e.g. out.avi"
        )

    for t in range(frames):
        alpha = t / max(1, frames - 1)
        x = int((1 - alpha) * x_start + alpha * x_end)
        # small jitter to avoid being too synthetic
        jitter_x, jitter_y = rng.randint(-2, 2), rng.randint(-2, 2)
        frame = bg.copy()
        cv2.rectangle(frame, (x + jitter_x, y + jitter_y),
                      (x + jitter_x + box_size, y + jitter_y + box_size),
                      color, thickness=-1)
        # overlay text
        cv2.putText(frame, mode, (8, 24), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 0), 2, cv2.LINE_AA)
        cv2.putText(frame, mode, (8, 24), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 1, cv2.LINE_AA)
        vw.write(frame)

    vw.release()

def write_labels(labels, out_dir):
    with open(os.path.join(out_dir, "labels.txt"), "w", encoding="utf-8") as f:
        for c in labels:
            f.write(c + "\n")

def main():
    ap = argparse.ArgumentParser(description="Generate a tiny synthetic gesture dataset.")
    ap.add_argument("--out", default="data", help="Output directory (default: data)")
    ap.add_argument("--classes", nargs="+",
                    default=["swipe_left", "swipe_right", "stop"],
                    help="Class names (default: swipe_left swipe_right stop)")
    ap.add_argument("--clips", type=int, default=16, help="Clips per class (default: 16)")
    ap.add_argument("--seconds", type=float, default=1.5, help="Seconds per clip (default: 1.5)")
    ap.add_argument("--fps", type=int, default=16, help="Frames per second (default: 16)")
    ap.add_argument("--size", type=int, default=224, help="Frame size WxH (default: 224)")
    ap.add_argument("--box", type=int, default=60, help="Box size (default: 60)")
    ap.add_argument("--codec", default="mp4v", help="Codec fourcc (mp4v or XVID)")
    ap.add_argument("--ext", default=".mp4", help="File extension (.mp4 or .avi)")
    args = ap.parse_args()

    ensure_dir(args.out)
    write_labels(args.classes, ".")  # writes labels.txt to project root

    print(f"Generating synthetic dataset -> {args.out}")
    for cls in args.classes:
        cls_dir = os.path.join(args.out, cls)
        ensure_dir(cls_dir)
        mode = "stop" if cls == "stop" else ("swipe_left" if "left" in cls else ("swipe_right" if "right" in cls else "stop"))
        for i in range(args.clips):
            filename = os.path.join(cls_dir, f"{cls}_{i+1:03d}{args.ext}")
            make_clip(
                mode=mode,
                out_path=filename,
                seconds=args.seconds,
                fps=args.fps,
                size=args.size,
                box_size=args.box,
                seed=i + 1,
                codec=args.codec
            )
        print(f"  {cls}: {args.clips} clips")

    print("Done. You can now run: python train.py, python export_onnx.py, python app.py")

if __name__ == "__main__":
    main()

The script generates a synthetic gesture dataset by creating video clips of a moving or stationary coloured box, simulating gestures like "swipe left," "swipe right," and "stop," and saves them in a specified output directory.

Now run it inside your virtual environment:

python generate_synthetic_gestures.py --out data --clips 16 --seconds 1.5

The command above runs a Python script named generate_synthetic_gestures.py, which generates a synthetic gesture dataset with 16 clips per gesture, each lasting 1.5 seconds, and saves the output in a directory named "data".

This creates a dataset like:

data/
  swipe_left/*.mp4
  swipe_right/*.mp4
  stop/*.mp4
labels.txt

Each folder contains short clips of a moving (or still) box that simulate gestures. This is perfect for testing the pipeline.

Training Script: `train.py`

Now that we have our dataset, let’s fine-tune a Vision Transformer with temporal pooling. This model applies ViT frame-by-frame, averages embeddings across time, and trains a classification head on your gestures.

Here’s the full training script:

# train.py
import torch, torch.nn as nn, torch.optim as optim
from torch.utils.data import DataLoader
import timm
from dataset import GestureClips, read_labels

class ViTTemporal(nn.Module):
    """Frame-wise ViT encoder -> mean pool over time -> linear head."""
    def __init__(self, num_classes, vit_name="vit_tiny_patch16_224"):
        super().__init__()
        self.vit = timm.create_model(vit_name, pretrained=True, num_classes=0, global_pool="avg")
        feat_dim = self.vit.num_features
        self.head = nn.Linear(feat_dim, num_classes)

    def forward(self, x):  # x: (B,T,C,H,W)
        B, T, C, H, W = x.shape
        x = x.view(B * T, C, H, W)
        feats = self.vit(x)                  # (B*T, D)
        feats = feats.view(B, T, -1).mean(dim=1)  # (B, D)
        return self.head(feats)

def train():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    labels, _ = read_labels("labels.txt")
    n_classes = len(labels)

    train_ds = GestureClips(train=True)
    val_ds   = GestureClips(train=False)
    print(f"Train clips: {len(train_ds)} | Val clips: {len(val_ds)}")

    # Windows/CPU friendly
    train_dl = DataLoader(train_ds, batch_size=2, shuffle=True,  num_workers=0, pin_memory=False)
    val_dl   = DataLoader(val_ds,   batch_size=2, shuffle=False, num_workers=0, pin_memory=False)

    model = ViTTemporal(num_classes=n_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.05)

    best_acc = 0.0
    epochs = 5
    for epoch in range(1, epochs + 1):
        # ---- Train ----
        model.train()
        total, correct, loss_sum = 0, 0, 0.0
        for x, y in train_dl:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()

            loss_sum += loss.item() * x.size(0)
            correct += (logits.argmax(1) == y).sum().item()
            total += x.size(0)

        train_acc = correct / total if total else 0.0
        train_loss = loss_sum / total if total else 0.0

        # ---- Validate ----
        model.eval()
        vtotal, vcorrect = 0, 0
        with torch.no_grad():
            for x, y in val_dl:
                x, y = x.to(device), y.to(device)
                vcorrect += (model(x).argmax(1) == y).sum().item()
                vtotal += x.size(0)
        val_acc = vcorrect / vtotal if vtotal else 0.0

        print(f"Epoch {epoch:02d} | train_loss {train_loss:.4f} "
              f"| train_acc {train_acc:.3f} | val_acc {val_acc:.3f}")

        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), "vit_temporal_best.pt")

    print("Best val acc:", best_acc)

if __name__ == "__main__":
    train()

Running the command python train.py initiates the training process for your gesture recognition model. Here's a breakdown of what happens:

Load your dataset from data/: The script will access and load the gesture dataset stored in the "data" directory. This dataset is used to train the model.
Fine-tune a pre-trained Vision Transformer: The training script will take a Vision Transformer model that has been pre-trained on a larger dataset and fine-tune it using your specific gesture dataset. Fine-tuning helps the model adapt to the nuances of your data, improving its performance on the specific task of gesture recognition.
Save the best checkpoint as vit_temporal_best.pt: During training, the script will evaluate the model's performance on a validation set. The best-performing version of the model (based on some metric like accuracy) will be saved as a checkpoint file named "vit_temporal_best.pt". This file can later be used for inference or further training.

What Training Looks Like

You should see logs similar to this:

Train clips: 38 | Val clips: 10
Epoch 01 | train_loss 1.4508 | train_acc 0.395 | val_acc 0.200
Epoch 02 | train_loss 1.2466 | train_acc 0.263 | val_acc 0.200
Epoch 03 | train_loss 1.1361 | train_acc 0.368 | val_acc 0.200
Best val acc: 0.200

Don’t worry if your accuracy is low at first, as with the synthetic dataset that’s normal. The key is proving that the Transformer pipeline works. You can boost results later by:

Adding more clips per class
Training for more epochs
Switching to real recorded gestures

Figure 1. Example training logs from train.py, where the Vision Transformer with temporal pooling is fine-tuned on a tiny synthetic dataset.

Export the Model to ONNX

To make our model easier to run in real time (and lighter on CPU), we’ll export it to the ONNX format.

Note: ONNX, which stands for Open Neural Network Exchange, is an open-source format designed to facilitate the interchange of deep learning models between different frameworks. It lets you train a model in one framework, such as PyTorch or TensorFlow, and then deploy it in another, like Caffe2 or MXNet, without needing to completely rewrite the model. This interoperability is achieved by providing a standardized representation of the model's architecture and parameters.

ONNX supports a wide range of operators and is continually updated to include new features, making it a versatile choice for deploying machine learning models across various platforms and devices.

Create a file called export_onnx.py:

import torch
from train import ViTTemporal
from dataset import read_labels

labels, _ = read_labels("labels.txt")
n_classes = len(labels)

# Load trained model
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load("vit_temporal_best.pt", map_location="cpu"))
model.eval()

# Dummy input: batch=1, 16 frames, 3x224x224
dummy = torch.randn(1, 16, 3, 224, 224)

# Export
torch.onnx.export(
    model, dummy, "vit_temporal.onnx",
    input_names=["video"], output_names=["logits"],
    dynamic_axes={"video": {0: "batch"}},
    opset_version=13
)

print("Exported vit_temporal.onnx")

Run it with python export_onnx.py.

This generates a file vit_temporal.onnx in your project folder. ONNX lets us use onnxruntime, which is much faster for inference.

Create a file called app.py:

import os, tempfile, cv2, torch, onnxruntime, numpy as np
import gradio as gr
from dataset import read_labels

T = 16
SIZE = 224
MODEL_PATH = "vit_temporal.onnx"

labels, _ = read_labels("labels.txt")

# --- ONNX session + auto-detect names ---
ort_session = onnxruntime.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
# detect first input and first output names to avoid mismatches
INPUT_NAME = ort_session.get_inputs()[0].name   # e.g. "input" or "video"
OUTPUT_NAME = ort_session.get_outputs()[0].name # e.g. "logits" or something else

def preprocess_clip(frames_rgb):
    if len(frames_rgb) == 0:
        frames_rgb = [np.zeros((SIZE, SIZE, 3), dtype=np.uint8)]
    if len(frames_rgb) < T:
        frames_rgb = frames_rgb + [frames_rgb[-1]] * (T - len(frames_rgb))
    frames_rgb = frames_rgb[:T]
    clip = [cv2.resize(f, (SIZE, SIZE), interpolation=cv2.INTER_AREA) for f in frames_rgb]
    clip = np.stack(clip, axis=0)                                    # (T,H,W,3)
    clip = np.transpose(clip, (0, 3, 1, 2)).astype(np.float32) / 255 # (T,3,H,W)
    clip = (clip - 0.5) / 0.5
    clip = np.expand_dims(clip, 0)                                   # (1,T,3,H,W)
    return clip

def _extract_path_from_gradio_video(inp):
    if isinstance(inp, str) and os.path.exists(inp):
        return inp
    if isinstance(inp, dict):
        for key in ("video", "name", "path", "filepath"):
            v = inp.get(key)
            if isinstance(v, str) and os.path.exists(v):
                return v
        for key in ("data", "video"):
            v = inp.get(key)
            if isinstance(v, (bytes, bytearray)):
                tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4")
                tmp.write(v); tmp.flush(); tmp.close()
                return tmp.name
    if isinstance(inp, (list, tuple)) and inp and isinstance(inp[0], str) and os.path.exists(inp[0]):
        return inp[0]
    return None

def _read_uniform_frames(video_path):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) or 1
    idxs = np.linspace(0, total - 1, max(T, 1)).astype(int)
    want = set(int(i) for i in idxs.tolist())
    j = 0
    while True:
        ok, bgr = cap.read()
        if not ok: break
        if j in want:
            rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
            frames.append(rgb)
        j += 1
    cap.release()
    return frames

def predict_from_video(gradio_video):
    video_path = _extract_path_from_gradio_video(gradio_video)
    if not video_path or not os.path.exists(video_path):
        return {}
    frames = _read_uniform_frames(video_path)

    # If OpenCV choked on the codec (common with recorded webm), re-encode once:
    if len(frames) == 0:
        tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4"); tmp_name = tmp.name; tmp.close()
        cap = cv2.VideoCapture(video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) or 640
        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) or 480
        out = cv2.VideoWriter(tmp_name, fourcc, 20.0, (w, h))
        while True:
            ok, frame = cap.read()
            if not ok: break
            out.write(frame)
        cap.release(); out.release()
        frames = _read_uniform_frames(tmp_name)

    clip = preprocess_clip(frames)
    # >>> use the detected ONNX input/output names <<<
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[0]  # (1, C)
    probs = torch.softmax(torch.from_numpy(logits), dim=1)[0].numpy().tolist()
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

def predict_from_image(image):
    if image is None:
        return {}
    clip = preprocess_clip([image] * T)
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[0]
    probs = torch.softmax(torch.from_numpy(logits), dim=1)[0].numpy().tolist()
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

with gr.Blocks() as demo:
    gr.Markdown("# Gesture Classifier (ONNX)\nRecord or upload a short video, then click **Classify Video**.")
    with gr.Tab("Video (record or upload)"):
        vid_in = gr.Video(label="Record from webcam or upload a short clip")
        vid_out = gr.Label(num_top_classes=3, label="Prediction")
        gr.Button("Classify Video").click(fn=predict_from_video, inputs=vid_in, outputs=vid_out)
    with gr.Tab("Single Image (fallback)"):
        img_in = gr.Image(label="Upload an image frame", type="numpy")
        img_out = gr.Label(num_top_classes=3, label="Prediction")
        gr.Button("Classify Image").click(fn=predict_from_image, inputs=img_in, outputs=img_out)

if __name__ == "__main__":
    demo.launch()

Running the command python app.py launches a Gradio application in your web browser. Here's what happens:

Webcam feed streams live: The application accesses your webcam to provide a live video feed. This allows you to perform gestures in front of the camera in real-time.
Predictions update continuously: As you perform gestures, the model processes the video frames continuously, updating its predictions in real-time.
Top 3 gesture classes displayed with probabilities: The application displays the top three predicted gesture classes along with their probabilities, giving you an idea of the model's confidence in its predictions.

When you open the app in your browser, you'll find two tabs. In the Video tab, you can click Record from webcam to capture a short clip of your gesture, typically lasting 2–4 seconds. After recording, click Classify Video. The model will then process the captured frames using the Transformer model and display the predicted gesture probabilities. This setup allows for interactive testing and demonstration of the gesture recognition system.

Here’s an example where I raised my hand for a stop gesture, and the model predicts “stop” as the top class:

Figure 2. The Gradio app running locally. After recording a short clip, the Transformer model predicts the gesture with class probabilities.

Evaluate Accuracy + Latency

Now that the model runs in a demo app, let’s check how well it performs. There are two sides to this:

Accuracy: does the model predict the right gesture class?
Latency: how fast does it respond, especially on CPU vs GPU?

1. Quick Accuracy Check

Save this as eval.py in the same folder as your other scripts:

import torch
from dataset import GestureClips, read_labels
from train import ViTTemporal

labels, _ = read_labels("labels.txt")
n_classes = len(labels)

# Load validation data
val_ds = GestureClips(train=False)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=2, shuffle=False)

# Load trained model
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load("vit_temporal_best.pt", map_location="cpu"))
model.eval()

correct, total = 0, 0
all_preds, all_labels = [], []

with torch.no_grad():
    for x, y in val_dl:
        logits = model(x)
        preds = logits.argmax(dim=1)
        correct += (preds == y).sum().item()
        total += y.size(0)
        all_preds.extend(preds.tolist())
        all_labels.extend(y.tolist())

print(f"Validation accuracy: {correct/total:.2%}")

2. Confusion Matrix

Let’s also visualize which gestures are confused. Add this snippet at the bottom of eval.py:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(all_labels, all_preds)

plt.figure(figsize=(6,6))
sns.heatmap(cm, annot=True, fmt="d", xticklabels=labels, yticklabels=labels, cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()

When you run python eval.py, a heatmap like this will pop up:

Figure 3. Confusion matrix on the validation set. Correct predictions appear along the diagonal. Off-diagonal counts show gesture confusions.

3. Latency Benchmark

Finally, let’s see how fast inference runs. Save the following as benchmark.py:

import time, numpy as np, onnxruntime
from dataset import read_labels

labels, _ = read_labels("labels.txt")

ort = onnxruntime.InferenceSession("vit_temporal.onnx", providers=["CPUExecutionProvider"])
INPUT_NAME = ort.get_inputs()[0].name
OUTPUT_NAME = ort.get_outputs()[0].name

dummy = np.random.randn(1, 16, 3, 224, 224).astype(np.float32)

# Warmup
for _ in range(3):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})

# Benchmark
t0 = time.time()
for _ in range(50):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})
t1 = time.time()

print(f"Average latency: {(t1 - t0)/50:.3f} seconds per clip")

Run: python benchmark.py

On CPU, you might see ~0.05–0.15s per clip; on GPU it’s much faster.

Note: If latency is high, you can enable quantization in ONNX to shrink the model and speed up inference.

Option 2: Use Small Samples from Public Gesture Datasets

If you’d prefer to see your model trained on real gesture clips instead of synthetic moving boxes, you can grab a handful of videos from open datasets. You don’t need to download the entire dataset (which can be several GB) just a few .mp4 samples are enough to follow along.

Recommended sources

20BN Jester Dataset: Contains short clips of hand gestures like swiping, clapping, and pointing.
WLASL: A large-scale dataset of isolated sign language words.

Both projects provide small .mp4 videos you can use as realistic training examples. I’ve linked them below.

Setting up your dataset folder

Once you download a few clips, place them in the data/ folder under subfolders named after each gesture class. For example:

data/
├── swipe_left/
│   ├── clip1.mp4
│   └── clip2.mp4
├── swipe_right/
│   ├── clip1.mp4
│   └── clip2.mp4
└── stop/
    ├── clip1.mp4
    └── clip2.mp4

And update labels.txt to match the folder names:

swipe_left
swipe_right
stop

Now your dataset is ready, and the same training scripts from earlier (train.py, eval.py) will work without modification.

Why choose this option?

Gives more realistic results than synthetic coloured boxes
Lets you see how the model handles actual human hand movements
It just requires a bit more effort (downloading clips, trimming them if needed)

Tip: If downloading from these datasets feels too heavy, you can also record your own short gestures using your laptop webcam. Just save them as .mp4 files and organize them in the same folder structure.

Accessibility Notes & Ethical Limits

While this project shows the technical workflow for gesture recognition with Transformers, it’s important to step back and consider the human context:

Accessibility first: Tools like this can help students with speech or motor difficulties, but they should always be co-designed with the people who will use them. Don’t assume one-size-fits-all.
Dataset sensitivity: Using publicly available sign or gesture datasets is fine for prototyping, but deploying such a system requires careful consideration of consent and representation.
Error tolerance: Even small misclassifications can have big consequences in accessibility contexts (for example, confusing stop with go). Always plan for fallback options (like manual input or confirmation).
Bias and inclusivity: Models trained on narrow datasets may fail for different skin tones, lighting conditions, or cultural gesture variations. Broad and diverse training data is essential for fairness.

In other words: this demo is a teaching scaffold, not a production-ready accessibility tool. Responsible deployment requires collaboration with educators, therapists, and end users.

Next Steps

If you’d like to push this project further, here are some directions to explore:

Better models: Try video-focused Transformers like TimeSformer or VideoMAE for stronger temporal reasoning.
Larger vocabularies: Add more gesture classes, build your own dataset, or use portions of public datasets like 20BN Jester or WLASL.
Pose fusion: Combine gesture video with human pose keypoints from MediaPipe or OpenPose for more robust predictions.
Real-time smoothing: Implement temporal smoothing or debounce logic in the app so predictions are more stable during live use.
Quantization + edge devices: Convert your ONNX model to an INT8 quantized version and deploy it on a Raspberry Pi or Jetson Nano for classroom-ready prototypes.

Conclusion

In this tutorial, you learned how to create a gesture recognition system using Transformer models, demonstrating the potential of cutting-edge machine learning techniques. By preparing a small dataset, training a Vision Transformer with temporal pooling, exporting the model to ONNX for efficient inference, and deploying a real-time Gradio app, you showcased a practical application of these technologies. The evaluation of accuracy and latency further highlighted the system's effectiveness and responsiveness.

This project illustrates how you can leverage advanced ML methods to enhance accessibility and communication, paving the way for more inclusive learning environments.

Remember: while this demo works with small datasets, real-world applications need larger, more diverse data and careful consideration of accessibility, inclusivity, and ethics.

Here’s the GitHub repo for full source code: transformer-gesture.

How to Create a Real-Time Gesture-to-Text Translator Using Python and Mediapipe

OMOTAYO OMOYEMI — Mon, 18 Aug 2025 14:00:13 +0000

Sign and symbol languages, like Makaton and American Sign Language (ASL), are powerful communication tools. However, they can create challenges when communicating with people who don't understand them.

As a researcher working on AI for accessibility, I wanted to explore how machine learning and computer vision could bridge that gap. The result was a real-time gesture-to-text translator built with Python and Mediapipe, capable of detecting hand gestures and instantly converting them to text.

In this tutorial, you’ll learn how to build your own version from scratch, even if you’ve never used Mediapipe before.

By the end, you’ll know how to:

Detect and track hand movements in real time.
Classify gestures using a simple machine learning model.
Convert recognized gestures into text output.
Extend the system for accessibility-focused applications.

Prerequisites

Before following along with this tutorial, you should have:

Basic Python knowledge – You should be comfortable writing and running Python scripts.
Familiarity with the command line – You’ll use it to run scripts and install dependencies.
A working webcam – This is required for capturing and recognizing gestures in real time.
Python installed (3.8 or later) – Along with pip for installing packages.
Some understanding of machine learning basics – Knowing what training data and models are will help, but I’ll explain the key parts along the way.
An internet connection – To install libraries such as Mediapipe and OpenCV.

If you’re completely new to Mediapipe or OpenCV, don’t worry, I will walk through the core parts you need to know to get this project working.

Prerequisites
Why This Matters
Tools and Technologies
Step 1: How to Install the Required Libraries
Step 2: How Mediapipe Tracks Hands
Step 3: Project Pipeline
Step 4: How to Collect Gesture Data
Step 5: How to Train a Gesture Classifier
Step 6: Real-Time Gesture-to-Text Translation
Step 7: Extending the Project
Ethical and Accessibility Considerations
Conclusion

Why This Matters

Accessible communication is a right, not a privilege. Gesture-to-text translators can:

Help non-signers communicate with sign/symbol language users.
Assist in educational contexts for children with communication challenges.
Support people with speech impairments.

Note: This project is a proof-of-concept and should be tested with diverse datasets before real-world deployment.

Tools and Technologies

We’ll be using:

Tool	Purpose
Python	Primary programming language
Mediapipe	Real-time hand tracking and gesture detection
OpenCV	Webcam input and video display
NumPy	Data processing
Scikit-learn	Gesture classification

Step 1: How to Install the Required Libraries

Before installing the dependencies, ensure you have Python version 3.8 or higher installed (for example, Python 3.8, 3.9, 3.10, or newer). You can check your current Python version by opening a terminal (Command Prompt on Windows, or Terminal on macOS/Linux) and typing:

python --version

python3 --version

You have to confirm that your Python version is 3.8 or higher because Mediapipe and some dependencies require modern language features and binary wheels. If the commands above print a version older than/before 3.8, then you’ll have to install a newer Python version before you continue.

Windows:

Press Windows Key + R
Type cmd and press Enter to open Command Prompt
Type one of the above commands and press Enter

macOS/Linux:

Open your Terminal application
Type one of the above commands and press Enter

If your Python version is older than 3.8, you’ll need to download and install a newer version from the official Python website.

Once Python is ready, you can install the required libraries using pip:

pip install mediapipe opencv-python numpy scikit-learn pandas

This command installs all the libraries you’ll need for the project:

Mediapipe – real-time hand tracking and landmark detection.
OpenCV – reading frames from your webcam and drawing overlays.
Pandas – storing our collected landmark data in a CSV for training.
Scikit-learn – training and evaluating the gesture classification model.

Step 2: How Mediapipe Tracks Hands

Mediapipe’s Hand Tracking solution detects 21 key landmarks for each hand including fingertips, joints, and the wrist, at up to 30+ FPS even on modest hardware.

Here’s a conceptual diagram of the landmarks:

And here’s what real‑time tracking looks like:

Each landmark has (x, y, z) coordinates relative to the image size, making it easy to measure angles and positions for gesture classification.

Step 3: Project Pipeline

Here’s how the system works, from webcam to text output:

Capture: Webcam frames are captured using OpenCV.
Detection: Mediapipe locates hand landmarks.
Vectorization: Landmarks are flattened into a numeric vector.
Classification: A machine learning model predicts the gesture.
Output: The recognized gesture is displayed as text.

Basic hand detection example:

import cv2
import mediapipe as mp

mp_hands = mp.solutions.hands
mp_draw = mp.solutions.drawing_utils

cap = cv2.VideoCapture(0)

with mp_hands.Hands(max_num_hands=1) as hands:
    while True:
        ret, frame = cap.read()
        if not ret:
            break

        results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        if results.multi_hand_landmarks:
            for hand_landmarks in results.multi_hand_landmarks:
                mp_draw.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)

        cv2.imshow("Hand Tracking", frame)
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break

cap.release()
cv2.destroyAllWindows()

The code above opens the webcam and processes each frame with Mediapipe’s Hands solution. The frame is then converted to RGB (as Mediapipe expects), runs detection, and if a hand is found, it draws the 21 landmarks and their connections on top of the frame. You can press q to close the window. This piece verifies your setup and helps you see that landmark tracking works before moving on.

Step 4: How to Collect Gesture Data

Before we can train our model, we need a dataset of labelled gestures. Each gesture will be stored in a CSV file (gesture_data.csv) containing the 3D landmark coordinates for all detected hand points.

For example, we’ll collect data for three gestures:

thumbs_up – the classic thumbs-up pose.
open_palm – a flat hand, fingers extended (like a “high five”).
ok – the “OK” sign, made by touching the thumb and index finger.

You can collect samples for each gesture by running:

python src/collect_data.py --label thumbs_up --samples 200

python src/collect_data.py --label open_palm --samples 200

python src/collect_data.py --label ok --samples 200

Explanation of the command:

--label → the name of the gesture you’re recording. This label will be stored alongside each row of coordinates in the CSV.
--samples → the number of frames to capture for that gesture. More samples generally lead to better accuracy.

How the process works:

When you run a command, your webcam will open.
Make the specified gesture in front of the camera.
The script will use MediaPipe Hands to detect 21 hand landmarks (each with x, y, z coordinates).
These 63 numbers (21 × 3) are stored in a row of the CSV file, along with the gesture label.
The counter at the top will track how many samples have been collected.
When the sample count reaches your target (--samples), the script will close automatically.

Example of what the CSV looks like:

Each row contains:

x0, y0, z0 … x20, y20, z20 → coordinates of each hand landmark.
label → the gesture name.

Example of data collection in progress:

In the above screenshot, the script is capturing 10 out of 10 thumbs_up samples.

📌 Tip: Make sure your hand is clearly visible and well-lit. Repeat the process for all gestures you want to train.

Step 5: How to Train a Gesture Classifier

Once you have enough samples for each gesture, train a model:

python src/train_model.py --data data/gesture_data.csv --label palm_open

This script:

Loads the CSV dataset.
Splits into training and testing sets.
Trains a Random Forest Classifier.
Prints accuracy and a classification report.
Saves the trained model.

Core training logic:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import pickle

# Load the dataset
df = pd.read_csv("data/gesture_data.csv")

# Separate features and labels
X = df.drop("label", axis=1)
y = df["label"]

# Initialize and train the Random Forest Classifier
model = RandomForestClassifier()
model.fit(X, y)

# Save the trained model to a file
with open("data/gesture_model.pkl", "wb") as f:
    pickle.dump(model, f)

This block loads the gesture dataset from data/gesture_data.csv and splits it into:

X – the input features (the 3D landmark coordinates for each gesture sample).
y – the labels (gesture names like thumbs_up, open_palm, ok).

We then created a Random Forest Classifier, which is well-suited for numerical data and works reliably without much tuning. The model learns patterns in the landmark positions that correspond to each gesture.
Finally, we saved the trained model as data/gesture_model.pkl so it can be loaded later for real-time gesture recognition without retraining.

Step 6: Real-Time Gesture-to-Text Translation

Load the model and run the translator:

python src/gesture_to_text.py --model data/gesture_model.pkl

This command runs the real-time gesture recognition script.

The --model argument tells the script which trained model file to load — in this case, gesture_model.pkl that we saved earlier.
Once running, the script opens your webcam, detects your hand landmarks, and uses the model to predict the gesture.
The predicted gesture name appears as text on the video feed.
Press q to exit the window when you’re done.

Core prediction logic:

with open("data/gesture_model.pkl", "rb") as f:
    model = pickle.load(f)

if results.multi_hand_landmarks:
    for hand_landmarks in results.multi_hand_landmarks:
        coords = []
        for lm in hand_landmarks.landmark:
            coords.extend([lm.x, lm.y, lm.z])
        gesture = model.predict([coords])[0]
        cv2.putText(frame, gesture, (10, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)

This code loads the trained gesture recognition model from gesture_model.pkl.
If any hands are detected (results.multi_hand_landmarks), it loops through each detected hand and:

Extracts the coordinates – for each of the 21 landmarks, it appends the x, y, and z values to the coords list.
Makes a prediction – passes coords to the model’s predict method to get the most likely gesture label.
Displays the result – uses cv2.putText to draw the predicted gesture name on the video feed.

This is the real-time decision-making step that turns raw Mediapipe landmark data into a readable gesture label.

You should see the recognized gesture at the top of the video feed:

Step 7: Extending the Project

You can take this project further by:

Adding Text-to-Speech: Use pyttsx3 to speak recognized words.
Supporting More Gestures: Expand your dataset.
Deploying in the Browser: Use TensorFlow.js for web-based recognition.
Testing with Real Users: Especially in accessibility contexts.

Ethical and Accessibility Considerations

Before deploying:

Dataset Diversity: Train with gestures from different skin tones, hand sizes, and lighting conditions.
Privacy: Store only landmark coordinates unless you have consent for video storage.
Cultural Context: Some gestures have different meanings in different cultures.

Conclusion

In this tutorial, we explored how to use Python, Mediapipe, and machine learning to build a real-time gesture-to-text translator. This technology has exciting potential for accessibility and inclusive communication, and with further development, could become a powerful tool for breaking down language barriers.

You can find the full code and resources here:

GitHub Repo – Gesture_Article

How to Use the Segment Anything Model (SAM) to Create Masks

freeCodeCamp — Wed, 08 Nov 2023 20:26:18 +0000

By Jess Wilk

Hey there! So, you know that buzz about Tesla's autopilot being all futuristic and driverless? Ever thought about how it actually does its magic? Well, let me tell you – it's all about image segmentation and object detection.

What is Image Segmentation?

Image segmentation, basically chopping up an image into different parts, helps the system recognize stuff. It identifies where humans, other cars, and obstacles are on the road. That's the tech making sure those self-driving cars can cruise around safely. Cool, right? 🚗

During the past decade, Computer Vision has made massive strides, especially in crafting super-sophisticated segmentation and object detection methods.

These breakthroughs have found diverse uses, like spotting tumors and diseases in medical images, keeping an eye on crops in farming, and even guiding robots in navigation. The tech's really branching out and making a significant impact across different fields.

The main challenge lies in getting and prepping the data. Building an image segmentation dataset demands annotating heaps of images to define the labels, which is a massive task. This requires a ton of resources.

So, the game changed when the Segment Anything Model (SAM) came into the scene. SAM revolutionized this field by enabling anyone to create segmentation masks for their data without relying on labeled data.

In this article, I’ll guide you through understanding SAM, its workings, and how you can utilize it to make masks. So, get ready with your cup of coffee because we're diving in! ☕

Prerequisites:

The prerequisites for this article include a basic understanding of Python programming and a fundamental knowledge of machine learning.

Additionally, familiarity with image segmentation concepts, computer vision, and data annotation challenges would also be beneficial.

What is the Segment Anything Model?

SAM is a Large Language Model that was developed by the Facebook research team (Meta AI). The model was trained on a massive dataset of 1.1 billion segmentation masks, the SA-1B dataset. The model can generalize well to unseen data because it is trained on a very diverse dataset and has low variance.

SAM can be used to segment any image and create masks without any labeled data. It is a breakthrough, as no fully automated segmentation was possible before SAM.

What makes SAM unique? It is a first-of-its-kind, promptable segmentation model. Prompts allow you to instruct the model on your desired output through text and interactive actions. You can provide prompts to SAM in multiple ways: Points, Bounding Boxes, texts, and even base masks.

How Does SAM Work?

SAM uses a transformer-based architecture, like most Large Language Models. Let’s look at the flow of data through different components of SAM.

Image Encoder: When you provide an image to SAM, it is first sent to the Image Encoder. True to its name, this component encodes the image into vectors. These vectors represent the low-level (edges, outlines) and high-level features like object shapes and textures extracted from the image. The encoder here is a Vision Transformer (ViT), which has many advantages over traditional CNNs.

Prompt Encoder: The prompt input the user gives is converted to embeddings by the prompt encoder. SAM uses positional embeddings for points, bounding box prompts, and text encoders for text prompts.

Mask Decoder: Next, SAM maps the extracted image features and prompt encodings to generate the mask, which is our output. SAM will generate 3 segmented masks for every input prompt, providing the users with choices.

Why use SAM?

With SAM, you can skip the expensive setup usually needed for AI, and still get fast results. It works well with all sorts of data, like medical or satellite images, and fits right into the software you already use for quick detection tasks.

You also get tools tailored for specific jobs like image segmentation, and it’s straightforward to interact with, whether you're training it or asking it to analyze data. Plus, it’s quicker than older systems like CNNs, saving you both time and money.

Why use SAM?

How to Install and Set up SAM

Now that you know how SAM works, let me show you how to install and set it up. The first step is to install the package in your Jupyter notebook or Google Colab with the following command:

pip install 'git+https://github.com/facebookresearch/segment-anything.git'

/content Collecting git+https://github.com/facebookresearch/segment-anything.git Cloning https://github.com/facebookresearch/segment-anything.git to /tmp/pip-req-build-xzlt_n7r Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/segment-anything.git /tmp/pip-req-build-xzlt_n7r Resolved https://github.com/facebookresearch/segment-anything.git to commit 6fdee8f2727f4506cfbbe553e23b895e27956588 Preparing metadata (setup.py) ... done

The next step is to download the pre-trained weights of the SAM model you want to use.

You can choose from three options of checkpoint weights: ViT-B (91M), ViT-L (308M), and ViT-H (636M parameters).

How do you choose the right one? The larger the number of parameters, the longer the time needed for inference, that is mask generation. If you have low GPU resources and fast inference, go for ViT-B. Otherwise, choose ViT-H.

Follow the below commands to set up the model checkpoint path:

!wget -q https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
CHECKPOINT_PATH='/content/weights/sam_vit_h_4b8939.pth'


import torch
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
MODEL_TYPE = "vit_h"

The model weights are ready! Now, I’ll show you different methods through which you can provide prompts and generate masks in the upcoming sections. 🚀

How SAM Can Generate Masks Automatically

SAM can automatically segment the entire input image into distinct segments without a specific prompt. For this, you can use the SamAutomaticMaskGenerator utility.

Follow the below commands to import and initialize it with the model type and checkpoint path.

from segment_anything import sam_model_registry, SamAutomaticMaskGenerator, SamPredictor


sam = sam_model_registry[MODEL_TYPE](checkpoint=CHECKPOINT_PATH).to(device=DEVICE)


mask_generator = SamAutomaticMaskGenerator(sam)

For example, I have uploaded an image of dogs to my notebook. It will be our input image, which has to be converted into RGB (Red-Green-Blue) pixel format to be input to the model.

You can do this using the OpenCV Python package and then use the generate() function to create a mask, as shown below:

# Import opencv package
import cv2


# Give the path of your image
IMAGE_PATH= '/content/dog.png'
# Read the image from the path
image= cv2.imread(IMAGE_PATH)
# Convert to RGB format
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)


# Generate segmentation mask
output_mask = mask_generator.generate(image_rgb)
print(output_mask)

The generated output is a dictionary with the following main values:

Segmentation: An array that has a mask shape
area: An integer that stores the area of the mask in pixels
bbox: The coordinates of the boundary box [xywh]
Predicted_iou: IOU is an evaluation score for segmentation

The generated output is a dictionary with the main values

So how do we visualize our output mask?

Well, it's a simple Python function that will take the dictionary generated by SAM as output and plot the segmentation masks with the mask shape values and coordinates.

# Function that inputs the output and plots image and mask
def show_output(result_dict,axes=None):
     if axes:
        ax = axes
     else:
        ax = plt.gca()
        ax.set_autoscale_on(False)
     sorted_result = sorted(result_dict, key=(lambda x: x['area']),      reverse=True)
     # Plot for each segment area
     for val in sorted_result:
        mask = val['segmentation']
        img = np.ones((mask.shape[0], mask.shape[1], 3))
        color_mask = np.random.random((1, 3)).tolist()[0]
        for i in range(3):
            img[:,:,i] = color_mask[i]
            ax.imshow(np.dstack((img, mask*0.5)))

Let’s use this function to plot our raw input image and segmented mask:

_,axes = plt.subplots(1,2, figsize=(16,16))
axes[0].imshow(image_rgb)
show_output(sam_result, axes[1])

Model has segmented every object

As you can see, the model has segmented every object in the image using a zero-shot method in one single go! 🌟

How to Use SAM with Bounding Box Prompts

Sometimes, we may want to segment only a specific portion of an image. To achieve this, input rough bounding boxes to identify the object within the area of interest, and SAM will segment it accordingly.

To implement this, import and initialize the SamPredictor and use the set_image() function to pass the input image. Next, call the predict function, providing the bounding box coordinates as input for the parameter box as shown in the snippet below. The bounding boxes prompt should be in the [X-min, Y-min, X-max, Y-max] format.

# Set up the SAM model with the encoded image
mask_predictor = SamPredictor(sam)
mask_predictor.set_image(image_rgb)


# Predict mask with bounding box prompt
masks, scores, logits = mask_predictor.predict(
box=bbox_prompt,
multimask_output=False
)


# Plot the bounding box prompt and predicted mask
plt.imshow(image_rgb)
show_mask(masks[0], plt.gca())
show_box(bbox_prompt, plt.gca())
plt.show()

The green bounding box was our input prompt in this output, and the blue represents our predicted mask.

How to Use SAM with Points as Prompts

What if you need the object's mask for a certain point in the image? You can provide the point’s coordinates as an input prompt to SAM. The model will then generate the three most relevant segmentation masks. This helps in case of any ambiguity on the main object of interest.

The first steps are similar to what we did in previous sections. Initialize the predictor module with the input image. Next, provide the input prompt as [X,Y] coordinates to the parameter point_coords.

# Initialize the model with the input image
from segment_anything import sam_model_registry, SamPredictor
sam = sam_model_registry[MODEL_TYPE](checkpoint=CHECKPOINT_PATH).to(device=DEVICE)
mask_predictor = SamPredictor(sam)
mask_predictor.set_image(image_rgb)
# Provide points as input prompt [X,Y]-coordinates
input_point = np.array([[250, 200]])
input_label = np.array([1])


# Predict the segmentation mask at that point
masks, scores, logits = mask_predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True,
)

As we have set the multimask_output parameter as True, there would be three output masks. Let’s visualize it by plotting the masks and their input prompt.

In the above figure, the green star denotes the prompt point, and the blue represents the predicted mask. While Mask 1 has poor coverage, Mask 2 and 3 have good accuracy for my needs.

I have also printed the self-evaluated IOU scores for each mask. IOU stands for Intersection Over Union and measures the deviation between the object outline and mask.

Conclusion

You can build a tailored segmentation dataset for your field by gathering raw images and utilizing the SAM tool for annotation. This model has shown consistent performance, even in tricky conditions like noise or occlusion.

In the upcoming version, they're making text prompts compatible, aiming to enhance user-friendliness.

Hope this info proves helpful for you!

Thank you for reading! I'm Jess, and I'm an expert at Hyperskill. You can check out our ML courses on the platform.

How to Implement Computer Vision with Deep Learning and TensorFlow

Beau Carnes — Tue, 06 Jun 2023 15:08:54 +0000

Computer vision is being used in more and more places. From enhancing security systems to improving healthcare diagnostics, computer vision techniques are revolutionizing multiple industries.

We just published a 37-hour course on the freeCodeCamp.org YouTube channel that will teach you about deep learning for computer vision using TensorFlow. The course was expertly created by Folefac Martins from Neuralearn.ai.

A Sneak Peek into the Course

This course is meticulously designed to cover a broad range of topics, starting from the basics of tensors and variables to the implementation of advanced deep learning models for complex tasks such as human emotion detection and image generation.

After introducing the prerequisites and discussing what learners can expect from the course, the first segment focuses on the foundational aspects of tensors and variables. You'll understand the basics, initialization and casting, indexing, and common TensorFlow functions. The topics extend to cover the intriguing concepts of ragged, sparse, and string tensors, laying the groundwork for building neural networks.

As you venture into the world of neural networks, you'll start by predicting car prices. This practical project involves steps from data preparation to measuring model performance, and it'll provide an understanding of linear regression models, error sanctioning, and training and optimization techniques.

The course then delves into convolutional neural networks (ConvNets), which are particularly useful for image data. You will use ConvNets to diagnose malaria, a task that includes data preparation, visualization, and processing, and learn how to build ConvNets with TensorFlow. Along the way, you'll explore binary cross-entropy loss, model training and evaluation, and saving and loading models on Google Drive.

Advanced topics in TensorFlow, such as custom loss and metrics, eager and graph modes, and custom training loops, are also thoroughly discussed. A significant portion of the course is devoted to improving model performance, evaluating classification models, and using data augmentation techniques to enhance the quality and diversity of data.

The course proceeds to explore modern Convolutional Neural Networks like AlexNet, VGGNet, ResNet, MobileNet, and EfficientNet, applied to a human emotions detection project. Additionally, the course illustrates the black box of these models by visualizing intermediate layers and using the Gradcam method.

There's a great section dedicated to Transformers in Vision, understanding and building Vision Transformers (ViTs) from scratch, and fine-tuning Huggingface ViT. This section includes practical training with the Weights and Biases tool for experiment tracking, hyperparameter tuning, dataset and model versioning, known as MLOps.

Finally, the course closes with important topics in model deployment, including converting TensorFlow models to Onnx format, understanding and implementing quantization, building and deploying an API with FastAPI, and load testing with Locust.

The course concludes with a module on object detection using the YOLO algorithm and image generation using Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).

The Learning Experience

What sets this course apart is the combination of theoretical understanding and practical applications. It is a guided journey through the intricacies of TensorFlow, deep learning, and computer vision, using real-world projects such as car price prediction, malaria diagnosis, human emotion detection, and image generation.

The course is perfect for anyone passionate about machine learning and AI, regardless of their current expertise level. So whether you're a complete beginner, a data scientist looking to update your skills, or an AI enthusiast, this course promises a thorough and practical understanding of computer vision and deep learning with TensorFlow.

Watch the full course on the freeCodeCamp.org YouTube channel (37-hour course, with subtitles).

TensorFlow for Computer Vision – Full Course on Python for Machine Learning

Beau Carnes — Tue, 05 Oct 2021 15:07:00 +0000

TensorFlow can do some amazing things when it comes to computer vision.

We just published a full course on the freeCodeCamp.org YouTube channel that will teach you how to use TensorFlow 2 for computer vision applications.

Nour Islam Mokhtari created this course. Nour is a Machine Learning Engineer and experienced teacher.

The course shows you how to create two computer vision projects. The first involves an image classification model with a prepared dataset. The second is a more real-world problem where you will have to clean and prepare a dataset before using it.

MNIST Dataset with labels

Here are the topics covered in this course:

Why learn Tensorflow
We will be using an IDE and not notebooks
Visual Studio Code (how to download and install it)
Miniconda - how to install it
Miniconda - why we need it
How are we going to use conda virtual environments in VS Code?
Installing Tensorflow 2 (CPU version)
Installing Tensorflow 2 (GPU version)
What do we want to achieve?
Exploring MNIST dataset
Tensorflow layers
Building a neural network the sequential way
Compiling the model and fitting the data
Building a neural network the functional way
Building a neural network the Model Class way
Things we should add
Restructuring our code for better readability
First part summary
What we want to achieve
Downloading and exploring the dataset
Preparing train and validation sets
Preparing the test set
Building a neural network the functional way
Creating data generators
Instantiating the generators
Compiling the model and fitting the data
Adding callbacks
Evaluating the model
Potential improvements
Running prediction on single images

Watch the full course below or on the freeCodeCamp.org YouTube channel (4.5-hour watch).

Advanced Computer Vision with Python

Beau Carnes — Fri, 28 May 2021 12:26:38 +0000

More and more applications are using computer vision these days.

We just published a full course on the freeCodeCamp.org YouTube channel that will help you learn advanced computer vision using Python. You will learn state of the art computer vision techniques by building five projects with libraries such as OpenCV and Mediapipe.

If you are a beginner, don't be afraid of the term advance. Even though the concepts are advanced, the course teaches them in a way that is easy to follow.

This course is taught by Murtaza Hassan. Murtaza has a popular YouTube channel about Robotics and AI and now he is sharing his expertise with the freeCodeCamp audience.

In the first half of the course you will learn core techniques by implementing hand tracking, pose estimation, face detection, and face mesh.

In the second half of the course you will create five projects with real-word application. Here is what you will create:

Gesture Volume Control
Finger Counter
AI Personal Trainer
AI Virtual Painter
AI Virtual Mouse

Watch the full course below or on the freeCodeCamp.org YouTube channel (6-hour watch).

How to Manage Computer Vision Datasets in Python with Remo

freeCodeCamp — Thu, 10 Dec 2020 19:28:08 +0000

By Pier Paolo Ippolito

Computer Vision is one of the most important applications of Machine Learning. Some common commercial applications of Computer Vision are:

Predictive maintenance for industrial infrastructure, oil and gas pipelines, and commercial real estate
Quality assurance automation
Landscape inventory and parcel management based on satellite imagery and drone footage

And some of the most common techniques used in order to accomplish these kind of tasks are:

Image Classification
Object Detection
Instance Segmentation

During the past decade, many frameworks such as TensorFlow, Keras and PyTorch have been developed in order to make it easier to develop Computer Vision-based models.

But it is still relatively difficult to work with image data due to the necessary image pre-processing, labelling, and annotation visualization.

As part of this article, I am going to introduce you to Remo, a free Python library designed to help developers work on Computer Vision tasks. Remo can help you:

Organize and visualize images and annotations
Efficiently annotate
Work and collaborate as a team on the data

Remo can be used either in a Jupyter Notebook or in the Google Colab environment. In this article, all the code is going to be based on the Google Colab set-up and the full notebook is freely available at this link.

How Remo Improves Image Management

There are a number of legacy open annotation tools for images available out there. LabelImg is one of the most popular ones.

Compared to these tools, Remo offers smart tools to annotate more efficiently (for example, shortcuts and xclick draw) and functionalities that help you collaborate and organize your work. You can mark images as Done or To Do, sort them and search them, and so on – which is very useful when you're working with thousands of images.

But datasets management is where Remo is very innovative. At present, images in Computer Vision projects are usually stored as flat files in a local hard disk or some Cloud storage, and annotations are saved as raw XML/JSON/csv files.

To visualize them, you would usually either open each file individually and try to imagine where annotations are, or plot them one by one in Python.

Instead, Remo gives you full control and visibility of all the data.

Demonstration of How Remo Works

First of all, we need to install all the necessary dependencies. This can be easily done in Google Colab by running the following two lines of code:

!pip install remo
!python -m remo_app init --colab

Once we've installed Remo, we can then create a dataset using some example images freely available on Amazon Web Services.

import remo
import pandas

link = ['https://remo-scripts.s3-eu-west-1.amazonaws.com/open_images_sample_dataset.zip']

df = remo.create_dataset(name = 'Example Images Dataset',
                    urls = link, 
                    annotation_task = "Object Detection")

# Output
# Acquiring data - completed                          
# Processing annotation files: 1 of 1 files                                  
# Processing data - completed       
# Data upload completed

By running the Remo list_datasets() command we can then easily check what datasets we currently have available.

remo.list_datasets()

# Output
# [Dataset  1 - 'Example Images Dataset' - 10 images]

We are now ready to use Remo's graphical interface in order to inspect our dataset and see the different options available.

In Figure 1, you'll see a simple example of how easy it can be to visualize and annotate our data using Remo.

df.view()

Figure 1: Remo's GUI Data Pre-processing

Another important advantage of using Remo is that it lets you quickly get key dataset statistics either through Python code or the user interface.

This can be particularly useful when you're trying to understand how annotations are distributed between different images and if the overall classes distribution is balanced or not.

df.get_annotation_statistics()

# Output
# [{'AnnotationSet ID': 1,
#  'AnnotationSet name': 'Object detection',
#  'creation_date': None,
#  'last_modified_date': '2020-11-28T22:04:48.263767Z',
#  'n_classes': 18,
#  'n_images': 10,
#  'n_objects': 98,
#  'top_3_classes': [{'count': 27, 'name': 'Fruit'},
#   {'count': 12, 'name': 'Sports equipment'},
#   {'count': 10, 'name': 'Human arm'}]}]

You can see similar results by using Remo's Graphical Interface (Figure 2).

df.view_annotation_stats()

Figure 2: Remo's Statistics Functionalities

Finally, if you used the Remo interface to add annotations to the different images of your dataset, these can be automatically exported in a CSV format. This lets you use them later and takes advantage of Remo's export_annotations_to_file() function.

df.export_annotations_to_file('images_annotations.zip', annotation_format='csv', export_tags = False)

Conclusion

To summarize, some of the key functionalities provided by Remo are:

Dataset management capabilities
Multiple file formats supported along with computer vision tasks
User friendly interface and enhanced annotation tools
Easy collaboration on a project
Support for Virtual Machine use

If you are interested in either finding out more about Remo (like how to integrate Remo with other frameworks such as PyTorch) or how to set up this workflow in a Jupyter Notebook environment, the official Remo documentation is a great place to start.

I hope you enjoyed this article, thank you for reading!

Contact me

If you want to keep updated with my latest articles and projects follow me on Medium and subscribe to my mailing list. These are some of my contacts details:

Cover photo from Remo documentation.

How to Create an Optical Character Reader Using Blazor and Azure Computer Vision

freeCodeCamp — Tue, 03 Mar 2020 18:48:00 +0000

By Ankit Sharma

Introduction

In this article, we will create an optical character recognition (OCR) application using Blazor and the Azure Computer Vision Cognitive Service.

Computer Vision is an AI service that analyzes content in images. We will use the OCR feature of Computer Vision to detect the printed text in an image.

The application will extract the text from the image and detects the language of the text. Currently, the OCR API supports 25 languages.

A demo of the application is shown below.

Prerequisites

Install the latest .NET Core 3.1 SDK from https://dotnet.microsoft.com/download/dotnet-core/3.1
Install the latest version of Visual Studio 2019 from https://visualstudio.microsoft.com/downloads/
An Azure subscription account. You can create a free Azure account at https://azure.microsoft.com/en-in/free/

Image requirements

The OCR API will work on images that meet the requirements as mentioned below:

The format of the image must be JPEG, PNG, GIF, or BMP.
The size of the image must be between 50 x 50 and 4200 x 4200 pixels.
The image file size should be less than 4 MB.
The text in the image can be rotated by any multiple of 90 degrees plus a small angle of up to 40 degrees.

Source Code

You can get the source code from GitHub.

Create the Azure Computer Vision Cognitive Service resource

Log in to the Azure portal and search for the cognitive services in the search bar and click on the result. Refer to the image shown below.

On the next screen, click on the Add button. It will open the cognitive services marketplace page.

Search for the Computer Vision in the search bar and click on the search result. It will open the Computer Vision API page. Click on the Create button to create a new Computer Vision resource. Refer to the image shown below.

On the Create page, fill in the details as indicated below.

Name: Give a unique name for your resource.
Subscription: Select the subscription type from the dropdown.
Pricing tier: Select the pricing tier as per your choice.
Resource group: Select an existing resource group or create a new one.

Click on the Create button. Refer to the image shown below.

After your resource is successfully deployed, click on the “Go to resource” button. You can see the Key and the endpoint for the newly created Computer Vision resource. Refer to the image shown below.

Make a note of the key and the endpoint. We will be using these in the latter part of this article to invoke the Computer Vision OCR API from the .NET Code. The values are masked here for privacy.

Create a Server-Side Blazor Application

Open Visual Studio 2019, click on “Create a new project”. Select “Blazor App” and click on the “Next” button. Refer to the image shown below.

On the next window, put the project name as BlazorComputerVision and click on the “Create” button.

The next window will ask you to select the type of Blazor app. Select Blazor Server App and click on the Create button to create a new server-side Blazor application. Refer to the image shown below.

Installing Computer Vision API library

We will install the Azure Computer Vision API library which will provide us with the models out of the box to handle the Computer Vision REST API response.

To install the package, navigate to Tools >> NuGet Package Manager >> Package Manager Console. It will open the Package Manager Console. Run the command as shown below.

Install-Package Microsoft.Azure.CognitiveServices.Vision.ComputerVision -Version 5.0.0

You can learn more about this package at the NuGet gallery.

Create the Models

Right-click on the BlazorComputerVision project and select Add >> New Folder. Name the folder as Models. Again, right-click on the Models folder and select Add >> Class to add a new class file. Put the name of your class as LanguageDetails.cs and click Add.

Open [LanguageDetails.cs](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Models/LanguageDetails.cs) and put the following code inside it.

namespace BlazorComputerVision.Models
{
    public class LanguageDetails
    {
        public string Name { get; set; }
        public string NativeName { get; set; }
        public string Dir { get; set; }
    }
}

Similarly, add a new class file [AvailableLanguage.cs](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Models/AvailableLanguage.cs) and put the following code inside it.

using System.Collections.Generic;

namespace BlazorComputerVision.Models
{
    public class AvailableLanguage
    {
        public Dictionary<string, LanguageDetails> Translation { get; set; }
    }
}

Finally, we will add a class as DTO (Data Transfer Object) for sending data back to the client. Add a new class file [OcrResultDTO.cs](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Models/OcrResultDTO.cs) and put the following code inside it.

namespace BlazorComputerVision.Models
{
    public class OcrResultDTO
    {
        public string Language { get; set; }

        public string DetectedText { get; set; }
    }
}

Create the Computer Vision Service

Right-click on the BlazorComputerVision/Data folder and select Add >> Class to add a new class file. Put the name of the file as ComputerVisionService.cs and click on Add.

Open the [ComputerVisionService.cs](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Data/ComputerVisionService.cs) file and put the following code inside it.

using BlazorComputerVision.Models;
using Microsoft.Azure.CognitiveServices.Vision.ComputerVision.Models;
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;
using System;
using System.Net.Http;
using System.Net.Http.Headers;
using System.Text;
using System.Threading.Tasks;

namespace BlazorComputerVision.Data
{
    public class ComputerVisionService
    {
        static string subscriptionKey;
        static string endpoint;
        static string uriBase;

        public ComputerVisionService()
        {
            subscriptionKey = "b993f3afb4e04119bd8ed37171d4ec71";
            endpoint = "https://ankitocrdemo.cognitiveservices.azure.com/";
            uriBase = endpoint + "vision/v2.1/ocr";
        }

        public async Task GetTextFromImage(byte[] imageFileBytes)
        {
            StringBuilder sb = new StringBuilder();
            OcrResultDTO ocrResultDTO = new OcrResultDTO();
            try
            {
                string JSONResult = await ReadTextFromStream(imageFileBytes);

                OcrResult ocrResult = JsonConvert.DeserializeObject(JSONResult);

                if (!ocrResult.Language.Equals("unk"))
                {
                    foreach (OcrLine ocrLine in ocrResult.Regions[0].Lines)
                    {
                        foreach (OcrWord ocrWord in ocrLine.Words)
                        {
                            sb.Append(ocrWord.Text);
                            sb.Append(' ');
                        }
                        sb.AppendLine();
                    }
                }
                else
                {
                    sb.Append("This language is not supported.");
                }
                ocrResultDTO.DetectedText = sb.ToString();
                ocrResultDTO.Language = ocrResult.Language;
                return ocrResultDTO;
            }
            catch
            {
                ocrResultDTO.DetectedText = "Error occurred. Try again";
                ocrResultDTO.Language = "unk";
                return ocrResultDTO;
            }
        }

        static async Task<string> ReadTextFromStream(byte[] byteData)
        {
            try
            {
                HttpClient client = new HttpClient();
                client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey);
                string requestParameters = "language=unk&detectOrientation=true";
                string uri = uriBase + "?" + requestParameters;
                HttpResponseMessage response;

                using (ByteArrayContent content = new ByteArrayContent(byteData))
                {
                    content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
                    response = await client.PostAsync(uri, content);
                }

                string contentString = await response.Content.ReadAsStringAsync();
                string result = JToken.Parse(contentString).ToString();
                return result;
            }
            catch (Exception e)
            {
                return e.Message;
            }
        }

        public async Task GetAvailableLanguages()
        {
            string endpoint = "https://api.cognitive.microsofttranslator.com/languages?api-version=3.0&scope=translation";
            var client = new HttpClient();
            using (var request = new HttpRequestMessage())
            {
                request.Method = HttpMethod.Get;
                request.RequestUri = new Uri(endpoint);
                var response = await client.SendAsync(request).ConfigureAwait(false);
                string result = await response.Content.ReadAsStringAsync();

                AvailableLanguage deserializedOutput = JsonConvert.DeserializeObject(result);

                return deserializedOutput;
            }
        }
    }
}

In the constructor of the class, we have initialized the key and the endpoint URL for the OCR API.

Inside the ReadTextFromStream method, we will create a new HttpRequestMessage. This HTTP request is a Post request. We will pass the subscription key in the header of the request. The OCR API will return a JSON object having each word from the image as well as the detected language of the text.

The GetTextFromImage method will accept the image data as a byte array and returns an object of type OcrResultDTO. We will invoke the ReadTextFromStream method and deserialize the response into an object of type OcrResult. We will then form the sentence by iterating over the OcrWord object.

The GetAvailableLanguages method will return the list of all the language supported by the Translate Text API. We will set the request URI and create a HttpRequestMessage which will be a Get request. This request URL will return a JSON object which will be deserialized to an object of type AvailableLanguage.

Why do we need to fetch the list of supported languages?

The OCR API returns the language code (e.g. en for English, de for German, etc.) of the detected language. But we cannot display the language code on the UI as it is not user-friendly. Therefore, we need a dictionary to look up the language name corresponding to the language code.

The Azure Computer Vision OCR API supports 25 languages. To know all the languages supported by OCR API see the list of supported languages.

These languages are a subset of the languages supported by the Azure Translate Text API. Since there is no dedicated API endpoint to fetch the list of languages supported by OCR API, therefore, we are using the Translate Text API endpoint to fetch the list of languages.

We will create the language lookup dictionary using the JSON response from this API call and filter the result based on the language code returned by the OCR API.

Install BlazorInputFile NuGet package

BlazorInputFile is a file input component for Blazor applications. It provides the ability to upload single or multiple files to a Blazor app.

Open [BlazorComputerVision.csproj](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/BlazorComputerVision.csproj#L8) file and add a dependency for the BlazorInputFile package as shown below:

<ItemGroup>
    <PackageReference Include="BlazorInputFile" Version="0.1.0-preview-00002" />
ItemGroup>

Open [BlazorComputerVision\Pages\_Host.cshtml](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Pages/_Host.cshtml#L17) file and add the reference for the package’s JavaScript file by adding the following line in the section.

Add the following line in the [_Imports.razor](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/_Imports.razor#L10) file.

@using BlazorInputFile

Configuring the Service

To make the service available to the components we need to configure it on the server-side app. Open the Startup.cs file. Add the following line inside the [ConfigureServices](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Startup.cs#L31) method of Startup class.

 services.AddSingleton();

Creating the Blazor UI Component

We will add the Razor page in the BlazorComputerVision/Pages folder. By default, we have “Counter” and “Fetch Data” pages provided in our application. These default pages will not affect our application but for the sake of this tutorial, we will delete fetchdata and counter pages from BlazorComputerVision/Pages folder.

Right-click on the BlazorComputerVision/Pages folder and then select Add >> New Item. An “Add New Item” dialog box will open, select “Visual C#” from the left panel, then select “Razor Component” from the templates panel, put the name as OCR.razor. Click Add. Refer to the image shown below.

We will add a code-behind file for this razor page to keep the code and presentation separate. This will allow easy maintenance for the application.

Right-click on the BlazorComputerVision/Pages folder and then select Add >> Class. Name the class as OCR.razor.cs. The Blazor framework is smart enough to tag this class file to the razor file. Refer to the image shown below.

Blazor UI component code behind

Open the [OCR.razor.cs](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Pages/OCR.razor.cs) file and put the following code inside it.

using Microsoft.AspNetCore.Components;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
using System.IO;
using BlazorComputerVision.Models;
using BlazorInputFile;
using BlazorComputerVision.Data;

namespace BlazorComputerVision.Pages
{
    public class OCRModel : ComponentBase
    {
        [Inject]
        protected ComputerVisionService computerVisionService { get; set; }

        protected string DetectedTextLanguage;
        protected string imagePreview;
        protected bool loading = false;
        byte[] imageFileBytes;

        const string DefaultStatus = "Maximum size allowed for the image is 4 MB";
        protected string status = DefaultStatus;

        protected OcrResultDTO Result = new OcrResultDTO();
        private AvailableLanguage availableLanguages;
        private Dictionary<string, LanguageDetails> LanguageList = new Dictionary<string, LanguageDetails>();
        const int MaxFileSize = 4 * 1024 * 1024; // 4MB

        protected override async Task OnInitializedAsync()
        {
            availableLanguages = await computerVisionService.GetAvailableLanguages();
            LanguageList = availableLanguages.Translation;
        }

        protected async Task ViewImage(IFileListEntry[] files)
        {
            var file = files.FirstOrDefault();
            if (file == null)
            {
                return;
            }
            else if (file.Size > MaxFileSize)
            {
                status = $"The file size is {file.Size} bytes, this is more than the allowed limit of {MaxFileSize} bytes.";
                return;
            }
            else if (!file.Type.Contains("image"))
            {
                status = "Please uplaod a valid image file";
                return;
            }
            else
            {
                var memoryStream = new MemoryStream();
                await file.Data.CopyToAsync(memoryStream);
                imageFileBytes = memoryStream.ToArray();
                string base64String = Convert.ToBase64String(imageFileBytes, 0, imageFileBytes.Length);

                imagePreview = string.Concat("data:image/png;base64,", base64String);
                memoryStream.Flush();
                status = DefaultStatus;
            }
        }

        protected private async Task GetText()
        {
            if (imageFileBytes != null)
            {
                loading = true;
                Result = await computerVisionService.GetTextFromImage(imageFileBytes);
                if (LanguageList.ContainsKey(Result.Language))
                {
                    DetectedTextLanguage = LanguageList[Result.Language].Name;
                }
                else
                {
                    DetectedTextLanguage = "Unknown";
                }
                loading = false;
            }
        }
    }
}

We are injecting the ComputerVisionService in this class.

The OnInitializedAsync is a Blazor lifecycle method which is invoked when the component is initialized. We are invoking the GetAvailableLanguages method from our service inside the OnInitializedAsync method. We will then initialize the variable LanguageList which is a dictionary to hold the details of available languages.

Inside the ViewImage method, we will check if the uploaded file is an image only and the size is less than 4 MB. We will transfer the uploaded image to the memory stream. We will then convert that memory stream to a byte array.

To set the image preview, we will convert the image from byte array to a base64 encoded string. The GetText method will invoke the GetTextFromImage method from the service and pass the image byte array as an argument. We will search for the language name from the dictionary based on the language code returned from the service. If no language code is available, we will set the language as unknown.

Blazor UI component template

Open the [OCR.razor](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Pages/OCR.razor) file and put the following code inside it.

@page "/computer-vision-ocr"
@inherits OCRModel

<h2>Optical Character Recognition (OCR) Using Blazor and Azure Computer Vision Cognitive Serviceh2>

<div class="row">
    <div class="col-md-5">
        <textarea disabled class="form-control" rows="10" cols="15">@Result.DetectedTexttextarea>
        <hr />
        <div class="row">
            <div class="col-sm-5">
                <label><strong> Detected Language :strong>label>
            div>
            <div class="col-sm-6">
                <input disabled type="text" class="form-control" value="@DetectedTextLanguage" />
            div>
        div>
    div>
    <div class="col-md-5">
        <div class="image-container">
            <img class="preview-image" src=@imagePreview>
        div>
        <InputFile OnChange="ViewImage" />
        <p>@statusp>
        <hr />
        <button disabled="@loading" class="btn btn-primary btn-lg" @onclick="GetText">
            @if (loading)
            {
                <span class="spinner-border spinner-border-sm mr-1">span>
            }
            Extract Text
        button>
    div>
div>

We have defined the route for this component. We have inherited the OCRModel class which allows us to access the properties and method of this class from the template. Bootstrap is used for designing the UI. We have a text area to display the detected text and a text box to display the detected language. The image tag is used to show the image preview after uploading the image. The component will allow us to upload an image file and invoke the ViewImage method as we upload the image.

Add styling for the component

Navigate to [BlazorComputerVision\wwwroot\css\site.css](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/wwwroot/css/site.css#L185-L197) file and add the following style definition inside it.

.preview-image {
    max-height: 300px;
    max-width: 300px;
}
.image-container {
    display: flex;
    padding: 15px;
    align-content: center;
    align-items: center;
    justify-content: center;
    border: 2px dashed skyblue;
}

The last step is to add the link of our OCR component in the navigation menu. Open [BlazorComputerVision/Shared/NavMenu.razor](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Shared/NavMenu.razor#L15-L19) file and add the following code into it.

<li class="nav-item px-3">
    <NavLink class="nav-link" href="computer-vision-ocr">
        <span class="oi oi-list-rich" aria-hidden="true">span> Computer Vision
    NavLink>
li>

Remove the navigation links for Counter and Fetch-data components as they are not required for this application.

Execution Demo

Press F5 to launch the application. Click on the Computer Vision button on the nav menu on the left. On the next page, upload an image with some text in it and click on the “Extract Text” button. You will see the extracted text in the text area on the left along with the detected language for the text. Refer to the image shown below.

Now we will try to upload an image with some French text on it, you can see the extracted text and the detected language as French. Refer to the image shown below.

If we try to upload an image with an unsupported language, we will get the error. Refer to the image shown below where an image with text written in Hindi is uploaded.

Summary

We have created an optical character recognition (OCR) application using Blazor and the Computer Vision Azure Cognitive Service. We have added the feature of uploading an image file using the BlazorInputFile component. The application is able to extract the printed text from the uploaded image and recognizes the language of the text. The OCR API of the Computer Vision is used which can recognize text in 25 languages.

Get the Source code from GitHub and play around to get a better understanding.

How to Train an Image Classifier and Teach Your Computer Japanese

freeCodeCamp — Sun, 21 Jul 2019 16:30:00 +0000

By Ajay Uppili Arasanipalai

Introduction

Hi. Hello. こんにちは

Those squiggly characters you just saw are from a language called Japanese. You’ve probably heard of it if you’ve ever watched Dragon Ball Z.

_Source_

Here’s the problem though: you know those ancient Japanese scrolls that make you look like you’re going to unleash an ultimate samurai ninja overlord super combo move.

Yeah, those. I can’t exactly read them, and it turns out that very few people can.

Luckily, a bunch of smart people understands how important it is that I master the Bijudama-Rasenshuriken, so they invented this thing called deep learning.

So pack your ramen and get ready. In this article, I’ll show you how to train a neural network that can accurately predict Japanese characters from their images.

To ensure that we get good results, I’m going to use of an incredible deep learning library called fastAI, which is a wrapper around PyTorch that makes it easy to implement best practices from modern research. You can read more about it on their docs.

With that said, let’s get started.

KMNIST

OK, so before we can create anime subtitles, we’re going to need a dataset. Today we’re going to focus on KMNIST.

This dataset takes of examples of characters from the Japanese Kuzushiji script, and organizes them into 10 labeled classes. The images measure 28x28 pixels, and there are 70,000 images in total, mirroring the structure of MNIST.

But why KMNIST? Well firstly, it has “MNIST” in its name, and we all know how much people in machine learning love MNIST.

_Source_

So in theory, you could just change a few lines of that Keras code that you copy-pasted from Stack Overflow and BOOM! You now have computer code that can revive an ancient Japanese script.

Of course, in practice, it isn’t that simple. For starters, the cute little model that you trained on MNIST probably won’t do that well. Because, you know, figuring out whether a number is a 2 or a 5 is just a tad easier than deciphering a forgotten cursive script that only a handful of people on earth know how to read.

Apart from that, I guess I should point out that Kuzushiji, which is what the “K” in KMNIST stands for, is not just 10 characters long. Unfortunately, I’m NOT one of the handfuls of experts that can read the language, so I can’t describe in intricate detail how it works.

But here’s what I do know: There are actually three variants of these Kuzushiji character datasets — KMNIST, Kuzushiji-49, and Kuzushiji-Kanji.

Kuzushiji-49 is variant with 49 classes instead of 10. Kuzushiji-Kanji is even more insane, with a whopping 3832 classes.

Yep, you read that right. It’s three times as many classes as ImageNet.

‍

How to Not Mess Up Your Dataset

To keep things as MNIST-y as possible, it looks like the researchers who put out the KMNIST dataset kept it in the original format (man, they really took that whole MNIST thing to heart, didn’t they).

If you take a look at the KMNIST GitHub repo, you’ll see that the dataset is served in two formats: the original MNIST thing, and as a bunch of Numpy arrays.

Of course, I know you were probably too lazy to click that link. So here you go. You can thank me later.

_Source_

Personally, I found the NumPy array format easier to work with when using fastai, but the choice is yours. If you’re using PyTorch, KMNIST comes for free as a part of torchvision.datasets.

The next challenge is actually getting those 10,000-year-old brush strokes onto your notebook (or IDE, who am I to judge). Luckily, the GitHub repo mentions that there’s this handy script called download_data.py that’ll do all the work for us. Yay!

From here, it’ll probably start getting awkward if I continue talking about how to pre-process your data without actual code. So check out the notebook if you want to dive deeper.

Moving on…

Should I use a hyper ultra Inception ResNet XXXL?‍

Short Answer

Probably not. A regular ResNet should be fine.

A Little Less Short Answer

Ok, look. By now, you’re probably thinking, “KMNIST big. KMNIST hard. Me need to use very new, very fancy model.”

Did I overdo the Bizzaro voice?

The point is, you DON’T need a shiny new model to do well on these image classification tasks. At best, you’ll probably get a marginal accuracy improvement at the cost of a whole lot of time and money.

Most of the time, you’ll just waste a whole lot of time and money.

So heed my advice — just stick to good ol’ fashion ResNets. They work really well, they're relatively fast and lightweight (compared to some of the other memory hogs like Inception and DenseNet), and best of all, people have been using them for a while, so it shouldn’t be too hard to fine-tune.

If the dataset you’re working with is simple like MNIST, use ResNet18. If it’s medium-difficulty, like CIFAR10, use ResNet34. If it’s really hard, like ImageNet, use ResNet50. If it’s harder than that, you can probably afford to use something better than a ResNet.

Don’t believe me? Check out my leading entry for the Stanford DAWNBench competition from April 2019:

What do you see? ResNets everywhere! Now come on, there’s got to be a reason for that.‍

Hyperparameters Galore

A few months ago, I wrote an article on how to pick the right hyperparameters. If you’re interested in a more general solution to this herculean task, go check that out. Here, I’m going to walk you through my process of picking good-enough hyperparameters to get good-enough results on KMNIST.

To start off, let’s go over what hyperparameters we need to tune.

We’ve already decided to use a ResNet34, so that’s that. We don’t need to figure out the number of layers, filter size, number of filters, etc. since that comes baked into our model.

See, I told you it would save time.

So what’s remaining is the big three: learning rate, batch size, and the number of epochs (plus stuff like dropout probability for which we can just use the default values).

Let’s go over them one by one.

Number of Epochs

Let’s start with the number of epochs. As you’ll come to see when you play around with the model in the notebook, our training is pretty efficient. We can easily cross 90% accuracy within a few minutes.

So given that our training is so fast in the first place, it seems extremely unlikely that we would use too many epochs and overfit. I’ve seen other KMNIST models train for over 50 epochs without any issues, so staying in the 0-30 range should be absolutely fine.

That means within the scope of the restrictions we’ve put on the model when it comes to epochs, the more, the merrier. In my experiments, I found that 10 epochs strike a good balance between model accuracy and training time.

Learning Rate

What I’m about to say is going to piss a lot of people off. But I’ll say it anyway — We don’t need to pay too much attention to the learning rate.

Yep, you heard me right. But give me a chance to explain.

Instead of going “Hmm… that doesn’t seem to work, let’s try again with lr=3e-3 ,” we’re going to use a much more systematic and disciplined approach to finding a good learning rate.

We’re going to use the learning rate finder, a revolutionary idea proposed by Leslie Smith in his paper on cyclical learning rates.

Here’s how it works:

First, we set up our model and prepare to train it for one epoch. As the model is training, we’ll gradually increase the learning rate.
Along the way, we’ll keep track of the loss at every iteration.
Finally, we select the learning rate the corresponds to the lowest loss.

When all is said and done, and you plot the loss against the learning rate, you should see something like this:

Now, before you get all giddy and pick 1e-01 as the learning rate, I’ll have you know that it’s NOT the best choice.

That’s because fastai implements a smoothening technique called exponentially weighted averages, which is the deep learning researcher version of an Instagram filter. It prevents our plots from looking like the result of giving your neighbors’ kid too much time with a blue crayon.

Since we’re using a form of averaging to make the plot look smooth, the “minimum” point that you’re looking at on the learning rate finder isn’t actually a minimum. It’s an average.

Instead, to actually find the learning rate, a good rule of thumb is to pick the learning rate that’s an order of magnitude lower than the minimum point on the smoothened plot. That tends to work really well in practice.

I understand that all this plotting and averaging might seem weird if all you’ve been brute-forcing learning rate values all your life. So I’d advise you to check out Sylvain Gugger’s explanation of the learning rate finder to learn more.

Batch Size

OK, you caught me red-handed here. My initial experiments used a batch size of 128 since that’s what the top submission used.

I know, I know. Not very creative. But it’s what I did. Afterward, I experimented with a few other batch sizes, and I couldn’t get better results. So 128 it is!

In general, batch sizes can be a weird thing to optimize, since it partially depends on the computer you’re using. If you have a GPU with more VRAM, you can train on larger batch sizes.

So if I tell you to use a batch size of 2048, for example, instead of getting that coveted top spot on Kaggle and eternal fame and glory for life, you might just end up with a CUDA: out of memory error.

So it’s hard to recommend a perfect batch size because, in practice, there are clearly computational limits. The best way to pick it is to try out values that work for you.

But how would you pick a random number from the vast sea of positive integers?

Well, you actually don’t. Since GPU memory comes is organized in bits, it’s a good idea to choose a batch size that’s a power of 2 so that your mini-batches fit snugly in memory.

Here’s what I would do: start off with a moderately large batch size like 512. Then, if you find that your model starts acting weird and the loss is not on a clear downward trend, half it. Next, repeat the training process with a batch size of 256, and see if it behaves this time.

If it doesn’t, wash, rinse, and repeat.

A Few Pretty Pictures

With the optimizations going on here, it’s going to be pretty challenging to keep track of this giant mess of models, metrics, and hyperparameters that we’ve created.

To ensure that we all remain sane human beings while climbing the accuracy mountain, we’re going to use the wandb + fastai integration.

So what does wandb actually do?

It keeps track of a whole lot of statistics about your model and how it’s performing automatically. But what’s really cool is that it also provides instant charts and visualizations to keep track of critical metrics like accuracy and loss, all in real-time!

If that wasn’t enough, it also stores all of those charts, visualizations, and statistics in the cloud, so you can access them anytime anywhere.

Your days of starting at a black terminal screen and fiddling around with matplotlib are over.

The notebook tutorial for this article has a straightforward introduction to how it works seamlessly with fastai. You can also check out the wandb workspace, where you can take a look at all the stuff I mentioned without writing any code.

Conclusion

これで終わりです

That means “this is the end.”

But you didn't need me to tell you that, did you? Not after you went through the trouble of getting a Japanese character dataset, using the learning rate finder, training a ResNet using modern best practices, and watching your model rise to glory using real-time monitoring in the cloud.

Yep, in about 20 minutes, you actually did all of that! Give yourself a pat on the back.

And please, go watch some Dragonball.

How to create a simple Image Classifier

freeCodeCamp — Thu, 18 Jul 2019 18:06:36 +0000

By Aditya

Image classification is an amazing application of deep learning. We can train a powerful algorithm to model a large image dataset. This model can then be used to classify a similar but unknown set of images.

There is no limit to the applications of image classification. You can use it in your next app or you can use it to solve some real world problem. That's all up to you. But to someone who is fairly new to this realm, it might seem very challenging at first. How should I get my data? How should I build my model? What tools should I use?

In this article we will discuss all of that - from finding a dataset to training your model. I will try to make things as simple as possible by avoiding some technical details (PS: Please note that this doesn't mean those details are not important. I will mention some great resources which you can refer to learn more about those topics). The purpose of this article is to explain the basic process of building an image classifier and that's what we will focus more on here.

We will build an Image classifier for the Fashion-MNIST Dataset. The Fashion-MNIST dataset is a collection of Zalando's article images. It contains 60,000 images for the training set and 10,000 images for the test set data (we will discuss the test and training datasets along with the validation dataset later). These images belong to the labels of 10 different classes.

Source

Importing Libraries

Our goal is to train a deep learning model that can classify a given set of images into one of these 10 classes. Now that we have our dataset, we should move on to the tools we need. There are many libraries and tools out there that you can choose based on your own project requirements. For this one I will stick to the following:

Numpy - Python library for numerical computation
Pandas - Python library data manipulation
Matplotlib - Python library data visualisation
Keras - Python library based on tensorflow for creating deep learning models
Jupyter - I will run all my code on Jupyter Notebooks. You can install it via the link. You can use Google Colabs also if you need better computational power.

Along with these four, we will also use scikit-learn. The purpose of these libraries will become more clear once we dive into the code.

Okay! We have our tools and libraries ready. Now we should start setting up our code.

Start with importing all the above mentioned libraries. Along with importing libraries I have also imported some specific modules from these libraries. Let me go through them one by one.

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import keras 

from sklearn.model_selection import train_test_split 
from keras.utils import to_categorical 

from keras.models import Sequential 
from keras.layers import Conv2D, MaxPooling2D 
from keras.layers import Dense, Dropout 
from keras.layers import Flatten, BatchNormalization

train_test_split: This module splits the training dataset into training and validation data. The reason behind this split is to check if our model is overfitting or not. We use a training dataset to train our model and then we will compare the resulting accuracy to validation accuracy. If the difference between both quantities is significantly large, then our model is probably overfitting. We will reiterate through our model building process and making required changes along the way. Once we are satisfied with our training and validation accuracies, we will make final predictions on our test data.

to_categorical: to_categorical is a keras utility. It is used to convert the categorical labels into one-hot encodings. Let's say we have three labels ("apples", "oranges", "bananas"), then one hot encodings for each of these would be [1, 0, 0] -> "apples", [0, 1, 0] -> "oranges", [0, 0, 1] -> "bananas".

The rest of the Keras modules we have imported are convolutional layers. We will discuss convolutional layers when we start building our model. We will also give a quick glance to what each of these layers do.

Data Pre-processing

For now we will shift our attention to getting our data and analysing it. You should always remember the importance of pre-processing and analysing the data. It not only gives you insights about the data but also helps to locate inconsistencies.

A very slight variation in data can sometimes lead to a devastating result for your model. This makes it important to preprocess your data before using it for training. So with that in mind let's start data preprocessing.

train_df = pd.read_csv('./fashion-mnist_train.csv')
test_df = pd.read_csv('./fashion-mnist_test.csv')

First of all let's import our dataset (Here is the link to download this dataset on your system). Once you have imported the dataset, run the following command.

train_df.head()

This command will show you how your data looks like. The following screenshot shows the output of this command.

We can see how our image data is stored in the form of pixel values. But we cannot feed data to our model in this format. So, we will have to convert it into numpy arrays.

train_data = np.array(train_df.iloc[:, 1:])
test_data = np.array(test_df.iloc[:, 1:])

Now, it's time to get our labels.

train_labels = to_categorical(train_df.iloc[:, 0])
test_labels = to_categorical(test_df.iloc[:, 0])

Here, you can see that we have used _tocategorical to convert our categorical data into one hot encodings.

We will now reshape the data and cast it into float32 type so that we can use it conveniently.

rows, cols = 28, 28 

train_data = train_data.reshape(train_data.shape[0], rows, cols, 1)
test_data = test_data.reshape(test_data.shape[0], rows, cols, 1)

train_data = train_data.astype('float32')
test_data = test_data.astype('float32')

We are almost done. Let's just finish preprocessing our data by normalizing it. Normalizing image data will map all the pixel values in each image to the values between 0 to 1. This helps us reduce inconsistencies in data. Before normalizing, the image data can have large variations in pixel values which can lead to some unusual behaviour during the training process.

train_data /= 255.0
test_data /= 255.0

Convolutional Neural Networks

So, data preprocessing is done. Now we can start building our model. We will build a Convolutional Neural Network for modeling the image data. CNNs are modified versions of regular neural networks. These are modified specifically for image data. Feeding images to regular neural networks would require our network to have a large number of input neurons. For example just for a 28x28 image we would require 784 input neurons. This would create a huge mess of training parameters.

CNNs fix this problem by already assuming that the input is going to be an image. The main purpose of convolutional neural networks is to take advantage of the spatial structure of the image and to extract high level features from that and then train on those features. It does so by performing a convolution operation on the matrix of pixel values.

Source

The visualization above shows how convolution operation works. And the Conv2D layer we imported earlier does the same thing. The first matrix (from the left) in the demonstration is the input to the convolutional layer. Then another matrix called "filter" or "kernel" is multiplied (matrix multiplication) to each window of the input matrix. The output of this multiplication is the input to the next layer.

Other than convolutional layers, a typical CNN also has two other types of layers: 1) a pooling layer, and 2) a fully connected layer.

Pooling layers are used to generalize the output of the convolutional layers. Along with generalizing, it also reduces the number of parameters in the model by down-sampling the output of the convolutional layer.

As we just learned, convolutional layers represent high level features from image data. Fully connected layers use these high level features to train the parameters and to learn to classify those images.

We will also use the Dropout, Batch-normalization and Flatten layers in addition to the layers mentioned above. Flatten layer converts the output of convolutional layers into a one dimensional feature vector. It is important to flatten the outputs because Dense (Fully connected) layers only accept a feature vector as input. Dropout and Batch-normalization layers are for preventing the model from overfitting.

train_x, val_x, train_y, val_y = train_test_split(train_data, train_labels, test_size=0.2)

batch_size = 256
epochs = 5
input_shape = (rows, cols, 1)

def baseline_model():
    model = Sequential()
    model.add(BatchNormalization(input_shape=input_shape))
    model.add(Conv2D(32, (3, 3), padding='same', activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2,2)))
    model.add(Dropout(0.25))

    model.add(BatchNormalization())
    model.add(Conv2D(32, (3, 3), padding='same', activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(10, activation='softmax'))
    return model

The code that you see above is the code for our CNN model. You can structure these layers in many different ways to get good results. There are many popular CNN architectures which give state of the art results. Here, I have just created my own simple architecture for the purpose of this problem. Feel free to try your own and let me know what results you get :)

Training the model

Once you have created the model you can import it and then compile it by using the code below.

model = baseline_model()
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

model.compile configures the learning process for our model. We have passed it three arguments. These arguments define the loss function for our model, optimizer and metrics.

history = model.fit(train_x, train_y,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(val_x, val_y))

And finally by running the code above you can train your model. I am training this model for just five epochs but you can increase the number of epochs. After your training process is completed you can make predictions on the test set by using the following code.

predictions= model.predict(test_data)

Conclusion

Congrats! You did it, you have taken your first step into the amazing world of computer vision.

You have created a your own image classifier. Even though this is a great achievement, we have just scratched the surface.

There is a lot you can do with CNNs. The applications are limitless. I hope that this article helped you to get an understanding of how the process of training these models works.

Working on other datasets on your own will help you understand this even better. I have also created a GitHub repository for the code I used in this article. So, if this article was useful for you please let me know.

If you have any questions or you want to share your own results or if you just want to say "hi", feel free to hit me up on twitter, and I'll try to do my best to help you. And finally Thanks a lot for reading this article!! :)

Everything you need to know to master Convolutional Neural Networks

freeCodeCamp — Fri, 26 Apr 2019 20:16:26 +0000

By Tirmidzi Faizal Aflahi

Look at the photo below:

_Courtesy of [Pix2PixHD](https://github.com/NVIDIA/pix2pixHD" rel="noopener" target="blank" title=")

That is not a real photo. You can open the image in a new tab and zoom into the image. Do you see the mosaics?

The picture is actually generated by a program called Artificial Intelligence. Doesn’t it feel realistic? It’s great, isn’t it?

It has been only 7 years since the technology was brought to the public by Alex Krizhevsky and friends via the ImageNet competition. This competition is an annual Computer Vision competition to categorize pictures into 1000 different classes. From Alaskan Malamutes to toilet paper. Alex and friends built something called AlexNet, and it won the competition with a large margin between it and second place.

This technology is called a Convolutional Neural Network. It’s a sub-branch of Deep Neural Networks which performs exceptionally well in processing images.

Courtesy of ImageNet

The image above is the error rate produced by the software that won the competition several years back. In 2016, it is actually better than human performance which was around 5%.

The introduction of Deep Learning into this field is actually game breaking more than game-changing.

Convolutional Neural Network Architecture

So, how does this technology work?

Convolutional Neural Networks perform better than other Deep Neural Network architectures because of their unique process. Instead of looking at the image one pixel at a time, CNNs group several pixels together (an example 3×3 pixel like in the image above) so they can understand a temporal pattern.

In another way, CNNs can “see” group of pixels forming a line or curve. Because of the deep nature of Deep Neural Networks, in the next level they will see not the group of pixels, but groups of lines and curves forming some shapes. And so on until they form a complete picture.

_Deep Convolutional Neural Network by [Mynepalli](https://www.researchgate.net/figure/Learning-hierarchy-of-visual-features-in-CNN-architecture_fig1_281607765" rel="noopener" target="blank" title=")

There are many things you need to learn if you want to understand CNNs, from the very basic things, like a kernel, pooling layers, and so on. But nowadays, you can just dive and use many open source projects for this technology.

This is actually true because of the technology called Transfer Learning.

Transfer Learning

Transfer Learning is a technique which reuses the finished Deep Learning model in another more specific task.

As an example, say you are working in a Train Management company, and want to assess whether your trains are on time or not. And you don’t want to add another workforce just for this task.

You can just reuse an ImageNet Convolutional Neural Network model, maybe ResNet (the 2015 winner) and re-train the network with the images of your train fleets. And you will do just fine.

There are two main competitive edges when you use Transfer Learning.

Needs fewer images to perform well than training from scratch. ImageNet competition has around 1 million images to train with. Using transfer learning, you can use only 1000 or even 100 images and perform well because it is already trained with those 1 million images.
Needs less time to achieve good performance. To be as good as ImageNet, you will need to train the network for days, and that doesn’t count the time needed to alter the network if it doesn’t work well. Using transfer learning, you will only need several hours or even minutes to finish training for some tasks. A lot of time saved.

Image Classification to Image Generation

Enabled with transfer learning, many initiatives appeared. If you can process some images and tell us about what the images are all about, how about constructing the image itself?

Challenge accepted!

Generative Adversarial Network comes to the scene.

_CycleGAN by [Jun-Yan Zhu](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix" rel="noopener" target="blank" title=")

This technology can generate pictures using some inputs.

It can generate a realistic photo given a painting in a type called CycleGAN which I give you in the photo above. In another use case, it also can generate a picture of a bag given some sketches. It can even generate a higher resolution photo given a low-res photo.

_[Super Resolution Generative Adversarial Network](https://github.com/tensorlayer/srgan" rel="noopener" target="blank" title=")

Amazing, aren’t they?

Of course. And you can start learning to build them now. But how?

Convolutional Neural Network Tutorial

So, let’s begin. You will learn that getting started on this topic is easy, so freaking easy. But mastering it is on another level.

Let’s put aside mastering it for now.

_Photo by [Unsplash](https://unsplash.com/photos/5A06OWU6Wuc?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText" rel="noopener" target="_blank" title="">Thomas Verbruggen on Aerial Cactus Identification

This is a tutorial project from Kaggle. Your task is to identify is there any columnar cactus in an aerial image

Pretty simple, eh?

You will be given 17,500 images to work with and need to label 4,000 images that have not been labeled. Your score is 1 or 100% if all the 4,000 images are correctly labeled by your program.

Cactus

The images are pretty much like what you see above. A photo of a region that may or may not contains a group of columnar cacti. The photos are 32×32 pixels. And it shows cacti in different directions since they are aerial photos.

So what do you need?

Convolutional Neural Network with Python

Yes, Python, the popular language for Deep Learning. With many choices available, you can practically do trial and error for each choice. The choices are:

Tensorflow, the most popular Deep Learning library. Built by engineers at Google and has the biggest contributor base and most fans. Because the community is so big, you can easily find the solution to your problem. It has Keras as the high-level abstraction wrapper, that is so favorable for a newbie.
Pytorch. My favorite Deep Learning library. Built purely on Python and following the pros and cons of Python. Python developers will be extremely familiar with this library. It has another library called FastAI which gives the abstraction Keras has for Tensorflow.
MXNet. The Deep Learning library by Apache.
Theano. Predecessor of Tensorflow
CNTK. Microsoft also has his own Deep Learning library.

For this tutorial, let’s use my favorite one, Pytorch, complemented by its abstraction, FastAI.

Before starting, you need to install Python. Go to the Python website and download what you need. You need to make sure that you install version 3.6+, or it may not be supported by the libraries you will use.

Now, open your command line or terminal and install these things

pip install numpy pip install pandas pip install jupyter

Numpy will be used to store the inputted images. And pandas to work with CSV files. Jupyter notebook is what you need to interactively code with Python.

Then, go to the Pytorch website and download what you need. You might need the CUDA version to fasten your training speed. But make sure that you have version 1.0+ for Pytorch.

After that, install torchvision and FastAI:

pip install torchvision pip install fastai

Run Jupyter with the command jupyter notebook and it will open a browser window.

Now, you are ready to go.

Prepare the Data

Import the necessary code:

import numpy as npimport pandas as pd from pathlib import Path from fastai import * from fastai.vision import * import torch %matplotlib inline

Numpy and Pandas are always needed for everything you want to do. FastAI and Torch are your Deep Learning Library. Matplotlib Inline will be used to show charts.

Now, download data files from the competition website.

Extract the zip data file and put them inside the jupyter notebook folder.

Let’s say you named your notebook Cacti. Your folder structure would be like this:

Train folder contains all the images for your training step.

Test folder contains all the images for submission.

Train CSV file contains the training data; mapping the image name with the column has_cactus which gives a value of 1 if it has cactus or 0 otherwise.

Sample Submission CSV file contains all the formatting for submission that you need to do. The file names stated there are equal to all files inside Test folder.

train_df = pd.read_csv("train.csv")

Load the Train CSV file to a data frame.

data_folder = Path(".") train_images = ImageList.from_df(train_df, path=data_folder, folder='train')

Create a load generator using the ImageList from_df method to map train_df data frame with images inside the train folder.

Data Augmentation

This is a technique to create more data from your existing data. An image of a cat flipped vertically is still a cat. By doing this you can basically multiply your data set twice, four times, or even 16 times.

You will need this technique a lot if you happen to have very little data to work with.

transformations = get_transforms(do_flip=True, flip_vert=True, max_rotate=10.0, max_zoom=1.1, max_lighting=0.2, max_warp=0.2, p_affine=0.75, p_lighting=0.75)

FastAI gives you a nice transformation method to do all of this, called get_transform. You can do a flip vertically, horizontally, rotate, zoom, add lighting/brightness, and warp the image.

You can play with the parameter I stated above to find out how it will look. Or you can open the documentation and read about it in detail.

Of course, apply the transformation to your image list:

train_img = train_img.transform(transformations, size=128)

The parameter size will be used to scale up or down the input to match with the neural network you will use. The network I will use is called DenseNet, which won Best Paper Award at ImageNet 2017, and it needs images with 128×128 pixels in size.

Training Preparation

After loading your data, you need to prepare yourself and your data for the most important phase in Deep Learning called Training. Basically, this is the Learning in Deep Learning. It learns from your data, and updates itself accordingly so that it will have good performance on your data.

test_df = pd.read_csv("sample_submission.csv") test_img = ImageList.from_df(test_df, path=data_folder, folder='test')

train_img = train_img           .split_by_rand_pct(0.01)           .label_from_df()           .add_test(test_img)           .databunch(path='.', bs=64, device=torch.device('cuda:0'))                       .normalize(imagenet_stats)

For the training step, you need to split some of your training data into a small portion called validation data. You can’t touch these data because they will be your validation tool. When your Convolutional Neural Network performs well on validation data, it will likely perform well on the test data that will be submitted.

FastAI has the convenient method called split_by_rand_pct to split a portion of your data into validation data.

It also has the method databunch to perform batch processing. I used 64 as the batch because that is what my GPU limits. If you don’t have GPU, emit the device parameter.

Then, the normalize method is called to normalize your images because you will use a pre-trained network. imagenet_stats will normalize the images according to how the pre-trained network was trained for the ImageNet competition.

Adding the test data to the training image list makes it easy to predict later on without more pre-processing. Remember, these images will not be trained on and will not go to your validation. You just want to pre-process the data in the same way with the training images.

learn = cnn_learner(train_img, models.densenet161, metrics=[error_rate, accuracy])

You are done preparing your training data. Now, create a training method with cnn_learner. As I said before, I will use DenseNet as the pre-trained network. You can use another network offered in TorchVision.

The One-Cycle Technique

You can start your training right now. But, there is always confusion when training any Deep Neural Network, Convolutional Neural Networks included. That is choosing the right learning rate. The algorithm is called Gradient Descent, and it will try to decrease the error defined with a parameter called learning rate.

A bigger learning rate makes the training steps faster, but it is prone to overstepping the boundaries. This makes it possible for the error to go out of control like the picture above. While a smaller learning rate makes the training steps slower, but it will not go out of control.

So, choosing the right learning rate is really important. Make it big enough without going out of control.

It is easier said than done.

So, there came a person called Leslie Smith who create a technique called the 1-cycle policy.

Intuition wise, you might want to find / brute force several learning rates and find one with nearly minimal error but have some space to improve. Let’s try it out in our code.

learn.lr_find() learn.recorder.plot()

It will print something like this:

The minimum should be 10 ⁻¹. So, I think we can use something smaller than that but not too small. Maybe 3 * 10 ⁻ ² is a good choice. Let’s try it!

lr = 3e-02 learn.fit_one_cycle(5, slice(lr))

Train for several steps (I choose 5, not too big and not too small), and let’s see the result.

Wait, what!?

Our simple solution gives us 100% accuracy for our validation split! It is actually effective. And it only needs six minutes to train. What a stroke of luck! In real life, you will do several iterations just to find out which algorithms do better than the others.

I am eager to submit! Haha. Let’s predict the test folder and submit the result.

preds,_ = learn.get_preds(ds_type=DatasetType.Test) test_df.has_cactus = preds.numpy()[:, 0]

Because you have already put the test images in the training image list, you will not need to pre-process and load your test images.

test_df.to_csv('submission.csv', index=False)

This line will create a CSV file containing the images name and has a cactus column for all 4,000 test images.

When I tried to submit, I actually just realized that you need to submit the CSV via a Kaggle kernel. I missed that.

_Courtesy of [Kaggle](https://www.kaggle.com/" rel="noopener" target="blank" title=")

But, luckily, the kernel is actually the same as your jupyter notebook. You can just copy paste all the things you have built in your notebook and submit there.

And BAM!

Good Lord! I get 0.9999 for the public score. That’s really good. But, of course, I want to get a perfect score if my first attempt is like that.

So, I did several tweaks in the network and once more, BAM!

I did it! So can you. It’s actually not that hard.

(BTW, this rank was taken on April 13th, so I might drop my rank right by now…)

What I Learned

This problem is easy. So, you will not face any weird challenge while solving it. This makes it one of the most suitable projects to start with.

Alas, because many people get a perfect score on this, I think the admin needs to create another test set for submission. A harder one maybe.

Whatever the reason, there is no barrier for you to try this. You can try this right now and get good results.

_Photo by [Unsplash](https://unsplash.com/photos/rGG-BCtNiuo?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText" rel="noopener" target="_blank" title="">Mario Mrad on Final Thoughts

Convolutional Neural Networks are so helpful for various tasks. From Image Recognition to generating images. Analyzing images nowadays is not as hard as before. Of course, you can also do it if you try.

Just get started, pick a good Convolutional Neural Network project, and get good data.

Good luck!

This article is originally published on my blog at thedatamage.

Computer Vision .js frameworks you need to know

freeCodeCamp — Mon, 18 Mar 2019 16:05:59 +0000

By Shen Huang

Computer vision has been a hot topic in recent years, enabling countless great applications. With the effort from some dedicated developers in the world, creating an application utilizing computer vision is no longer rocket science. In fact, you can build many of the application in a few lines of JavaScript code. In this article, I will introduce you to some of them.

1. TensorFlow.js

Being one of the largest machine learning frameworks, TensorFlow also allows the creation of Node.js and front-end JavaScript applications with Tensorflow.js. Below is one of their demos matching poses with a collection of images. TensorFlow also has a playground allowing us to visualize better artificial neural networks, which can be great for educational purposes.

_A Move Mirror Demo from [Tensorflow.js](https://experiments.withgoogle.com/move-mirror" rel="noopener" target="blank" title=")

2. Amazon Rekognition

Amazon Rekognition is a powerful cloud-based tool. But they also provide SDKs for JavaScript in browsers which can be found here. Below is an image illustrating how detailed their face detection can be.

_Facial Feature Detection with [Amazon Rekognition API](https://docs.aws.amazon.com/rekognition/latest/dg/faces-detect-images.html" rel="noopener" target="blank" title=")

3. OpenCV.js

Being one of the oldest computer vision frameworks out there, OpenCV has served developers in computer vision for a very long time. They also have a JavaScript version allowing developers to implement those features onto a website.

_Example Face Detection with OpenCV, Image from [DZone](https://dzone.com/articles/face-detection-using-html5" rel="noopener" target="blank" title=")

4. tracking.js

If you are only looking to build a quick face detection app, such as a web version of the snapchat filters, you should take a look at tracking.js. This framework allows integration of face recognition with JavaScript with a fairly simple setup. I have also wrote a guide on this framework dropping a leprechaun hat onto faces for St. Patrick’s Day.

_tracking.js face detection [example](https://trackingjs.com/examples/face_hello_world.html" rel="noopener" target="blank" title=")

5. WebGazer.js

Whether you are trying to perform user experience studies or creating new interactive systems for your game or websites, WebGazer.js can be a great place to start. This powerful framework allows our apps to know where the person is looking at with camera inputs.

_WebGazer.js gaze tracking [example](https://webgazer.cs.brown.edu/#examples" rel="noopener" target="blank" title=")

6. three.ar.js

Another framework from Google, three.ar.js extends the functionalities of ARCore onto front-end JavaScript. It enables us to integrate surface and object detection into browsers, which is the perfect tool for an AR game.

_[three.ar.js](https://github.com/google-ar/three.ar.js?files=1" rel="noopener" target="blank" title=") demo

In the End…

I am passionate about learning new technology and sharing it with the community. If there is anything you wish to read in particular, please let me know. Below are my previous articles related to this subject. Stay tuned and have fun engineering!

how to drop LEPRECHAUN-HATS into your website with COMPUTER VISION

freeCodeCamp — Thu, 07 Mar 2019 20:56:14 +0000

By Shen Huang

Automatically leprechaun-hat people on your website for St. Patrick’s Day.

!!! — WARNING — !!!

Giving a person a green hat can be considered OFFENSIVE to some Chinese people, as it has the same meaning as cheating in a relationship. So use this CAREFULLY when you are serving a Chinese user base.

!!! — WARNING — !!!

In this tutorial, we will go over how to drop a leprechaun hat onto your website images that contain people. The process will be done through the aid of some Computer Vision frameworks, so it will be the same amount of work even if you have millions of portraits to go through. A demo can be found here thanks to the permission from my teammates.

This tutorial is for more advanced audiences. I am assuming you can figure out a lot of the fundamentals on your own. I have also made some tutorials for total beginners, which I have attached in the end as links.

Leprechaun Hats Fall on top of Heads in Portraits

1. Initial Setup

Before we start this tutorial, we need to first perform some setup.

First of all, we are using tracking.js to help us in this project, and therefore, we need to download and extract the necessary files for tracking.js from here.

For this tutorial, we start with a template website I snatched from our team WiX which is a Content Management System (CMS) allowing you to build websites with much less effort. The template can be downloaded from here. Extract the files into the “tracking.js-master” folder from previous step.

In order to make everything work, we also need a server. We will be using a simple Python server for this tutorial. In case you do not have Python or Homebrew (which helps to install Python), you can use the following bash commands to install them.

Installing Homebrew:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Installing Python:

brew install python

Now that everything is ready, we will run the command below under our “tracking.js-master” to start the Python server.

python -m SimpleHTTPServer

To test, go to this link of your local host to see an example page. You should also be able to view the extracted example page from here. And that is all you have to do for the setup.

Setting up a simple Python server.

2. Creating the hat

Different from my other tutorials, we will be using an online image for this tutorial rather than trying to recreate everything with CSS.

I found a leprechaun hat from kisspng and it can be found here. Save the image to the root folder of our website. By appending the following code to the end above the ml>, we should be able to view the image in our example website after save and reload.



  <img id = "hat" class = "leprechaunhat" src = "./leprechaunhat_kisspng.png" >
body>

Hat Image Appended to the Bottom of the Website
Now we have to design a drop animation with CSS, and put the code above the hat declaration. The code basically allows the hat to drop down and then shake a little bit.
<style>
 @keyframes shake {
  0% {
   transform : translateY(-30px);
  }
  40% {
   transform : rotate(10deg);
  }
  60% {
   transform : rotate(-10deg);
  }
  80% {
   transform : rotate(10deg);
  }
  100% {
   transform : rotate(0deg);
  }
 }
 .leprechaunhat {
  animation : shake 1s ease-in;
 }
style>


Hat drop animation.
3. Drop hats onto portraits
Now we will go over dropping hats precisely onto portraits. First we have to reference the JavaScript files from “tracking.js” with the following code.
<script src = "build/tracking-min.js" type = "text/javascript" >script>
<script src = "build/data/face-min.js" type = "text/javascript" >script>

The code provides us a Tracker class which we can feed images into. Then we can listen for a response indicating a rectangle outlining the faces inside the image.

Tracker Explained
We start by defining a function that executes when the page is loaded. This function can be attached to anywhere else if necessary. The yOffsetValue is an offset aligning the hat into a more appropriate position.
const yOffsetValue = 10;
window.onload = function() {
};

Inside, we define our hat creation function, allowing it to create hats with arbitrary sizes and positions.
function placeHat(x, y, w, h, image, count) {
 hats[count] = hat.cloneNode(true);
 hats[count].style.display = "inline";
 hats[count].style.position = "absolute";
 hats[count].style.left = x + "px";
 hats[count].style.top = y + "px";
 hats[count].style.width = w + "px";
 hats[count].style.height = h + "px";
 image.parentNode.parentNode.appendChild(hats[count]);
}

We should also twist our image declaration script a little bit to make it hide the image, as we are now showing it with JavaScript.
<img id = "hat" class = "leprechaunhat" src = "./leprechaunhat_kisspng.png" style = "display : none" >

Then we add the following code to create the hats on top of faces, with the size matching the face.
var hat = document.getElementById("hat");
var images = document.getElementsByTagName('img');
var trackers = [];
var hats = [];
for(i = 0; i < images.length; i++)
{
 (function(img)
 {
  trackers[i] = new tracking.ObjectTracker('face');
  tracking.track(img, trackers[i]);
  trackers[i].on('track', function(event) {
   event.data.forEach(function(rect) {
    var bcr = img.getBoundingClientRect();
    placeHat(rect.x, rect.y + yOffsetValue - rect.height, rect.width, rect.height, img, i);
   });
  });
 })(images[i]);
}

Now, while our Python server is still running, calling the following address should show us leprechaun hats dropping onto portraits.
http://localhost:8000/TEAM%20MEMBERS%20_%20Teamwebsite.html

Leprechaun hat drop demo
Congratulations! You just learned how to drop leprechaun hats onto all the portraits on a website with computer vision. Wish you, your friends, and your audiences a great St. Patricks Day!!!
In the end
I have linked some of previous guides below on similar projects. I believe there are certain trends in front end design. Despite the newly emerging .js frameworks and ES updates, Computer Animations and Artificial Intelligence can do wonders in the future for front end, improving user experience with elegancy and efficiency.
Beginner:

how to fill your website with lovely VALENTINES HEARTS
how to add some FIREWORKS to your website
how to add some BUBBLES to your website

Advanced:

how to create beautiful LANTERNS that ARRANGE THEMSELVES into words

I am passionate about coding and would love to learn new stuff. I believe knowledge can make the world a better place and therefore am self-motivated to share. Let me know if you are interested in reading anything in particular.
If you are looking for the source code of this project, they can be found here. Thanks again for my teammates who allowed me to use their portraits for this project and be wary before using this on a website with a Chinese user base.



 How to classify photos in 600 classes using nine million Open Images 
freeCodeCamp — Wed, 20 Feb 2019 18:16:51 +0000
 By Aleksey Bilogur
If you’re looking build an image classifier but need training data, look no further than Google Open Images.
This massive image dataset contains over 30 million images and 15 million bounding boxes. That’s 18 terabytes of image data!
Plus, Open Images is much more open and accessible than certain other image datasets at this scale. For example, ImageNet has restrictive licensing.
However, it’s not easy for developers on single machines to sift through that much data.You need to download and process multiple metadata files, and roll their own storage space (or apply for access to a Google Cloud bucket).
On the other hand, there aren’t many custom image training sets in the wild, because frankly they’re a pain to create and share.
In this article, we’ll build and distribute a simple end-to-end machine learning pipeline using Open Images.
We’ll see how to create your own dataset around any of the 600 labels included in the Open Images bounding box data.
We’ll show off our handiwork by building “open sandwiches”. These are simple, reproducible image classifiers built to answer an age-old question: is a hamburger a sandwich?
Want to see the code? You can follow along in the repository on GitHub.
Downloading the data
We need to download the relevant data before we can do anything with it.
This is the core challenge when working with Google Open Images (or any external dataset really). There is no easy way to download a subset of the data. We need to write a script that does so for us.
I’ve written a Python script that searches the metadata in the Open Images data set for keywords that you specify. It finds the original URLs of the corresponding images (on Flickr), then downloads them to disk.
It’s a testament to the power of Python that you can do all of this in just 50 lines of code:
This script enables you to download the subset of raw images which have bounding box information for any subset of categories of our choice:
$ git clone https://github.com/quiltdata/open-images.git$ cd open-images/$ conda env create -f environment.yml$ source activate quilt-open-images-dev$ cd src/openimager/$ python openimager.py "Sandwiches" "Hamburgers"
Categories are organized in a hierarchical way.
For example, sandwich and hamburger are both sub-labels of food (but hamburger is not a sub-label of sandwich — hmm).
We can visualize the ontology as a radial tree using Vega:

You can view an interactive annotated version of this chart (and download the code behind it) here.
Not all categories in Open Images have bounding box data associated with them.
But this script will allow you to download any subset of the 600 labels that do. Here’s a taste of what’s possible:
football, toy, bird, cat, vase, hair dryer, kangaroo, knife, briefcase, pencil case, tennis ball, nail, high heels, sushi, skyscraper, tree, truck, violin, wine, wheel, whale, pizza cutter, bread, helicopter, lemon, dog, elephant, shark, flower, furniture, airplane, spoon, bench, swan, peanut, camera, flute, helmet, pomegranate, crown…
For the purposes of this article, we’ll limit ourselves to just two: hamburger and sandwich.
Clean it, crop it
Once we’ve run the script and localized the images, we can inspect them with matplotlib to see what we’ve got:
import matplotlib.pyplot as pltfrom matplotlib.image import imread%matplotlib inlineimport os
fig, axarr = plt.subplots(1, 5, figsize=(24, 4))for i, img in enumerate(os.listdir('../data/images/')[:5]):    axarr[i].imshow(imread('../data/images/' + img))

Five example {hamburger, sandwich} images from Google Open Images V4.
These images are not easy ones to train on. They have all of the issues associated with building a dataset using an external source from the public Internet.
Just this small sample here demonstrates the different sizes, orientations, and occlusions possible in our target classes.
In one case, we didn’t even succeed in downloading the actual image. Instead, we got a placeholder telling us that the image we wanted has since been deleted!
Downloading this data nets us a few thousand sample images like these. The next step is to take advantage of the bounding box information to clip our images down to just the sandwich-y, hamburger-y parts.
Here’s another image array, this time with bounding boxes included, to demonstrate what this entails:

Bounding boxes. Notice (1) the dataset includes “depictions” and (2) raw images can contain many object instances.
This annotated Jupyter notebook in the demo GitHub repository does this work.
I will omit showing that code here because it is slightly complicated. This is especially since we also need to (1) refactor our image metadata to match the clipped image outputs and (2) extract the images that have since been deleted. Definitely check out the notebook if you wish to see the code.
After running the notebook code, we will have an images_cropped folder on disk containing all of the cropped images.
Building the model
Once we have downloaded the data, and cropped and cleaned it, we’re ready to train the model.
We will train a convolutional neural network (or ‘CNN’) on the data.
CNNs are a special type of neural network which build progressively higher level features out of groups of pixels commonly found in the images.
How an image scores on these various features is then weighted to generate a final classification result.
This architecture works extremely well because it takes advantage of locality. This is because any one pixel is likely to have far more in common with pixels nearby than those far away.
CNNs also have other attractive properties, like noise tolerance and scale invariance (to an extent). These further improve their classification properties.
If you’re unfamiliar with CNNs, I recommend skimming Brandon Rohrer’s excellent “How convolutional neural networks work” to learn more about them.
We will train a very simple convolutional neural network and see how even that gets decent results on our problem. I use Keras to define and train the model.
We start by laying out the images in a certain directory structure:
images_cropped/    sandwich/        some_image.jpg        some_other_image.jpg        ...    hamburger/        yet_another_image.jpg        ...
We then point Keras at this folder using the following code:
Keras will inspect the input folders, and determine there are two classes in our classification problem. It will assign class names based on the subfolder names, and create “image generators” that serve out of those folders.
But we don’t just return the images themselves. Instead, we return randomly subsampled, skewed, and zoomed selections from the images (via train_datagen.flow_from_directory).
This is an example of data augmentation in action.
Data augmentation is the practice of feeding an image classifier randomly cropped and distorted versions of an input dataset. This helps us overcome the small size of our dataset. We can train our model on a single image multiple times. Each time we use a slightly different segment of the image preprocessed in a slightly different way.
With our data input defined, the next step is defining the model itself:
This is a simple convolutional neural network model. It contains just three convolutional layers: a single densely connected post-processing layer just before the output layer, and strong regularization in the form of a dropout layer and relu activation.
These things all work together to make it more difficult for this model to overfit. This is important, given the small size of our input dataset.
Finally, the last step is actually fitting the model.
This code selects an epoch step size determined by our image sample size and chosen batch size (16). Then it trains on that data for 50 epochs.
Training is likely to be suspended early by the EarlyStopping callback. This returns the best performing model ahead of the 50 epoch limit if it does not see improvement in the validation score in the previous four epochs.
We selected such a large patience value because there is a significant amount of variability in model validation loss.
This simple training regimen results in a model with about 75% accuracy:
              precision    recall  f1-score   support           0       0.90      0.59      0.71      1399           1       0.64      0.92      0.75      1109   micro avg       0.73      0.73      0.73      2508   macro avg       0.77      0.75      0.73      2508weighted avg       0.78      0.73      0.73      2508
It’s interesting to note that our model is under-confident when classifying hamburgers (class 0), but over-confident when classifying hamburgers (class 1).
90% of images classified as hamburgers are actually hamburgers. But only 59% of all actual hamburgers are classified correctly.
On the other hand, just 64% of images classified as sandwiches are actually sandwiches. But 92% of sandwiches are classified correctly.
These results are in line with the 80% accuracy Francois Chollet got by applying a very similar model to a similarly-sized subset of the classic Cats versus Dogs dataset.
The difference is probably mainly due to increased level of occlusion and noise in the Google Open Images V4 dataset.
The dataset also includes illustrations as well as photographic images. These sometimes take large artistic liberties, making classification more difficult. You may choose to remove these when building a model yourself.
This performance can be further improved using transfer learning techniques. To learn more, check out Keras author Francois Chollet’s blog post “Building powerful image classification models using very little data”.
Distributing the model
Now that we’ve now built a custom dataset and trained a model, it’d be a shame if we didn’t share it.
Machine Learning projects should be reproducible. I outline the following strategy in a previous article, “Reproduce a machine learning model build in four lines of code”.

Separate dependencies into data, code, and environment components.
Data dependencies version control (1) the model definition and (2) the training data. Save these to versioned blob storage, e.g. Amazon S3 with Quilt T4.
Code dependencies version controls the code used to train the model (use git).
Environment dependencies version control the environment used to train the model. In a production environment this should probably be a Docker file, but you can use pip or conda locally.
To provide someone with a retrainable copy of the model, give them the corresponding {data, code, environment} tuple.

Following these principles makes getting everything you need to train your own copy of this model fit into a handful of lines of code:
git clone https://github.com/quiltdata/open-images.gitconda env create -f open-images/environment.ymlsource activate quilt-open-images-devpython -c "import t4; t4.Package.install('quilt/open_images', dest='open-images/', registry='s3://quilt-example')"
To learn more about {data, code, environment} see the GitHub repository and/or the corresponding article.
Conclusion
In this article we demonstrated an end-to-end image classification machine learning pipeline. We covered everything from downloading/transforming a dataset to training a model. We then distributed it in a way that lets anyone else rebuild it themselves later.
Because custom datasets are difficult to generate and distribute, over time there has emerged a cabal of example datasets which get used everywhere. This is not because they’re actually that good (they’re not). Instead, it’s because they’re easy.
For example, Google’s recently released Machine Learning Crash Course makes heavy use of the California Housing Dataset. That data is now almost two decades old!
Consider instead exploring new horizons. Using real images from the living Internet with interesting categorical breakdowns. It’s easier than you think!

Computer Vision - freeCodeCamp.org

How I Built a Makaton AI Companion Using Gemini Nano and the Gemini API

Table of Contents

Tools and Tech Stack

Frontend

AI Components

Local Storage

Browser SpeechSynthesis API

Mapping Logic

Local Server

Building the App Step by Step

1. Setting Up the Project Folder

2. Creating the Basic HTML Structure

3. Mapping Descriptions to Makaton Meanings

4. Adding Gemini AI Logic

5. The Main Logic (app.js)

How Vision and Language Work Together Here

6. Optional — Speak and Copy

How to Fix the Common Issues

1. The “CORS” Error When Running With file://

2. “Model Not Found” (404) From the Gemini API

3. Packaging a Local Single-File Version

4. Debugging Script Import Errors in the Console

Demo: The Makaton AI Companion in Action

Step 1: Run the app locally

Step 2: Get Your Gemini API Key

Step 3: Enable Gemini Nano for On-Device AI

Download and Install Chrome Canary:

Enable Gemini Nano:

Download the Gemini Nano Model:

Verify Installation:

Step 4: Upload a Makaton sign or symbol

Step 5: AI Description and Mapping

Why this matters

Broader Reflections

Accessibility Meets AI

Integrating NLP and Computer Vision

Why Local AI (Gemini Nano) Matters

Looking Forward

Conclusion

Join the Conversation

How to Use Transformers for Real-Time Gesture Recognition

Table of Contents

Why Transformers for Gestures?

What You’ll Learn

Prerequisites

Project Setup

Generate a Gesture Dataset

Option 1: Generate a Synthetic Dataset

Training Script: train.py

What Training Looks Like

Export the Model to ONNX

Evaluate Accuracy + Latency

1. Quick Accuracy Check

2. Confusion Matrix

3. Latency Benchmark

Option 2: Use Small Samples from Public Gesture Datasets

Recommended sources

Setting up your dataset folder

Why choose this option?

Accessibility Notes & Ethical Limits

Next Steps

Conclusion

How to Create a Real-Time Gesture-to-Text Translator Using Python and Mediapipe

Prerequisites

Table of Contents

Why This Matters

Tools and Technologies

Step 1: How to Install the Required Libraries

Step 2: How Mediapipe Tracks Hands

Step 3: Project Pipeline

Step 4: How to Collect Gesture Data

Step 5: How to Train a Gesture Classifier

Step 6: Real-Time Gesture-to-Text Translation

Step 7: Extending the Project

Ethical and Accessibility Considerations

Conclusion

How to Use the Segment Anything Model (SAM) to Create Masks

What is Image Segmentation?

Prerequisites:

1. The “CORS” Error When Running With `file://`

Training Script: `train.py`