<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Computer Vision - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Computer Vision - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sat, 16 May 2026 22:22:53 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/computer-vision/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How I Built a Makaton AI Companion Using Gemini Nano and the Gemini API ]]>
                </title>
                <description>
                    <![CDATA[ When I started my research on AI systems that could translate Makaton (a sign and symbol language designed to support speech and communication), I wanted to bridge a gap in accessibility for learners with speech or language difficulties. Over time, t... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-i-built-a-makaton-ai-companion-using-gemini-nano-and-the-gemini-api/</link>
                <guid isPermaLink="false">690e1f43cb50ea9684f6d9aa</guid>
                
                    <category>
                        <![CDATA[ geminiAPI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ gemini-nano ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ OMOTAYO OMOYEMI ]]>
                </dc:creator>
                <pubDate>Fri, 07 Nov 2025 16:33:07 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762533154134/e2209ade-6971-464b-aeef-f05abd0a30d7.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When I started my research on AI systems that could translate Makaton (a sign and symbol language designed to support speech and communication), I wanted to bridge a gap in accessibility for learners with speech or language difficulties.</p>
<p>Over time, this academic interest evolved into a working prototype that combines on-device AI and cloud AI to describe images and translate them into English meanings. The idea was simple: I wanted to build a lightweight web app that recognized Makaton gestures or symbols and instantly provided an English interpretation.</p>
<p>In this article, I’ll walk you through how I built my Makaton AI Companion, a single-page web app powered by Gemini Nano (on-device) and the Gemini API (cloud). You’ll see how it works, how I solved common issues like CORS and API model errors, and how this small project became part of my journey toward AI for accessibility.</p>
<p>By the end of this article, you will be able to:</p>
<ul>
<li><p>Understand the core concept behind Makaton and why it’s important in accessibility and inclusive education.</p>
</li>
<li><p>Learn how to combine on-device AI (Gemini Nano) and cloud-based AI (Gemini API) in a single web project.</p>
</li>
<li><p>Build a functional AI-powered web app that can describe images and map them to predefined English meanings.</p>
</li>
<li><p>Discover how to handle common errors such as model endpoint issues, missing API keys, and CORS restrictions when working with generative AI APIs.</p>
</li>
<li><p>Learn how to store API keys locally for user privacy using <code>localStorage</code>.</p>
</li>
<li><p>Use browser speech synthesis to convert the AI-generated English meanings into spoken output.</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-tools-and-tech-stack">Tools and Tech Stack</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-building-the-app-step-by-step">Building the App Step by Step</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-fix-the-common-issues">How to Fix the Common Issues</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-demo-the-makaton-ai-companion-in-action">Demo: The Makaton AI Companion in Action</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-broader-reflections">Broader Reflections</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-tools-and-tech-stack">Tools and Tech Stack</h2>
<p>To build the Makaton AI Companion, I wanted something lightweight, fast to prototype, and easy for anyone to run without complicated dependencies. I chose a plain web stack with a focus on accessibility and transparency.</p>
<p>Here’s what I used:</p>
<h3 id="heading-frontend">Frontend</h3>
<ul>
<li><p><strong>HTML + CSS + JavaScript (Vanilla):</strong> No frameworks, just clean and understandable code that any beginner can follow.</p>
</li>
<li><p>A single <code>index.html</code> page handles the upload interface, output display, and AI logic.</p>
</li>
</ul>
<h3 id="heading-ai-components">AI Components</h3>
<ul>
<li><p><strong>Gemini Nano</strong> runs locally in Chrome Canary. This on-device model lets users generate short text without calling the cloud API.</p>
</li>
<li><p><strong>Gemini API (Cloud)</strong> used as a fallback when on-device AI isn’t available or when image analysis is required.</p>
<ul>
<li><p>Model tested: <code>gemini-1.5-flash</code> and <code>gemini-pro-vision</code>.</p>
</li>
<li><p>Fallback logic ensures the app checks multiple model endpoints if one returns a 404 error.</p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-local-storage">Local Storage</h3>
<ul>
<li>The Gemini API key is stored safely in the browser’s localStorage, so it never leaves the user’s computer.</li>
</ul>
<h3 id="heading-browser-speechsynthesis-api">Browser SpeechSynthesis API</h3>
<ul>
<li>Converts the translated English meaning into spoken audio with one click.</li>
</ul>
<h3 id="heading-mapping-logic">Mapping Logic</h3>
<ul>
<li>A small custom dictionary (<code>mapping.js</code>) links AI-generated descriptions to likely Makaton meanings. For example: <code>{ keywords: ["open hand", "raised hand", "wave"], meaning: "Hello / Stop" }</code></li>
</ul>
<h3 id="heading-local-server">Local Server</h3>
<ul>
<li><p>The app is served locally using Python’s built-in HTTP server to avoid CORS issues:</p>
<p>  <code>python -m http.server 8080</code></p>
</li>
</ul>
<p>Then open <code>http://localhost:8080</code> in Chrome Canary.</p>
<h2 id="heading-building-the-app-step-by-step">Building the App Step by Step</h2>
<p>Now let’s dive into how the Makaton AI Companion works under the hood. This project follows a simple but effective flow: Upload an image → Describe (AI) → Map to Meaning → Speak or Copy the result</p>
<p>We’ll go through each part step by step.</p>
<h3 id="heading-1-setting-up-the-project-folder">1. Setting Up the Project Folder</h3>
<p>You don’t need any complex setup. Just create a new folder and add these files:</p>
<pre><code class="lang-plaintext">makaton-ai-companion/
│
├── index.html
├── styles.css
├── app.js
└── lib/
    ├── mapping.js
    └── ai.js
</code></pre>
<p>If you prefer a ready-to-run version, you can serve everything from one zip (I’ll share a GitHub link at the end).</p>
<h3 id="heading-2-creating-the-basic-html-structure">2. Creating the Basic HTML Structure</h3>
<p>Your <code>index.html</code> file defines the interface where users upload an image, click <em>Describe</em>, and view the results.</p>
<pre><code class="lang-html"><span class="hljs-meta">&lt;!DOCTYPE <span class="hljs-meta-keyword">html</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">html</span> <span class="hljs-attr">lang</span>=<span class="hljs-string">"en"</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">head</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">meta</span> <span class="hljs-attr">charset</span>=<span class="hljs-string">"UTF-8"</span> /&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">meta</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"viewport"</span> <span class="hljs-attr">content</span>=<span class="hljs-string">"width=device-width, initial-scale=1.0"</span>/&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">title</span>&gt;</span>Makaton AI Companion<span class="hljs-tag">&lt;/<span class="hljs-name">title</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">link</span> <span class="hljs-attr">rel</span>=<span class="hljs-string">"stylesheet"</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"styles.css"</span>/&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">head</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">body</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">header</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"app-header"</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>🧩 Makaton AI Companion<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnSettings"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn secondary"</span>&gt;</span>Settings<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
  <span class="hljs-tag">&lt;/<span class="hljs-name">header</span>&gt;</span>

  <span class="hljs-tag">&lt;<span class="hljs-name">main</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"container"</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">section</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"card"</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">h2</span>&gt;</span>1) Upload an image (Makaton sign/symbol)<span class="hljs-tag">&lt;/<span class="hljs-name">h2</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">label</span> <span class="hljs-attr">for</span>=<span class="hljs-string">"file"</span>&gt;</span>
        Choose an image file
        <span class="hljs-tag">&lt;<span class="hljs-name">input</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"file"</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"file"</span> <span class="hljs-attr">accept</span>=<span class="hljs-string">"image/*"</span> <span class="hljs-attr">title</span>=<span class="hljs-string">"Select an image file"</span>/&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">label</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"preview"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"preview hidden"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">p</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"status"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"status"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"actions"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnDescribe"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn"</span>&gt;</span>Describe (Cloud or Nano)<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnType"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn ghost"</span>&gt;</span>Type a description instead<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"typedBox"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"typed hidden"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">textarea</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"typed"</span> <span class="hljs-attr">rows</span>=<span class="hljs-string">"3"</span> <span class="hljs-attr">placeholder</span>=<span class="hljs-string">"Describe what you see..."</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">textarea</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnUseTyped"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn"</span>&gt;</span>Use this description<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">section</span>&gt;</span>

    <span class="hljs-tag">&lt;<span class="hljs-name">section</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"card"</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">h2</span>&gt;</span>2) AI Output<span class="hljs-tag">&lt;/<span class="hljs-name">h2</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"grid"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span>Image Description<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"output"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"output"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span>English Meaning (Mapped)<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"meaning"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"meaning"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"actions"</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnSpeak"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn ghost"</span> <span class="hljs-attr">disabled</span>&gt;</span>🔊 Speak<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnCopy"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn ghost"</span> <span class="hljs-attr">disabled</span>&gt;</span>📋 Copy<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
          <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">section</span>&gt;</span>
  <span class="hljs-tag">&lt;/<span class="hljs-name">main</span>&gt;</span>

  <span class="hljs-tag">&lt;<span class="hljs-name">dialog</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"settings"</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">form</span> <span class="hljs-attr">method</span>=<span class="hljs-string">"dialog"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"settings-form"</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">h2</span>&gt;</span>Settings<span class="hljs-tag">&lt;/<span class="hljs-name">h2</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">label</span>&gt;</span>Gemini API key (optional)<span class="hljs-tag">&lt;<span class="hljs-name">input</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"apiKey"</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"password"</span> <span class="hljs-attr">placeholder</span>=<span class="hljs-string">"AIza..."</span>/&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">label</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"settings-actions"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnSaveKey"</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"submit"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn"</span>&gt;</span>Save<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"btnCloseSettings"</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"button"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn secondary"</span>&gt;</span>Close<span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"apiStatus"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"api-status"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">form</span>&gt;</span>
  <span class="hljs-tag">&lt;/<span class="hljs-name">dialog</span>&gt;</span>

  <span class="hljs-tag">&lt;<span class="hljs-name">script</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"module"</span> <span class="hljs-attr">src</span>=<span class="hljs-string">"lib/mapping.js"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">script</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">script</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"module"</span> <span class="hljs-attr">src</span>=<span class="hljs-string">"lib/ai.js"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">script</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">script</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"module"</span> <span class="hljs-attr">src</span>=<span class="hljs-string">"app.js"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">script</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">body</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">html</span>&gt;</span>
</code></pre>
<p>This interface is intentionally minimal: no frameworks, no build tools, just clear HTML.</p>
<h3 id="heading-3-mapping-descriptions-to-makaton-meanings">3. Mapping Descriptions to Makaton Meanings</h3>
<p>The <code>mapping.js</code> file holds a simple keyword-based dictionary. When the AI describes an image (like <em>“a raised open hand”</em>), the app searches for keywords that match known Makaton signs.</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// lib/mapping.js</span>

<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> MAKATON_GLOSSES = [
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"open hand"</span>, <span class="hljs-string">"raised hand"</span>, <span class="hljs-string">"wave"</span>, <span class="hljs-string">"hand up"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Hello / Stop"</span> },
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"eat"</span>, <span class="hljs-string">"food"</span>, <span class="hljs-string">"spoon"</span>, <span class="hljs-string">"hand to mouth"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Eat"</span> },
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"drink"</span>, <span class="hljs-string">"cup"</span>, <span class="hljs-string">"glass"</span>, <span class="hljs-string">"bottle"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Drink"</span> },
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"home"</span>, <span class="hljs-string">"house"</span>, <span class="hljs-string">"roof"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Home"</span> },
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"sleep"</span>, <span class="hljs-string">"bed"</span>, <span class="hljs-string">"eyes closed"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Sleep"</span> },
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"book"</span>, <span class="hljs-string">"reading"</span>, <span class="hljs-string">"pages"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Book / Read"</span> },
  <span class="hljs-comment">// Added so your current screenshot maps correctly:</span>
  { <span class="hljs-attr">keywords</span>: [<span class="hljs-string">"help"</span>, <span class="hljs-string">"assist"</span>, <span class="hljs-string">"thumb on palm"</span>, <span class="hljs-string">"hand over hand"</span>, <span class="hljs-string">"assisting"</span>], <span class="hljs-attr">meaning</span>: <span class="hljs-string">"Help"</span> },
];

<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">mapDescriptionToMeaning</span>(<span class="hljs-params">desc</span>) </span>{
  <span class="hljs-keyword">if</span> (!desc) <span class="hljs-keyword">return</span> <span class="hljs-string">""</span>;
  <span class="hljs-keyword">const</span> d = desc.toLowerCase();
  <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> entry <span class="hljs-keyword">of</span> MAKATON_GLOSSES) {
    <span class="hljs-keyword">if</span> (entry.keywords.some(<span class="hljs-function"><span class="hljs-params">k</span> =&gt;</span> d.includes(k))) <span class="hljs-keyword">return</span> entry.meaning;
  }
  <span class="hljs-keyword">if</span> (d.includes(<span class="hljs-string">"hand"</span>)) <span class="hljs-keyword">return</span> <span class="hljs-string">"Gesture / Hand sign (clarify)"</span>;
  <span class="hljs-keyword">return</span> <span class="hljs-string">"No direct mapping found."</span>;
}
</code></pre>
<p>It’s simple but effective enough to simulate real symbol-to-language translation for demo purposes.</p>
<h3 id="heading-4-adding-gemini-ai-logic">4. Adding Gemini AI Logic</h3>
<p>The <code>ai.js</code> file connects to Gemini Nano (on-device) or the Gemini API (cloud). If Nano isn’t available, the app falls back to the cloud model. And if that fails, it lets users type a description manually.</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// lib/ai.js — dynamic model discovery (try-all version)</span>

<span class="hljs-comment">// --- On-device availability (Gemini Nano) ---</span>
<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">checkAvailability</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">const</span> res = { <span class="hljs-attr">nanoTextPossible</span>: <span class="hljs-literal">false</span> };
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> canCreate = self.ai?.canCreateTextSession || self.ai?.languageModel?.canCreate;
    <span class="hljs-keyword">if</span> (<span class="hljs-keyword">typeof</span> canCreate === <span class="hljs-string">"function"</span>) {
      <span class="hljs-keyword">const</span> ok = <span class="hljs-keyword">await</span> (self.ai.canCreateTextSession?.() || self.ai.languageModel.canCreate?.());
      res.nanoTextPossible = ok === <span class="hljs-string">"readily"</span> || ok === <span class="hljs-string">"after-download"</span> || ok === <span class="hljs-literal">true</span>;
    }
  } <span class="hljs-keyword">catch</span> {}
  <span class="hljs-keyword">return</span> res;
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">createNanoTextSession</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">if</span> (self.ai?.createTextSession) <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> self.ai.createTextSession();
  <span class="hljs-keyword">if</span> (self.ai?.languageModel?.create) <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> self.ai.languageModel.create();
  <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"Gemini Nano text session not available"</span>);
}

<span class="hljs-comment">// --- Cloud: dynamically discover models for this key ---</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">listModels</span>(<span class="hljs-params">key</span>) </span>{
  <span class="hljs-keyword">const</span> url = <span class="hljs-string">"https://generativelanguage.googleapis.com/v1/models?key="</span> + <span class="hljs-built_in">encodeURIComponent</span>(key);
  <span class="hljs-keyword">const</span> r = <span class="hljs-keyword">await</span> fetch(url);
  <span class="hljs-keyword">if</span> (!r.ok) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"ListModels failed: "</span> + (<span class="hljs-keyword">await</span> r.text()));
  <span class="hljs-keyword">const</span> j = <span class="hljs-keyword">await</span> r.json();
  <span class="hljs-keyword">return</span> (j.models || []).map(<span class="hljs-function"><span class="hljs-params">m</span> =&gt;</span> m.name).filter(<span class="hljs-built_in">Boolean</span>);
}

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">rankModels</span>(<span class="hljs-params">names</span>) </span>{
  <span class="hljs-comment">// Prefer Gemini 1.5 (multimodal), then flash variants, then anything with vision/pro.</span>
  <span class="hljs-keyword">return</span> names
    .filter(<span class="hljs-function"><span class="hljs-params">n</span> =&gt;</span> n.startsWith(<span class="hljs-string">"models/"</span>))              <span class="hljs-comment">// ignore tunedModels, etc.</span>
    .filter(<span class="hljs-function"><span class="hljs-params">n</span> =&gt;</span> !n.includes(<span class="hljs-string">"experimental"</span>))          <span class="hljs-comment">// skip experimental</span>
    .sort(<span class="hljs-function">(<span class="hljs-params">a, b</span>) =&gt;</span> score(b) - score(a));

  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">score</span>(<span class="hljs-params">n</span>) </span>{
    <span class="hljs-keyword">let</span> s = <span class="hljs-number">0</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"1.5"</span>)) s += <span class="hljs-number">10</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"flash"</span>)) s += <span class="hljs-number">8</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"pro-vision"</span>)) s += <span class="hljs-number">7</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"pro"</span>)) s += <span class="hljs-number">6</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"vision"</span>)) s += <span class="hljs-number">5</span>;
    <span class="hljs-keyword">if</span> (n.includes(<span class="hljs-string">"latest"</span>)) s += <span class="hljs-number">2</span>;
    <span class="hljs-keyword">return</span> s;
  }
}

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">tryGenerateForModels</span>(<span class="hljs-params">imageDataUrl, key, models, mimeType</span>) </span>{
  <span class="hljs-keyword">const</span> base64 = imageDataUrl.split(<span class="hljs-string">","</span>)[<span class="hljs-number">1</span>];
  <span class="hljs-keyword">const</span> body = {
    <span class="hljs-attr">contents</span>: [{
      <span class="hljs-attr">parts</span>: [
        { <span class="hljs-attr">text</span>: <span class="hljs-string">"Describe this image briefly in one sentence focusing on the main gesture or symbol."</span> },
        { <span class="hljs-attr">inline_data</span>: { <span class="hljs-attr">mime_type</span>: mimeType || <span class="hljs-string">"image/png"</span>, <span class="hljs-attr">data</span>: base64 } }
      ]
    }]
  };
  <span class="hljs-keyword">let</span> lastErr = <span class="hljs-string">""</span>;
  <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> model <span class="hljs-keyword">of</span> models) {
    <span class="hljs-keyword">const</span> endpoint = <span class="hljs-string">"https://generativelanguage.googleapis.com/v1/"</span> + model + <span class="hljs-string">":generateContent?key="</span> + <span class="hljs-built_in">encodeURIComponent</span>(key);
    <span class="hljs-keyword">try</span> {
      <span class="hljs-keyword">const</span> r = <span class="hljs-keyword">await</span> fetch(endpoint, { <span class="hljs-attr">method</span>: <span class="hljs-string">"POST"</span>, <span class="hljs-attr">headers</span>: { <span class="hljs-string">"Content-Type"</span>: <span class="hljs-string">"application/json"</span> }, <span class="hljs-attr">body</span>: <span class="hljs-built_in">JSON</span>.stringify(body)});
      <span class="hljs-keyword">if</span> (!r.ok) { lastErr = <span class="hljs-keyword">await</span> r.text().catch(<span class="hljs-function">()=&gt;</span><span class="hljs-built_in">String</span>(r.status)); <span class="hljs-keyword">continue</span>; }
      <span class="hljs-keyword">const</span> j = <span class="hljs-keyword">await</span> r.json();
      <span class="hljs-keyword">const</span> text = j?.candidates?.[<span class="hljs-number">0</span>]?.content?.parts?.map(<span class="hljs-function"><span class="hljs-params">p</span>=&gt;</span>p.text).join(<span class="hljs-string">" "</span>).trim();
      <span class="hljs-keyword">if</span> (text) <span class="hljs-keyword">return</span> text;
      lastErr = <span class="hljs-string">"Empty response from "</span> + model;
    } <span class="hljs-keyword">catch</span> (e) {
      lastErr = <span class="hljs-built_in">String</span>(e?.message || e);
    }
  }
  <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"All discovered models failed. Last error: "</span> + lastErr);
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">describeImageWithGemini</span>(<span class="hljs-params">imageDataUrl, apiKey, mimeType = <span class="hljs-string">"image/png"</span></span>) </span>{
  <span class="hljs-keyword">if</span> (!apiKey) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"No API key provided"</span>);

  <span class="hljs-keyword">const</span> models = <span class="hljs-keyword">await</span> listModels(apiKey);
  <span class="hljs-keyword">if</span> (!models.length) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"No models returned for this key. Ensure Generative Language API is enabled and T&amp;Cs accepted in AI Studio."</span>);

  <span class="hljs-keyword">const</span> ranked = rankModels(models);
  <span class="hljs-keyword">if</span> (!ranked.length) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"No usable model names returned (models/*)."</span>);

  <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> tryGenerateForModels(imageDataUrl, apiKey, ranked, mimeType);
}

<span class="hljs-comment">// --- Key storage (local only) ---</span>
<span class="hljs-keyword">const</span> KEY = <span class="hljs-string">"makaton_demo_gemini_key"</span>;
<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">saveApiKey</span>(<span class="hljs-params">k</span>) </span>{ <span class="hljs-built_in">localStorage</span>.setItem(KEY, k || <span class="hljs-string">""</span>); }
<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">loadApiKey</span>(<span class="hljs-params"></span>) </span>{ <span class="hljs-keyword">return</span> <span class="hljs-built_in">localStorage</span>.getItem(KEY) || <span class="hljs-string">""</span>; }
</code></pre>
<p>Note: This retry system is essential because many users encounter 404 model errors due to the unavailability of certain Gemini versions in every account.</p>
<h3 id="heading-5-the-main-logic-appjs">5. The Main Logic (app.js)</h3>
<p>This script ties everything together: file upload, AI call, meaning mapping, and output display.</p>
<pre><code class="lang-javascript">
<span class="hljs-keyword">import</span> { mapDescriptionToMeaning } <span class="hljs-keyword">from</span> <span class="hljs-string">'./lib/mapping.js'</span>;
<span class="hljs-keyword">import</span> { checkAvailability, createNanoTextSession, describeImageWithGemini, saveApiKey, loadApiKey } <span class="hljs-keyword">from</span> <span class="hljs-string">'./lib/ai.js'</span>;

<span class="hljs-built_in">document</span>.addEventListener(<span class="hljs-string">'DOMContentLoaded'</span>, <span class="hljs-function">() =&gt;</span> {
  <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton] DOM ready'</span>);

  <span class="hljs-keyword">const</span> $ = <span class="hljs-function">(<span class="hljs-params">s</span>) =&gt;</span> <span class="hljs-built_in">document</span>.querySelector(s);

  <span class="hljs-comment">// Elements</span>
  <span class="hljs-keyword">const</span> fileInput   = $(<span class="hljs-string">'#file'</span>);
  <span class="hljs-keyword">const</span> preview     = $(<span class="hljs-string">'#preview'</span>);
  <span class="hljs-keyword">const</span> meaningEl   = $(<span class="hljs-string">'#meaning'</span>);
  <span class="hljs-keyword">const</span> outputEl    = $(<span class="hljs-string">'#output'</span>);
  <span class="hljs-keyword">const</span> btnDescribe = $(<span class="hljs-string">'#btnDescribe'</span>);
  <span class="hljs-keyword">const</span> btnType     = $(<span class="hljs-string">'#btnType'</span>);
  <span class="hljs-keyword">const</span> typedBox    = $(<span class="hljs-string">'#typedBox'</span>);
  <span class="hljs-keyword">const</span> typed       = $(<span class="hljs-string">'#typed'</span>);
  <span class="hljs-keyword">const</span> btnUseTyped = $(<span class="hljs-string">'#btnUseTyped'</span>);
  <span class="hljs-keyword">const</span> btnSpeak    = $(<span class="hljs-string">'#btnSpeak'</span>);
  <span class="hljs-keyword">const</span> btnCopy     = $(<span class="hljs-string">'#btnCopy'</span>);
  <span class="hljs-keyword">const</span> statusEl    = $(<span class="hljs-string">'#status'</span>);

  <span class="hljs-keyword">const</span> settings        = $(<span class="hljs-string">'#settings'</span>);
  <span class="hljs-keyword">const</span> btnSettings     = $(<span class="hljs-string">'#btnSettings'</span>);
  <span class="hljs-keyword">const</span> btnCloseSettings= $(<span class="hljs-string">'#btnCloseSettings'</span>);
  <span class="hljs-keyword">const</span> btnSaveKey      = $(<span class="hljs-string">'#btnSaveKey'</span>);
  <span class="hljs-keyword">const</span> apiKeyInput     = $(<span class="hljs-string">'#apiKey'</span>);
  <span class="hljs-keyword">const</span> apiStatus       = $(<span class="hljs-string">'#apiStatus'</span>);

  <span class="hljs-keyword">let</span> currentImageDataUrl = <span class="hljs-literal">null</span>;
  <span class="hljs-keyword">let</span> currentImageMime    = <span class="hljs-string">"image/png"</span>;

  <span class="hljs-comment">// Sanity logs</span>
  <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton] Elements:'</span>, {
    <span class="hljs-attr">fileInput</span>: !!fileInput, <span class="hljs-attr">preview</span>: !!preview, <span class="hljs-attr">outputEl</span>: !!outputEl,
    <span class="hljs-attr">meaningEl</span>: !!meaningEl, <span class="hljs-attr">btnDescribe</span>: !!btnDescribe, <span class="hljs-attr">statusEl</span>: !!statusEl
  });

  <span class="hljs-comment">// Init API key</span>
  <span class="hljs-keyword">if</span> (apiKeyInput) apiKeyInput.value = loadApiKey() || <span class="hljs-string">""</span>;

  <span class="hljs-comment">// --- Helpers ---</span>
  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">setStatus</span>(<span class="hljs-params">text</span>) </span>{
    <span class="hljs-keyword">if</span> (statusEl) statusEl.textContent = text || <span class="hljs-string">''</span>;
    <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton][Status]'</span>, text);
  }
  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">clearOutputs</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">if</span> (outputEl) outputEl.textContent = <span class="hljs-string">''</span>;
    <span class="hljs-keyword">if</span> (meaningEl) meaningEl.textContent = <span class="hljs-string">''</span>;
    <span class="hljs-keyword">if</span> (btnSpeak) btnSpeak.disabled = <span class="hljs-literal">true</span>;
    <span class="hljs-keyword">if</span> (btnCopy)  btnCopy.disabled  = <span class="hljs-literal">true</span>;
  }
  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">setOutput</span>(<span class="hljs-params">desc</span>) </span>{
    <span class="hljs-keyword">if</span> (outputEl) outputEl.textContent = desc || <span class="hljs-string">''</span>;
    <span class="hljs-keyword">const</span> meaning = mapDescriptionToMeaning(desc || <span class="hljs-string">''</span>);
    <span class="hljs-keyword">if</span> (meaningEl) meaningEl.textContent = meaning;
    <span class="hljs-keyword">if</span> (btnSpeak) btnSpeak.disabled = !meaning || meaning.includes(<span class="hljs-string">'No direct mapping'</span>);
    <span class="hljs-keyword">if</span> (btnCopy)  btnCopy.disabled  = !meaning;
    setStatus(<span class="hljs-string">'Done.'</span>);
  }
  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">fileToDataURL</span>(<span class="hljs-params">file</span>) </span>{
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Promise</span>(<span class="hljs-function">(<span class="hljs-params">resolve, reject</span>) =&gt;</span> {
      <span class="hljs-keyword">const</span> reader = <span class="hljs-keyword">new</span> FileReader();
      reader.onload  = <span class="hljs-function">() =&gt;</span> resolve(reader.result);
      reader.onerror = <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> reject(e);
      reader.readAsDataURL(file);
    });
  }
  <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">handleFiles</span>(<span class="hljs-params">files</span>) </span>{
    <span class="hljs-keyword">const</span> file = files?.[<span class="hljs-number">0</span>];
    <span class="hljs-keyword">if</span> (!file) { setStatus(<span class="hljs-string">'No file selected.'</span>); <span class="hljs-keyword">return</span>; }
    currentImageMime = file.type || <span class="hljs-string">"image/png"</span>;
    fileToDataURL(file)
      .then(<span class="hljs-function">(<span class="hljs-params">dataUrl</span>) =&gt;</span> {
        currentImageDataUrl = dataUrl;
        <span class="hljs-keyword">if</span> (preview) {
          preview.innerHTML = <span class="hljs-string">`&lt;img alt="preview" src="<span class="hljs-subst">${dataUrl}</span>" /&gt;`</span>;
          preview.classList.remove(<span class="hljs-string">'hidden'</span>);
        }
        setStatus(<span class="hljs-string">'Image loaded. Click "Describe" to continue.'</span>);
      })
      .catch(<span class="hljs-function">(<span class="hljs-params">err</span>) =&gt;</span> {
        <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'[Makaton] fileToDataURL error'</span>, err);
        setStatus(<span class="hljs-string">'Could not read the image.'</span>);
      });
  }

  <span class="hljs-comment">// --- File input change ---</span>
  <span class="hljs-keyword">if</span> (fileInput) {
    fileInput.addEventListener(<span class="hljs-string">'change'</span>, <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> {
      <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton] file input change'</span>);
      handleFiles(e.target.files);
    });
  } <span class="hljs-keyword">else</span> {
    <span class="hljs-built_in">console</span>.warn(<span class="hljs-string">'[Makaton] #file input not found in DOM.'</span>);
  }

  <span class="hljs-comment">// --- Drag &amp; drop support on preview area ---</span>
  <span class="hljs-keyword">if</span> (preview) {
    preview.addEventListener(<span class="hljs-string">'dragover'</span>, <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> { e.preventDefault(); preview.classList.add(<span class="hljs-string">'drag'</span>); });
    preview.addEventListener(<span class="hljs-string">'dragleave'</span>, <span class="hljs-function">() =&gt;</span> preview.classList.remove(<span class="hljs-string">'drag'</span>));
    preview.addEventListener(<span class="hljs-string">'drop'</span>, <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> {
      e.preventDefault();
      preview.classList.remove(<span class="hljs-string">'drag'</span>);
      <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton] drop'</span>);
      handleFiles(e.dataTransfer?.files);
    });
  }

  <span class="hljs-comment">// --- Describe click ---</span>
  <span class="hljs-keyword">if</span> (btnDescribe) {
    btnDescribe.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-keyword">async</span> () =&gt; {
      <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'[Makaton] Describe clicked'</span>);
      <span class="hljs-keyword">if</span> (!currentImageDataUrl) { setStatus(<span class="hljs-string">'Please upload an image first.'</span>); <span class="hljs-keyword">return</span>; }
      clearOutputs();
      setStatus(<span class="hljs-string">'Checking on-device AI availability…'</span>);

      <span class="hljs-keyword">const</span> avail = <span class="hljs-keyword">await</span> checkAvailability().catch(<span class="hljs-function">() =&gt;</span> ({ <span class="hljs-attr">nanoTextPossible</span>: <span class="hljs-literal">false</span> }));
      <span class="hljs-keyword">try</span> {
        <span class="hljs-keyword">const</span> apiKey = loadApiKey();
        <span class="hljs-keyword">if</span> (apiKey) {
          setStatus(<span class="hljs-string">'Using Gemini cloud for image description…'</span>);
          <span class="hljs-keyword">const</span> desc = <span class="hljs-keyword">await</span> describeImageWithGemini(currentImageDataUrl, apiKey, currentImageMime);
          setOutput(desc);
          <span class="hljs-keyword">return</span>;
        }
        <span class="hljs-keyword">if</span> (avail.nanoTextPossible) {
          setStatus(<span class="hljs-string">'No API key found. Using on-device AI (text) for best guess…'</span>);
          <span class="hljs-keyword">const</span> session = <span class="hljs-keyword">await</span> createNanoTextSession();
          <span class="hljs-keyword">const</span> desc = <span class="hljs-keyword">await</span> session.prompt(<span class="hljs-string">'Given an image is uploaded by the user (not directly visible to you), infer a likely one-sentence description of a common Makaton sign or symbol a teacher might upload. Keep it generic and safe.'</span>);
          setOutput(desc);
          <span class="hljs-keyword">return</span>;
        }
        setStatus(<span class="hljs-string">'No AI available. Please type a brief description.'</span>);
        <span class="hljs-keyword">if</span> (typedBox) typedBox.classList.remove(<span class="hljs-string">'hidden'</span>);
      } <span class="hljs-keyword">catch</span> (err) {
        <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'[Makaton] Describe error'</span>, err);
        setStatus(<span class="hljs-string">'Description failed: '</span> + (err?.message || err));
        <span class="hljs-keyword">if</span> (typedBox) typedBox.classList.remove(<span class="hljs-string">'hidden'</span>);
      }
    });
  } <span class="hljs-keyword">else</span> {
    <span class="hljs-built_in">console</span>.warn(<span class="hljs-string">'[Makaton] Describe button not found.'</span>);
  }

  <span class="hljs-comment">// --- Manual typing flow ---</span>
  <span class="hljs-keyword">if</span> (btnType) {
    btnType.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> {
      <span class="hljs-keyword">if</span> (typedBox) typedBox.classList.remove(<span class="hljs-string">'hidden'</span>);
      <span class="hljs-keyword">if</span> (typed) typed.focus();
    });
  }
  <span class="hljs-keyword">if</span> (btnUseTyped) {
    btnUseTyped.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> {
      <span class="hljs-keyword">const</span> text = (typed?.value || <span class="hljs-string">''</span>).trim();
      <span class="hljs-keyword">if</span> (!text) { setStatus(<span class="hljs-string">'Type a description first.'</span>); <span class="hljs-keyword">return</span>; }
      setOutput(text);
    });
  }

  <span class="hljs-comment">// --- Utilities ---</span>
  <span class="hljs-keyword">if</span> (btnSpeak) {
    btnSpeak.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> {
      <span class="hljs-keyword">const</span> text = meaningEl?.textContent?.trim();
      <span class="hljs-keyword">if</span> (!text) <span class="hljs-keyword">return</span>;
      <span class="hljs-keyword">const</span> u = <span class="hljs-keyword">new</span> SpeechSynthesisUtterance(text);
      speechSynthesis.cancel();
      speechSynthesis.speak(u);
    });
  }
  <span class="hljs-keyword">if</span> (btnCopy) {
    btnCopy.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-keyword">async</span> () =&gt; {
      <span class="hljs-keyword">const</span> text = meaningEl?.textContent?.trim();
      <span class="hljs-keyword">if</span> (!text) <span class="hljs-keyword">return</span>;
      <span class="hljs-keyword">try</span> {
        <span class="hljs-keyword">await</span> navigator.clipboard.writeText(text);
        setStatus(<span class="hljs-string">'Copied meaning to clipboard.'</span>);
      } <span class="hljs-keyword">catch</span> {
        setStatus(<span class="hljs-string">'Copy failed.'</span>);
      }
    });
  }

  <span class="hljs-comment">// --- Settings modal ---</span>
  <span class="hljs-keyword">if</span> (btnSettings &amp;&amp; settings) btnSettings.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> settings.showModal());
  <span class="hljs-keyword">if</span> (btnCloseSettings &amp;&amp; settings) btnCloseSettings.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> settings.close());
  <span class="hljs-keyword">if</span> (btnSaveKey) {
    btnSaveKey.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> {
      e.preventDefault();
      <span class="hljs-keyword">const</span> k = apiKeyInput?.value?.trim() || <span class="hljs-string">""</span>;
      saveApiKey(k);
      <span class="hljs-keyword">if</span> (apiStatus) apiStatus.textContent = k ? <span class="hljs-string">"API key saved locally. Try Describe again."</span> : <span class="hljs-string">"Cleared API key. You can still use on-device or typed mode."</span>;
    });
  }

  <span class="hljs-comment">// First status</span>
  setStatus(<span class="hljs-string">'Ready. Upload an image to begin.'</span>);
});
</code></pre>
<p>Let's break down the main sections of the <code>app.js</code> script for the Makaton AI Companion, as there’s a lot going on here:</p>
<ol>
<li><p><strong>Imports and Initial Setup:</strong></p>
<ul>
<li><p>The script imports functions from <code>mapping.js</code> and <code>ai.js</code> to handle mapping descriptions to meanings and AI interactions.</p>
</li>
<li><p>It sets up event listeners for when the DOM content is fully loaded, ensuring all elements are ready for interaction.</p>
</li>
</ul>
</li>
<li><p><strong>Element Selection:</strong></p>
<ul>
<li>It uses a helper function <code>$</code> to select DOM elements by their CSS selectors. This includes file inputs, buttons, and display areas for image previews and outputs.</li>
</ul>
</li>
<li><p><strong>Sanity Logs:</strong></p>
<ul>
<li>It logs the presence of key elements to the console for debugging purposes, ensuring that all necessary elements are found in the DOM.</li>
</ul>
</li>
<li><p><strong>API Key Initialization:</strong></p>
<ul>
<li>It loads any saved API key from local storage and sets it in the input field for user convenience.</li>
</ul>
</li>
<li><p><strong>Helper Functions:</strong></p>
<ul>
<li><p><code>setStatus</code>: Updates the status message displayed to the user.</p>
</li>
<li><p><code>clearOutputs</code>: Clears the output and meaning display areas and disables buttons for speaking and copying.</p>
</li>
<li><p><code>setOutput</code>: Displays the AI-generated description and maps it to a Makaton meaning, enabling buttons if a valid meaning is found.</p>
</li>
<li><p><code>fileToDataURL</code>: Converts an uploaded file to a data URL for image preview and processing.</p>
</li>
<li><p><code>handleFiles</code>: Handles file selection, updating the preview and setting the current image data URL.</p>
</li>
</ul>
</li>
<li><p><strong>File Input Change Handling:</strong></p>
<ul>
<li>It listens for changes in the file input, processes the selected file, and updates the preview area.</li>
</ul>
</li>
<li><p><strong>Drag &amp; Drop Support:</strong></p>
<ul>
<li>It adds drag-and-drop functionality to the preview area, allowing users to drag files directly onto the app for processing.</li>
</ul>
</li>
<li><p><strong>Describe Button Click:</strong></p>
<ul>
<li><p>It handles the "Describe" button click event, checking for an uploaded image and attempting to describe it using either the Gemini API or on-device AI.</p>
</li>
<li><p>If no AI is available, it prompts the user to type a description manually.</p>
</li>
</ul>
</li>
<li><p><strong>Manual Typing Flow:</strong></p>
<ul>
<li>It allows users to manually type a description if AI processing is unavailable or fails, updating the output with the typed text.</li>
</ul>
</li>
<li><p><strong>Utilities:</strong></p>
<ul>
<li><p><code>btnSpeak</code>: Uses the browser's SpeechSynthesis API to read aloud the mapped meaning.</p>
</li>
<li><p><code>btnCopy</code>: Copies the mapped meaning to the clipboard for easy sharing.</p>
</li>
</ul>
</li>
<li><p><strong>Settings Modal:</strong></p>
<ul>
<li>It manages the settings modal for entering and saving the API key, providing feedback on the key's status.</li>
</ul>
</li>
<li><p><strong>Initial Status:</strong></p>
<ul>
<li>It sets the initial status message to guide the user to upload an image to begin the process.</li>
</ul>
</li>
</ol>
<p>This script effectively ties together the user interface, file handling, AI processing, and output display, providing a seamless experience for translating Makaton signs into English meanings.</p>
<h4 id="heading-how-vision-and-language-work-together-here">How Vision and Language Work Together Here</h4>
<p>While working on this project, I started appreciating how computer vision and language understanding complement each other in multimodal systems like this one.</p>
<ul>
<li><p>The vision model (Gemini or Nano) interprets <em>what it sees</em> like hand shapes, gestures, or layout and turns that visual context into descriptive language.</p>
</li>
<li><p>The language mapping logic then interprets those words, infers intent, and finds the closest semantic match (e.g., “help,” “friend,” “eat”).</p>
</li>
<li><p>It’s a collaboration between two forms of understanding (<em>perceptual</em> and <em>semantic</em>) that together allow the AI to bridge the gap between gesture and meaning.</p>
</li>
</ul>
<p>This realization reshaped how I think about accessibility: the best assistive technologies often emerge not from smarter models alone, but from the interaction between modalities like seeing, describing, and reasoning in context.</p>
<h3 id="heading-6-optional-speak-and-copy">6. Optional — Speak and Copy</h3>
<p>To make the app more accessible, I added speech output and a quick copy button:</p>
<pre><code class="lang-javascript">btnSpeak.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-function">() =&gt;</span> {
  <span class="hljs-keyword">const</span> text = meaningEl.textContent.trim();
  <span class="hljs-keyword">if</span> (text) speechSynthesis.speak(<span class="hljs-keyword">new</span> SpeechSynthesisUtterance(text));
});

btnCopy.addEventListener(<span class="hljs-string">'click'</span>, <span class="hljs-keyword">async</span> () =&gt; {
  <span class="hljs-keyword">const</span> text = meaningEl.textContent.trim();
  <span class="hljs-keyword">if</span> (text) <span class="hljs-keyword">await</span> navigator.clipboard.writeText(text);
});
</code></pre>
<p>This gives users both visual and auditory feedback, especially helpful for learners or educators.</p>
<h2 id="heading-how-to-fix-the-common-issues">How to Fix the Common Issues</h2>
<p>No AI or web integration project runs smoothly the first time – and that’s okay. Here’s a breakdown of the main issues I faced while building the Makaton AI Companion, how I diagnosed them, and how I fixed each one.</p>
<p>These lessons will help anyone trying to integrate Gemini APIs, on-device AI, or local web apps without a full backend.</p>
<h3 id="heading-1-the-cors-error-when-running-with-file">1. The “CORS” Error When Running With <code>file://</code></h3>
<p>When I first opened my <code>index.html</code> directly from my file explorer, Chrome threw several CORS policy errors:</p>
<pre><code class="lang-python">Access to script at <span class="hljs-string">'file:///lib/ai.js'</span> <span class="hljs-keyword">from</span> origin <span class="hljs-string">'null'</span> has been blocked by CORS policy.
</code></pre>
<p>At first this looked confusing, but the reason is simple: modern browsers block JavaScript modules (<code>import/export</code>) when running from <code>file://</code> paths for security reasons.</p>
<p>✅ <strong>Fix:</strong> I realized I needed to serve the files over <strong>HTTP</strong>, not from the file system. So I ran a quick local web server using Python:</p>
<pre><code class="lang-python">python -m http.server <span class="hljs-number">8080</span>
</code></pre>
<p>Then opened:</p>
<pre><code class="lang-python">http://localhost:<span class="hljs-number">8080</span>/index.html
</code></pre>
<p>That single step fixed all the CORS errors and allowed my modules to load correctly.</p>
<h3 id="heading-2-model-not-found-404-from-the-gemini-api">2. “Model Not Found” (404) From the Gemini API</h3>
<p>The next big challenge came from the Gemini API. Even though I had a valid API key, my console showed this error:</p>
<pre><code class="lang-python"><span class="hljs-string">"models/gemini-1.5-flash"</span> <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> found <span class="hljs-keyword">for</span> API version v1beta, <span class="hljs-keyword">or</span> <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> supported <span class="hljs-keyword">for</span> generateContent.
</code></pre>
<p>It turns out Google’s API endpoints can vary slightly depending on your project setup and key permissions.</p>
<p>✅ <strong>Fix:</strong> I rewrote my <code>lib/ai.js</code> script to automatically <strong>try multiple Gemini model endpoints</strong> until it found one that worked. Something like this:</p>
<pre><code class="lang-python">const GEMINI_IMAGE_ENDPOINTS = [
  <span class="hljs-string">"https://generativelanguage.googleapis.com/v1/models/gemini-1.5-flash:generateContent"</span>,
  <span class="hljs-string">"https://generativelanguage.googleapis.com/v1/models/gemini-1.5-pro:generateContent"</span>,
  <span class="hljs-string">"https://generativelanguage.googleapis.com/v1/models/gemini-1.5-flash-latest:generateContent"</span>,
];
</code></pre>
<p>And I wrapped it in a loop that stopped once one endpoint succeeded.</p>
<p>Later, I improved it further by listing available models dynamically using<br><code>https://generativelanguage.googleapis.com/v1/models?key=YOUR_KEY</code> and automatically trying whichever ones supported image generation.</p>
<p>That dynamic discovery approach fixed the 404 errors permanently.</p>
<h3 id="heading-3-packaging-a-local-single-file-version"><strong>3. Packaging a Local Single-File Version</strong></h3>
<p>Once I got everything working, I wanted a version that others could test easily without installing Node.js or running build tools.</p>
<p>✅ <strong>Fix:</strong> I bundled the project into a simple zip file containing:</p>
<pre><code class="lang-python">index.html
app.js
lib/ai.js
lib/mapping.js
styles.css
</code></pre>
<p>That way, anyone can just unzip and run:</p>
<pre><code class="lang-python">python -m http.server <span class="hljs-number">8080</span>
</code></pre>
<p>and open <code>localhost:8080</code>.</p>
<p>Everything runs locally in the browser, no server-side code required. This also makes it perfect for demos, classrooms, and so on.</p>
<h3 id="heading-4-debugging-script-import-errors-in-the-console">4. Debugging Script Import Errors in the Console</h3>
<p>Another subtle issue appeared when I noticed this red message:</p>
<pre><code class="lang-python">The requested module <span class="hljs-string">'./lib/mapping.js'</span> does <span class="hljs-keyword">not</span> provide an export named <span class="hljs-string">'mapDescriptionToMeaning'</span>
</code></pre>
<p>That line told me exactly what was wrong: my import and export function names didn’t match. The fix was straightforward:</p>
<pre><code class="lang-python">// app.js
<span class="hljs-keyword">import</span> { mapDescriptionToMeaning } <span class="hljs-keyword">from</span> <span class="hljs-string">'./lib/mapping.js'</span>;
</code></pre>
<p>And then ensuring the mapping file exported it:</p>
<pre><code class="lang-python">// mapping.js
export function mapDescriptionToMeaning(desc) { ... }
</code></pre>
<p>After that, all the pieces connected smoothly.</p>
<p>Using the browser console <strong>as my debugging dashboard</strong> turned out to be the most powerful tool of all. Every fix started by reading and reasoning about those red error lines.</p>
<h2 id="heading-demo-the-makaton-ai-companion-in-action">Demo: The Makaton AI Companion in Action</h2>
<p>Let’s see the Makaton AI Companion in action and understand what’s happening under the hood.</p>
<h3 id="heading-step-1-run-the-app-locally">Step 1: Run the app locally</h3>
<p>Once you’ve downloaded or cloned the project folder, open your terminal in that directory and start a local development server: <code>python -m http.server 8080</code>. Then open your browser and visit: <code>http://localhost:8080/index.html</code></p>
<p>You should see the Makaton AI Companion interface:</p>
<p><img src="https://github.com/tayo4christ/makaton-ai-companion/blob/9cc834fa75f6dcd39866c538ed42255f9006bb51/assets/app-interface.jpg?raw=true" alt="Main interface of the Makaton AI Companion app" width="600" height="400" loading="lazy"></p>
<h3 id="heading-step-2-get-your-gemini-api-key">Step 2: Get Your Gemini API Key</h3>
<p>To enable cloud-based image description, you’ll need a <a target="_blank" href="https://aistudio.google.com/welcome?utm_source=PMAX&amp;utm_medium=display&amp;utm_campaign=FY25-global-DR-pmax-1710442&amp;utm_content=pmax&amp;gclsrc=aw.ds&amp;gad_source=1&amp;gad_campaignid=21521981511&amp;gbraid=0AAAAACn9t66nbeHlpP_VYvpWIrX7IJGEW&amp;gclid=EAIaIQobChMIqf-KiIHbkAMV1ZFQBh0KHA8wEAAYASAAEgKLA_D_BwE"><strong>Gemini API key</strong></a> from Google AI Studio.</p>
<p><strong>Here’s how to generate one:</strong></p>
<ol>
<li><p>Visit: <code>https://aistudio.google.com/welcome</code></p>
</li>
<li><p>Click <strong>“Create API key”</strong> and link it to your Google Cloud project (or create a new one).</p>
</li>
<li><p>Copy the key it will look like this: <code>AIzaSyA...XXXXXXXXXXXX</code></p>
</li>
<li><p>Open the Makaton AI Companion in your browser and click the <strong>Settings</strong> button (top left).</p>
</li>
<li><p>Paste your key in the input box and click <strong>Save</strong>.</p>
</li>
</ol>
<p><img src="https://github.com/tayo4christ/makaton-ai-companion/blob/9cc834fa75f6dcd39866c538ed42255f9006bb51/assets/api-key-setting.jpg?raw=true" alt="Setting up the OpenAI API key in the app interface" width="600" height="400" loading="lazy"></p>
<p>You’ll see a confirmation message like this:</p>
<blockquote>
<p><em>“API key saved locally. Try Describe again.”</em></p>
</blockquote>
<p>This means your key is stored safely in localStorage and is only accessible from your browser.</p>
<h3 id="heading-step-3-enable-gemini-nano-for-on-device-ai">Step 3: Enable Gemini Nano for On-Device AI</h3>
<p>If you’re using <a target="_blank" href="https://www.google.com/intl/en_uk/chrome/canary/"><strong>Chrome Canary</strong>,</a> you can run Gemini Nano locally without internet access. This allows the Makaton AI Companion to generate text even when the API key isn’t set.</p>
<h4 id="heading-download-and-install-chrome-canary">Download and Install Chrome Canary:</h4>
<p>Visit the official Chrome Canary download page and install it on your Windows or macOS system. Chrome Canary is a special version of Chrome designed for developers and early adopters, offering the latest features and updates.</p>
<h4 id="heading-enable-gemini-nano">Enable Gemini Nano:</h4>
<p>Open Chrome Canary and type <code>chrome://flags/#prompt-api-for-gemini-nano</code> in the address bar.</p>
<p>Locate the "Prompt API for Gemini Nano" flag in the list. Set this flag to <strong>Enabled</strong>. This action allows Chrome Canary to support the Gemini Nano model for on-device AI processing.</p>
<p>After enabling the flag, relaunch Chrome Canary to apply the changes.</p>
<h4 id="heading-download-the-gemini-nano-model">Download the Gemini Nano Model:</h4>
<p>Open a new tab in Chrome Canary and enter <code>chrome://components</code> in the address bar.</p>
<p>Scroll down to find the <strong>“Optimization Guide”</strong> component. Click on <strong>Check for update</strong>. This action will initiate the download of the Gemini Nano model, which is necessary for running AI tasks locally without an internet connection.</p>
<h4 id="heading-verify-installation">Verify Installation:</h4>
<p>Once the Gemini Nano model is installed, the Makaton AI Companion app will automatically detect it. You should see a message indicating that the app is using on-device AI: <em>“No API key found. Using on-device AI (text) for best guess…”</em></p>
<p>This confirmation means that the app can now generate text descriptions using the Gemini Nano model without needing an API key or internet access.</p>
<p>By following these detailed steps, you ensure that the Gemini Nano model is correctly set up and ready to use for on-device AI processing in the Makaton AI Companion.</p>
<h3 id="heading-step-4-upload-a-makaton-sign-or-symbol">Step 4: Upload a Makaton sign or symbol</h3>
<p>Click <strong>Choose File</strong> to upload any Makaton image (for example, the “help” sign), then press <strong>Describe (Cloud or Nano)</strong>. You’ll immediately see console logs confirming that the app is running correctly and connecting to the Gemini API:</p>
<p><img src="https://github.com/tayo4christ/makaton-ai-companion/blob/9cc834fa75f6dcd39866c538ed42255f9006bb51/assets/console.jpg?raw=true" alt="Console output showing real-time translation logs" width="600" height="400" loading="lazy"></p>
<h3 id="heading-step-5-ai-description-and-mapping">Step 5: AI Description and Mapping</h3>
<p>Here’s what happens next:</p>
<ol>
<li><p>The image is read and encoded as Base64.</p>
</li>
<li><p>The Gemini API (cloud or on-device) generates a short visual description.</p>
</li>
<li><p>The description is passed to the <code>mapDescriptionToMeaning()</code> function.</p>
</li>
<li><p>If keywords match an entry in the <code>MAKATON_GLOSSES</code> dictionary, the app displays the corresponding English meaning.</p>
</li>
<li><p>Finally, users can click <strong>Speak</strong> or <strong>Copy</strong> to hear or reuse the translation.</p>
</li>
</ol>
<p>Example outputs:</p>
<p><strong>When no mapping is found:</strong><br>The AI description is accurate but doesn’t yet match a known Makaton keyword.</p>
<p><img src="https://github.com/tayo4christ/makaton-ai-companion/blob/9cc834fa75f6dcd39866c538ed42255f9006bb51/assets/Incorrect-demonstration.jpg?raw=true" alt="Incorrect demonstration showing the model misinterpreting a sign" width="600" height="400" loading="lazy"></p>
<p><strong>After updating the mapping list:</strong><br>Adding new keywords like <code>"help"</code>, <code>"assist"</code>, or <code>"hand over hand"</code> enables correct translation.</p>
<p><img src="https://github.com/tayo4christ/makaton-ai-companion/blob/9cc834fa75f6dcd39866c538ed42255f9006bb51/assets/correct-demonstration.jpg?raw=true" alt="Correct demonstration where the AI accurately recognizes the Makaton sign" width="600" height="400" loading="lazy"></p>
<h3 id="heading-why-this-matters">Why this matters</h3>
<p>This demonstrates how accessible, AI-assisted tools can support communication for people who rely on Makaton. Even when a gesture isn’t recognized, the system provides a structured output and allows users or educators to expand the mapping list making the tool smarter over time.</p>
<h2 id="heading-broader-reflections">Broader Reflections</h2>
<p>Building this project turned out to be much more than a coding exercise for me.<br>It was a meaningful experiment in combining accessibility, natural language processing, and computer vision. These three fields, when brought together, can create real social impact.</p>
<p>While working on it, I began to understand how computer vision and language understanding complement each other in practice. The vision model perceives the world by identifying shapes, gestures, and spatial patterns, while the language model interprets what those visuals mean in human terms.<br>In this project, the artificial intelligence system first sees the Makaton sign, then describes it, and finally maps it to an English word that carries intent and meaning.</p>
<p>This interaction between perception and semantics is what makes multimodal artificial intelligence so powerful. It is not only about recognizing an image or generating text; it is about building systems that connect understanding across different forms of information to make technology more inclusive and human centered.</p>
<p>This realization changed how I think about accessibility technology. True innovation happens not only through smarter models but through the harmony between seeing and understanding, between what an artificial intelligence system observes and how it communicates that observation to help people.</p>
<h3 id="heading-accessibility-meets-ai">Accessibility Meets AI</h3>
<p>Working on this project reminded me that accessibility isn’t just about compliance or assistive devices. It’s also about inclusion. A simple AI system that can describe a hand gesture or symbol in real time can empower teachers, parents, and students who communicate using Makaton or similar systems.</p>
<p>By mapping AI-generated descriptions to meaningful phrases, the app demonstrates how AI can support inclusive education<strong>,</strong> even at small scales. It bridges the communication gap between verbal and nonverbal learners, which is something that traditional translation systems often overlook.</p>
<h3 id="heading-integrating-nlp-and-computer-vision">Integrating NLP and Computer Vision</h3>
<p>On the technical side, this project showed me how naturally computer vision and language understanding complement each other. The Gemini API’s multimodal models were able to analyze an image and produce coherent natural-language sentences, something that older APIs couldn’t do without chaining multiple tools.</p>
<p>By feeding that output into a lightweight NLP mapping function, I was able to simulate a very early-stage symbol-to-language translator the core of my broader research interest in automatic Makaton-to-English translation.</p>
<h3 id="heading-why-local-ai-gemini-nano-matters">Why Local AI (Gemini Nano) Matters</h3>
<p>While the cloud models are powerful, experimenting with Gemini Nano revealed something exciting:<br>on-device AI can make accessibility tools faster, safer, and more private.</p>
<p>In classrooms or therapy sessions, you often can’t rely on stable internet connections or share sensitive student data. Running inference locally means learners’ gestures or symbol images never leave the device, a crucial step toward privacy-preserving accessibility AI.</p>
<p>And since Nano runs directly inside Chrome Canary, it shows how AI is becoming embedded at the browser level, lowering barriers for teachers and developers to build inclusive solutions without needing large infrastructure.</p>
<h3 id="heading-looking-forward">Looking Forward</h3>
<p>This prototype is just a starting point. Future iterations could integrate gesture recognition directly from camera input, support multiple symbol sets, or even learn from user feedback to expand the dictionary automatically.</p>
<p>Most importantly, it reinforces a central belief in my research and teaching journey:</p>
<p><strong>Accessibility innovation doesn’t require massive systems. It starts with curiosity, empathy, and a few lines of purposeful code.</strong></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Building the Makaton AI Companion has been one of the most rewarding projects in my AI journey – not just because it worked, but because it proved how accessible innovation can be.</p>
<p>With just a browser, a few lines of JavaScript, and the right API, I was able to combine computer vision, language understanding, and accessibility design into a working system that translates symbols into meaning. It’s a small step toward a future where anyone, regardless of speech or language ability, can be understood through technology.</p>
<p>The project also reinforced something deeply personal to me as a researcher and educator: that AI for accessibility doesn’t need to be complex, expensive, or centralized. It can be lightweight, open, and built with empathy by anyone who’s willing to learn and experiment.</p>
<h3 id="heading-join-the-conversation">Join the Conversation</h3>
<p>If this project inspires you, I’d love to see your own experiments and improvements. Can you make it support live webcam gestures? Could you adapt it for other symbol systems, like PECS or BSL?</p>
<p>Share your ideas in the comments or tag me if you publish your own version. Together, we can grow a small prototype into a community-driven accessibility tool and continue exploring how AI can give more people a voice.</p>
<p>Full source code on GitHub: <a target="_blank" href="https://github.com/tayo4christ/makaton-ai-companion">Makaton-ai-companion</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Transformers for Real-Time Gesture Recognition ]]>
                </title>
                <description>
                    <![CDATA[ Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are no... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/using-transformers-for-real-time-gesture-recognition/</link>
                <guid isPermaLink="false">68e3c692aa82abf4b593114c</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ transformers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pytorch ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ONNX ]]>
                    </category>
                
                    <category>
                        <![CDATA[ gradio ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Gesture Recognition ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Accessibility ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Tutorial ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ OMOTAYO OMOYEMI ]]>
                </dc:creator>
                <pubDate>Mon, 06 Oct 2025 13:39:30 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759757931295/5f19fd4e-93c0-4bd7-a75c-a7858e061ecd.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are not static images. Rather, they unfold over time. To build more robust, real-time systems, we need models that can capture both spatial details and temporal context.</p>
<p>This is where Transformers come in. Originally built for language, they’ve become state-of-the-art in vision tasks thanks to models like the Vision Transformer (ViT) and video-focused variants such as TimeSformer.</p>
<p>In this tutorial, we’ll use a Transformer backbone to create a lightweight real-time gesture recognition tool, optimized for small datasets and deployable on a regular laptop webcam.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-why-transformers-for-gestures">Why Transformers for Gestures?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-youll-learn">What You’ll Learn</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-generate-a-gesture-dataset">Generate a Gesture Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-1-generate-a-synthetic-dataset">Option 1: Generate a Synthetic Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-training-script-trainpy">Training Script:</a> <a target="_blank" href="http://train.py">train.py</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-export-the-model-to-onnx">Export the Model to ONNX</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-evaluate-accuracy-latency">Evaluate Accuracy + Latency</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-2-use-small-samples-from-public-gesture-datasets">Option 2: Use Small Samples from Public Gesture Datasets</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-accessibility-notes-amp-ethical-limits">Accessibility Notes &amp; Ethical Limits</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-next-steps">Next Steps</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-why-transformers-for-gestures">Why Transformers for Gestures?</h2>
<p>Transformers are powerful because they use self-attention to model relationships across a sequence. For gestures, this means the model doesn’t just see isolated frames, but also learns how movements evolve over time. A wave, for example, looks different from a raised hand only when viewed as a sequence.</p>
<p>Vision Transformers process images as patches, while video Transformers extend this to multiple frames with temporal attention. Even a simple approach, like applying ViT to each frame and pooling across time, can outperform traditional CNN-based methods for small datasets.</p>
<p>Combined with Hugging Face’s pre-trained models and ONNX Runtime for optimization, Transformers make it possible to train on a modest dataset and still achieve smooth real-time recognition.</p>
<h2 id="heading-what-youll-learn">What You’ll Learn</h2>
<p>In this tutorial, you’ll build a gesture recognition system using Transformers. By the end, you’ll know how to:</p>
<ul>
<li><p>Create (or record) a tiny gesture dataset</p>
</li>
<li><p>Train a Vision Transformer (ViT) with temporal pooling</p>
</li>
<li><p>Export the model to ONNX for faster inference</p>
</li>
<li><p>Build a real-time Gradio app that classifies gestures from your webcam</p>
</li>
<li><p>Evaluate your model’s accuracy and latency with simple scripts</p>
</li>
<li><p>Understand the accessibility potential and ethical limits of gesture recognition</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along, you should have:</p>
<ul>
<li><p>Basic Python knowledge (functions, scripts, virtual environments)</p>
</li>
<li><p>Familiarity with PyTorch (tensors, datasets, training loops) – helpful but not required</p>
</li>
<li><p>Python 3.8+ installed on your system</p>
</li>
<li><p>A webcam (for the live demo in Gradio)</p>
</li>
<li><p>Optionally: GPU access (training on CPU works, but is slower)</p>
</li>
</ul>
<h2 id="heading-project-setup">Project Setup</h2>
<p>Create a new project folder and install the required libraries.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create a new project directory and navigate into it</span>
mkdir transformer-gesture &amp;&amp; <span class="hljs-built_in">cd</span> transformer-gesture

<span class="hljs-comment"># Set up a Python virtual environment</span>
python -m venv .venv

<span class="hljs-comment"># Activate the virtual environment</span>
<span class="hljs-comment"># Windows PowerShell</span>
.venv\Scripts\Activate.ps1

<span class="hljs-comment"># macOS/Linux</span>
<span class="hljs-built_in">source</span> .venv/bin/activate
</code></pre>
<p>The provided code snippet is a set of commands for setting up a new Python project with a virtual environment. Here's a breakdown of each part:</p>
<ol>
<li><p><code>mkdir transformer-gesture &amp;&amp; cd transformer-gesture</code>: This command creates a new directory named "transformer-gesture" and then navigates into it.</p>
</li>
<li><p><code>python -m venv .venv</code>: This command creates a new virtual environment in the current directory. The virtual environment is stored in a folder named ".venv".</p>
</li>
<li><p>Activating the virtual environment:</p>
<ul>
<li><p>For Windows PowerShell, you can use <code>.venv\Scripts\Activate.ps1</code> to activate the virtual environment.</p>
</li>
<li><p>For macOS/Linux, use <code>source .venv/bin/activate</code> to activate the virtual environment.</p>
</li>
</ul>
</li>
</ol>
<p>Activating a virtual environment ensures that the Python interpreter and any packages you install are isolated to this specific project, preventing conflicts with other projects or system-wide packages.</p>
<p>Create a <code>requirements.txt</code> file:</p>
<pre><code class="lang-plaintext">torch&gt;=2.0
torchvision
torchaudio
timm
huggingface_hub

onnx
onnxruntime

gradio

numpy
opencv-python
pillow

matplotlib
seaborn
scikit-learn
</code></pre>
<p>The list provided is a set of package dependencies typically found in a <code>requirements.txt</code> file for a Python project. Here's a brief explanation of each package:</p>
<ol>
<li><p><strong>torch&gt;=2.0</strong>: PyTorch is a popular open-source deep learning framework that provides a flexible and efficient platform for building and training neural networks. Version 2.0 and above includes improvements in performance and new features.</p>
</li>
<li><p><strong>torchvision</strong>: This library is part of the PyTorch ecosystem and provides tools for computer vision tasks, including datasets, model architectures, and image transformations.</p>
</li>
<li><p><strong>torchaudio</strong>: Also part of the PyTorch ecosystem, Torchaudio provides audio processing tools and datasets, making it easier to work with audio data in deep learning projects.</p>
</li>
<li><p><strong>timm</strong>: The PyTorch Image Models (timm) library offers a collection of pre-trained models and utilities for computer vision tasks, facilitating quick experimentation and deployment.</p>
</li>
<li><p><strong>huggingface_hub</strong>: This library allows easy access to models and datasets hosted on the Hugging Face Hub, a platform for sharing and collaborating on machine learning models and datasets.</p>
</li>
<li><p><strong>onnx</strong>: The Open Neural Network Exchange (ONNX) format is used to represent machine learning models, enabling interoperability between different frameworks.</p>
</li>
<li><p><strong>onnxruntime</strong>: This is a high-performance runtime for executing ONNX models, allowing for efficient deployment across various platforms.</p>
</li>
<li><p><strong>gradio</strong>: Gradio is a library for creating user interfaces for machine learning models, making them accessible through a web interface for easy interaction and testing.</p>
</li>
<li><p><strong>numpy</strong>: A fundamental package for numerical computing in Python, providing support for arrays and a wide range of mathematical functions.</p>
</li>
<li><p><strong>opencv-python</strong>: OpenCV is a library for computer vision and image processing tasks, widely used for real-time applications.</p>
</li>
<li><p><strong>pillow</strong>: A Python Imaging Library (PIL) fork, Pillow provides tools for opening, manipulating, and saving many different image file formats.</p>
</li>
<li><p><strong>matplotlib</strong>: A plotting library for Python, Matplotlib is used for creating static, interactive, and animated visualizations in Python.</p>
</li>
<li><p><strong>seaborn</strong>: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.</p>
</li>
<li><p><strong>scikit-learn</strong>: A machine learning library in Python that provides simple and efficient tools for data analysis and modeling, including classification, regression, clustering, and dimensionality reduction.</p>
</li>
</ol>
<p>Install dependencies:</p>
<pre><code class="lang-bash">pip install -r requirements.txt
</code></pre>
<p>The command <code>pip install -r requirements.txt</code> is used to install all the Python packages listed in a file named <code>requirements.txt</code>. This file typically contains a list of package dependencies required for a Python project, each specified with a package name and optionally a version number.</p>
<p>By running this command, <code>pip</code>, which is the Python package installer, reads the file and installs each package listed, ensuring that the project has all the necessary dependencies to run properly. This is a common practice in Python projects to manage and share dependencies easily.</p>
<h2 id="heading-generate-a-gesture-dataset">Generate a Gesture Dataset</h2>
<p>To train our Transformer-based gesture recognizer, we need some data. Instead of downloading a huge dataset, we’ll start with a tiny synthetic dataset you can generate in seconds. This makes the tutorial lightweight and ensures that everyone can follow along without dealing with multi-gigabyte downloads.</p>
<h2 id="heading-option-1-generate-a-synthetic-dataset">Option 1: Generate a Synthetic Dataset</h2>
<p>We’ll use a small Python script that creates short <code>.mp4</code> clips of a moving (or still) coloured box. Each class represents a gesture:</p>
<ul>
<li><p><strong>swipe_left</strong> – box moves from right to left</p>
</li>
<li><p><strong>swipe_right</strong> – box moves from left to right</p>
</li>
<li><p><strong>stop</strong> – box stays still in the center</p>
</li>
</ul>
<p>Save this script as <code>generate_synthetic_gestures.py</code> in your project root:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os, cv2, numpy <span class="hljs-keyword">as</span> np, random, argparse

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ensure_dir</span>(<span class="hljs-params">p</span>):</span> os.makedirs(p, exist_ok=<span class="hljs-literal">True</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">make_clip</span>(<span class="hljs-params">mode, out_path, seconds=<span class="hljs-number">1.5</span>, fps=<span class="hljs-number">16</span>, size=<span class="hljs-number">224</span>, box_size=<span class="hljs-number">60</span>, seed=<span class="hljs-number">0</span>, codec=<span class="hljs-string">"mp4v"</span></span>):</span>
    rng = random.Random(seed)
    frames = int(seconds * fps)
    H = W = size

    <span class="hljs-comment"># background + box color</span>
    bg_val = rng.randint(<span class="hljs-number">160</span>, <span class="hljs-number">220</span>)
    bg = np.full((H, W, <span class="hljs-number">3</span>), bg_val, dtype=np.uint8)
    color = (rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>), rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>), rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>))

    <span class="hljs-comment"># path of motion</span>
    y = rng.randint(<span class="hljs-number">40</span>, H - <span class="hljs-number">40</span> - box_size)
    <span class="hljs-keyword">if</span> mode == <span class="hljs-string">"swipe_left"</span>:
        x_start, x_end = W - <span class="hljs-number">20</span> - box_size, <span class="hljs-number">20</span>
    <span class="hljs-keyword">elif</span> mode == <span class="hljs-string">"swipe_right"</span>:
        x_start, x_end = <span class="hljs-number">20</span>, W - <span class="hljs-number">20</span> - box_size
    <span class="hljs-keyword">elif</span> mode == <span class="hljs-string">"stop"</span>:
        x_start = x_end = (W - box_size) // <span class="hljs-number">2</span>
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f"Unknown mode: <span class="hljs-subst">{mode}</span>"</span>)

    fourcc = cv2.VideoWriter_fourcc(*codec)
    vw = cv2.VideoWriter(out_path, fourcc, fps, (W, H))
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> vw.isOpened():
        <span class="hljs-keyword">raise</span> RuntimeError(
            <span class="hljs-string">f"Could not open VideoWriter with codec '<span class="hljs-subst">{codec}</span>'. "</span>
            <span class="hljs-string">"Try --codec XVID and use .avi extension, e.g. out.avi"</span>
        )

    <span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> range(frames):
        alpha = t / max(<span class="hljs-number">1</span>, frames - <span class="hljs-number">1</span>)
        x = int((<span class="hljs-number">1</span> - alpha) * x_start + alpha * x_end)
        <span class="hljs-comment"># small jitter to avoid being too synthetic</span>
        jitter_x, jitter_y = rng.randint(<span class="hljs-number">-2</span>, <span class="hljs-number">2</span>), rng.randint(<span class="hljs-number">-2</span>, <span class="hljs-number">2</span>)
        frame = bg.copy()
        cv2.rectangle(frame, (x + jitter_x, y + jitter_y),
                      (x + jitter_x + box_size, y + jitter_y + box_size),
                      color, thickness=<span class="hljs-number">-1</span>)
        <span class="hljs-comment"># overlay text</span>
        cv2.putText(frame, mode, (<span class="hljs-number">8</span>, <span class="hljs-number">24</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">0.7</span>, (<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>), <span class="hljs-number">2</span>, cv2.LINE_AA)
        cv2.putText(frame, mode, (<span class="hljs-number">8</span>, <span class="hljs-number">24</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">0.7</span>, (<span class="hljs-number">255</span>, <span class="hljs-number">255</span>, <span class="hljs-number">255</span>), <span class="hljs-number">1</span>, cv2.LINE_AA)
        vw.write(frame)

    vw.release()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_labels</span>(<span class="hljs-params">labels, out_dir</span>):</span>
    <span class="hljs-keyword">with</span> open(os.path.join(out_dir, <span class="hljs-string">"labels.txt"</span>), <span class="hljs-string">"w"</span>, encoding=<span class="hljs-string">"utf-8"</span>) <span class="hljs-keyword">as</span> f:
        <span class="hljs-keyword">for</span> c <span class="hljs-keyword">in</span> labels:
            f.write(c + <span class="hljs-string">"\n"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    ap = argparse.ArgumentParser(description=<span class="hljs-string">"Generate a tiny synthetic gesture dataset."</span>)
    ap.add_argument(<span class="hljs-string">"--out"</span>, default=<span class="hljs-string">"data"</span>, help=<span class="hljs-string">"Output directory (default: data)"</span>)
    ap.add_argument(<span class="hljs-string">"--classes"</span>, nargs=<span class="hljs-string">"+"</span>,
                    default=[<span class="hljs-string">"swipe_left"</span>, <span class="hljs-string">"swipe_right"</span>, <span class="hljs-string">"stop"</span>],
                    help=<span class="hljs-string">"Class names (default: swipe_left swipe_right stop)"</span>)
    ap.add_argument(<span class="hljs-string">"--clips"</span>, type=int, default=<span class="hljs-number">16</span>, help=<span class="hljs-string">"Clips per class (default: 16)"</span>)
    ap.add_argument(<span class="hljs-string">"--seconds"</span>, type=float, default=<span class="hljs-number">1.5</span>, help=<span class="hljs-string">"Seconds per clip (default: 1.5)"</span>)
    ap.add_argument(<span class="hljs-string">"--fps"</span>, type=int, default=<span class="hljs-number">16</span>, help=<span class="hljs-string">"Frames per second (default: 16)"</span>)
    ap.add_argument(<span class="hljs-string">"--size"</span>, type=int, default=<span class="hljs-number">224</span>, help=<span class="hljs-string">"Frame size WxH (default: 224)"</span>)
    ap.add_argument(<span class="hljs-string">"--box"</span>, type=int, default=<span class="hljs-number">60</span>, help=<span class="hljs-string">"Box size (default: 60)"</span>)
    ap.add_argument(<span class="hljs-string">"--codec"</span>, default=<span class="hljs-string">"mp4v"</span>, help=<span class="hljs-string">"Codec fourcc (mp4v or XVID)"</span>)
    ap.add_argument(<span class="hljs-string">"--ext"</span>, default=<span class="hljs-string">".mp4"</span>, help=<span class="hljs-string">"File extension (.mp4 or .avi)"</span>)
    args = ap.parse_args()

    ensure_dir(args.out)
    write_labels(args.classes, <span class="hljs-string">"."</span>)  <span class="hljs-comment"># writes labels.txt to project root</span>

    print(<span class="hljs-string">f"Generating synthetic dataset -&gt; <span class="hljs-subst">{args.out}</span>"</span>)
    <span class="hljs-keyword">for</span> cls <span class="hljs-keyword">in</span> args.classes:
        cls_dir = os.path.join(args.out, cls)
        ensure_dir(cls_dir)
        mode = <span class="hljs-string">"stop"</span> <span class="hljs-keyword">if</span> cls == <span class="hljs-string">"stop"</span> <span class="hljs-keyword">else</span> (<span class="hljs-string">"swipe_left"</span> <span class="hljs-keyword">if</span> <span class="hljs-string">"left"</span> <span class="hljs-keyword">in</span> cls <span class="hljs-keyword">else</span> (<span class="hljs-string">"swipe_right"</span> <span class="hljs-keyword">if</span> <span class="hljs-string">"right"</span> <span class="hljs-keyword">in</span> cls <span class="hljs-keyword">else</span> <span class="hljs-string">"stop"</span>))
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(args.clips):
            filename = os.path.join(cls_dir, <span class="hljs-string">f"<span class="hljs-subst">{cls}</span>_<span class="hljs-subst">{i+<span class="hljs-number">1</span>:<span class="hljs-number">03</span>d}</span><span class="hljs-subst">{args.ext}</span>"</span>)
            make_clip(
                mode=mode,
                out_path=filename,
                seconds=args.seconds,
                fps=args.fps,
                size=args.size,
                box_size=args.box,
                seed=i + <span class="hljs-number">1</span>,
                codec=args.codec
            )
        print(<span class="hljs-string">f"  <span class="hljs-subst">{cls}</span>: <span class="hljs-subst">{args.clips}</span> clips"</span>)

    print(<span class="hljs-string">"Done. You can now run: python train.py, python export_onnx.py, python app.py"</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    main()
</code></pre>
<p>The script generates a synthetic gesture dataset by creating video clips of a moving or stationary coloured box, simulating gestures like "swipe left," "swipe right," and "stop," and saves them in a specified output directory.</p>
<p>Now run it inside your virtual environment:</p>
<pre><code class="lang-bash">python generate_synthetic_gestures.py --out data --clips 16 --seconds 1.5
</code></pre>
<p>The command above runs a Python script named <code>generate_synthetic_gestures.py</code>, which generates a synthetic gesture dataset with 16 clips per gesture, each lasting 1.5 seconds, and saves the output in a directory named "data".</p>
<p>This creates a dataset like:</p>
<pre><code class="lang-plaintext">data/
  swipe_left/*.mp4
  swipe_right/*.mp4
  stop/*.mp4
labels.txt
</code></pre>
<p>Each folder contains short clips of a moving (or still) box that simulate gestures. This is perfect for testing the pipeline.</p>
<h3 id="heading-training-script-trainpy">Training Script: <code>train.py</code></h3>
<p>Now that we have our dataset, let’s fine-tune a Vision Transformer with temporal pooling. This model applies ViT frame-by-frame, averages embeddings across time, and trains a classification head on your gestures.</p>
<p>Here’s the full training script:</p>
<pre><code class="lang-python"><span class="hljs-comment"># train.py</span>
<span class="hljs-keyword">import</span> torch, torch.nn <span class="hljs-keyword">as</span> nn, torch.optim <span class="hljs-keyword">as</span> optim
<span class="hljs-keyword">from</span> torch.utils.data <span class="hljs-keyword">import</span> DataLoader
<span class="hljs-keyword">import</span> timm
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> GestureClips, read_labels

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ViTTemporal</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-string">"""Frame-wise ViT encoder -&gt; mean pool over time -&gt; linear head."""</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, num_classes, vit_name=<span class="hljs-string">"vit_tiny_patch16_224"</span></span>):</span>
        super().__init__()
        self.vit = timm.create_model(vit_name, pretrained=<span class="hljs-literal">True</span>, num_classes=<span class="hljs-number">0</span>, global_pool=<span class="hljs-string">"avg"</span>)
        feat_dim = self.vit.num_features
        self.head = nn.Linear(feat_dim, num_classes)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>  <span class="hljs-comment"># x: (B,T,C,H,W)</span>
        B, T, C, H, W = x.shape
        x = x.view(B * T, C, H, W)
        feats = self.vit(x)                  <span class="hljs-comment"># (B*T, D)</span>
        feats = feats.view(B, T, <span class="hljs-number">-1</span>).mean(dim=<span class="hljs-number">1</span>)  <span class="hljs-comment"># (B, D)</span>
        <span class="hljs-keyword">return</span> self.head(feats)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train</span>():</span>
    device = <span class="hljs-string">"cuda"</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"cpu"</span>
    labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
    n_classes = len(labels)

    train_ds = GestureClips(train=<span class="hljs-literal">True</span>)
    val_ds   = GestureClips(train=<span class="hljs-literal">False</span>)
    print(<span class="hljs-string">f"Train clips: <span class="hljs-subst">{len(train_ds)}</span> | Val clips: <span class="hljs-subst">{len(val_ds)}</span>"</span>)

    <span class="hljs-comment"># Windows/CPU friendly</span>
    train_dl = DataLoader(train_ds, batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">True</span>,  num_workers=<span class="hljs-number">0</span>, pin_memory=<span class="hljs-literal">False</span>)
    val_dl   = DataLoader(val_ds,   batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">False</span>, num_workers=<span class="hljs-number">0</span>, pin_memory=<span class="hljs-literal">False</span>)

    model = ViTTemporal(num_classes=n_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=<span class="hljs-number">3e-4</span>, weight_decay=<span class="hljs-number">0.05</span>)

    best_acc = <span class="hljs-number">0.0</span>
    epochs = <span class="hljs-number">5</span>
    <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, epochs + <span class="hljs-number">1</span>):
        <span class="hljs-comment"># ---- Train ----</span>
        model.train()
        total, correct, loss_sum = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0.0</span>
        <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> train_dl:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()

            loss_sum += loss.item() * x.size(<span class="hljs-number">0</span>)
            correct += (logits.argmax(<span class="hljs-number">1</span>) == y).sum().item()
            total += x.size(<span class="hljs-number">0</span>)

        train_acc = correct / total <span class="hljs-keyword">if</span> total <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>
        train_loss = loss_sum / total <span class="hljs-keyword">if</span> total <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>

        <span class="hljs-comment"># ---- Validate ----</span>
        model.eval()
        vtotal, vcorrect = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>
        <span class="hljs-keyword">with</span> torch.no_grad():
            <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> val_dl:
                x, y = x.to(device), y.to(device)
                vcorrect += (model(x).argmax(<span class="hljs-number">1</span>) == y).sum().item()
                vtotal += x.size(<span class="hljs-number">0</span>)
        val_acc = vcorrect / vtotal <span class="hljs-keyword">if</span> vtotal <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>

        print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch:<span class="hljs-number">02</span>d}</span> | train_loss <span class="hljs-subst">{train_loss:<span class="hljs-number">.4</span>f}</span> "</span>
              <span class="hljs-string">f"| train_acc <span class="hljs-subst">{train_acc:<span class="hljs-number">.3</span>f}</span> | val_acc <span class="hljs-subst">{val_acc:<span class="hljs-number">.3</span>f}</span>"</span>)

        <span class="hljs-keyword">if</span> val_acc &gt; best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), <span class="hljs-string">"vit_temporal_best.pt"</span>)

    print(<span class="hljs-string">"Best val acc:"</span>, best_acc)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    train()
</code></pre>
<p>Running the command <code>python train.py</code> initiates the training process for your gesture recognition model. Here's a breakdown of what happens:</p>
<ol>
<li><p><strong>Load your dataset from data/</strong>: The script will access and load the gesture dataset stored in the "data" directory. This dataset is used to train the model.</p>
</li>
<li><p><strong>Fine-tune a pre-trained Vision Transformer</strong>: The training script will take a Vision Transformer model that has been pre-trained on a larger dataset and fine-tune it using your specific gesture dataset. Fine-tuning helps the model adapt to the nuances of your data, improving its performance on the specific task of gesture recognition.</p>
</li>
<li><p><strong>Save the best checkpoint as vit_temporal_best.pt</strong>: During training, the script will evaluate the model's performance on a validation set. The best-performing version of the model (based on some metric like accuracy) will be saved as a checkpoint file named "vit_temporal_best.pt". This file can later be used for inference or further training.</p>
</li>
</ol>
<h4 id="heading-what-training-looks-like">What Training Looks Like</h4>
<p>You should see logs similar to this:</p>
<pre><code class="lang-plaintext">Train clips: 38 | Val clips: 10
Epoch 01 | train_loss 1.4508 | train_acc 0.395 | val_acc 0.200
Epoch 02 | train_loss 1.2466 | train_acc 0.263 | val_acc 0.200
Epoch 03 | train_loss 1.1361 | train_acc 0.368 | val_acc 0.200
Best val acc: 0.200
</code></pre>
<p>Don’t worry if your accuracy is low at first, as with the synthetic dataset that’s normal. The key is proving that the Transformer pipeline works. You can boost results later by:</p>
<ul>
<li><p>Adding more clips per class</p>
</li>
<li><p>Training for more epochs</p>
</li>
<li><p>Switching to real recorded gestures</p>
</li>
</ul>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/training-logs.png?raw=true" alt="Training logs" width="600" height="400" loading="lazy"></p>
<p>Figure 1. Example training logs from <code>train.py</code>, where the Vision Transformer with temporal pooling is fine-tuned on a tiny synthetic dataset.</p>
<h3 id="heading-export-the-model-to-onnx">Export the Model to ONNX</h3>
<p>To make our model easier to run in real time (and lighter on CPU), we’ll export it to the ONNX format.</p>
<p><strong>Note:</strong> ONNX, which stands for Open Neural Network Exchange, is an open-source format designed to facilitate the interchange of deep learning models between different frameworks. It lets you train a model in one framework, such as PyTorch or TensorFlow, and then deploy it in another, like Caffe2 or MXNet, without needing to completely rewrite the model. This interoperability is achieved by providing a standardized representation of the model's architecture and parameters.</p>
<p>ONNX supports a wide range of operators and is continually updated to include new features, making it a versatile choice for deploying machine learning models across various platforms and devices.</p>
<p>Create a file called <code>export_onnx.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> train <span class="hljs-keyword">import</span> ViTTemporal
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
n_classes = len(labels)

<span class="hljs-comment"># Load trained model</span>
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load(<span class="hljs-string">"vit_temporal_best.pt"</span>, map_location=<span class="hljs-string">"cpu"</span>))
model.eval()

<span class="hljs-comment"># Dummy input: batch=1, 16 frames, 3x224x224</span>
dummy = torch.randn(<span class="hljs-number">1</span>, <span class="hljs-number">16</span>, <span class="hljs-number">3</span>, <span class="hljs-number">224</span>, <span class="hljs-number">224</span>)

<span class="hljs-comment"># Export</span>
torch.onnx.export(
    model, dummy, <span class="hljs-string">"vit_temporal.onnx"</span>,
    input_names=[<span class="hljs-string">"video"</span>], output_names=[<span class="hljs-string">"logits"</span>],
    dynamic_axes={<span class="hljs-string">"video"</span>: {<span class="hljs-number">0</span>: <span class="hljs-string">"batch"</span>}},
    opset_version=<span class="hljs-number">13</span>
)

print(<span class="hljs-string">"Exported vit_temporal.onnx"</span>)
</code></pre>
<p>Run it with <code>python export_onnx.py</code>.</p>
<p>This generates a file <code>vit_temporal.onnx</code> in your project folder. ONNX lets us use onnxruntime, which is much faster for inference.</p>
<p>Create a file called <code>app.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os, tempfile, cv2, torch, onnxruntime, numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> gradio <span class="hljs-keyword">as</span> gr
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

T = <span class="hljs-number">16</span>
SIZE = <span class="hljs-number">224</span>
MODEL_PATH = <span class="hljs-string">"vit_temporal.onnx"</span>

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)

<span class="hljs-comment"># --- ONNX session + auto-detect names ---</span>
ort_session = onnxruntime.InferenceSession(MODEL_PATH, providers=[<span class="hljs-string">"CPUExecutionProvider"</span>])
<span class="hljs-comment"># detect first input and first output names to avoid mismatches</span>
INPUT_NAME = ort_session.get_inputs()[<span class="hljs-number">0</span>].name   <span class="hljs-comment"># e.g. "input" or "video"</span>
OUTPUT_NAME = ort_session.get_outputs()[<span class="hljs-number">0</span>].name <span class="hljs-comment"># e.g. "logits" or something else</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">preprocess_clip</span>(<span class="hljs-params">frames_rgb</span>):</span>
    <span class="hljs-keyword">if</span> len(frames_rgb) == <span class="hljs-number">0</span>:
        frames_rgb = [np.zeros((SIZE, SIZE, <span class="hljs-number">3</span>), dtype=np.uint8)]
    <span class="hljs-keyword">if</span> len(frames_rgb) &lt; T:
        frames_rgb = frames_rgb + [frames_rgb[<span class="hljs-number">-1</span>]] * (T - len(frames_rgb))
    frames_rgb = frames_rgb[:T]
    clip = [cv2.resize(f, (SIZE, SIZE), interpolation=cv2.INTER_AREA) <span class="hljs-keyword">for</span> f <span class="hljs-keyword">in</span> frames_rgb]
    clip = np.stack(clip, axis=<span class="hljs-number">0</span>)                                    <span class="hljs-comment"># (T,H,W,3)</span>
    clip = np.transpose(clip, (<span class="hljs-number">0</span>, <span class="hljs-number">3</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>)).astype(np.float32) / <span class="hljs-number">255</span> <span class="hljs-comment"># (T,3,H,W)</span>
    clip = (clip - <span class="hljs-number">0.5</span>) / <span class="hljs-number">0.5</span>
    clip = np.expand_dims(clip, <span class="hljs-number">0</span>)                                   <span class="hljs-comment"># (1,T,3,H,W)</span>
    <span class="hljs-keyword">return</span> clip

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_extract_path_from_gradio_video</span>(<span class="hljs-params">inp</span>):</span>
    <span class="hljs-keyword">if</span> isinstance(inp, str) <span class="hljs-keyword">and</span> os.path.exists(inp):
        <span class="hljs-keyword">return</span> inp
    <span class="hljs-keyword">if</span> isinstance(inp, dict):
        <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> (<span class="hljs-string">"video"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"path"</span>, <span class="hljs-string">"filepath"</span>):
            v = inp.get(key)
            <span class="hljs-keyword">if</span> isinstance(v, str) <span class="hljs-keyword">and</span> os.path.exists(v):
                <span class="hljs-keyword">return</span> v
        <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> (<span class="hljs-string">"data"</span>, <span class="hljs-string">"video"</span>):
            v = inp.get(key)
            <span class="hljs-keyword">if</span> isinstance(v, (bytes, bytearray)):
                tmp = tempfile.NamedTemporaryFile(delete=<span class="hljs-literal">False</span>, suffix=<span class="hljs-string">".mp4"</span>)
                tmp.write(v); tmp.flush(); tmp.close()
                <span class="hljs-keyword">return</span> tmp.name
    <span class="hljs-keyword">if</span> isinstance(inp, (list, tuple)) <span class="hljs-keyword">and</span> inp <span class="hljs-keyword">and</span> isinstance(inp[<span class="hljs-number">0</span>], str) <span class="hljs-keyword">and</span> os.path.exists(inp[<span class="hljs-number">0</span>]):
        <span class="hljs-keyword">return</span> inp[<span class="hljs-number">0</span>]
    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_read_uniform_frames</span>(<span class="hljs-params">video_path</span>):</span>
    cap = cv2.VideoCapture(video_path)
    frames = []
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) <span class="hljs-keyword">or</span> <span class="hljs-number">1</span>
    idxs = np.linspace(<span class="hljs-number">0</span>, total - <span class="hljs-number">1</span>, max(T, <span class="hljs-number">1</span>)).astype(int)
    want = set(int(i) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> idxs.tolist())
    j = <span class="hljs-number">0</span>
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        ok, bgr = cap.read()
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ok: <span class="hljs-keyword">break</span>
        <span class="hljs-keyword">if</span> j <span class="hljs-keyword">in</span> want:
            rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
            frames.append(rgb)
        j += <span class="hljs-number">1</span>
    cap.release()
    <span class="hljs-keyword">return</span> frames

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_from_video</span>(<span class="hljs-params">gradio_video</span>):</span>
    video_path = _extract_path_from_gradio_video(gradio_video)
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> video_path <span class="hljs-keyword">or</span> <span class="hljs-keyword">not</span> os.path.exists(video_path):
        <span class="hljs-keyword">return</span> {}
    frames = _read_uniform_frames(video_path)

    <span class="hljs-comment"># If OpenCV choked on the codec (common with recorded webm), re-encode once:</span>
    <span class="hljs-keyword">if</span> len(frames) == <span class="hljs-number">0</span>:
        tmp = tempfile.NamedTemporaryFile(delete=<span class="hljs-literal">False</span>, suffix=<span class="hljs-string">".mp4"</span>); tmp_name = tmp.name; tmp.close()
        cap = cv2.VideoCapture(video_path)
        fourcc = cv2.VideoWriter_fourcc(*<span class="hljs-string">"mp4v"</span>)
        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) <span class="hljs-keyword">or</span> <span class="hljs-number">640</span>
        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) <span class="hljs-keyword">or</span> <span class="hljs-number">480</span>
        out = cv2.VideoWriter(tmp_name, fourcc, <span class="hljs-number">20.0</span>, (w, h))
        <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
            ok, frame = cap.read()
            <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ok: <span class="hljs-keyword">break</span>
            out.write(frame)
        cap.release(); out.release()
        frames = _read_uniform_frames(tmp_name)

    clip = preprocess_clip(frames)
    <span class="hljs-comment"># &gt;&gt;&gt; use the detected ONNX input/output names &lt;&lt;&lt;</span>
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[<span class="hljs-number">0</span>]  <span class="hljs-comment"># (1, C)</span>
    probs = torch.softmax(torch.from_numpy(logits), dim=<span class="hljs-number">1</span>)[<span class="hljs-number">0</span>].numpy().tolist()
    <span class="hljs-keyword">return</span> {labels[i]: float(probs[i]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(labels))}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_from_image</span>(<span class="hljs-params">image</span>):</span>
    <span class="hljs-keyword">if</span> image <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">return</span> {}
    clip = preprocess_clip([image] * T)
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[<span class="hljs-number">0</span>]
    probs = torch.softmax(torch.from_numpy(logits), dim=<span class="hljs-number">1</span>)[<span class="hljs-number">0</span>].numpy().tolist()
    <span class="hljs-keyword">return</span> {labels[i]: float(probs[i]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(labels))}

<span class="hljs-keyword">with</span> gr.Blocks() <span class="hljs-keyword">as</span> demo:
    gr.Markdown(<span class="hljs-string">"# Gesture Classifier (ONNX)\nRecord or upload a short video, then click **Classify Video**."</span>)
    <span class="hljs-keyword">with</span> gr.Tab(<span class="hljs-string">"Video (record or upload)"</span>):
        vid_in = gr.Video(label=<span class="hljs-string">"Record from webcam or upload a short clip"</span>)
        vid_out = gr.Label(num_top_classes=<span class="hljs-number">3</span>, label=<span class="hljs-string">"Prediction"</span>)
        gr.Button(<span class="hljs-string">"Classify Video"</span>).click(fn=predict_from_video, inputs=vid_in, outputs=vid_out)
    <span class="hljs-keyword">with</span> gr.Tab(<span class="hljs-string">"Single Image (fallback)"</span>):
        img_in = gr.Image(label=<span class="hljs-string">"Upload an image frame"</span>, type=<span class="hljs-string">"numpy"</span>)
        img_out = gr.Label(num_top_classes=<span class="hljs-number">3</span>, label=<span class="hljs-string">"Prediction"</span>)
        gr.Button(<span class="hljs-string">"Classify Image"</span>).click(fn=predict_from_image, inputs=img_in, outputs=img_out)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    demo.launch()
</code></pre>
<p>Running the command <code>python app.py</code> launches a Gradio application in your web browser. Here's what happens:</p>
<ol>
<li><p><strong>Webcam feed streams live</strong>: The application accesses your webcam to provide a live video feed. This allows you to perform gestures in front of the camera in real-time.</p>
</li>
<li><p><strong>Predictions update continuously</strong>: As you perform gestures, the model processes the video frames continuously, updating its predictions in real-time.</p>
</li>
<li><p><strong>Top 3 gesture classes displayed with probabilities</strong>: The application displays the top three predicted gesture classes along with their probabilities, giving you an idea of the model's confidence in its predictions.</p>
</li>
</ol>
<p>When you open the app in your browser, you'll find two tabs. In the <strong>Video tab</strong>, you can click <em>Record from webcam</em> to capture a short clip of your gesture, typically lasting 2–4 seconds. After recording, click <strong>Classify Video</strong>. The model will then process the captured frames using the Transformer model and display the predicted gesture probabilities. This setup allows for interactive testing and demonstration of the gesture recognition system.</p>
<p>Here’s an example where I raised my hand for a <strong>stop</strong> gesture, and the model predicts “stop” as the top class:</p>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/realtime-demo.png?raw=true" alt="Gradio demo output" width="600" height="400" loading="lazy"></p>
<p>Figure 2. The Gradio app running locally. After recording a short clip, the Transformer model predicts the gesture with class probabilities.</p>
<h3 id="heading-evaluate-accuracy-latency">Evaluate Accuracy + Latency</h3>
<p>Now that the model runs in a demo app, let’s check how well it performs. There are two sides to this:</p>
<ul>
<li><p><strong>Accuracy</strong>: does the model predict the right gesture class?</p>
</li>
<li><p><strong>Latency</strong>: how fast does it respond, especially on CPU vs GPU?</p>
</li>
</ul>
<h4 id="heading-1-quick-accuracy-check">1. Quick Accuracy Check</h4>
<p>Save this as <code>eval.py</code> in the same folder as your other scripts:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> GestureClips, read_labels
<span class="hljs-keyword">from</span> train <span class="hljs-keyword">import</span> ViTTemporal

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
n_classes = len(labels)

<span class="hljs-comment"># Load validation data</span>
val_ds = GestureClips(train=<span class="hljs-literal">False</span>)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># Load trained model</span>
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load(<span class="hljs-string">"vit_temporal_best.pt"</span>, map_location=<span class="hljs-string">"cpu"</span>))
model.eval()

correct, total = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>
all_preds, all_labels = [], []

<span class="hljs-keyword">with</span> torch.no_grad():
    <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> val_dl:
        logits = model(x)
        preds = logits.argmax(dim=<span class="hljs-number">1</span>)
        correct += (preds == y).sum().item()
        total += y.size(<span class="hljs-number">0</span>)
        all_preds.extend(preds.tolist())
        all_labels.extend(y.tolist())

print(<span class="hljs-string">f"Validation accuracy: <span class="hljs-subst">{correct/total:<span class="hljs-number">.2</span>%}</span>"</span>)
</code></pre>
<h4 id="heading-2-confusion-matrix">2. Confusion Matrix</h4>
<p>Let’s also visualize which gestures are confused. Add this snippet at the bottom of <code>eval.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> confusion_matrix

cm = confusion_matrix(all_labels, all_preds)

plt.figure(figsize=(<span class="hljs-number">6</span>,<span class="hljs-number">6</span>))
sns.heatmap(cm, annot=<span class="hljs-literal">True</span>, fmt=<span class="hljs-string">"d"</span>, xticklabels=labels, yticklabels=labels, cmap=<span class="hljs-string">"Blues"</span>)
plt.xlabel(<span class="hljs-string">"Predicted"</span>)
plt.ylabel(<span class="hljs-string">"True"</span>)
plt.title(<span class="hljs-string">"Confusion Matrix"</span>)
plt.tight_layout()
plt.show()
</code></pre>
<p>When you run <code>python eval.py</code>, a heatmap like this will pop up:</p>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/confusion-matrix.png?raw=true" alt="Confusion matrix" width="600" height="400" loading="lazy"></p>
<p>Figure 3. Confusion matrix on the validation set. Correct predictions appear along the diagonal. Off-diagonal counts show gesture confusions.</p>
<h4 id="heading-3-latency-benchmark">3. Latency Benchmark</h4>
<p>Finally, let’s see how fast inference runs. Save the following as <code>benchmark.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time, numpy <span class="hljs-keyword">as</span> np, onnxruntime
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)

ort = onnxruntime.InferenceSession(<span class="hljs-string">"vit_temporal.onnx"</span>, providers=[<span class="hljs-string">"CPUExecutionProvider"</span>])
INPUT_NAME = ort.get_inputs()[<span class="hljs-number">0</span>].name
OUTPUT_NAME = ort.get_outputs()[<span class="hljs-number">0</span>].name

dummy = np.random.randn(<span class="hljs-number">1</span>, <span class="hljs-number">16</span>, <span class="hljs-number">3</span>, <span class="hljs-number">224</span>, <span class="hljs-number">224</span>).astype(np.float32)

<span class="hljs-comment"># Warmup</span>
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(<span class="hljs-number">3</span>):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})

<span class="hljs-comment"># Benchmark</span>
t0 = time.time()
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(<span class="hljs-number">50</span>):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})
t1 = time.time()

print(<span class="hljs-string">f"Average latency: <span class="hljs-subst">{(t1 - t0)/<span class="hljs-number">50</span>:<span class="hljs-number">.3</span>f}</span> seconds per clip"</span>)
</code></pre>
<p>Run: <code>python benchmark.py</code></p>
<p>On CPU, you might see ~0.05–0.15s per clip; on GPU it’s much faster.</p>
<p><strong>Note</strong>: If latency is high, you can enable <strong>quantization</strong> in ONNX to shrink the model and speed up inference.</p>
<h2 id="heading-option-2-use-small-samples-from-public-gesture-datasets">Option 2: Use Small Samples from Public Gesture Datasets</h2>
<p>If you’d prefer to see your model trained on <em>real</em> gesture clips instead of synthetic moving boxes, you can grab a handful of videos from open datasets. You don’t need to download the entire dataset (which can be several GB) just a few <code>.mp4</code> samples are enough to follow along.</p>
<h3 id="heading-recommended-sources">Recommended sources</h3>
<ul>
<li><p><strong>20BN Jester Dataset</strong>: Contains short clips of hand gestures like swiping, clapping, and pointing.</p>
</li>
<li><p><strong>WLASL</strong>: A large-scale dataset of isolated sign language words.</p>
</li>
</ul>
<p>Both projects provide small <code>.mp4</code> videos you can use as realistic training examples. I’ve linked them below.</p>
<h3 id="heading-setting-up-your-dataset-folder">Setting up your dataset folder</h3>
<p>Once you download a few clips, place them in the <code>data/</code> folder under subfolders named after each gesture class. For example:</p>
<pre><code class="lang-plaintext">data/
├── swipe_left/
│   ├── clip1.mp4
│   └── clip2.mp4
├── swipe_right/
│   ├── clip1.mp4
│   └── clip2.mp4
└── stop/
    ├── clip1.mp4
    └── clip2.mp4
</code></pre>
<p>And update <code>labels.txt</code> to match the folder names:</p>
<pre><code class="lang-plaintext">swipe_left
swipe_right
stop
</code></pre>
<p>Now your dataset is ready, and the same training scripts from earlier (<code>train.py</code>, <code>eval.py</code>) will work without modification.</p>
<h3 id="heading-why-choose-this-option">Why choose this option?</h3>
<ul>
<li><p>Gives more realistic results than synthetic coloured boxes</p>
</li>
<li><p>Lets you see how the model handles <em>actual human hand movements</em></p>
</li>
<li><p>It just requires a bit more effort (downloading clips, trimming them if needed)</p>
</li>
</ul>
<p><strong>Tip:</strong> If downloading from these datasets feels too heavy, you can also record your own short gestures using your laptop webcam. Just save them as <code>.mp4</code> files and organize them in the same folder structure.</p>
<h2 id="heading-accessibility-notes-amp-ethical-limits">Accessibility Notes &amp; Ethical Limits</h2>
<p>While this project shows the technical workflow for gesture recognition with Transformers, it’s important to step back and consider the <strong>human context</strong>:</p>
<ul>
<li><p><strong>Accessibility first</strong>: Tools like this can help students with speech or motor difficulties, but they should always be co-designed with the people who will use them. Don’t assume one-size-fits-all.</p>
</li>
<li><p><strong>Dataset sensitivity</strong>: Using publicly available sign or gesture datasets is fine for prototyping, but deploying such a system requires careful consideration of consent and representation.</p>
</li>
<li><p><strong>Error tolerance</strong>: Even small misclassifications can have big consequences in accessibility contexts (for example, confusing <em>stop</em> with <em>go</em>). Always plan for fallback options (like manual input or confirmation).</p>
</li>
<li><p><strong>Bias and inclusivity</strong>: Models trained on narrow datasets may fail for different skin tones, lighting conditions, or cultural gesture variations. Broad and diverse training data is essential for fairness.</p>
</li>
</ul>
<p>In other words: this demo is a <strong>teaching scaffold</strong>, not a production-ready accessibility tool. Responsible deployment requires collaboration with educators, therapists, and end users.</p>
<h2 id="heading-next-steps">Next Steps</h2>
<p>If you’d like to push this project further, here are some directions to explore:</p>
<ul>
<li><p><strong>Better models</strong>: Try video-focused Transformers like <a target="_blank" href="https://arxiv.org/abs/2102.05095">TimeSformer</a> or <a target="_blank" href="https://arxiv.org/abs/2203.12602">VideoMAE</a> for stronger temporal reasoning.</p>
</li>
<li><p><strong>Larger vocabularies</strong>: Add more gesture classes, build your own dataset, or use portions of public datasets like <a target="_blank" href="https://www.kaggle.com/datasets/toxicmender/20bn-jester">20BN Jester</a> or <a target="_blank" href="https://www.kaggle.com/datasets/risangbaskoro/wlasl-processed">WLASL.</a></p>
</li>
<li><p><strong>Pose fusion</strong>: Combine gesture video with human pose keypoints from <a target="_blank" href="https://mediapipe.readthedocs.io/en/latest/solutions/hands.html">MediaPipe</a> or <a target="_blank" href="https://github.com/CMU-Perceptual-Computing-Lab/openpose">OpenPose</a> for more robust predictions.</p>
</li>
<li><p><strong>Real-time smoothing</strong>: Implement temporal smoothing or debounce logic in the app so predictions are more stable during live use.</p>
</li>
<li><p><strong>Quantization + edge devices</strong>: Convert your ONNX model to an INT8 quantized version and deploy it on a Raspberry Pi or Jetson Nano for classroom-ready prototypes.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to create a gesture recognition system using Transformer models, demonstrating the potential of cutting-edge machine learning techniques. By preparing a small dataset, training a Vision Transformer with temporal pooling, exporting the model to ONNX for efficient inference, and deploying a real-time Gradio app, you showcased a practical application of these technologies. The evaluation of accuracy and latency further highlighted the system's effectiveness and responsiveness.</p>
<p>This project illustrates how you can leverage advanced ML methods to enhance accessibility and communication, paving the way for more inclusive learning environments.</p>
<p>Remember: while this demo works with small datasets, real-world applications need larger, more diverse data and careful consideration of accessibility, inclusivity, and ethics.</p>
<p>Here’s the GitHub repo for full source code: <a target="_blank" href="https://github.com/tayo4christ/transformer-gesture">transformer-gesture</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create a Real-Time Gesture-to-Text Translator Using Python and Mediapipe ]]>
                </title>
                <description>
                    <![CDATA[ Sign and symbol languages, like Makaton and American Sign Language (ASL), are powerful communication tools. However, they can create challenges when communicating with people who don't understand them. As a researcher working on AI for accessibility,... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/create-a-real-time-gesture-to-text-translator/</link>
                <guid isPermaLink="false">68a331edf6c19271552e2ac7</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Accessibility ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ OMOTAYO OMOYEMI ]]>
                </dc:creator>
                <pubDate>Mon, 18 Aug 2025 14:00:13 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755525484024/9f4c42e0-dbfd-4f04-9223-0a2169abd1fb.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Sign and symbol languages, like Makaton and American Sign Language (ASL), are powerful communication tools. However, they can create challenges when communicating with people who don't understand them.</p>
<p>As a researcher working on AI for accessibility, I wanted to explore how machine learning and computer vision could bridge that gap. The result was a real-time gesture-to-text translator built with Python and Mediapipe, capable of detecting hand gestures and instantly converting them to text.</p>
<p>In this tutorial, you’ll learn how to build your own version from scratch, even if you’ve never used Mediapipe before.</p>
<p>By the end, you’ll know how to:</p>
<ul>
<li><p>Detect and track hand movements in real time.</p>
</li>
<li><p>Classify gestures using a simple machine learning model.</p>
</li>
<li><p>Convert recognized gestures into text output.</p>
</li>
<li><p>Extend the system for accessibility-focused applications.</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before following along with this tutorial, you should have:</p>
<ul>
<li><p><strong>Basic Python knowledge</strong> – You should be comfortable writing and running Python scripts.</p>
</li>
<li><p><strong>Familiarity with the command line</strong> – You’ll use it to run scripts and install dependencies.</p>
</li>
<li><p><strong>A working webcam</strong> – This is required for capturing and recognizing gestures in real time.</p>
</li>
<li><p><strong>Python installed (3.8 or later)</strong> – Along with <code>pip</code> for installing packages.</p>
</li>
<li><p><strong>Some understanding of machine learning basics</strong> – Knowing what training data and models are will help, but I’ll explain the key parts along the way.</p>
</li>
<li><p><strong>An internet connection</strong> – To install libraries such as Mediapipe and OpenCV.</p>
</li>
</ul>
<p>If you’re completely new to Mediapipe or OpenCV, don’t worry, I will walk through the core parts you need to know to get this project working.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-this-matters">Why This Matters</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tools-and-technologies">Tools and Technologies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-how-to-install-the-required-libraries">Step 1: How to Install the Required Libraries</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-how-mediapipe-tracks-hands">Step 2: How Mediapipe Tracks Hands</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-project-pipeline">Step 3: Project Pipeline</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-how-to-collect-gesture-data">Step 4: How to Collect Gesture Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-how-to-train-a-gesture-classifier">Step 5: How to Train a Gesture Classifier</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-real-time-gesture-to-text-translation">Step 6: Real-Time Gesture-to-Text Translation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-7-extending-the-project">Step 7: Extending the Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-ethical-and-accessibility-considerations">Ethical and Accessibility Considerations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-why-this-matters">Why This Matters</h2>
<p>Accessible communication is a right, not a privilege. Gesture-to-text translators can:</p>
<ul>
<li><p>Help non-signers communicate with sign/symbol language users.</p>
</li>
<li><p>Assist in educational contexts for children with communication challenges.</p>
</li>
<li><p>Support people with speech impairments.</p>
</li>
</ul>
<p><strong>Note:</strong> This project is a proof-of-concept and should be tested with diverse datasets before real-world deployment.</p>
<h2 id="heading-tools-and-technologies">Tools and Technologies</h2>
<p>We’ll be using:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Tool</td><td>Purpose</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Python</strong></td><td>Primary programming language</td></tr>
<tr>
<td><strong>Mediapipe</strong></td><td>Real-time hand tracking and gesture detection</td></tr>
<tr>
<td><strong>OpenCV</strong></td><td>Webcam input and video display</td></tr>
<tr>
<td><strong>NumPy</strong></td><td>Data processing</td></tr>
<tr>
<td><strong>Scikit-learn</strong></td><td>Gesture classification</td></tr>
</tbody>
</table>
</div><h2 id="heading-step-1-how-to-install-the-required-libraries">Step 1: How to Install the Required Libraries</h2>
<p>Before installing the dependencies, ensure you have Python version 3.8 or higher installed (for example, Python 3.8, 3.9, 3.10, or newer). You can check your current Python version by opening a terminal (Command Prompt on Windows, or Terminal on macOS/Linux) and typing:</p>
<pre><code class="lang-bash">python --version
</code></pre>
<p>or</p>
<pre><code class="lang-bash">python3 --version
</code></pre>
<p>You have to confirm that your Python version is 3.8 or higher because Mediapipe and some dependencies require modern language features and binary wheels. If the commands above print a version older than/before 3.8, then you’ll have to install a newer Python version before you continue.</p>
<p><strong>Windows:</strong></p>
<ol>
<li><p>Press <strong>Windows Key + R</strong></p>
</li>
<li><p>Type <code>cmd</code> and press Enter to open Command Prompt</p>
</li>
<li><p>Type one of the above commands and press Enter</p>
</li>
</ol>
<p><strong>macOS/Linux:</strong></p>
<ol>
<li><p>Open your <strong>Terminal</strong> application</p>
</li>
<li><p>Type one of the above commands and press Enter</p>
</li>
</ol>
<p>If your Python version is older than 3.8, you’ll need to <a target="_blank" href="https://www.python.org/downloads/">download and install a newer version from the official Python website</a>.</p>
<p>Once Python is ready, you can install the required libraries using <code>pip</code>:</p>
<pre><code class="lang-bash">pip install mediapipe opencv-python numpy scikit-learn pandas
</code></pre>
<p>This command installs all the libraries you’ll need for the project:</p>
<ul>
<li><p><strong>Mediapipe</strong> – real-time hand tracking and landmark detection.</p>
</li>
<li><p><strong>OpenCV</strong> – reading frames from your webcam and drawing overlays.</p>
</li>
<li><p><strong>Pandas</strong> – storing our collected landmark data in a CSV for training.</p>
</li>
<li><p><strong>Scikit-learn</strong> – training and evaluating the gesture classification model.</p>
</li>
</ul>
<h2 id="heading-step-2-how-mediapipe-tracks-hands">Step 2: How Mediapipe Tracks Hands</h2>
<p>Mediapipe’s Hand Tracking solution detects 21 key landmarks for each hand including fingertips, joints, and the wrist, at up to <strong>30+ FPS</strong> even on modest hardware.</p>
<p>Here’s a conceptual diagram of the landmarks:</p>
<p><img src="https://github.com/tayo4christ/Gesture_Article/blob/7598826bb530d5bd1cd40251d6f56f35653b6b51/images/landmarks_concept.png?raw=true" alt="Diagram showing Mediapipe hand landmark numbering and connections between joints" width="600" height="400" loading="lazy"></p>
<p>And here’s what real‑time tracking looks like:</p>
<p><img src="https://github.com/tayo4christ/Gesture_Article/blob/7598826bb530d5bd1cd40251d6f56f35653b6b51/images/hand_tracking_3d_android_gpu.gif?raw=true" alt="Animated GIF showing Mediapipe 3D hand tracking detecting finger joints and bones in real-time" width="600" height="400" loading="lazy"></p>
<p>Each landmark has <code>(x, y, z)</code> coordinates relative to the image size, making it easy to measure angles and positions for gesture classification.</p>
<h2 id="heading-step-3-project-pipeline">Step 3: Project Pipeline</h2>
<p>Here’s how the system works, from webcam to text output:</p>
<p><img src="https://github.com/tayo4christ/Gesture_Article/blob/7598826bb530d5bd1cd40251d6f56f35653b6b51/diagrams/pipeline_flowchart.png?raw=true" alt="Pipeline Flowchart showing how gesture input flows through hand tracking, feature extraction, gesture classification, and final text output" width="600" height="400" loading="lazy"></p>
<ul>
<li><p><strong>Capture</strong>: Webcam frames are captured using OpenCV.</p>
</li>
<li><p><strong>Detection</strong>: Mediapipe locates hand landmarks.</p>
</li>
<li><p><strong>Vectorization</strong>: Landmarks are flattened into a numeric vector.</p>
</li>
<li><p><strong>Classification</strong>: A machine learning model predicts the gesture.</p>
</li>
<li><p><strong>Output</strong>: The recognized gesture is displayed as text.</p>
</li>
</ul>
<p>Basic hand detection example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> cv2
<span class="hljs-keyword">import</span> mediapipe <span class="hljs-keyword">as</span> mp

mp_hands = mp.solutions.hands
mp_draw = mp.solutions.drawing_utils

cap = cv2.VideoCapture(<span class="hljs-number">0</span>)

<span class="hljs-keyword">with</span> mp_hands.Hands(max_num_hands=<span class="hljs-number">1</span>) <span class="hljs-keyword">as</span> hands:
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        ret, frame = cap.read()
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ret:
            <span class="hljs-keyword">break</span>

        results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        <span class="hljs-keyword">if</span> results.multi_hand_landmarks:
            <span class="hljs-keyword">for</span> hand_landmarks <span class="hljs-keyword">in</span> results.multi_hand_landmarks:
                mp_draw.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)

        cv2.imshow(<span class="hljs-string">"Hand Tracking"</span>, frame)
        <span class="hljs-keyword">if</span> cv2.waitKey(<span class="hljs-number">1</span>) &amp; <span class="hljs-number">0xFF</span> == ord(<span class="hljs-string">"q"</span>):
            <span class="hljs-keyword">break</span>

cap.release()
cv2.destroyAllWindows()
</code></pre>
<p>The code above opens the webcam and processes each frame with Mediapipe’s Hands solution. The frame is then converted to RGB (as Mediapipe expects), runs detection, and if a hand is found, it draws the 21 landmarks and their connections on top of the frame. You can press <code>q</code> to close the window. This piece verifies your setup and helps you see that landmark tracking works before moving on.</p>
<h2 id="heading-step-4-how-to-collect-gesture-data">Step 4: How to Collect Gesture Data</h2>
<p>Before we can train our model, we need a dataset of <strong>labelled gestures</strong>. Each gesture will be stored in a CSV file (<code>gesture_data.csv</code>) containing the 3D landmark coordinates for all detected hand points.</p>
<p>For example, we’ll collect data for three gestures:</p>
<ul>
<li><p><strong>thumbs_up</strong> – the classic thumbs-up pose.</p>
</li>
<li><p><strong>open_palm</strong> – a flat hand, fingers extended (like a “high five”).</p>
</li>
<li><p><strong>ok</strong> – the “OK” sign, made by touching the thumb and index finger.</p>
</li>
</ul>
<p>You can collect samples for each gesture by running:</p>
<pre><code class="lang-bash">python src/collect_data.py --label thumbs_up --samples 200
</code></pre>
<pre><code class="lang-bash">python src/collect_data.py --label open_palm --samples 200
</code></pre>
<pre><code class="lang-bash">python src/collect_data.py --label ok --samples 200
</code></pre>
<p><strong>Explanation of the command:</strong></p>
<ul>
<li><p><code>--label</code> → the name of the gesture you’re recording. This label will be stored alongside each row of coordinates in the CSV.</p>
</li>
<li><p><code>--samples</code> → the number of frames to capture for that gesture. More samples generally lead to better accuracy.</p>
</li>
</ul>
<p><strong>How the process works:</strong></p>
<ol>
<li><p>When you run a command, your webcam will open.</p>
</li>
<li><p>Make the specified gesture in front of the camera.</p>
</li>
<li><p>The script will use MediaPipe Hands to detect 21 hand landmarks (each with <code>x</code>, <code>y</code>, <code>z</code> coordinates).</p>
</li>
<li><p>These 63 numbers (21 × 3) are stored in a row of the CSV file, along with the gesture label.</p>
</li>
<li><p>The counter at the top will track how many samples have been collected.</p>
</li>
<li><p>When the sample count reaches your target (<code>--samples</code>), the script will close automatically.</p>
</li>
</ol>
<p><strong>Example of what the CSV looks like:</strong></p>
<p><img src="https://raw.githubusercontent.com/tayo4christ/Gesture_Article/26db13366407e5b5d230a6c7dd7923e34a9f2a19/screenshots/gesture_data.webp" alt="Sample of gesture_data.csv" width="600" height="400" loading="lazy"></p>
<p>Each row contains:</p>
<ul>
<li><p><strong>x0, y0, z0 … x20, y20, z20</strong> → coordinates of each hand landmark.</p>
</li>
<li><p><strong>label</strong> → the gesture name.</p>
</li>
</ul>
<p><strong>Example of data collection in progress:</strong></p>
<p><img src="https://github.com/tayo4christ/Gesture_Article/blob/7598826bb530d5bd1cd40251d6f56f35653b6b51/screenshots/detection_example.jpg?raw=true" alt="Screenshot of data collection interface capturing hand gesture landmarks from webcam" width="600" height="400" loading="lazy"></p>
<p>In the above screenshot, the script is capturing <strong>10 out of 10</strong> <code>thumbs_up</code> samples.</p>
<p>📌 <strong>Tip:</strong> Make sure your hand is clearly visible and well-lit. Repeat the process for all gestures you want to train.</p>
<h2 id="heading-step-5-how-to-train-a-gesture-classifier">Step 5: How to Train a Gesture Classifier</h2>
<p>Once you have enough samples for each gesture, train a model:</p>
<pre><code class="lang-bash">python src/train_model.py --data data/gesture_data.csv --label palm_open
</code></pre>
<p>This script:</p>
<ul>
<li><p>Loads the CSV dataset.</p>
</li>
<li><p>Splits into training and testing sets.</p>
</li>
<li><p>Trains a Random Forest Classifier.</p>
</li>
<li><p>Prints accuracy and a classification report.</p>
</li>
<li><p>Saves the trained model.</p>
</li>
</ul>
<p>Core training logic:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> RandomForestClassifier
<span class="hljs-keyword">import</span> pickle

<span class="hljs-comment"># Load the dataset</span>
df = pd.read_csv(<span class="hljs-string">"data/gesture_data.csv"</span>)

<span class="hljs-comment"># Separate features and labels</span>
X = df.drop(<span class="hljs-string">"label"</span>, axis=<span class="hljs-number">1</span>)
y = df[<span class="hljs-string">"label"</span>]

<span class="hljs-comment"># Initialize and train the Random Forest Classifier</span>
model = RandomForestClassifier()
model.fit(X, y)

<span class="hljs-comment"># Save the trained model to a file</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">"data/gesture_model.pkl"</span>, <span class="hljs-string">"wb"</span>) <span class="hljs-keyword">as</span> f:
    pickle.dump(model, f)
</code></pre>
<p>This block loads the gesture dataset from <code>data/gesture_data.csv</code> and splits it into:</p>
<ul>
<li><p><code>X</code> – the input features (the 3D landmark coordinates for each gesture sample).</p>
</li>
<li><p><code>y</code> – the labels (gesture names like <code>thumbs_up</code>, <code>open_palm</code>, <code>ok</code>).</p>
</li>
</ul>
<p>We then created a Random Forest Classifie<strong>r</strong>, which is well-suited for numerical data and works reliably without much tuning. The model learns patterns in the landmark positions that correspond to each gesture.<br>Finally, we saved the trained model as <code>data/gesture_model.pkl</code> so it can be loaded later for real-time gesture recognition without retraining.</p>
<h2 id="heading-step-6-real-time-gesture-to-text-translation">Step 6: Real-Time Gesture-to-Text Translation</h2>
<p>Load the model and run the translator:</p>
<pre><code class="lang-bash">python src/gesture_to_text.py --model data/gesture_model.pkl
</code></pre>
<p>This command runs the real-time gesture recognition script.</p>
<ul>
<li><p>The <code>--model</code> argument tells the script which trained model file to load — in this case, <code>gesture_model.pkl</code> that we saved earlier.</p>
</li>
<li><p>Once running, the script opens your webcam, detects your hand landmarks, and uses the model to predict the gesture.</p>
</li>
<li><p>The predicted gesture name appears as text on the video feed.</p>
</li>
<li><p>Press <code>q</code> to exit the window when you’re done.</p>
</li>
</ul>
<p>Core prediction logic:</p>
<pre><code class="lang-python"><span class="hljs-keyword">with</span> open(<span class="hljs-string">"data/gesture_model.pkl"</span>, <span class="hljs-string">"rb"</span>) <span class="hljs-keyword">as</span> f:
    model = pickle.load(f)

<span class="hljs-keyword">if</span> results.multi_hand_landmarks:
    <span class="hljs-keyword">for</span> hand_landmarks <span class="hljs-keyword">in</span> results.multi_hand_landmarks:
        coords = []
        <span class="hljs-keyword">for</span> lm <span class="hljs-keyword">in</span> hand_landmarks.landmark:
            coords.extend([lm.x, lm.y, lm.z])
        gesture = model.predict([coords])[<span class="hljs-number">0</span>]
        cv2.putText(frame, gesture, (<span class="hljs-number">10</span>, <span class="hljs-number">50</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">1</span>, (<span class="hljs-number">0</span>, <span class="hljs-number">255</span>, <span class="hljs-number">0</span>), <span class="hljs-number">2</span>)
</code></pre>
<p>This code loads the trained gesture recognition model from <code>gesture_model.pkl</code>.<br>If any hands are detected (<code>results.multi_hand_landmarks</code>), it loops through each detected hand and:</p>
<ol>
<li><p><strong>Extracts the coordinates</strong> – for each of the 21 landmarks, it appends the <code>x</code>, <code>y</code>, and <code>z</code> values to the <code>coords</code> list.</p>
</li>
<li><p><strong>Makes a prediction</strong> – passes <code>coords</code> to the model’s <code>predict</code> method to get the most likely gesture label.</p>
</li>
<li><p><strong>Displays the result</strong> – uses <code>cv2.putText</code> to draw the predicted gesture name on the video feed.</p>
</li>
</ol>
<p>This is the real-time decision-making step that turns raw Mediapipe landmark data into a readable gesture label.</p>
<p>You should see the recognized gesture at the top of the video feed:</p>
<p><img src="https://github.com/tayo4christ/Gesture_Article/blob/7598826bb530d5bd1cd40251d6f56f35653b6b51/screenshots/text_output.jpg?raw=true" alt="Screenshot of the real-time gesture recognition output overlaying the 'palm_open' label on the video feed" width="600" height="400" loading="lazy"></p>
<h2 id="heading-step-7-extending-the-project">Step 7: Extending the Project</h2>
<p>You can take this project further by:</p>
<ul>
<li><p><strong>Adding Text-to-Speech</strong>: Use <code>pyttsx3</code> to speak recognized words.</p>
</li>
<li><p><strong>Supporting More Gestures</strong>: Expand your dataset.</p>
</li>
<li><p><strong>Deploying in the Browser</strong>: Use TensorFlow.js for web-based recognition.</p>
</li>
<li><p><strong>Testing with Real Users</strong>: Especially in accessibility contexts.</p>
</li>
</ul>
<h2 id="heading-ethical-and-accessibility-considerations">Ethical and Accessibility Considerations</h2>
<p>Before deploying:</p>
<ul>
<li><p><strong>Dataset Diversity</strong>: Train with gestures from different skin tones, hand sizes, and lighting conditions.</p>
</li>
<li><p><strong>Privacy</strong>: Store only landmark coordinates unless you have consent for video storage.</p>
</li>
<li><p><strong>Cultural Context</strong>: Some gestures have different meanings in different cultures.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, we explored how to use Python, Mediapipe, and machine learning to build a real-time gesture-to-text translator. This technology has exciting potential for accessibility and inclusive communication, and with further development, could become a powerful tool for breaking down language barriers.</p>
<p>You can find the full code and resources here:</p>
<p><strong>GitHub Repo</strong> – <a target="_blank" href="https://github.com/tayo4christ/Gesture_Article">Gesture_Article</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use the Segment Anything Model (SAM) to Create Masks ]]>
                </title>
                <description>
                    <![CDATA[ By Jess Wilk Hey there! So, you know that buzz about Tesla's autopilot being all futuristic and driverless? Ever thought about how it actually does its magic? Well, let me tell you – it's all about image segmentation and object detection.  What is Im... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/use-segment-anything-model-to-create-masks/</link>
                <guid isPermaLink="false">66d45f6f706b9fb1c166b995</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 08 Nov 2023 20:26:18 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/11/cover-image-SAM.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Jess Wilk</p>
<p>Hey there! So, you know that buzz about Tesla's autopilot being all futuristic and driverless? Ever thought about how it actually does its magic? Well, let me tell you – it's all about image segmentation and object detection. </p>
<h2 id="heading-what-is-image-segmentation">What is Image Segmentation?</h2>
<p>Image segmentation, basically chopping up an image into different parts, helps the system recognize stuff. It identifies where humans, other cars, and obstacles are on the road. That's the tech making sure those self-driving cars can cruise around safely. Cool, right? 🚗</p>
<p>During the past decade, Computer Vision has made massive strides, especially in crafting super-sophisticated segmentation and object detection methods. </p>
<p>These breakthroughs have found diverse uses, like spotting tumors and diseases in medical images, keeping an eye on crops in farming, and even guiding robots in navigation. The tech's really branching out and making a significant impact across different fields. </p>
<p>The main challenge lies in getting and prepping the data. Building an image segmentation dataset demands annotating heaps of images to define the labels, which is a massive task. This requires a ton of resources. </p>
<p>So, the game changed when the <strong>Segment Anything Model (SAM)</strong> came into the scene. SAM revolutionized this field by enabling anyone to create segmentation masks for their data without relying on labeled data.</p>
<p>In this article, I’ll guide you through understanding SAM, its workings, and how you can utilize it to make masks. So, get ready with your cup of coffee because we're diving in! ☕</p>
<h3 id="heading-prerequisites">Prerequisites:</h3>
<p>The prerequisites for this article include a basic understanding of <strong>Python programming</strong> and a fundamental knowledge of <strong>machine learning</strong>. </p>
<p>Additionally, familiarity with image segmentation concepts, computer vision, and data annotation challenges would also be beneficial.</p>
<h2 id="heading-what-is-the-segment-anything-model">What is the Segment Anything Model?</h2>
<p>SAM is a Large Language Model that was developed by the Facebook research team (Meta AI). The model was trained on a massive dataset of <strong>1.1 billion segmentation masks</strong>, the SA-1B dataset. The model can generalize well to unseen data because it is trained on a very diverse dataset and has low variance. </p>
<p>SAM can be used to segment any image and create masks without any labeled data. It is a breakthrough, as no fully automated segmentation was possible before SAM.</p>
<p>What makes SAM unique? It is a first-of-its-kind, <strong>promptable segmentation</strong> model. Prompts allow you to instruct the model on your desired output through text and interactive actions. You can provide prompts to SAM in multiple ways: Points, Bounding Boxes, texts, and even base masks.</p>
<h2 id="heading-how-does-sam-work">How Does SAM Work?</h2>
<p>SAM uses a transformer-based architecture, like most Large Language Models. Let’s look at the flow of data through different components of SAM. </p>
<p><strong>Image Encoder:</strong> When you provide an image to SAM, it is first sent to the Image Encoder. True to its name, this component encodes the image into vectors. These vectors represent the low-level (edges, outlines) and high-level features like object shapes and textures extracted from the image. The encoder here is a <strong>Vision Transformer (ViT),</strong> which has many advantages over traditional CNNs.</p>
<p><strong>Prompt Encoder:</strong> The prompt input the user gives is converted to embeddings by the prompt encoder. SAM uses positional embeddings for points, bounding box prompts, and text encoders for text prompts.</p>
<p><strong>Mask Decoder:</strong> Next, SAM maps the extracted image features and prompt encodings to generate the mask, which is our output. SAM will generate 3 segmented masks for every input prompt, providing the users with choices. </p>
<h2 id="heading-why-use-sam">Why use SAM?</h2>
<p>With SAM, you can skip the expensive setup usually needed for AI, and still get fast results. It works well with all sorts of data, like medical or satellite images, and fits right into the software you already use for quick detection tasks. </p>
<p>You also get tools tailored for specific jobs like image segmentation, and it’s straightforward to interact with, whether you're training it or asking it to analyze data. Plus, it’s quicker than older systems like CNNs, saving you both time and money.</p>
<p><img src="https://lh7-us.googleusercontent.com/tcDOfehN4GLt4bZkN_0uhOPYsZ9B8cBeQaCxf9F6OS6iUN1WESAAWNUb9_vCpTj66TvzeVocZi3i6xKkrMB2cSbj0-UBrjlR3jjBXJfRo1WAYyipmVbSiYQPj0f3X8HMc1AA1y1dQ7Zq197kxXETWDY" alt="Image" width="600" height="400" loading="lazy">
<em>Why use SAM?</em></p>
<h2 id="heading-how-to-install-and-set-up-sam">How to Install and Set up SAM</h2>
<p>Now that you know how SAM works, let me show you how to install and set it up. The first step is to install the package in your Jupyter notebook or Google Colab with the following command:</p>
<pre><code class="lang-python">pip install <span class="hljs-string">'git+https://github.com/facebookresearch/segment-anything.git'</span>
</code></pre>
<pre><code class="lang-python">/content Collecting git+https://github.com/facebookresearch/segment-anything.git Cloning https://github.com/facebookresearch/segment-anything.git to /tmp/pip-req-build-xzlt_n7r Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/segment-anything.git /tmp/pip-req-build-xzlt_n7r Resolved https://github.com/facebookresearch/segment-anything.git to commit <span class="hljs-number">6</span>fdee8f2727f4506cfbbe553e23b895e27956588 Preparing metadata (setup.py) ... done
</code></pre>
<p>The next step is to download the pre-trained weights of the SAM model you want to use. </p>
<p>You can choose from three options of checkpoint weights: ViT-B (91M), ViT-L (308M), and ViT-H (636M parameters). </p>
<p>How do you choose the right one? The larger the number of parameters, the longer the time needed for inference, that is mask generation. If you have low GPU resources and fast inference, go for ViT-B. Otherwise, choose ViT-H. </p>
<p>Follow the below commands to set up the model checkpoint path:</p>
<pre><code class="lang-python">!wget -q https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
CHECKPOINT_PATH=<span class="hljs-string">'/content/weights/sam_vit_h_4b8939.pth'</span>


<span class="hljs-keyword">import</span> torch
DEVICE = torch.device(<span class="hljs-string">'cuda:0'</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">'cpu'</span>)
MODEL_TYPE = <span class="hljs-string">"vit_h"</span>
</code></pre>
<p>The model weights are ready! Now, I’ll show you different methods through which you can provide prompts and generate masks in the upcoming sections. 🚀</p>
<h2 id="heading-how-sam-can-generate-masks-automatically">How SAM Can Generate Masks Automatically</h2>
<p>SAM can automatically segment the entire input image into distinct segments without a specific prompt. For this, you can use the <code>SamAutomaticMaskGenerator</code> utility. </p>
<p>Follow the below commands to import and initialize it with the model type and checkpoint path. </p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> segment_anything <span class="hljs-keyword">import</span> sam_model_registry, SamAutomaticMaskGenerator, SamPredictor


sam = sam_model_registry[MODEL_TYPE](checkpoint=CHECKPOINT_PATH).to(device=DEVICE)


mask_generator = SamAutomaticMaskGenerator(sam)
</code></pre>
<p>For example, I have uploaded an image of dogs to my notebook. It will be our input image, which has to be converted into RGB (Red-Green-Blue) pixel format to be input to the model. </p>
<p>You can do this using the OpenCV Python package and then use the <code>generate()</code> function to create a mask, as shown below:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Import opencv package</span>
<span class="hljs-keyword">import</span> cv2


<span class="hljs-comment"># Give the path of your image</span>
IMAGE_PATH= <span class="hljs-string">'/content/dog.png'</span>
<span class="hljs-comment"># Read the image from the path</span>
image= cv2.imread(IMAGE_PATH)
<span class="hljs-comment"># Convert to RGB format</span>
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)


<span class="hljs-comment"># Generate segmentation mask</span>
output_mask = mask_generator.generate(image_rgb)
print(output_mask)
</code></pre>
<p>The generated output is a dictionary with the following main values:</p>
<ul>
<li><code>Segmentation:</code> An array that has a mask shape</li>
<li><code>area:</code>  An integer that stores the area of the mask in pixels</li>
<li><code>bbox:</code> The coordinates of the boundary box [xywh]</li>
<li><code>Predicted_iou:</code> IOU is an evaluation score for segmentation</li>
</ul>
<p><img src="https://lh7-us.googleusercontent.com/zvUNSrvPrv8-Z1idbMLHXKv8iXzWlInik9R2fdJ24HQc5EBxdAgqaiEFTeE4UalWdUvA0R0L9dQuqDDZVucoBWwTMBld9aCJ8NKRTp2vxE-fYnvsbIEL8Z1kRfnQFsCVGb4HGf0pkkuNT6Wss1iMX6c" alt="Image" width="600" height="400" loading="lazy">
<em>The generated output is a dictionary with the main values</em></p>
<p>So how do we visualize our output mask?</p>
<p>Well, it's a simple Python function that will take the dictionary generated by SAM as output and plot the segmentation masks with the mask shape values and coordinates.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Function that inputs the output and plots image and mask</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">show_output</span>(<span class="hljs-params">result_dict,axes=None</span>):</span>
     <span class="hljs-keyword">if</span> axes:
        ax = axes
     <span class="hljs-keyword">else</span>:
        ax = plt.gca()
        ax.set_autoscale_on(<span class="hljs-literal">False</span>)
     sorted_result = sorted(result_dict, key=(<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-string">'area'</span>]),      reverse=<span class="hljs-literal">True</span>)
     <span class="hljs-comment"># Plot for each segment area</span>
     <span class="hljs-keyword">for</span> val <span class="hljs-keyword">in</span> sorted_result:
        mask = val[<span class="hljs-string">'segmentation'</span>]
        img = np.ones((mask.shape[<span class="hljs-number">0</span>], mask.shape[<span class="hljs-number">1</span>], <span class="hljs-number">3</span>))
        color_mask = np.random.random((<span class="hljs-number">1</span>, <span class="hljs-number">3</span>)).tolist()[<span class="hljs-number">0</span>]
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">3</span>):
            img[:,:,i] = color_mask[i]
            ax.imshow(np.dstack((img, mask*<span class="hljs-number">0.5</span>)))
</code></pre>
<p>Let’s use this function to plot our raw input image and segmented mask:</p>
<pre><code class="lang-python">_,axes = plt.subplots(<span class="hljs-number">1</span>,<span class="hljs-number">2</span>, figsize=(<span class="hljs-number">16</span>,<span class="hljs-number">16</span>))
axes[<span class="hljs-number">0</span>].imshow(image_rgb)
show_output(sam_result, axes[<span class="hljs-number">1</span>])
</code></pre>
<p><img src="https://lh7-us.googleusercontent.com/m7RxR_KOL-nSBtptL-dEbsV_EN7w21sqQMiCnfvrr83hwxAhe7jgXWLUhMgjoGzpO4QHgSbnoCOtN5SB__kokKC_OykSCxEo7ntXYd1LihwL3BBlAgUNqn70-E35yQS-Xvb2JrnpYOYTjShEmCg9w9w" alt="Image" width="600" height="400" loading="lazy">
<em>Model has segmented every object</em></p>
<p>As you can see, the model has segmented every object in the image using a zero-shot method in one single go! 🌟</p>
<h2 id="heading-how-to-use-sam-with-bounding-box-prompts">How to Use SAM with Bounding Box Prompts</h2>
<p>Sometimes, we may want to segment only a specific portion of an image. To achieve this, input rough bounding boxes to identify the object within the area of interest, and SAM will segment it accordingly.   </p>
<p>To implement this, import and initialize the <code>SamPredictor</code> and use the <code>set_image()</code> function to pass the input image. Next, call the <code>predict</code> function, providing the bounding box coordinates as input for the parameter <code>box</code> as shown in the snippet below. The bounding boxes prompt should be in the [X-min, Y-min, X-max, Y-max] format.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Set up the SAM model with the encoded image</span>
mask_predictor = SamPredictor(sam)
mask_predictor.set_image(image_rgb)


<span class="hljs-comment"># Predict mask with bounding box prompt</span>
masks, scores, logits = mask_predictor.predict(
box=bbox_prompt,
multimask_output=<span class="hljs-literal">False</span>
)


<span class="hljs-comment"># Plot the bounding box prompt and predicted mask</span>
plt.imshow(image_rgb)
show_mask(masks[<span class="hljs-number">0</span>], plt.gca())
show_box(bbox_prompt, plt.gca())
plt.show()
</code></pre>
<p><img src="https://lh7-us.googleusercontent.com/DoiDVGgozu4ZDeBMyJWbSlCt3CGFnxd7SFlfWFuvuUu_ByZuHc2pA75C2dbaygBwIQqmHcPCBoEsVFaqs_dxpAskPVZxXOoejgu2j0JIrkwDmjPr3aa7xgsgdpmcG2vVETURBkZ32EOKNFZrDzvmQLA" alt="Image" width="600" height="400" loading="lazy">
<em>The green bounding box was our input prompt in this output, and the blue represents our predicted mask.</em></p>
<h2 id="heading-how-to-use-sam-with-points-as-prompts">How to Use SAM with Points as Prompts</h2>
<p>What if you need the object's mask for a certain point in the image? You can provide the point’s coordinates as an input prompt to SAM. The model will then generate the three most relevant segmentation masks. This helps in case of any ambiguity on the main object of interest. </p>
<p>The first steps are similar to what we did in previous sections. Initialize the predictor module with the input image. Next, provide the input prompt as [X,Y] coordinates to the parameter <code>point_coords</code>.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Initialize the model with the input image</span>
<span class="hljs-keyword">from</span> segment_anything <span class="hljs-keyword">import</span> sam_model_registry, SamPredictor
sam = sam_model_registry[MODEL_TYPE](checkpoint=CHECKPOINT_PATH).to(device=DEVICE)
mask_predictor = SamPredictor(sam)
mask_predictor.set_image(image_rgb)
<span class="hljs-comment"># Provide points as input prompt [X,Y]-coordinates</span>
input_point = np.array([[<span class="hljs-number">250</span>, <span class="hljs-number">200</span>]])
input_label = np.array([<span class="hljs-number">1</span>])


<span class="hljs-comment"># Predict the segmentation mask at that point</span>
masks, scores, logits = mask_predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=<span class="hljs-literal">True</span>,
)
</code></pre>
<p>As we have set the <code>multimask_output</code> parameter as True, there would be three output masks. Let’s visualize it by plotting the masks and their input prompt.</p>
<p><img src="https://lh7-us.googleusercontent.com/etMcljU5T2wlLBfbJdV46L4n1I2KUZe2nswYJVFs0Hh-xRFFs-nArO9i5rEr1xU3Er77T7TTn7uenU9Tu1_H4SuSwjGyAtOYe-Jt7_-UQpO05Rv3dOIs5Y3Q-1I41VepltOi_tyBiKSf0RMfWhwVUaQ" alt="Image" width="600" height="400" loading="lazy">
<em>In the above figure, the green star denotes the prompt point, and the blue represents the predicted mask. While Mask 1 has poor coverage, Mask 2 and 3 have good accuracy for my needs.</em></p>
<p>I have also printed the self-evaluated IOU scores for each mask. IOU stands for Intersection Over Union and measures the deviation between the object outline and mask.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You can build a tailored segmentation dataset for your field by gathering raw images and utilizing the SAM tool for annotation. This model has shown consistent performance, even in tricky conditions like noise or occlusion. </p>
<p>In the upcoming version, they're making text prompts compatible, aiming to enhance user-friendliness. </p>
<p>Hope this info proves helpful for you!</p>
<p>Thank you for reading! I'm Jess, and I'm an expert at Hyperskill. You can check out our <a target="_blank" href="https://hyperskill.org/tracks/42?utm_source=fc_hs&amp;utm_medium=social&amp;utm_campaign=jess"><strong>ML courses</strong></a> on the platform. </p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Implement Computer Vision with Deep Learning and TensorFlow ]]>
                </title>
                <description>
                    <![CDATA[ Computer vision is being used in more and more places. From enhancing security systems to improving healthcare diagnostics, computer vision techniques are revolutionizing multiple industries.  We just published a 37-hour course on the freeCodeCamp.or... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-implement-computer-vision-with-deep-learning-and-tensorflow/</link>
                <guid isPermaLink="false">66b2032525ef0bb2c5a51748</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ TensorFlow ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Tue, 06 Jun 2023 15:08:54 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/06/compvision.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Computer vision is being used in more and more places. From enhancing security systems to improving healthcare diagnostics, computer vision techniques are revolutionizing multiple industries.</p>
<p> We just published a 37-hour course on the freeCodeCamp.org YouTube channel that will teach you about deep learning for computer vision using TensorFlow. The course was expertly created by Folefac Martins from Neuralearn.ai.</p>
<h2 id="heading-a-sneak-peek-into-the-course">A Sneak Peek into the Course</h2>
<p>This course is meticulously designed to cover a broad range of topics, starting from the basics of tensors and variables to the implementation of advanced deep learning models for complex tasks such as human emotion detection and image generation.</p>
<p>After introducing the prerequisites and discussing what learners can expect from the course, the first segment focuses on the foundational aspects of tensors and variables. You'll understand the basics, initialization and casting, indexing, and common TensorFlow functions. The topics extend to cover the intriguing concepts of ragged, sparse, and string tensors, laying the groundwork for building neural networks.</p>
<p>As you venture into the world of neural networks, you'll start by predicting car prices. This practical project involves steps from data preparation to measuring model performance, and it'll provide an understanding of linear regression models, error sanctioning, and training and optimization techniques.</p>
<p>The course then delves into convolutional neural networks (ConvNets), which are particularly useful for image data. You will use ConvNets to diagnose malaria, a task that includes data preparation, visualization, and processing, and learn how to build ConvNets with TensorFlow. Along the way, you'll explore binary cross-entropy loss, model training and evaluation, and saving and loading models on Google Drive.</p>
<p>Advanced topics in TensorFlow, such as custom loss and metrics, eager and graph modes, and custom training loops, are also thoroughly discussed. A significant portion of the course is devoted to improving model performance, evaluating classification models, and using data augmentation techniques to enhance the quality and diversity of data.</p>
<p>The course proceeds to explore modern Convolutional Neural Networks like AlexNet, VGGNet, ResNet, MobileNet, and EfficientNet, applied to a human emotions detection project. Additionally, the course illustrates the black box of these models by visualizing intermediate layers and using the Gradcam method.</p>
<p>There's a great section dedicated to Transformers in Vision, understanding and building Vision Transformers (ViTs) from scratch, and fine-tuning Huggingface ViT. This section includes practical training with the Weights and Biases tool for experiment tracking, hyperparameter tuning, dataset and model versioning, known as MLOps.</p>
<p>Finally, the course closes with important topics in model deployment, including converting TensorFlow models to Onnx format, understanding and implementing quantization, building and deploying an API with FastAPI, and load testing with Locust.</p>
<p>The course concludes with a module on object detection using the YOLO algorithm and image generation using Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).</p>
<h2 id="heading-the-learning-experience">The Learning Experience</h2>
<p>What sets this course apart is the combination of theoretical understanding and practical applications. It is a guided journey through the intricacies of TensorFlow, deep learning, and computer vision, using real-world projects such as car price prediction, malaria diagnosis, human emotion detection, and image generation.</p>
<p>The course is perfect for anyone passionate about machine learning and AI, regardless of their current expertise level. So whether you're a complete beginner, a data scientist looking to update your skills, or an AI enthusiast, this course promises a thorough and practical understanding of computer vision and deep learning with TensorFlow.</p>
<p>Watch the full course <a target="_blank" href="https://www.youtube.com/watch?v=IA3WxTTPXqQ">on the freeCodeCamp.org YouTube channel</a> (37-hour course, with subtitles).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/IA3WxTTPXqQ" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ TensorFlow for Computer Vision – Full Course on Python for Machine Learning ]]>
                </title>
                <description>
                    <![CDATA[ TensorFlow can do some amazing things when it comes to computer vision. We just published a full course on the freeCodeCamp.org YouTube channel that will teach you how to use TensorFlow 2 for computer vision applications. Nour Islam Mokhtari created ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-tensorflow-for-computer-vision/</link>
                <guid isPermaLink="false">66b2037127569435a9255acb</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ TensorFlow ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Tue, 05 Oct 2021 15:07:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/10/tf-vision.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>TensorFlow can do some amazing things when it comes to computer vision.</p>
<p>We just published a full course on the freeCodeCamp.org YouTube channel that will teach you how to use TensorFlow 2 for computer vision applications.</p>
<p>Nour Islam Mokhtari created this course. Nour is a Machine Learning Engineer and experienced teacher.</p>
<p>The course shows you how to create two computer vision projects. The first involves an image classification model with a prepared dataset. The second is a more real-world problem where you will have to clean and prepare a dataset before using it.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-26.png" alt="Image" width="600" height="400" loading="lazy">
<em>MNIST Dataset with labels</em></p>
<p>Here are the topics covered in this course:</p>
<ul>
<li>Why learn Tensorflow</li>
<li>We will be using an IDE and not notebooks</li>
<li>Visual Studio Code (how to download and install it)</li>
<li>Miniconda - how to install it</li>
<li>Miniconda - why we need it</li>
<li>How are we going to use conda virtual environments in VS Code?</li>
<li>Installing Tensorflow 2 (CPU version)</li>
<li>Installing Tensorflow 2 (GPU version)</li>
<li>What do we want to achieve?</li>
<li>Exploring MNIST dataset</li>
<li>Tensorflow layers</li>
<li>Building a neural network the sequential way</li>
<li>Compiling the model and fitting the data</li>
<li>Building a neural network the functional way</li>
<li>Building a neural network the Model Class way</li>
<li>Things we should add</li>
<li>Restructuring our code for better readability</li>
<li>First part summary</li>
<li>What we want to achieve</li>
<li>Downloading and exploring the dataset</li>
<li>Preparing train and validation sets</li>
<li>Preparing the test set</li>
<li>Building a neural network the functional way</li>
<li>Creating data generators</li>
<li>Instantiating the generators</li>
<li>Compiling the model and fitting the data</li>
<li>Adding callbacks</li>
<li>Evaluating the model</li>
<li>Potential improvements</li>
<li>Running prediction on single images</li>
</ul>
<p>Watch the full course below or <a target="_blank" href="https://youtu.be/cPmjQ9V6Hbk">on the freeCodeCamp.org YouTube channel</a> (4.5-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/cPmjQ9V6Hbk" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Advanced Computer Vision with Python ]]>
                </title>
                <description>
                    <![CDATA[ More and more applications are using computer vision these days. We just published a full course on the freeCodeCamp.org YouTube channel that will help you learn advanced computer vision using Python. You will learn state of the art computer vision t... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/advanced-computer-vision-with-python/</link>
                <guid isPermaLink="false">66b2005808bc664c3c097e48</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ freeCodeCamp.org ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Fri, 28 May 2021 12:26:38 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/05/computervision.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>More and more applications are using computer vision these days.</p>
<p>We just published a full course on the freeCodeCamp.org YouTube channel that will help you learn advanced computer vision using Python. You will learn state of the art computer vision techniques by building five projects with libraries such as OpenCV and Mediapipe.</p>
<p>If you are a beginner, don't be afraid of the term advance. Even though the concepts are advanced, the course teaches them in a way that is easy to follow. </p>
<p>This course is taught by Murtaza Hassan. Murtaza has a popular YouTube channel about Robotics and AI and now he is sharing his expertise with the freeCodeCamp audience.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/05/hand.gif" alt="Image" width="600" height="400" loading="lazy"></p>
<p>In the first half of the course you will learn core techniques by implementing hand tracking, pose estimation, face detection, and face mesh. </p>
<p>In the second half of the course you will create five projects with real-word application. Here is what you will create:</p>
<ul>
<li>Gesture Volume Control</li>
<li>Finger Counter</li>
<li>AI Personal Trainer</li>
<li>AI Virtual Painter</li>
<li>AI Virtual Mouse</li>
</ul>
<p>Watch the full course below or on <a target="_blank" href="https://youtu.be/01sAkU_NvOY">the freeCodeCamp.org YouTube channel</a> (6-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/01sAkU_NvOY" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Manage Computer Vision Datasets in Python with Remo ]]>
                </title>
                <description>
                    <![CDATA[ By Pier Paolo Ippolito Computer Vision is one of the most important applications of Machine Learning. Some common commercial applications of Computer Vision are: Predictive maintenance for industrial infrastructure, oil and gas pipelines, and commer... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/manage-computer-vision-datasets-in-python-with-remo/</link>
                <guid isPermaLink="false">66d46096c7632f8bfbf1e483</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 10 Dec 2020 19:28:08 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2020/12/d148d60c3269c7e0a3070eec97a5e497-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Pier Paolo Ippolito</p>
<p>Computer Vision is one of the most important applications of Machine Learning. Some common commercial applications of Computer Vision are:</p>
<ul>
<li>Predictive maintenance for industrial infrastructure, oil and gas pipelines, and commercial real estate</li>
<li>Quality assurance automation</li>
<li>Landscape inventory and parcel management based on satellite imagery and drone footage</li>
</ul>
<p>And some of the most common techniques used in order to accomplish these kind of tasks are:</p>
<ul>
<li>Image Classification</li>
<li>Object Detection</li>
<li>Instance Segmentation</li>
</ul>
<p>During the past decade, many frameworks such as TensorFlow, Keras and PyTorch have been developed in order to make it easier to develop Computer Vision-based models. </p>
<p>But it is still relatively difficult to work with image data due to the necessary image pre-processing, labelling, and annotation visualization.</p>
<p>As part of this article, I am going to introduce you to <a target="_blank" href="https://remo.ai/">Remo</a>, a free Python library designed to help developers work on Computer Vision tasks. Remo can help you:</p>
<ul>
<li>Organize and visualize images and annotations</li>
<li>Efficiently annotate</li>
<li>Work and collaborate as a team on the data</li>
</ul>
<p>Remo can be used either in a Jupyter Notebook or in the Google Colab environment. In this article, all the code is going to be based on the Google Colab set-up and the full notebook is freely available at <a target="_blank" href="https://colab.research.google.com/drive/1G0X6ieL9_O5jbdpPPG72nulNhxKELwzd?usp=sharing">this link.</a></p>
<h2 id="heading-how-remo-improves-image-management">How Remo Improves Image Management</h2>
<p>There are a number of legacy open annotation tools for images available out there. <a target="_blank" href="https://github.com/tzutalin/labelImg">LabelImg</a> is one of the most popular ones. </p>
<p>Compared to these tools, Remo offers smart tools to annotate more efficiently (for example, shortcuts and xclick draw) and functionalities that help you collaborate and organize your work. You can mark images as Done or To Do, sort them and search them, and so on – which is very useful when you're working with thousands of images.</p>
<p>But datasets management is where Remo is very innovative. At present, images in Computer Vision projects are usually stored as flat files in a local hard disk or some Cloud storage, and annotations are saved as raw XML/JSON/csv files. </p>
<p>To visualize them, you would usually either open each file individually and try to imagine where annotations are, or plot them one by one in Python. </p>
<p>Instead, Remo gives you full control and visibility of all the data.</p>
<h2 id="heading-demonstration-of-how-remo-works">Demonstration of How Remo Works</h2>
<p>First of all, we need to install all the necessary dependencies. This can be easily done in Google Colab by running the following two lines of code:</p>
<pre><code class="lang-python">!pip install remo
!python -m remo_app init --colab
</code></pre>
<p>Once we've installed Remo, we can then create a dataset using some example images freely available on Amazon Web Services.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> remo
<span class="hljs-keyword">import</span> pandas

link = [<span class="hljs-string">'https://remo-scripts.s3-eu-west-1.amazonaws.com/open_images_sample_dataset.zip'</span>]

df = remo.create_dataset(name = <span class="hljs-string">'Example Images Dataset'</span>,
                    urls = link, 
                    annotation_task = <span class="hljs-string">"Object Detection"</span>)

<span class="hljs-comment"># Output</span>
<span class="hljs-comment"># Acquiring data - completed                          </span>
<span class="hljs-comment"># Processing annotation files: 1 of 1 files                                  </span>
<span class="hljs-comment"># Processing data - completed       </span>
<span class="hljs-comment"># Data upload completed</span>
</code></pre>
<p>By running the Remo <strong>list_datasets()</strong> command we can then easily check what datasets we currently have available.</p>
<pre><code class="lang-python">remo.list_datasets()

<span class="hljs-comment"># Output</span>
<span class="hljs-comment"># [Dataset  1 - 'Example Images Dataset' - 10 images]</span>
</code></pre>
<p>We are now ready to use Remo's graphical interface in order to inspect our dataset and see the different options available. </p>
<p>In Figure 1, you'll see a simple example of how easy it can be to visualize and annotate our data using Remo.</p>
<pre><code class="lang-python">df.view()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/11/remo.gif" alt="Image" width="600" height="400" loading="lazy">
<em>Figure 1: Remo's GUI Data Pre-processing</em></p>
<p>Another important advantage of using Remo is that it lets you quickly get key dataset statistics either through Python code or the user interface. </p>
<p>This can be particularly useful when you're trying to understand how annotations are distributed between different images and if the overall classes distribution is balanced or not.</p>
<pre><code class="lang-python">df.get_annotation_statistics()

<span class="hljs-comment"># Output</span>
<span class="hljs-comment"># [{'AnnotationSet ID': 1,</span>
<span class="hljs-comment">#  'AnnotationSet name': 'Object detection',</span>
<span class="hljs-comment">#  'creation_date': None,</span>
<span class="hljs-comment">#  'last_modified_date': '2020-11-28T22:04:48.263767Z',</span>
<span class="hljs-comment">#  'n_classes': 18,</span>
<span class="hljs-comment">#  'n_images': 10,</span>
<span class="hljs-comment">#  'n_objects': 98,</span>
<span class="hljs-comment">#  'top_3_classes': [{'count': 27, 'name': 'Fruit'},</span>
<span class="hljs-comment">#   {'count': 12, 'name': 'Sports equipment'},</span>
<span class="hljs-comment">#   {'count': 10, 'name': 'Human arm'}]}]</span>
</code></pre>
<p>You can see similar results by using Remo's Graphical Interface (Figure 2).</p>
<pre><code class="lang-python">df.view_annotation_stats()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/11/remo2.gif" alt="Image" width="600" height="400" loading="lazy">
<em>Figure 2: Remo's Statistics Functionalities</em></p>
<p>Finally, if you used the Remo interface to add annotations to the different images of your dataset, these can be automatically exported in a CSV format. This lets you use them later and takes advantage of Remo's <strong>export_annotations_to_file()</strong> function. </p>
<pre><code class="lang-python">df.export_annotations_to_file(<span class="hljs-string">'images_annotations.zip'</span>, annotation_format=<span class="hljs-string">'csv'</span>, export_tags = <span class="hljs-literal">False</span>)
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>To summarize, some of the key functionalities provided by Remo are:</p>
<ul>
<li>Dataset management capabilities</li>
<li>Multiple file formats supported along with computer vision tasks</li>
<li>User friendly interface and enhanced annotation tools</li>
<li>Easy collaboration on a project</li>
<li>Support for Virtual Machine use</li>
</ul>
<p>If you are interested in either finding out more about Remo (like how to integrate Remo with other frameworks such as PyTorch) or how to set up this workflow in a Jupyter Notebook environment, the <a target="_blank" href="https://remo.ai/docs/">official Remo documentation</a> is a great place to start. </p>
<p><em>I hope you enjoyed this article, thank you for reading!</em></p>
<h2 id="heading-contact-me">Contact me</h2>
<p>If you want to keep updated with my latest articles and projects <a target="_blank" href="https://medium.com/@pierpaoloippolito28?source=post_page---------------------------">follow me on Medium</a> and subscribe to my <a target="_blank" href="http://eepurl.com/gwO-Dr?source=post_page---------------------------">mailing list</a>. These are some of my contacts details:</p>
<ul>
<li><a target="_blank" href="https://uk.linkedin.com/in/pier-paolo-ippolito-202917146?source=post_page---------------------------">Linkedin</a></li>
<li><a target="_blank" href="https://pierpaolo28.github.io/blog/?source=post_page---------------------------">Personal Blog</a></li>
<li><a target="_blank" href="https://pierpaolo28.github.io/?source=post_page---------------------------">Personal Website</a></li>
<li><a target="_blank" href="https://www.patreon.com/user?u=32155890">Patreon</a></li>
<li><a target="_blank" href="https://towardsdatascience.com/@pierpaoloippolito28?source=post_page---------------------------">Medium Profile</a></li>
<li><a target="_blank" href="https://github.com/pierpaolo28?source=post_page---------------------------">GitHub</a></li>
<li><a target="_blank" href="https://www.kaggle.com/pierpaolo28?source=post_page---------------------------">Kaggle</a></li>
</ul>
<p>Cover photo from <a target="_blank" href="https://remo.ai/">Remo documentation.</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create an Optical Character Reader Using Blazor and Azure Computer Vision ]]>
                </title>
                <description>
                    <![CDATA[ By Ankit Sharma Introduction In this article, we will create an optical character recognition (OCR) application using Blazor and the Azure Computer Vision Cognitive Service.  Computer Vision is an AI service that analyzes content in images. We will u... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-an-optical-character-reader-using-blazor-and-azure-computer-vision/</link>
                <guid isPermaLink="false">66d45dae33b83c4378a517ba</guid>
                
                    <category>
                        <![CDATA[ Aspnetcore ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Azure ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Blazor ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 03 Mar 2020 18:48:00 +0000</pubDate>
                <media:content url="https://cdn-media-2.freecodecamp.org/w1280/5f9c9c4c740569d1a4ca3143.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Ankit Sharma</p>
<h2 id="heading-introduction">Introduction</h2>
<p>In this article, we will create an optical character recognition (OCR) application using Blazor and the Azure Computer Vision Cognitive Service. </p>
<p>Computer Vision is an AI service that analyzes content in images. We will use the OCR feature of Computer Vision to detect the printed text in an image. </p>
<p>The application will extract the text from the image and detects the language of the text. Currently, the OCR API supports 25 languages.</p>
<p>A demo of the application is shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/BlazorComputerVision-1.gif" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li>Install the latest .NET Core 3.1 SDK from <a target="_blank" href="https://dotnet.microsoft.com/download/dotnet-core/3.1">https://dotnet.microsoft.com/download/dotnet-core/3.1</a></li>
<li>Install the latest version of Visual Studio 2019 from <a target="_blank" href="https://visualstudio.microsoft.com/downloads/">https://visualstudio.microsoft.com/downloads/</a></li>
<li>An Azure subscription account. You can create a free Azure account  at <a target="_blank" href="https://azure.microsoft.com/en-in/free/">https://azure.microsoft.com/en-in/free/</a></li>
</ul>
<h2 id="heading-image-requirements">Image requirements</h2>
<p>The OCR API will work on images that meet the requirements as mentioned below:</p>
<ul>
<li>The format of the image must be JPEG, PNG, GIF, or BMP.</li>
<li>The size of the image must be between 50 x 50 and 4200 x 4200 pixels.</li>
<li>The image file size should be less than 4 MB.</li>
<li>The text in the image can be rotated by any multiple of 90 degrees plus a small angle of up to 40 degrees.</li>
</ul>
<h2 id="heading-source-code">Source Code</h2>
<p>You can get the source code from <a target="_blank" href="https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services">GitHub</a>.</p>
<h2 id="heading-create-the-azure-computer-vision-cognitive-service-resource">Create the Azure Computer Vision Cognitive Service resource</h2>
<p>Log in to the Azure portal and search for the cognitive services in the search bar and click on the result. Refer to the image shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/CreateTextCogServ.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>On the next screen, click on the Add button. It will open the cognitive services marketplace page. </p>
<p>Search for the Computer Vision in the search bar and click on the search result. It will open the Computer Vision API page. Click on the Create button to create a new Computer Vision resource. Refer to the image shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/SelectComputerVisionCogServ-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>On the Create page, fill in the details as indicated below.</p>
<ul>
<li><strong>Name</strong>: Give a unique name for your resource.</li>
<li><strong>Subscription</strong>: Select the subscription type from the dropdown.</li>
<li><strong>Pricing tier</strong>: Select the pricing tier as per your choice.</li>
<li><strong>Resource group</strong>: Select an existing resource group or create a new one.</li>
</ul>
<p>Click on the Create button. Refer to the image shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/BlazorConfigureComputerVisionCogServ.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>After your resource is successfully deployed, click on the “Go to resource” button. You can see the Key and the endpoint for the newly created Computer Vision resource. Refer to the image shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/ComputerVisionCogServKey.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Make a note of the key and the endpoint. We will be using these in the latter part of this article to invoke the Computer Vision OCR API from the .NET Code. The values are masked here for privacy.</p>
<h2 id="heading-create-a-server-side-blazor-application">Create a Server-Side Blazor Application</h2>
<p>Open Visual Studio 2019, click on “Create a new project”. Select “Blazor App” and click on the “Next” button. Refer to the image shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/CreateProject.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>On the next window, put the project name as <code>BlazorComputerVision</code> and click on the “Create” button. </p>
<p>The next window will ask you to select the type of Blazor app. Select <code>Blazor Server App</code> and click on the Create button to create a new server-side Blazor application. Refer to the image shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/BlazorTemplate.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-installing-computer-vision-api-library">Installing Computer Vision API library</h2>
<p>We will install the Azure Computer Vision API library which will provide us with the models out of the box to handle the Computer Vision REST API response. </p>
<p>To install the package, navigate to Tools &gt;&gt; NuGet Package Manager &gt;&gt; Package Manager Console. It will open the Package Manager Console. Run the command as shown below.</p>
<pre><code>Install-Package Microsoft.Azure.CognitiveServices.Vision.ComputerVision -Version <span class="hljs-number">5.0</span><span class="hljs-number">.0</span>
</code></pre><p>You can learn more about this package at the <a target="_blank" href="https://www.nuget.org/packages/Microsoft.Azure.CognitiveServices.Vision.ComputerVision/">NuGet gallery</a>.</p>
<h2 id="heading-create-the-models">Create the Models</h2>
<p>Right-click on the <code>BlazorComputerVision</code> project and select Add &gt;&gt; New Folder. Name the folder as <code>Models</code>. Again, right-click on the <code>Models</code> folder and select Add &gt;&gt; Class to add a new class file. Put the name of your class as <code>LanguageDetails.cs</code> and click Add.</p>
<p>Open <code>[LanguageDetails.cs](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Models/LanguageDetails.cs)</code> and put the following code inside it.</p>
<pre><code class="lang-c#"><span class="hljs-keyword">namespace</span> <span class="hljs-title">BlazorComputerVision.Models</span>
{
    <span class="hljs-keyword">public</span> <span class="hljs-keyword">class</span> <span class="hljs-title">LanguageDetails</span>
    {
        <span class="hljs-keyword">public</span> <span class="hljs-keyword">string</span> Name { <span class="hljs-keyword">get</span>; <span class="hljs-keyword">set</span>; }
        <span class="hljs-keyword">public</span> <span class="hljs-keyword">string</span> NativeName { <span class="hljs-keyword">get</span>; <span class="hljs-keyword">set</span>; }
        <span class="hljs-keyword">public</span> <span class="hljs-keyword">string</span> Dir { <span class="hljs-keyword">get</span>; <span class="hljs-keyword">set</span>; }
    }
}
</code></pre>
<p>Similarly, add a new class file <code>[AvailableLanguage.cs](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Models/AvailableLanguage.cs)</code> and put the following code inside it.</p>
<pre><code class="lang-c#"><span class="hljs-keyword">using</span> System.Collections.Generic;

<span class="hljs-keyword">namespace</span> <span class="hljs-title">BlazorComputerVision.Models</span>
{
    <span class="hljs-keyword">public</span> <span class="hljs-keyword">class</span> <span class="hljs-title">AvailableLanguage</span>
    {
        <span class="hljs-keyword">public</span> Dictionary&lt;<span class="hljs-keyword">string</span>, LanguageDetails&gt; Translation { <span class="hljs-keyword">get</span>; <span class="hljs-keyword">set</span>; }
    }
}
</code></pre>
<p>Finally, we will add a class as DTO (Data Transfer Object) for sending data back to the client. Add a new class file <code>[OcrResultDTO.cs](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Models/OcrResultDTO.cs)</code> and put the following code inside it.</p>
<pre><code class="lang-c#"><span class="hljs-keyword">namespace</span> <span class="hljs-title">BlazorComputerVision.Models</span>
{
    <span class="hljs-keyword">public</span> <span class="hljs-keyword">class</span> <span class="hljs-title">OcrResultDTO</span>
    {
        <span class="hljs-keyword">public</span> <span class="hljs-keyword">string</span> Language { <span class="hljs-keyword">get</span>; <span class="hljs-keyword">set</span>; }

        <span class="hljs-keyword">public</span> <span class="hljs-keyword">string</span> DetectedText { <span class="hljs-keyword">get</span>; <span class="hljs-keyword">set</span>; }
    }
}
</code></pre>
<h2 id="heading-create-the-computer-vision-service">Create the Computer Vision Service</h2>
<p>Right-click on the <code>BlazorComputerVision/Data</code> folder and select Add &gt;&gt; Class to add a new class file. Put the name of the file as <code>ComputerVisionService.cs</code> and click on Add.</p>
<p>Open the <code>[ComputerVisionService.cs](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Data/ComputerVisionService.cs)</code> file and put the following code inside it.</p>
<pre><code class="lang-c#"><span class="hljs-keyword">using</span> BlazorComputerVision.Models;
<span class="hljs-keyword">using</span> Microsoft.Azure.CognitiveServices.Vision.ComputerVision.Models;
<span class="hljs-keyword">using</span> Newtonsoft.Json;
<span class="hljs-keyword">using</span> Newtonsoft.Json.Linq;
<span class="hljs-keyword">using</span> System;
<span class="hljs-keyword">using</span> System.Net.Http;
<span class="hljs-keyword">using</span> System.Net.Http.Headers;
<span class="hljs-keyword">using</span> System.Text;
<span class="hljs-keyword">using</span> System.Threading.Tasks;

<span class="hljs-keyword">namespace</span> <span class="hljs-title">BlazorComputerVision.Data</span>
{
    <span class="hljs-keyword">public</span> <span class="hljs-keyword">class</span> <span class="hljs-title">ComputerVisionService</span>
    {
        <span class="hljs-keyword">static</span> <span class="hljs-keyword">string</span> subscriptionKey;
        <span class="hljs-keyword">static</span> <span class="hljs-keyword">string</span> endpoint;
        <span class="hljs-keyword">static</span> <span class="hljs-keyword">string</span> uriBase;

        <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">ComputerVisionService</span>(<span class="hljs-params"></span>)</span>
        {
            subscriptionKey = <span class="hljs-string">"b993f3afb4e04119bd8ed37171d4ec71"</span>;
            endpoint = <span class="hljs-string">"https://ankitocrdemo.cognitiveservices.azure.com/"</span>;
            uriBase = endpoint + <span class="hljs-string">"vision/v2.1/ocr"</span>;
        }

        <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">async</span> Task&lt;OcrResultDTO&gt; <span class="hljs-title">GetTextFromImage</span>(<span class="hljs-params"><span class="hljs-keyword">byte</span>[] imageFileBytes</span>)</span>
        {
            StringBuilder sb = <span class="hljs-keyword">new</span> StringBuilder();
            OcrResultDTO ocrResultDTO = <span class="hljs-keyword">new</span> OcrResultDTO();
            <span class="hljs-keyword">try</span>
            {
                <span class="hljs-keyword">string</span> JSONResult = <span class="hljs-keyword">await</span> ReadTextFromStream(imageFileBytes);

                OcrResult ocrResult = JsonConvert.DeserializeObject&lt;OcrResult&gt;(JSONResult);

                <span class="hljs-keyword">if</span> (!ocrResult.Language.Equals(<span class="hljs-string">"unk"</span>))
                {
                    <span class="hljs-keyword">foreach</span> (OcrLine ocrLine <span class="hljs-keyword">in</span> ocrResult.Regions[<span class="hljs-number">0</span>].Lines)
                    {
                        <span class="hljs-keyword">foreach</span> (OcrWord ocrWord <span class="hljs-keyword">in</span> ocrLine.Words)
                        {
                            sb.Append(ocrWord.Text);
                            sb.Append(<span class="hljs-string">' '</span>);
                        }
                        sb.AppendLine();
                    }
                }
                <span class="hljs-keyword">else</span>
                {
                    sb.Append(<span class="hljs-string">"This language is not supported."</span>);
                }
                ocrResultDTO.DetectedText = sb.ToString();
                ocrResultDTO.Language = ocrResult.Language;
                <span class="hljs-keyword">return</span> ocrResultDTO;
            }
            <span class="hljs-keyword">catch</span>
            {
                ocrResultDTO.DetectedText = <span class="hljs-string">"Error occurred. Try again"</span>;
                ocrResultDTO.Language = <span class="hljs-string">"unk"</span>;
                <span class="hljs-keyword">return</span> ocrResultDTO;
            }
        }

        <span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">async</span> Task&lt;<span class="hljs-keyword">string</span>&gt; <span class="hljs-title">ReadTextFromStream</span>(<span class="hljs-params"><span class="hljs-keyword">byte</span>[] byteData</span>)</span>
        {
            <span class="hljs-keyword">try</span>
            {
                HttpClient client = <span class="hljs-keyword">new</span> HttpClient();
                client.DefaultRequestHeaders.Add(<span class="hljs-string">"Ocp-Apim-Subscription-Key"</span>, subscriptionKey);
                <span class="hljs-keyword">string</span> requestParameters = <span class="hljs-string">"language=unk&amp;detectOrientation=true"</span>;
                <span class="hljs-keyword">string</span> uri = uriBase + <span class="hljs-string">"?"</span> + requestParameters;
                HttpResponseMessage response;

                <span class="hljs-keyword">using</span> (ByteArrayContent content = <span class="hljs-keyword">new</span> ByteArrayContent(byteData))
                {
                    content.Headers.ContentType = <span class="hljs-keyword">new</span> MediaTypeHeaderValue(<span class="hljs-string">"application/octet-stream"</span>);
                    response = <span class="hljs-keyword">await</span> client.PostAsync(uri, content);
                }

                <span class="hljs-keyword">string</span> contentString = <span class="hljs-keyword">await</span> response.Content.ReadAsStringAsync();
                <span class="hljs-keyword">string</span> result = JToken.Parse(contentString).ToString();
                <span class="hljs-keyword">return</span> result;
            }
            <span class="hljs-keyword">catch</span> (Exception e)
            {
                <span class="hljs-keyword">return</span> e.Message;
            }
        }

        <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">async</span> Task&lt;AvailableLanguage&gt; <span class="hljs-title">GetAvailableLanguages</span>(<span class="hljs-params"></span>)</span>
        {
            <span class="hljs-keyword">string</span> endpoint = <span class="hljs-string">"https://api.cognitive.microsofttranslator.com/languages?api-version=3.0&amp;scope=translation"</span>;
            <span class="hljs-keyword">var</span> client = <span class="hljs-keyword">new</span> HttpClient();
            <span class="hljs-keyword">using</span> (<span class="hljs-keyword">var</span> request = <span class="hljs-keyword">new</span> HttpRequestMessage())
            {
                request.Method = HttpMethod.Get;
                request.RequestUri = <span class="hljs-keyword">new</span> Uri(endpoint);
                <span class="hljs-keyword">var</span> response = <span class="hljs-keyword">await</span> client.SendAsync(request).ConfigureAwait(<span class="hljs-literal">false</span>);
                <span class="hljs-keyword">string</span> result = <span class="hljs-keyword">await</span> response.Content.ReadAsStringAsync();

                AvailableLanguage deserializedOutput = JsonConvert.DeserializeObject&lt;AvailableLanguage&gt;(result);

                <span class="hljs-keyword">return</span> deserializedOutput;
            }
        }
    }
}
</code></pre>
<p>In the constructor of the class, we have initialized the key and the endpoint URL for the OCR API.</p>
<p>Inside the <code>ReadTextFromStream</code> method, we will create a new <code>HttpRequestMessage</code>. This HTTP request is a Post request. We will pass the subscription key in the header of the request. The OCR API will return a JSON object having each word from the image as well as the detected language of the text.</p>
<p>The <code>GetTextFromImage</code> method will accept the image data as a byte array and returns an object of type <code>OcrResultDTO</code>. We will invoke the <code>ReadTextFromStream</code> method and deserialize the response into an object of type <code>OcrResult</code>. We will then form the sentence by iterating over the <code>OcrWord</code> object.</p>
<p>The <code>GetAvailableLanguages</code> method will return the list of all the language supported by the Translate Text API. We will set the request URI and create a <code>HttpRequestMessage</code> which will be a Get request. This request URL will return a JSON object which will be deserialized to an object of type <code>AvailableLanguage</code>.</p>
<h2 id="heading-why-do-we-need-to-fetch-the-list-of-supported-languages">Why do we need to fetch the list of supported languages?</h2>
<p>The OCR API returns the language code (e.g. en for English, de for German, etc.) of the detected language. But we cannot display the language code on the UI as it is not user-friendly. Therefore, we need a dictionary to look up the language name corresponding to the language code.</p>
<p>The Azure Computer Vision OCR API supports 25 languages. To know all the languages supported by OCR API see the list of <a target="_blank" href="https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/language-support">supported languages</a>. </p>
<p>These languages are a subset of the languages supported by the Azure Translate Text API. Since there is no dedicated API endpoint to fetch the list of languages supported by OCR API, therefore, we are using the Translate Text API endpoint to fetch the list of languages. </p>
<p>We will create the language lookup dictionary using the JSON response from this API call and filter the result based on the language code returned by the OCR API.</p>
<h2 id="heading-install-blazorinputfile-nuget-package">Install BlazorInputFile NuGet package</h2>
<p><a target="_blank" href="https://www.nuget.org/packages/BlazorInputFile/">BlazorInputFile</a> is a file input component for Blazor applications. It provides the ability to upload single or multiple files to a Blazor app.</p>
<p>Open <code>[BlazorComputerVision.csproj](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/BlazorComputerVision.csproj#L8)</code> file and add a dependency for the <code>BlazorInputFile</code> package as shown below:</p>
<pre><code class="lang-xhtml"><span class="hljs-tag">&lt;<span class="hljs-name">ItemGroup</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">PackageReference</span> <span class="hljs-attr">Include</span>=<span class="hljs-string">"BlazorInputFile"</span> <span class="hljs-attr">Version</span>=<span class="hljs-string">"0.1.0-preview-00002"</span> /&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">ItemGroup</span>&gt;</span>
</code></pre>
<p>Open <code>[BlazorComputerVision\Pages\_Host.cshtml](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Pages/_Host.cshtml#L17)</code> file and add the reference for the package’s JavaScript file by adding the following line in the <code>&lt;head&gt;</code> section.</p>
<pre><code class="lang-js">&lt;script src=<span class="hljs-string">"_content/BlazorInputFile/inputfile.js"</span>&gt;&lt;/script&gt;
</code></pre>
<p>Add the following line in the <code>[_Imports.razor](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/_Imports.razor#L10)</code> file.</p>
<pre><code>@using BlazorInputFile
</code></pre><h2 id="heading-configuring-the-service"><strong>Configuring the Service</strong></h2>
<p>To make the service available to the components we need to configure it on the server-side app. Open the Startup.cs file. Add the following line inside the <code>[ConfigureServices](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Startup.cs#L31)</code> method of Startup class.</p>
<pre><code class="lang-c#"> services.AddSingleton&lt;ComputerVisionService&gt;();
</code></pre>
<h2 id="heading-creating-the-blazor-ui-component">Creating the Blazor UI Component</h2>
<p>We will add the Razor page in the <code>BlazorComputerVision/Pages</code> folder. By default, we have “Counter” and “Fetch Data” pages provided in our application. These default pages will not affect our application but for the sake of this tutorial, we will delete fetchdata and counter pages from <code>BlazorComputerVision/Pages</code> folder.</p>
<p>Right-click on the <code>BlazorComputerVision/Pages</code> folder and then select Add &gt;&gt; New Item. An “Add New Item” dialog box will open, select “Visual C#” from the left panel, then select “Razor Component” from the templates panel, put the name as <code>OCR.razor</code>. Click Add. Refer to the image shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/BlazorOCRComponent.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We will add a code-behind file for this razor page to keep the code and presentation separate. This will allow easy maintenance for the application.  </p>
<p>Right-click on the <code>BlazorComputerVision/Pages</code> folder and then select Add &gt;&gt; Class. Name the class as <code>OCR.razor.cs</code>. The Blazor framework is smart enough to tag this class file to the razor file. Refer to the image shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/BlazorCodeBehind.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-blazor-ui-component-code-behind">Blazor UI component code behind</h2>
<p>Open the <code>[OCR.razor.cs](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Pages/OCR.razor.cs)</code> file and put the following code inside it.</p>
<pre><code class="lang-c#"><span class="hljs-keyword">using</span> Microsoft.AspNetCore.Components;
<span class="hljs-keyword">using</span> System;
<span class="hljs-keyword">using</span> System.Collections.Generic;
<span class="hljs-keyword">using</span> System.Linq;
<span class="hljs-keyword">using</span> System.Threading.Tasks;
<span class="hljs-keyword">using</span> System.IO;
<span class="hljs-keyword">using</span> BlazorComputerVision.Models;
<span class="hljs-keyword">using</span> BlazorInputFile;
<span class="hljs-keyword">using</span> BlazorComputerVision.Data;

<span class="hljs-keyword">namespace</span> <span class="hljs-title">BlazorComputerVision.Pages</span>
{
    <span class="hljs-keyword">public</span> <span class="hljs-keyword">class</span> <span class="hljs-title">OCRModel</span> : <span class="hljs-title">ComponentBase</span>
    {
        [<span class="hljs-meta">Inject</span>]
        <span class="hljs-keyword">protected</span> ComputerVisionService computerVisionService { <span class="hljs-keyword">get</span>; <span class="hljs-keyword">set</span>; }

        <span class="hljs-keyword">protected</span> <span class="hljs-keyword">string</span> DetectedTextLanguage;
        <span class="hljs-keyword">protected</span> <span class="hljs-keyword">string</span> imagePreview;
        <span class="hljs-keyword">protected</span> <span class="hljs-keyword">bool</span> loading = <span class="hljs-literal">false</span>;
        <span class="hljs-keyword">byte</span>[] imageFileBytes;

        <span class="hljs-keyword">const</span> <span class="hljs-keyword">string</span> DefaultStatus = <span class="hljs-string">"Maximum size allowed for the image is 4 MB"</span>;
        <span class="hljs-keyword">protected</span> <span class="hljs-keyword">string</span> status = DefaultStatus;

        <span class="hljs-keyword">protected</span> OcrResultDTO Result = <span class="hljs-keyword">new</span> OcrResultDTO();
        <span class="hljs-keyword">private</span> AvailableLanguage availableLanguages;
        <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-title">Dictionary</span>&lt;<span class="hljs-title">string</span>, <span class="hljs-title">LanguageDetails</span>&gt; LanguageList</span> = <span class="hljs-keyword">new</span> Dictionary&lt;<span class="hljs-keyword">string</span>, LanguageDetails&gt;();
        <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> MaxFileSize = <span class="hljs-number">4</span> * <span class="hljs-number">1024</span> * <span class="hljs-number">1024</span>; <span class="hljs-comment">// 4MB</span>

        <span class="hljs-function"><span class="hljs-keyword">protected</span> <span class="hljs-keyword">override</span> <span class="hljs-keyword">async</span> Task <span class="hljs-title">OnInitializedAsync</span>(<span class="hljs-params"></span>)</span>
        {
            availableLanguages = <span class="hljs-keyword">await</span> computerVisionService.GetAvailableLanguages();
            LanguageList = availableLanguages.Translation;
        }

        <span class="hljs-function"><span class="hljs-keyword">protected</span> <span class="hljs-keyword">async</span> Task <span class="hljs-title">ViewImage</span>(<span class="hljs-params">IFileListEntry[] files</span>)</span>
        {
            <span class="hljs-keyword">var</span> file = files.FirstOrDefault();
            <span class="hljs-keyword">if</span> (file == <span class="hljs-literal">null</span>)
            {
                <span class="hljs-keyword">return</span>;
            }
            <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (file.Size &gt; MaxFileSize)
            {
                status = <span class="hljs-string">$"The file size is <span class="hljs-subst">{file.Size}</span> bytes, this is more than the allowed limit of <span class="hljs-subst">{MaxFileSize}</span> bytes."</span>;
                <span class="hljs-keyword">return</span>;
            }
            <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (!file.Type.Contains(<span class="hljs-string">"image"</span>))
            {
                status = <span class="hljs-string">"Please uplaod a valid image file"</span>;
                <span class="hljs-keyword">return</span>;
            }
            <span class="hljs-keyword">else</span>
            {
                <span class="hljs-keyword">var</span> memoryStream = <span class="hljs-keyword">new</span> MemoryStream();
                <span class="hljs-keyword">await</span> file.Data.CopyToAsync(memoryStream);
                imageFileBytes = memoryStream.ToArray();
                <span class="hljs-keyword">string</span> base64String = Convert.ToBase64String(imageFileBytes, <span class="hljs-number">0</span>, imageFileBytes.Length);

                imagePreview = <span class="hljs-keyword">string</span>.Concat(<span class="hljs-string">"data:image/png;base64,"</span>, base64String);
                memoryStream.Flush();
                status = DefaultStatus;
            }
        }

        <span class="hljs-function"><span class="hljs-keyword">protected</span> <span class="hljs-keyword">private</span> <span class="hljs-keyword">async</span> Task <span class="hljs-title">GetText</span>(<span class="hljs-params"></span>)</span>
        {
            <span class="hljs-keyword">if</span> (imageFileBytes != <span class="hljs-literal">null</span>)
            {
                loading = <span class="hljs-literal">true</span>;
                Result = <span class="hljs-keyword">await</span> computerVisionService.GetTextFromImage(imageFileBytes);
                <span class="hljs-keyword">if</span> (LanguageList.ContainsKey(Result.Language))
                {
                    DetectedTextLanguage = LanguageList[Result.Language].Name;
                }
                <span class="hljs-keyword">else</span>
                {
                    DetectedTextLanguage = <span class="hljs-string">"Unknown"</span>;
                }
                loading = <span class="hljs-literal">false</span>;
            }
        }
    }
}
</code></pre>
<p>We are injecting the <code>ComputerVisionService</code> in this class.</p>
<p>The <code>OnInitializedAsync</code> is a Blazor lifecycle method which is invoked when the component is initialized. We are invoking the <code>GetAvailableLanguages</code> method from our service inside the <code>OnInitializedAsync</code> method. We will then initialize the variable LanguageList which is a dictionary to hold the details of available languages.</p>
<p>Inside the <code>ViewImage</code> method, we will check if the uploaded file is an image only and the size is less than 4 MB. We will transfer the uploaded image to the memory stream. We will then convert that memory stream to a byte array. </p>
<p>To set the image preview, we will convert the image from byte array to a base64 encoded string. The <code>GetText</code> method will invoke the <code>GetTextFromImage</code> method from the service and pass the image byte array as an argument. We will search for the language name from the dictionary based on the language code returned from the service. If no language code is available, we will set the language as unknown.</p>
<h2 id="heading-blazor-ui-component-template">Blazor UI component template</h2>
<p>Open the <code>[OCR.razor](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Pages/OCR.razor)</code> file and put the following code inside it.</p>
<pre><code class="lang-html">@page "/computer-vision-ocr"
@inherits OCRModel

<span class="hljs-tag">&lt;<span class="hljs-name">h2</span>&gt;</span>Optical Character Recognition (OCR) Using Blazor and Azure Computer Vision Cognitive Service<span class="hljs-tag">&lt;/<span class="hljs-name">h2</span>&gt;</span>

<span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"row"</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"col-md-5"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">textarea</span> <span class="hljs-attr">disabled</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"form-control"</span> <span class="hljs-attr">rows</span>=<span class="hljs-string">"10"</span> <span class="hljs-attr">cols</span>=<span class="hljs-string">"15"</span>&gt;</span>@Result.DetectedText<span class="hljs-tag">&lt;/<span class="hljs-name">textarea</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">hr</span> /&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"row"</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"col-sm-5"</span>&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">label</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">strong</span>&gt;</span> Detected Language :<span class="hljs-tag">&lt;/<span class="hljs-name">strong</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">label</span>&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"col-sm-6"</span>&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">input</span> <span class="hljs-attr">disabled</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"text"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"form-control"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@DetectedTextLanguage"</span> /&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"col-md-5"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"image-container"</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">img</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"preview-image"</span> <span class="hljs-attr">src</span>=<span class="hljs-string">@imagePreview</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">InputFile</span> <span class="hljs-attr">OnChange</span>=<span class="hljs-string">"ViewImage"</span> /&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">p</span>&gt;</span>@status<span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">hr</span> /&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">disabled</span>=<span class="hljs-string">"@loading"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"btn btn-primary btn-lg"</span> @<span class="hljs-attr">onclick</span>=<span class="hljs-string">"GetText"</span>&gt;</span>
            @if (loading)
            {
                <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"spinner-border spinner-border-sm mr-1"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
            }
            Extract Text
        <span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
</code></pre>
<p>We have defined the route for this component. We have inherited the <code>OCRModel</code> class which allows us to access the properties and method of this class from the template. Bootstrap is used for designing the UI. We have a text area to display the detected text and a text box to display the detected language. The image tag is used to show the image preview after uploading the image. The <code>&lt;InputFile&gt;</code> component will allow us to upload an image file and invoke the <code>ViewImage</code> method as we upload the image.</p>
<h2 id="heading-add-styling-for-the-component">Add styling for the component</h2>
<p>Navigate to <code>[BlazorComputerVision\wwwroot\css\site.css](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/wwwroot/css/site.css#L185-L197)</code> file and add the following style definition inside it.</p>
<pre><code class="lang-css"><span class="hljs-selector-class">.preview-image</span> {
    <span class="hljs-attribute">max-height</span>: <span class="hljs-number">300px</span>;
    <span class="hljs-attribute">max-width</span>: <span class="hljs-number">300px</span>;
}
<span class="hljs-selector-class">.image-container</span> {
    <span class="hljs-attribute">display</span>: flex;
    <span class="hljs-attribute">padding</span>: <span class="hljs-number">15px</span>;
    <span class="hljs-attribute">align-content</span>: center;
    <span class="hljs-attribute">align-items</span>: center;
    <span class="hljs-attribute">justify-content</span>: center;
    <span class="hljs-attribute">border</span>: <span class="hljs-number">2px</span> dashed skyblue;
}
</code></pre>
<h2 id="heading-adding-link-to-navigation-menu"><strong>Adding Link to Navigation menu</strong></h2>
<p>The last step is to add the link of our OCR component in the navigation menu. Open <code>[BlazorComputerVision/Shared/NavMenu.razor](https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services/blob/master/BlazorComputerVision/Shared/NavMenu.razor#L15-L19)</code> file and add the following code into it.</p>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"nav-item px-3"</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">NavLink</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"nav-link"</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"computer-vision-ocr"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"oi oi-list-rich"</span> <span class="hljs-attr">aria-hidden</span>=<span class="hljs-string">"true"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span> Computer Vision
    <span class="hljs-tag">&lt;/<span class="hljs-name">NavLink</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">li</span>&gt;</span>
</code></pre>
<p>Remove the navigation links for Counter and Fetch-data components as they are not required for this application.</p>
<h2 id="heading-execution-demo">Execution Demo</h2>
<p>Press F5 to launch the application. Click on the Computer Vision button on the nav menu on the left. On the next page, upload an image with some text in it and click on the “Extract Text” button. You will see the extracted text in the text area on the left along with the detected language for the text. Refer to the image shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Execution_English.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Now we will try to upload an image with some French text on it, you can see the extracted text and the detected language as French. Refer to the image shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Execution_French.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>If we try to upload an image with an unsupported language, we will get the error. Refer to the image shown below where an image with text written in Hindi is uploaded.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Execution_Hindi.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-summary"><strong>Summary</strong></h2>
<p>We have created an optical character recognition (OCR) application using Blazor and the Computer Vision Azure Cognitive Service. We have added the feature of uploading an image file using the <code>BlazorInputFile</code> component. The application is able to extract the printed text from the uploaded image and recognizes the language of the text. The OCR API of the Computer Vision is used which can recognize text in 25 languages.</p>
<p>Get the Source code from <a target="_blank" href="https://github.com/AnkitSharma-007/Blazor-Computer-Vision-Azure-Cognitive-Services">GitHub</a> and play around to get a better understanding.</p>
<h2 id="heading-see-also">See Also</h2>
<ul>
<li><a target="_blank" href="https://ankitsharmablogs.com/multi-language-translator-using-blazor-and-azure-cognitive-services/">Multi-Language Translator Using Blazor And Azure Cognitive Services</a></li>
<li><a target="_blank" href="https://ankitsharmablogs.com/facebook-authentication-and-authorization-in-server-side-blazor-app/">Facebook Authentication And Authorization In Server-Side Blazor App</a></li>
<li><a target="_blank" href="https://ankitsharmablogs.com/google-authentication-and-authorization-in-server-side-blazor-app/">Google Authentication And Authorization In Server-Side Blazor App</a></li>
<li><a target="_blank" href="https://ankitsharmablogs.com/policy-based-authorization-in-angular-using-jwt/">Policy-Based Authorization In Angular Using JWT</a></li>
<li><a target="_blank" href="https://ankitsharmablogs.com/continuous-deployment-for-angular-app-using-heroku-and-github/">Continuous Deployment For Angular App Using Heroku And GitHub</a></li>
<li><a target="_blank" href="https://ankitsharmablogs.com/hosting-a-blazor-application-on-firebase/">Hosting A Blazor Application on Firebase</a></li>
<li><a target="_blank" href="https://ankitsharmablogs.com/deploying-a-blazor-application-on-azure/">Deploying A Blazor Application On Azure</a></li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Train an Image Classifier and Teach Your Computer Japanese ]]>
                </title>
                <description>
                    <![CDATA[ By Ajay Uppili Arasanipalai Introduction Hi. Hello. こんにちは Those squiggly characters you just saw are from a language called Japanese. You’ve probably heard of it if you’ve ever watched Dragon Ball Z. _Source_ Here’s the problem though: you know thos... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-teach-your-computer-japanese/</link>
                <guid isPermaLink="false">66d45d5ed7a4e35e38434932</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ algorithms ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ fastai,  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Sun, 21 Jul 2019 16:30:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2019/07/kmnist.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Ajay Uppili Arasanipalai</p>
<h2 id="heading-introduction">Introduction</h2>
<p>Hi. Hello. こんにちは</p>
<p>Those squiggly characters you just saw are from a language called Japanese. You’ve probably heard of it if you’ve ever watched Dragon Ball Z.</p>
<p><img src="https://assets.website-files.com/5ac6b7f2924c652fd013a891/5d2951d9e7d17d10d2b65c0b_100881-dragon-ball-z-dragon-ball-fight.gif" alt="Image" width="600" height="400" loading="lazy">
_<a href="http://fanaru.com/dragon-ball-z/image/100881-dragon-ball-z-dragon-ball-fight.gif" target="_blank">Source</a>_</p>
<p>Here’s the problem though: you know those ancient Japanese scrolls that make you look like you’re going to unleash an ultimate samurai ninja overlord super combo move.</p>
<p>Yeah, those. I can’t exactly read them, and it turns out that very few people can.</p>
<p>Luckily, a bunch of smart people understands how important it is that I master the Bijudama-Rasenshuriken, so they invented this thing called deep learning.</p>
<p>So pack your ramen and get ready. In this article, I’ll show you how to train a neural network that can accurately predict Japanese characters from  their images.</p>
<p>To  ensure that we get good results, I’m going to use of an incredible deep learning library called fastAI, which is a wrapper around PyTorch that  makes it easy to implement best practices from modern research. You can  read more about it on their <a target="_blank" href="https://docs.fast.ai">docs</a>.</p>
<p>With that said, let’s get started.</p>
<h2 id="heading-kmnist">KMNIST</h2>
<p>OK, so before we can create anime subtitles, we’re going to need a dataset. Today we’re going to focus on KMNIST.</p>
<p>This dataset takes of examples of characters from the Japanese Kuzushiji script, and organizes them into 10 labeled classes. The images measure 28x28 pixels, and there are 70,000 images in total, mirroring the  structure of MNIST.</p>
<p>But why KMNIST? Well firstly, it has “MNIST” in its name, and we all know how much people in machine learning love MNIST.</p>
<p><img src="https://assets.website-files.com/5ac6b7f2924c652fd013a891/5d2951dabfaf722c2f74abc4_kmnist.png" alt="Image" width="600" height="400" loading="lazy">
_<a href="http://codh.rois.ac.jp/img/kmnist.png" target="_blank">Source</a>_</p>
<p>So  in theory, you could just change a few lines of that Keras code that  you copy-pasted from Stack Overflow and BOOM! You now have computer code  that can <a target="_blank" href="https://www.wandb.com/articles/collaborative-deep-learning-for-reading-japanese">revive an ancient Japanese script</a>.</p>
<p>Of  course, in practice, it isn’t that simple. For starters, the cute  little model that you trained on MNIST probably won’t do that well.  Because, you know, figuring out whether a number is a 2 or a 5 is just a  tad easier than deciphering a forgotten cursive script that only a  handful of people on earth know how to read.</p>
<p><img src="https://assets.website-files.com/5ac6b7f2924c652fd013a891/5d2951d9e7d17dd802b65c0c_giphy.gif" alt="Animated GIF showing " width="600" height="400" loading="lazy"></p>
<p>Apart  from that, I guess I should point out that Kuzushiji, which is what the  “K” in KMNIST stands for, is not just 10 characters long.  Unfortunately, I’m <strong>NOT</strong> one of the handfuls of experts that can read the language, so I can’t describe in intricate detail how it works.</p>
<p>But  here’s what I do know: There are actually three variants of these  Kuzushiji character datasets — KMNIST, Kuzushiji-49, and  Kuzushiji-Kanji.</p>
<p>Kuzushiji-49 is variant with 49 classes instead of 10. Kuzushiji-Kanji is even more insane, with a whopping 3832 classes.</p>
<p>Yep, you read that right. It’s three times as many classes as ImageNet.</p>
<p>‍</p>
<h2 id="heading-how-to-not-mess-up-your-dataset">How to Not Mess Up Your Dataset</h2>
<p>To  keep things as MNIST-y as possible, it looks like the researchers who  put out the KMNIST dataset kept it in the original format (man, they  really took that whole MNIST thing to heart, didn’t they).</p>
<p>If you take a look at <a target="_blank" href="https://github.com/rois-codh/kmnist">the KMNIST GitHub repo</a>, you’ll see that the dataset is served in two formats: the original MNIST thing, and as a bunch of Numpy arrays.</p>
<p>Of course, I know you were probably too lazy to click that link. So here you go. You can thank me later.</p>
<p><img src="https://assets.website-files.com/5ac6b7f2924c652fd013a891/5d2951da80b61183001c8a76_s_F44ECE835EEC7420BCA13A4D3C5E89A345A0759A53AF1A997CB55909898D5236_1561571516749_Screenshot%2B2019-06-26%2Bat%2B11.20.09%2BPM.png" alt="GitHub screenshot showing the various download formats for the KMNIST dataset" width="600" height="400" loading="lazy">
_<a href="https://github.com/rois-codh/kmnist" target="_blank">Source</a>_</p>
<p>Personally,  I found the NumPy array format easier to work with when using fastai,  but the choice is yours. If you’re using PyTorch, KMNIST comes for free  as a part of <a target="_blank" href="https://pytorch.org/docs/stable/torchvision/datasets.html?highlight=kmnist#kmnist"><code>torchvision.datasets</code></a>.</p>
<p>The  next challenge is actually getting those 10,000-year-old brush strokes  onto your notebook (or IDE, who am I to judge). Luckily, the GitHub repo  mentions that there’s this handy script called <code>download_data.py</code> that’ll  do all the work for us. Yay!</p>
<p><img src="https://paper.dropboxstatic.com/static/img/ace/emoji/1f389.png?version=3.1.2" alt="party popper" width="600" height="400" loading="lazy"></p>
<p>From here, it’ll probably start getting awkward if I continue talking  about how to pre-process your data without actual code. So check out <a target="_blank" href="https://colab.research.google.com/gist/iyaja/fe102ae34312e48e637edd804a450207/kmnist.ipynb">the notebook</a> if you want to dive deeper.</p>
<p>Moving on…</p>
<h2 id="heading-should-i-use-a-hyper-ultra-inception-resnet-xxxl">Should I use a hyper ultra Inception ResNet XXXL?‍</h2>
<h3 id="heading-short-answer">Short Answer</h3>
<p>Probably not. A regular ResNet should be fine.</p>
<h3 id="heading-a-little-less-short-answer">A Little Less Short Answer</h3>
<p>Ok, look. By now, you’re probably thinking, “KMNIST big. KMNIST hard. Me need to use very new, very fancy model.”</p>
<p>Did I overdo the Bizzaro voice?</p>
<p>The point is, you <strong>DON’T</strong> need a shiny new model to do well on these image classification tasks.  At best, you’ll probably get a marginal accuracy improvement at the cost  of a whole lot of time and money.</p>
<p>Most of the time, you’ll just waste a whole lot of time and money.</p>
<p>So  heed my advice — just stick to good ol’ fashion ResNets. They work  really well, they're relatively fast and lightweight (compared to some  of the other memory hogs like Inception and DenseNet), and best of all,  people have been using them for a while, so it shouldn’t be too hard to  fine-tune.</p>
<p>If the  dataset you’re working with is simple like MNIST, use ResNet18. If it’s  medium-difficulty, like CIFAR10, use ResNet34. If it’s really hard,  like ImageNet, use ResNet50. If it’s harder than that, you can probably  afford to use something better than a ResNet.</p>
<p>Don’t believe me? Check out my leading entry for the Stanford DAWNBench competition from April 2019:</p>
<p><img src="https://assets.website-files.com/5ac6b7f2924c652fd013a891/5d2951dae06d1bb7f8a731ba_s_F44ECE835EEC7420BCA13A4D3C5E89A345A0759A53AF1A997CB55909898D5236_1561651063683_D4s0U9_UwAA3vk2.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>What do you see? ResNets everywhere! Now come on, there’s got to be a reason for that.‍</p>
<h2 id="heading-hyperparameters-galore">Hyperparameters Galore</h2>
<p>A few months ago, I wrote an article on <a target="_blank" href="https://blog.nanonets.com/hyperparameter-optimization/">how to pick the right hyperparameters</a>.  If you’re interested in a more general solution to this herculean task,  go check that out. Here, I’m going to walk you through my process of  picking good-enough hyperparameters to get good-enough results on  KMNIST.</p>
<p>To start off, let’s go over what hyperparameters we need to tune.</p>
<p>We’ve  already decided to use a ResNet34, so that’s that. We don’t need to  figure out the number of layers, filter size, number of filters, etc.  since that comes baked into our model.</p>
<p>See, I told you it would save time.</p>
<p>So  what’s remaining is the big three: learning rate, batch size, and the  number of epochs (plus stuff like dropout probability for which we can  just use the default values).</p>
<p>Let’s go over them one by one.</p>
<h3 id="heading-number-of-epochs">Number of Epochs</h3>
<p>Let’s  start with the number of epochs. As you’ll come to see when you play  around with the model in the notebook, our training is pretty efficient.  We can easily cross 90% accuracy within a few minutes.</p>
<p><img src="https://assets.website-files.com/5ac6b7f2924c652fd013a891/5d2955a222d8c93af99209f9_Screenshot%202019-07-13%20at%209.23.20%20AM.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>So  given that our training is so fast in the first place, it seems  extremely unlikely that we would use too many epochs and overfit. I’ve  seen other KMNIST models train for over 50 epochs without any issues, so  staying in the 0-30 range should be absolutely fine.</p>
<p>That  means within the scope of the restrictions we’ve put on the model when  it comes to epochs, the more, the merrier. In my experiments, I found  that 10 epochs strike a good balance between model accuracy and training  time.</p>
<h3 id="heading-learning-rate">Learning Rate</h3>
<p>What  I’m about to say is going to piss a lot of people off. But I’ll say it  anyway — We don’t need to pay too much attention to the learning rate.</p>
<p>Yep, you heard me right. But give me a chance to explain.</p>
<p>Instead  of going “Hmm… that doesn’t seem to work, let’s try again with lr=3e-3  ,” we’re going to use a much more systematic and disciplined approach to  finding a good learning rate.</p>
<p>We’re going to use the learning rate finder, a revolutionary idea proposed by Leslie Smith in his <a target="_blank" href="https://arxiv.org/pdf/1506.01186.pdf">paper on cyclical learning rates</a>.</p>
<p>Here’s how it works:</p>
<ul>
<li>First,  we set up our model and prepare to train it for one epoch. As the model  is training, we’ll gradually increase the learning rate.</li>
<li>Along the way, we’ll keep track of the loss at every iteration.</li>
<li>Finally, we select the learning rate the corresponds to the lowest loss.</li>
</ul>
<p>When all is said and done, and you plot the loss against the learning rate, you should see something like this:</p>
<p><img src="https://assets.website-files.com/5ac6b7f2924c652fd013a891/5d2951dabfaf72811274abb2_s_F44ECE835EEC7420BCA13A4D3C5E89A345A0759A53AF1A997CB55909898D5236_1561652469267_Unknown.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Now, before you get all giddy and pick 1e-01 as the learning rate, I’ll have you know that it’s <strong>NOT</strong> the best choice.</p>
<p>That’s  because fastai implements a smoothening technique called exponentially  weighted averages, which is the deep learning researcher version of an  Instagram filter. It prevents our plots from looking like the result of  giving your neighbors’ kid too much time with a blue crayon.</p>
<p><img src="https://assets.website-files.com/5ac6b7f2924c652fd013a891/5d2951dae06d1b5fb4a731bb_art2_loss_vs_lr.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Since  we’re using a form of averaging to make the plot look smooth, the  “minimum” point that you’re looking at on the learning rate finder isn’t  actually a minimum. It’s an average.</p>
<p>Instead, to <em>actually</em> find the learning rate, a good rule of thumb is to pick the learning  rate that’s an order of magnitude lower than the minimum point on the  smoothened plot. That tends to work really well in practice.</p>
<p>I  understand that all this plotting and averaging might seem weird if all  you’ve been brute-forcing learning rate values all your life. So I’d  advise you to check out <a target="_blank" href="https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html">Sylvain Gugger’s explanation of the learning rate finder</a> to learn more.</p>
<h3 id="heading-batch-size">Batch Size</h3>
<p>OK, you caught me red-handed here. My initial experiments used a batch size of 128 since that’s what the top submission used.</p>
<p>I  know, I know. Not very creative. But it’s what I did. Afterward, I  experimented with a few other batch sizes, and I couldn’t get better  results. So 128 it is!</p>
<p>In  general, batch sizes can be a weird thing to optimize, since it  partially depends on the computer you’re using. If you have a GPU with  more VRAM, you can train on larger batch sizes.</p>
<p>So  if I tell you to use a batch size of 2048, for example, instead of  getting that coveted top spot on Kaggle and eternal fame and glory for  life, you might just end up with a CUDA: out of memory error.</p>
<p>So  it’s hard to recommend a perfect batch size because, in practice, there  are clearly computational limits. The best way to pick it is to try out  values that work for you.</p>
<p>But how would you pick a random number from the vast sea of positive integers?</p>
<p>Well,  you actually don’t. Since GPU memory comes is organized in bits, it’s a  good idea to choose a batch size that’s a power of 2 so that your  mini-batches fit snugly in memory.</p>
<p>Here’s  what I would do: start off with a moderately large batch size like 512.  Then, if you find that your model starts acting weird and the loss is  not on a clear downward trend, half it. Next, repeat the training  process with a batch size of 256, and see if it behaves this time.</p>
<p>If it doesn’t, wash, rinse, and repeat.</p>
<h2 id="heading-a-few-pretty-pictures">A Few Pretty Pictures</h2>
<p>With  the optimizations going on here, it’s going to be pretty challenging to  keep track of this giant mess of models, metrics, and hyperparameters  that we’ve created.</p>
<p>To ensure that we all remain sane human beings while climbing the accuracy mountain, we’re going to use <a target="_blank" href="https://docs.wandb.com/docs/frameworks/fastai.html">the wandb + fastai integration</a>.</p>
<p>So what does wandb actually do?</p>
<p>It  keeps track of a whole lot of statistics about your model and how it’s  performing automatically. But what’s really cool is that it also  provides instant charts and visualizations to keep track of critical  metrics like accuracy and loss, all in real-time!</p>
<p>If  that wasn’t enough, it also stores all of those charts, visualizations,  and statistics in the cloud, so you can access them anytime anywhere.</p>
<p>Your days of starting at a black terminal screen and fiddling around with matplotlib are over.</p>
<p><a target="_blank" href="https://colab.research.google.com/gist/iyaja/fe102ae34312e48e637edd804a450207/kmnist.ipynb">The notebook tutorial</a> for this article has a straightforward introduction to how it works seamlessly with fastai. You can also check out <a target="_blank" href="https://app.wandb.ai/ajayuppili/kmnist/runs/41gbr2yx">the wandb workspace</a>, where you can take a look at all the stuff I mentioned without writing any code.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>これで終わりです</p>
<p>That means “this is the end.”</p>
<p>But  you didn't need me to tell you that, did you? Not after you went  through the trouble of getting a Japanese character dataset, using the  learning rate finder, training a ResNet using modern best practices, and  watching your model rise to glory using real-time monitoring in the  cloud.</p>
<p>Yep, in about 20 minutes, you actually did all of that! Give yourself a pat on the back.</p>
<p>And please, go watch some Dragonball.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to create a simple Image Classifier ]]>
                </title>
                <description>
                    <![CDATA[ By Aditya Image classification is an amazing application of deep learning. We can train a powerful algorithm to model a large image dataset. This model can then be used to classify a similar but unknown set of images.  There is no limit to the applic... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/creating-your-first-image-classifier/</link>
                <guid isPermaLink="false">66d45ddb7df3a1f32ee7f7e7</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ image classification ]]>
                    </category>
                
                    <category>
                        <![CDATA[ neural networks ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 18 Jul 2019 18:06:36 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2019/07/mnist-fashion3.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Aditya</p>
<p>Image classification is an amazing application of deep learning. We can train a powerful algorithm to model a large image dataset. This model can then be used to classify a similar but unknown set of images. </p>
<p>There is no limit to the applications of image classification. You can use it in your next app or you can use it to solve some real world problem. That's all up to you. But to someone who is fairly new to this realm, it might seem very challenging at first. How should I get my data? How should I build my model? What tools should I use? </p>
<p>In this article we will discuss all of that - from finding a dataset to training your model. I will try to make things as simple as possible by avoiding some technical details (<em>PS: Please note that this doesn't mean those details are not important. I will mention some great resources which you can refer to learn more about those topics</em>).  The purpose of this article is to explain the basic process of building an image classifier and that's what we will focus more on here. </p>
<p>We will build an Image classifier for the <a target="_blank" href="https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/">Fashion-MNIST Dataset</a>. The Fashion-MNIST dataset is a collection of <a target="_blank" href="https://research.zalando.com/">Zalando's</a> article images. It contains 60,000 images for the training set and 10,000 images for the test set data (<em>we will discuss the test and training datasets along with the validation dataset later</em>). These images belong to the labels of 10 different classes. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2019/07/mnist-fashion.png" alt="Image" width="600" height="400" loading="lazy">
<em><a target="_blank" href="https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/">Source</a></em></p>
<h2 id="heading-importing-libraries">Importing Libraries</h2>
<p>Our goal is to train a deep learning model that can classify a given set of images into one of these 10 classes. Now that we have our dataset, we should move on to the tools we need. There are many libraries and tools out there that you can choose based on your own project requirements. For this one I will stick to the following:</p>
<ol>
<li><a target="_blank" href="https://www.numpy.org/"><strong>Numpy</strong></a> - Python library for numerical computation</li>
<li><a target="_blank" href="https://pandas.pydata.org/"><strong>Pandas</strong></a> - Python library data manipulation</li>
<li><a target="_blank" href="https://matplotlib.org/"><strong>Matplotlib</strong></a> - Python library data visualisation</li>
<li><a target="_blank" href="https://keras.io/"><strong>Keras</strong></a> - Python library based on tensorflow for creating deep learning models</li>
<li><a target="_blank" href="https://jupyter.org/"><strong>Jupyter</strong></a> - I will run all my code on Jupyter Notebooks. You can install it via the link. You can use <a target="_blank" href="https://colab.research.google.com/">Google Colabs</a> also if you need better computational power.</li>
</ol>
<p>Along with these four, we will also use <a target="_blank" href="https://scikit-learn.org/">scikit-learn</a>. The purpose of these libraries will become more clear once we dive into the code. </p>
<p>Okay! We have our tools and libraries ready. Now we should start setting up our code.</p>
<p>Start with importing all the above mentioned libraries. Along with importing libraries I have also imported some specific modules from these libraries. Let me go through them one by one.</p>
<pre><code class="lang-python3">import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import keras 

from sklearn.model_selection import train_test_split 
from keras.utils import to_categorical 

from keras.models import Sequential 
from keras.layers import Conv2D, MaxPooling2D 
from keras.layers import Dense, Dropout 
from keras.layers import Flatten, BatchNormalization
</code></pre>
<p><strong><a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">train_test_split</a>:</strong> This module splits the training dataset into training and validation data. The reason behind this split is to check if our model is <a target="_blank" href="https://en.wikipedia.org/wiki/Overfitting">overfitting</a> or not. We use a training dataset to train our model and then we will compare the resulting accuracy to validation accuracy. If the difference between both quantities is significantly large, then our model is probably overfitting. We will reiterate through our model building process and making required changes along the way. Once we are satisfied with our training and validation accuracies, we will make final predictions on our test data. </p>
<p><strong>to_categorical:</strong> to_categorical is a keras utility. It is used to convert the categorical labels into <a target="_blank" href="https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/">one-hot encodings</a>. Let's say we have three labels ("apples", "oranges", "bananas"), then one hot encodings for each of these would be [1, 0, 0] -&gt; "apples", [0, 1, 0] -&gt; "oranges",   [0, 0, 1] -&gt; "bananas".</p>
<p>The rest of the Keras modules we have imported are convolutional layers. We will discuss convolutional layers when we start building our model. We will also give a quick glance to what each of these layers do.</p>
<h2 id="heading-data-pre-processing">Data Pre-processing</h2>
<p>For now we will shift our attention to getting our data and analysing it. You should always remember the importance of pre-processing and analysing the data. It not only gives you insights about the data but also helps to locate inconsistencies. </p>
<p>A very slight variation in data can sometimes lead to a devastating result for your model. This makes it important to preprocess your data before using it for training. So with that in mind let's start data preprocessing.</p>
<pre><code class="lang-python3">train_df = pd.read_csv('./fashion-mnist_train.csv')
test_df = pd.read_csv('./fashion-mnist_test.csv')
</code></pre>
<p>First of all let's import our dataset (<em><a target="_blank" href="https://www.kaggle.com/zalando-research/fashionmnist">Here</a> is the link to download this dataset on your system</em>). Once you have imported the dataset, run the following command.</p>
<pre><code class="lang-python3">train_df.head()
</code></pre>
<p>This command will show you how your data looks like. The following screenshot shows the output of this command.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2019/07/head_output.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can see how our image data is stored in the form of pixel values. But we cannot feed data to our model in this format. So, we will have to convert it into numpy arrays. </p>
<pre><code class="lang-python3">train_data = np.array(train_df.iloc[:, 1:])
test_data = np.array(test_df.iloc[:, 1:])
</code></pre>
<p>Now, it's time to get our labels. </p>
<pre><code class="lang-python3">train_labels = to_categorical(train_df.iloc[:, 0])
test_labels = to_categorical(test_df.iloc[:, 0])
</code></pre>
<p>Here, you can see that we have used _to<em>categorical</em> to convert our categorical data into one hot encodings.</p>
<p>We will now reshape the data and cast it into <em>float32</em> type so that we can use it conveniently. </p>
<pre><code>rows, cols = <span class="hljs-number">28</span>, <span class="hljs-number">28</span> 

train_data = train_data.reshape(train_data.shape[<span class="hljs-number">0</span>], rows, cols, <span class="hljs-number">1</span>)
test_data = test_data.reshape(test_data.shape[<span class="hljs-number">0</span>], rows, cols, <span class="hljs-number">1</span>)

train_data = train_data.astype(<span class="hljs-string">'float32'</span>)
test_data = test_data.astype(<span class="hljs-string">'float32'</span>)
</code></pre><p>We are almost done. Let's just finish preprocessing our data by normalizing it. Normalizing image data will map all the pixel values in each image to the values between 0 to 1. This helps us reduce inconsistencies in data. Before normalizing, the image data can have large variations in pixel values which can lead to some unusual behaviour during the training process. </p>
<pre><code>train_data /= <span class="hljs-number">255.0</span>
test_data /= <span class="hljs-number">255.0</span>
</code></pre><h2 id="heading-convolutional-neural-networks">Convolutional Neural Networks</h2>
<p>So, data preprocessing is done. Now we can start building our model. We will build a <a target="_blank" href="http://cs231n.github.io/convolutional-networks/">Convolutional Neural Network</a> for modeling the image data. CNNs are modified versions of regular <a target="_blank" href="https://en.wikipedia.org/wiki/Neural_network">neural networks</a>. These are modified specifically for image data. Feeding images to regular neural networks would require our network to have a large number of input neurons. For example just for a 28x28 image we would require 784 input neurons. This would create a huge mess of training parameters.</p>
<p>CNNs fix this problem by already assuming that the input is going to be an image. The main purpose of convolutional neural networks is to take advantage of the spatial structure of the image and to extract high level features from that and then train on those features. It does so by performing a <a target="_blank" href="https://en.wikipedia.org/wiki/Convolution">convolution</a> operation on the matrix of pixel values.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2019/07/convSobel.gif" alt="Image" width="600" height="400" loading="lazy">
<em><a target="_blank" href="https://mlnotebook.github.io/post/CNN1/">Source</a></em></p>
<p>The visualization above shows how convolution operation works. And the Conv2D layer we imported earlier does the same thing. The first matrix (<em>from the left</em>) in the demonstration is the input to the convolutional layer. Then another matrix called "filter" or "kernel" is multiplied (matrix multiplication) to each window of the input matrix. The output of this multiplication is the input to the next layer. </p>
<p>Other than convolutional layers, a typical CNN also has two other types of layers: 1) a  p<a target="_blank" href="https://machinelearningmastery.com/pooling-layers-for-convolutional-neural-networks/">ooling layer</a>, and 2) a f<a target="_blank" href="https://stats.stackexchange.com/questions/182102/what-do-the-fully-connected-layers-do-in-cnns">ully connected layer</a>. </p>
<p>Pooling layers are used to generalize the output of the convolutional layers. Along with generalizing, it also reduces the number of parameters in the model by down-sampling the output of the convolutional layer. </p>
<p>As we just learned, convolutional layers represent high level features from image data. Fully connected layers use these high level features to train the parameters and to learn to classify those images. </p>
<p>We will also use the <a target="_blank" href="https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/">Dropout</a>, <a target="_blank" href="https://en.wikipedia.org/wiki/Batch_normalization">Batch-normalization</a> and <a target="_blank" href="https://stackoverflow.com/questions/43237124/role-of-flatten-in-keras">Flatten</a> layers in addition to the layers mentioned above. Flatten layer converts the output of convolutional layers into a one dimensional feature vector. It is important to flatten the outputs because Dense (Fully connected) layers only accept a feature vector as input. Dropout and Batch-normalization layers are for preventing the model from <a target="_blank" href="https://en.wikipedia.org/wiki/Overfitting">overfitting</a>.</p>
<pre><code class="lang-python">train_x, val_x, train_y, val_y = train_test_split(train_data, train_labels, test_size=<span class="hljs-number">0.2</span>)

batch_size = <span class="hljs-number">256</span>
epochs = <span class="hljs-number">5</span>
input_shape = (rows, cols, <span class="hljs-number">1</span>)
</code></pre>
<pre><code>def baseline_model():
    model = Sequential()
    model.add(BatchNormalization(input_shape=input_shape))
    model.add(Conv2D(<span class="hljs-number">32</span>, (<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), padding=<span class="hljs-string">'same'</span>, activation=<span class="hljs-string">'relu'</span>))
    model.add(MaxPooling2D(pool_size=(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>), strides=(<span class="hljs-number">2</span>,<span class="hljs-number">2</span>)))
    model.add(Dropout(<span class="hljs-number">0.25</span>))

    model.add(BatchNormalization())
    model.add(Conv2D(<span class="hljs-number">32</span>, (<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), padding=<span class="hljs-string">'same'</span>, activation=<span class="hljs-string">'relu'</span>))
    model.add(MaxPooling2D(pool_size=(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)))
    model.add(Dropout(<span class="hljs-number">0.25</span>))

    model.add(Flatten())
    model.add(Dense(<span class="hljs-number">128</span>, activation=<span class="hljs-string">'relu'</span>))
    model.add(Dropout(<span class="hljs-number">0.5</span>))
    model.add(Dense(<span class="hljs-number">10</span>, activation=<span class="hljs-string">'softmax'</span>))
    <span class="hljs-keyword">return</span> model
</code></pre><p>The code that you see above is the code for our CNN model. You can structure these layers in many different ways to get good results. There are many popular CNN architectures which give state of the art results. Here, I have just created my own simple architecture for the purpose of this problem. Feel free to try your own and let me know what results you get :)</p>
<h2 id="heading-training-the-model">Training the model</h2>
<p>Once you have created the model you can import it and then compile it by using the code below.</p>
<pre><code>model = baseline_model()
model.compile(loss=<span class="hljs-string">'categorical_crossentropy'</span>, optimizer=<span class="hljs-string">'sgd'</span>, metrics=[<span class="hljs-string">'accuracy'</span>])
</code></pre><p><strong>model.compile</strong> configures the learning process for our model. We have passed it three arguments. These arguments define the <a target="_blank" href="https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/">loss function</a> for our model, <a target="_blank" href="https://blog.algorithmia.com/introduction-to-optimizers/">optimizer</a> and <a target="_blank" href="https://keras.io/metrics/">metrics</a>.</p>
<pre><code>history = model.fit(train_x, train_y,
          batch_size=batch_size,
          epochs=epochs,
          verbose=<span class="hljs-number">1</span>,
          validation_data=(val_x, val_y))
</code></pre><p>And finally by running the code above you can train your model. I am training this model for just five epochs but you can increase the number of epochs. After your training process is completed you can make predictions on the test set by using the following code.</p>
<pre><code>predictions= model.predict(test_data)
</code></pre><h2 id="heading-conclusion">Conclusion</h2>
<p>Congrats! You did it, you have taken your first step into the amazing world of computer vision. </p>
<p>You have created a your own image classifier. Even though this is a great achievement, we have just scratched the surface. </p>
<p>There is a lot you can do with CNNs. The applications are limitless. I hope that this article helped you to get an understanding of how the process of training these models works. </p>
<p>Working on other datasets on your own will help you understand this even better. I have also created a GitHub <a target="_blank" href="https://github.com/aditya2000/MNIST-Fashion-">repository</a> for the code I used in this article. So, if this article was useful for you please let me know. </p>
<p>If you have any questions or you want to share your own results or if you just want to say "hi", feel free to hit me up on <a target="_blank" href="https://twitter.com/aditya_dehal">twitter</a>, and I'll try to do my best to help you. And finally <strong>Thanks a lot for reading this article!!</strong> :)</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Everything you need to know to master Convolutional Neural Networks ]]>
                </title>
                <description>
                    <![CDATA[ By Tirmidzi Faizal Aflahi Look at the photo below: _Courtesy of [Pix2PixHD](https://github.com/NVIDIA/pix2pixHD" rel="noopener" target="blank" title=") That is not a real photo. You can open the image in a new tab and zoom into the image. Do you see... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/everything-you-need-to-know-to-master-convolutional-neural-networks-ef98ca3c7655/</link>
                <guid isPermaLink="false">66c349fe4f1fc448a3678ff7</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 26 Apr 2019 20:16:26 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*-Rip5afVhP2NlhVYfPp_mA.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Tirmidzi Faizal Aflahi</p>
<p>Look at the photo below:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Qo6M5bDKw4FummLtjX8NrAwl6hbz8uDtw9ut" alt="Image" width="600" height="400" loading="lazy">
_Courtesy of [Pix2PixHD](https://github.com/NVIDIA/pix2pixHD" rel="noopener" target="<em>blank" title=")</em></p>
<p><strong>That is not a real photo</strong>. You can open the image in a new tab and zoom into the image. Do you see the mosaics?</p>
<p>The picture is actually generated by a program called Artificial Intelligence. Doesn’t it feel realistic? It’s great, isn’t it?</p>
<p>It has been only 7 years since the technology was brought to the public by Alex Krizhevsky and friends via the ImageNet competition. This competition is an annual Computer Vision competition to categorize pictures into 1000 different classes. From Alaskan Malamutes to toilet paper. Alex and friends built something called AlexNet, and it won the competition with a large margin between it and second place.</p>
<p>This technology is called a <strong>Convolutional Neural Network</strong>. It’s a sub-branch of Deep Neural Networks which performs exceptionally well in processing images.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Fao8o7A2hPuF04JmdAL4fdh-P1CFodRdi985" alt="Image" width="600" height="400" loading="lazy">
<em>Courtesy of ImageNet</em></p>
<p>The image above is the error rate produced by the software that won the competition several years back. In 2016, <strong>it is actually better than human performance</strong> which was around 5%.</p>
<p>The introduction of Deep Learning into this field is actually <em>game breaking</em> more than game-changing.</p>
<h3 id="heading-convolutional-neural-network-architecture">Convolutional Neural Network Architecture</h3>
<p>So, how does this technology work?</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/iZ-JSs6hw3oDgPEn8Lw3wHdQDr4xNoU1tvjV" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Convolutional Neural Networks perform better than other Deep Neural Network architectures because of their unique process. Instead of looking at the image one pixel at a time, <strong>CNNs group several pixels together</strong> (an example 3×3 pixel like in the image above) so they can understand a temporal pattern.</p>
<p>In another way, CNNs can “see” group of pixels forming a line or curve. Because of the deep nature of Deep Neural Networks, in the next level they will see not the group of pixels, but groups of lines and curves forming some shapes. And so on until they form a complete picture.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/ogzEWfxkTLd-tGiwOIUvieopX-1rAioqoFSC" alt="Image" width="600" height="400" loading="lazy">
_Deep Convolutional Neural Network by [Mynepalli](https://www.researchgate.net/figure/Learning-hierarchy-of-visual-features-in-CNN-architecture_fig1_281607765" rel="noopener" target="<em>blank" title=")</em></p>
<p>There are many things you need to learn if you want to understand CNNs, from the very basic things, like a kernel, pooling layers, and so on. But nowadays, <strong>you can just dive and use many open source projects for this technology.</strong></p>
<p>This is actually true because of the technology called <strong>Transfer Learning</strong>.</p>
<h3 id="heading-transfer-learning">Transfer Learning</h3>
<p>Transfer Learning is a technique which reuses the finished Deep Learning model in another more specific task.</p>
<p>As an example, say you are working in a Train Management company, and want to assess whether your trains are on time or not. And you don’t want to add another workforce just for this task.</p>
<p><strong>You can just reuse an ImageNet Convolutional Neural Network model, maybe ResNet (the 2015 winner) and re-train the network with the images of your train fleets. And you will do just fine.</strong></p>
<p>There are two main competitive edges when you use Transfer Learning.</p>
<ol>
<li><strong>Needs fewer images to perform well than training from scratch</strong>. ImageNet competition has around 1 million images to train with. Using transfer learning, you can use only 1000 or even 100 images and perform well because it is already trained with those 1 million images.</li>
<li><strong>Needs less time to achieve good performance</strong>. To be as good as ImageNet, you will need to train the network for days, and that doesn’t count the time needed to alter the network if it doesn’t work well. Using transfer learning, you will only need several hours or even minutes to finish training for some tasks. A lot of time saved.</li>
</ol>
<h3 id="heading-image-classification-to-image-generation">Image Classification to Image Generation</h3>
<p>Enabled with transfer learning, many initiatives appeared. If you can process some images and tell us about what the images are all about, how about constructing the image itself?</p>
<p><em>Challenge accepted!</em></p>
<p><strong>Generative Adversarial Network</strong> comes to the scene.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/zkTwwVivOhYrDKvsNo5bAaPiO8g-6NI8jXnM" alt="Image" width="600" height="400" loading="lazy">
_CycleGAN by [Jun-Yan Zhu](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix" rel="noopener" target="<em>blank" title=")</em></p>
<p>This technology can generate pictures using some inputs.</p>
<p>It can generate a realistic photo given a painting in a type called CycleGAN which I give you in the photo above. In another use case, it also can generate a picture of a bag given some sketches. It can even generate a higher resolution photo given a low-res photo.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/JxxgUM7CBe1EGYEN2QHAfzkN6Lw6QSC4JCpo" alt="Image" width="600" height="400" loading="lazy">
_[Super Resolution Generative Adversarial Network](https://github.com/tensorlayer/srgan" rel="noopener" target="<em>blank" title=")</em></p>
<p>Amazing, aren’t they?</p>
<p>Of course. And you can start learning to build them now. But how?</p>
<h3 id="heading-convolutional-neural-network-tutorial">Convolutional Neural Network Tutorial</h3>
<p>So, let’s begin. You will learn that getting started on this topic is easy, so freaking easy. But mastering it is on another level.</p>
<p>Let’s put aside mastering it for now.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/7RlbToHPUJgIejBUne2NDMQWhDss3fTZ4S78" alt="Image" width="600" height="400" loading="lazy">
_Photo by [Unsplash](https://unsplash.com/photos/5A06OWU6Wuc?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText" rel="noopener" target="_blank" title=""&gt;Thomas Verbruggen on &lt;a href="https://unsplash.com/search/photos/columnar-cactus?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText" rel="noopener" target="<em>blank" title=")</em></p>
<p>After browsing for several days, I found this project which is really suitable for you to start with.</p>
<h3 id="heading-aerial-cactus-identificationhttpswwwkagglecomcaerial-cactus-identification"><a target="_blank" href="https://www.kaggle.com/c/aerial-cactus-identification">Aerial Cactus Identification</a></h3>
<p>This is a tutorial project from <a target="_blank" href="https://www.kaggle.com/">Kaggle</a>. Your task is to identify is there any columnar cactus in an aerial image</p>
<p>Pretty simple, eh?</p>
<p>You will be given 17,500 images to work with and need to label 4,000 images that have not been labeled. Your score is 1 or 100% if all the 4,000 images are correctly labeled by your program.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/y6aBaIA1GC7twlFe4uFq6RY5dykaTDMG3xVX" alt="Image" width="600" height="400" loading="lazy">
<em>Cactus</em></p>
<p>The images are pretty much like what you see above. A photo of a region that may or may not contains a group of columnar cacti. The photos are 32×32 pixels. And it shows cacti in different directions since they are aerial photos.</p>
<p>So what do you need?</p>
<h3 id="heading-convolutional-neural-network-with-python">Convolutional Neural Network with Python</h3>
<p>Yes, Python, the popular language for Deep Learning. With many choices available, you can practically do trial and error for each choice. The choices are:</p>
<ol>
<li><strong>Tensorflow</strong>, the most popular Deep Learning library. Built by engineers at Google and has the biggest contributor base and most fans. Because the community is so big, you can easily find the solution to your problem. It has <strong>Keras</strong> as the high-level abstraction wrapper, that is so favorable for a newbie.</li>
<li><strong>Pytorch</strong>. My favorite Deep Learning library. Built purely on Python and following the pros and cons of Python. Python developers will be extremely familiar with this library. It has another library called <strong>FastAI</strong> which gives the abstraction Keras has for Tensorflow.</li>
<li><strong>MXNet</strong>. The Deep Learning library by Apache.</li>
<li><strong>Theano</strong>. Predecessor of Tensorflow</li>
<li><strong>CNTK</strong>. Microsoft also has his own Deep Learning library.</li>
</ol>
<p>For this tutorial, let’s use my favorite one, Pytorch, complemented by its abstraction, FastAI.</p>
<p>Before starting, you need to install Python. Go to the <a target="_blank" href="https://www.python.org/downloads/">Python website</a> and download what you need. You need to make sure that you install <strong>version 3.6+</strong>, or it may not be supported by the libraries you will use.</p>
<p>Now, open your command line or terminal and install these things</p>
<pre><code>pip install numpy pip install pandas pip install jupyter
</code></pre><p><strong>Numpy</strong> will be used to store the inputted images. And <strong>pandas</strong> to work with CSV files. Jupyter notebook is what you need to interactively code with Python.</p>
<p>Then, go to the <a target="_blank" href="https://pytorch.org/">Pytorch website</a> and download what you need. You might need the CUDA version to fasten your training speed. But make sure that you have version 1.0+ for Pytorch.</p>
<p>After that, install torchvision and FastAI:</p>
<pre><code>pip install torchvision pip install fastai
</code></pre><p>Run Jupyter with the command <strong>jupyter notebook</strong> and it will open a browser window.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/DZEbPNLsV51dIziniiLp0Z6KtlqUaUzjk2rL" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Now, you are ready to go.</p>
<h3 id="heading-prepare-the-data">Prepare the Data</h3>
<p>Import the necessary code:</p>
<pre><code><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> npimport pandas <span class="hljs-keyword">as</span> pd <span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path <span class="hljs-keyword">from</span> fastai <span class="hljs-keyword">import</span> * <span class="hljs-keyword">from</span> fastai.vision <span class="hljs-keyword">import</span> * <span class="hljs-keyword">import</span> torch %matplotlib inline
</code></pre><p>Numpy and Pandas are always needed for everything you want to do. FastAI and Torch are your Deep Learning Library. Matplotlib Inline will be used to show charts.</p>
<p>Now, download data files from the <a target="_blank" href="https://www.kaggle.com/c/aerial-cactus-identification/data">competition website.</a></p>
<p>Extract the zip data file and put them inside the jupyter notebook folder.</p>
<p>Let’s say you named your notebook Cacti. Your folder structure would be like this:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/R5fQ0dLngqrAVLPRl8J8lFxgqr4yhXQmm3yg" alt="Image" width="600" height="400" loading="lazy"></p>
<p><strong>Train folder</strong> contains all the images for your training step.</p>
<p><strong>Test folder</strong> contains all the images for submission.</p>
<p><strong>Train CSV</strong> file contains the training data; mapping the image name with the column has_cactus which gives a value of 1 if it has cactus or 0 otherwise.</p>
<p><strong>Sample Submission CSV</strong> file contains all the formatting for submission that you need to do. The file names stated there are equal to all files inside Test folder.</p>
<pre><code>train_df = pd.read_csv(<span class="hljs-string">"train.csv"</span>)
</code></pre><p>Load the Train CSV file to a data frame.</p>
<pre><code>data_folder = Path(<span class="hljs-string">"."</span>) train_images = ImageList.from_df(train_df, path=data_folder, folder=<span class="hljs-string">'train'</span>)
</code></pre><p>Create a load generator using the <strong>ImageList from_df</strong> method to map train_df data frame with images inside the train folder.</p>
<h3 id="heading-data-augmentation">Data Augmentation</h3>
<p>This is a technique to <strong>create more data from your existing data.</strong> An image of a cat flipped vertically is still a cat. By doing this you can basically multiply your data set twice, four times, or even 16 times.</p>
<p>You will need this technique a lot if you happen to have very little data to work with.</p>
<pre><code>transformations = get_transforms(do_flip=True, flip_vert=True, max_rotate=<span class="hljs-number">10.0</span>, max_zoom=<span class="hljs-number">1.1</span>, max_lighting=<span class="hljs-number">0.2</span>, max_warp=<span class="hljs-number">0.2</span>, p_affine=<span class="hljs-number">0.75</span>, p_lighting=<span class="hljs-number">0.75</span>)
</code></pre><p>FastAI gives you a nice transformation method to do all of this, called <strong>get_transform</strong>. You can do a flip vertically, horizontally, rotate, zoom, add lighting/brightness, and warp the image.</p>
<p>You can play with the parameter I stated above to find out how it will look. Or you can <a target="_blank" href="https://docs.fast.ai/vision.transform.html">open the documentation</a> and read about it in detail.</p>
<p>Of course, apply the transformation to your image list:</p>
<pre><code>train_img = train_img.transform(transformations, size=<span class="hljs-number">128</span>)
</code></pre><p>The parameter size will be used to scale up or down the input to match with the neural network you will use. The network I will use is called <strong>DenseNet</strong>, which won Best Paper Award at ImageNet 2017, and it needs images with 128×128 pixels in size.</p>
<h3 id="heading-training-preparation">Training Preparation</h3>
<p>After loading your data, you need to prepare yourself and your data for the most important phase in Deep Learning called Training. <strong>Basically, this is the Learning in Deep Learning</strong>. It learns from your data, and updates itself accordingly so that it will have good performance on your data.</p>
<pre><code>test_df = pd.read_csv(<span class="hljs-string">"sample_submission.csv"</span>) test_img = ImageList.from_df(test_df, path=data_folder, folder=<span class="hljs-string">'test'</span>)
</code></pre><pre><code>train_img = train_img           .split_by_rand_pct(<span class="hljs-number">0.01</span>)           .label_from_df()           .add_test(test_img)           .databunch(path=<span class="hljs-string">'.'</span>, bs=<span class="hljs-number">64</span>, device=torch.device(<span class="hljs-string">'cuda:0'</span>))                       .normalize(imagenet_stats)
</code></pre><p>For the training step, you need to split some of your training data into a small portion called <strong>validation data</strong>. You can’t touch these data because they will be your validation tool. <strong>When your Convolutional Neural Network performs well on validation data, it will likely perform well on the test data</strong> that will be submitted.</p>
<p>FastAI has the convenient method called <strong>split_by_rand_pct</strong> to split a portion of your data into validation data.</p>
<p>It also has the method <strong>databunch</strong> to perform batch processing. I used 64 as the batch because that is what my GPU limits. If you don’t have GPU, emit the <strong>device</strong> parameter.</p>
<p>Then, the <strong>normalize</strong> method is called to normalize your images because you will use a pre-trained network. <strong>imagenet_stats</strong> will normalize the images according to how the pre-trained network was trained for the ImageNet competition.</p>
<p>Adding the test data to the training image list makes it easy to predict later on without more pre-processing. Remember, these images will not be trained on and will not go to your validation. You just want to pre-process the data in the same way with the training images.</p>
<pre><code>learn = cnn_learner(train_img, models.densenet161, metrics=[error_rate, accuracy])
</code></pre><p>You are done preparing your training data. Now, create a training method with <strong>cnn_learner</strong>. As I said before, I will use DenseNet as the pre-trained network. You can use another network offered in <a target="_blank" href="https://pytorch.org/docs/stable/torchvision/models.html">TorchVision</a>.</p>
<h3 id="heading-the-one-cycle-technique">The One-Cycle Technique</h3>
<p>You can start your training right now. But, there is always confusion when training any Deep Neural Network, Convolutional Neural Networks included. That is <strong>choosing the right learning rate</strong>. The algorithm is called Gradient Descent, and it will try to decrease the error defined with a parameter called learning rate.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/c1KIT8wg1bopl2bJ8tEXgTE-a37PwRBJFNrF" alt="Image" width="600" height="400" loading="lazy"></p>
<p><strong>A bigger learning rate makes the training steps faster</strong>, but it is prone to overstepping the boundaries. This makes it possible for the error to go out of control like the picture above. While <strong>a smaller learning rate makes</strong> <strong>the training steps slower,</strong> but it will not go out of control.</p>
<p>So, choosing the right learning rate is really important. Make it big enough without going out of control.</p>
<p>It is easier said than done.</p>
<p>So, there came a person called Leslie Smith who create a technique called the <a target="_blank" href="https://sgugger.github.io/the-1cycle-policy.html">1-cycle policy</a>.</p>
<p>Intuition wise, you might want to find / brute force several learning rates and <strong>find one with nearly minimal error but have some space to improve</strong>. Let’s try it out in our code.</p>
<pre><code>learn.lr_find() learn.recorder.plot()
</code></pre><p>It will print something like this:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/zw79NSmNj8dcbd0t3Ua5pBQFNsy6hzsiumVD" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The minimum should be 10 ⁻¹. So, I think we can use something smaller than that but not too small. Maybe <strong>3 * 10 ⁻ ²</strong> is a good choice. Let’s try it!</p>
<pre><code>lr = <span class="hljs-number">3e-02</span> learn.fit_one_cycle(<span class="hljs-number">5</span>, slice(lr))
</code></pre><p>Train for several steps (I choose 5, not too big and not too small), and let’s see the result.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/a2BKjtyc5jOPIkMlofT2QCrXdgYYtGGAJ4VX" alt="Image" width="600" height="400" loading="lazy"></p>
<p><strong>Wait, what!?</strong></p>
<p>Our simple solution gives us 100% accuracy for our validation split! It is actually effective. And it only needs six minutes to train. What a stroke of luck! <strong>In real life, you will do several iterations just to find out which algorithms do better than the others.</strong></p>
<p>I am eager to submit! Haha. <strong>Let’s predict the test folder and submit the result.</strong></p>
<pre><code>preds,_ = learn.get_preds(ds_type=DatasetType.Test) test_df.has_cactus = preds.numpy()[:, <span class="hljs-number">0</span>]
</code></pre><p>Because you have already put the test images in the training image list, you will not need to pre-process and load your test images.</p>
<pre><code>test_df.to_csv(<span class="hljs-string">'submission.csv'</span>, index=False)
</code></pre><p>This line will create a CSV file containing the images name and has a cactus column for all 4,000 test images.</p>
<p>When I tried to submit, I actually just realized that you need to submit the CSV via a Kaggle kernel. I missed that.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/DFBYT7cE69bKw439uShdmeHBwnNN9QOq8GUS" alt="Image" width="600" height="400" loading="lazy">
_Courtesy of [Kaggle](https://www.kaggle.com/" rel="noopener" target="<em>blank" title=")</em></p>
<p>But, luckily, <strong>the kernel is actually the same as your jupyter notebook</strong>. You can just copy paste all the things you have built in your notebook and submit there.</p>
<p>And <strong>BAM</strong>!</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/R34ss1jMs-UJUVNLeQE8-qOtHRrovBptiryf" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Good Lord! I get 0.9999 for the public score. That’s really good. But, of course, I want to get a perfect score if my first attempt is like that.</p>
<p>So, I did several tweaks in the network and once more, BAM!</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/d8BptL2rdf6599N1MEmGnLfzxOgYkDsPwqKm" alt="Image" width="600" height="400" loading="lazy"></p>
<p>I did it! So can you. It’s actually not that hard.</p>
<p>(BTW, this rank was taken on April 13th, so I might drop my rank right by now…)</p>
<h3 id="heading-what-i-learned">What I Learned</h3>
<p>This problem is easy. So, you will not face any weird challenge while solving it. This makes it one of the most suitable projects to start with.</p>
<p>Alas, because many people get a perfect score on this, I think the admin needs to create another test set for submission. A harder one maybe.</p>
<p>Whatever the reason, there is no barrier for you to try this. <strong>You can try this right now and get good results</strong>.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/eFxKKzNLkMLki84lBbzy9zVUcyTI2lgjTDbA" alt="Image" width="600" height="400" loading="lazy">
_Photo by [Unsplash](https://unsplash.com/photos/rGG-BCtNiuo?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText" rel="noopener" target="_blank" title=""&gt;Mario Mrad on &lt;a href="https://unsplash.com/search/photos/vision?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText" rel="noopener" target="<em>blank" title=")</em></p>
<h3 id="heading-final-thoughts">Final Thoughts</h3>
<p>Convolutional Neural Networks are so helpful for various tasks. From Image Recognition to generating images. Analyzing images nowadays is not as hard as before. Of course, you can also do it if you try.</p>
<p>Just get started, pick a good Convolutional Neural Network project, and get good data.</p>
<p>Good luck!</p>
<p><em>This article is originally published on my blog at <a target="_blank" href="https://thedatamage.com/convolutional-neural-network-explained/">thedatamage</a>.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Computer Vision .js frameworks you need to know ]]>
                </title>
                <description>
                    <![CDATA[ By Shen Huang Computer vision has been a hot topic in recent years, enabling countless great applications. With the effort from some dedicated developers in the world, creating an application utilizing computer vision is no longer rocket science. In ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/computer-vision-js-frameworks-you-need-to-know-b233996103ce/</link>
                <guid isPermaLink="false">66d460f23a8352b6c5a2ab01</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Front-end Development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ JavaScript ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Mon, 18 Mar 2019 16:05:59 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/0*uXRPu25xSI_86KUw" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Shen Huang</p>
<p>Computer vision has been a hot topic in recent years, enabling countless great applications. With the effort from some dedicated developers in the world, creating an application utilizing computer vision is no longer rocket science. In fact, you can build many of the application in a few lines of JavaScript code. In this article, I will introduce you to some of them.</p>
<h3 id="heading-1-tensorflowjs">1. TensorFlow.js</h3>
<p>Being one of the largest machine learning frameworks, TensorFlow also allows the creation of Node.js and front-end JavaScript applications with <a target="_blank" href="https://www.tensorflow.org/js"><strong>Tensorflow.js</strong></a>. Below is one of their demos matching poses with a collection of images. TensorFlow also has a <a target="_blank" href="https://playground.tensorflow.org/#activation=tanh&amp;batchSize=10&amp;dataset=circle&amp;regDataset=reg-plane&amp;learningRate=0.03&amp;regularizationRate=0&amp;noise=0&amp;networkShape=4,2&amp;seed=0.27185&amp;showTestData=false&amp;discretize=false&amp;percTrainData=50&amp;x=true&amp;y=true&amp;xTimesY=false&amp;xSquared=false&amp;ySquared=false&amp;cosX=false&amp;sinX=false&amp;cosY=false&amp;sinY=false&amp;collectStats=false&amp;problem=classification&amp;initZero=false&amp;hideText=false"><strong>playground</strong></a> allowing us to visualize better artificial neural networks, which can be great for educational purposes.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/fzXRDjBio2OIxVNIHI2Njxb9sg6x9rVQRAph" alt="Image" width="600" height="400" loading="lazy">
_A Move Mirror Demo from [Tensorflow.js](https://experiments.withgoogle.com/move-mirror" rel="noopener" target="<em>blank" title=")</em></p>
<h3 id="heading-2-amazon-rekognition">2. Amazon Rekognition</h3>
<p><a target="_blank" href="https://aws.amazon.com/rekognition/?sc_channel=PS&amp;sc_campaign=acquisition_US&amp;sc_publisher=google&amp;sc_medium=ACQ-P%7CPS-GO%7CBrand%7CDesktop%7CSU%7CMachine%20Learning%7CRekognition%7CUS%7CEN%7CText&amp;sc_content=aws_recognition_software_e&amp;sc_detail=amazon%20rekognition&amp;sc_category=Machine%20Learning&amp;sc_segment=293645376368&amp;sc_matchtype=e&amp;sc_country=US&amp;s_kwcid=AL!4422!3!293645376368!e!!g!!amazon%20rekognition&amp;ef_id=EAIaIQobChMIwLzV1obx4AIVEK6WCh3MZAPREAAYASAAEgJlv_D_BwE:G:s"><strong>Amazon Rekognition</strong></a> is a powerful cloud-based tool. But they also provide SDKs for JavaScript in browsers which can be found <a target="_blank" href="https://aws.amazon.com/sdk-for-browser/"><strong>here</strong></a>. Below is an image illustrating how detailed their face detection can be.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0pIcn86SNFaM5cbA5CXboENRyfMtX0ayQ3rb" alt="Image" width="800" height="941" loading="lazy">
_Facial Feature Detection with [Amazon Rekognition API](https://docs.aws.amazon.com/rekognition/latest/dg/faces-detect-images.html" rel="noopener" target="<em>blank" title=")</em></p>
<h3 id="heading-3-opencvjs">3. OpenCV.js</h3>
<p>Being one of the oldest computer vision frameworks out there, <a target="_blank" href="https://opencv.org/"><strong>OpenCV</strong></a> has served developers in computer vision for a very long time. They also have a <a target="_blank" href="https://docs.opencv.org/3.4/d5/d10/tutorial_js_root.html"><strong>JavaScript version</strong></a> allowing developers to implement those features onto a website.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/axCPVu-3ItA12kmt4OLraf0WgqxzZ-BmmUfr" alt="Image" width="600" height="400" loading="lazy">
_Example Face Detection with OpenCV, Image from [DZone](https://dzone.com/articles/face-detection-using-html5" rel="noopener" target="<em>blank" title=")</em></p>
<h3 id="heading-4-trackingjs">4. tracking.js</h3>
<p>If you are only looking to build a quick face detection app, such as a web version of the snapchat filters, you should take a look at <a target="_blank" href="https://trackingjs.com/"><strong>tracking.js</strong></a>. This framework allows integration of face recognition with JavaScript with a fairly simple setup. I have also wrote a <a target="_blank" href="https://medium.freecodecamp.org/how-to-drop-leprechaun-hats-into-your-website-with-computer-vision-b0d115a0f1ad"><strong>guide</strong></a> on this framework dropping a leprechaun hat onto faces for St. Patrick’s Day.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/ntEODKKA39CkXs9Q8Fcb8uMBEaVC0OZZZpot" alt="Image" width="600" height="400" loading="lazy">
_tracking.js face detection [example](https://trackingjs.com/examples/face_hello_world.html" rel="noopener" target="<em>blank" title=")</em></p>
<h3 id="heading-5-webgazerjs">5. WebGazer.js</h3>
<p>Whether you are trying to perform user experience studies or creating new interactive systems for your game or websites, <a target="_blank" href="https://webgazer.cs.brown.edu/"><strong>WebGazer.js</strong></a> can be a great place to start. This powerful framework allows our apps to know where the person is looking at with camera inputs.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/ofDdoti6XYIdLDUNfnDA4X54j8AHvNXeOpJY" alt="Image" width="600" height="400" loading="lazy">
_WebGazer.js gaze tracking [example](https://webgazer.cs.brown.edu/#examples" rel="noopener" target="<em>blank" title=")</em></p>
<h3 id="heading-6-threearjs">6. three.ar.js</h3>
<p>Another framework from Google, <a target="_blank" href="https://github.com/google-ar/three.ar.js?files=1"><strong>three.ar.js</strong></a> extends the functionalities of <a target="_blank" href="https://developers.google.com/ar/"><strong>ARCore</strong></a> onto front-end JavaScript. It enables us to integrate surface and object detection into browsers, which is the perfect tool for an AR game.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/2jPTttH19OZg9eeSQ7YXiFsQ-Xq8E2bqmk96" alt="Image" width="800" height="302" loading="lazy">
_[three.ar.js](https://github.com/google-ar/three.ar.js?files=1" rel="noopener" target="<em>blank" title=") demo</em></p>
<h3 id="heading-in-the-end">In the End…</h3>
<p>I am passionate about learning new technology and sharing it with the community. If there is anything you wish to read in particular, please let me know. Below are my previous articles related to this subject. Stay tuned and have fun engineering!</p>
<ul>
<li><a target="_blank" href="https://medium.com/swlh/how-computer-vision-is-revolutionizing-ecommerce-d05e0ca11765"><strong>How Computer Vision is Revolutionizing eCommerce</strong></a></li>
<li><a target="_blank" href="https://medium.freecodecamp.org/how-to-drop-leprechaun-hats-into-your-website-with-computer-vision-b0d115a0f1ad"><strong>How to drop LEPRECHAUN-HATS into your website with COMPUTER VISION</strong></a></li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ how to drop LEPRECHAUN-HATS into your website with COMPUTER VISION ]]>
                </title>
                <description>
                    <![CDATA[ By Shen Huang Automatically leprechaun-hat people on your website for St. Patrick’s Day. !!! — WARNING — !!! Giving a person a green hat can be considered OFFENSIVE to some Chinese people, as it has the same meaning as cheating in a relationship. So... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-drop-leprechaun-hats-into-your-website-with-computer-vision-b0d115a0f1ad/</link>
                <guid isPermaLink="false">66d460f63a8352b6c5a2ab09</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ CSS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ JavaScript ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ UX ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 07 Mar 2019 20:56:14 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*kfqTx__agnemI2s0kRd3rw.gif" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Shen Huang</p>
<h4 id="heading-automatically-leprechaun-hat-people-on-your-website-for-st-patricks-day">Automatically leprechaun-hat people on your website for St. Patrick’s Day.</h4>
<blockquote>
<p><strong>!!! — WARNING — !!!</strong></p>
<p>Giving a person a green hat can be considered <a target="_blank" href="https://mspoweruser.com/microsoft-removes-green-hat-from-vs-2019-installer-after-offending-users-in-china/"><strong>OFFENSIVE</strong></a> to some Chinese people, as it has the same meaning as cheating in a relationship. So use this <strong>CAREFULLY</strong> when you are serving a Chinese user base.</p>
<p><strong>!!! — WARNING — !!!</strong></p>
</blockquote>
<p>In this tutorial, we will go over how to drop a leprechaun hat onto your website images that contain people. The process will be done through the aid of some <strong>Computer Vision</strong> frameworks, so it will be the same amount of work even if you have millions of portraits to go through. A demo can be found <a target="_blank" href="https://shenhuang.github.io/demo_projects/tracking.js-master/TEAM%20MEMBERS%20_%20Teamwebsite.html"><strong>here</strong></a> thanks to the permission from my teammates.</p>
<p>This tutorial is for more advanced audiences. I am assuming you can figure out a lot of the fundamentals on your own. I have also made some tutorials for total beginners, which I have attached in the end as links.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/oKTeBIcRIikaGpEVv0zWVOjoUNVQU43ms4XW" alt="Image" width="600" height="400" loading="lazy">
<em>Leprechaun Hats Fall on top of Heads in Portraits</em></p>
<h3 id="heading-1-initial-setup">1. Initial Setup</h3>
<p>Before we start this tutorial, we need to first perform some setup.</p>
<p>First of all, we are using <strong>tracking.js</strong> to help us in this project, and therefore, we need to download and extract the necessary files for <strong>tracking.js</strong> from <a target="_blank" href="https://github.com/eduardolundgren/tracking.js/archive/master.zip"><strong>here</strong></a>.</p>
<p>For this tutorial, we start with a template website I snatched from our team <strong>WiX</strong> which is a <strong>Content Management System (CMS)</strong> allowing you to build websites with much less effort. The template can be downloaded from <a target="_blank" href="https://github.com/shenhuang/shenhuang.github.io/raw/master/tracking.js-master/site_template.zip"><strong>here</strong></a>. Extract the files into the “tracking.js-master” folder from previous step.</p>
<p>In order to make everything work, we also need a server. We will be using a simple Python server for this tutorial. In case you do not have Python or Homebrew (which helps to install Python), you can use the following bash commands to install them.</p>
<p>Installing Homebrew:</p>
<pre><code>/usr/bin/ruby -e <span class="hljs-string">"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"</span>
</code></pre><p>Installing Python:</p>
<pre><code>brew install python
</code></pre><p>Now that everything is ready, we will run the command below under our “tracking.js-master” to start the Python server.</p>
<pre><code>python -m SimpleHTTPServer
</code></pre><p>To test, go to this <a target="_blank" href="http://localhost:8000/examples/face_hello_world.html"><strong>link</strong></a> of your local host to see an example page. You should also be able to view the extracted example page from <a target="_blank" href="http://localhost:8000/TEAM%20MEMBERS%20_%20Teamwebsite.html"><strong>here</strong></a>. And that is all you have to do for the setup.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/E3njCktFKMne4zqeC-1t6ljhser9k4Ay8Xhx" alt="Image" width="600" height="400" loading="lazy">
<em>Setting up a simple Python server.</em></p>
<h3 id="heading-2-creating-the-hat">2. Creating the hat</h3>
<p>Different from my other tutorials, we will be using an online image for this tutorial rather than trying to recreate everything with <strong>CSS</strong>.</p>
<p>I found a leprechaun hat from <strong>kisspng</strong> and it can be found <a target="_blank" href="https://github.com/shenhuang/shenhuang.github.io/raw/master/tracking.js-master/leprechaunhat_kisspng.png"><strong>here</strong></a>. Save the image to the root folder of our website. By appending the following code to the end above the <code>&lt;/ht</code>ml&gt;, we should be able to view the image in our example website after save and reload.</p>
<pre><code>&lt;body&gt;
  <span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">img</span> <span class="hljs-attr">id</span> = <span class="hljs-string">"hat"</span> <span class="hljs-attr">class</span> = <span class="hljs-string">"leprechaunhat"</span> <span class="hljs-attr">src</span> = <span class="hljs-string">"./leprechaunhat_kisspng.png"</span> &gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">body</span>&gt;</span></span>
</code></pre><p><img src="https://cdn-media-1.freecodecamp.org/images/FDncTXccdZYyRY8TG3fF1jaCjtMHsI9WyQEa" alt="Image" width="600" height="400" loading="lazy">
<em>Hat Image Appended to the Bottom of the Website</em></p>
<p>Now we have to design a drop animation with CSS, and put the code above the hat declaration. The code basically allows the hat to drop down and then shake a little bit.</p>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">style</span>&gt;</span><span class="css">
 <span class="hljs-keyword">@keyframes</span> shake {
  0% {
   <span class="hljs-attribute">transform </span>: <span class="hljs-built_in">translateY</span>(-<span class="hljs-number">30px</span>);
  }
  40% {
   <span class="hljs-attribute">transform </span>: <span class="hljs-built_in">rotate</span>(<span class="hljs-number">10deg</span>);
  }
  60% {
   <span class="hljs-attribute">transform </span>: <span class="hljs-built_in">rotate</span>(-<span class="hljs-number">10deg</span>);
  }
  80% {
   <span class="hljs-attribute">transform </span>: <span class="hljs-built_in">rotate</span>(<span class="hljs-number">10deg</span>);
  }
  100% {
   <span class="hljs-attribute">transform </span>: <span class="hljs-built_in">rotate</span>(<span class="hljs-number">0deg</span>);
  }
 }
 <span class="hljs-selector-class">.leprechaunhat</span> {
  <span class="hljs-attribute">animation </span>: shake <span class="hljs-number">1s</span> ease-in;
 }
</span><span class="hljs-tag">&lt;/<span class="hljs-name">style</span>&gt;</span>
</code></pre>
<p><img src="https://cdn-media-1.freecodecamp.org/images/niLdZDtnM566OnXKvebFQ-kC96UrllOgVuQv" alt="Image" width="600" height="400" loading="lazy">
<em>Hat drop animation.</em></p>
<h3 id="heading-3-drop-hats-onto-portraits">3. Drop hats onto portraits</h3>
<p>Now we will go over dropping hats precisely onto portraits. First we have to reference the JavaScript files from “tracking.js” with the following code.</p>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">script</span> <span class="hljs-attr">src</span> = <span class="hljs-string">"build/tracking-min.js"</span> <span class="hljs-attr">type</span> = <span class="hljs-string">"text/javascript"</span> &gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">script</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">script</span> <span class="hljs-attr">src</span> = <span class="hljs-string">"build/data/face-min.js"</span> <span class="hljs-attr">type</span> = <span class="hljs-string">"text/javascript"</span> &gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">script</span>&gt;</span>
</code></pre>
<p>The code provides us a <code>Tracker</code> class which we can feed images into. Then we can listen for a response indicating a rectangle outlining the faces inside the image.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/19eUYAEHlwvb6ycxU58Xv1ZnIWQoV--GvHDZ" alt="Image" width="600" height="400" loading="lazy">
<em>Tracker Explained</em></p>
<p>We start by defining a function that executes when the page is loaded. This function can be attached to anywhere else if necessary. The <code>yOffsetValue</code> is an offset aligning the hat into a more appropriate position.</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> yOffsetValue = <span class="hljs-number">10</span>;
<span class="hljs-built_in">window</span>.onload = <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params"></span>) </span>{
};
</code></pre>
<p>Inside, we define our hat creation function, allowing it to create hats with arbitrary sizes and positions.</p>
<pre><code class="lang-js"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">placeHat</span>(<span class="hljs-params">x, y, w, h, image, count</span>) </span>{
 hats[count] = hat.cloneNode(<span class="hljs-literal">true</span>);
 hats[count].style.display = <span class="hljs-string">"inline"</span>;
 hats[count].style.position = <span class="hljs-string">"absolute"</span>;
 hats[count].style.left = x + <span class="hljs-string">"px"</span>;
 hats[count].style.top = y + <span class="hljs-string">"px"</span>;
 hats[count].style.width = w + <span class="hljs-string">"px"</span>;
 hats[count].style.height = h + <span class="hljs-string">"px"</span>;
 image.parentNode.parentNode.appendChild(hats[count]);
}
</code></pre>
<p>We should also twist our image declaration script a little bit to make it hide the image, as we are now showing it with JavaScript.</p>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">img</span> <span class="hljs-attr">id</span> = <span class="hljs-string">"hat"</span> <span class="hljs-attr">class</span> = <span class="hljs-string">"leprechaunhat"</span> <span class="hljs-attr">src</span> = <span class="hljs-string">"./leprechaunhat_kisspng.png"</span> <span class="hljs-attr">style</span> = <span class="hljs-string">"display : none"</span> &gt;</span>
</code></pre>
<p>Then we add the following code to create the hats on top of faces, with the size matching the face.</p>
<pre><code class="lang-js"><span class="hljs-keyword">var</span> hat = <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">"hat"</span>);
<span class="hljs-keyword">var</span> images = <span class="hljs-built_in">document</span>.getElementsByTagName(<span class="hljs-string">'img'</span>);
<span class="hljs-keyword">var</span> trackers = [];
<span class="hljs-keyword">var</span> hats = [];
<span class="hljs-keyword">for</span>(i = <span class="hljs-number">0</span>; i &lt; images.length; i++)
{
 (<span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">img</span>)
 </span>{
  trackers[i] = <span class="hljs-keyword">new</span> tracking.ObjectTracker(<span class="hljs-string">'face'</span>);
  tracking.track(img, trackers[i]);
  trackers[i].on(<span class="hljs-string">'track'</span>, <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">event</span>) </span>{
   event.data.forEach(<span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">rect</span>) </span>{
    <span class="hljs-keyword">var</span> bcr = img.getBoundingClientRect();
    placeHat(rect.x, rect.y + yOffsetValue - rect.height, rect.width, rect.height, img, i);
   });
  });
 })(images[i]);
}
</code></pre>
<p>Now, while our Python server is still running, calling the following address should show us leprechaun hats dropping onto portraits.</p>
<pre><code>http:<span class="hljs-comment">//localhost:8000/TEAM%20MEMBERS%20_%20Teamwebsite.html</span>
</code></pre><p><img src="https://cdn-media-1.freecodecamp.org/images/3lHrFCf6hT-qFaANYfSA7kyK9KzSS9BYG-N8" alt="Image" width="600" height="400" loading="lazy">
<em>Leprechaun hat drop demo</em></p>
<p>Congratulations! You just learned how to drop leprechaun hats onto all the portraits on a website with computer vision. Wish you, your friends, and your audiences a great St. Patricks Day!!!</p>
<h3 id="heading-in-the-end">In the end</h3>
<p>I have linked some of previous guides below on similar projects. I believe there are certain trends in front end design. Despite the newly emerging .js frameworks and ES updates, Computer Animations and Artificial Intelligence can do wonders in the future for front end, improving user experience with elegancy and efficiency.</p>
<p><strong>Beginner:</strong></p>
<ul>
<li><a target="_blank" href="https://medium.com/front-end-weekly/how-to-fill-your-website-with-lovely-valentines-hearts-d30fe66d58eb">how to fill your website with lovely VALENTINES HEARTS</a></li>
<li><a target="_blank" href="https://medium.com/front-end-weekly/how-to-add-some-fireworks-to-your-website-18b594b06cca">how to add some FIREWORKS to your website</a></li>
<li><a target="_blank" href="https://medium.com/front-end-weekly/how-to-add-some-bubbles-to-your-website-8c51b8b72944">how to add some BUBBLES to your website</a></li>
</ul>
<p><strong>Advanced:</strong></p>
<ul>
<li><a target="_blank" href="https://medium.freecodecamp.org/how-to-create-beautiful-lanterns-that-arrange-themselves-into-words-da01ae98238">how to create beautiful LANTERNS that ARRANGE THEMSELVES into words</a></li>
</ul>
<p>I am passionate about coding and would love to learn new stuff. I believe knowledge can make the world a better place and therefore am self-motivated to share. Let me know if you are interested in reading anything in particular.</p>
<p>If you are looking for the source code of this project, they can be found <a target="_blank" href="https://github.com/shenhuang/shenhuang.github.io/tree/master/tracking.js-master"><strong>here</strong></a>. Thanks again for my teammates who allowed me to use their portraits for this project and <strong>be wary before using this on a website with a Chinese user base</strong>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to classify photos in 600 classes using nine million Open Images ]]>
                </title>
                <description>
                    <![CDATA[ By Aleksey Bilogur If you’re looking build an image classifier but need training data, look no further than Google Open Images. This massive image dataset contains over 30 million images and 15 million bounding boxes. That’s 18 terabytes of image dat... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-classify-photos-in-600-classes-using-nine-million-open-images-65847da1a319/</link>
                <guid isPermaLink="false">66c35095d73001a6c0054bdc</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 20 Feb 2019 18:16:51 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*EI4YPmaav7hc79e0GH__BQ.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Aleksey Bilogur</p>
<p>If you’re looking build an image classifier but need training data, look no further than <a target="_blank" href="https://storage.googleapis.com/openimages/web/index.html">Google Open Images</a>.</p>
<p>This massive image dataset contains over 30 million images and 15 million bounding boxes. That’s 18 terabytes of image data!</p>
<p>Plus, Open Images is much more open and accessible than certain other image datasets at this scale. For example, ImageNet has restrictive licensing.</p>
<p>However, it’s not easy for developers on single machines to sift through that much data.You need to download and process multiple metadata files, and roll their own storage space (or apply for access to a Google Cloud bucket).</p>
<p>On the other hand, there aren’t many custom image training sets in the wild, because frankly they’re a pain to create and share.</p>
<p>In this article, we’ll build and distribute a simple end-to-end machine learning pipeline using Open Images.</p>
<p>We’ll see how to create your own dataset around any of the 600 labels included in the Open Images bounding box data.</p>
<p>We’ll show off our handiwork by building “open sandwiches”. These are simple, reproducible image classifiers built to answer an age-old question: <a target="_blank" href="https://english.stackexchange.com/questions/246580/is-a-hamburger-considered-a-sandwich">is a hamburger a sandwich</a>?</p>
<p>Want to see the code? You can follow along in <a target="_blank" href="https://github.com/quiltdata/open-images">the repository on GitHub</a>.</p>
<h3 id="heading-downloading-the-data">Downloading the data</h3>
<p>We need to download the relevant data before we can do anything with it.</p>
<p>This is the core challenge when working with Google Open Images (or any external dataset really). There is no easy way to download a subset of the data. We need to write a script that does so for us.</p>
<p>I’ve written a Python script that searches the metadata in the <a target="_blank" href="https://github.com/openimages/dataset">Open Images data set</a> for keywords that you specify. It finds the original URLs of the corresponding images (on <a target="_blank" href="https://www.flickr.com/">Flickr</a>), then downloads them to disk.</p>
<p>It’s a testament to the power of Python that you can do all of this in just 50 lines of code:</p>
<p>This script enables you to download the subset of raw images which have bounding box information for any subset of categories of our choice:</p>
<pre><code>$ git clone https:<span class="hljs-comment">//github.com/quiltdata/open-images.git$ cd open-images/$ conda env create -f environment.yml$ source activate quilt-open-images-dev$ cd src/openimager/$ python openimager.py "Sandwiches" "Hamburgers"</span>
</code></pre><p>Categories are organized in a hierarchical way.</p>
<p>For example, <code>sandwich</code> and <code>hamburger</code> are both sub-labels of <code>food</code> (but <code>hamburger</code> is not a sub-label of <code>sandwich</code> — hmm).</p>
<p>We can visualize the ontology as a radial tree using Vega:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*Wp0-dUSPLuwC6KN7hELNLw.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can view an interactive annotated version of this chart (and download the code behind it) <a target="_blank" href="https://alpha.quiltdata.com/b/quilt-example/tree/quilt/open_images/">here</a>.</p>
<p>Not all categories in Open Images have bounding box data associated with them.</p>
<p>But this script will allow you to download any subset of the 600 labels that do. Here’s a taste of what’s possible:</p>
<p><code>football</code>, <code>toy</code>, <code>bird</code>, <code>cat</code>, <code>vase</code>, <code>hair dryer</code>, <code>kangaroo</code>, <code>knife</code>, <code>briefcase</code>, <code>pencil case</code>, <code>tennis ball</code>, <code>nail</code>, <code>high heels</code>, <code>sushi</code>, <code>skyscraper</code>, <code>tree</code>, <code>truck</code>, <code>violin</code>, <code>wine</code>, <code>wheel</code>, <code>whale</code>, <code>pizza cutter</code>, <code>bread</code>, <code>helicopter</code>, <code>lemon</code>, <code>dog</code>, <code>elephant</code>, <code>shark</code>, <code>flower</code>, <code>furniture</code>, <code>airplane</code>, <code>spoon</code>, <code>bench</code>, <code>swan</code>, <code>peanut</code>, <code>camera</code>, <code>flute</code>, <code>helmet</code>, <code>pomegranate</code>, <code>crown</code>…</p>
<p>For the purposes of this article, we’ll limit ourselves to just two: <code>hamburger</code> and <code>sandwich</code>.</p>
<h3 id="heading-clean-it-crop-it">Clean it, crop it</h3>
<p>Once we’ve run the script and localized the images, we can inspect them with <code>matplotlib</code> to see what we’ve got:</p>
<pre><code><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> pltfrom matplotlib.image <span class="hljs-keyword">import</span> imread%matplotlib inlineimport os
</code></pre><pre><code>fig, axarr = plt.subplots(<span class="hljs-number">1</span>, <span class="hljs-number">5</span>, figsize=(<span class="hljs-number">24</span>, <span class="hljs-number">4</span>))<span class="hljs-keyword">for</span> i, img <span class="hljs-keyword">in</span> enumerate(os.listdir(<span class="hljs-string">'../data/images/'</span>)[:<span class="hljs-number">5</span>]):    axarr[i].imshow(imread(<span class="hljs-string">'../data/images/'</span> + img))
</code></pre><p><img src="https://cdn-media-1.freecodecamp.org/images/1*8ytDl01L-ZoUKYxYREMkuw.png" alt="Image" width="600" height="400" loading="lazy">
<em>Five example {hamburger, sandwich} images from Google Open Images V4.</em></p>
<p>These images are not easy ones to train on. They have all of the issues associated with building a dataset using an external source from the public Internet.</p>
<p>Just this small sample here demonstrates the different sizes, orientations, and occlusions possible in our target classes.</p>
<p>In one case, we didn’t even succeed in downloading the actual image. Instead, we got a placeholder telling us that the image we wanted has since been deleted!</p>
<p>Downloading this data nets us a few thousand sample images like these. The next step is to take advantage of the bounding box information to clip our images down to just the sandwich-y, hamburger-y parts.</p>
<p>Here’s another image array, this time with bounding boxes included, to demonstrate what this entails:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*ii670RcCgXExc2CQgc-QEg.png" alt="Image" width="600" height="400" loading="lazy">
<em>Bounding boxes. Notice (1) the dataset includes “depictions” and (2) raw images can contain many object instances.</em></p>
<p><a target="_blank" href="https://github.com/quiltdata/open-images/blob/master/notebooks/build-dataset.ipynb">This annotated Jupyter notebook</a> in the <a target="_blank" href="https://github.com/quiltdata/open-images">demo GitHub repository</a> does this work.</p>
<p>I will omit showing that code here because it is slightly complicated. This is especially since we also need to (1) refactor our image metadata to match the clipped image outputs and (2) extract the images that have since been deleted. Definitely check out the notebook if you wish to see the code.</p>
<p>After running the notebook code, we will have an <code>images_cropped</code> folder on disk containing all of the cropped images.</p>
<h3 id="heading-building-the-model">Building the model</h3>
<p>Once we have downloaded the data, and cropped and cleaned it, we’re ready to train the model.</p>
<p>We will train a <a target="_blank" href="https://medium.freecodecamp.org/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050">convolutional neural network</a> (or ‘CNN’) on the data.</p>
<p>CNNs are a special type of neural network which build progressively higher level features out of groups of pixels commonly found in the images.</p>
<p>How an image scores on these various features is then weighted to generate a final classification result.</p>
<p>This architecture works extremely well because it takes advantage of locality. This is because any one pixel is likely to have far more in common with pixels nearby than those far away.</p>
<p>CNNs also have other attractive properties, like noise tolerance and scale invariance (to an extent). These further improve their classification properties.</p>
<p>If you’re unfamiliar with CNNs, I recommend skimming Brandon Rohrer’s excellent “<a target="_blank" href="https://brohrer.github.io/how_convolutional_neural_networks_work.html">How convolutional neural networks work</a>” to learn more about them.</p>
<p>We will train a very simple convolutional neural network and see how even that gets decent results on our problem. I use <a target="_blank" href="https://keras.io/">Keras</a> to define and train the model.</p>
<p>We start by laying out the images in a certain directory structure:</p>
<pre><code>images_cropped/    sandwich/        some_image.jpg        some_other_image.jpg        ...    hamburger/        yet_another_image.jpg        ...
</code></pre><p>We then point Keras at this folder using the following code:</p>
<p>Keras will inspect the input folders, and determine there are two classes in our classification problem. It will assign class names based on the subfolder names, and create “image generators” that serve out of those folders.</p>
<p>But we don’t just return the images themselves. Instead, we return randomly subsampled, skewed, and zoomed selections from the images (via <code>train_datagen.flow_from_directory</code>).</p>
<p>This is an example of data augmentation in action.</p>
<p>Data augmentation is the practice of feeding an image classifier randomly cropped and distorted versions of an input dataset. This helps us overcome the small size of our dataset. We can train our model on a single image multiple times. Each time we use a slightly different segment of the image preprocessed in a slightly different way.</p>
<p>With our data input defined, the next step is defining the model itself:</p>
<p>This is a simple convolutional neural network model. It contains just three convolutional layers: a single densely connected post-processing layer just before the output layer, and strong regularization in the form of a dropout layer and <code>relu</code> activation.</p>
<p>These things all work together to make it more difficult for this model to <a target="_blank" href="https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/">overfit</a>. This is important, given the small size of our input dataset.</p>
<p>Finally, the last step is actually fitting the model.</p>
<p>This code selects an epoch step size determined by our image sample size and chosen batch size (16). Then it trains on that data for 50 epochs.</p>
<p>Training is likely to be suspended early by the <code>EarlyStopping</code> callback. This returns the best performing model ahead of the 50 epoch limit if it does not see improvement in the validation score in the previous four epochs.</p>
<p>We selected such a large patience value because there is a significant amount of variability in model validation loss.</p>
<p>This simple training regimen results in a model with about 75% accuracy:</p>
<pre><code>              precision    recall  f1-score   support           <span class="hljs-number">0</span>       <span class="hljs-number">0.90</span>      <span class="hljs-number">0.59</span>      <span class="hljs-number">0.71</span>      <span class="hljs-number">1399</span>           <span class="hljs-number">1</span>       <span class="hljs-number">0.64</span>      <span class="hljs-number">0.92</span>      <span class="hljs-number">0.75</span>      <span class="hljs-number">1109</span>   micro avg       <span class="hljs-number">0.73</span>      <span class="hljs-number">0.73</span>      <span class="hljs-number">0.73</span>      <span class="hljs-number">2508</span>   macro avg       <span class="hljs-number">0.77</span>      <span class="hljs-number">0.75</span>      <span class="hljs-number">0.73</span>      <span class="hljs-number">2508</span>weighted avg       <span class="hljs-number">0.78</span>      <span class="hljs-number">0.73</span>      <span class="hljs-number">0.73</span>      <span class="hljs-number">2508</span>
</code></pre><p>It’s interesting to note that our model is under-confident when classifying hamburgers (class 0), but over-confident when classifying hamburgers (class 1).</p>
<p>90% of images classified as hamburgers are actually hamburgers. But only 59% of all actual hamburgers are classified correctly.</p>
<p>On the other hand, just 64% of images classified as sandwiches are actually sandwiches. But 92% of sandwiches are classified correctly.</p>
<p>These results are in line with the 80% accuracy Francois Chollet got by applying a very similar model to a similarly-sized subset of the classic <a target="_blank" href="https://www.kaggle.com/c/dogs-vs-cats">Cats versus Dogs</a> dataset.</p>
<p>The difference is probably mainly due to increased level of occlusion and noise in the Google Open Images V4 dataset.</p>
<p>The dataset also includes illustrations as well as photographic images. These sometimes take large artistic liberties, making classification more difficult. You may choose to remove these when building a model yourself.</p>
<p>This performance can be further improved using <a target="_blank" href="https://towardsdatascience.com/keras-transfer-learning-for-beginners-6c9b8b7143e">transfer learning</a> techniques. To learn more, check out Keras author Francois Chollet’s blog post “<a target="_blank" href="https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html">Building powerful image classification models using very little data</a>”.</p>
<h3 id="heading-distributing-the-model">Distributing the model</h3>
<p>Now that we’ve now built a custom dataset and trained a model, it’d be a shame if we didn’t share it.</p>
<p>Machine Learning projects should be reproducible. I outline the following strategy in a previous article, “<a target="_blank" href="https://blog.quiltdata.com/reproduce-a-machine-learning-model-build-in-four-lines-of-code-b4f0a5c5f8c8">Reproduce a machine learning model build in four lines of code</a>”.</p>
<ul>
<li>Separate dependencies into data, code, and environment components.</li>
<li>Data dependencies version control (1) the model definition and (2) the training data. Save these to versioned blob storage, e.g. <a target="_blank" href="https://aws.amazon.com/s3/">Amazon S3</a> with <a target="_blank" href="https://github.com/quiltdata/t4">Quilt T4</a>.</li>
<li>Code dependencies version controls the code used to train the model (use git).</li>
<li>Environment dependencies version control the environment used to train the model. In a production environment this should probably be a Docker file, but you can use <code>pip</code> or <code>conda</code> locally.</li>
<li>To provide someone with a retrainable copy of the model, give them the corresponding <code>{data, code, environment}</code> tuple.</li>
</ul>
<p>Following these principles makes getting everything you need to train your own copy of this model fit into a handful of lines of code:</p>
<pre><code>git clone https:<span class="hljs-comment">//github.com/quiltdata/open-images.gitconda env create -f open-images/environment.ymlsource activate quilt-open-images-devpython -c "import t4; t4.Package.install('quilt/open_images', dest='open-images/', registry='s3://quilt-example')"</span>
</code></pre><p>To learn more about <code>{data, code, environment}</code> see <a target="_blank" href="https://github.com/quiltdata/open-images">the GitHub repository</a> and/or <a target="_blank" href="https://blog.quiltdata.com/reproduce-a-machine-learning-model-build-in-four-lines-of-code-b4f0a5c5f8c8">the corresponding article</a>.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>In this article we demonstrated an end-to-end image classification machine learning pipeline. We covered everything from downloading/transforming a dataset to training a model. We then distributed it in a way that lets anyone else rebuild it themselves later.</p>
<p>Because custom datasets are difficult to generate and distribute, over time there has emerged a cabal of example datasets which get used everywhere. This is not because they’re actually that good (they’re not). Instead, it’s because they’re easy.</p>
<p>For example, Google’s recently released Machine Learning Crash Course makes heavy use of the <a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html">California Housing Dataset</a>. That data is now almost two decades old!</p>
<p>Consider instead exploring new horizons. Using real images from the living Internet with interesting categorical breakdowns. It’s easier than you think!</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
