Accessibility - freeCodeCamp.org

How to Build Responsive and Accessible UI Designs with React and Semantic HTML

Gopinath Karunanithi — Tue, 07 Apr 2026 17:06:31 +0000

Building modern React applications requires more than just functionality. It also demands responsive layouts and accessible user experiences.

By combining semantic HTML, responsive design techniques, and accessibility best practices (like ARIA roles and keyboard navigation), developers can create interfaces that work across devices and for all users, including those with disabilities.

This article shows how to design scalable, inclusive React UIs using real-world patterns and code examples.

Prerequisites
Overview
Why Accessibility and Responsiveness Matter
Core Principles of Accessible and Responsive Design
Using Semantic HTML in React
Structuring a Page with Semantic Elements
Building Responsive Layouts
Accessibility with ARIA
Keyboard Navigation
Focus Management
Forms and Accessibility
Responsive Typography and Images
Building a Fully Accessible Responsive Component (End-to-End Example)
Testing Accessibility
Best Practices
When NOT to Overuse Accessibility Features
Future Enhancements
Conclusion

Prerequisites

Before following along, you should be familiar with:

React fundamentals (components, hooks, JSX)
Basic HTML and CSS
JavaScript ES6 features
Basic understanding of accessibility concepts (helpful but not required)

Overview

Modern web applications must serve a diverse audience across a wide range of devices, screen sizes, and accessibility needs. Users today expect seamless experiences whether they are browsing on a desktop, tablet, or mobile device – and they also expect interfaces that are usable regardless of physical or cognitive limitations.

Two essential principles help achieve this:

Responsive design, which ensures layouts adapt to different screen sizes
Accessibility, which ensures applications are usable by people with disabilities

In React applications, these principles are often implemented incorrectly or treated as afterthoughts. Developers may rely heavily on div-based layouts, ignore semantic HTML, or overlook accessibility features such as keyboard navigation and screen reader support.

This article will show you how to build responsive and accessible UI designs in React using semantic HTML. You'll learn how to:

Structure components using semantic HTML elements
Build responsive layouts using modern CSS techniques
Improve accessibility with ARIA attributes and proper roles
Ensure keyboard navigation and screen reader compatibility
Apply best practices for scalable and inclusive UI design

By the end of this guide, you'll be able to create React interfaces that are not only visually responsive but also accessible to all users.

Why Accessibility and Responsiveness Matter

Responsive and accessible design isn't just about compliance. It directly impacts usability, performance, and reach.

Accessibility benefits:

Supports users with visual, motor, or cognitive impairments
Improves SEO and content discoverability
Enhances usability for all users

Responsiveness benefits:

Ensures consistent UX across devices
Reduces bounce rates on mobile
Improves performance and scalability

Ignoring these principles can result in broken layouts on smaller screens, poor screen reader compatibility, and limited reach and usability.

Core Principles of Accessible and Responsive Design

Before diving into the code, it’s important to understand the foundational principles.

1. Semantic HTML First

Semantic HTML refers to using HTML elements that clearly describe their meaning and role in the interface, rather than relying on generic containers like

or .These elements provide built-in accessibility, improve SEO, and make code more readable.

For example:

Non-semantic:

Submit

Semantic:

Another example:

Non-semantic:

My App

Semantic:

My App

Using semantic elements such as

, and

Then, React can enhance it with validation, dynamic feedback, or animations.

By prioritizing functionality first and enhancements later, you ensure your application remains usable in a wide range of real-world scenarios.

4. Keyboard Accessibility

Keyboard accessibility ensures that users can navigate and interact with your application using only a keyboard. This is critical for users with motor disabilities and also improves usability for power users.

Key aspects of keyboard accessibility include:

Ensuring all interactive elements (buttons, links, inputs) are focusable
Maintaining a logical tab order across the page
Providing visible focus indicators (for example, outline styles)
Supporting keyboard events such as Enter and Space

Bad Example (Not Accessible)

Submit

This element:

Cannot be focused with a keyboard
Does not respond to Enter/Space
Is invisible to screen readers

Good Example

This automatically supports:

Keyboard interaction
Focus management
Screen reader announcements

Custom Component Example (if needed)

 {
    if (e.key === 'Enter' || e.key === ' ') {
      e.preventDefault();
      handleClick();
    }
  }}
>
  Submit

But only use this when native elements aren't sufficient.

These principles form the foundation of accessible and responsive design:

Use semantic HTML to communicate intent
Design for mobile first, then scale up
Enhance progressively for better compatibility
Ensure full keyboard accessibility

Applying these early prevents major usability and accessibility issues later in development.

Using Semantic HTML in React

As we briefly discussed above, semantic HTML plays a critical role in both accessibility (a11y) and code readability. Semantic elements clearly describe their purpose to both developers and browsers, which allows assistive technologies like screen readers to interpret and navigate the UI correctly.

For example, when you use a

Why this is better:

The
It is automatically focusable and keyboard accessible
It supports Enter and Space key activation by default
Screen readers correctly announce it as a button

This reduces complexity while improving accessibility and usability.

Why all this matters:

There are many reasons to use semantic HTML.

First, semantic elements like


  );
}

How this works:

role="dialog" identifies the element as a modal dialog
aria-modal="true" indicates that background content is inactive
aria-labelledby connects the dialog to its visible title for screen readers
tabIndex={-1} allows the dialog container to receive focus programmatically
Focus is moved to the dialog when it opens
Pressing Escape closes the modal, which is a standard accessibility expectation

This ensures that users can understand, navigate, and exit the modal using both keyboard and assistive technologies.

Key ARIA Attributes

1. role

Defines the type of element and its purpose. For example, role="dialog" tells assistive technologies that the element behaves like a modal dialog.

2. aria-label

Provides an accessible name for an element when visible text is not sufficient. Screen readers use this label to describe the element to users.

3. aria-hidden

Indicates whether an element should be ignored by assistive technologies. For example, aria-hidden="true" hides decorative elements from screen readers.

4. aria-live

Used for dynamic content updates. It tells screen readers to announce changes automatically without requiring user interaction (for example, form validation messages or notifications).

Example: Accessible Dropdown (Custom Component)

function Dropdown({ isOpen, toggle }) {
  return (
    
      

      {isOpen && (
        
          
            
          
          
            
          
        
      )}
    
  );
}

How this works:

aria-expanded indicates whether the dropdown is open or closed
aria-controls links the button to the dropdown content via its id
The
The
- elements provide a natural list structure
- Using elements ensures proper navigation behavior and accessibility
Why this approach is correct:
- It follows standard web patterns instead of application-style menus
- It avoids misusing ARIA roles like role="menu", which require complex keyboard handling
- Screen readers can correctly interpret the structure without additional roles
- It keeps the implementation simple, accessible, and maintainable
If you need advanced menu behavior (like arrow key navigation), then ARIA menu roles may be appropriate – but only when fully implemented according to the ARIA Authoring Practices.

Note: Most dropdowns in web applications are not true "menus" in the ARIA sense. Avoid using role="menu" unless you are implementing full keyboard navigation (arrow keys, focus management, and so on).

Keyboard Navigation

Keyboard navigation ensures that users can fully interact with your application using only a keyboard, without relying on a mouse. This is essential for users with motor disabilities, but it also benefits power users and developers who prefer keyboard-based workflows.

In a well-designed interface, users should be able to:
- Navigate through interactive elements using the Tab key
- Activate buttons and links using Enter or Space
- Clearly see which element is currently focused
In the example below, we’ll look at common mistakes in keyboard handling and why relying on native HTML elements is usually the better approach.

Example:
Avoid adding custom keyboard handlers to native elements like
This automatically supports:
- Enter and Space key activation
- Focus management
- Screen reader announcements
Adding manual keyboard event handlers here is unnecessary and can introduce bugs or inconsistent behavior.

What this example shows:
Avoid manually handling keyboard events for native interactive elements like
Why this works:
- Supports both Enter and Space key activation by default
- Is focusable and participates in natural tab order
- Provides built-in accessibility roles and screen reader announcements
- Reduces the need for additional logic or ARIA attributes
Adding custom keyboard handlers (like onKeyDown) to native elements is unnecessary and can introduce bugs or inconsistent behavior. Always prefer native HTML elements for interactivity whenever possible.

Avoiding Common Keyboard Traps

One of the most common keyboard accessibility issues is “trapping users inside interactive components”, such as modals or custom dropdowns. This happens when focus is moved into a component but can't escape using Tab, Shift+Tab, or other keyboard controls. Users relying on keyboards may become stuck, unable to navigate to other parts of the page.

In the example below, you'll see a simple modal that tries to set focus, but doesn’t manage Tab behavior properly.
```
function Modal({ isOpen }) {
  const ref = React.useRef();

  React.useEffect(() => {
    if (isOpen) ref.current?.focus();
  }, [isOpen]);

  return (
    
      
    
  );
}
```
What this code shows:
- When the modal opens, focus is moved to the Close button using ref.current.focus()
- The modal uses role="dialog" to communicate its purpose
There are some issues with this code that you should be aware of. First, tabbing inside the modal may allow focus to move outside the modal if additional focusable elements exist.

Users may also become trapped if no mechanism returns focus to the triggering element when the modal closes.

There's also no handling of Shift+Tab or cycling focus is present.

This demonstrates a partial focus management, but it’s not fully accessible yet.

To improve focus management, you can trap focus within the modal by ensuring that Tab and Shift+Tab cycle only through elements inside the modal.

You can also return focus to the trigger: when the modal closes, return focus to the element that opened it.

Example improvement (conceptual):
```
function Modal({ isOpen, onClose, triggerRef }) {
  const modalRef = React.useRef();

  React.useEffect(() => {
    if (isOpen) {
      modalref.current?.focus();
      // Add focus trap logic here
    } else {
      triggerref.current?.focus();
    }
  }, [isOpen]);

  return (
    
      
    
  );
}
```
Remember that this modal is not fully accessible without focus trapping. In production, use a library like focus-trap-react, react-aria, or Radix UI.

Key points:
- tabIndex={-1} allows the div to receive programmatic focus
- Focus trap ensures users cannot tab out unintentionally
- Returning focus preserves context, so users can continue where they left off
Best practices:
- Always move focus into modals
- Return focus to the trigger element when closed
- Ensure Tab cycles correctly
As a general rule, always prefer native HTML elements for interactivity. Only implement custom keyboard handling when building advanced components that cannot be achieved with standard elements.

Focus Management

Focus management is the practice of controlling where keyboard focus goes when users interact with components such as modals, forms, or interactive widgets. Proper focus management ensures that:
- Users relying on keyboards or assistive technologies can navigate seamlessly
- Focus does not get lost or trapped in unexpected places
- Users maintain context when content updates dynamically
The example below shows a common approach that only partially handles focus:

Bad Example:
```
// Bad Example: Automatically focusing input without context
const ref = React.useRef();
React.useEffect(() => {
  ref.current?.focus();
}, []);
```
In the above code, the input receives focus as soon as the component mounts, but there’s no handling for returning focus when the user navigates away.

If this input is inside a modal or dynamic content, users may get lost or trapped. There aren't any focus indicators or context for assistive technologies.

This is a minimal solution that can cause confusion in real applications.

Improved Example:
```
// Improved Example: Managing focus in a modal context
function Modal({ isOpen, onClose, triggerRef }) {  
const dialogRef = React.useRef();

  React.useEffect(() => {
    if (isOpen) {
      dialogRef.current?.focus();
    } else if (triggerRef?.current) {
      triggerref.current?.focus();
    }
  }, [isOpen]);

  React.useEffect(() => {
    function handleKeyDown(e) {
      if (e.key === 'Escape') {
        onClose();
      }
    }

    if (isOpen) {
      document.addEventListener('keydown', handleKeyDown);
    }

    return () => {
      document.removeEventListener('keydown', handleKeyDown);
    };
  }, [isOpen, onClose]);

  if (!isOpen) return null;

  return (
    
      Modal Title
      
      
    
  );
}
```
Explanation:
- tabIndex={-1} enables the dialog container to receive focus
- Focus is moved to the modal when it opens, ensuring keyboard users start in the correct context
- Focus is returned to the trigger element when the modal closes, preserving user flow
- aria-labelledby provides an accessible name for the dialog
- Escape key handling allows users to close the modal without a mouse
Note: For full accessibility, you should also implement focus trapping so users cannot tab outside the modal while it is open.

Tip: In production applications, use libraries like react-aria, focus-trap-react, or Radix UI to handle focus trapping and accessibility edge cases reliably.

Also, keep in mind here that the document-level keydown listener is global, which affects the entire page and can conflict with other components.
```
document.addEventListener('keydown', handleKeyDown);
```
A safer alternative is to scope it to the modal:
```
 {
    if (e.key === 'Escape') onClose();
  }}
>
```
For simple cases, attach onKeyDown to the dialog instead of the document.

Best Practice:

For complex components, use libraries like focus-trap-react or react-aria to manage focus reliably, especially for modals, dropdowns, and popovers.

Forms and Accessibility

Forms are critical points of interaction in web applications, and proper accessibility ensures that all users – including those using screen readers or other assistive technologies – can understand and interact with them effectively.

Proper labeling means that every input field, checkbox, radio button, or select element has an associated label that clearly describes its purpose. This allows screen readers to announce the input meaningfully and helps keyboard-only users understand what information is expected.

In addition to labeling, form accessibility includes:
- Providing clear error messages when input is invalid
- Ensuring error messages are announced to assistive technologies
- Maintaining logical focus order so users can navigate inputs easily
Bad Example:
Why this isn't good:
- This input relies only on a placeholder for context
- Screen readers may not announce the purpose of the field clearly
- Once a user starts typing, the placeholder disappears, leaving no guidance
- Keyboard-only users may not have enough context to know what to enter
Good Example:
```
Name
```
Why this is better:
- The is explicitly associated with the input via htmlFor / id
- Screen readers announce "Name" before the input, providing clear context
- Users navigating with Tab understand the field’s purpose
- The label persists even when the user types, unlike a placeholder
Error Handling:
```
Name



  Name is required
```
Explanation
- aria-describedby links the input to the error message using the element’s id
- Screen readers announce the error message when the input is focused
- aria-invalid="true" indicates that the field currently contains an error
- role="alert" ensures the error message is announced immediately when it appears
This creates a clear relationship between the input and its validation message, improving usability for screen reader users.

Tip: Only apply aria-invalid and error messages when validation fails. Avoid marking fields as invalid before user interaction.

Responsive Typography and Images

Responsive typography and images ensure that your content remains readable and visually appealing across a wide range of devices, from small smartphones to large desktop monitors.

This is important, because text should scale naturally so it remains legible on all screens, and images should adjust to container sizes to avoid layout issues or overflow. Both contribute to a better user experience and accessibility

In this section, we’ll cover practical ways to implement responsive typography and images in React and CSS.
```
h1 {
  font-size: clamp(1.5rem, 2vw, 3rem);
}
```
In this code:
- The clamp() function allows text to scale fluidly:
- The first value (1.5rem) is the “minimum font size”
- The second value (2vw) is the “preferred size based on viewport width”
- The third value (3rem) is the “maximum font size”
- This ensures headings are “readable on small screens” without becoming too large on desktops
Alternative methods include using media queries to adjust font sizes at different breakpoints

Responsive Images:
In this code, responsive images adapt to different screen sizes and resolutions to prevent layout issues or slow loading times. Key techniques include:

1. Fluid images using CSS:
```
img {
     max-width: 100%;
     height: auto;
   }
```
This makes sure that images never overflow their container and maintains aspect ratio automatically.

2. Using srcset for multiple resolutions:
This provides different image files depending on screen size or resolution and reduces loading times and improves performance on smaller devices.

3. Always include descriptive alt text

This is critical for screen readers and accessibility. It also helps users understand the image if it cannot be loaded.

Tip: Combine responsive typography, images, and flexible layout containers (like CSS Grid or Flexbox) to create interfaces that scale gracefully across all devices and maintain accessibility.

4. Ensure Sufficient Color Contrast

Low contrast text can make content unreadable for many users.
```
.bad-text {
  color: #aaa;
}

.good-text {
  color: #222;
}
```
Use tools like WebAIM Contrast Checker and Chrome DevTools Accessibility panel to check your color contrasts. Also note that WCAG AA requires 4.5:1 contrast ratio for normal text.

Building a Fully Accessible Responsive Component (End-to-End Example)

To understand how responsiveness and accessibility work together in practice, let’s build a reusable accessible card component that adapts to screen size and supports keyboard and screen reader users.

Step 1: Component Structure (Semantic HTML)
```
function ProductCard({ title, description, onAction }) {
  return (
    
      {title}
      {description}
      
    
  );
}
```
Why This Works
- provides semantic meaning for standalone content
- establishes a proper heading hierarchy
Step 2: Responsive Styling
```
.card {
  padding: 16px;
  border: 1px solid #ddd;
  border-radius: 8px;
}

@media (min-width: 768px) {
  .card {
    padding: 24px;
  }
}
```
This ensures comfortable spacing on mobile and improved readability on larger screens.

Step 3: Accessibility Enhancements
The visible button text provides a clear and accessible label, so no additional ARIA attributes are needed.

Step 4: Keyboard Focus Styling
```
button:focus {
  outline: 2px solid blue;
  outline-offset: 2px;
}
```
Focus indicators are essential for keyboard users.

Step 5: Using the Component
```
function App() {
  return (
    
       alert('Clicked')}
      />
    
  );
}
```
Key Takeaways

This simple component demonstrates:
- Semantic HTML structure
- Responsive design
- Built-in accessibility via native elements
- Minimal ARIA usage
In real-world applications, this pattern scales into entire design systems.

Testing Accessibility

Accessibility should be validated continuously, not just at the end of development. There are various automated tools you can use to help you with this process:
- Lighthouse (built into Chrome DevTools)
- axe DevTools for detailed audits
- ESLint plugins for accessibility rules
Manual Testing

But automated tools cannot catch everything. Manual testing is essential to make sure users can navigate using only the keyboard and use a screen reader (NVDA or VoiceOver. You should also test zoom levels (up to 200%) and check the color contrast manually.

Example: ESLint Accessibility Plugin
```
npm install eslint-plugin-jsx-a11y --save-dev
```
This helps catch accessibility issues during development.

Best Practices
- Use semantic HTML first
- Avoid unnecessary ARIA
- Test keyboard navigation
- Design mobile-first
- Ensure color contrast
- Use consistent spacing
When NOT to Overuse Accessibility Features
- Avoid adding ARIA when native HTML works
- Do not override browser defaults unnecessarily
- Avoid complex custom components without accessibility support
Future Enhancements
- Design systems with accessibility built-in
- Automated accessibility testing in CI/CD
- Advanced focus management libraries
- Accessibility-first component libraries
Conclusion

Building responsive and accessible React applications is not a one-time effort—it is a continuous design and engineering practice. Instead of treating accessibility as a checklist, developers should integrate it into the core of their component design process.

If you are starting out, focus on using semantic HTML and mobile-first layouts. These two practices alone solve a large percentage of accessibility and responsiveness issues. As your application grows, introduce ARIA enhancements, keyboard navigation, and automated accessibility testing.

The key is to build interfaces that work for everyone by default. When responsiveness and accessibility are treated as first-class concerns, your React applications become more usable, scalable, and future-proof.

How to Create a Table of Contents for Your Article

Jakub T. Jankiewicz — Thu, 12 Mar 2026 08:39:33 +0000

When you create an article, such as a blog post for freeCodeCamp, Hashnode, Medium, or DEV.to, you can help guide the reader by creating a Table of Contents (ToC). In this article, I'll explain how to create one with the help of JavaScript and browser DevTools. The article will explain how to use Google Chrome Dev Tools. But the same can be applied to any modern browser.

The process in this article needs to be done once per platform. Once you have the code, you can apply it every time to create a ToC. Note that if the platform changes something, you may need to adjust the script.

Browser Dev Tools
JavaScript Console
Understanding the DOM Structure
Creating TOC in Markdown
How to create an HTML TOC?
Copy the HTML code for the editor
What to do if I don’t have headers?
- Create Table of Contents for DEV.to
Conclusion

Browser Dev Tools

Dev Tools is an extension to the browser that can allow you to inspect and manipulate the DOM (Document Object Model), which is a representation of the HTML the browser keeps in memory in the form of a tree. It also gives access to the JavaScript console, where you can write short code snippets to test something. It has a lot more features, but we'll only use those two.

To open Dev Tools (in Google Chrome), you can press F12 or right-click on the page with your mouse and click Inspect.

⚠

In Safari, the browser Dev Tools are disabled initially. To enable it, read: Use the developer tools in the Develop menu in Safari on Mac.

Above is the screenshot of DevTools with a preview of this article. On the right, you can see a selected h1 HTML tag (the title) and CSS applied to that tag. The tree structure you see is the DOM.

💡

When creating a ToC for freeCodeCamp, you should open the preview in a new tab.

JavaScript Console

We will need to have access to the JavaScript console. To open the console in Google Chrome, you can use F12, right-click on the page and select Inspect from the context menu, or use the shortcut CTRL+SHIFT+C (Windows, Linux) or CMD+OPTION+C (Mac).

In Chrome DevTools, you can pick the Console tab at the top of the DevTools. But this will hide the DOM tree. It’s better to open the bottom drawer. You need to click the 3 dots in the top right corner and pick “show console drawer”.

The Dev Tools will look like this:

💡

You can ignore any errors or warnings in the console. You can click this icon 🚫 on the left side of the drawer, and it will clear the console.

The console is a so-called Read-Eval-Print-Loop. A classic interface, where you type some commands, here JavaScript code, and when you press enter, the code is executed in the context of the page the DevTools is on.

Above, you can see a page alert executed from the console.

Understanding the DOM Structure

The first step to create a ToC is to inspect the DOM and find the headers. They are usually H1…H6 tags. H1 is often the title of the page. In an ideal world, it would always be.

In my case, the header looks like this:

Dev Tools

The article only has H2 tags, but later in the article, I will also explain how to create a nested ToC.

💡

Your headers need to have an “id” attribute. It can look different, for example, be on a different element, but it has to be in the DOM. Later in the article, I will explain a few different structures and how to handle them.

Now with DevTools, we can write code that will find every header:

document.querySelectorAll('h2[id], h3[id], main h4[id]');

In the case of my article on freeCodeCamp, it returned this output:

NodeList(5) [h2#heading-dev-tools, h2#heading-javascript-console, h2#heading-understanding-the-dom-structure, h2#trending-guides.col-header, h2#mobile-app.col-header]

First, it’s a NodeList that we need to convert to an Array. Second is that besides our headers that we have so far, we also have two headers that are part of the website and not the main content. So we need to find out the single element that is the parent of the headers we need.

You can right-click on the white page that contains the article and pick Inspect Element. In our case, it found an element

. So we can rewrite our selector as:

document.querySelectorAll('main h2[id], main h3[id], main h4[id]');

And now it returns our headers and nothing more.

💡

The [id] attribute selector is not needed here, actually. At least not on freeCodeCamp.

How to Create the ToC in Markdown

A lot of blogging platforms support Markdown, so it'll be the first thing we'll create.

First, we'll convert the Node list to an array. We can use the spread operator:

[...document.querySelectorAll('main h2[id], main h3[id], main h4[id]')];

Then we can map over the array and create the Markdown links that point to the given header.

const headers = [...document.querySelectorAll('main h2[id], main h3[id], main h4[id]')];

headers.map(function(node) {
    // H2 header should have 0 indent
    const level = parseInt(node.nodeName.replace('H', '')) - 2;
    const hash = node.getAttribute('id');
    const indent = ' '.repeat(level * 2);
    return `\({indent}* [\){node.innerText}](#${hash})`;
});

The output looks like this:

(4) ['* [Dev Tools](#heading-dev-tools)', '* [JavaScript Console](#heading-javascript-console)', '* [Understanding the DOM Structure](#heading-understanding-the-dom-structure)', '* [What to do if I don’t have headers?](#heading-what-to-do-if-i-dont-have-headers)']

To get the text, we can join the array with a newline character and use console.log to display the output. If we don’t use console.log, it will show a string with \n characters.

const headers = [...document.querySelectorAll('main h2[id], main h3[id], main h4[id]')];

console.log(headers.map(function(node) {
    // H2 header should have 0 indent
    const level = parseInt(node.nodeName.replace('H', '')) - 2;
    const hash = node.getAttribute('id');
    const indent = ' '.repeat(level * 2);
    return `\({indent}* [\){node.innerText}](#${hash})`;
}).join('\n'));

The output for this article will look like this:

* [Dev Tools](#heading-dev-tools)
* [JavaScript Console](#heading-javascript-console)
* [Understanding the DOM Structure](#heading-understanding-the-dom-structure)
* [Creating TOC in Markdown](#heading-creating-toc-in-markdown)
  * [This is fake header](#heading-this-is-fake-header)

I created one fake subheader. Platforms, even when not supporting Markdown when writing articles, often support Markdown when copy-pasted. The ToC at the top of the article was created by copying and pasting markdown generated with the last JavaScript snippet.

How to Create an HTML ToC

If your platform doesn’t support Markdown (like Medium), you can create HTML, preview that HTML, and copy the output to the clipboard. Pasting that into the editor of the platform you're using should keep the formatting.

💡

On Medium, the content is inside a

element, so the selector must be updated.

To convert Markdown to HTML, you can use any online tool, but you'll see how to create it yourself in the snippet. It will be faster after you create the code.

const headers = [...document.querySelectorAll('main h2[id], main h3[id], main h4[id]')]

function indent(state) {
    return ' '.repeat((state.level - 1) * 2);
}

function closeUlTags(state, targetLevel) {
    while (state.level > targetLevel) {
        state.level--;
        state.lines.push(`${indent(state)}`);
    }
}

function openUlTags(state, targetLevel) {
    while (state.level < targetLevel) {
        state.lines.push(`${indent(state)}`);
        state.level++;
    }
}

const result = headers.reduce((state, node) => {
    const level = parseInt(node.nodeName.replace('H', ''));

    closeUlTags(state, level);
    openUlTags(state, level);
    
    const hash = node.getAttribute('id');
    state.lines.push(`\({indent(state)}${node.innerText}`);
    return state;
}, { lines: [], level: 1 });

closeUlTags(result, 1);

console.log(result.lines.join('\n'));

This is the output of the code in this article:


  Table of Contents
  Dev Tools
  JavaScript Console
  Understanding the DOM Structure
  Creating TOC in Markdown
  How to create HTML TOC
  
    Level 3
    
      Level 4
    
  
  What to do if I don’t have headers?

I added a few headers at the end, so you can see that it will work for any level of nested headers. Note that we also have the ToC as the first element on the list.

💡

Note that the above HTML code includes a link to the Table of Contents. This happens if you run the script again after adding the TOC. You can remove it by hand. If you want to improve the code, you can add a filter.

Copy the HTML code for the editor

Most so-called WYSIWYG editors are using HTML, and you should be able to copy the output of HTML code with formatting and paste it into that editor. The easiest is to just save that into a file, open that file, and select the text:

What to Do If I Don’t Have Headers?

You need to find anything that can be targeted with CSS. If they are p tags with a specific class (like header), you can use p.header instead of h2.

How to Create a Table of Contents for DEV.to

If you have a different DOM structure, you can use different DOM methods to extract the element you need. For example, on DEV.to, the headers look like this:


  
  
  Overview

So the selector needs to be just main h2. But when you execute this code:

[...document.querySelectorAll('main h2, main h3, main h4')];

You will see that there are way more headers than the content of the document. Luckily, we can use a new selector in CSS :has(). The final selector for one header can look like this: main h2:has(a[name]).

Here is the full code:

const selector = 'main h2:has(a[name]), main h3:has(a[name]), main h4:has(a[name])';
const headers = [...document.querySelectorAll(selector)];

console.log(headers.map(function(node) {
    // H2 header should have 0 indent
    const level = parseInt(node.nodeName.replace('H', '')) - 2;
    // this is how you get the hash
    // you can also access href attribute and remove # from the output string
    const hash = node.querySelector('a').getAttribute('name');
    const indent = ' '.repeat(level);
    return `\({indent}* [\){node.innerText}](#${hash})`;
}).join('\n'));

Conclusion

Creating a table of contents can help your readers digest your article. Since most people don’t read the whole article, they only scan for what they need. You can also find a lot of articles about its impact on SEO. So it’s always worth adding one if the article is longer.

And as you can see, creating a ToC is not that hard with a bit of web development knowledge.

If you like this article, you may want to follow me on Social Media: (Twitter/X, GitHub, and/or LinkedIn). You can also check my personal website and my new blog.

How to Build a Production-Ready Voice Agent Architecture with WebRTC

Nataraj Sundar — Fri, 06 Mar 2026 19:46:46 +0000

In this tutorial, you'll build a production-ready voice agent architecture: a browser client that streams audio over WebRTC (Web Real-Time Communication), a backend that mints short-lived session tokens, an agent runtime that orchestrates speech and tools safely, and generates post-call artifacts for downstream workflows.

This article is intentionally vendor-neutral. You can implement these patterns using any AI voice platform that supports WebRTC (directly or via an SFU, selective forwarding unit) and server-side token minting. The goal is to help you ship a voice agent architecture that is secure, observable, and operable in production.

Disclosure: This article reflects my personal views and experience. It does not represent the views of my employer or any vendor mentioned.

What You'll Build
How to Avoid Common Production Failures in Voice Agents
How to Design a Latency Budget for a Real-Time Voice Agent
Production Voice Agent Architecture (Vendor-Neutral)
Production readiness checklist
Closing

What You'll Build

By the end, you'll have:

A web client that streams microphone audio and plays agent audio.
A backend token endpoint that keeps credentials server-side.
A safe coordination channel between the agent and the application.
Structured messages between the application and the agent.
A production checklist for security, reliability, observability, and cost control.

Prerequisites

You should be comfortable with:

JavaScript or TypeScript
Node.js 18+ (so fetch works server-side) and an HTTP framework (Express in examples)
Browser microphone permissions
Basic WebRTC concepts (high level is fine)

TL;DR

A production-ready voice agent needs:

A server-side token service (no secrets in the browser)
A real-time media plane (WebRTC) for low-latency audio
A data channel for structured messages between your app and the agent
Tool guardrails (allowlists, confirmations, timeouts, audit logs)
Post-call processing (summary, actions, CRM (Customer Relationship Management), tickets)
Observability-first implementation (state transitions + metrics)

How to Avoid Common Production Failures in Voice Agents

If you've operated distributed systems, you've seen most failures happen at boundaries:

timeouts and partial connectivity
retries that amplify load
unclear ownership between components
missing observability
“helpful automation” that becomes unsafe

Voice agents amplify those risks because:

Latency is User Experience: A slow agent feels broken. Conversational UX is less forgiving than web UX.

Audio + UI + Tools is a Distributed System: You coordinate browser audio capture, WebRTC transport, STT (speech-to-text), model reasoning, tool calls, TTS (text-to-speech), and playback buffering. Each stage has different clocks and failure modes.

Security Boundaries are Non-negotiable: A leaked API key is catastrophic. A tool misfire can trigger real-world side effects.

Debuggability determines whether you can ship: If you don't log state transitions and capture post-call artifacts, you can't operate or improve the system safely.

How to Design a Latency Budget for a Real-Time Voice Agent

Conversations have a “feel.” That feel is mostly latency.

A practical guideline:

Under ~200ms feels instant
300–500ms feels responsive
Over ~700ms feels broken

Your end-to-end latency is the sum of mic capture, network RTT (round-trip time), STT, reasoning, tool execution, TTS, and playback buffering. Budget for it explicitly or you’ll ship a technically correct system that users perceive as unintelligent.

How to Design a Production Voice Agent Architecture (Vendor-Neutral)

A scalable voice agent architecture typically has these layers:

Web client: mic capture, audio playback, UI state
Token service: short-lived session tokens (secrets stay server-side)
Real-time plane: WebRTC media + a data channel
Agent runtime: STT → reasoning → TTS, plus tool orchestration
Tool layer: external actions behind safety controls
Post-call processor: summary + structured outputs after the session ends

This separation makes failure domains and trust boundaries explicit.

Step 0: Set Up the Project

Create a new project directory:

mkdir voice-agent-app
cd voice-agent-app
npm init -y
npm pkg set type=module
npm pkg set scripts.start="node server.js"

Install dependencies:

npm install express dotenv

Create this folder structure:

voice-agent-app/
├── server.js
├── .env
└── public/
    ├── index.html
    └── client.js

Add a .env file:

VOICE_PLATFORM_URL=https://your-provider.example
VOICE_PLATFORM_API_KEY=your_api_key_here

Now you’re ready to implement each part of the system.

Step 1: Keep Credentials Server-side

Treat every API key like production credentials:

store it in environment variables or a secrets manager
rotate it if exposed
never embed it in browser or mobile apps
avoid logging secrets (log only a short suffix if necessary)

Even if a vendor supports CORS, the browser is not a safe place for long-lived credentials.

Step 2: Build a Backend Token Endpoint

Your backend should:

authenticate the user
mint a short-lived session token using your platform API
return only what the client needs (URL + token + expiry)

Create server.js (Node.js + Express)

import express from "express";
import dotenv from "dotenv";
import path from "path";
import { fileURLToPath } from "url";

dotenv.config();

const app = express();
app.use(express.json());

// Serve the web client from /public
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
app.use(express.static(path.join(__dirname, "public")));

const VOICE_PLATFORM_URL = process.env.VOICE_PLATFORM_URL;
const VOICE_PLATFORM_API_KEY = process.env.VOICE_PLATFORM_API_KEY;

app.post("/api/voice-token", async (req, res) => {
  res.setHeader("Cache-Control", "no-store");

  try {
    if (!VOICE_PLATFORM_URL || !VOICE_PLATFORM_API_KEY) {
      return res.status(500).json({
        error: "Missing VOICE_PLATFORM_URL or VOICE_PLATFORM_API_KEY in .env",
      });
    }

    // TODO: Authenticate the caller before minting tokens.

    const r = await fetch(`${VOICE_PLATFORM_URL}/api/v1/token`, {
      method: "POST",
      headers: {
        "X-API-Key": VOICE_PLATFORM_API_KEY,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ participant_name: "Web User" }),
    });

    if (!r.ok) {
      const detail = await r.text().catch(() => "");
      return res.status(r.status).json({ error: "Token request failed", detail });
    }

    const data = await r.json();

    res.json({
      rtc_url: data.rtc_url || data.livekit_url,
      token: data.token,
      expires_in: data.expires_in,
    });
  } catch (err) {
    res.status(500).json({ error: "Failed to mint token" });
  }
});

app.listen(3000, () => console.log("Open http://localhost:3000"));

Run the server

npm start

Then open: http://localhost:3000

How this code works

You load credentials from environment variables so secrets never enter the browser.
The /api/voice-token endpoint calls the voice platform’s token API.
You return only the rtc_url, token, and expiration time.
The browser never sees the API key.
If the provider returns an error, you forward a structured error response.

Production Notes

rate-limit /api/voice-token (cost + abuse control)
instrument token mint latency and error rate
keep TTL short and handle refresh/reconnect
return minimal fields

Step 3: Connect from the Web Client (WebRTC + SFU)

In this step, you'll build a minimal web UI that:

Requests a short-lived token from your backend
Connects to a real-time WebRTC room (often via an SFU)
Plays the agent's audio track
Captures and publishes microphone audio

Create `public/index.html`



  
    
    
    Voice Agent Demo
  
  
    Voice Agent Demo

    
    

    Idle

Create `public/client.js`

Note: This uses a LiveKit-style client SDK to demonstrate the pattern. If you're using a different provider, swap this import and the connect/publish calls for your provider's WebRTC client.

import { Room, RoomEvent, Track } from "https://unpkg.com/livekit-client@2.10.1/dist/livekit-client.esm.mjs";

const startBtn = document.getElementById("startBtn");
const endBtn = document.getElementById("endBtn");
const statusEl = document.getElementById("status");

let room = null;
let intentionallyDisconnected = false;
let audioEls = [];

function setStatus(text) {
  statusEl.textContent = text;
}

function detachAllAudio() {
  for (const el of audioEls) {
    try { el.pause?.(); } catch {}
    el.remove();
  }
  audioEls = [];
}

async function mintToken() {
  const res = await fetch("/api/voice-token", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ participant_name: "Web User" }),
    cache: "no-store",
  });

  if (!res.ok) {
    const detail = await res.text().catch(() => "");
    throw new Error(`Token request failed: ${detail || res.status}`);
  }

  const { rtc_url, token } = await res.json();
  if (!rtc_url || !token) throw new Error("Token response missing rtc_url or token");
  return { rtc_url, token };
}

function wireRoomEvents(r) {
  // 1) Play the agent audio track when subscribed
  r.on(RoomEvent.TrackSubscribed, (track) => {
    if (track.kind !== Track.Kind.Audio) return;

    const el = track.attach();
    audioEls.push(el);
    document.body.appendChild(el);

    // Autoplay restrictions vary by browser/device.
    el.play?.().catch(() => {
      setStatus("Connected (audio may be blocked — click the page to enable)");
    });
  });

  // 2) Reconnect on disconnect (token expiry often shows up this way)
  r.on(RoomEvent.Disconnected, async () => {
    if (intentionallyDisconnected) return;
    setStatus("Disconnected (reconnecting...)");
    await attemptReconnect();
  });
}

async function connectOnce() {
  const { rtc_url, token } = await mintToken();

  const r = new Room();
  wireRoomEvents(r);

  await r.connect(rtc_url, token);

  // Mic permission + publish mic
  try {
    await r.localParticipant.setMicrophoneEnabled(true);
  } catch {
    try { r.disconnect(); } catch {}
    throw new Error("Microphone access denied. Allow mic permission and try again.");
  }

  return r;
}

async function startCall() {
  if (room) return;

  intentionallyDisconnected = false;
  setStatus("Connecting...");

  room = await connectOnce();

  setStatus("Connected");
  startBtn.disabled = true;
  endBtn.disabled = false;
}

async function stopCall() {
  intentionallyDisconnected = true;

  try {
    await room?.localParticipant?.setMicrophoneEnabled(false);
  } catch {}

  try {
    room?.disconnect();
  } catch {}

  room = null;
  detachAllAudio();

  setStatus("Disconnected");
  startBtn.disabled = false;
  endBtn.disabled = true;
}

async function attemptReconnect() {
  // Simplified exponential backoff reconnect.
  // In production, add jitter, max attempts, and better error classification.
  const delaysMs = [250, 500, 1000, 2000];

  for (const delay of delaysMs) {
    if (intentionallyDisconnected) return;

    try {
      // Tear down current state before reconnecting
      try { room?.disconnect(); } catch {}
      room = null;
      detachAllAudio();

      await new Promise((r) => setTimeout(r, delay));

      room = await connectOnce();
      setStatus("Reconnected");
      startBtn.disabled = true;
      endBtn.disabled = false;
      return;
    } catch {
      // keep retrying
    }
  }

  setStatus("Disconnected (reconnect failed)");
  startBtn.disabled = false;
  endBtn.disabled = true;
}

startBtn.addEventListener("click", async () => {
  try {
    await startCall();
  } catch (err) {
    setStatus(err?.message || "Connection failed");
    startBtn.disabled = false;
    endBtn.disabled = true;
    room = null;
    detachAllAudio();
  }
});

endBtn.addEventListener("click", async () => {
  await stopCall();
});

How this Step works (and why these details matter)

The Start button gives you a user gesture so browsers are more likely to allow audio playback.
Mic permission is handled explicitly: if the user denies access, you show a clear error and avoid a half-connected session.
Disconnect cleanup removes audio elements so you don't leak resources across retries.
The reconnect loop demonstrates the production pattern: if a disconnect happens (often due to token expiry or network churn), the client re-mints a token and reconnects.

In the next step, you'll add a structured data-channel handler to safely process agent-suggested “client actions”.

Handle These Explicitly

Autoplay Restriction Example

Add this to index.html:

In client.js:

const startBtn = document.getElementById("startBtn");
const endBtn = document.getElementById("endBtn");
const statusEl = document.getElementById("status");

let room;

startBtn.addEventListener("click", async () => {
  try {
    room = await connectVoice();
    statusEl.textContent = "Connected";
    startBtn.disabled = true;
    endBtn.disabled = false;
  } catch (err) {
    statusEl.textContent = "Connection failed";
  }
});

Microphone denial

try {
  await navigator.mediaDevices.getUserMedia({ audio: true });
} catch (err) {
  statusEl.textContent = "Microphone access denied";
  throw err;
}

Disconnect cleanup

endBtn.addEventListener("click", () => {
  if (room) {
    room.disconnect();
    statusEl.textContent = "Disconnected";
    startBtn.disabled = false;
    endBtn.disabled = true;
  }
});

Token refresh (simplified)

room.on(RoomEvent.Disconnected, async () => {
  const res = await fetch("/api/voice-token");
  const { rtc_url, token } = await res.json();
  await room.connect(rtc_url, token);
});

Step 4: Add Client Actions (Agent Suggests, App Executes)

A production voice agent often needs to:

open a runbook/dashboard URL
show a checklist in the UI
request confirmation for an irreversible action
receive structured context (account, region, incident ID)

The key safety rule:

The agent suggests actions. The application validates and executes them.

Use structured messages over the data channel:

{
  "type": "client_action",
  "action": "open_url",
  "payload": { "url": "https://internal.example.com/runbook" },
  "id": "action_123"
}

Add guardrails:

allowlist permitted actions
validate payload shape
confirmation gates for irreversible actions
idempotency via id
audit logs for every request and outcome

This boundary limits damage from hallucinations or prompt injection.

// Guardrails: allowlist + validation + idempotency + confirmation

const ALLOWED_ACTIONS = new Set(["open_url", "request_confirm"]);
const EXECUTED_ACTION_IDS = new Set();
const ALLOWED_HOSTS = new Set(["internal.example.com"]);

function parseClientAction(text) {
  let msg;
  try {
    msg = JSON.parse(text);
  } catch {
    return null;
  }

  if (msg?.type !== "client_action") return null;
  if (typeof msg.id !== "string") return null;
  if (!ALLOWED_ACTIONS.has(msg.action)) return null;

  return msg;
}

async function handleClientAction(msg, room) {
  if (EXECUTED_ACTION_IDS.has(msg.id)) return; // idempotency
  EXECUTED_ACTION_IDS.add(msg.id);

  console.log("[client_action]", msg); // audit log (demo)

  if (msg.action === "open_url") {
    const url = msg.payload?.url;
    if (typeof url !== "string") return;

    const u = new URL(url);
    if (!ALLOWED_HOSTS.has(u.host)) {
      console.warn("Blocked navigation to:", u.host);
      return;
    }

    window.open(url, "_blank", "noopener,noreferrer");
    return;
  }

  if (msg.action === "request_confirm") {
    const prompt = msg.payload?.prompt || "Confirm this action?";
    const ok = window.confirm(prompt);

    // Send confirmation back to agent/app
    room.localParticipant.publishData(
  new TextEncoder().encode(
    JSON.stringify({ type: "user_confirmed", id: msg.id, ok })
  ),
  { topic: "client_events", reliable: true }
);
  }
}

room.on(RoomEvent.DataReceived, (payload, participant, kind, topic) => {
  if (topic !== "client_actions") return;

  const text = new TextDecoder().decode(payload);
  const msg = parseClientAction(text);
  if (!msg) return;

  handleClientAction(msg, room);
});

Step 5: Add Tool Integrations Safely

Tools turn a voice agent into automation. Regardless of vendor, enforce these rules:

timeouts on every tool call
circuit breakers for flaky dependencies
audit logs (inputs, outputs, duration, trace IDs)
explicit confirmation for destructive actions
credentials stored server-side (never in prompts or clients)

If tools fail, degrade gracefully (“I can’t access that system right now, here’s the manual fallback.”). Silence reads as failure.

Create a server-side tool runner (example)

Paste this into server.js:

const TOOL_ALLOWLIST = {
  get_status: { destructive: false },
  create_ticket: { destructive: true },
};

let failures = 0;
let circuitOpenUntil = 0;

function circuitOpen() {
  return Date.now() < circuitOpenUntil;
}

async function withTimeout(promise, ms) {
  return Promise.race([
    promise,
    new Promise((_, reject) => setTimeout(() => reject(new Error("timeout")), ms)),
  ]);
}

async function runToolSafely(tool, args) {
  if (circuitOpen()) throw new Error("circuit_open");

  try {
    const result = await withTimeout(Promise.resolve({ ok: true, tool, args }), 2000);
    failures = 0;
    return result;
  } catch (err) {
    failures++;
    if (failures >= 3) circuitOpenUntil = Date.now() + 10_000;
    throw err;
  }
}

app.post("/api/tools/run", async (req, res) => {
  const { tool, args, user_confirmed } = req.body || {};

  if (!TOOL_ALLOWLIST[tool]) return res.status(400).json({ error: "Tool not allowed" });

  if (TOOL_ALLOWLIST[tool].destructive && user_confirmed !== true) {
    return res.status(400).json({ error: "Confirmation required" });
  }

  try {
    const started = Date.now();
    const result = await runToolSafely(tool, args);
    console.log("[tool_call]", { tool, ms: Date.now() - started }); // audit log
    res.json({ ok: true, result });
  } catch (err) {
    console.log("[tool_error]", { tool, err: String(err) });
    res.status(500).json({ ok: false, error: "Tool call failed" });
  }
});

Step 6: Add post-call processing (where durable value appears)

After a call ends, generate structured artifacts:

summary
action items
follow-up email draft
CRM entry or ticket creation

A production pattern:

store transcript + metadata
enqueue a background job (queue/worker)
produce outputs as JSON + a human-readable report
apply integrations with retries + idempotency
store a “call report” for audits and incident reviews

Create a post-call webhook endpoint (example)

Paste into server.js:

app.post("/webhooks/call-ended", async (req, res) => {
  const payload = req.body;

  console.log("[call_ended]", {
    call_id: payload.call_id,
    ended_at: payload.ended_at,
  });

  setImmediate(() => processPostCall(payload));
  res.json({ ok: true });
});

function processPostCall(payload) {
  const transcript = payload.transcript || [];
  const summary = transcript.slice(0, 3).map(t => `- \({t.speaker}: \){t.text}`).join("\n");

  const report = {
    call_id: payload.call_id,
    summary,
    action_items: payload.action_items || [],
    created_at: new Date().toISOString(),
  };

  console.log("[call_report]", report);
}

Test it locally

curl -X POST http://localhost:3000/webhooks/call-ended \
  -H "Content-Type: application/json" \
  -d '{
    "call_id": "call_123",
    "ended_at": "2026-02-26T00:10:00Z",
    "transcript": [
      {"speaker": "user", "text": "I need help resetting my password."},
      {"speaker": "agent", "text": "Sure — I can help with that."}
    ],
    "action_items": ["Send password reset link", "Verify account email"]
  }'

Production readiness checklist

Security

no API keys in the browser
strict allowlist for client actions
confirmation gates for destructive actions
schema validation on all inbound messages
audit logging for actions and tool calls

Reliability

reconnect strategy for expired tokens
timeouts + circuit breakers for tools
graceful degradation when dependencies fail
idempotent side effects

Observability

Log state transitions (for example):
listening → thinking → speaking → ended

Track:

connect failure rate
end-to-end latency (STT + reasoning + TTS)
tool error rate
reconnect frequency

Cost control

rate-limit token minting and sessions
cap max call duration
bound context growth (summarize or truncate)
track per-call usage drivers (STT/TTS minutes, tool calls)

Optional resources

How to Try a Managed Voice Platform Quickly

If you want a managed provider to test quickly, you can sign up for a Vocal Bridge account and implement these steps using their token minting + real-time session APIs.

But the core production voice agent architecture in this article is vendor-agnostic. You can replace any component (SFU, STT/TTS, agent runtime, tool layer) as long as you preserve the boundaries: secure token service, real-time media, safe tool execution, and strong observability.

Watch a full demo and explore a complete reference repo

If you'd like to see these patterns working together in a realistic scenario (incident triage), here are two optional resources:

- Demo video: Voice-First Incident Triage (end-to-end run)
This is a hackathon run-through showing client actions, decision boundaries for irreversible actions, and a structured post-call summary.

- GitHub repo (architecture + design + working code): https://github.com/natarajsundar/voice-first-incident-triage

These links are optional, you can follow the tutorial end-to-end without them.

Closing

Production-ready voice agents work when you treat them like real-time distributed systems.

Start with the baseline:

token service + web client + real-time audio

Then layer in:

controlled client actions
safe tools
post-call automation
observability and cost controls

That’s how you ship a voice agent architecture you can operate. You now have a vendor-neutral reference architecture you can adapt to your stack, with clear trust boundaries, safe tool execution, and operational visibility.

If you’re shipping real-time AI systems, what’s been your biggest production bottleneck so far: latency, reliability, or tool safety? I’d love to hear what you’re seeing in the wild. Connect with me on LinkedIn.

How to Add Multi-Language Support in Flutter: Manual and AI-Automated Translations for Flutter Apps

Atuoha Anthony — Sat, 31 Jan 2026 01:27:17 +0000

As Flutter applications scale beyond a single market, language support becomes a critical requirement. A well-designed app should feel natural to users regardless of their locale, automatically adapting to their language preferences while still giving them control.

This article provides a comprehensive, production-focused guide to supporting multiple languages in a Flutter application using Flutter’s localization system, the intl package, and Bloc for state management. We’ll support English, French, and Spanish, implement automatic language detection, and allow users to manually switch languages from settings, while also exploring the use of AI to automate text translations.

Prerequisites
Why Localization Matters in Flutter Applications
Flutter Localization Architecture Overview
How to Set Up Dependencies
How to Define Supported Languages
How to Add Localized Text with ARB Files
How to Generate Localization Code
How to Configure MaterialApp for Localization
Auto-Detecting the User’s Device Language
How to Manage Localization with Bloc
How to Display Localized Text in Widgets
Language Switching from Settings
How to Add Parameters to Localized Strings
Pluralization and Quantities
How to Format Dates, Numbers, and Currency
Localization Data Flow
Common Pitfalls and How to Avoid Them
How to Automate Translations with AI
Best Practices and Considerations
Conclusion
References

Prerequisites

Before proceeding, you should be comfortable with the following concepts:

Dart programming language: variables, classes, functions, and null safety
Flutter fundamentals: widgets, BuildContext, and widget trees
State management basics: familiarity with Bloc or similar patterns
Terminal usage: running Flutter CLI commands

If you have prior experience working with Flutter widgets and basic app architecture, you are well prepared to follow along.

Why Localization Matters in Flutter Applications

Localization (often abbreviated as l10n) is the process of adapting an application for different languages and regions, going beyond simple text translation to influence accessibility, user trust, and overall usability. From a technical perspective, localization introduces several challenges: text must be dynamically resolved at runtime, the UI must update instantly when the language changes, language preferences must persist across sessions, and device locale detection must gracefully fall back when a language is unsupported.

Flutter’s localization framework, when combined with intl and Bloc, solves these challenges cleanly and predictably.

Flutter Localization Architecture Overview

Flutter localization is built around three key ideas:

ARB files as the source of truth for translated strings
Code generation to provide type-safe access to translations
Locale-driven rebuilds of the widget tree

At runtime, the active Locale determines which translation file is used. When the locale changes, Flutter automatically rebuilds dependent widgets.

How to Set Up Dependencies

Add the required dependencies to your pubspec.yaml:

dependencies:
  flutter:
    sdk: flutter

  flutter_localizations:
    sdk: flutter

  intl: ^0.20.2
  flutter_bloc: ^8.1.3
  arb_translate: ^1.1.0

Enable localization code generation:

flutter:
  generate: true

This instructs Flutter to generate localization classes from ARB files.

How to Define Supported Languages

For this guide, the application will support:

English (en)
French (fr)
Spanish (es)

These locales will be declared centrally and used throughout the app.

How to Add Localized Text with ARB Files

Flutter uses Application Resource Bundle (ARB) files to store localized strings. Each supported language has its own ARB file.

English – `app_en.arb`

{
  "@@locale": "en",
  "enter_email_address_to_reset": "Enter your email address to reset"
}

French – `app_fr.arb`

{
  "@@locale": "fr",
  "enter_email_address_to_reset": "Entrez votre adresse e-mail pour réinitialiser"
}

Spanish – `app_es.arb`

{
  "@@locale": "es",
  "enter_email_address_to_reset": "Ingrese su dirección de correo electrónico para restablecer"
}

Each key must be identical across files. Only the values change per language.

How to Generate Localization Code

Run the following command in your terminal:

flutter gen-l10n

Flutter generates a strongly typed localization class, typically located at:

.dart_tool/flutter_gen/gen_l10n/app_localizations.dart

This file exposes getters such as:

AppLocalizations.of(context)!.enter_email_address_to_reset

How to Configure `MaterialApp` for Localization

The MaterialApp widget must be configured with localization delegates and supported locales:

MaterialApp(
  localizationsDelegates: const [
    AppLocalizations.delegate,
    GlobalMaterialLocalizations.delegate,
    GlobalWidgetsLocalizations.delegate,
    GlobalCupertinoLocalizations.delegate,
  ],
  supportedLocales: const [
    Locale('en'),
    Locale('fr'),
    Locale('es'),
  ],
  locale: state.locale,
  home: const MyHomePage(),
)

The locale property is controlled by Bloc, allowing dynamic updates at runtime.

Auto-Detecting the User’s Device Language

Flutter exposes the device locale via PlatformDispatcher. We can use this to automatically select the most appropriate supported language.

void detectLanguageAndSet() {
  Locale deviceLocale = PlatformDispatcher.instance.locale;

  Locale selectedLocale = AppLocalizations.supportedLocales.firstWhere(
    (supported) => supported.languageCode == deviceLocale.languageCode,
    orElse: () => const Locale('en'),
  );

  print('Using Locale: ${selectedLocale.languageCode}');

  GlobalConfig.storageService.setStringValue(
    AppStrings.DETECTED_LANGUAGE,
    selectedLocale.languageCode,
  );

  context.read().add(
    SetLocale(locale: selectedLocale),
  );
}

This approach reads the device language, matches it against supported locales, falls back to English when the language is unsupported, persists the detected language, and updates the UI instantly.

How to Manage Localization with Bloc

Bloc provides a predictable and testable way to manage application-wide locale changes.

Localization State

class AppLocalizationState {
  final Locale locale;
  const AppLocalizationState(this.locale);
}

Localization Event

abstract class AppLocalizationEvent {}

class SetLocale extends AppLocalizationEvent {
  final Locale locale;
  SetLocale({required this.locale});
}

Localization Bloc

class AppLocalizationBloc
    extends Bloc<AppLocalizationEvent, AppLocalizationState> {
  AppLocalizationBloc()
      : super(const AppLocalizationState(Locale('en'))) {
    on((event, emit) {
      emit(AppLocalizationState(event.locale));
    });
  }
}

The AppLocalizationBloc manages the app’s language state. It starts with English (Locale('en')) as the default, and when it receives a SetLocale event, it updates the state to the new locale provided in the event, causing the app’s UI to switch to that language. Whenever SetLocale is dispatched, the entire app rebuilds using the new locale.

How to Display Localized Text in Widgets

Once localization is configured, using translated text is straightforward:

Text(
  AppLocalizations.of(context)!.enter_email_address_to_reset,
  style: getRegularStyle(
    color: Colors.white,
    fontSize: FontSize.s16,
  ),
)

AppLocalizations.of(context)!.enter_email_address_to_reset retrieves the localized string enter_email_address_to_reset for the current app locale from the generated localization resources. The correct translation is resolved automatically based on the active locale.

Language Switching from Settings

Users should always be able to override automatic language detection.

ListTile(
  title: const Text('French'),
  onTap: () {
    context.read().add(
      SetLocale(locale: const Locale('fr')),
    );
  },
)

This ListTile displays the text "French", and when tapped, it triggers the AppLocalizationBloc to change the app’s locale to French ('fr') by dispatching a SetLocale event and it persists the selected language so it can be restored on the next app launch.

How to Add Parameters to Localized Strings

Real-world applications rarely display static text. Messages often include dynamic values such as user names, counts, dates, or prices. Flutter’s localization system, powered by intl, supports parameterized (interpolated) strings in a type-safe way.

Where Parameters Are Defined

Parameters are defined inside ARB files alongside the localized string itself, with each parameterized message consisting of the message string containing placeholders and a corresponding metadata entry that describes those placeholders.

Example: Parameterized Text

Suppose we want to display a greeting message that includes a user’s name.

English – `app_en.arb`

{
  "@@locale": "en",
  "greetingMessage": "Hello {username}!",
  "@greetingMessage": {
    "description": "Greeting message shown on the home screen",
    "placeholders": {
      "username": {
        "type": "String"
      }
    }
  }
}

This defines a parameterized localized message for English, indicated by "@@locale": "en". The "greetingMessage" key contains the string "Hello {username}!", where {username} is a placeholder that will be dynamically replaced with the user’s name at runtime. The "@greetingMessage" entry provides metadata for the message, including a description that explains the string is shown on the home screen, and a "placeholders" section that specifies "username" is of type String. When the app runs, this structure allows the message to display dynamically—for example, if the username is "Alice", the message would appear as "Hello Alice!".

French – `app_fr.arb`

{
  "@@locale": "fr",
  "greetingMessage": "Bonjour {username} !"
}

Spanish – `app_es.arb`

{
  "@@locale": "es",
  "greetingMessage": "¡Hola {username}!"
}

The placeholder name ({username}) must be identical across all ARB files.

Generated Dart API

After running:

flutter gen-l10n

Flutter generates a strongly typed method instead of a simple getter:

String greetingMessage(String username)

This prevents runtime errors and ensures compile-time safety.

How to Use Parameterized Strings in Widgets

Text(
  AppLocalizations.of(context)!.greetingMessage('Tony'),
)

If the locale is set to French, the output becomes:

Bonjour Tony !

Pluralization and Quantities

Another common localization requirement is pluralization. Languages differ significantly in how they express quantities, and hardcoding plural logic in Dart quickly becomes error-prone.

Defining Plural Messages in ARB

{
  "itemsCount": "{count, plural, =0{No items} =1{1 item} other{{count} items}}",
  "@itemsCount": {
    "description": "Displays the number of items",
    "placeholders": {
      "count": {
        "type": "int"
      }
    }
  }
}

This defines a pluralized message for itemsCount. The string {count, plural, =0{No items} =1{1 item} other{{count} items}} dynamically changes based on the value of count: it shows "No items" when count is 0, "1 item" when count is 1, and "{count} items" for all other values. The metadata entry "@itemsCount" provides a description and specifies that the placeholder count is of type int.

Each language can define its own plural rules while sharing the same key.

Using Pluralized Messages

Text(
  AppLocalizations.of(context)!.itemsCount(3),
)

Flutter automatically applies the correct plural form based on the active locale.

How to Format Dates, Numbers, and Currency

The intl package also provides locale-aware formatting utilities. These should be used in combination with localized strings, not as replacements.

Date Formatting Example

final formattedDate = DateFormat.yMMMMd(
  Localizations.localeOf(context).toString(),
).format(DateTime.now());

Text(
  AppLocalizations.of(context)!.lastLoginDate(formattedDate),
)

This ensures that both language and formatting rules align with the user’s locale.

Localization Data Flow

Localization is handled as an explicit data flow, with locale resolution modeled as application state rather than a static configuration passed into MaterialApp.

The process starts with the device locale, obtained from the platform layer at startup. This value represents the system’s preferred language and region but is not applied directly to the UI.

Instead, it flows through a detectLanguageAndSet step responsible for applying application-specific rules. This layer typically handles locale normalization and fallback logic, such as mapping unsupported locales to supported ones, restoring a user-selected language from persistent storage, or enforcing product constraints around available translations.

The resolved locale is then emitted into a Localization Bloc, which acts as the single source of truth for localization state. By centralizing locale management, the application can support runtime language changes, ensure predictable rebuilds, and keep localization logic decoupled from both the widget tree and platform APIs.

The Bloc feeds into the locale property of MaterialApp, which is the integration point with Flutter’s localization system. Updating this value triggers a rebuild of the Localizations scope and causes all dependent widgets to resolve strings for the active locale.

At the edge of the system, localized widgets consume the generated localization classes produced by flutter gen-l10n. These widgets remain agnostic to how the locale was selected or updated. They simply react to the localization context provided by the framework.

This architecture cleanly separates:

Locale detection
Business logic and state management
Framework-level localization
UI rendering

As a result, localization behavior remains explicit, maintainable, and compatible with automated translation workflows and CI-driven localization updates.

Common Pitfalls and How to Avoid Them

Avoid manual string concatenation. For example, do not use 'Hello ' + name. You should rely on localized templates instead.
Never hardcode plural logic in Dart. Always use intl’s pluralization features to handle different languages correctly.
Avoid locale-specific formatting outside intl utilities. Dates, numbers, and currencies should be formatted using the proper localization tools.
Always regenerate localization files after updating ARB files. This ensures the app reflects all the latest translations.

How to Automate Translations with AI

In Flutter applications that rely on ARB files for localization, translation maintenance becomes increasingly costly as the application grows. Each new message must be manually propagated across locale files, often resulting in missing keys, inconsistent phrasing, or delayed updates. This problem is amplified in projects that do not use a Translation Management System (TMS) and instead keep ARB files directly in the repository.

While many TMS platforms have begun adding AI-assisted translation features, not all projects use a TMS at all, particularly small teams, internal tools, or personal projects. In these cases, developers frequently resort to copying strings into AI chat tools and pasting results back into ARB files, which is inefficient and difficult to scale.

To address this workflow gap, Leen Code published arb_translate package, a Dart-based CLI tool that automates missing ARB translations using large language models.

Design Approach

The model behind arb_translate aligns with Flutter’s existing localization pipeline rather than replacing it:

English ARB files remain the source of truth
Only missing keys are translated
Output is written back as standard ARB files
flutter gen-l10n is still responsible for code generation

This design makes the tool suitable for both local development and CI usage, without introducing new runtime dependencies or localization abstractions.

At a high level, the flow is:

Parse the base (typically English) ARB file
Identify missing keys in target locale ARB files
Send key–value pairs to an LLM via API
Receive translated strings
Update or generate locale-specific ARB files
Run flutter gen-l10n to regenerate localized resources

Gemini-Based Setup

To use Gemini for ARB translation:

Generate a Gemini API key
https://ai.google.dev/tutorials/setup
Install the CLI:

dart pub global activate arb_translate

Export the API key:

export ARB_TRANSLATE_API_KEY=your-api-key

Run the tool from the Flutter project root:

arb_translate

The tool scans existing ARB files, generates missing translations, and writes them back to disk.

OpenAI/ChatGPT Support

As of version 1.0.0, arb_translate also supports OpenAI ChatGPT models. This allows teams to standardize on OpenAI infrastructure or switch providers without changing their localization workflow.

Generate an OpenAI API key
https://platform.openai.com/api-keys
Install the tool:

dart pub global activate arb_translate

Export the API key:

export ARB_TRANSLATE_API_KEY=your-api-key

Select OpenAI as the provider:

Via l10n.yaml:

arb-translate-model-provider: open-ai

Or via CLI:

arb_translate --model-provider open-ai

Execute:

arb_translate

Practical Use Cases

This approach is not intended to replace professional translation or review workflows. Instead, it serves as a deterministic automation layer that:

Eliminates manual copy-paste workflows
Keeps ARB files structurally consistent
Enables translation generation in CI
Allows downstream review in a TMS if required

For content-heavy Flutter applications or teams without a dedicated localization platform, this provides a pragmatic and maintainable solution.

Best Practices and Considerations

Always define a fallback locale to ensure the app remains usable.
Avoid hardcoding user-facing strings; rely on localized resources.
Use semantic and stable ARB keys for maintainability.
Persist user language preferences to provide a consistent experience.
Test your app with long translations and multiple locales to catch layout or UI issues.

Conclusion

Localization is a foundational requirement for modern Flutter applications. By combining Flutter’s built-in localization framework, the intl package, and Bloc for state management, you gain a robust and scalable solution.

With automatic device language detection, runtime switching, and clean architecture, your application becomes globally accessible without sacrificing maintainability.

References

Here are official links you can use as references for Flutter localization:

Flutter Internationalization Guide – Official Flutter guide on how to internationalize your app:
https://docs.flutter.dev/ui/accessibility-and-internationalization/internationalization
Dart intl Package Documentation – API reference for the intl library used for formatting and localization utilities:
https://api.flutter.dev/flutter/package-intl_intl/index.html
Flutter flutter_localizations API – API docs for the flutter_localizations library that provides localized strings and resources for Flutter widgets:
https://api.flutter.dev/flutter/flutter_localizations/
Flutter App Localization with AI (LeanCode) – A guide on speeding up Flutter localization using AI and tools like Gemini or ChatGPT, including details on the arb_translate package.
https://leancode.co/blog/flutter-app-localization-with-ai
arb_translate package (pub.dev) – A tool for automating ARB file translations in Flutter:
https://pub.dev/packages/arb_translate

How to Turn Your Favorite Tech Blogs into a Personal Podcast

Spruce Emmanuel — Wed, 21 Jan 2026 21:46:25 +0000

These days it feels almost impossible to keep up with tech news. I step away for three days, and suddenly there is a new AI model, a new framework, and a new tool everyone says I must learn. Reading everything no longer scales, but I still want to stay informed.

So I decided to change the format instead of giving up. I took a few tech blogs I already enjoy reading, picked the best articles, converted them to audio using my own voice, and turned the result into a private podcast. Now I can stay up to date while walking, running, or driving.

In this tutorial, you’ll learn how to build a simplified version of that pipeline step by step.

What You Are Going to Build
Prerequisites
Project Overview
Getting Started
How to Get the Content
How to Filter the Content
How to Clean Up the Content
How to Convert Content to Audio
How to Upload the Audio to Cloudflare R2
How to Make the Podcast
How to Automate the Pipeline
Conclusion

What You Are Going to Build

You will build a Node.js script that does the following:

Fetches articles from RSS feeds.
Extracts clean, readable text from each article.
Filters out content you do not want to listen to.
Cleans the text so it sounds good when spoken.
Converts the text to natural-sounding audio using your own voice.
Uploads the audio to Cloudflare R2.
Generates a podcast RSS feed.
Runs automatically on a schedule.

At the end, you will have a real podcast feed you can subscribe to on your phone.

If you want to skip the tutorial and jump straight into using the finished tool, you can find the complete version and instructions on GitHub.

Prerequisites

To follow along, you need basic JavaScript knowledge.

You also need:

Node.js 22 or newer.
A place to store audio files (Cloudflare R2 in this tutorial).
A text-to-speech API (OrangeClone in this tutorial).

Project Overview

Before writing code, it helps to understand the idea clearly.

This project is a pipeline:

Fetch content -> Filter content -> Clean up content -> Convert to audio -> Repeat

Each step takes the output of the previous one. Keeping the flow linear makes the project easier to reason about, debug, and automate.

All code in this tutorial lives in a single file called index.js.

Getting Started

Create a new project folder and your main file.

mkdir podcast-pipeline
cd podcast-pipeline
touch index.js

Initialize the project and install dependencies.

npm init -y
npm install rss-parser @mozilla/readability jsdom node-fetch uuid xmlbuilder @aws-sdk/client-s3

Enable ESM so import syntax works in Node 22.

npm pkg set type=module

Here is what each dependency is used for:

rss-parser reads RSS feeds.
@mozilla/readability extracts readable article text.
jsdom provides a DOM for Readability.
node-fetch fetches remote content.
uuid generates unique filenames.
xmlbuilder creates the podcast RSS feed.
@aws-sdk/client-s3 uploads audio to Cloudflare R2.

How to Get the Content

The first decision is where your content comes from.

Avoid scraping websites directly. Scraped HTML is noisy and inconsistent. RSS feeds are structured and reliable. Most serious blogs provide one.

Open index.js and define your sources.

import Parser from "rss-parser";
import fetch from "node-fetch";
import { JSDOM } from "jsdom";
import { Readability } from "@mozilla/readability";

const parser = new Parser();

const NUMBER_OF_ARTICLES_TO_FETCH = 15;

const SOURCES = [
  "https://www.freecodecamp.org/news/rss/",
  "https://hnrss.org/frontpage",
];

Now fetch articles and extract readable content.

async function fetchArticles() {
  const articles = [];

  for (const source of SOURCES) {
    const feed = await parser.parseURL(source);

    for (const item of feed.items.slice(0, NUMBER_OF_ARTICLES_TO_FETCH)) {
      if (!item.link) continue;

      const response = await fetch(item.link);
      const html = await response.text();

      const dom = new JSDOM(html, { url: item.link });
      const reader = new Readability(dom.window.document);
      const content = reader.parse();

      if (!content) continue;

      articles.push({
        title: item.title,
        link: item.link,
        content: content.content,
        text: content.textContent,
      });
    }
  }

  return articles.slice(0, NUMBER_OF_ARTICLES_TO_FETCH);
}

This function:

Reads RSS feeds.
Downloads each article.
Extracts clean text using Readability.
Returns a list of articles ready for processing.

How to Filter the Content

Not every article deserves your attention. Start by filtering out topics you do not want to hear about.

const BLOCKED_KEYWORDS = ["crypto", "nft", "giveaway"];

function filterByKeywords(articles) {
  return articles.filter(
    (article) =>
      !BLOCKED_KEYWORDS.some((keyword) =>
        article.text.toLowerCase().includes(keyword)
      )
  );
}

Next, remove promotional content.

function removePromotionalContent(articles) {
  return articles.filter(
    (article) => !article.text.toLowerCase().includes("sponsored")
  );
}

Finally, remove articles that are too short.

function filterByWordCount(articles, minWords = 700) {
  return articles.filter(
    (article) => article.text.split(/\s+/).length >= minWords
  );
}

After these steps, you are left with articles you actually want to listen to.

How to Clean Up the Content

Raw articles text still need to be cleaned up to sound good when spoken. First, replace images with spoken placeholders.

function replaceImages(html) {
  return html.replace(/]*alt="([^"]*)"[^>]*>/gi, (_, alt) => {
    return alt ? `[Image: ${alt}]` : `[Image omitted]`;
  });
}

Next, remove code blocks.

function replaceCodeBlocks(html) {
  return html.replace(
    /[\s\S]*?<\/code><\/pre>/gi,
    "[Code example omitted]"
  );
}

Strip URLs and replace them with spoken text.
function replaceUrls(text) {
  return text.replace(/https?:\/\/\S+/gi, "link removed");
}

Normalize common symbols.
function normalizeSymbols(text) {
  return text
    .replace(/&/g, "and")
    .replace(/%/g, "percent")
    .replace(/\$/g, "dollar");
}

Convert HTML to text so TTS does not read tags.
function stripHtml(html) {
  return html.replace(/<[^>]+>/g, " ");
}

Combine everything into one cleanup step.
function cleanArticle(article) {
  let cleaned = replaceImages(article.content);
  cleaned = replaceCodeBlocks(cleaned);
  cleaned = stripHtml(cleaned);
  cleaned = replaceUrls(cleaned);
  cleaned = normalizeSymbols(cleaned);

  return {
    ...article,
    cleanedText: cleaned,
  };
}

At this point, the text is ready for audio generation.
How to Convert Content to Audio
Browser speech APIs sound robotic. I wanted something that sounded human and familiar. After trying several tools, I settled on OrangeClone. It was the only option that actually sounded like me.
Create a free account and copy your API key from the dashboard.

Record 10 to 15 seconds of clean audio and save it as SAMPLE_VOICE.wav in the project root. Then create a voice character (one-time setup).
import fs from "node:fs/promises";

const ORANGECLONE_API_KEY = process.env.ORANGECLONE_API_KEY;
const ORANGECLONE_BASE_URL =
  process.env.ORANGECLONE_BASE_URL || "https://orangeclone.com/api";

async function createVoiceCharacter({ name, avatarStyle, voiceSamplePath }) {
  const audioBuffer = await fs.readFile(voiceSamplePath);
  const audioBase64 = audioBuffer.toString("base64");

  const response = await fetch(
    `${ORANGECLONE_BASE_URL}/characters/create`,
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${ORANGECLONE_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        name,
        avatarStyle,
        voiceSample: {
          format: "wav",
          data: audioBase64,
        },
      }),
    }
  );

  if (!response.ok) {
    const errorText = await response.text();
    throw new Error(`Failed to create character: ${errorText}`);
  }

  const data = await response.json();

  return (
    data.data?.id ||
    data.data?.characterId ||
    data.id ||
    data.characterId
  );
}

Generate audio from text.
async function generateAudio(characterId, text) {
  const response = await fetch(`${ORANGECLONE_BASE_URL}/voices_clone`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${ORANGECLONE_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      characterId,
      text,
    }),
  });

  return response.json();
}

Wait for the job to complete.
async function waitForAudio(jobId) {
  while (true) {
    const response = await fetch(`${ORANGECLONE_BASE_URL}/voices/${jobId}`);
    const data = await response.json();

    if (data.status === "completed") {
      return data.audioUrl;
    }

    await new Promise((r) => setTimeout(r, 5000));
  }
}

How to Upload the Audio to Cloudflare R2
OrangeClone returns an audio URL, but podcast apps need a stable, public file that will not expire.
That is where Cloudflare R2 comes in.
R2 is S3-compatible storage, which means we can upload files using the AWS SDK and serve them publicly for podcast apps.
How to Set Up Credentials
Create an R2 bucket in your Cloudflare dashboard and set the following environment variables:

R2_ACCOUNT_ID

R2_ACCESS_KEY_ID

R2_SECRET_ACCESS_KEY

R2_BUCKET_NAME

R2_PUBLIC_URL


These values allow the script to upload files and generate public URLs for them.
How to Initialize the R2 Client
import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";

const r2 = new S3Client({
  region: "auto",
  endpoint: `https://${process.env.R2_ACCOUNT_ID}.r2.cloudflarestorage.com`,
  credentials: {
    accessKeyId: process.env.R2_ACCESS_KEY_ID,
    secretAccessKey: process.env.R2_SECRET_ACCESS_KEY,
  },
});

This creates an S3-compatible client that connects directly to your Cloudflare R2 account instead of AWS.
How to Download the Audio
async function downloadAudio(audioUrl) {
  const response = await fetch(audioUrl);
  const buffer = await response.arrayBuffer();
  return Buffer.from(buffer);
}

OrangeClone gives us a URL, not a file.
This function downloads the audio and converts it into a Node.js buffer so it can be uploaded to R2.
How to Upload to R2
import { v4 as uuid } from "uuid";

async function uploadToR2(audioBuffer) {
  const fileName = `${uuid()}.mp3`;

  const command = new PutObjectCommand({
    Bucket: process.env.R2_BUCKET_NAME,
    Key: fileName,
    Body: audioBuffer,
    ContentType: "audio/mpeg",
  });

  await r2.send(command);

  return `${process.env.R2_PUBLIC_URL}/${fileName}`;
}

This function uploads the audio buffer to R2 using a unique filename and returns a public URL that podcast apps can access.
Putting It Together
const audioUrl = await waitForAudio(jobId);
const audioBuffer = await downloadAudio(audioUrl);
const publicAudioUrl = await uploadToR2(audioBuffer);

At the end of this step, publicAudioUrl is the final audio file used in the podcast RSS feed.
How to Make the Podcast
With public audio URLs, you can now generate an RSS feed.
import xmlbuilder from "xmlbuilder";

function generatePodcastFeed(episodes) {
  const feed = xmlbuilder
    .create("rss", { version: "1.0" })
    .att("version", "2.0")
    .ele("channel");

  feed.ele("title", "My Tech Podcast");
  feed.ele("description", "Tech articles converted to audio");
  feed.ele("link", "https://your-site.com");

  episodes.forEach((ep) => {
    const item = feed.ele("item");
    item.ele("title", ep.title);
    item.ele("enclosure", {
      url: ep.audioUrl,
      type: "audio/mpeg",
    });
  });

  return feed.end({ pretty: true });
}

How to Automate the Pipeline
Automation in this project happens in two stages. First, the code itself must be able to process multiple articles in one run. Second, the script must run automatically on a schedule. We’ll start with the code-level automation.
Automating Inside the Code
Earlier, we fetched up to fifteen articles. Now we need to make sure every article that passes our filters goes through the full pipeline.
Add the following function near the bottom of index.js.
async function runPipeline() {
  const rawArticles = await fetchArticles();

  const filteredArticles = filterByWordCount(
    removePromotionalContent(filterByKeywords(rawArticles))
  );

  if (filteredArticles.length === 0) {
    console.log("No articles passed the filters");
    return [];
  }

  const characterId = await createVoiceCharacter({
    name: "My Voice",
    avatarStyle: "realistic",
    voiceSamplePath: "./SAMPLE_VOICE.wav",
  });

  const episodes = [];

  for (const article of filteredArticles) {
    console.log(`Processing: ${article.title}`);

    const cleaned = cleanArticle(article);

    const job = await generateAudio(characterId, cleaned.cleanedText);

    const audioUrl = await waitForAudio(job.id);
    const audioBuffer = await downloadAudio(audioUrl);
    const publicAudioUrl = await uploadToR2(audioBuffer);

    episodes.push({
      title: article.title,
      audioUrl: publicAudioUrl,
    });
  }

  return episodes;
}

This function does all the heavy lifting:

Fetches articles

Applies all filters

Creates the voice character once

Loops through every valid article

Converts each article into audio

Uploads the audio to Cloudflare R2

Collects podcast episode data


At this point, one script run can generate multiple podcast episodes.
Running the Pipeline and Generating the Feed
Now we need a single entry point that runs the pipeline and writes the podcast feed. Add this below the pipeline function.
import fs from "node:fs/promises";

async function main() {
  const episodes = await runPipeline();

  if (episodes.length === 0) {
    console.log("No episodes generated");
    return;
  }

  const rss = generatePodcastFeed(episodes);

  await fs.mkdir("./public", { recursive: true });
  await fs.writeFile("./public/feed.xml", rss);

  console.log("Podcast feed generated at public/feed.xml");
}

main().catch(console.error);

When you run node index.js, this now:

Processes all selected articles

Creates multiple audio files

Generates a valid podcast RSS feed


This is the core automation.
Scheduling the Pipeline with GitHub Actions
The final step is to make this script run automatically. Create a GitHub Actions workflow file at .github/workflows/podcast.yml.
name: Podcast Pipeline

on:
  schedule:
    - cron: "0 6 * * *"

jobs:
  run:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
      - run: npm install
      - run: node index.js
        env:
          ORANGECLONE_API_KEY: ${{ secrets.ORANGECLONE_API_KEY }}
          R2_ACCOUNT_ID: ${{ secrets.R2_ACCOUNT_ID }}
          R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }}
          R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }}
          R2_BUCKET_NAME: ${{ secrets.R2_BUCKET_NAME }}
          R2_PUBLIC_URL: ${{ secrets.R2_PUBLIC_URL }}

This workflow runs the pipeline every morning at 6 AM.
Each run:

Fetches new articles

Generates fresh audio

Updates the podcast feed


Once this is set up, your podcast updates itself without manual work.
Conclusion
This is a basic version of my full production pipeline, PostCast, but the core idea is the same.
You now know how to turn blogs into a personal podcast. Be mindful of copyright and only use content you are allowed to consume.
If you have questions, reach me on X at @sprucekhalifa. I write practical tech articles like this regularly.



 How to Build and Deploy a Blog-to-Audio Service Using OpenAI 
Manish Shivanandhan — Wed, 14 Jan 2026 04:34:50 +0000
 Turning written blog posts into audio is a simple way to reach more people. Many users prefer listening during travel or workouts. Others enjoy having both reading and listening options. 
With OpenAI’s text-to-speech models, you can build a clean service that takes a blog URL or pasted text and produces a natural-sounding audio file. 
In this article, you’ll learn how to build this system end-to-end. You will learn how to fetch blog content, send it to OpenAI’s audio API, save the output as an MP3 file, and serve everything through a small FastAPI app. 
At the end, you’ll also build a minimal user interface and deploy it to Sevalla so that anyone can upload text and download audio without touching code.
Table of Contents

Understanding the Core Idea

How to Set Up Your Project

How to Fetch and Clean Blog Content

How to Send Text to OpenAI for Audio

How to Build a FastAPI Backend

How to Add a Simple User Interface

How to Deploy Your Service to Sevalla

Conclusion


Understanding the Core Idea
A blog-to-audio service has only three important parts. The first part takes a blog link or text and cleans it. The second part sends the clean text to OpenAI’s text-to-speech model. The third part gives the final MP3 file back to the user.
OpenAI’s speech generation is simple to use. You send text, choose a voice, and get audio back. The quality is high and works well even for long posts. This means you do not need to worry about training models or tuning voices.
The only job left is to make the system easy to use. That is where FastAPI and a small HTML form help. They wrap your code into a web service so anyone can try it.
How to Set Up Your Project
Create a folder for your project. Inside it, create a file called main.py. You will also need a basic HTML file later.
Install the libraries you need with pip:
pip install fastapi uvicorn requests beautifulsoup4 python-multipart

FastAPI gives you a simple backend. Requests module helps download blog pages. BeautifulSoup helps remove HTML tags and extract readable text. Python-multipart helps upload form data.
You must also install the OpenAI client:
pip install openai

Make sure you have your OpenAI API key ready. Set it in your terminal before running the app:
export OPENAI_API_KEY="your-key"

On Windows, you can do:
setx OPENAI_API_KEY "your-key"

How to Fetch and Clean Blog Content
To convert a blog into audio, you must first extract the main article text. You can fetch the page with requests and parse it with BeautifulSoup. 
Below is a simple function that does this. 
import requests
from bs4 import BeautifulSoup

def extract_text_from_url(url: str) -> str:
    response = requests.get(url, timeout=10)
    html = response.text
    soup = BeautifulSoup(html, "html.parser")
    paragraphs = soup.find_all("p")
    text = " ".join(p.get_text(strip=True) for p in paragraphs)
    return text

Here is what happens step by step. 

The function downloads the page. 

BeautifulSoup reads the HTML and finds all paragraph tags. 

It pulls out the text in each paragraph and joins them into one long string. 

This gives you a clean version of the blog post without ads or layout code.


If the user pastes text instead of a URL, you can skip this part and use the text as it is.
How to Send Text to OpenAI for Audio
OpenAI’s text-to-speech API makes this part of the work very easy. You send a message with text and select a voice such as Alloy or Verse. The API returns raw audio bytes. You can save these bytes as an MP3 file.
Here is a helper function to convert text into audio:
from openai import OpenAI
client = OpenAI()

def text_to_audio(text: str, output_path: str):
    audio = client.audio.speech.create(
        model="gpt-4o-mini-tts",
        voice="alloy",
        input=text
    )
    with open(output_path, "wb") as f:
        f.write(audio.read())

This function calls the OpenAI client and passes the text, model name, and voice choice. The .read() method extracts the binary audio stream. Writing this to an MP3 file completes the process.
If the blog post is very long, you may want to limit text length or chunk the text and join the audio files later. But for most blogs, the model can handle the entire text in one request.
How to Build a FastAPI Backend
Now you can wrap both steps into a simple FastAPI server. This server will accept either a URL or pasted text. It will convert the content into audio and return the MP3 file as a response.
Here is the full backend code:
from fastapi import FastAPI, Form
from fastapi.responses import FileResponse
import uuid
import os

app = FastAPI()
@app.post("/convert")
def convert(url: str = Form(None), text: str = Form(None)):
    if not url and not text:
        return {"error": "Please provide a URL or text"}
    if url:
        try:
            text_content = extract_text_from_url(url)
        except Exception:
            return {"error": "Could not fetch the URL"}
    else:
        text_content = text
    file_id = uuid.uuid4().hex
    output_path = f"audio_{file_id}.mp3"
    text_to_audio(text_content, output_path)
    return FileResponse(output_path, media_type="audio/mpeg")

Here is how it works. The user sends form data with either url or text. The server checks which one exists. 
If there is a URL, it extracts text with the earlier function. If there is no URL, it uses the provided text directly. A unique file name is created for every request. Then the audio file is generated and returned as an MP3 download.
You can run the server like this:
uvicorn main:app --reload

Open your browser at http://localhost:8000. You will not see the UI yet, but the API endpoint is working. You can test it using a tool like Postman or by building the front end next.
How to Add a Simple User Interface
A service is much easier to use when it has a clean UI. Below is a simple HTML page that sends either a URL or text to your FastAPI backend. Save this file as index.html in the same folder:
html>
<html>
<head>
    <title>Blog to Audiotitle>
    <style>
        body { font-family: Arial, padding: 40px; max-width: 600px; margin: auto; }
        input, textarea { width: 100%; padding: 10px; margin-top: 10px; }
        button { padding: 12px 20px; margin-top: 20px; cursor: pointer; }
    style>
head>
<body>
    <h2>Convert Blog to Audioh2>
    <form action="/convert" method="post">
        <label>Blog URLlabel>
        <input type="text" name="url" placeholder="Enter a blog link">
<p>or paste text belowp>
        <textarea name="text" rows="10" placeholder="Paste blog text here">textarea>
        <button type="submit">Convert to Audiobutton>
    form>
body>
html>

This page gives the user two options. They can type a URL or paste text. The form sends the data to /convert using a POST request. The response will be the MP3 file, so the browser will download it.
To serve the HTML file, add this route to your main.py:
from fastapi.responses import HTMLResponse

@app.get("/")
def home():
    with open("index.html", "r") as f:
        html = f.read()
    return HTMLResponse(html)

Now, when you visit the main URL, you will see a clean form.

When you submit a URL, the server will process your request and give you an audio file.

Great. Our text to audio service is working. Now let’s get it into production.
How to Deploy Your Service to Sevalla
You can choose any cloud provider, like AWS, DigitalOcean, or others, to host your service. I will be using Sevalla for this example.
Sevalla is a developer-friendly PaaS provider. It offers application hosting, database, object storage, and static site hosting for your projects.
Every platform will charge you for creating a cloud resource. Sevalla comes with a $50 credit for us to use, so we won’t incur any costs for this example.
Let’s push this project to GitHub so that we can connect our repository to Sevalla. We can also enable auto-deployments so that any new change to the repository is automatically deployed.
You can also fork my repository from here.
Log in to Sevalla and click on Applications -> Create new application. You can see the option to link your GitHub repository to create a new application.

Use the default settings. Click “Create application”. Now we have to add our OpenAI API key to the environment variables. Click on the “Environment variables” section once the application is created, and save the OPENAI_API_KEY value as an environment variable.

Now we are ready to deploy our application. Click on “Deployments” and click “Deploy now”. It will take 2–3 minutes for the deployment to complete.

Once done, click on “Visit app”. You will see the application served via a URL ending with sevalla.app . This is your new root URL. You can replace localhost:8000 with this URL and start using it.

Congrats! Your blog-to-audio service is now live. You can extend this by adding other capabilities and pushing your code to GitHub. Sevalla will automatically deploy your application to production.
Conclusion
You now know how to build a full blog-to-audio service using OpenAI. You learned how to fetch blog text, convert it into speech, and serve it with FastAPI. You also learned how to create a simple user interface, allowing people to try it with no setup. 
With this foundation, you can turn any written content into smooth, natural audio. This can help creators reach a wider audience, enhance accessibility, and provide users with more ways to enjoy content.
Hope you enjoyed this article. Signup for my free newsletter TuringTalks.ai for more hands-on tutorials on AI. You can also visit my website.
 


 A Game Developer’s Guide to Understanding Screen Resolution 
Manish Shivanandhan — Wed, 19 Nov 2025 15:59:38 +0000
 Every game developer obsesses over performance, textures, and frame rates, but resolution is the quiet foundation that makes or breaks visual quality. 
Whether you are building a pixel-art indie game or a high-fidelity 3D world, understanding how resolution works is essential. 
It affects how your art assets scale, how your UI appears, and how your game feels on different screens. Yet, many developers still treat resolution as a simple number instead of a design decision.
Let’s learn what resolutions are and why it matters for game developers. 
What we will Cover

What Resolution Really Means

The Evolution of Resolution in Gaming

DPI, Scaling, and Texture Clarity

Resolution vs. Performance

Aspect Ratio and Display Diversity

The Art of Testing in 4K and HDR

Preparing for Next-Gen Displays

Conclusion


What Resolution Really Means
Resolution defines how many pixels a screen can display horizontally and vertically.

A monitor labelled 1920x1080 has 1920 pixels across and 1080 down, which equals over two million pixels in total. More pixels mean more visual detail but also more rendering work for the GPU.
In game development, that tradeoff is constant. Rendering at higher resolutions improves clarity but reduces frame rates unless your code and assets are optimized. 
Many developers solve this by offering resolution scaling options in their games, letting players balance visual quality and performance.
It’s also important to distinguish between screen size and resolution. A 27-inch monitor and a 15-inch laptop can both run at 1080p, but the larger display will have bigger, less dense pixels. 
This is where pixel density comes in. High-density displays pack more pixels per inch, creating smoother edges and sharper textures even at the same resolution.
The Evolution of Resolution in Gaming
Games have evolved alongside display technology. 

Early consoles ran at 240p, then 480p during the SD era. The jump to HD with 720p and 1080p transformed game visuals. Suddenly, developers had to think about anti-aliasing, texture resolution, and UI scaling in new ways.
Today, 4K and HDR have become the standard for modern consoles and PCs. Developers now design with higher fidelity in mind, baking in lighting systems, shaders, and art pipelines that scale up to Ultra HD. 
That’s why testing on different display resolutions isn’t just good practice, it’s critical for consistent player experience.
If you want to see how your game performs on large high-resolution displays, try testing it on a modern TV for PS5. These screens are optimized for 4K and 120Hz refresh rates, giving you a realistic look at how your game will appear in a living-room setup. 
They also help you spot UI scaling issues, frame pacing problems, and HDR color mismatches that might go unnoticed on a typical monitor.
DPI, Scaling, and Texture Clarity
For web developers, DPI mostly affects how images scale. But for game developers, DPI connects directly to texture resolution and how art assets are perceived at different screen sizes. 

A sprite that looks crisp on a 1080p monitor might appear tiny or blurry on a 4K display if not properly scaled. Engines like Unity and Unreal handle this with dynamic scaling options, but understanding the underlying math helps. 
When your display density doubles, each asset needs four times as many pixels to appear at the same size and sharpness. If you do not plan for this, your carefully crafted textures might look soft or misaligned on higher-resolution displays.
This is why UI systems in modern engines rely on resolution-independent units. In Unity, Canvas Scaler helps ensure your interface looks the same on every device. In Unreal, DPI scaling rules allow developers to maintain consistent HUD layouts. Getting this right means your game remains legible on everything from handhelds to 8K TVs.
Resolution vs Performance
The biggest cost of higher resolution is GPU load. Rendering in 4K means pushing four times as many pixels as 1080p. Without proper optimization, frame rates can drop sharply. 
That’s why many AAA games use resolution scaling techniques like temporal upsampling or DLSS. These methods render frames at a lower resolution and then use AI or interpolation to upscale them without losing clarity.
As a developer, you should test your game across multiple resolutions and aspect ratios. This helps ensure your render pipeline, shaders, and assets adapt smoothly. Tools like NVIDIA Nsight or Unreal’s built-in profiler show how resolution affects frame time and GPU usage.
If your game includes video content or cinematic sequences, also remember that video compression behaves differently at higher resolutions. Encoding 4K video requires significantly more bandwidth and storage, which can affect your build size and performance during playback.
Aspect Ratio and Display Diversity
Aspect ratio determines the shape of the display.

Most modern games target 16:9, but 21:9 ultrawide and 32:9 super-ultrawide displays are becoming more popular. Developers must ensure their camera framing and UI layouts adapt accordingly.
When a game is locked to one ratio, black bars or stretching can occur. To fix this, adjust your camera’s field of view dynamically or provide safe viewport settings.
Engines like Unreal let you script these adjustments easily, while Unity’s Cinemachine system handles FOV scaling automatically.
Even TVs now vary in aspect ratio capabilities, especially with new mini LED and OLED technologies. Testing across multiple ratios ensures your game looks balanced and cinematic on every screen.
The Art of Testing in 4K and HDR
4K and HDR introduce new layers of visual complexity. HDR displays show a wider range of brightness and color depth, which means lighting and textures can look completely different compared to SDR monitors. To handle this, calibrate your color grading pipeline and use tone mapping tools within your engine.
When working with HDR assets, always test your output on real hardware. Emulators and monitors often fail to reproduce true HDR contrast. A proper HDR-certified TV helps you identify overexposure, color clipping, and banding issues before release.
Preparing for Next-Gen Displays
The display industry continues to evolve fast. 8K and high refresh rate panels are already entering mainstream markets. 
For developers, this means thinking ahead. Designing scalable rendering systems, supporting dynamic resolution, and maintaining flexible UI layouts are now essential parts of modern game design.
As displays get sharper, player expectations rise too. Textures, shaders, and post-processing all need to support higher levels of detail without compromising performance. By understanding how resolution interacts with your pipeline, you can future-proof your games for years to come.
Conclusion
Resolution is more than a number on a settings menu. It is a design constraint, a performance factor, and a creative opportunity. As a game developer, mastering resolution helps you build experiences that look sharp, play smoothly, and scale across every device.
The next time you polish your textures or fine-tune your rendering settings, remember that every pixel counts. Understanding how resolution, scaling, and density interact will not only make your games more beautiful but also more accessible to every player, whether they’re gaming on a laptop, a monitor, or the living-room tv that brings your visuals to life in stunning detail.
Hope you enjoyed this article. Find me on Linkedin or visit my website.
 


 How to Use Transformers for Real-Time Gesture Recognition 
OMOTAYO OMOYEMI — Mon, 06 Oct 2025 13:39:30 +0000
 Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are not static images. Rather, they unfold over time. To build more robust, real-time systems, we need models that can capture both spatial details and temporal context.
This is where Transformers come in. Originally built for language, they’ve become state-of-the-art in vision tasks thanks to models like the Vision Transformer (ViT) and video-focused variants such as TimeSformer.
In this tutorial, we’ll use a Transformer backbone to create a lightweight real-time gesture recognition tool, optimized for small datasets and deployable on a regular laptop webcam.
Table of Contents

Why Transformers for Gestures?

What You’ll Learn

Prerequisites

Project Setup

Generate a Gesture Dataset

Option 1: Generate a Synthetic Dataset

Training Script: train.py

Export the Model to ONNX

Evaluate Accuracy + Latency

Option 2: Use Small Samples from Public Gesture Datasets

Accessibility Notes & Ethical Limits

Next Steps

Conclusion


Why Transformers for Gestures?
Transformers are powerful because they use self-attention to model relationships across a sequence. For gestures, this means the model doesn’t just see isolated frames, but also learns how movements evolve over time. A wave, for example, looks different from a raised hand only when viewed as a sequence.
Vision Transformers process images as patches, while video Transformers extend this to multiple frames with temporal attention. Even a simple approach, like applying ViT to each frame and pooling across time, can outperform traditional CNN-based methods for small datasets.
Combined with Hugging Face’s pre-trained models and ONNX Runtime for optimization, Transformers make it possible to train on a modest dataset and still achieve smooth real-time recognition.
What You’ll Learn
In this tutorial, you’ll build a gesture recognition system using Transformers. By the end, you’ll know how to:

Create (or record) a tiny gesture dataset

Train a Vision Transformer (ViT) with temporal pooling

Export the model to ONNX for faster inference

Build a real-time Gradio app that classifies gestures from your webcam

Evaluate your model’s accuracy and latency with simple scripts

Understand the accessibility potential and ethical limits of gesture recognition


Prerequisites
To follow along, you should have:

Basic Python knowledge (functions, scripts, virtual environments)

Familiarity with PyTorch (tensors, datasets, training loops) – helpful but not required

Python 3.8+ installed on your system

A webcam (for the live demo in Gradio)

Optionally: GPU access (training on CPU works, but is slower)


Project Setup
Create a new project folder and install the required libraries.
# Create a new project directory and navigate into it
mkdir transformer-gesture && cd transformer-gesture

# Set up a Python virtual environment
python -m venv .venv

# Activate the virtual environment
# Windows PowerShell
.venv\Scripts\Activate.ps1

# macOS/Linux
source .venv/bin/activate

The provided code snippet is a set of commands for setting up a new Python project with a virtual environment. Here's a breakdown of each part:

mkdir transformer-gesture && cd transformer-gesture: This command creates a new directory named "transformer-gesture" and then navigates into it.

python -m venv .venv: This command creates a new virtual environment in the current directory. The virtual environment is stored in a folder named ".venv".

Activating the virtual environment:

For Windows PowerShell, you can use .venv\Scripts\Activate.ps1 to activate the virtual environment.

For macOS/Linux, use source .venv/bin/activate to activate the virtual environment.




Activating a virtual environment ensures that the Python interpreter and any packages you install are isolated to this specific project, preventing conflicts with other projects or system-wide packages.
Create a requirements.txt file:
torch>=2.0
torchvision
torchaudio
timm
huggingface_hub

onnx
onnxruntime

gradio

numpy
opencv-python
pillow

matplotlib
seaborn
scikit-learn

The list provided is a set of package dependencies typically found in a requirements.txt file for a Python project. Here's a brief explanation of each package:

torch>=2.0: PyTorch is a popular open-source deep learning framework that provides a flexible and efficient platform for building and training neural networks. Version 2.0 and above includes improvements in performance and new features.

torchvision: This library is part of the PyTorch ecosystem and provides tools for computer vision tasks, including datasets, model architectures, and image transformations.

torchaudio: Also part of the PyTorch ecosystem, Torchaudio provides audio processing tools and datasets, making it easier to work with audio data in deep learning projects.

timm: The PyTorch Image Models (timm) library offers a collection of pre-trained models and utilities for computer vision tasks, facilitating quick experimentation and deployment.

huggingface_hub: This library allows easy access to models and datasets hosted on the Hugging Face Hub, a platform for sharing and collaborating on machine learning models and datasets.

onnx: The Open Neural Network Exchange (ONNX) format is used to represent machine learning models, enabling interoperability between different frameworks.

onnxruntime: This is a high-performance runtime for executing ONNX models, allowing for efficient deployment across various platforms.

gradio: Gradio is a library for creating user interfaces for machine learning models, making them accessible through a web interface for easy interaction and testing.

numpy: A fundamental package for numerical computing in Python, providing support for arrays and a wide range of mathematical functions.

opencv-python: OpenCV is a library for computer vision and image processing tasks, widely used for real-time applications.

pillow: A Python Imaging Library (PIL) fork, Pillow provides tools for opening, manipulating, and saving many different image file formats.

matplotlib: A plotting library for Python, Matplotlib is used for creating static, interactive, and animated visualizations in Python.

seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.

scikit-learn: A machine learning library in Python that provides simple and efficient tools for data analysis and modeling, including classification, regression, clustering, and dimensionality reduction.


Install dependencies:
pip install -r requirements.txt

The command pip install -r requirements.txt is used to install all the Python packages listed in a file named requirements.txt. This file typically contains a list of package dependencies required for a Python project, each specified with a package name and optionally a version number.
By running this command, pip, which is the Python package installer, reads the file and installs each package listed, ensuring that the project has all the necessary dependencies to run properly. This is a common practice in Python projects to manage and share dependencies easily.
Generate a Gesture Dataset
To train our Transformer-based gesture recognizer, we need some data. Instead of downloading a huge dataset, we’ll start with a tiny synthetic dataset you can generate in seconds. This makes the tutorial lightweight and ensures that everyone can follow along without dealing with multi-gigabyte downloads.
Option 1: Generate a Synthetic Dataset
We’ll use a small Python script that creates short .mp4 clips of a moving (or still) coloured box. Each class represents a gesture:

swipe_left – box moves from right to left

swipe_right – box moves from left to right

stop – box stays still in the center


Save this script as generate_synthetic_gestures.py in your project root:
import os, cv2, numpy as np, random, argparse

def ensure_dir(p): os.makedirs(p, exist_ok=True)

def make_clip(mode, out_path, seconds=1.5, fps=16, size=224, box_size=60, seed=0, codec="mp4v"):
    rng = random.Random(seed)
    frames = int(seconds * fps)
    H = W = size

    # background + box color
    bg_val = rng.randint(160, 220)
    bg = np.full((H, W, 3), bg_val, dtype=np.uint8)
    color = (rng.randint(20, 80), rng.randint(20, 80), rng.randint(20, 80))

    # path of motion
    y = rng.randint(40, H - 40 - box_size)
    if mode == "swipe_left":
        x_start, x_end = W - 20 - box_size, 20
    elif mode == "swipe_right":
        x_start, x_end = 20, W - 20 - box_size
    elif mode == "stop":
        x_start = x_end = (W - box_size) // 2
    else:
        raise ValueError(f"Unknown mode: {mode}")

    fourcc = cv2.VideoWriter_fourcc(*codec)
    vw = cv2.VideoWriter(out_path, fourcc, fps, (W, H))
    if not vw.isOpened():
        raise RuntimeError(
            f"Could not open VideoWriter with codec '{codec}'. "
            "Try --codec XVID and use .avi extension, e.g. out.avi"
        )

    for t in range(frames):
        alpha = t / max(1, frames - 1)
        x = int((1 - alpha) * x_start + alpha * x_end)
        # small jitter to avoid being too synthetic
        jitter_x, jitter_y = rng.randint(-2, 2), rng.randint(-2, 2)
        frame = bg.copy()
        cv2.rectangle(frame, (x + jitter_x, y + jitter_y),
                      (x + jitter_x + box_size, y + jitter_y + box_size),
                      color, thickness=-1)
        # overlay text
        cv2.putText(frame, mode, (8, 24), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 0), 2, cv2.LINE_AA)
        cv2.putText(frame, mode, (8, 24), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 1, cv2.LINE_AA)
        vw.write(frame)

    vw.release()

def write_labels(labels, out_dir):
    with open(os.path.join(out_dir, "labels.txt"), "w", encoding="utf-8") as f:
        for c in labels:
            f.write(c + "\n")

def main():
    ap = argparse.ArgumentParser(description="Generate a tiny synthetic gesture dataset.")
    ap.add_argument("--out", default="data", help="Output directory (default: data)")
    ap.add_argument("--classes", nargs="+",
                    default=["swipe_left", "swipe_right", "stop"],
                    help="Class names (default: swipe_left swipe_right stop)")
    ap.add_argument("--clips", type=int, default=16, help="Clips per class (default: 16)")
    ap.add_argument("--seconds", type=float, default=1.5, help="Seconds per clip (default: 1.5)")
    ap.add_argument("--fps", type=int, default=16, help="Frames per second (default: 16)")
    ap.add_argument("--size", type=int, default=224, help="Frame size WxH (default: 224)")
    ap.add_argument("--box", type=int, default=60, help="Box size (default: 60)")
    ap.add_argument("--codec", default="mp4v", help="Codec fourcc (mp4v or XVID)")
    ap.add_argument("--ext", default=".mp4", help="File extension (.mp4 or .avi)")
    args = ap.parse_args()

    ensure_dir(args.out)
    write_labels(args.classes, ".")  # writes labels.txt to project root

    print(f"Generating synthetic dataset -> {args.out}")
    for cls in args.classes:
        cls_dir = os.path.join(args.out, cls)
        ensure_dir(cls_dir)
        mode = "stop" if cls == "stop" else ("swipe_left" if "left" in cls else ("swipe_right" if "right" in cls else "stop"))
        for i in range(args.clips):
            filename = os.path.join(cls_dir, f"{cls}_{i+1:03d}{args.ext}")
            make_clip(
                mode=mode,
                out_path=filename,
                seconds=args.seconds,
                fps=args.fps,
                size=args.size,
                box_size=args.box,
                seed=i + 1,
                codec=args.codec
            )
        print(f"  {cls}: {args.clips} clips")

    print("Done. You can now run: python train.py, python export_onnx.py, python app.py")

if __name__ == "__main__":
    main()

The script generates a synthetic gesture dataset by creating video clips of a moving or stationary coloured box, simulating gestures like "swipe left," "swipe right," and "stop," and saves them in a specified output directory.
Now run it inside your virtual environment:
python generate_synthetic_gestures.py --out data --clips 16 --seconds 1.5

The command above runs a Python script named generate_synthetic_gestures.py, which generates a synthetic gesture dataset with 16 clips per gesture, each lasting 1.5 seconds, and saves the output in a directory named "data".
This creates a dataset like:
data/
  swipe_left/*.mp4
  swipe_right/*.mp4
  stop/*.mp4
labels.txt

Each folder contains short clips of a moving (or still) box that simulate gestures. This is perfect for testing the pipeline.
Training Script: train.py
Now that we have our dataset, let’s fine-tune a Vision Transformer with temporal pooling. This model applies ViT frame-by-frame, averages embeddings across time, and trains a classification head on your gestures.
Here’s the full training script:
# train.py
import torch, torch.nn as nn, torch.optim as optim
from torch.utils.data import DataLoader
import timm
from dataset import GestureClips, read_labels

class ViTTemporal(nn.Module):
    """Frame-wise ViT encoder -> mean pool over time -> linear head."""
    def __init__(self, num_classes, vit_name="vit_tiny_patch16_224"):
        super().__init__()
        self.vit = timm.create_model(vit_name, pretrained=True, num_classes=0, global_pool="avg")
        feat_dim = self.vit.num_features
        self.head = nn.Linear(feat_dim, num_classes)

    def forward(self, x):  # x: (B,T,C,H,W)
        B, T, C, H, W = x.shape
        x = x.view(B * T, C, H, W)
        feats = self.vit(x)                  # (B*T, D)
        feats = feats.view(B, T, -1).mean(dim=1)  # (B, D)
        return self.head(feats)

def train():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    labels, _ = read_labels("labels.txt")
    n_classes = len(labels)

    train_ds = GestureClips(train=True)
    val_ds   = GestureClips(train=False)
    print(f"Train clips: {len(train_ds)} | Val clips: {len(val_ds)}")

    # Windows/CPU friendly
    train_dl = DataLoader(train_ds, batch_size=2, shuffle=True,  num_workers=0, pin_memory=False)
    val_dl   = DataLoader(val_ds,   batch_size=2, shuffle=False, num_workers=0, pin_memory=False)

    model = ViTTemporal(num_classes=n_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.05)

    best_acc = 0.0
    epochs = 5
    for epoch in range(1, epochs + 1):
        # ---- Train ----
        model.train()
        total, correct, loss_sum = 0, 0, 0.0
        for x, y in train_dl:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()

            loss_sum += loss.item() * x.size(0)
            correct += (logits.argmax(1) == y).sum().item()
            total += x.size(0)

        train_acc = correct / total if total else 0.0
        train_loss = loss_sum / total if total else 0.0

        # ---- Validate ----
        model.eval()
        vtotal, vcorrect = 0, 0
        with torch.no_grad():
            for x, y in val_dl:
                x, y = x.to(device), y.to(device)
                vcorrect += (model(x).argmax(1) == y).sum().item()
                vtotal += x.size(0)
        val_acc = vcorrect / vtotal if vtotal else 0.0

        print(f"Epoch {epoch:02d} | train_loss {train_loss:.4f} "
              f"| train_acc {train_acc:.3f} | val_acc {val_acc:.3f}")

        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), "vit_temporal_best.pt")

    print("Best val acc:", best_acc)

if __name__ == "__main__":
    train()

Running the command python train.py initiates the training process for your gesture recognition model. Here's a breakdown of what happens:

Load your dataset from data/: The script will access and load the gesture dataset stored in the "data" directory. This dataset is used to train the model.

Fine-tune a pre-trained Vision Transformer: The training script will take a Vision Transformer model that has been pre-trained on a larger dataset and fine-tune it using your specific gesture dataset. Fine-tuning helps the model adapt to the nuances of your data, improving its performance on the specific task of gesture recognition.

Save the best checkpoint as vit_temporal_best.pt: During training, the script will evaluate the model's performance on a validation set. The best-performing version of the model (based on some metric like accuracy) will be saved as a checkpoint file named "vit_temporal_best.pt". This file can later be used for inference or further training.


What Training Looks Like
You should see logs similar to this:
Train clips: 38 | Val clips: 10
Epoch 01 | train_loss 1.4508 | train_acc 0.395 | val_acc 0.200
Epoch 02 | train_loss 1.2466 | train_acc 0.263 | val_acc 0.200
Epoch 03 | train_loss 1.1361 | train_acc 0.368 | val_acc 0.200
Best val acc: 0.200

Don’t worry if your accuracy is low at first, as with the synthetic dataset that’s normal. The key is proving that the Transformer pipeline works. You can boost results later by:

Adding more clips per class

Training for more epochs

Switching to real recorded gestures



Figure 1. Example training logs from train.py, where the Vision Transformer with temporal pooling is fine-tuned on a tiny synthetic dataset.
Export the Model to ONNX
To make our model easier to run in real time (and lighter on CPU), we’ll export it to the ONNX format.
Note: ONNX, which stands for Open Neural Network Exchange, is an open-source format designed to facilitate the interchange of deep learning models between different frameworks. It lets you train a model in one framework, such as PyTorch or TensorFlow, and then deploy it in another, like Caffe2 or MXNet, without needing to completely rewrite the model. This interoperability is achieved by providing a standardized representation of the model's architecture and parameters.
ONNX supports a wide range of operators and is continually updated to include new features, making it a versatile choice for deploying machine learning models across various platforms and devices.
Create a file called export_onnx.py:
import torch
from train import ViTTemporal
from dataset import read_labels

labels, _ = read_labels("labels.txt")
n_classes = len(labels)

# Load trained model
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load("vit_temporal_best.pt", map_location="cpu"))
model.eval()

# Dummy input: batch=1, 16 frames, 3x224x224
dummy = torch.randn(1, 16, 3, 224, 224)

# Export
torch.onnx.export(
    model, dummy, "vit_temporal.onnx",
    input_names=["video"], output_names=["logits"],
    dynamic_axes={"video": {0: "batch"}},
    opset_version=13
)

print("Exported vit_temporal.onnx")

Run it with python export_onnx.py.
This generates a file vit_temporal.onnx in your project folder. ONNX lets us use onnxruntime, which is much faster for inference.
Create a file called app.py:
import os, tempfile, cv2, torch, onnxruntime, numpy as np
import gradio as gr
from dataset import read_labels

T = 16
SIZE = 224
MODEL_PATH = "vit_temporal.onnx"

labels, _ = read_labels("labels.txt")

# --- ONNX session + auto-detect names ---
ort_session = onnxruntime.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
# detect first input and first output names to avoid mismatches
INPUT_NAME = ort_session.get_inputs()[0].name   # e.g. "input" or "video"
OUTPUT_NAME = ort_session.get_outputs()[0].name # e.g. "logits" or something else

def preprocess_clip(frames_rgb):
    if len(frames_rgb) == 0:
        frames_rgb = [np.zeros((SIZE, SIZE, 3), dtype=np.uint8)]
    if len(frames_rgb) < T:
        frames_rgb = frames_rgb + [frames_rgb[-1]] * (T - len(frames_rgb))
    frames_rgb = frames_rgb[:T]
    clip = [cv2.resize(f, (SIZE, SIZE), interpolation=cv2.INTER_AREA) for f in frames_rgb]
    clip = np.stack(clip, axis=0)                                    # (T,H,W,3)
    clip = np.transpose(clip, (0, 3, 1, 2)).astype(np.float32) / 255 # (T,3,H,W)
    clip = (clip - 0.5) / 0.5
    clip = np.expand_dims(clip, 0)                                   # (1,T,3,H,W)
    return clip

def _extract_path_from_gradio_video(inp):
    if isinstance(inp, str) and os.path.exists(inp):
        return inp
    if isinstance(inp, dict):
        for key in ("video", "name", "path", "filepath"):
            v = inp.get(key)
            if isinstance(v, str) and os.path.exists(v):
                return v
        for key in ("data", "video"):
            v = inp.get(key)
            if isinstance(v, (bytes, bytearray)):
                tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4")
                tmp.write(v); tmp.flush(); tmp.close()
                return tmp.name
    if isinstance(inp, (list, tuple)) and inp and isinstance(inp[0], str) and os.path.exists(inp[0]):
        return inp[0]
    return None

def _read_uniform_frames(video_path):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) or 1
    idxs = np.linspace(0, total - 1, max(T, 1)).astype(int)
    want = set(int(i) for i in idxs.tolist())
    j = 0
    while True:
        ok, bgr = cap.read()
        if not ok: break
        if j in want:
            rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
            frames.append(rgb)
        j += 1
    cap.release()
    return frames

def predict_from_video(gradio_video):
    video_path = _extract_path_from_gradio_video(gradio_video)
    if not video_path or not os.path.exists(video_path):
        return {}
    frames = _read_uniform_frames(video_path)

    # If OpenCV choked on the codec (common with recorded webm), re-encode once:
    if len(frames) == 0:
        tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4"); tmp_name = tmp.name; tmp.close()
        cap = cv2.VideoCapture(video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) or 640
        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) or 480
        out = cv2.VideoWriter(tmp_name, fourcc, 20.0, (w, h))
        while True:
            ok, frame = cap.read()
            if not ok: break
            out.write(frame)
        cap.release(); out.release()
        frames = _read_uniform_frames(tmp_name)

    clip = preprocess_clip(frames)
    # >>> use the detected ONNX input/output names <<<
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[0]  # (1, C)
    probs = torch.softmax(torch.from_numpy(logits), dim=1)[0].numpy().tolist()
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

def predict_from_image(image):
    if image is None:
        return {}
    clip = preprocess_clip([image] * T)
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[0]
    probs = torch.softmax(torch.from_numpy(logits), dim=1)[0].numpy().tolist()
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

with gr.Blocks() as demo:
    gr.Markdown("# Gesture Classifier (ONNX)\nRecord or upload a short video, then click **Classify Video**.")
    with gr.Tab("Video (record or upload)"):
        vid_in = gr.Video(label="Record from webcam or upload a short clip")
        vid_out = gr.Label(num_top_classes=3, label="Prediction")
        gr.Button("Classify Video").click(fn=predict_from_video, inputs=vid_in, outputs=vid_out)
    with gr.Tab("Single Image (fallback)"):
        img_in = gr.Image(label="Upload an image frame", type="numpy")
        img_out = gr.Label(num_top_classes=3, label="Prediction")
        gr.Button("Classify Image").click(fn=predict_from_image, inputs=img_in, outputs=img_out)

if __name__ == "__main__":
    demo.launch()

Running the command python app.py launches a Gradio application in your web browser. Here's what happens:

Webcam feed streams live: The application accesses your webcam to provide a live video feed. This allows you to perform gestures in front of the camera in real-time.

Predictions update continuously: As you perform gestures, the model processes the video frames continuously, updating its predictions in real-time.

Top 3 gesture classes displayed with probabilities: The application displays the top three predicted gesture classes along with their probabilities, giving you an idea of the model's confidence in its predictions.


When you open the app in your browser, you'll find two tabs. In the Video tab, you can click Record from webcam to capture a short clip of your gesture, typically lasting 2–4 seconds. After recording, click Classify Video. The model will then process the captured frames using the Transformer model and display the predicted gesture probabilities. This setup allows for interactive testing and demonstration of the gesture recognition system.
Here’s an example where I raised my hand for a stop gesture, and the model predicts “stop” as the top class:

Figure 2. The Gradio app running locally. After recording a short clip, the Transformer model predicts the gesture with class probabilities.
Evaluate Accuracy + Latency
Now that the model runs in a demo app, let’s check how well it performs. There are two sides to this:

Accuracy: does the model predict the right gesture class?

Latency: how fast does it respond, especially on CPU vs GPU?


1. Quick Accuracy Check
Save this as eval.py in the same folder as your other scripts:
import torch
from dataset import GestureClips, read_labels
from train import ViTTemporal

labels, _ = read_labels("labels.txt")
n_classes = len(labels)

# Load validation data
val_ds = GestureClips(train=False)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=2, shuffle=False)

# Load trained model
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load("vit_temporal_best.pt", map_location="cpu"))
model.eval()

correct, total = 0, 0
all_preds, all_labels = [], []

with torch.no_grad():
    for x, y in val_dl:
        logits = model(x)
        preds = logits.argmax(dim=1)
        correct += (preds == y).sum().item()
        total += y.size(0)
        all_preds.extend(preds.tolist())
        all_labels.extend(y.tolist())

print(f"Validation accuracy: {correct/total:.2%}")

2. Confusion Matrix
Let’s also visualize which gestures are confused. Add this snippet at the bottom of eval.py:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(all_labels, all_preds)

plt.figure(figsize=(6,6))
sns.heatmap(cm, annot=True, fmt="d", xticklabels=labels, yticklabels=labels, cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()

When you run python eval.py, a heatmap like this will pop up:

Figure 3. Confusion matrix on the validation set. Correct predictions appear along the diagonal. Off-diagonal counts show gesture confusions.
3. Latency Benchmark
Finally, let’s see how fast inference runs. Save the following as benchmark.py:
import time, numpy as np, onnxruntime
from dataset import read_labels

labels, _ = read_labels("labels.txt")

ort = onnxruntime.InferenceSession("vit_temporal.onnx", providers=["CPUExecutionProvider"])
INPUT_NAME = ort.get_inputs()[0].name
OUTPUT_NAME = ort.get_outputs()[0].name

dummy = np.random.randn(1, 16, 3, 224, 224).astype(np.float32)

# Warmup
for _ in range(3):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})

# Benchmark
t0 = time.time()
for _ in range(50):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})
t1 = time.time()

print(f"Average latency: {(t1 - t0)/50:.3f} seconds per clip")

Run: python benchmark.py
On CPU, you might see ~0.05–0.15s per clip; on GPU it’s much faster.
Note: If latency is high, you can enable quantization in ONNX to shrink the model and speed up inference.
Option 2: Use Small Samples from Public Gesture Datasets
If you’d prefer to see your model trained on real gesture clips instead of synthetic moving boxes, you can grab a handful of videos from open datasets. You don’t need to download the entire dataset (which can be several GB) just a few .mp4 samples are enough to follow along.
Recommended sources

20BN Jester Dataset: Contains short clips of hand gestures like swiping, clapping, and pointing.

WLASL: A large-scale dataset of isolated sign language words.


Both projects provide small .mp4 videos you can use as realistic training examples. I’ve linked them below.
Setting up your dataset folder
Once you download a few clips, place them in the data/ folder under subfolders named after each gesture class. For example:
data/
├── swipe_left/
│   ├── clip1.mp4
│   └── clip2.mp4
├── swipe_right/
│   ├── clip1.mp4
│   └── clip2.mp4
└── stop/
    ├── clip1.mp4
    └── clip2.mp4

And update labels.txt to match the folder names:
swipe_left
swipe_right
stop

Now your dataset is ready, and the same training scripts from earlier (train.py, eval.py) will work without modification.
Why choose this option?

Gives more realistic results than synthetic coloured boxes

Lets you see how the model handles actual human hand movements

It just requires a bit more effort (downloading clips, trimming them if needed)


Tip: If downloading from these datasets feels too heavy, you can also record your own short gestures using your laptop webcam. Just save them as .mp4 files and organize them in the same folder structure.
Accessibility Notes & Ethical Limits
While this project shows the technical workflow for gesture recognition with Transformers, it’s important to step back and consider the human context:

Accessibility first: Tools like this can help students with speech or motor difficulties, but they should always be co-designed with the people who will use them. Don’t assume one-size-fits-all.

Dataset sensitivity: Using publicly available sign or gesture datasets is fine for prototyping, but deploying such a system requires careful consideration of consent and representation.

Error tolerance: Even small misclassifications can have big consequences in accessibility contexts (for example, confusing stop with go). Always plan for fallback options (like manual input or confirmation).

Bias and inclusivity: Models trained on narrow datasets may fail for different skin tones, lighting conditions, or cultural gesture variations. Broad and diverse training data is essential for fairness.


In other words: this demo is a teaching scaffold, not a production-ready accessibility tool. Responsible deployment requires collaboration with educators, therapists, and end users.
Next Steps
If you’d like to push this project further, here are some directions to explore:

Better models: Try video-focused Transformers like TimeSformer or VideoMAE for stronger temporal reasoning.

Larger vocabularies: Add more gesture classes, build your own dataset, or use portions of public datasets like 20BN Jester or WLASL.

Pose fusion: Combine gesture video with human pose keypoints from MediaPipe or OpenPose for more robust predictions.

Real-time smoothing: Implement temporal smoothing or debounce logic in the app so predictions are more stable during live use.

Quantization + edge devices: Convert your ONNX model to an INT8 quantized version and deploy it on a Raspberry Pi or Jetson Nano for classroom-ready prototypes.


Conclusion
In this tutorial, you learned how to create a gesture recognition system using Transformer models, demonstrating the potential of cutting-edge machine learning techniques. By preparing a small dataset, training a Vision Transformer with temporal pooling, exporting the model to ONNX for efficient inference, and deploying a real-time Gradio app, you showcased a practical application of these technologies. The evaluation of accuracy and latency further highlighted the system's effectiveness and responsiveness.
This project illustrates how you can leverage advanced ML methods to enhance accessibility and communication, paving the way for more inclusive learning environments.
Remember: while this demo works with small datasets, real-world applications need larger, more diverse data and careful consideration of accessibility, inclusivity, and ethics.
Here’s the GitHub repo for full source code: transformer-gesture.
 


 How to Build a Multimodal Makaton-to-English Translator for Accessible Education 
OMOTAYO OMOYEMI — Thu, 18 Sep 2025 01:20:45 +0000
 A year nine student walks into class full of ideas, but when it is time to contribute, the tools around them do not listen. Their speech is difficult for standard voice systems to recognise, typing feels slow and exhausting, and the lesson moves on without their voice being heard. The challenge is not a lack of ability but a lack of access.
Across the world, millions of learners face communication barriers. Some live with apraxia of speech or dysarthria, others with limited mobility, hearing differences, or neurodiverse needs. When speaking, writing, or pointing is unreliable or tiring, participation becomes limited, feedback is lost, and confidence slowly erodes. This is not a rare exception but an everyday reality in classrooms.
These barriers appear in very practical ways. Students are skipped or misunderstood when they cannot respond quickly. Their ability is under-measured because their means of expression are constrained. Teachers struggle to maintain the pace of lessons while making individual accommodations. Peers interact less often, reducing opportunities for social belonging.
Assistive technologies have helped over the years, with tools like text-to-speech, symbol boards, and simple gesture inputs. Yet most of these tools are designed for a single mode of interaction. They assume the learner will either speak, or type, or tap. Real communication, however, is fluid. Learners naturally combine gestures, partial speech, symbols, and context to share meaning, especially when fatigue, anxiety, or motor challenges come into play.
This is where modern AI changes the picture. We are beginning to move beyond single-solution tools into multimodal systems that can understand speech, even when it is disordered, interpret gestures and visual symbols, combine signals to infer intent, and adapt in real time as the learner’s abilities develop or change.
AI is reshaping accessibility in education by shifting from isolated tools to multimodal and adaptive systems. These systems combine gesture, speech, and intelligent feedback to meet learners where they are, while also supporting their growth over time.
In this article, we will explore what this shift looks like in practice, how it can unlock participation, and how adaptive feedback personalises support and we will also build a hands-on multimodal demo that turns these ideas into a classroom-ready tool.
Prerequisites

An Operating System: Windows, macOS, or Linux

Python installed (3.9 or later) – Along with pip for installing packages.

Editor: Visual Studio Code or any Integrated development environment (IDE)

Basics: Comfortable running commands in a terminal

Optional hardware: Microphone (speech input), Webcam (single-frame tab), speakers (TTS playback)

Internet: Required for the default SpeechRecognition (Google Web Speech API) and gTTS

No dataset/model needed: A stub gesture classifier is provided so the demo runs end-to-end


Table of Contents

Prerequisites

What We’ve Achieved So Far

Case Study 1: Translating Makaton to English

Case Study 2: AURA Prototype (Adaptive Speech Assistant)

The Bigger Picture: Multimodal Accessibility Tools

How to Build a Multimodal Makaton to English Translator (Gesture + Speech)

Project Overview

Challenges and Ethical Considerations

Where We’re Heading Next

Conclusion: Building an Inclusive Future with AI


What We’ve Achieved So Far
The past few years have shown how AI can make classrooms more inclusive when we focus on accessibility. Developers, educators, and researchers are already experimenting with tools that bridge communication gaps.
In my first freeCodeCamp tutorial, I built a gesture-to-text translator using MediaPipe. This project demonstrated how computer vision can track hand movements and convert them into text in real time. For learners who rely on gestures, this kind of system can provide a bridge to participation.
Here is a simplified example of how MediaPipe detects hand landmarks:
import mediapipe as mp
import cv2

# Initialize MediaPipe Hands
mp_hands = mp.solutions.hands
hands = mp_hands.Hands()

# Start capturing video from the webcam
cap = cv2.VideoCapture(0)

# Capture a frame from the video
ret, frame = cap.read()

# Process the frame to detect hand landmarks
results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

# Print the detected hand landmarks
print("Hand landmarks:", results.multi_hand_landmarks)

This small piece of code shows how MediaPipe processes a video frame and extracts hand landmarks. From there, you can classify gestures and map them to text.
👉 You can explore the full project on GitHub or read the complete tutorial on freeCodeCamp.
In another freeCodeCamp article, I demonstrated how to build AI accessibility tools with Python, such as speech recognition and text-to-speech. These projects provided readers with a foundation for building their own inclusive tools, and you can find the full source code in the repository.
Beyond these individual projects, the wider field has also made significant progress. Advances in sign language recognition have improved accuracy in capturing complex hand shapes and movements. Text-to-speech systems have become more natural and adaptive, giving users voices that sound closer to human speech. Mobile and desktop accessibility apps have brought these capabilities into everyday classrooms.
These achievements are encouraging, but they remain limited. Most of today’s tools are still designed for a single mode of communication. A system may work for gestures, or for speech, or for text, but not all of them together.
The next step is clear: we need multimodal, adaptive AI tools that can blend gestures, speech, and feedback into unified systems. This is where the most exciting opportunities in accessibility lie, and it is where we will turn next.

Figure 1: Comparison of isolated single-modality systems with unified multimodal AI systems.
Case Study 1: Translating Makaton to English
One of my first projects in this area focused on translating Makaton into English.
Makaton is a language programme that uses signs and symbols to support people with speech and language difficulties. It is widely used in classrooms where learners may not rely fully on speech. The challenge is that while a learner communicates in Makaton, their teachers and peers often work in English, which creates a communication gap.
The AI Workflow
The system followed a clear pipeline:
Camera Input → Hand Landmark Detection → Gesture Classification → English Translation Output

Figure 2: AI workflow for translating Makaton gestures into English.

Camera Input: captures the learner’s Makaton sign.

Hand Landmark Detection: a vision library such as MediaPipe or OpenCV identifies the position of the fingers and hands.

Gesture Classification: a trained machine learning model classifies which Makaton sign was made.

English Translation Output: the system maps that gesture to its English word or phrase and displays it.


Example in Python
Here is a simplified version of how this workflow might look in code:
# Step 1: Capture input
frame = camera.read()

# Step 2: Detect hand landmarks
landmarks = mediapipe.process(frame)

# Step 3: Classify gesture
gesture = gesture_model.predict(landmarks)

# Step 4: Translate to English
translation_map = {
    "hello_sign": "Hello",
    "thank_you_sign": "Thank you"
}
text = translation_map.get(gesture, "Unknown sign")

print("Makaton sign:", gesture, " -> English:", text)

This is a simplified example, but it shows the core idea: map gestures to meaning and then bridge that meaning into English.
Why This Matters
Imagine a student signing thank you in Makaton and the system instantly displaying the words on screen. Teachers can check understanding, peers can respond naturally, and the learner’s contribution becomes visible to everyone.
The key takeaway is that AI can bridge symbol and gesture based languages with mainstream spoken and written communication. Instead of forcing learners to adapt to rigid systems, we can design systems that adapt to the way they already communicate.
Case Study 2: AURA Prototype (Adaptive Speech Assistant)
Another project I worked on is called AURA, the Apraxia of Speech Adaptive Understanding and Relearning Assistant. The idea was to design a system that not only recognises speech but also supports learners with speech disorders by detecting errors, adapting feedback, and offering multimodal alternatives.
The Challenge
Most commercial speech recognition systems fail when a person’s speech does not follow typical patterns. This is especially true for people with apraxia of speech, where motor planning difficulties make pronunciation inconsistent. The result is frequent misrecognition, frustration, and exclusion from tools that rely on voice input.
The AI Workflow
The AURA prototype used a layered architecture:
Speech Input → Wav2Vec2 (fine-tuned for disordered speech) → CNN + BiLSTM Error Detection → Reinforcement Learning Feedback → Multimodal Output (Speech + Gesture)

Figure 3: Workflow of the AURA prototype, combining speech, error detection, adaptive feedback, and multimodal outputs.

Wav2Vec2 Speech Recognition: fine-tuned on disordered speech to improve transcription accuracy.

CNN + BiLSTM Model: classifies articulation or phonological errors in real time.

Reinforcement Learning Engine: adapts feedback loops so therapy suggestions improve as the learner progresses.

Gesture-to-Speech Multimodal Input: when speech is too difficult, MediaPipe gestures can be used to trigger spoken outputs.

Streamlit Interface: integrates everything into a single accessible app for testing.


Here’s a simplified view of how an error detection module could be structured:
# Example: Error classification using CNN + BiLSTM
import torch
import torch.nn as nn

# Define the ErrorClassifier model
class ErrorClassifier(nn.Module):
    def __init__(self):
        super(ErrorClassifier, self).__init__()
        self.cnn = nn.Conv1d(in_channels=40, out_channels=64, kernel_size=3)
        self.lstm = nn.LSTM(64, 128, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(256, 3)  # Output classes: e.g. correct, substitution, omission

    def forward(self, x):
        x = self.cnn(x)
        x, _ = self.lstm(x)
        return self.fc(x[:, -1, :])

# Instantiate the model
model = ErrorClassifier()

This snippet shows the heart of the error detection pipeline: combining CNN layers for feature extraction with BiLSTMs for sequence modeling. The model can flag articulation errors, which then guide the feedback loop.
Why This Matters
With AURA, the goal was not just to recognise what someone said, but to help them communicate more effectively. The prototype adapted in real time offering corrective feedback, suggesting gestures, or switching modes when speech became difficult.
The takeaway is that AI can evolve from being a passive recognition tool into an active partner in learning and communication.
The Bigger Picture: Multimodal Accessibility Tools
The two projects we explored, translating Makaton into English and building the AURA prototype highlight a much larger transformation underway. Accessibility technology is moving away from isolated, single-purpose applications toward multimodal platforms that bring together speech, gestures, text, and adaptive AI into one seamless system.
Why This Shift Matters
The benefits of this shift are profound:

Greater inclusivity in classrooms: learners who rely on different modes of communication can participate equally.

Real-time support: systems that detect errors or adapt to gestures give learners immediate feedback rather than delayed corrections.

Lower frustration: multimodal options mean if one channel breaks down (for example, speech), others like gesture or text can take over smoothly.

Confidence and independence: learners express themselves more fully, without depending heavily on support staff or interpreters.


Beyond the Classroom
The impact of multimodal accessibility extends across many sectors:

In healthcare, patients with communication difficulties can use multimodal AI assistants to express needs clearly, reducing misdiagnosis and stress.

In the workplace, employees with speech or motor impairments can collaborate effectively using adaptive AI tools.

In community settings, individuals can participate more freely in conversations, services, and digital platforms, strengthening social inclusion.


Visualising the Shift

How to Build a Multimodal Makaton to English Translator (Gesture + Speech)
This demo combines both use cases: a Makaton to English classroom tool and the AURA assistive speech path. It prioritizes gesture when a sign is detected, falls back to speech when it isn’t, and produces a unified English output (with optional text-to-speech). We’ll focus on the translation layer, multimodal fusion, and a simple Streamlit UI.
Project structure
makaton_multimodal_demo/
├─ .streamlit/
│   └─ config.toml 
├─ assets/
│   └─ README.txt 
├─ tests/
│   └─ test_fuse.py 
└─ streamlit_app.py

The structure provided above outlines the organization of a project directory for a multimodal Makaton to English translator demo using Streamlit. Here's a brief explanation of each component:

makaton_multimodal_demo/: This is the root directory of the project.

.streamlit/: This directory contains configuration files for Streamlit, which is a framework used to build web apps in Python. The config.toml file is optional and can be used to customize the Streamlit app's settings.

assets/: This directory is intended to store models or other necessary files for the project. The README.txt serves as a placeholder to indicate where these files should be placed.

tests/: This directory is for test scripts. The test_fuse.py file likely contains tests for the fusion function, which is a part of the multimodal translation process.

streamlit_app.py: This is the main application file where the Streamlit app is implemented. It contains the code that runs the app, handling the user interface and the logic for translating Makaton gestures and speech into English.


Install & run
# (optional) create and activate a virtualenv
python -m venv .venv

# Windows
.\.venv\Scripts\activate

# macOS/Linux
source .venv/bin/activate

The code snippet above provides instructions for creating and activating a Python virtual environment, which is a self-contained directory that contains a Python installation for a particular version of Python, plus several additional packages.

python -m venv .venv: This command creates a new virtual environment in a directory named .venv. The venv module is used to create lightweight virtual environments.

.\.venv\Scripts\activate (Windows): This command activates the virtual environment on Windows. Once activated, the environment's Python interpreter and installed packages will be used.

source .venv/bin/activate (macOS/Linux): This command activates the virtual environment on macOS or Linux. Similar to Windows, activating the environment ensures that the specific Python interpreter and packages within the environment are used.


Install dependencies
pip install streamlit opencv-python mediapipe SpeechRecognition gTTS pydub numpy

The command above is used to install multiple Python packages at once. Here's what each package does:

streamlit: A framework for building interactive web applications in Python, often used for data science and machine learning projects.

opencv-python: Provides OpenCV, a library for computer vision tasks such as image processing and video analysis.

mediapipe: A library developed by Google for building cross-platform, customizable machine learning solutions for live and streaming media, including hand and face detection.

SpeechRecognition: A library for performing speech recognition, allowing Python to recognize and process human speech.

gTTS: Google Text-to-Speech, a library and CLI tool to interface with Google Translate's text-to-speech API, enabling text-to-speech conversion.

pydub: A library for audio processing, allowing manipulation of audio files, such as converting between different audio formats.

numpy: A fundamental package for scientific computing in Python, providing support for arrays and matrices, along with a collection of mathematical functions.


Create streamlit_app.py
# streamlit_app.py
from io import BytesIO
from typing import Optional
import streamlit as st

# Optional deps (kept optional so readers can still run the core demo)
try:
    import cv2
    import mediapipe as mp
    MP_OK = True
except Exception:
    MP_OK = False

try:
    import speech_recognition as sr
    SR_OK = True
except Exception:
    SR_OK = False

try:
    from gtts import gTTS
    GTTS_OK = True
except Exception:
    GTTS_OK = False

# --- 1) Minimal Makaton dictionary (extend as needed)
MAKATON_DICT = {
    "hello_sign": "Hello",
    "thank_you_sign": "Thank you",
    "help_sign": "Help",
    "toilet_sign": "Toilet",
    "stop_sign": "Stop",
}

# --- 2) Gesture classifier (stub for the demo)
def classify_gesture(landmarks) -> Optional[str]:
    """
    Return a canonical label like 'hello_sign' or None if unknown.
    Replace this stub with your trained model + confidence threshold.
    """
    return "hello_sign" if landmarks else None

# --- 3) Speech recognizer (fallback path)
def transcribe_speech(seconds: int = 3) -> Optional[str]:
    if not SR_OK:
        return None
    r = sr.Recognizer()
    try:
        with sr.Microphone() as source:
            st.info("Listening...")
            audio = r.listen(source, phrase_time_limit=seconds)
        return r.recognize_google(audio)
    except Exception as e:
        st.warning(f"Speech recognition error: {e}")
        return None

# --- 4) Fusion logic (gesture first, speech fallback)
def fuse(gesture_label: Optional[str], speech_text: Optional[str]) -> str:
    if gesture_label and gesture_label in MAKATON_DICT:
        return MAKATON_DICT[gesture_label]
    if speech_text:
        return speech_text
    return "No input detected"

# --- 5) Optional: extract single-frame hand landmarks using MediaPipe
def extract_hand_landmarks_from_image(image_bytes: bytes):
    if not MP_OK:
        return None
    try:
        import numpy as np
        np_arr = np.frombuffer(image_bytes, dtype=np.uint8)
        img = cv2.imdecode(np_arr, cv2.IMREAD_COLOR)
        if img is None:
            return None

        mp_hands = mp.solutions.hands
        with mp_hands.Hands(static_image_mode=True, max_num_hands=1, min_detection_confidence=0.5) as hands:
            img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            result = hands.process(img_rgb)

        if not result.multi_hand_landmarks:
            return None

        hand_landmarks = result.multi_hand_landmarks[0]
        return [(lm.x, lm.y, lm.z) for lm in hand_landmarks.landmark]
    except Exception:
        return None

# --- 6) Streamlit UI
st.set_page_config(page_title="Makaton → English (Multimodal Demo)")
st.title("Makaton → English (Multimodal Demo)")
st.caption("Combines a classroom Makaton translator with an assistive speech path (AURA-style).")

with st.expander("What this demo shows"):
    st.write(
        "- **Translation layer:** small Makaton dictionary you can extend.\n"
        "- **Multimodal fusion:** gesture prioritized, speech as fallback.\n"
        "- **UI:** one page, clear output, optional text-to-speech."
    )

tabs = st.tabs(["Simulated Sign", "Single-Frame Webcam (Optional)", "About"])

# Tab 1: Simulated (no CV model required)
with tabs[0]:
    st.subheader("Simulated Gesture + Speech")
    col1, col2 = st.columns(2)

    with col1:
        simulate = st.selectbox(
            "Pick a sign",
            ["", "hello_sign", "thank_you_sign", "help_sign", "toilet_sign", "stop_sign"],
            index=0
        )
        gesture_label = simulate or None

    with col2:
        speech_text = st.session_state.get("speech_text")
        st.write("Current speech:", speech_text or "None")
        if st.button("Transcribe 3s"):
            if SR_OK:
                speech_text = transcribe_speech(3)
                st.session_state["speech_text"] = speech_text
            else:
                st.warning("SpeechRecognition not installed.")

    output = fuse(gesture_label, st.session_state.get("speech_text"))
    st.markdown(f"### Output: **{output}**")

    if output and output != "No input detected":
        if st.button("Speak output"):
            if GTTS_OK:
                mp3 = BytesIO()
                try:
                    gTTS(output).write_to_fp(mp3)
                    st.audio(mp3.getvalue(), format="audio/mp3")
                except Exception as e:
                    st.warning(f"TTS failed: {e}")
            else:
                st.warning("gTTS not installed.")

# Tab 2: Optional single-frame webcam capture
with tabs[1]:
    st.subheader("Single-Frame Hand Detection (Webcam)")
    if not MP_OK:
        st.warning("Install MediaPipe + OpenCV to enable this tab.")
    else:
        img = st.camera_input("Capture a frame")
        captured_label = None
        if img is not None:
            landmarks = extract_hand_landmarks_from_image(img.getvalue())
            if landmarks:
                captured_label = classify_gesture(landmarks)
                st.success("Hand detected.")
            else:
                st.info("No hand landmarks found. Try better lighting/framing.")

        if st.button("Transcribe 3s (webcam tab)"):
            st.session_state["speech_text2"] = transcribe_speech(3) if SR_OK else None

        speech_text2 = st.session_state.get("speech_text2")
        st.write("Current speech:", speech_text2 or "None")

        output2 = fuse(captured_label, speech_text2)
        st.markdown(f"### Output: **{output2}**")

        if output2 and output2 != "No input detected":
            if st.button("Speak output (webcam tab)"):
                if GTTS_OK:
                    mp3 = BytesIO()
                    try:
                        gTTS(output2).write_to_fp(mp3)
                        st.audio(mp3.getvalue(), format="audio/mp3")
                    except Exception as e:
                        st.warning(f"TTS failed: {e}")
                else:
                    st.warning("gTTS not installed.")

The code above creates a Streamlit application that combines gesture recognition and speech recognition to translate Makaton signs into English. Here's a brief explanation of how it works:

Dependencies and Setup: The code attempts to import optional dependencies like OpenCV, MediaPipe, SpeechRecognition, and gTTS. These are used for gesture detection, speech recognition, and text-to-speech functionalities.

Makaton Dictionary: A minimal dictionary that maps Makaton signs to English words. This can be extended to include more signs.

Gesture Classifier: A placeholder function (classify_gesture) is used to classify hand gestures. In a real application, this would be replaced with a trained model.

Speech Recognizer: The transcribe_speech function uses the SpeechRecognition library to convert spoken words into text, serving as a fallback when gestures are not detected.

Fusion Logic: The fuse function prioritizes gesture recognition over speech. If a gesture is recognized, it translates it using the dictionary; otherwise, it uses the transcribed speech.

Hand Landmark Extraction: The code includes a function to extract hand landmarks from an image using MediaPipe, which is used for gesture classification.

Streamlit UI: The user interface is built with Streamlit, featuring tabs for simulated gestures, webcam-based gesture detection, and additional information. Users can simulate gestures, capture gestures via webcam, and use speech input. The output is displayed and can be converted to speech using gTTS.


This application demonstrates a multimodal approach by integrating both gesture and speech recognition to facilitate communication for users who rely on Makaton.
Run
streamlit run .\streamlit_app.py

The command above is used to launch a Streamlit application. When executed, it starts a local web server and opens the specified Python script in a web browser, allowing you to interact with the app's user interface. This command is typically run in a terminal or command prompt.

Figure — App interface: the Simulated Sign tab before any input.

Figure — Selecting hello_sign produces “Output: Hello”.
Project Overview
You have developed a multimodal translator that integrates both gesture recognition (specifically Makaton signs) and speech recognition to produce a unified English output. The system is designed to prioritize gesture input, using speech as a fallback when gestures are not detected.
User Interface
The application is built using Streamlit, featuring two main tabs:

Simulated Sign Tab: Allows users to simulate gestures without requiring computer vision (CV) capabilities.

Webcam Single Frame Tab: Optionally uses a webcam to capture and process a single frame for gesture detection.


Use Case Integration

Makaton to English Translation: In a classroom setting, detected Makaton signs are translated into short English phrases, facilitating communication.

AURA-style Assistive Path: If no gesture is detected, the system relies on speech input to generate an output, ensuring continuous communication support.


Design Limitations

The gesture classifier is currently a placeholder and should be replaced with a trained model that includes a confidence threshold for better accuracy.

The Makaton dictionary is minimal and can be expanded to include more phrases and templates.

The speech recognition component uses a basic recognizer. For improved robustness, consider using advanced models like Wav2Vec2 or offline automatic speech recognition (ASR) systems.


Suggested Extensions

Implement a confidence threshold to display both gesture and speech inputs when the system is uncertain.

Expand the dictionary to support slot templates, such as "I want [item]".

Introduce a toggle to switch between speech-first and gesture-first input priorities.

Enable logging of outputs for teachers and provide an option to export these logs as CSV files.

Consider replacing gTTS with an offline text-to-speech solution for better reliability.


Troubleshooting Tips

If you encounter microphone errors, ensure that pyaudio is installed. On Windows, use pip install pipwin followed by pipwin install pyaudio.

If the webcam is not detected, check your browser permissions. The Simulated Sign tab can still be used without a webcam.

If there are issues with package imports, verify that they are installed in your active virtual environment.


The link to the full code: Multimodal_Makaton
Challenges and Ethical Considerations
While the promise of multimodal accessibility tools is exciting, building them responsibly requires us to confront several challenges. These are not only technical problems but also ethical ones that affect how learners, teachers, and communities experience AI.
Data Scarcity
Training AI systems requires large, diverse datasets. But when it comes to disordered speech or symbol systems like Makaton, the data is limited. Without enough examples, models risk being inaccurate or biased toward a narrow group of users. Collecting more data is essential, but it must be done ethically, with consent and respect for the communities involved.
Fairness and Inclusion
AI systems often work better for some groups than others. A model trained mostly on fluent English speakers may fail to recognise learners with strong accents or speech difficulties. Similarly, gesture recognition may not account for differences in motor ability. Fairness means designing models that work across abilities, accents, and cultures, so that no group is excluded by design.
Privacy and Security
Speech and video data are highly sensitive, especially when collected in schools. Protecting this data is not optional, it is a requirement. Systems must anonymize or encrypt recordings and store them securely. Transparency is also key: learners, parents, and teachers should know exactly how data is being used and who has access to it.
Accessibility of the Tools Themselves
Ironically, many “accessibility tools” remain inaccessible because they are expensive, require powerful hardware, or are too complex to use. For AI to truly reduce barriers, solutions must be affordable, lightweight, and easy for teachers to set up in real classrooms, not just in research labs.
Takeaway
These challenges remind us that accessibility in AI is not only a technical question but also an ethical and social responsibility. To build tools that genuinely help learners, we need collaboration between developers, educators, policymakers, and the communities who will use the systems.
Where We’re Heading Next
The future of AI accessibility tools is speculative, but the possibilities are both exciting and necessary. What we have now are prototypes and early systems. What lies ahead are tools that could reshape how classrooms and society more broadly approach communication and inclusion.
Multilingual Makaton Translation
One promising direction is the ability to translate Makaton across multiple languages. A learner in the UK could sign in Makaton and see their contribution appear not just in English but in French, Spanish, or Yoruba. This would open up international classrooms and give learners access to global opportunities that are often closed off by language barriers.
AI Tutors with Dynamic Adaptation
Imagine a classroom assistant powered by AI that adapts in real time. If a learner struggles with speech, it could switch to gesture recognition. If gestures become tiring, it could prompt the learner with symbol-based options. These AI tutors would not only support communication but also guide learning, adapting to each student’s strengths and challenges over time.
Wearable Multimodal Devices
The rise of lightweight hardware makes it possible to imagine wearable AI assistants that provide instant translation and support. Glasses could capture gestures and overlay text, while earbuds could translate disordered speech into clear audio for peers and teachers. Instead of bulky setups, accessibility would become portable, personal, and ever-present.
A Broader Impact
These innovations go beyond technology alone. They align with the United Nations Sustainable Development Goals (SDGs) especially:

Quality Education (Goal 4): ensuring that every learner, regardless of ability, has equal access to education.

Reduced Inequalities (Goal 10): breaking down barriers so that disability or difference is not a cause of exclusion.


The journey from single-modality tools to multimodal, adaptive systems is still in its early stages. But if we continue to push forward with creativity, ethics, and inclusivity at the center, AI accessibility tools will not only change classrooms they will change lives.
Conclusion: Building an Inclusive Future with AI
AI accessibility tools are no longer just optional add-ons for a few learners. They are becoming core enablers of inclusion in education, healthcare, workplaces, and daily life.
The journey from early gesture recognition systems to multimodal, adaptive prototypes like Makaton translation and AURA shows what is possible when technology is designed around people rather than forcing people to adapt to technology. These innovations break down communication barriers and open up new opportunities for learners who have too often been left on the margins.
But the future of accessibility is not automatic. It depends on choices we make now as developers, educators, researchers, and policymakers. Building tools that are open, ethical, and affordable requires collaboration and commitment.
The vision is clear: a world where every learner, regardless of ability, can express themselves fully, be understood by others, and participate with confidence.
The future of education is inclusive and with thoughtful design, AI can help us get there.
 


 How to Design Accessible Browser Extensions 
Ophy Boamah — Wed, 10 Sep 2025 12:07:03 +0000
 Building a browser extension is easy, but ensuring that it’s accessible to everyone takes deliberate care and skill.
Your extension might fetch data flawlessly and have a beautiful interface, but if screen reader users or keyboard navigators can’t use it, you’ve unintentionally excluded many potential users.
In this article, we will audit a Chrome browser extension for accessibility issues and transform it into an inclusive experience that works for everyone.
Table of Contents

Why Accessibility Matters in Browser Extensions

How to Perform Manual Browser Extension Accessibility Tests

How to Implement Browser Extension Accessibility Improvements

How to Perform Automated Browser Extension Accessibility Tests

Best Practices for Accessible Browser Extensions

Conclusion


Why Accessibility Matters in Browser Extensions
Every click in your browser extension is an opportunity to empower users or exclude them if accessibility isn’t part of your design.
Browser extensions face unique accessibility challenges, as they must inject functionality into existing web pages while maintaining their own accessible interfaces - a dual responsibility that can introduce potential barriers. For example, a popup that traps keyboard users or fails to communicate with screen readers can render an extension unusable.
With over one billion people living with disabilities, according to the World Health Organization, accessible design unlocks a vast user base and creates better experiences for everyone.

For browser extensions, accessibility barriers commonly emerge as:

Keyboard navigation dead-ends: Popups and interfaces that trap or exclude keyboard users.

Silent interactions: Missing labels and descriptions, like a button with only an icon announced as “unlabelled button” by screen readers, leaving users guessing about its purpose.

Unannounced dynamic content updates: Content changes that occur without assistive technology awareness, such as a quote updating without notifying screen readers of the change, including missing feedback for loading states or errors

Context integration conflicts: Extensions modifying existing web pages can mistakenly break the page's accessibility features or introduce elements that clash with established navigation patterns


By understanding these barriers, developers can take targeted steps to test and improve their extensions’ accessibility.
How to Perform Manual Browser Extension Accessibility Tests
While automated tools catch obvious issues, manual testing reveals the real user experience. Here's how to systematically evaluate your extension's accessibility.

💡
You can use any unpublished browser extension to follow along. For this test, we’ll be using the browser extension built in this article, which uses this Advice generator app design.


Keyboard Navigation Test
Disconnect your mouse and try to use your extension completely with the keyboard only. Navigate using Tab to move between elements, Enter or Space to activate buttons, and arrow keys within components. 

Is it always clear which element has focus?

Can you activate buttons with Enter or Space as expected?

Can users exit modal dialogs or dropdown menus?


If you encounter any dead-ends or confusion points, keyboard users will face the same barriers.

Screen Reader Evaluation
Use your operating system's built-in screen reader to navigate your extension and listen to what is announced. On macOS, enable VoiceOver; on Windows, use Narrator; on Linux, try Orca. 

Does each element’s purpose come through clearly, such as a button announced as “Generate new advice” rather than just “button”?

Are headings, lists, and other structures properly conveyed?

Do users understand when content is loading, selected, or has changed?


This testing phase often reveals the gap between what you intended to communicate and what actually reaches users.
Visual Accessibility Review
Examine your extension in different visual contexts. Use developer tools, like WebAIM’s Contrast Checker, to verify that text meets WCAG’s 4.5:1 contrast ratio for readability. Test how your extension appears in system high-contrast settings. Ensure:

Functionality remains usable at 200% zoom.

Information isn’t conveyed through colour alone, such as using text labels alongside colour-coded indicators.


These manual tests will uncover critical accessibility issues, paving the way for targeted improvements to make your extension inclusive.
How to Implement Browser Extension Accessibility Improvements
Imagine refreshing a page without knowing it happened or clicking a button with no clear purpose. The manual tests performed above revealed that's the experience for screen reader users of our extension among these three key accessibility issues:

Missing button label: The dice button only has an image with alt text “Dice icon,” which lacks the context screen readers need

Silent dynamic updates: When new advice loads, screen readers don't know the content has changed

No loading states: When fetching advice, users receive no feedback that something is happening


Let's address the issues before conducting automated tests.
How to Address Missing Button Label and Alt text
We’ll add aria-label to clearly explain the button's purpose and provide descriptive alt text for the icon. The role="presentation" attribute ensures the image is treated as decorative by screen readers.

<button class="dice-button" id="generate-advice-btn">
    <img src="/icons/icon-dice.png" alt="Dice icon">
button>


<button class="dice-button" id="generate-advice-btn" aria-label="Generate new advice">
     <img src="/icons/icon-dice.png" alt="A dice icon with green background" role="presentation">
button>

How to Address Silent Dynamic Updates
We’ll add aria-live="polite" for screen readers to announce new advice and aria-atomic="true" to ensure that the entire quote is read. That is:

<p class="advice-quote" id="advice-quote">
    "It is easy to sit up and take notice, what's difficult is getting up and taking action."
p>


<p class="advice-quote" id="advice-quote" aria-live="polite" aria-atomic="true">
    "It is easy to sit up and take notice, what's difficult is getting up and taking action."
p>

How to Address No Loading States
We’ll add a setLoadingState function to provide loading indicators, ensuring screen reader users are notified when content is being fetched:
// Before: No Loading Feedback
function requestNewAdvice() {
  chrome.runtime.sendMessage({ action: "fetchAdvice" }, (response) => {
    // No loading indicators...
  });
}

// After: Accessible Loading States
function requestNewAdvice() {
  setLoadingState(true); 
  chrome.runtime.sendMessage({ action: "fetchAdvice" }, (response) => {
    setLoadingState(false);
    // Handle response with proper announcements...
  });
}
function setLoadingState(isLoading) {
  if (isLoading) {
    // Disable button and show loading text
    generateAdviceBtn.disabled = true;
    generateAdviceBtn.setAttribute('aria-label', 'Loading new advice...');
    // Show loading text in the advice quote element
    adviceQuoteElement.textContent = "Loading new advice...";
  } else {
    // Re-enable button
    generateAdviceBtn.disabled = false;
    generateAdviceBtn.setAttribute('aria-label', 'Generate new advice');
  }
}

With the manual testing issues addressed, we can now move on to performing an automated test of the same extension.
How to Perform Automated Browser Extension Accessibility Tests
Manual testing provides crucial insights, but automated tools can efficiently catch common issues and provide ongoing monitoring. 
This Extension Accessibility Checker simplifies testing by analyzing browser extension interfaces, such as popups and content scripts, for WCAG compliance, addressing unique challenges like popup constraints and content injection conflicts.

To use the Extension Accessibility Checker:

Compress your browser extension folder into a .zip file

Upload the .zip file on https://extensiona11ychecker.vercel.app/

Review the generated report for specific accessibility violations and implement suggested fixes 


As shown in the GIF above, this workflow helps establish accessibility as a routine part of your development process rather than an afterthought.
With automated testing in place, let’s explore best practices to ensure that your extension remains accessible throughout development.
Best Practices for Accessible Browser Extensions
We've transformed our sample advice-generating browser extension from a functional but inaccessible tool into an inclusive one that works for everyone. 
Based on our improvements, here are four key principles for designing accessible browser extensions:

Semantic HTML and Clear, Descriptive Labels


Always start with proper HTML structure, using appropriate elements (for example, for a “Generate Advice” action, proper heading hierarchy) before adding ARIA attributes.
Ensure that every interactive element has a clear purpose via aria-label, aria-labelledby, or visible text that explains its action.

Clear Communication at Every Step


Every interactive element must convey its purpose effectively. Users need to understand:


What’s happening (for example, “Loading new advice…” for loading states)

What went wrong (for example, “Failed to load advice” for errors)

What changed (for example, aria-live regions for updated content)







Complete Keyboard Accessibility


All functionality must be available through keyboard navigation. This requires testing with Tab, Enter, Space, and arrow keys as appropriate.
Provide clear and thoughtful focus indicators that move predictably through your interface with obvious ways to exit modals or complex interactions.

User Preferences and Content Script Considerations


Respect user choices by supporting system font size settings and not overriding user-defined colour schemes unnecessarily.
When your extension modifies existing web pages, make sure you don't break the page's established accessibility features, focus management and navigation patterns. Ensure any new elements you inject follow accessibility standards.
Conclusion
As we’ve seen with our advice-generating extension, addressing accessibility issues transforms a functional tool into an inclusive one.
However, while fixing issues in existing extensions is helpful, the most effective approach is letting accessibility guide your design and development decisions from the first line of code.
When starting your next browser extension project, ask:

How would someone navigate this using only a keyboard?

Is the purpose of every interactive element immediately clear to screen readers?

How will users understand what's happening during loading states?


Here are some helpful resources

Chrome Extension Accessibility Documentation

Extension Accessibility Checker

Web Content Accessibility Guidelines (WCAG) 2.1


 


 How to Build AI Speech-to-Text and Text-to-Speech Accessibility Tools with Python 
OMOTAYO OMOYEMI — Mon, 01 Sep 2025 19:50:40 +0000
 Classrooms today are more diverse than ever before. Among the students are neurodiverse learners with different learning needs. While these learners bring unique strengths, traditional teaching methods don’t always meet their needs.
This is where AI-driven accessibility tools can make a difference. From real-time captioning to adaptive reading support, artificial intelligence is transforming classrooms into more inclusive spaces.
In this article, you’ll:

Understand what inclusive education means in practice.

See how AI can support neurodiverse learners.

Try two hands-on Python demos:

Speech-to-Text using local Whisper (free, no API key).

Text-to-Speech using Hugging Face SpeechT5.



Get a ready-to-use project structure, requirements**,** and troubleshooting tips for Windows and macOS/Linux users.


Table of Contents

Prerequisites

A Note on Missing Files

What Inclusive Education Really Means

Toolbox: Five AI Accessibility Tools Teachers Can Try Today

Platform Notes (Windows vs macOS/Linux)

Hands-On: Build a Simple Accessibility Toolkit (Python)

Quick Setup Cheatsheet

From Code to Classroom Impact

Developer Challenge: Build for Inclusion

Challenges and Considerations

Looking Ahead

Conclusion


Prerequisites
Before you start, make sure you have the following:

Python 3.8 or later versions installed (for Windows users, in case you don’t have it installed, you can download the latest version at: python.org. macOS users usually already have python3).

Virtual environment set up (venv) — recommended to keep things clean.

You have to install FFmpeg (This is required for Whisper to read audio files).

PowerShell (Windows) or Terminal (macOS/Linux).

Basic familiarity with running Python scripts.


Tip: If you’re new to Python environments, the you shouldn’t worry because the setup commands will be included with each step below.
A Note on Missing Files
Some files are not included in the GitHub repository. This is intentional, they are either generated automatically or should be created/installed locally:

.venv/ → Your virtual environment folder. Each reader should create their own locally with:
  python -m venv .venv


FFmpeg Installation:

Windows: FFmpeg is not included in the project files because it is large (approximately 90 MB). Users are instructed to download the FFmpeg build separately.

macOS: Users can install FFmpeg using the Homebrew package manager with the command brew install ffmpeg.

Linux: Users can install FFmpeg using the package manager with the command sudo apt install ffmpeg.



Output File:

output.wav is a file generated when you run the Text-to-Speech script. This file is not included in the GitHub repository, it is created locally on your machine when you execute the script.





To keep the repo clean, these are excluded using .gitignore:
# Ignore virtual environments
.venv/
env/
venv/

# Ignore binary files
ffmpeg.exe
*.dll
*.lib

# Ignore generated audio (but keep sample input)
*.wav
*.mp3
!lesson_recording.mp3

The repository does include all essential files needed to follow along:

requirements.txt (see below)

transcribe.py and tts.py(covered step-by-step in the Hands-On section).


requirements.txt
openai-whisper
transformers
torch
soundfile
sentencepiece
numpy

This way, you’ll have everything you need to reproduce the project.
What Inclusive Education Really Means
Inclusive education goes beyond placing students with diverse needs in the same classroom. It’s about designing learning environments where every student can thrive.
Common barriers include:

Reading difficulties (for example, dyslexia).

Communication challenges (speech/hearing impairments).

Sensory overload or attention struggles (autism, ADHD).

Note-taking and comprehension difficulties.


AI can help reduce these barriers with captioning, reading aloud, adaptive pacing, and alternative communication tools.
Toolbox: Five AI Accessibility Tools Teachers Can Try Today

Microsoft Immersive Reader – Text-to-speech, reading guides, and translation.

Google Live Transcribe – Real-time captions for speech/hearing support.

Otter.ai – Automatic note-taking and summarization.

Grammarly / Quillbot – Writing assistance for readability and clarity.

Seeing AI (Microsoft) – Describes text and scenes for visually impaired learners.


Real-World Examples
A student with dyslexia can use Immersive Reader to listen to a textbook while following along visually. Another student with hearing loss can use Live Transcribe to follow class discussions. These are small technology shifts that create big inclusion wins.
Platform Notes (Windows vs macOS/Linux)
Most code works the same across systems, but setup commands differ slightly:
Creating a virtual environment
To create and activate a virtual environment in PowerShell using Python 3.8 or higher, you can follow these steps:

Create a virtual environment:
 py -3.12 -m venv .venv


Activate the virtual environment:
 .\.venv\Scripts\Activate



Once activated, your PowerShell prompt should change to indicate that you are now working within the virtual environment. This setup helps manage dependencies and keep your project environment isolated.
For Mac OS users to create and activate a virtual environment in a bash shell using Python 3, you can follow these steps:

Create a virtual environment:
 python3 -m venv .venv


Activate the virtual environment:
 source .venv/bin/activate



Once activated, your bash prompt should change to indicate that you are now working within the virtual environment. This setup helps manage dependencies and keep your project environment isolated.
To install FFmpeg on Windows, follow these steps:

Download FFmpeg Build: Visit the official FFmpeg website to download the latest FFmpeg build for Windows.

Unzip the Downloaded File: Once downloaded, unzip the file to extract its contents. You will find several files, including ffmpeg.exe.

Copy ffmpeg.exe: You have two options for using ffmpeg.exe:

Project Folder: Copy ffmpeg.exe directly into your project folder. This way, your project can access FFmpeg without modifying system settings.

Add to PATH: Alternatively, you can add the directory containing ffmpeg.exe to your system's PATH environment variable. This allows you to use FFmpeg from any command prompt window without specifying its location.




Additionally, the full project folder, including all necessary files and instructions, is available for download on GitHub. You can also find the link to the GitHub repository at the end of the article.
For macOS users:
To install FFmpeg on macOS, you can use Homebrew, a popular package manager for macOS. Here’s how:

Open Terminal: You can find Terminal in the Utilities folder within Applications.

Install Homebrew (if not already installed): Paste the following command in Terminal and press Enter. Follow the on-screen instructions. /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install FFmpeg: Once Homebrew is installed, run the following command in Terminal:
 brew install ffmpeg

 This command will download and install FFmpeg, making it available for use on your system.


For Linux users (Debian/Ubuntu):
To install FFmpeg on Debian-based systems like Ubuntu, you can use the APT package manager. Here’s how:

Open Terminal: You can usually find Terminal in your system’s applications menu.

Update Package List: Before installing new software, it’s a good idea to update your package list. Run:
 sudo apt update


Install FFmpeg: After updating, install FFmpeg by running:
 sudo apt install ffmpeg

 This command will download and install FFmpeg, allowing you to use it from the command line.


These steps will ensure that FFmpeg is installed and ready to use on your macOS or Linux system.
Running Python scripts

Windows: python script.py or py script.py

macOS/Linux: python3 script.py


I will mark these differences with a macOS/Linux note in the relevant steps so you can follow along smoothly on your system.
Hands-On: Build a Simple Accessibility Toolkit (Python)
You’ll build two small demos:

Speech-to-Text with Whisper (local, free).

Text-to-Speech with Hugging Face SpeechT5.


1) Speech-to-Text with Whisper (Local and free)
What you’ll build:
A Python script that takes a short MP3 recording and prints the transcript to your terminal.
Why Whisper?
It’s a robust open-source STT model. The local version is perfect for beginners because it avoids API keys/quotas and works offline after the first install.
How to Install Whisper (PowerShell):
# Activate your virtual environment
# Example: .\venv\Scripts\Activate

# Install the openai-whisper package
pip install openai-whisper

# Check if FFmpeg is available
ffmpeg -version

# If FFmpeg is not available, download and install it, then add it to PATH or place ffmpeg.exe next to your script
# Example: Move ffmpeg.exe to the script directory or update PATH environment variable


You should see a version string here before running Whisper.
Note: Mac OS users can use the same code snippet as above in their terminal
If FFmpeg is not installed, you can install it using the following commands:
For macOS:
brew install ffmpeg

For Ubuntu/Debian Linux:
sudo apt install ffmpeg

Create transcribe.py:
import whisper

# Load the Whisper model
model = whisper.load_model("base")  # Use "tiny" or "small" for faster speed

# Transcribe the audio file
result = model.transcribe("lesson_recording.mp3", fp16=False)

# Print the transcript
print("Transcript:", result["text"])

How the code works:

whisper.load_model("base") — downloads/loads the model once, then cached afterward.

model.transcribe(...) — handles audio decoding, language detection, and text inference.

fp16=False — avoids half-precision GPU math so it runs fine on CPU.

result["text"] — the final transcript string.


Run it:
python transcribe.py

Expected output:

Successful Speech-to-Text: Whisper prints the recognized sentence from lesson_recording.mp3
To run the transcribe.py script on macOS or Linux, use the following command in your Terminal:
python3 transcribe.py

Common hiccups (and fixes):

FileNotFoundError during transcribe → FFmpeg isn’t found. Install it and confirm with ffmpeg -version.

Super slow on CPU → switch to tiny or small models: whisper.load_model("small").


2) Text-to-Speech with SpeechT5
What you’ll build:
A Python script that converts a short string into a spoken WAV file called output.wav.
Why SpeechT5?
It’s a widely used open model that runs on your CPU. Easy to demo and no API key needed.
Install the required packages on (PowerShell) Windows:
# Activate your virtual environment
# Example: .\venv\Scripts\Activate

# Install the required packages
pip install transformers torch soundfile sentencepiece

Note: Mac OS users can use the same code snippet as above in their terminal
Create tts.py
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import soundfile as sf
import torch
import numpy as np

# Load models
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

# Speaker embedding (fixed random seed for a consistent synthetic voice)
g = torch.Generator().manual_seed(42)
speaker_embeddings = torch.randn((1, 512), generator=g)

# Text to synthesize
text = "Welcome to inclusive education with AI."
inputs = processor(text=text, return_tensors="pt")

# Generate speech
with torch.no_grad():
    speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

# Save to WAV
sf.write("output.wav", speech.numpy(), samplerate=16000)
print("✅ Audio saved as output.wav")

Expected Output:

Text-to-Speech complete. SpeechT5 generated the audio and saved it as output.wav
How the code works:

SpeechT5Processor — prepares your text for the model.

SpeechT5ForTextToSpeech — generates a mel-spectrogram (the speech content).

SpeechT5HifiGan — a vocoder that turns the spectrogram into a waveform you can play.

speaker_embedding — a 512-dim vector representing a “voice.” We seed it for a consistent (synthetic) voice across runs.


Note: If you want the same voice every time you reopen the project, you need to save the embedding once using the snippet below:
import numpy as np
import torch

# Save the speaker embeddings
np.save("speaker_emb.npy", speaker_embeddings.numpy())

# Later, load the speaker embeddings
speaker_embeddings = torch.tensor(np.load("speaker_emb.npy"))

Run it:
python tts.py

Note: MacOS/Linux use python3 tts.py to run the same code as above.
Expected result:

Terminal prints: ✅ Audio saved as output.wav

A new file appears in your folder: output.wav



Common hiccups (and fixes):

ImportError: sentencepiece not found → pip install sentencepiece

Torch install issues on Windows →


# Activate your virtual environment
# Example: .\venv\Scripts\Activate

# Install the torch package using the specified index URL for CPU
pip install torch --index-url https://download.pytorch.org/whl/cpu

Note: The first run is usually slow because the models may still be downloading. So that’s normal.
3) Optional: Whisper via OpenAI API
What this does:
Instead of running Whisper locally, you can call the OpenAI Whisper API (whisper-1). Your audio file is uploaded to OpenAI’s servers, transcribed there, and the text is returned.
Why use the API?

No need to install or run Whisper models locally (saves disk space & setup time).

Runs on OpenAI’s infrastructure (faster if your computer is slow).

Great if you’re already using OpenAI services in your classroom or app.


What to watch out for:

Requires an API key.

Requires billing enabled (the free trial quota is usually small).

Needs internet access (unlike the local Whisper demo).


How to get an API key:

Go to OpenAI’s API Keys page.

Log in with your OpenAI account (or create one).

Click “Create new secret key”.

Copy the key — it looks like sk-xxxxxxxx.... Treat this like a password: don’t share it publicly or push it to GitHub.


Step 1: Set your API key
In PowerShell (session only):
# Set the OpenAI API key in the environment variable
$env:OPENAI_API_KEY="your_api_key_here"

Or permanently set an environment variable in PowerShell - you can use the setx command. Here is how you can do it:
setx OPENAI_API_KEY "your_api_key_here"

This command sets the OPENAI_API_KEY environment variable to the specified value. Note that you should replace "your_api_key_here" with your actual API key. This change will apply to future PowerShell sessions, but you may need to restart your current session or open a new one to see the changes take effect.
Verify it’s set:
To check the value of an environment variable in PowerShell, you can use the echo command. Here's how you can do it:
echo $env:OPENAI_API_KEY

This command will display the current value of the OPENAI_API_KEY environment variable in your PowerShell session. If the variable is set, it will print the value. Otherwise, it will return nothing or an empty line.
Step 2: Install the OpenAI Python client
To install the OpenAI Python client using pip, you can use the following command in your PowerShell:
pip install openai

This command will download and install the OpenAI package, allowing you to interact with OpenAI's API in your Python projects. Make sure you have Python and pip installed on your system before running this command.
Step 3: Create transcribe_api.py
from openai import OpenAI

# Initialize the OpenAI client (reads API key from environment)
client = OpenAI()

# Open the audio file and create a transcription
with open("lesson_recording.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f
    )

# Print the transcript
print("Transcript:", transcript.text)

Step 4: Run it
python transcribe_api.py

Expected output:
Transcript: Welcome to inclusive education with AI.
Common hiccups (and fixes):

Error: insufficient_quota → You’ve run out of free credits. Add billing to continue.

Slow upload → If your audio is large, compress it first (for example, MP3 instead of WAV).

Key not found → Double-check if $env:OPENAI_API_KEY is set in your terminal session.


Local Whisper vs API Whisper — Which Should You Use?




Feature Local Whisper (on your machine) OpenAI Whisper API (cloud)



Setup Needs Python packages + FFmpeg Just install openai client + set API key

Hardware Runs on your CPU (slower) or GPU (faster) Runs on OpenAI’s servers (no local compute needed)

Cost ✅ Free after initial download 💳 Pay per minute of audio (after free trial quota)

Internet required ❌ No (fully offline once installed) ✅ Yes (uploads audio to OpenAI servers)

Accuracy Very good - depends on model size (tiny → large) Consistently strong - optimized by OpenAI

Speed Slower on CPU, faster with GPU Fast (uses OpenAI’s infrastructure)

Privacy Audio never leaves your machine Audio is sent to OpenAI (data handling per policy)


Rule of thumb:

Use Local Whisper if you want free, offline transcription or you’re working with sensitive data.

Use the API Whisper if you prefer convenience, don’t mind usage billing, and want speed without local setup.


Quick Setup Cheatsheet




Task Windows (PowerShell) macOS / Linux (Terminal)



Create venv py -3.12 -m venv .venv python3 -m venv .venv

Activate venv .\.venv\Scripts\Activate source .venv/bin/activate

Install Whisper pip install openai-whisper pip install openai-whisper

Install FFmpeg Download build → unzip → add to PATH or copy ffmpeg.exe brew install ffmpeg (macOS) sudo apt install ffmpeg (Linux)

Run STT script python transcribe.py python3 transcribe.py

Install TTS deps pip install transformers torch soundfile sentencepiece pip install transformers torch soundfile sentencepiece

Run TTS script python tts.py python3 tts.py

Install OpenAI client (API) pip install openai pip install openai

Run API script python transcribe_api.py python3 transcribe_api.py


Pro tip for MacOS M1/M2 users: You may need a special PyTorch build for Metal GPU acceleration. Check the PyTorch install guide for the right wheel.
From Code to Classroom Impact
Whether you chose the local Whisper, the cloud API, or SpeechT5 for text-to-speech, you should now have a working prototype that can:

Convert spoken lessons into text.

Read text aloud for students who prefer auditory input.


That’s the technical foundation. But the real question is: how can these building blocks empower teachers and learners in real classrooms?
Developer Challenge: Build for Inclusion
Try combining the two snippets into a simple classroom companion app that:

Captions what the teacher says in real time.

Reads aloud transcripts or textbook passages on demand.


Then think about how to expand it further:

Add symbol recognition for non-verbal communication.

Add multi-language translation for diverse classrooms.

Add offline support for schools with poor connectivity.


These are not futuristic ideas, they are achievable with today’s open-source AI tools.
Challenges and Considerations
Of course, building for inclusion isn’t just about code. There are important challenges to address:

Privacy: Student data must be safeguarded, especially when recordings are involved.

Cost: Solutions must be affordable and scalable for schools of all sizes.

Teacher Training: Educators need support to confidently use these tools.

Balance: AI should assist teachers, not replace the vital human element in learning.


Looking Ahead
The future of inclusive education will likely involve multimodal AI which include systems that combine speech, gestures, symbols, and even emotion recognition. We may even see brain–computer interfaces and wearable devices that enable seamless communication for learners who are currently excluded.
But one principle is clear: inclusion works best when teachers, developers, and neurodiverse learners co-design solutions together.
Conclusion
AI isn’t here to replace teachers, it’s here to help them reach every student. By embracing AI-driven accessibility, classrooms can transform into spaces where neurodiverse learners aren’t left behind, but instead empowered to thrive.
📢 Your turn:

Teachers: You can try one of the tools in your next lesson.

Developers: You can use the code snippets above to prototype your own inclusive classroom tool.

Policymakers: You can support initiatives that make accessibility central to education.


Inclusive education isn’t just a dream, it’s becoming a reality. With thoughtful use of AI, it can become the new norm.
Full source code on GitHub: Inclusive AI Toolkit
 


 How to Create a Real-Time Gesture-to-Text Translator Using Python and Mediapipe 
OMOTAYO OMOYEMI — Mon, 18 Aug 2025 14:00:13 +0000
 Sign and symbol languages, like Makaton and American Sign Language (ASL), are powerful communication tools. However, they can create challenges when communicating with people who don't understand them.
As a researcher working on AI for accessibility, I wanted to explore how machine learning and computer vision could bridge that gap. The result was a real-time gesture-to-text translator built with Python and Mediapipe, capable of detecting hand gestures and instantly converting them to text.
In this tutorial, you’ll learn how to build your own version from scratch, even if you’ve never used Mediapipe before.
By the end, you’ll know how to:

Detect and track hand movements in real time.

Classify gestures using a simple machine learning model.

Convert recognized gestures into text output.

Extend the system for accessibility-focused applications.


Prerequisites
Before following along with this tutorial, you should have:

Basic Python knowledge – You should be comfortable writing and running Python scripts.

Familiarity with the command line – You’ll use it to run scripts and install dependencies.

A working webcam – This is required for capturing and recognizing gestures in real time.

Python installed (3.8 or later) – Along with pip for installing packages.

Some understanding of machine learning basics – Knowing what training data and models are will help, but I’ll explain the key parts along the way.

An internet connection – To install libraries such as Mediapipe and OpenCV.


If you’re completely new to Mediapipe or OpenCV, don’t worry, I will walk through the core parts you need to know to get this project working.
Table of Contents

Prerequisites

Why This Matters

Tools and Technologies

Step 1: How to Install the Required Libraries

Step 2: How Mediapipe Tracks Hands

Step 3: Project Pipeline

Step 4: How to Collect Gesture Data

Step 5: How to Train a Gesture Classifier

Step 6: Real-Time Gesture-to-Text Translation

Step 7: Extending the Project

Ethical and Accessibility Considerations

Conclusion


Why This Matters
Accessible communication is a right, not a privilege. Gesture-to-text translators can:

Help non-signers communicate with sign/symbol language users.

Assist in educational contexts for children with communication challenges.

Support people with speech impairments.


Note: This project is a proof-of-concept and should be tested with diverse datasets before real-world deployment.
Tools and Technologies
We’ll be using:




Tool Purpose



Python Primary programming language

Mediapipe Real-time hand tracking and gesture detection

OpenCV Webcam input and video display

NumPy Data processing

Scikit-learn Gesture classification


Step 1: How to Install the Required Libraries
Before installing the dependencies, ensure you have Python version 3.8 or higher installed (for example, Python 3.8, 3.9, 3.10, or newer). You can check your current Python version by opening a terminal (Command Prompt on Windows, or Terminal on macOS/Linux) and typing:
python --version

or
python3 --version

You have to confirm that your Python version is 3.8 or higher because Mediapipe and some dependencies require modern language features and binary wheels. If the commands above print a version older than/before 3.8, then you’ll have to install a newer Python version before you continue.
Windows:

Press Windows Key + R

Type cmd and press Enter to open Command Prompt

Type one of the above commands and press Enter


macOS/Linux:

Open your Terminal application

Type one of the above commands and press Enter


If your Python version is older than 3.8, you’ll need to download and install a newer version from the official Python website.
Once Python is ready, you can install the required libraries using pip:
pip install mediapipe opencv-python numpy scikit-learn pandas

This command installs all the libraries you’ll need for the project:

Mediapipe – real-time hand tracking and landmark detection.

OpenCV – reading frames from your webcam and drawing overlays.

Pandas – storing our collected landmark data in a CSV for training.

Scikit-learn – training and evaluating the gesture classification model.


Step 2: How Mediapipe Tracks Hands
Mediapipe’s Hand Tracking solution detects 21 key landmarks for each hand including fingertips, joints, and the wrist, at up to 30+ FPS even on modest hardware.
Here’s a conceptual diagram of the landmarks:

And here’s what real‑time tracking looks like:

Each landmark has (x, y, z) coordinates relative to the image size, making it easy to measure angles and positions for gesture classification.
Step 3: Project Pipeline
Here’s how the system works, from webcam to text output:


Capture: Webcam frames are captured using OpenCV.

Detection: Mediapipe locates hand landmarks.

Vectorization: Landmarks are flattened into a numeric vector.

Classification: A machine learning model predicts the gesture.

Output: The recognized gesture is displayed as text.


Basic hand detection example:
import cv2
import mediapipe as mp

mp_hands = mp.solutions.hands
mp_draw = mp.solutions.drawing_utils

cap = cv2.VideoCapture(0)

with mp_hands.Hands(max_num_hands=1) as hands:
    while True:
        ret, frame = cap.read()
        if not ret:
            break

        results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        if results.multi_hand_landmarks:
            for hand_landmarks in results.multi_hand_landmarks:
                mp_draw.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)

        cv2.imshow("Hand Tracking", frame)
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break

cap.release()
cv2.destroyAllWindows()

The code above opens the webcam and processes each frame with Mediapipe’s Hands solution. The frame is then converted to RGB (as Mediapipe expects), runs detection, and if a hand is found, it draws the 21 landmarks and their connections on top of the frame. You can press q to close the window. This piece verifies your setup and helps you see that landmark tracking works before moving on.
Step 4: How to Collect Gesture Data
Before we can train our model, we need a dataset of labelled gestures. Each gesture will be stored in a CSV file (gesture_data.csv) containing the 3D landmark coordinates for all detected hand points.
For example, we’ll collect data for three gestures:

thumbs_up – the classic thumbs-up pose.

open_palm – a flat hand, fingers extended (like a “high five”).

ok – the “OK” sign, made by touching the thumb and index finger.


You can collect samples for each gesture by running:
python src/collect_data.py --label thumbs_up --samples 200

python src/collect_data.py --label open_palm --samples 200

python src/collect_data.py --label ok --samples 200

Explanation of the command:

--label → the name of the gesture you’re recording. This label will be stored alongside each row of coordinates in the CSV.

--samples → the number of frames to capture for that gesture. More samples generally lead to better accuracy.


How the process works:

When you run a command, your webcam will open.

Make the specified gesture in front of the camera.

The script will use MediaPipe Hands to detect 21 hand landmarks (each with x, y, z coordinates).

These 63 numbers (21 × 3) are stored in a row of the CSV file, along with the gesture label.

The counter at the top will track how many samples have been collected.

When the sample count reaches your target (--samples), the script will close automatically.


Example of what the CSV looks like:

Each row contains:

x0, y0, z0 … x20, y20, z20 → coordinates of each hand landmark.

label → the gesture name.


Example of data collection in progress:

In the above screenshot, the script is capturing 10 out of 10 thumbs_up samples.
📌 Tip: Make sure your hand is clearly visible and well-lit. Repeat the process for all gestures you want to train.
Step 5: How to Train a Gesture Classifier
Once you have enough samples for each gesture, train a model:
python src/train_model.py --data data/gesture_data.csv --label palm_open

This script:

Loads the CSV dataset.

Splits into training and testing sets.

Trains a Random Forest Classifier.

Prints accuracy and a classification report.

Saves the trained model.


Core training logic:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import pickle

# Load the dataset
df = pd.read_csv("data/gesture_data.csv")

# Separate features and labels
X = df.drop("label", axis=1)
y = df["label"]

# Initialize and train the Random Forest Classifier
model = RandomForestClassifier()
model.fit(X, y)

# Save the trained model to a file
with open("data/gesture_model.pkl", "wb") as f:
    pickle.dump(model, f)

This block loads the gesture dataset from data/gesture_data.csv and splits it into:

X – the input features (the 3D landmark coordinates for each gesture sample).

y – the labels (gesture names like thumbs_up, open_palm, ok).


We then created a Random Forest Classifier, which is well-suited for numerical data and works reliably without much tuning. The model learns patterns in the landmark positions that correspond to each gesture.
Finally, we saved the trained model as data/gesture_model.pkl so it can be loaded later for real-time gesture recognition without retraining.
Step 6: Real-Time Gesture-to-Text Translation
Load the model and run the translator:
python src/gesture_to_text.py --model data/gesture_model.pkl

This command runs the real-time gesture recognition script.

The --model argument tells the script which trained model file to load — in this case, gesture_model.pkl that we saved earlier.

Once running, the script opens your webcam, detects your hand landmarks, and uses the model to predict the gesture.

The predicted gesture name appears as text on the video feed.

Press q to exit the window when you’re done.


Core prediction logic:
with open("data/gesture_model.pkl", "rb") as f:
    model = pickle.load(f)

if results.multi_hand_landmarks:
    for hand_landmarks in results.multi_hand_landmarks:
        coords = []
        for lm in hand_landmarks.landmark:
            coords.extend([lm.x, lm.y, lm.z])
        gesture = model.predict([coords])[0]
        cv2.putText(frame, gesture, (10, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)

This code loads the trained gesture recognition model from gesture_model.pkl.
If any hands are detected (results.multi_hand_landmarks), it loops through each detected hand and:

Extracts the coordinates – for each of the 21 landmarks, it appends the x, y, and z values to the coords list.

Makes a prediction – passes coords to the model’s predict method to get the most likely gesture label.

Displays the result – uses cv2.putText to draw the predicted gesture name on the video feed.


This is the real-time decision-making step that turns raw Mediapipe landmark data into a readable gesture label.
You should see the recognized gesture at the top of the video feed:

Step 7: Extending the Project
You can take this project further by:

Adding Text-to-Speech: Use pyttsx3 to speak recognized words.

Supporting More Gestures: Expand your dataset.

Deploying in the Browser: Use TensorFlow.js for web-based recognition.

Testing with Real Users: Especially in accessibility contexts.


Ethical and Accessibility Considerations
Before deploying:

Dataset Diversity: Train with gestures from different skin tones, hand sizes, and lighting conditions.

Privacy: Store only landmark coordinates unless you have consent for video storage.

Cultural Context: Some gestures have different meanings in different cultures.


Conclusion
In this tutorial, we explored how to use Python, Mediapipe, and machine learning to build a real-time gesture-to-text translator. This technology has exciting potential for accessibility and inclusive communication, and with further development, could become a powerful tool for breaking down language barriers.
You can find the full code and resources here:
GitHub Repo – Gesture_Article
 


 How to Improve Web Accessibility with Landmarks - Explained with Examples 
Ilknur Eren — Tue, 05 Aug 2025 20:51:05 +0000
 If you’re reading this article on the freeCodeCamp publication, you should see some visual clues in different sections of the page. The header is at the top of the page. If you scroll all the way to the bottom of the page, you can see the footer section in grey background, which is clearly separated from the body with a white background.
freecCodeCamp, like other websites, visually separates the sections of the page to give the user clues so they can easily navigate between sections.
While sighted users have visual clues about the sections, those who use assistive technology like a screen reader, rely on landmarks to navigate through the page.
Simply put, landmarks are semantic regions in a web page that define the purpose of its sections. Landmarks allow assistive technologies to jump between major parts of the page, just like sighted users visually scan headings or menus.
Common HTML landmarks include:

 – Represents introductory content or a page header.

 – Identifies navigation links.

 – Marks the main content area of the page.

 – Contains complementary or related information.

 – Represents page or section footer.


Table of contents

How to Navigate Landmarks in Any Browser

How to Navigate Through Landmarks on a Mac Voice Over

Why Landmarks Matter for Accessibility

How to Use Landmarks

Concrete Examples of Each Landmark

Final Thoughts


How to Navigate Landmarks in Any Browser
General Browser Support
Most screen readers support landmark navigation with shortcut keys. Here's a basic overview:




Screen Reader OS Shortcut



VoiceOver macOS Control + Option + U (to open Rotor), then arrow keys to navigate

NVDA Windows D to move to the next landmark

JAWS Windows R to cycle through regions

Narrator Windows Caps Lock + Right Arrow to move by landmark

ChromeVox Chrome OS Search + Left/Right Arrow to move between landmarks


These shortcuts let users jump between regions—for example, from the 
 content directly to the —without tabbing through every interactive element.
How to Navigate Through Landmarks on a Mac Voice Over

Turn on VoiceOver: You can easily turn VoiceOver by opening Finder and typing VoiceOver. Toggle VoiceOver on.
 

Open rotor: Once you turned on voiceOver, press Control+Option+U on your keyboard. This will open the VoiceOver rotor. You can press right and left arrow to switch through different rotor items which include navigating with all headings, links and landmarks. Screenshot below is the accessibility rotor’s landmark item option on freeCodeCamp article. The article is divided up into navigation, search, main, article and footer elements.




Press down and up arrow to navigate through landmarks: Once you are on accessibility rotor’s landmark items, you can press down and up arrow to navigate to different sections of the page. If you want to go to the footer, press the down arrow until you reach footer and then press enter.

Why Landmarks Matter for Accessibility
1. Easier Navigation for Screen Reader Users
Screen readers provide shortcuts to navigate through landmarks. Without landmarks, users must tab through every single link or element, which is frustrating and time-consuming. In the freeCodeCamp article example, the user might want to skip to the footer in order to find and click on the donation link. Without landmarks, the user will need to tab through the entire article to reach the footer. This is time consuming and exhausting. Landmarks provide easy navigation to users that rely on screen readers.
2. Consistent Structure Across Pages
When every page uses the same landmark structure, users can predict where navigation menus, main content, and sidebars are located. This predictability reduces cognitive load. With organizing the page into sections, you can easily figure out where to go next.
3. Clear Context and Orientation
Landmarks communicate the role of content. For instance:

The main landmark signals: “This is the core content of the page.”

The aside landmark signals: “This is supplementary or related content.”


This helps users decide which areas to skip or focus on.
How to Use Landmarks
✅ Basic Landmark Structure
Here’s an example of a page using HTML5 landmarks:



  
  Accessible Landmark Example



  
    Website Logo
    
      
        Home
      
    
  

  
    Main Content Area
    This is the primary content of the page.
  

  
    Related Links
    
      Resource 1
    
  

  
    2025 Example Company
  




The HTML is divided into 5 landmark sections which are header, navigation, main, aside and footer. If the screen reader wants to skip the header and go direct to the main content, they can do so by turning the accessibility rotor and clicking on the main landmark. Landmarks allow screen reader users to easily navigate through the page.
Here’s a breakdown of what each landmark is and how it's typically used:
 – Navigation Section
Used for menus, site-wide links, or breadcrumbs.

  
    About
    Courses
  


Real-world use: Jump straight to the navigation to find the “Contact” page without browsing through all the content.
 – Primary Page Content
Used once per page to wrap the most important content.

  Learn Accessibility
  This article explains how to use landmarks...


Real-world use: Skip past the header and sidebar to dive into the tutorial or article.
 – Complementary Information
Used for sidebars, ads, related links, or pull quotes.

  Related Tutorials
  
    Accessible Forms
  


Real-world use: Users can skip the aside if they don’t want extra content, or jump to it for helpful resources.
 – Page Footer
Used for closing content like copyright.

  © 2025 FreeCodeCamp. All rights reserved.


Real-world use: Quickly navigate to support links, licensing info, or a newsletter sign-up.
 – Top-of-Page or Section Header
Used for introductory content, such as logos or search bars.

  
  
    
  


Real-world use: Quickly access the search input or return to the homepage.
Final Thoughts
Landmarks aren’t just an accessibility bonus—they’re a fundamental part of good UX. By implementing landmarks properly, you make your site easier to navigate for users with disabilities, comply with WCAG, and create a more predictable structure for everyone.
 


 How to Audit Android Accessibility with the Accessibility Scanner App 
Ilknur Eren — Mon, 30 Jun 2025 18:02:46 +0000
 The Web Content Accessibility Guidelines (WCAG 2.1 Level AA) is an internationally recognized standard for digital accessibility. Meeting these guidelines helps you make sure that your website is usable by people with visual, motor, hearing, and cognitive impairments.
Google’s Accessibility Scanner on Google Play is a free app that offers developers, designers, and product leaders the ability to audit their app to find accessibility issues. The app is designed to highlight accessibility issues that might not meet the WCAG 2.1 Level AA standards. 
Once installed, the Accessibility Scanner allows you to take screenshots or video recordings of your app, then highlights areas that may not meet accessibility requirements, like small touch targets, low color contrast, or missing content labels.
Here’s what we’ll cover:

How to Download and Enable the Accessibility Scanner

How to Use the Accessibility Scanner

How to Use the Snapshot Feature

How to Use the Record Feature



Why Use the Accessibility Scanner?


How to Download and Enable the Accessibility Scanner
In five quick steps, you can download the Accessibility App and enable it on your Android device.

Search “Accessibility Scanner” on Google Play Store and download it.

Find the downloaded app on your device and open it.

Turn on the Accessibility scanner by clicking on the “Turn on” button on the bottom right side of the page. This will take you to your Accessibility Settings.

In the Accessibility Setting page, click on the Accessibility Scanner button. This will take you to the Accessibility Scanner Settings.

Find Accessibility Scanner toggle and turn it on. (This will open a modal that asks if you allow “Accessibility Scanner” to have full control of your device, click Allow.


After step five, you will have a blue checkmark icon will appear on the right side of your screen (see image below). This floating icon gives you quick access to start scanning any screen for accessibility issues.

How to Use the Accessibility Scanner
To scan or record your app to find accessibility issues, tap the blue checkmark icon. You’ll see a few options after clicking on the blue checkmark:

Record: Captures a short video of user interaction and generates a report of potential accessibility issues.

Snapshot: Takes a static screenshot and flags issues found on that screen.

Turn off: Turns the Accessibility Scanner off.

Collapse: Collapses the options to show the initial blue checkmark.



You can choose between taking a single Snapshot or recording user flow using Record to evaluate multiple screens.
How to Use the Snapshot Feature
The snapshot button will take a snapshot of the page you are currently in and give you a result of accessibility issues that may be on the page. The accessibility issues will be highlighted in red boxes.
The image below is the result of taking a snapshot of the Facebook log in page. The accessibility scanner states that there are 10 accessibility suggestions on this page alone.

You can click on the highlighted area in order to get more details of the potential accessibility issue. For example, you can click on the red box that is highlighting the “Mobile number or email” form that’s in the image above. Once you click on the highlighted area, you will get additional information.
The image below is the result of clicking on the “Mobile number or email” form element. Accessibility Scanner is highlighting errors it found on this email form.
The first suggestion it gives is to fix the item label, because the item may not have a label readable by screen readers. The second issue it highlights is the Touch Target and suggests that the target should be larger. The final suggestion is the Unexposed Text, possible text detected: Mobile number or email.
Snapshots allow us to take screenshots of our pages and highlight accessibility issues.

How to Use the Record Feature
If you select to record, the Accessibility Scanner will take snapshots at intervals as you go through your app’s pages. To end the recording, tap the blue pause button (which replaces the original checkmark during recording).
Once you stop recording, Accessibility Scanner will give you the several snapshots and highlighted errors. The image below is the result of recording the Facebook log in page in less than a minute.
While recording, I navigated to other pages within the app. The recording gave 5 snapshots of the pages I was going through. You can see the snapshots on top of the page. In the image below, I am on screen one of five,. I can click to the other snapshots underneath the words, “Screen 1 of 5” and see issues for different snapshots taken during my recording. Similar to the snapshot accessibility audit, you can click on the red boxes and get more information on the errors.

Why Use the Accessibility Scanner?
The Accessibility Scanner is a valuable tool for teams throughout the app development lifecycle. Engineers can use it early in the process to scan the app locally, identify accessibility issues, and resolve them before release. During the QA phase, designers and product managers can use the scanner to audit user interfaces and flag potential accessibility concerns. Even after an app is in production, all teams can continue to use the scanner to monitor and improve accessibility.
But it’s important to note that the Accessibility Scanner is just one part of an accessibility strategy – it’s not a complete replacement for manual testing or audits. And it won’t catch all types of accessibility barriers – especially those that require keyboard navigation, screen reader testing, or cognitive usability reviews. But it is a simple and effective starting point for improving accessibility in Android apps.
You should use it alongside other tools, such as Android’s TalkBack for screen reader testing. Most importantly, real-world feedback from people who use assistive technologies is essential to identifying usability barriers that automated tools may miss.
With just a few taps, Accessibility Scanner helps surface issues that might otherwise be missed. It’s a free, lightweight, and essential tool for anyone building inclusive mobile experiences.
Thanks for Reading!
You should now know how to get started using the Accessibility Scanner to check your apps’ accessibility and make sure they’re usable by everyone.
 


 How to Create Accessible and User-Friendly Forms in React 
Grant Riordan — Tue, 29 Apr 2025 15:51:58 +0000
 When designing web applications, you’ll often be asked the age old question “How accessible is your website” and “Does it offer the best user experience?”. These are both very valid questions, but they are often overlooked in favour of rich or fancy looking features, reducing the site’s audience.
In this article, I’ll teach you about the React Hook Form library, HTML attributes, and development considerations to make sure your site’s available for all, focusing on:

blind or visually impaired users, who may use a screen reader

better user feedback

visual queues for all

design considerations for all


Whilst following along with this tutorial, you can either pull down the code from the GitHub repo (by visiting this page), or you can use the inline code snippets within the article.
Pre-requisites for this article:

Knowledge of React

Knowledge of writing TypeScript and HTML / JSX.

Familiarity with Tailwind CSS (not required in order to follow this tutorial)


Table of Contents

The Initial Basic Form

Error Handling With React-Hook-Form

Hooking Up The useForm Methods To Our Form

Showing Error Messages

Adding aria-required

Adding fieldset and legend

Adding Labels and Using htmlFor

Do Not Rely on Placeholders Only!

Give Additional Information With aria-describedBy

Avoid Tooltips for Critical Information

Tell Me Something Important

Focus States and Colouring

Make Buttons Descriptive

Final Thoughts


The Initial Basic Form
So if we take a look at the form in its current state, you may think it looks fine. But it’s actually not very accessible, nor does it offer a great user experience.
import { TvIcon } from "@heroicons/react/24/outline";

type FormData = {
    fullName: string;
    email: string;
    password: string;
    confirmPassword: string;
    agreeToTerms: boolean;
};

export const RegistrationForm = () => {
    const onSubmit = () => {
        alert(`Form submitted`);
    };

    return (
        "flex justify-center items-center w-screen h-screen bg-gray-900">
            "w-full max-w-md p-8 bg-black bg-opacity-75 rounded-lg">
                "flex flex-row justify-center items-center gap-x-4">
                    "h-12 w-12 text-white" />
                    "text-7xl font-bold text-center text-red-600 mb-4">Getflix
                
                "text-3xl font-bold text-white mb-6 text-center">
                    Sign Up
                

                "space-y-6">

                    {/* Full Name */}
                    
                        type="text"
                            placeholder="Full Name"
                            className="w-full p-3 rounded bg-gray-700 text-white placeholder-gray-400 "
                        />

                    

                    {/* Email */}
                    
                        type="email"
                            placeholder="Email Address"
                            className="w-full p-3 rounded bg-gray-700 text-white placeholder-gray-400 "
                        />

                    

                    {/* Password */}
                    
                        type="password"
                            placeholder="Password"
                            className="w-full p-3 rounded bg-gray-700 text-white placeholder-gray-400"
                        />

                    

                    {/* Confirm Password */}
                    
                        type="password"
                            placeholder="Confirm Password"
                            className="w-full p-3 rounded bg-gray-700 text-white placeholder-gray-400 "
                        />

                    

                    {/* Agree to Terms */}
                    "flex items-center text-gray-400 text-sm">
                        type="checkbox"
                            id="agreeToTerms"
                            className="mr-2"
                        />
                        "agreeToTerms" className="select-none">
                            I agree to the Terms and Conditions
                        
                    


                    {/* Submit */}
                    


                
            
        
    );
};

What’s Wrong With The Form?

Lack of action feedback – no user feedback means that users can become confused as to whether an action has happened or not. No error messages or feedback offers the user no insight into what they need to do to correct the form.

No labels for form inputs – No labels for form inputs prevent screen readers from understanding their purpose. Some screen readers may miss placeholders, and once a user types within the input, the placeholder is replaced, losing context and making it hard to return to erroneous inputs.

Lack of accessibility markup to make the form optimised for screen readers and accessibility tools.


So how do we make this better? Let’s jump right in.
Error Handling With React-Hook-Form
Error handling on forms is a critical aspect of any form submission flow. Without it, the process becomes both chaotic and frustrating for the user. We can alleviate this frustration by adding some useful error messages which explain the issues.
A popular library for working with forms in React is the react-hook-form library. It’s used by over 1.4 million people according to their GitHub statistics.
Go ahead and install it if you don’t have it already:
npm install react-hook-form

We will then implement the basic required functions from the react-hook-form package, using the useForm() hook like so:
// define our type structure to use within the form
type FormData = {
    fullName: string;
    email: string;
    password: string;
    confirmPassword: string;
    agreeToTerms: boolean;
};

// basic usage of `useForm()`
const {
    register,
    handleSubmit,
    watch,
    formState: { errors },
  } = useForm()

Quick Explanation:

register: One of the key concepts in React Hook Form is to “register” your component / HTML element. This means you can access value of the element for both form validation and when submitting the form.

handleSubmit: This is the key function needed to submit the form, run validation, and any other configured checks. It can take up to two arguments:

handleSubmit(onSuccess) – called when the submission of the form is valid and can submit ok.

handleSubmit(onSuccess, onFail) – here you can pass the handleSubmit() method two functions: the first will be run when React Hook Form deems the form to be valid, and allows you to continue. The second will be called when the form sees an error. This could be from validation, or another stipulation.



watch: Watch is a function that monitors a specified element for changes and returns its value. For instance, if you’re watching an input element, you can output the user’s typing in real-time or have another element validate it against a predefined value. A good example is a confirm password matching the previous password field.

formState: this is an object which holds information about your form. The formState object keeps track of the state of the form, like:

isDirty – true if the user has changed any input.

isValid – true if the form passes all validations.

errors – an object holding any validation errors per field.

isSubmitting – true while the form is being submitted (useful for showing loading spinners)

isSubmitted – true after the form has been submitted.

touchedFields – which fields the user has interacted with.

dirtyFields – which fields the user has modified.




We can use any of these properties by including them in our form state object. We are destructing the errors property so we can use the errors later in our form to either show error messages, or validate that there no errors on the page.
Hooking Up the useForm Methods to Our Form
Now that we know more about the useForm() method and react-hook-form, we need to integrate this with our existing 
 element. Doing so will allow us to use all the react-hook-form features we’ve discussed so far in our form.
import { TvIcon } from "@heroicons/react/24/outline";
import { useState } from "react";
import { useForm } from "react-hook-form";

type FormData = {
    fullName: string;
    email: string;
    password: string;
    confirmPassword: string;
    agreeToTerms: boolean;
};

export const RegistrationForm = () => {
    const {
        register,
        handleSubmit,
        formState: { errors },
        watch,
    } = useForm<FormData>();

    const onSubmit = () => {
        alert(`Form submitted`);
    };

    return (
        <div className="flex justify-center items-center w-screen h-screen bg-gray-900">
            <div className="w-full max-w-md p-8 bg-black bg-opacity-75 rounded-lg">
                <div className="flex flex-row justify-center items-center gap-x-4">
                    <TvIcon className="h-12 w-12 text-red-500" />
                    <h1 className="text-7xl font-bold text-center text-white mb-4">Getflixh1>
                div>
                <h2 className="text-3xl font-bold text-white mb-6 text-center">
                    Sign Up
                h2>


                <form onSubmit={handleSubmit(onSubmit)} className="space-y-6">

                    {/* Full Name */}
                    <div>
                        <input
                            {...register("fullName", {
                                required: "Full Name is required"
                            })}
                            aria-required
                            type="text"
                            placeholder="Full name"
                            className="w-full p-3 rounded bg-gray-700 text-white placeholder-gray-400 focus:outline-none focus:ring-2 focus:ring-red-500"
                        />
                        {errors.fullName && (
                            <p className="text-red-500 text-sm mt-1">{errors.fullName.message}p>
                        )}
                    div>

                    {/* Email */}
                    <div>
                        <input
                            {...register("email", {
                                required: "Email is required",
                                pattern: {
                                    value: /^\S+@\S+$/i,
                                    message: "Invalid email address",
                                },
                            })}
                            type="email"
                            placeholder="Email Address"
                            className="w-full p-3 rounded bg-gray-700 text-white placeholder-gray-400 focus:outline-none focus:ring-2 focus:ring-red-500"
                        />
                        {errors.email && (
                            <p className="text-red-500 text-sm mt-1">{errors.email.message}p>
                        )}

                    div>

                    {/* Password */}
                    <div>
                        <input
                            {...register("password", {
                                required: "Please enter your password",
                            })}
                            type="password"
                            placeholder="Password"
                            className="w-full p-3 rounded bg-gray-700 text-white placeholder-gray-400 focus:outline-none focus:ring-2 focus:ring-red-500"
                        />
                        {errors.password && (
                            <p className="text-red-500 text-sm mt-1">{errors.password.message}p>
                        )}
                    div>

                    {/* Confirm Password */}
                    <div>
                        <input
                            {...register("confirmPassword", {
                                required: "Please enter your password",
                                validate: (value) =>
                                    value === watch("password") || "Passwords do not match",
                            })}
                            type="password"
                            placeholder="Confirm Password"
                            className="w-full p-3 rounded bg-gray-700 text-white placeholder-gray-400 focus:outline-none focus:ring-2 focus:ring-red-500"
                        />
                        {errors.confirmPassword && (
                            <p className="text-red-500 text-sm mt-1">{errors.confirmPassword.message}p>
                        )}
                    div>

                    {/* Agree to Terms */}
                    <div className="flex items-center text-gray-400 text-sm">
                        <input
                            {...register("agreeToTerms", {
                                required: "You must agree to the terms and conditions"
                            })}
                            type="checkbox"
                            id="agreeToTerms"
                            className="mr-2"
                        />
                        <label className="select-none">
                            I agree to the Terms and Conditions
                        label>

                    div>
                    {errors.agreeToTerms && (
                        <p className="text-red-500 text-sm mt-1">{errors.agreeToTerms.message}p>
                    )}


                    {/* Submit */}
                    <button
                        type="submit"
                        className="w-full py-3 bg-red-600 hover:bg-red-700 text-white rounded font-semibold transition"
                    >
                        Sign Up
                    button>

                    {/* Already have account */}
                    <p className="text-center text-gray-400 text-sm mt-4">
                        Already have an account?{" "}
                        <a href="#" className="text-red-500 hover:underline">
                            Sign In
                        a>
                    p>
                form>
            div>
        div >
    );
};

So in the updated form code, we’ve made a few adjustments:
Registered Each Our Elements
For each of our elements we’ve added the register object, and configuring some overrides.
We added the required property to all input fields, which checks if the element has a value. If not, it records the provided name and marks the error as erroneous, updating the errors object with our name and the provided required message.
 {...register("fullName", {
    required: "Full Name is required"
  })}

We’ve added a pattern property on the email’s register object. This allows us to specify a criteria for the value of the input – perfect for passwords, email fields, and other inputs which may have value restrictions, or requirements.
// valid email pattern
pattern: {
    value: /^\S+@\S+$/i,
    message: "Invalid email address",
},

We have also added the validate property to the confirm password element. This is a given function that will run as the user types.
validate: (value) => value === watch("password") || "Passwords do not match"

The validate function inside register is run automatically based on the field's validationMode setting.
By default (if you do not specify the validationMode), React Hook Form runs validation on onChange and onBlur events. This means that:

When the user types into the input → it triggers validate.

When the user leaves (blurs) the input → it triggers validate again.


If you wanted to update the custom validation mode, you can override this using the mode setting within useForm() like so:
 const { register, handleSubmit, formState, trigger } = useForm({
    mode: "onSubmit",
  });

If you then want to go an extra step and update the mode per element, overriding the mode setting you just globally set for your form, you can use the trigger() method from useForm like so:
<input
  {...register("email", { required: "Email is required" })}
  onBlur={() => trigger("email")} // validate this field onBlur manually
/>

This allows you to have onSubmit validation set via mode, and then email is triggered via onBlur() too.
Just adding these simple settings within the react-hook-form library already gives us a much better user experience than before – but it isn’t everything. Let’s explore more settings, HTML, and attributes we can add to increase accessibility and user experience.
Showing Error Messages
Form errors can be stored within the formState object we mentioned earlier, but they’re no good there – we need to display them to our users. We can achieve this simply by accessing the destructed errors object, like below:
{errors.password && (
    <p className="text-red-500 text-sm mt-1">{errors.password.message}p>
)}

The code uses conditional syntax to show the 
 tag only if the errors.password object has a value, indicating an error associated with the password field from useForm() checks. We can then display the error message from errors.password.message, combined with a commonly used erroneous colour like red, to highlight the form’s problems. This can then been applied to all other input fields as per the code above.
Adding aria-required
So we’ve informed the form that certain elements are required and these should be checked when submitting the form. But this alone doesn’t inform visually impaired users that the element is required.
To aid with screen-readers, we can add an aria attribute to our element which will be read by the screen-reader. This property is the aria-required property. This means that when the screen-reader reads out information about the element it will inform the user that this value is required for successful submission.
 <input
    {...register("fullName", {
        required: "Full Name is required"
    })}
    aria-required
    type="text"
    placeholder="Full name"
    className="w-full p-3 rounded bg-gray-700 text-white placeholder-gray-400 focus:outline-none focus:ring-2 focus:ring-red-500"
/>

Adding fieldset and legend
Fieldset elements group  controls together, while legend elements provide a description for the grouped controls.
Imagine you have one big form, but it spans two "sections" – for example, a "User Details" section for username, email, and passwords, and an "Address Details" section asking for your shipping and billing information.
In this tutorial, we’re using TailwindCSS, which provides a utility class called sr-only. You can apply sr-only to your legends so they are only visible to screen readers, and not actually visible on the page.
This way, the legend will be read aloud when users navigate into a section of the form, making it clear which part of the form they are interacting with.
Important Note: Legends must be placed inside fieldsets. You need to wrap your legends within a 
 element for your HTML to be valid and accessible.
Here's an unrelated example (to keep it brief and simple):

  <fieldset>
    <legend>Payment Methodlegend>    
    <label>
      <input type="radio" name="payment" value="card" />
      Credit Card
    label>

    <label>
      <input type="radio" name="payment" value="paypal" />
      PayPal
    label>
  fieldset>

You can see that the payment option inputs have been grouped within a fieldset, and then described by the legend element, informing the user that these elements relate to “Payment Method”. You as the developer can then decide if you would like this shown to everyone, or if it’s only for visually impaired users.
For screen readers, they’d hear something like:

"Group: Payment Method. Credit Card radio button. PayPal radio button."

Do Not Rely on Placeholders Only!
Placeholders are a great addition to make it clear to the user what the input elements are used for, and show helpful information. But they aren’t that user friendly, especially in regards to screen-readers.
The main reasons for this are:

Placeholders disappear when typing, meaning that if a user begins to type “Grant”, and then tabs away from the input when they go back, without a label it will simply read the value of the input, not what it relates to.

Often developers utilise a grey-like colour for their placeholders, with a low opacity. This can mean it’s difficult for users to sometimes see the placeholder, especially those who are colour blind or visually impaired.


So what can we do instead ? Well this leads me onto our next point – we can use a common HTML element, the .
Adding Labels and Using htmlFor
Another accessibility feature we can add to boost our accessibility and user experience for all, is the htmlFor attribute combined with the  element.
Labels are highly important for both sighted and visually impaired users. It offers clarity as to what the input is associated with, as well as a navigational tool for those using screen-readers.
The htmlFor attribute is used to link  elements with their input.
Note: htmlFor attributes can only be used on labels and are not valid on any other element.
<label htmlFor="fullname" className="text-white">Full Namelabel>
<input
    {...register("fullName", {
        required: "Full Name is required"
    })}
    id="fullname"
    aria-required
    type="text"
    placeholder="Full name"
    className="w-full p-3 rounded bg-gray-700 text-white placeholder-gray-400 focus:outline-none focus:ring-2 focus:ring-red-500"
/>

Why this is important for accessibility:
1. Screen readers:
When a screen reader lands on the , it automatically reads the associated label ("Full Name"). Even if the label is not visually right next to the input, the screen reader still knows which text describes the input, giving you some freedom when designing your forms.
2. Click behaviour:
When you click the , it automatically focuses the  when using htmlFor.
Users don’t have to click exactly on the tiny input field – and this can certainly be useful when dealing with checkboxes or radio buttons, for example.
In short, big click targets = better usability and faster form filling.
This is also very helpful for mobile users where precision tapping is hard, especially on smaller screens.
Give Additional Information With aria-describedBy
Now that we’ve added clear labels to our form fields, we can take accessibility a step further by providing additional guidance for users when errors occur. By using aria-describedby and aria-invalid, we can link helpful error messages to the input fields and ensure screen readers communicate validation issues clearly. Let’s look at how to implement this:
<div>
  <label htmlFor="email" className="text-white">Emaillabel>
  <input
    {...register("email", {
          required: "You must enter an email address",
      pattern: {
        value: /^\S+@\S+$/i,
        message: "Invalid email address",
      },
    })}
    id="email"
    type="email"
    aria-invalid={errors.email ? "true" : "false"}
    aria-describedby={errors.email ? "email-error" : undefined}
    placeholder="Enter your email address"
    className="w-full p-3 rounded bg-gray-700 text-white placeholder-gray-400 focus:outline-none focus:ring-2 focus:ring-red-500"
  />
  {errors.email && (
    <p id="email-error" className="text-red-500 text-sm mt-1">
      {errors.email.message}
    p>
  )}
div>

Notice the two new attributes we’ve added:

aria-describedBy – this attribute links our error message with our input. Screen readers will therefore read out the error message whilst reading out other information when the input is focused.

aria-invalid – this attribute again aids with screen readers, informing the user that the input’s value is invalid and they need to correct it. This combined with the describedBy attribute gives visually impaired users all the information they need in order to correct their mistake.


Avoid Tooltips for Critical Information
When developing your form, try to avoid tooltips (those little elements that show when you hover over another element for a period of time like below).

The problems with using tooltips are:

They often require mouse hover, which doesn't work on touch devices (for example mobile phones, or tablets).

They aren’t announced reliably by screen readers if proper aria labels aren’t added.

They disappear too quickly


Instead, we can use inline helper text or descriptions combined with aria-describedby like below:
<p id="passwordHint" className="text-xs text-gray-500">
  Must be at least 8 characters and include a number.
p>

We can then reference this within our input using the aria-describedBy attribute. But wait, we already have a describedBy pointing at the error message – well, that’s ok! We can link multiple elements, like the brief example below:
// now references both passwordHint and the password error (we seperate the ids with a space)
<input 
  id="password"
  aria-describedby="passwordHint passwordError"
/>

<div id="passwordHint">
  Must be at least 8 characters long.
div>

<div id="passwordError">
  Passwords do not match!
div>

Tell Me Something Important
aria-live is an aria attribute you can add to an element to tell screen readers:

"Hey, if the content inside me changes, announce it automatically."

It makes dynamic content updates audible without needing the user to re-focus anything.
A basic example could look something like below, where a message which is updated upon submission is updated, it could contain something like:

“Loading” → “Hurray, registration complete”
or
““Pending” → “Registration failed due to many errors”

<p aria-live="polite">
  {formSubmissionResultMessage}
p>

When formSubmissionResultMessage changes, screen readers will automatically announce the updated message.
The timing of when it is read out depends on the value of the aria-live attribute – with polite, the announcement waits for a natural pause. With assertive, it interrupts immediately.
Real-World Examples
Polite update: good for passive notifications
<p aria-live="polite" className="mt-2 text-green-500">
  Form saved successfully.
p>

The screen reader waits for a good moment to say it.
Assertive update: good for urgent errors
<p aria-live="assertive" className="mt-2 text-red-500">
  Passwords do not match!
p>

The screen reader immediately interrupts and announces it.
Good things to know:

The element needs to already exist in the DOM when the update happens. So it’s smart to always render the 
 – just update its content.

Don’t overuse assertive, or you’ll annoy users and make apps feel super noisy and overwhelming.


Focus States and Colouring
You may have noticed on the input elements that I have added some custom colouring with TailwindCSS classes focus:. But what is this doing?
Well, this allows us to control the focus colour of the inputs. Without this, the browser will apply its own default styling which may not be as accessible to our users, especially those with colour-blindness.
For example, within our form, without the styling the input with focus looks like this:

Here you can see it has applied a subtle white and blue outline – but its not that clear it’s being focused. You can argue it is different enough to other input elements, but for some users this may not be enough.
To combat this and improve usability, we can override this with our own custom colouring. When using TailwindCSS, we can apply the following class names:
focus:outline-none focus:ring-2 focus:ring-red-500

What Does This Do?
This now applies a much thicker red line (encompassing brand colours) as well as making it clearer against the darker background




Class name Meaning (CSS equivalent)



focus:outline-none Remove the outline when the element is focused

focus:ring-2 On focus, apply a 2px wide ring (like a border/shadow)

focus:ring-red-500 Set the ring colour to Tailwind’s red-500 colour



If you’re not using TailwindCSS, you can accomplish the same with plain CSS like so:
input:focus {
  outline: none; /* no default browser outline */
  box-shadow: 0 0 0 2px #ef4444; /* 2px red ring around input */
}

Make Buttons Descriptive
A super simple way to level up your form’s user experience is to make sure your buttons use clear, descriptive text.
Let’s take a look at a few examples of buttons that don’t quite achieve this:

The above buttons are examples of poor input buttons because:

“Click Here” doesn’t give any context. Screen reader users, and even sighted users, have no idea what "click here" does without reading nearby text.

Icon Only: Sighted users might guess what the icon means, but screen readers see nothing unless you add aria-label. The point is, it is ambiguous and unclear as to what the button does. You may see websites that just use an icon, not surrounded by a button, which can be even more confusing.

“Submit”: If you have several "Submit" buttons (for example, one for payment, one for contact form), users don't know which "submit" is doing what.


Improvements
Instead, we can improve those buttons to be more accessible and user-friendly by doing the following:

Use descriptive button text – for example: "Pay Now", "Sign Up", or "Save Changes".

Use both an icon and text – combining an icon with text can be the perfect blend for both accessibility and design.

Use aria-label – if you really must use an icon-only button (like a basket or home icon in a navigation bar), make sure to add an aria-label attribute to clearly describe the button’s purpose, like so:


<button 
    type="submit"
    className="w-full py-3 px-6 rounded-lg bg-red-600 hover:bg-red-700 focus:outline-none focus:ring-2 focus:ring-red-500 text-white text-lg font-semibold transition"
> Pay Now <button>

<button
    type="submit"
    className="w-full py-3 px-6 rounded-lg bg-blue-600 hover:bg-blue-700 focus:outline-none focus:ring-2 focus:ring-blue-500 text-white text-lg font-semibold flex justify-center items-center gap-2 transition">
        <HomeIcon className="h-6 w-6" />
        Home
button>

<button
    type="submit"
    aria-label="Go to homepage"
    className="w-full py-3 px-6 rounded-lg bg-blue-600 hover:bg-blue-700 focus:outline-none focus:ring-2 focus:ring-blue-500 text-white text-lg font-semibold flex justify-center items-center transition">
        <HomeIcon className="h-6 w-6" />
button>

That code would generate the following:

Final Thoughts
In this tutorial, we’ve covered various ways to make your forms more accessible and user-friendly. From simple things like making button text clearer and using more user-friendly colours, to more complex HTML attributes like aria-describedBy and aria-live, you should be covered.
I hope you found this tutorial helpful, and now you’re ready to take your development skills to the next level. Making these simple changes can have a big impact on your users’ experience, and they’ll definitely stick around longer and be less frustrated.
As always, if you’d like to share feedback on the article, discuss it further, or just hear about future articles or content, you can drop me a follow on X (Twitter) via my handle @grantdotdev.

Task	Windows (PowerShell)	macOS / Linux (Terminal)
Create venv	`py -3.12 -m venv .venv`	`python3 -m venv .venv`
Activate venv	`.\.venv\Scripts\Activate`	`source .venv/bin/activate`
Install Whisper	`pip install openai-whisper`	`pip install openai-whisper`
Install FFmpeg	Download build → unzip → add to PATH or copy `ffmpeg.exe`	`brew install` `ffmpeg` (macOS) `sudo apt install ffmpeg` (Linux)
Run STT script	`python` `transcribe.py`	`python3` `transcribe.py`
Install TTS deps	`pip install transformers torch soundfile sentencepiece`	`pip install` `transformers torch soundfile sentencepiece`
Run TTS script	`python` `tts.py`	`python3` `tts.py`
Install OpenAI client (API)	`pip install` `openai`	`pip` `install openai`
Run API script	`python transcribe_api.py`	`python3 transcribe_api.py`

Class name	Meaning (CSS equivalent)
`focus:outline-none`	Remove the outline when the element is focused
`focus:ring-2`	On focus, apply a 2px wide ring (like a border/shadow)
`focus:ring-red-500`	Set the ring colour to Tailwind’s `red-500` colour

Feature	Local Whisper (on your machine)	OpenAI Whisper API (cloud)
Setup	Needs Python packages + FFmpeg	Just install `openai` client + set API key
Hardware	Runs on your CPU (slower) or GPU (faster)	Runs on OpenAI’s servers (no local compute needed)
Cost	✅ Free after initial download	💳 Pay per minute of audio (after free trial quota)
Internet required	❌ No (fully offline once installed)	✅ Yes (uploads audio to OpenAI servers)
Accuracy	Very good - depends on model size (tiny → large)	Consistently strong - optimized by OpenAI
Speed	Slower on CPU, faster with GPU	Fast (uses OpenAI’s infrastructure)
Privacy	Audio never leaves your machine	Audio is sent to OpenAI (data handling per policy)

Tool	Purpose
Python	Primary programming language
Mediapipe	Real-time hand tracking and gesture detection
OpenCV	Webcam input and video display
NumPy	Data processing
Scikit-learn	Gesture classification

Screen Reader	OS	Shortcut
VoiceOver	macOS	`Control + Option + U` (to open Rotor), then arrow keys to navigate
NVDA	Windows	`D` to move to the next landmark
JAWS	Windows	`R` to cycle through regions
Narrator	Windows	`Caps Lock + Right Arrow` to move by landmark
ChromeVox	Chrome OS	`Search + Left/Right Arrow` to move between landmarks

Accessibility - freeCodeCamp.org

How to Build Responsive and Accessible UI Designs with React and Semantic HTML

Table of Contents

Prerequisites

Overview

Why Accessibility and Responsiveness Matter

Core Principles of Accessible and Responsive Design

1. Semantic HTML First

Key ARIA Attributes

1. role

2. aria-label

3. aria-hidden

4. aria-live

Keyboard Navigation

Avoiding Common Keyboard Traps

Focus Management

Best Practice:

Forms and Accessibility

Responsive Typography and Images

1. Fluid images using CSS:

2. Using srcset for multiple resolutions:

3. Always include descriptive alt text

4. Ensure Sufficient Color Contrast

Building a Fully Accessible Responsive Component (End-to-End Example)

Step 1: Component Structure (Semantic HTML)

{title}

establishes a proper heading hierarchy

Step 2: Responsive Styling

Step 3: Accessibility Enhancements

Step 4: Keyboard Focus Styling

Step 5: Using the Component

Testing Accessibility

Manual Testing

Best Practices

When NOT to Overuse Accessibility Features

Future Enhancements

Conclusion

How to Create a Table of Contents for Your Article

Table of Contents

Browser Dev Tools

JavaScript Console

Understanding the DOM Structure

Dev Tools

How to Create the ToC in Markdown

How to Create an HTML ToC

Copy the HTML code for the editor

What to Do If I Don’t Have Headers?

How to Create a Table of Contents for DEV.to

Overview

Conclusion

How to Build a Production-Ready Voice Agent Architecture with WebRTC

Table of Contents

What You'll Build

Prerequisites

TL;DR

How to Avoid Common Production Failures in Voice Agents

How to Design a Latency Budget for a Real-Time Voice Agent

How to Design a Production Voice Agent Architecture (Vendor-Neutral)

Step 0: Set Up the Project

Step 1: Keep Credentials Server-side

Step 2: Build a Backend Token Endpoint

Create server.js (Node.js + Express)

Run the server

How this code works

Production Notes

Step 3: Connect from the Web Client (WebRTC + SFU)

Create public/index.html

Voice Agent Demo

Create public/client.js

How this Step works (and why these details matter)

Handle These Explicitly

Autoplay Restriction Example

Microphone denial

Disconnect cleanup

Token refresh (simplified)

Step 4: Add Client Actions (Agent Suggests, App Executes)

Step 5: Add Tool Integrations Safely

Step 6: Add post-call processing (where durable value appears)

Test it locally

Production readiness checklist

2. Using `srcset` for multiple resolutions:

Create `public/index.html`

Create `public/client.js`

English – `app_en.arb`

French – `app_fr.arb`

Spanish – `app_es.arb`

How to Configure `MaterialApp` for Localization

English – `app_en.arb`

French – `app_fr.arb`

Spanish – `app_es.arb`