<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Voice - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Voice - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Tue, 12 May 2026 15:09:56 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/voice/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Production-Ready Voice Agent Architecture with WebRTC ]]>
                </title>
                <description>
                    <![CDATA[ In this tutorial, you'll build a production-ready voice agent architecture: a browser client that streams audio over WebRTC (Web Real-Time Communication), a backend that mints short-lived session toke ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-production-ready-voice-agents/</link>
                <guid isPermaLink="false">69ab2f260bca1a3976458b2a</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Accessibility ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Voice ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Nataraj Sundar ]]>
                </dc:creator>
                <pubDate>Fri, 06 Mar 2026 19:46:46 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/c61b4358-66d9-434d-8555-d8921313e573.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this tutorial, you'll build a production-ready voice agent architecture: a browser client that streams audio over WebRTC (Web Real-Time Communication), a backend that mints short-lived session tokens, an agent runtime that orchestrates speech and tools safely, and generates post-call artifacts for downstream workflows.</p>
<p>This article is intentionally vendor-neutral. You can implement these patterns using any AI voice platform that supports WebRTC (directly or via an SFU, selective forwarding unit) and server-side token minting. The goal is to help you ship a voice agent architecture that is secure, observable, and operable in production.</p>
<blockquote>
<p><em>Disclosure: This article reflects my personal views and experience. It does not represent the views of my employer or any vendor mentioned.</em></p>
</blockquote>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#what-youll-build">What You'll Build</a></p>
</li>
<li><p><a href="#how-to-avoid-common-production-failures-in-voice-agents">How to Avoid Common Production Failures in Voice Agents</a></p>
</li>
<li><p><a href="#how-to-design-a-latency-budget-for-a-real-time-voice-agent">How to Design a Latency Budget for a Real-Time Voice Agent</a></p>
</li>
<li><p><a href="#production-voice-agent-architecture-vendor-neutral">Production Voice Agent Architecture (Vendor-Neutral)</a></p>
<ul>
<li><p><a href="#step-0-set-up-the-project">Step 0: Set Up the Project</a></p>
</li>
<li><p><a href="#step-1-keep-credentials-server-side">Step 1: Keep Credentials Server-side</a></p>
</li>
<li><p><a href="#step-2-build-a-backend-token-endpoint">Step 2: Build a Backend Token Endpoint</a></p>
</li>
<li><p><a href="#step-3-connect-from-the-web-client-webrtc--sfu">Step 3: Connect from the Web Client (WebRTC + SFU)</a></p>
</li>
<li><p><a href="#step-4-add-client-actions-agent-suggests-app-executes">Step 4: Add Client Actions (Agent Suggests, App Executes)</a></p>
</li>
<li><p><a href="#step-5-add-tool-integrations-safely">Step 5: Add Tool Integrations Safely</a></p>
</li>
<li><p><a href="#step-6-add-post-call-processing-where-durable-value-appears">Step 6: Add post-call processing (where durable value appears)</a></p>
</li>
</ul>
</li>
<li><p><a href="#production-readiness-checklist">Production readiness checklist</a></p>
</li>
<li><p><a href="#closing">Closing</a></p>
</li>
</ul>
<h2 id="heading-what-youll-build">What You'll Build</h2>
<p>By the end, you'll have:</p>
<ul>
<li><p>A web client that streams microphone audio and plays agent audio.</p>
</li>
<li><p>A backend token endpoint that keeps credentials server-side.</p>
</li>
<li><p>A safe coordination channel between the agent and the application.</p>
</li>
<li><p>Structured messages between the application and the agent.</p>
</li>
<li><p>A production checklist for security, reliability, observability, and cost control.</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You should be comfortable with:</p>
<ul>
<li><p>JavaScript or TypeScript</p>
</li>
<li><p>Node.js 18+ (so <code>fetch</code> works server-side) and an HTTP framework (Express in examples)</p>
</li>
<li><p>Browser microphone permissions</p>
</li>
<li><p>Basic WebRTC concepts (high level is fine)</p>
</li>
</ul>
<h2 id="heading-tldr">TL;DR</h2>
<p>A <strong>production-ready voice agent</strong> needs:</p>
<ul>
<li><p>A <strong>server-side token service</strong> (no secrets in the browser)</p>
</li>
<li><p>A <strong>real-time media plane</strong> (WebRTC) for low-latency audio</p>
</li>
<li><p>A <strong>data channel</strong> for structured messages between your app and the agent</p>
</li>
<li><p><strong>Tool guardrails</strong> (allowlists, confirmations, timeouts, audit logs)</p>
</li>
<li><p><strong>Post-call processing</strong> (summary, actions, CRM (Customer Relationship Management), tickets)</p>
</li>
<li><p><strong>Observability-first</strong> implementation (state transitions + metrics)</p>
</li>
</ul>
<h2 id="heading-how-to-avoid-common-production-failures-in-voice-agents">How to Avoid Common Production Failures in Voice Agents</h2>
<p>If you've operated distributed systems, you've seen most failures happen at boundaries:</p>
<ul>
<li><p>timeouts and partial connectivity</p>
</li>
<li><p>retries that amplify load</p>
</li>
<li><p>unclear ownership between components</p>
</li>
<li><p>missing observability</p>
</li>
<li><p>“helpful automation” that becomes unsafe</p>
</li>
</ul>
<p>Voice agents amplify those risks because:</p>
<p><strong>Latency is User Experience</strong>: A slow agent feels broken. Conversational UX is less forgiving than web UX.</p>
<p><strong>Audio + UI + Tools is a Distributed System</strong>: You coordinate browser audio capture, WebRTC transport, STT (speech-to-text), model reasoning, tool calls, TTS (text-to-speech), and playback buffering. Each stage has different clocks and failure modes.</p>
<p><strong>Security Boundaries are Non-negotiable</strong>: A leaked API key is catastrophic. A tool misfire can trigger real-world side effects.</p>
<p><strong>Debuggability determines whether you can ship</strong>: If you don't log state transitions and capture post-call artifacts, you can't operate or improve the system safely.</p>
<h2 id="heading-how-to-design-a-latency-budget-for-a-real-time-voice-agent">How to Design a Latency Budget for a Real-Time Voice Agent</h2>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/694ca88d5ac09a5d68c63854/8bb5c6d5-4250-457b-94a2-fcb748050731.png" alt="Latency budget for a real-time voice agent showing mic capture, network RTT, STT, reasoning, tools, TTS, and playback buffering." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Conversations have a “feel.” That feel is mostly latency.</p>
<p>A practical guideline:</p>
<ul>
<li><p>Under <strong>~200ms</strong> feels instant</p>
</li>
<li><p><strong>300–500ms</strong> feels responsive</p>
</li>
<li><p>Over <strong>~700ms</strong> feels broken</p>
</li>
</ul>
<p>Your end-to-end latency is the sum of mic capture, network RTT (round-trip time), STT, reasoning, tool execution, TTS, and playback buffering. Budget for it explicitly or you’ll ship a technically correct system that users perceive as unintelligent.</p>
<h2 id="heading-how-to-design-a-production-voice-agent-architecture-vendor-neutral">How to Design a Production Voice Agent Architecture (Vendor-Neutral)</h2>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/694ca88d5ac09a5d68c63854/f0411ddc-d3fb-48e4-be72-37d9765bf0a7.png" alt="Production-ready voice agent architecture showing web client, token service, WebRTC real-time plane, agent runtime, tool layer, and post-call processing." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>A scalable <strong>voice agent architecture</strong> typically has these layers:</p>
<ol>
<li><p><strong>Web client</strong>: mic capture, audio playback, UI state</p>
</li>
<li><p><strong>Token service</strong>: short-lived session tokens (secrets stay server-side)</p>
</li>
<li><p><strong>Real-time plane</strong>: WebRTC media + a data channel</p>
</li>
<li><p><strong>Agent runtime</strong>: STT → reasoning → TTS, plus tool orchestration</p>
</li>
<li><p><strong>Tool layer</strong>: external actions behind safety controls</p>
</li>
<li><p><strong>Post-call processor</strong>: summary + structured outputs after the session ends</p>
</li>
</ol>
<p>This separation makes failure domains and trust boundaries explicit.</p>
<h2 id="heading-step-0-set-up-the-project">Step 0: Set Up the Project</h2>
<p>Create a new project directory:</p>
<pre><code class="language-shell">mkdir voice-agent-app
cd voice-agent-app
npm init -y
npm pkg set type=module
npm pkg set scripts.start="node server.js"
</code></pre>
<p>Install dependencies:</p>
<pre><code class="language-shell">npm install express dotenv
</code></pre>
<p>Create this folder structure:</p>
<pre><code class="language-plaintext">voice-agent-app/
├── server.js
├── .env
└── public/
    ├── index.html
    └── client.js
</code></pre>
<p>Add a <code>.env</code> file:</p>
<pre><code class="language-shell">VOICE_PLATFORM_URL=https://your-provider.example
VOICE_PLATFORM_API_KEY=your_api_key_here
</code></pre>
<p>Now you’re ready to implement each part of the system.</p>
<h2 id="heading-step-1-keep-credentials-server-side">Step 1: Keep Credentials Server-side</h2>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/694ca88d5ac09a5d68c63854/d522fdf2-bb96-4531-b4ff-3a364336178c.png" alt="Security trust boundary diagram showing browser as untrusted zone and backend/tooling as trusted zone with secrets server-side." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Treat every API key like production credentials:</p>
<ul>
<li><p>store it in environment variables or a secrets manager</p>
</li>
<li><p>rotate it if exposed</p>
</li>
<li><p>never embed it in browser or mobile apps</p>
</li>
<li><p>avoid logging secrets (log only a short suffix if necessary)</p>
</li>
</ul>
<p>Even if a vendor supports CORS, the browser is not a safe place for long-lived credentials.</p>
<h2 id="heading-step-2-build-a-backend-token-endpoint">Step 2: Build a Backend Token Endpoint</h2>
<p>Your backend should:</p>
<ul>
<li><p>authenticate the user</p>
</li>
<li><p>mint a short-lived session token using your platform API</p>
</li>
<li><p>return only what the client needs (URL + token + expiry)</p>
</li>
</ul>
<h3 id="heading-create-serverjs-nodejs-express">Create server.js (Node.js + Express)</h3>
<pre><code class="language-javascript">import express from "express";
import dotenv from "dotenv";
import path from "path";
import { fileURLToPath } from "url";

dotenv.config();

const app = express();
app.use(express.json());

// Serve the web client from /public
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
app.use(express.static(path.join(__dirname, "public")));

const VOICE_PLATFORM_URL = process.env.VOICE_PLATFORM_URL;
const VOICE_PLATFORM_API_KEY = process.env.VOICE_PLATFORM_API_KEY;

app.post("/api/voice-token", async (req, res) =&gt; {
  res.setHeader("Cache-Control", "no-store");

  try {
    if (!VOICE_PLATFORM_URL || !VOICE_PLATFORM_API_KEY) {
      return res.status(500).json({
        error: "Missing VOICE_PLATFORM_URL or VOICE_PLATFORM_API_KEY in .env",
      });
    }

    // TODO: Authenticate the caller before minting tokens.

    const r = await fetch(`${VOICE_PLATFORM_URL}/api/v1/token`, {
      method: "POST",
      headers: {
        "X-API-Key": VOICE_PLATFORM_API_KEY,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ participant_name: "Web User" }),
    });

    if (!r.ok) {
      const detail = await r.text().catch(() =&gt; "");
      return res.status(r.status).json({ error: "Token request failed", detail });
    }

    const data = await r.json();

    res.json({
      rtc_url: data.rtc_url || data.livekit_url,
      token: data.token,
      expires_in: data.expires_in,
    });
  } catch (err) {
    res.status(500).json({ error: "Failed to mint token" });
  }
});

app.listen(3000, () =&gt; console.log("Open http://localhost:3000"));
</code></pre>
<h3 id="heading-run-the-server">Run the server</h3>
<pre><code class="language-shell">npm start
</code></pre>
<p>Then open: <a href="http://localhost:3000">http://localhost:3000</a></p>
<h3 id="heading-how-this-code-works">How this code works</h3>
<ul>
<li><p>You load credentials from environment variables so secrets never enter the browser.</p>
</li>
<li><p>The <code>/api/voice-token</code> endpoint calls the voice platform’s token API.</p>
</li>
<li><p>You return only the <code>rtc_url</code>, <code>token</code>, and expiration time.</p>
</li>
<li><p>The browser never sees the API key.</p>
</li>
<li><p>If the provider returns an error, you forward a structured error response.</p>
</li>
</ul>
<h3 id="heading-production-notes"><strong>Production Notes</strong></h3>
<ul>
<li><p>rate-limit /api/voice-token (cost + abuse control)</p>
</li>
<li><p>instrument token mint latency and error rate</p>
</li>
<li><p>keep TTL short and handle refresh/reconnect</p>
</li>
<li><p>return minimal fields</p>
</li>
</ul>
<h2 id="heading-step-3-connect-from-the-web-client-webrtc-sfu">Step 3: Connect from the Web Client (WebRTC + SFU)</h2>
<p>In this step, you'll build a minimal web UI that:</p>
<ul>
<li><p>Requests a short-lived token from your backend</p>
</li>
<li><p>Connects to a real-time WebRTC room (often via an SFU)</p>
</li>
<li><p>Plays the agent's audio track</p>
</li>
<li><p>Captures and publishes microphone audio</p>
</li>
</ul>
<h3 id="heading-create-publicindexhtml">Create <code>public/index.html</code></h3>
<pre><code class="language-html">&lt;!doctype html&gt;
&lt;html&gt;
  &lt;head&gt;
    &lt;meta charset="UTF-8" /&gt;
    &lt;meta name="viewport" content="width=device-width,initial-scale=1" /&gt;
    &lt;title&gt;Voice Agent Demo&lt;/title&gt;
  &lt;/head&gt;
  &lt;body&gt;
    &lt;h1&gt;Voice Agent Demo&lt;/h1&gt;

    &lt;button id="startBtn"&gt;Start Call&lt;/button&gt;
    &lt;button id="endBtn" disabled&gt;End Call&lt;/button&gt;

    &lt;p id="status"&gt;Idle&lt;/p&gt;

    &lt;script type="module" src="/client.js"&gt;&lt;/script&gt;
  &lt;/body&gt;
&lt;/html&gt;
</code></pre>
<h3 id="heading-create-publicclientjs">Create <code>public/client.js</code></h3>
<p>Note: This uses a LiveKit-style client SDK to demonstrate the pattern. If you're using a different provider, swap this import and the connect/publish calls for your provider's WebRTC client.</p>
<pre><code class="language-javascript">import { Room, RoomEvent, Track } from "https://unpkg.com/livekit-client@2.10.1/dist/livekit-client.esm.mjs";

const startBtn = document.getElementById("startBtn");
const endBtn = document.getElementById("endBtn");
const statusEl = document.getElementById("status");

let room = null;
let intentionallyDisconnected = false;
let audioEls = [];

function setStatus(text) {
  statusEl.textContent = text;
}

function detachAllAudio() {
  for (const el of audioEls) {
    try { el.pause?.(); } catch {}
    el.remove();
  }
  audioEls = [];
}

async function mintToken() {
  const res = await fetch("/api/voice-token", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ participant_name: "Web User" }),
    cache: "no-store",
  });

  if (!res.ok) {
    const detail = await res.text().catch(() =&gt; "");
    throw new Error(`Token request failed: ${detail || res.status}`);
  }

  const { rtc_url, token } = await res.json();
  if (!rtc_url || !token) throw new Error("Token response missing rtc_url or token");
  return { rtc_url, token };
}

function wireRoomEvents(r) {
  // 1) Play the agent audio track when subscribed
  r.on(RoomEvent.TrackSubscribed, (track) =&gt; {
    if (track.kind !== Track.Kind.Audio) return;

    const el = track.attach();
    audioEls.push(el);
    document.body.appendChild(el);

    // Autoplay restrictions vary by browser/device.
    el.play?.().catch(() =&gt; {
      setStatus("Connected (audio may be blocked — click the page to enable)");
    });
  });

  // 2) Reconnect on disconnect (token expiry often shows up this way)
  r.on(RoomEvent.Disconnected, async () =&gt; {
    if (intentionallyDisconnected) return;
    setStatus("Disconnected (reconnecting...)");
    await attemptReconnect();
  });
}

async function connectOnce() {
  const { rtc_url, token } = await mintToken();

  const r = new Room();
  wireRoomEvents(r);

  await r.connect(rtc_url, token);

  // Mic permission + publish mic
  try {
    await r.localParticipant.setMicrophoneEnabled(true);
  } catch {
    try { r.disconnect(); } catch {}
    throw new Error("Microphone access denied. Allow mic permission and try again.");
  }

  return r;
}

async function startCall() {
  if (room) return;

  intentionallyDisconnected = false;
  setStatus("Connecting...");

  room = await connectOnce();

  setStatus("Connected");
  startBtn.disabled = true;
  endBtn.disabled = false;
}

async function stopCall() {
  intentionallyDisconnected = true;

  try {
    await room?.localParticipant?.setMicrophoneEnabled(false);
  } catch {}

  try {
    room?.disconnect();
  } catch {}

  room = null;
  detachAllAudio();

  setStatus("Disconnected");
  startBtn.disabled = false;
  endBtn.disabled = true;
}

async function attemptReconnect() {
  // Simplified exponential backoff reconnect.
  // In production, add jitter, max attempts, and better error classification.
  const delaysMs = [250, 500, 1000, 2000];

  for (const delay of delaysMs) {
    if (intentionallyDisconnected) return;

    try {
      // Tear down current state before reconnecting
      try { room?.disconnect(); } catch {}
      room = null;
      detachAllAudio();

      await new Promise((r) =&gt; setTimeout(r, delay));

      room = await connectOnce();
      setStatus("Reconnected");
      startBtn.disabled = true;
      endBtn.disabled = false;
      return;
    } catch {
      // keep retrying
    }
  }

  setStatus("Disconnected (reconnect failed)");
  startBtn.disabled = false;
  endBtn.disabled = true;
}

startBtn.addEventListener("click", async () =&gt; {
  try {
    await startCall();
  } catch (err) {
    setStatus(err?.message || "Connection failed");
    startBtn.disabled = false;
    endBtn.disabled = true;
    room = null;
    detachAllAudio();
  }
});

endBtn.addEventListener("click", async () =&gt; {
  await stopCall();
});
</code></pre>
<h3 id="heading-how-this-step-works-and-why-these-details-matter">How this Step works (and why these details matter)</h3>
<ul>
<li><p>The Start button gives you a user gesture so browsers are more likely to allow audio playback.</p>
</li>
<li><p>Mic permission is handled explicitly: if the user denies access, you show a clear error and avoid a half-connected session.</p>
</li>
<li><p>Disconnect cleanup removes audio elements so you don't leak resources across retries.</p>
</li>
<li><p>The reconnect loop demonstrates the production pattern: if a disconnect happens (often due to token expiry or network churn), the client re-mints a token and reconnects.</p>
</li>
</ul>
<p>In the next step, you'll add a structured data-channel handler to safely process agent-suggested “client actions”.</p>
<h3 id="heading-handle-these-explicitly"><strong>Handle These Explicitly</strong></h3>
<h3 id="heading-autoplay-restriction-example">Autoplay Restriction Example</h3>
<p>Add this to <code>index.html</code>:</p>
<pre><code class="language-html">&lt;button id="startBtn"&gt;Start Call&lt;/button&gt;
&lt;button id="endBtn" disabled&gt;End Call&lt;/button&gt;
&lt;div id="status"&gt;&lt;/div&gt;
</code></pre>
<p>In <code>client.js</code>:</p>
<pre><code class="language-javascript">const startBtn = document.getElementById("startBtn");
const endBtn = document.getElementById("endBtn");
const statusEl = document.getElementById("status");

let room;

startBtn.addEventListener("click", async () =&gt; {
  try {
    room = await connectVoice();
    statusEl.textContent = "Connected";
    startBtn.disabled = true;
    endBtn.disabled = false;
  } catch (err) {
    statusEl.textContent = "Connection failed";
  }
});
</code></pre>
<h3 id="heading-microphone-denial">Microphone denial</h3>
<pre><code class="language-javascript">try {
  await navigator.mediaDevices.getUserMedia({ audio: true });
} catch (err) {
  statusEl.textContent = "Microphone access denied";
  throw err;
}
</code></pre>
<h3 id="heading-disconnect-cleanup">Disconnect cleanup</h3>
<pre><code class="language-javascript">endBtn.addEventListener("click", () =&gt; {
  if (room) {
    room.disconnect();
    statusEl.textContent = "Disconnected";
    startBtn.disabled = false;
    endBtn.disabled = true;
  }
});
</code></pre>
<h3 id="heading-token-refresh-simplified">Token refresh (simplified)</h3>
<pre><code class="language-javascript">room.on(RoomEvent.Disconnected, async () =&gt; {
  const res = await fetch("/api/voice-token");
  const { rtc_url, token } = await res.json();
  await room.connect(rtc_url, token);
});
</code></pre>
<h2 id="heading-step-4-add-client-actions-agent-suggests-app-executes">Step 4: Add Client Actions (Agent Suggests, App Executes)</h2>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/694ca88d5ac09a5d68c63854/2304be1c-3451-45f8-ae44-2519fa92c82a.png" alt="Sequence diagram showing agent requesting a client action, app validating allowlist, user confirming, and app executing the side effect." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>A production voice agent often needs to:</p>
<ul>
<li><p>open a runbook/dashboard URL</p>
</li>
<li><p>show a checklist in the UI</p>
</li>
<li><p>request confirmation for an irreversible action</p>
</li>
<li><p>receive structured context (account, region, incident ID)</p>
</li>
</ul>
<p>The key safety rule:</p>
<p><strong>The agent suggests actions. The application validates and executes them.</strong></p>
<p>Use structured messages over the data channel:</p>
<pre><code class="language-json">{
&nbsp;&nbsp;"type": "client_action",
&nbsp;&nbsp;"action": "open_url",
&nbsp;&nbsp;"payload": { "url": "https://internal.example.com/runbook" },
&nbsp;&nbsp;"id": "action_123"
}
</code></pre>
<p><strong>Add guardrails</strong>:</p>
<ul>
<li><p>allowlist permitted actions</p>
</li>
<li><p>validate payload shape</p>
</li>
<li><p>confirmation gates for irreversible actions</p>
</li>
<li><p>idempotency via id</p>
</li>
<li><p>audit logs for every request and outcome</p>
</li>
</ul>
<p>This boundary limits damage from hallucinations or prompt injection.</p>
<pre><code class="language-javascript">// Guardrails: allowlist + validation + idempotency + confirmation

const ALLOWED_ACTIONS = new Set(["open_url", "request_confirm"]);
const EXECUTED_ACTION_IDS = new Set();
const ALLOWED_HOSTS = new Set(["internal.example.com"]);

function parseClientAction(text) {
  let msg;
  try {
    msg = JSON.parse(text);
  } catch {
    return null;
  }

  if (msg?.type !== "client_action") return null;
  if (typeof msg.id !== "string") return null;
  if (!ALLOWED_ACTIONS.has(msg.action)) return null;

  return msg;
}

async function handleClientAction(msg, room) {
  if (EXECUTED_ACTION_IDS.has(msg.id)) return; // idempotency
  EXECUTED_ACTION_IDS.add(msg.id);

  console.log("[client_action]", msg); // audit log (demo)

  if (msg.action === "open_url") {
    const url = msg.payload?.url;
    if (typeof url !== "string") return;

    const u = new URL(url);
    if (!ALLOWED_HOSTS.has(u.host)) {
      console.warn("Blocked navigation to:", u.host);
      return;
    }

    window.open(url, "_blank", "noopener,noreferrer");
    return;
  }

  if (msg.action === "request_confirm") {
    const prompt = msg.payload?.prompt || "Confirm this action?";
    const ok = window.confirm(prompt);

    // Send confirmation back to agent/app
    room.localParticipant.publishData(
  new TextEncoder().encode(
    JSON.stringify({ type: "user_confirmed", id: msg.id, ok })
  ),
  { topic: "client_events", reliable: true }
);
  }
}
</code></pre>
<pre><code class="language-javascript">room.on(RoomEvent.DataReceived, (payload, participant, kind, topic) =&gt; {
  if (topic !== "client_actions") return;

  const text = new TextDecoder().decode(payload);
  const msg = parseClientAction(text);
  if (!msg) return;

  handleClientAction(msg, room);
});
</code></pre>
<h2 id="heading-step-5-add-tool-integrations-safely">Step 5: Add Tool Integrations Safely</h2>
<p>Tools turn a voice agent into automation. Regardless of vendor, enforce these rules:</p>
<ul>
<li><p>timeouts on every tool call</p>
</li>
<li><p>circuit breakers for flaky dependencies</p>
</li>
<li><p>audit logs (inputs, outputs, duration, trace IDs)</p>
</li>
<li><p>explicit confirmation for destructive actions</p>
</li>
<li><p>credentials stored server-side (never in prompts or clients)</p>
</li>
</ul>
<p>If tools fail, degrade gracefully (“I can’t access that system right now, here’s the manual fallback.”). Silence reads as failure.</p>
<p><strong>Create a server-side tool runner (example)</strong></p>
<p>Paste this into <code>server.js</code>:</p>
<pre><code class="language-javascript">const TOOL_ALLOWLIST = {
  get_status: { destructive: false },
  create_ticket: { destructive: true },
};

let failures = 0;
let circuitOpenUntil = 0;

function circuitOpen() {
  return Date.now() &lt; circuitOpenUntil;
}

async function withTimeout(promise, ms) {
  return Promise.race([
    promise,
    new Promise((_, reject) =&gt; setTimeout(() =&gt; reject(new Error("timeout")), ms)),
  ]);
}

async function runToolSafely(tool, args) {
  if (circuitOpen()) throw new Error("circuit_open");

  try {
    const result = await withTimeout(Promise.resolve({ ok: true, tool, args }), 2000);
    failures = 0;
    return result;
  } catch (err) {
    failures++;
    if (failures &gt;= 3) circuitOpenUntil = Date.now() + 10_000;
    throw err;
  }
}

app.post("/api/tools/run", async (req, res) =&gt; {
  const { tool, args, user_confirmed } = req.body || {};

  if (!TOOL_ALLOWLIST[tool]) return res.status(400).json({ error: "Tool not allowed" });

  if (TOOL_ALLOWLIST[tool].destructive &amp;&amp; user_confirmed !== true) {
    return res.status(400).json({ error: "Confirmation required" });
  }

  try {
    const started = Date.now();
    const result = await runToolSafely(tool, args);
    console.log("[tool_call]", { tool, ms: Date.now() - started }); // audit log
    res.json({ ok: true, result });
  } catch (err) {
    console.log("[tool_error]", { tool, err: String(err) });
    res.status(500).json({ ok: false, error: "Tool call failed" });
  }
});
</code></pre>
<h2 id="heading-step-6-add-post-call-processing-where-durable-value-appears">Step 6: Add post-call processing (where durable value appears)</h2>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/694ca88d5ac09a5d68c63854/65d350ff-8f20-489f-b5de-9cd59dda5b8c.png" alt="Post-call processing workflow showing transcript storage, queue/worker, summaries/action items, and integration updates." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>After a call ends, generate structured artifacts:</p>
<ul>
<li><p>summary</p>
</li>
<li><p>action items</p>
</li>
<li><p>follow-up email draft</p>
</li>
<li><p>CRM entry or ticket creation</p>
</li>
</ul>
<p>A production pattern:</p>
<ul>
<li><p>store transcript + metadata</p>
</li>
<li><p>enqueue a background job (queue/worker)</p>
</li>
<li><p>produce outputs as JSON + a human-readable report</p>
</li>
<li><p>apply integrations with retries + idempotency</p>
</li>
<li><p>store a “call report” for audits and incident reviews</p>
</li>
</ul>
<p><strong>Create a post-call webhook endpoint (example)</strong></p>
<p>Paste into <code>server.js</code>:</p>
<pre><code class="language-javascript">app.post("/webhooks/call-ended", async (req, res) =&gt; {
  const payload = req.body;

  console.log("[call_ended]", {
    call_id: payload.call_id,
    ended_at: payload.ended_at,
  });

  setImmediate(() =&gt; processPostCall(payload));
  res.json({ ok: true });
});

function processPostCall(payload) {
  const transcript = payload.transcript || [];
  const summary = transcript.slice(0, 3).map(t =&gt; `- \({t.speaker}: \){t.text}`).join("\n");

  const report = {
    call_id: payload.call_id,
    summary,
    action_items: payload.action_items || [],
    created_at: new Date().toISOString(),
  };

  console.log("[call_report]", report);
}
</code></pre>
<h3 id="heading-test-it-locally">Test it locally</h3>
<pre><code class="language-shell">curl -X POST http://localhost:3000/webhooks/call-ended \
  -H "Content-Type: application/json" \
  -d '{
    "call_id": "call_123",
    "ended_at": "2026-02-26T00:10:00Z",
    "transcript": [
      {"speaker": "user", "text": "I need help resetting my password."},
      {"speaker": "agent", "text": "Sure — I can help with that."}
    ],
    "action_items": ["Send password reset link", "Verify account email"]
  }'
</code></pre>
<h2 id="heading-production-readiness-checklist">Production readiness checklist</h2>
<h3 id="heading-security"><strong>Security</strong></h3>
<ul>
<li><p>no API keys in the browser</p>
</li>
<li><p>strict allowlist for client actions</p>
</li>
<li><p>confirmation gates for destructive actions</p>
</li>
<li><p>schema validation on all inbound messages</p>
</li>
<li><p>audit logging for actions and tool calls</p>
</li>
</ul>
<h3 id="heading-reliability"><strong>Reliability</strong></h3>
<ul>
<li><p>reconnect strategy for expired tokens</p>
</li>
<li><p>timeouts + circuit breakers for tools</p>
</li>
<li><p>graceful degradation when dependencies fail</p>
</li>
<li><p>idempotent side effects</p>
</li>
</ul>
<h3 id="heading-observability"><strong>Observability</strong></h3>
<p>Log state transitions (for example):<br><strong>listening → thinking → speaking → ended</strong></p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/694ca88d5ac09a5d68c63854/a1302294-4338-4a3a-ab0d-c50fd34c117f.png" alt="Voice agent state machine showing listening, thinking, speaking, and ended states for observability." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><strong>Track:</strong></p>
<ul>
<li><p>connect failure rate</p>
</li>
<li><p>end-to-end latency (STT + reasoning + TTS)</p>
</li>
<li><p>tool error rate</p>
</li>
<li><p>reconnect frequency</p>
</li>
</ul>
<h3 id="heading-cost-control"><strong>Cost control</strong></h3>
<ul>
<li><p>rate-limit token minting and sessions</p>
</li>
<li><p>cap max call duration</p>
</li>
<li><p>bound context growth (summarize or truncate)</p>
</li>
<li><p>track per-call usage drivers (STT/TTS minutes, tool calls)</p>
</li>
</ul>
<h2 id="heading-optional-resources">Optional resources</h2>
<h3 id="heading-how-to-try-a-managed-voice-platform-quickly">How to Try a Managed Voice Platform Quickly</h3>
<p>If you want a managed provider to test quickly, you can sign up for a <a href="https://vocalbridgeai.com/">Vocal Bridge account</a> and implement these steps using their token minting + real-time session APIs.</p>
<p>But the core production voice agent architecture in this article is vendor-agnostic. You can replace any component (SFU, STT/TTS, agent runtime, tool layer) as long as you preserve the boundaries: secure token service, real-time media, safe tool execution, and strong observability.</p>
<h3 id="heading-watch-a-full-demo-and-explore-a-complete-reference-repo">Watch a full demo and explore a complete reference repo</h3>
<p>If you'd like to see these patterns working together in a realistic scenario (incident triage), here are two optional resources:</p>
<p>- <strong>Demo video:</strong> <a href="https://youtu.be/TqrtOKd8Zug">Voice-First Incident Triage (end-to-end run)</a><br>This is a hackathon run-through showing client actions, decision boundaries for irreversible actions, and a structured post-call summary.</p>
<p>- <strong>GitHub repo (architecture + design + working code):</strong> <code>https://github.com/natarajsundar/voice-first-incident-triage</code></p>
<p>These links are optional, you can follow the tutorial end-to-end without them.</p>
<h2 id="heading-closing">Closing</h2>
<p>Production-ready voice agents work when you treat them like real-time distributed systems.</p>
<p>Start with the baseline:</p>
<ul>
<li>token service + web client + real-time audio</li>
</ul>
<p>Then layer in:</p>
<ul>
<li><p>controlled client actions</p>
</li>
<li><p>safe tools</p>
</li>
<li><p>post-call automation</p>
</li>
<li><p>observability and cost controls</p>
</li>
</ul>
<p>That’s how you ship a voice agent architecture you can operate. You now have a vendor-neutral reference architecture you can adapt to your stack, with clear trust boundaries, safe tool execution, and operational visibility.</p>
<p>If you’re shipping real-time AI systems, what’s been your biggest production bottleneck so far: <strong>latency, reliability, or tool safety</strong>? I’d love to hear what you’re seeing in the wild. Connect with me on <a href="https://www.linkedin.com/in/natarajsundar/">LinkedIn</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build Your Own Private Voice Assistant: A Step-by-Step Guide Using Open-Source Tools ]]>
                </title>
                <description>
                    <![CDATA[ Most commercial voice assistants send your voice data to cloud servers before responding. By using open‑source tools, you can run everything directly on your phone for better privacy, faster responses, and full control over how the assistant behaves.... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/private-voice-assistant-using-open-source-tools/</link>
                <guid isPermaLink="false">690bcbbc8abe1e0a5b05e0be</guid>
                
                    <category>
                        <![CDATA[ Voice ]]>
                    </category>
                
                    <category>
                        <![CDATA[ voice assistants ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Personalization  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tool calling ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agentic AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ on-device ai ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Open Source ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Surya Teja Appini ]]>
                </dc:creator>
                <pubDate>Wed, 05 Nov 2025 22:12:12 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762380694991/10687751-7aec-4d78-8af8-1f76edc28afd.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most commercial voice assistants send your voice data to cloud servers before responding. By using open‑source tools, you can run everything directly on your phone for better privacy, faster responses, and full control over how the assistant behaves.</p>
<p>In this tutorial, I’ll walk you through the process step-by-step. You don’t need prior experience with machine learning models, as we’ll build up the system gradually and test each part as we go. By the end, you will have a fully local mobile voice assistant powered by:</p>
<ul>
<li><p>Whisper for Automatic Speech Recognition (ASR)</p>
</li>
<li><p>Machine Learning Compiler (MLC) LLM for on-device reasoning</p>
</li>
<li><p>System Text-to-Speech (TTS) using built-in Android TTS</p>
</li>
</ul>
<p>Your assistant will be able to:</p>
<ul>
<li><p>Understand your voice commands offline</p>
</li>
<li><p>Respond to you with synthesized speech</p>
</li>
<li><p>Perform tool calling actions (such as controlling smart devices)</p>
</li>
<li><p>Store personal memories and preferences</p>
</li>
<li><p>Use Retrieval-Augmented Generation (RAG) to answer questions from your own notes</p>
</li>
<li><p>Perform multi-step agentic workflows such as generating a morning briefing and optionally sending the summary to a contact</p>
</li>
</ul>
<p>This tutorial focuses on Android using Termux (the terminal environment for Android) for a fully local workflow.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-system-overview">System Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-requirements">Requirements</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-test-microphone-and-audio-playback-on-android">Step 1: Test Microphone and Audio Playback on Android</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-install-and-run-whisper-for-asr">Step 2: Install and Run Whisper for ASR</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-install-a-local-llm-with-mlc">Step 3: Install a Local LLM with MLC</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-local-text-to-speech-tts">Step 4: Local Text-to-Speech (TTS)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-the-core-voice-loop">Step 5: The Core Voice Loop</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-tool-calling-make-it-act">Step 6: Tool Calling (Make It Act)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-7-memory-and-personalization">Step 7: Memory and Personalization</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-8-retrieval-augmented-generation-rag">Step 8: Retrieval-Augmented Generation (RAG)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-9-multi-step-agentic-workflow">Step 9: Multi-Step Agentic Workflow</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion-and-next-steps">Conclusion and Next Steps</a></p>
</li>
</ul>
<h2 id="heading-system-overview"><strong>System Overview</strong></h2>
<p>This diagram shows how your voice moves through the assistant: speech in → transcription → reasoning → action → spoken reply.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762319872832/7b52b715-79c0-4c92-b431-b84c49ba7299.png" alt="7b52b715-79c0-4c92-b431-b84c49ba7299" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>This pipeline describes the core flow:</p>
<ul>
<li><p>You speak into the microphone.</p>
</li>
<li><p>Whisper converts audio into text.</p>
</li>
<li><p>The local LLM interprets your request.</p>
</li>
<li><p>The assistant may call tools (for example, send notifications or create events).</p>
</li>
<li><p>The response is spoken aloud using the device’s Text-to-Speech system.</p>
</li>
</ul>
<h3 id="heading-key-concepts-used-in-this-tutorial">Key Concepts Used in This Tutorial</h3>
<ul>
<li><p><strong>Automatic Speech Recognition (ASR):</strong> Converts your speech into text. We use Whisper or Faster‑Whisper.</p>
</li>
<li><p><strong>Local Large Language Model (LLM):</strong> A reasoning model running on your phone using the MLC engine.</p>
</li>
<li><p><strong>Text‑to‑Speech (TTS):</strong> Converts text back to speech. We use Android’s built‑in system TTS.</p>
</li>
<li><p><strong>Tool Calling:</strong> Allows the assistant to perform actions (for example, sending a notification or creating an event).</p>
</li>
<li><p><strong>Memory:</strong> Stores personalized facts the assistant learns during conversation.</p>
</li>
<li><p><strong>Retrieval‑Augmented Generation (RAG):</strong> Lets the assistant reference your documents or notes.</p>
</li>
<li><p><strong>Agent Workflow:</strong> A multi‑step chain where the assistant uses multiple abilities together.</p>
</li>
</ul>
<h2 id="heading-requirements">Requirements</h2>
<p>What you should already be familiar with:</p>
<ul>
<li><p>Basic command line usage (running commands, navigating directories)</p>
</li>
<li><p>Very basic Python (calling a function, editing a <code>.py</code> script)</p>
</li>
</ul>
<p>You do <strong>not</strong> need to have:</p>
<ul>
<li><p>Machine learning experience</p>
</li>
<li><p>A deep understanding of neural networks</p>
</li>
<li><p>Prior experience with speech or audio models</p>
</li>
</ul>
<p>Here are the tools and technologies you’ll need to follow along:</p>
<ul>
<li><p>An Android phone with Snapdragon 8+ Gen 1 or newer recommended (older devices will still work, but responses may be slower)</p>
</li>
<li><p>Termux</p>
</li>
<li><p>Python 3.9+ inside Termux</p>
</li>
<li><p>Enough free storage (at least 4–6 GB) to store the model and audio files</p>
</li>
</ul>
<p><strong>Why these requirements matter:</strong></p>
<p>Whisper and Llama models run on-device, so the phone must handle real‑time compute. MLC optimizes models for your device's GPU / NPU, so newer processors will run faster and cooler. And system TTS and Termux APIs let the assistant speak and interact with the phone locally.</p>
<p>If your phone is older or mid‑range, switch the model in Step 3 to <code>Phi-3.5-Mini</code> which is smaller and faster.</p>
<p>We’ll start by setting up your Android environment with Termux, Python, media access, and storage permissions so later steps can record audio, run models, and speak.</p>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># In Termux</span>
pkg update &amp;&amp; pkg upgrade -y
pkg install -y python git ffmpeg termux-api
termux-setup-storage  <span class="hljs-comment"># grant storage permission</span>
</code></pre>
<h2 id="heading-step-1-test-microphone-and-audio-playback-on-android">Step 1: Test Microphone and Audio Playback on Android</h2>
<p><strong>What this step does:</strong> Verifies that your device microphone and speakers work correctly through Termux before connecting them to the voice assistant.</p>
<p>On-device assistants need reliable access to the microphone and speakers. On Android, Termux provides utilities to record audio and play media. This avoids complex audio dependencies and works on more devices.</p>
<p>These commands let you quickly test your microphone and audio playback without writing any code. This is useful to verify that your device permissions and audio paths are working before introducing Whisper or TTS.</p>
<ul>
<li><p><code>termux-microphone-record</code> records from the device microphone to a <code>.wav</code> file</p>
</li>
<li><p><code>termux-media-player</code> plays audio files</p>
</li>
<li><p><code>termux-tts-speak</code> speaks text using the system TTS voice (fast fallback)</p>
</li>
</ul>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Start a 4 second recording</span>
termux-microphone-record -f <span class="hljs-keyword">in</span>.wav -l <span class="hljs-number">4</span> &amp;&amp; termux-microphone-record -q

<span class="hljs-comment"># Play back the captured audio</span>
termux-media-player play <span class="hljs-keyword">in</span>.wav

<span class="hljs-comment"># Speak text via system TTS (fallback if you do not install a Python TTS)</span>
termux-tts-speak <span class="hljs-string">"Hello, this is your on-device assistant running locally."</span>
</code></pre>
<h2 id="heading-step-2-install-and-run-whisper-for-asr">Step 2: Install and Run Whisper for ASR</h2>
<p><strong>What this step does:</strong> Converts recorded speech into text so the language model can understand what you said.</p>
<p>Whisper listens to your audio recording and converts it into text. Smaller versions like <code>tiny</code> or <code>base</code> run faster on most phones and are good enough for everyday commands.</p>
<p>Install Whisper:</p>
<pre><code class="lang-python">pip install openai-whisper
</code></pre>
<p>If you run into installation issues, you can use Faster‑Whisper instead:</p>
<pre><code class="lang-python">pip install faster-whisper
</code></pre>
<p>Below is a small Python script that takes the recorded audio file and turns it into text. It tries Whisper first, and if that isn’t available, it will automatically fall back to Faster‑Whisper.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Convert recorded speech to text (asr_transcribe.py)</span>
<span class="hljs-keyword">import</span> sys

<span class="hljs-comment"># Try Whisper, fallback to Faster-Whisper if needed</span>
<span class="hljs-keyword">try</span>:
    <span class="hljs-keyword">import</span> whisper
    use_faster = <span class="hljs-literal">False</span>
<span class="hljs-keyword">except</span> Exception:
    use_faster = <span class="hljs-literal">True</span>

<span class="hljs-keyword">if</span> use_faster:
    <span class="hljs-keyword">from</span> faster_whisper <span class="hljs-keyword">import</span> WhisperModel
    model = WhisperModel(<span class="hljs-string">"tiny.en"</span>)
    segments, info = model.transcribe(sys.argv[<span class="hljs-number">1</span>])
    text = <span class="hljs-string">" "</span>.join(s.text <span class="hljs-keyword">for</span> s <span class="hljs-keyword">in</span> segments)
    print(text.strip())
<span class="hljs-keyword">else</span>:
    model = whisper.load_model(<span class="hljs-string">"tiny.en"</span>)
    result = model.transcribe(sys.argv[<span class="hljs-number">1</span>], fp16=<span class="hljs-literal">False</span>)
    print(result[<span class="hljs-string">"text"</span>].strip())
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Record 4 seconds and transcribe</span>
termux-microphone-record -f <span class="hljs-keyword">in</span>.wav -l <span class="hljs-number">4</span> &amp;&amp; termux-microphone-record -q
python asr_transcribe.py <span class="hljs-keyword">in</span>.wav
</code></pre>
<h2 id="heading-step-3-install-a-local-llm-with-mlc">Step 3: Install a Local LLM with MLC</h2>
<p><strong>What this step does:</strong> Installs and tests the on-device reasoning model that will generate responses to transcribed speech.</p>
<p>MLC compiles transformer models to mobile GPUs and Neural Processing Units, enabling on-device inference. You will run an instruction-tuned model with 4-bit or 8-bit weights for speed.</p>
<p>Install the command-line interface like this:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Clone and install Python bindings (for scripting) and CLI</span>
git clone https://github.com/mlc-ai/mlc-llm.git
cd mlc-llm
pip install -r requirements.txt
pip install -e python
</code></pre>
<p>We will use <strong>Llama 3 8B Instruct q4</strong> because it offers strong reasoning while still running on many recent Android devices. If your phone has less memory or you want faster responses, you can swap in <strong>Phi-3.5 Mini</strong> (about 3.8B) without changing any code.</p>
<p>Download a mobile-optimized model:</p>
<pre><code class="lang-python">mlc_llm download Llama<span class="hljs-number">-3</span><span class="hljs-number">-8</span>B-Instruct-q4f16_1
</code></pre>
<p>We will use a short Python script to send text to the model and print the response. This lets us verify that the model is installed correctly before we connect it to audio.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Local LLM text generation (local_llm.py)</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">import</span> sys

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
prompt = sys.argv[<span class="hljs-number">1</span>] <span class="hljs-keyword">if</span> len(sys.argv) &gt; <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"Hello"</span>
resp = engine.chat([{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: prompt}])
<span class="hljs-comment"># The engine may return different structures across versions</span>
reply_text = resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp)
print(reply_text)
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python local_llm.py <span class="hljs-string">"Summarize this in one sentence: building a local voice assistant on Android"</span>
</code></pre>
<h2 id="heading-step-4-local-text-to-speech-tts">Step 4: Local Text-to-Speech (TTS)</h2>
<p><strong>What this step does:</strong> Turns the model’s text responses into spoken audio so the assistant can talk back.</p>
<p>This step converts the text returned by the model into spoken audio so the assistant can talk back. It uses the built-in Android Text-to-Speech voice and requires no additional Python packages.</p>
<pre><code class="lang-python">termux-tts-speak <span class="hljs-string">"Hello, I am running entirely on your device."</span>
</code></pre>
<p>This is the voice output method we will use throughout the tutorial.</p>
<h2 id="heading-step-5-the-core-voice-loop">Step 5: The Core Voice Loop</h2>
<p><strong>What this step does:</strong> Connects speech recognition, language model reasoning, and speech synthesis into a single interactive conversation loop.</p>
<p>This loop ties together recording, transcription, response generation, and playback.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Core voice loop tying ASR + LLM + TTS (voice_loop.py)</span>
<span class="hljs-keyword">import</span> subprocess, os

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run</span>(<span class="hljs-params">cmd</span>):</span> <span class="hljs-keyword">return</span> subprocess.check_output(cmd).decode().strip()

print(<span class="hljs-string">"Listening..."</span>)
subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-f"</span>, <span class="hljs-string">"in.wav"</span>, <span class="hljs-string">"-l"</span>, <span class="hljs-string">"4"</span>]) ; subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-q"</span>])
text = run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"asr_transcribe.py"</span>, <span class="hljs-string">"in.wav"</span>])
reply = run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"local_llm.py"</span>, text])
<span class="hljs-keyword">try</span>:
    subprocess.run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"speak_xtts.py"</span>, reply]); subprocess.run([<span class="hljs-string">"termux-media-player"</span>, <span class="hljs-string">"play"</span>, <span class="hljs-string">"out.wav"</span>])
<span class="hljs-keyword">except</span>:
    subprocess.run([<span class="hljs-string">"termux-tts-speak"</span>, reply])
</code></pre>
<p>Run:</p>
<pre><code class="lang-python">python voice_loop.py
</code></pre>
<h2 id="heading-step-6-tool-calling-make-it-act">Step 6: Tool Calling (Make It Act)</h2>
<p><strong>What this step does:</strong> Enables the assistant to perform actions – not just reply – by calling real functions on your device.</p>
<p>Tool calling lets the assistant perform actions, not just answer. When the model recognizes an action request, it outputs a small JSON instruction, and your code runs the corresponding function. You show the model which tools exist and how to call them. The program intercepts calls and runs the corresponding code.</p>
<p><strong>Example use case:</strong></p>
<p>You say: <em>"Schedule a meeting tomorrow at 3 PM with John."</em></p>
<p>The assistant:</p>
<ol>
<li><p>Transcribes what you said.</p>
</li>
<li><p>Detects that this is not a question, but an action request.</p>
</li>
<li><p>Calls the <code>add_event()</code> function with the correct parameters.</p>
</li>
<li><p>Confirms: <em>"Okay, I scheduled that."</em></p>
</li>
</ol>
<p>Here’s the structure of how tool calls will work:</p>
<ul>
<li><p>Define Python functions such as <code>add_event</code>, <code>control_light</code></p>
</li>
<li><p>Provide a schema for the model to output when it wants to call a tool</p>
</li>
<li><p>Detect that schema in the LLM output and execute the function</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Tool calling functions (tools.py)</span>
<span class="hljs-keyword">import</span> json

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">add_event</span>(<span class="hljs-params">title: str, date: str</span>) -&gt; dict:</span>
    <span class="hljs-comment"># Replace with actual calendar integration</span>
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"ok"</span>, <span class="hljs-string">"title"</span>: title, <span class="hljs-string">"date"</span>: date}

TOOLS = {
    <span class="hljs-string">"add_event"</span>: add_event,
}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_tool</span>(<span class="hljs-params">call_json: str</span>) -&gt; str:</span>
    <span class="hljs-string">"""call_json: '{"tool":"add_event","args":{"title":"Dentist","date":"2025-11-10 10:00"}}'"""</span>
    data = json.loads(call_json)
    name = data[<span class="hljs-string">"tool"</span>]
    args = data.get(<span class="hljs-string">"args"</span>, {})
    <span class="hljs-keyword">if</span> name <span class="hljs-keyword">in</span> TOOLS:
        result = TOOLS[name](**args)
        <span class="hljs-keyword">return</span> json.dumps({<span class="hljs-string">"tool_result"</span>: result})
    <span class="hljs-keyword">return</span> json.dumps({<span class="hljs-string">"error"</span>: <span class="hljs-string">"unknown tool"</span>})
</code></pre>
<p>Prompt the model to use tools:</p>
<pre><code class="lang-python"><span class="hljs-comment"># LLM wrapper enabling tool use (llm_with_tools.py)</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">import</span> json, sys

SYSTEM = (
    <span class="hljs-string">"You can call tools by emitting a single JSON object with keys 'tool' and 'args'. "</span>
    <span class="hljs-string">"Available tools: add_event(title:str, date:str). "</span>
    <span class="hljs-string">"If no tool is needed, answer directly."</span>
)

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
user = sys.argv[<span class="hljs-number">1</span>]
resp = engine.chat([
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: user},
])
print(resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp))
</code></pre>
<p>And then glue it together:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Run LLM with tool call detection (run_with_tools.py)</span>
<span class="hljs-keyword">import</span> subprocess, json
<span class="hljs-keyword">from</span> tools <span class="hljs-keyword">import</span> run_tool

user = <span class="hljs-string">"Add a dentist appointment next Thursday at 10"</span>
raw = subprocess.check_output([<span class="hljs-string">"python"</span>, <span class="hljs-string">"llm_with_tools.py"</span>, user]).decode().strip()

<span class="hljs-comment"># If the model returned a JSON tool call, run it</span>
<span class="hljs-keyword">try</span>:
    data = json.loads(raw)
    <span class="hljs-keyword">if</span> isinstance(data, dict) <span class="hljs-keyword">and</span> <span class="hljs-string">"tool"</span> <span class="hljs-keyword">in</span> data:
        print(<span class="hljs-string">"Tool call:"</span>, data)
        print(run_tool(raw))
    <span class="hljs-keyword">else</span>:
        print(<span class="hljs-string">"Assistant:"</span>, raw)
<span class="hljs-keyword">except</span> Exception:
    print(<span class="hljs-string">"Assistant:"</span>, raw)
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python run_with_tools.py
</code></pre>
<h2 id="heading-step-7-memory-and-personalization">Step 7: Memory and Personalization</h2>
<p><strong>What this step does:</strong> Allows the assistant to remember personal information you share so conversations feel continuous and adaptive.</p>
<p>A helpful assistant should feel like it learns alongside you. Memory allows the system to keep track of small details you mention naturally in conversation.</p>
<p>Without memory, every conversation starts from scratch. With memory, your assistant can remember personal facts (for example, birthdays, favorite music), your routines, device settings, or notes you mention in conversation. This unlocks more natural interactions and enables personalization over time.</p>
<p>You can start with a simple key-value store and expand over time. Your program reads memory before inference and writes back new facts after.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Simple key-value memory store (memory.py)</span>
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path

MEM_PATH = Path(<span class="hljs-string">"memory.json"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">mem_load</span>():</span>
    <span class="hljs-keyword">return</span> json.loads(MEM_PATH.read_text()) <span class="hljs-keyword">if</span> MEM_PATH.exists() <span class="hljs-keyword">else</span> {}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">mem_save</span>(<span class="hljs-params">mem</span>):</span>
    MEM_PATH.write_text(json.dumps(mem, indent=<span class="hljs-number">2</span>))

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">remember</span>(<span class="hljs-params">key: str, value: str</span>):</span>
    mem = mem_load()
    mem[key] = value
    mem_save(mem)
</code></pre>
<p>Use memory in the loop:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Voice loop with memory loading and updating (voice_loop_with_memory.py)</span>
<span class="hljs-keyword">import</span> subprocess, json
<span class="hljs-keyword">from</span> memory <span class="hljs-keyword">import</span> mem_load, remember

<span class="hljs-comment"># 1) Record and transcribe</span>
subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-f"</span>, <span class="hljs-string">"in.wav"</span>, <span class="hljs-string">"-l"</span>, <span class="hljs-string">"4"</span>]) 
subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-q"</span>]) 
user_text = subprocess.check_output([<span class="hljs-string">"python"</span>, <span class="hljs-string">"asr_transcribe.py"</span>, <span class="hljs-string">"in.wav"</span>]).decode().strip()

<span class="hljs-comment"># 2) Load memory and add as system context</span>
mem = mem_load()
SYSTEM = <span class="hljs-string">"Known facts: "</span> + json.dumps(mem)

<span class="hljs-comment"># 3) Ask the model</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
resp = engine.chat([
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: user_text},
])
reply = resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp)
print(<span class="hljs-string">"Assistant:"</span>, reply)

<span class="hljs-comment"># 4) Very simple pattern: if the user said "remember X is Y", store it</span>
<span class="hljs-keyword">if</span> user_text.lower().startswith(<span class="hljs-string">"remember "</span>) <span class="hljs-keyword">and</span> <span class="hljs-string">" is "</span> <span class="hljs-keyword">in</span> user_text:
    k, v = user_text[<span class="hljs-number">9</span>:].split(<span class="hljs-string">" is "</span>, <span class="hljs-number">1</span>)
    remember(k.strip(), v.strip())
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python voice_loop_with_memory.py
</code></pre>
<h2 id="heading-step-8-retrieval-augmented-generation-rag">Step 8: Retrieval-Augmented Generation (RAG)</h2>
<p><strong>What this step does:</strong> Lets the assistant search your offline notes or documents at answer time, improving accuracy for personal tasks.</p>
<p>To use RAG, we first install a lightweight vector database, then add documents to it, and later query it when answering questions.</p>
<p>A language model cannot magically know details about your life, your work, or your files unless you give it a way to look things up.</p>
<p><a target="_blank" href="https://www.freecodecamp.org/news/learn-rag-fundamentals-and-advanced-techniques/">Retrieval-Augmented Generation (RAG)</a> bridges that gap. RAG allows the assistant to search your own stored data at query time. This means the assistant can answer questions about your projects, home details, travel plans, studies, or any personal documents you store completely offline.</p>
<p>RAG allows the assistant to reference your actual notes when answering, instead of relying only on the model's internal training.</p>
<p>Install the vector store:</p>
<pre><code class="lang-python">pip install chromadb
</code></pre>
<p>Add and search your notes:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Local vector DB indexing and querying (rag.py)</span>
<span class="hljs-keyword">from</span> chromadb <span class="hljs-keyword">import</span> Client

client = Client()
notes = client.create_collection(<span class="hljs-string">"notes"</span>)

<span class="hljs-comment"># Add your documents (repeat as needed)</span>
notes.add(documents=[<span class="hljs-string">"Contractor quote was 42000 United States Dollars for the extension."</span>], ids=[<span class="hljs-string">"q1"</span>]) 

<span class="hljs-comment"># Query the local vector database</span>
results = notes.query(query_texts=[<span class="hljs-string">"extension quote"</span>], n_results=<span class="hljs-number">1</span>)
context = results[<span class="hljs-string">"documents"</span>][<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]
print(context)
</code></pre>
<p>Use retrieved context in responses:</p>
<pre><code class="lang-python"><span class="hljs-comment"># LLM answering using retrieved context (llm_with_rag.py)</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">from</span> chromadb <span class="hljs-keyword">import</span> Client

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
client = Client()
notes = client.get_or_create_collection(<span class="hljs-string">"notes"</span>)

question = <span class="hljs-string">"What was the quoted amount for the home extension?"</span>
res = notes.query(query_texts=[question], n_results=<span class="hljs-number">2</span>)
ctx = <span class="hljs-string">"\n"</span>.join([d[<span class="hljs-number">0</span>] <span class="hljs-keyword">for</span> d <span class="hljs-keyword">in</span> res[<span class="hljs-string">"documents"</span>]])

SYSTEM = <span class="hljs-string">"Use the provided context to answer accurately. If missing, say you do not know.\nContext:\n"</span> + ctx
ans = engine.chat([
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: question},
])
print(ans.get(<span class="hljs-string">"message"</span>, ans) <span class="hljs-keyword">if</span> isinstance(ans, dict) <span class="hljs-keyword">else</span> str(ans))
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python rag.py
python llm_with_rag.py
</code></pre>
<h2 id="heading-step-9-multi-step-agentic-workflow">Step 9: Multi-Step Agentic Workflow</h2>
<p><strong>What this step does:</strong> Combines listening, reasoning, memory, and tool usage into a multi-step routine that runs automatically.</p>
<p>Now that the assistant can listen, respond, remember facts, and call tools, we can combine those abilities into a small routine that performs several steps automatically.</p>
<p><strong>Practical example: "Morning Briefing" on your phone</strong></p>
<p>Goal: when you say <em>"Give me my morning briefing and text it to my partner"</em>, the assistant will:</p>
<ol>
<li><p>Read today's agenda from a local file,</p>
</li>
<li><p>summarize it,</p>
</li>
<li><p>speak it aloud, and</p>
</li>
<li><p>send the summary via SMS using Termux.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762319593253/99e670d4-4934-47ce-a164-f0f7880ea80f.png" alt="Multi-step morning briefing workflow with retrieval, summary, speech output, and SMS action." class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><em>Diagram: Multi-step morning briefing workflow with retrieval, summary, speech output, and SMS action.</em></p>
<h3 id="heading-prepare-your-agenda-file">Prepare your agenda file</h3>
<p>This file stores your events for the day. You can edit it manually, generate it, or sync it later if you want.</p>
<p>Create <code>agenda.json</code> in the same folder:</p>
<pre><code class="lang-python">{
  <span class="hljs-string">"2025-11-03"</span>: [
    {<span class="hljs-string">"time"</span>: <span class="hljs-string">"09:30"</span>, <span class="hljs-string">"title"</span>: <span class="hljs-string">"Standup meeting"</span>},
    {<span class="hljs-string">"time"</span>: <span class="hljs-string">"13:00"</span>, <span class="hljs-string">"title"</span>: <span class="hljs-string">"Lunch with Priya"</span>},
    {<span class="hljs-string">"time"</span>: <span class="hljs-string">"16:30"</span>, <span class="hljs-string">"title"</span>: <span class="hljs-string">"Gym"</span>}
  ]
}
</code></pre>
<p>Phone-integrated tools for this workflow:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Phone-integrated agent tools (tools_phone.py)</span>
<span class="hljs-keyword">import</span> json, subprocess, datetime
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path

AGENDA_PATH = Path(<span class="hljs-string">"agenda.json"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_today_agenda</span>():</span>
    today = datetime.date.today().isoformat()
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> AGENDA_PATH.exists():
        <span class="hljs-keyword">return</span> []
    data = json.loads(AGENDA_PATH.read_text())
    <span class="hljs-keyword">return</span> data.get(today, [])

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">send_sms</span>(<span class="hljs-params">number: str, text: str</span>) -&gt; dict:</span>
    <span class="hljs-comment"># Requires Termux:API and SMS permission</span>
    subprocess.run([<span class="hljs-string">"termux-sms-send"</span>, <span class="hljs-string">"-n"</span>, number, text])
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"sent"</span>, <span class="hljs-string">"to"</span>: number}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">notify</span>(<span class="hljs-params">title: str, content: str</span>) -&gt; dict:</span>
    subprocess.run([<span class="hljs-string">"termux-notification"</span>, <span class="hljs-string">"--title"</span>, title, <span class="hljs-string">"--content"</span>, content])
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"notified"</span>}
</code></pre>
<p>Create the agent routine:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Multi-step morning briefing agent (agent_morning.py)</span>
<span class="hljs-keyword">import</span> json, subprocess, os
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">from</span> tools_phone <span class="hljs-keyword">import</span> load_today_agenda, send_sms, notify

PARTNER_PHONE = os.environ.get(<span class="hljs-string">"PARTNER_PHONE"</span>, <span class="hljs-string">"+15551234567"</span>)

TOOLS = {
    <span class="hljs-string">"send_sms"</span>: send_sms,
    <span class="hljs-string">"notify"</span>: notify,
}

SYSTEM = (
  <span class="hljs-string">"You assist on a phone. You may emit a single-line JSON when an action is needed "</span>
  <span class="hljs-string">"with keys 'tool' and 'args'. Available tools: send_sms(number:str, text:str), "</span>
  <span class="hljs-string">"notify(title:str, content:str). Keep messages concise. If no tool is needed, answer in plain text."</span>
)

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)

agenda = load_today_agenda()
agenda = load_today_agenda()
agenda_text = <span class="hljs-string">"
"</span>.join(<span class="hljs-string">f"<span class="hljs-subst">{e[<span class="hljs-string">'time'</span>]}</span> - <span class="hljs-subst">{e[<span class="hljs-string">'title'</span>]}</span>"</span> <span class="hljs-keyword">for</span> e <span class="hljs-keyword">in</span> agenda) <span class="hljs-keyword">or</span> <span class="hljs-string">"No events for today."</span>

user_request = <span class="hljs-string">"Give me my morning briefing and text it to my partner."</span> <span class="hljs-string">"Give me my morning briefing and text it to my partner."</span>

<span class="hljs-comment"># 1) Ask LLM for a 2-3 sentence summary to speak</span>
summary = engine.chat([
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"Summarize this agenda in 2-3 sentences for a morning briefing:"</span>},
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: agenda_text},
])
summary_text = summary.get(<span class="hljs-string">"message"</span>, summary) <span class="hljs-keyword">if</span> isinstance(summary, dict) <span class="hljs-keyword">else</span> str(summary)
print(<span class="hljs-string">"Briefing:
"</span>, summary_text)

<span class="hljs-comment"># 2) Speak locally (prefer XTTS, fallback to system TTS)</span>
<span class="hljs-keyword">try</span>:
    subprocess.run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"speak_xtts.py"</span>, summary_text], check=<span class="hljs-literal">True</span>)
    subprocess.run([<span class="hljs-string">"termux-media-player"</span>, <span class="hljs-string">"play"</span>, <span class="hljs-string">"out.wav"</span>]) 
<span class="hljs-keyword">except</span> Exception:
    subprocess.run([<span class="hljs-string">"termux-tts-speak"</span>, summary_text])

<span class="hljs-comment"># 3) Ask LLM whether to send SMS and with what text, using tool schema</span>
resp = engine.chat([
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">f"User said: '<span class="hljs-subst">{user_request}</span>'. Partner phone is <span class="hljs-subst">{PARTNER_PHONE}</span>. Summary: <span class="hljs-subst">{summary_text}</span>"</span>},
])
msg = resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp)

<span class="hljs-comment"># 4) If the model requested a tool, execute it</span>
<span class="hljs-keyword">try</span>:
    data = json.loads(msg)
    <span class="hljs-keyword">if</span> isinstance(data, dict) <span class="hljs-keyword">and</span> data.get(<span class="hljs-string">"tool"</span>) <span class="hljs-keyword">in</span> TOOLS:
        <span class="hljs-comment"># Auto-fill phone number if missing</span>
        <span class="hljs-keyword">if</span> data[<span class="hljs-string">"tool"</span>] == <span class="hljs-string">"send_sms"</span> <span class="hljs-keyword">and</span> <span class="hljs-string">"number"</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> data.get(<span class="hljs-string">"args"</span>, {}):
            data.setdefault(<span class="hljs-string">"args"</span>, {})[<span class="hljs-string">"number"</span>] = PARTNER_PHONE
        result = TOOLS[data[<span class="hljs-string">"tool"</span>]](**data.get(<span class="hljs-string">"args"</span>, {}))
        print(<span class="hljs-string">"Tool result:"</span>, result)
    <span class="hljs-keyword">else</span>:
        print(<span class="hljs-string">"Assistant:"</span>, msg)
<span class="hljs-keyword">except</span> Exception:
    print(<span class="hljs-string">"Assistant:"</span>, msg)
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">export PARTNER_PHONE=+<span class="hljs-number">15551234567</span>
python agent_morning.py
</code></pre>
<p>This example is realistic on Android because it uses Termux utilities you already installed: local TTS for speech output, <code>termux-sms-send</code> for messaging, and <code>termux-notification</code> for a quick on-device confirmation. You can extend it with a Home Assistant tool later if you have a local server (for example, to toggle lights or set thermostat scenes).</p>
<h2 id="heading-conclusion-and-next-steps">Conclusion and Next Steps</h2>
<p>Building a fully local voice assistant is an incremental process. Each step you added – speech recognition, text generation, memory, retrieval, and tool execution – unlocked new capabilities and moved the system closer to behaving like a real assistant.</p>
<p>You built a fully local voice assistant on your phone with:</p>
<ul>
<li><p>On-device Automatic Speech Recognition with Whisper (with Faster-Whisper fallback)</p>
</li>
<li><p>On-device reasoning with MLC Large Language Model</p>
</li>
<li><p>Local Text-to-Speech using the built-in system TTS</p>
</li>
<li><p>Tool calling for real actions</p>
</li>
<li><p>Memory and personalization</p>
</li>
<li><p>Retrieval-Augmented Generation for document-based knowledge</p>
</li>
<li><p>A simple agent loop for multi-step work</p>
</li>
</ul>
<p>From here you can add:</p>
<ul>
<li><p>Wake word detection (for example, Porcupine or open wake word models)</p>
</li>
<li><p>Device-specific integrations (for example, Home Assistant, smart lighting)</p>
</li>
<li><p>Better memory schemas and calendars or contacts adapters</p>
</li>
</ul>
<p>Your data never leaves your device, and you control every part of the stack. This is a private, customizable assistant you can expand however you like.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Voice AI Agent Using Open-Source Tools ]]>
                </title>
                <description>
                    <![CDATA[ Voice is the next frontier of conversational AI. It is the most natural modality for people to chat and interact with another intelligent being. In the past year, frontier AI labs such as OpenAI, xAI, Anthropic, Meta, and Google have all released rea... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-voice-ai-agent-using-open-source-tools/</link>
                <guid isPermaLink="false">68f7d890413573e1d65bb331</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ education ]]>
                    </category>
                
                    <category>
                        <![CDATA[ stem ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Voice ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agentic AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Open Source ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Rust ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Michael Yuan ]]>
                </dc:creator>
                <pubDate>Tue, 21 Oct 2025 19:01:36 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1761073279608/a73ce2cd-c95e-4f8b-b529-8774ce39a43f.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Voice is the next frontier of conversational AI. It is the most natural modality for people to chat and interact with another intelligent being.</p>
<p>In the past year, frontier AI labs such as OpenAI, xAI, Anthropic, Meta, and Google have all released real-time voice services. Yet voice apps also have the highest requirements for latency, privacy, and customization. It’s difficult to have a one-size-fits-all voice AI solution.</p>
<p>In this article, we’ll explore how to use open-source technologies to create <a target="_blank" href="https://echokit.dev/">voice AI agents</a> that utilize your custom knowledge base, voice style, actions, fine-tuned AI models, and run on your own computer.</p>
<h2 id="heading-what-well-cover">What We’ll Cover:</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-it-looks-like">What it Looks Like</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-two-voice-ai-approaches">Two Voice AI Approaches</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-voice-ai-orchestrator">The Voice AI Orchestrator</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-configure-an-asr">Configure an ASR</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-run-and-configure-a-vad">Run and configure a VAD</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configure-an-llm">Configure an LLM</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configure-a-tts">Configure a TTS</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configure-mcp-and-actions">Configure MCP and actions</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-local-ai-with-llamaedge">Local AI With LlamaEdge</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You’ll need to have and know a few things to most effectively follow along with this tutorial:</p>
<ul>
<li><p>Access to a Linux-like system. Mac or Windows WSL suffice too.</p>
</li>
<li><p>Be comfortable with command line (CLI) tools.</p>
</li>
<li><p>Be able to run server applications on the Linux system.</p>
</li>
<li><p>Have/get free API keys from <a target="_blank" href="https://console.groq.com/keys">Groq</a> and <a target="_blank" href="https://elevenlabs.io/app/sign-in?redirect=%2Fapp%2Fdevelopers%2Fapi-keys">ElevenLabs</a>.</p>
</li>
<li><p>Optional: be able to compile and build Rust source code.</p>
</li>
<li><p>Optional: have/get an <a target="_blank" href="https://echokit.dev/echokit_diy.html">EchoKit device</a> or assemble your own.</p>
</li>
</ul>
<h2 id="heading-what-it-looks-like">What it Looks Like</h2>
<p>The key software component we will cover is the <a target="_blank" href="https://github.com/second-state/echokit_server">echokit_server</a> project. It is an open-source agent orchestrator for voice AI applications. That means it coordinates services such as LLMs, ASR, TTS, VAD, MCP, search, knowledge/vector databases, and others to generate intelligent voice responses from user prompts.</p>
<p>The EchoKit server provides a WebSocket interface that allows compatible clients to send and receive voice data to and from it. The <a target="_blank" href="https://github.com/second-state/echokit_box">echokit_box</a> project provides an ESP32-based firmware that can act as a client to collect audio from the user and play TTS-generated voice from the EchoKit server. You can see a couple of demos here. You can assemble your own EchoKit device or <a target="_blank" href="https://echokit.dev/echokit_diy.html">purchase one</a>.</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/XroT7a0DLkw" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/Zy-rLT4EgZQ" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<p>Of course, you can also use a pure software client that conforms to the <a target="_blank" href="https://github.com/second-state/echokit_server">echokit_server</a> WebSocket interface. The project publishes a <a target="_blank" href="https://echokit.dev/chat/">JavaScript web page</a> that you can run locally to connect to your own EchoKit server as a reference.</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/Eyd9ToflccY" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<p>In the rest of the article, I will discuss how it’s implemented and how to deploy the system for your own voice AI applications.</p>
<h2 id="heading-two-voice-ai-approaches">Two Voice AI Approaches</h2>
<p>When OpenAI released its “realtime voice” services in October 2024, the consensus was that voice AI required “end-to-end” AI models. Traditional LLMs take text as input and then respond in text. The voice end-to-end models take voice audio data as input and respond in voice audio data as well. The end-to-end models could reduce latency since the voice processing, understanding, and generation are done in a single step.</p>
<p>But an end-to-end model is very difficult to customize. For example, it’s impossible to add your own prompt and knowledge to the context for each LLM request, or to act on the LLM's thinking or tool-call responses, or to clone your own voice for the response.</p>
<p>The second approach is to use an “agent orchestration” service to tie together multiple AI models, using one model’s output as the input for the next model. This allows us to customize or select each model and manipulate or supplement the model input at every step.</p>
<ul>
<li><p>The VAD model is used to detect conversation turns in the user's speech. It determines when the user is finished speaking and is now expecting a response.</p>
</li>
<li><p>The ASR/STT model turns user speech into text.</p>
</li>
<li><p>The LLM model generates a text response, including MCP tool calls.</p>
</li>
<li><p>The TTS model turns the response text into voice.</p>
</li>
</ul>
<p>The issue with multi-model and multi-step orchestration is that it can be slow. A lot of optimizations are needed for this approach to work well. For example, a useful technique is to utilize streaming input and output wherever possible. This way, each model doesn’t have to wait for the complete response from the upstream model.</p>
<p>The <a target="_blank" href="https://github.com/second-state/echokit_server">EchoKit server</a> is a stream-everything, highly efficient AI model orchestrator. It is entirely written in Rust for stability, safety, and speed.</p>
<h2 id="heading-the-voice-ai-orchestrator">The Voice AI Orchestrator</h2>
<p>The EchoKit server project is an open-source AI service orchestrator focused on real-time voice use cases. It starts up a WebSocket server that listens for streaming audio input and returns streaming audio responses.</p>
<p>You can build the <a target="_blank" href="https://github.com/second-state/echokit_server">echokit_server</a> project yourself using the Rust toolchain. Or, you can simply download the pre-built binary for your computer.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># for x86 / AMD64 CPUs</span>
curl -LO https://github.com/second-state/echokit_server/releases/download/v0.1.0/echokit_server-v0.1.0-x86_64-unknown-linux-gnu.tar.gz
unzip echokit_server-v0.1.0-x86_64-unknown-linux-gnu.tar.gz

<span class="hljs-comment"># for arm64 CPUs</span>
curl -LO https://github.com/second-state/echokit_server/releases/download/v0.1.0/echokit_server-v0.1.0-aarch64-unknown-linux-gnu.tar.gz
unzip echokit_server-v0.1.0-aarch64-unknown-linux-gnu.tar.gz
</code></pre>
<p>Then, run it as follows:</p>
<pre><code class="lang-bash">nohup ./echokit_server &amp;
</code></pre>
<p>It reads the <code>config.toml</code> file from the current directory. At the top of the file, you can configure the port on which the WebSocket server listens. You can also specify a WAV file that is downloaded to the connected <a target="_blank" href="https://echokit.dev/echokit_diy.html">EchoKit client device</a> as a welcome message.</p>
<pre><code class="lang-ini"><span class="hljs-attr">addr</span> = <span class="hljs-string">"0.0.0.0:8000"</span>
<span class="hljs-attr">hello_wav</span> = <span class="hljs-string">"hello.wav"</span>
</code></pre>
<h3 id="heading-configure-an-asr">Configure an ASR</h3>
<p>When the EchoKit server receives the user's voice data, it first sends the data to an ASR service to convert it into text.</p>
<p>There are many compelling ASR models available today. The EchoKit server can work with any OpenAI-compatible API providers, such as OpenAI itself, x.ai, OpenRouter, and Groq.</p>
<p>In our example, we use Groq’s Whisper ASR service. Whisper is a state-of-the-art ASR model released by OpenAI. Groq provides specialized hardware chips to run it very fast. You will first get <a target="_blank" href="https://console.groq.com/keys">a free API key from Groq</a>. Then, configure the ASR service as follows. Notice the “prompt” for the Whisper model. It is a tried-and-true prompt to reduce hallucination of the Whisper model.</p>
<pre><code class="lang-ini"><span class="hljs-section">[asr]</span>
<span class="hljs-attr">url</span> = <span class="hljs-string">"https://api.groq.com/openai/v1/audio/transcriptions"</span>
<span class="hljs-attr">api_key</span> = <span class="hljs-string">"gsk_XYZ"</span>
<span class="hljs-attr">model</span> = <span class="hljs-string">"whisper-large-v3"</span>
<span class="hljs-attr">lang</span> = <span class="hljs-string">"en"</span>
<span class="hljs-attr">prompt</span> = <span class="hljs-string">"Hello\n你好\n(noise)\n(bgm)\n(silence)\n"</span>
</code></pre>
<h3 id="heading-run-and-configure-a-vad">Run and configure a VAD</h3>
<p>In order to carry out a voice conversation, participants must detect each other's intentions and speak only when a turn arises. VAD (Voice Activity Detection) is a specialized AI model used to detect activities and, in particular, when the speaker has finished and expects an answer.</p>
<p>In EchoKit, we have VAD detection on both the device and the server.</p>
<ul>
<li><p>Device-side VAD: It detects human language. The device ignores background noise, music, keyboard sounds, and dog barking. It only sends human voice to the server.</p>
</li>
<li><p>Server-side VAD: It processes the audio stream in 100ms (0.1s) chunks. Once it detects that the speaker has finished, it sends all transcribed text to the LLM and starts waiting for the LLM’s response stream.</p>
</li>
</ul>
<p>The server-side VAD is optional, since the device-side VAD can also generate “conversation turn” signals. But due to the limited computing resources on the device, adding the server-side VAD can dramatically improve the overall VAD performance.</p>
<p>We’re porting the popular <a target="_blank" href="https://github.com/snakers4/silero-vad">Silero VAD</a> project from Python to Rust, and creating the <a target="_blank" href="https://github.com/second-state/silero_vad_server">silero_vad_server</a> project. Build the project <a target="_blank" href="https://github.com/second-state/silero_vad_server?tab=readme-ov-file#build-the-api-server">as instructed</a>. You can start the VAD server on your EchoKit server’s port 9094 as follows:</p>
<pre><code class="lang-bash">VAD_LISTEN=0.0.0.0:9094 nohup target/release/silero_vad_server &amp;
</code></pre>
<p>You might be wondering: why port to Rust? While many AI projects are written in Python for ease of development, Rust applications are often much lighter, faster, and safer at deployment. So, we’ll leverage AI tools like <a target="_blank" href="https://github.com/cardea-mcp/RustCoder">RustCoder</a> to port as much Python code as possible to Rust. The EchoKit software stack is largely written in Rust.</p>
<p>The VAD server is a WebSocket service that listens on port 9094. As we discussed, the EchoKit server will stream audio to this WebSocket and stop the ASR when a conversation turn is detected. Therefore, we’ll add the VAD service to the EchoKit server’s ASR config section in <code>config.toml</code>.</p>
<pre><code class="lang-ini"><span class="hljs-section">[asr]</span>
<span class="hljs-attr">url</span> = <span class="hljs-string">"https://api.groq.com/openai/v1/audio/transcriptions"</span>
<span class="hljs-attr">api_key</span> = <span class="hljs-string">"gsk_XYZ"</span>
<span class="hljs-attr">model</span> = <span class="hljs-string">"whisper-large-v3"</span>
<span class="hljs-attr">lang</span> = <span class="hljs-string">"en"</span>
<span class="hljs-attr">prompt</span> = <span class="hljs-string">"Hello\n你好\n(noise)\n(bgm)\n(silence)\n"</span>
<span class="hljs-attr">vad_realtime_url</span> = <span class="hljs-string">"ws://localhost:9094/v1/audio/realtime_vad"</span>
</code></pre>
<h3 id="heading-configure-an-llm">Configure an LLM</h3>
<p>Once the ASR service transcribes the user's voice into text, the next step in the pipeline is the LLM (Large Language Model). It’s the AI service that actually “thinks” and generates an answer in text.</p>
<p>Again, the EchoKit server can work with any OpenAI-compatible API providers for LLMs, such as OpenAI itself, x.ai, OpenRouter, and Groq. Since the voice service is highly sensitive to speed, we’ll choose Groq again. Groq supports a number of open-source LLMs. We’ll choose the <code>gpt-oss-20b</code> model released by OpenAI.</p>
<pre><code class="lang-ini"><span class="hljs-section">[llm]</span>
<span class="hljs-attr">llm_chat_url</span> = <span class="hljs-string">"https://api.groq.com/openai/v1/chat/completions"</span>
<span class="hljs-attr">api_key</span> = <span class="hljs-string">"gsk_XYZ"</span>
<span class="hljs-attr">model</span> = <span class="hljs-string">"openai/gpt-oss-20b"</span>
<span class="hljs-attr">history</span> = <span class="hljs-number">20</span>
</code></pre>
<p>The “history” field indicates how many messages should be kept in the context. Another crucial feature of an LLM application is the “system prompt,” where you instruct the LLM how it should “behave.” You can specify the system prompt in the EchoKit server config as well.</p>
<pre><code class="lang-ini"><span class="hljs-section">[[llm.sys_prompts]]</span>
<span class="hljs-attr">role</span> = <span class="hljs-string">"system"</span>
<span class="hljs-attr">content</span> = <span class="hljs-string">"""
You are a comedian. Engage in lighthearted and humorous conversation with the user. Tell jokes when appropriate.

"""</span>
</code></pre>
<p>Since Groq is very fast, it can process very large system prompts in under one second. You can add a lot more context and instructions to the system prompt. For example, you can give the application “knowledge” about a specific field by putting entire books into the system prompt.</p>
<h3 id="heading-configure-a-tts">Configure a TTS</h3>
<p>Finally, once the LLM generates a text response, the EchoKit server will call a TTS (text to speech) service to convert the text into voice and stream it back to the client device.</p>
<p>While Groq has a TTS service, it’s not particularly compelling. ElevenLabs is a leading TTS provider that offers hundreds of voice characters. It can express emotions and supports easy voice cloning. In the config below, you’ll put in your <a target="_blank" href="https://elevenlabs.io/app/sign-in?redirect=%2Fapp%2Fdevelopers%2Fapi-keys">ElevenLabs API key</a> and select a voice.</p>
<pre><code class="lang-ini"><span class="hljs-section">[tts]</span>
<span class="hljs-attr">platform</span> = <span class="hljs-string">"Elevenlabs"</span>
<span class="hljs-attr">token</span> = <span class="hljs-string">"sk_xyz"</span>
<span class="hljs-attr">voice</span> = <span class="hljs-string">"VOICE-ID-ABCD"</span>
</code></pre>
<p>The ElevenLabs TTS models and API services are all great, but they are not open-source. A very compelling open-source TTS, known as GPT-SoVITS, is also available.</p>
<p>You can port GPT-SoVITS from Python to Rust and create an open-source API server project called <a target="_blank" href="https://github.com/second-state/gsv_tts">gsv_tts</a>. It allows easy cloning of any voice. You can run a <a target="_blank" href="https://github.com/second-state/gsv_tts">gsv_tts</a> API server by following its instructions. Then, you can configure the EchoKit server to stream text to it and receive streaming audio from it.</p>
<pre><code class="lang-ini"><span class="hljs-section">[tts]</span>
<span class="hljs-attr">platform</span> = <span class="hljs-string">"StreamGSV"</span>
<span class="hljs-attr">url</span> = <span class="hljs-string">"http://gsv_tts.server:port/v1/audio/stream_speech"</span>
<span class="hljs-attr">speaker</span> = <span class="hljs-string">"michael"</span>
</code></pre>
<h3 id="heading-configure-mcp-and-actions">Configure MCP and actions</h3>
<p>Of course, an “AI agent” is not just about chatting. It is about performing actions on specific tasks. For example, the <a target="_blank" href="https://www.youtube.com/watch?v=Zy-rLT4EgZQ">“US civics test prep”</a> use case, which I shared as an example video at the beginning of this article, requires the agent to get exam questions from a database, and then generate responses that guide the user toward the official answer. This is accomplished using LLM tools and actions.</p>
<ul>
<li><p>The LLM detects that the user is requesting a new question.</p>
</li>
<li><p>Instead of responding in natural language, it responds with a JSON structure that instructs the agent to "get a new question and answer."</p>
</li>
<li><p>The EchoKit server intercepts this JSON response and retrieves the question and answer from a database.</p>
</li>
<li><p>The EchoKit server sends the question and answer back to the LLM.</p>
</li>
<li><p>The LLM formulates a natural language response based on the question and answer.</p>
</li>
<li><p>The EchoKit server generates a voice response using its TTS service.</p>
</li>
</ul>
<p>As you can see, the EchoKit server needs to perform a few extra steps behind the scenes before it responds in voice. The EchoKit server leverages the MCP protocol for this. The function to look up questions and answers is provided by an open-source MCP server called <a target="_blank" href="https://github.com/cardea-mcp/ExamPrepAgent">ExamPrepAgent</a>.</p>
<p>The MCP protocol standardizes the tools and functions for LLMs to call. There are many MCP servers available for all kinds of different tasks. ExamPrepAgent is just one of them.</p>
<p>We are running this MCP server on port 8003. With the MCP server up and running, you only need to add the following configuration to EchoKit server’s <code>config.toml</code>.</p>
<pre><code class="lang-ini"><span class="hljs-section">[[llm.mcp_server]]</span>
<span class="hljs-attr">server</span> = <span class="hljs-string">"http://localhost:8003/mcp"</span>
<span class="hljs-attr">type</span> = <span class="hljs-string">"http_streamable"</span>
</code></pre>
<p>With MCP integration, the EchoKit AI agent can now perform actions. It can call APIs to send messages, make payments, or even turn electronic devices on or off.</p>
<h2 id="heading-local-ai-with-llamaedge">Local AI With LlamaEdge</h2>
<p>You’ve now seen the open-source EchoKit device working with the open-source EchoKit server to understand and respond to users in voice. But the AI models we use, while also open-source, run on commercial cloud providers. Can we run AI models using open-source technologies at home?</p>
<p><a target="_blank" href="https://github.com/LlamaEdge/LlamaEdge">LlamaEdge</a> is an open-source, cross-platform API server for AI models. It <a target="_blank" href="https://llamaedge.com/docs/ai-models/">supports many mainstream LLM, ASR, and TTS models</a> across Linux, Mac, Windows, and many CPU/GPU architectures. It’s perfect for running AI models on home or office computers. It also provides OpenAI-compatible API endpoints, which makes them very easy to integrate into the EchoKit server.</p>
<p>To install LlamaEdge and its dependencies, run the following shell command. It will detect your hardware and install the appropriate software that can fully take advantage of your GPUs (if any).</p>
<pre><code class="lang-bash">curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s
</code></pre>
<p>Then, download an open-source LLM model. I am using Google's Gemma model as an example.</p>
<pre><code class="lang-bash">curl -LO https://huggingface.co/second-state/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q5_K_M.gguf
</code></pre>
<p>Download the cross-platform LlamaEdge API server.</p>
<pre><code class="lang-bash">curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-api-server.wasm
</code></pre>
<p>Start an LLamaEdge API server with the Google Gemma LLM model. by default, it listens on localhost port 8080.</p>
<pre><code class="lang-bash">wasmedge --dir .:. --nn-preload default:GGML:AUTO:gemma-3-4b-it-Q5_K_M.gguf llama-api-server.wasm -p gemma-3
</code></pre>
<p>Test the OpenAI compatible API on that server.</p>
<pre><code class="lang-bash">curl -X POST http://localhost:8080/v1/chat/completions \
  -H <span class="hljs-string">'accept: application/json'</span> \
  -H <span class="hljs-string">'Content-Type: application/json'</span> \
  -d <span class="hljs-string">'{"messages":[{"role":"system", "content": "You are a helpful assistant. Try to be as brief as possible."}, {"role":"user", "content": "Where is the capital of Texas?"}]}'</span>
</code></pre>
<p>Now, you can add this local LLM service to your EchoKit server configuration.</p>
<pre><code class="lang-ini"><span class="hljs-section">[llm]</span>
<span class="hljs-attr">llm_chat_url</span> = <span class="hljs-string">"http://localhost:8080/v1/chat/completions"</span>
<span class="hljs-attr">api_key</span> = <span class="hljs-string">"NONE"</span>
<span class="hljs-attr">model</span> = <span class="hljs-string">"default"</span>
<span class="hljs-attr">history</span> = <span class="hljs-number">20</span>
</code></pre>
<p>The LlamaEdge project supports more than LLMs. It runs the <a target="_blank" href="https://github.com/LlamaEdge/whisper-api-server">Whisper ASR model</a> and the <a target="_blank" href="https://github.com/LlamaEdge/tts-api-server">Piper TTS model</a> as OpenAI-compatible API servers as well.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The voice AI agent software stack is complex and deep. EchoKit is an open-source platform that ties together and coordinates all those components. It provides a good vantage point for us to learn about the entire stack.</p>
<p>I can’t wait to see what you build!</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
