<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Djibril-M🍀 - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Djibril-M🍀 - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Thu, 18 Jun 2026 20:59:02 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/djibrilm/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Build an Offline AI Image Generator in Node.js with QVAC and Socket.io ]]>
                </title>
                <description>
                    <![CDATA[ A few years ago, the first day I finally got access to an AI image generator, I was so excited that I immediately sat down and wrote an article about it (using Node.js and OpenAI's DALL-E). The magic  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-an-offline-ai-image-generator-in-node-js-with-qvac-and-socket-io/</link>
                <guid isPermaLink="false">6a3417433a2cf3cf64bb2225</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ image generation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Node.js ]]>
                    </category>
                
                    <category>
                        <![CDATA[ stable diffusion ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Express ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Djibril-M🍀 ]]>
                </dc:creator>
                <pubDate>Thu, 18 Jun 2026 16:05:23 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/db3fe63d-6df2-4250-b8f2-c166d9eafc3b.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>A few years ago, the first day I finally got access to an AI image generator, I was so excited that I immediately sat down and wrote an article about it <a href="https://dev.to/djibrilm/dall-e-with-nodejs-5chb">(using Node.js and OpenAI's DALL-E)</a>. The magic of turning thoughts directly into digital pixels felt like holding a real-life magic wand.</p>
<p>But back then, accessing these models wasn't a walk in the park. Our primary option was Midjourney, which meant you had to struggle on Discord, and sometimes you couldn't do anything due to rate limits and servers being very busy.</p>
<p>Accessing image generation back then felt like trying to order a coffee during a flash mob.</p>
<p>Thankfully, the landscape has completely shifted. Today, not only can we run state-of-the-art models like Stable Diffusion on consumer hardware, but we can do it locally, offline, and completely free of charge. We don't need any API keys, there aren't any subscription rate limits, and there's no Discord channels to deal with.</p>
<p>In this tutorial, we'll build a local web application using Node.js, Express, Socket.io, and the QVAC SDK to run a quantized Stable Diffusion 2.1 model.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-is-qvac">What is QVAC?</a></p>
</li>
<li><p><a href="#heading-how-stable-diffusion-works-under-the-hood">How Stable Diffusion Works Under the Hood</a></p>
</li>
<li><p><a href="#heading-gpu-limitations">GPU Limitations: Metal, AMD, and the Intel Mac Trap</a></p>
</li>
<li><p><a href="#heading-the-image-generation-pipeline">The Image Generation Pipeline</a></p>
</li>
<li><p><a href="#heading-complete-implementation">Complete Implementation</a></p>
</li>
<li><p><a href="#heading-codebase-breakdown">Codebase Breakdown</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-resources-and-further-reading">Resources and Further Reading</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this tutorial, you should have a solid foundation in web backend and frontend basics:</p>
<ul>
<li><p><strong>Node.js and ES Modules</strong>: Basic familiarity with modern JavaScript modules (<code>import</code>/<code>export</code>), async loops, and event listeners.</p>
</li>
<li><p><strong>Express and WebSockets</strong>: Familiarity with routing static files and sending real-time messages over WebSockets with <code>socket.io</code>.</p>
</li>
<li><p><strong>HTML and Vanilla CSS</strong>: Understanding of basic DOM manipulation and style bindings.</p>
</li>
<li><p><strong>Development environment</strong>: A local machine with Node.js installed.</p>
</li>
</ul>
<h2 id="heading-what-is-qvac">What is QVAC?</h2>
<p>Developed by Tether, QVAC is a family of local inference tools designed to execute machine learning models directly on client hardware.</p>
<p>Instead of routing inference requests to expensive cloud-hosted APIs (such as DALL-E or Midjourney), QVAC bundles pre-compiled machine learning runtimes (like <code>llama.cpp</code> for text, <code>whisper.cpp</code> for transcription, and custom diffusion backends) directly into Node.js, mobile, and desktop runtimes.</p>
<p>Running local AI models with QVAC offers several practical advantages:</p>
<ul>
<li><p><strong>Zero API costs</strong>: Generate as many images as your hardware can handle without recurring costs.</p>
</li>
<li><p><strong>Privacy-first</strong>: Prompts and generated images are kept entirely in memory on your local machine.</p>
</li>
<li><p><strong>Offline independence</strong>: Run your application in isolated networks, on flights, or in regions without internet access.</p>
</li>
</ul>
<h2 id="heading-how-stable-diffusion-works-under-the-hood">How Stable Diffusion Works Under the Hood</h2>
<p>To execute image generation locally without running out of RAM, QVAC leverages a quantized <strong>Stable Diffusion 2.1 GGUF</strong> model (<code>SD_V2_1_1B_Q8_0</code>).</p>
<p>But how does this actual image generation process work conceptually? Let's make one thing clear: <strong>this is not a scientific paper</strong>. We aren't going to dive into the underlying multivariable calculus, probability distributions, or stochastic differential equations because I'm not a low-level machine learning researcher (and let's be honest, neither of us wants to stare at Greek symbols and linear algebra formulas on a screen when we could be writing clean JavaScript).</p>
<p>Instead, let's understand how these models work conceptually, using some intuitive developer analogies.</p>
<h3 id="heading-the-world-class-sculptor-analogy">The World-Class Sculptor Analogy</h3>
<p>At its core, modern AI image generation turns randomness into reality. Instead of "painting" an image from scratch, pixel-by-pixel, like a human illustrator with a brush, the AI essentially acts as a world-class sculptor, carving an image out of a block of digital static.</p>
<p>The most dominant technology behind this today is <strong>Diffusion</strong>, which powers models like Stable Diffusion, Midjourney, and Google's Imagen series.</p>
<p>Here is the conceptual step-by-step breakdown of how this block of static turns into art:</p>
<h4 id="heading-1-the-training-phase-learning-the-patterns">1. The Training Phase (Learning the Patterns)</h4>
<p>Before a model can generate anything, it has to look at billions of images and their corresponding text descriptions. During this phase, developers do something counterintuitive: <strong>they intentionally ruin the images</strong>.</p>
<ul>
<li><p><strong>Adding noise:</strong> The system takes a clear picture (for example, of a cat) and gradually adds random digital static (noise) pixel-by-pixel until the original image is completely unrecognizable.</p>
</li>
<li><p><strong>Learning to reverse it:</strong> The AI's job is to look at a noisy image and predict exactly how much noise was added at that specific step. By doing this billions of times, it becomes an expert at denoising – that is, turning chaos back into order.</p>
</li>
</ul>
<h4 id="heading-2-connecting-words-to-visuals-clip">2. Connecting Words to Visuals (CLIP)</h4>
<p>To make sure the AI knows what a <em>"cat wearing a top hat"</em> looks like, it uses a text-to-image bridge, often powered by a system called <strong>CLIP</strong> (Contrastive Language-Image Pre-training).</p>
<ul>
<li><p>CLIP translates human language into a mathematical map (called an <strong>embedding</strong>).</p>
</li>
<li><p>In this map, the words "cat" and the actual pixels of a cat sit very close together. This ensures that when you type a prompt, the AI knows exactly which visual concepts to pull from its memory.</p>
</li>
</ul>
<h4 id="heading-3-the-generation-phase-the-reverse-diffusion-loop">3. The Generation Phase (The Reverse Diffusion Loop)</h4>
<p>When you type a prompt and hit "Generate," the magic happens in reverse:</p>
<ul>
<li><p><strong>The blank canvas:</strong> The AI starts with a canvas of pure, 100% random digital noise (it looks like old television static).</p>
</li>
<li><p><strong>The prompt guidance:</strong> The AI looks at your prompt and uses its text embedding to guide its eye. It looks at the random static and asks, <em>"Where in this mess can I start to see a cat?"</em></p>
</li>
<li><p><strong>Step-by-step denoising:</strong> The AI subtracts a little bit of noise, sharpening the image slightly. It repeats this loop 20 to 50 times. With every step, fuzzy shapes turn into rough outlines, textures appear, and eventually, a crisp, clean, brand-new image emerges.</p>
</li>
</ul>
<p><strong>Fun fact about seeds:</strong> Because the process starts with completely random static every single time, typing the exact same prompt twice will always give you a completely different image (unless you lock down the starting randomness using a specific number called a <strong>Seed</strong>).</p>
<p>Here's an illustration of denoising with diffusion models:</p>
<img src="https://cdn.hashnode.com/uploads/covers/68e4f3e9867c1707d1b057a9/19131e8b-0f4f-4b76-a85b-2c7be2460a10.png" alt="Denoising with diffusion models diagram " style="display:block;margin:0 auto" width="3584" height="1234" loading="lazy">

<h3 id="heading-latent-diffusion-keeping-it-fast-the-vae">Latent Diffusion: Keeping it Fast (The VAE)</h3>
<p>Generating high-resolution images pixel-by-pixel requires massive computing power. If we tried to do this directly in pixel space on consumer hardware, our computers would melt, and a single generation would take hours.</p>
<p>To fix this, modern models use <strong>Latent Diffusion</strong>.</p>
<p>Instead of working with the full-sized image, a component called an <strong>encoder</strong> compresses the image into a smaller, abstract mathematical space (the <strong>"latent space"</strong>). Think of it as a shrunken playground where all the noisy/denoising math happens. Because this playground is so small, the computations are incredibly fast.</p>
<p>Once the denoising loop finishes in the latent space, another component called the <strong>decoder</strong> (specifically, a Variational Autoencoder, or VAE) blows it back up into a sharp, high-resolution image for you to see.</p>
<h3 id="heading-architectures-supported-by-qvac">Architectures Supported by QVAC</h3>
<p>When you run local inference with QVAC, the SDK hooks into optimized, community-maintained C++ backends. QVAC manages the hardware bindings and model lifecycles for different AI modalities:</p>
<ol>
<li><p><strong>Text generation (</strong><code>llama.cpp</code><strong>):</strong> Used for large language models (LLMs) like Llama 3 or Mistral, executing auto-regressive token prediction.</p>
</li>
<li><p><strong>Audio transcription (</strong><code>whisper.cpp</code><strong>):</strong> Used for highly optimized speech-to-text transcription.</p>
</li>
<li><p><strong>Image Generation (</strong><code>stable-diffusion.cpp</code> <strong>/</strong> <code>sdcpp-generation</code><strong>):</strong> Our focus in this tutorial. QVAC supports two distinct approaches for image generation depending on the model architecture you choose:</p>
<ul>
<li><p><strong>The Bundled Model Approach (Stable Diffusion 1.5/2.1/XL):</strong> The traditional approach where the entire pipeline (Text Encoders, VAE, and the main Diffusion UNet) is baked into a single, unified GGUF file (for example, <code>SD_V2_1_1B_Q8_0</code>).  </p>
<p>This is incredibly convenient for local deployments because you only need to manage and load one file to start generating images.</p>
</li>
<li><p><strong>The Modular Multi-Model Approach (Flux):</strong> Modern architectures like <strong>FLUX.1</strong> use a much more complex setup. Instead of a single file, Flux splits its computational brain into separate components. You load a core Diffusion Transformer (DiT) model, but you must also separately load large text encoders (like T5-v1.1-xxl and CLIP-L) and an independent VAE model.  </p>
<p>While this requires more complex orchestration to load multiple GGUF files simultaneously, it provides vastly superior prompt adherence and photorealism by utilizing dedicated, massive text-understanding models.</p>
</li>
</ul>
</li>
<li><p><strong>Speech synthesis (TTS):</strong> Specialized architectures like <em>Chatterbox</em> (transformer-based zero-shot voice cloning) and <em>Supertonic</em> (diffusion-based speech denoising).</p>
</li>
</ol>
<h2 id="heading-gpu-limitations-metal-amd-and-the-intel-mac-trap">GPU Limitations: Metal, AMD, and the Intel Mac Trap</h2>
<p>When running machine learning models locally on Apple Mac hardware, QVAC will try to automatically accelerate execution by compiling compute pipelines for the <strong>Metal</strong> API to utilize the system's GPU.</p>
<p>If you're on an Apple Silicon Mac (M1, M2, M3, M4, or M5 chip), this works seamlessly, and generation will compile on the Apple Neural Engine and Unified GPU memory in seconds.</p>
<p>But if you're running on an older Intel-based Mac with a discrete <strong>AMD Radeon GPU</strong> (such as the AMD Radeon Pro 5500M commonly found in 16-inch MacBook Pros), you'll run into a major driver-level limitation:</p>
<ul>
<li><p>The macOS Metal driver for older AMD discrete GPUs doesn't support the modern machine learning compute shaders and matrix reduction operators used by <code>stable-diffusion.cpp</code>.</p>
</li>
<li><p>When the inference worker attempts to run these unsupported operations, the driver fails to compile the pipeline and triggers a hard C++ crash (<code>SIGABRT</code>) inside the <code>ggml-metal-ops.cpp</code> shader encoder, abruptly exiting the background worker process.</p>
</li>
</ul>
<p>If you hit this hardware roadblock, the default GPU configuration will crash the application every time you trigger an image generation.</p>
<p>To resolve this, you should configure the model to run on the CPU instead by setting the model configuration parameter <code>device</code> to <code>"cpu"</code> and specifying the threads (for example, <code>threads: 4</code>). While generating images on the CPU takes longer than on a GPU, it runs successfully on any machine, regardless of how old or limited its GPU is.</p>
<h2 id="heading-the-image-generation-pipeline">The Image Generation Pipeline</h2>
<p>To coordinate the local execution lifecycle, our app sets up a real-time event pipeline:</p>
<pre><code class="language-plaintext">[Browser Client]                                  [Node.js Server]
       |                                                 |
       | ------ 1. Connects &amp; Checks Model ---------&gt;    |
       | &lt;----- 2. Downloads &amp; Loads Model ----------     | (Model Cached locally)
       |                                                 |
       | ------ 3. Submits prompt ("Cozy cabin...") -&gt;  |
       |                                                 |
       |                                                 | === [ QVAC Inference Engine ] ===
       |                                                 | 
       | &lt;----- 4. Denoising Step Updates (e.g. 5/20) -- | (Streams steps in real time)
       |                                                 |
       | &lt;----- 5. Sends final image (Base64 DataURL) -- | (Direct in-memory payload)
       |                                                 |
</code></pre>
<h2 id="heading-complete-implementation">Complete Implementation</h2>
<p>Let's look at the implementation. You can <a href="https://github.com/DjibrilM/qvac-local-image-generation-Case-study-">clone the full project repository</a> to follow along, or build it from scratch by creating a project folder, running <code>npm init -y</code>, installing the dependencies (<code>@qvac/sdk</code>, <code>express</code>, <code>socket.io</code>, <code>concurrently</code>), and configuring <code>"type": "module"</code> in your <code>package.json</code>.</p>
<h3 id="heading-1-server-configuration-serverjs">1. Server Configuration (<code>server.js</code>)</h3>
<p>Create a file named <code>server.js</code> and paste the following implementation:</p>
<pre><code class="language-javascript">import express from 'express';
import path from 'path';
import http from 'http';
import { Server } from 'socket.io';
import fs from 'fs';
import { fileURLToPath } from 'url';
import { loadModel, unloadModel, getLoadedModelInfo, diffusion, SD_V2_1_1B_Q8_0 } from "@qvac/sdk";

const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

const app = express();
const server = http.createServer(app);
const io = new Server(server);

const PORT = process.env.PORT || 3000;

app.use(express.json());
app.use(express.static(path.join(__dirname, 'public')));

const CONFIG_PATH = path.join(__dirname, '.device-preference.json');

function getPreferredDevice() {
  try {
    if (fs.existsSync(CONFIG_PATH)) {
      const data = JSON.parse(fs.readFileSync(CONFIG_PATH, 'utf8'));
      return data.device || null;
    }
  } catch (err) {
    console.error('Failed to read device preference:', err.message);
  }
  return null;
}

function setPreferredDevice(device) {
  try {
    fs.writeFileSync(CONFIG_PATH, JSON.stringify({ device }), 'utf8');
  } catch (err) {
    console.error('Failed to write device preference:', err.message);
  }
}

// Global model state
let loadedModelId = process.modelId || null;
let modelLoadPercent = 0;
let modelLoadStatus = 'Awaiting trigger...';
let isModelLoading = false;

const modelSize = (SD_V2_1_1B_Q8_0.expectedSize / (1024 * 1024 * 1024)).toFixed(2) + ' GB';

function broadcastModelProgress(percent, status) {
  io.emit('model-download-progress', { percent, status, size: modelSize });
}

io.on('connection', (socket) =&gt; {
  console.log('Client connected:', socket.id);

  socket.on('disconnect', () =&gt; {
    console.log('Client disconnected:', socket.id);
  });

  // Trigger model download
  socket.on('trigger-model-download', async () =&gt; {
    // If already loaded, verify it's still alive in the worker
    if (loadedModelId) {
      try {
        await getLoadedModelInfo({ modelId: loadedModelId });
        socket.emit('model-download-progress', {
          percent: 100,
          status: 'Model fully loaded locally.',
          size: modelSize
        });
        return;
      } catch (err) {
        console.log('Model ID was stale/not found, resetting state and reloading...', err.message);
        loadedModelId = null;
        process.modelId = null;
      }
    }

    // If currently loading, report current progress
    if (isModelLoading) {
      socket.emit('model-download-progress', {
        percent: Math.round(modelLoadPercent),
        status: modelLoadStatus,
        size: modelSize
      });
      return;
    }

    isModelLoading = true;
    modelLoadPercent = 0;
    modelLoadStatus = 'Initiating model download...';
    broadcastModelProgress(modelLoadPercent, modelLoadStatus);

    try {
      console.log('Starting model download...');
      const preferredDevice = getPreferredDevice();
      const loadConfig = { prediction: "v" };
      if (preferredDevice) {
        loadConfig.device = preferredDevice;
        if (preferredDevice === 'cpu') {
          loadConfig.threads = 4;
        }
        console.log(`Using cached device preference: ${preferredDevice}`);
      }

      loadedModelId = await loadModel({
        modelSrc: SD_V2_1_1B_Q8_0,
        modelType: "sdcpp-generation",
        modelConfig: loadConfig,
        onProgress: (p) =&gt; {
          modelLoadPercent = p.percentage;
          modelLoadStatus = p.percentage &gt;= 100 ? 'Model fully loaded locally.' : `Downloading model weights... (${p.percentage.toFixed(1)}%)`;
          broadcastModelProgress(Math.round(modelLoadPercent), modelLoadStatus);
        }
      });
      process.modelId = loadedModelId;

      isModelLoading = false;
      console.log('Model loaded successfully. ID:', loadedModelId);
    } catch (err) {
      isModelLoading = false;
      modelLoadPercent = 0;
      modelLoadStatus = 'Failed to load model: ' + err.message;
      console.error('Failed to load model:', err);
      broadcastModelProgress(0, modelLoadStatus);
      socket.emit('error_event', { message: 'Failed to load model: ' + err.message });
    }
  });

  socket.on('generate', async (data) =&gt; {
    const { prompt, ratio } = data;
    if (!prompt || prompt.trim() === '') {
      socket.emit('error_event', { message: 'Prompt is required' });
      return;
    }

    if (!loadedModelId) {
      socket.emit('error_event', { message: 'Model is not loaded yet' });
      return;
    }

    const runDiffusion = async (modelIdToUse) =&gt; {
      socket.emit('progress', {
        percent: 0,
        status: 'Starting diffusion process...',
        sub: 'DIFFUSION INITIALIZING'
      });

      console.log(`Generating image for prompt: "\({prompt}" with ratio: \){ratio} using model ID: ${modelIdToUse}`);

      const { progressStream, outputs, stats } = diffusion({
        modelId: modelIdToUse,
        prompt,
      });

      // Stream progress steps
      for await (const { step, totalSteps } of progressStream) {
        const percent = Math.round((step / totalSteps) * 100);
        socket.emit('progress', {
          percent,
          status: `Denoising step \({step}/\){totalSteps}...`,
          sub: 'RUNNING DIFFUSION'
        });
      }

      // Resolve output buffers
      const buffers = await outputs;
      if (!buffers || buffers.length === 0) {
        throw new Error('No image buffer returned from diffusion model.');
      }

      // Convert image buffer to a base64 Data URL instead of saving to disk
      const base64Data = Buffer.from(buffers[0]).toString('base64');
      const dataUrl = `data:image/png;base64,${base64Data}`;

      // Emit success
      socket.emit('success', {
        url: dataUrl,
        prompt,
        seed: (await stats).seed || -1
      });

      console.log(`Image generated and emitted successfully as base64 Data URL.`);
    };

    try {
      await runDiffusion(loadedModelId);
    } catch (err) {
      console.error('Image generation failed:', err);

      const isCrash = err.code === 50205 || (err.message &amp;&amp; err.message.includes('WORKER_CRASHED'));
      if (isCrash) {
        console.log('Worker crashed during GPU execution. Attempting CPU fallback...');

        // Save device preference so we load CPU directly next time and prevent double loading
        setPreferredDevice('cpu');

        // Reset the stale model state
        loadedModelId = null;
        process.modelId = null;

        socket.emit('progress', {
          percent: 0,
          status: 'GPU driver crashed. Automatically falling back to CPU mode...',
          sub: 'CPU FALLBACK LOADING'
        });

        try {
          console.log('Loading model on CPU...');
          isModelLoading = true;
          modelLoadPercent = 0;
          modelLoadStatus = 'Loading CPU model weights...';
          broadcastModelProgress(modelLoadPercent, modelLoadStatus);

          loadedModelId = await loadModel({
            modelSrc: SD_V2_1_1B_Q8_0,
            modelType: "sdcpp-generation",
            modelConfig: { prediction: "v", device: 'cpu', threads: 4 },
            onProgress: (p) =&gt; {
              modelLoadPercent = p.percentage;
              modelLoadStatus = `Loading CPU model weights... (${p.percentage.toFixed(1)}%)`;
              broadcastModelProgress(Math.round(modelLoadPercent), modelLoadStatus);
            }
          });
          process.modelId = loadedModelId;
          isModelLoading = false;
          console.log('Model loaded successfully on CPU. ID:', loadedModelId);

          // Retry diffusion on CPU
          await runDiffusion(loadedModelId);
        } catch (cpuErr) {
          console.error('CPU fallback execution failed:', cpuErr);
          isModelLoading = false;
          socket.emit('error_event', { message: 'Image generation failed on CPU: ' + cpuErr.message });
        }
      } else {
        if (err.message &amp;&amp; (err.message.includes('MODEL_NOT_FOUND') || err.message.includes('not found'))) {
          loadedModelId = null;
          process.modelId = null;
          broadcastModelProgress(0, 'Model state lost. Please re-trigger download.');
        }
        socket.emit('error_event', { message: 'Image generation failed: ' + err.message });
      }
    }
  });
});

app.get('*', (req, res) =&gt; {
  res.sendFile(path.join(__dirname, 'public', 'index.html'));
});

server.listen(PORT, () =&gt; {
  console.log(`Server is running at http://localhost:${PORT}`);
});

// Clean exit handler
async function handleCleanup() {
  const modelId = process.modelId || loadedModelId;
  if (modelId &amp;&amp; modelId !== 'mock-model-id') {
    try {
      await unloadModel({ modelId, clearStorage: false });
    } catch (err) {}
  }
  process.exit(0);
}

process.on('SIGINT', handleCleanup);
process.on('SIGTERM', handleCleanup);
</code></pre>
<h3 id="heading-2-frontend-architecture-summary">2. Frontend Architecture Summary</h3>
<p>Since our application runs completely locally, the frontend is a single-page web app built with vanilla HTML, CSS, and client-side JavaScript that communicates with our Express server over <strong>Socket.io</strong> WebSockets.</p>
<p>Rather than cluttering this tutorial with hundreds of lines of UI templates and style sheets, we'll keep the focus entirely on the backend orchestration. You can grab the complete HTML layout, Tailwind styles, and client script from the <a href="#heading-resources-and-further-reading">GitHub Repository</a>.</p>
<p>Here is a summary of how the client communicates with the server under the hood:</p>
<ol>
<li><p><strong>Preflight sync (</strong><code>trigger-model-download</code><strong>):</strong> As soon as the page loads, the client establishes a WebSocket connection and emits <code>trigger-model-download</code>. The server intercepts this to check if the model is cached/loading, and begins broadcasting progress.</p>
</li>
<li><p><strong>Denoising stream (</strong><code>progress</code><strong>):</strong> During image generation, the server constantly streams progress events containing denoising statistics (for example <code>Denoising step 12/20...</code>). The client updates the visual progress bar and status labels accordingly.</p>
</li>
<li><p><strong>Data URL delivery (</strong><code>success</code><strong>):</strong> When the diffusion steps are completed, the server converts the binary image buffer into a Base64 string and emits a <code>success</code> event. The client binds this Base64 Data URL directly to the source of the <code>&lt;img&gt;</code> element for direct local display and instant download.</p>
</li>
</ol>
<h2 id="heading-codebase-breakdown">Codebase Breakdown</h2>
<p>Let’s lift the hood on the key mechanisms that make our local offline image generator work smoothly.</p>
<h3 id="heading-1-multi-client-model-id-binding-processmodelid">1. Multi-Client Model ID Binding (<code>process.modelId</code>)</h3>
<p>Quantized weights take a significant amount of memory. Every time we call <code>loadModel()</code>, QVAC boots a separate C++ background process (a <code>Bare</code> worker) to host the GGML runtime.</p>
<p>To prevent spawning multiple processes or loading the 2.3 GB GGUF model multiple times when a client refreshes a page or opens another browser tab, we store the loaded model ID globally on Node’s <code>process</code> object:</p>
<pre><code class="language-javascript">let loadedModelId = process.modelId || null;
// ...
process.modelId = loadedModelId;
</code></pre>
<p>This acts as a process-wide singleton registry. But using a global variable introduces a challenge: <strong>stale worker processes</strong>. If a client triggers a model load, gets an ID, and the background worker process later crashes or is killed, <code>process.modelId</code> remains populated with a dead reference.</p>
<p>To resolve this, every time a new client connects and requests a model download trigger, we preflight the model ID using <code>getLoadedModelInfo</code>:</p>
<pre><code class="language-javascript">if (loadedModelId) {
  try {
    await getLoadedModelInfo({ modelId: loadedModelId });
    socket.emit('model-download-progress', { percent: 100, status: 'Model fully loaded locally.' });
    return;
  } catch (err) {
    console.log('Model ID was stale, resetting state...', err.message);
    loadedModelId = null;
    process.modelId = null;
  }
}
</code></pre>
<p>If the background worker is dead, <code>getLoadedModelInfo</code> throws an error. The catch block intercepts this, wipes the stale references, and safely restarts the loading routine.</p>
<p>[!IMPORTANT] <strong>Process singleton integrity:</strong> Always preflight model state visibility before initiating inference. Without validation checks, attempting <code>diffusion()</code> on a stale model ID will trigger immediate client-side connection timeouts and silent backend worker failures.</p>
<h3 id="heading-2-in-memory-image-serialization-zero-disk-writes">2. In-Memory Image Serialization (Zero Disk Writes)</h3>
<p>Writing generated images to the server's hard drive creates significant I/O overhead. It forces you to write custom cron cleanup scripts to delete old image files, and runs the risk of running out of disk space on systems with high user traffic.</p>
<p>Since QVAC’s <code>diffusion()</code> function outputs generated PNG files directly as in-memory binary buffers (<code>Uint8Array</code>), we bypass the local file system entirely. We serialize the binary array into a Base64 string directly in memory:</p>
<pre><code class="language-javascript">const base64Data = Buffer.from(buffers[0]).toString('base64');
const dataUrl = `data:image/png;base64,${base64Data}`;
</code></pre>
<p>This Data URL is transmitted over WebSockets to the client, which immediately binds it to the image element:</p>
<ul>
<li><p><strong>Zero disk overhead:</strong> The server doesn't write a single byte to the hard drive, preserving SSD life and preventing storage bloat.</p>
</li>
<li><p><strong>Instant delivery:</strong> Transmission is handled entirely within network memory buffers, bypassing disk serialization latency.</p>
</li>
<li><p><strong>Effortless client integration:</strong> The client doesn't need to request a static image URL path. It directly renders the Base64 Data URL, allowing users to save or download the image instantly.</p>
</li>
</ul>
<h3 id="heading-3-gpu-to-cpu-fallback-amp-preference-cache-strategy">3. GPU-to-CPU Fallback &amp; Preference Cache Strategy</h3>
<p>One of the biggest challenges with local-first AI is client hardware heterogeneity. For example, older Intel Macs with discrete AMD Radeon GPUs support Apple's Metal framework, but lack the modern tensor reduction operators used by the Stable Diffusion engine, causing a hard C++ crash (<code>SIGABRT</code>) inside <code>ggml-metal-ops.cpp</code>.</p>
<p>To keep the application running and ensure we don't trigger the model loading twice (once on the incompatible GPU on startup, and once on the CPU fallback after the first prompt crash), we use a persistent <strong>device preference cache</strong> file (<code>.device-preference.json</code>) alongside our C++ worker crash interceptor:</p>
<pre><code class="language-javascript">try {
  await runDiffusion(loadedModelId);
} catch (err) {
  const isCrash = err.code === 50205 || err.message.includes('WORKER_CRASHED');
  if (isCrash) {
    // 1. Cache the CPU preference on disk
    setPreferredDevice('cpu');

    // 2. Reset stale references
    loadedModelId = null;
    process.modelId = null;

    // 3. Automatically load the model on CPU with multi-threading
    loadedModelId = await loadModel({
      modelSrc: SD_V2_1_1B_Q8_0,
      modelType: "sdcpp-generation",
      modelConfig: { prediction: "v", device: "cpu", threads: 4 }
    });
    process.modelId = loadedModelId;

    // 4. Transparently retry generation
    await runDiffusion(loadedModelId);
  }
}
</code></pre>
<p>This approach utilizes a two-layered defense:</p>
<ol>
<li><p><strong>Dynamic recovery:</strong> If a GPU driver error triggers a crash, the app intercepts it, saves <code>"device": "cpu"</code> to the <code>.device-preference.json</code> file, dynamically reloads the weights into CPU threads, and retries the generation. The client simply sees a status update indicating CPU fallback is occurring, surviving what would otherwise be a fatal crash.</p>
</li>
<li><p><strong>Preference persistence:</strong> The next time the server starts or a page is loaded, the preflight loading routine reads the cached preference from the disk and loads the CPU model immediately:</p>
</li>
</ol>
<pre><code class="language-javascript">const preferredDevice = getPreferredDevice(); // Reads .device-preference.json
const loadConfig = { prediction: "v" };
if (preferredDevice) {
  loadConfig.device = preferredDevice;
  if (preferredDevice === 'cpu') {
    loadConfig.threads = 4;
  }
}
loadedModelId = await loadModel({
  modelSrc: SD_V2_1_1B_Q8_0,
  modelType: "sdcpp-generation",
  modelConfig: loadConfig,
  // ...
});
</code></pre>
<p>This prevents the server from making redundant GPU load attempts on subsequent sessions, ensuring that the model is loaded only once and directly onto the correct hardware execution target.</p>
<p>[!WARNING] <strong>CPU Fallback Latency:</strong> While CPU mode guarantees resilience across older hardware, it uses sequential multi-threaded calculations instead of GPU hardware cores. Consequently, generation times will be significantly longer (typically 1 to 2 minutes on CPU compared to 10 to 15 seconds on a compatible GPU). Make sure to design responsive progress loaders in the UI to manage user expectations during fallback.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Running local-first Stable Diffusion with QVAC gives you absolute control over your inference costs and data privacy. By coupling on-device GGML models with a simple Node.js WebSocket backend, you can build responsive web tools that run completely offline without ever spending money on cloud APIs.</p>
<p>As mobile and desktop system-on-chip architectures continue to pack more neural engines, local-first AI architectures will become an increasingly powerful option for modern developers.</p>
<h2 id="heading-resources-and-further-reading">Resources and Further Reading</h2>
<ul>
<li><p><a href="https://docs.qvac.tether.io/ai-capabilities/image-generation/#image-to-image"><strong>QVAC Image-to-Image Generation Documentation</strong></a></p>
</li>
<li><p><a href="https://github.com/tetherto/qvac-sdk"><strong>QVAC SDK GitHub Repository</strong></a></p>
</li>
<li><p><a href="https://huggingface.co/models?search=gguf"><strong>Stable Diffusion GGUF models on Hugging Face</strong></a></p>
</li>
<li><p><a href="https://socket.io/docs/v4/"><strong>Socket.io WebSockets Guide</strong></a></p>
</li>
<li><p><a href="https://github.com/DjibrilM/qvac-local-image-generation-Case-study-"><strong>Full codebase</strong></a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Run Private Text-to-Speech on Your Own Hardware Using QVAC ]]>
                </title>
                <description>
                    <![CDATA[ When I was putting the final touches on QuizRope, an educational mobile app I built that uses LLMs for real-time tutoring and homework assistance, I knew the next logical step was voice. Reading text  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-run-private-text-to-speech-on-your-own-hardware-using-qvac/</link>
                <guid isPermaLink="false">6a2e0cb22e4a72670f854140</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ React Native ]]>
                    </category>
                
                    <category>
                        <![CDATA[ TextToSpeech ]]>
                    </category>
                
                    <category>
                        <![CDATA[ privacy ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Djibril-M🍀 ]]>
                </dc:creator>
                <pubDate>Sun, 14 Jun 2026 02:06:42 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/3ac11484-05eb-4e59-9d35-f2bad4d1d730.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When I was putting the final touches on <a href="https://github.com/DjibrilM/Quiz-rope-">QuizRope</a>, an educational mobile app I built that uses LLMs for real-time tutoring and homework assistance, I knew the next logical step was voice. Reading text on a screen is great, but having an AI tutor physically <em>speak</em> to you transforms the entire learning experience.</p>
<p>Naturally, my first instinct was to look at cloud providers. While services like ElevenLabs offer incredible voice quality, I quickly ran the numbers. Between the API pricing, token consumption for lengthy tutoring sessions, and the sheer volume of users I anticipated, the math got ugly very quickly. Relying on a paid API for every single sentence spoken within the app simply wasn't sustainable for an independent developer.</p>
<p>If you’re about to ask, "How far did you get with QuizRope?", well honestly, I straight-up gave up on the project back then because I couldn't find a sane, affordable solution for the TTS feature.</p>
<p>Beyond the prohibitive cost, there was the latency. Waiting for a server to process a prompt, generate the audio, and stream it back down to a mobile device completely breaks the conversational illusion. And worst of all, it meant every question a student asked would be beamed to a third-party server.</p>
<p>That frustration became the catalyst for my search to find a reliable, offline, and completely zero-cost solution.</p>
<p>In this article, we’re going to build a React Native application that performs high-fidelity Text-to-Speech (TTS) completely offline using your device's own hardware.</p>
<p>If you haven't set up your environment or need a refresher on local inference fundamentals, I highly recommend reading my previous article, <a href="https://www.freecodecamp.org/news/how-to-run-an-llm-locally-on-your-mobile-phone-with-qvac-and-expo/">How to Run a Local LLM Offline in React Native with QVAC</a>, where I cover project initialization, prebuilding, and native hardware dependencies.</p>
<p>This guide assumes you already have a project with the QVAC SDK configured and ready to run on a physical device.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-is-qvac">What is QVAC?</a></p>
</li>
<li><p><a href="#heading-the-architecture-supported-by-qvac">The Architecture Supported by QVAC</a></p>
</li>
<li><p><a href="#heading-the-inference-pipeline">The Inference Pipeline</a></p>
</li>
<li><p><a href="#heading-environment-and-dependency-config">Environment and Dependency Config</a></p>
</li>
<li><p><a href="#heading-the-audio-utility-packaging">The Audio Utility Packaging</a></p>
</li>
<li><p><a href="#heading-complete-implementation">Complete Implementation</a></p>
</li>
<li><p><a href="#heading-codebase-breakdown">Codebase Breakdown</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-resources-and-further-reading">Resources and Further Reading</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this article, you should have a solid foundation in modern web and mobile development:</p>
<ul>
<li><p><strong>JavaScript/TypeScript &amp; React</strong>: Familiarity with React concepts and hooks, especially <code>useState</code>, <code>useEffect</code>, and <code>useRef</code>.</p>
</li>
<li><p><strong>React Native &amp; Expo</strong>: Basic understanding of layout structures (such as <code>View</code>, <code>ScrollView</code>, <code>TextInput</code>) and styling conventions.</p>
</li>
<li><p><strong>Asynchronous JavaScript &amp; Binary Buffers</strong>: Experience with <code>async/await</code>, Promises, and basic manipulation of arrays like <code>Int16Array</code> or <code>Buffer</code>.</p>
</li>
<li><p><strong>Development Build Environment</strong>: Familiarity with running local development compilation commands, specifically <code>npx expo prebuild</code> to build native iOS and Android modules.</p>
</li>
<li><p><strong>Physical Mobile Device</strong>: Because local machine learning models leverage device-specific hardware acceleration and native optimizations, the QVAC SDK doesn't support simulator environments. You must have a physical iOS or Android testing device with Developer Mode enabled.</p>
</li>
</ul>
<h2 id="heading-what-is-qvac">What is QVAC?</h2>
<p>To help you follow along more effectively, let’s establish what QVAC is and why it exists.</p>
<p>Developed by Tether, QVAC is a local-first AI SDK designed for building cross-platform, peer-to-peer (P2P) applications and systems.</p>
<p>Many mobile applications that utilize Large Language Models (LLMs) or Text-to-Speech (TTS) engines rely on network requests to cloud-hosted APIs (such as OpenAI or ElevenLabs). While convenient, this model introduces dependencies on network connectivity, recurring API usage fees, and transmission of user data to third-party servers.</p>
<p>QVAC provides an alternative by executing AI models directly on the client device. This local-first architecture offers several practical advantages:</p>
<ul>
<li><p><strong>Local-first execution</strong>: Runs inference directly on the client hardware, eliminating the need for external APIs or active internet connections.</p>
</li>
<li><p><strong>Peer-to-peer (P2P) support</strong>: Allows distributing inference tasks across local networks, helping coordinate workloads without centralized servers.</p>
</li>
<li><p><strong>Cross-platform compatibility</strong>: Provides a single JavaScript/TypeScript interface that works consistently across different hardware and runtime environments.</p>
</li>
<li><p><strong>Unified capabilities</strong>: Exposes text generation, transcription, image generation, and speech synthesis within a single package.</p>
</li>
</ul>
<h3 id="heading-key-concepts-for-on-device-inference">Key Concepts for On-Device Inference</h3>
<p>To understand how QVAC runs on a mobile device, we must keep a few key concepts in mind:</p>
<ul>
<li><p><strong>On-Device Inference</strong>: Running model calculations locally. Rather than relying on a single engine, QVAC supports multiple specialized local inference backends depending on the task (such as <code>llama.cpp</code> for text, <code>whisper.cpp</code> for transcription, or custom diffusion backends for image generation). Under the hood, these engines memory-map quantized model weights directly into the device's RAM and run calculations using native GPU hardware acceleration.</p>
</li>
<li><p><strong>Quantization (GGUF format)</strong>: A mathematical optimization technique that compresses the model's weights (for example, from a standard 16-bit floating-point precision down to 4-bit or 8-bit integers). This makes it possible for models to fit into the memory constraints of consumer mobile hardware while keeping output quality high.</p>
</li>
<li><p><strong>KV (Key-Value) Cache</strong>: A memory area that stores calculated states of previous tokens so the model doesn't have to re-evaluate the entire context window with every word or token it generates.</p>
</li>
</ul>
<h2 id="heading-the-architecture-supported-by-qvac">The Architecture Supported by QVAC</h2>
<p>Before writing code, it's crucial to understand what's actually happening under the hood. To handle local execution without melting your device, the QVAC SDK manages the hardware binding and model lifecycle while hooking into optimized, community-maintained <a href="https://huggingface.co/blog/introduction-to-ggml"><strong>GGML</strong></a> inference backends.</p>
<p>Instead of a one-size-fits-all approach, the QVAC SDK supports two distinctly different neural architectures for speech synthesis. Depending on your application's needs — whether you want instant voice cloning or ultra-high-fidelity pre-trained voices — you'll choose between <strong>Chatterbox</strong> and <strong>Supertonic</strong>.</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>Chatterbox</th>
<th>Supertonic</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Architecture</strong></td>
<td>Transformer-based language model</td>
<td>Diffusion-based latent denoising</td>
</tr>
<tr>
<td><strong>Model Structure</strong></td>
<td>Split (T3 GGUF + S3Gen companion)</td>
<td>Single file (GGUF)</td>
</tr>
<tr>
<td><strong>Voice Method</strong></td>
<td>Zero-shot voice cloning (Reference WAV)</td>
<td>Pre-trained voice styles</td>
</tr>
<tr>
<td><strong>Sample Rate</strong></td>
<td>24,000 Hz</td>
<td>44,100 Hz</td>
</tr>
</tbody></table>
<h3 id="heading-1-the-chatterbox-engine">1. The Chatterbox Engine</h3>
<p>Chatterbox is built on a <strong>transformer-based language model</strong> architecture. It treats audio generation similarly to how an LLM predicts the next word in a sentence, but instead, it predicts discrete acoustic tokens.</p>
<p>Because of this architecture, Chatterbox excels at <strong>zero-shot voice cloning</strong>. Instead of relying purely on pre-baked voices, you can pass an optional <code>referenceAudioSrc</code> (a short WAV file of someone speaking) alongside your text. The transformer analyzes the reference audio's acoustic properties and generates a cloned voice based on those features.</p>
<h3 id="heading-2-the-supertonic-engine">2. The Supertonic Engine</h3>
<p>Supertonic takes a completely different approach, utilizing <a href="https://www.emergentmind.com/topics/latent-denoising-diffusion-models"><strong>diffusion-based latent denoising</strong></a> — the same fundamental architecture used by AI image generators like Stable Diffusion, but applied to audio.</p>
<p>It starts with pure digital noise and iteratively refines it into a 44.1 kHz high-fidelity speech waveform based on the text prompt. Supertonic uses a single, unified GGUF file rather than a split model. Instead of dynamic voice cloning, it relies on highly optimized, pre-trained voice styles (for example, <code>voice: "F1"</code> or <code>voice: "M1"</code>) baked directly into the model. This makes it incredibly efficient for generating crystal-clear, studio-quality speech when you don't need dynamic cloning capabilities.</p>
<p>For this tutorial, we'll use Supertonic. It yields fantastic results out of the box and avoids the complexity of loading multiple companion files.</p>
<h2 id="heading-the-inference-pipeline">The Inference Pipeline</h2>
<p>To visualize how we interact with these engines in our codebase, think of local TTS (Text to Speech) as running a virtual recording studio right in your phone's memory:</p>
<ol>
<li><p><strong>Hiring the actor (loading the model):</strong> We map the compressed GGUF file directly into the device's RAM or GPU VRAM.</p>
</li>
<li><p><strong>Handing over the script (text input):</strong> We pass plain text to the loaded engine.</p>
</li>
<li><p><strong>The performance (inference):</strong> The engine reads the text and mathematically predicts the sound waves. Crucially, the AI doesn't emit a finished audio file. Instead, it outputs raw digital sound waves known as PCM samples.</p>
</li>
<li><p><strong>Packaging the audio:</strong> Because a raw list of numbers can't be played by standard media players, we must manually wrap the PCM data in a standard WAV header.</p>
</li>
<li><p><strong>Closing the studio (unloading):</strong> Because speech synthesis is memory-intensive and maintains a persistent state, the model is cleared from RAM to free up resources and flush its context.</p>
</li>
</ol>
<h2 id="heading-environment-and-dependency-config">Environment and Dependency Config</h2>
<p>Before we jump into the codebase, there's a crucial dependency setup to keep in mind if your project uses the pnpm package manager.</p>
<p>Because QVAC plugins rely on transitive native peer dependencies, strict package managers like pnpm will lock these dependencies down inside hidden <code>.pnpm</code> subfolders.</p>
<p>To ensure the QVAC native bundler (<code>bare-pack</code>) can resolve your worker plugins correctly at build time, create a <code>.npmrc</code> file in the root of your project:</p>
<pre><code class="language-ini">shamefully-hoist=true
</code></pre>
<p>IMPORTANT: After creating this file, you must run a clean dependency install (<code>pnpm install</code>). This ensures a flat layout in your root <code>node_modules</code> so that all QVAC-specific helper packages are resolved properly during your local <code>npx expo prebuild</code> compilation step.</p>
<h2 id="heading-the-audio-utility-packaging">The Audio Utility Packaging</h2>
<p>Because QVAC outputs raw PCM arrays, we need to construct a valid WAV file in memory and write it to the device's storage before the native audio player can play it.</p>
<p>To achieve this, let's create a utility module inside <code>src/lib/utils.ts</code> to build the required WAV header, convert raw audio samples into a binary buffer, and write it to local storage.</p>
<pre><code class="language-typescript">import { Buffer } from "buffer";
import * as FileSystem from "expo-file-system/legacy";

/**
 * Creates a WAV header for 16-bit PCM audio
 */
export function createWavHeader(
  dataLength: number,
  sampleRate: number,
): Buffer {
  const buffer = Buffer.alloc(44);
  const channels = 1; // Mono
  const byteRate = sampleRate * channels * 2; // 16-bit audio
  const blockAlign = channels * 2;

  buffer.write("RIFF", 0);
  buffer.writeUInt32LE(36 + dataLength, 4);
  buffer.write("WAVE", 8);
  buffer.write("fmt ", 12);
  buffer.writeUInt32LE(16, 16); // Subchunk1Size
  buffer.writeUInt16LE(1, 20); // AudioFormat (PCM)
  buffer.writeUInt16LE(channels, 22);
  buffer.writeUInt32LE(sampleRate, 24);
  buffer.writeUInt32LE(byteRate, 28);
  buffer.writeUInt16LE(blockAlign, 32);
  buffer.writeUInt16LE(16, 34); // BitsPerSample
  buffer.write("data", 36);
  buffer.writeUInt32LE(dataLength, 40);

  return buffer;
}

/**
 * Converts the raw Int16Array samples from QVAC to a binary Buffer
 */
export function int16ArrayToBuffer(int16Array: Int16Array): Buffer {
  const buffer = Buffer.alloc(int16Array.length * 2);
  for (let i = 0; i &lt; int16Array.length; i++) {
    buffer.writeInt16LE(int16Array[i] ?? 0, i * 2);
  }
  return buffer;
}

/**
 * Main function to package and save the file to local mobile storage
 */
export async function saveAudioToDevice(
  audioBuffer: Int16Array,
  sampleRate: number,
): Promise&lt;string&gt; {
  try {
    const audioData = int16ArrayToBuffer(audioBuffer);
    const wavHeader = createWavHeader(audioData.length, sampleRate);
    const finalWavBuffer = Buffer.concat([wavHeader, audioData]);
    const base64Data = finalWavBuffer.toString("base64");

    const filename = `tts-speech-${Date.now()}.wav`;
    const fileUri = `\({FileSystem.documentDirectory}\){filename}`;

    await FileSystem.writeAsStringAsync(fileUri, base64Data, {
      encoding: FileSystem.EncodingType.Base64,
    });

    console.log(`✅ File saved locally at: ${fileUri}`);
    return fileUri;
  } catch (error) {
    console.error("❌ Failed to save audio file locally:", error);
    throw error;
  }
}
</code></pre>
<h2 id="heading-complete-implementation">Complete Implementation</h2>
<p>Let's bring it all together. We'll implement an interface that takes user input, manages download and loading states for the Supertonic engine, packages generated raw waves into a playable local file, and renders an interactive visual waveform player.</p>
<p>Replace your entry app file <code>src/app/index.tsx</code> with the following implementation:</p>
<pre><code class="language-tsx">import { useState, useEffect } from "react";
import {
  TextInput,
  KeyboardAvoidingView,
  Platform,
  ScrollView,
} from "react-native";
import {
  loadModel,
  unloadModel,
  textToSpeech,
  downloadAsset,
  TTS_EN_SUPERTONIC_Q8_0,
  getModelInfo,
  type ModelProgressUpdate,
} from "@qvac/sdk";
import { saveAudioToDevice } from "@/lib/utils";
import { TtsModelLoader } from "@/components/tts-model-loader";
import { AudioPlayer } from "@/components/audio-player";
import {
  Card,
  CardContent,
  CardDescription,
  CardHeader,
  CardTitle,
} from "@/components/ui/card";
import { Button } from "@/components/ui/button";
import { Text } from "@/components/ui/text";

const SUPERTONIC_SAMPLE_RATE = 44100;

// Global reference for our model ID
let globalModelId: string | null = null;

type TtsStatus =
  | { phase: "idle" }
  | { phase: "synthesizing" }
  | { phase: "done"; audioUri: string }
  | { phase: "error"; message: string };

export default function TextToVoiceScreen() {
  const [text, setText] = useState("");
  const [status, setStatus] = useState&lt;TtsStatus&gt;({ phase: "idle" });

  const [isModelLoaded, setIsModelLoaded] = useState(!!globalModelId);
  const [isDownloading, setIsDownloading] = useState(false);
  const [downloadProgress, setDownloadProgress] = useState(0);

  const isBusy = status.phase === "synthesizing";

  useEffect(() =&gt; {
    async function checkAndAutoLoad() {
      if (globalModelId) return;
      try {
        const info = await getModelInfo({ name: TTS_EN_SUPERTONIC_Q8_0.name });
        if (info.isCached) {
          setIsDownloading(true);
          setDownloadProgress(1);

          globalModelId = await loadModel({
            modelSrc: TTS_EN_SUPERTONIC_Q8_0,
            modelConfig: {
              ttsEngine: "supertonic",
              language: "en",
              voice: "F1",
              ttsSpeed: 1.05,
              ttsNumInferenceSteps: 5,
            },
          });

          setIsModelLoaded(true);
          setIsDownloading(false);
        }
      } catch (err: unknown) {
        console.warn("Failed to auto-load cached model on mount:", err);
        setIsDownloading(false);
      }
    }
    checkAndAutoLoad();
  }, []);

  const handleDownloadModel = async () =&gt; {
    if (isDownloading || isModelLoaded) return;

    try {
      setIsDownloading(true);
      setDownloadProgress(0);

      await downloadAsset({
        assetSrc: TTS_EN_SUPERTONIC_Q8_0,
        onProgress: (p: ModelProgressUpdate) =&gt; {
          setDownloadProgress(p.percentage / 100);
        },
      });

      setDownloadProgress(1);

      globalModelId = await loadModel({
        modelSrc: TTS_EN_SUPERTONIC_Q8_0,
        modelConfig: {
          ttsEngine: "supertonic",
          language: "en",
          voice: "F1",
          ttsSpeed: 1.05,
          ttsNumInferenceSteps: 5,
        },
      });

      setIsModelLoaded(true);
      setIsDownloading(false);
    } catch (err: unknown) {
      console.error("Failed to download or load model:", err);
      setIsDownloading(false);
      setStatus({
        phase: "error",
        message: err instanceof Error ? err.message : String(err),
      });
      setIsModelLoaded(false);
    }
  };

  const handleSubmit = async () =&gt; {
    if (!text.trim() || isBusy || !globalModelId) return;

    try {
      setStatus({ phase: "synthesizing" });

      // 1. Unload and reload the model to reset its state and clear the KV cache.
      if (globalModelId) {
        await unloadModel({ modelId: globalModelId });
      }
      globalModelId = await loadModel({
        modelSrc: TTS_EN_SUPERTONIC_Q8_0,
        modelConfig: {
          ttsEngine: "supertonic",
          language: "en",
          voice: "F1",
          ttsSpeed: 1.05,
          ttsNumInferenceSteps: 5,
        },
      });

      // 2. Synthesize text to raw PCM samples
      const result = textToSpeech({
        modelId: globalModelId,
        text: text.trim(),
        inputType: "text",
        stream: false,
      });

      const audioBuffer = await result.buffer;

      // 3. Package and save WAV file using our local util
      const samplesInt16 = new Int16Array(audioBuffer);
      const wavUri = await saveAudioToDevice(
        samplesInt16,
        SUPERTONIC_SAMPLE_RATE,
      );

      // 4. Show player
      setStatus({ phase: "done", audioUri: wavUri });
    } catch (err: unknown) {
      console.error("TTS error:", err);
      const msg = err instanceof Error ? err.message : String(err);
      setStatus({ phase: "error", message: msg });
    }
  };

  const buttonLabel =
    status.phase === "synthesizing" ? "Synthesizing…" : "Synthesize Speech";

  if (!isModelLoaded) {
    return (
      &lt;TtsModelLoader
        onDownload={handleDownloadModel}
        isDownloading={isDownloading}
        progress={downloadProgress}
      /&gt;
    );
  }

  return (
    &lt;KeyboardAvoidingView
      behavior={Platform.OS === "ios" ? "padding" : "height"}
      className="flex-1 bg-black"
    &gt;
      &lt;ScrollView contentContainerClassName="flex-grow p-6  justify-center"&gt;
        &lt;Card className="border border-border bg-card max-w-md w-full mx-auto"&gt;
          &lt;CardHeader&gt;
            &lt;CardTitle variant="h3" className="text-white text-center"&gt;
              Text to Voice
            &lt;/CardTitle&gt;
            &lt;CardDescription className="text-center mt-1"&gt;
              Type or paste your content to synthesize speech
            &lt;/CardDescription&gt;
          &lt;/CardHeader&gt;

          &lt;CardContent className="gap-6"&gt;
            &lt;TextInput
              className="bg-muted text-white border border-border rounded-lg p-4 h-48 text-base leading-6"
              multiline
              numberOfLines={8}
              placeholder="Type your message here..."
              placeholderTextColor="#666"
              value={text}
              onChangeText={setText}
              style={{ textAlignVertical: "top" }}
              editable={!isBusy}
            /&gt;

            {status.phase === "error" &amp;&amp; (
              &lt;Text className="text-destructive text-sm text-center"&gt;
                {status.message}
              &lt;/Text&gt;
            )}

            {status.phase === "done" &amp;&amp; &lt;AudioPlayer uri={status.audioUri} /&gt;}

            &lt;Button
              onPress={handleSubmit}
              className="w-full h-12 rounded-xl"
              disabled={!text.trim() || isBusy}
            &gt;
              &lt;Text className="font-semibold text-lg"&gt;{buttonLabel}&lt;/Text&gt;
            &lt;/Button&gt;
          &lt;/CardContent&gt;
        &lt;/Card&gt;
      &lt;/ScrollView&gt;
    &lt;/KeyboardAvoidingView&gt;
  );
}
</code></pre>
<h3 id="heading-codebase-breakdown">Codebase Breakdown</h3>
<p>Let’s lift the hood on how this local Text-to-Speech implementation manages native model lifecycles and processes raw audio arrays.</p>
<h4 id="heading-1-managing-the-native-lifecycle">1. Managing the Native Lifecycle</h4>
<p>Loading neural network weights for speech synthesis is computationally expensive. When the QVAC runtime initializes a model, it must read parameters from the local disk and copy the active weights into device RAM.</p>
<p>To handle this efficiently, we declared the reference variable outside the component scope:</p>
<pre><code class="language-typescript">let globalModelId: string | null = null;
</code></pre>
<p>If <code>globalModelId</code> were tracked inside component states, navigating away from the text-to-speech screen would clean up the state, causing the app to unnecessarily drop the reference. Storing the ID globally ensures we hold onto it across layout transitions.</p>
<h4 id="heading-2-flushing-the-kv-cache-unload-and-reload">2. Flushing the KV Cache: Unload and Reload</h4>
<p>One of the most important aspects of offline generation using GGML engines is state management:</p>
<pre><code class="language-typescript">// 1. Unload and reload the model to reset its state and clear the KV cache.
if (globalModelId) {
  await unloadModel({ modelId: globalModelId });
}

globalModelId = await loadModel({ ... });
</code></pre>
<p>WARNING about <strong>acoustic hallucinations:</strong> If you continuously synthesize sentences on a single TTS model instance without resetting it, the model's Key-Value (KV) cache fills up. It begins treating your new sentence as a continuation of the previous one, leading to heavy robotic distortion, echoing, and repeated voices.</p>
<p>By explicitly destroying the model via <code>unloadModel</code> and immediately booting a fresh instance with <code>loadModel</code>, we're forcing a pristine, empty context window. Since the model is already downloaded and memory-mapped, reloading the model directly from local flash storage is extremely fast, typically completing in a fraction of a second on modern mobile hardware to ensure a seamless user experience while guaranteeing artifact-free audio.</p>
<h4 id="heading-3-demystifying-the-wav-header-structure">3. Demystifying the WAV Header Structure</h4>
<p>Operating systems and built-in mobile media decoders are unable to parse raw, naked PCM (Pulse Code Modulation) sound waves directly. A raw PCM buffer is simply a stream of numerical coordinates representing audio wave amplitudes.</p>
<p>We resolve this by prepending-formatting our PCM buffer with a standard 44-byte RIFF/WAVE header.</p>
<p>This header acts as a passport, defining:</p>
<ul>
<li><p><strong>AudioFormat (</strong><code>1</code><strong>)</strong>: Signals uncompressed linear PCM.</p>
</li>
<li><p><strong>NumChannels (</strong><code>1</code><strong>)</strong>: Mono audio.</p>
</li>
<li><p><strong>SampleRate (</strong><code>44100</code><strong>)</strong>: The clock frequency required for Supertonic playback.</p>
</li>
<li><p><strong>BitsPerSample (</strong><code>16</code><strong>)</strong>: 16-bit word length (2 bytes per sample).</p>
</li>
</ul>
<p>Additionally, writing the file is handled via Base64 encoding to safely cross React Native's JavaScript-to-Native bridge without dropping binary data:</p>
<pre><code class="language-typescript">const base64Data = finalWavBuffer.toString("base64");
await FileSystem.writeAsStringAsync(fileUri, base64Data, {
  encoding: FileSystem.EncodingType.Base64,
});
</code></pre>
<h4 id="heading-4-visual-waveform-player">4. Visual Waveform Player</h4>
<p>Rather than using a basic headless native audio player that fires immediately in the background, we pass the local WAV file path to a custom <code>&lt;AudioPlayer&gt;</code> component powered by <code>@simform_solutions/react-native-audio-waveform</code>.</p>
<p>This module analyzes our newly written WAV file and draws a sleek, WhatsApp-inspired interactive visual waveform, giving the user full control over playback, dynamic speed adjustments (<code>1x</code>, <code>1.5x</code>, <code>2x</code>), and seeking. It's a vast UX improvement that makes the final result feel premium and polished.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Transitioning Text-to-Speech from the cloud to on-device hardware offers a practical approach for mobile application developers. Running model inference locally eliminates reliance on remote internet connectivity, removes recurring API usage costs, and ensures that user text inputs never leave the physical device.</p>
<p>Integrating local speech synthesis can be highly beneficial for interactive, educational, or conversational apps. For example, in voice-guided systems, on-device TTS allows applications to function in private or offline environments. As edge processors gain dedicated hardware acceleration cores and open-source models decrease in memory size through quantization research, local-first architectures present a compelling alternative for developers prioritizing privacy, offline resilience, and predictable cost structures.</p>
<h2 id="heading-resources-and-further-reading">Resources and Further Reading</h2>
<p>To dive deeper into local Text-to-Speech inference, inspect the source code, or explore advanced configurations for your mobile applications, check out the following resources:</p>
<ul>
<li><p><a href="https://docs.qvac.tether.io/tutorials/expo/"><strong>QVAC Expo Integration Docs</strong></a>: Learn more about configuring custom local models in Expo.</p>
</li>
<li><p><a href="https://github.com/SimformSolutionsPvtLtd/react-native-audio-waveform"><strong>react-native-audio-waveform</strong></a>: Learn more about interactive React Native audio visualizations.</p>
</li>
<li><p><a href="https://huggingface.co/models?search=gguf"><strong>GGUF Model Hub on Hugging Face</strong></a>: Browse compatible quantized open-source models.</p>
</li>
<li><p><a href="https://www.emergentmind.com/topics/latent-denoising-diffusion-models"><strong>Latent Denoising Deep Dive</strong></a>: Technical deep dive into Diffusion-based acoustic generation.</p>
</li>
<li><p><a href="https://github.com/DjibrilM/QVAC-TTS-Expo-Implementation"><strong>https://github.com/DjibrilM/QVAC-TTS-Expo-Implementation</strong></a>: Full implementation code.</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Run an LLM Locally on Your Mobile Phone with QVAC and Expo ]]>
                </title>
                <description>
                    <![CDATA[ When I was younger, I remember my mother’s Android phone, a Samsung Galaxy Note 3 that she bought right after losing her BlackBerry. During that time, a phone with 16 GB of storage was considered cutt ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-run-an-llm-locally-on-your-mobile-phone-with-qvac-and-expo/</link>
                <guid isPermaLink="false">6a2061ad78a43e3153aede0d</guid>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Mobile Development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ local development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Djibril-M🍀 ]]>
                </dc:creator>
                <pubDate>Wed, 03 Jun 2026 17:17:33 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/a5fb9baf-a10d-4e53-9c66-3980919a35b8.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When I was younger, I remember my mother’s Android phone, a Samsung Galaxy Note 3 that she bought right after losing her BlackBerry. During that time, a phone with 16 GB of storage was considered cutting-edge technology. The ability to store five 720p torrented movies on a single phone honestly felt unreal.</p>
<p>Most flagship devices back then shipped with somewhere between 2 and 8 GB of RAM, and GPUs were nowhere near what we carry around today. My mom’s Galaxy Note 3 featured the Qualcomm Adreno 330 GPU with 32 unified shader cores running at up to 578 MHz — a complete powerhouse for its time.</p>
<p>Fast forward to today, and the phones in our pockets are ridiculously more powerful, more efficient, and, honestly, capable of things people would’ve considered science fiction back then.</p>
<p>But enough about my mom’s phone. What I’m really trying to say is this: instead of spending hundreds of dollars every month on AI subscriptions and tokens, we can take advantage of the insanely capable devices we already carry around every day.</p>
<p>Modern smartphones now have dedicated AI acceleration, impressive thermal efficiency, and enough compute power to run lightweight language models locally, completely offline. That means better privacy, full control over your chat history, lower latency, and the ability to use AI without depending entirely on cloud services.</p>
<p>In this article, we’re going to build a React Native application that interacts with an LLM running directly on the device itself. The implementation will revolve around QVAC, a family of inference tools designed specifically for running AI models locally.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-is-qvac">What is QVAC?</a></p>
</li>
<li><p><a href="#heading-environment-setup">Environment Setup</a></p>
</li>
<li><p><a href="#heading-model-management">Model Management</a></p>
</li>
<li><p><a href="#heading-custom-models">Custom Models</a></p>
</li>
<li><p><a href="#heading-complete-implementation">Complete Implementation</a></p>
</li>
<li><p><a href="#heading-codebase-breakdown">Codebase Breakdown</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-resources-amp-further-reading">Resources &amp; Further Reading</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this article, you should have a basic understanding of front end development and React in general. You don't have to be a mobile developer, but understanding React will help a lot.</p>
<h2 id="heading-what-is-qvac">What is QVAC?</h2>
<p>QVAC (QuantumVerse Automatic Computer) is a local-first AI inference platform developed by Tether. It's designed to move artificial intelligence away from centralized cloud systems and bring computation back to the user’s own device.</p>
<p>Most modern AI tools rely heavily on remote servers, API keys, and cloud infrastructure controlled by a handful of companies. While this makes AI accessible, it also creates major concerns around privacy, censorship, vendor lock-in, internet dependency, and ownership of user data. Every prompt, conversation, or uploaded file often passes through third-party servers that users have little control over.</p>
<p>QVAC was designed to solve that problem by allowing AI models and agents to run directly on devices like smartphones, laptops, and embedded systems, even while completely offline. Instead of sending personal conversations and sensitive data to the cloud, users can process everything locally on their own hardware.</p>
<p>The platform also embraces decentralization through peer-to-peer communication, reducing reliance on centralized infrastructure and eliminating single points of failure. This approach makes AI systems more private, resilient, autonomous, and accessible, especially in environments with limited internet access or strict data privacy requirements.</p>
<p>In simple terms, QVAC exists to make AI truly owned by its users — local-first, private by default, and independent from centralized control.</p>
<h2 id="heading-environment-setup">Environment Setup</h2>
<p>To speed up the process, I prepared a React Native starter project with all the dependencies installed. But we will install and set up QVAC in this article, since that's our main topic. Here's a link to the <a href="https://github.com/DjibrilM/QVAC-offline-Chatbot-Article-Project-">repository</a>.</p>
<p>Or you can run the below command to clone the starter project.</p>
<pre><code class="language-shell">git clone --branch ft-ui-implementation --single-branch https://github.com/DjibrilM/QVAC-offline-Chatbot-Article-Project-
</code></pre>
<h3 id="heading-qvac-installation">QVAC Installation</h3>
<p>Run the following command to install the SDK: <code>npm i @qvac/sdk</code>. Feel free to use any package manager of your choice. As for me, I will keep things simple with <code>npm.</code></p>
<p>Then add the following peer dependencies to your <code>package.json</code>:</p>
<pre><code class="language-json">{
  "dependencies": {
    "@qvac/sdk": "^0.7.0",
+   "bare-rpc": "^1.0.0", 
    "expo": "~54.0.33",
    "expo-status-bar": "~3.0.9",
    "react": "19.1.0",
    "react-native": "0.81.5",
+   "react-native-bare-kit": "^0.11.5"  
  },
  "devDependencies": {
    "@types/react": "~19.1.0",
    "bare-pack": "^1.5.1", 
    "typescript": "~5.9.2"
  }
}
</code></pre>
<p>Install the following additional dependencies:</p>
<pre><code class="language-shell">npx expo install expo-file-system expo-build-properties expo-device
</code></pre>
<p>Then configure <code>expo-build-properties</code> and add <code>@qvac/sdk/expo-plugin</code> to the <code>plugins</code> array in your <code>app.json</code>:</p>
<pre><code class="language-json">{
  "expo": {
    "plugins": [
      "expo-router",
      "@qvac/sdk/expo-plugin",
      [
        "expo-splash-screen",
        {
          "backgroundColor": "#208AEF",
          "android": {
            "image": "./assets/images/splash-icon.png",
            "imageWidth": 76
          }
        }
      ]
    ]
  }
}
</code></pre>
<p>Run the following command to build the native modules:</p>
<pre><code class="language-shell">npx expo prebuild
</code></pre>
<p><strong>Note:</strong> QVAC uses llama.cpp under the hood. Due to optimization requirements and native hardware dependencies, the QVAC SDK doesn't run on emulators. You'll have to test this with a real physical device with Developer Mode enabled.</p>
<p>To run the app on your physical device, execute:</p>
<pre><code class="language-shell"># For Android:
npx expo run:android --device

# For iOS:
npx expo run:ios --device
</code></pre>
<h2 id="heading-model-management">Model Management</h2>
<p>The QVAC model management system is completely local-first and decentralized. It handles the entire lifecycle, from downloading files to lifecycle optimization, abstracting everything behind clean utility APIs.</p>
<h3 id="heading-resumable-amp-deduplicated-downloading-downloadasset">Resumable &amp; Deduplicated Downloading (<code>downloadAsset</code>)</h3>
<p>It writes temporary chunks to local disk. If a network drop occurs, the partial file is preserved and resumes automatically upon the next call. Also, if multiple components invoke a download for the same asset simultaneously, QVAC handles the streaming under a single network stream.</p>
<h3 id="heading-memory-lifecycle-loadmodel-amp-unloadmodel">Memory Lifecycle (<code>loadModel</code> &amp; <code>unloadModel</code>)</h3>
<p><code>loadModel</code> maps the asset file directly into memory, maps it to your hardware target (such as the device GPU), and exposes an ephemeral <code>modelId</code>. Because local inference is highly memory-intensive on mobile devices, calling <code>unloadModel</code> frees system RAM immediately while preserving the downloaded file on disk.</p>
<h3 id="heading-custom-models">Custom Models</h3>
<p>Because QVAC relies on an optimized branch of llama.cpp, it remains highly compatible with the open-source AI ecosystem. If you plan to load custom models, ensure they adhere to these criteria:</p>
<ul>
<li><p><strong>Format:</strong> Must be in the GGUF (<code>.gguf</code>) format.</p>
</li>
<li><p><strong>Quantization:</strong> For mobile and edge deployments, always prioritize <code>Q4_0</code>, <code>Q4_K_M</code>, or <code>Q8_0</code> configurations to guarantee they fit safely within mobile hardware RAM constraints.</p>
</li>
</ul>
<h2 id="heading-complete-implementation">Complete Implementation</h2>
<p>Now let's replace your main file codebase logic with the full implementation, combining the UI container layout, user interaction state, model lifecycle setup, and real-time inference handling into a cohesive structure.</p>
<p>Replace your entry file with the following code:</p>
<pre><code class="language-typescript">import { ChatInput } from "@/components/chat-input";
import { ChatMessage, Message } from "@/components/chat-message";
import { ModelLoader } from "@/components/model-loader";
import { Button } from "@/components/ui/button";
import { Text } from "@/components/ui/text";

import {
  completion,
  deleteCache,
  downloadAsset,
  LLAMA_3_2_1B_INST_Q4_0,
  loadModel,
  type ModelProgressUpdate,
  VERBOSITY,
} from "@qvac/sdk";
import { SymbolView } from "expo-symbols";
import { useEffect, useRef, useState } from "react";

import {
  Clipboard,
  KeyboardAvoidingView,
  Platform,
  SafeAreaView,
  ScrollView,
  View,
} from "react-native";

const makeId = () =&gt; Math.random().toString(36).substring(2, 9);

export default function Index() {
  const [messages, setMessages] = useState&lt;Message[]&gt;([]);
  const [input, setInput] = useState("");
  const [isGenerating, setIsGenerating] = useState(false);

  // Model loading state
  const [modelId, setModelId] = useState&lt;string | null&gt;(null);
  const [isModelLoaded, setIsModelLoaded] = useState(false);
  const [isDownloading, setIsDownloading] = useState(false);
  const [downloadProgress, setDownloadProgress] = useState(0);

  const scrollViewRef = useRef&lt;ScrollView&gt;(null);
  const messagesRef = useRef&lt;Message[]&gt;([]);

  useEffect(() =&gt; {
    messagesRef.current = messages;
  }, [messages]);

  const startDownload = () =&gt; {
    setIsDownloading(true);
    setupModel();
  };

  // Automatically scroll to bottom when messages list updates
  useEffect(() =&gt; {
    if (scrollViewRef.current) {
      setTimeout(() =&gt; {
        scrollViewRef.current?.scrollToEnd({ animated: true });
      }, 100);
    }
  }, [messages, isGenerating]);

  const copyToClipboard = (text: string) =&gt; {
    if (Platform.OS === "web") {
      navigator.clipboard.writeText(text);
    } else {
      Clipboard.setString(text);
    }
  };

  const setupModel = async () =&gt; {
    try {
      setIsDownloading(true);
      setDownloadProgress(0);
      
      // 1. Local download path execution
      await downloadAsset({
        assetSrc: LLAMA_3_2_1B_INST_Q4_0,
        onProgress: (progress: ModelProgressUpdate) =&gt; {
          setDownloadProgress(progress.percentage / 100);
        },
      });

      setDownloadProgress(1);

      // 2. Load model into runtime memory
      const loadedModel = await loadModel({
        modelSrc: LLAMA_3_2_1B_INST_Q4_0,
        modelType: "llm",
        modelConfig: {
          device: "gpu",
          ctx_size: 2048,
          verbosity: VERBOSITY.ERROR,
        },
      });

      setModelId(loadedModel);
      setIsModelLoaded(true);
      setIsDownloading(false);
    } catch (e: any) {
      console.error("Error setting up model:", e);
      setIsDownloading(false);
    }
  };

  async function handleSend() {
    // Guard against sending before the model is ready or while generating.
    if (!modelId || isGenerating) return;

    const trimmed = input.trim();
    if (!trimmed) return;

    setInput("");
    setIsGenerating(true);

    // Append user message and a placeholder assistant message for streaming.
    const userMsg: Message = {
      id: makeId(),
      role: "user",
      content: trimmed,
    };

    const assistantId = makeId();

    const assistantMsg: Message = {
      id: assistantId,
      role: "assistant",
      content: "",
    };

    setMessages((prev) =&gt; [...prev, userMsg, assistantMsg]);

    try {
      // Build chat history for the completion request.
      const history = [...messagesRef.current, userMsg].map((m) =&gt; ({
        role: m.role,
        content: m.content,
      }));

      // Run a streaming completion and update the last assistant bubble.
      const result = completion({
        modelId,
        history,
        stream: true,
      });

      let acc = "";

      for await (const token of result.tokenStream) {
        acc += token;

        // Update only the last assistant message content
        setMessages((prev) =&gt;
          prev.map((m) =&gt;
            m.id === assistantId ? { ...m, content: acc } : m
          )
        );
      }

      // Optional: Log completion performance stats
      try {
        const stats = await result.stats;
        console.log("📊 Completion stats:", stats);
      } catch {}

    } catch (e: any) {
      // Show any error in the assistant bubble.
      setMessages((prev) =&gt;
        prev.map((m) =&gt;
          m.id === assistantId
            ? { ...m, content: `❌ Error: ${e?.message ?? String(e)}` }
            : m
        )
      );
    } finally {
      setIsGenerating(false);
    }
  }

  if (!isModelLoaded) {
    return (
      &lt;ModelLoader
        onDownload={startDownload}
        isDownloading={isDownloading}
        progress={downloadProgress}
      /&gt;
    );
  }

  return (
    &lt;SafeAreaView className="flex-1 bg-background"&gt;
      &lt;KeyboardAvoidingView
        behavior={Platform.OS === "ios" ? "padding" : "height"}
        className="flex-1"
      &gt;
        &lt;View className="flex-row items-center justify-between p-4 border-b border-border"&gt;
          &lt;View className="flex-row items-center gap-2"&gt;
            &lt;View className="w-2 h-2 rounded-full bg-emerald-500" /&gt;
            &lt;Text className="font-semibold text-lg"&gt;Local Llama 3.2&lt;/Text&gt;
          &lt;/View&gt;
          &lt;Text className="text-xs text-muted-foreground"&gt;Offline Engine&lt;/Text&gt;
        &lt;/View&gt;

        &lt;ScrollView
          ref={scrollViewRef}
          className="flex-1 px-4"
          contentContainerStyle={{ paddingVertical: 16, gap: 16 }}
        &gt;
          {messages.filter(m =&gt; m.content !== "" || m.role === "assistant").map((msg) =&gt; (
            &lt;ChatMessage
              key={msg.id}
              message={msg}
              onCopy={() =&gt; copyToClipboard(msg.content)}
            /&gt;
          ))}
        &lt;/ScrollView&gt;

        &lt;ChatInput
          value={input}
          onChangeText={setInput}
          onSend={handleSend}
          disabled={isGenerating}
          placeholder={isGenerating ? "Thinking..." : "Type a message..."}
        /&gt;
      &lt;/KeyboardAvoidingView&gt;
    &lt;/SafeAreaView&gt;
  );
}
</code></pre>
<h3 id="heading-codebase-breakdown">Codebase Breakdown</h3>
<p>Let’s lift the hood on how this unified component manages local model workflows and real-time UI streaming.</p>
<h4 id="heading-1-tracking-model-state-amp-asynchronous-synchronization">1. Tracking Model State &amp; Asynchronous Synchronization</h4>
<p>At the root of the component, we track both user-facing interface state and underlying QVAC runtime handles:</p>
<pre><code class="language-typescript">const [messages, setMessages] = useState&lt;Message[]&gt;([]);
const [modelId, setModelId] = useState&lt;string | null&gt;(null);
const [isModelLoaded, setIsModelLoaded] = useState(false);
const [isDownloading, setIsDownloading] = useState(false);
const [downloadProgress, setDownloadProgress] = useState(0);
</code></pre>
<p>Because state setters in React are asynchronous, streaming loops can accidentally capture stale representations of current chat logs.</p>
<p>To circumvent this, a mutable <code>messagesRef</code> acts as a real-time single source of truth for the active session state:</p>
<pre><code class="language-typescript">const messagesRef = useRef&lt;Message[]&gt;([]);

useEffect(() =&gt; {
  messagesRef.current = messages;
}, [messages]);
</code></pre>
<h4 id="heading-2-orchestrating-download-amp-memory-instantiation">2. Orchestrating Download &amp; Memory Instantiation</h4>
<p>When the user strikes the download button action trigger, the application launches <code>setupModel()</code>. This function splits tasks clearly across local storage caching and active hardware allocation layers:</p>
<pre><code class="language-typescript">await downloadAsset({
  assetSrc: LLAMA_3_2_1B_INST_Q4_0,
  onProgress: (progress: ModelProgressUpdate) =&gt; {
    setDownloadProgress(progress.percentage / 100);
  },
});
</code></pre>
<ul>
<li><p><strong>Storage Sync:</strong> <code>downloadAsset</code> reaches out to pull the designated standard model signature down into mobile device disk files.</p>
</li>
<li><p><strong>Hardware Binding:</strong> Once safe on disk, <code>loadModel</code> executes to wake up the engine runtime:</p>
</li>
</ul>
<pre><code class="language-typescript">const loadedModel = await loadModel({
  modelSrc: LLAMA_3_2_1B_INST_Q4_0,
  modelType: "llm",
  modelConfig: {
    device: "gpu",
    ctx_size: 2048,
    verbosity: VERBOSITY.ERROR,
  },
});
</code></pre>
<p>Passing <code>device: "gpu"</code> tells QVAC to run hardware-accelerated kernels across the smartphone's graphic processing hardware structure, ensuring rapid performance metrics instead of locking execution to slower CPU loops.</p>
<h4 id="heading-3-pipeline-ingest-amp-streaming-generation-loop">3. Pipeline Ingest &amp; Streaming Generation Loop</h4>
<p>Once user validation confirms the prompt is ready, <code>handleSend()</code> sets up user bubbles and generates an empty assistant placeholder card to catch token output segments.</p>
<p>The application map transforms references straight out of <code>messagesRef.current</code> into a structured history syntax before processing:</p>
<pre><code class="language-typescript">const result = completion({
  modelId,
  history,
  stream: true,
});
</code></pre>
<p>With <code>stream: true</code> enabled, QVAC doesn't hold up your application thread waiting for long string sequences to complete. Instead, it yields an asynchronous iterable stream that spits out fresh updates instantly:</p>
<pre><code class="language-typescript">let acc = "";

for await (const token of result.tokenStream) {
  acc += token;

  setMessages((prev) =&gt;
    prev.map((m) =&gt;
      m.id === assistantId ? { ...m, content: acc } : m
    )
  );
}
</code></pre>
<p>The loop continuously concatenates token text variables into the tracking accumulator (<code>acc</code>), target patching state properties exclusively against our placeholder identifier (<code>assistantId</code>). This creates a lightning-fast typing animation experience while executing fully offline on your user's physical device hardware.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Building a local-first AI application is no longer a concept confined to high-end desktops or specialized research labs. As we’ve seen, the smartphones we carry in our pockets every day possess more than enough computational muscle and dedicated hardware acceleration to run highly capable language models completely offline.</p>
<p>By leveraging React Native and the QVAC SDK, we successfully bypassed the traditional cloud-dependent architecture. We eliminated the need for complex server infrastructure, API key management, and recurring token subscription fees, all while providing an ultra-private, low-latency, streaming chat experience directly on-device.</p>
<p>As open-source models continue to shrink in size and grow in capabilities, edge inference will become an essential architecture for developers prioritizing privacy, offline resilience, and cost efficiency. The power to compute is back where it belongs: in the hands of the user.</p>
<h3 id="heading-resources-amp-further-reading">Resources &amp; Further Reading</h3>
<p>To dive deeper into local inference, inspect the source code, or explore advanced configurations for your mobile applications, check out the following resources:</p>
<ul>
<li><p><a href="https://docs.qvac.tether.io/tutorials/expo/"><strong>QVAC Expo Integration Tutorial</strong></a> – The official step-by-step documentation for configuring QVAC within the Expo and React Native ecosystems.</p>
</li>
<li><p><a href="https://github.com/DjibrilM/QVAC-offline-Chatbot-Article-Project-"><strong>Project GitHub Repository</strong></a> – Access the complete source code, including the UI layout components, starter themes, and full configuration files used in this guide.</p>
</li>
<li><p><a href="https://github.com/ggml-org/llama.cpp"><strong>Llama.cpp Official Repository</strong></a> – Learn more about the underlying inference engine that powers QVAC's hardware-accelerated local execution.</p>
</li>
<li><p><a href="https://huggingface.co/models?search=gguf"><strong>Hugging Face GGUF Models</strong></a> – Explore thousands of open-source, quantized models that you can download and experiment with inside your local application.</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an AI Agent with LangChain and LangGraph: Build an Autonomous Starbucks Agent ]]>
                </title>
                <description>
                    <![CDATA[ Back in 2023, when I started using ChatGPT, it was just another chatbot that I could ask complex questions to and it would identify errors in my code snippets. Everything was fine. The application had no memory of previous states or what was said the... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-starbucks-ai-agent-with-langchain/</link>
                <guid isPermaLink="false">69449a6dcd2a4eec1f27eb1b</guid>
                
                    <category>
                        <![CDATA[ ai agents ]]>
                    </category>
                
                    <category>
                        <![CDATA[ langchain ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nestjs ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Djibril-M🍀 ]]>
                </dc:creator>
                <pubDate>Fri, 19 Dec 2025 00:21:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765630477745/8dffec85-c3c4-4d83-9aa4-f332439d4663.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Back in 2023, when I started using ChatGPT, it was just another chatbot that I could ask complex questions to and it would identify errors in my code snippets. Everything was fine. The application had no memory of previous states or what was said the day before.</p>
<p>Then in 2024, everything started to change. We went from a stateless chatbot to an AI agent that could call tools, search the internet, and generate download links.</p>
<p>At this point, I started to get curious. How can an LLM search the internet? An infinite number of questions were flowing through my head. Can it create its own tools, programs, or execute its own code? It felt like we were heading toward the Skynet (Terminator) revolution.</p>
<p>I was just ignorant 😅. But that's when I started my research and discovered LangChain, a tool that promises all those miracles without a billion-dollar budget.</p>
<p>In this article, you’ll build a fully functional AI agent using LangChain and LangGraph. You’ll start by defining structured data using Zod schemas, then parsing them for AI understanding. Next, you’ll learn about summarizing data into text, creating tools the agent can call, and setting up LangGraph nodes to orchestrate workflows.</p>
<p>You’ll see how to compile the workflow graph, manage state, and persist conversation history using MongoDB. By the end, you’ll have a working Starbucks barista AI that demonstrates how to combine reasoning, tool execution, and memory in a single agent.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-an-llm-agent">What is an LLM Agent?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-data-schematization-with-zod">Data Schematization with Zod</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-parse-the-schema">How to Parse the Schema</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-data-to-text-summarization">Data-to-Text Summarization</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-persist-orders-with-mongodb-in-nestjs">How to Persist Orders with MongoDB in NestJS</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-langgraph-stateannotation-terms">LangGraph State/Annotation Terms</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-tools-for-the-agent">How to Create Tools for the Agent</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-langgraph-nodes-workflow-components">LangGraph Nodes (Workflow Components)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-graph-declaration">Graph Declaration</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-workflow-compilation-and-state-persistence-final-part">Workflow Compilation and State Persistence (Final Part)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To take full advantage of this article, you should have a basic understanding of TypeScript, Node.js, and a bit of NestJS will help, as it’s the backend framework we’ll be using.</p>
<h2 id="heading-what-is-an-llm-agent"><strong>What is an LLM Agent?</strong></h2>
<p>By definition, an LLM agent is a software program that’s capable of perceiving its environment, making decisions, and taking autonomous actions to achieve specific goals. It often does this by interacting with tools and systems.</p>
<p>Many frameworks and conventions were created to achieve this, and one of the most famous and widely used is the ReAct (Reason &amp; Act) framework.</p>
<p>With this framework, the LLM receives a prompt, thinks, decides the next action (this can be calling a specific tool), and receives the tool data. Once the tool’s response has been received, the AI model observes the response, generates its own response, and plans its next actions based on the tool’s response.</p>
<p>You can read more about this concept on the official <a target="_blank" href="https://arxiv.org/abs/2210.03629">white paper</a>. And here’s a diagram that summarizes the entire process:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765064426716/b1e6d7b2-4e4b-43c4-af5c-9cd49b27a864.png" alt="Diagram illustrating an LLM agent workflow: the agent receives a prompt, reasons, decides an action (such as calling a tool), observes the tool’s response, generates its own response, and iteratively plans its next actions using the ReAct framework" class="image--center mx-auto" width="3015" height="1827" loading="lazy"></p>
<p>Note that the workflow is not limited to a single tool invocation – it can proceed through several rounds before returning to the user.</p>
<p>But for an LLM agent to be truly human-like and act with knowledge of the past, it requires a memory. This enables it to recall previous prompts and responses, maintaining consistency within the given thread.</p>
<p>There’s no single source of truth for how to approach this. Most agents implement a short-term memory. This means that the agent will append each new chat to the conversation history, and when a new prompt is submitted, the agent will append the previous messages to the new prompt.</p>
<p>This method is very efficient and gives the LLM a strong knowledge of previous states. But it can also introduce problems, because the more the conversation grows, the more the LLM will have to go through all previous messages in order to understand what action to take next.</p>
<p>And this can introduce some context drift, just like humans experience. You can’t watch a two-hour podcast and remember all the spoken words, right? In this scenario, the LLM will focus on the most relevant information, eventually losing some context.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765064542431/18b8d0a7-b9f1-4f7d-993d-76b3c4058ccf.png" alt="Illustration showing an LLM agent workflow with memory: the agent processes multiple rounds of prompts and tool interactions, maintains a short-term memory of previous conversations, and uses this context to decide actions, while older context may fade over time causing potential context drift." class="image--center mx-auto" width="3015" height="1827" loading="lazy"></p>
<p>You don’t have to implement this from scratch. Many tools and frameworks have been developed to make the implementation as easy as possible. You can build it from scratch if you want, of course, but we won’t be doing that here.</p>
<p>In this article, we’ll build a Starbucks barista that collects order information and calls a <code>create_order</code> tool once the order meets the full criteria. This is a tool that we’ll create and expose to the AI.</p>
<h2 id="heading-project-setup">Project Setup</h2>
<p>Let’s start by initializing our project. We’ll use Nest.js for its efficiency and native TypeScript support. Note that nothing here is tied to Nest.js – this is just a framework preference, and everything we’ll do here can be done with Node.js and Express.js.</p>
<p>Here is a list of all the tools that we’ll use:</p>
<ol>
<li><p><code>langchain/core</code> - <strong>Always required</strong></p>
<p> This is the main Langchain engine that defines all core tools and fundamental functions, containing:</p>
<ul>
<li><p>prompt templates</p>
</li>
<li><p>message types</p>
</li>
<li><p>runnables</p>
</li>
<li><p>tool interfaces</p>
</li>
<li><p>chain composition utilities, and more.</p>
</li>
</ul>
</li>
</ol>
<p>    Most LangChain project need this.</p>
<ol start="2">
<li><p><code>langchain/google-genai</code> - This package is used to interact with Google’s generative AI models, vector embedding models, and other related tools.</p>
</li>
<li><p><code>langchain/langgraph</code> - <strong>Important for building an AI agent with total control</strong></p>
<p> Langgraph is a low-level orchestration framework for building controllable agents. It can be used to build:</p>
<ul>
<li><p>Conversational agents.</p>
</li>
<li><p>Build complex task automation.</p>
</li>
<li><p>Agent’s context management.</p>
</li>
</ul>
</li>
<li><p><code>langchain/langgraph-checkpoint-mongodb</code> - This package provides a MongoDB-based checkpointer for LangGraph, enabling persistence of agent state and short-term memory using MongoDB.</p>
</li>
<li><p><code>@langchain/mongodb</code> - This package provides MongoDB integrations for LangChain, allowing you to:</p>
<ul>
<li><p>Store and retrieve vector embeddings.</p>
</li>
<li><p>Persist LangChain documents, agents, or memory states.</p>
</li>
<li><p>Easily integrate MongoDB as a database backend for your AI workflows.</p>
</li>
</ul>
</li>
<li><p><code>@nestjs/mongoose</code> - A NestJS wrapper around Mongoose for MongoDB. Provides:</p>
<ul>
<li><p>Dependency injection support for Mongoose models.</p>
</li>
<li><p>Simplified schema definition and model management.</p>
</li>
<li><p>Seamless integration of MongoDB into NestJS applications, enabling structured data persistence for AI apps or any backend.</p>
</li>
</ul>
</li>
<li><p><code>langchain</code> - This is the main npm package that aggregates LangChain functionality. It provides:</p>
<ul>
<li><p>Access to connectors, utilities, and core modules.</p>
</li>
<li><p>Easy import of different LangChain components in one place.</p>
</li>
<li><p>Commonly used alongside <code>@langchain/core</code> for building applications with minimal setup.</p>
</li>
</ul>
</li>
<li><p><code>mongodb</code> - The official MongoDB driver for Node.js. It provides:</p>
<ul>
<li><p>Low-level, flexible access to MongoDB databases.</p>
</li>
<li><p>Support for CRUD operations, transactions, and indexing.</p>
</li>
<li><p>A required dependency if you plan to connect LangChain components or your backend directly to MongoDB.</p>
</li>
</ul>
</li>
<li><p><code>mongoose</code> - An ODM (Object Data Modeling) library for MongoDB. Offers:</p>
<ul>
<li><p>Schema-based data modeling for MongoDB documents.</p>
</li>
<li><p>Middleware, validation, and hooks for MongoDB operations.</p>
</li>
<li><p>Ideal for structured data management in NestJS or other Node.js applications.</p>
</li>
</ul>
</li>
<li><p><code>zod</code> - A TypeScript-first schema validation library. Used for:</p>
<ul>
<li><p>Defining strict data schemas and validating inputs/outputs.</p>
</li>
<li><p>Ensuring type safety at runtime.</p>
</li>
<li><p>Useful in AI applications to validate responses from models or enforce data consistency.</p>
</li>
</ul>
</li>
</ol>
<p>Start by initializing your Nest.js project, and installing all the required dependencies:</p>
<pre><code class="lang-dart">$ npm i -g <span class="hljs-meta">@nestjs</span>/cli <span class="hljs-comment">//If you don't have Nest.js installed on your machine</span>
$ nest <span class="hljs-keyword">new</span> project-name

<span class="hljs-string">"dependencies"</span> : {
    <span class="hljs-string">"@langchain/core"</span>: <span class="hljs-string">"^0.3.75"</span>,
    <span class="hljs-string">"@langchain/google-genai"</span>: <span class="hljs-string">"^0.2.16"</span>,
    <span class="hljs-string">"@langchain/langgraph"</span>: <span class="hljs-string">"^0.4.8"</span>,
    <span class="hljs-string">"@langchain/langgraph-checkpoint-mongodb"</span>: <span class="hljs-string">"^0.1.1"</span>,
    <span class="hljs-string">"@langchain/mongodb"</span>: <span class="hljs-string">"^0.1.0"</span>,
    <span class="hljs-string">"@nestjs/mongoose"</span>: <span class="hljs-string">"^11.0.3"</span>,
    <span class="hljs-string">"langchain"</span>: <span class="hljs-string">"^0.3.33"</span>,
    <span class="hljs-string">"mongodb"</span>: <span class="hljs-string">"^6.19.0"</span>,
    <span class="hljs-string">"mongoose"</span>: <span class="hljs-string">"^8.18.1"</span>,
    <span class="hljs-string">"zod"</span>: <span class="hljs-string">"^4.1.8"</span>
}

<span class="hljs-comment">//The versions may not be same at the time you are reading this, so I recommand checking</span>
<span class="hljs-comment">//The official documentation for each package.</span>
</code></pre>
<p>Now that we have our project created and all the packages installed, let’s see what we need to do to turn our vision into a project. Think of what you’ll need in order to create a Starbucks barista:</p>
<ul>
<li><p>First, we need to define the structure of our data (creating schemas)</p>
</li>
<li><p>Then we need to create a menu list that our agent will be referring to.</p>
</li>
<li><p>After that, we’ll add LLM interaction</p>
</li>
<li><p>And last but not least, we’ll add the ability to save previous conversations for conversational context.</p>
</li>
</ul>
<h3 id="heading-folder-structure">Folder Structure</h3>
<p>You can modify this folder structure and adapt it based on your framework of choice. But the core implementation is the same across all frameworks.</p>
<pre><code class="lang-plaintext">├── .env
├── .eslintrc.js
├── .gitignore
├── .prettierrc
├── nest-cli.json
├── package.json
├── README.md
├── tsconfig.build.json
├── tsconfig.json
├── src/
│   ├── app.controller.ts
│   ├── app.module.ts
│   ├── app.service.ts
│   ├── main.ts
│   ├── chat/
│   │   ├── chat.controller.ts
│   │   ├── chat.module.ts
│   │   ├── chat.service.ts
│   │   └── dtos/
│   │       └── chat.dto.ts
│   ├── data/
│   │   └── schema/
│   │       └── order.schema.ts
│   └── util/
│       ├── constants/
│       │   └── drinks_data.ts
│       ├── schemas/
│       │   ├── drinks/
│       │   │   └── Drink.schema.ts
│       │   └── orders/
│       │       └── Order.schema.ts
│       ├── summeries/
│       │   └── drink.ts
│       └── types/
</code></pre>
<h2 id="heading-data-schematization-with-zod">Data Schematization with Zod</h2>
<p>This file contains all our schema definitions regarding drinks and all modifications they can receive. This part is useful for defining the structure of the data that will be used by the AI agent.</p>
<h3 id="heading-importing-zod"><strong>Importing Zod</strong></h3>
<p>In the <code>lib/util/schemas/drinks.ts</code> file, before defining any schemas, import the Zod library, which provides tools for building TypeScript-first schemas.</p>
<pre><code class="lang-typescript"><span class="hljs-comment">// Imports the 'z' object from the 'zod' library.</span>
<span class="hljs-comment">// Zod is a TypeScript-first schema declaration and validation library.</span>
<span class="hljs-comment">// 'z' is the primary object used to define schemas (e.g., z.object, z.string, z.boolean, z.array).</span>
<span class="hljs-keyword">import</span> z <span class="hljs-keyword">from</span> <span class="hljs-string">"zod"</span>;
</code></pre>
<p>Zod gives you a simple and expressive way to define and validate the structure of the data our agent will interact with.</p>
<h3 id="heading-drink-schema"><strong>Drink Schema</strong></h3>
<p>This schema represents the structure of a drink in the Starbucks-style menu. I split and explained each field so the reader clearly understands what each property controls.</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> DrinkSchema = z.object({
  name: z.string(),            <span class="hljs-comment">// Required name of the drink</span>
  description: z.string(),     <span class="hljs-comment">// Required explanation of what the drink is</span>
  supportMilk: z.boolean(),    <span class="hljs-comment">// Whether milk options are available</span>
  supportSweeteners: z.boolean(), <span class="hljs-comment">// Whether sweeteners can be added</span>
  supportSyrup: z.boolean(),   <span class="hljs-comment">// Whether flavor syrups are allowed</span>
  supportTopping: z.boolean(), <span class="hljs-comment">// Whether toppings are supported</span>
  supportSize: z.boolean(),    <span class="hljs-comment">// Whether the drink can be ordered in sizes</span>
  image: z.string().url().optional(), <span class="hljs-comment">// Optional image URL</span>
});
</code></pre>
<h3 id="heading-what-this-schema-represents"><strong>What this schema represents</strong></h3>
<ul>
<li><p>It ensures every drink has a proper name and a description.</p>
</li>
<li><p>It defines which customizations apply to the drink.</p>
</li>
<li><p>It prepares the agent to reason about drink options in a structured, validated format.</p>
</li>
</ul>
<h3 id="heading-sweetener-schema"><strong>Sweetener Schema</strong></h3>
<p>Each sweetener option in the menu is represented with its own schema.</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> SweetenerSchema = z.object({
  name: z.string(),                <span class="hljs-comment">// Sweetener name</span>
  description: z.string(),         <span class="hljs-comment">// What it is / taste description</span>
  image: z.string().url().optional(), <span class="hljs-comment">// Optional image URL</span>
});
</code></pre>
<p>This ensures consistency across all sweetener entries and avoids malformed data.</p>
<h3 id="heading-syrup-schema"><strong>Syrup Schema</strong></h3>
<p>Similar to sweeteners, but for syrup flavors:</p>
<pre><code class="lang-typescript">
<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> SyrupSchema = z.object({
  name: z.string(),
  description: z.string(),
  image: z.string().url().optional(),
});
</code></pre>
<p>This can represent flavors like Vanilla, Caramel, or Hazelnut.</p>
<h3 id="heading-topping-schema"><strong>Topping Schema</strong></h3>
<p>Toppings such as whipped cream or cinnamon are defined here.</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> ToppingSchema = z.object({
  name: z.string(),
  description: z.string(),
  image: z.string().url().optional(),
});
</code></pre>
<h3 id="heading-size-schema"><strong>Size Schema</strong></h3>
<p>Drink sizes are modeled as objects as well:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> SizeSchema = z.object({
  name: z.string(),               <span class="hljs-comment">// e.g. Small, Medium</span>
  description: z.string(),        <span class="hljs-comment">// A short explanation</span>
  image: z.string().url().optional(),
});
</code></pre>
<h3 id="heading-milk-schema"><strong>Milk Schema</strong></h3>
<p>Represents milk types such as Whole, Skim, Almond, or Oat.</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> MilkSchema = z.object({
  name: z.string(),
  description: z.string(),
  image: z.string().url().optional(),
});
</code></pre>
<h3 id="heading-collections-of-items"><strong>Collections of Items</strong></h3>
<p>Now that the individual item schemas exist, we can create <strong>collections</strong> of them. These represent all available toppings, sizes, milk types, syrups, sweeteners, and the entire menu of drinks</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> ToppingsSchema = z.array(ToppingSchema);
<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> SizesSchema = z.array(SizeSchema);
<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> MilksSchema = z.array(MilkSchema);
<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> SyrupsSchema = z.array(SyrupSchema);
<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> SweetenersSchema = z.array(SweetenerSchema);
<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> DrinksSchema = z.array(DrinkSchema);
</code></pre>
<p>Why arrays? Because in the real world, your agent will receive <strong>lists</strong> from a database or API—not single items.</p>
<h3 id="heading-inferred-types"><strong>Inferred Types</strong></h3>
<p>Zod also allows TypeScript to infer types from schemas automatically.</p>
<p>This ensures:</p>
<ul>
<li><p>TypeScript types always match the schemas.</p>
</li>
<li><p>You avoid duplicated definitions.</p>
</li>
<li><p>The agent code stays consistent and safe.</p>
</li>
</ul>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> Drink = z.infer&lt;<span class="hljs-keyword">typeof</span> DrinkSchema&gt;;
<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> SupportSweetener = z.infer&lt;<span class="hljs-keyword">typeof</span> SweetenerSchema&gt;;
<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> Syrup = z.infer&lt;<span class="hljs-keyword">typeof</span> SyrupSchema&gt;;
<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> Topping = z.infer&lt;<span class="hljs-keyword">typeof</span> ToppingSchema&gt;;
<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> Size = z.infer&lt;<span class="hljs-keyword">typeof</span> SizeSchema&gt;;
<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> Milk = z.infer&lt;<span class="hljs-keyword">typeof</span> MilkSchema&gt;;

<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> Toppings = z.infer&lt;<span class="hljs-keyword">typeof</span> ToppingsSchema&gt;;
<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> Sizes = z.infer&lt;<span class="hljs-keyword">typeof</span> SizesSchema&gt;;
<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> Milks = z.infer&lt;<span class="hljs-keyword">typeof</span> MilksSchema&gt;;
<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> Syrups = z.infer&lt;<span class="hljs-keyword">typeof</span> SyrupsSchema&gt;;
<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> Sweeteners = z.infer&lt;<span class="hljs-keyword">typeof</span> SweetenersSchema&gt;;
<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> Drinks = z.infer&lt;<span class="hljs-keyword">typeof</span> DrinksSchema&gt;;
</code></pre>
<p>These provide the rest of your LangChain/LangGraph code with strong typing based on your schema definitions.</p>
<p>This entire file:</p>
<ul>
<li><p>Encodes all drink-related data structures.</p>
</li>
<li><p>Provides validation to ensure clean, predictable data.</p>
</li>
<li><p>Automatically generates TypeScript types.</p>
</li>
<li><p>Helps the AI agent reason reliably about drinks and customization options.</p>
</li>
</ul>
<p>You’ll use these schemas later and convert them into string representations for LLM prompts.</p>
<p><em>You can find the file containing all the code</em> <a target="_blank" href="https://github.com/DjibrilM/langgraph-starbucks-agent/blob/main/src/lib/schemas/drinks.ts"><em>here</em></a><em>.</em></p>
<h2 id="heading-how-to-parse-the-schema">How to Parse the Schema</h2>
<p>As mentioned earlier, LLMs are <strong>text input–output machines</strong>. They don’t understand TypeScript types or Zod schemas directly. If you include a schema inside a prompt, the model will simply see it as plain text without understanding its structure or constraints.</p>
<p>Because of this, we need a way to convert schemas into a readable string format that can be embedded inside a prompt, such as:</p>
<blockquote>
<p>“The output must be a JSON object with the following fields…”</p>
</blockquote>
<p>This is exactly the problem solved by <code>StructuredOutputParser</code> from <code>langchain/output_parsers</code>. It takes a Zod schema and turns it into:</p>
<ul>
<li><p>A human-readable description that can be sent to an LLM.</p>
</li>
<li><p>A validator that checks whether the model’s output matches the schema.</p>
</li>
</ul>
<p>In short, it acts as a bridge between typed application logic and text-based AI output.</p>
<h3 id="heading-defining-the-order-schema">Defining the Order Schema</h3>
<p>We’ll start with a simple Zod schema that represents a customer’s drink order. This schema defines the exact shape and constraints of the data we expect the model to produce.</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> OrderSchema = z.object({
  drink: z.string(),
  size: z.string(),
  mil: z.string(),
  syrup: z.string(),
  sweeteners: z.string(),
  toppings: z.string(),
  quantity: z.number().min(<span class="hljs-number">1</span>).max(<span class="hljs-number">10</span>),
});

<span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> OrderType = z.infer&lt;<span class="hljs-keyword">typeof</span> OrderSchema&gt;;
</code></pre>
<p>At this point, the schema is useful only inside our TypeScript application. The LLM still has no idea what this structure means.</p>
<h3 id="heading-parsing-the-schema-into-human-readable-text">Parsing the Schema into Human-Readable Text</h3>
<p>This is where schema parsing comes in. Using <code>StructuredOutputParser.fromZodSchema</code>, we can transform the Zod schema into:</p>
<ul>
<li><p>Instructions the LLM can understand.</p>
</li>
<li><p>A runtime validator that ensures the response is correct.</p>
</li>
</ul>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> OrderParser =
  StructuredOutputParser.fromZodSchema(OrderSchema <span class="hljs-keyword">as</span> <span class="hljs-built_in">any</span>);
</code></pre>
<p>The parser enables two critical workflows:</p>
<h4 id="heading-generating-prompt-instructions">Generating prompt instructions</h4>
<p>The parser can generate a text description of the schema that looks roughly like: “Return a JSON object with the fields <code>drink</code>, <code>size</code>, <code>mil</code>, <code>syrup</code>, <code>sweeteners</code>, and <code>toppings</code> as strings, and <code>quantity</code> as a number between 1 and 10.” This string can be injected directly into your prompt so the LLM knows exactly how to format its response.</p>
<h4 id="heading-validating-the-models-output">Validating the model’s output</h4>
<p>After the LLM responds, its output is still just text. The parser:</p>
<ul>
<li><p>Converts that text into a JavaScript object.</p>
</li>
<li><p>Validates it against the original Zod schema.</p>
</li>
<li><p>Throws an error if anything is missing, malformed, or out of bounds.</p>
</li>
</ul>
<p>This prevents invalid AI-generated data (for example, <code>quantity: 0</code>) from entering your system.</p>
<h3 id="heading-reusing-the-same-approach-for-other-schemas">Reusing the Same Approach for Other Schemas</h3>
<p>Once you understand this pattern, applying it to other schemas is straightforward.</p>
<p>For example, you can do the same thing for a <code>DrinkSchema</code>:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> DrinkParser =
  StructuredOutputParser.fromZodSchema(DrinkSchema <span class="hljs-keyword">as</span> <span class="hljs-built_in">any</span>);
</code></pre>
<p>Now you can confidently say something like: “Hey Gemini, this is what a drink object looks like—please respond using this structure.”</p>
<h3 id="heading-why-this-matters">Why This Matters</h3>
<p>Schema parsing allows you to:</p>
<ul>
<li><p>Keep strong typing in your application.</p>
</li>
<li><p>Give clear formatting instructions to the LLM.</p>
</li>
<li><p>Safely convert unstructured AI output into validated, production-ready data.</p>
</li>
</ul>
<p>Without this step, working with LLMs at scale becomes unreliable and error-prone.</p>
<h2 id="heading-data-to-text-summarization">Data-to-Text Summarization</h2>
<p>In the context of LLM agents, <strong>data-to-text summarization</strong> means converting structured data—such as objects returned from a database or backend API—into <strong>clear, human-readable strings</strong> that can be embedded directly into prompts.</p>
<p>Even the most advanced LLMs operate purely on text. They don’t reason over JavaScript objects, database rows, or JSON structures in the same way humans or programs do. The clearer and more descriptive your text input is, the more accurate and reliable the model’s output will be.</p>
<p>Because of this, a common and recommended pattern when building LLM-powered systems is:</p>
<p><strong>Fetch structured data → summarize it into natural language → pass the summary into the prompt</strong></p>
<p>To keep this article focused, we’ll store our data in constants instead of querying a real database. The technique is exactly the same whether the data comes from MongoDB, PostgreSQL, or an API.</p>
<h3 id="heading-the-core-idea">The Core Idea</h3>
<p>The goal of data-to-text summarization is simple:</p>
<ul>
<li><p>Take an object with fields and boolean flags</p>
</li>
<li><p>Convert it into a short paragraph that explains what the object represents</p>
</li>
<li><p>Remove ambiguity and guesswork for the LLM</p>
</li>
</ul>
<p>Instead of forcing the model to infer meaning from raw data, we <em>spell it out explicitly</em>.</p>
<h3 id="heading-summarizing-a-drink-object">Summarizing a Drink Object</h3>
<p>Consider the following drink object:</p>
<pre><code class="lang-typescript">{
  name: <span class="hljs-string">'Espresso'</span>,
  description: <span class="hljs-string">'Strong concentrated coffee shot.'</span>,
  supportMilk: <span class="hljs-literal">false</span>,
  supportSweeteners: <span class="hljs-literal">true</span>,
  supportSyrup: <span class="hljs-literal">true</span>,
  supportTopping: <span class="hljs-literal">false</span>,
  supportSize: <span class="hljs-literal">false</span>,
}
</code></pre>
<p>While this structure is easy for developers to understand, it’s not ideal for an LLM prompt. Boolean flags like <code>supportMilk: false</code> require interpretation, which increases the chance of incorrect assumptions.</p>
<p>Instead, we convert this object into a descriptive paragraph:</p>
<p>“A drink named Espresso. It is described as a strong, concentrated coffee shot. It cannot be made with milk. It can be made with sweeteners. It can be made with syrup. It cannot be made with toppings. It cannot be made in different sizes.”</p>
<p>This transformation is exactly what data-to-text summarization provides.</p>
<h3 id="heading-a-standard-summarization-pattern">A Standard Summarization Pattern</h3>
<p>Below is a simplified example of how we convert a <code>Drink</code> object into a readable description.</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> createDrinkItemSummary = (drink: Drink): <span class="hljs-function"><span class="hljs-params">string</span> =&gt;</span> {
  <span class="hljs-keyword">const</span> name = <span class="hljs-string">`A drink named <span class="hljs-subst">${drink.name}</span>.`</span>;
  <span class="hljs-keyword">const</span> description = <span class="hljs-string">`It is described as <span class="hljs-subst">${drink.description}</span>.`</span>;

  <span class="hljs-keyword">const</span> milk = drink.supportMilk
    ? <span class="hljs-string">'It can be made with milk.'</span>
    : <span class="hljs-string">'It cannot be made with milk.'</span>;

  <span class="hljs-keyword">const</span> sweeteners = drink.supportSweeteners
    ? <span class="hljs-string">'It can be made with sweeteners.'</span>
    : <span class="hljs-string">'It cannot contain sweeteners.'</span>;

  <span class="hljs-keyword">const</span> syrup = drink.supportSyrup
    ? <span class="hljs-string">'It can be made with syrup.'</span>
    : <span class="hljs-string">'It cannot be made with syrup.'</span>;

  <span class="hljs-keyword">const</span> toppings = drink.supportTopping
    ? <span class="hljs-string">'It can be made with toppings.'</span>
    : <span class="hljs-string">'It cannot be made with toppings.'</span>;

  <span class="hljs-keyword">const</span> size = drink.supportSize
    ? <span class="hljs-string">'It can be made in different sizes.'</span>
    : <span class="hljs-string">'It cannot be made in different sizes.'</span>;

  <span class="hljs-keyword">return</span> <span class="hljs-string">`<span class="hljs-subst">${name}</span> <span class="hljs-subst">${description}</span> <span class="hljs-subst">${milk}</span> <span class="hljs-subst">${sweeteners}</span> <span class="hljs-subst">${syrup}</span> <span class="hljs-subst">${toppings}</span> <span class="hljs-subst">${size}</span>`</span>;
};
</code></pre>
<h3 id="heading-why-this-works-well-for-llms">Why this works well for LLMs</h3>
<ul>
<li><p>Boolean logic is converted into <strong>explicit sentences</strong></p>
</li>
<li><p>Every capability and limitation is clearly stated</p>
</li>
<li><p>The output can be embedded directly into a system or user prompt</p>
</li>
</ul>
<h3 id="heading-summarizing-collections-of-data">Summarizing Collections of Data</h3>
<p>This same approach applies to lists of data such as milks, syrups, toppings, or sizes. Instead of passing an array of objects to the model, we convert them into bullet-style text summaries:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> createSweetenersSummary = (): <span class="hljs-function"><span class="hljs-params">string</span> =&gt;</span> {
  <span class="hljs-keyword">return</span> <span class="hljs-string">`Available sweeteners are:
<span class="hljs-subst">${SWEETENERS.map(
  (s) =&gt; <span class="hljs-string">`- <span class="hljs-subst">${s.name}</span>: <span class="hljs-subst">${s.description}</span>`</span>
).join(<span class="hljs-string">'\n'</span>)}</span>`</span>;
};
</code></pre>
<p>This gives the model a <strong>complete, readable overview</strong> of available options without requiring it to interpret raw arrays.</p>
<h3 id="heading-applying-the-same-idea-to-other-domains">Applying the Same Idea to Other Domains</h3>
<p>This pattern is not limited to drinks or menus. It works for <em>any</em> domain. For example, here’s the same summarization technique applied to an object representing a shoe in an online ordering assistant:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> createShoeItemSummary = (shoe: {
  name: <span class="hljs-built_in">string</span>;
  description: <span class="hljs-built_in">string</span>;
  genderCategory: <span class="hljs-built_in">string</span>;
  styleType: <span class="hljs-built_in">string</span>;
  material: <span class="hljs-built_in">string</span>;
  availableInMultipleColors: <span class="hljs-built_in">boolean</span>;
  limitedEdition: <span class="hljs-built_in">boolean</span>;
  supportsCustomization: <span class="hljs-built_in">boolean</span>;
}): <span class="hljs-function"><span class="hljs-params">string</span> =&gt;</span> {
  <span class="hljs-keyword">return</span> <span class="hljs-string">`
A shoe named <span class="hljs-subst">${shoe.name}</span>.
It is described as <span class="hljs-subst">${shoe.description}</span>.
It is categorized as a <span class="hljs-subst">${shoe.genderCategory.toLowerCase()}</span> shoe.
It belongs to the <span class="hljs-subst">${shoe.styleType.toLowerCase()}</span> fashion style.
It is made of <span class="hljs-subst">${shoe.material.toLowerCase()}</span> material.
<span class="hljs-subst">${shoe.availableInMultipleColors ? <span class="hljs-string">'It is available in multiple colors.'</span> : <span class="hljs-string">'It is available in a single color.'</span>}</span>
<span class="hljs-subst">${shoe.limitedEdition ? <span class="hljs-string">'It is a limited-edition release.'</span> : <span class="hljs-string">'It is not a limited-edition release.'</span>}</span>
<span class="hljs-subst">${shoe.supportsCustomization ? <span class="hljs-string">'It supports customization options.'</span> : <span class="hljs-string">'It does not support customization options.'</span>}</span>
`</span>.trim();
};
</code></pre>
<p>Which produces an output like:</p>
<p>“A shoe named Veloria Canvas Sneaker. It is described as a minimalist everyday sneaker designed for casual wear. It is categorized as a unisex shoe. It belongs to the casual fashion style. It is made of breathable canvas material. It is available in multiple colors. It is not a limited-edition release. It supports light customization options.”</p>
<h2 id="heading-how-to-persist-orders-with-mongodb-in-nestjs">How to Persist Orders with MongoDB in NestJS</h2>
<p>Now that we’ve established the core foundations of our application—schemas, parsers, and data-to-text summaries—it’s time to <strong>persist data</strong>. In a real-world assistant, orders and conversations shouldn’t disappear when the server restarts. They need to be stored reliably so they can be retrieved, analyzed, or continued later.</p>
<p>To achieve this, we’ll use MongoDB as our database and the NestJS Mongoose integration to manage data models and collections.</p>
<h3 id="heading-connecting-mongodb-to-a-nestjs-application">Connecting MongoDB to a NestJS Application</h3>
<p>In NestJS, the <code>AppModule</code> is the root module of the application. This is where global dependencies—such as database connections—are configured.</p>
<pre><code class="lang-typescript"><span class="hljs-meta">@Module</span>({
  imports: [
    MongooseModule.forRoot(process.env.MONGO_URI),
    ChatsModule,
  ],
  controllers: [AppController],
  providers: [AppService],
})
<span class="hljs-keyword">export</span> <span class="hljs-keyword">class</span> AppModule {}
</code></pre>
<p>What’s happening here?</p>
<ul>
<li><p><code>MongooseModule.forRoot(...)</code> establishes a global MongoDB connection.</p>
</li>
<li><p>The connection string is read from an environment variable (<code>MONGO_URI</code>), which is the recommended practice for security.</p>
</li>
<li><p>Once configured, this connection becomes available throughout the entire application.</p>
</li>
<li><p><code>ChatsModule</code> is imported so it can access the database connection and register its own schemas.</p>
</li>
</ul>
<p>This setup ensures that every feature module can safely interact with MongoDB without creating multiple connections.</p>
<h3 id="heading-defining-an-order-schema-with-mongoose">Defining an Order Schema with Mongoose</h3>
<p>NestJS uses decorators to define MongoDB schemas in a clean, class-based way. Each class represents a MongoDB document, and each property becomes a field in the collection.</p>
<pre><code class="lang-typescript"><span class="hljs-meta">@Schema</span>()
<span class="hljs-keyword">export</span> <span class="hljs-keyword">class</span> Order {
  <span class="hljs-meta">@Prop</span>({ required: <span class="hljs-literal">true</span> })
  drink: <span class="hljs-built_in">string</span>;

  <span class="hljs-meta">@Prop</span>({ <span class="hljs-keyword">default</span>: <span class="hljs-literal">null</span> })
  size: <span class="hljs-built_in">string</span>;

  <span class="hljs-meta">@Prop</span>({ <span class="hljs-keyword">default</span>: <span class="hljs-literal">null</span> })
  milk: <span class="hljs-built_in">string</span>;

  <span class="hljs-meta">@Prop</span>({ <span class="hljs-keyword">default</span>: <span class="hljs-literal">null</span> })
  syrup: <span class="hljs-built_in">string</span>;

  <span class="hljs-meta">@Prop</span>({ <span class="hljs-keyword">default</span>: <span class="hljs-literal">null</span> })
  sweeter: <span class="hljs-built_in">string</span>;

  <span class="hljs-meta">@Prop</span>({ <span class="hljs-keyword">default</span>: <span class="hljs-literal">null</span> })
  toppings: <span class="hljs-built_in">string</span>;

  <span class="hljs-meta">@Prop</span>({ <span class="hljs-keyword">default</span>: <span class="hljs-number">1</span> })
  quantity: <span class="hljs-built_in">number</span>;
}
</code></pre>
<p>Why this approach?</p>
<ul>
<li><p>Each <code>@Prop()</code> decorator maps directly to a MongoDB field.</p>
</li>
<li><p>Default values allow partial orders to be saved incrementally.</p>
</li>
<li><p>Required fields (like <code>drink</code>) enforce basic data integrity.</p>
</li>
<li><p>The schema closely mirrors the structured output produced by the LLM.</p>
</li>
</ul>
<p>Once the class is defined, it’s converted into a MongoDB schema:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> OrderSchema = SchemaFactory.createForClass(Order);
</code></pre>
<p>This single line creates:</p>
<ul>
<li><p>A MongoDB collection</p>
</li>
<li><p>A validation layer</p>
</li>
<li><p>A schema that Mongoose can use to create, read, and update orders</p>
</li>
</ul>
<h3 id="heading-how-this-fits-into-the-llm-agent-architecture">How This Fits into the LLM Agent Architecture</h3>
<p>At this point, we have:</p>
<ul>
<li><p><strong>Zod schemas</strong> → for validating AI output</p>
</li>
<li><p><strong>Summarization functions</strong> → for converting data into readable prompts</p>
</li>
<li><p><strong>MongoDB schemas</strong> → for persisting finalized orders</p>
</li>
</ul>
<p>This separation is intentional:</p>
<ul>
<li><p>Zod handles <em>AI-facing validation</em></p>
</li>
<li><p>Mongoose handles <em>database persistence</em></p>
</li>
<li><p>NestJS acts as the glue that ties everything together</p>
</li>
</ul>
<h3 id="heading-preparing-for-the-agent-logic">Preparing for the Agent Logic</h3>
<p>With the database in place, we’re now ready to implement the agent itself.</p>
<p>The agent’s responsibilities will include:</p>
<ul>
<li><p>Interpreting user messages</p>
</li>
<li><p>Calling tools</p>
</li>
<li><p>Generating structured orders</p>
</li>
<li><p>Validating them</p>
</li>
<li><p>Persisting them to MongoDB</p>
</li>
<li><p>Maintaining conversational state</p>
</li>
</ul>
<p>All of this logic will live inside the <code>src/chats/chats.service.ts</code> file. The next section introduces the <strong>agent’s core logic</strong>, and we’ll walk through it step by step so every part is easy to follow.</p>
<p>Start by importing the required dependencies:</p>
<pre><code class="lang-tsx">
import { Injectable } from '@nestjs/common';
import { InjectModel } from '@nestjs/mongoose';
import { MongoClient } from 'mongodb';
import { Model } from 'mongoose';

import { tool } from '@langchain/core/tools';
import {
  ChatPromptTemplate,
  MessagesPlaceholder,
} from '@langchain/core/prompts';
import { AIMessage, BaseMessage, HumanMessage } from '@langchain/core/messages';

import { ChatGoogleGenerativeAI } from '@langchain/google-genai';
import { StateGraph } from '@langchain/langgraph';
import { ToolNode } from '@langchain/langgraph/prebuilt';
import { Annotation } from '@langchain/langgraph';
import { START, END } from '@langchain/langgraph';

import { MongoDBSaver } from '@langchain/langgraph-checkpoint-mongodb';

import z from 'zod';

import { Order } from './schemas/order.schema';
import { OrderParser, OrderSchema, OrderType } from 'src/lib/schemas/orders';
import { DrinkParser } from 'src/lib/schemas/drinks';
import { DRINKS } from 'src/lib/utils/constants/menu_data';

import {
  createSweetenersSummary,
  availableToppingsSummary,
  createAvailableMilksSummary,
  createSyrupsSummary,
  createSizesSummary,
  createDrinkItemSummary,
} from 'src/lib/summaries';

const GOOGLE_API_KEY = process.env.GOOGLE_API_KEY || '';
const client: MongoClient = new MongoClient(process.env.MONGO_URI || '');
const database_name = 'drinks_db';
</code></pre>
<h2 id="heading-langgraph-stateannotation-terms">LangGraph State/Annotation Terms</h2>
<p>In LangGraph, <strong>state</strong> can be thought of as a temporary workspace that exists while the agent is running. It stores all the information that nodes (we’ll cover nodes in detail later) might need to access information like the last message, the history of the conversation, or any intermediate data generated during execution.</p>
<p>This state allows nodes to <strong>read from it, update it, and pass information along</strong> as the agent processes a workflow, making it the agent’s short-term memory for the duration of the run.</p>
<pre><code class="lang-tsx">@Injectable()
export class ChatService {

  chatWithAgent = async ({
    thread_id,
    query,
  }: {
    thread_id: string;
    query: string;
  }) =&gt; {

    const graphState = Annotation.Root({
      messages: Annotation&lt;BaseMessage[]&gt;({
        reducer: (x, y) =&gt; [...x, ...y],
      }),
    });

  }

}
</code></pre>
<p>This code defines the <strong>LangGraph state</strong> for the chat agent. The <code>graphState</code> object acts as a central memory that every node in the workflow can read from and update.</p>
<p>The <code>messages</code> field specifically stores all messages in the conversation, including user messages, AI responses, and tool outputs. The reducer function <code>[...x, ...y]</code> appends new messages to the existing array, preserving the conversation history across multiple steps.</p>
<p>LangGraph’s reducer mechanism lets developers control how new state merges with old state. In this chat system, the approach is similar to updating React state with <code>setMessages(prev =&gt; [...prev, ...newMessages])</code>: it keeps the old messages while adding the new ones.</p>
<p>Together, this state enables the agent, tools, and checkpointing system to maintain a coherent conversation, allowing each node in the LangGraph workflow to access the full context and contribute incrementally.</p>
<h2 id="heading-how-to-create-tools-for-the-agent">How to Create Tools for the Agent</h2>
<p>Modern chatbots can do more than just generate text - they can also search the internet, read files, or perform computations. While LLMs are powerful, they cannot execute code or compile programs on their own.</p>
<p>In the code text of LLM agents, a tool is a piece of code written by the agent developer that an LLM can invoke on the host machine. The host machine executes the code, and the LLM only receives the final output of the computation.</p>
<p>Here's how to create a tool that stores orders in the database. Still in the <code>chatWithAgent</code> function within the <code>ChatService</code> class. Bellow the state store definition:</p>
<pre><code class="lang-tsx">const orderTool = tool(
  async ({ order }: { order: OrderType }) =&gt; {
    try {
      await this.orderModel.create(order);
      return 'Order created successfully';
    } catch (error) {
      console.log(error);
      return 'Failed to create the order';
    }
  },
  {
    schema: z.object({
      order: OrderSchema.describe('The order that will be stored in the DB'),
    }),
    name: 'create_order',
    description: 'This tool creates a new order in the database',
  }
);

const tools = [orderTool];
</code></pre>
<h2 id="heading-langgraph-nodes-workflow-components">LangGraph Nodes (Workflow Components)</h2>
<p>From a definition standpoint, a LangGraph node is a fundamental component of a LangGraph workflow, representing a single unit of computation or an individual step in an AI agent's process.</p>
<p>Each node can perform a specific task, such as generating a message, invoking a tool, or transforming data, and it interacts with the state to read inputs and write outputs. Together, nodes are connected to form the agent’s workflow or execution graph, allowing complex reasoning and multi-step operations.</p>
<p>In our project, we’ll have four nodes.</p>
<ol>
<li><p><strong>Agent node:</strong> This node is in charge of interacting with the LLM - it constructs the agent’s main message template and stacks old messages to the new prompt to create context.</p>
</li>
<li><p><strong>Tools node:</strong> The tools node introduces external capabilities, which allow the workflow to interact with external APIs</p>
</li>
<li><p><code>START</code> <strong>node:</strong> This node indicates the entry point of our workflow, or to be precise, which node to call when a user initiates a conversation with the agent. It’s quite simple to define.</p>
</li>
<li><p><code>addConditionalEdges</code> - <code>addConditionalEdges('agent', shouldContinue)</code>: In LangGraph, <code>.addConditionalEdges('agent', shouldContinue)</code> lets the workflow branch dynamically after the <code>'agent'</code> node runs, based on a condition defined in <code>shouldContinue</code>. Unlike a fixed edge, which always goes from one node to the next, a conditional edge evaluates the agent’s output and directs the workflow to different nodes depending on the result, allowing the AI agent to make decisions and adapt its next steps.</p>
</li>
</ol>
<h2 id="heading-graph-declaration">Graph Declaration</h2>
<p>In LangGraph, a graph is the central structure that models an AI agent’s workflow as interconnected nodes, where each node represents a computation step, tool, or decision. It orchestrates the flow of data and control between nodes, manages conditional branching, and maintains the recursive loop of execution.</p>
<p>Essentially, the graph is the backbone that ensures complex, stateful interactions happen in a coordinated and modular way, connecting nodes like <code>agent</code>, <code>tools</code>, and conditional edges into a coherent workflow.</p>
<p>With that knowledge in place, we can now create the agent graph with all its nodes.</p>
<pre><code class="lang-tsx">  const callModal = async (states: typeof graphState.State) =&gt; {
    const prompt = ChatPromptTemplate.fromMessages([
      {
        role: 'system',
        content: `
            You are a helpful assistant that helps users order drinks from Starbucks.
            Your job is to take the user's request and fill in any missing details based on how a complete order should look.
            A complete order follows this structure: ${OrderParser}.

            **TOOLS**
            You have access to a "create_order" tool.
            Use this tool when the user confirms the final order.
            After calling the tool, you should inform the user whether the order was successfully created or if it failed.

            **DRINK DETAILS**
            Each drink has its own set of properties such as size, milk, syrup, sweetener, and toppings.
            Here is the drink schema: ${DrinkParser}.

            You must ask for any missing details before creating the order.

            If the user requests a modification that is not supported for the selected drink, tell them that it is not possible.

            If the user asks for something unrelated to drink orders, politely tell them that you can only assist with drink orders.

            **AVAILABLE OPTIONS**
            List of available drinks and their allowed modifications:
            ${DRINKS.map((drink) =&gt; `- ${createDrinkItemSummary(drink)}`)}

            Sweeteners: ${createSweetenersSummary()}
            Toppings: ${availableToppingsSummary()}
            Milks: ${createAvailableMilksSummary()}
            Syrups: ${createSyrupsSummary()}
            Sizes: ${createSizesSummary()}

            Order schema: ${OrderParser}

            If the user's query is unclear, tell them that the request is not clear.

            **ORDER CONFIRMATION**
            Once the order is ready, you must ask the user to confirm it.
            If they confirm, immediately call the "create_order" tool.
            Only respond after the tool completes, indicating success or failure.

            **FRONTEND RESPONSE FORMAT**
            Every response must include:

            "message": "Your message to the user",
            "current_order": "The order currently being constructed",
            "suggestions": "Options the user can choose from",
            "progress": "Order status ('completed' after creation)"

            **IMPORTANT RULES**
            - Be friendly, use emojis, and add humor.
            - Use null for unfilled fields.
            - Never omit the JSON tracking object.
        `,
      },
      new MessagesPlaceholder('messages'),
    ]);

  const formattedPrompt = await prompt.formatMessages({
    time: new Date().toISOString(),
    messages: states.messages,
  });

  const chat = new ChatGoogleGenerativeAI({
    model: 'gemini-2.0-flash',
    temperature: 0,
    apiKey: GOOGLE_API_KEY,
  }).bindTools(tools);

  const result = await chat.invoke(formattedPrompt);
  return { messages: [result] };
  };     
    const shouldContinue = (state: typeof graphState.State) =&gt; {
      const lastMessage = state.messages[
        state.messages.length - 1
      ] as AIMessage;
      return lastMessage.tool_calls?.length ? 'tools' : END;
    };

    const toolsNode = new ToolNode&lt;typeof graphState.State&gt;(tools);

    /**
     * Build the conversation graph.
     */
    const graph = new StateGraph(graphState)
      .addNode('agent', callModal)
      .addNode('tools', toolsNode)
      .addEdge(START, 'agent')
      .addConditionalEdges('agent', shouldContinue)
      .addEdge('tools', 'agent');
</code></pre>
<h3 id="heading-explanation">Explanation</h3>
<ul>
<li><p><strong>Graph State (</strong><code>graphState</code>)<br>  The <code>graphState</code> object is the shared memory across all nodes. It stores <code>messages</code>, which track the conversation history including user inputs, AI responses, and tool interactions. The reducer <code>[...x, ...y]</code> appends new messages, preserving past context. This is similar to React state updates: old messages remain while new ones are added.</p>
</li>
<li><p><strong>Agent Node (</strong><code>callModal</code>)<br>  This node handles the <strong>LLM call</strong>. It formats a prompt containing system instructions, drink schemas, available tools, and frontend response rules. By including <code>states.messages</code>, the AI sees the full conversation history, enabling multi-turn dialogue.</p>
</li>
<li><p><strong>LLM Execution</strong><br>  <code>ChatGoogleGenerativeAI</code> generates the AI response. <code>.bindTools(tools)</code> allows the AI to call tools like <code>create_order</code> directly if needed.</p>
</li>
<li><p><strong>Conditional Flow (</strong><code>shouldContinue</code>)<br>  After the AI responds, the <code>shouldContinue</code> function checks if the message includes tool calls. If so, execution moves to the <code>tools</code> node; otherwise, the workflow ends. This allows dynamic branching depending on the AI’s output.</p>
</li>
<li><p><strong>Tool Node (</strong><code>ToolNode</code>)<br>  The <code>tools</code> node executes the requested tool, such as saving the order to the database. Once completed, control returns to the agent node, enabling the AI to respond to the user with results.</p>
</li>
<li><p><strong>Graph Construction (</strong><code>StateGraph</code>)<br>  Nodes are connected in a coherent workflow:</p>
<ul>
<li><p><code>START → agent</code> begins the conversation</p>
</li>
<li><p>Conditional edges handle tool execution</p>
</li>
<li><p><code>tools → agent</code> ensures the agent can respond after tools run</p>
</li>
</ul>
</li>
<li><p><strong>Overall Flow</strong><br>  Together, the graph and shared state ensure a <strong>stateful, multi-turn conversation</strong>. The AI can ask for missing details, call tools when needed, and maintain context across interactions. Every node reads and writes to the same state.</p>
</li>
</ul>
<h2 id="heading-workflow-compilation-and-state-persistence-final-part"><strong>Workflow Compilation and State Persistence (Final Part)</strong></h2>
<p>So far, all of our states are temporary, meaning they only exist for the duration of a user’s request. However, we want our agent to <strong>remember and recall conversation context</strong> even when a new request is sent with the same <code>thread_id</code> or conversation ID.</p>
<p>To achieve this, we’ll use MongoDB in combination with the <code>langchain/langgraph-checkpoint-mongo</code> library. This library simplifies state persistence by associating each conversation with a unique, manually assigned ID. All operations—from retrieving previous messages to saving new ones—are handled internally, you only need to provide the conversation ID you want to work with.</p>
<pre><code class="lang-tsx">const graph = new StateGraph(graphState)
  .addNode('agent', callModal)
  .addNode('tools', toolsNode)
  .addEdge(START, 'agent')
  .addConditionalEdges('agent', shouldContinue)
  .addEdge('tools', 'agent');

  const checkpointer = new MongoDBSaver({ client, dbName: database_name });

  const app = graph.compile({ checkpointer });

  /**
     * Run the graph using the user's message.
     */
    const finalState = await app.invoke(
      { messages: [new HumanMessage(query)] },
      { recursionLimit: 15, configurable: { thread_id } },
    );

  /**
   * Extract JSON payload from AI response.
   */
  function extractJsonResponse(response: any) {
    const match = response.match(/```json\\s*([\\s\\S]*?)\\s*```/i);
    if (match &amp;&amp; match[1] &amp;&amp; typeof response === 'string') {
      return JSON.parse(match[1].trim());
    }
    throw response;
  }

  const lastMessage = finalState.messages.at(-1) as AIMessage; // Extract the last message of the conversation
  return extractJsonResponse(lastMessage.content); //Response
</code></pre>
<p>The above code demonstrates how to initialize a checkpoint, compile a graph, and invoke the agent with an incoming prompt.</p>
<p>The <code>extractJsonResponse</code> method is used to grab the formatted response that we instructed the LLM to generate whenever it’s sending back something to the user.</p>
<p>Based on this given instruction from the main template, every response must include: "message": "Your message to the user", "current_order": "The order currently being constructed", "suggestions": "Options the user can choose from", "progress": "Order status ('completed' after creation)"</p>
<p>Every response from the LLM should look like this:</p>
<pre><code class="lang-tsx">'```json\\n' +
  '{\\n' +
  '"message": "Got it! To make sure I get your order just right, can you clarify which coffee drink you\\'d like? We have Latte, Cappuccino, Cold Brew, and Frappuccino. 😊",\\n' +
  '"current_order": {\\n' +
  '"drink": null,\\n' +
  '"size": null,\\n' +
  '"mil": null,\\n' +
  '"syrup": null,\\n' +
  '"sweeteners": null,\\n' +
  '"toppings": null,\\n' +
  '"quantity": null\\n' +
  '},\\n' +
  '"suggestions": [\\n' +
  '"Latte",\\n' +
  '"Cappuccino",\\n' +
  '"Cold Brew",\\n' +
  '"Frappuccino"\\n' +
  '],\\n' +
  '"progress": "incomplete"\\n' +
  '}\\n' +
  '```';
</code></pre>
<p>This structure allows the frontend to easily render the LLM response and track the state of the current order. This is more of a design choice and less of a convention.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Building an autonomous AI agent with LangChain and LangGraph allows you to combine the reasoning power of LLMs with practical tool execution and persistent memory. By defining schemas, parsing data into human-readable formats, and orchestrating workflows through nodes, you can create intelligent agents capable of handling real-world tasks—like our Starbucks barista.</p>
<p>With MongoDB integration for state persistence, your agent can maintain context across conversations, making interactions feel more natural and human-like. This approach opens the door to building more sophisticated, domain-specific AI assistants without starting from scratch.</p>
<p>In short: <strong>define your data, teach your agent how to reason, and let LangGraph orchestrate the magic.</strong> ☕🤖</p>
<p>Source code here: <a target="_blank" href="https://github.com/DjibrilM/langgraph-starbucks-agent">https://github.com/DjibrilM/langgraph-starbucks-agent</a></p>
<h3 id="heading-resources"><strong>Resources</strong></h3>
<ul>
<li><p>LangGraph documentation: <a target="_blank" href="https://docs.langchain.com/oss/javascript/langgraph/quickstart">https://docs.langchain.com/oss/javascript/langgraph/quickstart</a></p>
</li>
<li><p>Synergizing Reasoning and Acting in Language Models: <a target="_blank" href="https://arxiv.org/abs/2210.03629">https://arxiv.org/abs/2210.03629</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
