Jibril-M🍀 - freeCodeCamp.org

How Neural Machine Translation Works: Build Your Own Translation App with React Native and QVAC

Jibril-M🍀 — Fri, 17 Jul 2026 16:51:58 +0000

For the past 10 years, we've experienced a massive improvement in translation technologies. We went from robotic-like translations to systems that not only understand the meaning of each word in a sentence, but also how the word fits into the context of the full sentence.

For instance, current translation systems know how to differentiate the meaning of "bank" in a sentence like:

"I can't make the bank deposit today," and "We shall meet near the river bank."

Both sentences have "bank" in them, but with different meanings.

So how did we get here? This huge revolution started back in June of 2017 when a team of 8 Google researchers, notoriously known as the "8 Samurai," released a research paper titled "Attention Is All You Need". This date marked a turning point in modern AI systems and architecture.

For context, this framework is the bedrock of current LLMs like ChatGPT and all large language models.

The 8 Google researchers who created the Transformer architecture

So, what is NMT, and how were Google engineers able to develop a framework that enables machines to understand the semantic meaning of each word in a sentence?

Demystifying NMT: The Brain Behind the Screen
How the Transformer Sees the World
Why This Matters
The Democratization of AI
What is QVAC?
The Architecture Supported by QVAC
The Inference Pipeline
Setting Up the Project
Complete Implementation
Conclusion
Resources and Further Reading

Demystifying NMT: The Brain Behind the Screen

To understand this breakthrough, we first have to pull back the curtain on what NMT (Neural Machine Translation) actually means.

For decades, computer translation was "rule-based." The computer was essentially given a massive bilingual dictionary and a set of grammar rules. It would translate a sentence word-by-word, swap a few positions around, and hope for the best.

This is why early translations felt so incredibly stiff and robotic: the computer was trying to solve language like a math problem.

NMT changed the game by introducing Neural Networks, computer systems inspired by the human brain. Instead of memorizing strict rules, an NMT system learns by looking at millions of existing human translations. It looks at how humans translate phrases, captures patterns, and learns how words actually interact in the real world.

But even early NMT systems had a massive flaw: they read sentences sequentially, from left to right. If a sentence was too long, the system would "forget" how it started by the time it reached the end.

This is where the Google researchers made their historic leap.

How the Transformer Sees the World

The "Attention Is All You Need" paper solved the memory problem by introducing a brand-new architecture called the Transformer. Instead of reading a sentence word-by-word, the Transformer reads the entire sentence all at once.

To do this, it splits the job into two main parts: the Encoder and the Decoder.

The Encoder (The Reader)

Think of the Encoder as a highly analytical reader. When you feed a sentence into the system, the Encoder’s job is to read it and build a "mental map" of what the sentence actually means.

It does this using a mechanism called Self-Attention. You can think of Self-Attention as a series of spotlights. When the computer looks at a specific word, it shines spotlights on all the other words in the sentence to see how they relate.

Going back to our earlier example:

"We shall meet near the river bank."

When the Encoder processes the word "bank," its Self-Attention spotlight instantly flags the word "river." Because those two words are highly connected on the AI's mental map, the system immediately knows we're talking about land next to water, not a financial institution. It locks in this "semantic meaning" before moving to the next step.

The Decoder (The Writer)

Once the Encoder has mapped out the true meaning of the sentence, it hands this blueprint over to the Decoder.

The Decoder is the writer. Its only job is to translate that blueprint into the target language. But it doesn't just output a pre-written template. It builds the new sentence word-by-word, constantly looking back at the Encoder's blueprint (using a trick called Cross-Attention) to make sure it maintains the correct context, tone, and grammar.

If it's translating our river bank sentence into French, it knows to write "la rive" (the bank of the river) instead of "la banque" (the financial bank), because the Encoder's blueprint warned it ahead of time.

Why This Matters

By teaching machines to look at the whole picture rather than individual words, Google’s engineers didn't just build a better translator. They built a system that finally understands the nuances, idioms, and context of human language.

And as it turns out, if an AI can understand the context of a sentence well enough to translate it, it can also use that same context to write essays, answer complex questions, and code. The 2017 translation engine accidentally became the foundation of the entire AI era.

The Democratization of AI

A few years after the Transformer's invention, building with it was strictly a toy for the rich. If you wanted to implement even a simple translation feature, you had to pay Big Tech giants like Google a fortune once you went beyond their tiny free tier.

Trying to bypass their dominance was almost impossible because there were practically no resources for independent developers. Back then, just understanding the basic math of a Transformer required an academic PhD. Without a massive research department at your back, trying to build your own solution from scratch was an incredibly expensive nightmare.

Thankfully, the open-source developer community has worked tirelessly to democratize access to AI. Today, we have incredibly powerful models that anyone can download and use freely.

On top of that, the processors in our personal devices have become exceptionally capable. This hardware evolution means that sophisticated AI models can now run locally directly on your smartphone, ensuring maximum data privacy and removing the dependency on external servers.

As the saying goes, "Today it needs a full building to function, tomorrow it will fit in your pocket." Of course, I totally made that quote up 😅, but you get my point!

To put this in action, we'll build a mobile application with Expo and QVAC that translates English to French.

What is QVAC?

QVAC (QuantumVerse Automatic Computer) is a decentralized, local-first AI development platform and SDK created by Tether.

Unlike traditional AI tools that require cloud connectivity, QVAC allows users to run AI models entirely on their own devices. By keeping the computation local and offline, it ensures your data remains private, secure, and entirely under your control.

Key Concepts for On-Device Translation

To understand how QVAC runs on a mobile device, we must keep a few key concepts in mind:

1. On-Device Inference:

Running model calculations locally. Rather than relying on a single engine or cloud API, QVAC supports specialized local inference backends depending on the task.

For translation, it uses the Bergamot engine under the hood. These engines memory-map quantized model weights directly into the device's RAM and run calculations using native hardware acceleration.

2. Quantization

A mathematical optimization technique that compresses the model's weights. This makes it possible for models to fit into the memory constraints of consumer mobile hardware while keeping output quality high.

The Architecture Supported by QVAC

Before writing code, it's crucial to understand what's actually happening under the hood. To handle local execution without melting your device, the QVAC SDK manages the hardware binding and model lifecycle while hooking into optimized inference backends.

For translation, QVAC utilizes the Bergamot engine. Originally developed as part of the Bergamot project (which powers Firefox's offline translation), this engine is highly optimized for fast, accurate Neural Machine Translation (NMT) on consumer hardware.

At its core, the Bergamot engine takes a source sentence, processes it through its Encoder-Decoder transformer architecture, and predicts the target language tokens in a highly efficient manner.

Understanding Language Pairs

It's important to understand the mechanics of how these models are trained. Translation models like the ones used by Bergamot are strictly unidirectional language pairs. This means the BERGAMOT_EN_FR model is designed exclusively to translate from English to French. It can't reverse the process.

If you want to translate French back to English, you would need to download and load a completely separate model trained specifically for that direction.

If a model is trained to be bidirectional (English ↔ French) or multilingual (translating dozens of languages like large language models do), it has to store mathematical representations, vocabulary, and grammar rules for multiple linguistic directions inside a single neural network. This balloons the parameter count, making the file size massive and requiring heavy RAM and compute power to process.

By isolating the task to a single direction (for example BERGAMOT_EN_FR), the model only needs the neural network to "understand" English inputs and "generate" French outputs. It doesn't need the capacity to generate English text.

This extreme specialization is exactly how Bergamot shrinks the model weights down to those incredibly tiny 15–35MB files that can run instantly on a local CPU without freezing your browser.

The Inference Pipeline

To visualize how we interact with the translation engine in our codebase, think of local translation as running a dedicated interpreter right in your phone's memory:

Hiring the interpreter (loading the model): We map the compressed model file (in this case, the BERGAMOT_EN_FR English-to-French model) directly into the device's RAM.
Handing over the script (text input): We pass the source text to the loaded engine.
The performance (inference): The engine reads the text and mathematically predicts the translated tokens, providing the translated result once the process is complete.
Closing the show (unloading): Because neural network models are memory-intensive, the model can be cleared from RAM to free up resources once the translation is complete or when the user leaves the screen.

Setting Up the Project

To ensure this guide is completely self-contained, let's start by quickly generating our new Expo application and installing the QVAC SDK. Open your terminal and run the following commands:

npx create-expo-app translator-app --template blank-typescript
cd translator-app
npm install @qvac/sdk jiti

Next, you need to add the following peer dependencies to your package.json for QVAC to work correctly. Add these lines to their respective sections:

  "dependencies": {
    "bare-rpc": "^1.0.0",
    "react-native-bare-kit": "^0.11.5"
  },
  "devDependencies": {
    "bare-pack": "^1.5.1"
  }

Once added, install the dependencies by running:

npm install
npx expo install expo-file-system expo-build-properties expo-device

Configuring the Expo Plugin with JITI

Next, we need to add the QVAC SDK plugin to our Expo project. Because the QVAC SDK's Expo plugin is distributed as a modern ECMAScript Module (ESM), but Expo's configuration file (app.config.js) runs in a standard Node.js CommonJS environment, we can't use a standard require().

This is why we installed jiti. It acts as a bridge, allowing us to synchronously load ESM modules inside CommonJS files without breaking the build process.

Create or update your app.config.js file at the root of your project and configure it like this:

const createJiti = require("jiti");
const jiti = createJiti(__filename);

// Synchronously require the ESM module using jiti
const qvacModule = jiti("@qvac/sdk/expo-plugin");
const withQvacSDK = qvacModule.withQvacSDK || qvacModule.default;

// (Include your withEscapeBundleShellScript helper if needed)

module.exports = ({ config }) => {
  config.plugins = [
    [
      "expo-build-properties",
      {
        android: { minSdkVersion: 29 },
      },
    ],
    withQvacSDK,
    "expo-router",
    [
      "expo-splash-screen",
      {
        backgroundColor: "#208AEF",
      },
    ],
    withEscapeBundleShellScript, // Custom helper if applicable
  ];

  return config;
};

This configuration applies the QVAC native setup scripts and ensures Android requires at least SDK version 29 (which is necessary for the native libraries).

With our base configuration ready to go, let's jump straight into the translation code.

Complete Implementation

Let's bring it all together. We'll implement an interface that takes English text, manages the downloading and loading states for the Bergamot engine, translates the text to French, and renders the output to the screen.

Replace your entry app file src/app/index.tsx with the following implementation:

import { View, ScrollView, TextInput, Text, TouchableOpacity, StyleSheet } from "react-native";
import { useState, useEffect } from "react";
import {
  loadModel,
  translate,
  unloadModel,
  BERGAMOT_EN_FR,
  getModelInfo,
} from "@qvac/sdk";
import { Stack } from "expo-router";

type TranslationStatus =
  | "Idle"
  | "Checking model..."
  | "Downloading model..."
  | "Model downloaded successfully."
  | "Loading model..."
  | "Translating..."
  | "Streaming translation..."
  | "Translation finished."
  | `Error: ${string}`;

export default function HomeScreen() {
  const [status, setStatus] = useState("Checking model...");
  const [translatedText, setTranslatedText] = useState("");
  const [inputText, setInputText] = useState("");
  const [isTranslating, setIsTranslating] = useState(false);
  const [isDownloaded, setIsDownloaded] = useState(null);
  const [downloadProgressStr, setDownloadProgressStr] = useState("");

  useEffect(() => {
    const checkModelStatus = async () => {
      try {
        const model = await getModelInfo({ name: BERGAMOT_EN_FR.name });
        setIsDownloaded(model.isCached);
        console.log("Model", model);
        setStatus("Idle");
      } catch (error) {
        console.error("Error checking model:", error);
        setStatus("Error: Failed to check model status");
      }
    };
    checkModelStatus();
  }, []);

  const handleDownload = async () => {
    try {
      setIsTranslating(true);
      setStatus("Downloading model...");
      setDownloadProgressStr("");

      const modelId = await loadModel({
        modelSrc: BERGAMOT_EN_FR,
        modelType: "nmt",
        onProgress: (progress: any) => {
          let pct = progress.percentage;
          let dl = progress.downloaded;
          let tot = progress.total;
          if (progress.shardInfo) {
            pct = progress.shardInfo.overallPercentage;
            dl = progress.shardInfo.overallDownloaded;
            tot = progress.shardInfo.overallTotal;
          }
          const formatBytes = (bytes: number) => {
            if (bytes === 0) return "0 B";
            const k = 1024;
            const sizes = ["B", "KB", "MB", "GB"];
            const i = Math.floor(Math.log(bytes) / Math.log(k));
            return (
              parseFloat((bytes / Math.pow(k, i)).toFixed(2)) + " " + sizes[i]
            );
          };
          setDownloadProgressStr(
            `${pct.toFixed(2)}% (${formatBytes(dl)} / ${formatBytes(tot)})`,
          );
        },
        modelConfig: {
          engine: "Bergamot",
          from: "en",
          to: "fr",
          beamsize: 1,
          normalize: 1,
          temperature: 0.2,
          norepeatngramsize: 3,
          lengthpenalty: 1.2,
        },
      });

      await unloadModel({ modelId, clearStorage: false });

      setIsDownloaded(true);
      setStatus("Model downloaded successfully.");
    } catch (error: any) {
      console.error(error);
      setStatus(`Error: ${error.message}`);
    } finally {
      setIsTranslating(false);
      setDownloadProgressStr("");
    }
  };

  const handleTranslate = async () => {
    if (!inputText.trim()) {
      setStatus("Error: Please enter text to translate");
      return;
    }

    try {
      setIsTranslating(true);
      setTranslatedText("");
      setStatus("Loading model...");

      const modelId = await loadModel({
        modelSrc: BERGAMOT_EN_FR,
        modelType: "nmt",

        modelConfig: {
          engine: "Bergamot",
          from: "en",
          to: "fr",
          beamsize: 1,
          normalize: 1,
          temperature: 0.2,
          norepeatngramsize: 3,
          lengthpenalty: 1.2,
        },
      });

      setStatus(`Translating...`);

      const result = translate({
        modelId,
        text: inputText,
        modelType: "nmt",
        stream: false,
      });

      const text = await result.text;
      setTranslatedText(text);

      const stats = await result.stats;
      if (stats) {
        console.log(`▸ Processing stats:`, stats);
      }

      setStatus("Translation finished.");

      await unloadModel({ modelId, clearStorage: false });
    } catch (error: any) {
      console.error(error);
      setStatus(`Error: ${error.message}`);
    } finally {
      setIsTranslating(false);
    }
  };

  return (
    <>
      
      
        
          
            
              English to French Translator
            
            
              Enter text to translate:
            
          

          
            

            
              Status: {status}
              {downloadProgressStr ? `\n${downloadProgressStr}` : ""}
            

            {isDownloaded === null ? (
              
                
                  Loading...
                
              
            ) : isDownloaded ? (
              
                
                  {isTranslating ? "Translating..." : "Translate"}
                
              
            ) : (
              
                
                  {isTranslating ? "Downloading..." : "Download Model"}
                
              
            )}

            
              
                {translatedText || "Translation will appear here..."}
              
            
          
        
      
    
  );
}

const styles = StyleSheet.create({
  scrollContainer: {
    flexGrow: 1,
    paddingHorizontal: 16,
    paddingTop: 16,
    paddingBottom: 24,
    backgroundColor: "#f9fafb",
  },
  card: {
    backgroundColor: "#ffffff",
    maxWidth: 450,
    width: "100%",
    alignSelf: "center",
    borderRadius: 12,
    padding: 16,
  },
  header: {
    marginBottom: 16,
  },
  title: {
    textAlign: "center",
    fontSize: 24,
    fontWeight: "bold",
    color: "#111827",
  },
  subtitle: {
    textAlign: "center",
    marginTop: 4,
    fontSize: 16,
    color: "#6b7280",
  },
  content: {
    gap: 24,
  },
  input: {
    borderWidth: 1,
    borderColor: "#e5e7eb",
    backgroundColor: "#ffffff",
    color: "#111827",
    padding: 12,
    borderRadius: 8,
    minHeight: 100,
    textAlignVertical: "top",
  },
  disabledText: {
    opacity: 0.5,
  },
  statusText: {
    fontSize: 14,
    color: "#3b82f6",
    fontWeight: "bold",
    textAlign: "center",
    marginTop: 12,
    marginBottom: 12,
  },
  button: {
    width: "100%",
    height: 48,
    borderRadius: 12,
    backgroundColor: "#3b82f6",
    alignItems: "center",
    justifyContent: "center",
  },
  buttonDisabled: {
    opacity: 0.5,
  },
  buttonText: {
    fontWeight: "600",
    fontSize: 18,
    color: "#ffffff",
  },
  outputContainer: {
    marginTop: 16,
    padding: 16,
    backgroundColor: "#f3f4f6",
    borderRadius: 8,
    minHeight: 100,
  },
  outputText: {
    fontSize: 16,
    color: "#111827",
  },
});

Here is a translation example from the application.

Input (English)

The location I told you was near the river bank

Output (French)

L'endroit où je vous ai dit était près de la rive de la rivière

Codebase Breakdown

Let’s lift the hood on how this local translation implementation manages native model lifecycles and processes the streamed tokens.

1. Managing the Native Lifecycle

Loading neural network weights for translation is computationally expensive. When the QVAC runtime initializes a model, it must read parameters from the local disk and copy the active weights into device RAM.

To handle this efficiently, we check if the model is cached before attempting to load it. This is used to check if the model is downloaded. That's the meaning of cached: it means the model has been downloaded to the user's disk:

const model = await getModelInfo({ name: BERGAMOT_EN_FR.name });
setIsDownloaded(model.isCached);

The loadModel function will automatically handle downloading the model from the Hugging Face hub if it hasn't been cached locally yet. Once the file is available locally, it directly memory-maps the weights.

2. Translating the Text

Once the model is loaded, we can pass our text to the translation engine:

const result = translate({
  modelId,
  text: inputText,
  modelType: "nmt",
  stream: false,
});

const text = await result.text;
setTranslatedText(text);

This waits for the full translation to complete before displaying the final result to the user.

3. Unloading the Model

After the translation is complete, we explicitly destroy the model via unloadModel:

await unloadModel({ modelId, clearStorage: false });

By unloading the model, we ensure that the device's RAM is freed up for other processes. Because the model is already downloaded and cached on the disk (and we explicitly set clearStorage: false), reloading the model the next time the user wants to translate something will be nearly instantaneous.

Conclusion

Transitioning translation from the cloud to on-device hardware offers a practical approach for mobile application developers. Running model inference locally eliminates reliance on remote internet connectivity, removes recurring API usage costs, and ensures that user text inputs never leave the physical device.

Integrating local translation can be highly beneficial for travel apps, secure communication tools, or educational platforms. As edge processors gain dedicated hardware acceleration cores and open-source models become even more efficient through quantization research, local-first architectures present a compelling alternative for developers prioritizing privacy, offline resilience, and predictable cost structures.

Resources and Further Reading

To dive deeper into local Neural Machine Translation, inspect the source code, or explore advanced configurations for your mobile applications, check out the following resources:

QVAC Translation Docs: Official documentation for integrating local translation capabilities with QVAC.
QVAC Expo Integration Docs: Learn more about configuring custom local models in Expo.
Bergamot Project: Learn more about the underlying Neural Machine Translation engine.
Attention Is All You Need: The original 2017 Google research paper that introduced the Transformer architecture.
Full Code Example: Full code example's repository.

How to Build an Offline AI Image Generator in Node.js with QVAC and Socket.io

Jibril-M🍀 — Thu, 18 Jun 2026 16:05:23 +0000

A few years ago, the first day I finally got access to an AI image generator, I was so excited that I immediately sat down and wrote an article about it (using Node.js and OpenAI's DALL-E). The magic of turning thoughts directly into digital pixels felt like holding a real-life magic wand.

But back then, accessing these models wasn't a walk in the park. Our primary option was Midjourney, which meant you had to struggle on Discord, and sometimes you couldn't do anything due to rate limits and servers being very busy.

Accessing image generation back then felt like trying to order a coffee during a flash mob.

Thankfully, the landscape has completely shifted. Today, not only can we run state-of-the-art models like Stable Diffusion on consumer hardware, but we can do it locally, offline, and completely free of charge. We don't need any API keys, there aren't any subscription rate limits, and there's no Discord channels to deal with.

In this tutorial, we'll build a local web application using Node.js, Express, Socket.io, and the QVAC SDK to run a quantized Stable Diffusion 2.1 model.

Prerequisites
What is QVAC?
How Stable Diffusion Works Under the Hood
GPU Limitations: Metal, AMD, and the Intel Mac Trap
The Image Generation Pipeline
Complete Implementation
Codebase Breakdown
Conclusion
Resources and Further Reading

Prerequisites

To get the most out of this tutorial, you should have a solid foundation in web backend and frontend basics:

Node.js and ES Modules: Basic familiarity with modern JavaScript modules (import/export), async loops, and event listeners.
Express and WebSockets: Familiarity with routing static files and sending real-time messages over WebSockets with socket.io.
HTML and Vanilla CSS: Understanding of basic DOM manipulation and style bindings.
Development environment: A local machine with Node.js installed.

What is QVAC?

Developed by Tether, QVAC is a family of local inference tools designed to execute machine learning models directly on client hardware.

Instead of routing inference requests to expensive cloud-hosted APIs (such as DALL-E or Midjourney), QVAC bundles pre-compiled machine learning runtimes (like llama.cpp for text, whisper.cpp for transcription, and custom diffusion backends) directly into Node.js, mobile, and desktop runtimes.

Running local AI models with QVAC offers several practical advantages:

Zero API costs: Generate as many images as your hardware can handle without recurring costs.
Privacy-first: Prompts and generated images are kept entirely in memory on your local machine.
Offline independence: Run your application in isolated networks, on flights, or in regions without internet access.

How Stable Diffusion Works Under the Hood

To execute image generation locally without running out of RAM, QVAC leverages a quantized Stable Diffusion 2.1 GGUF model (SD_V2_1_1B_Q8_0).

But how does this actual image generation process work conceptually? Let's make one thing clear: this is not a scientific paper. We aren't going to dive into the underlying multivariable calculus, probability distributions, or stochastic differential equations because I'm not a low-level machine learning researcher (and let's be honest, neither of us wants to stare at Greek symbols and linear algebra formulas on a screen when we could be writing clean JavaScript).

Instead, let's understand how these models work conceptually, using some intuitive developer analogies.

The World-Class Sculptor Analogy

At its core, modern AI image generation turns randomness into reality. Instead of "painting" an image from scratch, pixel-by-pixel, like a human illustrator with a brush, the AI essentially acts as a world-class sculptor, carving an image out of a block of digital static.

The most dominant technology behind this today is Diffusion, which powers models like Stable Diffusion, Midjourney, and Google's Imagen series.

Here is the conceptual step-by-step breakdown of how this block of static turns into art:

1. The Training Phase (Learning the Patterns)

Before a model can generate anything, it has to look at billions of images and their corresponding text descriptions. During this phase, developers do something counterintuitive: they intentionally ruin the images.

Adding noise: The system takes a clear picture (for example, of a cat) and gradually adds random digital static (noise) pixel-by-pixel until the original image is completely unrecognizable.
Learning to reverse it: The AI's job is to look at a noisy image and predict exactly how much noise was added at that specific step. By doing this billions of times, it becomes an expert at denoising – that is, turning chaos back into order.

2. Connecting Words to Visuals (CLIP)

To make sure the AI knows what a "cat wearing a top hat" looks like, it uses a text-to-image bridge, often powered by a system called CLIP (Contrastive Language-Image Pre-training).

CLIP translates human language into a mathematical map (called an embedding).
In this map, the words "cat" and the actual pixels of a cat sit very close together. This ensures that when you type a prompt, the AI knows exactly which visual concepts to pull from its memory.

3. The Generation Phase (The Reverse Diffusion Loop)

When you type a prompt and hit "Generate," the magic happens in reverse:

The blank canvas: The AI starts with a canvas of pure, 100% random digital noise (it looks like old television static).
The prompt guidance: The AI looks at your prompt and uses its text embedding to guide its eye. It looks at the random static and asks, "Where in this mess can I start to see a cat?"
Step-by-step denoising: The AI subtracts a little bit of noise, sharpening the image slightly. It repeats this loop 20 to 50 times. With every step, fuzzy shapes turn into rough outlines, textures appear, and eventually, a crisp, clean, brand-new image emerges.

Fun fact about seeds: Because the process starts with completely random static every single time, typing the exact same prompt twice will always give you a completely different image (unless you lock down the starting randomness using a specific number called a Seed).

Here's an illustration of denoising with diffusion models:

Latent Diffusion: Keeping it Fast (The VAE)

Generating high-resolution images pixel-by-pixel requires massive computing power. If we tried to do this directly in pixel space on consumer hardware, our computers would melt, and a single generation would take hours.

To fix this, modern models use Latent Diffusion.

Instead of working with the full-sized image, a component called an encoder compresses the image into a smaller, abstract mathematical space (the "latent space"). Think of it as a shrunken playground where all the noisy/denoising math happens. Because this playground is so small, the computations are incredibly fast.

Once the denoising loop finishes in the latent space, another component called the decoder (specifically, a Variational Autoencoder, or VAE) blows it back up into a sharp, high-resolution image for you to see.

Architectures Supported by QVAC

When you run local inference with QVAC, the SDK hooks into optimized, community-maintained C++ backends. QVAC manages the hardware bindings and model lifecycles for different AI modalities:

Text generation (llama.cpp): Used for large language models (LLMs) like Llama 3 or Mistral, executing auto-regressive token prediction.
Audio transcription (whisper.cpp): Used for highly optimized speech-to-text transcription.
Image Generation (stable-diffusion.cpp / sdcpp-generation): Our focus in this tutorial. QVAC supports two distinct approaches for image generation depending on the model architecture you choose:
- The Bundled Model Approach (Stable Diffusion 1.5/2.1/XL): The traditional approach where the entire pipeline (Text Encoders, VAE, and the main Diffusion UNet) is baked into a single, unified GGUF file (for example, SD_V2_1_1B_Q8_0).
  
  This is incredibly convenient for local deployments because you only need to manage and load one file to start generating images.
- The Modular Multi-Model Approach (Flux): Modern architectures like FLUX.1 use a much more complex setup. Instead of a single file, Flux splits its computational brain into separate components. You load a core Diffusion Transformer (DiT) model, but you must also separately load large text encoders (like T5-v1.1-xxl and CLIP-L) and an independent VAE model.
  
  While this requires more complex orchestration to load multiple GGUF files simultaneously, it provides vastly superior prompt adherence and photorealism by utilizing dedicated, massive text-understanding models.
Speech synthesis (TTS): Specialized architectures like Chatterbox (transformer-based zero-shot voice cloning) and Supertonic (diffusion-based speech denoising).

GPU Limitations: Metal, AMD, and the Intel Mac Trap

When running machine learning models locally on Apple Mac hardware, QVAC will try to automatically accelerate execution by compiling compute pipelines for the Metal API to utilize the system's GPU.

If you're on an Apple Silicon Mac (M1, M2, M3, M4, or M5 chip), this works seamlessly, and generation will compile on the Apple Neural Engine and Unified GPU memory in seconds.

But if you're running on an older Intel-based Mac with a discrete AMD Radeon GPU (such as the AMD Radeon Pro 5500M commonly found in 16-inch MacBook Pros), you'll run into a major driver-level limitation:

The macOS Metal driver for older AMD discrete GPUs doesn't support the modern machine learning compute shaders and matrix reduction operators used by stable-diffusion.cpp.
When the inference worker attempts to run these unsupported operations, the driver fails to compile the pipeline and triggers a hard C++ crash (SIGABRT) inside the ggml-metal-ops.cpp shader encoder, abruptly exiting the background worker process.

If you hit this hardware roadblock, the default GPU configuration will crash the application every time you trigger an image generation.

To resolve this, you should configure the model to run on the CPU instead by setting the model configuration parameter device to "cpu" and specifying the threads (for example, threads: 4). While generating images on the CPU takes longer than on a GPU, it runs successfully on any machine, regardless of how old or limited its GPU is.

The Image Generation Pipeline

To coordinate the local execution lifecycle, our app sets up a real-time event pipeline:

[Browser Client]                                  [Node.js Server]
       |                                                 |
       | ------ 1. Connects & Checks Model --------->    |
       | <----- 2. Downloads & Loads Model ----------     | (Model Cached locally)
       |                                                 |
       | ------ 3. Submits prompt ("Cozy cabin...") ->  |
       |                                                 |
       |                                                 | === [ QVAC Inference Engine ] ===
       |                                                 | 
       | <----- 4. Denoising Step Updates (e.g. 5/20) -- | (Streams steps in real time)
       |                                                 |
       | <----- 5. Sends final image (Base64 DataURL) -- | (Direct in-memory payload)
       |                                                 |

Complete Implementation

Let's look at the implementation. You can clone the full project repository to follow along, or build it from scratch by creating a project folder, running npm init -y, installing the dependencies (@qvac/sdk, express, socket.io, concurrently), and configuring "type": "module" in your package.json.

1. Server Configuration (`server.js`)

Create a file named server.js and paste the following implementation:

import express from 'express';
import path from 'path';
import http from 'http';
import { Server } from 'socket.io';
import fs from 'fs';
import { fileURLToPath } from 'url';
import { loadModel, unloadModel, getLoadedModelInfo, diffusion, SD_V2_1_1B_Q8_0 } from "@qvac/sdk";

const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

const app = express();
const server = http.createServer(app);
const io = new Server(server);

const PORT = process.env.PORT || 3000;

app.use(express.json());
app.use(express.static(path.join(__dirname, 'public')));

const CONFIG_PATH = path.join(__dirname, '.device-preference.json');

function getPreferredDevice() {
  try {
    if (fs.existsSync(CONFIG_PATH)) {
      const data = JSON.parse(fs.readFileSync(CONFIG_PATH, 'utf8'));
      return data.device || null;
    }
  } catch (err) {
    console.error('Failed to read device preference:', err.message);
  }
  return null;
}

function setPreferredDevice(device) {
  try {
    fs.writeFileSync(CONFIG_PATH, JSON.stringify({ device }), 'utf8');
  } catch (err) {
    console.error('Failed to write device preference:', err.message);
  }
}

// Global model state
let loadedModelId = process.modelId || null;
let modelLoadPercent = 0;
let modelLoadStatus = 'Awaiting trigger...';
let isModelLoading = false;

const modelSize = (SD_V2_1_1B_Q8_0.expectedSize / (1024 * 1024 * 1024)).toFixed(2) + ' GB';

function broadcastModelProgress(percent, status) {
  io.emit('model-download-progress', { percent, status, size: modelSize });
}

io.on('connection', (socket) => {
  console.log('Client connected:', socket.id);

  socket.on('disconnect', () => {
    console.log('Client disconnected:', socket.id);
  });

  // Trigger model download
  socket.on('trigger-model-download', async () => {
    // If already loaded, verify it's still alive in the worker
    if (loadedModelId) {
      try {
        await getLoadedModelInfo({ modelId: loadedModelId });
        socket.emit('model-download-progress', {
          percent: 100,
          status: 'Model fully loaded locally.',
          size: modelSize
        });
        return;
      } catch (err) {
        console.log('Model ID was stale/not found, resetting state and reloading...', err.message);
        loadedModelId = null;
        process.modelId = null;
      }
    }

    // If currently loading, report current progress
    if (isModelLoading) {
      socket.emit('model-download-progress', {
        percent: Math.round(modelLoadPercent),
        status: modelLoadStatus,
        size: modelSize
      });
      return;
    }

    isModelLoading = true;
    modelLoadPercent = 0;
    modelLoadStatus = 'Initiating model download...';
    broadcastModelProgress(modelLoadPercent, modelLoadStatus);

    try {
      console.log('Starting model download...');
      const preferredDevice = getPreferredDevice();
      const loadConfig = { prediction: "v" };
      if (preferredDevice) {
        loadConfig.device = preferredDevice;
        if (preferredDevice === 'cpu') {
          loadConfig.threads = 4;
        }
        console.log(`Using cached device preference: ${preferredDevice}`);
      }

      loadedModelId = await loadModel({
        modelSrc: SD_V2_1_1B_Q8_0,
        modelType: "sdcpp-generation",
        modelConfig: loadConfig,
        onProgress: (p) => {
          modelLoadPercent = p.percentage;
          modelLoadStatus = p.percentage >= 100 ? 'Model fully loaded locally.' : `Downloading model weights... (${p.percentage.toFixed(1)}%)`;
          broadcastModelProgress(Math.round(modelLoadPercent), modelLoadStatus);
        }
      });
      process.modelId = loadedModelId;

      isModelLoading = false;
      console.log('Model loaded successfully. ID:', loadedModelId);
    } catch (err) {
      isModelLoading = false;
      modelLoadPercent = 0;
      modelLoadStatus = 'Failed to load model: ' + err.message;
      console.error('Failed to load model:', err);
      broadcastModelProgress(0, modelLoadStatus);
      socket.emit('error_event', { message: 'Failed to load model: ' + err.message });
    }
  });

  socket.on('generate', async (data) => {
    const { prompt, ratio } = data;
    if (!prompt || prompt.trim() === '') {
      socket.emit('error_event', { message: 'Prompt is required' });
      return;
    }

    if (!loadedModelId) {
      socket.emit('error_event', { message: 'Model is not loaded yet' });
      return;
    }

    const runDiffusion = async (modelIdToUse) => {
      socket.emit('progress', {
        percent: 0,
        status: 'Starting diffusion process...',
        sub: 'DIFFUSION INITIALIZING'
      });

      console.log(`Generating image for prompt: "\({prompt}" with ratio: \){ratio} using model ID: ${modelIdToUse}`);

      const { progressStream, outputs, stats } = diffusion({
        modelId: modelIdToUse,
        prompt,
      });

      // Stream progress steps
      for await (const { step, totalSteps } of progressStream) {
        const percent = Math.round((step / totalSteps) * 100);
        socket.emit('progress', {
          percent,
          status: `Denoising step \({step}/\){totalSteps}...`,
          sub: 'RUNNING DIFFUSION'
        });
      }

      // Resolve output buffers
      const buffers = await outputs;
      if (!buffers || buffers.length === 0) {
        throw new Error('No image buffer returned from diffusion model.');
      }

      // Convert image buffer to a base64 Data URL instead of saving to disk
      const base64Data = Buffer.from(buffers[0]).toString('base64');
      const dataUrl = `data:image/png;base64,${base64Data}`;

      // Emit success
      socket.emit('success', {
        url: dataUrl,
        prompt,
        seed: (await stats).seed || -1
      });

      console.log(`Image generated and emitted successfully as base64 Data URL.`);
    };

    try {
      await runDiffusion(loadedModelId);
    } catch (err) {
      console.error('Image generation failed:', err);

      const isCrash = err.code === 50205 || (err.message && err.message.includes('WORKER_CRASHED'));
      if (isCrash) {
        console.log('Worker crashed during GPU execution. Attempting CPU fallback...');

        // Save device preference so we load CPU directly next time and prevent double loading
        setPreferredDevice('cpu');

        // Reset the stale model state
        loadedModelId = null;
        process.modelId = null;

        socket.emit('progress', {
          percent: 0,
          status: 'GPU driver crashed. Automatically falling back to CPU mode...',
          sub: 'CPU FALLBACK LOADING'
        });

        try {
          console.log('Loading model on CPU...');
          isModelLoading = true;
          modelLoadPercent = 0;
          modelLoadStatus = 'Loading CPU model weights...';
          broadcastModelProgress(modelLoadPercent, modelLoadStatus);

          loadedModelId = await loadModel({
            modelSrc: SD_V2_1_1B_Q8_0,
            modelType: "sdcpp-generation",
            modelConfig: { prediction: "v", device: 'cpu', threads: 4 },
            onProgress: (p) => {
              modelLoadPercent = p.percentage;
              modelLoadStatus = `Loading CPU model weights... (${p.percentage.toFixed(1)}%)`;
              broadcastModelProgress(Math.round(modelLoadPercent), modelLoadStatus);
            }
          });
          process.modelId = loadedModelId;
          isModelLoading = false;
          console.log('Model loaded successfully on CPU. ID:', loadedModelId);

          // Retry diffusion on CPU
          await runDiffusion(loadedModelId);
        } catch (cpuErr) {
          console.error('CPU fallback execution failed:', cpuErr);
          isModelLoading = false;
          socket.emit('error_event', { message: 'Image generation failed on CPU: ' + cpuErr.message });
        }
      } else {
        if (err.message && (err.message.includes('MODEL_NOT_FOUND') || err.message.includes('not found'))) {
          loadedModelId = null;
          process.modelId = null;
          broadcastModelProgress(0, 'Model state lost. Please re-trigger download.');
        }
        socket.emit('error_event', { message: 'Image generation failed: ' + err.message });
      }
    }
  });
});

app.get('*', (req, res) => {
  res.sendFile(path.join(__dirname, 'public', 'index.html'));
});

server.listen(PORT, () => {
  console.log(`Server is running at http://localhost:${PORT}`);
});

// Clean exit handler
async function handleCleanup() {
  const modelId = process.modelId || loadedModelId;
  if (modelId && modelId !== 'mock-model-id') {
    try {
      await unloadModel({ modelId, clearStorage: false });
    } catch (err) {}
  }
  process.exit(0);
}

process.on('SIGINT', handleCleanup);
process.on('SIGTERM', handleCleanup);

2. Frontend Architecture Summary

Since our application runs completely locally, the frontend is a single-page web app built with vanilla HTML, CSS, and client-side JavaScript that communicates with our Express server over Socket.io WebSockets.

Rather than cluttering this tutorial with hundreds of lines of UI templates and style sheets, we'll keep the focus entirely on the backend orchestration. You can grab the complete HTML layout, Tailwind styles, and client script from the GitHub Repository.

Here is a summary of how the client communicates with the server under the hood:

Preflight sync (trigger-model-download): As soon as the page loads, the client establishes a WebSocket connection and emits trigger-model-download. The server intercepts this to check if the model is cached/loading, and begins broadcasting progress.
Denoising stream (progress): During image generation, the server constantly streams progress events containing denoising statistics (for example Denoising step 12/20...). The client updates the visual progress bar and status labels accordingly.
Data URL delivery (success): When the diffusion steps are completed, the server converts the binary image buffer into a Base64 string and emits a success event. The client binds this Base64 Data URL directly to the source of the element for direct local display and instant download.

Codebase Breakdown

Let’s lift the hood on the key mechanisms that make our local offline image generator work smoothly.

1. Multi-Client Model ID Binding (`process.modelId`)

Quantized weights take a significant amount of memory. Every time we call loadModel(), QVAC boots a separate C++ background process (a Bare worker) to host the GGML runtime.

To prevent spawning multiple processes or loading the 2.3 GB GGUF model multiple times when a client refreshes a page or opens another browser tab, we store the loaded model ID globally on Node’s process object:

let loadedModelId = process.modelId || null;
// ...
process.modelId = loadedModelId;

This acts as a process-wide singleton registry. But using a global variable introduces a challenge: stale worker processes. If a client triggers a model load, gets an ID, and the background worker process later crashes or is killed, process.modelId remains populated with a dead reference.

To resolve this, every time a new client connects and requests a model download trigger, we preflight the model ID using getLoadedModelInfo:

if (loadedModelId) {
  try {
    await getLoadedModelInfo({ modelId: loadedModelId });
    socket.emit('model-download-progress', { percent: 100, status: 'Model fully loaded locally.' });
    return;
  } catch (err) {
    console.log('Model ID was stale, resetting state...', err.message);
    loadedModelId = null;
    process.modelId = null;
  }
}

If the background worker is dead, getLoadedModelInfo throws an error. The catch block intercepts this, wipes the stale references, and safely restarts the loading routine.

[!IMPORTANT] Process singleton integrity: Always preflight model state visibility before initiating inference. Without validation checks, attempting diffusion() on a stale model ID will trigger immediate client-side connection timeouts and silent backend worker failures.

2. In-Memory Image Serialization (Zero Disk Writes)

Writing generated images to the server's hard drive creates significant I/O overhead. It forces you to write custom cron cleanup scripts to delete old image files, and runs the risk of running out of disk space on systems with high user traffic.

Since QVAC’s diffusion() function outputs generated PNG files directly as in-memory binary buffers (Uint8Array), we bypass the local file system entirely. We serialize the binary array into a Base64 string directly in memory:

const base64Data = Buffer.from(buffers[0]).toString('base64');
const dataUrl = `data:image/png;base64,${base64Data}`;

This Data URL is transmitted over WebSockets to the client, which immediately binds it to the image element:

Zero disk overhead: The server doesn't write a single byte to the hard drive, preserving SSD life and preventing storage bloat.
Instant delivery: Transmission is handled entirely within network memory buffers, bypassing disk serialization latency.
Effortless client integration: The client doesn't need to request a static image URL path. It directly renders the Base64 Data URL, allowing users to save or download the image instantly.

3. GPU-to-CPU Fallback & Preference Cache Strategy

One of the biggest challenges with local-first AI is client hardware heterogeneity. For example, older Intel Macs with discrete AMD Radeon GPUs support Apple's Metal framework, but lack the modern tensor reduction operators used by the Stable Diffusion engine, causing a hard C++ crash (SIGABRT) inside ggml-metal-ops.cpp.

To keep the application running and ensure we don't trigger the model loading twice (once on the incompatible GPU on startup, and once on the CPU fallback after the first prompt crash), we use a persistent device preference cache file (.device-preference.json) alongside our C++ worker crash interceptor:

try {
  await runDiffusion(loadedModelId);
} catch (err) {
  const isCrash = err.code === 50205 || err.message.includes('WORKER_CRASHED');
  if (isCrash) {
    // 1. Cache the CPU preference on disk
    setPreferredDevice('cpu');

    // 2. Reset stale references
    loadedModelId = null;
    process.modelId = null;

    // 3. Automatically load the model on CPU with multi-threading
    loadedModelId = await loadModel({
      modelSrc: SD_V2_1_1B_Q8_0,
      modelType: "sdcpp-generation",
      modelConfig: { prediction: "v", device: "cpu", threads: 4 }
    });
    process.modelId = loadedModelId;

    // 4. Transparently retry generation
    await runDiffusion(loadedModelId);
  }
}

This approach utilizes a two-layered defense:

Dynamic recovery: If a GPU driver error triggers a crash, the app intercepts it, saves "device": "cpu" to the .device-preference.json file, dynamically reloads the weights into CPU threads, and retries the generation. The client simply sees a status update indicating CPU fallback is occurring, surviving what would otherwise be a fatal crash.
Preference persistence: The next time the server starts or a page is loaded, the preflight loading routine reads the cached preference from the disk and loads the CPU model immediately:

const preferredDevice = getPreferredDevice(); // Reads .device-preference.json
const loadConfig = { prediction: "v" };
if (preferredDevice) {
  loadConfig.device = preferredDevice;
  if (preferredDevice === 'cpu') {
    loadConfig.threads = 4;
  }
}
loadedModelId = await loadModel({
  modelSrc: SD_V2_1_1B_Q8_0,
  modelType: "sdcpp-generation",
  modelConfig: loadConfig,
  // ...
});

This prevents the server from making redundant GPU load attempts on subsequent sessions, ensuring that the model is loaded only once and directly onto the correct hardware execution target.

[!WARNING] CPU Fallback Latency: While CPU mode guarantees resilience across older hardware, it uses sequential multi-threaded calculations instead of GPU hardware cores. Consequently, generation times will be significantly longer (typically 1 to 2 minutes on CPU compared to 10 to 15 seconds on a compatible GPU). Make sure to design responsive progress loaders in the UI to manage user expectations during fallback.

Conclusion

Running local-first Stable Diffusion with QVAC gives you absolute control over your inference costs and data privacy. By coupling on-device GGML models with a simple Node.js WebSocket backend, you can build responsive web tools that run completely offline without ever spending money on cloud APIs.

As mobile and desktop system-on-chip architectures continue to pack more neural engines, local-first AI architectures will become an increasingly powerful option for modern developers.

Resources and Further Reading

How to Run Private Text-to-Speech on Your Own Hardware Using QVAC

Jibril-M🍀 — Sun, 14 Jun 2026 02:06:42 +0000

When I was putting the final touches on QuizRope, an educational mobile app I built that uses LLMs for real-time tutoring and homework assistance, I knew the next logical step was voice. Reading text on a screen is great, but having an AI tutor physically speak to you transforms the entire learning experience.

Naturally, my first instinct was to look at cloud providers. While services like ElevenLabs offer incredible voice quality, I quickly ran the numbers. Between the API pricing, token consumption for lengthy tutoring sessions, and the sheer volume of users I anticipated, the math got ugly very quickly. Relying on a paid API for every single sentence spoken within the app simply wasn't sustainable for an independent developer.

If you’re about to ask, "How far did you get with QuizRope?", well honestly, I straight-up gave up on the project back then because I couldn't find a sane, affordable solution for the TTS feature.

Beyond the prohibitive cost, there was the latency. Waiting for a server to process a prompt, generate the audio, and stream it back down to a mobile device completely breaks the conversational illusion. And worst of all, it meant every question a student asked would be beamed to a third-party server.

That frustration became the catalyst for my search to find a reliable, offline, and completely zero-cost solution.

In this article, we’re going to build a React Native application that performs high-fidelity Text-to-Speech (TTS) completely offline using your device's own hardware.

If you haven't set up your environment or need a refresher on local inference fundamentals, I highly recommend reading my previous article, How to Run a Local LLM Offline in React Native with QVAC, where I cover project initialization, prebuilding, and native hardware dependencies.

This guide assumes you already have a project with the QVAC SDK configured and ready to run on a physical device.

Prerequisites
What is QVAC?
The Architecture Supported by QVAC
The Inference Pipeline
Environment and Dependency Config
The Audio Utility Packaging
Complete Implementation
Codebase Breakdown
Conclusion
Resources and Further Reading

Prerequisites

To get the most out of this article, you should have a solid foundation in modern web and mobile development:

JavaScript/TypeScript & React: Familiarity with React concepts and hooks, especially useState, useEffect, and useRef.
React Native & Expo: Basic understanding of layout structures (such as View, ScrollView, TextInput) and styling conventions.
Asynchronous JavaScript & Binary Buffers: Experience with async/await, Promises, and basic manipulation of arrays like Int16Array or Buffer.
Development Build Environment: Familiarity with running local development compilation commands, specifically npx expo prebuild to build native iOS and Android modules.
Physical Mobile Device: Because local machine learning models leverage device-specific hardware acceleration and native optimizations, the QVAC SDK doesn't support simulator environments. You must have a physical iOS or Android testing device with Developer Mode enabled.

What is QVAC?

To help you follow along more effectively, let’s establish what QVAC is and why it exists.

Developed by Tether, QVAC is a local-first AI SDK designed for building cross-platform, peer-to-peer (P2P) applications and systems.

Many mobile applications that utilize Large Language Models (LLMs) or Text-to-Speech (TTS) engines rely on network requests to cloud-hosted APIs (such as OpenAI or ElevenLabs). While convenient, this model introduces dependencies on network connectivity, recurring API usage fees, and transmission of user data to third-party servers.

QVAC provides an alternative by executing AI models directly on the client device. This local-first architecture offers several practical advantages:

Local-first execution: Runs inference directly on the client hardware, eliminating the need for external APIs or active internet connections.
Peer-to-peer (P2P) support: Allows distributing inference tasks across local networks, helping coordinate workloads without centralized servers.
Cross-platform compatibility: Provides a single JavaScript/TypeScript interface that works consistently across different hardware and runtime environments.
Unified capabilities: Exposes text generation, transcription, image generation, and speech synthesis within a single package.

Key Concepts for On-Device Inference

To understand how QVAC runs on a mobile device, we must keep a few key concepts in mind:

On-Device Inference: Running model calculations locally. Rather than relying on a single engine, QVAC supports multiple specialized local inference backends depending on the task (such as llama.cpp for text, whisper.cpp for transcription, or custom diffusion backends for image generation). Under the hood, these engines memory-map quantized model weights directly into the device's RAM and run calculations using native GPU hardware acceleration.
Quantization (GGUF format): A mathematical optimization technique that compresses the model's weights (for example, from a standard 16-bit floating-point precision down to 4-bit or 8-bit integers). This makes it possible for models to fit into the memory constraints of consumer mobile hardware while keeping output quality high.
KV (Key-Value) Cache: A memory area that stores calculated states of previous tokens so the model doesn't have to re-evaluate the entire context window with every word or token it generates.

The Architecture Supported by QVAC

Instead of a one-size-fits-all approach, the QVAC SDK supports two distinctly different neural architectures for speech synthesis. Depending on your application's needs — whether you want instant voice cloning or ultra-high-fidelity pre-trained voices — you'll choose between Chatterbox and Supertonic.

Feature	Chatterbox	Supertonic
Architecture	Transformer-based language model	Diffusion-based latent denoising
Model Structure	Split (T3 GGUF + S3Gen companion)	Single file (GGUF)
Voice Method	Zero-shot voice cloning (Reference WAV)	Pre-trained voice styles
Sample Rate	24,000 Hz	44,100 Hz

1. The Chatterbox Engine

Chatterbox is built on a transformer-based language model architecture. It treats audio generation similarly to how an LLM predicts the next word in a sentence, but instead, it predicts discrete acoustic tokens.

Because of this architecture, Chatterbox excels at zero-shot voice cloning. Instead of relying purely on pre-baked voices, you can pass an optional referenceAudioSrc (a short WAV file of someone speaking) alongside your text. The transformer analyzes the reference audio's acoustic properties and generates a cloned voice based on those features.

2. The Supertonic Engine

Supertonic takes a completely different approach, utilizing diffusion-based latent denoising — the same fundamental architecture used by AI image generators like Stable Diffusion, but applied to audio.

It starts with pure digital noise and iteratively refines it into a 44.1 kHz high-fidelity speech waveform based on the text prompt. Supertonic uses a single, unified GGUF file rather than a split model. Instead of dynamic voice cloning, it relies on highly optimized, pre-trained voice styles (for example, voice: "F1" or voice: "M1") baked directly into the model. This makes it incredibly efficient for generating crystal-clear, studio-quality speech when you don't need dynamic cloning capabilities.

For this tutorial, we'll use Supertonic. It yields fantastic results out of the box and avoids the complexity of loading multiple companion files.

The Inference Pipeline

To visualize how we interact with these engines in our codebase, think of local TTS (Text to Speech) as running a virtual recording studio right in your phone's memory:

Hiring the actor (loading the model): We map the compressed GGUF file directly into the device's RAM or GPU VRAM.
Handing over the script (text input): We pass plain text to the loaded engine.
The performance (inference): The engine reads the text and mathematically predicts the sound waves. Crucially, the AI doesn't emit a finished audio file. Instead, it outputs raw digital sound waves known as PCM samples.
Packaging the audio: Because a raw list of numbers can't be played by standard media players, we must manually wrap the PCM data in a standard WAV header.
Closing the studio (unloading): Because speech synthesis is memory-intensive and maintains a persistent state, the model is cleared from RAM to free up resources and flush its context.

Environment and Dependency Config

Before we jump into the codebase, there's a crucial dependency setup to keep in mind if your project uses the pnpm package manager.

Because QVAC plugins rely on transitive native peer dependencies, strict package managers like pnpm will lock these dependencies down inside hidden .pnpm subfolders.

To ensure the QVAC native bundler (bare-pack) can resolve your worker plugins correctly at build time, create a .npmrc file in the root of your project:

shamefully-hoist=true

IMPORTANT: After creating this file, you must run a clean dependency install (pnpm install). This ensures a flat layout in your root node_modules so that all QVAC-specific helper packages are resolved properly during your local npx expo prebuild compilation step.

The Audio Utility Packaging

Because QVAC outputs raw PCM arrays, we need to construct a valid WAV file in memory and write it to the device's storage before the native audio player can play it.

To achieve this, let's create a utility module inside src/lib/utils.ts to build the required WAV header, convert raw audio samples into a binary buffer, and write it to local storage.

import { Buffer } from "buffer";
import * as FileSystem from "expo-file-system/legacy";

/**
 * Creates a WAV header for 16-bit PCM audio
 */
export function createWavHeader(
  dataLength: number,
  sampleRate: number,
): Buffer {
  const buffer = Buffer.alloc(44);
  const channels = 1; // Mono
  const byteRate = sampleRate * channels * 2; // 16-bit audio
  const blockAlign = channels * 2;

  buffer.write("RIFF", 0);
  buffer.writeUInt32LE(36 + dataLength, 4);
  buffer.write("WAVE", 8);
  buffer.write("fmt ", 12);
  buffer.writeUInt32LE(16, 16); // Subchunk1Size
  buffer.writeUInt16LE(1, 20); // AudioFormat (PCM)
  buffer.writeUInt16LE(channels, 22);
  buffer.writeUInt32LE(sampleRate, 24);
  buffer.writeUInt32LE(byteRate, 28);
  buffer.writeUInt16LE(blockAlign, 32);
  buffer.writeUInt16LE(16, 34); // BitsPerSample
  buffer.write("data", 36);
  buffer.writeUInt32LE(dataLength, 40);

  return buffer;
}

/**
 * Converts the raw Int16Array samples from QVAC to a binary Buffer
 */
export function int16ArrayToBuffer(int16Array: Int16Array): Buffer {
  const buffer = Buffer.alloc(int16Array.length * 2);
  for (let i = 0; i < int16Array.length; i++) {
    buffer.writeInt16LE(int16Array[i] ?? 0, i * 2);
  }
  return buffer;
}

/**
 * Main function to package and save the file to local mobile storage
 */
export async function saveAudioToDevice(
  audioBuffer: Int16Array,
  sampleRate: number,
): Promise {
  try {
    const audioData = int16ArrayToBuffer(audioBuffer);
    const wavHeader = createWavHeader(audioData.length, sampleRate);
    const finalWavBuffer = Buffer.concat([wavHeader, audioData]);
    const base64Data = finalWavBuffer.toString("base64");

    const filename = `tts-speech-${Date.now()}.wav`;
    const fileUri = `\({FileSystem.documentDirectory}\){filename}`;

    await FileSystem.writeAsStringAsync(fileUri, base64Data, {
      encoding: FileSystem.EncodingType.Base64,
    });

    console.log(`✅ File saved locally at: ${fileUri}`);
    return fileUri;
  } catch (error) {
    console.error("❌ Failed to save audio file locally:", error);
    throw error;
  }
}

Complete Implementation

Let's bring it all together. We'll implement an interface that takes user input, manages download and loading states for the Supertonic engine, packages generated raw waves into a playable local file, and renders an interactive visual waveform player.

Replace your entry app file src/app/index.tsx with the following implementation:

import { useState, useEffect } from "react";
import {
  TextInput,
  KeyboardAvoidingView,
  Platform,
  ScrollView,
} from "react-native";
import {
  loadModel,
  unloadModel,
  textToSpeech,
  downloadAsset,
  TTS_EN_SUPERTONIC_Q8_0,
  getModelInfo,
  type ModelProgressUpdate,
} from "@qvac/sdk";
import { saveAudioToDevice } from "@/lib/utils";
import { TtsModelLoader } from "@/components/tts-model-loader";
import { AudioPlayer } from "@/components/audio-player";
import {
  Card,
  CardContent,
  CardDescription,
  CardHeader,
  CardTitle,
} from "@/components/ui/card";
import { Button } from "@/components/ui/button";
import { Text } from "@/components/ui/text";

const SUPERTONIC_SAMPLE_RATE = 44100;

// Global reference for our model ID
let globalModelId: string | null = null;

type TtsStatus =
  | { phase: "idle" }
  | { phase: "synthesizing" }
  | { phase: "done"; audioUri: string }
  | { phase: "error"; message: string };

export default function TextToVoiceScreen() {
  const [text, setText] = useState("");
  const [status, setStatus] = useState({ phase: "idle" });

  const [isModelLoaded, setIsModelLoaded] = useState(!!globalModelId);
  const [isDownloading, setIsDownloading] = useState(false);
  const [downloadProgress, setDownloadProgress] = useState(0);

  const isBusy = status.phase === "synthesizing";

  useEffect(() => {
    async function checkAndAutoLoad() {
      if (globalModelId) return;
      try {
        const info = await getModelInfo({ name: TTS_EN_SUPERTONIC_Q8_0.name });
        if (info.isCached) {
          setIsDownloading(true);
          setDownloadProgress(1);

          globalModelId = await loadModel({
            modelSrc: TTS_EN_SUPERTONIC_Q8_0,
            modelConfig: {
              ttsEngine: "supertonic",
              language: "en",
              voice: "F1",
              ttsSpeed: 1.05,
              ttsNumInferenceSteps: 5,
            },
          });

          setIsModelLoaded(true);
          setIsDownloading(false);
        }
      } catch (err: unknown) {
        console.warn("Failed to auto-load cached model on mount:", err);
        setIsDownloading(false);
      }
    }
    checkAndAutoLoad();
  }, []);

  const handleDownloadModel = async () => {
    if (isDownloading || isModelLoaded) return;

    try {
      setIsDownloading(true);
      setDownloadProgress(0);

      await downloadAsset({
        assetSrc: TTS_EN_SUPERTONIC_Q8_0,
        onProgress: (p: ModelProgressUpdate) => {
          setDownloadProgress(p.percentage / 100);
        },
      });

      setDownloadProgress(1);

      globalModelId = await loadModel({
        modelSrc: TTS_EN_SUPERTONIC_Q8_0,
        modelConfig: {
          ttsEngine: "supertonic",
          language: "en",
          voice: "F1",
          ttsSpeed: 1.05,
          ttsNumInferenceSteps: 5,
        },
      });

      setIsModelLoaded(true);
      setIsDownloading(false);
    } catch (err: unknown) {
      console.error("Failed to download or load model:", err);
      setIsDownloading(false);
      setStatus({
        phase: "error",
        message: err instanceof Error ? err.message : String(err),
      });
      setIsModelLoaded(false);
    }
  };

  const handleSubmit = async () => {
    if (!text.trim() || isBusy || !globalModelId) return;

    try {
      setStatus({ phase: "synthesizing" });

      // 1. Unload and reload the model to reset its state and clear the KV cache.
      if (globalModelId) {
        await unloadModel({ modelId: globalModelId });
      }
      globalModelId = await loadModel({
        modelSrc: TTS_EN_SUPERTONIC_Q8_0,
        modelConfig: {
          ttsEngine: "supertonic",
          language: "en",
          voice: "F1",
          ttsSpeed: 1.05,
          ttsNumInferenceSteps: 5,
        },
      });

      // 2. Synthesize text to raw PCM samples
      const result = textToSpeech({
        modelId: globalModelId,
        text: text.trim(),
        inputType: "text",
        stream: false,
      });

      const audioBuffer = await result.buffer;

      // 3. Package and save WAV file using our local util
      const samplesInt16 = new Int16Array(audioBuffer);
      const wavUri = await saveAudioToDevice(
        samplesInt16,
        SUPERTONIC_SAMPLE_RATE,
      );

      // 4. Show player
      setStatus({ phase: "done", audioUri: wavUri });
    } catch (err: unknown) {
      console.error("TTS error:", err);
      const msg = err instanceof Error ? err.message : String(err);
      setStatus({ phase: "error", message: msg });
    }
  };

  const buttonLabel =
    status.phase === "synthesizing" ? "Synthesizing…" : "Synthesize Speech";

  if (!isModelLoaded) {
    return (
      
    );
  }

  return (
    
      
        
          
            
              Text to Voice
            
            
              Type or paste your content to synthesize speech
            
          

          
            

            {status.phase === "error" && (
              
                {status.message}
              
            )}

            {status.phase === "done" && }

            
          
        
      
    
  );
}

Codebase Breakdown

Let’s lift the hood on how this local Text-to-Speech implementation manages native model lifecycles and processes raw audio arrays.

1. Managing the Native Lifecycle

Loading neural network weights for speech synthesis is computationally expensive. When the QVAC runtime initializes a model, it must read parameters from the local disk and copy the active weights into device RAM.

To handle this efficiently, we declared the reference variable outside the component scope:

let globalModelId: string | null = null;

If globalModelId were tracked inside component states, navigating away from the text-to-speech screen would clean up the state, causing the app to unnecessarily drop the reference. Storing the ID globally ensures we hold onto it across layout transitions.

2. Flushing the KV Cache: Unload and Reload

One of the most important aspects of offline generation using GGML engines is state management:

// 1. Unload and reload the model to reset its state and clear the KV cache.
if (globalModelId) {
  await unloadModel({ modelId: globalModelId });
}

globalModelId = await loadModel({ ... });

WARNING about acoustic hallucinations: If you continuously synthesize sentences on a single TTS model instance without resetting it, the model's Key-Value (KV) cache fills up. It begins treating your new sentence as a continuation of the previous one, leading to heavy robotic distortion, echoing, and repeated voices.

By explicitly destroying the model via unloadModel and immediately booting a fresh instance with loadModel, we're forcing a pristine, empty context window. Since the model is already downloaded and memory-mapped, reloading the model directly from local flash storage is extremely fast, typically completing in a fraction of a second on modern mobile hardware to ensure a seamless user experience while guaranteeing artifact-free audio.

3. Demystifying the WAV Header Structure

Operating systems and built-in mobile media decoders are unable to parse raw, naked PCM (Pulse Code Modulation) sound waves directly. A raw PCM buffer is simply a stream of numerical coordinates representing audio wave amplitudes.

We resolve this by prepending-formatting our PCM buffer with a standard 44-byte RIFF/WAVE header.

This header acts as a passport, defining:

AudioFormat (1): Signals uncompressed linear PCM.
NumChannels (1): Mono audio.
SampleRate (44100): The clock frequency required for Supertonic playback.
BitsPerSample (16): 16-bit word length (2 bytes per sample).

Additionally, writing the file is handled via Base64 encoding to safely cross React Native's JavaScript-to-Native bridge without dropping binary data:

const base64Data = finalWavBuffer.toString("base64");
await FileSystem.writeAsStringAsync(fileUri, base64Data, {
  encoding: FileSystem.EncodingType.Base64,
});

4. Visual Waveform Player

Rather than using a basic headless native audio player that fires immediately in the background, we pass the local WAV file path to a custom component powered by @simform_solutions/react-native-audio-waveform.

This module analyzes our newly written WAV file and draws a sleek, WhatsApp-inspired interactive visual waveform, giving the user full control over playback, dynamic speed adjustments (1x, 1.5x, 2x), and seeking. It's a vast UX improvement that makes the final result feel premium and polished.

Conclusion

Transitioning Text-to-Speech from the cloud to on-device hardware offers a practical approach for mobile application developers. Running model inference locally eliminates reliance on remote internet connectivity, removes recurring API usage costs, and ensures that user text inputs never leave the physical device.

Integrating local speech synthesis can be highly beneficial for interactive, educational, or conversational apps. For example, in voice-guided systems, on-device TTS allows applications to function in private or offline environments. As edge processors gain dedicated hardware acceleration cores and open-source models decrease in memory size through quantization research, local-first architectures present a compelling alternative for developers prioritizing privacy, offline resilience, and predictable cost structures.

Resources and Further Reading

To dive deeper into local Text-to-Speech inference, inspect the source code, or explore advanced configurations for your mobile applications, check out the following resources:

QVAC Expo Integration Docs: Learn more about configuring custom local models in Expo.
react-native-audio-waveform: Learn more about interactive React Native audio visualizations.
GGUF Model Hub on Hugging Face: Browse compatible quantized open-source models.
Latent Denoising Deep Dive: Technical deep dive into Diffusion-based acoustic generation.
https://github.com/DjibrilM/QVAC-TTS-Expo-Implementation: Full implementation code.

How to Run an LLM Locally on Your Mobile Phone with QVAC and Expo

Jibril-M🍀 — Wed, 03 Jun 2026 17:17:33 +0000

When I was younger, I remember my mother’s Android phone, a Samsung Galaxy Note 3 that she bought right after losing her BlackBerry. During that time, a phone with 16 GB of storage was considered cutting-edge technology. The ability to store five 720p torrented movies on a single phone honestly felt unreal.

Most flagship devices back then shipped with somewhere between 2 and 8 GB of RAM, and GPUs were nowhere near what we carry around today. My mom’s Galaxy Note 3 featured the Qualcomm Adreno 330 GPU with 32 unified shader cores running at up to 578 MHz — a complete powerhouse for its time.

Fast forward to today, and the phones in our pockets are ridiculously more powerful, more efficient, and, honestly, capable of things people would’ve considered science fiction back then.

But enough about my mom’s phone. What I’m really trying to say is this: instead of spending hundreds of dollars every month on AI subscriptions and tokens, we can take advantage of the insanely capable devices we already carry around every day.

Modern smartphones now have dedicated AI acceleration, impressive thermal efficiency, and enough compute power to run lightweight language models locally, completely offline. That means better privacy, full control over your chat history, lower latency, and the ability to use AI without depending entirely on cloud services.

In this article, we’re going to build a React Native application that interacts with an LLM running directly on the device itself. The implementation will revolve around QVAC, a family of inference tools designed specifically for running AI models locally.

Prerequisites
What is QVAC?
Environment Setup
Model Management
Custom Models
Complete Implementation
Codebase Breakdown
Conclusion
Resources & Further Reading

Prerequisites

To get the most out of this article, you should have a basic understanding of front end development and React in general. You don't have to be a mobile developer, but understanding React will help a lot.

What is QVAC?

QVAC (QuantumVerse Automatic Computer) is a local-first AI inference platform developed by Tether. It's designed to move artificial intelligence away from centralized cloud systems and bring computation back to the user’s own device.

Most modern AI tools rely heavily on remote servers, API keys, and cloud infrastructure controlled by a handful of companies. While this makes AI accessible, it also creates major concerns around privacy, censorship, vendor lock-in, internet dependency, and ownership of user data. Every prompt, conversation, or uploaded file often passes through third-party servers that users have little control over.

QVAC was designed to solve that problem by allowing AI models and agents to run directly on devices like smartphones, laptops, and embedded systems, even while completely offline. Instead of sending personal conversations and sensitive data to the cloud, users can process everything locally on their own hardware.

The platform also embraces decentralization through peer-to-peer communication, reducing reliance on centralized infrastructure and eliminating single points of failure. This approach makes AI systems more private, resilient, autonomous, and accessible, especially in environments with limited internet access or strict data privacy requirements.

In simple terms, QVAC exists to make AI truly owned by its users — local-first, private by default, and independent from centralized control.

Environment Setup

To speed up the process, I prepared a React Native starter project with all the dependencies installed. But we will install and set up QVAC in this article, since that's our main topic. Here's a link to the repository.

Or you can run the below command to clone the starter project.

git clone --branch ft-ui-implementation --single-branch https://github.com/DjibrilM/QVAC-offline-Chatbot-Article-Project-

QVAC Installation

Run the following command to install the SDK: npm i @qvac/sdk. Feel free to use any package manager of your choice. As for me, I will keep things simple with npm.

Then add the following peer dependencies to your package.json:

{
  "dependencies": {
    "@qvac/sdk": "^0.7.0",
+   "bare-rpc": "^1.0.0", 
    "expo": "~54.0.33",
    "expo-status-bar": "~3.0.9",
    "react": "19.1.0",
    "react-native": "0.81.5",
+   "react-native-bare-kit": "^0.11.5"  
  },
  "devDependencies": {
    "@types/react": "~19.1.0",
    "bare-pack": "^1.5.1", 
    "typescript": "~5.9.2"
  }
}

Install the following additional dependencies:

npx expo install expo-file-system expo-build-properties expo-device

Then configure expo-build-properties and add @qvac/sdk/expo-plugin to the plugins array in your app.json:

{
  "expo": {
    "plugins": [
      "expo-router",
      "@qvac/sdk/expo-plugin",
      [
        "expo-splash-screen",
        {
          "backgroundColor": "#208AEF",
          "android": {
            "image": "./assets/images/splash-icon.png",
            "imageWidth": 76
          }
        }
      ]
    ]
  }
}

Run the following command to build the native modules:

npx expo prebuild

Note: QVAC uses llama.cpp under the hood. Due to optimization requirements and native hardware dependencies, the QVAC SDK doesn't run on emulators. You'll have to test this with a real physical device with Developer Mode enabled.

To run the app on your physical device, execute:

# For Android:
npx expo run:android --device

# For iOS:
npx expo run:ios --device

Model Management

The QVAC model management system is completely local-first and decentralized. It handles the entire lifecycle, from downloading files to lifecycle optimization, abstracting everything behind clean utility APIs.

Resumable & Deduplicated Downloading (`downloadAsset`)

It writes temporary chunks to local disk. If a network drop occurs, the partial file is preserved and resumes automatically upon the next call. Also, if multiple components invoke a download for the same asset simultaneously, QVAC handles the streaming under a single network stream.

Memory Lifecycle (`loadModel` & `unloadModel`)

loadModel maps the asset file directly into memory, maps it to your hardware target (such as the device GPU), and exposes an ephemeral modelId. Because local inference is highly memory-intensive on mobile devices, calling unloadModel frees system RAM immediately while preserving the downloaded file on disk.

Custom Models

Because QVAC relies on an optimized branch of llama.cpp, it remains highly compatible with the open-source AI ecosystem. If you plan to load custom models, ensure they adhere to these criteria:

Format: Must be in the GGUF (.gguf) format.
Quantization: For mobile and edge deployments, always prioritize Q4_0, Q4_K_M, or Q8_0 configurations to guarantee they fit safely within mobile hardware RAM constraints.

Complete Implementation

Now let's replace your main file codebase logic with the full implementation, combining the UI container layout, user interaction state, model lifecycle setup, and real-time inference handling into a cohesive structure.

Replace your entry file with the following code:

import { ChatInput } from "@/components/chat-input";
import { ChatMessage, Message } from "@/components/chat-message";
import { ModelLoader } from "@/components/model-loader";
import { Button } from "@/components/ui/button";
import { Text } from "@/components/ui/text";

import {
  completion,
  deleteCache,
  downloadAsset,
  LLAMA_3_2_1B_INST_Q4_0,
  loadModel,
  type ModelProgressUpdate,
  VERBOSITY,
} from "@qvac/sdk";
import { SymbolView } from "expo-symbols";
import { useEffect, useRef, useState } from "react";

import {
  Clipboard,
  KeyboardAvoidingView,
  Platform,
  SafeAreaView,
  ScrollView,
  View,
} from "react-native";

const makeId = () => Math.random().toString(36).substring(2, 9);

export default function Index() {
  const [messages, setMessages] = useState([]);
  const [input, setInput] = useState("");
  const [isGenerating, setIsGenerating] = useState(false);

  // Model loading state
  const [modelId, setModelId] = useState(null);
  const [isModelLoaded, setIsModelLoaded] = useState(false);
  const [isDownloading, setIsDownloading] = useState(false);
  const [downloadProgress, setDownloadProgress] = useState(0);

  const scrollViewRef = useRef(null);
  const messagesRef = useRef([]);

  useEffect(() => {
    messagesRef.current = messages;
  }, [messages]);

  const startDownload = () => {
    setIsDownloading(true);
    setupModel();
  };

  // Automatically scroll to bottom when messages list updates
  useEffect(() => {
    if (scrollViewRef.current) {
      setTimeout(() => {
        scrollViewRef.current?.scrollToEnd({ animated: true });
      }, 100);
    }
  }, [messages, isGenerating]);

  const copyToClipboard = (text: string) => {
    if (Platform.OS === "web") {
      navigator.clipboard.writeText(text);
    } else {
      Clipboard.setString(text);
    }
  };

  const setupModel = async () => {
    try {
      setIsDownloading(true);
      setDownloadProgress(0);
      
      // 1. Local download path execution
      await downloadAsset({
        assetSrc: LLAMA_3_2_1B_INST_Q4_0,
        onProgress: (progress: ModelProgressUpdate) => {
          setDownloadProgress(progress.percentage / 100);
        },
      });

      setDownloadProgress(1);

      // 2. Load model into runtime memory
      const loadedModel = await loadModel({
        modelSrc: LLAMA_3_2_1B_INST_Q4_0,
        modelType: "llm",
        modelConfig: {
          device: "gpu",
          ctx_size: 2048,
          verbosity: VERBOSITY.ERROR,
        },
      });

      setModelId(loadedModel);
      setIsModelLoaded(true);
      setIsDownloading(false);
    } catch (e: any) {
      console.error("Error setting up model:", e);
      setIsDownloading(false);
    }
  };

  async function handleSend() {
    // Guard against sending before the model is ready or while generating.
    if (!modelId || isGenerating) return;

    const trimmed = input.trim();
    if (!trimmed) return;

    setInput("");
    setIsGenerating(true);

    // Append user message and a placeholder assistant message for streaming.
    const userMsg: Message = {
      id: makeId(),
      role: "user",
      content: trimmed,
    };

    const assistantId = makeId();

    const assistantMsg: Message = {
      id: assistantId,
      role: "assistant",
      content: "",
    };

    setMessages((prev) => [...prev, userMsg, assistantMsg]);

    try {
      // Build chat history for the completion request.
      const history = [...messagesRef.current, userMsg].map((m) => ({
        role: m.role,
        content: m.content,
      }));

      // Run a streaming completion and update the last assistant bubble.
      const result = completion({
        modelId,
        history,
        stream: true,
      });

      let acc = "";

      for await (const token of result.tokenStream) {
        acc += token;

        // Update only the last assistant message content
        setMessages((prev) =>
          prev.map((m) =>
            m.id === assistantId ? { ...m, content: acc } : m
          )
        );
      }

      // Optional: Log completion performance stats
      try {
        const stats = await result.stats;
        console.log("📊 Completion stats:", stats);
      } catch {}

    } catch (e: any) {
      // Show any error in the assistant bubble.
      setMessages((prev) =>
        prev.map((m) =>
          m.id === assistantId
            ? { ...m, content: `❌ Error: ${e?.message ?? String(e)}` }
            : m
        )
      );
    } finally {
      setIsGenerating(false);
    }
  }

  if (!isModelLoaded) {
    return (
      
    );
  }

  return (
    
      
        
          
            
            Local Llama 3.2
          
          Offline Engine
        

        
          {messages.filter(m => m.content !== "" || m.role === "assistant").map((msg) => (
             copyToClipboard(msg.content)}
            />
          ))}
        

        
      
    
  );
}

Codebase Breakdown

Let’s lift the hood on how this unified component manages local model workflows and real-time UI streaming.

1. Tracking Model State & Asynchronous Synchronization

At the root of the component, we track both user-facing interface state and underlying QVAC runtime handles:

const [messages, setMessages] = useState([]);
const [modelId, setModelId] = useState(null);
const [isModelLoaded, setIsModelLoaded] = useState(false);
const [isDownloading, setIsDownloading] = useState(false);
const [downloadProgress, setDownloadProgress] = useState(0);

Because state setters in React are asynchronous, streaming loops can accidentally capture stale representations of current chat logs.

To circumvent this, a mutable messagesRef acts as a real-time single source of truth for the active session state:

const messagesRef = useRef([]);

useEffect(() => {
  messagesRef.current = messages;
}, [messages]);

2. Orchestrating Download & Memory Instantiation

When the user strikes the download button action trigger, the application launches setupModel(). This function splits tasks clearly across local storage caching and active hardware allocation layers:

await downloadAsset({
  assetSrc: LLAMA_3_2_1B_INST_Q4_0,
  onProgress: (progress: ModelProgressUpdate) => {
    setDownloadProgress(progress.percentage / 100);
  },
});

Storage Sync: downloadAsset reaches out to pull the designated standard model signature down into mobile device disk files.
Hardware Binding: Once safe on disk, loadModel executes to wake up the engine runtime:

const loadedModel = await loadModel({
  modelSrc: LLAMA_3_2_1B_INST_Q4_0,
  modelType: "llm",
  modelConfig: {
    device: "gpu",
    ctx_size: 2048,
    verbosity: VERBOSITY.ERROR,
  },
});

Passing device: "gpu" tells QVAC to run hardware-accelerated kernels across the smartphone's graphic processing hardware structure, ensuring rapid performance metrics instead of locking execution to slower CPU loops.

3. Pipeline Ingest & Streaming Generation Loop

Once user validation confirms the prompt is ready, handleSend() sets up user bubbles and generates an empty assistant placeholder card to catch token output segments.

The application map transforms references straight out of messagesRef.current into a structured history syntax before processing:

const result = completion({
  modelId,
  history,
  stream: true,
});

With stream: true enabled, QVAC doesn't hold up your application thread waiting for long string sequences to complete. Instead, it yields an asynchronous iterable stream that spits out fresh updates instantly:

let acc = "";

for await (const token of result.tokenStream) {
  acc += token;

  setMessages((prev) =>
    prev.map((m) =>
      m.id === assistantId ? { ...m, content: acc } : m
    )
  );
}

The loop continuously concatenates token text variables into the tracking accumulator (acc), target patching state properties exclusively against our placeholder identifier (assistantId). This creates a lightning-fast typing animation experience while executing fully offline on your user's physical device hardware.

Conclusion

Building a local-first AI application is no longer a concept confined to high-end desktops or specialized research labs. As we’ve seen, the smartphones we carry in our pockets every day possess more than enough computational muscle and dedicated hardware acceleration to run highly capable language models completely offline.

By leveraging React Native and the QVAC SDK, we successfully bypassed the traditional cloud-dependent architecture. We eliminated the need for complex server infrastructure, API key management, and recurring token subscription fees, all while providing an ultra-private, low-latency, streaming chat experience directly on-device.

As open-source models continue to shrink in size and grow in capabilities, edge inference will become an essential architecture for developers prioritizing privacy, offline resilience, and cost efficiency. The power to compute is back where it belongs: in the hands of the user.

Resources & Further Reading

To dive deeper into local inference, inspect the source code, or explore advanced configurations for your mobile applications, check out the following resources:

QVAC Expo Integration Tutorial – The official step-by-step documentation for configuring QVAC within the Expo and React Native ecosystems.
Project GitHub Repository – Access the complete source code, including the UI layout components, starter themes, and full configuration files used in this guide.
Llama.cpp Official Repository – Learn more about the underlying inference engine that powers QVAC's hardware-accelerated local execution.
Hugging Face GGUF Models – Explore thousands of open-source, quantized models that you can download and experiment with inside your local application.

How to Build an AI Agent with LangChain and LangGraph: Build an Autonomous Starbucks Agent

Jibril-M🍀 — Fri, 19 Dec 2025 00:21:01 +0000

Back in 2023, when I started using ChatGPT, it was just another chatbot that I could ask complex questions to and it would identify errors in my code snippets. Everything was fine. The application had no memory of previous states or what was said the day before.

Then in 2024, everything started to change. We went from a stateless chatbot to an AI agent that could call tools, search the internet, and generate download links.

At this point, I started to get curious. How can an LLM search the internet? An infinite number of questions were flowing through my head. Can it create its own tools, programs, or execute its own code? It felt like we were heading toward the Skynet (Terminator) revolution.

I was just ignorant 😅. But that's when I started my research and discovered LangChain, a tool that promises all those miracles without a billion-dollar budget.

In this article, you’ll build a fully functional AI agent using LangChain and LangGraph. You’ll start by defining structured data using Zod schemas, then parsing them for AI understanding. Next, you’ll learn about summarizing data into text, creating tools the agent can call, and setting up LangGraph nodes to orchestrate workflows.

You’ll see how to compile the workflow graph, manage state, and persist conversation history using MongoDB. By the end, you’ll have a working Starbucks barista AI that demonstrates how to combine reasoning, tool execution, and memory in a single agent.

Prerequisites
What is an LLM Agent?
Project Setup
Data Schematization with Zod
How to Parse the Schema
Data-to-Text Summarization
How to Persist Orders with MongoDB in NestJS
LangGraph State/Annotation Terms
How to Create Tools for the Agent
LangGraph Nodes (Workflow Components)
Graph Declaration
Workflow Compilation and State Persistence (Final Part)
Conclusion

Prerequisites

To take full advantage of this article, you should have a basic understanding of TypeScript, Node.js, and a bit of NestJS will help, as it’s the backend framework we’ll be using.

What is an LLM Agent?

By definition, an LLM agent is a software program that’s capable of perceiving its environment, making decisions, and taking autonomous actions to achieve specific goals. It often does this by interacting with tools and systems.

Many frameworks and conventions were created to achieve this, and one of the most famous and widely used is the ReAct (Reason & Act) framework.

With this framework, the LLM receives a prompt, thinks, decides the next action (this can be calling a specific tool), and receives the tool data. Once the tool’s response has been received, the AI model observes the response, generates its own response, and plans its next actions based on the tool’s response.

You can read more about this concept on the official white paper. And here’s a diagram that summarizes the entire process:

Note that the workflow is not limited to a single tool invocation – it can proceed through several rounds before returning to the user.

But for an LLM agent to be truly human-like and act with knowledge of the past, it requires a memory. This enables it to recall previous prompts and responses, maintaining consistency within the given thread.

There’s no single source of truth for how to approach this. Most agents implement a short-term memory. This means that the agent will append each new chat to the conversation history, and when a new prompt is submitted, the agent will append the previous messages to the new prompt.

This method is very efficient and gives the LLM a strong knowledge of previous states. But it can also introduce problems, because the more the conversation grows, the more the LLM will have to go through all previous messages in order to understand what action to take next.

And this can introduce some context drift, just like humans experience. You can’t watch a two-hour podcast and remember all the spoken words, right? In this scenario, the LLM will focus on the most relevant information, eventually losing some context.

You don’t have to implement this from scratch. Many tools and frameworks have been developed to make the implementation as easy as possible. You can build it from scratch if you want, of course, but we won’t be doing that here.

In this article, we’ll build a Starbucks barista that collects order information and calls a create_order tool once the order meets the full criteria. This is a tool that we’ll create and expose to the AI.

Project Setup

Let’s start by initializing our project. We’ll use Nest.js for its efficiency and native TypeScript support. Note that nothing here is tied to Nest.js – this is just a framework preference, and everything we’ll do here can be done with Node.js and Express.js.

Here is a list of all the tools that we’ll use:

langchain/core - Always required

This is the main Langchain engine that defines all core tools and fundamental functions, containing:
- prompt templates
- message types
- runnables
- tool interfaces
- chain composition utilities, and more.

Most LangChain project need this.

langchain/google-genai - This package is used to interact with Google’s generative AI models, vector embedding models, and other related tools.
langchain/langgraph - Important for building an AI agent with total control

Langgraph is a low-level orchestration framework for building controllable agents. It can be used to build:
- Conversational agents.
- Build complex task automation.
- Agent’s context management.
langchain/langgraph-checkpoint-mongodb - This package provides a MongoDB-based checkpointer for LangGraph, enabling persistence of agent state and short-term memory using MongoDB.
@langchain/mongodb - This package provides MongoDB integrations for LangChain, allowing you to:
- Store and retrieve vector embeddings.
- Persist LangChain documents, agents, or memory states.
- Easily integrate MongoDB as a database backend for your AI workflows.
@nestjs/mongoose - A NestJS wrapper around Mongoose for MongoDB. Provides:
- Dependency injection support for Mongoose models.
- Simplified schema definition and model management.
- Seamless integration of MongoDB into NestJS applications, enabling structured data persistence for AI apps or any backend.
langchain - This is the main npm package that aggregates LangChain functionality. It provides:
- Access to connectors, utilities, and core modules.
- Easy import of different LangChain components in one place.
- Commonly used alongside @langchain/core for building applications with minimal setup.
mongodb - The official MongoDB driver for Node.js. It provides:
- Low-level, flexible access to MongoDB databases.
- Support for CRUD operations, transactions, and indexing.
- A required dependency if you plan to connect LangChain components or your backend directly to MongoDB.
mongoose - An ODM (Object Data Modeling) library for MongoDB. Offers:
- Schema-based data modeling for MongoDB documents.
- Middleware, validation, and hooks for MongoDB operations.
- Ideal for structured data management in NestJS or other Node.js applications.
zod - A TypeScript-first schema validation library. Used for:
- Defining strict data schemas and validating inputs/outputs.
- Ensuring type safety at runtime.
- Useful in AI applications to validate responses from models or enforce data consistency.

Start by initializing your Nest.js project, and installing all the required dependencies:

$ npm i -g @nestjs/cli //If you don't have Nest.js installed on your machine
$ nest new project-name

"dependencies" : {
    "@langchain/core": "^0.3.75",
    "@langchain/google-genai": "^0.2.16",
    "@langchain/langgraph": "^0.4.8",
    "@langchain/langgraph-checkpoint-mongodb": "^0.1.1",
    "@langchain/mongodb": "^0.1.0",
    "@nestjs/mongoose": "^11.0.3",
    "langchain": "^0.3.33",
    "mongodb": "^6.19.0",
    "mongoose": "^8.18.1",
    "zod": "^4.1.8"
}

//The versions may not be same at the time you are reading this, so I recommand checking
//The official documentation for each package.

Now that we have our project created and all the packages installed, let’s see what we need to do to turn our vision into a project. Think of what you’ll need in order to create a Starbucks barista:

First, we need to define the structure of our data (creating schemas)
Then we need to create a menu list that our agent will be referring to.
After that, we’ll add LLM interaction
And last but not least, we’ll add the ability to save previous conversations for conversational context.

Folder Structure

You can modify this folder structure and adapt it based on your framework of choice. But the core implementation is the same across all frameworks.

├── .env
├── .eslintrc.js
├── .gitignore
├── .prettierrc
├── nest-cli.json
├── package.json
├── README.md
├── tsconfig.build.json
├── tsconfig.json
├── src/
│   ├── app.controller.ts
│   ├── app.module.ts
│   ├── app.service.ts
│   ├── main.ts
│   ├── chat/
│   │   ├── chat.controller.ts
│   │   ├── chat.module.ts
│   │   ├── chat.service.ts
│   │   └── dtos/
│   │       └── chat.dto.ts
│   ├── data/
│   │   └── schema/
│   │       └── order.schema.ts
│   └── util/
│       ├── constants/
│       │   └── drinks_data.ts
│       ├── schemas/
│       │   ├── drinks/
│       │   │   └── Drink.schema.ts
│       │   └── orders/
│       │       └── Order.schema.ts
│       ├── summeries/
│       │   └── drink.ts
│       └── types/

Data Schematization with Zod

This file contains all our schema definitions regarding drinks and all modifications they can receive. This part is useful for defining the structure of the data that will be used by the AI agent.

Importing Zod

In the lib/util/schemas/drinks.ts file, before defining any schemas, import the Zod library, which provides tools for building TypeScript-first schemas.

// Imports the 'z' object from the 'zod' library.
// Zod is a TypeScript-first schema declaration and validation library.
// 'z' is the primary object used to define schemas (e.g., z.object, z.string, z.boolean, z.array).
import z from "zod";

Zod gives you a simple and expressive way to define and validate the structure of the data our agent will interact with.

Drink Schema

This schema represents the structure of a drink in the Starbucks-style menu. I split and explained each field so the reader clearly understands what each property controls.

export const DrinkSchema = z.object({
  name: z.string(),            // Required name of the drink
  description: z.string(),     // Required explanation of what the drink is
  supportMilk: z.boolean(),    // Whether milk options are available
  supportSweeteners: z.boolean(), // Whether sweeteners can be added
  supportSyrup: z.boolean(),   // Whether flavor syrups are allowed
  supportTopping: z.boolean(), // Whether toppings are supported
  supportSize: z.boolean(),    // Whether the drink can be ordered in sizes
  image: z.string().url().optional(), // Optional image URL
});

What this schema represents

It ensures every drink has a proper name and a description.
It defines which customizations apply to the drink.
It prepares the agent to reason about drink options in a structured, validated format.

Sweetener Schema

Each sweetener option in the menu is represented with its own schema.

export const SweetenerSchema = z.object({
  name: z.string(),                // Sweetener name
  description: z.string(),         // What it is / taste description
  image: z.string().url().optional(), // Optional image URL
});

This ensures consistency across all sweetener entries and avoids malformed data.

Syrup Schema

Similar to sweeteners, but for syrup flavors:


export const SyrupSchema = z.object({
  name: z.string(),
  description: z.string(),
  image: z.string().url().optional(),
});

This can represent flavors like Vanilla, Caramel, or Hazelnut.

Topping Schema

Toppings such as whipped cream or cinnamon are defined here.

export const ToppingSchema = z.object({
  name: z.string(),
  description: z.string(),
  image: z.string().url().optional(),
});

Size Schema

Drink sizes are modeled as objects as well:

export const SizeSchema = z.object({
  name: z.string(),               // e.g. Small, Medium
  description: z.string(),        // A short explanation
  image: z.string().url().optional(),
});

Milk Schema

Represents milk types such as Whole, Skim, Almond, or Oat.

export const MilkSchema = z.object({
  name: z.string(),
  description: z.string(),
  image: z.string().url().optional(),
});

Collections of Items

Now that the individual item schemas exist, we can create collections of them. These represent all available toppings, sizes, milk types, syrups, sweeteners, and the entire menu of drinks

export const ToppingsSchema = z.array(ToppingSchema);
export const SizesSchema = z.array(SizeSchema);
export const MilksSchema = z.array(MilkSchema);
export const SyrupsSchema = z.array(SyrupSchema);
export const SweetenersSchema = z.array(SweetenerSchema);
export const DrinksSchema = z.array(DrinkSchema);

Why arrays? Because in the real world, your agent will receive lists from a database or API—not single items.

Inferred Types

Zod also allows TypeScript to infer types from schemas automatically.

This ensures:

TypeScript types always match the schemas.
You avoid duplicated definitions.
The agent code stays consistent and safe.

export type Drink = z.infer<typeof DrinkSchema>;
export type SupportSweetener = z.infer<typeof SweetenerSchema>;
export type Syrup = z.infer<typeof SyrupSchema>;
export type Topping = z.infer<typeof ToppingSchema>;
export type Size = z.infer<typeof SizeSchema>;
export type Milk = z.infer<typeof MilkSchema>;

export type Toppings = z.infer<typeof ToppingsSchema>;
export type Sizes = z.infer<typeof SizesSchema>;
export type Milks = z.infer<typeof MilksSchema>;
export type Syrups = z.infer<typeof SyrupsSchema>;
export type Sweeteners = z.infer<typeof SweetenersSchema>;
export type Drinks = z.infer<typeof DrinksSchema>;

These provide the rest of your LangChain/LangGraph code with strong typing based on your schema definitions.

This entire file:

Encodes all drink-related data structures.
Provides validation to ensure clean, predictable data.
Automatically generates TypeScript types.
Helps the AI agent reason reliably about drinks and customization options.

You’ll use these schemas later and convert them into string representations for LLM prompts.

You can find the file containing all the code here.

How to Parse the Schema

As mentioned earlier, LLMs are text input–output machines. They don’t understand TypeScript types or Zod schemas directly. If you include a schema inside a prompt, the model will simply see it as plain text without understanding its structure or constraints.

Because of this, we need a way to convert schemas into a readable string format that can be embedded inside a prompt, such as:

“The output must be a JSON object with the following fields…”

This is exactly the problem solved by StructuredOutputParser from langchain/output_parsers. It takes a Zod schema and turns it into:

A human-readable description that can be sent to an LLM.
A validator that checks whether the model’s output matches the schema.

In short, it acts as a bridge between typed application logic and text-based AI output.

Defining the Order Schema

We’ll start with a simple Zod schema that represents a customer’s drink order. This schema defines the exact shape and constraints of the data we expect the model to produce.

export const OrderSchema = z.object({
  drink: z.string(),
  size: z.string(),
  mil: z.string(),
  syrup: z.string(),
  sweeteners: z.string(),
  toppings: z.string(),
  quantity: z.number().min(1).max(10),
});

export type OrderType = z.infer<typeof OrderSchema>;

At this point, the schema is useful only inside our TypeScript application. The LLM still has no idea what this structure means.

Parsing the Schema into Human-Readable Text

This is where schema parsing comes in. Using StructuredOutputParser.fromZodSchema, we can transform the Zod schema into:

Instructions the LLM can understand.
A runtime validator that ensures the response is correct.

export const OrderParser =
  StructuredOutputParser.fromZodSchema(OrderSchema as any);

The parser enables two critical workflows:

Generating prompt instructions

The parser can generate a text description of the schema that looks roughly like: “Return a JSON object with the fields drink, size, mil, syrup, sweeteners, and toppings as strings, and quantity as a number between 1 and 10.” This string can be injected directly into your prompt so the LLM knows exactly how to format its response.

Validating the model’s output

After the LLM responds, its output is still just text. The parser:

Converts that text into a JavaScript object.
Validates it against the original Zod schema.
Throws an error if anything is missing, malformed, or out of bounds.

This prevents invalid AI-generated data (for example, quantity: 0) from entering your system.

Reusing the Same Approach for Other Schemas

Once you understand this pattern, applying it to other schemas is straightforward.

For example, you can do the same thing for a DrinkSchema:

export const DrinkParser =
  StructuredOutputParser.fromZodSchema(DrinkSchema as any);

Now you can confidently say something like: “Hey Gemini, this is what a drink object looks like—please respond using this structure.”

Why This Matters

Schema parsing allows you to:

Keep strong typing in your application.
Give clear formatting instructions to the LLM.
Safely convert unstructured AI output into validated, production-ready data.

Without this step, working with LLMs at scale becomes unreliable and error-prone.

Data-to-Text Summarization

In the context of LLM agents, data-to-text summarization means converting structured data—such as objects returned from a database or backend API—into clear, human-readable strings that can be embedded directly into prompts.

Even the most advanced LLMs operate purely on text. They don’t reason over JavaScript objects, database rows, or JSON structures in the same way humans or programs do. The clearer and more descriptive your text input is, the more accurate and reliable the model’s output will be.

Because of this, a common and recommended pattern when building LLM-powered systems is:

Fetch structured data → summarize it into natural language → pass the summary into the prompt

To keep this article focused, we’ll store our data in constants instead of querying a real database. The technique is exactly the same whether the data comes from MongoDB, PostgreSQL, or an API.

The Core Idea

The goal of data-to-text summarization is simple:

Take an object with fields and boolean flags
Convert it into a short paragraph that explains what the object represents
Remove ambiguity and guesswork for the LLM

Instead of forcing the model to infer meaning from raw data, we spell it out explicitly.

Summarizing a Drink Object

Consider the following drink object:

{
  name: 'Espresso',
  description: 'Strong concentrated coffee shot.',
  supportMilk: false,
  supportSweeteners: true,
  supportSyrup: true,
  supportTopping: false,
  supportSize: false,
}

While this structure is easy for developers to understand, it’s not ideal for an LLM prompt. Boolean flags like supportMilk: false require interpretation, which increases the chance of incorrect assumptions.

Instead, we convert this object into a descriptive paragraph:

“A drink named Espresso. It is described as a strong, concentrated coffee shot. It cannot be made with milk. It can be made with sweeteners. It can be made with syrup. It cannot be made with toppings. It cannot be made in different sizes.”

This transformation is exactly what data-to-text summarization provides.

A Standard Summarization Pattern

Below is a simplified example of how we convert a Drink object into a readable description.

export const createDrinkItemSummary = (drink: Drink): string => {
  const name = `A drink named ${drink.name}.`;
  const description = `It is described as ${drink.description}.`;

  const milk = drink.supportMilk
    ? 'It can be made with milk.'
    : 'It cannot be made with milk.';

  const sweeteners = drink.supportSweeteners
    ? 'It can be made with sweeteners.'
    : 'It cannot contain sweeteners.';

  const syrup = drink.supportSyrup
    ? 'It can be made with syrup.'
    : 'It cannot be made with syrup.';

  const toppings = drink.supportTopping
    ? 'It can be made with toppings.'
    : 'It cannot be made with toppings.';

  const size = drink.supportSize
    ? 'It can be made in different sizes.'
    : 'It cannot be made in different sizes.';

  return `${name} ${description} ${milk} ${sweeteners} ${syrup} ${toppings} ${size}`;
};

Why this works well for LLMs

Boolean logic is converted into explicit sentences
Every capability and limitation is clearly stated
The output can be embedded directly into a system or user prompt

Summarizing Collections of Data

This same approach applies to lists of data such as milks, syrups, toppings, or sizes. Instead of passing an array of objects to the model, we convert them into bullet-style text summaries:

export const createSweetenersSummary = (): string => {
  return `Available sweeteners are:
${SWEETENERS.map(
  (s) => `- ${s.name}: ${s.description}`
).join('\n')}`;
};

This gives the model a complete, readable overview of available options without requiring it to interpret raw arrays.

Applying the Same Idea to Other Domains

This pattern is not limited to drinks or menus. It works for any domain. For example, here’s the same summarization technique applied to an object representing a shoe in an online ordering assistant:

export const createShoeItemSummary = (shoe: {
  name: string;
  description: string;
  genderCategory: string;
  styleType: string;
  material: string;
  availableInMultipleColors: boolean;
  limitedEdition: boolean;
  supportsCustomization: boolean;
}): string => {
  return `
A shoe named ${shoe.name}.
It is described as ${shoe.description}.
It is categorized as a ${shoe.genderCategory.toLowerCase()} shoe.
It belongs to the ${shoe.styleType.toLowerCase()} fashion style.
It is made of ${shoe.material.toLowerCase()} material.
${shoe.availableInMultipleColors ? 'It is available in multiple colors.' : 'It is available in a single color.'}
${shoe.limitedEdition ? 'It is a limited-edition release.' : 'It is not a limited-edition release.'}
${shoe.supportsCustomization ? 'It supports customization options.' : 'It does not support customization options.'}
`.trim();
};

Which produces an output like:

“A shoe named Veloria Canvas Sneaker. It is described as a minimalist everyday sneaker designed for casual wear. It is categorized as a unisex shoe. It belongs to the casual fashion style. It is made of breathable canvas material. It is available in multiple colors. It is not a limited-edition release. It supports light customization options.”

How to Persist Orders with MongoDB in NestJS

Now that we’ve established the core foundations of our application—schemas, parsers, and data-to-text summaries—it’s time to persist data. In a real-world assistant, orders and conversations shouldn’t disappear when the server restarts. They need to be stored reliably so they can be retrieved, analyzed, or continued later.

To achieve this, we’ll use MongoDB as our database and the NestJS Mongoose integration to manage data models and collections.

Connecting MongoDB to a NestJS Application

In NestJS, the AppModule is the root module of the application. This is where global dependencies—such as database connections—are configured.

@Module({
  imports: [
    MongooseModule.forRoot(process.env.MONGO_URI),
    ChatsModule,
  ],
  controllers: [AppController],
  providers: [AppService],
})
export class AppModule {}

What’s happening here?

MongooseModule.forRoot(...) establishes a global MongoDB connection.
The connection string is read from an environment variable (MONGO_URI), which is the recommended practice for security.
Once configured, this connection becomes available throughout the entire application.
ChatsModule is imported so it can access the database connection and register its own schemas.

This setup ensures that every feature module can safely interact with MongoDB without creating multiple connections.

Defining an Order Schema with Mongoose

NestJS uses decorators to define MongoDB schemas in a clean, class-based way. Each class represents a MongoDB document, and each property becomes a field in the collection.

@Schema()
export class Order {
  @Prop({ required: true })
  drink: string;

  @Prop({ default: null })
  size: string;

  @Prop({ default: null })
  milk: string;

  @Prop({ default: null })
  syrup: string;

  @Prop({ default: null })
  sweeter: string;

  @Prop({ default: null })
  toppings: string;

  @Prop({ default: 1 })
  quantity: number;
}

Why this approach?

Each @Prop() decorator maps directly to a MongoDB field.
Default values allow partial orders to be saved incrementally.
Required fields (like drink) enforce basic data integrity.
The schema closely mirrors the structured output produced by the LLM.

Once the class is defined, it’s converted into a MongoDB schema:

export const OrderSchema = SchemaFactory.createForClass(Order);

This single line creates:

A MongoDB collection
A validation layer
A schema that Mongoose can use to create, read, and update orders

How This Fits into the LLM Agent Architecture

At this point, we have:

Zod schemas → for validating AI output
Summarization functions → for converting data into readable prompts
MongoDB schemas → for persisting finalized orders

This separation is intentional:

Zod handles AI-facing validation
Mongoose handles database persistence
NestJS acts as the glue that ties everything together

Preparing for the Agent Logic

With the database in place, we’re now ready to implement the agent itself.

The agent’s responsibilities will include:

Interpreting user messages
Calling tools
Generating structured orders
Validating them
Persisting them to MongoDB
Maintaining conversational state

All of this logic will live inside the src/chats/chats.service.ts file. The next section introduces the agent’s core logic, and we’ll walk through it step by step so every part is easy to follow.

Start by importing the required dependencies:


import { Injectable } from '@nestjs/common';
import { InjectModel } from '@nestjs/mongoose';
import { MongoClient } from 'mongodb';
import { Model } from 'mongoose';

import { tool } from '@langchain/core/tools';
import {
  ChatPromptTemplate,
  MessagesPlaceholder,
} from '@langchain/core/prompts';
import { AIMessage, BaseMessage, HumanMessage } from '@langchain/core/messages';

import { ChatGoogleGenerativeAI } from '@langchain/google-genai';
import { StateGraph } from '@langchain/langgraph';
import { ToolNode } from '@langchain/langgraph/prebuilt';
import { Annotation } from '@langchain/langgraph';
import { START, END } from '@langchain/langgraph';

import { MongoDBSaver } from '@langchain/langgraph-checkpoint-mongodb';

import z from 'zod';

import { Order } from './schemas/order.schema';
import { OrderParser, OrderSchema, OrderType } from 'src/lib/schemas/orders';
import { DrinkParser } from 'src/lib/schemas/drinks';
import { DRINKS } from 'src/lib/utils/constants/menu_data';

import {
  createSweetenersSummary,
  availableToppingsSummary,
  createAvailableMilksSummary,
  createSyrupsSummary,
  createSizesSummary,
  createDrinkItemSummary,
} from 'src/lib/summaries';

const GOOGLE_API_KEY = process.env.GOOGLE_API_KEY || '';
const client: MongoClient = new MongoClient(process.env.MONGO_URI || '');
const database_name = 'drinks_db';

LangGraph State/Annotation Terms

In LangGraph, state can be thought of as a temporary workspace that exists while the agent is running. It stores all the information that nodes (we’ll cover nodes in detail later) might need to access information like the last message, the history of the conversation, or any intermediate data generated during execution.

This state allows nodes to read from it, update it, and pass information along as the agent processes a workflow, making it the agent’s short-term memory for the duration of the run.

@Injectable()
export class ChatService {

  chatWithAgent = async ({
    thread_id,
    query,
  }: {
    thread_id: string;
    query: string;
  }) => {

    const graphState = Annotation.Root({
      messages: Annotation({
        reducer: (x, y) => [...x, ...y],
      }),
    });

  }

}

This code defines the LangGraph state for the chat agent. The graphState object acts as a central memory that every node in the workflow can read from and update.

The messages field specifically stores all messages in the conversation, including user messages, AI responses, and tool outputs. The reducer function [...x, ...y] appends new messages to the existing array, preserving the conversation history across multiple steps.

LangGraph’s reducer mechanism lets developers control how new state merges with old state. In this chat system, the approach is similar to updating React state with setMessages(prev => [...prev, ...newMessages]): it keeps the old messages while adding the new ones.

Together, this state enables the agent, tools, and checkpointing system to maintain a coherent conversation, allowing each node in the LangGraph workflow to access the full context and contribute incrementally.

How to Create Tools for the Agent

Modern chatbots can do more than just generate text - they can also search the internet, read files, or perform computations. While LLMs are powerful, they cannot execute code or compile programs on their own.

In the code text of LLM agents, a tool is a piece of code written by the agent developer that an LLM can invoke on the host machine. The host machine executes the code, and the LLM only receives the final output of the computation.

Here's how to create a tool that stores orders in the database. Still in the chatWithAgent function within the ChatService class. Bellow the state store definition:

const orderTool = tool(
  async ({ order }: { order: OrderType }) => {
    try {
      await this.orderModel.create(order);
      return 'Order created successfully';
    } catch (error) {
      console.log(error);
      return 'Failed to create the order';
    }
  },
  {
    schema: z.object({
      order: OrderSchema.describe('The order that will be stored in the DB'),
    }),
    name: 'create_order',
    description: 'This tool creates a new order in the database',
  }
);

const tools = [orderTool];

LangGraph Nodes (Workflow Components)

From a definition standpoint, a LangGraph node is a fundamental component of a LangGraph workflow, representing a single unit of computation or an individual step in an AI agent's process.

Each node can perform a specific task, such as generating a message, invoking a tool, or transforming data, and it interacts with the state to read inputs and write outputs. Together, nodes are connected to form the agent’s workflow or execution graph, allowing complex reasoning and multi-step operations.

In our project, we’ll have four nodes.

Agent node: This node is in charge of interacting with the LLM - it constructs the agent’s main message template and stacks old messages to the new prompt to create context.
Tools node: The tools node introduces external capabilities, which allow the workflow to interact with external APIs
START node: This node indicates the entry point of our workflow, or to be precise, which node to call when a user initiates a conversation with the agent. It’s quite simple to define.
addConditionalEdges - addConditionalEdges('agent', shouldContinue): In LangGraph, .addConditionalEdges('agent', shouldContinue) lets the workflow branch dynamically after the 'agent' node runs, based on a condition defined in shouldContinue. Unlike a fixed edge, which always goes from one node to the next, a conditional edge evaluates the agent’s output and directs the workflow to different nodes depending on the result, allowing the AI agent to make decisions and adapt its next steps.

Graph Declaration

In LangGraph, a graph is the central structure that models an AI agent’s workflow as interconnected nodes, where each node represents a computation step, tool, or decision. It orchestrates the flow of data and control between nodes, manages conditional branching, and maintains the recursive loop of execution.

Essentially, the graph is the backbone that ensures complex, stateful interactions happen in a coordinated and modular way, connecting nodes like agent, tools, and conditional edges into a coherent workflow.

With that knowledge in place, we can now create the agent graph with all its nodes.

  const callModal = async (states: typeof graphState.State) => {
    const prompt = ChatPromptTemplate.fromMessages([
      {
        role: 'system',
        content: `
            You are a helpful assistant that helps users order drinks from Starbucks.
            Your job is to take the user's request and fill in any missing details based on how a complete order should look.
            A complete order follows this structure: ${OrderParser}.

            **TOOLS**
            You have access to a "create_order" tool.
            Use this tool when the user confirms the final order.
            After calling the tool, you should inform the user whether the order was successfully created or if it failed.

            **DRINK DETAILS**
            Each drink has its own set of properties such as size, milk, syrup, sweetener, and toppings.
            Here is the drink schema: ${DrinkParser}.

            You must ask for any missing details before creating the order.

            If the user requests a modification that is not supported for the selected drink, tell them that it is not possible.

            If the user asks for something unrelated to drink orders, politely tell them that you can only assist with drink orders.

            **AVAILABLE OPTIONS**
            List of available drinks and their allowed modifications:
            ${DRINKS.map((drink) => `- ${createDrinkItemSummary(drink)}`)}

            Sweeteners: ${createSweetenersSummary()}
            Toppings: ${availableToppingsSummary()}
            Milks: ${createAvailableMilksSummary()}
            Syrups: ${createSyrupsSummary()}
            Sizes: ${createSizesSummary()}

            Order schema: ${OrderParser}

            If the user's query is unclear, tell them that the request is not clear.

            **ORDER CONFIRMATION**
            Once the order is ready, you must ask the user to confirm it.
            If they confirm, immediately call the "create_order" tool.
            Only respond after the tool completes, indicating success or failure.

            **FRONTEND RESPONSE FORMAT**
            Every response must include:

            "message": "Your message to the user",
            "current_order": "The order currently being constructed",
            "suggestions": "Options the user can choose from",
            "progress": "Order status ('completed' after creation)"

            **IMPORTANT RULES**
            - Be friendly, use emojis, and add humor.
            - Use null for unfilled fields.
            - Never omit the JSON tracking object.
        `,
      },
      new MessagesPlaceholder('messages'),
    ]);

  const formattedPrompt = await prompt.formatMessages({
    time: new Date().toISOString(),
    messages: states.messages,
  });

  const chat = new ChatGoogleGenerativeAI({
    model: 'gemini-2.0-flash',
    temperature: 0,
    apiKey: GOOGLE_API_KEY,
  }).bindTools(tools);

  const result = await chat.invoke(formattedPrompt);
  return { messages: [result] };
  };     
    const shouldContinue = (state: typeof graphState.State) => {
      const lastMessage = state.messages[
        state.messages.length - 1
      ] as AIMessage;
      return lastMessage.tool_calls?.length ? 'tools' : END;
    };

    const toolsNode = new ToolNode(tools);

    /**
     * Build the conversation graph.
     */
    const graph = new StateGraph(graphState)
      .addNode('agent', callModal)
      .addNode('tools', toolsNode)
      .addEdge(START, 'agent')
      .addConditionalEdges('agent', shouldContinue)
      .addEdge('tools', 'agent');

Explanation

Graph State (graphState)
The graphState object is the shared memory across all nodes. It stores messages, which track the conversation history including user inputs, AI responses, and tool interactions. The reducer [...x, ...y] appends new messages, preserving past context. This is similar to React state updates: old messages remain while new ones are added.
Agent Node (callModal)
This node handles the LLM call. It formats a prompt containing system instructions, drink schemas, available tools, and frontend response rules. By including states.messages, the AI sees the full conversation history, enabling multi-turn dialogue.
LLM Execution
ChatGoogleGenerativeAI generates the AI response. .bindTools(tools) allows the AI to call tools like create_order directly if needed.
Conditional Flow (shouldContinue)
After the AI responds, the shouldContinue function checks if the message includes tool calls. If so, execution moves to the tools node; otherwise, the workflow ends. This allows dynamic branching depending on the AI’s output.
Tool Node (ToolNode)
The tools node executes the requested tool, such as saving the order to the database. Once completed, control returns to the agent node, enabling the AI to respond to the user with results.
Graph Construction (StateGraph)
Nodes are connected in a coherent workflow:
- START → agent begins the conversation
- Conditional edges handle tool execution
- tools → agent ensures the agent can respond after tools run
Overall Flow
Together, the graph and shared state ensure a stateful, multi-turn conversation. The AI can ask for missing details, call tools when needed, and maintain context across interactions. Every node reads and writes to the same state.

Workflow Compilation and State Persistence (Final Part)

So far, all of our states are temporary, meaning they only exist for the duration of a user’s request. However, we want our agent to remember and recall conversation context even when a new request is sent with the same thread_id or conversation ID.

To achieve this, we’ll use MongoDB in combination with the langchain/langgraph-checkpoint-mongo library. This library simplifies state persistence by associating each conversation with a unique, manually assigned ID. All operations—from retrieving previous messages to saving new ones—are handled internally, you only need to provide the conversation ID you want to work with.

const graph = new StateGraph(graphState)
  .addNode('agent', callModal)
  .addNode('tools', toolsNode)
  .addEdge(START, 'agent')
  .addConditionalEdges('agent', shouldContinue)
  .addEdge('tools', 'agent');

  const checkpointer = new MongoDBSaver({ client, dbName: database_name });

  const app = graph.compile({ checkpointer });

  /**
     * Run the graph using the user's message.
     */
    const finalState = await app.invoke(
      { messages: [new HumanMessage(query)] },
      { recursionLimit: 15, configurable: { thread_id } },
    );

  /**
   * Extract JSON payload from AI response.
   */
  function extractJsonResponse(response: any) {
    const match = response.match(/```json\\s*([\\s\\S]*?)\\s*```/i);
    if (match && match[1] && typeof response === 'string') {
      return JSON.parse(match[1].trim());
    }
    throw response;
  }

  const lastMessage = finalState.messages.at(-1) as AIMessage; // Extract the last message of the conversation
  return extractJsonResponse(lastMessage.content); //Response

The above code demonstrates how to initialize a checkpoint, compile a graph, and invoke the agent with an incoming prompt.

The extractJsonResponse method is used to grab the formatted response that we instructed the LLM to generate whenever it’s sending back something to the user.

Based on this given instruction from the main template, every response must include: "message": "Your message to the user", "current_order": "The order currently being constructed", "suggestions": "Options the user can choose from", "progress": "Order status ('completed' after creation)"

Every response from the LLM should look like this:

'```json\\n' +
  '{\\n' +
  '"message": "Got it! To make sure I get your order just right, can you clarify which coffee drink you\\'d like? We have Latte, Cappuccino, Cold Brew, and Frappuccino. 😊",\\n' +
  '"current_order": {\\n' +
  '"drink": null,\\n' +
  '"size": null,\\n' +
  '"mil": null,\\n' +
  '"syrup": null,\\n' +
  '"sweeteners": null,\\n' +
  '"toppings": null,\\n' +
  '"quantity": null\\n' +
  '},\\n' +
  '"suggestions": [\\n' +
  '"Latte",\\n' +
  '"Cappuccino",\\n' +
  '"Cold Brew",\\n' +
  '"Frappuccino"\\n' +
  '],\\n' +
  '"progress": "incomplete"\\n' +
  '}\\n' +
  '```';

This structure allows the frontend to easily render the LLM response and track the state of the current order. This is more of a design choice and less of a convention.

Conclusion

Building an autonomous AI agent with LangChain and LangGraph allows you to combine the reasoning power of LLMs with practical tool execution and persistent memory. By defining schemas, parsing data into human-readable formats, and orchestrating workflows through nodes, you can create intelligent agents capable of handling real-world tasks—like our Starbucks barista.

With MongoDB integration for state persistence, your agent can maintain context across conversations, making interactions feel more natural and human-like. This approach opens the door to building more sophisticated, domain-specific AI assistants without starting from scratch.

In short: define your data, teach your agent how to reason, and let LangGraph orchestrate the magic. ☕🤖

Source code here: https://github.com/DjibrilM/langgraph-starbucks-agent

Resources

LangGraph documentation: https://docs.langchain.com/oss/javascript/langgraph/quickstart
Synergizing Reasoning and Acting in Language Models: https://arxiv.org/abs/2210.03629

Jibril-M🍀 - freeCodeCamp.org

How Neural Machine Translation Works: Build Your Own Translation App with React Native and QVAC

Table of Contents

Demystifying NMT: The Brain Behind the Screen

How the Transformer Sees the World

The Encoder (The Reader)

The Decoder (The Writer)

Why This Matters

The Democratization of AI

What is QVAC?

Key Concepts for On-Device Translation

1. On-Device Inference:

2. Quantization

The Architecture Supported by QVAC

Understanding Language Pairs

The Inference Pipeline

Setting Up the Project

Configuring the Expo Plugin with JITI

Complete Implementation

Codebase Breakdown

1. Managing the Native Lifecycle

2. Translating the Text

3. Unloading the Model

Conclusion

Resources and Further Reading

How to Build an Offline AI Image Generator in Node.js with QVAC and Socket.io

Table of Contents

Prerequisites

What is QVAC?

How Stable Diffusion Works Under the Hood

The World-Class Sculptor Analogy

1. The Training Phase (Learning the Patterns)

2. Connecting Words to Visuals (CLIP)

3. The Generation Phase (The Reverse Diffusion Loop)

Latent Diffusion: Keeping it Fast (The VAE)

Architectures Supported by QVAC

GPU Limitations: Metal, AMD, and the Intel Mac Trap

The Image Generation Pipeline

Complete Implementation

1. Server Configuration (server.js)

2. Frontend Architecture Summary

Codebase Breakdown

1. Multi-Client Model ID Binding (process.modelId)

2. In-Memory Image Serialization (Zero Disk Writes)

3. GPU-to-CPU Fallback & Preference Cache Strategy

Conclusion

Resources and Further Reading

How to Run Private Text-to-Speech on Your Own Hardware Using QVAC

Table of Contents

Prerequisites

What is QVAC?

Key Concepts for On-Device Inference

The Architecture Supported by QVAC

1. The Chatterbox Engine

2. The Supertonic Engine

The Inference Pipeline

Environment and Dependency Config

The Audio Utility Packaging

Complete Implementation

Codebase Breakdown

1. Managing the Native Lifecycle

2. Flushing the KV Cache: Unload and Reload

3. Demystifying the WAV Header Structure

4. Visual Waveform Player

Conclusion

Resources and Further Reading

How to Run an LLM Locally on Your Mobile Phone with QVAC and Expo

Table of Contents

Prerequisites

What is QVAC?

Environment Setup

QVAC Installation

Model Management

Resumable & Deduplicated Downloading (downloadAsset)

Memory Lifecycle (loadModel & unloadModel)

Custom Models

Complete Implementation

Codebase Breakdown

1. Tracking Model State & Asynchronous Synchronization

2. Orchestrating Download & Memory Instantiation

1. Server Configuration (`server.js`)

1. Multi-Client Model ID Binding (`process.modelId`)

Resumable & Deduplicated Downloading (`downloadAsset`)

Memory Lifecycle (`loadModel` & `unloadModel`)