<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ RAG  - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ RAG  - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Mon, 25 May 2026 05:05:48 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/rag/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Self-Learning RAG System with Knowledge Reflection ]]>
                </title>
                <description>
                    <![CDATA[ Every RAG system I've seen — including the one I wrote a handbook about on this site — has the same fundamental problem. It doesn't learn. You ingest 500 documents. You ask a question. The system retr ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-self-learning-rag-system-with-knowledge-reflection/</link>
                <guid isPermaLink="false">69ebd821b463d4844c4f97e5</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cloudflare ]]>
                    </category>
                
                    <category>
                        <![CDATA[ TypeScript ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ webdev ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Daniel Nwaneri ]]>
                </dc:creator>
                <pubDate>Fri, 24 Apr 2026 20:52:49 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/d4567606-0d92-434c-8fd1-6137549350cf.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every RAG system I've seen — including the one I wrote a handbook about on this site — has the same fundamental problem.</p>
<p>It doesn't learn.</p>
<p>You ingest 500 documents. You ask a question. The system retrieves the three most similar chunks and hands them to the LLM. Repeat for the next query.</p>
<p>The system knows exactly as much as it did on day one. It's a library that never builds a card catalog, never cross-references its own shelves, never notices that three of its books are saying contradictory things.</p>
<p>That's what I set out to fix with a knowledge reflection layer. After every ingest, the system finds semantically related documents already in the index and asks an LLM to synthesise what's new, how it connects, and what gap remains. That synthesis gets embedded, stored, and boosted in search results.</p>
<p>The knowledge base gets smarter as you add more documents — not just bigger.</p>
<p>This tutorial shows you exactly how to build it.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-what-you-will-build">What You Will Build</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-the-base-system">How to Set Up the Base System</a></p>
</li>
<li><p><a href="#heading-why-standard-rag-has-a-memory-problem">Why Standard RAG Has a Memory Problem</a></p>
</li>
<li><p><a href="#heading-step-1-schema-update">Step 1: Schema Update</a></p>
</li>
<li><p><a href="#heading-step-2-the-reflection-engine">Step 2: The Reflection Engine</a></p>
</li>
<li><p><a href="#heading-step-3-consolidation">Step 3: Consolidation</a></p>
</li>
<li><p><a href="#heading-step-4-wire-it-into-your-ingest-handler">Step 4: Wire It Into Your Ingest Handler</a></p>
</li>
<li><p><a href="#heading-step-5-boost-reflections-in-search">Step 5: Boost Reflections in Search</a></p>
</li>
<li><p><a href="#heading-step-6-filtering-by-doc_type">Step 6: Filtering by doc_type</a></p>
</li>
<li><p><a href="#heading-what-changes-after-you-build-this">What Changes After You Build This</a></p>
</li>
<li><p><a href="#heading-deploying">Deploying</a></p>
</li>
<li><p><a href="#heading-what-to-build-next">What to Build Next</a></p>
</li>
</ol>
<h2 id="heading-what-you-will-build">What You Will Build</h2>
<p>In this tutorial, you'll build a post-ingest reflection pipeline that:</p>
<ol>
<li><p>Fires automatically after every document ingest</p>
</li>
<li><p>Finds the most semantically related documents already in the index</p>
</li>
<li><p>Asks Kimi K2.5 to synthesise a three-sentence insight linking the new document to existing knowledge</p>
</li>
<li><p>Stores that reflection with <code>doc_type=reflection</code> and a 1.5× ranking boost in search results</p>
</li>
<li><p>Consolidates reflections into summaries every three ingests</p>
</li>
</ol>
<p>By the end, searching your knowledge base will surface both raw document chunks and reflection artifacts the system wrote on ingest.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You will need:</p>
<ul>
<li><p>A Cloudflare account — free tier works</p>
</li>
<li><p>Node.js v18+ and Wrangler CLI installed (<code>npm install -g wrangler</code>)</p>
</li>
<li><p>Basic TypeScript familiarity</p>
</li>
</ul>
<p>No external API keys. Everything runs on Cloudflare's infrastructure.</p>
<h2 id="heading-how-to-set-up-the-base-system">How to Set Up the Base System</h2>
<p>If you have already built the RAG system from my <a href="https://www.freecodecamp.org/news/build-a-production-rag-system-with-cloudflare-workers-handbook">freeCodeCamp handbook</a>, skip this section — your system is ready for the reflection layer.</p>
<p>If you're starting fresh, this section gets you to a working base in about 15 minutes.</p>
<h3 id="heading-scaffold-the-project">Scaffold the Project</h3>
<pre><code class="language-bash">npm create cloudflare@latest rag-reflection-system
cd rag-reflection-system
</code></pre>
<p>Choose: Hello World example → TypeScript → No deploy yet.</p>
<h3 id="heading-create-the-vectorize-index-and-d1-database">Create the Vectorize Index and D1 Database</h3>
<pre><code class="language-bash">npx wrangler vectorize create rag-index --dimensions=384 --metric=cosine
npx wrangler d1 create rag-db
</code></pre>
<h3 id="heading-configure-wranglertoml">Configure wrangler.toml</h3>
<pre><code class="language-toml">name = "rag-reflection-system"
main = "src/index.ts"
compatibility_date = "2026-01-01"

[[vectorize]]
binding = "VECTORIZE"
index_name = "rag-index"

[[d1_databases]]
binding = "DB"
database_name = "rag-db"
database_id = "YOUR_DB_ID"

[ai]
binding = "AI"
</code></pre>
<h3 id="heading-create-the-documents-table">Create the <code>documents</code> Table</h3>
<pre><code class="language-sql">-- migrations/001_init.sql
CREATE TABLE IF NOT EXISTS documents (
  id TEXT PRIMARY KEY,
  content TEXT NOT NULL,
  source TEXT,
  date_created TEXT DEFAULT (datetime('now'))
);
</code></pre>
<pre><code class="language-bash">npx wrangler d1 execute rag-db --remote --file=./migrations/001_init.sql
</code></pre>
<h3 id="heading-add-the-ingest-and-search-endpoints">Add the <code>ingest</code> and <code>search</code> endpoints</h3>
<p>Replace <code>src/index.ts</code> with this minimal working system:</p>
<pre><code class="language-typescript">export interface Env {
  VECTORIZE: VectorizeIndex;
  DB: D1Database;
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise&lt;Response&gt; {
    const url = new URL(request.url);

    if (url.pathname === '/ingest' &amp;&amp; request.method === 'POST') {
      const { id, content, source } = await request.json() as any;

      const embResult = await env.AI.run('@cf/baai/bge-small-en-v1.5', {
        text: [content.slice(0, 512)],
      }) as any;
      const vector = embResult.data[0];

      await env.VECTORIZE.upsert([{
        id,
        values: vector,
        metadata: { content: content.slice(0, 1000), source, doc_type: 'raw' },
      }]);

      await env.DB.prepare(
        'INSERT OR REPLACE INTO documents (id, content, source) VALUES (?, ?, ?)'
      ).bind(id, content, source ?? '').run();

      return Response.json({ success: true, id });
    }

    if (url.pathname === '/search' &amp;&amp; request.method === 'POST') {
      const { query } = await request.json() as any;

      const embResult = await env.AI.run('@cf/baai/bge-small-en-v1.5', {
        text: [query],
      }) as any;
      const vector = embResult.data[0];

      const results = await env.VECTORIZE.query(vector, {
        topK: 5,
        returnMetadata: 'all',
      });

      const context = results.matches
        .map(m =&gt; m.metadata?.content as string)
        .filter(Boolean)
        .join('\n\n');

      const answer = await env.AI.run('@cf/moonshotai/kimi-k2.5', {
        messages: [
          { role: 'system', content: 'Answer using only the context provided.' },
          { role: 'user', content: `Context:\n\({context}\n\nQuestion: \){query}` },
        ],
        max_tokens: 256,
      }) as any;

      return Response.json({ answer: answer.response, sources: results.matches.map(m =&gt; m.id) });
    }

    return new Response('RAG system running', { status: 200 });
  },
};
</code></pre>
<h3 id="heading-deploy-and-verify">Deploy and Verify</h3>
<pre><code class="language-bash">npx wrangler deploy
</code></pre>
<p>Test it:</p>
<pre><code class="language-bash"># Ingest a document
curl -X POST https://your-worker.workers.dev/ingest \
  -H "Content-Type: application/json" \
  -d '{"id": "doc-001", "content": "Cursor pagination beats offset pagination for live-updating datasets because offset becomes unreliable when rows are inserted or deleted during pagination."}'

# Search
curl -X POST https://your-worker.workers.dev/search \
  -H "Content-Type: application/json" \
  -d '{"query": "what pagination approach should I use?"}'
</code></pre>
<p>If you get a grounded answer back, the base system is working. The next sections add the reflection layer on top of this foundation.</p>
<h2 id="heading-why-standard-rag-has-a-memory-problem">Why Standard RAG Has a Memory Problem</h2>
<p>Standard RAG retrieval is stateless. Every query goes in cold. The system has no memory of what it found before, no synthesis of what it learned across documents, and no growing understanding of what questions remain unanswered.</p>
<p>Imagine you've ingested 200 documents about your product. Twelve of them touch on a pricing decision made last year. No single one has the full picture — it's distributed across quarterly reports, meeting notes, an internal Slack export, a few Notion pages.</p>
<p>A user asks: "Why did we change our pricing structure?"</p>
<p>Standard RAG retrieves the three most similar chunks. If those three chunks collectively have the answer, great. If they don't — if the real answer requires synthesising across those twelve documents — the system has no mechanism for that. It returns fragments. The LLM makes its best guess.</p>
<p>The reflection layer addresses this directly. When the twelfth pricing document gets ingested, the system finds the eleven related documents, synthesises what connects them, and stores that synthesis as a retrievable artifact. The answer to "why did we change our pricing structure" exists in the index before anyone asks the question.</p>
<p>Not smarter retrieval — smarter indexing.</p>
<h2 id="heading-step-1-schema-update">Step 1: Schema Update</h2>
<p>The reflection layer needs two new fields in your D1 documents table. Run this migration:</p>
<pre><code class="language-sql">-- migrations/003_add_reflection_fields.sql
ALTER TABLE documents ADD COLUMN doc_type TEXT DEFAULT 'raw';
ALTER TABLE documents ADD COLUMN reflection_score REAL DEFAULT 0;
ALTER TABLE documents ADD COLUMN parent_reflection_id TEXT;
</code></pre>
<p>Apply it:</p>
<pre><code class="language-bash">wrangler d1 execute mcp-knowledge-db --remote --file=./migrations/003_add_reflection_fields.sql
</code></pre>
<p><code>doc_type</code> distinguishes raw documents (<code>raw</code>), single-document reflections (<code>reflection</code>), and consolidated multi-reflection summaries (<code>summary</code>). You'll use this field to filter — exposing only reflections to users who want the distilled view, or excluding them for users who want raw source chunks.</p>
<h2 id="heading-step-2-the-reflection-engine">Step 2: The Reflection Engine</h2>
<p>Create <code>src/engines/reflection.ts</code>. This is the core of the layer.</p>
<pre><code class="language-typescript">import { Env } from '../types/env';
import { resolveEmbeddingModel, resolveReflectionModel } from '../config/models';

const REFLECTION_BOOST = 1.5;
const CONSOLIDATION_THRESHOLD = 3; // consolidate every N new reflections

export async function reflect(
  newDocId: string,
  newDocContent: string,
  env: Env
): Promise&lt;void&gt; {
  // 1. Find semantically related documents already in the index
  const embModel = resolveEmbeddingModel(env.EMBEDDING_MODEL);
  const embResult = await env.AI.run(embModel.id as any, {
    text: [newDocContent.slice(0, 512)],
  });
  const queryVector = (embResult as any).data?.[0];
  if (!queryVector) return;

  const related = await env.VECTORIZE.query(queryVector, {
    topK: 5,
    filter: { doc_type: { $eq: 'raw' } },
    returnMetadata: 'all',
  });

  const relatedDocs = (related.matches ?? []).filter(
    m =&gt; m.id !== newDocId &amp;&amp; (m.score ?? 0) &gt; 0.65
  );

  if (relatedDocs.length === 0) return; // nothing related yet — skip

  // 2. Build synthesis prompt
  const relatedSummaries = relatedDocs
    .slice(0, 3)
    .map((m, i) =&gt; `Document \({i + 1}: \){String(m.metadata?.content ?? '').slice(0, 300)}`)
    .join('\n\n');

  const prompt = `You are synthesising knowledge across documents in a knowledge base.

New document:
${newDocContent.slice(0, 600)}

Related existing documents:
${relatedSummaries}

Write exactly three sentences:
1. What the new document adds that the existing documents don't already cover
2. How the new document connects to or extends the existing documents
3. What gap or question remains unanswered across all these documents

Be specific. Reference actual content. Do not summarise — synthesise.`;

  // 3. Call the reflection model
  const reflModel = resolveReflectionModel(env.REFLECTION_MODEL);
  const llmResp = await env.AI.run(reflModel.id as any, {
    messages: [{ role: 'user', content: prompt }],
    max_tokens: 180,
  });

  const reflectionText = (llmResp as any)?.response?.trim();
  if (!reflectionText || reflectionText.length &lt; 40) return;

  // 4. Embed and store the reflection
  const reflEmbResult = await env.AI.run(embModel.id as any, {
    text: [reflectionText],
  });
  const reflVector = (reflEmbResult as any).data?.[0];
  if (!reflVector) return;

  const reflectionId = `refl_\({newDocId}_\){Date.now()}`;

  await env.VECTORIZE.upsert([
    {
      id: reflectionId,
      values: reflVector,
      metadata: {
        content: reflectionText,
        doc_type: 'reflection',
        parent_id: newDocId,
        reflection_score: REFLECTION_BOOST,
        source_doc_ids: relatedDocs.map(m =&gt; m.id).join(','),
        date_created: new Date().toISOString(),
      },
    },
  ]);

  await env.DB.prepare(
    `INSERT INTO documents
     (id, content, doc_type, reflection_score, parent_id, date_created)
     VALUES (?, ?, 'reflection', ?, ?, ?)`
  )
    .bind(reflectionId, reflectionText, REFLECTION_BOOST, newDocId, new Date().toISOString())
    .run();

  // 5. Check if consolidation is due
  const recentCount = await env.DB
    .prepare(`SELECT COUNT(*) as cnt FROM documents WHERE doc_type = 'reflection' AND date_created &gt; datetime('now', '-1 hour')`)
    .first&lt;{ cnt: number }&gt;();

  if ((recentCount?.cnt ?? 0) &gt;= CONSOLIDATION_THRESHOLD) {
    await consolidate(env);
  }
}
</code></pre>
<p>Two things worth noting here.</p>
<p>First, the semantic threshold (<code>score &gt; 0.65</code>) matters. Too low and you're synthesising unrelated documents. Too high and you're rarely finding connections. 0.65 works well with <code>bge-small</code>. You can bump it to 0.72 with <code>qwen3-0.6b</code> (1024d) where scores cluster higher.</p>
<p>The prompt structure is deliberate. Three sentences, each doing a specific job: what's new, how it connects, what remains. This keeps reflections useful for retrieval. A freeform synthesis prompt produces beautiful prose that doesn't retrieve well. This structure produces retrievable artifacts.</p>
<h2 id="heading-step-3-consolidation">Step 3: Consolidation</h2>
<p>As reflections accumulate, they need their own synthesis layer — otherwise you're adding noise at a higher abstraction level.</p>
<p>Add this to <code>src/engines/reflection.ts</code>:</p>
<pre><code class="language-typescript">export async function consolidate(env: Env): Promise&lt;void&gt; {
  // Fetch recent reflections not yet consolidated
  const recent = await env.DB
    .prepare(
      `SELECT id, content FROM documents
       WHERE doc_type = 'reflection'
       AND id NOT IN (
         SELECT DISTINCT parent_id FROM documents
         WHERE doc_type = 'summary' AND parent_id IS NOT NULL
       )
       ORDER BY date_created DESC
       LIMIT 6`
    )
    .all&lt;{ id: string; content: string }&gt;();

  if (!recent.results || recent.results.length &lt; CONSOLIDATION_THRESHOLD) return;

  const reflectionTexts = recent.results.map((r, i) =&gt; `Reflection \({i + 1}: \){r.content}`).join('\n\n');

  const prompt = `You are consolidating multiple knowledge reflections into a single compressed insight.

${reflectionTexts}

Write two to three sentences that capture the most important cross-cutting pattern or tension across these reflections. What does the knowledge base now understand that it didn't before these documents were added? What's the most important open question?

Be precise. No preamble.`;

  const reflModel = resolveReflectionModel(env.REFLECTION_MODEL);
  const llmResp = await env.AI.run(reflModel.id as any, {
    messages: [{ role: 'user', content: prompt }],
    max_tokens: 320,
  });

  const summaryText = (llmResp as any)?.response?.trim();
  if (!summaryText || summaryText.length &lt; 40) return;

  const embModel = resolveEmbeddingModel(env.EMBEDDING_MODEL);
  const embResult = await env.AI.run(embModel.id as any, { text: [summaryText] });
  const summaryVector = (embResult as any).data?.[0];
  if (!summaryVector) return;

  const summaryId = `summary_${Date.now()}`;

  await env.VECTORIZE.upsert([
    {
      id: summaryId,
      values: summaryVector,
      metadata: {
        content: summaryText,
        doc_type: 'summary',
        reflection_score: REFLECTION_BOOST * 1.2,
        source_reflection_ids: recent.results.map(r =&gt; r.id).join(','),
        date_created: new Date().toISOString(),
      },
    },
  ]);

  await env.DB.prepare(
    `INSERT INTO documents (id, content, doc_type, reflection_score, date_created)
     VALUES (?, ?, 'summary', ?, ?)`
  )
    .bind(summaryId, summaryText, REFLECTION_BOOST * 1.2, new Date().toISOString())
    .run();
}
</code></pre>
<p>Summaries get a 1.2× multiplier on top of the base reflection boost. In search results, a summary synthesising twelve related documents should rank above any single document chunk on broad conceptual queries. On specific factual queries, the raw chunks will score higher. The ranking sorts itself.</p>
<h2 id="heading-step-4-wire-it-into-your-ingest-handler">Step 4: Wire It Into Your Ingest Handler</h2>
<p>The reflection runs as a background job. It doesn't block the ingest response — that would add 2–3 seconds to every ingest call.</p>
<p>In your <code>src/handlers/ingest.ts</code>, after you've stored the document:</p>
<pre><code class="language-typescript">import { reflect } from '../engines/reflection';

// ... existing ingest logic ...

// After VECTORIZE.upsert() and DB insert succeed:
ctx.waitUntil(
  reflect(documentId, content, env).catch(err =&gt; {
    console.warn('[reflection] failed for', documentId, err.message);
  })
);

return new Response(JSON.stringify({
  success: true,
  documentId,
  chunks: chunkCount,
  // ... rest of response
}), { headers: { 'Content-Type': 'application/json' } });
</code></pre>
<p><code>ctx.waitUntil()</code> is the Cloudflare Workers primitive for background work. The response returns immediately. The reflection runs after. The ingest API stays fast.</p>
<p>The <code>.catch()</code> is important. A failed reflection should never fail an ingest. Raw documents are the source of truth. Reflections are derived value — useful, but not critical path.</p>
<h2 id="heading-step-5-boost-reflections-in-search">Step 5: Boost Reflections in Search</h2>
<p>Add the reflection boost to your ranking logic in <code>src/engines/hybrid.ts</code>. After RRF fusion and before returning results:</p>
<pre><code class="language-typescript">// Apply reflection boost
const boosted = results.map(r =&gt; ({
  ...r,
  score: r.doc_type === 'reflection' || r.doc_type === 'summary'
    ? r.score * (r.reflection_score ?? 1.5)
    : r.score,
}));

return boosted.sort((a, b) =&gt; b.score - a.score);
</code></pre>
<p>This is a post-fusion boost, not a pre-fusion rerank. The reasoning: apply RRF across all results first, so reflections earn their place on raw relevance before getting boosted. A reflection that would not rank in the top 20 on raw similarity shouldn't appear just because it has a boost multiplier.</p>
<h2 id="heading-step-6-filtering-by-doctype">Step 6: Filtering by <code>doc_type</code></h2>
<p>Your search endpoint should accept a <code>doc_type</code> filter so callers can control what they see:</p>
<pre><code class="language-typescript">// In your search request handler:
const docTypeFilter = body.filters?.doc_type;

// Pass to Vectorize query:
const vectorFilter: Record&lt;string, unknown&gt; = {};
if (docTypeFilter) {
  vectorFilter.doc_type = docTypeFilter;
}
</code></pre>
<p>This gives callers three modes:</p>
<pre><code class="language-bash"># Only reflections and summaries
POST /search
{ "query": "pricing decisions", "filters": { "doc_type": { "$in": ["reflection", "summary"] } } }

# Only source documents
POST /search
{ "query": "pricing decisions", "filters": { "doc_type": { "$eq": "raw" } } }

# Default: all types, reflections boosted
POST /search
{ "query": "pricing decisions" }
</code></pre>
<p>The default (no filter) is the most useful. Let the boost do its job. Restrict to raw when you need citations. Restrict to reflections when you want the synthesised view.</p>
<h2 id="heading-what-changes-after-you-build-this">What Changes After You Build This</h2>
<p>At 200 documents, the difference becomes noticeable. Queries that previously returned five fragmented chunks now surface a reflection that already synthesised those chunks. Broad conceptual queries — "what do we know about X?" — start returning genuinely useful summaries instead of just the most-similar individual paragraph.</p>
<p>At 2,000 documents, the reflection layer is the most valuable part of the system. The raw chunks answer specific factual questions. The reflections and summaries answer conceptual questions that could not be answered from any single document. The system has learned something no individual document contains.</p>
<p>One failure mode worth knowing: if your embedding model has poor semantic clustering — old <code>bge-small</code> at 384d with mixed-domain documents — the related-documents retrieval step will surface weak connections and produce shallow reflections. The 0.65 threshold filters most of this out, but if you're seeing reflections that seem off-topic, your embeddings are the first thing to check.</p>
<h2 id="heading-deploying">Deploying</h2>
<pre><code class="language-bash">wrangler d1 execute mcp-knowledge-db --remote --file=./migrations/003_add_reflection_fields.sql
wrangler deploy
</code></pre>
<p>Then ingest a few documents and watch what happens:</p>
<pre><code class="language-bash"># Ingest document 1
curl -X POST https://your-worker.workers.dev/ingest \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"id": "doc-001", "content": "Your document text here..."}'

# After a few seconds, check if a reflection was created
curl "https://your-worker.workers.dev/search" \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "your topic", "filters": {"doc_type": {"$eq": "reflection"}}}'
</code></pre>
<p>Reflections won't appear until there are related documents to synthesise. Ingest at least three documents on similar topics before expecting to see them.</p>
<h2 id="heading-what-to-build-next">What to Build Next</h2>
<p>The reflection layer as described here fires after every ingest. That's expensive at high ingest volume: if you're batch-importing 10,000 documents, you don't want 10,000 individual reflection calls.</p>
<p>For bulk ingestion, gate it: call <code>reflect()</code> only when a document's similarity search returns a match above 0.8, or batch-run reflection after the bulk import completes. The <code>POST /ingest/batch</code> endpoint in the <a href="https://github.com/dannwaneri/vectorize-mcp-worker">full repo</a> does this.</p>
<p>The second thing worth building: surfacing reflections in your UI with a visual distinction. A search result that's a reflection should look different from a raw chunk. In the dashboard included in the repo, reflections render with a <code>💡</code> badge and a "synthesised from N documents" note.</p>
<p>Full source at <a href="https://github.com/dannwaneri/vectorize-mcp-worker">github.com/dannwaneri/vectorize-mcp-worker</a> — reflection engine, consolidation, batch ingest, dashboard, OpenAPI spec.</p>
<p>The codebase is TypeScript, deploys with a single <code>wrangler deploy</code>, runs for roughly $1–5/month at 10,000 queries/day.</p>
<p>Standard RAG retrieves. This learns.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Production RAG System with Cloudflare Workers – a Handbook for Devs ]]>
                </title>
                <description>
                    <![CDATA[ Most RAG tutorials show you a working demo and call it done. You copy the code, it runs locally, and then you try to put it in production and everything falls apart. This tutorial is different. I run  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-production-rag-system-with-cloudflare-workers-handbook/</link>
                <guid isPermaLink="false">69bb2fa98c55d6eefb6ce907</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ webdev ]]>
                    </category>
                
                    <category>
                        <![CDATA[ TypeScript ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cloudflare ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Daniel Nwaneri ]]>
                </dc:creator>
                <pubDate>Wed, 18 Mar 2026 23:05:13 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/cc3556fb-abe6-4aea-b9bd-83404319c1b9.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most RAG tutorials show you a working demo and call it done. You copy the code, it runs locally, and then you try to put it in production and everything falls apart.</p>
<p>This tutorial is different. I run a production RAG system (<a href="https://github.com/dannwaneri/vectorize-mcp-worker">vectorize-mcp-worker</a>) that handles real traffic at a total cost of \(5/month. The alternatives I evaluated ranged from \)100–$200/month. The difference isn't magic. It's architecture.</p>
<p>Here, you'll build <code>rag-tutorial-simple</code>: a clean, minimal RAG chatbot deployed on Cloudflare Workers. No external API keys. No paid vector database subscriptions. No servers to manage. Just Cloudflare's free tier – Workers, Vectorize, and Workers AI – doing the heavy lifting at the edge.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-what-you-will-build">What You Will Build</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-rag-works">How RAG Works</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-your-project">How to Set Up Your Project</a></p>
</li>
<li><p><a href="#heading-how-to-build-the-data-pipeline">How to Build the Data Pipeline</a></p>
</li>
<li><p><a href="#heading-how-to-build-the-query-pipeline">How to Build the Query Pipeline</a></p>
</li>
<li><p><a href="#heading-how-to-add-error-handling-and-security">How to Add Error Handling and Security</a></p>
</li>
<li><p><a href="#heading-performance-and-cost-analysis">Performance and Cost Analysis</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-what-you-will-build">What You Will Build</h2>
<p>By the end of this tutorial, you'll have a globally deployed RAG API that:</p>
<ul>
<li><p>Accepts a natural language question via HTTP</p>
</li>
<li><p>Converts it to a vector embedding using Workers AI</p>
</li>
<li><p>Searches a knowledge base stored in Cloudflare Vectorize</p>
</li>
<li><p>Passes the retrieved context to an LLM (also on Workers AI) to generate an answer</p>
</li>
<li><p>Returns a grounded, accurate response (not a hallucination)</p>
</li>
</ul>
<p>The complete source code is available at <a href="https://github.com/dannwaneri/rag-tutorial-simple">github.com/dannwaneri/rag-tutorial-simple</a>.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>This is an intermediate-level tutorial. You should be comfortable with:</p>
<ul>
<li><p><strong>JavaScript/TypeScript</strong>: async/await, promises, basic types</p>
</li>
<li><p><strong>HTTP APIs</strong>: REST, request/response, JSON</p>
</li>
<li><p><strong>Command line basics</strong>: running npm commands, navigating directories</p>
</li>
</ul>
<p>You will need:</p>
<ul>
<li><p><strong>Node.js 18 or higher</strong>: check with <code>node --version</code></p>
</li>
<li><p><strong>A Cloudflare account</strong>: free tier is fine, sign up at <a href="https://dash.cloudflare.com/sign-up">cloudflare.com</a></p>
</li>
<li><p><strong>A code editor</strong>: VS Code recommended for TypeScript support</p>
</li>
</ul>
<p>That's it. No OpenAI key. No credit card for embeddings. Let's build.</p>
<h2 id="heading-how-rag-works">How RAG Works</h2>
<p>Before you write any code, you'll need a clear mental model of what you're building. This section explains the three core components of a RAG system, how data flows between them, and why this architecture works at scale.</p>
<h3 id="heading-the-mental-model">The Mental Model</h3>
<p>Think of a traditional LLM like a doctor who studied medicine for years but has been in a remote cabin with no internet since their graduation day. They are brilliant, but they only know what they knew when they left. Ask them about a drug approved last year and they'll either say they don't know or – worse – confidently give you wrong information.</p>
<p>RAG gives that doctor access to an up-to-date medical library. Before answering your question, they can look up the relevant pages, read them, and use that information to give you an accurate answer. Their training still matters (that is, they know how to read and interpret the information), but they're no longer limited to what they memorized years ago.</p>
<p>In technical terms, RAG works in three steps on every request:</p>
<ol>
<li><p><strong>Retrieve</strong>: find the most relevant documents from your knowledge base</p>
</li>
<li><p><strong>Augment</strong>: add those documents to the LLM prompt as context</p>
</li>
<li><p><strong>Generate</strong>: let the LLM produce an answer using both its training and the retrieved context</p>
</li>
</ol>
<h3 id="heading-the-three-components">The Three Components</h3>
<p>Every RAG system has three moving parts. Understanding each one will help you debug problems and make better architectural decisions as you build.</p>
<h4 id="heading-the-embedding-model">The Embedding Model</h4>
<p>An embedding model converts text into a vector – an array of numbers that represents the meaning of that text. The model you will use in this tutorial, <code>@cf/baai/bge-base-en-v1.5</code>, outputs 768 numbers for any piece of text you give it.</p>
<p>The critical property of embeddings is that semantically similar text produces numerically similar vectors. "How do I install Node.js?" and "What's the process for setting up Node?" will produce vectors that are close together. "How do I install Node.js?" and "What is the capital of France?" will produce vectors that are far apart.</p>
<p>This is what makes semantic search possible. You aren't matching keywords, you're matching meaning.</p>
<p>One rule you must never break: your documents and your queries must be embedded with the same model. If you embed your documents with <code>bge-base-en-v1.5</code> and your queries with a different model, the vectors won't be comparable and your searches will return garbage.</p>
<h4 id="heading-the-vector-database">The Vector Database</h4>
<p>The vector database stores your embeddings and lets you search them by similarity. In this tutorial, you'll use Cloudflare Vectorize.</p>
<p>When you run a similarity search, you pass in a query vector and Vectorize returns the K most similar vectors it has stored, along with their metadata and similarity scores. This is called approximate nearest neighbor search, and Vectorize is optimized to do it fast even across millions of vectors.</p>
<p>The key advantage of using Vectorize over an external vector database like Pinecone is co-location. Vectorize runs in the same Cloudflare network as your Worker. There's no external API call, no authentication roundtrip, and no network latency between your application and your database.</p>
<h4 id="heading-the-language-model">The Language Model</h4>
<p>The LLM is responsible for one thing: reading the retrieved context and generating a natural language answer. It doesn't search anything. It doesn't decide what's relevant. It just reads what you give it and writes a response.</p>
<p>This separation of concerns is intentional. The LLM is good at language: understanding questions, synthesizing information, writing clearly. The vector database is good at retrieval: finding relevant documents fast. RAG combines their strengths without asking either component to do something it is not designed for.</p>
<p>In this tutorial you'll use <code>@cf/meta/llama-3.3-70b-instruct-fp8-fast</code> through Workers AI. No API key required.</p>
<h3 id="heading-a-note-on-visual-embeddings">A Note on Visual Embeddings</h3>
<p>If you plan to extend this system to search images, you may be tempted to use a vision-language model like CLIP to generate visual embeddings (vectors that represent the image itself rather than a text description of it). This sounds clever but works worse for RAG in practice.</p>
<p>Visual embeddings match pixel similarity. They are good for "find images that look like this one." They are poor for "find the login screen" or "find dashboards showing error rates" because those queries are about meaning, not pixels.</p>
<p>The better approach – used in production – is to pass the image through a multimodal model like Llama 4 Scout, which generates a detailed text description and extracts visible text via OCR. You then embed that description using the same BGE model as your other documents.</p>
<p>The result lives in one unified index, works with your existing query pipeline, and produces better search results than visual embeddings for RAG use cases.</p>
<p>Cloudflare Workers AI does not support CLIP anyway. But even if it did, descriptions would outperform it for semantic search.</p>
<h3 id="heading-how-a-query-flows-through-the-system">How a Query Flows Through the System</h3>
<p>Here is exactly what happens when a user sends the question "What is RAG?" to your finished Worker:</p>
<ol>
<li><p><strong>Step 1 – Embed the question (20-30ms)</strong>: Your Worker calls Workers AI with the question text. The embedding model returns a 768-dimensional vector representing the meaning of the question.</p>
</li>
<li><p><strong>Step 2 – Search Vectorize (30-50ms)</strong>: Your Worker passes that vector to Vectorize, which searches your knowledge base and returns the 3 most similar documents with their similarity scores.</p>
</li>
<li><p><strong>Step 3 – Filter and build context (&lt; 1ms)</strong>: Documents with a similarity score below 0.5 are discarded. The remaining document texts are joined into a context string.</p>
</li>
<li><p><strong>Step 4 – Generate the answer (500-1500ms)</strong>: Your Worker sends the context and the question to the LLM. The LLM reads the context and generates a grounded answer.</p>
</li>
<li><p><strong>Step 5 – Return to the user</strong>: The answer and source metadata are returned as JSON.</p>
</li>
</ol>
<p>Total time: typically 600-1600ms end to end. The LLM generation step dominates. Everything else is fast.</p>
<h3 id="heading-why-this-works-at-scale">Why This Works at Scale</h3>
<p>A common objection to Cloudflare RAG is that it cannot meet sub-200ms retrieval requirements. That objection comes from a specific architectural mistake: trying to run the entire RAG pipeline, including heavy embedding generation and reranking, inside a single synchronous request. That's the wrong architecture.</p>
<p>The architecture you're building in this tutorial separates the loading step (which is slow and runs once) from the query step (which is fast and runs on every request). By the time a user asks a question, your documents are already embedded and stored. The query pipeline only needs to embed the question, run one vector search, and call the LLM. Those three steps are fast.</p>
<p>My production system (<a href="https://github.com/dannwaneri/vectorize-mcp-worker">vectorize-mcp-worker</a>) runs this architecture and handles real traffic at $5/month. The <a href="https://dev.to/dannwaneri/i-built-a-production-rag-system-for-5month-most-alternatives-cost-100-200-21hj">full performance breakdown is here</a>. Cloudflare RAG works. You just have to build it correctly.</p>
<h2 id="heading-how-to-set-up-your-project">How to Set Up Your Project</h2>
<p>In this section, you'll scaffold a Cloudflare Worker, create a Vectorize index to store your embeddings, and configure the bindings that connect them together.</p>
<h3 id="heading-how-to-create-the-project">How to Create the Project</h3>
<p>Open your terminal and create a new directory for the project.</p>
<p>On Mac/Linux:</p>
<pre><code class="language-bash">mkdir rag-tutorial-simple &amp;&amp; cd rag-tutorial-simple
</code></pre>
<p>On Windows PowerShell:</p>
<pre><code class="language-powershell">mkdir rag-tutorial-simple
cd rag-tutorial-simple
</code></pre>
<p>Then run the Cloudflare scaffolding tool:</p>
<pre><code class="language-bash">npm create cloudflare@latest
</code></pre>
<p>Answer the prompts like this:</p>
<ul>
<li><p><strong>Directory/app name</strong>: <code>rag-tutorial-simple</code></p>
</li>
<li><p><strong>What would you like to start with?</strong> Hello World example</p>
</li>
<li><p><strong>TypeScript?</strong> Yes</p>
</li>
<li><p><strong>Deploy?</strong> No</p>
</li>
</ul>
<p>When it finishes, you'll have a working TypeScript Worker with Wrangler already configured.</p>
<h3 id="heading-how-to-create-the-vectorize-index">How to Create the Vectorize Index</h3>
<p>Vectorize is Cloudflare's vector database. It lives in the same network as your Worker, which means no external API call and no added latency when you search it.</p>
<pre><code class="language-bash">npx wrangler vectorize create rag-tutorial-index --dimensions=768 --metric=cosine
</code></pre>
<p>Two things to note here.</p>
<p><code>--dimensions=768</code> tells Vectorize how many numbers make up each embedding. This must match the output of the embedding model you use. The model you will use (<code>@cf/baai/bge-base-en-v1.5</code>) outputs 768 dimensions. If this number doesn't match, your searches will fail.</p>
<p><code>--metric=cosine</code> is how Vectorize measures similarity between vectors. Cosine similarity measures the angle between two vectors rather than the distance between them. For text embeddings, this captures semantic meaning more accurately than other metrics.</p>
<h3 id="heading-how-to-configure-wranglertoml">How to Configure wrangler.toml</h3>
<p>Open <code>wrangler.toml</code> and replace its contents with the following:</p>
<pre><code class="language-toml">name = "rag-tutorial-simple"
main = "src/index.ts"
compatibility_date = "2026-02-25"

[[vectorize]]
binding = "VECTORIZE"
index_name = "rag-tutorial-index"

[ai]
binding = "AI"
</code></pre>
<p>The <code>[[vectorize]]</code> block connects your Worker to the index you just created. The <code>[ai]</code> block gives your Worker access to Workers AI – both for generating embeddings and for running the language model that produces answers.</p>
<p>Notice that there are no API keys anywhere. Cloudflare handles authentication internally because everything – your Worker, Vectorize, and Workers AI – runs under the same account.</p>
<h3 id="heading-how-to-update-srcindexts">How to Update src/index.ts</h3>
<p>Open <code>src/index.ts</code> and replace the generated code with this:</p>
<pre><code class="language-typescript">export interface Env {
  VECTORIZE: VectorizeIndex;
  AI: Ai;
  LOAD_SECRET: string;
}

export default {
  async fetch(request: Request, env: Env): Promise&lt;Response&gt; {
    return new Response("RAG tutorial worker is running", { status: 200 });
  },
};
</code></pre>
<p>The <code>Env</code> interface tells TypeScript what bindings are available inside your Worker. <code>VectorizeIndex</code> and <code>Ai</code> are types provided by Cloudflare's type definitions.</p>
<h3 id="heading-how-to-verify-your-setup">How to Verify Your Setup</h3>
<p>Start the local development server:</p>
<pre><code class="language-bash">npx wrangler dev
</code></pre>
<p>Open your browser and visit <code>http://localhost:8787</code>. You should see:</p>
<pre><code class="language-plaintext">RAG tutorial worker is running
</code></pre>
<p>You will see two warnings in your terminal. Both are expected.</p>
<p>The first warning says that Vectorize doesn't support local mode. This means Vectorize queries won't work during local development unless you run with the <code>--remote</code> flag. You'll do this later when testing the full pipeline.</p>
<p>The second warning says the AI binding always accesses remote resources. This means that embedding generation and LLM calls always hit Cloudflare's servers, even in local development. This is fine: usage within the free tier limits costs nothing.</p>
<p>Your project structure at this point:</p>
<pre><code class="language-plaintext">rag-tutorial-simple/
├── scripts/
│   └── knowledge-base.ts
├── src/
│   └── index.ts
├── wrangler.toml
├── package.json
└── tsconfig.json
</code></pre>
<h2 id="heading-how-to-build-the-data-pipeline">How to Build the Data Pipeline</h2>
<p>The data pipeline is responsible for two things: generating embeddings for each document in your knowledge base, and storing those embeddings in Vectorize. You'll handle both steps inside the Worker itself using a <code>/load</code> endpoint.</p>
<p>This approach has a key advantage: you don't need an API token, an Account ID, or any external tooling. Everything uses the bindings you already configured in <code>wrangler.toml</code>.</p>
<h3 id="heading-how-to-create-the-knowledge-base">How to Create the Knowledge Base</h3>
<p>Create a <code>scripts/</code> folder in your project and add a file called <code>knowledge-base.ts</code>:</p>
<pre><code class="language-bash">mkdir scripts
</code></pre>
<p>Add your documents to <code>scripts/knowledge-base.ts</code>:</p>
<pre><code class="language-typescript">export const documents = [
  {
    id: "1",
    text: "Cloudflare Workers run JavaScript at the edge, in over 300 data centers worldwide. Requests are handled close to the user, reducing latency significantly compared to a single-region server.",
    metadata: { source: "cloudflare-docs", category: "workers" },
  },
  {
    id: "2",
    text: "Vectorize is Cloudflare's vector database. It stores embeddings and lets you search them by semantic similarity. It runs in the same network as your Worker, so there is no external API call needed.",
    metadata: { source: "cloudflare-docs", category: "vectorize" },
  },
  {
    id: "3",
    text: "Workers AI lets you run machine learning models directly on Cloudflare's infrastructure. You can generate embeddings and run LLM inference without leaving the Cloudflare network.",
    metadata: { source: "cloudflare-docs", category: "workers-ai" },
  },
  {
    id: "4",
    text: "RAG stands for Retrieval Augmented Generation. Instead of relying only on what the LLM was trained on, RAG retrieves relevant context from a knowledge base and adds it to the prompt before generating an answer.",
    metadata: { source: "ai-concepts", category: "rag" },
  },
  {
    id: "5",
    text: "An embedding is a numerical representation of text. Similar pieces of text produce similar embeddings. This is what makes semantic search possible — you search by meaning, not exact keywords.",
    metadata: { source: "ai-concepts", category: "embeddings" },
  },
  {
    id: "6",
    text: "The BGE model (bge-base-en-v1.5) is available through Workers AI. It generates 768-dimensional embeddings and works well for English semantic search tasks.",
    metadata: { source: "cloudflare-docs", category: "workers-ai" },
  },
  {
    id: "7",
    text: "Cosine similarity measures the angle between two vectors. For text embeddings, it captures semantic similarity regardless of text length, which makes it more reliable than Euclidean distance.",
    metadata: { source: "ai-concepts", category: "embeddings" },
  },
  {
    id: "8",
    text: "Cloudflare Workers have a free tier that includes 100,000 requests per day. Vectorize is available on both the Workers Free and Paid plans. The free tier lets you prototype and experiment. The Workers Paid plan starts at $5/month and includes higher usage allocations for production workloads.",
    metadata: { source: "cloudflare-docs", category: "pricing" },
  },
];
</code></pre>
<p>Each document has three fields. The <code>id</code> is a unique string that Vectorize uses to identify the vector. The <code>text</code> is what gets converted into an embedding. The <code>metadata</code> is stored alongside the vector and returned in search results. You'll use it later to display the source of each answer.</p>
<h3 id="heading-understanding-embeddings">Understanding Embeddings</h3>
<p>Before writing the loading code, it helps to understand what you're actually generating.</p>
<p>An embedding is an array of 768 numbers that represents the meaning of a piece of text. The model reads a sentence and outputs those 768 numbers in a way where similar sentences produce similar arrays of numbers.</p>
<p>When a user asks a question, you convert that question into an embedding using the same model, then ask Vectorize to find the stored embeddings that are closest to it. The documents those embeddings came from are your most relevant context.</p>
<p>This is why the model choice matters: your documents and your queries must be embedded with the same model, or the similarity scores will be meaningless.</p>
<h3 id="heading-how-to-build-the-load-endpoint">How to Build the Load Endpoint</h3>
<p>Open <code>src/index.ts</code> and update it with a <code>/load</code> route. Here is the complete file at this stage:</p>
<pre><code class="language-typescript">import { documents } from "../scripts/knowledge-base";

export interface Env {
  VECTORIZE: VectorizeIndex;
  AI: Ai;
  LOAD_SECRET: string;
}

export default {
  async fetch(request: Request, env: Env): Promise&lt;Response&gt; {
    const url = new URL(request.url);

    if (url.pathname === "/load" &amp;&amp; request.method === "POST") {
      return handleLoad(env, request);
    }

    return new Response("RAG tutorial worker is running", { status: 200 });
  },
};

async function handleLoad(env: Env, request: Request): Promise&lt;Response&gt; {
  const authHeader = request.headers.get("X-Load-Secret");
  if (authHeader !== env.LOAD_SECRET) {
    return Response.json({ error: "Unauthorized" }, { status: 401 });
  }

  const results: { id: string; status: string }[] = [];

  for (const doc of documents) {
    const response = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
      text: [doc.text],
    }) as { data: number[][] };

    await env.VECTORIZE.upsert([
      {
        id: doc.id,
        values: response.data[0],
        metadata: {
          ...doc.metadata,
          text: doc.text,
        },
      },
    ]);

    results.push({ id: doc.id, status: "loaded" });
  }

  return Response.json({ success: true, loaded: results });
}
</code></pre>
<p>Notice that <code>env.AI.run()</code> and <code>env.VECTORIZE.upsert()</code> require no credentials. The bindings handle authentication because the Worker runs inside your Cloudflare account. There are no secrets to manage for internal service communication.</p>
<p>The <code>text: doc.text</code> field inside <code>metadata</code> is important. Vectorize stores the vector values and whatever metadata you provide, but it doesn't store the original text separately. By including the text in metadata, you can retrieve and display it in search results later.</p>
<p>The <code>as { data: number[][] }</code> cast is necessary because the TypeScript type definitions for Workers AI do not yet reflect the exact return shape of every model. The actual response always contains a <code>data</code> array, and the cast tells TypeScript to trust that.</p>
<h3 id="heading-how-to-deploy-and-load-your-knowledge-base">How to Deploy and Load Your Knowledge Base</h3>
<p>First, set the secret that will protect your load endpoint:</p>
<pre><code class="language-bash">npx wrangler secret put LOAD_SECRET
</code></pre>
<p>Type a strong value when prompted. Then deploy:</p>
<pre><code class="language-bash">npx wrangler deploy
</code></pre>
<p>Trigger the load endpoint. You only need to do this once, or any time you update your knowledge base:</p>
<pre><code class="language-bash">curl -X POST https://rag-tutorial-simple.&lt;your-subdomain&gt;.workers.dev/load \
  -H "X-Load-Secret: your-secret-value"
</code></pre>
<p>On Windows PowerShell:</p>
<p><strong>Note:</strong> PowerShell uses backtick (<code>`</code>) for line continuation, not backslash.</p>
<pre><code class="language-powershell">Invoke-WebRequest `
  -Uri "https://rag-tutorial-simple.&lt;your-subdomain&gt;.workers.dev/load" `
  -Method POST `
  -Headers @{"X-Load-Secret"="your-secret-value"} `
  -UseBasicParsing
</code></pre>
<p>You should see:</p>
<pre><code class="language-json">{
  "success": true,
  "loaded": [
    { "id": "1", "status": "loaded" },
    { "id": "2", "status": "loaded" },
    { "id": "3", "status": "loaded" },
    { "id": "4", "status": "loaded" },
    { "id": "5", "status": "loaded" },
    { "id": "6", "status": "loaded" },
    { "id": "7", "status": "loaded" },
    { "id": "8", "status": "loaded" }
  ]
}
</code></pre>
<p>Your knowledge base is now stored in Vectorize as vectors. In the next section, you'll build the query pipeline that searches those vectors and generates answers.</p>
<h2 id="heading-how-to-build-the-query-pipeline">How to Build the Query Pipeline</h2>
<p>The query pipeline is the core of your RAG system. When a user sends a question, the pipeline runs four steps in sequence: embed the question, search Vectorize, build context from the results, and generate an answer with the LLM.</p>
<p>Add a <code>/query</code> route to your fetch handler and the complete <code>handleQuery</code> function. Here is the full updated <code>src/index.ts</code>:</p>
<pre><code class="language-typescript">import { documents } from "../scripts/knowledge-base";

export interface Env {
  VECTORIZE: VectorizeIndex;
  AI: Ai;
  LOAD_SECRET: string;
}

export default {
  async fetch(request: Request, env: Env): Promise&lt;Response&gt; {
    const url = new URL(request.url);

    if (url.pathname === "/load" &amp;&amp; request.method === "POST") {
      return handleLoad(env, request);
    }

    if (url.pathname === "/query" &amp;&amp; request.method === "POST") {
      return handleQuery(request, env);
    }

    return new Response("RAG tutorial worker is running", { status: 200 });
  },
};

async function handleLoad(env: Env, request: Request): Promise&lt;Response&gt; {
  const authHeader = request.headers.get("X-Load-Secret");
  if (authHeader !== env.LOAD_SECRET) {
    return Response.json({ error: "Unauthorized" }, { status: 401 });
  }

  const results: { id: string; status: string }[] = [];

  for (const doc of documents) {
    const response = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
      text: [doc.text],
    }) as { data: number[][] };

    await env.VECTORIZE.upsert([
      {
        id: doc.id,
        values: response.data[0],
        metadata: {
          ...doc.metadata,
          text: doc.text,
        },
      },
    ]);

    results.push({ id: doc.id, status: "loaded" });
  }

  return Response.json({ success: true, loaded: results });
}

async function handleQuery(request: Request, env: Env): Promise&lt;Response&gt; {
  const body = await request.json() as { question: string };

  if (!body.question) {
    return Response.json({ error: "question is required" }, { status: 400 });
  }

  // Step 1: Embed the question using the same model as your documents
  const embeddingResponse = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: [body.question],
  }) as { data: number[][] };

  // Step 2: Search Vectorize for the 3 most similar documents
  const searchResults = await env.VECTORIZE.query(
    embeddingResponse.data[0],
    {
      topK: 3,
      returnMetadata: "all",
    }
  );

  // Step 3: Build context from results above the similarity threshold
  const context = searchResults.matches
    .filter((match) =&gt; match.score &gt; 0.5)
    .map((match) =&gt; match.metadata?.text as string)
    .filter(Boolean)
    .join("\n\n");

  if (!context) {
    return Response.json({
      answer: "I could not find relevant information to answer that question.",
      sources: [],
    });
  }

  // Step 4: Generate an answer using the retrieved context
  const aiResponse = await env.AI.run("@cf/meta/llama-3.3-70b-instruct-fp8-fast", {
    messages: [
      {
        role: "system",
        content: "You are a helpful assistant. Answer the question using only the context provided. If the context does not contain enough information, say so.",
      },
      {
        role: "user",
        content: `Context:\n\({context}\n\nQuestion: \){body.question}`,
      },
    ],
    max_tokens: 256,
  }) as { response: string };

  // Step 5: Return the answer with its sources
  const sources = searchResults.matches
    .filter((match) =&gt; match.score &gt; 0.5)
    .map((match) =&gt; match.metadata?.source as string)
    .filter(Boolean);

  return Response.json({
    answer: aiResponse.response,
    sources: [...new Set(sources)],
  });
}
</code></pre>
<p>What each step does:</p>
<ol>
<li><p><strong>Step 1 – Embed the question</strong>: You convert the user's question into a 768-dimensional vector using the same model you used when loading your documents. This is critical: the question and the documents must be embedded with the same model or the similarity scores will be meaningless.</p>
</li>
<li><p><strong>Step 2 – Search Vectorize</strong>: You pass the question embedding to Vectorize, which returns the three most similar documents. <code>returnMetadata: "all"</code> tells Vectorize to include the metadata you stored alongside each vector — including the original text.</p>
</li>
<li><p><strong>Step 3 – Build context</strong>: You filter out any results with a similarity score below 0.5 and join the remaining document texts into a single context string. The 0.5 threshold prevents the LLM from receiving irrelevant documents just because nothing better matched.</p>
</li>
<li><p><strong>Step 4 – Generate the answer</strong>: You pass the context and the question to the LLM using the chat format with <code>messages</code>. The system prompt explicitly instructs the model to answer using only the provided context. This is what keeps the LLM grounded. Without this instruction, it will ignore your context and answer from its training data instead.</p>
</li>
<li><p><strong>Step 5 – Return sources</strong>: You include the source metadata in the response so callers know which documents the answer came from. The <code>Set</code> deduplicates sources in case multiple chunks came from the same document.</p>
</li>
</ol>
<h3 id="heading-how-to-test-the-query-pipeline">How to Test the Query Pipeline</h3>
<p>Deploy your Worker:</p>
<pre><code class="language-bash">npx wrangler deploy
</code></pre>
<p>Send a question:</p>
<pre><code class="language-bash">curl -X POST https://rag-tutorial-simple.&lt;your-subdomain&gt;.workers.dev/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is RAG?"}'
</code></pre>
<p>On Windows PowerShell:</p>
<pre><code class="language-powershell">Invoke-WebRequest `
  -Uri "https://rag-tutorial-simple.&lt;your-subdomain&gt;.workers.dev/query" `
  -Method POST `
  -ContentType "application/json" `
  -Body '{"question": "What is RAG?"}' `
  -UseBasicParsing
</code></pre>
<p>You should receive a response like this:</p>
<pre><code class="language-json">{
  "answer": "RAG stands for Retrieval Augmented Generation. It's a method that enhances generation by retrieving relevant context from a knowledge base and adding it to the prompt before generating an answer.",
  "sources": ["ai-concepts"]
}
</code></pre>
<p>The answer came from your knowledge base, not from the LLM's training data. That's the entire point of RAG: grounded, verifiable answers with traceable sources.</p>
<h2 id="heading-how-to-add-error-handling-and-security">How to Add Error Handling and Security</h2>
<p>A tutorial that only shows the happy path is not production-ready. In this section, you'll add error handling to every step of the query pipeline and protect the <code>/load</code> endpoint from unauthorized access.</p>
<h3 id="heading-how-to-secure-the-load-endpoint">How to Secure the Load Endpoint</h3>
<p>The <code>/load</code> endpoint generates embeddings and writes to your Vectorize index. Without protection, anyone who discovers your Worker URL can trigger it repeatedly, consuming your Workers AI quota and overwriting your data.</p>
<p>The <code>LOAD_SECRET</code> binding you added to <code>Env</code> and the <code>wrangler secret put</code> command you ran earlier handle this. The check at the top of <code>handleLoad</code> rejects any request that doesn't include the correct secret header:</p>
<pre><code class="language-typescript">const authHeader = request.headers.get("X-Load-Secret");
if (authHeader !== env.LOAD_SECRET) {
  return Response.json({ error: "Unauthorized" }, { status: 401 });
}
</code></pre>
<p>A request without the header returns <code>{"error":"Unauthorized"}</code> with a 401 status. The secret itself is stored as an encrypted environment variable in your Worker. It never appears in your code or <code>wrangler.toml</code>.</p>
<p>To trigger the load endpoint, you must include the secret in the request header:</p>
<pre><code class="language-bash">curl -X POST https://rag-tutorial-simple.&lt;your-subdomain&gt;.workers.dev/load \
  -H "X-Load-Secret: your-secret-value"
</code></pre>
<h3 id="heading-how-to-handle-query-errors">How to Handle Query Errors</h3>
<p>Replace your <code>handleQuery</code> function with this hardened version:</p>
<pre><code class="language-typescript">async function handleQuery(request: Request, env: Env): Promise&lt;Response&gt; {
  // Guard against malformed request body
  let body: { question: string };
  try {
    body = await request.json() as { question: string };
  } catch {
    return Response.json({ error: "Invalid JSON in request body" }, { status: 400 });
  }

  if (!body.question || typeof body.question !== "string" || body.question.trim() === "") {
    return Response.json({ error: "question must be a non-empty string" }, { status: 400 });
  }

  // Sanitize: trim whitespace and cap length
  const question = body.question.trim().slice(0, 500);

  // Step 1: Embed the question
  let embeddingResponse: { data: number[][] };
  try {
    embeddingResponse = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
      text: [question],
    }) as { data: number[][] };
  } catch (err) {
    console.error("Embedding generation failed:", err);
    return Response.json({ error: "Failed to process your question" }, { status: 503 });
  }

  // Step 2: Search Vectorize
  let searchResults: Awaited&lt;ReturnType&lt;typeof env.VECTORIZE.query&gt;&gt;;
  try {
    searchResults = await env.VECTORIZE.query(
      embeddingResponse.data[0],
      { topK: 3, returnMetadata: "all" }
    );
  } catch (err) {
    console.error("Vectorize query failed:", err);
    return Response.json({ error: "Failed to search knowledge base" }, { status: 503 });
  }

  // Step 3: Build context
  const context = searchResults.matches
    .filter((match) =&gt; match.score &gt; 0.5)
    .map((match) =&gt; match.metadata?.text as string)
    .filter(Boolean)
    .join("\n\n");

  if (!context) {
    return Response.json({
      answer: "I could not find relevant information to answer that question. Try rephrasing or asking something else.",
      sources: [],
    });
  }

  // Step 4: Generate answer
  let aiResponse: { response: string };
  try {
    aiResponse = await env.AI.run("@cf/meta/llama-3.3-70b-instruct-fp8-fast", {
      messages: [
        {
          role: "system",
          content: "You are a helpful assistant. Answer the question using only the context provided. If the context does not contain enough information, say so.",
        },
        {
          role: "user",
          content: `Context:\n\({context}\n\nQuestion: \){question}`,
        },
      ],
      max_tokens: 256,
    }) as { response: string };
  } catch (err) {
    console.error("LLM generation failed:", err);
    return Response.json({ error: "Failed to generate an answer" }, { status: 503 });
  }

  // Step 5: Return answer with sources
  const sources = searchResults.matches
    .filter((match) =&gt; match.score &gt; 0.5)
    .map((match) =&gt; match.metadata?.source as string)
    .filter(Boolean);

  return Response.json({
    answer: aiResponse.response,
    sources: [...new Set(sources)],
  });
}
</code></pre>
<p>What each error handling decision means:</p>
<ul>
<li><p><code>try/catch</code> <strong>around</strong> <code>request.json()</code>: <code>request.json()</code> throws if the body is not valid JSON. Without this catch, a malformed request crashes your Worker with an unhandled 500 error. With it, the caller gets a clear 400 explaining what went wrong.</p>
</li>
<li><p><strong>Input validation before processing</strong>: You check that <code>question</code> exists, is a string, and is not empty before calling any external service. This prevents wasted AI calls on invalid input.</p>
</li>
<li><p><code>.slice(0, 500)</code> <strong>on the question</strong>: This caps the input length before it reaches the embedding model. Without it, a malicious caller could send a very long string designed to inflate your AI usage or hit Workers CPU limits.</p>
</li>
<li><p><strong>503 for AI and Vectorize failures</strong>: HTTP 503 means "service temporarily unavailable." It signals to callers that the error is on the server side and the request can be retried.</p>
</li>
<li><p><code>.filter(Boolean)</code> <strong>on context</strong>: After mapping <code>match.metadata?.text</code>, some results may be <code>undefined</code> if metadata was stored without a <code>text</code> field. This filters them out before joining, preventing <code>"undefined"</code> from appearing in the context string you send to the LLM.</p>
</li>
</ul>
<h3 id="heading-how-to-test-error-handling">How to Test Error Handling</h3>
<p>Deploy your updated Worker:</p>
<pre><code class="language-bash">npx wrangler deploy
</code></pre>
<p>Test each error case:</p>
<pre><code class="language-bash"># Missing secret on load endpoint — should return 401
curl -X POST https://rag-tutorial-simple.&lt;your-subdomain&gt;.workers.dev/load

# Invalid JSON — should return 400
curl -X POST https://rag-tutorial-simple.&lt;your-subdomain&gt;.workers.dev/query \
  -H "Content-Type: application/json" \
  -d 'not json'

# Empty question — should return 400
curl -X POST https://rag-tutorial-simple.&lt;your-subdomain&gt;.workers.dev/query \
  -H "Content-Type: application/json" \
  -d '{"question": ""}'
</code></pre>
<h2 id="heading-performance-and-cost-analysis">Performance and Cost Analysis</h2>
<p>This section uses real production data from my <a href="https://github.com/dannwaneri/vectorize-mcp-worker">vectorize-mcp-worker</a> deployment. It uses the same architecture you just built, measured from Port Harcourt, Nigeria to Cloudflare's edge.</p>
<h3 id="heading-real-performance-numbers">Real Performance Numbers</h3>
<p>Here is what the pipeline actually costs in time on every request:</p>
<table>
<thead>
<tr>
<th>Operation</th>
<th>Time</th>
</tr>
</thead>
<tbody><tr>
<td>Embedding generation</td>
<td>142ms</td>
</tr>
<tr>
<td>Vector search</td>
<td>223ms</td>
</tr>
<tr>
<td>Response formatting</td>
<td>&lt;5ms</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>~365ms</strong></td>
</tr>
</tbody></table>
<p>This covers embedding generation and vector search only – the retrieval layer. LLM generation adds 500-1500ms on top, which is why end-to-end response time typically runs 600-1600ms.</p>
<p>The embedding step and vector search dominate. Everything else is negligible. For context, a comparable setup using OpenAI embeddings and Pinecone would add two external API roundtrips on top of this, easily pushing total latency past 1 second.</p>
<p>These numbers come from a single-region measurement. Your actual latency will vary based on your location and Cloudflare's load at the time of the request. The architectural point holds regardless: co-locating everything on the edge eliminates inter-service network hops, which is where most latency in traditional RAG stacks comes from.</p>
<h3 id="heading-real-cost-breakdown">Real Cost Breakdown</h3>
<p>For 10,000 searches per day (300,000 per month) with 10,000 stored vectors:</p>
<p><strong>This stack:</strong></p>
<table>
<thead>
<tr>
<th>Service</th>
<th>Monthly Cost</th>
</tr>
</thead>
<tbody><tr>
<td>Workers</td>
<td>~$3</td>
</tr>
<tr>
<td>Workers AI</td>
<td>~$3-5</td>
</tr>
<tr>
<td>Vectorize</td>
<td>~$2</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>$8-10</strong></td>
</tr>
</tbody></table>
<p><strong>Traditional alternatives for the same volume:</strong></p>
<table>
<thead>
<tr>
<th>Solution</th>
<th>Monthly Cost</th>
</tr>
</thead>
<tbody><tr>
<td>Pinecone Standard</td>
<td>$50-70</td>
</tr>
<tr>
<td>Weaviate Serverless</td>
<td>$25-40</td>
</tr>
<tr>
<td>Self-hosted pgvector</td>
<td>$40-60</td>
</tr>
</tbody></table>
<p>That is an 85-95% cost reduction depending on which alternative you compare against. For a bootstrapped startup adding semantic search, that difference is $1,500-2,000 per year.</p>
<h3 id="heading-why-the-cost-difference-is-so-large">Why the Cost Difference Is So Large</h3>
<p>Traditional RAG stacks have three cost problems that compound each other.</p>
<p>The first is idle compute. A dedicated server or container running your embedding service costs money even when no searches are happening. Cloudflare Workers charge only for actual execution time.</p>
<p>The second is inter-service data transfer. Every time your application calls an external service for an embedding, then calls a separate service for a search, you're paying for two external API calls with metered pricing. In this stack, both operations happen inside Cloudflare's network at no additional transfer cost.</p>
<p>The third is minimum plan pricing. Pinecone's Standard plan costs \(50/month as a floor, regardless of how little you use it. Cloudflare's pricing scales from the \)5/month Workers Paid plan base.</p>
<h3 id="heading-when-the-included-allocation-is-enough">When the Included Allocation Is Enough</h3>
<p>For smaller usage levels, you may not pay beyond the $5/month Workers Paid base price:</p>
<ul>
<li><p>Workers: 10 million requests per month included</p>
</li>
<li><p>Workers AI: generous daily neuron allocation included</p>
</li>
<li><p>Vectorize: available on both Free and Paid plans, with a free allocation included</p>
</li>
</ul>
<p>A side project, internal tool, or small business with under 3,000 searches per day will likely stay within the included allocations entirely.</p>
<h3 id="heading-the-trade-off-to-know-about">The Trade-off to Know About</h3>
<p>This cost advantage comes with one operational constraint worth understanding before you build: Vectorize does not work in local development mode.</p>
<p>When you run <code>wrangler dev</code>, your Worker runs locally but Vectorize calls fail. You have to deploy to Cloudflare to test your vector search. For most development workflows this means testing your query logic locally with mocked responses, then deploying to a staging environment for full integration tests.</p>
<p>This is a real friction point. It's the honest trade-off for having a managed vector database with no infrastructure to operate.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you have built and deployed a production-ready RAG system on Cloudflare's edge network. Let's look at what you actually built and what it costs to run.</p>
<h3 id="heading-what-you-built">What You Built</h3>
<p>Your completed system has three endpoints:</p>
<ul>
<li><p><code>GET /</code>: health check confirming the Worker is running</p>
</li>
<li><p><code>POST /load</code>: loads your knowledge base into Vectorize, protected by a secret header</p>
</li>
<li><p><code>POST /query</code>: accepts a question, retrieves relevant context, and returns a grounded answer with sources</p>
</li>
</ul>
<p>The full query pipeline runs in four steps on every request:</p>
<ol>
<li><p>The question is converted to a 768-dimensional embedding using <code>@cf/baai/bge-base-en-v1.5</code></p>
</li>
<li><p>Vectorize finds the three most semantically similar documents</p>
</li>
<li><p>Documents above the 0.5 similarity threshold are assembled into context</p>
</li>
<li><p>Llama 3.3 generates an answer using only that context</p>
</li>
</ol>
<p>Everything runs on Cloudflare's infrastructure. No external API keys. No separate vector database subscription. No servers to manage.</p>
<h3 id="heading-what-to-build-next">What to Build Next</h3>
<p>This tutorial covered the core RAG pattern. Here are four directions to take it further.</p>
<h4 id="heading-add-more-documents">Add more documents</h4>
<p>The knowledge base in this tutorial has 8 documents. A real system might have thousands. The loading pattern is identical: add documents to <code>knowledge-base.ts</code>, hit <code>/load</code> with your secret, and Vectorize handles the rest.</p>
<p>For very large knowledge bases, update <code>handleLoad</code> to batch documents in groups of 20-50 rather than upserting one at a time.</p>
<h4 id="heading-improve-chunking">Improve chunking</h4>
<p>Each document in this tutorial is a single short paragraph. Real-world documents like PDFs, articles, documentation pages need to be split into chunks before embedding. Chunk at natural boundaries like paragraphs and sentences, aim for 200-400 tokens per chunk, and include 50-token overlaps between chunks to preserve context across boundaries.</p>
<h4 id="heading-add-conversation-history">Add conversation history</h4>
<p>The current system treats every query as independent. To support follow-up questions, store previous messages in a Cloudflare KV namespace and include the last 2-3 exchanges in the LLM <code>messages</code> array alongside the retrieved context.</p>
<h4 id="heading-stream-the-response">Stream the response</h4>
<p>For long answers, users stare at a blank screen until generation completes. Cloudflare Workers support streaming responses via <code>TransformStream</code>. Switching to streaming means the first tokens appear in under 100ms while the rest generates.</p>
<h4 id="heading-consider-dimensions-vs-reranking-trade-offs">Consider dimensions vs reranking trade-offs</h4>
<p>This tutorial uses <code>bge-base-en-v1.5</code> at 768 dimensions. My production system uses <code>bge-small-en-v1.5</code> at 384 dimensions. Testing showed upgrading from 384 to 768 dims only improved accuracy by about 2%, but doubled cost and latency.</p>
<p>Adding a reranker (<code>@cf/baai/bge-reranker-base</code>) gave a larger accuracy improvement than the dimension upgrade for a fraction of the cost. The exact improvement will vary by domain and query distribution — test both on your actual data before deciding. If you're optimizing for production, add a reranker before you increase dimensions.</p>
<h3 id="heading-the-complete-project">The Complete Project</h3>
<p>Clone and deploy in five commands:</p>
<pre><code class="language-bash">git clone https://github.com/dannwaneri/rag-tutorial-simple
cd rag-tutorial-simple
npm install
npx wrangler vectorize create rag-tutorial-index --dimensions=768 --metric=cosine
npx wrangler secret put LOAD_SECRET
npx wrangler deploy
</code></pre>
<p>Then load your knowledge base:</p>
<pre><code class="language-bash">curl -X POST https://&lt;your-worker&gt;.workers.dev/load \
  -H "X-Load-Secret: your-secret"
</code></pre>
<p>If you found this useful, the production system this tutorial is based on is open source at <a href="https://github.com/dannwaneri/vectorize-mcp-worker">github.com/dannwaneri/vectorize-mcp-worker</a>. It extends this foundation with hybrid search combining vector and BM25, multimodal support for searching images with AI vision, a reranker for more accurate results, and a live dashboard. It runs on the same Cloudflare stack you just built – Workers, Vectorize, Workers AI – plus D1 for document storage.</p>
<p>One difference you'll notice: the production system uses <code>bge-small-en-v1.5</code> at 384 dimensions rather than the 768 dimensions in this tutorial. That is an intentional trade-off: the reranker adds more accuracy than the extra dimensions at lower cost. The jump from what you built today to that system is smaller than it looks.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Ship a Production-Ready RAG App with FAISS (Guardrails, Evals, and Fallbacks) ]]>
                </title>
                <description>
                    <![CDATA[ Most LLM applications look great in a high-fidelity demo. Then they hit the hands of real users and start failing in very predictable yet damaging ways. They answer questions they should not, they bre ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-rag-app-faiss-fastapi/</link>
                <guid isPermaLink="false">69b841572ad6ae5184d54317</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ FastAPI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ vector database ]]>
                    </category>
                
                    <category>
                        <![CDATA[ faiss ]]>
                    </category>
                
                    <category>
                        <![CDATA[ api ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chidozie Managwu ]]>
                </dc:creator>
                <pubDate>Mon, 16 Mar 2026 17:43:51 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/f9da3ad9-e285-4ce1-acb7-ad119579971c.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most LLM applications look great in a high-fidelity demo. Then they hit the hands of real users and start failing in very predictable yet damaging ways.</p>
<p>They answer questions they should not, they break when document retrieval is weak, they time out due to network latency, and nobody can tell exactly what happened because there are no logs and no tests.</p>
<p>In this tutorial, you’ll build a beginner-friendly Retrieval Augmented Generation (RAG) application designed to survive production realities. This isn’t just a script that calls an API. It’s a system featuring a FastAPI backend, a persisted FAISS vector store, and essential safety guardrails (including a retrieval gate and fallbacks).</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ol>
<li><p><a href="#heading-why-rag-alone-does-not-equal-productionready">Why RAG Alone Does Not Equal Production-Ready</a></p>
</li>
<li><p><a href="#heading-the-architecture-you-are-building">The Architecture You Are Building</a></p>
</li>
<li><p><a href="#heading-project-setup-and-structure">Project Setup and Structure</a></p>
</li>
<li><p><a href="#heading-how-to-build-the-rag-layer-with-faiss">How to Build the RAG Layer with FAISS</a></p>
</li>
<li><p><a href="#heading-how-to-add-the-llm-call-with-structured-output">How to Add the LLM Call with Structured Output</a></p>
</li>
<li><p><a href="#heading-how-to-add-guardrails-retrieval-gate-and-fallbacks">How to Add Guardrails: Retrieval Gate and Fallbacks</a></p>
</li>
<li><p><a href="#heading-fast-api-app-creating-the-answer-endpoint">FastAPI App: Creating the /answer Endpoint</a></p>
</li>
<li><p><a href="#heading-how-to-add-beginnerfriendly-evals">How to Add Beginner-Friendly Evals</a></p>
</li>
<li><p><a href="#heading-what-to-improve-next-realistic-upgrades">What to Improve Next: Realistic Upgrades</a></p>
</li>
</ol>
<h2 id="heading-why-rag-alone-does-not-equal-production-ready">Why RAG Alone Does Not Equal Production-Ready</h2>
<p>Retrieval Augmented Generation (RAG) is often hailed as the hallucination killer. By grounding the model in retrieved text, we provide it with the facts it needs to be accurate. But simply connecting a vector database to an LLM isn’t enough for a production environment.</p>
<p>Production issues usually arise from the silent failures in the system surrounding the model:</p>
<ul>
<li><p><strong>Weak retrieval:</strong> If the app retrieves irrelevant chunks of text, the model tries to bridge the gap by inventing an answer anyway. Without a designated “I do not know” path, the model is essentially forced to hallucinate.</p>
</li>
<li><p><strong>Lack of visibility:</strong> Without structured outputs and basic logging, you can’t tell if bad retrieval, a confusing prompt, or a model update caused a wrong answer.</p>
</li>
<li><p><strong>Fragility:</strong> A simple API timeout or malformed provider response becomes a user-facing outage if you don’t implement fallbacks.</p>
</li>
<li><p><strong>No regression testing:</strong> In traditional software, we have unit tests. In AI, we need evals. Without them, a small tweak to your prompt might fix one issue but break ten others without you realising it.</p>
</li>
</ul>
<p>We’ll solve each of these issues systematically in this guide.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>This tutorial is beginner-friendly, but it assumes you have a few basics in place so you can focus on building a robust RAG system instead of getting stuck on setup issues.</p>
<h3 id="heading-knowledge">Knowledge</h3>
<p>You should be comfortable with:</p>
<ul>
<li><p><strong>Python fundamentals</strong> (functions, modules, virtual environments)</p>
</li>
<li><p><strong>Basic HTTP + JSON</strong> (requests, response payloads)</p>
</li>
<li><p><strong>APIs with FastAPI</strong> (what an endpoint is and how to run a server)</p>
</li>
<li><p><strong>High-level LLM concepts</strong> (prompting, temperature, structured outputs)</p>
</li>
</ul>
<h3 id="heading-tools-accounts">Tools + Accounts</h3>
<p>You’ll need:</p>
<ul>
<li><p><strong>Python 3.10+</strong></p>
</li>
<li><p>A working <strong>OpenAI-compatible API key</strong> (OpenAI or any provider that supports the same request/response shape)</p>
</li>
<li><p>A local environment where you can run a FastAPI app (Mac/Linux/Windows)</p>
</li>
</ul>
<h3 id="heading-what-this-tutorial-covers-and-what-it-doesnt">What This Tutorial Covers (and What It Doesn’t)</h3>
<p>We’ll build a production-minded baseline:</p>
<ul>
<li><p>A <strong>FAISS-backed retriever</strong> with a persisted index + metadata</p>
</li>
<li><p>A <strong>retrieval gate</strong> to prevent “forced hallucination”</p>
</li>
<li><p><strong>Structured JSON outputs</strong> so your backend is stable</p>
</li>
<li><p><strong>Fallback behavior</strong> for timeouts and provider errors</p>
</li>
<li><p>A small <strong>eval harness</strong> to prevent regressions</p>
</li>
</ul>
<p>We won’t implement advanced upgrades such as rerankers, semantic chunking, auth, background jobs beyond a roadmap at the end.</p>
<h2 id="heading-the-architecture-you-are-building">The Architecture You Are Building</h2>
<p>The flow of our application follows a disciplined path so every answer is grounded in evidence:</p>
<ol>
<li><p><strong>User query:</strong> The user submits a question via a FastAPI endpoint.</p>
</li>
<li><p><strong>Retrieval:</strong> The system embeds the question and retrieves the top-k most similar document chunks.</p>
</li>
<li><p><strong>The retrieval gate:</strong> We evaluate the similarity score. If the context is not relevant enough, we stop immediately and refuse the query.</p>
</li>
<li><p><strong>Augmentation and generation:</strong> If the gate passes, we send a context-augmented prompt to the LLM.</p>
</li>
<li><p><strong>Structured response:</strong> The model returns a JSON object containing the answer, sources used, and a confidence level.</p>
</li>
</ol>
<h2 id="heading-project-setup-and-structure">Project Setup and Structure</h2>
<p>To keep things organized and maintainable, we’ll use a modular structure. This allows you to swap out your LLM provider or your vector database without rewriting your entire core application.</p>
<h3 id="heading-project-structure">Project Structure</h3>
<pre><code class="language-python">.
├── app.py              # FastAPI entry point and API logic
├── rag.py              # FAISS index, persistence, and document retrieval
├── llm.py              # LLM API interface and JSON parsing
├── prompts.py          # Centralized prompt templates
├── data/               # Source .txt documents
├── index/              # Persisted FAISS index and metadata
└── evals/              # Evaluation dataset and runner script
    ├── eval_set.json
    └── run_evals.py
</code></pre>
<h3 id="heading-install-dependencies">Install Dependencies</h3>
<p>First, create a virtual environment to isolate your project:</p>
<pre><code class="language-python">python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install fastapi uvicorn faiss-cpu numpy pydantic requests python-dotenv
</code></pre>
<h3 id="heading-configure-the-environment">Configure the Environment</h3>
<p>Create a <code>.env</code> file in the root directory. We are targeting OpenAI-compatible providers:</p>
<pre><code class="language-python">OPENAI_API_KEY=your_actual_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o-mini
</code></pre>
<p>Important note on compatibility: The code below assumes an OpenAI-style API. If you use a provider that is not compatible, you must change the URL, headers (for example <code>X-API-Key</code>), and the way you extract embeddings and final message content in <code>embed_texts()</code> and <code>call_llm()</code>.</p>
<h2 id="heading-how-to-build-the-rag-layer-with-faiss">How to Build the RAG Layer with FAISS</h2>
<p>In <code>rag.py</code>, we handle the “Retriever” part of RAG. This involves turning raw text into mathematical vectors that the computer can compare.</p>
<h3 id="heading-what-is-faiss-and-what-does-it-do">What is FAISS (and What Does It Do)?</h3>
<p><strong>FAISS</strong> (Facebook AI Similarity Search) is a fast library for vector similarity search. In a RAG system, each chunk of text becomes an embedding vector (a list of floats). FAISS stores those vectors in an index so you can quickly ask:</p>
<blockquote>
<p>“Given this question embedding, which document chunks are closest to it?”</p>
</blockquote>
<p>In this tutorial, we use <code>IndexFlatIP</code> inner product and normalise vectors with <code>faiss.normalize_L2(...)</code>. With normalised vectors, the inner product behaves like <strong>cosine similarity</strong>, giving us a stable score we can use for a retrieval gate.</p>
<h3 id="heading-chunking-strategy-with-overlap">Chunking Strategy With Overlap</h3>
<p>We’ll use chunking with overlap. If we split a document at exactly 1,000 characters, we might cut a sentence in half, losing its meaning. By using an overlap, for example, 200 characters, we ensure that the end of one chunk and the beginning of the next share context.</p>
<h3 id="heading-implementation-of-ragpy">Implementation of <code>rag.py</code></h3>
<pre><code class="language-python">import os
import faiss
import numpy as np
import requests
import json
from typing import List, Dict
from dotenv import load_dotenv

load_dotenv()

INDEX_PATH = "index/faiss.index"
META_PATH = "index/meta.json"

def chunk_text(text: str, size: int = 1000, overlap: int = 200) -&gt; List[str]:
    chunks = []
    step = max(1, size - overlap)
    for i in range(0, len(text), step):
        chunk = text[i : i + size].strip()
        if chunk:
            chunks.append(chunk)
    return chunks

def embed_texts(texts: List[str]) -&gt; np.ndarray:
    # Note: If your provider is not OpenAI-compatible, change this URL and headers
    url = f"{os.getenv('OPENAI_BASE_URL')}/embeddings"
    headers = {"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"}
    payload = {"input": texts, "model": "text-embedding-3-small"}

    resp = requests.post(url, headers=headers, json=payload, timeout=30)
    resp.raise_for_status()
    # If your provider uses a different response format, change the line below
    vectors = np.array([item["embedding"] for item in resp.json()["data"]], dtype="float32")
    return vectors

def build_index() -&gt; None:
    all_chunks: List[str] = []
    metadata: List[Dict] = []

    if not os.path.exists("data"):
        os.makedirs("data")
        return

    for file in os.listdir("data"):
        if not file.endswith(".txt"):
            continue

        with open(f"data/{file}", "r", encoding="utf-8") as f:
            text = f.read()

        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        for c in chunks:
            metadata.append({"source": file, "text": c})

    if not all_chunks:
        return

    embeddings = embed_texts(all_chunks)
    faiss.normalize_L2(embeddings)

    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    os.makedirs("index", exist_ok=True)
    faiss.write_index(index, INDEX_PATH)

    with open(META_PATH, "w", encoding="utf-8") as f:
        json.dump(metadata, f, ensure_ascii=False)

def load_index():
    if not (os.path.exists(INDEX_PATH) and os.path.exists(META_PATH)):
        raise FileNotFoundError(
            "FAISS index not found. Add .txt files to data/ and run build_index()."
        )

    index = faiss.read_index(INDEX_PATH)
    with open(META_PATH, "r", encoding="utf-8") as f:
        metadata = json.load(f)
    return index, metadata

def retrieve(query: str, k: int = 5) -&gt; List[Dict]:
    index, metadata = load_index()

    q_emb = embed_texts([query])
    faiss.normalize_L2(q_emb)

    scores, ids = index.search(q_emb, k)
    results = []
    for score, idx in zip(scores[0], ids[0]):
        if idx == -1:
            continue
        m = metadata[idx]
        results.append(
            {"score": float(score), "source": m["source"], "text": m["text"], "id": int(idx)}
        )
    return results
</code></pre>
<h2 id="heading-how-to-add-the-llm-call-with-structured-output">How to Add the LLM Call with Structured Output</h2>
<p>A major failure point in AI apps is the “chatty” nature of LLMs. If your backend expects a list of sources but the LLM returns conversational filler, your code will crash.</p>
<p>We solve this with <strong>structured output</strong>: instruct the model to return a strict JSON object, then parse it safely.</p>
<h3 id="heading-implementation-of-llmpy">Implementation of <code>llm.py</code></h3>
<pre><code class="language-python">import json
import requests
import os
from typing import Dict, Any

def call_llm(system_prompt: str, user_prompt: str) -&gt; Dict[str, Any]:
    # Note: Change URL/Headers if using a non-OpenAI compatible provider
    url = f"{os.getenv('OPENAI_BASE_URL')}/chat/completions"
    headers = {
        "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
        "Content-Type": "application/json",
    }

    payload = {
        "model": os.getenv("OPENAI_MODEL"),
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "response_format": {"type": "json_object"},
        "temperature": 0,
    }

    try:
        resp = requests.post(url, headers=headers, json=payload, timeout=30)
        resp.raise_for_status()
        content = resp.json()["choices"][0]["message"]["content"]

        parsed = json.loads(content)
        parsed.setdefault("answer", "")
        parsed.setdefault("refusal", False)
        parsed.setdefault("confidence", "medium")
        parsed.setdefault("sources", [])
        return parsed

    except (requests.Timeout, requests.ConnectionError):
        return {
            "answer": "The system is temporarily unavailable (network issue). Please try again.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "network_error",
        }
    except Exception:
        return {
            "answer": "A system error occurred while generating the answer.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "unknown_error",
        }
</code></pre>
<h2 id="heading-how-to-add-guardrails-retrieval-gate-and-fallbacks">How to Add Guardrails: Retrieval Gate and Fallbacks</h2>
<p>Guardrails are interceptors. They sit between the user and the model to prevent predictable failures.</p>
<h3 id="heading-the-retrieval-gate-how-it-works-and-how-to-add-it">The Retrieval Gate: How It Works and How to Add It</h3>
<p>In a standard RAG pipeline, the system always calls the LLM. If the user asks an irrelevant question, the retriever will still return the “closest” (but wrong) chunks.</p>
<p>The solution is the retrieval gate:</p>
<ol>
<li><p>Retrieve top-k chunks and get the <strong>top similarity score</strong></p>
</li>
<li><p>If the score is below a threshold (for example <code>0.30</code>), refuse immediately</p>
</li>
<li><p>Only call the LLM when retrieval is strong enough to ground the answer</p>
</li>
</ol>
<p>A threshold of <code>0.30</code> is a reasonable starting point when using normalised cosine similarity, but you should tune it using evals (next section).</p>
<h3 id="heading-fallbacks-and-why-they-matter">Fallbacks and Why They Matter</h3>
<p>Fallbacks ensure that if an API fails or times out, the user gets a helpful message instead of a crash. They also keep your API response shape consistent, which prevents frontend errors and makes logging meaningful.</p>
<p>In this tutorial, fallbacks are implemented inside <code>call_llm()</code> so your FastAPI layer stays simple.</p>
<h2 id="heading-fastapi-app-creating-the-answer-endpoint">FastAPI App: Creating the /answer Endpoint</h2>
<p>The <code>app.py</code> file is the conductor. It ties retrieval, guardrails, prompting, and generation together.</p>
<h3 id="heading-implementation-of-apppy">Implementation of <code>app.py</code></h3>
<pre><code class="language-python">from fastapi import FastAPI
from pydantic import BaseModel
from rag import retrieve
from llm import call_llm
import prompts
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("rag_app")

app = FastAPI(title="Production-Ready RAG")

class QueryRequest(BaseModel):
    question: str

@app.post("/answer")
async def get_answer(req: QueryRequest):
    start_time = time.time()
    question = (req.question or "").strip()

    if not question:
        return {
            "answer": "Please provide a non-empty question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
        }

    # 1) Retrieval
    results = retrieve(question, k=5)
    top_score = results[0]["score"] if results else 0.0

    logger.info("query=%r top_score=%.3f num_results=%d", question, top_score, len(results))

    # 2) Retrieval Gate (Guardrail)
    if top_score &lt; 0.30:
        return {
            "answer": "I do not have documents to answer that question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
            "retrieval": {"top_score": top_score, "k": 5},
        }

    # 3) Augment
    context_text = "\n\n".join([f"Source {r['source']}: {r['text']}" for r in results])
    user_prompt = f"Context:\n{context_text}\n\nQuestion: {question}"

    # 4) Generation with Fallback
    response = call_llm(prompts.SYSTEM_PROMPT, user_prompt)

    # 5) Attach debug metadata
    response["latency_sec"] = round(time.time() - start_time, 2)
    response["retrieval"] = {"top_score": top_score, "k": 5}
    return response
</code></pre>
<h2 id="heading-centralized-prompt-template-promptspy">Centralized Prompt – Template: prompts.py</h2>
<p>A small but important habit: keep prompts centralised so they’re versionable and easy to evaluate.</p>
<h3 id="heading-example-promptspy">Example <code>prompts.py</code></h3>
<pre><code class="language-python">SYSTEM_PROMPT = """You are a RAG assistant. Use ONLY the provided Context to answer.
If the context does not contain the answer, respond with refusal=true.

Return a valid JSON object with exactly these keys:
- answer: string
- refusal: boolean
- confidence: "low" | "medium" | "high"
- sources: array of strings (source filenames you used)

Do not include any extra keys. Do not include markdown. Do not include commentary."""
</code></pre>
<h2 id="heading-how-to-add-beginner-friendly-evals">How to Add Beginner-Friendly Evals</h2>
<p>In AI systems, outputs are probabilistic. This makes testing harder than traditional software. Evals (evaluations) are a set of “golden questions” and “expected behaviours” you run repeatedly to detect regressions.</p>
<p>Instead of “does it output exactly this string,” you test:</p>
<ul>
<li><p>Should the app <strong>refuse</strong> when the retrieval is weak?</p>
</li>
<li><p>When it answers, does it include <strong>sources</strong>?</p>
</li>
<li><p>Is the behaviour stable across prompt tweaks and model changes?</p>
</li>
</ul>
<h3 id="heading-step-1-create-evalsevalsetjson">Step 1: Create <code>evals/eval_set.json</code></h3>
<p>This should contain both positive and negative cases.</p>
<pre><code class="language-json">[
  {
    "id": "in_scope_01",
    "question": "What is a retrieval gate and why is it important?",
    "expect_refusal": false,
    "notes": "Should explain gating and relate it to hallucination prevention."
  },
  {
    "id": "out_of_scope_01",
    "question": "What is the capital of France?",
    "expect_refusal": true,
    "notes": "If the knowledge base only includes our docs, the app should refuse."
  },
  {
    "id": "edge_01",
    "question": "",
    "expect_refusal": true,
    "notes": "Empty input should not call the LLM."
  }
]
</code></pre>
<h3 id="heading-step-2-create-evalsrunevalspy">Step 2: Create <code>evals/run_evals.py</code></h3>
<p>This runner calls your API endpoint (end-to-end) and checks expected behaviours.</p>
<pre><code class="language-python">import json
import requests

API_URL = "http://127.0.0.1:8000/answer"

def run():
    with open("evals/eval_set.json", "r", encoding="utf-8") as f:
        cases = json.load(f)

    passed = 0
    failed = 0

    for case in cases:
        resp = requests.post(API_URL, json={"question": case["question"]}, timeout=60)
        resp.raise_for_status()
        out = resp.json()

        got_refusal = bool(out.get("refusal", False))
        expect_refusal = bool(case["expect_refusal"])

        ok = (got_refusal == expect_refusal)

        # Beginner-friendly: if it answers, sources should exist and be a list
        if not got_refusal:
            ok = ok and isinstance(out.get("sources"), list)

        if ok:
            passed += 1
            print(f"PASS {case['id']}")
        else:
            failed += 1
            print(f"FAIL {case['id']} expected_refusal={expect_refusal} got_refusal={got_refusal}")
            print("Output:", json.dumps(out, indent=2))

    print(f"\nDone. Passed={passed} Failed={failed}")
    if failed:
        raise SystemExit(1)

if __name__ == "__main__":
    run()
</code></pre>
<h3 id="heading-how-to-use-evals-in-practice">How to Use Evals in Practice</h3>
<p>Run your server:</p>
<pre><code class="language-python">uvicorn app:app --reload
</code></pre>
<p>In another terminal, run evals:</p>
<pre><code class="language-python">python evals/run_evals.py
</code></pre>
<p>If an eval fails, you have a concrete signal that something changed in retrieval, gating, prompting, or provider behaviour.</p>
<h2 id="heading-what-to-improve-next-realistic-upgrades">What to Improve Next: Realistic Upgrades</h2>
<p>Building a reliable RAG app is iterative. Here are realistic next steps:</p>
<ul>
<li><p><strong>Semantic chunking:</strong> Break text based on meaning instead of character count.</p>
</li>
<li><p><strong>Reranking:</strong> Use a cross-encoder reranker to reorder the top-k chunks for higher precision.</p>
</li>
<li><p><strong>Metadata filtering:</strong> Filter results by category, date, or department to reduce false positives.</p>
</li>
<li><p><strong>Better citations:</strong> Store chunk IDs and show exactly which chunk(s) the answer came from.</p>
</li>
<li><p><strong>Observability:</strong> Add request IDs, structured logs, and traces so “what happened?” is answerable.</p>
</li>
<li><p><strong>Async + background indexing:</strong> Move index building to a background job and keep the API responsive.</p>
</li>
</ul>
<h2 id="heading-final-thoughts-production-ready-is-a-set-of-habits">Final Thoughts: Production-Ready Is a Set of Habits</h2>
<p>Building an AI application that survives in the real world is about building a system that is predictable, measurable, and safe.</p>
<ul>
<li><p><strong>Retrieval quality is measurable:</strong> Use similarity scores to gate your LLM.</p>
</li>
<li><p><strong>Refusal is a feature:</strong> It is better to say “I do not know” than to lie.</p>
</li>
<li><p><strong>Fallbacks are mandatory:</strong> Design for the moment the API goes down.</p>
</li>
<li><p><strong>Evals prevent regressions:</strong> Never deploy a change without running your tests.</p>
</li>
</ul>
<h2 id="heading-about-me">About Me</h2>
<p>I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead AI Titans Network, a community for developers learning how to ship AI products.</p>
<p>My work has been recognized with the Global Tech Hero award and featured on platforms like HackerNoon.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Serverless RAG Pipeline on AWS That Scales to Zero ]]>
                </title>
                <description>
                    <![CDATA[ Most RAG tutorials end the same way: you've got a working prototype and a bill for a vector database that runs whether anyone's querying it or not. Add an always-on embedding service, a hosted LLM end ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-serverless-rag-pipeline-on-aws-that-scales-to-zero/</link>
                <guid isPermaLink="false">69b1b23c6c896b0519b4eda8</guid>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ serverless ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Christopher Galliart ]]>
                </dc:creator>
                <pubDate>Wed, 11 Mar 2026 18:19:40 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/c0416d9e-9661-47a3-ba9c-8001f5f91b8c.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most RAG tutorials end the same way: you've got a working prototype and a bill for a vector database that runs whether anyone's querying it or not. Add an always-on embedding service, a hosted LLM endpoint, and the usual AWS infrastructure, and you're looking at real money before a single user shows up.</p>
<p>But it doesn't have to work that way. In this tutorial, you'll deploy a fully serverless RAG pipeline that processes documents, images, video, and audio, then scales to zero when nobody's using it.</p>
<p>Everything runs in your AWS account, your data never leaves your infrastructure, and your ongoing monthly cost for a modest knowledge base will be closer to <code>2-3 USD</code> than <code>300 USD</code>.</p>
<p>We'll use <a href="https://github.com/HatmanStack/RAGStack-Lambda">RAGStack-Lambda</a>, an open-source project I built on AWS. By the end, you'll have a deployed pipeline with a dashboard, an AI chat interface with source citations, a drop-in web component you can embed in any app, and an MCP server you can use to feed your assistant context.</p>
<h3 id="heading-heres-what-well-cover">Here's what we'll cover:</h3>
<ul>
<li><p><a href="#heading-what-this-actually-costs">What This Actually Costs</a></p>
</li>
<li><p><a href="#heading-what-youre-building">What You're Building</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-deploying-from-aws-marketplace">Deploying from AWS Marketplace</a></p>
</li>
<li><p><a href="#heading-deploying-from-source">Deploying from Source</a></p>
</li>
<li><p><a href="#heading-uploading-your-first-documents">Uploading Your First Documents</a></p>
</li>
<li><p><a href="#heading-chatting-with-your-knowledge-base">Chatting With Your Knowledge Base</a></p>
</li>
<li><p><a href="#heading-embedding-the-web-component-in-your-app">Embedding the Web Component in Your App</a></p>
</li>
<li><p><a href="#heading-using-the-mcp-server">Using the MCP Server</a></p>
</li>
<li><p><a href="#heading-what-you-can-build-from-here">What You Can Build From Here</a></p>
</li>
<li><p><a href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-what-this-actually-costs">What This Actually Costs</h2>
<p>Before we build anything, let's talk money, because the cost story is the whole point.</p>
<p>RAG pipelines have two cost phases: ingestion (processing your documents once) and operation (querying them over time).</p>
<p>Most platforms charge you a flat monthly rate regardless of which phase you're in. A serverless architecture flips that: ingestion costs something, and then everything scales to zero.</p>
<h3 id="heading-ingestion-the-one-time-hit">Ingestion: The One-Time Hit</h3>
<p>When you upload documents, several things happen: text extraction (OCR for PDFs and images), embedding generation, metadata extraction, and storage. Here's what that actually costs per service:</p>
<p><strong>Textract (OCR):</strong> This is the most expensive part of ingestion, and it only applies to scanned PDFs and images that need text extraction. Plain text, HTML, CSV, and other text-based formats skip this entirely.</p>
<p>Textract charges about <code>1.50 USD</code> per 1,000 pages for standard text detection. If you're uploading 500 pages of scanned PDFs, that's about <code>0.75 USD</code>. A heavy initial load of several thousand scanned pages might run <code>5-10 USD</code>. But once your documents are processed, you never pay this again unless you add new ones.</p>
<p><strong>Bedrock Embeddings (Nova Multimodal):</strong> This is where your content gets converted into vectors for semantic search. The pricing is almost comically cheap:</p>
<ul>
<li><p>Text: <code>0.00002 USD</code> per 1,000 input tokens</p>
</li>
<li><p>Images: <code>0.00115 USD</code> per image</p>
</li>
<li><p>Video/Audio: <code>0.00200 USD</code> per minute</p>
</li>
</ul>
<p>To put that in perspective: if you have 1,500 text documents averaging 2,500 tokens each after chunking, your total embedding cost is about <code>0.08 USD</code>. A knowledge base with 500 images runs <code>0.58 USD</code>. Even a mixed corpus of text, images, and a few hours of video stays well under <code>2 USD</code> for the entire embedding pass. This is a one-time cost – you only re-embed if you add or update documents.</p>
<p><strong>Bedrock LLM (Metadata Extraction):</strong> RAGStack uses an LLM to analyze each document and extract structured metadata automatically. This is a few inference calls per document using Nova Lite or a similar model. At <code>0.06 USD</code>/<code>0.24 USD</code> per million input/output tokens, processing 1,500 documents costs well under <code>1 USD</code>.</p>
<p><strong>S3 Vectors (Storage):</strong> Storing your embeddings. At <code>0.06 USD</code> per GB/month, a knowledge base of 1,500 documents with 1,024-dimension vectors takes up a trivially small amount of space. We're talking pennies per month.</p>
<p><strong>S3 (Document Storage):</strong> Your source documents in standard S3. Even cheaper, <code>0.023 USD</code> per GB/month.</p>
<p><strong>DynamoDB:</strong> Stores document metadata and processing state. The on-demand pricing model means you pay per request during ingestion, then essentially nothing at rest. A few cents for the initial load.</p>
<p>To put real numbers on it: if you upload 200 text documents (PDFs, HTML, markdown), your total ingestion cost is likely under <code>1 USD</code>. If you upload 1,000 scanned PDFs that need OCR, you might see <code>5-8 USD</code> as a one-time hit. That <code>7-10 USD</code> figure you might see referenced? That's the upper end for a heavy initial load with lots of OCR work.</p>
<h3 id="heading-operation-where-scale-to-zero-shines">Operation: Where Scale-to-Zero Shines</h3>
<p>Once your documents are ingested, the pipeline is waiting. Not running. Waiting. Here's what each query costs:</p>
<p><strong>Lambda:</strong> Invocations are billed per request and duration. The free tier covers 1 million requests/month. For a personal or small-team knowledge base, you may never leave the free tier.</p>
<p><strong>S3 Vectors (Queries):</strong> <code>2.50 USD</code> per million query API calls, plus a per-TB data processing charge. For a small index queried a few hundred times a month, this rounds to effectively zero.</p>
<p><strong>Bedrock (Chat Inference):</strong> This is your main operating cost. Each chat response requires an LLM call. Using Nova Lite at <code>0.06 USD</code> per million input tokens and <code>0.24 USD</code> per million output tokens, a typical RAG query (retrieval context + user question + response) might cost <code>0.001-0.003 USD</code> per query. A hundred queries a month is <code>0.10-0.30 USD</code>.</p>
<p><strong>Step Functions:</strong> Orchestrates the document processing pipeline. Standard workflows charge <code>0.025 USD</code> per 1,000 state transitions. Minimal during operation since it's only active during ingestion.</p>
<p><strong>Cognito:</strong> User authentication. Free for the first 10,000 monthly active users.</p>
<p><strong>CloudFront:</strong> Serves the dashboard UI. Free tier covers 1 TB of data transfer per month.</p>
<p><strong>API Gateway:</strong> Handles GraphQL API requests. Free tier covers 1 million API calls per month.</p>
<p>Add it all up for a knowledge base with 500 documents getting a few hundred queries per month, and your monthly operating cost is somewhere between <code>0.50 USD</code> and <code>3.00 USD</code>. Most of that is the LLM inference for chat responses.</p>
<h3 id="heading-the-comparison-that-matters">The Comparison That Matters</h3>
<p>Here's the same pipeline on a traditional always-on stack:</p>
<table>
<thead>
<tr>
<th>Service</th>
<th>RAGStack-Lambda</th>
<th>Traditional Stack</th>
</tr>
</thead>
<tbody><tr>
<td>Vector Database</td>
<td>S3 Vectors: pennies/mo</td>
<td>Pinecone Starter: <code>70 USD</code>/mo</td>
</tr>
<tr>
<td>Vector Database (alt)</td>
<td>S3 Vectors: pennies/mo</td>
<td>OpenSearch Serverless: about <code>350 USD</code>/mo min</td>
</tr>
<tr>
<td>Compute</td>
<td>Lambda: free tier</td>
<td>EC2 or ECS: <code>50-150 USD</code>/mo</td>
</tr>
<tr>
<td>LLM Inference</td>
<td>Same per-query cost</td>
<td>Same per-query cost</td>
</tr>
<tr>
<td>Total (idle)</td>
<td>about <code>0.50-3.00 USD</code>/mo</td>
<td><code>120-500 USD</code>/mo</td>
</tr>
</tbody></table>
<p>The LLM inference cost per query is roughly the same everywhere – that's Bedrock's on-demand pricing regardless of your architecture. The difference is everything else. Traditional stacks pay a floor cost whether anyone's using them or not. A serverless stack pays for what it uses, and idle costs essentially nothing.</p>
<h3 id="heading-what-about-transcribe">What About Transcribe?</h3>
<p>If you're uploading video or audio, AWS Transcribe adds cost for speech-to-text conversion. Standard transcription runs about <code>0.024 USD</code> per minute of audio. A 10-minute video costs <code>0.24 USD</code> to transcribe. This is a one-time ingestion cost, once transcribed and embedded, the resulting text chunks are queried like any other document.</p>
<h2 id="heading-what-youre-building">What You're Building</h2>
<p>By the end of this tutorial, you'll have a deployed pipeline that does the following:</p>
<ol>
<li><p>You upload a document (PDF, image, video, audio, HTML, CSV, <a href="https://github.com/HatmanStack/RAGStack-Lambda/blob/main/docs/ARCHITECTURE.md">the full list</a> is extensive) through a web dashboard.</p>
</li>
<li><p>The pipeline detects the file type and routes it to the right processor. Scanned PDFs go through OCR via Textract. Video and audio go through Transcribe for speech-to-text, split into 30-second searchable chunks with speaker identification. Images get visual embeddings and any caption text you provide.</p>
</li>
<li><p>An LLM analyzes each document and extracts structured metadata, topic, document type, date range, people mentioned, whatever's relevant. This happens automatically.</p>
</li>
<li><p>Everything gets embedded using Amazon Nova Multimodal Embeddings and stored in a Bedrock Knowledge Base backed by S3 Vectors.</p>
</li>
<li><p>You (or your users) ask questions through an AI chat interface. The pipeline retrieves relevant documents, passes them as context to a Bedrock LLM, and returns an answer with collapsible source citations, including timestamp links for video and audio that jump to the exact position.</p>
</li>
</ol>
<p>All of this runs in your AWS account. No external control plane, no third-party services beyond AWS itself.</p>
<h3 id="heading-the-architecture">The Architecture</h3>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/45eca6a5-91b4-4f55-8b1a-ba9f59a3e25d.png" alt="The diagram illustrates a flowchart of a buyer's AWS account, detailing the application plane with processes like S3 to Lambda OCR, supported by services like Cognito Auth. It emphasizes Amazon Bedrock's integration for knowledge and chat." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>A few things to note about this architecture:</p>
<p><strong>Step Functions orchestrate everything.</strong> When a document is uploaded, a state machine manages the entire processing flow, detecting the file type, routing to the right processor, waiting for async operations like Transcribe jobs, then triggering embedding and metadata extraction.</p>
<p>This is what makes the pipeline reliable without a running server. If a step fails, it retries. You can see exactly where every document is in the processing pipeline.</p>
<p><strong>Lambda does the compute.</strong> Every processing step is a Lambda function. They spin up when needed, run for a few seconds to a few minutes, and shut down. There's no EC2 instance idling at 3 AM.</p>
<p><strong>S3 Vectors is the vector store.</strong> Your embeddings live in S3's purpose-built vector storage rather than in a dedicated vector database like Pinecone or OpenSearch.</p>
<p>This is what makes the "scale to zero" cost possible: you're paying object storage rates for vector data instead of keeping a database cluster warm. It also means your vectors are sitting in your own S3 bucket, not in a third-party managed service that holds your data on their terms.</p>
<p><strong>Cognito handles auth.</strong> The dashboard and API are protected with Cognito user pools. When you deploy, you get a temporary password via email. The web component uses IAM-based authentication, and server-side integrations use API key auth.</p>
<p><strong>CloudFront serves the UI.</strong> The dashboard is a static React app served through CloudFront, so there's no web server to maintain.</p>
<h3 id="heading-two-ways-to-deploy">Two Ways to Deploy</h3>
<p>You have two deployment paths depending on what you want:</p>
<p><strong>AWS Marketplace (the fast path)</strong>, click deploy, fill in two fields (stack name and email), and wait about 10 minutes. No local tooling required. This is the path we'll walk through first.</p>
<p><strong>From Source (the developer path)</strong>, Clone the repo, run <code>publish.py</code>, and deploy via SAM CLI. This is the path for when you want to customize the processing pipeline, modify the UI, or contribute to the project. We'll cover this after the Marketplace walkthrough.</p>
<p>Both paths produce the same stack. The Marketplace version just wraps the CloudFormation template in a one-click deployment.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you deploy, you'll need:</p>
<ul>
<li><p><strong>An AWS account</strong> with permissions to create CloudFormation stacks, Lambda functions, S3 buckets, DynamoDB tables, and Cognito user pools. If you're using an admin account, you're covered.</p>
</li>
<li><p><strong>Bedrock model access:</strong> RAGStack defaults to <code>us-east-1</code> because that's where Nova Multimodal Embeddings is available. Amazon's own models (including Nova) are available by default in Bedrock, no manual enablement required. Just make sure your IAM role has the necessary <code>bedrock:InvokeModel</code> permissions.</p>
</li>
<li><p><strong>For the Marketplace path:</strong> just a web browser.</p>
</li>
<li><p><strong>For the source path:</strong> Python 3.13+, Node.js 24+, AWS CLI and SAM CLI configured, and Docker (for building Lambda layers).</p>
</li>
</ul>
<h2 id="heading-deploying-from-aws-marketplace">Deploying from AWS Marketplace</h2>
<p>This is the fastest path – no local tools, no CLI, no Docker. You'll launch a CloudFormation stack and have a working pipeline in about 10 minutes.</p>
<h3 id="heading-step-1-launch-the-stack">Step 1: Launch the Stack</h3>
<p>Click the <a href="https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review?templateURL=https://ragstack-quicklaunch-public.s3.us-east-1.amazonaws.com/ragstack-template.yaml&amp;stackName=my-docs">direct deploy link</a> to open CloudFormation's "Quick create stack" page with the template pre-loaded.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/d354f6bc-dee8-4f44-9b3b-523ea27564c7.png" alt="Screenshot of AWS CloudFormation Quick Create Stack page in dark mode. Sections for template URL, stack name, parameters, and build options are visible." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-step-2-fill-in-two-fields">Step 2: Fill In Two Fields</h3>
<p>The page has a lot of options, but you only need two:</p>
<ul>
<li><p><strong>Stack name:</strong> Must be lowercase. This becomes the prefix for all your AWS resources (for example, <code>my-docs</code>, <code>team-kb</code>, <code>project-notes</code>). Keep it short.</p>
</li>
<li><p><strong>Admin Email:</strong> Under Required Settings. Cognito will send your temporary login credentials here. Use an email you can access right now.</p>
</li>
</ul>
<p>Everything else – Build Options, Advanced Settings, OCR Backend, model selections – can stay at the defaults. They're there for customization later, but the defaults work out of the box.</p>
<h3 id="heading-step-3-deploy">Step 3: Deploy</h3>
<p>Scroll to the bottom, check the three acknowledgment boxes under "Capabilities and transforms," and click <strong>Create stack</strong>.</p>
<p>Deployment takes roughly 10 minutes. You can watch the progress in the CloudFormation Events tab if you're curious, but there's nothing to do until the stack status flips to <code>CREATE_COMPLETE</code>.</p>
<h3 id="heading-step-4-log-in">Step 4: Log In</h3>
<p>Once the stack finishes, check your email. Cognito sends you the dashboard URL and a temporary password. Log in, set a new password, and you're looking at an empty dashboard ready for documents.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/5ac31b6c-2782-4b66-82a9-0cb962c5dac4.png" alt="A software dashboard interface titled 'Document Pipeline (Demo)' displaying options for uploading, scraping, and searching documents. The screen shows no current documents or scrape jobs, with menu options on the left and a search and filter bar at the center. The overall tone is functional and minimalist." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-deploying-from-source">Deploying from Source</h2>
<p>If you want to customize the pipeline, modify the UI, or contribute to the project, deploy from source instead.</p>
<h3 id="heading-step-1-clone-and-set-up">Step 1: Clone and Set Up</h3>
<pre><code class="language-bash">git clone https://github.com/HatmanStack/RAGStack-Lambda.git
cd RAGStack-Lambda

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
</code></pre>
<h3 id="heading-step-2-deploy">Step 2: Deploy</h3>
<p>The <code>publish.py</code> script handles everything: building the frontend, packaging Lambda functions, and deploying via SAM CLI.</p>
<pre><code class="language-bash">python publish.py \
  --project-name my-docs \
  --admin-email admin@example.com
</code></pre>
<p>This defaults to <code>us-east-1</code> for Nova Multimodal Embeddings. The script will build the React dashboard, build the web component, package all Lambda layers with Docker, and deploy the CloudFormation stack through SAM.</p>
<p>First deploy takes longer (15-20 minutes) because it's building everything from scratch. Subsequent deploys are faster since SAM caches unchanged resources.</p>
<p>If you only want to iterate on the backend and skip UI builds:</p>
<pre><code class="language-bash"># Skip dashboard build (still builds web component)
python publish.py --project-name my-docs --admin-email admin@example.com --skip-ui

# Skip ALL UI builds
python publish.py --project-name my-docs --admin-email admin@example.com --skip-ui-all
</code></pre>
<p>Once it finishes, you'll get the same Cognito email and dashboard URL as the Marketplace path.</p>
<h2 id="heading-uploading-your-first-documents">Uploading Your First Documents</h2>
<p>The dashboard has tabs for different content types. We'll start with the Documents tab since that's the most common use case.</p>
<h3 id="heading-documents">Documents</h3>
<p>Click the <strong>Documents</strong> tab and upload a file. RAGStack accepts a wide range of formats: PDF, DOCX, XLSX, HTML, CSV, JSON, XML, EML, EPUB, TXT, and Markdown. Drag and drop or use the file picker.</p>
<p>Once uploaded, the document enters the processing pipeline. You'll see the status update in real time:</p>
<ol>
<li><p><strong>UPLOADED:</strong> File received and stored in S3.</p>
</li>
<li><p><strong>PROCESSING:</strong> Step Functions has picked it up and routed it to the right processor. Text-based files (HTML, CSV, Markdown) go through direct extraction. Scanned PDFs and images go through Textract OCR. The LLM analyzes the content and extracts structured metadata, topic, document type, people mentioned, date ranges, whatever's relevant to the content.</p>
</li>
<li><p><strong>INDEXED:</strong> Embeddings generated, vectors stored, document is searchable.</p>
</li>
</ol>
<p>Text documents typically process in 1-5 minutes. OCR-heavy documents (scanned PDFs, images with text) can take 2-15 minutes depending on page count.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/3df05041-2632-41a9-a71c-6d764c503f2a.png" alt="Screenshot of a document upload interface labeled &quot;Document Pipeline (Demo).&quot; Central panel shows a box for drag-and-drop file upload. Sleek, modern design." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-images">Images</h3>
<p>The <strong>Images</strong> tab works differently. Upload a JPG, PNG, GIF, or WebP and you can add a caption. Both the visual content and caption text get embedded using Nova Multimodal Embeddings, so you can search by what's in the image or by your description of it.</p>
<p>This is where multimodal embeddings earn their keep. A traditional text-only RAG pipeline would need you to describe every image manually. Here, the image itself becomes searchable, and since everything stays in your AWS account, you're not sending personal photos or sensitive visual content to an external service to get there.</p>
<h3 id="heading-what-about-video-and-audio">What About Video and Audio?</h3>
<p>Upload video or audio files and RAGStack routes them through AWS Transcribe for speech-to-text conversion. The transcript gets split into 30-second chunks with speaker identification, then embedded like any other document. When chat results reference a video source, you get timestamp links that jump to the exact position in the recording.</p>
<h3 id="heading-web-scraping">Web Scraping</h3>
<p>The <strong>Scrape</strong> tab lets you pull websites directly into your knowledge base. Enter a URL and RAGStack crawls the page, extracts the content, and processes it through the same pipeline as uploaded documents, metadata extraction, embedding, indexing.</p>
<p>This is useful for building a knowledge base from existing web content without manually saving and uploading pages. Documentation sites, blog archives, reference material, anything publicly accessible.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/ac2c6239-a323-4770-80f7-31aa7ff3bdfb.png" alt="Web scraping interface with fields for URL, max pages, and depth. A dropdown for scope selection and a 'Start Scrape' button are visible." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-chatting-with-your-knowledge-base">Chatting With Your Knowledge Base</h2>
<p>This is the payoff. Go to the <strong>Chat</strong> tab, type a question, and RAGStack retrieves relevant documents from your knowledge base, passes them as context to a Bedrock LLM, and returns an answer with source citations.</p>
<p>The citations are collapsible, so click to expand and see which documents informed the answer, with the option to download the source file. For video and audio sources, you get clickable timestamps that jump to the relevant moment.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/760b3cd0-8bb8-493d-97ce-5eb3d0138592.png" alt="Screenshot of a web interface titled &quot;Knowledge Base Chat&quot; with menu options on the left. The central section prompts users to ask document-related questions." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-metadata-filtering">Metadata Filtering</h3>
<p>If you've uploaded enough documents to have meaningful metadata categories, the chat interface lets you filter search results by metadata before querying. RAGStack auto-discovers the metadata structure from your documents, so you don't configure this manually, it just appears as your knowledge base grows.</p>
<p>This is useful when you have a large mixed corpus. Instead of hoping the vector search picks the right context from thousands of documents, you can narrow it down: "only search documents about project X" or "only search content from Q4 2024."</p>
<h2 id="heading-embedding-the-web-component-in-your-app">Embedding the Web Component in Your App</h2>
<p>The dashboard is useful for managing your knowledge base, but the real power is embedding RAGStack's chat in your own application. The web component works with any framework, React, Vue, Angular, Svelte, plain HTML.</p>
<p>Load the script once from your CloudFront distribution:</p>
<pre><code class="language-html">&lt;script src="https://your-cloudfront-url/ragstack-chat.js"&gt;&lt;/script&gt;
</code></pre>
<p>Then drop the component wherever you want a chat interface:</p>
<pre><code class="language-html">&lt;ragstack-chat
  conversation-id="my-app"
  header-text="Ask About Documents"
&gt;&lt;/ragstack-chat&gt;
</code></pre>
<p>That's it. The component handles authentication (via IAM), manages conversation state, and renders source citations, all self-contained. Your CloudFront URL is in the stack outputs.</p>
<p>For server-side integrations that don't need a UI, the GraphQL API is available with API key authentication. You can find your endpoint and API key in the dashboard under Settings.</p>
<h2 id="heading-using-the-mcp-server">Using the MCP Server</h2>
<p>RAGStack includes an MCP server that connects your knowledge base to AI assistants like Claude Desktop, Cursor, VS Code, and Amazon Q CLI. Instead of switching to the dashboard to search your documents, you ask your assistant directly.</p>
<p>Install it:</p>
<pre><code class="language-bash">pip install ragstack-mcp
</code></pre>
<p>Then add it to your AI assistant's MCP configuration:</p>
<pre><code class="language-json">{
  "ragstack": {
    "command": "uvx",
    "args": ["ragstack-mcp"],
    "env": {
      "RAGSTACK_GRAPHQL_ENDPOINT": "YOUR_ENDPOINT",
      "RAGSTACK_API_KEY": "YOUR_API_KEY"
    }
  }
}
</code></pre>
<p>Your endpoint and API key are in the dashboard under Settings. Once configured, type <code>@ragstack</code> in your assistant's chat to invoke the MCP server, then ask things like "search my knowledge base for authentication docs" and it queries RAGStack directly.</p>
<p>See the <a href="https://github.com/HatmanStack/RAGStack-Lambda/blob/main/src/ragstack-mcp/README.md">MCP Server docs</a> for the full list of available tools and setup details.</p>
<h2 id="heading-what-you-can-build-from-here">What You Can Build From Here</h2>
<p>You've got a deployed RAG pipeline that costs almost nothing to run and handles text, images, video, and audio. A few directions you might take it:</p>
<p><strong>A searchable personal archive.</strong> Every conference talk you've saved, every PDF textbook, every tutorial video that's sitting in a folder somewhere. Upload it all, and now you have one search interface across years of accumulated material. The multimodal embeddings mean your screenshots and diagrams are searchable too, not just the text.</p>
<p>I built <a href="https://github.com/HatmanStack/family-archive-document-ai">a family archive app</a> this way, scanned letters, old photos, home videos, with RAGStack deployed as a nested CloudFormation stack so the whole family can search across decades of memories using the chat widget.</p>
<p><strong>A second brain for a client project.</strong> Scrape the client's existing docs, upload the SOW and meeting notes, drop in the codebase documentation. Now you've got a searchable knowledge base scoped to that engagement. Spin it up at the start, tear it down when the contract ends. At these costs, it's disposable infrastructure.</p>
<p><strong>AI chat over a niche dataset.</strong> Recipe collections, legal filings, research papers, local government meeting minutes, any corpus that's too specialized for general-purpose LLMs to know well. The web component means you can ship it as a standalone tool without building a frontend from scratch.</p>
<p><strong>RAG for your MCP workflow.</strong> If you're already using Claude Desktop or Cursor, the MCP server turns your knowledge base into another tool your assistant can reach for. Upload your team's runbooks and architecture docs, and now <code>@ragstack</code> in your editor gives you instant context without tab-switching.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>The serverless RAG pipeline you just deployed handles document processing, multimodal embeddings, metadata extraction, and AI chat with source citations, all scaling to zero when idle, all running in your AWS account. Your documents, your vectors, your infrastructure. The traditional approach to this stack costs <code>120-500 USD</code>/month in baseline infrastructure. This one costs pocket change.</p>
<p>The full source is at <a href="https://github.com/HatmanStack/RAGStack-Lambda">github.com/HatmanStack/RAGStack-Lambda</a>. File issues, open PRs, or just poke around the architecture. If you want to go deeper on the technical tradeoffs, particularly how filtered vector search behaves on cost-optimized backends like S3 Vectors, that's a story for the next post.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an AI-Powered RAG Search Application with Next.js, Supabase, and OpenAI ]]>
                </title>
                <description>
                    <![CDATA[ In this tutorial, you'll learn how to build a complete RAG (Retrieval-Augmented Generation) search application from scratch. Your application will allow users to upload documents, store them securely, and search through them using AI-powered semantic... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-an-ai-powered-rag-search-application-with-nextjs-supabase-and-openai/</link>
                <guid isPermaLink="false">6978f421ead51482f82901bf</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ supabase ]]>
                    </category>
                
                    <category>
                        <![CDATA[ openai ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Next.js ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mayur Vekariya ]]>
                </dc:creator>
                <pubDate>Tue, 27 Jan 2026 17:21:37 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769534479648/a3f19714-a00b-4444-9289-753902282ac6.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this tutorial, you'll learn how to build a complete RAG (Retrieval-Augmented Generation) search application from scratch. Your application will allow users to upload documents, store them securely, and search through them using AI-powered semantic search.</p>
<p>By the end of this guide, you'll have a fully functional application that can:</p>
<ul>
<li><p>Upload and process PDF, DOCX, and TXT files</p>
</li>
<li><p>Store documents in Supabase Storage</p>
</li>
<li><p>Generate embeddings using OpenAI</p>
</li>
<li><p>Perform semantic search across document chunks</p>
</li>
<li><p>Provide AI-generated answers based on document content</p>
</li>
<li><p>View and manage uploaded documents</p>
</li>
</ul>
<p>This is a production-ready solution that you can deploy and use immediately.</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-youll-learn">What You'll Learn</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-the-technologies">Understanding the Technologies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-overview">Project Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-create-your-nextjs-project">Step 1: Create Your Next.js Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-install-required-dependencies">Step 2: Install Required Dependencies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-set-up-your-supabase-project">Step 3: Set Up Your Supabase Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-configure-environment-variables">Step 4: Configure Environment Variables</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-create-the-upload-api-route">Step 5: Create the Upload API Route</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-create-the-rag-search-api-route">Step 6: Create the RAG Search API Route</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-7-create-the-documents-api-route">Step 7: Create the Documents API Route</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-8-create-the-upload-modal-component">Step 8: Create the Upload Modal Component</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-9-create-the-pdf-viewer-modal-component">Step 9: Create the PDF Viewer Modal Component</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-10-create-the-navigation-component">Step 10: Create the Navigation Component</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-11-create-the-home-page-search-interface">Step 11: Create the Home Page (Search Interface)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-12-create-the-documents-page">Step 12: Create the Documents Page</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-13-test-your-application">Step 13: Test Your Application</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-14-deploy-your-application">Step 14: Deploy Your Application</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-rag-search-works">How RAG Search Works</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-troubleshooting-common-issues">Troubleshooting Common Issues</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-next-steps">Next Steps</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-youll-learn"><strong>What You'll Learn</strong></h2>
<p>In this handbook, you'll learn how to:</p>
<ul>
<li><p>Set up a Next.js application with TypeScript</p>
</li>
<li><p>Configure Supabase for database and file storage</p>
</li>
<li><p>Integrate OpenAI embeddings and chat completions</p>
</li>
<li><p>Implement document text extraction and chunking</p>
</li>
<li><p>Build a vector search system using PostgreSQL</p>
</li>
<li><p>Create a modern UI with React components</p>
</li>
<li><p>Handle file uploads and storage</p>
</li>
<li><p>Implement RAG (Retrieval-Augmented Generation) search</p>
</li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>Before you begin, make sure you have:</p>
<ul>
<li><p>Node.js 18 or higher installed on your computer</p>
</li>
<li><p>A Supabase account (free tier works fine)</p>
</li>
<li><p>An OpenAI API key</p>
</li>
<li><p>Basic knowledge of React and TypeScript</p>
</li>
<li><p>Familiarity with Next.js (helpful but not required)</p>
</li>
</ul>
<h2 id="heading-understanding-the-technologies"><strong>Understanding the Technologies</strong></h2>
<p>Before we dive into building the application, you should understand the key technologies and concepts you'll be working with:</p>
<h3 id="heading-what-is-rag-retrieval-augmented-generation"><strong>What is RAG (Retrieval-Augmented Generation)?</strong></h3>
<p>RAG is an AI pattern that combines information retrieval with text generation. Instead of relying solely on an AI model's training data, RAG retrieves relevant information from your own documents. It then uses that information as context to generate accurate, up-to-date answers. This approach gives you:</p>
<ul>
<li><p><strong>Accuracy</strong>: Answers are based on your actual documents, not just the AI's training data</p>
</li>
<li><p><strong>Transparency</strong>: You can see which document sections were used to generate the answer</p>
</li>
<li><p><strong>Efficiency</strong>: Only relevant document chunks are used, reducing token costs</p>
</li>
</ul>
<h3 id="heading-what-are-embeddings-and-vector-database"><strong>What are Embeddings and Vector Database?</strong></h3>
<p>Embeddings are numerical representations of text that capture semantic meaning. When you convert text to an embedding, similar meanings are represented by similar numbers. For example, "dog" and "puppy" would have similar embeddings. Meanwhile, "dog" and "airplane" would have very different ones.</p>
<p>OpenAI's embedding models convert text into vectors. These are arrays of numbers that can be compared mathematically. This allows you to find documents that are semantically similar to a search query. You can find matches even if they don't contain the exact same words.</p>
<p>A vector database is a specialized database designed to store and search through embeddings efficiently. Instead of searching for exact text matches, vector databases use mathematical operations. They use operations like <a target="_blank" href="https://www.freecodecamp.org/news/how-does-cosine-similarity-work/">cosine similarity</a> to find the most semantically similar content.</p>
<p>In this tutorial, you'll use Supabase's PostgreSQL database with the <code>pgvector</code> extension. This extension adds vector storage and similarity search capabilities to PostgreSQL. This lets you store embeddings alongside your regular database data. You can also perform fast similarity searches.</p>
<h3 id="heading-what-is-text-chunking"><strong>What is Text Chunking?</strong></h3>
<p>Text chunking is the process of breaking large documents into smaller, manageable pieces. This is necessary for several reasons.</p>
<p>First, AI models have token limits. These are maximum input sizes. Second, smaller chunks allow for more precise retrieval. Third, overlapping chunks ensure context isn't lost at boundaries.</p>
<p>You'll use LangChain's <code>RecursiveCharacterTextSplitter</code>. This tool intelligently splits text while trying to preserve sentence and paragraph boundaries.</p>
<h3 id="heading-what-is-supabase"><strong>What is Supabase?</strong></h3>
<p>Supabase is an open-source Firebase alternative. It provides several key features.</p>
<p>You get a PostgreSQL database, which is a powerful, open-source relational database. You also get storage, which is file storage similar to AWS S3. There are real-time features that provide real-time subscriptions to database changes. Finally, there's built-in user authentication.</p>
<p>For this project, you'll use Supabase's database to store document chunks and embeddings. You'll also use Supabase Storage to store the original uploaded files.</p>
<h3 id="heading-what-is-tailwind-css"><strong>What is Tailwind CSS?</strong></h3>
<p>Tailwind CSS is a utility-first CSS framework that lets you style your application by applying pre-built utility classes directly in your HTML/JSX. Instead of writing custom CSS, you use classes like <code>bg-blue-600</code>, <code>text-white</code>, and <code>rounded-lg</code> to style elements.</p>
<p>You'll use Tailwind CSS in this project because it speeds up development by providing ready-made styling utilities. It also ensures consistent design across the application. Plus, it makes it easy to create responsive, modern UIs. Finally, it works seamlessly with Next.js.</p>
<p>Now that you understand the core concepts and tools we’ll be using, let's start building the application.</p>
<h2 id="heading-project-overview"><strong>Project Overview</strong></h2>
<p>Your RAG search application will consist of:</p>
<ol>
<li><p><strong>Frontend</strong>: Next.js application with React components for uploading documents and searching</p>
</li>
<li><p><strong>Backend API Routes</strong>: Next.js API routes for handling uploads, searches, and document management</p>
</li>
<li><p><strong>Database</strong>: Supabase PostgreSQL with vector extension for storing embeddings</p>
</li>
<li><p><strong>Storage</strong>: Supabase Storage for storing original files</p>
</li>
<li><p><strong>AI Integration</strong>: OpenAI for generating embeddings and chat completions</p>
</li>
</ol>
<p>The application will have two main pages:</p>
<ul>
<li><p><strong>Search Page</strong>: Where users can ask questions about their uploaded documents and get AI-generated answers</p>
</li>
<li><p><strong>Documents Page</strong>: Where users can view all uploaded documents, upload new ones, preview files, and manage their document library</p>
</li>
</ul>
<p>Let's start building!</p>
<p>If you ever get stuck on the source code, you can view it on GitHub here:</p>
<p><a target="_blank" href="https://github.com/mayur9210/rag-search-app">https://github.com/mayur9210/rag-search-app</a></p>
<h2 id="heading-step-1-create-your-nextjs-project">Step 1: Create Your Next.js Project</h2>
<p>Start by creating a new Next.js project with TypeScript. Open your terminal and run:</p>
<pre><code class="lang-bash">npx create-next-app@latest rag-search-app --typescript --tailwind --app
</code></pre>
<p>When prompted, choose the following options:</p>
<ul>
<li><p>TypeScript: Yes</p>
</li>
<li><p>ESLint: Yes</p>
</li>
<li><p>Tailwind CSS: Yes</p>
</li>
<li><p>App Router: Yes (default)</p>
</li>
<li><p>Customize import alias: No</p>
</li>
</ul>
<p>Navigate into your project directory:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> rag-search-app
</code></pre>
<p>Now that your project is set up, you'll need to install the additional packages required for document processing, AI integration, and database operations.</p>
<h2 id="heading-step-2-install-required-dependencies">Step 2: Install Required Dependencies</h2>
<p>You'll need several packages for this project. You can install them using npm:</p>
<pre><code class="lang-bash">npm install @supabase/supabase-js @langchain/openai @langchain/textsplitters langchain openai mammoth pdf2json
</code></pre>
<p>Here's what each package does:</p>
<ul>
<li><p><code>@supabase/supabase-js</code>: Client library for interacting with Supabase (database and storage)</p>
</li>
<li><p><code>@langchain/openai</code>: LangChain integration for OpenAI (helps with text processing)</p>
</li>
<li><p><code>@langchain/textsplitters</code>: Text splitting utilities for chunking documents into smaller pieces</p>
</li>
<li><p><code>langchain</code>: Core LangChain library (provides AI workflow tools)</p>
</li>
<li><p><code>openai</code>: Official OpenAI SDK (for generating embeddings and chat completions)</p>
</li>
<li><p><code>mammoth</code>: Converts DOCX files to plain text</p>
</li>
<li><p><code>pdf2json</code>: Extracts text from PDF files</p>
</li>
</ul>
<p>Install the TypeScript types for pdf2json:</p>
<pre><code class="lang-bash">npm install --save-dev @types/pdf-parse
</code></pre>
<p>With all dependencies installed, you're ready to set up your Supabase project, which will handle your database and file storage needs.</p>
<h2 id="heading-step-3-set-up-your-supabase-project">Step 3: Set Up Your Supabase Project</h2>
<h3 id="heading-create-a-supabase-project">Create a Supabase Project</h3>
<p>First, you’ll need to create a new Supabase project, which you can do by following these steps:</p>
<ol>
<li><p>Go to <a target="_blank" href="https://supabase.com/"><strong>supabase.com</strong></a> and sign in or create an account</p>
</li>
<li><p>Click "New Project"</p>
</li>
<li><p>Fill in your project details:</p>
<ul>
<li><p>Name: <code>rag-search-app</code> (or any name you prefer)</p>
</li>
<li><p>Database Password: Choose a strong password (save this – you'll need it)</p>
</li>
<li><p>Region: Select the region closest to you</p>
</li>
</ul>
</li>
<li><p>Click "Create new project" and wait for it to be ready (this takes a few minutes)</p>
</li>
</ol>
<h3 id="heading-get-your-supabase-credentials">Get Your Supabase Credentials</h3>
<p>Once your project is ready, go to <strong>Settings</strong> and then <strong>API</strong>.</p>
<p>Copy the following values:</p>
<ul>
<li><p><strong>Project URL</strong> (this is your <code>NEXT_PUBLIC_SUPABASE_URL</code>)</p>
</li>
<li><p><strong>anon public key</strong> (this is your <code>NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY</code>)</p>
</li>
<li><p><strong>service_role key</strong> (this is your <code>SUPABASE_SERVICE_ROLE_KEY</code>)</p>
</li>
</ul>
<p><strong>Important</strong>: Keep your service role key secret. Never expose it in client-side code. It bypasses Row-Level Security (RLS) policies, which is necessary for server-side file uploads but should never be used in browser code.</p>
<h3 id="heading-set-up-the-database-schema"><strong>Set Up the Database Schema</strong></h3>
<p>Now you'll set up the database structure to store your documents and embeddings. Go to <strong>SQL Editor</strong> in your Supabase dashboard and run the following SQL:</p>
<pre><code class="lang-pgsql"><span class="hljs-comment">-- Enable the vector extension for embeddings</span>
<span class="hljs-comment">-- This extension allows PostgreSQL to store and search vector data efficiently</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">EXTENSION</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> vector;

<span class="hljs-comment">-- Create the documents table</span>
<span class="hljs-comment">-- This table stores document chunks, their metadata, and embeddings</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> documents (
  id <span class="hljs-type">BIGSERIAL</span> <span class="hljs-keyword">PRIMARY KEY</span>,
  content <span class="hljs-type">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">NULL</span>,
  metadata <span class="hljs-type">JSONB</span>,
  embedding vector(<span class="hljs-number">1536</span>)  <span class="hljs-comment">-- OpenAI's text-embedding-3-small produces 1536-dimensional vectors</span>
  file_path <span class="hljs-type">text</span> <span class="hljs-keyword">null</span>,
  file_url <span class="hljs-type">text</span> <span class="hljs-keyword">null</span>,
);

<span class="hljs-comment">-- Create an index on the embedding column for faster similarity search</span>
<span class="hljs-comment">-- The ivfflat index speeds up vector similarity queries</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> <span class="hljs-keyword">ON</span> documents <span class="hljs-keyword">USING</span> ivfflat (embedding vector_cosine_ops);

<span class="hljs-comment">-- Create a function for matching documents based on similarity</span>
<span class="hljs-comment">-- This function finds the most similar document chunks to a query embedding</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">OR REPLACE</span> <span class="hljs-keyword">FUNCTION</span> match_documents(
  query_embedding vector(<span class="hljs-number">1536</span>),
  match_threshold <span class="hljs-type">float</span>,
  match_count <span class="hljs-type">int</span>
)
<span class="hljs-keyword">RETURNS</span> <span class="hljs-keyword">TABLE</span> (
  id <span class="hljs-type">bigint</span>,
  content <span class="hljs-type">text</span>,
  metadata <span class="hljs-type">jsonb</span>,
  similarity <span class="hljs-type">float</span>
)
<span class="hljs-keyword">LANGUAGE</span> plpgsql
<span class="hljs-keyword">AS</span> $$<span class="pgsql">
<span class="hljs-keyword">BEGIN</span>
  <span class="hljs-keyword">RETURN QUERY</span>
  <span class="hljs-keyword">SELECT</span>
    documents.id,
    documents.content,
    documents.metadata,
    <span class="hljs-number">1</span> - (documents.embedding &lt;=&gt; query_embedding) <span class="hljs-keyword">AS</span> similarity
  <span class="hljs-keyword">FROM</span> documents
  <span class="hljs-keyword">WHERE</span> <span class="hljs-number">1</span> - (documents.embedding &lt;=&gt; query_embedding) &gt; match_threshold
  <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> documents.embedding &lt;=&gt; query_embedding
  <span class="hljs-keyword">LIMIT</span> match_count;
<span class="hljs-keyword">END</span>;
$$</span>;
</code></pre>
<p>This SQL does the following:</p>
<ul>
<li><p><strong>Enables the vector extension</strong>: This adds vector storage and similarity search capabilities to PostgreSQL</p>
</li>
<li><p><strong>Creates the documents table</strong>: Stores document chunks, metadata (file name, type, and so on), and their embeddings</p>
</li>
<li><p><strong>Creates an index</strong>: Speeds up similarity searches on the embedding column</p>
</li>
<li><p><strong>Creates a match function</strong>: Finds the most similar document chunks to a query embedding using cosine similarity</p>
</li>
</ul>
<p>The <code>&lt;=&gt;</code> operator calculates cosine distance between vectors. A smaller distance means more similar content.</p>
<h3 id="heading-set-up-supabase-storage"><strong>Set Up Supabase Storage</strong></h3>
<p>You’ll need a storage bucket to store uploaded files. This is separate from the database and holds the original PDF, DOCX, and TXT files.</p>
<p>To set up your storage bucket:</p>
<ol>
<li><p>Go to <strong>Storage</strong> in your Supabase dashboard</p>
</li>
<li><p>Click <strong>New bucket</strong></p>
</li>
<li><p>Name it <code>documents</code></p>
</li>
<li><p>Set it to <strong>Public</strong> (this allows file downloads)</p>
</li>
<li><p>Click <strong>Create bucket</strong></p>
</li>
</ol>
<p>If you prefer a private bucket, you can use the service role key for server-side operations, which bypasses Row-Level Security policies. For this tutorial, a public bucket is simpler and works well.</p>
<p>Now that your Supabase project is configured, you'll set up your environment variables to connect your Next.js application to Supabase and OpenAI.</p>
<h2 id="heading-step-4-configure-environment-variables">Step 4: Configure Environment Variables</h2>
<p>Create a <code>.env.local</code> file in your project root:</p>
<pre><code class="lang-bash">NEXT_PUBLIC_SUPABASE_URL=your_supabase_project_url
NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY=your_supabase_anon_key
SUPABASE_SERVICE_ROLE_KEY=your_supabase_service_role_key
OPENAI_API_KEY=your_openai_api_key
</code></pre>
<p>Replace the placeholder values with your actual credentials:</p>
<ul>
<li><p>Get Supabase values from <strong>Settings</strong> → <strong>API</strong> in your Supabase dashboard</p>
</li>
<li><p>Get your OpenAI API key from <a target="_blank" href="https://platform.openai.com/api-keys"><strong>platform.openai.com/api-keys</strong></a></p>
</li>
</ul>
<p><strong>Security Note</strong>: Never commit <code>.env.local</code> to version control. It's already in <code>.gitignore</code> by default, but double-check to ensure your secrets stay secure.</p>
<p>With your environment configured, you're ready to start building the API routes that will handle file uploads, searches, and document management.</p>
<h2 id="heading-step-5-create-the-upload-api-route">Step 5: Create the Upload API Route</h2>
<p>Now you'll create the API route that handles file uploads. This route will process uploaded files, extract their text, split them into chunks, generate embeddings, and store everything in your database and storage.</p>
<p>Create <code>src/app/api/upload/route.ts</code>:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> { createClient } <span class="hljs-keyword">from</span> <span class="hljs-string">'@supabase/supabase-js'</span>;
<span class="hljs-keyword">import</span> OpenAI <span class="hljs-keyword">from</span> <span class="hljs-string">'openai'</span>;
<span class="hljs-keyword">import</span> { NextResponse } <span class="hljs-keyword">from</span> <span class="hljs-string">'next/server'</span>;
<span class="hljs-keyword">import</span> { RecursiveCharacterTextSplitter } <span class="hljs-keyword">from</span> <span class="hljs-string">'@langchain/textsplitters'</span>;
<span class="hljs-keyword">import</span> mammoth <span class="hljs-keyword">from</span> <span class="hljs-string">'mammoth'</span>;

<span class="hljs-keyword">const</span> url = process.env.NEXT_PUBLIC_SUPABASE_URL!;
<span class="hljs-keyword">const</span> anonKey = process.env.NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY!;
<span class="hljs-keyword">const</span> serviceKey = process.env.SUPABASE_SERVICE_ROLE_KEY;
<span class="hljs-keyword">const</span> supabaseStorage = createClient(url, serviceKey || anonKey);
<span class="hljs-keyword">const</span> supabase = createClient(url, anonKey);
<span class="hljs-keyword">const</span> openai = <span class="hljs-keyword">new</span> OpenAI();

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">safeDecodeURIComponent</span>(<span class="hljs-params">str: <span class="hljs-built_in">string</span></span>): <span class="hljs-title">string</span> </span>{
  <span class="hljs-keyword">try</span> { 
    <span class="hljs-keyword">return</span> <span class="hljs-built_in">decodeURIComponent</span>(str); 
  } <span class="hljs-keyword">catch</span> { 
    <span class="hljs-keyword">try</span> { 
      <span class="hljs-keyword">return</span> <span class="hljs-built_in">decodeURIComponent</span>(str.replace(<span class="hljs-regexp">/%/g</span>, <span class="hljs-string">'%25'</span>)); 
    } <span class="hljs-keyword">catch</span> { 
      <span class="hljs-keyword">return</span> str; 
    } 
  }
}

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">extractTextFromFile</span>(<span class="hljs-params">file: File</span>): <span class="hljs-title">Promise</span>&lt;<span class="hljs-title">string</span>&gt; </span>{
  <span class="hljs-keyword">const</span> buffer = Buffer.from(<span class="hljs-keyword">await</span> file.arrayBuffer());
  <span class="hljs-keyword">const</span> fileName = file.name.toLowerCase();

  <span class="hljs-keyword">if</span> (fileName.endsWith(<span class="hljs-string">'.pdf'</span>)) {
    <span class="hljs-keyword">const</span> PDFParser = (<span class="hljs-keyword">await</span> <span class="hljs-keyword">import</span>(<span class="hljs-string">'pdf2json'</span>)).default;
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Promise</span>(<span class="hljs-function">(<span class="hljs-params">resolve, reject</span>) =&gt;</span> {
      <span class="hljs-keyword">const</span> pdfParser = <span class="hljs-keyword">new</span> (PDFParser <span class="hljs-keyword">as</span> <span class="hljs-built_in">any</span>)(<span class="hljs-literal">null</span>, <span class="hljs-literal">true</span>);
      pdfParser.on(<span class="hljs-string">'pdfParser_dataError'</span>, <span class="hljs-function">(<span class="hljs-params">err: <span class="hljs-built_in">any</span></span>) =&gt;</span> 
        reject(<span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">`PDF parsing error: <span class="hljs-subst">${err.parserError}</span>`</span>))
      );
      pdfParser.on(<span class="hljs-string">'pdfParser_dataReady'</span>, <span class="hljs-function">(<span class="hljs-params">pdfData: <span class="hljs-built_in">any</span></span>) =&gt;</span> {
        <span class="hljs-keyword">try</span> {
          <span class="hljs-keyword">let</span> fullText = <span class="hljs-string">''</span>;
          pdfData.Pages?.forEach(<span class="hljs-function">(<span class="hljs-params">page: <span class="hljs-built_in">any</span></span>) =&gt;</span> 
            page.Texts?.forEach(<span class="hljs-function">(<span class="hljs-params">text: <span class="hljs-built_in">any</span></span>) =&gt;</span> 
              text.R?.forEach(<span class="hljs-function">(<span class="hljs-params">r: <span class="hljs-built_in">any</span></span>) =&gt;</span> 
                r.T &amp;&amp; (fullText += safeDecodeURIComponent(r.T) + <span class="hljs-string">' '</span>)
              )
            )
          );
          resolve(fullText.trim());
        } <span class="hljs-keyword">catch</span> (error: <span class="hljs-built_in">any</span>) {
          reject(<span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">`Error extracting text: <span class="hljs-subst">${error.message}</span>`</span>));
        }
      });
      pdfParser.parseBuffer(buffer);
    });
  } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (fileName.endsWith(<span class="hljs-string">'.docx'</span>)) {
    <span class="hljs-keyword">const</span> result = <span class="hljs-keyword">await</span> mammoth.extractRawText({ buffer });
    <span class="hljs-keyword">return</span> result.value;
  } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (fileName.endsWith(<span class="hljs-string">'.txt'</span>)) {
    <span class="hljs-keyword">return</span> buffer.toString(<span class="hljs-string">'utf-8'</span>);
  } <span class="hljs-keyword">else</span> {
    <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">'Unsupported file type. Please upload PDF, DOCX, or TXT files.'</span>);
  }
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">POST</span>(<span class="hljs-params">req: Request</span>) </span>{
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> file = (<span class="hljs-keyword">await</span> req.formData()).get(<span class="hljs-string">'file'</span>) <span class="hljs-keyword">as</span> File;
    <span class="hljs-keyword">if</span> (!file) {
      <span class="hljs-keyword">return</span> NextResponse.json({ error: <span class="hljs-string">'No file provided'</span> }, { status: <span class="hljs-number">400</span> });
    }

    <span class="hljs-keyword">const</span> documentId = crypto.randomUUID();
    <span class="hljs-keyword">const</span> uploadDate = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Date</span>().toISOString();
    <span class="hljs-keyword">const</span> filePath = <span class="hljs-string">`<span class="hljs-subst">${documentId}</span>.<span class="hljs-subst">${file.name.split(<span class="hljs-string">'.'</span>).pop() || <span class="hljs-string">'bin'</span>}</span>`</span>;

    <span class="hljs-comment">// Upload file to Supabase Storage</span>
    <span class="hljs-keyword">const</span> fileBuffer = Buffer.from(<span class="hljs-keyword">await</span> file.arrayBuffer());
    <span class="hljs-keyword">const</span> { error: storageError } = <span class="hljs-keyword">await</span> supabaseStorage.storage
      .from(<span class="hljs-string">'documents'</span>)
      .upload(filePath, fileBuffer, {
        contentType: file.type || <span class="hljs-string">'application/octet-stream'</span>,
        upsert: <span class="hljs-literal">false</span>,
      });

    <span class="hljs-keyword">if</span> (storageError) {
      <span class="hljs-keyword">const</span> msg = storageError.message || <span class="hljs-string">'Unknown storage error'</span>;
      <span class="hljs-keyword">if</span> (msg.includes(<span class="hljs-string">'row-level security'</span>) || msg.includes(<span class="hljs-string">'RLS'</span>)) {
        <span class="hljs-keyword">return</span> NextResponse.json({ 
          success: <span class="hljs-literal">false</span>, 
          error: <span class="hljs-string">`Storage RLS error: <span class="hljs-subst">${msg}</span>. Ensure SUPABASE_SERVICE_ROLE_KEY is set.`</span> 
        }, { status: <span class="hljs-number">500</span> });
      }
      <span class="hljs-keyword">return</span> NextResponse.json({ 
        success: <span class="hljs-literal">false</span>, 
        error: <span class="hljs-string">`Failed to store file: <span class="hljs-subst">${msg}</span>`</span> 
      }, { status: <span class="hljs-number">500</span> });
    }

    <span class="hljs-comment">// Get public URL for the file</span>
    <span class="hljs-keyword">const</span> { data: urlData } = supabaseStorage.storage
      .from(<span class="hljs-string">'documents'</span>)
      .getPublicUrl(filePath);

    <span class="hljs-comment">// Extract text from file</span>
    <span class="hljs-keyword">const</span> text = <span class="hljs-keyword">await</span> extractTextFromFile(file);
    <span class="hljs-keyword">if</span> (!text || text.trim().length === <span class="hljs-number">0</span>) {
      <span class="hljs-keyword">return</span> NextResponse.json({ 
        error: <span class="hljs-string">'Could not extract text from file'</span> 
      }, { status: <span class="hljs-number">400</span> });
    }

    <span class="hljs-comment">// Split text into chunks</span>
    <span class="hljs-comment">// Chunk size of 800 characters with 100-character overlap ensures</span>
    <span class="hljs-comment">// we don't lose context at chunk boundaries</span>
    <span class="hljs-keyword">const</span> textSplitter = <span class="hljs-keyword">new</span> RecursiveCharacterTextSplitter({
      chunkSize: <span class="hljs-number">800</span>,
      chunkOverlap: <span class="hljs-number">100</span>,
    });
    <span class="hljs-keyword">const</span> chunks = <span class="hljs-keyword">await</span> textSplitter.splitText(text);

    <span class="hljs-comment">// Process each chunk: generate embedding and store in database</span>
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">let</span> i = <span class="hljs-number">0</span>; i &lt; chunks.length; i++) {
      <span class="hljs-keyword">const</span> chunk = chunks[i];

      <span class="hljs-comment">// Generate embedding using OpenAI</span>
      <span class="hljs-comment">// This converts the text chunk into a 1536-dimensional vector</span>
      <span class="hljs-keyword">const</span> emb = <span class="hljs-keyword">await</span> openai.embeddings.create({
        model: <span class="hljs-string">'text-embedding-3-small'</span>,
        input: chunk,
      });

      <span class="hljs-comment">// Store chunk with embedding in database</span>
      <span class="hljs-keyword">const</span> { error } = <span class="hljs-keyword">await</span> supabase.from(<span class="hljs-string">'documents'</span>).insert({
        content: chunk,
        metadata: { 
          source: file.name,
          document_id: documentId,
          file_name: file.name,
          file_type: file.type || file.name.split(<span class="hljs-string">'.'</span>).pop(),
          file_size: file.size,
          upload_date: uploadDate,
          chunk_index: i,
          total_chunks: chunks.length,
          file_path: filePath,
          file_url: urlData.publicUrl,
        },
        embedding: <span class="hljs-built_in">JSON</span>.stringify(emb.data[<span class="hljs-number">0</span>].embedding),
      });

      <span class="hljs-keyword">if</span> (error) {
        <span class="hljs-keyword">return</span> NextResponse.json({ 
          success: <span class="hljs-literal">false</span>, 
          error: error.message 
        }, { status: <span class="hljs-number">500</span> });
      }
    }

    <span class="hljs-keyword">return</span> NextResponse.json({ 
      success: <span class="hljs-literal">true</span>, 
      documentId, 
      fileName: file.name, 
      chunks: chunks.length, 
      textLength: text.length, 
      fileUrl: urlData.publicUrl 
    });
  } <span class="hljs-keyword">catch</span> (error: <span class="hljs-built_in">any</span>) {
    <span class="hljs-keyword">return</span> NextResponse.json({ 
      success: <span class="hljs-literal">false</span>, 
      error: error.message || <span class="hljs-string">'Failed to process file'</span> 
    }, { status: <span class="hljs-number">500</span> });
  }
}
</code></pre>
<p>This route handles the complete upload workflow:</p>
<ol>
<li><p>Receives the file from the client via FormData</p>
</li>
<li><p>Generates a unique document ID using <code>crypto.randomUUID()</code></p>
</li>
<li><p>Uploads the file to Supabase Storage for safekeeping</p>
</li>
<li><p>Extracts text based on file type (PDF, DOCX, or TXT)</p>
</li>
<li><p>Splits the text into chunks of 800 characters with 100-character overlap</p>
</li>
<li><p>Generates embeddings for each chunk using OpenAI's embedding model</p>
</li>
<li><p>Stores each chunk with its embedding and metadata in the database</p>
</li>
</ol>
<p>The overlap between chunks ensures that if a sentence or concept spans a chunk boundary, it won't be lost. Now that you can upload and process documents, let's create the search functionality.</p>
<h2 id="heading-step-6-create-the-rag-search-api-route">Step 6: Create the RAG Search API Route</h2>
<p>This route implements the core RAG functionality: it takes a user's query, finds the most relevant document chunks, and uses them to generate an accurate answer.</p>
<p>Create <code>src/app/api/search/route.ts</code>:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> { createClient } <span class="hljs-keyword">from</span> <span class="hljs-string">'@supabase/supabase-js'</span>;
<span class="hljs-keyword">import</span> OpenAI <span class="hljs-keyword">from</span> <span class="hljs-string">'openai'</span>;
<span class="hljs-keyword">import</span> { NextResponse } <span class="hljs-keyword">from</span> <span class="hljs-string">'next/server'</span>;

<span class="hljs-keyword">const</span> supabase = createClient(
  process.env.NEXT_PUBLIC_SUPABASE_URL!,
  process.env.NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY!
);
<span class="hljs-keyword">const</span> openai = <span class="hljs-keyword">new</span> OpenAI();

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">POST</span>(<span class="hljs-params">req: Request</span>) </span>{
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> { query } = <span class="hljs-keyword">await</span> req.json();

    <span class="hljs-comment">// Generate embedding for the user's query</span>
    <span class="hljs-comment">// This converts the search query into the same vector space as document chunks</span>
    <span class="hljs-keyword">const</span> emb = <span class="hljs-keyword">await</span> openai.embeddings.create({ 
      model: <span class="hljs-string">'text-embedding-3-small'</span>, 
      input: query 
    });

    <span class="hljs-comment">// Find similar documents using vector similarity search</span>
    <span class="hljs-comment">// The match_documents function finds the 5 most similar chunks</span>
    <span class="hljs-keyword">const</span> { data: results, error } = <span class="hljs-keyword">await</span> supabase.rpc(<span class="hljs-string">'match_documents'</span>, {
      query_embedding: <span class="hljs-built_in">JSON</span>.stringify(emb.data[<span class="hljs-number">0</span>].embedding),
      match_threshold: <span class="hljs-number">0.0</span>,  <span class="hljs-comment">// Accept any similarity (you can increase this for stricter matching)</span>
      match_count: <span class="hljs-number">5</span>,        <span class="hljs-comment">// Return top 5 most similar chunks</span>
    });

    <span class="hljs-keyword">if</span> (error) {
      <span class="hljs-keyword">return</span> NextResponse.json({ error: error.message }, { status: <span class="hljs-number">500</span> });
    }

    <span class="hljs-comment">// Combine retrieved chunks into context</span>
    <span class="hljs-comment">// These chunks will be used as context for the AI to generate an answer</span>
    <span class="hljs-keyword">const</span> context = results?.map(<span class="hljs-function">(<span class="hljs-params">r: <span class="hljs-built_in">any</span></span>) =&gt;</span> r.content).join(<span class="hljs-string">'\n---\n'</span>) || <span class="hljs-string">''</span>;

    <span class="hljs-comment">// Generate answer using OpenAI with retrieved context</span>
    <span class="hljs-comment">// This is the "Generation" part of RAG</span>
    <span class="hljs-keyword">const</span> completion = <span class="hljs-keyword">await</span> openai.chat.completions.create({
      model: <span class="hljs-string">'gpt-4o-mini'</span>,
      messages: [
        { 
          role: <span class="hljs-string">'system'</span>, 
          content: <span class="hljs-string">'You are a helpful assistant. Use the provided context to answer questions. If the answer is not in the context, say you do not know.'</span> 
        },
        { 
          role: <span class="hljs-string">'user'</span>, 
          content: <span class="hljs-string">`Context: <span class="hljs-subst">${context}</span>\n\nQuestion: <span class="hljs-subst">${query}</span>`</span> 
        }
      ],
    });

    <span class="hljs-keyword">return</span> NextResponse.json({ 
      answer: completion.choices[<span class="hljs-number">0</span>].message.content, 
      sources: results 
    });
  } <span class="hljs-keyword">catch</span> (error: <span class="hljs-built_in">any</span>) {
    <span class="hljs-keyword">return</span> NextResponse.json({ error: error.message }, { status: <span class="hljs-number">500</span> });
  }
}
</code></pre>
<p>This route implements the RAG pattern. Here's how the complete RAG workflow works:</p>
<ol>
<li><p><strong>Converts the query to an embedding</strong>: The user's question is transformed into the same vector space as your document chunks. This uses the same embedding model (<code>text-embedding-3-small</code>) that processed the documents, ensuring they're in the same "vector space."</p>
</li>
<li><p><strong>Searches for similar chunks</strong>: Uses the <code>match_documents</code> function to find the 5 most semantically similar document chunks. This uses cosine similarity on the embeddings. Cosine similarity measures the angle between vectors - smaller angles mean more similar content, even if the exact words differ.</p>
</li>
<li><p><strong>Uses chunks as context</strong>: The retrieved chunks are passed to GPT-4o-mini as context. These chunks contain the most relevant information from your documents.</p>
</li>
<li><p><strong>Generates an answer</strong>: The AI model generates an answer based on the provided context. The system prompt instructs the AI to only answer based on the provided context, ensuring accuracy and preventing hallucinations.</p>
</li>
<li><p><strong>Returns results</strong>: Both the answer and source chunks are returned so users can verify the information.</p>
</li>
</ol>
<p>This RAG approach gives you several benefits. First, you get accuracy because answers are based on your actual documents, not just the AI's training data. Second, you get transparency because you can see which document chunks were used to generate each answer. Third, you get efficiency because only relevant chunks are used, which reduces token usage and costs. Finally, you get up-to-date information because you can update your knowledge base by uploading new documents without retraining the AI.</p>
<p>Now let's create the API route for managing documents.</p>
<h2 id="heading-step-7-create-the-documents-api-route">Step 7: Create the Documents API Route</h2>
<p>This route handles listing, viewing, downloading, and deleting documents. It serves multiple purposes depending on the query parameters.</p>
<p>Create <code>src/app/api/documents/route.ts</code>:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> { createClient } <span class="hljs-keyword">from</span> <span class="hljs-string">'@supabase/supabase-js'</span>;
<span class="hljs-keyword">import</span> { NextResponse } <span class="hljs-keyword">from</span> <span class="hljs-string">'next/server'</span>;

<span class="hljs-keyword">const</span> url = process.env.NEXT_PUBLIC_SUPABASE_URL!;
<span class="hljs-keyword">const</span> anonKey = process.env.NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY!;
<span class="hljs-keyword">const</span> serviceKey = process.env.SUPABASE_SERVICE_ROLE_KEY || anonKey;
<span class="hljs-keyword">const</span> supabase = createClient(url, anonKey);
<span class="hljs-keyword">const</span> supabaseStorage = createClient(url, serviceKey);

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">GET</span>(<span class="hljs-params">req: Request</span>) </span>{
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> reqUrl = <span class="hljs-keyword">new</span> URL(req.url);
    <span class="hljs-keyword">const</span> id = reqUrl.searchParams.get(<span class="hljs-string">'id'</span>);
    <span class="hljs-keyword">const</span> file = reqUrl.searchParams.get(<span class="hljs-string">'file'</span>) === <span class="hljs-string">'true'</span>;
    <span class="hljs-keyword">const</span> view = reqUrl.searchParams.get(<span class="hljs-string">'view'</span>) === <span class="hljs-string">'true'</span>;

    <span class="hljs-comment">// Handle file download/view</span>
    <span class="hljs-keyword">if</span> (id &amp;&amp; file) {
      <span class="hljs-keyword">const</span> { data: documents } = <span class="hljs-keyword">await</span> supabase
        .from(<span class="hljs-string">'documents'</span>)
        .select(<span class="hljs-string">'metadata'</span>)
        .eq(<span class="hljs-string">'metadata-&gt;&gt;document_id'</span>, id)
        .limit(<span class="hljs-number">1</span>);

      <span class="hljs-keyword">if</span> (!documents || documents.length === <span class="hljs-number">0</span>) {
        <span class="hljs-keyword">return</span> NextResponse.json({ error: <span class="hljs-string">'Document not found'</span> }, { status: <span class="hljs-number">404</span> });
      }

      <span class="hljs-keyword">const</span> meta = documents[<span class="hljs-number">0</span>].metadata;
      <span class="hljs-keyword">const</span> fileName = meta?.file_name || <span class="hljs-string">'document'</span>;
      <span class="hljs-keyword">const</span> fileType = meta?.file_type || <span class="hljs-string">'application/octet-stream'</span>;
      <span class="hljs-keyword">const</span> filePath = meta?.file_path || <span class="hljs-string">`<span class="hljs-subst">${id}</span>.<span class="hljs-subst">${fileName.split(<span class="hljs-string">'.'</span>).pop() || <span class="hljs-string">'pdf'</span>}</span>`</span>;

      <span class="hljs-keyword">const</span> { data: fileData, error: downloadError } = <span class="hljs-keyword">await</span> supabaseStorage.storage
        .from(<span class="hljs-string">'documents'</span>)
        .download(filePath);

      <span class="hljs-keyword">if</span> (downloadError || !fileData) {
        <span class="hljs-keyword">return</span> NextResponse.json({ 
          error: downloadError?.message || <span class="hljs-string">'File not stored'</span> 
        }, { status: <span class="hljs-number">404</span> });
      }

      <span class="hljs-keyword">const</span> buffer = Buffer.from(<span class="hljs-keyword">await</span> fileData.arrayBuffer());
      <span class="hljs-keyword">if</span> (buffer.length === <span class="hljs-number">0</span>) {
        <span class="hljs-keyword">return</span> NextResponse.json({ error: <span class="hljs-string">'File is empty'</span> }, { status: <span class="hljs-number">500</span> });
      }

      <span class="hljs-keyword">const</span> isPDF = fileType === <span class="hljs-string">'application/pdf'</span> || fileName.toLowerCase().endsWith(<span class="hljs-string">'.pdf'</span>);
      <span class="hljs-keyword">return</span> <span class="hljs-keyword">new</span> NextResponse(<span class="hljs-keyword">new</span> <span class="hljs-built_in">Uint8Array</span>(buffer), {
        headers: {
          <span class="hljs-string">'Content-Type'</span>: fileType,
          <span class="hljs-string">'Content-Disposition'</span>: (view &amp;&amp; isPDF) 
            ? <span class="hljs-string">`inline; filename="<span class="hljs-subst">${fileName}</span>"`</span> 
            : <span class="hljs-string">`attachment; filename="<span class="hljs-subst">${fileName}</span>"`</span>,
          <span class="hljs-string">'Content-Length'</span>: buffer.length.toString(),
          ...(view &amp;&amp; isPDF ? { <span class="hljs-string">'X-Content-Type-Options'</span>: <span class="hljs-string">'nosniff'</span> } : {}),
        },
      });
    }

    <span class="hljs-comment">// Get single document with text content</span>
    <span class="hljs-keyword">if</span> (id) {
      <span class="hljs-keyword">const</span> { data: chunks, error } = <span class="hljs-keyword">await</span> supabase
        .from(<span class="hljs-string">'documents'</span>)
        .select(<span class="hljs-string">'content, metadata'</span>)
        .eq(<span class="hljs-string">'metadata-&gt;&gt;document_id'</span>, id)
        .order(<span class="hljs-string">'metadata-&gt;&gt;chunk_index'</span>, { ascending: <span class="hljs-literal">true</span> });

      <span class="hljs-keyword">if</span> (error || !chunks || chunks.length === <span class="hljs-number">0</span>) {
        <span class="hljs-keyword">return</span> NextResponse.json({ error: <span class="hljs-string">'Document not found'</span> }, { status: <span class="hljs-number">404</span> });
      }

      <span class="hljs-keyword">const</span> m = chunks[<span class="hljs-number">0</span>].metadata || {};
      <span class="hljs-keyword">return</span> NextResponse.json({
        id,
        file_name: m.file_name || <span class="hljs-string">'Unknown'</span>,
        file_type: m.file_type || <span class="hljs-string">'unknown'</span>,
        file_size: m.file_size || <span class="hljs-number">0</span>,
        upload_date: m.upload_date || <span class="hljs-keyword">new</span> <span class="hljs-built_in">Date</span>().toISOString(),
        total_chunks: chunks.length,
        fullText: chunks.map(<span class="hljs-function">(<span class="hljs-params">c: <span class="hljs-built_in">any</span></span>) =&gt;</span> c.content).join(<span class="hljs-string">'\n\n'</span>),
        file_url: m.file_url,
        file_path: m.file_path
      });
    }

    <span class="hljs-comment">// List all documents</span>
    <span class="hljs-keyword">const</span> { data: documents, error } = <span class="hljs-keyword">await</span> supabase
      .from(<span class="hljs-string">'documents'</span>)
      .select(<span class="hljs-string">'metadata'</span>);

    <span class="hljs-keyword">if</span> (error) {
      <span class="hljs-keyword">return</span> NextResponse.json({ error: error.message }, { status: <span class="hljs-number">500</span> });
    }

    <span class="hljs-comment">// Deduplicate documents by document_id</span>
    <span class="hljs-comment">// Since each document is split into multiple chunks, we need to group them</span>
    <span class="hljs-keyword">const</span> map = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Map</span>();
    documents?.forEach(<span class="hljs-function">(<span class="hljs-params">doc: <span class="hljs-built_in">any</span></span>) =&gt;</span> {
      <span class="hljs-keyword">const</span> m = doc.metadata;
      <span class="hljs-keyword">if</span> (m?.document_id &amp;&amp; !map.has(m.document_id)) {
        map.set(m.document_id, {
          id: m.document_id,
          file_name: m.file_name || <span class="hljs-string">'Unknown'</span>,
          file_type: m.file_type || <span class="hljs-string">'unknown'</span>,
          file_size: m.file_size || <span class="hljs-number">0</span>,
          upload_date: m.upload_date || <span class="hljs-keyword">new</span> <span class="hljs-built_in">Date</span>().toISOString(),
          total_chunks: m.total_chunks || <span class="hljs-number">0</span>,
          file_url: m.file_url,
          file_path: m.file_path,
        });
      }
    });

    <span class="hljs-keyword">return</span> NextResponse.json({ documents: <span class="hljs-built_in">Array</span>.from(map.values()) });
  } <span class="hljs-keyword">catch</span> (error: <span class="hljs-built_in">any</span>) {
    <span class="hljs-keyword">return</span> NextResponse.json({ error: error.message }, { status: <span class="hljs-number">500</span> });
  }
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">DELETE</span>(<span class="hljs-params">req: Request</span>) </span>{
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> id = <span class="hljs-keyword">new</span> URL(req.url).searchParams.get(<span class="hljs-string">'id'</span>);
    <span class="hljs-keyword">if</span> (!id) {
      <span class="hljs-keyword">return</span> NextResponse.json({ error: <span class="hljs-string">'Document ID required'</span> }, { status: <span class="hljs-number">400</span> });
    }

    <span class="hljs-comment">// Get file path from metadata</span>
    <span class="hljs-keyword">const</span> { data: docs } = <span class="hljs-keyword">await</span> supabase
      .from(<span class="hljs-string">'documents'</span>)
      .select(<span class="hljs-string">'metadata'</span>)
      .eq(<span class="hljs-string">'metadata-&gt;&gt;document_id'</span>, id)
      .limit(<span class="hljs-number">1</span>);

    <span class="hljs-keyword">const</span> filePath = docs?.[<span class="hljs-number">0</span>]?.metadata?.file_path;

    <span class="hljs-comment">// Delete file from storage</span>
    <span class="hljs-keyword">if</span> (filePath) {
      <span class="hljs-keyword">await</span> supabaseStorage.storage.from(<span class="hljs-string">'documents'</span>).remove([filePath]);
    }

    <span class="hljs-comment">// Delete all chunks from database</span>
    <span class="hljs-keyword">const</span> { error } = <span class="hljs-keyword">await</span> supabase
      .from(<span class="hljs-string">'documents'</span>)
      .delete()
      .eq(<span class="hljs-string">'metadata-&gt;&gt;document_id'</span>, id);

    <span class="hljs-keyword">if</span> (error) {
      <span class="hljs-keyword">return</span> NextResponse.json({ error: error.message }, { status: <span class="hljs-number">500</span> });
    }

    <span class="hljs-keyword">return</span> NextResponse.json({ success: <span class="hljs-literal">true</span>, fileDeleted: !!filePath });
  } <span class="hljs-keyword">catch</span> (error: <span class="hljs-built_in">any</span>) {
    <span class="hljs-keyword">return</span> NextResponse.json({ error: error.message }, { status: <span class="hljs-number">500</span> });
  }
}
</code></pre>
<p>This route handles:</p>
<ul>
<li><p><strong>GET without ID</strong>: Lists all documents (deduplicated since each document has multiple chunks)</p>
</li>
<li><p><strong>GET with ID</strong>: Returns document details and full text (all chunks combined)</p>
</li>
<li><p><strong>GET with ID and file=true</strong>: Downloads the original file from storage</p>
</li>
<li><p><strong>DELETE with ID</strong>: Deletes the document and its file from both storage and database</p>
</li>
</ul>
<p>Now that your API routes are complete, let's build the user interface components, starting with the upload modal.</p>
<h2 id="heading-step-8-create-the-upload-modal-component">Step 8: Create the Upload Modal Component</h2>
<p>The upload modal provides a user-friendly interface for selecting and uploading documents. It handles file selection, upload progress, and displays success or error messages.</p>
<p>Create <code>src/app/components/UploadModal.tsx</code>:</p>
<pre><code class="lang-typescript"><span class="hljs-string">'use client'</span>;
<span class="hljs-keyword">import</span> { useState, useEffect } <span class="hljs-keyword">from</span> <span class="hljs-string">'react'</span>;

<span class="hljs-keyword">interface</span> UploadModalProps {
  isOpen: <span class="hljs-built_in">boolean</span>;
  onClose: <span class="hljs-function">() =&gt;</span> <span class="hljs-built_in">void</span>;
  onUploadSuccess?: <span class="hljs-function">() =&gt;</span> <span class="hljs-built_in">void</span>;
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">UploadModal</span>(<span class="hljs-params">{ isOpen, onClose, onUploadSuccess }: UploadModalProps</span>) </span>{
  <span class="hljs-keyword">const</span> [file, setFile] = useState&lt;File | <span class="hljs-literal">null</span>&gt;(<span class="hljs-literal">null</span>);
  <span class="hljs-keyword">const</span> [uploading, setUploading] = useState(<span class="hljs-literal">false</span>);
  <span class="hljs-keyword">const</span> [message, setMessage] = useState&lt;{ <span class="hljs-keyword">type</span>: <span class="hljs-string">'success'</span> | <span class="hljs-string">'error'</span>; text: <span class="hljs-built_in">string</span> } | <span class="hljs-literal">null</span>&gt;(<span class="hljs-literal">null</span>);

  useEffect(<span class="hljs-function">() =&gt;</span> {
    <span class="hljs-built_in">document</span>.body.style.overflow = isOpen ? <span class="hljs-string">'hidden'</span> : <span class="hljs-string">'unset'</span>;
    <span class="hljs-keyword">if</span> (!isOpen) { 
      setFile(<span class="hljs-literal">null</span>); 
      setMessage(<span class="hljs-literal">null</span>); 
    }
    <span class="hljs-keyword">return</span> <span class="hljs-function">() =&gt;</span> { 
      <span class="hljs-built_in">document</span>.body.style.overflow = <span class="hljs-string">'unset'</span>; 
    };
  }, [isOpen]);

  <span class="hljs-keyword">const</span> handleFileChange = <span class="hljs-function">(<span class="hljs-params">e: React.ChangeEvent&lt;HTMLInputElement&gt;</span>) =&gt;</span> {
    <span class="hljs-keyword">if</span> (e.target.files &amp;&amp; e.target.files[<span class="hljs-number">0</span>]) {
      setFile(e.target.files[<span class="hljs-number">0</span>]);
      setMessage(<span class="hljs-literal">null</span>);
    }
  };

  <span class="hljs-keyword">const</span> handleUpload = <span class="hljs-keyword">async</span> () =&gt; {
    <span class="hljs-keyword">if</span> (!file) {
      setMessage({ <span class="hljs-keyword">type</span>: <span class="hljs-string">'error'</span>, text: <span class="hljs-string">'Please select a file'</span> });
      <span class="hljs-keyword">return</span>;
    }

    setUploading(<span class="hljs-literal">true</span>);
    setMessage(<span class="hljs-literal">null</span>);

    <span class="hljs-keyword">try</span> {
      <span class="hljs-keyword">const</span> formData = <span class="hljs-keyword">new</span> FormData();
      formData.append(<span class="hljs-string">'file'</span>, file);

      <span class="hljs-keyword">const</span> res = <span class="hljs-keyword">await</span> fetch(<span class="hljs-string">'/api/upload'</span>, {
        method: <span class="hljs-string">'POST'</span>,
        body: formData,
      });

      <span class="hljs-keyword">const</span> data = <span class="hljs-keyword">await</span> res.json();

      <span class="hljs-keyword">if</span> (data.success) {
        setMessage({
          <span class="hljs-keyword">type</span>: <span class="hljs-string">'success'</span>,
          text: <span class="hljs-string">`File "<span class="hljs-subst">${data.fileName}</span>" uploaded successfully! Processed <span class="hljs-subst">${data.chunks}</span> chunks.`</span>,
        });
        setFile(<span class="hljs-literal">null</span>);
        (<span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'upload-file-input'</span>) <span class="hljs-keyword">as</span> HTMLInputElement)?.setAttribute(<span class="hljs-string">'value'</span>, <span class="hljs-string">''</span>);
        <span class="hljs-built_in">setTimeout</span>(<span class="hljs-function">() =&gt;</span> { 
          onUploadSuccess?.(); 
          onClose(); 
        }, <span class="hljs-number">1500</span>);
      } <span class="hljs-keyword">else</span> {
        setMessage({ <span class="hljs-keyword">type</span>: <span class="hljs-string">'error'</span>, text: data.error || <span class="hljs-string">'Upload failed'</span> });
      }
    } <span class="hljs-keyword">catch</span> (error: <span class="hljs-built_in">any</span>) {
      setMessage({ <span class="hljs-keyword">type</span>: <span class="hljs-string">'error'</span>, text: error.message || <span class="hljs-string">'Upload failed'</span> });
    } <span class="hljs-keyword">finally</span> {
      setUploading(<span class="hljs-literal">false</span>);
    }
  };

  <span class="hljs-keyword">if</span> (!isOpen) <span class="hljs-keyword">return</span> <span class="hljs-literal">null</span>;

  <span class="hljs-keyword">return</span> (
    &lt;div
      className=<span class="hljs-string">"fixed inset-0 z-50 flex items-center justify-center bg-black bg-opacity-75 p-4"</span>
      onClick={onClose}
    &gt;
      &lt;div
        className=<span class="hljs-string">"relative bg-white dark:bg-gray-900 rounded-lg shadow-xl w-full max-w-2xl max-h-[90vh] overflow-y-auto"</span>
        onClick={<span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> e.stopPropagation()}
      &gt;
        &lt;div className=<span class="hljs-string">"flex items-center justify-between p-6 border-b border-gray-200 dark:border-gray-800"</span>&gt;
          &lt;h2 className=<span class="hljs-string">"text-2xl font-semibold text-gray-900 dark:text-gray-100"</span>&gt;
            Upload Document
          &lt;/h2&gt;
          &lt;button
            onClick={onClose}
            className=<span class="hljs-string">"p-2 text-gray-500 hover:text-gray-700 dark:text-gray-400 dark:hover:text-gray-200 rounded-lg hover:bg-gray-100 dark:hover:bg-gray-800"</span>
            aria-label=<span class="hljs-string">"Close"</span>
          &gt;
            &lt;svg className=<span class="hljs-string">"w-6 h-6"</span> fill=<span class="hljs-string">"none"</span> stroke=<span class="hljs-string">"currentColor"</span> viewBox=<span class="hljs-string">"0 0 24 24"</span>&gt;
              &lt;path strokeLinecap=<span class="hljs-string">"round"</span> strokeLinejoin=<span class="hljs-string">"round"</span> strokeWidth={<span class="hljs-number">2</span>} d=<span class="hljs-string">"M6 18L18 6M6 6l12 12"</span> /&gt;
            &lt;/svg&gt;
          &lt;/button&gt;
        &lt;/div&gt;

        &lt;div className=<span class="hljs-string">"p-6"</span>&gt;
          &lt;div className=<span class="hljs-string">"mb-6"</span>&gt;
            &lt;label htmlFor=<span class="hljs-string">"upload-file-input"</span> className=<span class="hljs-string">"block text-sm font-medium text-gray-700 dark:text-gray-300 mb-2"</span>&gt;
              Select a file (PDF, DOCX, or TXT)
            &lt;/label&gt;
            &lt;input
              id=<span class="hljs-string">"upload-file-input"</span>
              <span class="hljs-keyword">type</span>=<span class="hljs-string">"file"</span>
              accept=<span class="hljs-string">".pdf,.docx,.txt"</span>
              onChange={handleFileChange}
              className=<span class="hljs-string">"block w-full text-sm text-gray-500
                file:mr-4 file:py-2 file:px-4
                file:rounded-lg file:border-0
                file:text-sm file:font-semibold
                file:bg-blue-50 file:text-blue-700
                hover:file:bg-blue-100
                dark:file:bg-blue-900 dark:file:text-blue-300
                dark:hover:file:bg-blue-800"</span>
            /&gt;
          &lt;/div&gt;

          {file &amp;&amp; (
            &lt;div className=<span class="hljs-string">"mb-6 p-4 bg-gray-50 dark:bg-gray-800 rounded-lg text-sm text-gray-600 dark:text-gray-400 space-y-1"</span>&gt;
              &lt;p&gt;&lt;span className=<span class="hljs-string">"font-medium"</span>&gt;Selected:&lt;<span class="hljs-regexp">/span&gt; {file.name}&lt;/</span>p&gt;
              &lt;p&gt;&lt;span className=<span class="hljs-string">"font-medium"</span>&gt;Size:&lt;<span class="hljs-regexp">/span&gt; {(file.size /</span> <span class="hljs-number">1024</span>).toFixed(<span class="hljs-number">2</span>)} KB&lt;/p&gt;
              &lt;p&gt;&lt;span className=<span class="hljs-string">"font-medium"</span>&gt;Type:&lt;<span class="hljs-regexp">/span&gt; {file.type || file.name.split('.').pop()}&lt;/</span>p&gt;
            &lt;/div&gt;
          )}

          &lt;button
            onClick={handleUpload}
            disabled={!file || uploading}
            className=<span class="hljs-string">"w-full bg-blue-600 text-white px-6 py-3 rounded-lg hover:bg-blue-700 disabled:bg-gray-400 disabled:cursor-not-allowed font-medium"</span>
          &gt;
            {uploading ? <span class="hljs-string">'Uploading and Processing...'</span> : <span class="hljs-string">'Upload Document'</span>}
          &lt;/button&gt;

          {message &amp;&amp; (
            &lt;div
              className={<span class="hljs-string">`mt-6 p-4 rounded-lg <span class="hljs-subst">${
                message.<span class="hljs-keyword">type</span> === <span class="hljs-string">'success'</span>
                  ? <span class="hljs-string">'bg-green-50 text-green-800 dark:bg-green-900 dark:text-green-200'</span>
                  : <span class="hljs-string">'bg-red-50 text-red-800 dark:bg-red-900 dark:text-red-200'</span>
              }</span>`</span>}
            &gt;
              {message.text}
            &lt;/div&gt;
          )}

          &lt;div className=<span class="hljs-string">"mt-8 p-4 bg-blue-50 dark:bg-blue-900/20 rounded-lg text-sm"</span>&gt;
            &lt;p className=<span class="hljs-string">"font-medium text-blue-900 dark:text-blue-200 mb-2"</span>&gt;Supported: PDF, DOCX, TXT&lt;/p&gt;
            &lt;p className=<span class="hljs-string">"text-blue-700 dark:text-blue-400"</span>&gt;Files will be processed and embedded <span class="hljs-keyword">for</span> RAG search.&lt;/p&gt;
          &lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  );
}
</code></pre>
<p>This component provides a clean interface for file uploads with proper error handling and user feedback. Next, let's create the PDF viewer component for previewing documents.</p>
<h2 id="heading-step-9-create-the-pdf-viewer-modal-component">Step 9: Create the PDF Viewer Modal Component</h2>
<p>The PDF viewer modal allows users to preview PDFs and view extracted text from any document. It's particularly useful for verifying that documents were processed correctly.</p>
<p>Create <code>src/app/components/PDFViewerModal.tsx</code>:</p>
<pre><code class="lang-typescript"><span class="hljs-string">'use client'</span>;
<span class="hljs-keyword">import</span> { useEffect, useState } <span class="hljs-keyword">from</span> <span class="hljs-string">'react'</span>;

<span class="hljs-keyword">interface</span> PDFViewerModalProps {
  isOpen: <span class="hljs-built_in">boolean</span>;
  onClose: <span class="hljs-function">() =&gt;</span> <span class="hljs-built_in">void</span>;
  fileUrl: <span class="hljs-built_in">string</span>;
  fileName: <span class="hljs-built_in">string</span>;
  documentId?: <span class="hljs-built_in">string</span>;
  isPDF?: <span class="hljs-built_in">boolean</span>;
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">PDFViewerModal</span>(<span class="hljs-params">{ 
  isOpen, 
  onClose, 
  fileUrl, 
  fileName, 
  documentId, 
  isPDF = <span class="hljs-literal">true</span> 
}: PDFViewerModalProps</span>) </span>{
  <span class="hljs-keyword">const</span> [error, setError] = useState&lt;<span class="hljs-built_in">string</span> | <span class="hljs-literal">null</span>&gt;(<span class="hljs-literal">null</span>);
  <span class="hljs-keyword">const</span> [loading, setLoading] = useState(<span class="hljs-literal">true</span>);
  <span class="hljs-keyword">const</span> [activeTab, setActiveTab] = useState&lt;<span class="hljs-string">'preview'</span> | <span class="hljs-string">'content'</span>&gt;(<span class="hljs-string">'preview'</span>);
  <span class="hljs-keyword">const</span> [text, setText] = useState&lt;<span class="hljs-built_in">string</span>&gt;(<span class="hljs-string">''</span>);
  <span class="hljs-keyword">const</span> [textLoading, setTextLoading] = useState(<span class="hljs-literal">false</span>);
  <span class="hljs-keyword">const</span> [textError, setTextError] = useState&lt;<span class="hljs-built_in">string</span> | <span class="hljs-literal">null</span>&gt;(<span class="hljs-literal">null</span>);

  useEffect(<span class="hljs-function">() =&gt;</span> {
    <span class="hljs-built_in">document</span>.body.style.overflow = isOpen ? <span class="hljs-string">'hidden'</span> : <span class="hljs-string">'unset'</span>;
    <span class="hljs-keyword">if</span> (isOpen) { 
      setError(<span class="hljs-literal">null</span>); 
      setLoading(<span class="hljs-literal">true</span>); 
      setActiveTab(isPDF ? <span class="hljs-string">'preview'</span> : <span class="hljs-string">'content'</span>); 
      setText(<span class="hljs-string">''</span>); 
      setTextError(<span class="hljs-literal">null</span>); 
    }
    <span class="hljs-keyword">return</span> <span class="hljs-function">() =&gt;</span> { 
      <span class="hljs-built_in">document</span>.body.style.overflow = <span class="hljs-string">'unset'</span>; 
    };
  }, [isOpen, isPDF]);

  useEffect(<span class="hljs-function">() =&gt;</span> {
    <span class="hljs-keyword">if</span> (isOpen &amp;&amp; documentId &amp;&amp; activeTab === <span class="hljs-string">'content'</span> &amp;&amp; !text &amp;&amp; !textLoading &amp;&amp; !textError) {
      fetchDocumentText();
    }
  }, [isOpen, documentId, activeTab, text, textLoading, textError]);

  useEffect(<span class="hljs-function">() =&gt;</span> {
    <span class="hljs-keyword">if</span> (isOpen &amp;&amp; fileUrl &amp;&amp; isPDF) {
      fetch(fileUrl, { method: <span class="hljs-string">'GET'</span>, headers: { <span class="hljs-string">'Accept'</span>: <span class="hljs-string">'application/json'</span> } })
        .then(<span class="hljs-keyword">async</span> res =&gt; {
          <span class="hljs-keyword">if</span> (res.headers.get(<span class="hljs-string">'content-type'</span>)?.includes(<span class="hljs-string">'application/json'</span>)) {
            <span class="hljs-keyword">const</span> data = <span class="hljs-keyword">await</span> res.json();
            <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(data.error || <span class="hljs-string">'File not available'</span>);
          }
          <span class="hljs-keyword">if</span> (!res.ok) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">`Failed to load: <span class="hljs-subst">${res.status}</span>`</span>);
          setLoading(<span class="hljs-literal">false</span>);
        })
        .catch(<span class="hljs-function"><span class="hljs-params">err</span> =&gt;</span> {
          setError(err.message || <span class="hljs-string">'Failed to load PDF'</span>);
          setLoading(<span class="hljs-literal">false</span>);
        });
    } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (isOpen &amp;&amp; !isPDF) {
      setLoading(<span class="hljs-literal">false</span>);
    }
  }, [isOpen, fileUrl, isPDF]);

  <span class="hljs-keyword">const</span> fetchDocumentText = <span class="hljs-keyword">async</span> () =&gt; {
    <span class="hljs-keyword">if</span> (!documentId) <span class="hljs-keyword">return</span>;
    setTextLoading(<span class="hljs-literal">true</span>); 
    setTextError(<span class="hljs-literal">null</span>);
    <span class="hljs-keyword">try</span> {
      <span class="hljs-keyword">const</span> res = <span class="hljs-keyword">await</span> fetch(<span class="hljs-string">`/api/documents?id=<span class="hljs-subst">${documentId}</span>`</span>);
      <span class="hljs-keyword">const</span> data = <span class="hljs-keyword">await</span> res.json();
      <span class="hljs-keyword">if</span> (data.error) {
        setTextError(data.error);
      } <span class="hljs-keyword">else</span> {
        setText(data.fullText || <span class="hljs-string">'No text content available'</span>);
      }
    } <span class="hljs-keyword">catch</span> (err) {
      setTextError(err <span class="hljs-keyword">instanceof</span> <span class="hljs-built_in">Error</span> ? err.message : <span class="hljs-string">'Failed to fetch document text'</span>);
    } <span class="hljs-keyword">finally</span> {
      setTextLoading(<span class="hljs-literal">false</span>);
    }
  };

  <span class="hljs-keyword">if</span> (!isOpen) <span class="hljs-keyword">return</span> <span class="hljs-literal">null</span>;

  <span class="hljs-keyword">return</span> (
    &lt;div
      className=<span class="hljs-string">"fixed inset-0 z-50 flex items-center justify-center bg-black bg-opacity-75 p-4"</span>
      onClick={onClose}
    &gt;
      &lt;div
        className=<span class="hljs-string">"relative bg-white dark:bg-gray-900 rounded-lg shadow-xl w-full max-w-6xl h-[90vh] flex flex-col"</span>
        onClick={<span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> e.stopPropagation()}
      &gt;
        &lt;div className=<span class="hljs-string">"flex flex-col border-b border-gray-200 dark:border-gray-800"</span>&gt;
          &lt;div className=<span class="hljs-string">"flex items-center justify-between p-4"</span>&gt;
            &lt;h2 className=<span class="hljs-string">"text-xl font-semibold text-gray-900 dark:text-gray-100 truncate flex-1 mr-4"</span>&gt;
              {fileName}
            &lt;/h2&gt;
            &lt;div className=<span class="hljs-string">"flex items-center gap-2"</span>&gt;
              &lt;button
                onClick={onClose}
                className=<span class="hljs-string">"p-2 text-gray-500 hover:text-gray-700 dark:text-gray-400 dark:hover:text-gray-200 rounded-lg hover:bg-gray-100 dark:hover:bg-gray-800"</span>
                aria-label=<span class="hljs-string">"Close"</span>
              &gt;
                &lt;svg className=<span class="hljs-string">"w-6 h-6"</span> fill=<span class="hljs-string">"none"</span> stroke=<span class="hljs-string">"currentColor"</span> viewBox=<span class="hljs-string">"0 0 24 24"</span>&gt;
                  &lt;path strokeLinecap=<span class="hljs-string">"round"</span> strokeLinejoin=<span class="hljs-string">"round"</span> strokeWidth={<span class="hljs-number">2</span>} d=<span class="hljs-string">"M6 18L18 6M6 6l12 12"</span> /&gt;
                &lt;/svg&gt;
              &lt;/button&gt;
            &lt;/div&gt;
          &lt;/div&gt;

          {isPDF &amp;&amp; (
            &lt;div className=<span class="hljs-string">"flex border-t border-gray-200 dark:border-gray-800"</span>&gt;
              {([<span class="hljs-string">'preview'</span>, <span class="hljs-string">'content'</span>] <span class="hljs-keyword">as</span> <span class="hljs-keyword">const</span>).map(<span class="hljs-function"><span class="hljs-params">tab</span> =&gt;</span> (
                &lt;button 
                  key={tab} 
                  onClick={<span class="hljs-function">() =&gt;</span> setActiveTab(tab)} 
                  className={<span class="hljs-string">`flex-1 px-4 py-3 text-sm font-medium transition-colors <span class="hljs-subst">${
                    activeTab === tab 
                      ? <span class="hljs-string">'text-blue-600 dark:text-blue-400 border-b-2 border-blue-600 dark:border-blue-400 bg-blue-50 dark:bg-blue-900/20'</span> 
                      : <span class="hljs-string">'text-gray-500 dark:text-gray-400 hover:text-gray-700 dark:hover:text-gray-300 hover:bg-gray-50 dark:hover:bg-gray-800'</span>
                  }</span>`</span>}
                &gt;
                  {tab.charAt(<span class="hljs-number">0</span>).toUpperCase() + tab.slice(<span class="hljs-number">1</span>)}
                &lt;/button&gt;
              ))}
            &lt;/div&gt;
          )}
        &lt;/div&gt;

        &lt;div className=<span class="hljs-string">"flex-1 overflow-hidden"</span>&gt;
          {isPDF &amp;&amp; activeTab === <span class="hljs-string">'preview'</span> &amp;&amp; (
            &lt;div className=<span class="hljs-string">"h-full overflow-hidden"</span>&gt;
              {error ? (
                &lt;div className=<span class="hljs-string">"flex flex-col items-center justify-center h-full p-8"</span>&gt;
                  &lt;div className=<span class="hljs-string">"bg-yellow-50 dark:bg-yellow-900/20 border border-yellow-200 dark:border-yellow-800 rounded-lg p-6 max-w-md"</span>&gt;
                    &lt;h3 className=<span class="hljs-string">"text-lg font-semibold text-yellow-800 dark:text-yellow-200 mb-2"</span>&gt;
                      PDF File Not Available
                    &lt;/h3&gt;
                    &lt;p className=<span class="hljs-string">"text-yellow-700 dark:text-yellow-300 mb-4"</span>&gt;{error}&lt;/p&gt;
                    {documentId &amp;&amp; (
                      &lt;button 
                        onClick={<span class="hljs-function">() =&gt;</span> setActiveTab(<span class="hljs-string">'content'</span>)} 
                        className=<span class="hljs-string">"px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700 font-medium"</span>
                      &gt;
                        View Extracted Text Instead
                      &lt;/button&gt;
                    )}
                  &lt;/div&gt;
                &lt;/div&gt;
              ) : loading ? (
                &lt;div className=<span class="hljs-string">"flex items-center justify-center h-full"</span>&gt;
                  &lt;p className=<span class="hljs-string">"text-gray-500 dark:text-gray-400"</span>&gt;Loading PDF...&lt;/p&gt;
                &lt;/div&gt;
              ) : (
                &lt;iframe
                  src={<span class="hljs-string">`<span class="hljs-subst">${fileUrl}</span><span class="hljs-subst">${fileUrl.includes(<span class="hljs-string">'?'</span>) ? <span class="hljs-string">'&amp;'</span> : <span class="hljs-string">'?'</span>}</span>view=true#toolbar=1&amp;navpanes=0&amp;scrollbar=1`</span>}
                  className=<span class="hljs-string">"w-full h-full border-0"</span>
                  title={fileName}
                  allow=<span class="hljs-string">"fullscreen"</span>
                  onError={<span class="hljs-function">() =&gt;</span> setError(<span class="hljs-string">'Failed to load PDF'</span>)}
                /&gt;
              )}
            &lt;/div&gt;
          )}

          {(!isPDF || activeTab === <span class="hljs-string">'content'</span>) &amp;&amp; (
            &lt;div className=<span class="hljs-string">"h-full overflow-auto p-6"</span>&gt;
              {textLoading ? (
                &lt;div className=<span class="hljs-string">"flex items-center justify-center h-full"</span>&gt;
                  &lt;p className=<span class="hljs-string">"text-gray-500 dark:text-gray-400"</span>&gt;Loading...&lt;/p&gt;
                &lt;/div&gt;
              ) : textError ? (
                &lt;div className=<span class="hljs-string">"bg-red-50 dark:bg-red-900/20 border border-red-200 dark:border-red-800 rounded-lg p-4"</span>&gt;
                  &lt;p className=<span class="hljs-string">"text-red-800 dark:text-red-200"</span>&gt;<span class="hljs-built_in">Error</span>: {textError}&lt;/p&gt;
                &lt;/div&gt;
              ) : (
                &lt;div className=<span class="hljs-string">"space-y-4"</span>&gt;
                  &lt;p className=<span class="hljs-string">"text-sm text-gray-500 dark:text-gray-400"</span>&gt;
                    Formatting may be inconsistent <span class="hljs-keyword">from</span> source.
                  &lt;/p&gt;
                  &lt;pre className=<span class="hljs-string">"whitespace-pre-wrap text-sm text-gray-800 dark:text-gray-200 font-mono bg-gray-50 dark:bg-gray-800 p-4 rounded-lg"</span>&gt;
                    {text || <span class="hljs-string">'No text content available'</span>}
                  &lt;/pre&gt;
                &lt;/div&gt;
              )}
            &lt;/div&gt;
          )}
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  );
}
</code></pre>
<p>This component provides a full-screen modal for viewing PDFs and extracted text, with tabs to switch between preview and text content. Now let's create a simple navigation component to tie everything together.</p>
<h2 id="heading-step-10-create-the-navigation-component">Step 10: Create the Navigation Component</h2>
<p>The navigation component provides easy access to the Search and Documents pages. It highlights the current page and provides a clean, consistent navigation experience.</p>
<p>Create <code>src/app/components/Navigation.tsx</code>:</p>
<pre><code class="lang-typescript"><span class="hljs-string">'use client'</span>;
<span class="hljs-keyword">import</span> Link <span class="hljs-keyword">from</span> <span class="hljs-string">'next/link'</span>;
<span class="hljs-keyword">import</span> { usePathname } <span class="hljs-keyword">from</span> <span class="hljs-string">'next/navigation'</span>;

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">Navigation</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">const</span> pathname = usePathname();

  <span class="hljs-keyword">const</span> navItems = [
    { href: <span class="hljs-string">'/'</span>, label: <span class="hljs-string">'Search'</span> },
    { href: <span class="hljs-string">'/documents'</span>, label: <span class="hljs-string">'Documents'</span> },
  ];

  <span class="hljs-keyword">return</span> (
    &lt;nav className=<span class="hljs-string">"border-b border-gray-200 dark:border-gray-800 mb-8"</span>&gt;
      &lt;div className=<span class="hljs-string">"max-w-7xl mx-auto px-4 sm:px-6 lg:px-8"</span>&gt;
        &lt;div className=<span class="hljs-string">"flex space-x-8"</span>&gt;
          {navItems.map(<span class="hljs-function">(<span class="hljs-params">item</span>) =&gt;</span> (
            &lt;Link
              key={item.href}
              href={item.href}
              className={<span class="hljs-string">`py-4 px-1 border-b-2 font-medium text-sm <span class="hljs-subst">${
                pathname === item.href
                  ? <span class="hljs-string">'border-blue-500 text-blue-600 dark:text-blue-400'</span>
                  : <span class="hljs-string">'border-transparent text-gray-500 hover:text-gray-700 hover:border-gray-300 dark:text-gray-400 dark:hover:text-gray-300'</span>
              }</span>`</span>}
            &gt;
              {item.label}
            &lt;/Link&gt;
          ))}
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/nav&gt;
  );
}
</code></pre>
<p>With navigation in place, let's create the main search page where users can query their documents.</p>
<h2 id="heading-step-11-create-the-home-page-search-interface">Step 11: Create the Home Page (Search Interface)</h2>
<p>The search page is the main interface where users ask questions about their uploaded documents. It displays the AI-generated answers along with source citations, allowing users to verify the information.</p>
<p>Update <code>src/app/page.tsx</code>:</p>
<pre><code class="lang-typescript"><span class="hljs-string">'use client'</span>;
<span class="hljs-keyword">import</span> { useState } <span class="hljs-keyword">from</span> <span class="hljs-string">'react'</span>;
<span class="hljs-keyword">import</span> Navigation <span class="hljs-keyword">from</span> <span class="hljs-string">'./components/Navigation'</span>;

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">Home</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">const</span> [query, setQuery] = useState(<span class="hljs-string">''</span>);
  <span class="hljs-keyword">const</span> [answer, setAnswer] = useState(<span class="hljs-string">''</span>);
  <span class="hljs-keyword">const</span> [loading, setLoading] = useState(<span class="hljs-literal">false</span>);
  <span class="hljs-keyword">const</span> [sources, setSources] = useState&lt;<span class="hljs-built_in">any</span>[]&gt;([]);

  <span class="hljs-keyword">const</span> handleSearch = <span class="hljs-keyword">async</span> () =&gt; {
    <span class="hljs-keyword">if</span> (!query.trim()) <span class="hljs-keyword">return</span>;
    setLoading(<span class="hljs-literal">true</span>); 
    setAnswer(<span class="hljs-string">''</span>); 
    setSources([]);
    <span class="hljs-keyword">try</span> {
      <span class="hljs-keyword">const</span> res = <span class="hljs-keyword">await</span> fetch(<span class="hljs-string">'/api/search'</span>, { 
        method: <span class="hljs-string">'POST'</span>, 
        headers: { <span class="hljs-string">'Content-Type'</span>: <span class="hljs-string">'application/json'</span> }, 
        body: <span class="hljs-built_in">JSON</span>.stringify({ query }) 
      });
      <span class="hljs-keyword">const</span> data = <span class="hljs-keyword">await</span> res.json();
      <span class="hljs-keyword">if</span> (data.error) {
        setAnswer(<span class="hljs-string">`Error: <span class="hljs-subst">${data.error}</span>`</span>);
      } <span class="hljs-keyword">else</span> { 
        setAnswer(data.answer || <span class="hljs-string">'No answer generated'</span>); 
        setSources(data.sources || []); 
      }
    } <span class="hljs-keyword">catch</span> (error: <span class="hljs-built_in">any</span>) {
      setAnswer(<span class="hljs-string">`Error: <span class="hljs-subst">${error.message}</span>`</span>);
    } <span class="hljs-keyword">finally</span> {
      setLoading(<span class="hljs-literal">false</span>);
    }
  };

  <span class="hljs-keyword">const</span> handleKeyPress = <span class="hljs-function">(<span class="hljs-params">e: React.KeyboardEvent</span>) =&gt;</span> {
    <span class="hljs-keyword">if</span> (e.key === <span class="hljs-string">'Enter'</span> &amp;&amp; (e.metaKey || e.ctrlKey)) {
      handleSearch();
    }
  };

  <span class="hljs-keyword">return</span> (
    &lt;div className=<span class="hljs-string">"min-h-screen"</span>&gt;
      &lt;Navigation /&gt;
      &lt;main className=<span class="hljs-string">"max-w-4xl mx-auto p-8"</span>&gt;
        &lt;h1 className=<span class="hljs-string">"text-3xl font-bold mb-6"</span>&gt;RAG Search&lt;/h1&gt;

        &lt;div className=<span class="hljs-string">"bg-white dark:bg-gray-900 border border-gray-200 dark:border-gray-800 rounded-lg p-6 shadow-sm mb-6"</span>&gt;
          &lt;textarea 
            className=<span class="hljs-string">"w-full p-4 border border-gray-300 dark:border-gray-700 rounded-lg shadow-sm bg-white dark:bg-gray-800 text-gray-900 dark:text-gray-100 resize-none focus:ring-2 focus:ring-blue-500 focus:border-transparent"</span>
            placeholder=<span class="hljs-string">"Ask a question about your uploaded documents..."</span>
            value={query}
            onChange={<span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> setQuery(e.target.value)}
            onKeyDown={handleKeyPress}
            rows={<span class="hljs-number">4</span>}
          /&gt;
          &lt;button 
            onClick={handleSearch}
            className=<span class="hljs-string">"mt-4 bg-blue-600 text-white px-8 py-3 rounded-lg hover:bg-blue-700 disabled:bg-gray-400 disabled:cursor-not-allowed font-medium"</span>
            disabled={loading || !query.trim()}
          &gt;
            {loading ? <span class="hljs-string">'Searching...'</span> : <span class="hljs-string">'Search'</span>}
          &lt;/button&gt;
          &lt;p className=<span class="hljs-string">"mt-2 text-sm text-gray-500 dark:text-gray-400"</span>&gt;
            Press Cmd/Ctrl + Enter to search
          &lt;/p&gt;
        &lt;/div&gt;

        {answer &amp;&amp; (
          &lt;div className=<span class="hljs-string">"bg-white dark:bg-gray-900 border border-gray-200 dark:border-gray-800 rounded-lg p-6 shadow-sm mb-6"</span>&gt;
            &lt;h2 className=<span class="hljs-string">"text-xl font-semibold mb-3"</span>&gt;Answer:&lt;/h2&gt;
            &lt;p className=<span class="hljs-string">"text-gray-800 dark:text-gray-200 leading-relaxed whitespace-pre-wrap"</span>&gt;
              {answer}
            &lt;/p&gt;
          &lt;/div&gt;
        )}

        {sources &amp;&amp; sources.length &gt; <span class="hljs-number">0</span> &amp;&amp; (
          &lt;div className=<span class="hljs-string">"bg-white dark:bg-gray-900 border border-gray-200 dark:border-gray-800 rounded-lg p-6 shadow-sm"</span>&gt;
            &lt;h2 className=<span class="hljs-string">"text-xl font-semibold mb-3"</span>&gt;Sources ({sources.length}):&lt;/h2&gt;
            &lt;div className=<span class="hljs-string">"space-y-3"</span>&gt;
              {sources.map(<span class="hljs-function">(<span class="hljs-params">source, index</span>) =&gt;</span> (
                &lt;div
                  key={index}
                  className=<span class="hljs-string">"p-4 bg-gray-50 dark:bg-gray-800 rounded-lg border border-gray-200 dark:border-gray-700"</span>
                &gt;
                  &lt;p className=<span class="hljs-string">"text-sm text-gray-600 dark:text-gray-400 mb-1"</span>&gt;
                    &lt;span className=<span class="hljs-string">"font-medium"</span>&gt;Source:&lt;/span&gt;{<span class="hljs-string">' '</span>}
                    {source.metadata?.source || source.metadata?.file_name || <span class="hljs-string">'Unknown'</span>}
                  &lt;/p&gt;
                  &lt;p className=<span class="hljs-string">"text-sm text-gray-800 dark:text-gray-200 line-clamp-3"</span>&gt;
                    {source.content}
                  &lt;/p&gt;
                &lt;/div&gt;
              ))}
            &lt;/div&gt;
          &lt;/div&gt;
        )}
      &lt;/main&gt;
    &lt;/div&gt;
  );
}
</code></pre>
<p>This page provides a clean search interface with a textarea for queries, a search button, and sections to display answers and source citations. The sources section helps users verify where the information came from, which is crucial for trust and accuracy. Now let's create the documents management page.</p>
<h2 id="heading-step-12-create-the-documents-page">Step 12: Create the Documents Page</h2>
<p>The documents page serves as your document library. It displays all uploaded documents in a table format, shows metadata like file size and chunk count, and provides actions to preview, download, or delete documents. This page is essential for managing your document collection and verifying uploads.</p>
<p>Create <code>src/app/documents/page.tsx</code>:</p>
<pre><code class="lang-typescript"><span class="hljs-string">'use client'</span>;
<span class="hljs-keyword">import</span> { useState, useEffect } <span class="hljs-keyword">from</span> <span class="hljs-string">'react'</span>;
<span class="hljs-keyword">import</span> Navigation <span class="hljs-keyword">from</span> <span class="hljs-string">'../components/Navigation'</span>;
<span class="hljs-keyword">import</span> PDFViewerModal <span class="hljs-keyword">from</span> <span class="hljs-string">'../components/PDFViewerModal'</span>;
<span class="hljs-keyword">import</span> UploadModal <span class="hljs-keyword">from</span> <span class="hljs-string">'../components/UploadModal'</span>;

<span class="hljs-keyword">interface</span> Document {
  id: <span class="hljs-built_in">string</span>;
  file_name: <span class="hljs-built_in">string</span>;
  file_type: <span class="hljs-built_in">string</span>;
  file_size: <span class="hljs-built_in">number</span>;
  upload_date: <span class="hljs-built_in">string</span>;
  total_chunks: <span class="hljs-built_in">number</span>;
  file_url?: <span class="hljs-built_in">string</span>;
  file_path?: <span class="hljs-built_in">string</span>;
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">DocumentsPage</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">const</span> [documents, setDocuments] = useState&lt;Document[]&gt;([]);
  <span class="hljs-keyword">const</span> [loading, setLoading] = useState(<span class="hljs-literal">true</span>);
  <span class="hljs-keyword">const</span> [error, setError] = useState&lt;<span class="hljs-built_in">string</span> | <span class="hljs-literal">null</span>&gt;(<span class="hljs-literal">null</span>);
  <span class="hljs-keyword">const</span> [showPDFModal, setShowPDFModal] = useState(<span class="hljs-literal">false</span>);
  <span class="hljs-keyword">const</span> [selectedPDF, setSelectedPDF] = useState&lt;{ url: <span class="hljs-built_in">string</span>; name: <span class="hljs-built_in">string</span>; id?: <span class="hljs-built_in">string</span>; isPDF?: <span class="hljs-built_in">boolean</span> } | <span class="hljs-literal">null</span>&gt;(<span class="hljs-literal">null</span>);
  <span class="hljs-keyword">const</span> [deletingId, setDeletingId] = useState&lt;<span class="hljs-built_in">string</span> | <span class="hljs-literal">null</span>&gt;(<span class="hljs-literal">null</span>);
  <span class="hljs-keyword">const</span> [showUploadModal, setShowUploadModal] = useState(<span class="hljs-literal">false</span>);

  useEffect(<span class="hljs-function">() =&gt;</span> {
    fetchDocuments();
  }, []);

  <span class="hljs-keyword">const</span> fetchDocuments = <span class="hljs-keyword">async</span> () =&gt; {
    <span class="hljs-keyword">try</span> {
      setLoading(<span class="hljs-literal">true</span>);
      <span class="hljs-keyword">const</span> res = <span class="hljs-keyword">await</span> fetch(<span class="hljs-string">'/api/documents'</span>);
      <span class="hljs-keyword">const</span> data = <span class="hljs-keyword">await</span> res.json();
      <span class="hljs-keyword">if</span> (data.error) {
        setError(data.error);
      } <span class="hljs-keyword">else</span> {
        setDocuments(data.documents || []);
      }
    } <span class="hljs-keyword">catch</span> (err) {
      setError(err <span class="hljs-keyword">instanceof</span> <span class="hljs-built_in">Error</span> ? err.message : <span class="hljs-string">'Failed to fetch documents'</span>);
    } <span class="hljs-keyword">finally</span> {
      setLoading(<span class="hljs-literal">false</span>);
    }
  };

  <span class="hljs-keyword">const</span> formatDate = <span class="hljs-function">(<span class="hljs-params">s: <span class="hljs-built_in">string</span></span>) =&gt;</span> {
    <span class="hljs-keyword">try</span> {
      <span class="hljs-keyword">const</span> d = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Date</span>(s);
      <span class="hljs-keyword">return</span> <span class="hljs-built_in">isNaN</span>(d.getTime()) 
        ? s 
        : d.toLocaleString(<span class="hljs-string">'en-US'</span>, { 
            year: <span class="hljs-string">'numeric'</span>, 
            month: <span class="hljs-string">'short'</span>, 
            day: <span class="hljs-string">'numeric'</span>, 
            hour: <span class="hljs-string">'2-digit'</span>, 
            minute: <span class="hljs-string">'2-digit'</span>, 
            hour12: <span class="hljs-literal">true</span> 
          });
    } <span class="hljs-keyword">catch</span> { 
      <span class="hljs-keyword">return</span> s; 
    }
  };

  <span class="hljs-keyword">const</span> formatFileSize = <span class="hljs-function">(<span class="hljs-params">b: <span class="hljs-built_in">number</span></span>) =&gt;</span> 
    b &lt; <span class="hljs-number">1024</span> 
      ? <span class="hljs-string">`<span class="hljs-subst">${b}</span> B`</span> 
      : b &lt; <span class="hljs-number">1024</span> * <span class="hljs-number">1024</span> 
        ? <span class="hljs-string">`<span class="hljs-subst">${(b / <span class="hljs-number">1024</span>).toFixed(<span class="hljs-number">2</span>)}</span> KB`</span> 
        : <span class="hljs-string">`<span class="hljs-subst">${(b / (<span class="hljs-number">1024</span> * <span class="hljs-number">1024</span>)).toFixed(<span class="hljs-number">2</span>)}</span> MB`</span>;

  <span class="hljs-keyword">const</span> handleDelete = <span class="hljs-keyword">async</span> (id: <span class="hljs-built_in">string</span>, name: <span class="hljs-built_in">string</span>) =&gt; {
    <span class="hljs-keyword">if</span> (!confirm(<span class="hljs-string">`Delete "<span class="hljs-subst">${name}</span>"? This will permanently delete the document, embeddings, and file.`</span>)) {
      <span class="hljs-keyword">return</span>;
    }
    setDeletingId(id);
    <span class="hljs-keyword">try</span> {
      <span class="hljs-keyword">const</span> res = <span class="hljs-keyword">await</span> fetch(<span class="hljs-string">`/api/documents?id=<span class="hljs-subst">${id}</span>`</span>, { method: <span class="hljs-string">'DELETE'</span> });
      <span class="hljs-keyword">const</span> data = <span class="hljs-keyword">await</span> res.json();
      <span class="hljs-keyword">if</span> (data.error) {
        alert(<span class="hljs-string">`Error: <span class="hljs-subst">${data.error}</span>`</span>);
      } <span class="hljs-keyword">else</span> {
        setDocuments(documents.filter(<span class="hljs-function"><span class="hljs-params">doc</span> =&gt;</span> doc.id !== id));
      }
    } <span class="hljs-keyword">catch</span> (err) {
      alert(err <span class="hljs-keyword">instanceof</span> <span class="hljs-built_in">Error</span> ? err.message : <span class="hljs-string">'Failed to delete'</span>);
    } <span class="hljs-keyword">finally</span> {
      setDeletingId(<span class="hljs-literal">null</span>);
    }
  };

  <span class="hljs-keyword">return</span> (
    &lt;div className=<span class="hljs-string">"min-h-screen"</span>&gt;
      &lt;Navigation /&gt;
      &lt;main className=<span class="hljs-string">"max-w-7xl mx-auto p-8"</span>&gt;
        &lt;div className=<span class="hljs-string">"flex items-center justify-between mb-6"</span>&gt;
          &lt;h1 className=<span class="hljs-string">"text-3xl font-bold"</span>&gt;Documents&lt;/h1&gt;
          &lt;button
            onClick={<span class="hljs-function">() =&gt;</span> setShowUploadModal(<span class="hljs-literal">true</span>)}
            className=<span class="hljs-string">"px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700 font-medium"</span>
          &gt;
            Upload Document
          &lt;/button&gt;
        &lt;/div&gt;

        {loading ? (
          &lt;div className=<span class="hljs-string">"text-center py-12"</span>&gt;
            &lt;p className=<span class="hljs-string">"text-gray-500 dark:text-gray-400"</span>&gt;Loading documents...&lt;/p&gt;
          &lt;/div&gt;
        ) : error ? (
          &lt;div className=<span class="hljs-string">"bg-red-50 dark:bg-red-900/20 border border-red-200 dark:border-red-800 rounded-lg p-4"</span>&gt;
            &lt;p className=<span class="hljs-string">"text-red-800 dark:text-red-200"</span>&gt;<span class="hljs-built_in">Error</span>: {error}&lt;/p&gt;
          &lt;/div&gt;
        ) : documents.length === <span class="hljs-number">0</span> ? (
          &lt;div className=<span class="hljs-string">"bg-gray-50 dark:bg-gray-800 border border-gray-200 dark:border-gray-700 rounded-lg p-12 text-center"</span>&gt;
            &lt;p className=<span class="hljs-string">"text-gray-500 dark:text-gray-400 mb-4"</span>&gt;No documents uploaded yet.&lt;/p&gt;
            &lt;button
              onClick={<span class="hljs-function">() =&gt;</span> setShowUploadModal(<span class="hljs-literal">true</span>)}
              className=<span class="hljs-string">"text-blue-600 dark:text-blue-400 hover:underline font-medium"</span>
            &gt;
              Upload your first <span class="hljs-built_in">document</span>
            &lt;/button&gt;
          &lt;/div&gt;
        ) : (
          &lt;div className=<span class="hljs-string">"bg-white dark:bg-gray-900 border border-gray-200 dark:border-gray-800 rounded-lg shadow-sm overflow-hidden"</span>&gt;
            &lt;div className=<span class="hljs-string">"overflow-x-auto"</span>&gt;
              &lt;table className=<span class="hljs-string">"min-w-full divide-y divide-gray-200 dark:divide-gray-800"</span>&gt;
                &lt;thead className=<span class="hljs-string">"bg-gray-50 dark:bg-gray-800"</span>&gt;
                  &lt;tr&gt;
                    &lt;th className=<span class="hljs-string">"px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase tracking-wider"</span>&gt;
                      File Name
                    &lt;/th&gt;
                    &lt;th className=<span class="hljs-string">"px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase tracking-wider"</span>&gt;
                      Type
                    &lt;/th&gt;
                    &lt;th className=<span class="hljs-string">"px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase tracking-wider"</span>&gt;
                      Size
                    &lt;/th&gt;
                    &lt;th className=<span class="hljs-string">"px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase tracking-wider"</span>&gt;
                      Chunks
                    &lt;/th&gt;
                    &lt;th className=<span class="hljs-string">"px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase tracking-wider"</span>&gt;
                      Upload <span class="hljs-built_in">Date</span>
                    &lt;/th&gt;
                    &lt;th className=<span class="hljs-string">"px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase tracking-wider"</span>&gt;
                      Actions
                    &lt;/th&gt;
                  &lt;/tr&gt;
                &lt;/thead&gt;
                &lt;tbody className=<span class="hljs-string">"bg-white dark:bg-gray-900 divide-y divide-gray-200 dark:divide-gray-800"</span>&gt;
                  {documents.map(<span class="hljs-function">(<span class="hljs-params">doc</span>) =&gt;</span> (
                    &lt;tr key={doc.id} className=<span class="hljs-string">"hover:bg-gray-50 dark:hover:bg-gray-800"</span>&gt;
                      &lt;td className=<span class="hljs-string">"px-6 py-4 whitespace-nowrap"</span>&gt;
                        &lt;div className=<span class="hljs-string">"text-sm font-medium text-gray-900 dark:text-gray-100"</span>&gt;
                          {doc.file_name}
                        &lt;/div&gt;
                      &lt;/td&gt;
                      &lt;td className=<span class="hljs-string">"px-6 py-4 whitespace-nowrap"</span>&gt;
                        &lt;span className=<span class="hljs-string">"px-2 inline-flex text-xs leading-5 font-semibold rounded-full bg-blue-100 text-blue-800 dark:bg-blue-900 dark:text-blue-200"</span>&gt;
                          {doc.file_type || <span class="hljs-string">'unknown'</span>}
                        &lt;/span&gt;
                      &lt;/td&gt;
                      &lt;td className=<span class="hljs-string">"px-6 py-4 whitespace-nowrap text-sm text-gray-500 dark:text-gray-400"</span>&gt;
                        {formatFileSize(doc.file_size)}
                      &lt;/td&gt;
                      &lt;td className=<span class="hljs-string">"px-6 py-4 whitespace-nowrap text-sm text-gray-500 dark:text-gray-400"</span>&gt;
                        {doc.total_chunks}
                      &lt;/td&gt;
                      &lt;td className=<span class="hljs-string">"px-6 py-4 whitespace-nowrap text-sm text-gray-500 dark:text-gray-400"</span>&gt;
                        {formatDate(doc.upload_date)}
                      &lt;/td&gt;
                      &lt;td className=<span class="hljs-string">"px-6 py-4 whitespace-nowrap text-sm font-medium"</span>&gt;
                        &lt;div className=<span class="hljs-string">"flex gap-3 items-center"</span>&gt;
                          {doc.file_name.toLowerCase().endsWith(<span class="hljs-string">'.pdf'</span>) ? (
                            &lt;button 
                              onClick={<span class="hljs-function">() =&gt;</span> {
                                <span class="hljs-keyword">const</span> pdfUrl = doc.file_url 
                                  ? <span class="hljs-string">`<span class="hljs-subst">${doc.file_url}</span>?view=true`</span> 
                                  : <span class="hljs-string">`/api/documents?id=<span class="hljs-subst">${doc.id}</span>&amp;file=true&amp;view=true`</span>;
                                setSelectedPDF({ url: pdfUrl, name: doc.file_name, id: doc.id });
                                setShowPDFModal(<span class="hljs-literal">true</span>);
                              }} 
                              className=<span class="hljs-string">"text-blue-600 hover:text-blue-900 dark:text-blue-400 dark:hover:text-blue-300"</span>
                            &gt;
                              Preview
                            &lt;/button&gt;
                          ) : (
                            &lt;&gt;
                              &lt;button 
                                onClick={<span class="hljs-function">() =&gt;</span> {
                                  setSelectedPDF({ 
                                    url: doc.file_url || <span class="hljs-string">`/api/documents?id=<span class="hljs-subst">${doc.id}</span>&amp;file=true`</span>, 
                                    name: doc.file_name, 
                                    id: doc.id, 
                                    isPDF: <span class="hljs-literal">false</span> 
                                  });
                                  setShowPDFModal(<span class="hljs-literal">true</span>);
                                }} 
                                className=<span class="hljs-string">"text-blue-600 hover:text-blue-900 dark:text-blue-400 dark:hover:text-blue-300"</span>
                              &gt;
                                View
                              &lt;/button&gt;
                              {(doc.file_url || doc.file_path) &amp;&amp; (
                                &lt;a 
                                  href={doc.file_url || <span class="hljs-string">`/api/documents?id=<span class="hljs-subst">${doc.id}</span>&amp;file=true`</span>} 
                                  download={doc.file_name}
                                  className=<span class="hljs-string">"text-green-600 hover:text-green-900 dark:text-green-400 dark:hover:text-green-300"</span> 
                                  target=<span class="hljs-string">"_blank"</span> 
                                  rel=<span class="hljs-string">"noopener noreferrer"</span>
                                &gt;
                                  Download
                                &lt;/a&gt;
                              )}
                            &lt;/&gt;
                          )}
                          &lt;button 
                            onClick={<span class="hljs-function">() =&gt;</span> handleDelete(doc.id, doc.file_name)} 
                            disabled={deletingId === doc.id}
                            className=<span class="hljs-string">"text-red-600 hover:text-red-900 dark:text-red-400 dark:hover:text-red-300 disabled:opacity-50 disabled:cursor-not-allowed"</span>
                          &gt;
                            {deletingId === doc.id ? <span class="hljs-string">'Deleting...'</span> : <span class="hljs-string">'Delete'</span>}
                          &lt;/button&gt;
                        &lt;/div&gt;
                      &lt;/td&gt;
                    &lt;/tr&gt;
                  ))}
                &lt;/tbody&gt;
              &lt;/table&gt;
            &lt;/div&gt;
          &lt;/div&gt;
        )}

        {selectedPDF &amp;&amp; (
          &lt;PDFViewerModal 
            isOpen={showPDFModal} 
            onClose={<span class="hljs-function">() =&gt;</span> { 
              setShowPDFModal(<span class="hljs-literal">false</span>); 
              setSelectedPDF(<span class="hljs-literal">null</span>); 
            }}
            fileUrl={selectedPDF.url} 
            fileName={selectedPDF.name} 
            documentId={selectedPDF.id} 
            isPDF={selectedPDF.isPDF !== <span class="hljs-literal">false</span>} 
          /&gt;
        )}
        &lt;UploadModal 
          isOpen={showUploadModal} 
          onClose={<span class="hljs-function">() =&gt;</span> setShowUploadModal(<span class="hljs-literal">false</span>)} 
          onUploadSuccess={fetchDocuments} 
        /&gt;
      &lt;/main&gt;
    &lt;/div&gt;
  );
}
</code></pre>
<p>This page provides a comprehensive document management interface with a table showing all documents, their metadata, and action buttons for preview, download, and deletion. The page automatically refreshes after uploads and handles loading and error states gracefully.</p>
<p>Now that all your components and pages are built, let's test the complete application.</p>
<h2 id="heading-step-13-test-your-application">Step 13: Test Your Application</h2>
<p>Start your development server:</p>
<pre><code class="lang-typescript">npm run dev
</code></pre>
<p>Open <a target="_blank" href="http://localhost:3000/"><strong>http://localhost:3000</strong></a> in your browser.</p>
<h3 id="heading-test-the-upload-flow">Test the Upload Flow</h3>
<ol>
<li><p>Navigate to the Documents page</p>
</li>
<li><p>Click "Upload Document"</p>
</li>
<li><p>Select a PDF, DOCX, or TXT file</p>
</li>
<li><p>Wait for the upload and processing to complete (this may take a moment as embeddings are generated)</p>
</li>
<li><p>You should see your document in the list with its metadata:</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769376932518/cf1bcd3c-3ab2-4602-8df0-bca909c0edb0.png" alt="RAG search documents management page showing a table with uploaded documents." class="image--center mx-auto" width="2026" height="1296" loading="lazy"></p>
<h3 id="heading-test-the-search-flow">Test the Search Flow</h3>
<ol>
<li><p>Navigate to the Search page (or click "Search" in the navigation)</p>
</li>
<li><p>Make sure you've uploaded at least one document first</p>
</li>
<li><p>Type a question about your uploaded document (for example, "What is this document about?" or ask about specific content)</p>
</li>
<li><p>Click "Search" or press Cmd/Ctrl + Enter</p>
</li>
<li><p>You should see an AI-generated answer with source citations showing which document chunks were used</p>
</li>
</ol>
<p>Once the embedding is done, you can navigate to search and look for the sample test command based on the documents you have uploaded. You can also check the source from which the search results were pulled.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769377080953/c15678d6-59d0-4e97-8a1b-fe049e5fa6a9.png" alt="RAG Search application search interface showing a query input to search from the RAG database." class="image--center mx-auto" width="2390" height="1900" loading="lazy"></p>
<h3 id="heading-test-document-management">Test Document Management</h3>
<ol>
<li><p>On the Documents page, click "Preview" or "View" on a document</p>
</li>
<li><p>Try downloading a document</p>
</li>
<li><p>Test deleting a document (be careful - this is permanent)</p>
</li>
</ol>
<p>If everything works correctly, you're ready to deploy your application!</p>
<h2 id="heading-step-14-deploy-your-application">Step 14: Deploy Your Application</h2>
<h3 id="heading-deploy-to-vercel">Deploy to Vercel</h3>
<p>Vercel is the easiest way to deploy Next.js applications and is made by the creators of Next.js:</p>
<p>To get started, you’ll need to push your code to GitHub. So go ahead and create a repository and push your code.</p>
<p>Then go to <a target="_blank" href="https://vercel.com/"><strong>vercel.com</strong></a> and sign in with your GitHub account. Click "New Project" and import your GitHub repository.</p>
<p>Add your environment variables in the project settings:</p>
<ul>
<li><p><code>NEXT_PUBLIC_SUPABASE_URL</code></p>
</li>
<li><p><code>NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY</code></p>
</li>
<li><p><code>SUPABASE_SERVICE_ROLE_KEY</code></p>
</li>
<li><p><code>OPENAI_API_KEY</code></p>
</li>
</ul>
<p>Then click "Deploy", and your application will be live in minutes! Vercel automatically builds and deploys your Next.js application, and you'll get a URL like <a target="_blank" href="http://your-app.vercel.app/"><code>your-app.vercel.app</code></a>.</p>
<h3 id="heading-important-deployment-notes">Important Deployment Notes</h3>
<ul>
<li><p>Make sure all environment variables are set in your Vercel project settings</p>
</li>
<li><p>The service role key is required for file uploads to work</p>
</li>
<li><p>Supabase Storage bucket should be accessible (public or with proper RLS policies)</p>
</li>
<li><p>Your OpenAI API key should have sufficient credits</p>
</li>
</ul>
<h2 id="heading-how-rag-search-works">How RAG Search Works</h2>
<p>Your application uses the RAG (Retrieval-Augmented Generation) pattern. This combines information retrieval with AI text generation. Here's how it works step by step:</p>
<ol>
<li><p><strong>Document processing</strong>: When you upload a document, it's split into chunks. These are typically 800 characters each with 100-character overlap. Each chunk gets an embedding. This is a 1536-dimensional vector that represents its semantic meaning.</p>
</li>
<li><p><strong>Storage</strong>: Embeddings are stored in a vector database. This is PostgreSQL with the pgvector extension. They're stored alongside the original text chunks. The original files are stored in Supabase Storage.</p>
</li>
<li><p><strong>Query processing</strong>: When you search, your query is converted into an embedding. It uses the same model that processed the documents. This ensures the query and documents are in the same "vector space."</p>
</li>
<li><p><strong>Similarity search</strong>: The system finds the most similar document chunks. It uses cosine similarity on the embeddings. Cosine similarity measures the angle between vectors. Smaller angles mean more similar content, even if the exact words differ.</p>
</li>
<li><p><strong>Answer generation</strong>: The retrieved chunks are used as context for an AI model. This model is GPT-4o-mini. It generates an accurate answer. The system prompt instructs the AI to only answer based on the provided context. This ensures accuracy.</p>
</li>
</ol>
<p>This approach gives you several benefits.</p>
<p>First, you get accuracy. Answers are based on your actual documents, not just the AI's training data. Second, you get transparency. You can see which document chunks were used to generate each answer. Third, you get efficiency. Only relevant chunks are used, which reduces token usage and costs. Finally, you get up-to-date information. You can update your knowledge base by uploading new documents without retraining the AI.</p>
<h2 id="heading-troubleshooting-common-issues">Troubleshooting Common Issues</h2>
<h3 id="heading-storage-rls-error-when-uploading">"Storage RLS error" when uploading</h3>
<p>This means your <code>SUPABASE_SERVICE_ROLE_KEY</code> is not set or incorrect. Make sure the key is in your <code>.env.local</code> file for local development. Also make sure you're using the service role key, not the anon key. Finally, make sure the key is correctly set in your deployment environment, such as Vercel.</p>
<h3 id="heading-failed-to-extract-text-from-file">"Failed to extract text from file"</h3>
<p>Make sure your file is a valid PDF, DOCX, or TXT file. Check that the file isn't corrupted. For PDFs, ensure they contain extractable text. Scanned PDFs with only images won't work without <a target="_blank" href="https://en.wikipedia.org/wiki/Optical_character_recognition">OCR</a>.</p>
<h3 id="heading-no-answer-generated">"No answer generated"</h3>
<p>Make sure you've uploaded at least one document. Try a different query that's more likely to match your documents. Check that embeddings were successfully created. You can verify this in your Supabase database.</p>
<h3 id="heading-vector-similarity-search-not-working">Vector similarity search not working</h3>
<p>Ensure the <code>vector</code> extension is enabled in Supabase. You can do this by running <code>CREATE EXTENSION IF NOT EXISTS vector;</code>. Verify the <code>match_documents</code> function exists in your database. You can check this in the SQL Editor. Check that embeddings are being stored correctly. They should be JSON strings in the embedding column.</p>
<h3 id="heading-slow-search-or-upload-times">Slow search or upload times</h3>
<p>Large documents take longer to process. This is because more chunks mean more embedding API calls. Consider reducing chunk size or processing documents in batches. Also check your OpenAI API rate limits.</p>
<h2 id="heading-next-steps">Next Steps</h2>
<p>Now that you have a working RAG search application, you can extend it with additional features. Here are some examples of useful features you could add:</p>
<ul>
<li><p>You can add more file types by extending the text extraction to support Markdown, HTML, or other formats.</p>
</li>
<li><p>You can improve chunking by experimenting with different chunk sizes, overlap strategies, or semantic chunking.</p>
</li>
<li><p>You can add authentication to protect your documents with user authentication using Supabase Auth.</p>
</li>
<li><p>You can enhance the UI by adding features like search history, document tags, or advanced filters.</p>
</li>
<li><p>You can optimize performance by adding caching, pagination, or streaming responses.</p>
</li>
<li><p>You can add filters to allow users to search within specific documents or date ranges.</p>
</li>
<li><p>Finally, you can improve search by adding hybrid search, which combines keyword and semantic search, or reranking.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You've built a complete RAG search application from scratch. This application demonstrates modern web development with Next.js and TypeScript. It shows vector database operations with Supabase and pgvector. It demonstrates AI integration with OpenAI embeddings and chat completions. It includes file handling and storage with Supabase Storage. Finally, it features a production-ready user interface with Tailwind CSS.</p>
<p>The RAG pattern you've implemented is used by many production applications. These include <a target="_blank" href="https://www.freecodecamp.org/news/how-to-build-an-embeddable-ai-chatbot-widget-with-cloudflare-workers/">chatbots</a>, knowledge bases, document search systems, and AI assistants. You now have the foundation to build more advanced features on top of this.</p>
<p>The skills you've learned are highly valuable in today's AI-driven development landscape. You've learned to work with embeddings, vector databases, and the RAG pattern. You can apply these concepts to build intelligent search systems, document Q&amp;A applications, or AI-powered knowledge bases.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Chat with Your PDF Using Retrieval Augmented Generation ]]>
                </title>
                <description>
                    <![CDATA[ Large language models are good at answering questions, but they have one big limitation: they don’t know what is inside your private documents.  If you upload a PDF like a company policy, research paper, or contract, the model cannot magically read i... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-chat-with-your-pdf-using-retrieval-augmented-generation/</link>
                <guid isPermaLink="false">697822ad5a8ba7b3a0cfc888</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Tue, 27 Jan 2026 02:27:57 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769480850138/b28bc1fd-d035-4825-a6ea-11ccd084db89.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Large language models are good at answering questions, but they have one big limitation: they don’t know what is inside your private documents. </p>
<p>If you upload a PDF like a company policy, research paper, or contract, the model cannot magically read it unless you give it that content.</p>
<p>This is where <a target="_blank" href="https://www.freecodecamp.org/news/mastering-rag-from-scratch/">Retrieval Augmented Generation</a>, or RAG, becomes useful. </p>
<p>RAG lets you combine a language model with your own data. Instead of asking the model to guess, you first retrieve the right parts of the document and then ask the model to answer using that information.</p>
<p>In this article, you will learn how to chat with your own PDF using RAG. You will build the backend using LangChain and create a simple React user interface to ask questions and see answers.</p>
<p>You should be comfortable with basic Python and JavaScript, and have a working knowledge of React and REST APIs. Familiarity with language models and a basic <a target="_blank" href="https://www.freecodecamp.org/news/how-ai-agents-remember-things-vector-stores-in-llm-memory/">understanding of embeddings</a> or vector search will be helpful but not mandatory.</p>
<h2 id="heading-what-well-cover">What We’ll Cover</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-problem-are-we-solving">What Problem Are We Solving</a>?</p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-retrieval-augmented-generation">What Is Retrieval Augmented Generation</a>?</p>
</li>
<li><p><a class="post-section-overview" href="#heading-setting-up-the-backend-with-langchain">Setting Up the Backend with LangChain</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-installing-dependencies">Installing Dependencies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-loading-and-splitting-the-pdf">Loading and Splitting the PDF</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-creating-embeddings-and-vector-store">Creating Embeddings and Vector Store</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-creating-the-retrieval-chain">Creating the Retrieval Chain</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-exposing-an-api-with-fastapi">Exposing an API with FastAPI</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-building-a-simple-react-chat-ui">Building a Simple React Chat UI</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-how-the-full-flow-works">How the Full Flow Works</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-this-approach-works-well">Why This Approach Works Well</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-common-improvements-you-can-add">Common Improvements You Can Add</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ol>
<h2 id="heading-what-problem-are-we-solving">What Problem Are We Solving?</h2>
<p>Imagine you have a long PDF with hundreds of pages. Searching manually is slow. Copying text into ChatGPT is not practical. </p>
<p>You want to ask simple questions like “What is the leave policy?” or “What does this contract say about termination?”</p>
<p>A normal language model cannot answer these questions correctly because it has never seen your PDF. RAG solves this by adding a retrieval step before generation. </p>
<p>The system first finds relevant parts of the PDF and then uses those parts as context for the answer.</p>
<h2 id="heading-what-is-retrieval-augmented-generation">What Is Retrieval Augmented Generation?</h2>
<p><a target="_blank" href="https://www.turingtalks.ai/p/fine-tuning-or-rag-choosing-the-right-approach-to-train-llms-on-your-data">Retrieval Augmented Generation</a> is a pattern with three main steps.</p>
<p>First, your document is split into small chunks. Each chunk is converted into a vector embedding. These embeddings are stored in a vector database.</p>
<p>Second, when a user asks a question, that question is also converted into an embedding. The system searches the vector database to find the most similar chunks.</p>
<p>Third, those chunks are sent to the language model along with the question. The model uses only that context to generate an answer.</p>
<p>This approach keeps answers grounded in your document and reduces hallucinations.</p>
<p>The system has four main parts:</p>
<ul>
<li><p>A PDF loader reads the document. </p>
</li>
<li><p>A text splitter breaks it into chunks. </p>
</li>
<li><p>An embedding model converts text into vectors and stores them in a vector store. </p>
</li>
<li><p>A language model answers questions using retrieved chunks.</p>
</li>
</ul>
<p>The frontend is a simple chat interface built in React. It sends the user’s question to a backend API and displays the response. </p>
<p>This type of custom <a target="_blank" href="https://www.leanware.co/insights/rag-development-services">RAG development</a> helps companies build internal tools that work with their own private data instead of sending it to large language models. </p>
<h2 id="heading-setting-up-the-backend-with-langchain">Setting Up the Backend with LangChain</h2>
<p>We’ll use Python and LangChain for the backend. The backend will load the PDF, build the vector store, and expose an API to answer questions.</p>
<h3 id="heading-installing-dependencies">Installing Dependencies</h3>
<p>Start by installing the required libraries.</p>
<pre><code class="lang-python">pip install langchain langchain-community langchain-openai faiss-cpu pypdf fastapi uvicorn
</code></pre>
<p>This setup uses FAISS as a local vector store and OpenAI for embeddings and chat. You can swap these later for other models.</p>
<h3 id="heading-loading-and-splitting-the-pdf">Loading and Splitting the PDF</h3>
<p>The first step is to load the PDF and split it into chunks that are small enough for embeddings.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> langchain_community.document_loaders <span class="hljs-keyword">import</span> PyPDFLoader
<span class="hljs-keyword">from</span> langchain.text_splitter <span class="hljs-keyword">import</span> RecursiveCharacterTextSplitter

loader = PyPDFLoader(<span class="hljs-string">"document.pdf"</span>)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=<span class="hljs-number">1000</span>,
    chunk_overlap=<span class="hljs-number">200</span>
)
chunks = text_splitter.split_documents(documents)
</code></pre>
<p>Chunking is important. If chunks are too large, embeddings become less accurate. If they are too small, context is lost.</p>
<h3 id="heading-creating-embeddings-and-vector-store">Creating Embeddings and Vector Store</h3>
<p>Next, convert the chunks into embeddings and store them in FAISS.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> langchain_openai <span class="hljs-keyword">import</span> OpenAIEmbeddings
<span class="hljs-keyword">from</span> langchain_community.vectorstores <span class="hljs-keyword">import</span> FAISS

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)
</code></pre>
<p>This step is usually done once. In a real app, you would persist the vector store to disk.</p>
<h3 id="heading-creating-the-retrieval-chain">Creating the Retrieval Chain</h3>
<p>Now create a retrieval-based question answering chain.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> langchain_openai <span class="hljs-keyword">import</span> ChatOpenAI
<span class="hljs-keyword">from</span> langchain.chains <span class="hljs-keyword">import</span> RetrievalQA

llm = ChatOpenAI(
    temperature=<span class="hljs-number">0</span>,
    model=<span class="hljs-string">"gpt-4o-mini"</span>
)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={<span class="hljs-string">"k"</span>: <span class="hljs-number">4</span>}),
    return_source_documents=<span class="hljs-literal">False</span>
)
</code></pre>
<p>The retriever finds the top matching chunks. The language model answers using only those chunks.</p>
<h3 id="heading-exposing-an-api-with-fastapi">Exposing an API with FastAPI</h3>
<p>Now wrap this logic in an API so the React app can use it.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> fastapi <span class="hljs-keyword">import</span> FastAPI
<span class="hljs-keyword">from</span> pydantic <span class="hljs-keyword">import</span> BaseModel

app = FastAPI()
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">QuestionRequest</span>(<span class="hljs-params">BaseModel</span>):</span>
    question: str
<span class="hljs-meta">@app.post("/ask")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ask_question</span>(<span class="hljs-params">req: QuestionRequest</span>):</span>
    result = qa_chain.run(req.question)
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"answer"</span>: result}
</code></pre>
<p>Run the server using this command:</p>
<pre><code class="lang-python">uvicorn main:app --reload
</code></pre>
<p>Your backend is now ready.</p>
<h3 id="heading-building-a-simple-react-chat-ui">Building a Simple React Chat UI</h3>
<p>Next, build a simple React interface that sends questions to the backend and shows answers. </p>
<p>You can use any React setup. A simple Vite or Create React App project works fine.</p>
<p>Inside your main component, manage the question input and answer state.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> { useState } <span class="hljs-keyword">from</span> <span class="hljs-string">"react"</span>;

function App() {
  const [question, setQuestion] = useState(<span class="hljs-string">""</span>);
  const [answer, setAnswer] = useState(<span class="hljs-string">""</span>);
  const [loading, setLoading] = useState(false);
  const askQuestion = <span class="hljs-keyword">async</span> () =&gt; {
    setLoading(true);
    const res = <span class="hljs-keyword">await</span> fetch(<span class="hljs-string">"http://localhost:8000/ask"</span>, {
      method: <span class="hljs-string">"POST"</span>,
      headers: { <span class="hljs-string">"Content-Type"</span>: <span class="hljs-string">"application/json"</span> },
      body: JSON.stringify({ question })
    });
    const data = <span class="hljs-keyword">await</span> res.json();
    setAnswer(data.answer);
    setLoading(false);
  };
  <span class="hljs-keyword">return</span> (
    &lt;div style={{ padding: <span class="hljs-string">"2rem"</span>, maxWidth: <span class="hljs-string">"600px"</span>, margin: <span class="hljs-string">"auto"</span> }}&gt;
      &lt;h2&gt;Chat <span class="hljs-keyword">with</span> your PDF&lt;/h2&gt;
      &lt;textarea
        value={question}
        onChange={(e) =&gt; setQuestion(e.target.value)}
        rows={<span class="hljs-number">4</span>}
        style={{ width: <span class="hljs-string">"100%"</span> }}
        placeholder=<span class="hljs-string">"Ask a question about the PDF"</span>
      /&gt;
      &lt;button onClick={askQuestion} disabled={loading}&gt;
        {loading ? <span class="hljs-string">"Thinking..."</span> : <span class="hljs-string">"Ask"</span>}
      &lt;/button&gt;
      &lt;div style={{ marginTop: <span class="hljs-string">"1rem"</span> }}&gt;
        &lt;strong&gt;Answer&lt;/strong&gt;
        &lt;p&gt;{answer}&lt;/p&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  );
}
export default App;
</code></pre>
<p>This UI is simple but effective. It lets users type a question, sends it to the backend, and shows the answer. Make sure to use the latest version of React to avoid the growing <a target="_blank" href="https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components">React vulnerabilities</a>.</p>
<h2 id="heading-how-the-full-flow-works">How the Full Flow Works</h2>
<p>When the app starts, the backend has already processed the PDF and built the vector store. When a user types a question, the React app sends it to the API.</p>
<p>The backend converts the question into an embedding. It searches the vector store for similar chunks. Those chunks are passed to the language model as context. The model generates an answer based only on that context.</p>
<p>The answer is sent back to the frontend and displayed to the user.</p>
<h2 id="heading-why-this-approach-works-well">Why This Approach Works Well</h2>
<p>RAG works well because it keeps answers grounded in real data. The model is not guessing – it’s reading from your document.</p>
<p>This approach also scales well. You can add more PDFs, reindex them, and reuse the same chat interface. You can also swap FAISS for a hosted vector database if needed.</p>
<p>Another benefit is control. You decide what data the model can see. This is important for private or sensitive documents.</p>
<h2 id="heading-common-improvements-you-can-add">Common Improvements You Can Add</h2>
<p>You can improve this setup in many ways. You can persist the vector store so it doesn’t rebuild on every restart. You can also add document citations to the answer. And you can stream responses for a better chat experience.</p>
<p>You can also add authentication, upload new PDFs from the UI, or support multiple documents per user.</p>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>Chatting with PDFs using Retrieval Augmented Generation is one of the most practical uses of language models today. It turns static documents into interactive knowledge sources.</p>
<p>With LangChain handling retrieval and a simple React UI for interaction, you can build a useful system with very little code. The same pattern can be used for HR policies, legal documents, technical manuals, or research papers.</p>
<p>Once you understand this flow, you can adapt it to many real world problems where answers must come from trusted documents rather than from the model’s memory alone.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn RAG & MCP Fundamentals ]]>
                </title>
                <description>
                    <![CDATA[ Building AI today is about more than just a clever prompt. If you really want to move from playing with standalone tools to creating integrated systems that actually work with your data, our new crash course on the freeCodeCamp.org YouTube channel is... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/learn-rag-and-mcp-fundamentals/</link>
                <guid isPermaLink="false">6972357968889fc0fe8adf6b</guid>
                
                    <category>
                        <![CDATA[ mcp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 22 Jan 2026 14:34:33 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769092417621/b2dddb48-37e0-4303-b111-57f643b39bee.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Building AI today is about more than just a clever prompt. If you really want to move from playing with standalone tools to creating integrated systems that actually work with your data, our new crash course on the <a target="_blank" href="http://freeCodeCamp.org">freeCodeCamp.org</a> YouTube channel is exactly where you need to start.</p>
<h3 id="heading-mastering-rag-retrieval-augmented-generation">Mastering RAG (Retrieval Augmented Generation)</h3>
<p>Everyone is talking about RAG, but many people struggle to understand how it works under the hood. This course starts by breaking down how to connect a model to your own private information. You will learn how to turn documents into embeddings (mathematical representations of meaning) and store them in vector databases like Chroma.</p>
<p>The course also covers the "precision problem." You will learn why just uploading a massive PDF doesn't work and how to use chunking strategies to ensure the AI finds exactly the right paragraph to answer a user's question.</p>
<h3 id="heading-coordination-with-mcp">Coordination with MCP</h3>
<p>While RAG gives an AI knowledge, the Model Context Protocol (MCP) gives it the ability to coordinate actions. MCP allows AI agents to interact with third-party software, databases, and local files. Instead of writing custom code for every single API, MCP provides a standardized way for agents to discover what a server can do and then execute tasks.</p>
<p>You will learn how to build your own MCP server and client using the Python SDK, giving your AI the "hands" it needs to perform real-world tasks.</p>
<p>Watch the full course on <a target="_blank" href="https://youtu.be/I7_WXKhyGms">the freeCodeCamp.org YouTube channel</a> (2-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/I7_WXKhyGms" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Solve 5 Common RAG Failures with Knowledge Graphs ]]>
                </title>
                <description>
                    <![CDATA[ You may have built a Retrieval-Augmented Generation (RAG) pipeline to connect a vector store to a powerful LLM. And RAG pipelines are incredibly effective at grounding models in factual, up-to-date knowledge. But if you've worked with them long enoug... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-solve-5-common-rag-failures-with-knowledge-graphs/</link>
                <guid isPermaLink="false">6915f73887b014aa0a104567</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ knowledge graph ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kamal Kishore ]]>
                </dc:creator>
                <pubDate>Thu, 13 Nov 2025 15:20:24 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762904270014/5ebeec2b-0823-4f59-bdd7-bf37cb68a978.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You may have built a Retrieval-Augmented Generation (RAG) pipeline to connect a vector store to a powerful LLM. And RAG pipelines are incredibly effective at grounding models in factual, up-to-date knowledge. But if you've worked with them long enough, you've likely hit a wall.</p>
<p>The system is great at answering "What is X?" but falls apart when you ask, "How does X relate to Y, and what happened after Z?".</p>
<p>The problem is that standard RAG, by its very nature, breaks context. It chops documents into isolated chunks, finds them based on semantic similarity, and hopes the LLM can piece the puzzle back together. This approach is blind to the relational context—the web of timelines, causes, and connections—that gives facts their meaning.</p>
<p>When queries require synthesizing information across multiple documents or complex, multi-step reasoning, standard RAG fails.</p>
<p>In this article, I’ll give you a practical, code-first guide to solving this problem. We'll move beyond simple vector search by implementing a robust, graph-based pattern to build more reliable, knowledge-aware systems.</p>
<h2 id="heading-table-of-contents">Table of Contents:</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-brittle-baseline-our-standard-rag-setup">The Brittle Baseline: Our Standard RAG Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-a-more-robust-implementation-the-knowledgegraph">A More Robust Implementation: The KnowledgeGraph</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-a-knowledge-graph">What is a Knowledge Graph?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-is-this-more-effective">Why is this More Effective?</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-5-rag-failures-and-their-graph-based-solutions">5 RAG Failures and Their Graph-Based Solutions</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-pattern-1-the-multi-hop-failure">Pattern 1: The Multi-Hop Failure</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-pattern-2-the-causal-synthesis-failure">Pattern 2: The Causal Synthesis Failure</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-pattern-3-the-entity-ambiguity-trap">Pattern 3: The Entity Ambiguity Trap</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-pattern-4-the-contradictory-information-failure">Pattern 4: The Contradictory Information Failure</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-pattern-5-the-implicit-relationship-hallucination">Pattern 5: The Implicit Relationship Hallucination</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>This is a practical, code-first guide intended for developers and engineers who have some experience with RAG. To follow along, you should have the following:</p>
<h4 id="heading-conceptual-knowledge">Conceptual Knowledge</h4>
<ul>
<li><p>A solid understanding of what Retrieval-Augmented Generation (RAG) is and its basic components (like vector stores and LLMs).</p>
</li>
<li><p>Familiarity with basic graph concepts (nodes, edges, and relationships) is also helpful.</p>
</li>
</ul>
<h4 id="heading-technical-setup">Technical Setup</h4>
<ul>
<li><p>A Python environment.</p>
</li>
<li><p>An active Google API Key to use the Gemini API.</p>
</li>
<li><p>The Python libraries <code>langchain</code>, <code>langchain_google_genai</code>, <code>faiss-cpu</code>, and <code>networkx</code> installed.</p>
</li>
</ul>
<h2 id="heading-the-brittle-baseline-our-standard-rag-setup">The Brittle Baseline: Our Standard RAG Setup</h2>
<p>First, let's establish our baseline. This is a standard, "naïve" RAG pipeline using LangChain and the Gemini API. It ingests a list of <code>Document</code> objects, embeds them, and uses a FAISS vector store to retrieve the top-k chunks to answer a question.</p>
<p>This <code>create_rag_chain</code> function will serve as our point of comparison.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Install necessary libraries</span>
<span class="hljs-comment"># !pip install -q -U langchain langchain_google_genai faiss-cpu networkx</span>

<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> networkx <span class="hljs-keyword">as</span> nx
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> defaultdict
<span class="hljs-keyword">from</span> langchain_google_genai <span class="hljs-keyword">import</span> GoogleGenerativeAI, GoogleGenerativeAIEmbeddings
<span class="hljs-keyword">from</span> langchain.vectorstores <span class="hljs-keyword">import</span> FAISS
<span class="hljs-keyword">from</span> langchain.schema.document <span class="hljs-keyword">import</span> Document
<span class="hljs-keyword">from</span> langchain.prompts <span class="hljs-keyword">import</span> PromptTemplate
<span class="hljs-keyword">from</span> langchain.schema.runnable <span class="hljs-keyword">import</span> RunnablePassthrough
<span class="hljs-keyword">from</span> langchain.schema.output_parser <span class="hljs-keyword">import</span> StrOutputParser

<span class="hljs-comment"># --- Configure API Key (example) ---</span>
<span class="hljs-comment"># from google.colab import userdata</span>
<span class="hljs-comment"># GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY') </span>
<span class="hljs-comment"># os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY </span>

<span class="hljs-comment"># --- Initialize Models ---</span>
<span class="hljs-comment"># Make sure your API key is set in your environment</span>
llm = GoogleGenerativeAI(model=<span class="hljs-string">"gemini-1.5-pro-latest"</span>)
embeddings = GoogleGenerativeAIEmbeddings(model=<span class="hljs-string">"models/embedding-001"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_rag_chain</span>(<span class="hljs-params">docs</span>):</span>
    <span class="hljs-string">"""Creates a simple RAG chain using FAISS as the vector store."""</span> 

    <span class="hljs-comment"># Create vector store from documents</span>
    vectorstore = FAISS.from_documents(docs, embeddings)
    <span class="hljs-comment"># K=3 means it will retrieve the top 3 most relevant chunks</span>
    retriever = vectorstore.as_retriever(search_kwargs={<span class="hljs-string">"k"</span>: <span class="hljs-number">3</span>})

    template = <span class="hljs-string">"""
    Answer the following question based ONLY on the context provided.
    If the context doesn't contain the answer, say "I don't have enough information from the context."

    CONTEXT:
    {context}

    QUESTION:
    {question}
    """</span>

    prompt = PromptTemplate.from_template(template)

    <span class="hljs-comment"># Build the chain</span>
    rag_chain = (
        {<span class="hljs-string">"context"</span>: retriever, <span class="hljs-string">"question"</span>: RunnablePassthrough()} 
        | prompt
        | llm 
        | StrOutputParser() 
    )

    <span class="hljs-keyword">return</span> rag_chain
</code></pre>
<h2 id="heading-a-more-robust-implementation-the-knowledgegraph">A More Robust Implementation: The KnowledgeGraph</h2>
<h3 id="heading-what-is-a-knowledge-graph">What is a Knowledge Graph?</h3>
<p>At its core, a knowledge graph (KG) is a way of storing data as a network of nodes and edges.</p>
<ul>
<li><p><strong>Nodes</strong> represent entities: <code>people</code>, <code>companies</code>, <code>concepts</code>, or <code>events</code>.</p>
</li>
<li><p><strong>Edges</strong> represent the explicit, labeled relationships between them: <code>ceo_of</code>, <code>attended</code>, or <code>partners_with</code>.</p>
</li>
</ul>
<p>Instead of storing a document like "Jim Farley is the CEO of Ford," you store two nodes (<code>Jim Farley</code>, <code>Ford</code>) connected by a directed edge (<code>ceo_of</code>).</p>
<h3 id="heading-why-is-this-more-effective">Why is this More Effective?</h3>
<p>This structure is more effective because it preserves and makes relationships a first-class citizen.</p>
<p>Standard RAG relies on "semantic similarity". It's good at finding text chunks that <em>sound like</em> your query. But it’s "blind to the relational context" – the very thing you need for complex questions.</p>
<p>The graph-based approach solves this. When a query requires multi-step reasoning, you don't just search for similar text. You traverse a structured, explicit path in the graph. This allows the system to:</p>
<ol>
<li><p><strong>Follow chains of logic:</strong> It can answer multi-hop questions by finding a literal path from one node to another (for example, <code>F-150</code> → <code>made_by</code> → <code>Ford</code> → <code>ceo</code> → <code>Jim Farley</code>).</p>
</li>
<li><p><strong>Disambiguate entities:</strong> It can use node attributes (like <code>type: "company"</code>) to distinguish between two entities with the same name.</p>
</li>
<li><p><strong>Resolve contradictions:</strong> It can store metadata (like dates) directly <em>on the edge</em> to programmatically determine the most current fact.</p>
</li>
</ol>
<p>You move from "guessing from a cloud of semantically similar text" to querying a "global memory" of how facts are explicitly connected.</p>
<p>Here is the practical implementation of our <code>KnowledgeGraph</code>. This class uses <code>networkx</code> to store the nodes and edges we just discussed, and includes specific methods to run the structured query patterns needed to solve our RAG failures.</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">KnowledgeGraph</span>:</span>
    <span class="hljs-string">"""
    A wrapper around networkx.DiGraph to store and query
    explicit entities and their relationships.
    """</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        self.graph = nx.DiGraph() 

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">add_data</span>(<span class="hljs-params">self, nodes=None, edges=None</span>):</span>
        <span class="hljs-string">"""Populates the graph with nodes and edges."""</span>
        <span class="hljs-keyword">if</span> nodes:
            <span class="hljs-keyword">for</span> node, attrs <span class="hljs-keyword">in</span> nodes:
                self.graph.add_node(node, **attrs) 
        <span class="hljs-keyword">if</span> edges:
            <span class="hljs-keyword">for</span> u, v, attrs <span class="hljs-keyword">in</span> edges:
                self.graph.add_edge(u, v, **attrs) 

    <span class="hljs-comment"># --- Query Patterns ---</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">query_multi_hop_path</span>(<span class="hljs-params">self, source, target</span>):</span>
        <span class="hljs-string">"""
        Pattern 1: Solves multi-hop queries by finding a path.
        """</span>
        <span class="hljs-keyword">try</span>:
            path = nx.shortest_path(self.graph, source=source, target=target) 
            <span class="hljs-comment"># Format the answer based on the discovered path</span>
            <span class="hljs-keyword">return</span> <span class="hljs-string">f"<span class="hljs-subst">{path[<span class="hljs-number">-2</span>]}</span> attended <span class="hljs-subst">{path[<span class="hljs-number">-1</span>]}</span>."</span> 
        <span class="hljs-keyword">except</span> nx.NetworkXNoPath:
            <span class="hljs-keyword">return</span> <span class="hljs-string">"Could not find a connection."</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">query_with_conflict_resolution</span>(<span class="hljs-params">self, entity, relation, time_attr=<span class="hljs-string">"year"</span></span>):</span>
        <span class="hljs-string">"""
        Pattern 4: Resolves contradictions using metadata (like timestamps)
        stored on the edges.
        """</span>
        candidates = []
        <span class="hljs-keyword">for</span> neighbor <span class="hljs-keyword">in</span> self.graph.neighbors(entity):
            edge_data = self.graph.get_edge_data(entity, neighbor) 
            <span class="hljs-keyword">if</span> edge_data.get(<span class="hljs-string">"label"</span>) == relation: 
                candidates.append((neighbor, edge_data.get(time_attr, <span class="hljs-number">0</span>))) 

        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> candidates: 
            <span class="hljs-keyword">return</span> <span class="hljs-string">"No information found."</span> 

        <span class="hljs-comment"># Sort by the time attribute, descending, and take the latest</span>
        latest = sorted(candidates, key=<span class="hljs-keyword">lambda</span> item: item[<span class="hljs-number">1</span>], reverse=<span class="hljs-literal">True</span>)[<span class="hljs-number">0</span>] 
        <span class="hljs-keyword">return</span> <span class="hljs-string">f"<span class="hljs-subst">{latest[<span class="hljs-number">0</span>]}</span> (as of <span class="hljs-subst">{latest[<span class="hljs-number">1</span>]}</span>)"</span> 

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">query_disambiguated</span>(<span class="hljs-params">self, entity_name, entity_type, attribute_key</span>):</span>
        <span class="hljs-string">"""
        Pattern 3: Uses node 'type' attributes to disambiguate
        entities with the same name.
        """</span>
        <span class="hljs-keyword">for</span> node, attrs <span class="hljs-keyword">in</span> self.graph.nodes(data=<span class="hljs-literal">True</span>): 
            <span class="hljs-comment"># Find the node that matches both name and type</span>
            <span class="hljs-keyword">if</span> entity_name <span class="hljs-keyword">in</span> node <span class="hljs-keyword">and</span> attrs.get(<span class="hljs-string">"type"</span>) == entity_type: 
                <span class="hljs-comment"># Return the requested attribute</span>
                year = attrs[<span class="hljs-string">'year'</span>]
                product = attrs[attribute_key]
                <span class="hljs-keyword">return</span> <span class="hljs-string">f"<span class="hljs-subst">{node}</span>'s first product was the <span class="hljs-subst">{product}</span> in <span class="hljs-subst">{year}</span>."</span> 
        <span class="hljs-keyword">return</span> <span class="hljs-string">"Cannot disambiguate entity."</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">query_explicit_relation</span>(<span class="hljs-params">self, source_node, relation_label</span>):</span>
        <span class="hljs-string">"""
        Pattern 5: Finds partners based on an explicit edge label,
        preventing semantic 'bleed-over' from unrelated entities.
        """</span>
        partners = [
            v <span class="hljs-keyword">for</span> u, v, data <span class="hljs-keyword">in</span> self.graph.edges(data=<span class="hljs-literal">True</span>) 
            <span class="hljs-keyword">if</span> u == source_node <span class="hljs-keyword">and</span> data.get(<span class="hljs-string">'label'</span>) == relation_label
        ] 

        <span class="hljs-keyword">if</span> partners:
            <span class="hljs-keyword">return</span> <span class="hljs-string">f"<span class="hljs-subst">{source_node}</span> partnered with <span class="hljs-subst">{<span class="hljs-string">', '</span>.join(partners)}</span>."</span> 
        <span class="hljs-keyword">return</span> <span class="hljs-string">f"No partners found for <span class="hljs-subst">{source_node}</span>."</span>

<span class="hljs-comment"># A helper function for Pattern 2 (Causal Rules)</span>
<span class="hljs-comment"># This logic is more rule-based but can be backed by a graph</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">query_causal_chain</span>(<span class="hljs-params">facts</span>):</span>
    <span class="hljs-string">"""
    Pattern 2: Synthesizes a direct conclusion by following a
    chain of causal rules.
    """</span>
    <span class="hljs-keyword">try</span>:
        <span class="hljs-keyword">if</span> facts[<span class="hljs-string">"John"</span>][<span class="hljs-string">"takes"</span>] == <span class="hljs-string">"aspirin"</span>: 
            <span class="hljs-keyword">if</span> facts[<span class="hljs-string">"aspirin"</span>][<span class="hljs-string">"is_a"</span>] == <span class="hljs-string">"blood thinner"</span>: 
                <span class="hljs-keyword">if</span> facts[<span class="hljs-string">"blood thinner"</span>][<span class="hljs-string">"risk_for"</span>] == <span class="hljs-string">"surgery"</span>:
                    <span class="hljs-keyword">return</span> <span class="hljs-string">"John is NOT safe due to increased bleeding risk from aspirin, a blood thinner."</span>
    <span class="hljs-keyword">except</span> KeyError:
        <span class="hljs-keyword">pass</span> <span class="hljs-comment"># Fall through to default</span>
    <span class="hljs-keyword">return</span> <span class="hljs-string">"Insufficient information to determine risk."</span>
</code></pre>
<h2 id="heading-5-rag-failures-and-their-graph-based-solutions">5 RAG Failures and Their Graph-Based Solutions</h2>
<p>Let's run five scenarios to see how our standard RAG chain performs against our new <code>KnowledgeGraph</code>.</p>
<h3 id="heading-pattern-1-the-multi-hop-failure">Pattern 1: The Multi-Hop Failure</h3>
<p>The multi-hop failure occurs when an answer requires connecting multiple, separate facts – a chain of reasoning that RAG often breaks.</p>
<ul>
<li><p><strong>Query:</strong> "Which university did the CEO of the company that makes the F-150 attend?"</p>
</li>
<li><p><strong>Problem:</strong> A standard retriever might get chunks for <code>F-150 -&gt; Ford</code> and <code>Jim Farley -&gt; CEO</code>, but miss the <code>Jim Farley -&gt; Georgetown</code> chunk. The chain is broken.</p>
</li>
</ul>
<h4 id="heading-why-the-naive-rag-fails">Why the Naïve RAG Fails</h4>
<p>The retriever's job is to find the <code>top-k=3</code> chunks that are <strong>semantically similar</strong> to the entire query. When the user asks, "Which university did the CEO of the company that makes the F-150 attend?", the retriever will search our 6-document list and will likely retrieve:</p>
<ol>
<li><p>The chunk about the <strong>University of Michigan</strong> (because of the words "university" and "car companies").</p>
</li>
<li><p>The chunk about <strong>Jim Farley</strong> (because of "CEO," "Ford," and "F-150 line").</p>
</li>
<li><p>The chunk about the <strong>F-150 engine options</strong> (because of "F-150").</p>
</li>
</ol>
<p>The <code>top-k=3</code> context handed to the LLM is now full of irrelevant facts. The one chunk that contains the <em>actual</em> answer ("...Mr Farley... from Georgetown University") is semantically too far from the main query and is <strong>never retrieved</strong>. The LLM fails not because it's unintelligent, but because it was never given the correct piece of the puzzle.</p>
<h4 id="heading-why-the-graphrag-succeeds">Why the GraphRAG Succeeds</h4>
<p>The knowledge graph doesn't care about semantic similarity. It performs a deterministic traversal of explicit, verified relationships.</p>
<p>We ask for the <em>path</em> from the <code>F-150</code> node to the <code>Georgetown University</code> node. The graph follows the chain we defined: <code>F-150</code> → <code>made_by</code> → <code>Ford Motor Company</code> → <code>ceo</code> → <code>Jim Farley</code> → <code>attended</code> → <code>Georgetown University</code>. It can't fail or be distracted by the "noise" documents because it's not searching – it's <strong>navigating</strong> a pre-built map.</p>
<pre><code class="lang-python"><span class="hljs-comment"># --Naive RAG</span>
docs_s1 = [
    <span class="hljs-comment"># --- The 3 "Answer" Chunks ---</span>
    Document(page_content=<span class="hljs-string">"The Ford F-150 is a full-size pickup truck made by Ford Motor Company."</span>),
    Document(page_content=<span class="hljs-string">"Jim Farley is the current CEO of Ford Motor Company."</span>),
    Document(page_content=<span class="hljs-string">"Mr. Farley received his undergraduate degree from Georgetown University."</span>),

    <span class="hljs-comment"># --- The 3 "Noise" Chunks (to distract the retriever) ---</span>
    Document(page_content=<span class="hljs-string">"The University of Michigan is renowned for its automotive engineering program, which partners with many car companies."</span>),
    Document(page_content=<span class="hljs-string">"The F-150 comes with several engine options, including a powerful 3.5L EcoBoost V6."</span>),
    Document(page_content=<span class="hljs-string">"Mary Barra, the CEO of General Motors, is a major competitor to Ford and its F-150 line."</span>)
]
query_s1 = <span class="hljs-string">"Which university did the CEO of the company that makes the F-150 attend?"</span>
rag_chain_s1 = create_rag_chain(docs_s1) <span class="hljs-comment"># This uses top_k=3</span>
print(<span class="hljs-string">f"Naive RAG Answer: <span class="hljs-subst">{rag_chain_s1.invoke(query_s1)}</span>"</span>)
<span class="hljs-comment">#</span>
<span class="hljs-comment"># GraphRAG Pattern</span>
graph_s1 = KnowledgeGraph()
edges_s1 = [
    (<span class="hljs-string">"F-150"</span>, <span class="hljs-string">"Ford Motor Company"</span>, {<span class="hljs-string">"label"</span>: <span class="hljs-string">"made_by"</span>}),
    (<span class="hljs-string">"Ford Motor Company"</span>, <span class="hljs-string">"Jim Farley"</span>, {<span class="hljs-string">"label"</span>: <span class="hljs-string">"ceo"</span>}),
    (<span class="hljs-string">"Jim Farley"</span>, <span class="hljs-string">"Georgetown University"</span>, {<span class="hljs-string">"label"</span>: <span class="hljs-string">"attended"</span>}),
]
graph_s1.add_data(edges=edges_s1)
print(<span class="hljs-string">f"GraphRAG Answer: <span class="hljs-subst">{graph_s1.query_multi_hop_path(<span class="hljs-string">'F-150'</span>, <span class="hljs-string">'Georgetown University'</span>)}</span>"</span>)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">Naive RAG Answer: I don't have enough information from the context.
GraphRAG Answer: Jim Farley attended Georgetown University.
</code></pre>
<h3 id="heading-pattern-2-the-causal-synthesis-failure">Pattern 2: The Causal Synthesis Failure</h3>
<p>This is the failure to move from retrieval to synthesis. RAG lists facts but can't combine them to form a new conclusion.</p>
<ul>
<li><p><strong>Query:</strong> "Is John safe to undergo surgery while on aspirin?"</p>
</li>
<li><p><strong>Problem:</strong> RAG will retrieve "John takes aspirin," "Aspirin is a blood thinner," and "Blood thinners increase surgery risk." But it will fail to synthesize these into a direct "No, it's not safe" answer.</p>
</li>
</ul>
<h4 id="heading-why-the-naive-rag-fails-1">Why the Naïve RAG Fails</h4>
<p>The retriever searches for chunks that are semantically similar to the query: "John," "safe," "surgery," and "aspirin." In a real document base, it's highly likely to retrieve distracting, topically-related "noise" chunks.</p>
<p>In our example, the <code>top-k=3</code> chunks it retrieves might be:</p>
<ol>
<li><p>"John is currently taking daily low-dose aspirin." (Relevant: "John," "aspirin")</p>
</li>
<li><p>"Pre-surgery safety checks are standard procedure..." (Relevant: "surgery safety")</p>
</li>
<li><p>"John is otherwise in good health and is cleared for the procedure..." (Relevant: "John," "safe," "procedure")</p>
</li>
</ol>
<p>The key causal link ("Aspirin... is considered a blood thinner") is semantically less similar to the <em>full query</em> and gets pushed out of the <code>top-k=3</code> context. The LLM is then given incomplete information. It sees "John takes aspirin" and "John is cleared," so it provides a weak, hedged answer and cannot make the correct logical leap.</p>
<h4 id="heading-why-the-graphrag-succeeds-1">Why the GraphRAG Succeeds</h4>
<p>This approach doesn't use semantic search. It uses explicit logical rules (which could be backed by a causal graph). The <code>query_causal_chain</code> function is not searching for text – it's executing a pre-defined chain of logic:</p>
<ol>
<li><p><em>Fact:</em> Does John take aspirin? Yes.</p>
</li>
<li><p><em>Fact:</em> Is aspirin a blood thinner? Yes.</p>
</li>
<li><p><em>Fact:</em> Is a blood thinner a risk for surgery? Yes.</p>
</li>
<li><p><em>Conclusion:</em> Therefore, John is not safe.</p>
</li>
</ol>
<p>This deterministic, rule-based reasoning is immune to the "semantic noise" that distracts the naive RAG.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Naive RAG</span>
docs_s2 = [
    <span class="hljs-comment"># --- The 3 "Answer" Chunks ---</span>
    Document(page_content=<span class="hljs-string">"Aspirin reduces blood clotting and is considered a blood thinner."</span>),
    Document(page_content=<span class="hljs-string">"Patients on blood thinners have increased bleeding risk during surgery."</span>),
    Document(page_content=<span class="hljs-string">"John is currently taking daily low-dose aspirin."</span>),

    <span class="hljs-comment"># --- The 3 "Noise" Chunks (to distract the retriever) ---</span>
    Document(page_content=<span class="hljs-string">"John is otherwise in good health and is cleared for the procedure by his cardiologist."</span>),
    Document(page_content=<span class="hljs-string">"Pre-surgery safety checks are standard procedure and usually focus on anesthesia allergies."</span>),
    Document(page_content=<span class="hljs-string">"Aspirin is also commonly used to relieve minor aches and pains, but this is not why John takes it."</span>)
]
query_s2 = <span class="hljs-string">"Is John safe to undergo surgery while on aspirin?"</span>
rag_chain_s2 = create_rag_chain (docs_s2)
print(<span class="hljs-string">f"Naive RAG Answer: <span class="hljs-subst">{rag_chain_s2.invoke(query_s2)}</span>"</span>)

<span class="hljs-comment"># GraphRAG Pattern</span>
facts_s2 = {
    <span class="hljs-string">"aspirin"</span>: {<span class="hljs-string">"is_a"</span>: <span class="hljs-string">"blood thinner"</span>},
    <span class="hljs-string">"blood thinner"</span>: {<span class="hljs-string">"risk_for"</span>: <span class="hljs-string">"surgery"</span>},
    <span class="hljs-string">"John"</span>: {<span class="hljs-string">"takes"</span>: <span class="hljs-string">"aspirin"</span>},
}
print(<span class="hljs-string">f"GraphRAG Answer: <span class="hljs-subst">{query_causal_chain(facts_s2)}</span>"</span>)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">Naive RAG Answer: Based on the context, John is currently taking daily low-dose aspirin...
GraphRAG Answer: John is NOT safe due to increased bleeding risk from aspirin, a blood thinner.
</code></pre>
<h3 id="heading-pattern-3-the-entity-ambiguity-trap">Pattern 3: The Entity Ambiguity Trap</h3>
<p>Vector search struggles with polysemy (words with multiple meanings). It relies on local semantic context, which can easily be confused.</p>
<ul>
<li><p><strong>Query:</strong> "When did Apple release its first product?"</p>
</li>
<li><p><strong>Problem:</strong> The query "Apple" might retrieve documents for both Apple (company) and Apple (fruit), confusing the LLM.</p>
</li>
</ul>
<h4 id="heading-why-the-naive-rag-fails-2">Why the Naïve RAG Fails</h4>
<p>The query "When did Apple release its first product?" is semantically ambiguous. The vector retriever, which looks for <em>semantic closeness</em>, will be strongly attracted to the "noise" chunks we added about the fruit.</p>
<p>The <code>top-k=3</code> chunks it retrieves will likely be:</p>
<ol>
<li><p>"The 'Cosmic Crisp' is a new <strong>apple product</strong>... <strong>first released</strong>..." (Extremely high semantic similarity to "Apple releases its first product").</p>
</li>
<li><p>"The Granny Smith <strong>apple</strong>... is a popular <strong>product</strong>..."</p>
</li>
<li><p>"Many <strong>apple</strong> orchards <strong>release</strong> their new harvest..."</p>
</li>
</ol>
<p>The <em>correct</em> chunk ("The Apple I was introduced by Apple Inc...") is about a "company" and a specific "product" name. It might be semantically <em>less</em> similar to the general query than the "Cosmic Crisp" chunk. The LLM is then handed a context exclusively about fruits and confidently (but incorrectly) answers about the "Cosmic Crisp" apple.</p>
<h4 id="heading-why-the-graphrag-succeeds-2">Why the GraphRAG Succeeds</h4>
<p>The graph approach is immune to this ambiguity. The <code>query_disambiguated</code> function is <em>not</em> just searching for "Apple." It is explicitly looking for a node that matches two criteria: <code>name='Apple'</code> AND <code>type='company'</code>.</p>
<p>This query structurally guarantees that it finds the <code>Apple Inc.</code> node and ignores the <code>apple (fruit)</code> node, regardless of semantic similarity. It then reliably retrieves the <code>first_product</code> attribute from the correct node.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Naive RAG</span>
docs_s3 = [
    <span class="hljs-comment"># --- The "Answer" Chunks ---</span>
    Document(page_content=<span class="hljs-string">"The Apple was introduced by Apple Inc. in 1976."</span>),
    Document(page_content=<span class="hljs-string">"Apple Inc. is a technology company based in Cupertino."</span>),

    <span class="hljs-comment"># --- "Noise" Chunks (to create ambiguity) ---</span>
    Document(page_content=<span class="hljs-string">"The 'Cosmic Crisp' is a new apple product developed by Washington State University, first released to consumers in 2019."</span>),
    Document(page_content=<span class="hljs-string">"Apples (the fruit) were first cultivated in Central Asia thousands of years ago."</span>),
    Document(page_content=<span class="hljs-string">"The Granny Smith apple, first discovered in Australia, is a popular product for baking."</span>),
    Document(page_content=<span class="hljs-string">"Many apple orchards release their new harvest in the fall."</span>)
]
query_s3 = <span class="hljs-string">"When did Apple release its first product?"</span>
rag_chain_s3 = create_rag_chain(docs_s3)
print(<span class="hljs-string">f"Naive RAG Answer: <span class="hljs-subst">{rag_chain_s3.invoke(query_s3)}</span>"</span>)

<span class="hljs-comment"># GraphRAG Pattern</span>
graph_s3 = KnowledgeGraph()
nodes_s3 = [
    (<span class="hljs-string">"Apple Inc."</span>, {<span class="hljs-string">"type"</span>: <span class="hljs-string">"company"</span>, <span class="hljs-string">"first_product"</span>: <span class="hljs-string">"Apple I"</span>, <span class="hljs-string">"year"</span>: <span class="hljs-number">1976</span>}),
    (<span class="hljs-string">"apple"</span>, {<span class="hljs-string">"type"</span>: <span class="hljs-string">"fruit"</span>, <span class="hljs-string">"origin"</span>: <span class="hljs-string">"Central Asia"</span>}),
]
graph_s3.add_data(nodes=nodes_s3)
print(<span class="hljs-string">f"GraphRAG Answer: <span class="hljs-subst">{graph_s3.query_disambiguated(<span class="hljs-string">'Apple'</span>, <span class="hljs-string">'company'</span>, <span class="hljs-string">'first_product'</span>)}</span>"</span>)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">Naive RAG Answer: The <span class="hljs-string">'Cosmic Crisp'</span>, a new apple product, was first released to consumers <span class="hljs-keyword">in</span> <span class="hljs-number">2019.</span>
GraphRAG Answer: Apple Inc.<span class="hljs-string">'s first product was the Apple I in 1976.</span>
</code></pre>
<h3 id="heading-pattern-4-the-contradictory-information-failure">Pattern 4: The Contradictory Information Failure</h3>
<p>RAG is blind to knowledge conflicts. If it retrieves two or more contradictory facts, it can't resolve them using metadata like dates or source credibility. It will hedge, merge them into a false statement, or present all of them.</p>
<ul>
<li><p><strong>Query:</strong> "Who is the CEO of Twitter?"</p>
</li>
<li><p><strong>Problem:</strong> The retriever finds one chunk saying "Parag Agrawal (2022)" and another saying "Elon Musk (2023)". It may also find other related, confusing information. The LLM has no way to know which fact is the most current and authoritative.</p>
</li>
</ul>
<h4 id="heading-why-the-naive-rag-fails-3">Why the Naïve RAG Fails</h4>
<p>The query "Who is the CEO of Twitter?" is semantically similar to <em>all</em> documents containing the words "CEO" and "Twitter." In a real-world, evolving knowledge base, this is a recipe for disaster.</p>
<p>The <code>top-k=3</code> chunks our retriever finds will be a mess of contradictions:</p>
<ol>
<li><p>"In 2023, Elon Musk became the CEO of Twitter." (Correct, but old)</p>
</li>
<li><p>"In 2022, Parag Agrawal was the CEO of Twitter." (Old)</p>
</li>
<li><p>"Linda Yaccarino is the current CEO of X (formerly Twitter)..." (Also correct, but a different person/role).</p>
</li>
</ol>
<p>The LLM is handed three different, conflicting names for "CEO of Twitter" from different time periods. Because it is instructed to answer <em>only</em> from the context and has no mechanism to identify which fact is the most recent, it cannot give a single, confident answer. It’s forced to list the conflicts it found.</p>
<h4 id="heading-why-the-graphrag-succeeds-3">Why the GraphRAG Succeeds</h4>
<p>The knowledge graph is built for this. We've stored the "CEO" relationship as an <strong>edge with metadata</strong>, specifically a <code>year</code> attribute.</p>
<p>Our <code>query_with_conflict_resolution</code> function doesn't just find all CEO-related edges. It programmatically:</p>
<ol>
<li><p>Finds all nodes connected to "Twitter" by a <code>ceo</code> label.</p>
</li>
<li><p>Extracts the <code>year</code> from each of those edges.</p>
</li>
<li><p><strong>Sorts the candidates by year</strong> in descending order.</p>
</li>
<li><p>Returns only the top result.</p>
</li>
</ol>
<p>This provides a deterministic, programmatic way to resolve conflicts and always provide the most current fact based on the explicit timestamps in our graph.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Naive RAG</span>
docs_s4 = [
    <span class="hljs-comment"># --- The "Answer" Chunks (conflicting) ---</span>
    Document(page_content=<span class="hljs-string">"In 2022, Parag Agrawal was the CEO of Twitter."</span>),
    Document(page_content=<span class="hljs-string">"In 2023, Elon Musk became the CEO of Twitter."</span>),

    <span class="hljs-comment"># --- "Noise" Chunks (to add more conflict/confusion) ---</span>
    Document(page_content=<span class="hljs-string">"Linda Yaccarino is the current CEO of X (formerly Twitter), overseeing business operations."</span>),
    Document(page_content=<span class="hljs-string">"Jack Dorsey, a co-founder and former CEO of Twitter, is now focused on his company Block."</span>),
    Document(page_content=<span class="hljs-string">"CEOs of major tech companies, including Twitter's, have recently testified before Congress."</span>)
]
query_s4 = <span class="hljs-string">"Who is the CEO of Twitter?"</span>
rag_chain_s4 = create_rag_chain(docs_s4)
print(<span class="hljs-string">f"Naive RAG Answer: <span class="hljs-subst">{rag_chain_s4.invoke(query_s4)}</span>"</span>)

<span class="hljs-comment">#GraphRAG Pattern</span>
graph_s4 = KnowledgeGraph()
edges_s4 = [
    (<span class="hljs-string">"Twitter"</span>, <span class="hljs-string">"Parag Agrawal"</span>, {<span class="hljs-string">"label"</span>: <span class="hljs-string">"ceo"</span>, <span class="hljs-string">"year"</span>: <span class="hljs-number">2022</span>}),
    (<span class="hljs-string">"Twitter"</span>, <span class="hljs-string">"Elon Musk"</span>, {<span class="hljs-string">"label"</span>: <span class="hljs-string">"ceo"</span>, <span class="hljs-string">"year"</span>: <span class="hljs-number">2023</span>}),
]
graph_s4.add_data(edges=edges_s4)
print(<span class="hljs-string">f"GraphRAG Answer: <span class="hljs-subst">{graph_s4.query_with_conflict_resolution(<span class="hljs-string">'Twitter'</span>, <span class="hljs-string">'ceo'</span>, <span class="hljs-string">'year'</span>)}</span>"</span>)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">Naive RAG Answer: According to the context, <span class="hljs-keyword">in</span> <span class="hljs-number">2022</span>, Parag Agrawal was the CEO of Twitter. In <span class="hljs-number">2023</span>, Elon Musk became the CEO... Linda Yaccarino <span class="hljs-keyword">is</span> the current CEO of X (formerly Twitter)...
GraphRAG Answer: Elon Musk (<span class="hljs-keyword">as</span> of <span class="hljs-number">2023</span>)
</code></pre>
<h3 id="heading-pattern-5-the-implicit-relationship-hallucination">Pattern 5: The Implicit Relationship Hallucination</h3>
<p>RAG relies on implicit semantic closeness, which can be dangerous. If "Tesla," "Toyota," and "Panasonic" all appear near the word "battery" in the vector space, the LLM might hallucinate a relationship that doesn't exist.</p>
<ul>
<li><p><strong>Query:</strong> "Who did Tesla partner with on batteries?"</p>
</li>
<li><p><strong>Problem:</strong> The query is semantically "close" to any document mentioning "Tesla," "partner," and "batteries." The retriever will fetch chunks based on this closeness, even if they don't explicitly state a partnership, leading the LLM to infer one.</p>
</li>
</ul>
<h4 id="heading-why-the-naive-rag-fails-4">Why the Naïve RAG Fails</h4>
<p>The vector retriever will look for chunks that "sound" like the query. In our expanded document list, it's highly likely to retrieve a confusing context for the LLM.</p>
<p>The <code>top-k=3</code> chunks it finds will likely be:</p>
<ol>
<li><p>"Panasonic has a long-standing partnership to manufacture batteries..." (Relevant: "Panasonic," "partnership," "batteries")</p>
</li>
<li><p>"Tesla develops electric vehicles and relies on advanced battery tech..." (Relevant: "Tesla," "battery")</p>
</li>
<li><p>"Toyota also manufactures batteries and has discussed battery technology..." (Relevant: "Toyota," "manufactures batteries")</p>
</li>
</ol>
<p>When the LLM receives this context, it has "Panasonic," "Tesla," and "Toyota" all in a "battery" context. The chunk for Panasonic doesn't explicitly link it to Tesla. The chunk for Toyota also mentions batteries. The LLM, forced to synthesize an answer, may <em>incorrectly</em> infer a partnership that doesn't exist (like with Toyota) or state the facts without confirming the relationship.</p>
<h4 id="heading-why-the-graphrag-succeeds-4">Why the GraphRAG Succeeds</h4>
<p>The knowledge graph isn’t vulnerable to this kind of "semantic bleed-over." It doesn’t care if nodes are "semantically near" each other.</p>
<p>Our <code>query_explicit_relation</code> function asks a very specific, structural question: "Start at the node <strong>'Tesla'</strong> and return <em>only</em> the nodes connected to it by an edge with the <em>exact label</em> <strong>'partners_with'</strong>".</p>
<p>The graph then traverses its edges and finds only one: <code>("Tesla", "Panasonic", {"label": "partners_with"})</code>. It is structurally impossible for it to hallucinate a partnership with "Toyota" because no such <code>partners_with</code> edge exists for Tesla in the graph.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Naive RAG</span>
docs_s5 = [
    <span class="hljs-comment"># --- The "Answer" Chunks (ambiguous) ---</span>
    Document(page_content=<span class="hljs-string">"Tesla develops electric vehicles and relies on advanced battery tech."</span>),
    Document(page_content=<span class="hljs-string">"Panasonic has a long-standing partnership to manufacture batteries for electric vehicles."</span>),

    <span class="hljs-comment"># --- "Noise" Chunks (to create a false signal) ---</span>
    Document(page_content=<span class="hljs-string">"Toyota also manufactures batteries and hybrid powertrains for its own vehicle lineup."</span>),
    Document(page_content=<span class="hljs-string">"Tesla, Panasonic, and Toyota are all major players in the EV and battery supply chain."</span>),
    Document(page_content=<span class="hljs-string">"A new partnership for solid-state batteries was announced, but it did not involve Tesla."</span>)
]
query_s5 = <span class="hljs-string">"Who did Tesla partner with on batteries?"</span>
rag_chain_s5 = create_rag_chain(docs_s5)
print(<span class="hljs-string">f"Naive RAG Answer: <span class="hljs-subst">{rag_chain_s5.invoke(query_s5)}</span>"</span>)
<span class="hljs-comment">#</span>
<span class="hljs-comment"># GraphRAG Pattern</span>
graph_s5 = KnowledgeGraph()
edges_s5 = [
    (<span class="hljs-string">"Tesla"</span>, <span class="hljs-string">"Panasonic"</span>, {<span class="hljs-string">"label"</span>: <span class="hljs-string">"partners_with"</span>}),
    (<span class="hljs-string">"Toyota"</span>, <span class="hljs-string">"Toyota"</span>, {<span class="hljs-string">"label"</span>: <span class="hljs-string">"partners_with"</span>}),
]
graph_s5.add_data(edges=edges_s5)
print(<span class="hljs-string">f"GraphRAG Answer: <span class="hljs-subst">{graph_s5.query_explicit_relation(<span class="hljs-string">'Tesla'</span>, <span class="hljs-string">'partners_with'</span>)}</span>"</span>)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">Naive RAG Answer: Based on the context, Panasonic has a partnership to manufacture batteries, <span class="hljs-keyword">and</span> Tesla relies on advanced battery tech. Toyota also manufactures batteries.
GraphRAG Answer: Tesla partnered <span class="hljs-keyword">with</span> Panasonic.
</code></pre>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>Standard RAG is an essential tool, but its strength is <strong>retrieval, not reasoning</strong>. It falters when true synthesis is required.</p>
<p>You may find that a powerful LLM like Gemini can still correctly answer some of the simple scenarios in this article. The five patterns shown here are meant to build intuition. They demonstrate what <em>can</em> and <em>does</em> go wrong as your knowledge base grows larger and more complex.</p>
<p>The real failure of naive RAG emerges as you feed it more and more conflicting, ambiguous, or incomplete information. This "noisy" context forces the LLM to either hallucinate connections or fail to reason altogether.</p>
<p>By moving from a "bag of chunks" to a structured Knowledge Graph, you build a more reliable and intelligent system. You give your system a "global memory" of how facts explicitly connect, allowing it to answer complex questions by traversing a verified path rather than just guessing from a cloud of semantically similar text.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build Your Own Private Voice Assistant: A Step-by-Step Guide Using Open-Source Tools ]]>
                </title>
                <description>
                    <![CDATA[ Most commercial voice assistants send your voice data to cloud servers before responding. By using open‑source tools, you can run everything directly on your phone for better privacy, faster responses, and full control over how the assistant behaves.... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/private-voice-assistant-using-open-source-tools/</link>
                <guid isPermaLink="false">690bcbbc8abe1e0a5b05e0be</guid>
                
                    <category>
                        <![CDATA[ Voice ]]>
                    </category>
                
                    <category>
                        <![CDATA[ voice assistants ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Personalization  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tool calling ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agentic AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ on-device ai ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Open Source ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Surya Teja Appini ]]>
                </dc:creator>
                <pubDate>Wed, 05 Nov 2025 22:12:12 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762380694991/10687751-7aec-4d78-8af8-1f76edc28afd.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most commercial voice assistants send your voice data to cloud servers before responding. By using open‑source tools, you can run everything directly on your phone for better privacy, faster responses, and full control over how the assistant behaves.</p>
<p>In this tutorial, I’ll walk you through the process step-by-step. You don’t need prior experience with machine learning models, as we’ll build up the system gradually and test each part as we go. By the end, you will have a fully local mobile voice assistant powered by:</p>
<ul>
<li><p>Whisper for Automatic Speech Recognition (ASR)</p>
</li>
<li><p>Machine Learning Compiler (MLC) LLM for on-device reasoning</p>
</li>
<li><p>System Text-to-Speech (TTS) using built-in Android TTS</p>
</li>
</ul>
<p>Your assistant will be able to:</p>
<ul>
<li><p>Understand your voice commands offline</p>
</li>
<li><p>Respond to you with synthesized speech</p>
</li>
<li><p>Perform tool calling actions (such as controlling smart devices)</p>
</li>
<li><p>Store personal memories and preferences</p>
</li>
<li><p>Use Retrieval-Augmented Generation (RAG) to answer questions from your own notes</p>
</li>
<li><p>Perform multi-step agentic workflows such as generating a morning briefing and optionally sending the summary to a contact</p>
</li>
</ul>
<p>This tutorial focuses on Android using Termux (the terminal environment for Android) for a fully local workflow.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-system-overview">System Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-requirements">Requirements</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-test-microphone-and-audio-playback-on-android">Step 1: Test Microphone and Audio Playback on Android</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-install-and-run-whisper-for-asr">Step 2: Install and Run Whisper for ASR</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-install-a-local-llm-with-mlc">Step 3: Install a Local LLM with MLC</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-local-text-to-speech-tts">Step 4: Local Text-to-Speech (TTS)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-the-core-voice-loop">Step 5: The Core Voice Loop</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-tool-calling-make-it-act">Step 6: Tool Calling (Make It Act)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-7-memory-and-personalization">Step 7: Memory and Personalization</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-8-retrieval-augmented-generation-rag">Step 8: Retrieval-Augmented Generation (RAG)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-9-multi-step-agentic-workflow">Step 9: Multi-Step Agentic Workflow</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion-and-next-steps">Conclusion and Next Steps</a></p>
</li>
</ul>
<h2 id="heading-system-overview"><strong>System Overview</strong></h2>
<p>This diagram shows how your voice moves through the assistant: speech in → transcription → reasoning → action → spoken reply.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762319872832/7b52b715-79c0-4c92-b431-b84c49ba7299.png" alt="7b52b715-79c0-4c92-b431-b84c49ba7299" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>This pipeline describes the core flow:</p>
<ul>
<li><p>You speak into the microphone.</p>
</li>
<li><p>Whisper converts audio into text.</p>
</li>
<li><p>The local LLM interprets your request.</p>
</li>
<li><p>The assistant may call tools (for example, send notifications or create events).</p>
</li>
<li><p>The response is spoken aloud using the device’s Text-to-Speech system.</p>
</li>
</ul>
<h3 id="heading-key-concepts-used-in-this-tutorial">Key Concepts Used in This Tutorial</h3>
<ul>
<li><p><strong>Automatic Speech Recognition (ASR):</strong> Converts your speech into text. We use Whisper or Faster‑Whisper.</p>
</li>
<li><p><strong>Local Large Language Model (LLM):</strong> A reasoning model running on your phone using the MLC engine.</p>
</li>
<li><p><strong>Text‑to‑Speech (TTS):</strong> Converts text back to speech. We use Android’s built‑in system TTS.</p>
</li>
<li><p><strong>Tool Calling:</strong> Allows the assistant to perform actions (for example, sending a notification or creating an event).</p>
</li>
<li><p><strong>Memory:</strong> Stores personalized facts the assistant learns during conversation.</p>
</li>
<li><p><strong>Retrieval‑Augmented Generation (RAG):</strong> Lets the assistant reference your documents or notes.</p>
</li>
<li><p><strong>Agent Workflow:</strong> A multi‑step chain where the assistant uses multiple abilities together.</p>
</li>
</ul>
<h2 id="heading-requirements">Requirements</h2>
<p>What you should already be familiar with:</p>
<ul>
<li><p>Basic command line usage (running commands, navigating directories)</p>
</li>
<li><p>Very basic Python (calling a function, editing a <code>.py</code> script)</p>
</li>
</ul>
<p>You do <strong>not</strong> need to have:</p>
<ul>
<li><p>Machine learning experience</p>
</li>
<li><p>A deep understanding of neural networks</p>
</li>
<li><p>Prior experience with speech or audio models</p>
</li>
</ul>
<p>Here are the tools and technologies you’ll need to follow along:</p>
<ul>
<li><p>An Android phone with Snapdragon 8+ Gen 1 or newer recommended (older devices will still work, but responses may be slower)</p>
</li>
<li><p>Termux</p>
</li>
<li><p>Python 3.9+ inside Termux</p>
</li>
<li><p>Enough free storage (at least 4–6 GB) to store the model and audio files</p>
</li>
</ul>
<p><strong>Why these requirements matter:</strong></p>
<p>Whisper and Llama models run on-device, so the phone must handle real‑time compute. MLC optimizes models for your device's GPU / NPU, so newer processors will run faster and cooler. And system TTS and Termux APIs let the assistant speak and interact with the phone locally.</p>
<p>If your phone is older or mid‑range, switch the model in Step 3 to <code>Phi-3.5-Mini</code> which is smaller and faster.</p>
<p>We’ll start by setting up your Android environment with Termux, Python, media access, and storage permissions so later steps can record audio, run models, and speak.</p>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># In Termux</span>
pkg update &amp;&amp; pkg upgrade -y
pkg install -y python git ffmpeg termux-api
termux-setup-storage  <span class="hljs-comment"># grant storage permission</span>
</code></pre>
<h2 id="heading-step-1-test-microphone-and-audio-playback-on-android">Step 1: Test Microphone and Audio Playback on Android</h2>
<p><strong>What this step does:</strong> Verifies that your device microphone and speakers work correctly through Termux before connecting them to the voice assistant.</p>
<p>On-device assistants need reliable access to the microphone and speakers. On Android, Termux provides utilities to record audio and play media. This avoids complex audio dependencies and works on more devices.</p>
<p>These commands let you quickly test your microphone and audio playback without writing any code. This is useful to verify that your device permissions and audio paths are working before introducing Whisper or TTS.</p>
<ul>
<li><p><code>termux-microphone-record</code> records from the device microphone to a <code>.wav</code> file</p>
</li>
<li><p><code>termux-media-player</code> plays audio files</p>
</li>
<li><p><code>termux-tts-speak</code> speaks text using the system TTS voice (fast fallback)</p>
</li>
</ul>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Start a 4 second recording</span>
termux-microphone-record -f <span class="hljs-keyword">in</span>.wav -l <span class="hljs-number">4</span> &amp;&amp; termux-microphone-record -q

<span class="hljs-comment"># Play back the captured audio</span>
termux-media-player play <span class="hljs-keyword">in</span>.wav

<span class="hljs-comment"># Speak text via system TTS (fallback if you do not install a Python TTS)</span>
termux-tts-speak <span class="hljs-string">"Hello, this is your on-device assistant running locally."</span>
</code></pre>
<h2 id="heading-step-2-install-and-run-whisper-for-asr">Step 2: Install and Run Whisper for ASR</h2>
<p><strong>What this step does:</strong> Converts recorded speech into text so the language model can understand what you said.</p>
<p>Whisper listens to your audio recording and converts it into text. Smaller versions like <code>tiny</code> or <code>base</code> run faster on most phones and are good enough for everyday commands.</p>
<p>Install Whisper:</p>
<pre><code class="lang-python">pip install openai-whisper
</code></pre>
<p>If you run into installation issues, you can use Faster‑Whisper instead:</p>
<pre><code class="lang-python">pip install faster-whisper
</code></pre>
<p>Below is a small Python script that takes the recorded audio file and turns it into text. It tries Whisper first, and if that isn’t available, it will automatically fall back to Faster‑Whisper.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Convert recorded speech to text (asr_transcribe.py)</span>
<span class="hljs-keyword">import</span> sys

<span class="hljs-comment"># Try Whisper, fallback to Faster-Whisper if needed</span>
<span class="hljs-keyword">try</span>:
    <span class="hljs-keyword">import</span> whisper
    use_faster = <span class="hljs-literal">False</span>
<span class="hljs-keyword">except</span> Exception:
    use_faster = <span class="hljs-literal">True</span>

<span class="hljs-keyword">if</span> use_faster:
    <span class="hljs-keyword">from</span> faster_whisper <span class="hljs-keyword">import</span> WhisperModel
    model = WhisperModel(<span class="hljs-string">"tiny.en"</span>)
    segments, info = model.transcribe(sys.argv[<span class="hljs-number">1</span>])
    text = <span class="hljs-string">" "</span>.join(s.text <span class="hljs-keyword">for</span> s <span class="hljs-keyword">in</span> segments)
    print(text.strip())
<span class="hljs-keyword">else</span>:
    model = whisper.load_model(<span class="hljs-string">"tiny.en"</span>)
    result = model.transcribe(sys.argv[<span class="hljs-number">1</span>], fp16=<span class="hljs-literal">False</span>)
    print(result[<span class="hljs-string">"text"</span>].strip())
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Record 4 seconds and transcribe</span>
termux-microphone-record -f <span class="hljs-keyword">in</span>.wav -l <span class="hljs-number">4</span> &amp;&amp; termux-microphone-record -q
python asr_transcribe.py <span class="hljs-keyword">in</span>.wav
</code></pre>
<h2 id="heading-step-3-install-a-local-llm-with-mlc">Step 3: Install a Local LLM with MLC</h2>
<p><strong>What this step does:</strong> Installs and tests the on-device reasoning model that will generate responses to transcribed speech.</p>
<p>MLC compiles transformer models to mobile GPUs and Neural Processing Units, enabling on-device inference. You will run an instruction-tuned model with 4-bit or 8-bit weights for speed.</p>
<p>Install the command-line interface like this:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Clone and install Python bindings (for scripting) and CLI</span>
git clone https://github.com/mlc-ai/mlc-llm.git
cd mlc-llm
pip install -r requirements.txt
pip install -e python
</code></pre>
<p>We will use <strong>Llama 3 8B Instruct q4</strong> because it offers strong reasoning while still running on many recent Android devices. If your phone has less memory or you want faster responses, you can swap in <strong>Phi-3.5 Mini</strong> (about 3.8B) without changing any code.</p>
<p>Download a mobile-optimized model:</p>
<pre><code class="lang-python">mlc_llm download Llama<span class="hljs-number">-3</span><span class="hljs-number">-8</span>B-Instruct-q4f16_1
</code></pre>
<p>We will use a short Python script to send text to the model and print the response. This lets us verify that the model is installed correctly before we connect it to audio.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Local LLM text generation (local_llm.py)</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">import</span> sys

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
prompt = sys.argv[<span class="hljs-number">1</span>] <span class="hljs-keyword">if</span> len(sys.argv) &gt; <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"Hello"</span>
resp = engine.chat([{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: prompt}])
<span class="hljs-comment"># The engine may return different structures across versions</span>
reply_text = resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp)
print(reply_text)
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python local_llm.py <span class="hljs-string">"Summarize this in one sentence: building a local voice assistant on Android"</span>
</code></pre>
<h2 id="heading-step-4-local-text-to-speech-tts">Step 4: Local Text-to-Speech (TTS)</h2>
<p><strong>What this step does:</strong> Turns the model’s text responses into spoken audio so the assistant can talk back.</p>
<p>This step converts the text returned by the model into spoken audio so the assistant can talk back. It uses the built-in Android Text-to-Speech voice and requires no additional Python packages.</p>
<pre><code class="lang-python">termux-tts-speak <span class="hljs-string">"Hello, I am running entirely on your device."</span>
</code></pre>
<p>This is the voice output method we will use throughout the tutorial.</p>
<h2 id="heading-step-5-the-core-voice-loop">Step 5: The Core Voice Loop</h2>
<p><strong>What this step does:</strong> Connects speech recognition, language model reasoning, and speech synthesis into a single interactive conversation loop.</p>
<p>This loop ties together recording, transcription, response generation, and playback.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Core voice loop tying ASR + LLM + TTS (voice_loop.py)</span>
<span class="hljs-keyword">import</span> subprocess, os

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run</span>(<span class="hljs-params">cmd</span>):</span> <span class="hljs-keyword">return</span> subprocess.check_output(cmd).decode().strip()

print(<span class="hljs-string">"Listening..."</span>)
subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-f"</span>, <span class="hljs-string">"in.wav"</span>, <span class="hljs-string">"-l"</span>, <span class="hljs-string">"4"</span>]) ; subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-q"</span>])
text = run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"asr_transcribe.py"</span>, <span class="hljs-string">"in.wav"</span>])
reply = run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"local_llm.py"</span>, text])
<span class="hljs-keyword">try</span>:
    subprocess.run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"speak_xtts.py"</span>, reply]); subprocess.run([<span class="hljs-string">"termux-media-player"</span>, <span class="hljs-string">"play"</span>, <span class="hljs-string">"out.wav"</span>])
<span class="hljs-keyword">except</span>:
    subprocess.run([<span class="hljs-string">"termux-tts-speak"</span>, reply])
</code></pre>
<p>Run:</p>
<pre><code class="lang-python">python voice_loop.py
</code></pre>
<h2 id="heading-step-6-tool-calling-make-it-act">Step 6: Tool Calling (Make It Act)</h2>
<p><strong>What this step does:</strong> Enables the assistant to perform actions – not just reply – by calling real functions on your device.</p>
<p>Tool calling lets the assistant perform actions, not just answer. When the model recognizes an action request, it outputs a small JSON instruction, and your code runs the corresponding function. You show the model which tools exist and how to call them. The program intercepts calls and runs the corresponding code.</p>
<p><strong>Example use case:</strong></p>
<p>You say: <em>"Schedule a meeting tomorrow at 3 PM with John."</em></p>
<p>The assistant:</p>
<ol>
<li><p>Transcribes what you said.</p>
</li>
<li><p>Detects that this is not a question, but an action request.</p>
</li>
<li><p>Calls the <code>add_event()</code> function with the correct parameters.</p>
</li>
<li><p>Confirms: <em>"Okay, I scheduled that."</em></p>
</li>
</ol>
<p>Here’s the structure of how tool calls will work:</p>
<ul>
<li><p>Define Python functions such as <code>add_event</code>, <code>control_light</code></p>
</li>
<li><p>Provide a schema for the model to output when it wants to call a tool</p>
</li>
<li><p>Detect that schema in the LLM output and execute the function</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Tool calling functions (tools.py)</span>
<span class="hljs-keyword">import</span> json

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">add_event</span>(<span class="hljs-params">title: str, date: str</span>) -&gt; dict:</span>
    <span class="hljs-comment"># Replace with actual calendar integration</span>
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"ok"</span>, <span class="hljs-string">"title"</span>: title, <span class="hljs-string">"date"</span>: date}

TOOLS = {
    <span class="hljs-string">"add_event"</span>: add_event,
}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_tool</span>(<span class="hljs-params">call_json: str</span>) -&gt; str:</span>
    <span class="hljs-string">"""call_json: '{"tool":"add_event","args":{"title":"Dentist","date":"2025-11-10 10:00"}}'"""</span>
    data = json.loads(call_json)
    name = data[<span class="hljs-string">"tool"</span>]
    args = data.get(<span class="hljs-string">"args"</span>, {})
    <span class="hljs-keyword">if</span> name <span class="hljs-keyword">in</span> TOOLS:
        result = TOOLS[name](**args)
        <span class="hljs-keyword">return</span> json.dumps({<span class="hljs-string">"tool_result"</span>: result})
    <span class="hljs-keyword">return</span> json.dumps({<span class="hljs-string">"error"</span>: <span class="hljs-string">"unknown tool"</span>})
</code></pre>
<p>Prompt the model to use tools:</p>
<pre><code class="lang-python"><span class="hljs-comment"># LLM wrapper enabling tool use (llm_with_tools.py)</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">import</span> json, sys

SYSTEM = (
    <span class="hljs-string">"You can call tools by emitting a single JSON object with keys 'tool' and 'args'. "</span>
    <span class="hljs-string">"Available tools: add_event(title:str, date:str). "</span>
    <span class="hljs-string">"If no tool is needed, answer directly."</span>
)

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
user = sys.argv[<span class="hljs-number">1</span>]
resp = engine.chat([
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: user},
])
print(resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp))
</code></pre>
<p>And then glue it together:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Run LLM with tool call detection (run_with_tools.py)</span>
<span class="hljs-keyword">import</span> subprocess, json
<span class="hljs-keyword">from</span> tools <span class="hljs-keyword">import</span> run_tool

user = <span class="hljs-string">"Add a dentist appointment next Thursday at 10"</span>
raw = subprocess.check_output([<span class="hljs-string">"python"</span>, <span class="hljs-string">"llm_with_tools.py"</span>, user]).decode().strip()

<span class="hljs-comment"># If the model returned a JSON tool call, run it</span>
<span class="hljs-keyword">try</span>:
    data = json.loads(raw)
    <span class="hljs-keyword">if</span> isinstance(data, dict) <span class="hljs-keyword">and</span> <span class="hljs-string">"tool"</span> <span class="hljs-keyword">in</span> data:
        print(<span class="hljs-string">"Tool call:"</span>, data)
        print(run_tool(raw))
    <span class="hljs-keyword">else</span>:
        print(<span class="hljs-string">"Assistant:"</span>, raw)
<span class="hljs-keyword">except</span> Exception:
    print(<span class="hljs-string">"Assistant:"</span>, raw)
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python run_with_tools.py
</code></pre>
<h2 id="heading-step-7-memory-and-personalization">Step 7: Memory and Personalization</h2>
<p><strong>What this step does:</strong> Allows the assistant to remember personal information you share so conversations feel continuous and adaptive.</p>
<p>A helpful assistant should feel like it learns alongside you. Memory allows the system to keep track of small details you mention naturally in conversation.</p>
<p>Without memory, every conversation starts from scratch. With memory, your assistant can remember personal facts (for example, birthdays, favorite music), your routines, device settings, or notes you mention in conversation. This unlocks more natural interactions and enables personalization over time.</p>
<p>You can start with a simple key-value store and expand over time. Your program reads memory before inference and writes back new facts after.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Simple key-value memory store (memory.py)</span>
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path

MEM_PATH = Path(<span class="hljs-string">"memory.json"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">mem_load</span>():</span>
    <span class="hljs-keyword">return</span> json.loads(MEM_PATH.read_text()) <span class="hljs-keyword">if</span> MEM_PATH.exists() <span class="hljs-keyword">else</span> {}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">mem_save</span>(<span class="hljs-params">mem</span>):</span>
    MEM_PATH.write_text(json.dumps(mem, indent=<span class="hljs-number">2</span>))

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">remember</span>(<span class="hljs-params">key: str, value: str</span>):</span>
    mem = mem_load()
    mem[key] = value
    mem_save(mem)
</code></pre>
<p>Use memory in the loop:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Voice loop with memory loading and updating (voice_loop_with_memory.py)</span>
<span class="hljs-keyword">import</span> subprocess, json
<span class="hljs-keyword">from</span> memory <span class="hljs-keyword">import</span> mem_load, remember

<span class="hljs-comment"># 1) Record and transcribe</span>
subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-f"</span>, <span class="hljs-string">"in.wav"</span>, <span class="hljs-string">"-l"</span>, <span class="hljs-string">"4"</span>]) 
subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-q"</span>]) 
user_text = subprocess.check_output([<span class="hljs-string">"python"</span>, <span class="hljs-string">"asr_transcribe.py"</span>, <span class="hljs-string">"in.wav"</span>]).decode().strip()

<span class="hljs-comment"># 2) Load memory and add as system context</span>
mem = mem_load()
SYSTEM = <span class="hljs-string">"Known facts: "</span> + json.dumps(mem)

<span class="hljs-comment"># 3) Ask the model</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
resp = engine.chat([
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: user_text},
])
reply = resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp)
print(<span class="hljs-string">"Assistant:"</span>, reply)

<span class="hljs-comment"># 4) Very simple pattern: if the user said "remember X is Y", store it</span>
<span class="hljs-keyword">if</span> user_text.lower().startswith(<span class="hljs-string">"remember "</span>) <span class="hljs-keyword">and</span> <span class="hljs-string">" is "</span> <span class="hljs-keyword">in</span> user_text:
    k, v = user_text[<span class="hljs-number">9</span>:].split(<span class="hljs-string">" is "</span>, <span class="hljs-number">1</span>)
    remember(k.strip(), v.strip())
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python voice_loop_with_memory.py
</code></pre>
<h2 id="heading-step-8-retrieval-augmented-generation-rag">Step 8: Retrieval-Augmented Generation (RAG)</h2>
<p><strong>What this step does:</strong> Lets the assistant search your offline notes or documents at answer time, improving accuracy for personal tasks.</p>
<p>To use RAG, we first install a lightweight vector database, then add documents to it, and later query it when answering questions.</p>
<p>A language model cannot magically know details about your life, your work, or your files unless you give it a way to look things up.</p>
<p><a target="_blank" href="https://www.freecodecamp.org/news/learn-rag-fundamentals-and-advanced-techniques/">Retrieval-Augmented Generation (RAG)</a> bridges that gap. RAG allows the assistant to search your own stored data at query time. This means the assistant can answer questions about your projects, home details, travel plans, studies, or any personal documents you store completely offline.</p>
<p>RAG allows the assistant to reference your actual notes when answering, instead of relying only on the model's internal training.</p>
<p>Install the vector store:</p>
<pre><code class="lang-python">pip install chromadb
</code></pre>
<p>Add and search your notes:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Local vector DB indexing and querying (rag.py)</span>
<span class="hljs-keyword">from</span> chromadb <span class="hljs-keyword">import</span> Client

client = Client()
notes = client.create_collection(<span class="hljs-string">"notes"</span>)

<span class="hljs-comment"># Add your documents (repeat as needed)</span>
notes.add(documents=[<span class="hljs-string">"Contractor quote was 42000 United States Dollars for the extension."</span>], ids=[<span class="hljs-string">"q1"</span>]) 

<span class="hljs-comment"># Query the local vector database</span>
results = notes.query(query_texts=[<span class="hljs-string">"extension quote"</span>], n_results=<span class="hljs-number">1</span>)
context = results[<span class="hljs-string">"documents"</span>][<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]
print(context)
</code></pre>
<p>Use retrieved context in responses:</p>
<pre><code class="lang-python"><span class="hljs-comment"># LLM answering using retrieved context (llm_with_rag.py)</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">from</span> chromadb <span class="hljs-keyword">import</span> Client

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
client = Client()
notes = client.get_or_create_collection(<span class="hljs-string">"notes"</span>)

question = <span class="hljs-string">"What was the quoted amount for the home extension?"</span>
res = notes.query(query_texts=[question], n_results=<span class="hljs-number">2</span>)
ctx = <span class="hljs-string">"\n"</span>.join([d[<span class="hljs-number">0</span>] <span class="hljs-keyword">for</span> d <span class="hljs-keyword">in</span> res[<span class="hljs-string">"documents"</span>]])

SYSTEM = <span class="hljs-string">"Use the provided context to answer accurately. If missing, say you do not know.\nContext:\n"</span> + ctx
ans = engine.chat([
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: question},
])
print(ans.get(<span class="hljs-string">"message"</span>, ans) <span class="hljs-keyword">if</span> isinstance(ans, dict) <span class="hljs-keyword">else</span> str(ans))
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python rag.py
python llm_with_rag.py
</code></pre>
<h2 id="heading-step-9-multi-step-agentic-workflow">Step 9: Multi-Step Agentic Workflow</h2>
<p><strong>What this step does:</strong> Combines listening, reasoning, memory, and tool usage into a multi-step routine that runs automatically.</p>
<p>Now that the assistant can listen, respond, remember facts, and call tools, we can combine those abilities into a small routine that performs several steps automatically.</p>
<p><strong>Practical example: "Morning Briefing" on your phone</strong></p>
<p>Goal: when you say <em>"Give me my morning briefing and text it to my partner"</em>, the assistant will:</p>
<ol>
<li><p>Read today's agenda from a local file,</p>
</li>
<li><p>summarize it,</p>
</li>
<li><p>speak it aloud, and</p>
</li>
<li><p>send the summary via SMS using Termux.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762319593253/99e670d4-4934-47ce-a164-f0f7880ea80f.png" alt="Multi-step morning briefing workflow with retrieval, summary, speech output, and SMS action." class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><em>Diagram: Multi-step morning briefing workflow with retrieval, summary, speech output, and SMS action.</em></p>
<h3 id="heading-prepare-your-agenda-file">Prepare your agenda file</h3>
<p>This file stores your events for the day. You can edit it manually, generate it, or sync it later if you want.</p>
<p>Create <code>agenda.json</code> in the same folder:</p>
<pre><code class="lang-python">{
  <span class="hljs-string">"2025-11-03"</span>: [
    {<span class="hljs-string">"time"</span>: <span class="hljs-string">"09:30"</span>, <span class="hljs-string">"title"</span>: <span class="hljs-string">"Standup meeting"</span>},
    {<span class="hljs-string">"time"</span>: <span class="hljs-string">"13:00"</span>, <span class="hljs-string">"title"</span>: <span class="hljs-string">"Lunch with Priya"</span>},
    {<span class="hljs-string">"time"</span>: <span class="hljs-string">"16:30"</span>, <span class="hljs-string">"title"</span>: <span class="hljs-string">"Gym"</span>}
  ]
}
</code></pre>
<p>Phone-integrated tools for this workflow:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Phone-integrated agent tools (tools_phone.py)</span>
<span class="hljs-keyword">import</span> json, subprocess, datetime
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path

AGENDA_PATH = Path(<span class="hljs-string">"agenda.json"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_today_agenda</span>():</span>
    today = datetime.date.today().isoformat()
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> AGENDA_PATH.exists():
        <span class="hljs-keyword">return</span> []
    data = json.loads(AGENDA_PATH.read_text())
    <span class="hljs-keyword">return</span> data.get(today, [])

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">send_sms</span>(<span class="hljs-params">number: str, text: str</span>) -&gt; dict:</span>
    <span class="hljs-comment"># Requires Termux:API and SMS permission</span>
    subprocess.run([<span class="hljs-string">"termux-sms-send"</span>, <span class="hljs-string">"-n"</span>, number, text])
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"sent"</span>, <span class="hljs-string">"to"</span>: number}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">notify</span>(<span class="hljs-params">title: str, content: str</span>) -&gt; dict:</span>
    subprocess.run([<span class="hljs-string">"termux-notification"</span>, <span class="hljs-string">"--title"</span>, title, <span class="hljs-string">"--content"</span>, content])
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"notified"</span>}
</code></pre>
<p>Create the agent routine:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Multi-step morning briefing agent (agent_morning.py)</span>
<span class="hljs-keyword">import</span> json, subprocess, os
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">from</span> tools_phone <span class="hljs-keyword">import</span> load_today_agenda, send_sms, notify

PARTNER_PHONE = os.environ.get(<span class="hljs-string">"PARTNER_PHONE"</span>, <span class="hljs-string">"+15551234567"</span>)

TOOLS = {
    <span class="hljs-string">"send_sms"</span>: send_sms,
    <span class="hljs-string">"notify"</span>: notify,
}

SYSTEM = (
  <span class="hljs-string">"You assist on a phone. You may emit a single-line JSON when an action is needed "</span>
  <span class="hljs-string">"with keys 'tool' and 'args'. Available tools: send_sms(number:str, text:str), "</span>
  <span class="hljs-string">"notify(title:str, content:str). Keep messages concise. If no tool is needed, answer in plain text."</span>
)

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)

agenda = load_today_agenda()
agenda = load_today_agenda()
agenda_text = <span class="hljs-string">"
"</span>.join(<span class="hljs-string">f"<span class="hljs-subst">{e[<span class="hljs-string">'time'</span>]}</span> - <span class="hljs-subst">{e[<span class="hljs-string">'title'</span>]}</span>"</span> <span class="hljs-keyword">for</span> e <span class="hljs-keyword">in</span> agenda) <span class="hljs-keyword">or</span> <span class="hljs-string">"No events for today."</span>

user_request = <span class="hljs-string">"Give me my morning briefing and text it to my partner."</span> <span class="hljs-string">"Give me my morning briefing and text it to my partner."</span>

<span class="hljs-comment"># 1) Ask LLM for a 2-3 sentence summary to speak</span>
summary = engine.chat([
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"Summarize this agenda in 2-3 sentences for a morning briefing:"</span>},
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: agenda_text},
])
summary_text = summary.get(<span class="hljs-string">"message"</span>, summary) <span class="hljs-keyword">if</span> isinstance(summary, dict) <span class="hljs-keyword">else</span> str(summary)
print(<span class="hljs-string">"Briefing:
"</span>, summary_text)

<span class="hljs-comment"># 2) Speak locally (prefer XTTS, fallback to system TTS)</span>
<span class="hljs-keyword">try</span>:
    subprocess.run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"speak_xtts.py"</span>, summary_text], check=<span class="hljs-literal">True</span>)
    subprocess.run([<span class="hljs-string">"termux-media-player"</span>, <span class="hljs-string">"play"</span>, <span class="hljs-string">"out.wav"</span>]) 
<span class="hljs-keyword">except</span> Exception:
    subprocess.run([<span class="hljs-string">"termux-tts-speak"</span>, summary_text])

<span class="hljs-comment"># 3) Ask LLM whether to send SMS and with what text, using tool schema</span>
resp = engine.chat([
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">f"User said: '<span class="hljs-subst">{user_request}</span>'. Partner phone is <span class="hljs-subst">{PARTNER_PHONE}</span>. Summary: <span class="hljs-subst">{summary_text}</span>"</span>},
])
msg = resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp)

<span class="hljs-comment"># 4) If the model requested a tool, execute it</span>
<span class="hljs-keyword">try</span>:
    data = json.loads(msg)
    <span class="hljs-keyword">if</span> isinstance(data, dict) <span class="hljs-keyword">and</span> data.get(<span class="hljs-string">"tool"</span>) <span class="hljs-keyword">in</span> TOOLS:
        <span class="hljs-comment"># Auto-fill phone number if missing</span>
        <span class="hljs-keyword">if</span> data[<span class="hljs-string">"tool"</span>] == <span class="hljs-string">"send_sms"</span> <span class="hljs-keyword">and</span> <span class="hljs-string">"number"</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> data.get(<span class="hljs-string">"args"</span>, {}):
            data.setdefault(<span class="hljs-string">"args"</span>, {})[<span class="hljs-string">"number"</span>] = PARTNER_PHONE
        result = TOOLS[data[<span class="hljs-string">"tool"</span>]](**data.get(<span class="hljs-string">"args"</span>, {}))
        print(<span class="hljs-string">"Tool result:"</span>, result)
    <span class="hljs-keyword">else</span>:
        print(<span class="hljs-string">"Assistant:"</span>, msg)
<span class="hljs-keyword">except</span> Exception:
    print(<span class="hljs-string">"Assistant:"</span>, msg)
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">export PARTNER_PHONE=+<span class="hljs-number">15551234567</span>
python agent_morning.py
</code></pre>
<p>This example is realistic on Android because it uses Termux utilities you already installed: local TTS for speech output, <code>termux-sms-send</code> for messaging, and <code>termux-notification</code> for a quick on-device confirmation. You can extend it with a Home Assistant tool later if you have a local server (for example, to toggle lights or set thermostat scenes).</p>
<h2 id="heading-conclusion-and-next-steps">Conclusion and Next Steps</h2>
<p>Building a fully local voice assistant is an incremental process. Each step you added – speech recognition, text generation, memory, retrieval, and tool execution – unlocked new capabilities and moved the system closer to behaving like a real assistant.</p>
<p>You built a fully local voice assistant on your phone with:</p>
<ul>
<li><p>On-device Automatic Speech Recognition with Whisper (with Faster-Whisper fallback)</p>
</li>
<li><p>On-device reasoning with MLC Large Language Model</p>
</li>
<li><p>Local Text-to-Speech using the built-in system TTS</p>
</li>
<li><p>Tool calling for real actions</p>
</li>
<li><p>Memory and personalization</p>
</li>
<li><p>Retrieval-Augmented Generation for document-based knowledge</p>
</li>
<li><p>A simple agent loop for multi-step work</p>
</li>
</ul>
<p>From here you can add:</p>
<ul>
<li><p>Wake word detection (for example, Porcupine or open wake word models)</p>
</li>
<li><p>Device-specific integrations (for example, Home Assistant, smart lighting)</p>
</li>
<li><p>Better memory schemas and calendars or contacts adapters</p>
</li>
</ul>
<p>Your data never leaves your device, and you control every part of the stack. This is a private, customizable assistant you can expand however you like.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build RAG AI Agents with TypeScript ]]>
                </title>
                <description>
                    <![CDATA[ The most powerful AI systems don’t just generate – they also retrieve, reason, and respond with context. Retrieval-Augmented Generation (RAG) is how we get there. It combines the strengths of search and generation to build more accurate, reliable, an... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-rag-ai-agents-with-typescript/</link>
                <guid isPermaLink="false">67ffc2a75667d9e59ef9bc61</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ai agents ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ TypeScript ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Maham Codes ]]>
                </dc:creator>
                <pubDate>Wed, 16 Apr 2025 14:45:59 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1744814746615/72626297-def9-466a-8c1a-2cdb1b411300.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The most powerful AI systems don’t just generate – they also retrieve, reason, and respond with context. Retrieval-Augmented Generation (RAG) is how we get there. It combines the strengths of search and generation to build more accurate, reliable, and context-aware AI systems.</p>
<p>In this guide, you'll build a RAG-based AI agent in TypeScript using Langbase SDK. You'll plug in your own data as memory, use any embedding model, retrieve relevant context, and call an LLM to generate a precise response.</p>
<p>By the end of this tutorial, you'll have a working RAG system that:</p>
<ul>
<li><p>Stores and retrieves documents with semantic memory</p>
</li>
<li><p>Uses custom embeddings for vector search</p>
</li>
<li><p>Handles user queries with relevant context</p>
</li>
<li><p>Generates responses via OpenAI, Anthropic, or any LLM</p>
</li>
</ul>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-agentic-rag">What is agentic RAG?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-langbase-sdk">Langbase SDK</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-setup-your-project">Step 1: Setup your project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-get-langbase-api-key">Step 2: Get Langbase API Key</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-add-llm-api-keys">Step 3: Add LLM API keys</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-create-an-agentic-ai-memory">Step 4: Create an agentic AI memory</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-add-documents-to-ai-memory">Step 5: Add documents to AI memory</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-perform-rag-retrieval">Step 6: Perform RAG retrieval</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-7-create-support-pipe-agent">Step 7: Create support pipe agent</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-8-generate-rag-responses">Step 8: Generate RAG responses</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-result">The result</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before we begin creating a RAG-based AI agent, you’ll need to have some tools ready to go.</p>
<p>In this tutorial, I’ll be using the following tech stack:</p>
<ul>
<li><p><a target="_blank" href="http://langbase.com/">Langbase</a> – the platform to build and deploy your serverless AI agents.</p>
</li>
<li><p><a target="_blank" href="https://langbase.com/docs/sdk">Langbase SDK</a> – a TypeScript AI SDK, designed to work with JavaScript, TypeScript, Node.js, Next.js, React, and the like.</p>
</li>
<li><p><a target="_blank" href="https://openai.com/">OpenAI</a> – to get the LLM key for the preferred model.</p>
</li>
</ul>
<p>You’ll also need to:</p>
<ul>
<li><p>Sign up on Langbase to get access to the API key.</p>
</li>
<li><p>Sign up on OpenAI to generate the LLM key for the model you want to use (for this demo, I’ll be using the <code>openai:text-embedding-3-large</code> model). You can generate the key <a target="_blank" href="https://platform.openai.com/api-keys">here</a>.</p>
</li>
</ul>
<h2 id="heading-what-is-agentic-rag">What is Agentic RAG?</h2>
<p>Retrieval augmented generation (RAG) is an architecture for optimizing the performance of an artificial intelligence (AI) model by connecting it with external knowledge bases. RAG helps large language models (LLMs) deliver more relevant responses at a higher quality.</p>
<p>When we use AI agents to facilitate RAG, it becomes <strong>Agentic RAG.</strong> Agentic RAG systems add AI agents to the RAG pipeline to increase adaptability and accuracy. Compared to traditional RAG systems, agentic RAG allows LLMs to conduct information retrieval from multiple sources and handle more complex workflows.</p>
<p>Here’s the tabular comparison of RAG vs Agentic RAG:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Feature</strong></td><td><strong>RAG</strong></td><td><strong>Agentic RAG</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Task Complexity</td><td>Simple query tasks – no complex decision-making</td><td>Handles complex, multi-step tasks using multiple tools and agents</td></tr>
<tr>
<td>Decision-Making</td><td>Limited – no autonomy</td><td>Agents decide what to retrieve, how to grade, reason, reflect, and generate</td></tr>
<tr>
<td>Multi-Step Reasoning</td><td>Single-step queries and responses only</td><td>Supports multi-step reasoning with retrieval, grading, filtering, and evaluation</td></tr>
<tr>
<td>Key Role</td><td>LLM + external data for answers</td><td>Adds intelligent agents for retrieval, generation, critique, and orchestration</td></tr>
<tr>
<td>Real-Time Data Retrieval</td><td>Not supported</td><td>Built for real-time retrieval and dynamic integration</td></tr>
<tr>
<td>Retrieval Integration</td><td>Static, pre-defined vector databases</td><td>Agents dynamically retrieve from diverse and flexible sources</td></tr>
<tr>
<td>Context Awareness</td><td>Static context – no runtime adaptability</td><td>High – agents adapt to queries, pull relevant context, and fetch live data if needed</td></tr>
</tbody>
</table>
</div><h2 id="heading-langbase-sdk">Langbase SDK</h2>
<p>The Langbase SDK makes it easy to build powerful AI tools using TypeScript. It gives you everything you need to work with any LLM, connect your own embedding models, manage document memory, and build AI agents that can reason and respond.</p>
<p>The SDK is designed to work with Node.js, Next.js, React, or any modern JavaScript stack. You can use it to upload documents, create semantic memory, and run AI workflows (called Pipe agents) with just a few lines of code.</p>
<p>Langbase is an API-first AI platform. Its TypeScript SDK smooths out the experience, making it easy to get started without dealing with infrastructure. Just drop in your API key, write your logic, and you're good to go.</p>
<p>Now that you know about Langbase SDK, let’s start building the RAG agent.</p>
<h2 id="heading-step-1-setup-your-project">Step 1: Setup Your Project</h2>
<p>We’ll be building a basic Node.js app in TypeScript that uses the Langbase SDK to create an agentic RAG system. For that, create a new directory for your project and navigate to it.</p>
<pre><code class="lang-bash">mkdir agentic-rag &amp;&amp; <span class="hljs-built_in">cd</span> agentic-rag
</code></pre>
<p>Then initialize a Node.js project and create different TypeScript files by running this command in your terminal:</p>
<pre><code class="lang-bash">npm init -y &amp;&amp; touch index.ts agents.ts create-memory.ts upload-docs.ts create-pipe.ts
</code></pre>
<p>Here’s a breakdown of what each file will do in the project:</p>
<ul>
<li><p><strong>index.ts:</strong> This is typically the entry point of a TypeScript project. It orchestrates agent creation, memory setup, and document upload.</p>
</li>
<li><p><strong>agents.ts:</strong> This file handles AI agent creation and configuration.</p>
</li>
<li><p><strong>create-memory.ts:</strong> This sets up Langbase Memory (RAG) for storing and retrieving context.</p>
</li>
<li><p><strong>upload-docs.ts:</strong> This file will upload documents to Memory so agents can access and use them.</p>
</li>
<li><p><strong>create-pipe.ts</strong>: This file sets up a <a target="_blank" href="https://langbase.com/docs/pipe/quickstart">Langbase Pipe agent</a> which is a serverless AI agent with unified APIs for every LLM.</p>
</li>
</ul>
<p>After this, we will be using the Langbase SDK to create RAG agents and <code>dotenv</code> to manage environment variables. So, let's install these dependencies.</p>
<pre><code class="lang-bash">npm i langbase dotenv
</code></pre>
<h2 id="heading-step-2-get-langbase-api-key">Step 2: Get Langbase API Key</h2>
<p>Every request you send to Langbase needs an API key. You can generate API keys from the <a target="_blank" href="https://studio.langbase.com/">Langbase studio</a> by following these steps:</p>
<ol>
<li><p>Switch to your user or org account.</p>
</li>
<li><p>From the sidebar, click on the <code>Settings</code> menu.</p>
</li>
<li><p>In the developer settings section, click on the <code>Langbase API keys</code> link.</p>
</li>
<li><p>From here you can create a new API key or manage existing ones.</p>
</li>
</ol>
<p>For more details, check out the Langbase API keys documentation.</p>
<p>After generating the API key, create an <code>.env</code> file in the root of your project and add your Langbase API key in it:</p>
<pre><code class="lang-bash">LANGBASE_API_KEY=xxxxxxxxx
</code></pre>
<p>Replace xxxxxxxxx with your Langbase API key.</p>
<h2 id="heading-step-3-add-llm-api-keys">Step 3: Add LLM API Keys</h2>
<p>Once you have the Langbase API key, you’ll need the LLM key as well to run the RAG agent. If you have set up LLM API keys in your profile, the AI memory and agent pipe will automatically use them. Otherwise, navigate to the LLM API keys page and add keys for different providers like OpenAI, Anthropic, and so on.</p>
<p>Follow these steps to add the LLM keys:</p>
<ol>
<li><p>Add LLM API keys in your account using Langbase studio</p>
</li>
<li><p>Switch to your user or org account.</p>
</li>
<li><p>From the sidebar, click on the <code>Settings</code> menu.</p>
</li>
<li><p>In the developer settings section, click on the <code>LLM API keys</code> link.</p>
</li>
<li><p>From here you can add LLM API keys for different providers like OpenAI, TogetherAI, Anthropic, and so on.</p>
</li>
</ol>
<h2 id="heading-step-4-create-an-agentic-ai-memory">Step 4: Create an Agentic AI Memory</h2>
<p>Let’s now use the Langbase SDK to create an AI memory (Langbase memory agent) where your agent can store and retrieve context.</p>
<p>Langbase serverless memory agents (long-term memory solution) are designed to acquire, process, retain, and retrieve information seamlessly. They dynamically attach private data to any LLM, enabling context-aware responses in real time and reducing hallucinations.</p>
<p>These agents combine vector storage, RAG, and internet access to create a powerful managed context search API. You can use them to build smarter, more capable AI applications.</p>
<p>In a RAG setup, memory – when connected directly to a Langbase Pipe Agent – becomes a memory agent. This pairing gives the LLM the ability to fetch relevant data and deliver precise, contextually accurate answers – addressing the limitations of LLMs when it comes to handling private data.</p>
<p>To create it, add the following code to <code>create-memory.ts</code> file you created in Step 1:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> <span class="hljs-string">'dotenv/config'</span>;
<span class="hljs-keyword">import</span> {Langbase} <span class="hljs-keyword">from</span> <span class="hljs-string">'langbase'</span>;

<span class="hljs-keyword">const</span> langbase = <span class="hljs-keyword">new</span> Langbase({
    apiKey: process.env.LANGBASE_API_KEY!,
});

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">main</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">const</span> memory = <span class="hljs-keyword">await</span> langbase.memories.create({
        name: <span class="hljs-string">'knowledge-base'</span>,
        description: <span class="hljs-string">'An AI memory for agentic memory workshop'</span>,
        embedding_model: <span class="hljs-string">'openai:text-embedding-3-large'</span>
    });

    <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'AI Memory:'</span>, memory);
}

main();
</code></pre>
<p>Here’s what’s happening in the above code:</p>
<ul>
<li><p>Import the <code>dotenv</code> package to load environment variables.</p>
</li>
<li><p>Import the <code>Langbase</code> class from the langbase package.</p>
</li>
<li><p>Create a new instance of the Langbase class with your API key.</p>
</li>
<li><p>Use the <code>memories.create</code> method to create a new AI memory.</p>
</li>
<li><p>Set the name and description of the memory.</p>
</li>
<li><p>Use the <code>openai:text-embedding-3-large</code> model for embedding.</p>
</li>
<li><p>Log the created memory to the console.</p>
</li>
</ul>
<p>After this, let's create the agentic memory by running the <code>create-memory.ts</code> file.</p>
<pre><code class="lang-bash">npx tsx create-memory.ts
</code></pre>
<p>This will create an AI memory and log the memory details to the console.</p>
<h2 id="heading-step-5-add-documents-to-ai-memory">Step 5: Add Documents to AI Memory</h2>
<p>Now that you’ve created an AI memory agent, the next step is to add documents in it. These documents will serve as the context your agent can reference during interactions.</p>
<p>First, create a docs directory in your project root, and add two sample text files:</p>
<ul>
<li><p><a target="_blank" href="https://langbase.com/docs/examples/agent-architectures">agent-architectures.txt</a></p>
</li>
<li><p><a target="_blank" href="http://langbase.com/docs">langbase-faq.txt</a></p>
</li>
</ul>
<p>Next, open the <code>upload-docs.ts</code> file created in Step 1 and paste in the following code:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> <span class="hljs-string">'dotenv/config'</span>;
<span class="hljs-keyword">import</span> { Langbase } <span class="hljs-keyword">from</span> <span class="hljs-string">'langbase'</span>;
<span class="hljs-keyword">import</span> { readFile } <span class="hljs-keyword">from</span> <span class="hljs-string">'fs/promises'</span>;
<span class="hljs-keyword">import</span> path <span class="hljs-keyword">from</span> <span class="hljs-string">'path'</span>;

<span class="hljs-keyword">const</span> langbase = <span class="hljs-keyword">new</span> Langbase({
    apiKey: process.env.LANGBASE_API_KEY!,
});

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">main</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">const</span> cwd = process.cwd();
    <span class="hljs-keyword">const</span> memoryName = <span class="hljs-string">'knowledge-base'</span>;

    <span class="hljs-comment">// Upload agent architecture document</span>
    <span class="hljs-keyword">const</span> agentArchitecture = <span class="hljs-keyword">await</span> readFile(path.join(cwd, <span class="hljs-string">'docs'</span>, <span class="hljs-string">'agent-architectures.txt'</span>));
    <span class="hljs-keyword">const</span> agentResult = <span class="hljs-keyword">await</span> langbase.memories.documents.upload({
        memoryName,
        contentType: <span class="hljs-string">'text/plain'</span>,
        documentName: <span class="hljs-string">'agent-architectures.txt'</span>,
        <span class="hljs-built_in">document</span>: agentArchitecture,
        meta: { category: <span class="hljs-string">'Examples'</span>, topic: <span class="hljs-string">'Agent architecture'</span> },
    });

    <span class="hljs-built_in">console</span>.log(agentResult.ok ? <span class="hljs-string">'✓ Agent doc uploaded'</span> : <span class="hljs-string">'✗ Agent doc failed'</span>);

    <span class="hljs-comment">// Upload FAQ document</span>
    <span class="hljs-keyword">const</span> langbaseFaq = <span class="hljs-keyword">await</span> readFile(path.join(cwd, <span class="hljs-string">'docs'</span>, <span class="hljs-string">'langbase-faq.txt'</span>));
    <span class="hljs-keyword">const</span> faqResult = <span class="hljs-keyword">await</span> langbase.memories.documents.upload({
        memoryName,
        contentType: <span class="hljs-string">'text/plain'</span>,
        documentName: <span class="hljs-string">'langbase-faq.txt'</span>,
        <span class="hljs-built_in">document</span>: langbaseFaq,
        meta: { category: <span class="hljs-string">'Support'</span>, topic: <span class="hljs-string">'Langbase FAQs'</span> },
    });

    <span class="hljs-built_in">console</span>.log(faqResult.ok ? <span class="hljs-string">'✓ FAQ doc uploaded'</span> : <span class="hljs-string">'✗ FAQ doc failed'</span>);
}

main();
</code></pre>
<p>Let’s break down what’s happening in this code:</p>
<ul>
<li><p><code>dotenv/config</code> is used to load environment variables from your .env file.</p>
</li>
<li><p>Langbase is imported from the SDK to interact with the API.</p>
</li>
<li><p><code>readFile</code> from the <code>fs/promises</code> module reads each document file asynchronously.</p>
</li>
<li><p><code>path.join()</code> ensures file paths work across different operating systems.</p>
</li>
<li><p>A Langbase client instance is created using your API key.</p>
</li>
<li><p><code>memories.documents.upload</code> is used to upload each <code>.txt</code> file to the AI memory.</p>
</li>
<li><p>Each upload includes metadata like <code>category</code> and <code>topic</code> to help organize the content.</p>
</li>
<li><p>Upload success or failure is logged to the console.</p>
</li>
</ul>
<p>This step ensures your AI agent will have actual content to pull from – FAQs, architecture docs, or anything else you upload into memory.</p>
<p>Then, run the <code>upload-docs.ts</code> file to upload the documents to the AI memory by this command in your terminal. This will upload the documents to the AI memory:</p>
<pre><code class="lang-bash">npx tsx upload-docs.ts
</code></pre>
<h2 id="heading-step-6-perform-rag-retrieval">Step 6: Perform RAG Retrieval</h2>
<p>In this step, we’ll perform RAG against a query using the documents we uploaded to AI memory.</p>
<p>Add the following code to your <code>agents.ts</code> file created in Step 1:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> <span class="hljs-string">'dotenv/config'</span>;
<span class="hljs-keyword">import</span> { Langbase } <span class="hljs-keyword">from</span> <span class="hljs-string">'langbase'</span>;

<span class="hljs-keyword">const</span> langbase = <span class="hljs-keyword">new</span> Langbase({
    apiKey: process.env.LANGBASE_API_KEY!,
});

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">runMemoryAgent</span>(<span class="hljs-params">query: <span class="hljs-built_in">string</span></span>) </span>{
    <span class="hljs-keyword">const</span> chunks = <span class="hljs-keyword">await</span> langbase.memories.retrieve({
        query,
        topK: <span class="hljs-number">4</span>,
        memory: [
            {
                name: <span class="hljs-string">'knowledge-base'</span>,
            },
        ],
    });

    <span class="hljs-keyword">return</span> chunks;
}
</code></pre>
<p>Let’s break down what this does:</p>
<ul>
<li><p>Import the <code>Langbase</code> class from the Langbase SDK.</p>
</li>
<li><p>Initialize the Langbase client using your API key from environment variables.</p>
</li>
<li><p>Define an async function <code>runMemoryAgent</code> that takes a query string as input.</p>
</li>
<li><p>Use <code>memories.retrieve</code> to query the memory for the most relevant chunks, retrieving the top 4 results (<code>topK: 4</code>) from the memory named "knowledge-base".</p>
</li>
<li><p>Return the retrieved memory chunks.</p>
</li>
</ul>
<p>Now let's add the following code to the <code>index.ts</code> file created in Step 1 to run the memory agent:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> { runMemoryAgent } <span class="hljs-keyword">from</span> <span class="hljs-string">'./agents'</span>;

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">main</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">const</span> chunks = <span class="hljs-keyword">await</span> runMemoryAgent(<span class="hljs-string">'What is agent parallelization?'</span>);
    <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'Memory chunk:'</span>, chunks);
}

main();
</code></pre>
<p>This code runs a Langbase memory query for “What is agent parallelization?” It uses <code>runMemoryAgent</code> to retrieve the top matching chunks from your AI memory and logs the results. It’s how you fetch relevant knowledge with RAG.</p>
<p>After this, run the <code>index.ts</code> file to perform RAG retrieval against the query by this command in terminal:</p>
<pre><code class="lang-bash">npx tsx index.ts
</code></pre>
<p>You will see the memory agent output as retrieved memory chunks in the console as follows:</p>
<pre><code class="lang-bash">[
  {
    text: <span class="hljs-string">'---\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'## Agent Parallelization\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'Parallelization runs multiple LLM tasks at the same time to improve speed or accuracy. It works by splitting a task into independent parts (sectioning) or generating multiple responses for comparison (voting).\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'Voting is a parallelization method where multiple LLM calls generate different responses for the same task. The best result is selected based on agreement, predefined rules, or quality evaluation, improving accuracy and reliability.\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">"`This code implements an email analysis system that processes incoming emails through multiple parallel AI agents to determine if and how they should be handled. Here's the breakdown:"</span>,
    similarity: 0.7146744132041931,
    meta: {
      docName: <span class="hljs-string">'agent-architectures.txt'</span>,
      documentName: <span class="hljs-string">'agent-architectures.txt'</span>,
      category: <span class="hljs-string">'Examples'</span>,
      topic: <span class="hljs-string">'Agent architecture'</span>
    }
  },
  {
    text: <span class="hljs-string">'async function main(inputText: string) {\n'</span> +
      <span class="hljs-string">'\ttry {\n'</span> +
      <span class="hljs-string">'\t\t// Create pipes first\n'</span> +
      <span class="hljs-string">'\t\tawait createPipes();\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'\t\t// Step A: Determine which agent to route to\n'</span> +
      <span class="hljs-string">'\t\tconst route = await routerAgent(inputText);\n'</span> +
      <span class="hljs-string">"\t\tconsole.log('Router decision:', route);\n"</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'\t\t// Step B: Call the appropriate agent\n'</span> +
      <span class="hljs-string">'\t\tconst agent = agentConfigs[route.agent];\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'\t\tconst response = await langbase.pipes.run({\n'</span> +
      <span class="hljs-string">'\t\t\tstream: false,\n'</span> +
      <span class="hljs-string">'\t\t\tname: agent.name,\n'</span> +
      <span class="hljs-string">'\t\t\tmessages: [\n'</span> +
      <span class="hljs-string">"\t\t\t\t{ role: 'user', content: `<span class="hljs-variable">${agent.prompt}</span> <span class="hljs-variable">${inputText}</span>` }\n"</span> +
      <span class="hljs-string">'\t\t\t]\n'</span> +
      <span class="hljs-string">'\t\t});\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'\t\t// Final output\n'</span> +
      <span class="hljs-string">'\t\tconsole.log(\n'</span> +
      <span class="hljs-string">'\t\t\t`Agent: ${agent.name} \\n\\n Response: ${response.completion}`\n'</span> +
      <span class="hljs-string">'\t\t);\n'</span> +
      <span class="hljs-string">'\t} catch (error) {\n'</span> +
      <span class="hljs-string">"\t\tconsole.error('Error in main workflow:', error);\n"</span> +
      <span class="hljs-string">'\t}\n'</span> +
      <span class="hljs-string">'}\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'// Example usage:\n'</span> +
      <span class="hljs-string">"const inputText = 'Why days are shorter in winter?';\n"</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'main(inputText);\n'</span> +
      <span class="hljs-string">'```\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'---\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'## Agent Parallelization\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'Parallelization runs multiple LLM tasks at the same time to improve speed or accuracy. It works by splitting a task into independent parts (sectioning) or generating multiple responses for comparison (voting).'</span>,
    similarity: 0.5911030173301697,
    meta: {
      docName: <span class="hljs-string">'agent-architectures.txt'</span>,
      documentName: <span class="hljs-string">'agent-architectures.txt'</span>,
      category: <span class="hljs-string">'Examples'</span>,
      topic: <span class="hljs-string">'Agent architecture'</span>
    }
  },
  {
    text: <span class="hljs-string">"`This code implements a sophisticated task orchestration system with dynamic subtask generation and parallel processing. Here's how it works:\n"</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'1. Orchestrator Agent (Planning Phase):\n'</span> +
      <span class="hljs-string">'   - Takes a complex task as input\n'</span> +
      <span class="hljs-string">'   - Analyzes the task and breaks it down into smaller, manageable subtasks\n'</span> +
      <span class="hljs-string">'   - Returns both an analysis and a list of subtasks in JSON format\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'2. Worker Agents (Execution Phase):\n'</span> +
      <span class="hljs-string">'   - Multiple workers run in parallel using Promise.all()\n'</span> +
      <span class="hljs-string">'   - Each worker gets:\n'</span> +
      <span class="hljs-string">'     - The original task for context\n'</span> +
      <span class="hljs-string">'     - Their specific subtask to complete\n'</span> +
      <span class="hljs-string">'   - All workers use Gemini 2.0 Flash model\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'3. Synthesizer Agent (Integration Phase):\n'</span> +
      <span class="hljs-string">'   - Takes all the worker outputs\n'</span> +
      <span class="hljs-string">'   - Combines them into a cohesive final result\n'</span> +
      <span class="hljs-string">'   - Ensures the pieces flow together naturally'</span>,
    similarity: 0.5393730401992798,
    meta: {
      docName: <span class="hljs-string">'agent-architectures.txt'</span>,
      documentName: <span class="hljs-string">'agent-architectures.txt'</span>,
      category: <span class="hljs-string">'Examples'</span>,
      topic: <span class="hljs-string">'Agent architecture'</span>
    }
  },
  {
    text: <span class="hljs-string">"`This code implements an email analysis system that processes incoming emails through multiple parallel AI agents to determine if and how they should be handled. Here's the breakdown:\n"</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'1. Three Specialized Agents running in parallel:\n'</span> +
      <span class="hljs-string">'   - Sentiment Analysis Agent: Determines if the email tone is positive, negative, or neutral\n'</span> +
      <span class="hljs-string">'   - Summary Agent: Creates a concise summary of the email content\n'</span> +
      <span class="hljs-string">'   - Decision Maker Agent: Takes the outputs from the other agents and decides:\n'</span> +
      <span class="hljs-string">'     - If the email needs a response\n'</span> +
      <span class="hljs-string">"     - Whether it's spam\n"</span> +
      <span class="hljs-string">'     - Priority level (low, medium, high, urgent)\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'2. The workflow:\n'</span> +
      <span class="hljs-string">'   - Takes an email input\n'</span> +
      <span class="hljs-string">'   - Runs sentiment analysis and summary generation in parallel using Promise.all()\n'</span> +
      <span class="hljs-string">'   - Feeds those results to the decision maker agent\n'</span> +
      <span class="hljs-string">'   - Outputs a final decision object with response requirements\n'</span> +
      <span class="hljs-string">'\n'</span> +
      <span class="hljs-string">'3. All agents use Gemini 2.0 Flash model and are structured to return parsed JSON responses'</span>,
    similarity: 0.49115753173828125,
    meta: {
      docName: <span class="hljs-string">'agent-architectures.txt'</span>,
      documentName: <span class="hljs-string">'agent-architectures.txt'</span>,
      category: <span class="hljs-string">'Examples'</span>,
      topic: <span class="hljs-string">'Agent architecture'</span>
    }
  }
]
</code></pre>
<h2 id="heading-step-7-create-support-pipe-agent">Step 7: Create Support Pipe Agent</h2>
<p>In this step, we will create a support agent using the Langbase SDK. Go ahead and add the following code to the <code>create-pipe.ts</code> file created in Step 1:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> <span class="hljs-string">'dotenv/config'</span>;
<span class="hljs-keyword">import</span> { Langbase } <span class="hljs-keyword">from</span> <span class="hljs-string">'langbase'</span>;

<span class="hljs-keyword">const</span> langbase = <span class="hljs-keyword">new</span> Langbase({
    apiKey: process.env.LANGBASE_API_KEY!,
});

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">main</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">const</span> supportAgent = <span class="hljs-keyword">await</span> langbase.pipes.create({
        name: <span class="hljs-string">`ai-support-agent`</span>,
        description: <span class="hljs-string">`An AI agent to support users with their queries.`</span>,
        messages: [
            {
                role: <span class="hljs-string">`system`</span>,
                content: <span class="hljs-string">`You're a helpful AI assistant.
                You will assist users with their queries.
                Always ensure that you provide accurate and to the point information.`</span>,
            },
        ],
    });

    <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'Support agent:'</span>, supportAgent);
}

main();
</code></pre>
<p>Let's go through the above code:</p>
<ul>
<li><p>Initialize the Langbase SDK with your API key.</p>
</li>
<li><p>Use the <code>pipes.create</code> method to create a new pipe agent.</p>
</li>
<li><p>Log the created pipe agent to the console.</p>
</li>
</ul>
<p>Now run the <code>create-pipe.ts</code> file to create the pipe agent by this command in your terminal:</p>
<pre><code class="lang-bash">npx tsx create-pipe.ts
</code></pre>
<p>This will create a support agent and log the agent details to the console.</p>
<h2 id="heading-step-8-generate-rag-responses">Step 8: Generate RAG Responses</h2>
<p>Up until now, we’ve created a Langbase memory agent, added documents in it, performed RAG retrieval against a query, and created a support agent using Langbase Pipe agent. The only thing left in creating this complete RAG agent is generating comprehensive responses using LLMs.</p>
<p>To do this, add the following code to the <code>agents.ts</code> file created in Step 1:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> <span class="hljs-string">'dotenv/config'</span>;
<span class="hljs-keyword">import</span> { Langbase, MemoryRetrieveResponse } <span class="hljs-keyword">from</span> <span class="hljs-string">'langbase'</span>;

<span class="hljs-keyword">const</span> langbase = <span class="hljs-keyword">new</span> Langbase({
    apiKey: process.env.LANGBASE_API_KEY!,
});

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">runAiSupportAgent</span>(<span class="hljs-params">{
    chunks,
    query,
}: {
    chunks: MemoryRetrieveResponse[];
    query: <span class="hljs-built_in">string</span>;
}</span>) </span>{
    <span class="hljs-keyword">const</span> systemPrompt = <span class="hljs-keyword">await</span> getSystemPrompt(chunks);

    <span class="hljs-keyword">const</span> { completion } = <span class="hljs-keyword">await</span> langbase.pipes.run({
        stream: <span class="hljs-literal">false</span>,
        name: <span class="hljs-string">'ai-support-agent'</span>,
        messages: [
            {
                role: <span class="hljs-string">'system'</span>,
                content: systemPrompt,
            },
            {
                role: <span class="hljs-string">'user'</span>,
                content: query,
            },
        ],
    });

    <span class="hljs-keyword">return</span> completion;
}

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">getSystemPrompt</span>(<span class="hljs-params">chunks: MemoryRetrieveResponse[]</span>) </span>{
    <span class="hljs-keyword">let</span> chunksText = <span class="hljs-string">''</span>;
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> chunk <span class="hljs-keyword">of</span> chunks) {
        chunksText += chunk.text + <span class="hljs-string">'\n'</span>;
    }

    <span class="hljs-keyword">const</span> systemPrompt = <span class="hljs-string">`
    You're a helpful AI assistant.
    You will assist users with their queries.

    Always ensure that you provide accurate and to the point information.
    Below is some CONTEXT for you to answer the questions. ONLY answer from the CONTEXT. CONTEXT consists of multiple information chunks. Each chunk has a source mentioned at the end.

For each piece of response you provide, cite the source in brackets like so: [1].

At the end of the answer, always list each source with its corresponding number and provide the document name. like so [1] Filename.doc. If there is a URL, make it hyperlink on the name.

 If you don't know the answer, say so. Ask for more context if needed.
    <span class="hljs-subst">${chunksText}</span>`</span>;

    <span class="hljs-keyword">return</span> systemPrompt;
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">runMemoryAgent</span>(<span class="hljs-params">query: <span class="hljs-built_in">string</span></span>) </span>{
    <span class="hljs-keyword">const</span> chunks = <span class="hljs-keyword">await</span> langbase.memories.retrieve({
        query,
        topK: <span class="hljs-number">4</span>,
        memory: [
            {
                name: <span class="hljs-string">'knowledge-base'</span>,
            },
        ],
    });

    <span class="hljs-keyword">return</span> chunks;
}
</code></pre>
<p>The above code:</p>
<ul>
<li><p>Creates a function <code>runAiSupportAgent</code> that takes chunks and query as input.</p>
</li>
<li><p>Uses the <code>pipes.run</code> method to generate responses using the LLM.</p>
</li>
<li><p>Creates a function <code>getSystemPrompt</code> to generate a system prompt for the LLM.</p>
</li>
<li><p>Combines the retrieved chunks to create a system prompt.</p>
</li>
<li><p>Returns the generated completion.</p>
</li>
</ul>
<p>Now, let's run the support agent with AI memory chunks. Add the following code to the <code>index.ts</code> file:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> { runMemoryAgent, runAiSupportAgent } <span class="hljs-keyword">from</span> <span class="hljs-string">'./agents'</span>;

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">main</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">const</span> query = <span class="hljs-string">'What is agent parallelization?'</span>;
    <span class="hljs-keyword">const</span> chunks = <span class="hljs-keyword">await</span> runMemoryAgent(query);

    <span class="hljs-keyword">const</span> completion = <span class="hljs-keyword">await</span> runAiSupportAgent({
        chunks,
        query,
    });

    <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'Completion:'</span>, completion);
}

main();
</code></pre>
<p>This code runs two agents: one to retrieve memory chunks relevant to a query (<code>runMemoryAgent</code>), and another (<code>runAiSupportAgent</code>) to generate a final answer using those chunks.</p>
<p>Then, run the <code>index.ts</code> file to generate responses using the LLM.</p>
<pre><code class="lang-bash">npx tsx index.ts
</code></pre>
<h3 id="heading-the-result">The Result</h3>
<p>After running the support agent, you’ll see the following output generated in your console:</p>
<pre><code class="lang-bash">Completion: Agent parallelization is a process that runs multiple LLM (Language Model) tasks simultaneously to enhance speed or accuracy. This technique can be implemented <span class="hljs-keyword">in</span> two main ways:

1. **Sectioning**: A task is divided into independent parts that can be processed concurrently.
2. **Voting**: Multiple LLM calls generate different responses <span class="hljs-keyword">for</span> the same task, and the best result is selected based on agreement, predefined rules, or quality evaluation. This approach improves accuracy and reliability by comparing various outputs.

In practice, agent parallelization involves orchestrating multiple specialized agents to handle different aspects of a task, allowing <span class="hljs-keyword">for</span> efficient processing and improved outcomes.

If you need more detailed examples or further clarification, feel free to ask!
</code></pre>
<p>This is how you can build an agentic RAG system with TypeScript using the Langbase SDK.</p>
<p>Thank you for reading!</p>
<p>Connect with me by 🙌:</p>
<ul>
<li><p>Subscribing to my <a target="_blank" href="https://www.youtube.com/@AIwithMahamCodes">YouTube</a> Channel if you want to learn about AI and agents.</p>
</li>
<li><p>Subscribing to my free newsletter <a target="_blank" href="https://mahamcodes.substack.com/">The Agentic Engineer</a> where I share all the latest AI and agents news/trends/jobs and much more.</p>
</li>
<li><p>Follow me on <a target="_blank" href="https://x.com/MahamDev">X (Twitter)</a>.</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Local RAG App with Ollama and ChromaDB in the R Programming Language ]]>
                </title>
                <description>
                    <![CDATA[ A Large Language Model (LLM) is a type of machine learning model that is trained to understand and generate human-like text. These models are trained on vast datasets to capture the nuances of human language, enabling them to generate coherent and co... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-local-rag-app-with-ollama-and-chromadb-in-r/</link>
                <guid isPermaLink="false">67fd5ac89a2c2895da61d799</guid>
                
                    <category>
                        <![CDATA[ ollama ]]>
                    </category>
                
                    <category>
                        <![CDATA[ chromadb ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Language ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Elabonga Atuo ]]>
                </dc:creator>
                <pubDate>Mon, 14 Apr 2025 18:58:16 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1744638731389/83993a5e-7a4d-4615-a8c5-582008115fc4.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>A Large Language Model (LLM) is a type of machine learning model that is trained to understand and generate human-like text. These models are trained on vast datasets to capture the nuances of human language, enabling them to generate coherent and contextually relevant responses.</p>
<p>You can enhance the performance of an LLM by providing context — structured or unstructured data, such as documents, articles, or knowledge bases — tailored to the domain or information you want the model to specialize in. Using techniques like prompt engineering and context injection, you can build an intelligent chatbot capable of navigating extensive datasets, retrieving relevant information, and delivering responses.</p>
<p>Whether it's storing recipes, code documentation, research articles, or answering domain-specific queries, an LLM-based chatbot can adapt to your needs with customization and privacy. You can deploy it locally to create a highly specialized conversational assistant that respects your data.</p>
<p>In this article, you will learn how to build a local Retrieval-Augmented Generation (RAG) application using Ollama and ChromaDB in R. By the end, you'll have a custom conversational assistant with a Shiny interface that efficiently retrieves information while maintaining privacy and customization.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-rag">What is RAG?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-overview">Project Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-ollama-installation">Ollama Installation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-data-collection-and-cleaning">Data Collection and Cleaning</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-chunks">How to Create Chunks</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-generate-sentence-embeddings">How to Generate Sentence Embeddings</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-the-vector-database-for-embedding-storage">How to Set Up the Vector Database for Embedding Storage</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-write-the-user-input-query-embedding-function">How to Write the User Input Query Embedding Function</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tool-calling">Tool Calling</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-initialize-the-chat-system-design-prompts-and-integrate-tools">How to Initialize the Chat System, Design Prompts, and Integrate Tools</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-interact-with-your-chatbot-using-a-shiny-app">How to Interact with Your Chatbot Using a Shiny App</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-complete-code">Complete Code</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-is-rag">What is RAG?</h2>
<p>Retrieval-Augmented Generation (RAG) is a method that integrates retrieval systems with generative AI, enabling chatbots to access recent and specific information from external sources.</p>
<p>By using a retrieval pipeline, the chatbot can fetch up-to-date, relevant data and combine it with the generative model’s language capabilities, producing responses that are both accurate and contextually enriched. This makes RAG particularly useful for applications requiring fact-based, real-time knowledge delivery.</p>
<h2 id="heading-project-overview">Project Overview</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744367291671/3e7989f8-0cd9-4857-ba48-23a352d9ae8d.png" alt="Setting up a local RAG chatbot from data gathering, cleaning, chunking, embedding, vector database storage, system prompting and interactive chatbot using Shiny" class="image--center mx-auto" width="1318" height="1101" loading="lazy"></p>
<h2 id="heading-project-setup">Project Setup</h2>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>Before you begin, ensure you have installed the latest version of the items listed here:</p>
<ol>
<li><p><a target="_blank" href="https://posit.co/download/rstudio-desktop/"><strong>RStudio</strong></a><strong>: The IDE</strong> <em>–</em> RStudio is the primary workspace where you'll write and test your R code. Its user-friendly interface, debugging tools, and integrated environment make it ideal for data analysis and chatbot development.</p>
</li>
<li><p><a target="_blank" href="https://cran.rstudio.com/"><strong>R</strong></a><strong>: The Programming Language</strong> <em>–</em> R is the backbone of your project. You'll use it to handle data manipulation, apply statistical models, and integrate your recipe chatbot components seamlessly.</p>
</li>
<li><p><a target="_blank" href="https://www.python.org/downloads/"><strong>Python</strong></a> – Some libraries, like the embedding library you'll use for text vectorization, are built on Python. It’s vital to have Python installed to enable these functionalities alongside your R code.</p>
</li>
<li><p><a target="_blank" href="https://www.java.com/en/download/"><strong>Java</strong></a> – Java serves as a foundational element for certain embedding libraries. It ensures efficient processing and compatibility for text embedding tasks required to train your chatbot.</p>
</li>
<li><p><a target="_blank" href="https://www.docker.com/products/docker-desktop/"><strong>Docker Desktop</strong></a> – Docker Desktop allows you to run ChromaDB, the vector database, locally on your machine. This enables fast and reliable storage of embeddings, ensuring your chatbot retrieves relevant information quickly.</p>
</li>
<li><p><a target="_blank" href="https://ollama.com/"><strong>Ollama</strong></a> – Ollama brings powerful Large Language Models (LLMs) directly to your local computer, removing the need for cloud resources. It lets you access multiple models, customize outputs, and integrate them into your chatbot effortlessly.</p>
</li>
</ol>
<h2 id="heading-ollama-installation">Ollama Installation</h2>
<p>Ollama is an open-sourced tool you can use to run and manage LLMs on your computer. Once installed, you can access various LLMs as per your needs. You will be using <code>llama3.2:3b-instruct-q4_K_M</code> model to build this chatbot.</p>
<p>A quantized model is a version of a machine learning model that has been optimized to use less memory and computational power by reducing the precision of the numbers it uses. This enables you to use an LLM locally, especially when you don’t have access to a GPU (Graphics Processing Unit – a specialized processor that perform complex computations).</p>
<p>To start, you can download and install the Ollama software <a target="_blank" href="https://ollama.com/download">here</a>.</p>
<p>Then you can confirm installation by running this command:</p>
<pre><code class="lang-bash">ollama --version
</code></pre>
<p>Run the following command to start Ollama:</p>
<pre><code class="lang-bash">ollama serve
</code></pre>
<p>Next, run the following command to pull the Q4_K_M quantization of llama3.2:3b-instruct:</p>
<pre><code class="lang-bash">ollama pull llama3.2:3b-instruct-q4_K_M
</code></pre>
<p>Then confirm that the model was extracted with this:</p>
<pre><code class="lang-bash">ollama list
</code></pre>
<p>If the model extraction was successful, a list containing the model’s name, ID, and size will be returned, like so:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744288047721/f6349ca4-fe86-4851-beaf-2f04fe2a4d80.png" alt="Confirm Ollama Installation" class="image--center mx-auto" width="1455" height="256" loading="lazy"></p>
<p>Now you can chat with the model:</p>
<pre><code class="lang-bash">ollama run llama3.2:3b-instruct-q4_K_M
</code></pre>
<p>If successful, you should receive a prompt that you can test by asking a question and getting an answer. For example:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744288433940/d831d256-0f6c-49c0-b647-bce1c1976584.png" alt="Ollama llama3.2:3b-instruct-q4_K_M chat console" class="image--center mx-auto" width="1612" height="559" loading="lazy"></p>
<p>Then you can exit the console by typing <code>/bye</code> or ctrl + D</p>
<h2 id="heading-data-collection-and-cleaning">Data Collection and Cleaning</h2>
<p>The chatbot you are building will be a cooking assistant that suggests recipes given your available ingredients, what you want to eat, and how much food a recipe yields.</p>
<p>You first have to get the data to train the model. You will be using a <a target="_blank" href="https://www.kaggle.com/datasets/paultimothymooney/recipenlg">dataset</a> that contains recipes from Kaggle.</p>
<p>To start, load the necessary libraries:</p>
<pre><code class="lang-r"><span class="hljs-comment"># loading required libraries</span>
<span class="hljs-keyword">library</span>(xml2) <span class="hljs-comment">#read, parse, and manipulate XML,HTML documents</span>
<span class="hljs-keyword">library</span>(jsonlite) <span class="hljs-comment">#manipulate JSON objects</span>

<span class="hljs-keyword">library</span>(RKaggle) <span class="hljs-comment"># download datasets from Kaggle </span>
<span class="hljs-keyword">library</span>(dplyr)   <span class="hljs-comment"># data manipulation</span>
</code></pre>
<p>Then download and save recipe dataset:</p>
<pre><code class="lang-r"><span class="hljs-comment"># Download and read the "recipe" dataset from Kaggle</span>
recipes_list &lt;- RKaggle::get_dataset(<span class="hljs-string">"thedevastator/better-recipes-for-a-better-life"</span>)
</code></pre>
<p>Inspect the dataframe and extract the first element like this:</p>
<pre><code class="lang-r"><span class="hljs-comment"># inspect the dataset</span>
class(recipes_list)
str(recipes_list)
head(recipes_list)
<span class="hljs-comment"># extract the first tibble</span>
recipes_df &lt;- recipes_list[[<span class="hljs-number">1</span>]]
</code></pre>
<p>A quick inspection of the <code>recipes_list</code> object shows that it contains two objects of type tibble. You will be using only the first element for this project. A tibble is a type of data structure used for storing and manipulating data. It’s similar to a traditional dataframe, but it’s designed to enforce stricter rules and perform fewer automatic actions compared to traditional dataframes.</p>
<p>We’ll use a regular dataframe in this project because more people are likely familiar with it. It can also efficiently handle row indexing, which is crucial for accessing and manipulating specific rows in our recipe dataset.</p>
<p>In the code block below, you’ll convert the tibble to a dataframe and then drop the first column, which is the index column. Then you’ll inspect the newly converted dataframe and drop unnecessary columns.</p>
<p>Unnecessary columns are best removed to streamline the dataset and focus on relevant features. In this project, we’ll drop certain columns that aren’t particularly useful for training the chatbot. This ensures that the model concentrates on meaningful data to improve its accuracy and functionality.</p>
<pre><code class="lang-r"><span class="hljs-comment"># convert to dataframe and drop the first column</span>
recipes_df &lt;- as.data.frame(recipes_df[, -<span class="hljs-number">1</span>])
<span class="hljs-comment"># inspect the converted dataframe</span>
head(recipes_df)
class(recipes_df)
colnames(recipes_df)
<span class="hljs-comment"># drop unnecessary columns</span>
cleaned_recipes_df &lt;- subset(recipes_df, select = -c(yield,rating,url,cuisine_path,nutrition,timing,img_src))
</code></pre>
<p>Now you need to identify rows with NA (missing) values, which you can do like this:</p>
<pre><code class="lang-r"><span class="hljs-comment"># Identify rows and columns with NA values</span>
which(is.na(cleaned_recipes_df), arr.ind = <span class="hljs-literal">TRUE</span>)

<span class="hljs-comment"># a quick inspection reveals columns [2:4] have missing values</span>
subset_column_names &lt;- colnames(cleaned_recipes_df)[<span class="hljs-number">2</span>:<span class="hljs-number">4</span>]
subset_column_names
</code></pre>
<p>It is important to handle NA values to ensure that your data is complete, to prevent errors, and to preserve context.</p>
<p>Now, replace the NA values and confirm that there are no missing values:</p>
<pre><code class="lang-r"><span class="hljs-comment"># Replace NA values dynamically based on conditions</span>
cols_to_modify &lt;- c(<span class="hljs-string">"prep_time"</span>, <span class="hljs-string">"cook_time"</span>, <span class="hljs-string">"total_time"</span>)
cleaned_recipes_df[cols_to_modify] &lt;- lapply(
  cleaned_recipes_df[cols_to_modify],
  <span class="hljs-keyword">function</span>(x, df) {
    <span class="hljs-comment"># Replace NA in prep_time and cook_time where both are NA</span>
    replace(x, is.na(df$prep_time) &amp; is.na(df$cook_time), <span class="hljs-string">"unknown"</span>)
  },
  df = cleaned_recipes_df  <span class="hljs-comment"># Pass the whole dataframe for conditions</span>
)
cleaned_recipes_df &lt;- cleaned_recipes_df %&gt;%
  mutate(
    prep_time = case_when(
      <span class="hljs-comment"># If cooktime is present but preptime is NA, replace with "no preparation required"</span>
      !is.na(cook_time) &amp; is.na(prep_time) ~ <span class="hljs-string">"no preparation required"</span>,
      <span class="hljs-comment"># Otherwise, retain original value</span>
      <span class="hljs-literal">TRUE</span> ~ as.character(prep_time)
    ),
    cook_time = case_when(
      <span class="hljs-comment"># If prep_time is present but cook_time is NA, replace with "no cooking required"</span>
      !is.na(prep_time) &amp; is.na(cook_time) ~ <span class="hljs-string">"no cooking required"</span>,
      <span class="hljs-comment"># Otherwise, retain original value</span>
      <span class="hljs-literal">TRUE</span> ~ as.character(cook_time)
    )
  )
<span class="hljs-comment"># confirm there are no missing values</span>
any(is.na(cleaned_recipes_df))
)

<span class="hljs-comment"># confirm the replacing NA logic works by inspecting specific rows</span>
cleaned_recipes_df[<span class="hljs-number">1081</span>,]
cleaned_recipes_df[<span class="hljs-number">1</span>,]
cleaned_recipes_df[<span class="hljs-number">405</span>,]
</code></pre>
<p>For this tutorial, we’ll subset the dataframe to the first 250 rows for demo purposes. This saves on time when it comes to generating embeddings.</p>
<pre><code class="lang-r"><span class="hljs-comment"># recommended for demo/learning purposes</span>
cleaned_recipes_df &lt;- head(cleaned_recipes_df,<span class="hljs-number">250</span>)
</code></pre>
<h2 id="heading-how-to-create-chunks">How to Create Chunks</h2>
<p>To understand why chunking is important before embedding, you need to understand what an embedding is.</p>
<p>An embedding is a vectoral representation of a word or a sentence. Machines don’t understand human text – they understand numbers. LLMs work by transforming human text to numerical representations in order to give answers. The process of generating embeddings requires a lot of computation, and breaking down the data to be embedded optimizes the embedding process.</p>
<p>So now we’re going to split the dataframe into smaller chunks of a specified size to enable efficient batch processing and iteration.</p>
<pre><code class="lang-r"><span class="hljs-comment"># Define the size of each chunk (number of rows per chunk)</span>
chunk_size &lt;- <span class="hljs-number">1</span>

<span class="hljs-comment"># Get the total number of rows in the dataframe</span>
n &lt;- nrow(cleaned_recipes_df)

<span class="hljs-comment"># Create a vector of group numbers for chunking</span>
<span class="hljs-comment"># Each group number repeats for 'chunk_size' rows</span>
<span class="hljs-comment"># Ensure the vector matches the total number of rows</span>
r &lt;- rep(<span class="hljs-number">1</span>:ceiling(n/chunk_size), each = chunk_size)[<span class="hljs-number">1</span>:n]

<span class="hljs-comment"># Split the dataframe into smaller chunks (subsets) based on the group numbers</span>
chunks &lt;- split(cleaned_recipes_df, r)
</code></pre>
<h2 id="heading-how-to-generate-sentence-embeddings">How to Generate Sentence Embeddings</h2>
<p>As previously mentioned, embeddings are vector representations of words or sentences. Embeddings can be generated from both words and sentences. How you choose to generate embeddings depends on your intended application of the LLM.</p>
<p>Word embeddings are numerical representations of individual words in a continuous vector space. They capture semantic relationships between words, allowing similar words to have vectors close to each other.</p>
<p>Word embeddings can be used in search engines as they support word-level queries by matching embeddings to retrieve relevant documents. They can also be used in text classification to classify documents, emails, or tweets based on word-level features (for example, detecting spam emails or sentiment analysis).</p>
<p>Sentence embeddings are numerical representations of entire sentences in a vector space, designed to capture the overall meaning and context of the sentence. They are used in settings where sentences provide better context like question answering systems where user queries are matched to relevant sentences or documents for more precise retrieval.</p>
<p>For our recipe chatbot, sentence embedding is the best choice.</p>
<p>First, create an empty dataframe that has three columns.</p>
<pre><code class="lang-r"><span class="hljs-comment">#empty dataframe</span>
recipe_sentence_embeddings &lt;-  data.frame(
  recipe = character(),
  recipe_vec_embeddings = I(list()),
  recipe_id = character()
)
</code></pre>
<p>The first column will hold the actual recipe in text form, the <code>recipe_vec_embeddings</code> column will hold the generated sentence embeddings, and the <code>recipe_id</code> holds a unique id for each recipe. This will help in indexing and retrieval from the vector database.</p>
<p>Next, it’s helpful to define a progress bar, which you can do like this:</p>
<pre><code class="lang-r"><span class="hljs-comment"># create a progress bar</span>
pb &lt;- txtProgressBar(min = <span class="hljs-number">1</span>, max = length(chunks), style = <span class="hljs-number">3</span>)
</code></pre>
<p>Embedding can take a while, so it’s important to keep track of the progress of the process.</p>
<p>Now it’s time to generate embeddings and populate the dataframe.</p>
<p>Write a for loop that executes the code block as long as the length of the chunks.</p>
<pre><code class="lang-r"><span class="hljs-keyword">for</span> (i <span class="hljs-keyword">in</span> <span class="hljs-number">1</span>:length(chunks)) {}
</code></pre>
<p>The recipe field is the text at the chunk that is currently being executed and the unique chunk id is generated by pasting the index of the chunk and the text “chunk”.</p>
<pre><code class="lang-r"><span class="hljs-keyword">for</span> (i <span class="hljs-keyword">in</span> <span class="hljs-number">1</span>:length(chunks)) {
    recipe &lt;- as.character(chunks[i])
    recipe_id &lt;- paste0(<span class="hljs-string">"recipe"</span>,i)
}
</code></pre>
<p>The text embed function from the text library generates either sentence or word embeddings. It takes in a character variable or a dataframe and produces a tibble of embeddings. You can read loading instructions here for smooth running of the <a target="_blank" href="https://www.r-text.org/">text</a> library.</p>
<p>The <code>batch_size</code> defines how many rows are embedded at a time from the input. Setting the <code>keep_token_embeddings</code> discards the embeddings for individual tokens after processing, and <code>aggregation_from_layers_to_tokens</code> “concatenates” or combines embeddings from specified layers to create detailed embeddings for each token. A token is the smallest unit of text that a model can process.</p>
<pre><code class="lang-r"><span class="hljs-keyword">for</span> (i <span class="hljs-keyword">in</span> <span class="hljs-number">1</span>:length(chunks)) {
    recipe &lt;- as.character(chunks[i])
    recipe_id &lt;- paste0(<span class="hljs-string">"recipe"</span>,i)
    recipe_embeddings &lt;- textEmbed(as.character(recipe),
                                layers = <span class="hljs-number">10</span>:<span class="hljs-number">11</span>,
                                aggregation_from_layers_to_tokens = <span class="hljs-string">"concatenate"</span>,
                                aggregation_from_tokens_to_texts = <span class="hljs-string">"mean"</span>,
                                keep_token_embeddings = <span class="hljs-literal">FALSE</span>,
                                batch_size = <span class="hljs-number">1</span>
  )
}
</code></pre>
<p>In order to specify sentence embeddings, you need to set the argument to the <code>aggregation_from_tokens_to_texts</code> parameter as <code>"mean"</code>.</p>
<pre><code class="lang-r">aggregation_from_tokens_to_texts = <span class="hljs-string">"mean"</span>
</code></pre>
<p>The "mean" operation averages the embeddings of all tokens in a sentence to generate a single vector that represents the entire sentence. This sentence-level embedding captures the overall meaning and semantics of the text, regardless of its token length.</p>
<pre><code class="lang-r"><span class="hljs-comment"># convert tibble to vector</span>
  recipe_vec_embeddings &lt;- unlist(recipe_embeddings, use.names = <span class="hljs-literal">FALSE</span>)
  recipe_vec_embeddings &lt;- list(recipe_vec_embeddings)
</code></pre>
<p>The embedding function returns a tibble object. In order to obtain a vector embedding, you need to first unlist the tibble and drop the row names and then list the result to form a simple vector.</p>
<pre><code class="lang-r">  <span class="hljs-comment"># Append the current chunk's data to the dataframe</span>
  recipe_sentence_embeddings &lt;- recipe_sentence_embeddings %&gt;%
    add_row(
      recipe = recipe,
      recipe_vec_embeddings = recipe_vec_embeddings,
      recipe_id = recipe_id
    )
</code></pre>
<p>Finally, update the empty dataframe after each iteration with the newly generated data.</p>
<pre><code class="lang-r">  <span class="hljs-comment"># track embedding progress</span>
  setTxtProgressBar(pb, i)
</code></pre>
<p>In order to keep track of the embedding progress, you can use the earlier defined progress bar inside the loop. It will update at the end of every iteration.</p>
<p><strong>Complete Code Block:</strong></p>
<pre><code class="lang-r"><span class="hljs-comment"># load required library</span>
<span class="hljs-keyword">library</span>(text)
<span class="hljs-comment"># # ensure to read loading instructions here for smooth running of the 'text' library</span>
<span class="hljs-comment"># # https://www.r-text.org/</span>
<span class="hljs-comment"># embedding data</span>
<span class="hljs-keyword">for</span> (i <span class="hljs-keyword">in</span> <span class="hljs-number">1</span>:length(chunks)) {
  recipe &lt;- as.character(chunks[i])
  recipe_id &lt;- paste0(<span class="hljs-string">"recipe"</span>,i)
  recipe_embeddings &lt;- textEmbed(as.character(recipe),
                                layers = <span class="hljs-number">10</span>:<span class="hljs-number">11</span>,
                                aggregation_from_layers_to_tokens = <span class="hljs-string">"concatenate"</span>,
                                aggregation_from_tokens_to_texts = <span class="hljs-string">"mean"</span>,
                                keep_token_embeddings = <span class="hljs-literal">FALSE</span>,
                                batch_size = <span class="hljs-number">1</span>
  )

  <span class="hljs-comment"># convert tibble to vector</span>
  recipe_vec_embeddings &lt;- unlist(recipe_embeddings, use.names = <span class="hljs-literal">FALSE</span>)
  recipe_vec_embeddings &lt;- list(recipe_vec_embeddings)

  <span class="hljs-comment"># Append the current chunk's data to the dataframe</span>
  recipe_sentence_embeddings &lt;- recipe_sentence_embeddings %&gt;%
    add_row(
      recipe = recipe,
      recipe_vec_embeddings = recipe_vec_embeddings,
      recipe_id = recipe_id
    )

  <span class="hljs-comment"># track embedding progress</span>
  setTxtProgressBar(pb, i)

}
</code></pre>
<h2 id="heading-how-to-set-up-the-vector-database-for-embedding-storage">How to Set Up the Vector Database for Embedding Storage</h2>
<p>A vector database is a special type of database that stores embeddings and allows you to query and retrieve relevant information. There are numerous vector databases available, but for this project, you will use ChromaDB, an open-source option that integrates with the R environment through the <code>rchroma</code> library.</p>
<p>ChromaDB runs locally in a Docker container. Just make sure you have Docker installed and running on your device.</p>
<p>Then load the rchroma library and run your ChromaDB instance:</p>
<pre><code class="lang-r"><span class="hljs-comment"># load rchroma library</span>
<span class="hljs-keyword">library</span>(rchroma)
<span class="hljs-comment"># run ChromaDB instance.</span>
chroma_docker_run()
</code></pre>
<p>If it was successful, you should see this in the console:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744383249217/bd8fb67c-0731-46f9-8a13-0747b4789714.png" alt="Confirm ChromaDB is running locally" class="image--center mx-auto" width="598" height="121" loading="lazy"></p>
<p>Next, connect to a local ChromaDB instance and check the connection:</p>
<pre><code class="lang-r"><span class="hljs-comment"># Connect to a local ChromaDB instance</span>
client &lt;- chroma_connect()

<span class="hljs-comment"># Check the connection</span>
heartbeat(client)
version(client)
</code></pre>
<p>Now you’ll need to create a collection and confirm that it was created. Collections in ChromaDB function similarly to tables in conventional databases.</p>
<pre><code class="lang-r"><span class="hljs-comment"># Create a new collection</span>
create_collection(client, <span class="hljs-string">"recipes_collection"</span>)

<span class="hljs-comment"># List all collections</span>
list_collections(client)
</code></pre>
<p>Now, add embeddings to the collection. To add embeddings to the <code>recipes_collection</code>, use the <code>add_documents</code> function.</p>
<pre><code class="lang-r"><span class="hljs-comment"># Add documents to the collection</span>
add_documents(
  client,
  <span class="hljs-string">"recipes_collection"</span>,
  documents = recipe_sentence_embeddings$recipe,
  ids = recipe_sentence_embeddings$recipe_id,
  embeddings = recipe_sentence_embeddings$recipe_vec_embeddings
)
</code></pre>
<p>The <code>add_documents()</code> function is used to add recipe data to the <code>recipes_collection</code>. Here's a breakdown of its arguments and how the corresponding data is accessed:</p>
<ol>
<li><p><code>documents</code>: This argument represents the recipe text. It is sourced from the <code>recipe</code> column of the <code>recipe_sentence_embeddings</code> dataframe.</p>
</li>
<li><p><code>ids</code>: This is the unique identifier for each recipe. It is extracted from the <code>recipe_id</code> column of the same dataframe.</p>
</li>
<li><p><code>embeddings</code>: This contains the sentence embeddings, which were previously generated for each recipe. These embeddings are accessed from the <code>recipe_vec_embeddings</code> column of the dataframe.</p>
</li>
</ol>
<p>All three arguments—<code>documents</code>, <code>ids</code>, and <code>embeddings</code>—are obtained by subsetting their respective columns from the <code>recipe_sentence_embeddings</code> dataframe.</p>
<h2 id="heading-how-to-write-the-user-input-query-embedding-function">How to Write the User Input Query Embedding Function</h2>
<p>In order to retrieve information from a vector database, you must first embed your query text. The database compares your query's embedding with its stored embeddings to find and retrieve the most relevant document.</p>
<p>It's important to ensure that the dimensions (rows × columns) of your query embedding match those of the database embeddings. This alignment is achieved by using the same embedding model to generate your query.</p>
<p>Matching embeddings involves calculating the similarity (for example, cosine similarity) between the query and stored embeddings, identifying the closest match for effective retrieval.</p>
<p>Let’s write a function that allows us to embed a query which then queries similar documents using the generated embeddings. Wrapping it in a function makes it reusable.</p>
<pre><code class="lang-r">  <span class="hljs-comment">#sentence embeddings function and query</span>
  question &lt;- <span class="hljs-keyword">function</span>(sentence){
    sentence_embeddings &lt;- textEmbed(sentence,
                                     layers = <span class="hljs-number">10</span>:<span class="hljs-number">11</span>,
                                     aggregation_from_layers_to_tokens = <span class="hljs-string">"concatenate"</span>,
                                     aggregation_from_tokens_to_texts = <span class="hljs-string">"mean"</span>,
                                     keep_token_embeddings = <span class="hljs-literal">FALSE</span>
    )

    <span class="hljs-comment"># convert tibble to vector</span>
    sentence_vec_embeddings &lt;- unlist(sentence_embeddings, use.names = <span class="hljs-literal">FALSE</span>)
    sentence_vec_embeddings &lt;- list(sentence_vec_embeddings)

    <span class="hljs-comment"># Query similar documents using embeddings</span>
    results &lt;- query(
      client,
      <span class="hljs-string">"recipes_collection"</span>,
      query_embeddings = sentence_vec_embeddings ,
      n_results = <span class="hljs-number">2</span>
    )
    results

  }
</code></pre>
<p>This chunk of code is similar to how we have previously used the <code>text_embed()</code> function. The <code>query()</code> function is added to enable querying the vector database, particularly the recipes' collection, and returns the top two documents that closely match a user’s query.</p>
<p>Our function thus takes in a sentence as an argument and embeds the sentence to generate sentence embeddings. It then queries the database and returns two documents that match the query most.</p>
<h2 id="heading-tool-calling">Tool Calling</h2>
<p>To interact with Ollama in R, you will utilize the <code>ellmer</code> library. This library streamlines the use of large language models (LLMs) by offering an interface that enables seamless access to and interaction with a variety of LLM providers.</p>
<p>To enhance the LLM’s usage, we need to provide context to it. You can do this by tool calling. Tool calling allows an LLM to access external resources in order to enhance its functionality.</p>
<p>For this project, we are implementing <a target="_blank" href="https://www.freecodecamp.org/news/learn-rag-fundamentals-and-advanced-techniques/">Retrieval-Augmented Generation (RAG)</a>, which combines retrieving relevant information from a vector database and generating responses using an LLM. This approach improves the chatbot's ability to provide accurate and contextually relevant answers.</p>
<p>Now, define a function that links to the LLM to provide context using the <code>tool()</code> function from the <code>ellmer</code> library.</p>
<pre><code class="lang-r"><span class="hljs-comment"># load ellmer library</span>
<span class="hljs-keyword">library</span>(ellmer)

<span class="hljs-comment"># function that links to llm to provide context</span>
  tool_context  &lt;- tool(
    question,
    <span class="hljs-string">"obtains the right context for a given question"</span>,
    sentence = type_string()

  )
</code></pre>
<p>The <code>tool()</code> function takes the question function that returns the relevant documents that we’ll use as context as the first argument. We’ll use the documents to help the LLM answer questions accordingly.</p>
<p>The text, "obtains the right context for a given question", is a description of what the tool will be doing.</p>
<p>Finally, the <code>sentence = type_string()</code> defines what type of object the <code>question()</code> function expects.</p>
<h2 id="heading-how-to-initialize-the-chat-system-design-prompts-and-integrate-tools">How to Initialize the Chat System, Design Prompts, and Integrate Tools</h2>
<p>Next, you’ll set up a conversational AI system by defining its role and functionality. Using system prompt design, you will shape the assistant’s behavior, tone, and focus as a culinary assistant. You’ll also integrate external tools to extend the chatbot’s capabilities by registering tools. Let’s dive in.</p>
<p>First, you need to initialize a Chat Object:</p>
<pre><code class="lang-r"><span class="hljs-comment">#  Initialize the chat system with propmpt instructions.</span>
  chat &lt;- chat_ollama(system_prompt = <span class="hljs-string">"You are a knowledgeable culinary assistant specializing in recipe recommendations. 
                      You provide tailored meal suggestions based on the user's available ingredients and the desired amount of food or servings.
                      Ensure the recipes align closely with the user's inputs and yield the expected quantity."</span>,
                      model = <span class="hljs-string">"llama3.2:3b-instruct-q4_K_M"</span>)
</code></pre>
<p>You can do that using the <code>chat_ollama()</code> function. This sets up a conversational agent with the specified system prompt and model.</p>
<p>The system prompt defines the conversational behavior, tone, and focus of the LLM while the model argument specifies the language model (<code>llama3.2:3b-instruct-q4_K_M</code>) that the chat system will use to generate responses.</p>
<p>Next, you need to register a tool.</p>
<pre><code class="lang-r"> <span class="hljs-comment">#register tool</span>
  chat$register_tool(tool_context)
</code></pre>
<p>We need to tell our chat object about our <code>tool_context()</code> function. Do this by registering a tool using the <code>register_tool()</code> function.</p>
<h2 id="heading-how-to-interact-with-your-chatbot-using-a-shiny-app"><strong>How to Interact with Your Chatbot Using a Shiny App</strong></h2>
<p>To interact with the chatbot you’ve just created, we’ll use <strong>Shiny</strong>, a framework for building interactive web applications in R. Shiny provides a user-friendly graphical interface that allows seamless interaction with the chatbot.</p>
<p>For this purpose, we’ll use the <strong>shinychat</strong> library, which simplifies the process of building a chat interface within a Shiny app. This involves defining two key components:</p>
<ol>
<li><p><strong>User Interface (UI)</strong>:</p>
<ul>
<li><p>Responsible for the visual layout and what the user sees.</p>
</li>
<li><p>In this case, <code>chat_ui("chat")</code> is used to create the interactive chat interface.</p>
</li>
</ul>
</li>
<li><p><strong>Server Function</strong>:</p>
<ul>
<li><p>Handles the functionality and logic of the application.</p>
</li>
<li><p>It connects the chatbot to external tools and manages processes like embedding queries, retrieving relevant responses, and handling user inputs.</p>
</li>
</ul>
</li>
</ol>
<pre><code class="lang-r"><span class="hljs-comment"># load the required library</span>
<span class="hljs-keyword">library</span>(shinychat)

<span class="hljs-comment"># wrap the chat code in a Shiny App</span>
ui &lt;- bslib::page_fluid(
  chat_ui(<span class="hljs-string">"chat"</span>)
)

server &lt;- <span class="hljs-keyword">function</span>(input, output, session) {
  <span class="hljs-comment"># Connect to a local ChromaDB instance running on docker with embeddings loaded</span>
  client &lt;- chroma_connect()

  <span class="hljs-comment">#sentence embeddings function and query</span>
  question &lt;- <span class="hljs-keyword">function</span>(sentence){
    sentence_embeddings &lt;- textEmbed(sentence,
                                     layers = <span class="hljs-number">10</span>:<span class="hljs-number">11</span>,
                                     aggregation_from_layers_to_tokens = <span class="hljs-string">"concatenate"</span>,
                                     aggregation_from_tokens_to_texts = <span class="hljs-string">"mean"</span>,
                                     keep_token_embeddings = <span class="hljs-literal">FALSE</span>
    )

    <span class="hljs-comment"># convert tibble to vector</span>
    sentence_vec_embeddings &lt;- unlist(sentence_embeddings, use.names = <span class="hljs-literal">FALSE</span>)
    sentence_vec_embeddings &lt;- list(sentence_vec_embeddings)

    <span class="hljs-comment"># Query similar documents using embeddings</span>
    results &lt;- query(
      client,
      <span class="hljs-string">"recipes_collection"</span>,
      query_embeddings = sentence_vec_embeddings ,
      n_results = <span class="hljs-number">2</span>
    )
    results

  }


  <span class="hljs-comment"># function that provides context</span>
  tool_context  &lt;- tool(
    question,
    <span class="hljs-string">"obtains the right context for a given question"</span>,
    sentence = type_string()

  )

  <span class="hljs-comment">#  Initialize the chat system with the first chunk</span>
  chat &lt;- chat_ollama(system_prompt = <span class="hljs-string">"You are a knowledgeable culinary assistant specializing in recipe recommendations. 
                      You provide tailored meal suggestions based on the user's available ingredients and the desired amount of food or servings.
                      Ensure the recipes align closely with the user's inputs and yield the expected quantity."</span>,
                      model = <span class="hljs-string">"llama3.2:3b-instruct-q4_K_M"</span>)
  <span class="hljs-comment">#register tool</span>
  chat$register_tool(tool_context)

  observeEvent(input$chat_user_input, {
    stream &lt;- chat$stream_async(input$chat_user_input)
    chat_append(<span class="hljs-string">"chat"</span>, stream)
  })
}

shinyApp(ui, server)
</code></pre>
<p>Alright, let’s understand how this is working:</p>
<ol>
<li><p><strong>User input monitoring with</strong> <code>observeEvent()</code>: The <code>observeEvent()</code> block monitors user inputs from the chat interface (<code>input$chat_user_input</code>). When a user sends a message, the chatbot processes it, retrieves relevant context using the embeddings, and streams the response dynamically to the chat interface.</p>
</li>
<li><p><strong>Tool calling for context</strong>: The chatbot employs tool calling to interact with external resources (like the vector database) and enhance its functionality. In this project, Retrieval-Augmented Generation (RAG) ensures the chatbot provides accurate and context-rich responses by integrating retrieval and generation seamlessly.</p>
</li>
</ol>
<p>This approach brings the chatbot to life, enabling users to interact with it dynamically through a responsive Shiny app.</p>
<h2 id="heading-complete-code">Complete Code</h2>
<p>The R scripts have been split in two, with <code>data.R</code> containing code that handles data gathering and cleaning, text chunking, sentence embeddings generation, creating a vector database, and loading documents to it.</p>
<p>The <code>chat.R</code> script contains code that handles user input querying, context retrieval, chat initialization, system prompt design, tool integration, and a chat Shiny app.</p>
<p><strong>data.R</strong></p>
<pre><code class="lang-r"><span class="hljs-comment"># install and load required packages</span>
<span class="hljs-comment"># install devtools from CRAN</span>
install.packages(<span class="hljs-string">'devtools'</span>)
devtools::install_github(<span class="hljs-string">"benyamindsmith/RKaggle"</span>)

<span class="hljs-keyword">library</span>(text)
<span class="hljs-keyword">library</span>(rchroma)
<span class="hljs-keyword">library</span>(RKaggle)
<span class="hljs-keyword">library</span>(dplyr)

<span class="hljs-comment"># run ChromaDB instance.</span>
chroma_docker_run()

<span class="hljs-comment"># Connect to a local ChromaDB instance</span>
client &lt;- chroma_connect()

<span class="hljs-comment"># Check the connection</span>
heartbeat(client)
version(client)


<span class="hljs-comment"># Create a new collection</span>
create_collection(client, <span class="hljs-string">"recipes_collection"</span>)

<span class="hljs-comment"># List all collections</span>
list_collections(client)

<span class="hljs-comment"># Download and read the "recipe" dataset from Kaggle</span>
recipes_list &lt;- RKaggle::get_dataset(<span class="hljs-string">"thedevastator/better-recipes-for-a-better-life"</span>)

<span class="hljs-comment"># extract the first tibble</span>
recipes_df &lt;- recipes_list[[<span class="hljs-number">1</span>]]

<span class="hljs-comment"># convert to dataframe and drop the first column</span>
recipes_df &lt;- as.data.frame(recipes_df[, -<span class="hljs-number">1</span>])

<span class="hljs-comment"># drop unnecessary columns</span>
cleaned_recipes_df &lt;- subset(recipes_df, select = -c(yield,rating,url,cuisine_path,nutrition,timing,img_src))

<span class="hljs-comment">## Replace NA values dynamically based on conditions</span>
<span class="hljs-comment"># Replace NA when all columns have NA values</span>
cols_to_modify &lt;- c(<span class="hljs-string">"prep_time"</span>, <span class="hljs-string">"cook_time"</span>, <span class="hljs-string">"total_time"</span>)
cleaned_recipes_df[cols_to_modify] &lt;- lapply(
  cleaned_recipes_df[cols_to_modify],
  <span class="hljs-keyword">function</span>(x, df) {
    <span class="hljs-comment"># Replace NA in prep_time and cook_time where both are NA</span>
    replace(x, is.na(df$prep_time) &amp; is.na(df$cook_time), <span class="hljs-string">"unknown"</span>)
  },
  df = cleaned_recipes_df  
)

<span class="hljs-comment"># Replace NA when either or columns have NA values</span>
cleaned_recipes_df &lt;- cleaned_recipes_df %&gt;%
  mutate(
    prep_time = case_when(
      <span class="hljs-comment"># If cook_time is present but prep_time is NA, replace with "no preparation required"</span>
      !is.na(cook_time) &amp; is.na(prep_time) ~ <span class="hljs-string">"no preparation required"</span>,
      <span class="hljs-comment"># Otherwise, retain original value</span>
      <span class="hljs-literal">TRUE</span> ~ as.character(prep_time)
    ),
    cook_time = case_when(
      <span class="hljs-comment"># If prep_time is present but cook_time is NA, replace with "no cooking required"</span>
      !is.na(prep_time) &amp; is.na(cook_time) ~ <span class="hljs-string">"no cooking required"</span>,
      <span class="hljs-comment"># Otherwise, retain original value</span>
      <span class="hljs-literal">TRUE</span> ~ as.character(cook_time)
    )
  )

<span class="hljs-comment"># chunk the dataset</span>
chunk_size &lt;- <span class="hljs-number">1</span>
n &lt;- nrow(cleaned_recipes_df)
r &lt;- rep(<span class="hljs-number">1</span>:ceiling(n/chunk_size),each = chunk_size)[<span class="hljs-number">1</span>:n]
chunks &lt;- split(cleaned_recipes_df,r)

<span class="hljs-comment">#empty dataframe</span>
recipe_sentence_embeddings &lt;-  data.frame(
  recipe = character(),
  recipe_vec_embeddings = I(list()),
  recipe_id = character()
)

<span class="hljs-comment"># create a progress bar</span>
pb &lt;- txtProgressBar(min = <span class="hljs-number">1</span>, max = length(chunks), style = <span class="hljs-number">3</span>)

<span class="hljs-comment"># embedding data</span>
<span class="hljs-keyword">for</span> (i <span class="hljs-keyword">in</span> <span class="hljs-number">1</span>:length(chunks)) {
  recipe &lt;- as.character(chunks[i])
  recipe_id &lt;- paste0(<span class="hljs-string">"recipe"</span>,i)
  recipe_embeddings &lt;- textEmbed(as.character(recipe),
                                layers = <span class="hljs-number">10</span>:<span class="hljs-number">11</span>,
                                aggregation_from_layers_to_tokens = <span class="hljs-string">"concatenate"</span>,
                                aggregation_from_tokens_to_texts = <span class="hljs-string">"mean"</span>,
                                keep_token_embeddings = <span class="hljs-literal">FALSE</span>,
                                batch_size = <span class="hljs-number">1</span>
  )

  <span class="hljs-comment"># convert tibble to vector</span>
  recipe_vec_embeddings &lt;- unlist(recipe_embeddings, use.names = <span class="hljs-literal">FALSE</span>)
  recipe_vec_embeddings &lt;- list(recipe_vec_embeddings)

  <span class="hljs-comment"># Append the current chunk's data to the dataframe</span>
  recipe_sentence_embeddings &lt;- recipe_sentence_embeddings %&gt;%
    add_row(
      recipe = recipe,
      recipe_vec_embeddings = recipe_vec_embeddings,
      recipe_id = recipe_id
    )

  <span class="hljs-comment"># track embedding progress</span>
  setTxtProgressBar(pb, i)

}

<span class="hljs-comment"># Add documents to the collection</span>
add_documents(
  client,
  <span class="hljs-string">"recipes_collection"</span>,
  documents = recipe_sentence_embeddings$recipe,
  ids = recipe_sentence_embeddings$recipe_id,
  embeddings = recipe_sentence_embeddings$recipe_vec_embeddings
)
</code></pre>
<p><strong>chat.R</strong></p>
<pre><code class="lang-r"><span class="hljs-comment"># Load required packages</span>
<span class="hljs-keyword">library</span>(ellmer)
<span class="hljs-keyword">library</span>(text)
<span class="hljs-keyword">library</span>(rchroma)
<span class="hljs-keyword">library</span>(shinychat)

ui &lt;- bslib::page_fluid(
  chat_ui(<span class="hljs-string">"chat"</span>)
)

server &lt;- <span class="hljs-keyword">function</span>(input, output, session) {
  <span class="hljs-comment"># Connect to a local ChromaDB instance running on docker with embeddings loaded </span>
  client &lt;- chroma_connect()

  <span class="hljs-comment"># sentence embeddings function and query</span>
  question &lt;- <span class="hljs-keyword">function</span>(sentence){
    sentence_embeddings &lt;- textEmbed(sentence,
                                     layers = <span class="hljs-number">10</span>:<span class="hljs-number">11</span>,
                                     aggregation_from_layers_to_tokens = <span class="hljs-string">"concatenate"</span>,
                                     aggregation_from_tokens_to_texts = <span class="hljs-string">"mean"</span>,
                                     keep_token_embeddings = <span class="hljs-literal">FALSE</span>
    )

    <span class="hljs-comment"># convert tibble to vector</span>
    sentence_vec_embeddings &lt;- unlist(sentence_embeddings, use.names = <span class="hljs-literal">FALSE</span>)
    sentence_vec_embeddings &lt;- list(sentence_vec_embeddings)

    <span class="hljs-comment"># Query similar documents</span>
    results &lt;- query(
      client,
      <span class="hljs-string">"recipes_collection"</span>,
      query_embeddings = sentence_vec_embeddings ,
      n_results = <span class="hljs-number">2</span>
    )
    results

  }


  <span class="hljs-comment"># function that provides context</span>
  tool_context  &lt;- tool(
    question,
    <span class="hljs-string">"obtains the right context for a given question"</span>,
    sentence = type_string()

  )

  <span class="hljs-comment">#  Initialize the chat system </span>
  chat &lt;- chat_ollama(system_prompt = <span class="hljs-string">"You are a knowledgeable culinary assistant specializing in recipe recommendations. 
                      You provide tailored meal suggestions based on the user's available ingredients and the desired amount of food or servings.
                      Ensure the recipes align closely with the user's inputs and yield the expected quantity."</span>,
                      model = <span class="hljs-string">"llama3.2:3b-instruct-q4_K_M"</span>)
  <span class="hljs-comment">#register tool</span>
  chat$register_tool(tool_context)

  observeEvent(input$chat_user_input, {
    stream &lt;- chat$stream_async(input$chat_user_input)
    chat_append(<span class="hljs-string">"chat"</span>, stream)
  })
}

shinyApp(ui, server)
</code></pre>
<p>You can find the complete code <a target="_blank" href="https://github.com/elabongaatuo/Recipe-Chatbot/">here</a>.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Building a local Retrieval-Augmented Generation (RAG) application using Ollama and ChromaDB in R programming offers a powerful way to create a specialized conversational assistant.</p>
<p>By leveraging the capabilities of large language models and vector databases, you can efficiently manage and retrieve relevant information from extensive datasets.</p>
<p>This approach not only enhances the performance of language models but also ensures customization and privacy by running the application locally.</p>
<p>Whether you're developing a cooking assistant or any other domain-specific chatbot, this method provides a robust framework for delivering intelligent and contextually aware responses.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744380659737/4e096d1c-87d6-4baa-bbf3-03657e05c182.gif" alt="Chatbot running on Shiny giving relevant recipe after user prompt" class="image--center mx-auto" width="800" height="903" loading="lazy"></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Build Your Own RAG Chatbot with JavaScript! ]]>
                </title>
                <description>
                    <![CDATA[ Generative AI is rapidly evolving, and with it comes the ability to create powerful applications like chatbots that go beyond static knowledge. Imagine building a chatbot that isn’t limited by outdated training data or predefined responses but can fe... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-your-own-rag-chatbot-with-javascript/</link>
                <guid isPermaLink="false">672cebbcd4b84c7df59edd79</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ JavaScript ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 07 Nov 2024 16:33:00 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1730997146044/f942fb19-9cfd-4981-ba89-fe3ae9a156c1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Generative AI is rapidly evolving, and with it comes the ability to create powerful applications like chatbots that go beyond static knowledge. Imagine building a chatbot that isn’t limited by outdated training data or predefined responses but can fetch real-time information and provide tailored answers based on your own custom data. This is exactly what you’ll accomplish in this new course.</p>
<p>We just published a new course on the <a target="_blank" href="http://freeCodeCamp.org">freeCodeCamp.org</a> YouTube channel that will teach you how to build and deploy a RAG chatbot using your own data. This hands-on 90-minute tutorial, led by popular creator Ania Kubow, will teach you how to create a Retrieval-Augmented Generation (RAG) chatbot with JavaScript using tools like LangChain.js, Next.js, and OpenAI. You’ll also deploy it to Vercel and integrate a vector database with DataStax.</p>
<p>By the end of this course, you’ll have built a custom Formula 1 chatbot capable of answering real-time questions about the sport. And you’ll learn how to apply this concept to any dataset, whether it’s FAQs from your website, private business documents, or any other data source you choose.</p>
<h3 id="heading-what-youll-learn">What You’ll Learn</h3>
<p>This course introduces essential AI concepts while equipping you with practical skills to build, train, and deploy a chatbot. Here's what’s covered:</p>
<ol>
<li><p><strong>What is RAG?</strong><br> Learn the fundamentals of Retrieval-Augmented Generation (RAG), an approach that extends the capabilities of large language models (LLMs) by combining them with external data sources to generate more accurate and relevant answers.</p>
</li>
<li><p><strong>Prerequisites and Setup</strong></p>
<ul>
<li><p>Learn the tools you’ll need, including a Datastax API key for vector database management.</p>
</li>
<li><p>Explore the basics of integrating LangChain.js with your development environment.</p>
</li>
</ul>
</li>
<li><p><strong>Vector Embeddings and Databases</strong><br> Understand how vector embeddings convert textual data into numerical formats that AI models can process. Then, learn to manage these embeddings using a Datastax vector database.</p>
</li>
<li><p><strong>OpenAI Integration</strong><br> Discover how to enhance chatbot responses by leveraging OpenAI’s powerful language models to add conversational flair.</p>
</li>
<li><p><strong>Building the F1 RAG Chatbot</strong><br> Follow step-by-step instructions to scrape real-time Formula 1 data, store it as vector embeddings, and create a chatbot capable of answering the latest questions about F1.</p>
</li>
<li><p><strong>Deployment on Vercel</strong><br> Learn to deploy your chatbot with Next.js on Vercel, making it live and accessible to anyone.</p>
</li>
</ol>
<h3 id="heading-real-life-use-cases">Real-Life Use Cases</h3>
<p>This course doesn’t just stop at Formula 1. The skills you’ll learn can be applied to countless other scenarios:</p>
<ul>
<li><p><strong>Business Applications</strong>: Create a chatbot to answer customer FAQs or navigate complex company policies.</p>
</li>
<li><p><strong>Education</strong>: Build tools that provide up-to-date resources for students and educators.</p>
</li>
<li><p><strong>Personal Use</strong>: Develop chatbots that assist with scheduling, document retrieval, or even journaling.</p>
</li>
</ul>
<h3 id="heading-what-is-rag-and-why-does-it-matter">What is RAG and Why Does It Matter?</h3>
<p>Retrieval-Augmented Generation (RAG) combines the text generation capabilities of LLMs with custom or up-to-date data retrieval. This approach avoids the high costs and complexity of retraining large models while addressing limitations like outdated training data. Whether you’re topping up an LLM’s knowledge with real-time internet data or private documents, RAG ensures your chatbot delivers relevant, context-aware answers.</p>
<h3 id="heading-ready-to-build-your-first-rag-chatbot">Ready to Build Your First RAG Chatbot?</h3>
<p>This course is packed with everything you need to create and deploy a powerful, real-time chatbot. Watch the full tutorial on <a target="_blank" href="https://youtu.be/d-VKYF4Zow0">the freeCodeCamp.org YouTube channel</a> and unlock the potential of generative AI for JavaScript developers (2-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/d-VKYF4Zow0" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a RAG Pipeline with LlamaIndex ]]>
                </title>
                <description>
                    <![CDATA[ Large Language Models are everywhere these days – think ChatGPT – but they have their fair share of challenges. One of the biggest challenges faced by LLMs is hallucination. This occurs when the model generates text that is factually incorrect or mis... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-rag-pipeline-with-llamaindex/</link>
                <guid isPermaLink="false">66d1c98990f244bf8b6cb9d3</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ LlamaIndex ]]>
                    </category>
                
                    <category>
                        <![CDATA[ generative ai ]]>
                    </category>
                
                    <category>
                        <![CDATA[ IBM WatsonX ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Open Source ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ large language models ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Bhavishya Pandit ]]>
                </dc:creator>
                <pubDate>Fri, 30 Aug 2024 13:30:49 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1725024307257/62401eea-25ab-4f00-93d7-76d7c49cf330.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Large Language Models are everywhere these days – think ChatGPT – but they have their fair share of challenges.</p>
<p>One of the biggest challenges faced by LLMs is hallucination. This occurs when the model generates text that is factually incorrect or misleading, often based on patterns it has learned from its training data. So how can Retrieval-Augmented Generation, or RAG, help mitigate this issue?</p>
<p>By retrieving relevant information from a more vast, wider knowledge base, RAG ensures that the LLM's responses are grounded in real-world facts. This significantly reduces the likelihood of hallucinations and improves the overall accuracy and reliability of the generated content.</p>
<h2 id="heading-table-of-contents">Table of Contents:</h2>
<ol>
<li><p><a target="_blank" href="heading-what-is-retrieval-augmented-generation-rag">What is Retrieval Augmented Generation (RAG)?</a></p>
</li>
<li><p><a target="_blank" href="heading-understanding-the-components-of-a-rag-pipeline">Understanding the Components of a RAG Pipeline</a></p>
</li>
<li><p><a target="_blank" href="heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a target="_blank" href="heading-lets-get-started">Let's Get Started!</a></p>
</li>
<li><p><a target="_blank" href="heading-how-to-fine-tune-the-pipeline">How to Fine-Tune the Pipeline</a></p>
</li>
<li><p><a target="_blank" href="heading-real-world-applications-of-rag">Real-World Applications of RAG</a></p>
</li>
<li><p><a target="_blank" href="heading-rag-best-practices-and-considerations">RAG Best Practices and Considerations</a></p>
</li>
<li><p><a target="_blank" href="heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-what-is-retrieval-augmented-generation-rag">What is Retrieval Augmented Generation (RAG)?</h2>
<p>RAG is a technique that combines information retrieval with language generation. Think of it as a two-step process:</p>
<ol>
<li><p><strong>Retrieval:</strong> The model first retrieves relevant information from a large corpus of documents based on the user's query.</p>
</li>
<li><p><strong>Generation:</strong> Using this retrieved information, the model then generates a comprehensive and informative response.</p>
</li>
</ol>
<h3 id="heading-why-use-llamaindex-for-rag">Why use LlamaIndex for RAG?</h3>
<p>LlamaIndex is a powerful framework that simplifies the process of building RAG pipelines. It provides a flexible and efficient way to connect retrieval components (like vector databases and embedding models) with generation components (like LLMs).</p>
<p><strong>Some of the key benefits of using Llama-Index include:</strong></p>
<ul>
<li><p><strong>Modularity:</strong> It allows you to easily customize and experiment with different components.</p>
</li>
<li><p><strong>Scalability:</strong> It can handle large datasets and complex queries.</p>
</li>
<li><p><strong>Ease of use:</strong> It provides a high-level API that abstracts away much of the underlying complexity.</p>
</li>
</ul>
<h3 id="heading-what-youll-learn-here">What You'll Learn Here:</h3>
<p>In this article, we will delve deeper into the components of a RAG pipeline and explore how you can use LlamaIndex to build these systems.</p>
<p>We will cover topics such as vector databases, embedding models, language models, and the role of LlamaIndex in connecting these components.</p>
<h2 id="heading-understanding-the-components-of-a-rag-pipeline">Understanding the Components of a RAG Pipeline</h2>
<p>Here's a diagram that'll help familiarize you with the basics of RAG architecture:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724944925051/e525c6cb-6a99-4eec-8b47-3dc827ddff25.png" alt="RAG Architecture showing the flow from the user query through to the response" class="image--center mx-auto" width="1920" height="1080" loading="lazy"></p>
<p>This diagram is inspired by <a target="_blank" href="https://www.fivetran.com/blog/assembling-a-rag-architecture-using-fivetran">this article</a>. Let's go through the key pieces.</p>
<h3 id="heading-components-of-rag">Components of RAG</h3>
<p><strong>Retrieval Component:</strong></p>
<ul>
<li><p><strong>Vector Databases:</strong> These databases are optimized for storing and searching high-dimensional vectors. They are crucial for efficiently finding relevant information from a vast corpus of documents.</p>
</li>
<li><p><strong>Embedding Models:</strong> These models convert text into numerical representations or embeddings. These embeddings capture the semantic meaning of the text, allowing for efficient comparison and retrieval in vector databases.</p>
</li>
</ul>
<p>A vector is a mathematical object that represents a quantity with both magnitude (size) and direction. In the context of RAG, embeddings are high-dimensional vectors that capture the semantic meaning of text. Each dimension of the vector represents a different aspect of the text's meaning, allowing for efficient comparison and retrieval.</p>
<p><strong>Generation Component:</strong></p>
<ul>
<li><strong>Language Models:</strong> These models are trained on massive amounts of text data, enabling them to generate human-quality text. They are capable of understanding and responding to prompts in a coherent and informative manner.</li>
</ul>
<h3 id="heading-the-rag-flow">The RAG Flow</h3>
<ol>
<li><p><strong>Query Submission:</strong> A user submits a query or question.</p>
</li>
<li><p><strong>Embedding Creation:</strong> The query is converted into an embedding using the same embedding model used for the corpus.</p>
</li>
<li><p><strong>Retrieval:</strong> The embedding is searched against the vector database to find the most relevant documents.</p>
</li>
<li><p><strong>Contextualization:</strong> The retrieved documents are combined with the original query to form a context.</p>
</li>
<li><p><strong>Generation:</strong> The language model generates a response based on the provided context.</p>
</li>
</ol>
<h3 id="heading-lamaindex">LamaIndex</h3>
<p>LlamaIndex plays a crucial role in connecting the retrieval and generation components. It acts as an index that maps queries to relevant documents. By efficiently managing the index, LlamaIndex ensures that the retrieval process is fast and accurate.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>We will be using Python and <a target="_blank" href="https://www.ibm.com/products/watsonx-ai">IBM watsonx</a> via LlamaIndex in this article. You should have the following on your system before getting started:</p>
<ul>
<li><p>Python 3.9+</p>
</li>
<li><p><a target="_blank" href="https://dataplatform.cloud.ibm.com/docs/content/wsj/admin/admin-apikeys.html?context=wx">IBM watsonx project and API key</a></p>
</li>
<li><p>Curiosity to learn</p>
</li>
</ul>
<h2 id="heading-lets-get-started">Let's Get Started!</h2>
<p>In this article, we will be using LlamaIndex to make a simple RAG Pipeline.</p>
<p>Let's create a virtual environment for Python using the following command in your terminal: <code>python -m venv venv</code> . This will create a virtual environment (venv) for your project. If you are a Windows user you can activate it using <code>.\venv\Scripts\activate</code>, and Mac users can activate it with <code>source venv/bin/activate</code>.</p>
<p>Now let's install the packages:</p>
<pre><code class="lang-python">pip install wikipedia llama-index-llms-ibm llama-index-embeddings-huggingface
</code></pre>
<p>Once these packages are installed, you will need watsonx.ai's API key as well. This in turn will help you use LLMs via LlamaIndex.</p>
<p>To learn about how to get your watsonx.ai API keys, click <a target="_blank" href="https://cloud.ibm.com/docs/account?topic=account-userapikey&amp;interface=ui">here</a>. You need the project ID and API Key to be able to work on the "Generation" aspect of RAG. Having them will help you make LLM calls through watsonx.ai.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> wikipedia

<span class="hljs-comment"># Search for a specific page</span>
page = wikipedia.page(<span class="hljs-string">"Artificial Intelligence"</span>)

<span class="hljs-comment"># Access the content</span>
print(page.content)
</code></pre>
<p>Now let's save the page content to a text document. We are doing it so that we can access it later. You can do this using the below code:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os

<span class="hljs-comment"># Create the 'Document' directory if it doesn't exist</span>
<span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(<span class="hljs-string">'Document'</span>):
    os.mkdir(<span class="hljs-string">'Document'</span>)

<span class="hljs-comment"># Open the file 'AI.txt' in write mode with UTF-8 encoding</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">'Document/AI.txt'</span>, <span class="hljs-string">'w'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> f:
    <span class="hljs-comment"># Write the content of the 'page' object to the file</span>
    f.write(page.content)
</code></pre>
<p>Now we'll be using watsonx.ai via LlamaIndex. It will help us generate responses based on the user's query.</p>
<p>Note: Make sure to replace the parameters <code>WATSONX_APIKEY</code> and <code>project_id</code> with your values in the below code:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">from</span> llama_index.llms.ibm <span class="hljs-keyword">import</span> WatsonxLLM
<span class="hljs-keyword">from</span> llama_index.core <span class="hljs-keyword">import</span> SimpleDirectoryReader, Document


<span class="hljs-comment"># Define a function to generate responses using the WatsonxLLM instance</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">generate_response</span>(<span class="hljs-params">prompt</span>):</span>
    <span class="hljs-string">"""
    Generates a response to the given prompt using the WatsonxLLM instance.

    Args:
        prompt (str): The prompt to provide to the large language model.

    Returns:
        str: The generated response from the WatsonxLLM.
    """</span>

    response = watsonx_llm.complete(prompt)
    <span class="hljs-keyword">return</span> response

<span class="hljs-comment"># Set the WATSONX_APIKEY environment variable (replace with your actual key)</span>
os.environ[<span class="hljs-string">"WATSONX_APIKEY"</span>] = <span class="hljs-string">'YOUR_WATSONX_APIKEY'</span>  <span class="hljs-comment"># Replace with your API key</span>

<span class="hljs-comment"># Define model parameters (adjust as needed)</span>
temperature = <span class="hljs-number">0</span>
max_new_tokens = <span class="hljs-number">1500</span>
additional_params = {
    <span class="hljs-string">"decoding_method"</span>: <span class="hljs-string">"sample"</span>,
    <span class="hljs-string">"min_new_tokens"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-string">"top_k"</span>: <span class="hljs-number">50</span>,
    <span class="hljs-string">"top_p"</span>: <span class="hljs-number">1</span>,
}

<span class="hljs-comment"># Create a WatsonxLLM instance with the specified model, URL, project ID, and parameters</span>
watsonx_llm = WatsonxLLM(
    model_id=<span class="hljs-string">"meta-llama/llama-3-1-70b-instruct"</span>,
    url=<span class="hljs-string">"https://us-south.ml.cloud.ibm.com"</span>,
    project_id=<span class="hljs-string">"YOUR_PROJECT_ID"</span>,
    temperature=temperature,
    max_new_tokens=max_new_tokens,
    additional_params=additional_params,
)

<span class="hljs-comment"># Load documents from the specified directory</span>
documents = SimpleDirectoryReader(
    input_files=[<span class="hljs-string">"Document/AI.txt"</span>]
).load_data()

<span class="hljs-comment"># Combine the text content of all documents into a single Document object</span>
combined_documents = Document(text=<span class="hljs-string">"\n\n"</span>.join([doc.text <span class="hljs-keyword">for</span> doc <span class="hljs-keyword">in</span> documents]))

<span class="hljs-comment"># Print the combined document</span>
print(combined_documents)
</code></pre>
<p>Here's a breakdown of the parameters:</p>
<ul>
<li><p><strong>temperature = 0:</strong> This setting makes the model generate the most likely text sequence, leading to a more deterministic and predictable output. It's like telling the model to stick to the most common words and phrases.</p>
</li>
<li><p><strong>max_new_tokens = 1500:</strong> This limits the generated text to a maximum of 1500 new tokens (words or parts of words).</p>
</li>
<li><p><strong>additional_params:</strong></p>
<ul>
<li><p><strong>decoding_method = "sample":</strong> This means the model will generate text randomly based on the probability distribution of each token.</p>
</li>
<li><p><strong>min_new_tokens = 1:</strong> Ensures that at least one new token is generated, preventing the model from repeating itself.</p>
</li>
<li><p><strong>top_k = 50:</strong> This limits the model's choices to the 50 most likely tokens at each step, making the output more focused and less random.</p>
</li>
<li><p><strong>top_p = 1:</strong> This sets the nucleus sampling probability to 1, meaning all tokens with a probability greater than or equal to the top_p value will be considered.</p>
</li>
</ul>
</li>
</ul>
<p>You can tweak these parameters for experimentation and see how they affect your response. Now we'll be building and loading a vector store index from the given document. But first, let's understand what it is.</p>
<h3 id="heading-understanding-vector-store-indexes">Understanding Vector Store Indexes</h3>
<p>A vector store index is a specialized data structure designed to efficiently store and retrieve high-dimensional vectors. In the context of the Llama Index, these vectors represent the semantic embeddings of documents.</p>
<p><strong>Key characteristics of vector store indexes:</strong></p>
<ul>
<li><p><strong>High-dimensional vectors:</strong> Each document is represented as a high-dimensional vector, capturing its semantic meaning.</p>
</li>
<li><p><strong>Efficient retrieval:</strong> Vector store indexes are optimized for fast similarity search, allowing you to quickly find documents that are semantically similar to a given query.</p>
</li>
<li><p><strong>Scalability:</strong> They can handle large datasets and scale efficiently as the number of documents grows.</p>
</li>
</ul>
<p><strong>How Llama Index uses vector store indexes:</strong></p>
<ol>
<li><p><strong>Document Embedding:</strong> Documents are first converted into high-dimensional vectors using a language model like Llama.</p>
</li>
<li><p><strong>Index Creation:</strong> The embeddings are stored in a vector store index.</p>
</li>
<li><p><strong>Query Processing:</strong> When a user submits a query, it is also converted into a vector. The vector store index is then used to find the most similar documents based on their embeddings.</p>
</li>
<li><p><strong>Response Generation:</strong> The retrieved documents are used to generate a relevant response.</p>
</li>
</ol>
<p>In the below code, you'll come across the word "chunk". <strong>A chunk</strong> is a smaller, manageable unit of text extracted from a larger document. It's typically a paragraph or a few sentences long. They are used to make the retrieval and processing of information more efficient, especially when dealing with large documents.</p>
<p>By breaking down documents into chunks, RAG systems can focus on the most relevant parts and generate more accurate and concise responses.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> llama_index.core.node_parser <span class="hljs-keyword">import</span> SentenceSplitter
<span class="hljs-keyword">from</span> llama_index.core <span class="hljs-keyword">import</span> VectorStoreIndex, load_index_from_storage
<span class="hljs-keyword">from</span> llama_index.core <span class="hljs-keyword">import</span> Settings
<span class="hljs-keyword">from</span> llama_index.core <span class="hljs-keyword">import</span> StorageContext

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_build_index</span>(<span class="hljs-params">documents, embed_model=<span class="hljs-string">"local:BAAI/bge-small-en-v1.5"</span>, save_dir=<span class="hljs-string">"./vector_store/index"</span></span>):</span>
    <span class="hljs-string">"""
    Builds or loads a vector store index from the given documents.

    Args:
        documents (list[Document]): A list of Document objects.
        embed_model (str, optional): The embedding model to use. Defaults to "local:BAAI/bge-small-en-v1.5".
        save_dir (str, optional): The directory to save or load the index from. Defaults to "./vector_store/index".

    Returns:
        VectorStoreIndex: The built or loaded index.
    """</span>

    <span class="hljs-comment"># Set index settings</span>
    Settings.llm = watsonx_llm
    Settings.embed_model = embed_model
    Settings.node_parser = SentenceSplitter(chunk_size=<span class="hljs-number">1000</span>, chunk_overlap=<span class="hljs-number">200</span>)
    Settings.num_output = <span class="hljs-number">512</span>
    Settings.context_window = <span class="hljs-number">3900</span>

    <span class="hljs-comment"># Check if the save directory exists</span>
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(save_dir):
        <span class="hljs-comment"># Create and load the index</span>
        index = VectorStoreIndex.from_documents(
            [documents], service_context=Settings
        )
        index.storage_context.persist(persist_dir=save_dir)
    <span class="hljs-keyword">else</span>:
        <span class="hljs-comment"># Load the existing index</span>
        index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=save_dir),
            service_context=Settings,
        )
    <span class="hljs-keyword">return</span> index

<span class="hljs-comment"># Get the Vector Index</span>
vector_index = get_build_index(documents=documents, embed_model=<span class="hljs-string">"local:BAAI/bge-small-en-v1.5"</span>, save_dir=<span class="hljs-string">"./vector_store/index"</span>)
</code></pre>
<p>This is the last part of RAG: we create a query engine with metadata replacement and sentence transformer reranking. Bruh! What is a re-ranker now?</p>
<p><strong>A re-ranker</strong> is a component that reorders the retrieved documents based on their relevance to the query. It uses additional information, such as semantic similarity or context-specific factors, to refine the initial ranking provided by the retrieval system. This helps ensure that the most relevant documents are presented to the user, leading to more accurate and informative responses.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> llama_index.core.postprocessor <span class="hljs-keyword">import</span> MetadataReplacementPostProcessor, SentenceTransformerRerank

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_query_engine</span>(<span class="hljs-params">sentence_index, similarity_top_k=<span class="hljs-number">6</span>, rerank_top_n=<span class="hljs-number">2</span></span>):</span>
    <span class="hljs-string">"""
    Creates a query engine with metadata replacement and sentence transformer reranking.

    Args:
        sentence_index (VectorStoreIndex): The sentence index to use.
        similarity_top_k (int, optional): The number of similar nodes to consider. Defaults to 6.
        rerank_top_n (int, optional): The number of nodes to rerank. Defaults to 2.

    Returns:
        QueryEngine: The query engine.
    """</span>

    postproc = MetadataReplacementPostProcessor(target_metadata_key=<span class="hljs-string">"window"</span>)
    rerank = SentenceTransformerRerank(
        top_n=rerank_top_n, model=<span class="hljs-string">"BAAI/bge-reranker-base"</span>
    )
    engine = sentence_index.as_query_engine(
        similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank]
    )
    <span class="hljs-keyword">return</span> engine

<span class="hljs-comment"># Create a query engine with the specified parameters</span>
query_engine = get_query_engine(sentence_index=vector_index, similarity_top_k=<span class="hljs-number">8</span>, rerank_top_n=<span class="hljs-number">5</span>)

<span class="hljs-comment"># Query the engine with a question</span>
query = <span class="hljs-string">'What is Deep learning?'</span>
response = query_engine.query(query)
prompt = <span class="hljs-string">f'''Generate a detailed response for the query asked based only on the context fetched:
            Query: <span class="hljs-subst">{query}</span>
            Context: <span class="hljs-subst">{response}</span>

            Instructions:
            1. Show query and your generated response based on context.
            2. Your response should be detailed and should cover every aspect of the context.
            3. Be crisp and concise.
            4. Don't include anything else in your response - no header/footer/code etc
            '''</span>
response = generate_response(prompt)
print(response.text)

<span class="hljs-string">'''
OUTPUT - 
Query: What is Deep learning? 

Deep learning is a subset of artificial intelligence that utilizes multiple layers of neurons between the network's inputs and outputs to progressively extract higher-level features from raw input data. 
This technique allows for improved performance in various subfields of AI, such as computer vision, speech recognition, natural language processing, and image classification. 
The multiple layers in deep learning networks are able to identify complex concepts and patterns, including edges, faces, digits, and letters.
The reason behind deep learning's success is not attributed to a recent theoretical breakthrough, but rather the significant increase in computer power, particularly the shift to using graphics processing units (GPUs), which provided a hundred-fold increase in speed. 
Additionally, the availability of vast amounts of training data, including large curated datasets, has also contributed to the success of deep learning.
Overall, deep learning's ability to analyze and extract insights from raw data has led to its widespread application in various fields, and its performance continues to improve with advancements in technology and data availability. '''</span>
</code></pre>
<h2 id="heading-how-to-fine-tune-the-pipeline">How to Fine-Tune the Pipeline</h2>
<p>Once you've built a basic RAG pipeline, the next step is to fine-tune it for optimal performance. This involves iteratively adjusting various components and parameters to improve the quality of the generated responses.</p>
<h3 id="heading-how-to-evaluate-the-pipelines-performance">How to Evaluate the Pipeline's Performance</h3>
<p>To assess the pipeline's effectiveness, you can use <strong>metrics</strong> like:</p>
<ul>
<li><p><strong>Accuracy:</strong> How often does the pipeline generate correct and relevant responses?</p>
</li>
<li><p><strong>Relevance:</strong> How well do the retrieved documents match the query?</p>
</li>
<li><p><strong>Coherence:</strong> Is the generated text well-structured and easy to understand?</p>
</li>
<li><p><strong>Factuality:</strong> Are the generated responses accurate and consistent with known facts?</p>
</li>
</ul>
<h3 id="heading-iterate-on-the-index-structure-embedding-model-and-language-model">Iterate on the Index Structure, Embedding Model, and Language Model</h3>
<p>You can experiment with different <strong>index structures</strong> (for example flat index, hierarchical index) to find the one that best suits your data and query patterns. Consider using <strong>different embedding models</strong> to capture different semantic nuances. <strong>Fine-tuning the language model</strong> can also improve its ability to generate high-quality responses.</p>
<h3 id="heading-experiment-with-different-hyperparameters">Experiment with Different Hyperparameters</h3>
<p><strong>Hyperparameters</strong> are settings that control the behaviour of the pipeline components. By experimenting with different values, you can optimize the pipeline's performance. Some examples of hyperparameters include:</p>
<ul>
<li><p><strong>Embedding dimension:</strong> The size of the embedding vectors</p>
</li>
<li><p><strong>Index size:</strong> The maximum number of documents to store in the index</p>
</li>
<li><p><strong>Retrieval threshold:</strong> The minimum similarity score for a document to be considered relevant</p>
</li>
</ul>
<h2 id="heading-real-world-applications-of-rag">Real-World Applications of RAG</h2>
<p>RAG pipelines have a wide range of applications, including:</p>
<ul>
<li><p><strong>Customer support chatbots:</strong> Providing informative and helpful responses to customer inquiries</p>
</li>
<li><p><strong>Knowledge base search:</strong> Efficiently retrieving relevant information from large document collections</p>
</li>
<li><p><strong>Summarization of large documents:</strong> Condensing lengthy documents into concise summaries</p>
</li>
<li><p><strong>Question answering systems:</strong> Answering complex questions based on a given corpus of knowledge</p>
</li>
</ul>
<h2 id="heading-rag-best-practices-and-considerations">RAG Best Practices and Considerations</h2>
<p>To build effective RAG pipelines, consider these best practices:</p>
<ul>
<li><p><strong>Data quality and preprocessing:</strong> Ensure your data is clean, consistent, and relevant to your use case. Preprocess the data to remove noise and improve its quality.</p>
</li>
<li><p><strong>Embedding model selection:</strong> Choose an embedding model that is appropriate for your specific domain and task. Consider factors like accuracy, computational efficiency, and interpretability.</p>
</li>
<li><p><strong>Index optimization:</strong> Optimize the index structure and parameters to improve retrieval efficiency and accuracy.</p>
</li>
<li><p><strong>Ethical considerations and biases:</strong> Be aware of potential biases in your data and models. Take steps to mitigate bias and ensure fairness in your RAG pipeline.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>RAG pipelines offer a powerful approach to leveraging large language models for a variety of tasks. By carefully selecting and fine-tuning the components of an RAG pipeline, you can build systems that provide informative, accurate, and relevant responses.</p>
<p><strong>Key points to remember:</strong></p>
<ul>
<li><p>RAG combines information retrieval and language generation.</p>
</li>
<li><p>Llama-Index simplifies the process of building RAG pipelines.</p>
</li>
<li><p>Fine-tuning is essential for optimizing pipeline performance.</p>
</li>
<li><p>RAG has a wide range of real-world applications.</p>
</li>
<li><p>Ethical considerations are crucial in building responsible RAG systems.</p>
</li>
</ul>
<p>As RAG technology continues to evolve, we can expect to see even more innovative and powerful applications in the future. Till then, let's wait for the future to unfold!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn RAG Fundamentals and Advanced Techniques ]]>
                </title>
                <description>
                    <![CDATA[ Understanding how to enhance the capabilities of AI and machine learning systems is a valuable skill. One method is Retrieval-Augmented Generation (RAG), a powerful technique that combines retrieval-based methods with generative models to create more... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/learn-rag-fundamentals-and-advanced-techniques/</link>
                <guid isPermaLink="false">66ab9da23f0973550ec28af9</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 01 Aug 2024 14:37:22 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1722013188231/3fcbc925-f8bb-4e85-9396-f196b9856814.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Understanding how to enhance the capabilities of AI and machine learning systems is a valuable skill. One method is Retrieval-Augmented Generation (RAG), a powerful technique that combines retrieval-based methods with generative models to create more accurate and contextually relevant responses.</p>
<p>We just published a course on the <a target="_blank" href="http://freeCodeCamp.org">freeCodeCamp.org</a> YouTube channel that will teach you all about Retrieval-Augmented Generation (RAG). Created by Paulo Dichone, this course starts with the basics and progressively covers more advanced aspects, ensuring a comprehensive understanding of RAG and its practical applications. You'll learn the fundamental concepts and components of RAG, build a system for chatting with documents, explore advanced techniques, and understand the drawbacks of naive RAG approaches.</p>
<p>The course begins with an introduction, outlining the course content and objectives, setting the stage for your journey into RAG. You’ll learn the essential principles of Retrieval-Augmented Generation, including how it integrates retrieval mechanisms with generative models. This foundation will help you grasp the various components that make up a RAG system and understand their roles and interactions.</p>
<p>As the course progresses, you'll dive deeper into RAG, examining its inner workings and intricacies. This deep dive will prepare you for a hands-on project where you'll build a RAG system designed to interact with documents, enhancing your practical skills. This practical application is crucial for solidifying your understanding and ability to implement RAG in real-world scenarios.</p>
<p>Following this, the course introduces more sophisticated RAG techniques that can improve system performance and accuracy. You’ll gain insights into the limitations and common issues associated with simple, naive RAG implementations. This understanding is critical, as it highlights the importance of advanced techniques in overcoming these challenges.</p>
<p>One of the advanced techniques you'll learn about is query expansion, which involves generating more relevant answers through expanded queries. This section includes both theoretical explanations and hands-on projects, allowing you to apply query expansion techniques practically. The course also explores using multiple queries to further enhance the effectiveness of a RAG system, providing another hands-on project to deepen your understanding.</p>
<p>To test your skills and knowledge, the course presents a challenge that lets you implement what you've learned. Finally, the course concludes with a look at potential next steps and further learning opportunities in the field of RAG.</p>
<p>By the end of this course, you'll have a solid understanding of Retrieval-Augmented Generation, equipped with the knowledge and skills to build and enhance your own RAG systems. Watch the full course <a target="_blank" href="https://www.youtube.com/watch?v=ea2W8IogX80">on the freeCodeCamp.org YouTube channel</a> (2-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/ea2W8IogX80" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn RAG from Scratch – Python AI Tutorial from a LangChain Engineer ]]>
                </title>
                <description>
                    <![CDATA[ Retrieval-Augmented Generation (RAG) can be extremely helpful when developing projects with Large Language Models. It combines the power of retrieval systems with advanced natural language generation, providing a sophisticated approach to generating ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/mastering-rag-from-scratch/</link>
                <guid isPermaLink="false">66200806f5880a867f47d0ed</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Wed, 17 Apr 2024 17:33:58 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1713375208533/25d36579-4e59-4a7f-b63e-a67ec6de69b8.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Retrieval-Augmented Generation (RAG) can be extremely helpful when developing projects with Large Language Models. It combines the power of retrieval systems with advanced natural language generation, providing a sophisticated approach to generating accurate and context-rich responses.</p>
<p>We just posted an in-depth course on the freeCodeCamp.org YouTube channel that will teach you how to implement RAG from scratch. Lance Martin created this course. He is a software engineer at LangChain with a PhD in applied machine learning from Stanford.</p>
<h2 id="heading-what-is-rag">What is RAG?</h2>
<p>Retrieval-Augmented Generation (RAG) is a powerful framework that integrates retrieval into the sequence generation process. Essentially, RAG operates by fetching relevant documents or data snippets based on a query and then using this retrieved information to generate a coherent and contextually appropriate response. This method is particularly valuable in fields like chatbot development, where the ability to provide precise answers derived from extensive databases of knowledge is crucial.</p>
<p>RAG fundamentally enhances the natural language understanding and generation capabilities of models by allowing them to access and leverage a vast amount of external knowledge. The approach is built upon the synergy between two main components: a retrieval system and a generative model. The retrieval system first identifies relevant information from a knowledge base, which the generative model then uses to craft responses that are not only accurate but also rich in detail and scope.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713375387925/9246942a-79e4-4d94-b032-a85f10480a99.png" alt="RAG diagram" class="image--center mx-auto" width="1586" height="1200" loading="lazy"></p>
<h2 id="heading-course-breakdown">Course Breakdown</h2>
<p>Lance Martin’s course meticulously covers all aspects of RAG, beginning with an overview that sets the stage for deeper exploration. The course is structured to walk students through the entire process of implementing a RAG system from the ground up:</p>
<ul>
<li><p><strong>Indexing</strong>: Learners will start by understanding how to create efficient indexing systems to store and retrieve data, which is fundamental for any retrieval-based model.</p>
</li>
<li><p><strong>Retrieval</strong>: This section dives into the mechanics of retrieving the most relevant documents in response to a query.</p>
</li>
<li><p><strong>Generation</strong>: After retrieval, the focus shifts to generating coherent text from the retrieved data, using advanced natural language processing techniques.</p>
</li>
<li><p><strong>Query Translation</strong>: Multiple strategies for translating and refining queries are discussed, including Multi-Query techniques, RAG Fusion, Decomposition, Step Back, and HyDE approaches, each offering unique benefits depending on the application.</p>
</li>
<li><p><strong>Routing, Query Construction, and Advanced Indexing Techniques</strong>: These segments explore more sophisticated elements of RAG systems, such as routing queries to appropriate models, constructing effective queries, and advanced indexing techniques like RAPTOR and ColBERT.</p>
</li>
<li><p><strong>CRAG and Adaptive RAG</strong>: The course also introduces CRAG (Conditional RAG) and Adaptive RAG, enhancements that provide even more flexibility and power to the standard RAG framework.</p>
</li>
<li><p><strong>Is RAG Really Dead?</strong>: Finally, a discussion on the current and future relevance of RAG in research and practical applications, stimulating critical thinking and exploration beyond the course.</p>
</li>
</ul>
<p>Each section is packed with practical exercises, real-life examples, and detailed explanations that ensure students not only learn the theory but also apply the concepts in practical settings.</p>
<p>This course is ideal for software engineers, data scientists, and researchers with a solid foundation in machine learning and natural language processing who are looking to expand their expertise in advanced AI techniques.</p>
<p>Watch the full course <a target="_blank" href="https://youtu.be/sVcwVQRHIc8">on the freeCodeCamp.org YouTube channel</a> (2.5-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/sVcwVQRHIc8" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
