vector database - freeCodeCamp.org

How to Ship a Production-Ready RAG App with FAISS (Guardrails, Evals, and Fallbacks)

Chidozie Managwu — Mon, 16 Mar 2026 17:43:51 +0000

Most LLM applications look great in a high-fidelity demo. Then they hit the hands of real users and start failing in very predictable yet damaging ways.

They answer questions they should not, they break when document retrieval is weak, they time out due to network latency, and nobody can tell exactly what happened because there are no logs and no tests.

In this tutorial, you’ll build a beginner-friendly Retrieval Augmented Generation (RAG) application designed to survive production realities. This isn’t just a script that calls an API. It’s a system featuring a FastAPI backend, a persisted FAISS vector store, and essential safety guardrails (including a retrieval gate and fallbacks).

Why RAG Alone Does Not Equal Production-Ready
The Architecture You Are Building
Project Setup and Structure
How to Build the RAG Layer with FAISS
How to Add the LLM Call with Structured Output
How to Add Guardrails: Retrieval Gate and Fallbacks
FastAPI App: Creating the /answer Endpoint
How to Add Beginner-Friendly Evals
What to Improve Next: Realistic Upgrades

Why RAG Alone Does Not Equal Production-Ready

Retrieval Augmented Generation (RAG) is often hailed as the hallucination killer. By grounding the model in retrieved text, we provide it with the facts it needs to be accurate. But simply connecting a vector database to an LLM isn’t enough for a production environment.

Production issues usually arise from the silent failures in the system surrounding the model:

Weak retrieval: If the app retrieves irrelevant chunks of text, the model tries to bridge the gap by inventing an answer anyway. Without a designated “I do not know” path, the model is essentially forced to hallucinate.
Lack of visibility: Without structured outputs and basic logging, you can’t tell if bad retrieval, a confusing prompt, or a model update caused a wrong answer.
Fragility: A simple API timeout or malformed provider response becomes a user-facing outage if you don’t implement fallbacks.
No regression testing: In traditional software, we have unit tests. In AI, we need evals. Without them, a small tweak to your prompt might fix one issue but break ten others without you realising it.

We’ll solve each of these issues systematically in this guide.

Prerequisites

This tutorial is beginner-friendly, but it assumes you have a few basics in place so you can focus on building a robust RAG system instead of getting stuck on setup issues.

Knowledge

You should be comfortable with:

Python fundamentals (functions, modules, virtual environments)
Basic HTTP + JSON (requests, response payloads)
APIs with FastAPI (what an endpoint is and how to run a server)
High-level LLM concepts (prompting, temperature, structured outputs)

Tools + Accounts

You’ll need:

Python 3.10+
A working OpenAI-compatible API key (OpenAI or any provider that supports the same request/response shape)
A local environment where you can run a FastAPI app (Mac/Linux/Windows)

What This Tutorial Covers (and What It Doesn’t)

We’ll build a production-minded baseline:

A FAISS-backed retriever with a persisted index + metadata
A retrieval gate to prevent “forced hallucination”
Structured JSON outputs so your backend is stable
Fallback behavior for timeouts and provider errors
A small eval harness to prevent regressions

We won’t implement advanced upgrades such as rerankers, semantic chunking, auth, background jobs beyond a roadmap at the end.

The Architecture You Are Building

The flow of our application follows a disciplined path so every answer is grounded in evidence:

User query: The user submits a question via a FastAPI endpoint.
Retrieval: The system embeds the question and retrieves the top-k most similar document chunks.
The retrieval gate: We evaluate the similarity score. If the context is not relevant enough, we stop immediately and refuse the query.
Augmentation and generation: If the gate passes, we send a context-augmented prompt to the LLM.
Structured response: The model returns a JSON object containing the answer, sources used, and a confidence level.

Project Setup and Structure

To keep things organized and maintainable, we’ll use a modular structure. This allows you to swap out your LLM provider or your vector database without rewriting your entire core application.

Project Structure

.
├── app.py              # FastAPI entry point and API logic
├── rag.py              # FAISS index, persistence, and document retrieval
├── llm.py              # LLM API interface and JSON parsing
├── prompts.py          # Centralized prompt templates
├── data/               # Source .txt documents
├── index/              # Persisted FAISS index and metadata
└── evals/              # Evaluation dataset and runner script
    ├── eval_set.json
    └── run_evals.py

Install Dependencies

First, create a virtual environment to isolate your project:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install fastapi uvicorn faiss-cpu numpy pydantic requests python-dotenv

Configure the Environment

Create a .env file in the root directory. We are targeting OpenAI-compatible providers:

OPENAI_API_KEY=your_actual_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o-mini

Important note on compatibility: The code below assumes an OpenAI-style API. If you use a provider that is not compatible, you must change the URL, headers (for example X-API-Key), and the way you extract embeddings and final message content in embed_texts() and call_llm().

How to Build the RAG Layer with FAISS

In rag.py, we handle the “Retriever” part of RAG. This involves turning raw text into mathematical vectors that the computer can compare.

What is FAISS (and What Does It Do)?

FAISS (Facebook AI Similarity Search) is a fast library for vector similarity search. In a RAG system, each chunk of text becomes an embedding vector (a list of floats). FAISS stores those vectors in an index so you can quickly ask:

“Given this question embedding, which document chunks are closest to it?”

In this tutorial, we use IndexFlatIP inner product and normalise vectors with faiss.normalize_L2(...). With normalised vectors, the inner product behaves like cosine similarity, giving us a stable score we can use for a retrieval gate.

Chunking Strategy With Overlap

We’ll use chunking with overlap. If we split a document at exactly 1,000 characters, we might cut a sentence in half, losing its meaning. By using an overlap, for example, 200 characters, we ensure that the end of one chunk and the beginning of the next share context.

Implementation of `rag.py`

import os
import faiss
import numpy as np
import requests
import json
from typing import List, Dict
from dotenv import load_dotenv

load_dotenv()

INDEX_PATH = "index/faiss.index"
META_PATH = "index/meta.json"

def chunk_text(text: str, size: int = 1000, overlap: int = 200) -> List[str]:
    chunks = []
    step = max(1, size - overlap)
    for i in range(0, len(text), step):
        chunk = text[i : i + size].strip()
        if chunk:
            chunks.append(chunk)
    return chunks

def embed_texts(texts: List[str]) -> np.ndarray:
    # Note: If your provider is not OpenAI-compatible, change this URL and headers
    url = f"{os.getenv('OPENAI_BASE_URL')}/embeddings"
    headers = {"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"}
    payload = {"input": texts, "model": "text-embedding-3-small"}

    resp = requests.post(url, headers=headers, json=payload, timeout=30)
    resp.raise_for_status()
    # If your provider uses a different response format, change the line below
    vectors = np.array([item["embedding"] for item in resp.json()["data"]], dtype="float32")
    return vectors

def build_index() -> None:
    all_chunks: List[str] = []
    metadata: List[Dict] = []

    if not os.path.exists("data"):
        os.makedirs("data")
        return

    for file in os.listdir("data"):
        if not file.endswith(".txt"):
            continue

        with open(f"data/{file}", "r", encoding="utf-8") as f:
            text = f.read()

        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        for c in chunks:
            metadata.append({"source": file, "text": c})

    if not all_chunks:
        return

    embeddings = embed_texts(all_chunks)
    faiss.normalize_L2(embeddings)

    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    os.makedirs("index", exist_ok=True)
    faiss.write_index(index, INDEX_PATH)

    with open(META_PATH, "w", encoding="utf-8") as f:
        json.dump(metadata, f, ensure_ascii=False)

def load_index():
    if not (os.path.exists(INDEX_PATH) and os.path.exists(META_PATH)):
        raise FileNotFoundError(
            "FAISS index not found. Add .txt files to data/ and run build_index()."
        )

    index = faiss.read_index(INDEX_PATH)
    with open(META_PATH, "r", encoding="utf-8") as f:
        metadata = json.load(f)
    return index, metadata

def retrieve(query: str, k: int = 5) -> List[Dict]:
    index, metadata = load_index()

    q_emb = embed_texts([query])
    faiss.normalize_L2(q_emb)

    scores, ids = index.search(q_emb, k)
    results = []
    for score, idx in zip(scores[0], ids[0]):
        if idx == -1:
            continue
        m = metadata[idx]
        results.append(
            {"score": float(score), "source": m["source"], "text": m["text"], "id": int(idx)}
        )
    return results

How to Add the LLM Call with Structured Output

A major failure point in AI apps is the “chatty” nature of LLMs. If your backend expects a list of sources but the LLM returns conversational filler, your code will crash.

We solve this with structured output: instruct the model to return a strict JSON object, then parse it safely.

Implementation of `llm.py`

import json
import requests
import os
from typing import Dict, Any

def call_llm(system_prompt: str, user_prompt: str) -> Dict[str, Any]:
    # Note: Change URL/Headers if using a non-OpenAI compatible provider
    url = f"{os.getenv('OPENAI_BASE_URL')}/chat/completions"
    headers = {
        "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
        "Content-Type": "application/json",
    }

    payload = {
        "model": os.getenv("OPENAI_MODEL"),
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "response_format": {"type": "json_object"},
        "temperature": 0,
    }

    try:
        resp = requests.post(url, headers=headers, json=payload, timeout=30)
        resp.raise_for_status()
        content = resp.json()["choices"][0]["message"]["content"]

        parsed = json.loads(content)
        parsed.setdefault("answer", "")
        parsed.setdefault("refusal", False)
        parsed.setdefault("confidence", "medium")
        parsed.setdefault("sources", [])
        return parsed

    except (requests.Timeout, requests.ConnectionError):
        return {
            "answer": "The system is temporarily unavailable (network issue). Please try again.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "network_error",
        }
    except Exception:
        return {
            "answer": "A system error occurred while generating the answer.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "unknown_error",
        }

How to Add Guardrails: Retrieval Gate and Fallbacks

Guardrails are interceptors. They sit between the user and the model to prevent predictable failures.

The Retrieval Gate: How It Works and How to Add It

In a standard RAG pipeline, the system always calls the LLM. If the user asks an irrelevant question, the retriever will still return the “closest” (but wrong) chunks.

The solution is the retrieval gate:

Retrieve top-k chunks and get the top similarity score
If the score is below a threshold (for example 0.30), refuse immediately
Only call the LLM when retrieval is strong enough to ground the answer

A threshold of 0.30 is a reasonable starting point when using normalised cosine similarity, but you should tune it using evals (next section).

Fallbacks and Why They Matter

Fallbacks ensure that if an API fails or times out, the user gets a helpful message instead of a crash. They also keep your API response shape consistent, which prevents frontend errors and makes logging meaningful.

In this tutorial, fallbacks are implemented inside call_llm() so your FastAPI layer stays simple.

FastAPI App: Creating the /answer Endpoint

The app.py file is the conductor. It ties retrieval, guardrails, prompting, and generation together.

Implementation of `app.py`

from fastapi import FastAPI
from pydantic import BaseModel
from rag import retrieve
from llm import call_llm
import prompts
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("rag_app")

app = FastAPI(title="Production-Ready RAG")

class QueryRequest(BaseModel):
    question: str

@app.post("/answer")
async def get_answer(req: QueryRequest):
    start_time = time.time()
    question = (req.question or "").strip()

    if not question:
        return {
            "answer": "Please provide a non-empty question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
        }

    # 1) Retrieval
    results = retrieve(question, k=5)
    top_score = results[0]["score"] if results else 0.0

    logger.info("query=%r top_score=%.3f num_results=%d", question, top_score, len(results))

    # 2) Retrieval Gate (Guardrail)
    if top_score < 0.30:
        return {
            "answer": "I do not have documents to answer that question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
            "retrieval": {"top_score": top_score, "k": 5},
        }

    # 3) Augment
    context_text = "\n\n".join([f"Source {r['source']}: {r['text']}" for r in results])
    user_prompt = f"Context:\n{context_text}\n\nQuestion: {question}"

    # 4) Generation with Fallback
    response = call_llm(prompts.SYSTEM_PROMPT, user_prompt)

    # 5) Attach debug metadata
    response["latency_sec"] = round(time.time() - start_time, 2)
    response["retrieval"] = {"top_score": top_score, "k": 5}
    return response

Centralized Prompt – Template: prompts.py

A small but important habit: keep prompts centralised so they’re versionable and easy to evaluate.

Example `prompts.py`

SYSTEM_PROMPT = """You are a RAG assistant. Use ONLY the provided Context to answer.
If the context does not contain the answer, respond with refusal=true.

Return a valid JSON object with exactly these keys:
- answer: string
- refusal: boolean
- confidence: "low" | "medium" | "high"
- sources: array of strings (source filenames you used)

Do not include any extra keys. Do not include markdown. Do not include commentary."""

How to Add Beginner-Friendly Evals

In AI systems, outputs are probabilistic. This makes testing harder than traditional software. Evals (evaluations) are a set of “golden questions” and “expected behaviours” you run repeatedly to detect regressions.

Instead of “does it output exactly this string,” you test:

Should the app refuse when the retrieval is weak?
When it answers, does it include sources?
Is the behaviour stable across prompt tweaks and model changes?

Step 1: Create `evals/eval_set.json`

This should contain both positive and negative cases.

[
  {
    "id": "in_scope_01",
    "question": "What is a retrieval gate and why is it important?",
    "expect_refusal": false,
    "notes": "Should explain gating and relate it to hallucination prevention."
  },
  {
    "id": "out_of_scope_01",
    "question": "What is the capital of France?",
    "expect_refusal": true,
    "notes": "If the knowledge base only includes our docs, the app should refuse."
  },
  {
    "id": "edge_01",
    "question": "",
    "expect_refusal": true,
    "notes": "Empty input should not call the LLM."
  }
]

Step 2: Create `evals/run_evals.py`

This runner calls your API endpoint (end-to-end) and checks expected behaviours.

import json
import requests

API_URL = "http://127.0.0.1:8000/answer"

def run():
    with open("evals/eval_set.json", "r", encoding="utf-8") as f:
        cases = json.load(f)

    passed = 0
    failed = 0

    for case in cases:
        resp = requests.post(API_URL, json={"question": case["question"]}, timeout=60)
        resp.raise_for_status()
        out = resp.json()

        got_refusal = bool(out.get("refusal", False))
        expect_refusal = bool(case["expect_refusal"])

        ok = (got_refusal == expect_refusal)

        # Beginner-friendly: if it answers, sources should exist and be a list
        if not got_refusal:
            ok = ok and isinstance(out.get("sources"), list)

        if ok:
            passed += 1
            print(f"PASS {case['id']}")
        else:
            failed += 1
            print(f"FAIL {case['id']} expected_refusal={expect_refusal} got_refusal={got_refusal}")
            print("Output:", json.dumps(out, indent=2))

    print(f"\nDone. Passed={passed} Failed={failed}")
    if failed:
        raise SystemExit(1)

if __name__ == "__main__":
    run()

How to Use Evals in Practice

Run your server:

uvicorn app:app --reload

In another terminal, run evals:

python evals/run_evals.py

If an eval fails, you have a concrete signal that something changed in retrieval, gating, prompting, or provider behaviour.

What to Improve Next: Realistic Upgrades

Building a reliable RAG app is iterative. Here are realistic next steps:

Semantic chunking: Break text based on meaning instead of character count.
Reranking: Use a cross-encoder reranker to reorder the top-k chunks for higher precision.
Metadata filtering: Filter results by category, date, or department to reduce false positives.
Better citations: Store chunk IDs and show exactly which chunk(s) the answer came from.
Observability: Add request IDs, structured logs, and traces so “what happened?” is answerable.
Async + background indexing: Move index building to a background job and keep the API responsive.

Final Thoughts: Production-Ready Is a Set of Habits

Building an AI application that survives in the real world is about building a system that is predictable, measurable, and safe.

Retrieval quality is measurable: Use similarity scores to gate your LLM.
Refusal is a feature: It is better to say “I do not know” than to lie.
Fallbacks are mandatory: Design for the moment the API goes down.
Evals prevent regressions: Never deploy a change without running your tests.

About Me

I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead AI Titans Network, a community for developers learning how to ship AI products.

My work has been recognized with the Global Tech Hero award and featured on platforms like HackerNoon.

How to Integrate Vector Search in Columnar Storage

Chirag Agrawal — Wed, 12 Nov 2025 21:43:04 +0000

Integrating vector search into traditional data platforms is becoming a common task in the current AI-driven landscape. When Google announced general availability for vector search in BigQuery in early 2024, it joined a growing list of established databases that have added capabilities for similarity search on high-dimensional embeddings.

But if you examine BigQuery's implementation more closely, you’ll find an approach that goes beyond a simple feature addition. Instead of bolting on a vector library, Google has deeply integrated vector search into its existing distributed, columnar architecture.

In this article, we’ll take a technical deep dive into the engineering decisions behind BigQuery's vector search. We’ll explore how foundational Google technologies like Dremel, Borg, and Colossus, combined with a proprietary columnar format and a novel indexing algorithm, create a highly scalable and efficient platform for AI workloads.

This analysis will give you insights into the architectural trade-offs involved in building vector search at scale. It also demonstrates how you can adapt a system designed for large-scale analytics so that it excels at modern AI tasks.

Prerequisites
The Unique Challenge of Vector Search
BigQuery's Foundational Distributed Architecture
The Role of Columnar Storage in Vector Operations
- Accelerating Computations with SIMD
The TreeAH Indexing Algorithm
The End-to-End Vector Search Query Flow
Practical Implications for Engineering Teams
Conclusion
Further Reading

Prerequisites

This article assumes that you have a solid foundation in distributed systems and database internals, including familiarity with concepts like columnar storage, query execution plans, and distributed query processing.

You should understand the basics of vector embeddings and similarity search, though we'll briefly review the fundamentals. Experience with at least one vector database or search system (such as pgvector, Pinecone, or Elasticsearch) will help contextualize the architectural comparisons.

While deep knowledge of Google Cloud Platform isn't required, basic familiarity with cloud data warehouses and their typical architectures will be beneficial. The article includes discussions of SIMD operations and CPU-level optimizations, so comfort with low-level performance considerations is helpful, though not mandatory.

Code examples assume working knowledge of SQL, with some sections referencing implementation details in languages like Python or Java. Most importantly, you should have experience building or operating production data systems at scale, as many insights focus on practical engineering trade-offs rather than theoretical concepts.

The Unique Challenge of Vector Search

Vector search fundamentally differs from traditional database operations in ways that challenge our existing infrastructure assumptions. Where conventional queries leverage decades of optimization around exact matching and range scans, vector similarity search requires computing distances between high-dimensional points at massive scale.

Consider the numbers. Modern embedding models produce vectors with 768 or more dimensions. At 4 bytes per float32 value, a single embedding consumes roughly 3KB. A modest corpus of 100 million items translates to 300GB of vector data.

But the real challenge isn't storage. The killer is computation. Finding the nearest neighbors to a query vector means computing distance metrics across all those dimensions. For 100 million vectors, a brute-force search requires 76.8 billion floating-point operations per query just for the distance calculations. Even with modern SIMD instructions processing 16 floats at once, you're looking at billions of CPU cycles per search.

This computational reality forces a fundamental compromise: we abandon exact solutions for approximate ones. Approximate Nearest Neighbor (ANN) algorithms trade perfect accuracy for practical query times. They work by partitioning the vector space cleverly, building graphs of nearest neighbors, or using hashing schemes to avoid examining every vector. The engineering challenge becomes balancing query latency, recall accuracy, and resource consumption.

Most purpose-built vector databases address this through specialized in-memory indexes like HNSW or IVF. These work well for single queries but require keeping massive indexes in RAM. In case you are not familiar with these vector indexes, you can read this article.

BigQuery took a different path. Rather than optimizing for single-query latency, they asked what vector search would look like when built for analytical workloads at warehouse scale. The answer required rethinking basic assumptions about index design, storage layout, and query execution.

BigQuery's Foundational Distributed Architecture

BigQuery's vector search runs on the same infrastructure that's been processing SQL queries since 2011. No new cluster type. No specialized vector nodes. Just four core technologies that power most of Google's data processing, now handling a workload they weren't originally designed for.

This isn't the obvious choice. Most vector databases build specialized infrastructure optimized for similarity search. Graph-based indexes need fast random access. In-memory systems require careful memory management. BigQuery took its existing distributed SQL engine and asked: can we make this work for vectors, too?

The answer required leveraging four foundational systems in new ways:

Dremel, the query engine that normally handles SQL, now orchestrates vector similarity computations.
Borg, which allocates resources for everything from Search to YouTube, dynamically assigns thousands of workers to vector queries.
Colossus stores embeddings in the same distributed filesystem that holds petabytes of analytics data.
And Jupiter's datacenter network, built for bulk data processing, now shuttles vector data between computation nodes.

What's surprising isn't that it works, but how well it works. The same architecture that runs aggregate queries over trillion-row tables can search billion-scale vector collections. Understanding how requires examining each component and how they've been adapted for this new workload.

Dremel: The Distributed Query Engine

At its core, BigQuery is powered by Dremel, a distributed query execution engine developed at Google since 2006.

Dremel processes SQL queries using a hierarchical serving tree. A root server receives the query and orchestrates the execution, while mixer nodes break down the work and distribute it to hundreds or thousands of leaf nodes. These leaf nodes perform the actual computations in parallel on segments of the data.

This architecture allows BigQuery to dynamically allocate a massive number of execution threads, known as slots, to a single query, enabling it to process petabytes of data in seconds.

Borg: Cluster Management and Resource Orchestration

The serverless nature of BigQuery is made possible by Borg, Google's cluster management system that predates and inspired Kubernetes.

When a vector search query is submitted, Borg is responsible for finding available machines across Google's global data centers, allocating the precise amount of CPU and memory resources needed for the query's Dremel slots, and managing fault tolerance by automatically rescheduling work if a machine fails. This dynamic resource allocation means users do not need to provision or scale infrastructure, whether they are searching 1,000 vectors or 10 billion.

Colossus: The Distributed Storage Layer

Data in BigQuery is stored in Colossus, Google's next-generation distributed file system. Colossus is designed for exabyte-scale storage, provides high availability through automatic cross-datacenter replication, and is optimized for the high-throughput parallel reads required by Dremel's leaf nodes.

During a vector search, Colossus can deliver data to thousands of nodes simultaneously without creating a storage bottleneck.

Jupiter: The High-Speed Network Fabric

These compute and storage systems are interconnected by Jupiter, Google's internal datacenter network, which features a petabit-per-second bisection bandwidth. The network's design ensures that data can move between Colossus storage and Dremel compute nodes at extremely high speeds, making data shuffling and aggregation phases of a query efficient.

The Role of Columnar Storage in Vector Operations

Storing vectors in columns sounds wrong. Vectors are arrays. They belong together. Why split them across columnar storage?

BigQuery does it anyway, and it works brilliantly. Here's why.

When you search a million vectors, you need exactly one thing from each row: the embedding. Not the product name, price, or category. Just the vector. Row-oriented storage forces you to read entire records and throw away 90% of the data. Columnar storage reads only what you need.

The performance impact is dramatic. A table with 768-dimensional embeddings plus 20 other columns might total 3TB. Reading just the embedding column? 300GB. That's a 10x reduction in I/O before you've done any actual computation.

But the real magic happens at the CPU level. Columnar storage naturally aligns vector data for SIMD processing. Instead of jumping around memory gathering vector components, the CPU finds them laid out sequentially, ready for bulk operations. Modern processors can load 16 floating-point values into a single register and process them simultaneously.

Compression becomes almost trivial, too. BigQuery's Capacitor format applies techniques like Product Quantization directly to the column data, shrinking vectors from 3KB to under 300 bytes. Try doing that with row-oriented storage where vectors are scattered across pages.

The lesson? Sometimes the "wrong" abstraction at one level enables the right optimizations at another.

Accelerating Computations with SIMD

SIMD instructions are a form of hardware-level parallelism available in modern CPUs that provide significant speedups for vector arithmetic. This is achieved through special instruction sets built into the processor.

For example, AVX-512 (Advanced Vector Extensions 512-bit) is an instruction set found in modern high-performance CPUs, such as those from Intel, that allows a single instruction to operate on 512 bits of data at once.

Since a standard single-precision floating-point number is 32 bits, a CPU with AVX-512 can process 16 floating-point numbers in a single operation. This leads to dramatic performance gains.

The difference between scalar and SIMD processing for vector distance calculations is stark:

Scalar approach: Loop through each dimension, multiply corresponding components, accumulate results. For 768 dimensions, that's 768 multiplications, 768 additions, and terrible cache performance as you jump between two different memory locations for each iteration.
SIMD approach: Load 16 components from each vector into 512-bit registers. Execute a single multiply instruction that handles all 16 pairs. Execute a single horizontal add. Repeat 48 times. The CPU's pipeline stays full, the cache prefetcher knows exactly what data you need next, and you've turned 1,536 operations into 96.

The columnar storage pays off here, too. Vectors stored contiguously in memory align perfectly with SIMD register loads. No gather operations, no wasted cycles. Just pure throughput.

BigQuery's query engine is designed to leverage SIMD extensively. It automatically detects and uses the optimal instruction set available on the underlying hardware (for example, AVX-512 for Intel, NEON for ARM). The columnar storage format ensures that vector data is laid out in memory in a way that is friendly to SIMD registers, and the engine processes query vectors in large batches to maximize the utilization of these parallel instructions.

The TreeAH Indexing Algorithm

While brute-force search can be effective at smaller scales due to BigQuery's massive parallelism, efficient search over billions of vectors requires an index. BigQuery's primary vector index is TreeAH (Tree with Asymmetric Hashing), which is based on Google's open-sourced ScaNN (Scalable Nearest Neighbors) algorithm. TreeAH combines three techniques to achieve high performance and memory efficiency.

1. Hierarchical Tree Structure

The algorithm first partitions the entire vector space into thousands of smaller lists. You can think of this like organizing a massive library. Instead of having one giant room with a million books, a library has floors, sections, and shelves. This hierarchy allows you to find a book without scanning every single one.

Similarly, TreeAH groups semantically similar vectors together into partitions and arranges them in a tree. During a query, the search navigates this tree by comparing the query vector to "centroid" vectors that represent the center of each partition, effectively following a path to the most relevant partitions and pruning away large, irrelevant branches of the search space.

2. Product Quantization (PQ)

Within TreeAH, PQ serves a different purpose than just compression. The index doesn't just store smaller vectors – it fundamentally changes how distance calculations work.

TreeAH learns partition-specific codebooks that capture the local structure of vectors in each tree node. This means vectors that end up in the "shoes" partition get quantized differently than those in "electronics." The compression becomes semantic-aware.

When combined with the tree structure, this creates a powerful effect: not only are you searching fewer vectors (thanks to the tree), but you're computing distances faster on the vectors you do search (thanks to PQ).

3. Asymmetric Hashing

The "asymmetric" aspect refers to the fact that the query vector is kept in its full-precision form, while the database vectors are compared in their compressed, quantized form.

The vectors are not of different dimensions, but of different precision. The semantic matching works because the comparison is not direct. The compressed database vector is a code that points to a region in the original vector space. The distance calculation uses the full-precision query vector to look up a pre-computed distance to the center of that region. This way, the rich information in the query vector is used to accurately estimate the distance, avoiding the significant information loss that would occur if both vectors were compressed.

Architectural Comparison: TreeAH vs. HNSW

To better understand the design philosophy behind TreeAH, it’s useful to compare it with HNSW (Hierarchical Navigable Small World), a popular graph-based algorithm used in many dedicated vector databases.

HNSW constructs a multi-layered graph where vectors are nodes and edges connect them to their nearest neighbors. It’s known for excellent single-query latency.

But this performance comes with significant memory overhead, as the graph structure must be stored in addition to the full-precision vectors. HNSW index builds can also be time-consuming, and frequent data updates can lead to memory fragmentation and performance degradation.

TreeAH, in contrast, makes different architectural trade-offs that align with BigQuery's nature as a distributed analytics system.

The comparison reveals a fundamental design choice: TreeAH prioritizes batch throughput, memory efficiency, and scalability over absolute single-query latency. This makes it well-suited for analytical workloads where thousands of searches are performed simultaneously.

The End-to-End Vector Search Query Flow

The execution timeline of a BigQuery vector search demonstrates how parallel processing eliminates traditional bottlenecks. When a VECTOR_SEARCH query arrives, the system initiates multiple operations concurrently rather than executing them sequentially.

The root server begins query planning immediately upon receiving the request. In parallel, Borg starts allocating compute slots across the cluster, targeting 1,000 slots distributed across 50 or more nodes. Borg prioritizes slots that are physically close to the data in Colossus to minimize data movement costs. This allocation typically completes within 10 milliseconds.

Query planning and resource allocation overlap significantly. The mixer nodes receive partial execution plans and begin partitioning the search space before Borg completes all slot allocations. When TreeAH indexes are available, mixers use them to assign specific vector partitions to leaf nodes. This streaming approach ensures that leaf nodes receive work assignments as soon as they come online.

The parallel execution phase showcases the architecture's efficiency. Hundreds or thousands of leaf nodes simultaneously read their assigned vector partitions from Colossus. Jupiter's high-bandwidth network prevents I/O congestion even with thousands of concurrent reads. Each leaf node operates independently: loading compressed vectors, executing SIMD operations for distance calculations, and maintaining local top-k results.

Aggregation begins before all leaf nodes complete their local searches. Mixers implement a streaming merge algorithm that processes results as they arrive. This approach means that by the time the slowest leaf node reports its results, the mixers have already processed most of the data. The final global top-k emerges from this continuous merging process.

The measured 40-millisecond execution time represents the longest path through the parallel execution graph, not the sum of individual operations. Most operations complete much faster, but the overall latency is bounded by the slowest component. This design trades single-query latency for massive throughput, enabling BigQuery to process thousands of vector searches concurrently across billions of vectors.

Practical Implications for Engineering Teams

The architectural choices behind BigQuery's vector search create specific trade-offs that engineering teams need to understand before committing to this approach.

1. Query Latency vs. Throughput

BigQuery vector searches typically complete in 1-10 seconds, not the sub-100ms latency of specialized vector databases. But you can run thousands of searches concurrently without degradation. This makes BigQuery ideal for batch recommendation generation, similarity analysis across product catalogs, or embedding-based data enrichment pipelines. It's the wrong choice for autocomplete features or real-time personalization that requires immediate responses.

2. Cost Model Considerations

BigQuery charges for data scanned, not query execution time. A vector search that scans 1TB costs the same whether it completes in 2 seconds or 20 seconds. This model favors workloads where you search large datasets infrequently rather than small datasets continuously. Running vector search on a 10GB table thousands of times per day will be more expensive than a dedicated vector database with fixed infrastructure costs.

3. Index Management Trade-offs

TreeAH indexes update automatically in the background when new data arrives, typically within 5-15 minutes. You cannot force immediate index updates or control index parameters like you can with HNSW or IVF indexes. This simplicity reduces operational overhead but limits optimization options. If your use case requires fine-tuning recall/latency trade-offs or immediate consistency after updates, you'll need a different solution.

4. Integration Benefits That Actually Matter

The ability to JOIN vector search results with business data in a single query is more powerful than it initially appears. Consider this query pattern:

WITH semantic_matches AS (

  SELECT item_id, distance

  FROM VECTOR_SEARCH(

    TABLE products,

    'embedding',

    (SELECT embedding FROM queries WHERE query_id = @query_id)

  )

)

SELECT p.*, s.distance

FROM semantic_matches s

JOIN products p USING (item_id)

WHERE p.in_stock = TRUE

  AND p.price BETWEEN 50 AND 200

  AND p.category_restrictions IS NULL

ORDER BY s.distance

LIMIT 20

This combines semantic search with business logic, inventory status, and access controls in one atomic operation. Implementing this with a separate vector database requires complex synchronization between systems.

Conclusion

BigQuery's vector search implementation challenges our assumptions about what a data warehouse can do. Instead of building another specialized vector database, Google pushed their existing infrastructure to handle a fundamentally different workload.

The key insight is recognizing that vector search at scale is a data processing problem. And processing data at scale is what BigQuery was built for.

By leveraging its columnar architecture and hardware-aware algorithms like TreeAH, BigQuery makes a deliberate trade-off. It exchanges the sub-millisecond latency of in-memory systems for massive batch throughput and incredible resource efficiency. An index that uses 10x less memory than HNSW is a trade-off many teams building analytical AI systems would gladly make.

The real power emerges when vectors live alongside business data. Complex queries that would require multiple systems and synchronization nightmares become simple SQL. "Find similar products, but only from reliable suppliers, in stock locally, with no recent quality issues." One query, one system, no architectural gymnastics.

This approach validates a broader trend: vector capabilities are becoming table stakes for data platforms. The question isn't whether your data platform will support vectors, but how well it integrates them into existing workflows.

For teams building analytical AI applications, BigQuery offers a pragmatic path. It won't win latency benchmarks against dedicated vector databases. But for batch processing, integrated analytics, and operational simplicity at scale, it demonstrates that sometimes the best vector database isn't a vector database at all. It's your data warehouse, evolved.

vector database - freeCodeCamp.org

How to Ship a Production-Ready RAG App with FAISS (Guardrails, Evals, and Fallbacks)

Table of Contents

Why RAG Alone Does Not Equal Production-Ready

Prerequisites

Knowledge

Tools + Accounts

What This Tutorial Covers (and What It Doesn’t)

The Architecture You Are Building

Project Setup and Structure

Project Structure

Install Dependencies

Configure the Environment

How to Build the RAG Layer with FAISS

What is FAISS (and What Does It Do)?

Chunking Strategy With Overlap

Implementation of rag.py

How to Add the LLM Call with Structured Output

Implementation of llm.py

How to Add Guardrails: Retrieval Gate and Fallbacks

The Retrieval Gate: How It Works and How to Add It

Fallbacks and Why They Matter

FastAPI App: Creating the /answer Endpoint

Implementation of app.py

Centralized Prompt – Template: prompts.py

Example prompts.py

How to Add Beginner-Friendly Evals

Step 1: Create evals/eval_set.json

Step 2: Create evals/run_evals.py

How to Use Evals in Practice

What to Improve Next: Realistic Upgrades

Final Thoughts: Production-Ready Is a Set of Habits

About Me

How to Integrate Vector Search in Columnar Storage

Table of Contents

Prerequisites

The Unique Challenge of Vector Search

BigQuery's Foundational Distributed Architecture

Dremel: The Distributed Query Engine

Borg: Cluster Management and Resource Orchestration

Colossus: The Distributed Storage Layer

Jupiter: The High-Speed Network Fabric

The Role of Columnar Storage in Vector Operations

Accelerating Computations with SIMD

The TreeAH Indexing Algorithm

1. Hierarchical Tree Structure

2. Product Quantization (PQ)

3. Asymmetric Hashing

Architectural Comparison: TreeAH vs. HNSW

The End-to-End Vector Search Query Flow

Practical Implications for Engineering Teams

1. Query Latency vs. Throughput

2. Cost Model Considerations

3. Index Management Trade-offs

4. Integration Benefits That Actually Matter

Conclusion

Further Reading

Implementation of `rag.py`

Implementation of `llm.py`

Implementation of `app.py`

Example `prompts.py`

Step 1: Create `evals/eval_set.json`

Step 2: Create `evals/run_evals.py`