<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Chirag Agrawal - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Chirag Agrawal - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 24 May 2026 22:23:55 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/chiragagrawal/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Integrate Vector Search in Columnar Storage ]]>
                </title>
                <description>
                    <![CDATA[ Integrating vector search into traditional data platforms is becoming a common task in the current AI-driven landscape. When Google announced general availability for vector search in BigQuery in early 2024, it joined a growing list of established da... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-integrate-vector-search-in-columnar-storage/</link>
                <guid isPermaLink="false">6914ff68e1ffda5f6ea6d8ea</guid>
                
                    <category>
                        <![CDATA[ vector database ]]>
                    </category>
                
                    <category>
                        <![CDATA[ semantic search ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Columnar Database ]]>
                    </category>
                
                    <category>
                        <![CDATA[ google cloud ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chirag Agrawal ]]>
                </dc:creator>
                <pubDate>Wed, 12 Nov 2025 21:43:04 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762983768101/928331bd-3f97-4d05-92fb-2d8ea9af5dab.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Integrating vector search into traditional data platforms is becoming a common task in the current AI-driven landscape. When Google announced general availability for vector search in BigQuery in early 2024, it joined a growing list of established databases that have added capabilities for similarity search on high-dimensional embeddings.</p>
<p>But if you examine BigQuery's implementation more closely, you’ll find an approach that goes beyond a simple feature addition. Instead of bolting on a vector library, Google has deeply integrated vector search into its existing distributed, columnar architecture.</p>
<p>In this article, we’ll take a technical deep dive into the engineering decisions behind BigQuery's vector search. We’ll explore how foundational Google technologies like Dremel, Borg, and Colossus, combined with a proprietary columnar format and a novel indexing algorithm, create a highly scalable and efficient platform for AI workloads.</p>
<p>This analysis will give you insights into the architectural trade-offs involved in building vector search at scale. It also demonstrates how you can adapt a system designed for large-scale analytics so that it excels at modern AI tasks.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-unique-challenge-of-vector-search">The Unique Challenge of Vector Search</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-bigquerys-foundational-distributed-architecture">BigQuery's Foundational Distributed Architecture</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-dremel-the-distributed-query-engine">Dremel: The Distributed Query Engine</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-borg-cluster-management-and-resource-orchestration">Borg: Cluster Management and Resource Orchestration</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-colossus-the-distributed-storage-layer">Colossus: The Distributed Storage Layer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-jupiter-the-high-speed-network-fabric">Jupiter: The High-Speed Network Fabric</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-role-of-columnar-storage-in-vector-operations">The Role of Columnar Storage in Vector Operations</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-accelerating-computations-with-simd">Accelerating Computations with SIMD</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-treeah-indexing-algorithm">The TreeAH Indexing Algorithm</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-1-hierarchical-tree-structure">1. Hierarchical Tree Structure</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-2-product-quantization-pq">2. Product Quantization (PQ)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-3-asymmetric-hashing">3. Asymmetric Hashing</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-architectural-comparison-treeah-vs-hnsw">Architectural Comparison: TreeAH vs. HNSW</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-end-to-end-vector-search-query-flow">The End-to-End Vector Search Query Flow</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-practical-implications-for-engineering-teams">Practical Implications for Engineering Teams</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-1-query-latency-vs-throughput">1. Query Latency vs. Throughput</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-2-cost-model-considerations">2. Cost Model Considerations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-3-index-management-trade-offs">3. Index Management Trade-offs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-4-integration-benefits-that-actually-matter">4. Integration Benefits That Actually Matter</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-further-reading">Further Reading</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>This article assumes that you have a solid foundation in distributed systems and database internals, including familiarity with concepts like columnar storage, query execution plans, and distributed query processing.</p>
<p>You should understand the basics of vector embeddings and similarity search, though we'll briefly review the fundamentals. Experience with at least one vector database or search system (such as pgvector, Pinecone, or Elasticsearch) will help contextualize the architectural comparisons.</p>
<p>While deep knowledge of Google Cloud Platform isn't required, basic familiarity with cloud data warehouses and their typical architectures will be beneficial. The article includes discussions of SIMD operations and CPU-level optimizations, so comfort with low-level performance considerations is helpful, though not mandatory.</p>
<p>Code examples assume working knowledge of SQL, with some sections referencing implementation details in languages like Python or Java. Most importantly, you should have experience building or operating production data systems at scale, as many insights focus on practical engineering trade-offs rather than theoretical concepts.</p>
<h2 id="heading-the-unique-challenge-of-vector-search">The Unique Challenge of Vector Search</h2>
<p>Vector search fundamentally differs from traditional database operations in ways that challenge our existing infrastructure assumptions. Where conventional queries leverage decades of optimization around exact matching and range scans, vector similarity search requires computing distances between high-dimensional points at massive scale.</p>
<p>Consider the numbers. Modern embedding models produce vectors with 768 or more dimensions. At 4 bytes per float32 value, a single embedding consumes roughly 3KB. A modest corpus of 100 million items translates to 300GB of vector data.</p>
<p>But the real challenge isn't storage. The killer is computation. Finding the nearest neighbors to a query vector means computing distance metrics across all those dimensions. For 100 million vectors, a brute-force search requires 76.8 billion floating-point operations per query just for the distance calculations. Even with modern SIMD instructions processing 16 floats at once, you're looking at billions of CPU cycles per search.</p>
<p>This computational reality forces a fundamental compromise: we abandon exact solutions for approximate ones. Approximate Nearest Neighbor (ANN) algorithms trade perfect accuracy for practical query times. They work by partitioning the vector space cleverly, building graphs of nearest neighbors, or using hashing schemes to avoid examining every vector. The engineering challenge becomes balancing query latency, recall accuracy, and resource consumption.</p>
<p>Most purpose-built vector databases address this through specialized in-memory indexes like HNSW or IVF. These work well for single queries but require keeping massive indexes in RAM. In case you are not familiar with these vector indexes, you can read <a target="_blank" href="https://medium.com/towards-artificial-intelligence/unlocking-the-power-of-efficient-vector-search-in-rag-applications-c2e3a0c551d5">this article</a>.</p>
<p>BigQuery took a different path. Rather than optimizing for single-query latency, they asked what vector search would look like when built for analytical workloads at warehouse scale. The answer required rethinking basic assumptions about index design, storage layout, and query execution.</p>
<h2 id="heading-bigquerys-foundational-distributed-architecture">BigQuery's Foundational Distributed Architecture</h2>
<p>BigQuery's vector search runs on the same infrastructure that's been processing SQL queries since 2011. No new cluster type. No specialized vector nodes. Just four core technologies that power most of Google's data processing, now handling a workload they weren't originally designed for.</p>
<p>This isn't the obvious choice. Most vector databases build specialized infrastructure optimized for similarity search. Graph-based indexes need fast random access. In-memory systems require careful memory management. BigQuery took its existing distributed SQL engine and asked: can we make this work for vectors, too?</p>
<p>The answer required leveraging four foundational systems in new ways:</p>
<ul>
<li><p>Dremel, the query engine that normally handles SQL, now orchestrates vector similarity computations.</p>
</li>
<li><p>Borg, which allocates resources for everything from Search to YouTube, dynamically assigns thousands of workers to vector queries.</p>
</li>
<li><p>Colossus stores embeddings in the same distributed filesystem that holds petabytes of analytics data.</p>
</li>
<li><p>And Jupiter's datacenter network, built for bulk data processing, now shuttles vector data between computation nodes.</p>
</li>
</ul>
<p>What's surprising isn't that it works, but how well it works. The same architecture that runs aggregate queries over trillion-row tables can search billion-scale vector collections. Understanding how requires examining each component and how they've been adapted for this new workload.</p>
<h3 id="heading-dremel-the-distributed-query-engine">Dremel: The Distributed Query Engine</h3>
<p>At its core, BigQuery is powered by Dremel, a distributed query execution engine developed at Google since 2006.</p>
<p>Dremel processes SQL queries using a hierarchical serving tree. A root server receives the query and orchestrates the execution, while mixer nodes break down the work and distribute it to hundreds or thousands of leaf nodes. These leaf nodes perform the actual computations in parallel on segments of the data.</p>
<p>This architecture allows BigQuery to dynamically allocate a massive number of execution threads, known as slots, to a single query, enabling it to process petabytes of data in seconds.</p>
<h3 id="heading-borg-cluster-management-and-resource-orchestration">Borg: Cluster Management and Resource Orchestration</h3>
<p>The serverless nature of BigQuery is made possible by Borg, Google's cluster management system that predates and inspired Kubernetes.</p>
<p>When a vector search query is submitted, Borg is responsible for finding available machines across Google's global data centers, allocating the precise amount of CPU and memory resources needed for the query's Dremel slots, and managing fault tolerance by automatically rescheduling work if a machine fails. This dynamic resource allocation means users do not need to provision or scale infrastructure, whether they are searching 1,000 vectors or 10 billion.</p>
<h3 id="heading-colossus-the-distributed-storage-layer">Colossus: The Distributed Storage Layer</h3>
<p>Data in BigQuery is stored in Colossus, Google's next-generation distributed file system. Colossus is designed for exabyte-scale storage, provides high availability through automatic cross-datacenter replication, and is optimized for the high-throughput parallel reads required by Dremel's leaf nodes.</p>
<p>During a vector search, Colossus can deliver data to thousands of nodes simultaneously without creating a storage bottleneck.</p>
<h3 id="heading-jupiter-the-high-speed-network-fabric">Jupiter: The High-Speed Network Fabric</h3>
<p>These compute and storage systems are interconnected by Jupiter, Google's internal datacenter network, which features a petabit-per-second bisection bandwidth. The network's design ensures that data can move between Colossus storage and Dremel compute nodes at extremely high speeds, making data shuffling and aggregation phases of a query efficient.  </p>
<p><img alt="Big Query vector search architecture is powered by Dremel Query Engine, Borg Orchestrator for resource allocation, Colossus for large scale data storage and Jupiter network for ultra high bandwidth data transfer" width="600" height="400" loading="lazy"></p>
<h2 id="heading-the-role-of-columnar-storage-in-vector-operations">The Role of Columnar Storage in Vector Operations</h2>
<p>Storing vectors in columns sounds wrong. Vectors are arrays. They belong together. Why split them across columnar storage?</p>
<p>BigQuery does it anyway, and it works brilliantly. Here's why.</p>
<p>When you search a million vectors, you need exactly one thing from each row: the embedding. Not the product name, price, or category. Just the vector. Row-oriented storage forces you to read entire records and throw away 90% of the data. Columnar storage reads only what you need.</p>
<p>The performance impact is dramatic. A table with 768-dimensional embeddings plus 20 other columns might total 3TB. Reading just the embedding column? 300GB. That's a <strong>10x reduction in I/O</strong> before you've done any actual computation.</p>
<p>But the real magic happens at the CPU level. Columnar storage naturally aligns vector data for SIMD processing. Instead of jumping around memory gathering vector components, the CPU finds them laid out sequentially, ready for bulk operations. Modern processors can load 16 floating-point values into a single register and process them simultaneously.</p>
<p>Compression becomes almost trivial, too. BigQuery's Capacitor format applies techniques like Product Quantization directly to the column data, shrinking vectors from 3KB to under 300 bytes. Try doing that with row-oriented storage where vectors are scattered across pages.</p>
<p>The lesson? Sometimes the "wrong" abstraction at one level enables the right optimizations at another.</p>
<h3 id="heading-accelerating-computations-with-simd">Accelerating Computations with SIMD</h3>
<p>SIMD instructions are a form of hardware-level parallelism available in modern CPUs that provide significant speedups for vector arithmetic. This is achieved through special instruction sets built into the processor.</p>
<p>For example, AVX-512 (Advanced Vector Extensions 512-bit) is an instruction set found in modern high-performance CPUs, such as those from Intel, that allows a single instruction to operate on 512 bits of data at once.</p>
<p>Since a standard single-precision floating-point number is 32 bits, a CPU with AVX-512 can process 16 floating-point numbers in a single operation. This leads to dramatic performance gains.</p>
<p>The difference between scalar and SIMD processing for vector distance calculations is stark:</p>
<ul>
<li><p><strong>Scalar approach</strong>: Loop through each dimension, multiply corresponding components, accumulate results. For 768 dimensions, that's 768 multiplications, 768 additions, and terrible cache performance as you jump between two different memory locations for each iteration.</p>
</li>
<li><p><strong>SIMD approach</strong>: Load 16 components from each vector into 512-bit registers. Execute a single multiply instruction that handles all 16 pairs. Execute a single horizontal add. Repeat 48 times. The CPU's pipeline stays full, the cache prefetcher knows exactly what data you need next, and you've turned 1,536 operations into 96.</p>
</li>
</ul>
<p>The columnar storage pays off here, too. Vectors stored contiguously in memory align perfectly with SIMD register loads. No gather operations, no wasted cycles. Just pure throughput.</p>
<p><img alt="TreeAH SIMD In-Register Operations Speed up distance calculations with the help of pre-computed distance table and parallel operations " width="600" height="400" loading="lazy"></p>
<p>BigQuery's query engine is designed to leverage SIMD extensively. It automatically detects and uses the optimal instruction set available on the underlying hardware (for example, AVX-512 for Intel, NEON for ARM). The columnar storage format ensures that vector data is laid out in memory in a way that is friendly to SIMD registers, and the engine processes query vectors in large batches to maximize the utilization of these parallel instructions.</p>
<h2 id="heading-the-treeah-indexing-algorithm">The TreeAH Indexing Algorithm</h2>
<p>While brute-force search can be effective at smaller scales due to BigQuery's massive parallelism, efficient search over billions of vectors requires an index. BigQuery's primary vector index is TreeAH (Tree with Asymmetric Hashing), which is based on Google's open-sourced ScaNN (Scalable Nearest Neighbors) algorithm. TreeAH combines three techniques to achieve high performance and memory efficiency.</p>
<h3 id="heading-1-hierarchical-tree-structure">1. Hierarchical Tree Structure</h3>
<p>The algorithm first partitions the entire vector space into thousands of smaller lists. You can think of this like organizing a massive library. Instead of having one giant room with a million books, a library has floors, sections, and shelves. This hierarchy allows you to find a book without scanning every single one.</p>
<p>Similarly, TreeAH groups semantically similar vectors together into partitions and arranges them in a tree. During a query, the search navigates this tree by comparing the query vector to "centroid" vectors that represent the center of each partition, effectively following a path to the most relevant partitions and pruning away large, irrelevant branches of the search space.</p>
<h3 id="heading-2-product-quantization-pq">2. Product Quantization (PQ)</h3>
<p>Within TreeAH, PQ serves a different purpose than just compression. The index doesn't just store smaller vectors – it fundamentally changes how distance calculations work.</p>
<p>TreeAH learns partition-specific codebooks that capture the local structure of vectors in each tree node. This means vectors that end up in the "shoes" partition get quantized differently than those in "electronics." The compression becomes semantic-aware.</p>
<p>When combined with the tree structure, this creates a powerful effect: not only are you searching fewer vectors (thanks to the tree), but you're computing distances faster on the vectors you do search (thanks to PQ).</p>
<h3 id="heading-3-asymmetric-hashing">3. Asymmetric Hashing</h3>
<p>The "asymmetric" aspect refers to the fact that the query vector is kept in its full-precision form, while the database vectors are compared in their compressed, quantized form.</p>
<p>The vectors are not of different dimensions, but of different precision. The semantic matching works because the comparison is not direct. The compressed database vector is a code that points to a region in the original vector space. The distance calculation uses the full-precision query vector to look up a pre-computed distance to the center of that region. This way, the rich information in the query vector is used to accurately estimate the distance, avoiding the significant information loss that would occur if both vectors were compressed.</p>
<h3 id="heading-architectural-comparison-treeah-vs-hnsw">Architectural Comparison: TreeAH vs. HNSW</h3>
<p>To better understand the design philosophy behind TreeAH, it’s useful to compare it with HNSW (Hierarchical Navigable Small World), a popular graph-based algorithm used in many dedicated vector databases.</p>
<p>HNSW constructs a multi-layered graph where vectors are nodes and edges connect them to their nearest neighbors. It’s known for excellent single-query latency.</p>
<p>But this performance comes with significant memory overhead, as the graph structure must be stored in addition to the full-precision vectors. HNSW index builds can also be time-consuming, and frequent data updates can lead to memory fragmentation and performance degradation.</p>
<p>TreeAH, in contrast, makes different architectural trade-offs that align with BigQuery's nature as a distributed analytics system.</p>
<p>The comparison reveals a fundamental design choice: TreeAH prioritizes batch throughput, memory efficiency, and scalability over absolute single-query latency. This makes it well-suited for analytical workloads where thousands of searches are performed simultaneously.</p>
<p><img alt="TreeAH vs. HNSW Architectural Comparison" width="600" height="400" loading="lazy"></p>
<h2 id="heading-the-end-to-end-vector-search-query-flow">The End-to-End Vector Search Query Flow</h2>
<p>The execution timeline of a BigQuery vector search demonstrates how parallel processing eliminates traditional bottlenecks. When a VECTOR_SEARCH query arrives, the system initiates multiple operations concurrently rather than executing them sequentially.</p>
<p>The root server begins query planning immediately upon receiving the request. In parallel, Borg starts allocating compute slots across the cluster, targeting 1,000 slots distributed across 50 or more nodes. Borg prioritizes slots that are physically close to the data in Colossus to minimize data movement costs. This allocation typically completes within 10 milliseconds.</p>
<p>Query planning and resource allocation overlap significantly. The mixer nodes receive partial execution plans and begin partitioning the search space before Borg completes all slot allocations. When TreeAH indexes are available, mixers use them to assign specific vector partitions to leaf nodes. This streaming approach ensures that leaf nodes receive work assignments as soon as they come online.</p>
<p>The parallel execution phase showcases the architecture's efficiency. Hundreds or thousands of leaf nodes simultaneously read their assigned vector partitions from Colossus. Jupiter's high-bandwidth network prevents I/O congestion even with thousands of concurrent reads. Each leaf node operates independently: loading compressed vectors, executing SIMD operations for distance calculations, and maintaining local top-k results.</p>
<p>Aggregation begins before all leaf nodes complete their local searches. Mixers implement a streaming merge algorithm that processes results as they arrive. This approach means that by the time the slowest leaf node reports its results, the mixers have already processed most of the data. The final global top-k emerges from this continuous merging process.</p>
<p>The measured 40-millisecond execution time represents the longest path through the parallel execution graph, not the sum of individual operations. Most operations complete much faster, but the overall latency is bounded by the slowest component. This design trades single-query latency for massive throughput, enabling BigQuery to process thousands of vector searches concurrently across billions of vectors.</p>
<p><img alt="Big Query Vector Search Timeline" width="600" height="400" loading="lazy"></p>
<h2 id="heading-practical-implications-for-engineering-teams">Practical Implications for Engineering Teams</h2>
<p>The architectural choices behind BigQuery's vector search create specific trade-offs that engineering teams need to understand before committing to this approach.</p>
<h3 id="heading-1-query-latency-vs-throughput">1. Query Latency vs. Throughput</h3>
<p>BigQuery vector searches typically complete in 1-10 seconds, not the sub-100ms latency of specialized vector databases. But you can run thousands of searches concurrently without degradation. This makes BigQuery ideal for batch recommendation generation, similarity analysis across product catalogs, or embedding-based data enrichment pipelines. It's the wrong choice for autocomplete features or real-time personalization that requires immediate responses.</p>
<h3 id="heading-2-cost-model-considerations">2. Cost Model Considerations</h3>
<p>BigQuery charges for data scanned, not query execution time. A vector search that scans 1TB costs the same whether it completes in 2 seconds or 20 seconds. This model favors workloads where you search large datasets infrequently rather than small datasets continuously. Running vector search on a 10GB table thousands of times per day will be more expensive than a dedicated vector database with fixed infrastructure costs.</p>
<h3 id="heading-3-index-management-trade-offs">3. Index Management Trade-offs</h3>
<p>TreeAH indexes update automatically in the background when new data arrives, typically within 5-15 minutes. You cannot force immediate index updates or control index parameters like you can with HNSW or IVF indexes. This simplicity reduces operational overhead but limits optimization options. If your use case requires fine-tuning recall/latency trade-offs or immediate consistency after updates, you'll need a different solution.</p>
<h3 id="heading-4-integration-benefits-that-actually-matter">4. Integration Benefits That Actually Matter</h3>
<p>The ability to JOIN vector search results with business data in a single query is more powerful than it initially appears. Consider this query pattern:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">WITH</span> semantic_matches <span class="hljs-keyword">AS</span> (

  <span class="hljs-keyword">SELECT</span> item_id, distance

  <span class="hljs-keyword">FROM</span> VECTOR_SEARCH(

    <span class="hljs-keyword">TABLE</span> products,

    <span class="hljs-string">'embedding'</span>,

    (<span class="hljs-keyword">SELECT</span> embedding <span class="hljs-keyword">FROM</span> queries <span class="hljs-keyword">WHERE</span> query_id = @query_id)

  )

)

<span class="hljs-keyword">SELECT</span> p.*, s.distance

<span class="hljs-keyword">FROM</span> semantic_matches s

<span class="hljs-keyword">JOIN</span> products p <span class="hljs-keyword">USING</span> (item_id)

<span class="hljs-keyword">WHERE</span> p.in_stock = <span class="hljs-literal">TRUE</span>

  <span class="hljs-keyword">AND</span> p.price <span class="hljs-keyword">BETWEEN</span> <span class="hljs-number">50</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">200</span>

  <span class="hljs-keyword">AND</span> p.category_restrictions <span class="hljs-keyword">IS</span> <span class="hljs-literal">NULL</span>

<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> s.distance

<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">20</span>
</code></pre>
<p>This combines semantic search with business logic, inventory status, and access controls in one atomic operation. Implementing this with a separate vector database requires complex synchronization between systems.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>BigQuery's vector search implementation challenges our assumptions about what a data warehouse can do. Instead of building another specialized vector database, Google pushed their existing infrastructure to handle a fundamentally different workload.</p>
<p>The key insight is recognizing that vector search at scale is a data processing problem. And processing data at scale is what BigQuery was built for.</p>
<p>By leveraging its columnar architecture and hardware-aware algorithms like TreeAH, BigQuery makes a deliberate trade-off. It exchanges the sub-millisecond latency of in-memory systems for massive batch throughput and incredible resource efficiency. An index that uses <strong>10x less memory</strong> than HNSW is a trade-off many teams building analytical AI systems would gladly make.</p>
<p>The real power emerges when vectors live alongside business data. Complex queries that would require multiple systems and synchronization nightmares become simple SQL. "Find similar products, but only from reliable suppliers, in stock locally, with no recent quality issues." One query, one system, no architectural gymnastics.</p>
<p>This approach validates a broader trend: vector capabilities are becoming table stakes for data platforms. The question isn't whether your data platform will support vectors, but how well it integrates them into existing workflows.</p>
<p>For teams building analytical AI applications, BigQuery offers a pragmatic path. It won't win latency benchmarks against dedicated vector databases. But for batch processing, integrated analytics, and operational simplicity at scale, it demonstrates that sometimes the best vector database isn't a vector database at all. It's your data warehouse, evolved.</p>
<h3 id="heading-further-reading">Further Reading</h3>
<ul>
<li><p><a target="_blank" href="https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood">BigQuery Under the Hood</a>: Official architecture deep dive</p>
</li>
<li><p><a target="_blank" href="https://github.com/google-research/google-research/tree/master/scann/docs/algorithms.md">ScaNN Algorithm Details</a>: The mathematics behind TreeAH</p>
</li>
<li><p><a target="_blank" href="https://research.google/pubs/pub36632/">Dremel: Interactive Analysis of Web-Scale Datasets</a>: The foundational paper</p>
</li>
<li><p><a target="_blank" href="https://research.google/pubs/pub43438/">Large-scale cluster management at Google with Borg</a>: Understanding resource orchestration</p>
</li>
<li><p><a target="_blank" href="https://research.google/pubs/pub43837/">Jupiter Rising: A Decade of Clos Topologies</a>: Google's datacenter networking</p>
</li>
<li><p><a target="_blank" href="https://medium.com/google-cloud/bigquery-vector-search-a-practitioners-guide-0f85b0d988f0">BigQuery Vector Search: A Practitioner's Guide</a>: Optimization strategies</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Persist State in Time-Series Models with Docker and Redis ]]>
                </title>
                <description>
                    <![CDATA[ Have you ever built a brilliant time-series model, one that could forecast sales or predict stock prices, only to watch it fail in the real world? Well, this is a common frustration. Your model works perfectly on your machine, but the moment you depl... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-persist-state-in-time-series-models-with-docker-and-redis/</link>
                <guid isPermaLink="false">68e70d838fa4b92d9a027ebe</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Time Series Forecasting ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Redis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ PersistentVolumes ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chirag Agrawal ]]>
                </dc:creator>
                <pubDate>Thu, 09 Oct 2025 01:18:59 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759972706788/66d45afa-f86b-4365-8a55-8b6873df718b.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Have you ever built a brilliant time-series model, one that could forecast sales or predict stock prices, only to watch it fail in the real world? Well, this is a common frustration. Your model works perfectly on your machine, but the moment you deploy it in a Docker container, it seems to develop amnesia. It forgets everything it knew yesterday, making its predictions for tomorrow useless.</p>
<p>Don’t worry. This isn't likely a flaw in your model. It's a clash between how time-series models and Docker containers are designed to work.</p>
<p>Time-series models are all about memory. They need to remember the past to predict the future. But Docker containers are built to be stateless and forgetful, wiping their memory clean with every restart. This fundamental conflict can turn a powerful model into a worthless one in production.</p>
<p>In this article, we’ll solve that problem. We're going to give your time-series model a permanent memory. You'll learn how to build a production-ready prediction service that uses Redis as an external brain and Docker volumes to ensure that memory survives any restart. We'll walk through a hands-on example, step-by-step, so you can learn how to build a system that is both intelligent and incredibly reliable.</p>
<h3 id="heading-what-well-cover">What we’ll cover:</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-who-is-this-guide-for">Who is This Guide For?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-the-problem">Understanding the Problem</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-so-what-is-a-time-series-model">So, what is a time-series model?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-1-containers-are-ephemeral-by-design">1. Containers are ephemeral by design</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-2-lost-context-between-predictions">2. Lost context between predictions</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-3-model-amnesia-on-restart">3. Model amnesia on restart</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-solution-external-state-store">The Solution: External State Store</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-hands-on-implementation">Hands-On Implementation</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-start-with-the-broken-approach">Start with the broken approach</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-fix-it-with-volumes">How to fix it with volumes</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-the-code-handles-state">How the code handles state</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-test-the-health-endpoint">Test the health endpoint</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-what-about-scaling">What About Scaling?</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-horizontal-scaling-with-redis-cluster">Horizontal scaling with Redis Cluster</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-high-availability-with-redis-sentinel">High availability with Redis Sentinel</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-use-managed-redis-services">Use managed Redis services</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-common-pitfalls-to-avoid">Common Pitfalls to Avoid</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-dont-assume-volumes-work">Don't assume volumes work</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-dont-ignore-redis-memory-limits">Don't ignore Redis memory limits</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-dont-skip-monitoring">Don't skip monitoring</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-who-is-this-guide-for">Who is This Guide For?</h2>
<p>To get the most out of this tutorial, it’ll be helpful to have a few things under your belt. We’ll be diving into some code and command-line work, so a little preparation will go a long way.</p>
<ul>
<li><p>The main tools for this project are <a target="_blank" href="https://docs.docker.com/get-started/get-docker/">Docker</a> and <a target="_blank" href="https://docs.docker.com/compose/">Docker Compose</a>. Make sure you have them installed and running on your computer.</p>
</li>
<li><p>You’ll also find it easier to follow along if you’re comfortable with the basics of Docker, Python, and the <a target="_blank" href="http://flask.palletsprojects.com/en/stable/quickstart/">Flask</a> web framework. A bit of command-line experience will also be handy for running the commands in the tutorial.</p>
</li>
<li><p>But don't worry if you've never used <a target="_blank" href="https://redis.io/docs/latest/">Redis</a> before. All you need to know is that it’s a fast, in-memory database. We’ll handle the rest along the way.</p>
</li>
</ul>
<p>Think of this as a guided tour. As long as you're curious and have the basic tools ready, you'll be in great shape.</p>
<h2 id="heading-understanding-the-problem">Understanding the Problem</h2>
<p>Before jumping into solutions, let's first clarify what a time-series model is and then explore why containerizing it is so tricky.</p>
<h3 id="heading-so-what-is-a-time-series-model">So, what is a time-series model?</h3>
<p>Simply put, a time-series model is a type of model that analyzes data points collected over time to predict future values. Think of it like predicting the weather. A meteorologist doesn't just look at the sky right now. They look at the temperature, pressure, and wind patterns from the last few hours and days to forecast what will happen tomorrow.</p>
<p>Time-series models do the same thing with data, whether it's website traffic, stock prices, or energy consumption. The key takeaway is that history matters. The sequence of past events provides the context needed to make an intelligent prediction about the future.</p>
<p>Now, here’s what breaks when you put these models in Docker.</p>
<h3 id="heading-1-containers-are-ephemeral-by-design">1. Containers are ephemeral by design</h3>
<p>Docker containers are meant to be stateless. This works great for most APIs. A user profile endpoint? Stateless. A sentiment analysis model? Stateless. They take an input, return an output, and forget everything in between.</p>
<p>Time-series models don't work this way. They need context from previous predictions. Without it, your model is essentially blind.</p>
<h3 id="heading-2-lost-context-between-predictions">2. Lost context between predictions</h3>
<p>Each prediction happens in isolation. Your model receives a single data point and makes a guess without knowing what came before. This defeats the entire purpose of time-series modeling.</p>
<p>You may think: "I'll just load all historical data on every request." But that approach fails for two reasons:</p>
<ul>
<li><p>It's slow. Really slow if you have thousands of data points</p>
</li>
<li><p>It doesn't scale. When you have multiple series or high request volume, you'll hit performance walls fast</p>
</li>
</ul>
<h3 id="heading-3-model-amnesia-on-restart">3. Model amnesia on restart</h3>
<p>Every time you deploy a new version or the container crashes, all accumulated state disappears. Your model starts from scratch. In production, this is unacceptable.</p>
<h2 id="heading-the-solution-external-state-store">The Solution: External State Store</h2>
<p>Instead of keeping state inside the container, we’ll move it outside. Redis becomes the model's memory.</p>
<p>The pattern looks like this:</p>
<pre><code class="lang-plaintext">Client Request → Flask API → Redis → Prediction with Context
</code></pre>
<p>Your container stays stateless and replaceable. But the system as a whole maintains state through Redis.</p>
<h2 id="heading-hands-on-implementation">Hands-On Implementation</h2>
<p>Let's build this. Clone the demo repository:</p>
<pre><code class="lang-bash">git <span class="hljs-built_in">clone</span> https://github.com/ag-chirag/docker-redis-time-series
<span class="hljs-built_in">cd</span> docker-redis-time-series
</code></pre>
<h3 id="heading-start-with-the-broken-approach">Start with the broken approach</h3>
<p>The <code>docker-compose.initial.yml</code> file shows what NOT to do:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">services:</span>
  <span class="hljs-attr">api:</span>
    <span class="hljs-attr">build:</span> <span class="hljs-string">./flask-api</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"5000:5000"</span>

  <span class="hljs-attr">redis:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">redis:alpine</span>
</code></pre>
<p>Notice what's missing? No volumes. Redis stores data in the container's filesystem, which means that data is temporary.</p>
<p>Run it:</p>
<pre><code class="lang-bash">docker compose -f docker-compose.initial.yml up
</code></pre>
<p>Make a few predictions:</p>
<pre><code class="lang-bash">curl -X POST http://localhost:5000/predict \
  -H <span class="hljs-string">"Content-Type: application/json"</span> \
  -d <span class="hljs-string">'{
    "series_id": "demo",
    "historical_data": [
      {"timestamp": "2024-01-01T12:00:00", "value": 10},
      {"timestamp": "2024-01-01T12:01:00", "value": 20},
      {"timestamp": "2024-01-01T12:02:00", "value": 30}
    ]
  }'</span>
</code></pre>
<p>You'll get a response showing Redis is working:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"data_points_used"</span>: <span class="hljs-number">3</span>,
  <span class="hljs-attr">"prediction"</span>: <span class="hljs-number">40</span>,
  <span class="hljs-attr">"redis_connected"</span>: <span class="hljs-literal">true</span>
}
</code></pre>
<p>Now restart the services:</p>
<pre><code class="lang-bash">docker compose down
docker compose -f docker-compose.initial.yml up
</code></pre>
<p>Make another prediction. Check the <code>data_points_used</code> field. It reset. All your historical data is gone. This is exactly what we're trying to avoid.</p>
<h3 id="heading-how-to-fix-it-with-volumes">How to fix it with volumes</h3>
<p>The correct <code>docker-compose.yml</code> adds persistence:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">services:</span>
  <span class="hljs-attr">api:</span>
    <span class="hljs-attr">build:</span> <span class="hljs-string">./flask-api</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"5000:5000"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">REDIS_HOST=redis</span>

  <span class="hljs-attr">redis:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">redis:alpine</span>
    <span class="hljs-attr">command:</span> <span class="hljs-string">redis-server</span> <span class="hljs-string">--appendonly</span> <span class="hljs-literal">yes</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">redis_data:/data</span>

<span class="hljs-attr">volumes:</span>
  <span class="hljs-attr">redis_data:</span>
</code></pre>
<h4 id="heading-so-what-is-a-volume-and-how-does-it-work">So, what is a volume and how does it work?</h4>
<p>Think of a Docker volume as a dedicated external hard drive for your container. By default, when a container writes data, it does so to a temporary layer that gets destroyed when the container is removed. A volume provides a way to save that data permanently.</p>
<p>Here’s how it works:</p>
<ol>
<li><p>Docker creates and manages a special storage area on the host machine, completely separate from any container's filesystem. In our docker-compose.yml, the <code>volumes: redis_data:</code> section at the bottom tells Docker to create a named volume called <code>redis_data</code>.</p>
</li>
<li><p>When the Redis container starts, the <code>volumes: - redis_data:/data</code> line tells Docker to "plug in" this external hard drive. It connects the <code>redis_data</code> volume to the <code>/data</code> directory inside the container.</p>
</li>
<li><p>Now, whenever the Redis process inside the container writes data to its <code>/data</code> directory (which we've configured it to do), it's actually writing to the <code>redis_data</code> volume on the host machine.</p>
</li>
<li><p>When you run docker compose down, the Redis container is destroyed, but the <code>redis_data</code> volume is untouched. It's like unplugging the external hard drive, and the data is still safe. The next time you run docker compose up, a brand new Redis container is created, the volume is re-attached, and Redis finds all its old data right where it left it.</p>
</li>
</ol>
<p>This mechanism is the key to giving our stateful service a memory that survives restarts.</p>
<p>Run the corrected version:</p>
<pre><code class="lang-bash">docker compose up --build
</code></pre>
<p>Send several predictions to build up state:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> {1..5}; <span class="hljs-keyword">do</span>
  curl -X POST http://localhost:5000/predict \
    -H <span class="hljs-string">"Content-Type: application/json"</span> \
    -d <span class="hljs-string">"{
      \"series_id\": \"demo\",
      \"historical_data\": [{\"timestamp\": \"2024-01-01T12:0<span class="hljs-variable">$i</span>:00\", \"value\": <span class="hljs-subst">$((i*10)</span>)}]
    }"</span>
<span class="hljs-keyword">done</span>
</code></pre>
<p>Now comes the test. Restart everything:</p>
<pre><code class="lang-bash">docker compose down
docker compose up
</code></pre>
<p>Make another prediction. Look at <code>data_points_used</code>. It includes all previous points. The model picks up exactly where it left off.</p>
<p>This works because the volume exists independently of the container lifecycle.</p>
<h3 id="heading-how-the-code-handles-state">How the code handles state</h3>
<p>The Flask API in <code>flask-api/app.py</code> stores each data point in Redis using sorted sets:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">store_data_point</span>(<span class="hljs-params">series_id, timestamp, value</span>):</span>
    key = <span class="hljs-string">f"ts:<span class="hljs-subst">{series_id}</span>"</span>
    redis_client.zadd(key, {json.dumps({<span class="hljs-string">"ts"</span>: timestamp, <span class="hljs-string">"val"</span>: value}): timestamp})
</code></pre>
<p>When making predictions, it retrieves recent history:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_recent_data</span>(<span class="hljs-params">series_id, limit=<span class="hljs-number">100</span></span>):</span>
    key = <span class="hljs-string">f"ts:<span class="hljs-subst">{series_id}</span>"</span>
    data = redis_client.zrange(key, -limit, <span class="hljs-number">-1</span>)
    <span class="hljs-keyword">return</span> [json.loads(d) <span class="hljs-keyword">for</span> d <span class="hljs-keyword">in</span> data]
</code></pre>
<p>Redis sorted sets give you automatic time ordering. The volume ensures this data survives restarts.</p>
<h3 id="heading-test-the-health-endpoint">Test the health endpoint</h3>
<p>Check that everything is connected properly:</p>
<pre><code class="lang-bash">curl http://localhost:5000/health
</code></pre>
<p>You should see:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"model_loaded"</span>: <span class="hljs-literal">true</span>,
  <span class="hljs-attr">"redis_connected"</span>: <span class="hljs-literal">true</span>,
  <span class="hljs-attr">"status"</span>: <span class="hljs-string">"healthy"</span>
}
</code></pre>
<p>If <code>redis_connected</code> is false, check your Docker logs. Common issues are network configuration or Redis not starting properly.</p>
<h2 id="heading-what-about-scaling">What About Scaling?</h2>
<p>This setup works well for single-instance deployments. When traffic increases, you have a few options.</p>
<h3 id="heading-horizontal-scaling-with-redis-cluster">Horizontal scaling with Redis Cluster</h3>
<p>For high throughput, distribute your data across multiple Redis nodes. Redis Cluster handles sharding automatically.</p>
<h3 id="heading-high-availability-with-redis-sentinel">High availability with Redis Sentinel</h3>
<p>Add failover capability so your state store doesn't become a single point of failure. Sentinel monitors Redis instances and promotes replicas when the primary fails.</p>
<h3 id="heading-use-managed-redis-services">Use managed Redis services</h3>
<p>AWS ElastiCache, Azure Cache for Redis, or Google Cloud Memorystore handle the operational burden. You focus on your model, they handle Redis reliability.</p>
<p>The key insight: your API containers remain stateless. You scale the state store independently.</p>
<h2 id="heading-common-pitfalls-to-avoid">Common Pitfalls to Avoid</h2>
<p>I can't emphasize this enough: test your persistence before deploying to production.</p>
<h3 id="heading-dont-assume-volumes-work">Don't assume volumes work</h3>
<p>Actually restart your containers and verify state persists. I've seen deployments fail because someone forgot to mount the volume in production.</p>
<h3 id="heading-dont-ignore-redis-memory-limits">Don't ignore Redis memory limits</h3>
<p>Redis keeps everything in memory. Monitor your memory usage. Set maxmemory policies appropriate for your workload. If you run out of memory, Redis will start evicting keys or refuse writes.</p>
<h3 id="heading-dont-skip-monitoring">Don't skip monitoring</h3>
<p>Add health checks. Monitor Redis connection status. Track prediction latency. You want to know when things break, not learn about it from angry users.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Time-series models need memory. Docker containers lose memory by default. The solution is simple: separate state from compute.</p>
<p>Use Redis as an external state store. Use Docker volumes to persist that state. Your model stays smart, your containers stay replaceable, and your deployments become reliable.</p>
<p>The full working code is available at <a target="_blank" href="https://github.com/ag-chirag/docker-redis-time-series">github.com/ag-chirag/docker-redis-time-series</a>. Clone it, run it, break it, learn from it.</p>
<p>And remember: the simplest solution that works is usually the right one. You don't always need Kubernetes and StatefulSets. Sometimes Docker Compose and a volume are enough.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
