<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Ramesh Sinha - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Ramesh Sinha - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Fri, 08 May 2026 14:34:24 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/justramesh2000/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Why You Should Stop Managing Kafka Manually – A Guide to Kafka UI and Cruise Control ]]>
                </title>
                <description>
                    <![CDATA[ Over 80% of Fortune 100 companies use Apache Kafka. That's not surprising, as Kafka has revolutionized how we build real-time data pipelines and streaming applications. If you're working in software engineering today, chances are you've encountered K... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/stop-managing-kafka-manually-a-guide-to-kafka-ui-and-cruise-control/</link>
                <guid isPermaLink="false">6967bd3e7b94dad713ce9fae</guid>
                
                    <category>
                        <![CDATA[ kafka ]]>
                    </category>
                
                    <category>
                        <![CDATA[ distributed system ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Operational Efficiency ]]>
                    </category>
                
                    <category>
                        <![CDATA[ automation ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ramesh Sinha ]]>
                </dc:creator>
                <pubDate>Wed, 14 Jan 2026 15:58:54 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768353949324/18ce4a43-fb21-4e9b-9285-7c4db7b7ae2e.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Over 80% of Fortune 100 companies use Apache Kafka. That's not surprising, as Kafka has revolutionized how we build real-time data pipelines and streaming applications. If you're working in software engineering today, chances are you've encountered Kafka in some capacity.</p>
<p>But here's the thing: while Kafka itself is incredibly powerful, managing Kafka clusters is notoriously challenging. This isn't a flaw in Kafka – it's just the reality of distributed systems. The bigger your cluster grows, the more complex operations become.</p>
<p>The most painful aspect? Manual cluster management. It's tedious, error-prone, and doesn't scale. What starts as simple topic creation with a few brokers turns into hours of carefully orchestrating partition reassignments across dozens of machines. One typo in a JSON file at 3 AM can take down production.</p>
<p>Sound familiar? You're not alone.</p>
<p>In this guide, you'll learn how two tools can transform Kafka operations from a manual slog into a manageable process:</p>
<ul>
<li><p><strong>Kafka UI</strong> – A modern web interface that replaces cryptic CLI commands with visual cluster management</p>
</li>
<li><p><strong>Cruise Control</strong> – LinkedIn's automation engine that handles cluster balancing and self-healing</p>
</li>
</ul>
<p>We'll start by experiencing the pain of manual management firsthand, then see how these tools solve real-world operational challenges. You'll set up everything locally with <code>Docker</code> and by the end you’ll know exactly how to manage Kafka clusters without the headache.</p>
<h2 id="heading-what-well-cover"><strong>What We’ll Cover:</strong></h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-setting-up-our-unmanaged-cluster">Setting Up Our Unmanaged Cluster</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-starting-the-cluster-amp-verification">Starting the Cluster &amp; Verification</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-creating-topics-the-manual-way">Creating Topics: The Manual Way</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-kafka-ui">Kafka UI</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-setting-up-kafka-ui">Setting up Kafka UI</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-drawbacks-of-kafka-ui">Drawbacks of Kafka UI</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-cruise-control">Cruise Control</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-how-cruise-control-works">How Cruise Control Works</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-setting-up-cruise-control">Setting Up Cruise Control</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cruise-control-configuration-file">Cruise Control Configuration File</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-creating-the-imbalance">Creating the Imbalance</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-attempting-manual-rebalancing">Attempting Manual Rebalancing</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-rebalancing-using-cruise-control">Rebalancing Using Cruise Control</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-the-problem-manual-kafka-management">The Problem: Manual Kafka Management</h2>
<p>Let’s dive right in. First, I'm going to show you what managing a Kafka cluster looks like without any tools – just you, the command line, and dozens of manual operations.</p>
<p>You’ll spin up a small cluster locally, create some topics, and simulate the kind of growth you'd see in a real production environment. By the end of this section, you'll understand exactly why teams spend thousands of engineering hours just keeping Kafka clusters running smoothly.</p>
<p>Fair warning: this is going to feel tedious but it’s ok – <em>that’s the point</em>.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before we dive in, make sure you have:</p>
<ol>
<li><p><strong>Docker Desktop installed and running</strong></p>
<ul>
<li><p>Mac and Windows users: <a target="_blank" href="https://www.docker.com/products/docker-desktop/">https://www.docker.com/products/docker-desktop/</a></p>
</li>
<li><p>Linux users can install Docker Engine via their package manager </p>
</li>
</ul>
</li>
<li><p><strong>Basic Kafka knowledge.</strong> You should understand:</p>
<ul>
<li><p><strong>Topics</strong>: Categories for organizing messages</p>
</li>
<li><p><strong>Partitions</strong>: How topics are divided for parallelism</p>
</li>
<li><p><strong>Brokers</strong>: The Kafka servers that store data</p>
</li>
<li><p><strong>Producers and Consumers</strong>: Applications that write to and read from Kafka</p>
</li>
<li><p><strong>KRaft</strong>: Kafka consensus based discovery?</p>
</li>
</ul>
</li>
</ol>
<p>    If these terms are new to you, <a target="_blank" href="https://www.freecodecamp.org/news/apache-kafka-handbook/">here’s a great handbook about them</a>. I’d also recommend reading <a target="_blank" href="https://kafka.apache.org/intro">Kafka's Introduction</a> first.</p>
<ol start="3">
<li><p><strong>System Requirements</strong></p>
<ul>
<li><p>At least 8GB Ram</p>
</li>
<li><p>10GB Free Disk space</p>
</li>
</ul>
</li>
<li><p>Some basic understanding of <strong>containers</strong> is good to have:</p>
<ul>
<li><p>Docker</p>
</li>
<li><p>Images</p>
</li>
<li><p>Volumes</p>
</li>
<li><p>Networks</p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-setting-up-our-unmanaged-cluster">Setting Up Our Unmanaged Cluster</h2>
<p>Let’s go ahead and build the cluster so that we can see the problems firsthand. We’ll use Docker to spin up three Kafka brokers running in <code>KRaft</code> mode (the modern, ZooKeeper-free approach).</p>
<p>Start by creating a file called <code>docker-compose-basic.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">version:</span> <span class="hljs-string">'3.8'</span>

<span class="hljs-attr">services:</span>
  <span class="hljs-attr">kafka-1:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-1</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9092:9092"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-2:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-2</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9093:9093"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-3:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-3</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9094:9094"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">3</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3-data:/var/lib/kafka/data</span>

<span class="hljs-attr">volumes:</span>
  <span class="hljs-attr">kafka-1-data:</span>
  <span class="hljs-attr">kafka-2-data:</span>
  <span class="hljs-attr">kafka-3-data:</span>
</code></pre>
<p>In the above configuration file, we’re creating three Kafka brokers (<code>kafka-1, kafka-2, kafka-3</code>). Each one uses the <code>confluentinc/cp-kafka:7.6.0</code> image and has its port opened (<code>9092, 9093, 9094</code>).</p>
<p>The environment variables are:</p>
<ul>
<li><p><strong>KAFKA_NODE_ID</strong> – A unique identifier for each broker (1,2,3). No two brokers can have the same ID.</p>
</li>
<li><p><strong>KAFKA_PROCESS_ROLES: broker, controller</strong> – This tells Kafka to run in <code>KRaft</code> mode (without ZooKeeper). Each broker acts as both a data broker and a controller for cluster coordination.</p>
</li>
<li><p><strong>KAFKA_CONTROLLER_QUORUM_VOTERS</strong> – The membership list that tells each broker how to find the others. All three brokers must have the identical list: <code>1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</code>. This is how they discover each other and elect a leader.</p>
</li>
<li><p><strong>CLUSTER_ID</strong> – A unique identifier for the entire cluster. All brokers must use the <strong>exact same value</strong> or they won't recognize each other as part of the same cluster. The actual value (<code>MkU3OEVBNTcwNTJENDM2Qk</code>) doesn't matter as long as long as it is consistent across brokers. One important thing to note is that CLUSTER_ID must be a valid <code>base64-encoded UUID</code>  per Kafka’s requirement.</p>
</li>
<li><p><strong>KAFKA_LISTENERS</strong> - Defines which network interfaces and ports Kafka listens on. We have three listeners:</p>
<ul>
<li><p><strong>PLAINTEXT://0.0.0.0:29092</strong>: For inter-broker communication (brokers talking to each other)</p>
</li>
<li><p><strong>CONTROLLER://0.0.0.0:29093</strong>: For controller communication in <code>KRaft</code> mode</p>
</li>
<li><p><strong>PLAINTEXT_HOST://0.0.0.0:9092</strong> (varies per broker): For external connections from your machine</p>
</li>
</ul>
</li>
<li><p><strong>KAFKA_ADVERTISED_LISTENERS</strong> – Tells clients (producers/consumers) how to connect to this broker. This is what gets returned when a client asks "<code>where should I connect?</code>" The PLAINTEXT_HOST://localhost:9092 part is what allows you to connect from your Mac.</p>
</li>
</ul>
<p>Note: <strong>Listener configuration is critical.</strong> Incorrect settings will prevent clients from connecting even when brokers are running. These settings work for local Docker environments where Docker's internal DNS resolves broker names (<code>kafka-1, kafka-2, kafka-3</code>). For production, replace hostnames with actual IP addresses or FQDNs - (Fully Qualified Domain Name):</p>
<ul>
<li><p><strong>KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2</strong> – How many copies of consumer offset data to keep. We use 2 instead of 3 because with only three brokers, this prevents issues during rolling restarts. In production with more brokers, you'd use 3 or more.</p>
</li>
<li><p><strong>The Volumes</strong> –  <code>kafka-x-data:/var/lib/kafka/data</code>  creates persistent storage for each broker’s data. Without volumes you will lose your topics and messages if you stop or restart your containers. Volumes are assigned to each broker so they don’t accidentally share data.</p>
</li>
</ul>
<p>Note: For a restart from scratch you need to delete the volumes using the following command. The -v flag removes volumes. Without it, old data persists even after down. </p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml down -v
</code></pre>
<p>If you're using the legacy <code>docker-compose</code> tool (V1), replace <code>docker compose</code> with <code>docker-compose</code> in all commands throughout this tutorial.</p>
<h3 id="heading-ports"><strong>Ports</strong></h3>
<p>Three ports are used for any given broker. Their purposes are:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Port</td><td>Purpose</td></tr>
</thead>
<tbody>
<tr>
<td><strong>9092</strong></td><td>external connections (producers, consumers from you Mac)</td></tr>
<tr>
<td><strong>29092</strong></td><td>Internal broker-to-broker communication</td></tr>
<tr>
<td><strong>29093</strong></td><td>Cluster coordination via KRaft</td></tr>
</tbody>
</table>
</div><h2 id="heading-starting-the-cluster-amp-verification">Starting the Cluster &amp; Verification</h2>
<p>Now that we have the basic docker configuration for Kafka, let’s run it and verify the results.</p>
<p>Run the following command in the same directory where you saved <code>docker-compose-basic.yml</code>:</p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml up -d
</code></pre>
<p>The <code>-d</code> flag runs the containers in detached mode (in the background), so you get your terminal back.</p>
<p>You should see output like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767554351786/4500c108-e9b6-403f-98cf-15198b3a9831.png" alt="Docker compose command result" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Using the following command, check if the containers running Kafka brokers are up:</p>
<pre><code class="lang-bash">docker ps
</code></pre>
<p>You should see three Kafka containers (kafka-1, kafka-2, kafka-3) with status “<code>Up</code>” – something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767554452185/4e605547-8153-4a85-9903-e3b1132889a8.png" alt="Running Kafka Containers" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Run the following command to verify that all three brokers are registered in the cluster:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-broker-api-versions --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092
</code></pre>
<p>You should see API version information for all three brokers (IDs 1, 2, 3) without any connection errors.</p>
<p>Note that we’re using <code>kafka-1:29092,kafka-2:29092,kafka-3:29092</code> here (the internal Docker addresses) instead of localhost:9092 because this command runs inside the <code>kafka-1</code> container by virtue of <code>docker exec -it kafka-1</code>, where <code>localhost</code> only refers to that specific container.</p>
<p>If any of the above verification returns errors or doesn’t show expected result as shown in screenshots, you can run the following command to see logs and debug:</p>
<pre><code class="lang-bash">docker logs kafka-1
</code></pre>
<h2 id="heading-creating-topics-the-manual-way">Creating Topics: The Manual Way</h2>
<p>Now that we have a cluster running, let’s simulate a real production use case where different teams need Kafka topics for their applications – payments, logs, events, metrics notifications, you name it.</p>
<p>Let’s start by creating a topic for logs. The command to do this is:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 12 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy
</code></pre>
<p>You’ll need to specify some command parameters, which are:</p>
<ol>
<li><p>The exact broker address <code>kafka-1:29092,kafka-2:29092,kafka-3:29092</code> (or the IP address of your servers in production)</p>
</li>
<li><p>The number of partitions – I have used <code>12</code> in the above command. Creating too few partitions creates bottlenecks, while creating too many adds overhead.</p>
</li>
<li><p>Retention policy – I have used 7 days (that is, 604800000 milliseconds)</p>
</li>
<li><p>Compression type</p>
</li>
</ol>
<p>Manually managing these parameters and running the command a handful of times is okay – but what if you have to run this for every team in your enterprise? Each team will have different requirements. The grind of copy, paste, adjust becomes painful if you have 100+ topics and multiple clusters (dev, staging, prod).</p>
<p>Feel the pain yet? Well, let’s just go on for a minute and we’ll address this issue shortly. For now, if you run the above command you should see the “<strong>Created topic</strong>” message:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767554860337/81937532-b88b-4d9a-bfca-f2f9b0433e4d.png" alt="Create Kafka Topic result" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Note: We’re using <code>kafka-1:29092,kafka-2:29092,kafka-3:29092</code> to reach Kafka brokers because we’re running the command inside of broker kafka-1  by running using <code>docker exec</code>.</p>
<p>Let's keep going. We’ll create more topics using the same command by changing the topic name and partitions. Copy, paste, update, and run the above commands a couple times. On my machine, I ran it 3 more times like below (you can choose to run couple more times with changed values – it won’t matter because concrete values are not important for this tutorial):</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-views \    
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 20 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy


docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-analytics \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 3 \ 
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy


docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-articles \ 
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 5 \ 
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy
</code></pre>
<p>After creating the topics, let’s see all the ones you have now by running the following command:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \ --list \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092
</code></pre>
<p>You should see a list of topics like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767555021681/451f69bc-5f91-432a-9c74-ca56b79aa179.png" alt="Kafka list of Topics" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Notice that you just get the list of topics but no meaningful information, like:</p>
<ul>
<li><p>How many partitions does each have?</p>
</li>
<li><p>Which brokers are hosting them?</p>
</li>
<li><p>Are they evenly distributed?</p>
</li>
<li><p>What are their configurations?</p>
</li>
</ul>
<h3 id="heading-partition-information">Partition Information</h3>
<p>Let’s try to get information about our partitions. For this tutorial, I have created 4 topics and a total of 40 partitions spread across three brokers. I want to see which broker has the most partitions.</p>
<p>In a well-managed cluster, you’d want them roughly evenly distributed. But how can we check that? </p>
<p>Maybe the describe command shown below can help. Let’s run it:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --describe \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092
</code></pre>
<p>It will return a wall of text, something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767555186583/e50948e9-48f8-4431-8008-38f24da98373.png" alt="Kafka describe Topics" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>So, we have partition information but:</p>
<ul>
<li><p>No summary or aggregation</p>
</li>
<li><p>No visual representation </p>
</li>
<li><p>It’s difficult to scan and compare</p>
</li>
<li><p>It gets exponentially worse with more topics</p>
</li>
</ul>
<h3 id="heading-counting-leaders">Counting Leaders</h3>
<p>The Leader field in the above screenshot tells you which broker is the leader for each partition. Leaders handle all read and write requests, so you want them evenly distributed or else some brokers will become overloaded. </p>
<p>Let’s try to count how many partitions each broker leads. To do that, run the following command:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --describe \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 | grep <span class="hljs-string">"Leader: 1"</span> | wc -l
</code></pre>
<p>It will show something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767555304446/6a58be3a-e21f-4209-8165-1544c5bc6c20.png" alt="Kafka Leader Count result" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Per my topic creation, <code>14</code> is the count of partitions where <code>broker 1 (Leader : 1)</code> is the leader. You might see a different number depending on how many topics and how many partitions you have created.</p>
<p>You can repeat this command to see the count of partitions led by other brokers. To do so, just change <code>Leader: 1</code> to <code>Leader: 2</code> or <code>Leader: 3.</code>. I get <code>14, 12, 14</code>:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767555418151/8c66afe7-b5db-4106-800f-f6e7e8926ba2.png" alt="Kafka Leader Count for all Topics" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>That’s somewhat balanced, but you had to run the command multiple times, parse using <code>grep</code> and <code>wc</code>, and this is just 3 brokers. What if you had 100+? Also, what if you have to get the replicas’ information?</p>
<p>I could go on and on with the data we need and the commands to get that information. But the point I’m trying to make here is that sooner or later this becomes impossible to manage. Your team is going to need an army, and to be honest there isn’t much value in doing all of this manually. </p>
<p>So far, you’ve seen only simple operational commands, but the problems don’t stop there. In a real production environment there are more complex and challenging operations like:</p>
<ul>
<li><p><strong>Consumer Lag Monitoring</strong>: When consumers fall behind, you need to track which partitions are lagging, which consumer instances own them, and where the lag is growing or shrinking. With CLI tools, you get raw numbers but no trends or context.</p>
</li>
<li><p><strong>Broker Failures</strong>: When a broker fails, you need to identify under-replicated partitions, trigger leader elections, and create partition reassignment <code>JSON</code> files manually. One mistake in that JSON can cause data loss.</p>
</li>
<li><p><strong>Cluster rebalancing</strong>: You’ll see that when you add new brokers, they sit empty until you manually redistribute partitions. Similarly for removing brokers, you need to move all their partitions first. These operations require calculating optimal placement and creating complex reassignment plans.</p>
</li>
</ul>
<p>If you’re still with me, you’re probably thinking that there has to be a better way. Fortunately, there is – actually, there are a couple complimentary ways and we are going to talk about those next.</p>
<h2 id="heading-kafka-ui">Kafka UI</h2>
<p>Kafka UI is a modern, open-source web interface for managing Kafka clusters. It replaces the <code>command line chaos</code> we just experienced with a clean, visual dashboard.</p>
<p>Kafka UI provides the following features:</p>
<ul>
<li><p><code>Visual cluster Overview</code>: see all brokers, topics, and partitions at a glance.</p>
</li>
<li><p><code>Topic management</code>: create, configure, and delete topics with a GUI</p>
</li>
<li><p><code>Consumer group monitoring</code>: track lags, offsets, and consumer health in real-time</p>
</li>
<li><p><code>Message browsing</code>: view actual messages in topics without command line tools </p>
</li>
</ul>
<p>Without further ado, let’s set up Kafka UI.</p>
<h3 id="heading-setting-up-kafka-ui">Setting Up Kafka UI</h3>
<p>To setup up Kafka UI, let’s modify our existing <code>docker-compose-basic.yml</code> like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">version:</span> <span class="hljs-string">'3.8'</span>

<span class="hljs-attr">services:</span>
  <span class="hljs-attr">kafka-1:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-1</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9092:9092"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-2:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-2</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9093:9093"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-3:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-3</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9094:9094"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">3</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3-data:/var/lib/kafka/data</span>
<span class="hljs-comment"># Adding kafka-UI service start</span>
  <span class="hljs-attr">kafka-ui:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">provectuslabs/kafka-ui:latest</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-ui</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"8080:8080"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">DYNAMIC_CONFIG_ENABLED:</span> <span class="hljs-string">'true'</span>
      <span class="hljs-attr">KAFKA_CLUSTERS_0_NAME:</span> <span class="hljs-string">freecodecamp-cluster</span>
      <span class="hljs-attr">KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS:</span> <span class="hljs-string">kafka-1:29092,kafka-2:29092,kafka-3:29092</span>
    <span class="hljs-attr">depends_on:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3</span>
<span class="hljs-comment"># Adding kafka-UI service end</span>
<span class="hljs-attr">volumes:</span>
  <span class="hljs-attr">kafka-1-data:</span>
  <span class="hljs-attr">kafka-2-data:</span>
  <span class="hljs-attr">kafka-3-data:</span>
</code></pre>
<p>The yaml file is pretty much the same as before except that we have added a new service called <code>kafka-ui</code> (for better clarity, I have added the changes in between start and end comments).</p>
<p>Key Configurations are:</p>
<ul>
<li><p><strong>Port 8080</strong> – You can access the UI at <a target="_blank" href="http://localhost:8080">http://localhost:8080</a> from your machine.</p>
</li>
<li><p><strong>KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS</strong> – This environment variable tells Kafka UI where to connect your cluster (using internal Docker addresses).</p>
</li>
<li><p><strong>KAFKA_CLUSTERS_0_NAME</strong> – A friendly name for your cluster in the UI.</p>
</li>
</ul>
<p>Let’s first clean up the old cluster while keeping the topic data intact. Go ahead and run the following command to do so:</p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml down
</code></pre>
<p>Note that we’re not using <code>-v</code> here, so volumes (topic data) will remain intact.</p>
<p>Wait for couple seconds and then run the following docker up command to bring up our cluster with Kafka UI:</p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml up -d
</code></pre>
<p>Now open a browser and visit <a target="_blank" href="http://localhost:8080/">http://localhost:8080/</a>. You’ll see the UI like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767556151144/f0dd2a51-79c5-4906-a32d-d7732f9fd242.png" alt="Kafka UI" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>You can click around and see all information about the cluster we have created, like:</p>
<ul>
<li><p>Your 3 brokers </p>
</li>
<li><p>The topics you created earlier</p>
</li>
<li><p>Partition counts</p>
</li>
</ul>
<p>For comparison with manual commands, let's look at the Brokers tab. You can see the partition leader count for each broker at a glance – remember that we had to run multiple commands to get this information earlier. Beyond this, the UI provides many other useful metrics that would require separate command-line queries.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767556236480/c99ba704-47b9-42da-b135-e1f9503ee1ab.png" alt="Kakfa UI Brokers" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Remember the CLI commands we had to run to create topics? If you go to the <code>Topics</code> tab, you will notice that Topic management (<code>creation, deletion, data cleanup</code> and so on) are just a few button clicks.  </p>
<p>Similarly, managing Consumers only requires a few button clicks.</p>
<p>After exploring the Kafka UI, you'll see how much easier it is to monitor your cluster compared to running individual CLI commands.</p>
<h3 id="heading-drawbacks-of-kafka-ui">Drawbacks of Kafka UI</h3>
<p>That said, Kafka UI does have some limitations:</p>
<ul>
<li><p><strong>Automatic rebalancing</strong>: One or few brokers having more partitions that others, you must manually reassign them. </p>
</li>
<li><p><strong>Self-healing</strong>: If a broker fails, you have to manually create reassignment plans.</p>
</li>
<li><p><strong>Performance optimization</strong>: The UI can’t recommend intelligent partition placement.</p>
</li>
<li><p><strong>Alerts</strong>: The UI doesn’t warn you before problems happen.</p>
</li>
</ul>
<p>For small clusters (3 - 10 brokers ), Kafka UI and some command execution might be enough. You’ll be able to see problems clearly and fix them when needed.  </p>
<p>For large clusters, manual operations are still not scalable, so we need some kind of a complementary tool…and that tool is <strong>Cruise Control</strong>.</p>
<h2 id="heading-cruise-control">Cruise Control</h2>
<p>Cruise Control is an automation engine for Kafka clusters. While Kafka UI gives you visibility and manual control, Cruise Control provides intelligent automation and self-healing. You can think of Kafka UI as a dashboard with manual controls and Cruise Control as an autopilot. In other words, they complement each other.</p>
<p>Let’s try to create some imbalance in our cluster and fix it manually. This will help you learn how to reason through why you need Cruise Control. </p>
<p>To keep things simple, let’s start from scratch. We will first delete all the Docker resources we have created so far by running the following command:</p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml down -v
</code></pre>
<p>Running <code>docker-compose down -v</code> will delete all the topics and messages we created so far, but don’t worry –we’ll create them again.</p>
<h3 id="heading-how-cruise-control-works">How Cruise Control Works</h3>
<p>You can think of Cruise Control as a metric-monitoring and action-taking tool. Kafka brokers collect internal metrics (CPU, disk, network traffic, partition sizes), and a metric reporter running inside each broker sends these metrics to a Kafka topic.</p>
<p>Cruise Control then reads from that topic and analyzes the data. Based on that analysis, it proposes partition movements. We’ll see this in action shortly.</p>
<h3 id="heading-setting-up-cruise-control">Setting Up Cruise Control</h3>
<p>As of this writing, I couldn’t find a compatible Kafka and Cruise Control image that supports <code>KRaft</code> (Kafka Consensus Algorithm), so I decided to create Kafka and Cruise Control public images that will help with the tutorial. I don’t recommend using these images in production. For production usage, you should either wait for community to provide an image or create one of your own.</p>
<p>Change the <code>docker-compose-basic.yml</code> file to look like the below:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">version:</span> <span class="hljs-string">'3.8'</span>

<span class="hljs-attr">services:</span>
  <span class="hljs-attr">kafka-1:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">justramesh2000/kafka-apache-cc:3.8.1</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-1</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9092:9092"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
      <span class="hljs-comment"># Cruise Control Metrics Reporter</span>
      <span class="hljs-attr">KAFKA_METRIC_REPORTERS:</span> <span class="hljs-string">'com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS:</span> <span class="hljs-string">'kafka-1:29092,kafka-2:29092,kafka-3:29092'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE:</span> <span class="hljs-string">'true'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS:</span> <span class="hljs-string">'1'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-string">'2'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE:</span> <span class="hljs-string">'false'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS:</span> <span class="hljs-string">'60000'</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-2:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">justramesh2000/kafka-apache-cc:3.8.1</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-2</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9093:9093"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
      <span class="hljs-attr">KAFKA_METRIC_REPORTERS:</span> <span class="hljs-string">com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS:</span> <span class="hljs-string">kafka-1:29092,kafka-2:29092,kafka-3:29092</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE:</span> <span class="hljs-string">'false'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC:</span> <span class="hljs-string">__CruiseControlMetrics</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE:</span> <span class="hljs-string">'true'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS:</span> <span class="hljs-string">'1'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-string">'2'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS:</span> <span class="hljs-string">'60000'</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-3:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">justramesh2000/kafka-apache-cc:3.8.1</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-3</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9094:9094"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">3</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
      <span class="hljs-attr">KAFKA_METRIC_REPORTERS:</span> <span class="hljs-string">com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS:</span> <span class="hljs-string">kafka-1:29092,kafka-2:29092,kafka-3:29092</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE:</span> <span class="hljs-string">'false'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC:</span> <span class="hljs-string">__CruiseControlMetrics</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE:</span> <span class="hljs-string">'true'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS:</span> <span class="hljs-string">'1'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-string">'2'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS:</span> <span class="hljs-string">'60000'</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3-data:/var/lib/kafka/data</span>
  <span class="hljs-comment"># Adding kafka-UI service start</span>
  <span class="hljs-attr">kafka-ui:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">provectuslabs/kafka-ui:latest</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-ui</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"8080:8080"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">DYNAMIC_CONFIG_ENABLED:</span> <span class="hljs-string">'true'</span>
      <span class="hljs-attr">KAFKA_CLUSTERS_0_NAME:</span> <span class="hljs-string">freecodecamp-cluster</span>
      <span class="hljs-attr">KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS:</span> <span class="hljs-string">kafka-1:29092,kafka-2:29092,kafka-3:29092</span>
    <span class="hljs-attr">depends_on:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./config:/opt/cruise-control/config</span>  
  <span class="hljs-comment"># Adding kafka-UI service end</span>
  <span class="hljs-comment"># Adding cruise-control start</span>
  <span class="hljs-attr">cruise-control:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">justramesh2000/cruise-control-kraft:2.5.142</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">cruise-control</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9090:9090"</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./config/cruisecontrol.properties:/opt/cruise-control/config/cruisecontrol.properties</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./config/capacityJBOD.json:/opt/cruise-control/config/capacityJBOD.json:ro</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./config/log4j.properties:/opt/cruise-control/config/log4j.properties:ro</span>
    <span class="hljs-attr">depends_on:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3</span>
   <span class="hljs-comment"># Adding cruise-control end    </span>
<span class="hljs-attr">volumes:</span>
  <span class="hljs-attr">kafka-1-data:</span>
  <span class="hljs-attr">kafka-2-data:</span>
  <span class="hljs-attr">kafka-3-data:</span>
</code></pre>
<p>You should have made the following changes to the file:</p>
<ul>
<li><p>Changed Kafka image from <code>confluentinc/cp-kafka:7.6.0</code> to <code>justramesh2000/kafka-apache-cc:3.8.1</code>. The new image contains the Cruise Control metrics exporter which will export metrics data from Kafka brokers to be used by Cruise Control.</p>
</li>
<li><p>Added the following environment variables:</p>
<ul>
<li><p><strong>KAFKA_METRIC_REPORTERS</strong> – This variable tells Kafka to load a plugin called the <code>Cruise Control Metrics Reporter</code>. It runs inside each Kafka broker process, and hooks into Kafka’s internal metrics system. This helps with data collection.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS</strong> – This tells the <code>Cruise Control Metrics Reporter</code> where to send metrics to, meaning which Kafka brokers and which port.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE</strong> – This disables specific Kubernetes behaviors (Pod name, id instead of Host). We are using Docker, so we don’t need K8s behaviors.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_TOPIC</strong> – Specifies the name of the topic where metrics will be published. Default is <code>__CruiseControlMetrics</code> but you can customize using this variable if you want to.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE</strong> – Automatically creates a <code>__CruiseControlMetrics</code> topic if it doesn’t exist. Without this metric, the reporter will fail reporting until you manually create this topic.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS</strong> – Defines the number of partitions for the topic <code>__CruiseControlMetrics</code>.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR</strong> – Tells Kafka how many copies of metrics data to keep. In our case, we’re keeping 2 copies of the data.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS</strong> – Tells Kafka how often to send metrics. We’re sending every minute.</p>
</li>
</ul>
</li>
<li><p>Added Cruise-control service using image <code>justramesh2000/cruise-control-kraft:2.5.142</code>. For clarity, I have kept this change between the <code>start</code> and <code>end</code> comments.</p>
</li>
<li><p>Under cruise control, we’ve mounted <code>three</code> Cruise Control configurations files. We’ll talk about those files next.</p>
</li>
</ul>
<h3 id="heading-cruise-control-configuration-file">Cruise Control Configuration File</h3>
<p>To run Cruise Control, we need to provide several configuration files. Among the key pieces of information are:</p>
<ul>
<li><p>Where the Kafka cluster is located</p>
</li>
<li><p>The capacity of each broker</p>
</li>
</ul>
<p>Create a config directory and add the following files:</p>
<pre><code class="lang-bash">mkdir config
</code></pre>
<h4 id="heading-cruisecontrolproperties">cruisecontrol.properties</h4>
<p>This is Cruise Control’s main configuration file.</p>
<p>Save the following content as <code>cruisecontrol.properties</code> in the config directory:</p>
<pre><code class="lang-abap"># Kafka cluster. Tells how <span class="hljs-keyword">to</span> connect <span class="hljs-keyword">to</span> brokers
bootstrap.servers=kafka-<span class="hljs-number">1</span>:<span class="hljs-number">29092</span>,kafka-<span class="hljs-number">2</span>:<span class="hljs-number">29092</span>,kafka-<span class="hljs-number">3</span>:<span class="hljs-number">29092</span>

# <span class="hljs-keyword">Topic</span> <span class="hljs-keyword">from</span> which metrics are <span class="hljs-keyword">to</span> be <span class="hljs-keyword">read</span>
metric.reporter.<span class="hljs-keyword">topic</span>=__CruiseControlMetrics

# Aggregated partition <span class="hljs-keyword">data</span>
partition.metric.sample.store.<span class="hljs-keyword">topic</span>=__KafkaCruiseControlPartitionMetricSamples

#Aggregated broker <span class="hljs-keyword">data</span>
broker.metric.sample.store.<span class="hljs-keyword">topic</span>=__KafkaCruiseControlModelTrainingSamples

# Enable broker failure detection <span class="hljs-keyword">for</span> KRaft <span class="hljs-keyword">mode</span> (no ZooKeeper)
kafka.broker.failure.detection.enable=true

# Capacity. Tells <span class="hljs-keyword">where</span> the capacity file <span class="hljs-keyword">is</span> 
capacity.config.file=config/capacityJBOD.json

# Goals. What <span class="hljs-keyword">to</span> optimize <span class="hljs-keyword">for</span> <span class="hljs-keyword">during</span> cluster balancing. These are the riles <span class="hljs-keyword">for</span> CC <span class="hljs-keyword">to</span> abide <span class="hljs-keyword">to</span> <span class="hljs-keyword">during</span> rebalancing
<span class="hljs-keyword">default</span>.goals=com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.RackAwareGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.ReplicaCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.DiskCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.NetworkInboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.NetworkOutboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.CpuCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.ReplicaDistributionGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.DiskUsageDistributionGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.LeaderReplicaDistributionGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.LeaderBytesInDistributionGoal

# hard goals. 
hard.goals=com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.RackAwareGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.ReplicaCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.DiskCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.NetworkInboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.NetworkOutboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.CpuCapacityGoal

# Webserver. <span class="hljs-keyword">For</span> WebApi access
webserver.http.port=<span class="hljs-number">9090</span>
webserver.http.address=<span class="hljs-number">0.0</span>.<span class="hljs-number">0.0</span>

# Execution
num.broker.metrics.windows=<span class="hljs-number">1</span>
num.partition.metrics.windows=<span class="hljs-number">1</span>
</code></pre>
<p>I’ve added in line comments to explain much of the above configuration, but I think the <code>Goals</code> need special attention. These are the rules that we as users have set for Cruise Control to abide by.</p>
<p>By defining goals, we tell Cruise Control to do the following:</p>
<ul>
<li><p><code>RackAwareGoal</code> – Spread replicas across racks (or in our case, brokers)</p>
</li>
<li><p><code>ReplicaCapacityGoal</code> – Don't overload brokers with too many replicas</p>
</li>
<li><p><code>DiskCapacityGoal</code> – Don't fill up disk</p>
</li>
<li><p><code>NetworkInboundCapacityGoal</code> – Balance incoming network traffic</p>
</li>
<li><p><code>NetworkOutboundCapacityGoal</code> – Balance outgoing network traffic</p>
</li>
<li><p><code>CpuCapacityGoal</code> – Balance CPU usage</p>
</li>
<li><p><code>ReplicaDistributionGoal</code> – Evenly distribute replicas</p>
</li>
<li><p><code>DiskUsageDistributionGoal</code> – Ensure even disk usage across brokers</p>
</li>
<li><p><code>LeaderReplicaDistributionGoal</code> – Evenly distribute leader replicas</p>
</li>
<li><p><code>LeaderBytesInDistributionGoal</code> – Balance data flowing to leaders</p>
</li>
</ul>
<p>Via Cruise Control configuration, you can define two types of goals: <code>Default goals</code> and <code>Hard goals</code>. Hard goals must be met. Default goals that aren’t part of the hard goals become soft goals. This means that Cruise Control will give its best effort to satisfy them but won’t reject a proposal if it can’t.</p>
<p>Here’s a little summary:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Type</td><td>Meaning</td><td>What CC Does</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Hard Goals</strong></td><td>Must-haves (capacity limits)</td><td><strong>Never violates</strong> – rejects proposal if can't satisfy</td></tr>
<tr>
<td><strong>Soft Goals</strong></td><td>Nice-to-haves (better balance)</td><td><strong>Tries to satisfy</strong> – skips if conflicts with hard goals</td></tr>
<tr>
<td><strong>Default Goals</strong></td><td>Hard + Soft together</td><td><strong>Optimizes for all</strong> – prioritizes hard over soft</td></tr>
</tbody>
</table>
</div><p>Cruise control collects metrics for a defined period (default: 5 minutes) and creates a monitoring window. The following settings control how many windows Cruise Control needs before it’s ready to generate proposals (shortly, we will see what proposals are):</p>
<ul>
<li><p><code>num.broker.metrics.windows=1</code>: Wait for 1 monitoring window before generating proposals. Each window in Cruise Control is 5 minutes by default. This means that Cruise Control will be ready after 5 minutes. I’ve set this to 1 for quick testing. The recommendation is to use a large window in production to avoid false proposals from temporary spikes.</p>
</li>
<li><p><code>num.partition.metrics.windows=1</code>: Wait for 1 window of partition metrics. Same reasoning as above.</p>
</li>
</ul>
<h4 id="heading-capacity">Capacity</h4>
<p>This informs cruise control about the capacity (CPU, DISK) of each broker, which then helps it to make decisions. Using the below file, we’re telling Cruise Control the following:</p>
<ul>
<li><p>What are the brokerIds</p>
</li>
<li><p>What is the disk path <code>/var/lib/kafka/data</code> and disk capacity (<code>100000000</code> MB = 100 GB). This is used by <code>DiskCapacityGoal</code> that we set up in the above <code>cruisecontrol.properties</code> file.</p>
</li>
<li><p>What is the CPU 100% (1 Core). Used by <code>CpuCapacityGoal</code>.</p>
</li>
<li><p>What is the <code>NW_IN</code> Network Inbound Capacity (125,000 KB/s = 1 MB/s –Megabytes per second) = 1 Gbps – Giga <code>bits</code> per second). Used by <code>NetworkInboundCapacityGoal</code>.</p>
</li>
<li><p>What is the <code>NW_OUT</code> Network Outbound Capacity (125,000 KB/s). Used by <code>NetworkOutboundCapacityGoal</code></p>
</li>
</ul>
<p>Save the following content as <code>capacityJBOD.json</code> in the config directory:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"brokerCapacities"</span>:[
    {
      <span class="hljs-attr">"brokerId"</span>: <span class="hljs-string">"1"</span>,
      <span class="hljs-attr">"capacity"</span>: {
        <span class="hljs-attr">"DISK"</span>: {<span class="hljs-attr">"/var/lib/kafka/data"</span>: <span class="hljs-string">"100000000"</span>},
        <span class="hljs-attr">"CPU"</span>: <span class="hljs-string">"100"</span>,
        <span class="hljs-attr">"NW_IN"</span>: <span class="hljs-string">"125000"</span>,
        <span class="hljs-attr">"NW_OUT"</span>: <span class="hljs-string">"125000"</span>
      }
    },
    {
      <span class="hljs-attr">"brokerId"</span>: <span class="hljs-string">"2"</span>,
      <span class="hljs-attr">"capacity"</span>: {
        <span class="hljs-attr">"DISK"</span>: {<span class="hljs-attr">"/var/lib/kafka/data"</span>: <span class="hljs-string">"100000000"</span>},
        <span class="hljs-attr">"CPU"</span>: <span class="hljs-string">"100"</span>,
        <span class="hljs-attr">"NW_IN"</span>: <span class="hljs-string">"125000"</span>,
        <span class="hljs-attr">"NW_OUT"</span>: <span class="hljs-string">"125000"</span>
      }
    },
    {
      <span class="hljs-attr">"brokerId"</span>: <span class="hljs-string">"3"</span>,
      <span class="hljs-attr">"capacity"</span>: {
        <span class="hljs-attr">"DISK"</span>: {<span class="hljs-attr">"/var/lib/kafka/data"</span>: <span class="hljs-string">"100000000"</span>},
        <span class="hljs-attr">"CPU"</span>: <span class="hljs-string">"100"</span>,
        <span class="hljs-attr">"NW_IN"</span>: <span class="hljs-string">"125000"</span>,
        <span class="hljs-attr">"NW_OUT"</span>: <span class="hljs-string">"125000"</span>
      }
    }
  ]
}
</code></pre>
<h4 id="heading-logging">Logging</h4>
<p>This is not important for Cruise Control to work properly, but it’ll help you debug if there are issues. Save the following content as <code>log4j.properties</code> in the config directory. When you execute commands to start Cruise Control and If you see unexpected behaviors like container exiting, you can use the <code>docker logs</code> command to see what happened.</p>
<pre><code class="lang-abap"># Root logger - INFO level, output <span class="hljs-keyword">to</span> console
rootLogger.level=INFO
appenders=console

# Console output (<span class="hljs-keyword">for</span> docker logs)
appender.console.<span class="hljs-keyword">type</span>=Console
appender.console.name=STDOUT
appender.console.layout.<span class="hljs-keyword">type</span>=PatternLayout
appender.console.layout.pattern=[%d] %p %m (%c)%n

# <span class="hljs-keyword">Send</span> root logger <span class="hljs-keyword">to</span> console
rootLogger.appenderRef.console.<span class="hljs-keyword">ref</span>=STDOUT
</code></pre>
<p>Now that we have all the configurations in place, let’s run the following command to start Kafka brokers with Kafka UI and Cruise Control:</p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml up -d
</code></pre>
<p>Using the following command, verify that the three Kafka brokers, Kafka UI, and Cruise Control containers are running:</p>
<pre><code class="lang-bash">docker ps
</code></pre>
<p>You should see something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768085885265/b196c5ce-77b3-4563-b72a-20c6de7123f0.png" alt="Docker running containers" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Now that we have Cruise Control up and running, let’s create some Imbalance and see how much better of an experience we get by using Cruise Control versus mitigating the imbalance manually.</p>
<h3 id="heading-creating-the-imbalance">Creating the Imbalance</h3>
<p>An imbalance is a scenario where some brokers are handling more messages than others – and they may run into high disk usage or high IOPS.</p>
<p>To create the imbalance in our cluster, we’ll have to create a few topics and then produce messages unevenly. Now that you have Kafka UI running, you can create topics using that method or you can create topics using commands. I’m going to use the commands because it’ll be easier for you to reproduce my work (but I recommend UI for production operations because it prevents typos).</p>
<p>If you also decide to use commands, run the following command. Then using UI, verify that the topics have been created.</p>
<p>Note: You’ll find that the commands are different from previous commands. This is because, previously in our <code>docker-compose-basic.yml</code> file, we were using the <code>confluentinc/cp-kafka:7.6.0</code> image for Kafka. But now we’re using the <code>justramesh2000/kafka-apache-cc:3.8.1</code> image which is based off of the <code>apache/kafka:3.8.1</code> image. For different images, the tools are located at different places, so the command needs to be adjusted to account for that.</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 bash -c <span class="hljs-string">'
/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092 \
  --partitions 12 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-views \
  --bootstrap-server kafka-1:29092 \
  --partitions 20 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-analytics \
  --bootstrap-server kafka-1:29092 \
  --partitions 3 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-articles \
  --bootstrap-server kafka-1:29092 \
  --partitions 5 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy
'</span>
</code></pre>
<p>Run the following command to produce uneven messages on different topics we created above.</p>
<p>Heavy Load on <code>freecodecamp-logs</code>:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 bash -c <span class="hljs-string">"
for i in {1..100000}; do 
  echo '{\"log_id\":\"'\$i'\",\"level\":\"INFO\",\"message\":\"Log entry '\$i'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-logs --bootstrap-server kafka-1:29092"</span>
</code></pre>
<p>Heavy load on <code>freecodecamp-views</code>:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 bash -c <span class="hljs-string">"
for i in {1..80000}; do 
  echo '{\"view_id\":\"'\$i'\",\"page\":\"/article/'\$((i % 100))'\",\"user\":\"user_'\$((i % 1000))'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-views --bootstrap-server kafka-1:29092"</span>
</code></pre>
<p>Moderate load on <code>freecodecamp-analytics</code>:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 bash -c <span class="hljs-string">"
for i in {1..30000}; do 
  echo '{\"event\":\"page_view\",\"user\":\"user_'\$i'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-analytics --bootstrap-server kafka-1:29092"</span>
</code></pre>
<p>Now, produce a message with a <code>fixed key</code> to force all data into one Partition. This is a fast way to create strong disk imbalance. Run the following command:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 bash -c <span class="hljs-string">"
for i in {1..300000}; do
  echo 'hotkey:{\"log_id\":'\$i',\"msg\":\"big payload\"}'
done | /opt/kafka/bin/kafka-console-producer.sh \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092 \
  --property parse.key=true \
  --property key.separator=:"</span>
</code></pre>
<p>After running the above commands, come back to the UI, refresh, and you will see a number of messages like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768088817389/13c285b2-308a-4e6c-80f1-4de774f34662.png" alt="Kafka Topics with Message Count" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Now, go to brokers tab and see the imbalance in Disk Usage:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768088880398/5e615012-a35f-4820-8daa-b46746b65b56.png" alt="Kafka Brokers Disk usage" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>You should be able to see that <strong>Broker-2 has only about 47% of the data that Broker-1 has</strong>, and <strong>Broker-3 has about 11% more data than Broker-1</strong>. Broker-2 is significantly underutilized, while Broker-1 and Broker-3 hold most of the data.</p>
<h3 id="heading-attempting-manual-rebalancing">Attempting Manual Rebalancing</h3>
<p><strong>Step 1</strong>: First, we need to find out which topic is heavy – meaning which one handles more data. My setup shows the <code>freecodecamp-logs</code> topic with 8MB of data:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768089339438/e57c1b10-ebac-4824-ad0d-1c8d3ec6ff71.png" alt="Kafka Topics with Message Count" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><strong>Step 2</strong>: Let’s see where the heavy partitions are.</p>
<p>Click on <strong>freecodecamp-logs</strong> in Kafka UI and see the partition table. Look at the message count: partition 4 is bigger than the others. The table also gives information about replicas of partitions: partition 4 has replicas on Broker 1 and 3. Broker 2 doesn’t have partition 4 at all. This explains why Broker 2 was underutilized.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768089513001/8993d5ce-c70f-45e3-b643-b8a759dda138.png" alt="Kafka Topic Partitions" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><strong>Step 3:</strong> To balance the cluster, we need to move partition 4 around.</p>
<p>We can move partition 4 to Broker 2. But before that, let’s do some math to be able to rationalize our decision. Note that the calculation doesn’t have to be precise – we just want a relative sense of data between brokers.</p>
<p>Current state:</p>
<ul>
<li><p><strong>Broker 1</strong>: 4.55 MB</p>
</li>
<li><p><strong>Broker 2</strong>: 2.29 MB (underutilized)</p>
</li>
<li><p><strong>Broker 3</strong>: 5.11 MB (over-utilized)</p>
</li>
</ul>
<p>Note that roughly the compressed data size for partition 4 is 2.25 MB (exact size is not critical).</p>
<p>If we move partition 4 from [1,3] to [2,3]:</p>
<ul>
<li><p><strong>Broker 1:</strong> Loses partition 4, so 4.55 + 2.25 = <strong>~2.3 MB</strong></p>
</li>
<li><p><strong>Broker 2:</strong> Gains Partition 4, so 2.33 + 2.25 = ~<strong>4.58 MB</strong></p>
</li>
<li><p><strong>Broker 3:</strong> Already has partition 4, so = <strong>5.11 MB (no change)</strong></p>
</li>
</ul>
<p>The result is that Broker 1 becomes underutilized.</p>
<p>How about if we move partition 4 from [1,3] to [1,2]?</p>
<ul>
<li><p><strong>Broker 1:</strong> Already has partition 4 = <strong>4.55 MB (no change)</strong></p>
</li>
<li><p><strong>Broker 2:</strong> Gains Partition 4, so 2.33 + 2.25 = ~<strong>4.58 MB</strong></p>
</li>
<li><p><strong>Broker 3:</strong> Loses partition 4, so 5.11 + 2.25 = ~<strong>2.8 MB</strong></p>
</li>
</ul>
<p>Hmm, this still creates an imbalance (broker 3 becomes too light).</p>
<p>So basically, manual rebalancing requires complex calculations. Moving a single partition impacts disk usage, leader distribution, and network traffic across multiple brokers. One poorly planned move can create a new imbalance elsewhere. </p>
<p>But, let’s say you somehow landed on a perfect mathematical calculation and you’re ready to make the move to balance. We’ll assume that the perfect plan is to move Partition 4 from [1, 3] to [2, 3]. I know it’s not the perfect move but the point is to see the pain afterwards.</p>
<p><strong>Step 4</strong>: it’s time to move the partition manually.</p>
<p>We need to tell Kafka to move partition 4's replicas from brokers [1,3] to brokers [2,3].</p>
<p>To do that, you need create a file called <code>reassignment.json</code> on your machine:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"version"</span>: <span class="hljs-number">1</span>,
  <span class="hljs-attr">"partitions"</span>: [
    {
      <span class="hljs-attr">"topic"</span>: <span class="hljs-string">"freecodecamp-logs"</span>,
      <span class="hljs-attr">"partition"</span>: <span class="hljs-number">4</span>,
      <span class="hljs-attr">"replicas"</span>: [<span class="hljs-number">2</span>, <span class="hljs-number">3</span>],
      <span class="hljs-attr">"log_dirs"</span>: [<span class="hljs-string">"any"</span>, <span class="hljs-string">"any"</span>]
    }
  ]
}
</code></pre>
<p><strong>What this means:</strong></p>
<ul>
<li><p>"partition": 4 – Target Partition</p>
</li>
<li><p>"replicas": [2, 3] – New placement: brokers 2 and 3</p>
</li>
<li><p>"log_dirs": ["any", "any"] – Let Kafka choose the disk directory</p>
</li>
</ul>
<p>Save this file somewhere accessible.</p>
<p>Then run the following command to copy the JSON to the Kafka cluster:</p>
<pre><code class="lang-bash">docker cp reassignment.json kafka-1:/tmp/reassignment.json
</code></pre>
<p>This copies your local file into the kafka-1 container's /tmp directory.</p>
<p>Run following command to verify the file is there:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 cat /tmp/reassignment.json
</code></pre>
<p>You should see your JSON file content.</p>
<p>Now run the actual reassignment command:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 /opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --reassignment-json-file /tmp/reassignment.json \
  --execute
</code></pre>
<p>You will get a message from Kafka that will tell you if Kafka has accepted the reassignment and started moving the data.</p>
<p>You can monitor the reassignment using the following command:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 /opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --reassignment-json-file /tmp/reassignment.json \
  --verify
</code></pre>
<p>I’m not going to run the manual reassignment because I want to keep the imbalance and show how Cruise Control can help reduce the manual steps. Next, let’s see how Cruise Control handles the same imbalance automatically.</p>
<h3 id="heading-rebalancing-using-cruise-control">Rebalancing Using Cruise Control</h3>
<p>After creating the topic and messages, I have let Cruise Control run for a couple minutes. During that time, it collected metrics and trained its linear regression model. You can run the following command to verify if Cruise Control is running fine and it has data (following is a REST API call using curl):</p>
<pre><code class="lang-bash">curl http://localhost:9090/kafkacruisecontrol/state
</code></pre>
<p>You will get multiple JSON object outputs as part of the response. Each JSON object holds some information about the state of Cruise Control and the Kafka cluster. Let’s see each of these one at a time:</p>
<pre><code class="lang-json">MonitorState: {
  state: RUNNING(<span class="hljs-number">20.000</span>% trained),
  NumValidWindows: (<span class="hljs-number">1</span>/<span class="hljs-number">1</span>) (<span class="hljs-number">100.000</span>%),
  NumValidPartitions: <span class="hljs-number">105</span>/<span class="hljs-number">105</span> (<span class="hljs-number">100.000</span>%),
  flawedPartitions: <span class="hljs-number">0</span>
}
</code></pre>
<p>This tells about the state of monitoring based on data collected by Cruise Control:</p>
<ul>
<li><p><code>state: RUNNING(20.000% trained)</code> – Cruise Control is <strong>actively collecting metrics</strong> from your Kafka cluster. Right now it has <strong>trained its model on 20% of the expected monitoring data</strong>.</p>
</li>
<li><p><code>NumValidWindows: (1/1) (100%)</code> – Cruise Control has collected 1 complete monitoring window out of 1 required (100% ready). Remember, we had set <code>num.broker.metrics.windows=1</code> in the <code>cruisecontrol.properties</code> configuration file.</p>
</li>
<li><p><code>NumValidPartitions: 105/105 (100%)</code> – Cruise Control analyzed all 105 partitions and has metrics for <strong>all.</strong></p>
</li>
<li><p><code>flawedPartitions: 0</code> – None of the partitions have problematic or missing metrics.</p>
</li>
</ul>
<pre><code class="lang-json">ExecutorState: {state: NO_TASK_IN_PROGRESS}
</code></pre>
<p>The above response indicates the execution engine is idle – no partition moves or leadership changes are currently in progress. This makes sense since we haven't asked Cruise Control to do anything yet.</p>
<pre><code class="lang-json">AnalyzerState: {
  isProposalReady: <span class="hljs-literal">true</span>,
  readyGoals: [
    NetworkInboundCapacityGoal,
    LeaderBytesInDistributionGoal,
    DiskCapacityGoal,
    ReplicaDistributionGoal,
    RackAwareGoal,
    NetworkOutboundCapacityGoal,
    CpuCapacityGoal,
    DiskUsageDistributionGoal,
    LeaderReplicaDistributionGoal,
    ReplicaCapacityGoal
  ]
}
</code></pre>
<p>AnalyzerState tells whether Cruise Control is ready to show a proposal or not. In this case it’s ready.</p>
<ul>
<li><p><code>isProposalReady: true</code> – Cruise Control has <strong>calculated a potential rebalancing plan</strong> (a proposal) that satisfies the configured goals.</p>
</li>
<li><p><code>readyGoals</code> – These are the goals that are considered <strong>ready and valid</strong> for rebalancing. Examples:</p>
<ul>
<li><p><code>DiskCapacityGoal</code>: balance disk usage among brokers</p>
</li>
<li><p><code>ReplicaDistributionGoal</code>: balance number of replicas per broker</p>
</li>
<li><p><code>RackAwareGoal</code>: maintain replicas across racks for fault tolerance</p>
</li>
<li><p><code>LeaderBytesInDistributionGoal</code>: balance network traffic from leaders</p>
</li>
<li><p><code>DiskUsageDistributionGoal</code>: ensures partitions are spread to prevent skew</p>
</li>
</ul>
</li>
</ul>
<p>Note that these are the goals we had set earlier in the <code>cruisecontrol.properties</code> file.</p>
<pre><code class="lang-json">AnomalyDetectorState: {
  selfHealingEnabled:[],
  selfHealingDisabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY, TOPIC_ANOMALY, MAINTENANCE_EVENT],
  selfHealingEnabledRatio:{...},
  recentGoalViolations:[],
  recentBrokerFailures:[],
  recentMetricAnomalies:[],
  recentDiskFailures:[],
  recentTopicAnomalies:[],
  recentMaintenanceEvents:[],
  metrics:{...},
  ongoingSelfHealingAnomaly:None,
  balancednessScore:<span class="hljs-number">100.000</span>
}
</code></pre>
<p>Anomaly detection shows information about any existing anomaly and healing properties.</p>
<ul>
<li><p><code>selfHealingEnabled: []</code> – Automatic self-healing is <strong>currently off</strong>. Cruise Control will <strong>not move partitions automatically</strong> in response to anomalies.</p>
</li>
<li><p><code>selfHealingDisabled: [...]</code> – Lists the anomaly types that are <strong>disabled for automatic self-healing</strong>, including broker failures, disk failures, and goal violations.</p>
</li>
<li><p><code>recentGoalViolations: []</code> – No goals have been violated recently.</p>
</li>
<li><p><code>balancednessScore: 100.000</code> – This is <strong>how balanced the cluster is according to Cruise Control’s hard goals</strong>. 100% means the cluster is perfectly balanced according to the metrics and hard goals currently active. This metric only cares about Hard Goals (Disk Capacity, CPU capacity) being violated – that’s why it shows 100% even though we know there are some disk usage imbalances in our cluster.</p>
</li>
</ul>
<h4 id="heading-the-proposal">The Proposal</h4>
<p>Via AnalyzerState information, Cruise Control told us that it has a proposal for the cluster. Let’s see what it is. We can fetch the proposal using the proposal end point:</p>
<pre><code class="lang-bash">curl -s <span class="hljs-string">"http://localhost:9090/kafkacruisecontrol/proposals?json=true"</span>
</code></pre>
<p>The JSON response is quite large. Let's focus on the key parts that show our cluster's imbalance and how Cruise Control plans to fix it:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"summary"</span>: {
    <span class="hljs-attr">"numReplicaMovements"</span>: <span class="hljs-number">13</span>,    <span class="hljs-comment">// CC wants to move 13 partition replicas</span>
    <span class="hljs-attr">"numLeaderMovements"</span>: <span class="hljs-number">6</span>,      <span class="hljs-comment">// And reassign 6 partition leaders</span>
    <span class="hljs-attr">"onDemandBalancednessScoreBefore"</span>: <span class="hljs-number">84.67</span>,   <span class="hljs-comment">// Current: 84.67% balanced</span>
    <span class="hljs-attr">"onDemandBalancednessScoreAfter"</span>: <span class="hljs-number">89.76</span>.    <span class="hljs-comment">// After: 89.76% balanced</span>
  },
  <span class="hljs-attr">"goalSummary"</span>: [
    {
      <span class="hljs-attr">"goal"</span>: <span class="hljs-string">"DiskUsageDistributionGoal"</span>,
      <span class="hljs-attr">"status"</span>: <span class="hljs-string">"VIOLATED"</span>
    },
    {
      <span class="hljs-attr">"goal"</span>: <span class="hljs-string">"LeaderBytesInDistributionGoal"</span>,
      <span class="hljs-attr">"status"</span>: <span class="hljs-string">"VIOLATED"</span>
    }
  ]
}
</code></pre>
<p>Based on the calculations, Cruise Control thinks:</p>
<ol>
<li><p>Moving 13 partition replicas will help. Note that manually we decided to move just 1 partition, that is partition 4.</p>
</li>
<li><p>Reassigning 6 partition leaders will help. Manually we didn’t account for any leadership reassignment.</p>
</li>
<li><p><code>DiskUsageDistributionGoal</code> has been violated. We know that the disk usage is not distributed perfectly.</p>
</li>
<li><p><code>LeaderBytesInDistributionGoal</code> has also been violated. We couldn’t find this out manually. Technically, you could find out but it would take a decent amount of manual calculations and would still be error-prone.</p>
</li>
</ol>
<p>Note: While we're focusing on disk usage imbalance, Cruise Control optimizes for 10 different goals (disk, CPU, network, leaders, and so on). This holistic approach gives it a better chance of achieving true cluster balance versus balancing manually.</p>
<h4 id="heading-executing-the-proposal">Executing the proposal</h4>
<p>Let’s run the actual rebalancing using Cruise Control. The command is:</p>
<pre><code class="lang-bash">curl -X POST <span class="hljs-string">'http://localhost:9090/kafkacruisecontrol/rebalance?dryrun=false&amp;json=true'</span>
</code></pre>
<p>Again, you’ll get a huge JSON file similar to the proposal.</p>
<p>You can track the status using following API call:</p>
<pre><code class="lang-bash">curl <span class="hljs-string">"http://localhost:9090/kafkacruisecontrol/user_tasks"</span>
</code></pre>
<p>You will get something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768095347466/64131c47-5884-4894-95df-d46e9eb8cd97.png" alt="Cruise Control Tasks" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Note that the 4th item in the list is our rebalance API call and it’s complete. This was quick for our small Dev cluster, but in large clusters you may see status as <code>InExecution</code>.</p>
<p>Let’s look at the UI to see what is the state of Imbalance now that Cruise Control has completed its execution of the proposal. The UI shows the following for me:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768095510743/16db64f7-d14b-4120-95c9-ac9b1d43f47e.png" alt="Kafka balanced Disk Usage" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h4 id="heading-comparison">Comparison</h4>
<p>Before rebalancing:</p>
<ul>
<li><p>Broker 1: 4.52 MB, 69 partitions, 35 leaders</p>
</li>
<li><p>Broker 2: 2.22 MB, 69 partitions, 35 leaders (<strong>underutilized</strong>)</p>
</li>
<li><p>Broker 3: 5.05 MB, 72 partitions, 35 leaders (<strong>overutilized</strong>)</p>
</li>
<li><p><strong>Disk range:</strong> 2.83 MB (5.05 - 2.22)</p>
</li>
</ul>
<p>After rebalancing:</p>
<ul>
<li><p>Broker 1: 4.66 MB, 69 partitions, 38 leaders</p>
</li>
<li><p>Broker 2: 3.87 MB, 77 partitions, 31 leaders</p>
</li>
<li><p>Broker 3: 4.87 MB, 64 partitions, 36 leaders</p>
</li>
<li><p><strong>Disk range:</strong> 1.00 MB (4.87 - 3.87)</p>
</li>
</ul>
<p>Results:</p>
<ul>
<li><p><strong>Disk usage balanced</strong> – Range reduced from 2.83 MB to 1.00 MB (64% improvement!)</p>
</li>
<li><p><strong>Replicas redistributed</strong> – Broker 2 gained 8 replicas, Broker 3 lost 8 replicas</p>
</li>
<li><p><strong>Leaders balanced</strong> – Changed from 35-35-35 to 38-31-36. Cruise Control prioritized balancing actual network traffic over leader count.</p>
</li>
</ul>
<p>The cluster is now more balanced across all metrics. Congrats!</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We covered a lot in this tutorial, so let’s take a step back and look at what we did.</p>
<p>You started by experiencing the reality of manual Kafka management – the endless CLI commands, the tedious calculations, the JSON files, and the potential for costly mistakes. If you felt frustrated during that section, that’s to be expected. That frustration is exactly what thousands of engineering teams deal with every day.</p>
<p>Then you were presented with two complementary tools:</p>
<ol>
<li><p><strong>Kafka UI</strong> gave you visibility. No more grepping through command outputs or manually counting partition leaders. Everything you need, broker health, topic configurations, consumer lag is right there in a clean web interface. For small teams and development environments, this alone is a game-changer.</p>
</li>
<li><p><strong>Cruise Control</strong> gave you intelligence. It didn't just automate what you'd do manually – it also did a fundamentally better job. While you were focused on moving one partition (partition 4), Cruise Control analyzed all 105 partitions across 10 different optimization goals and proposed a comprehensive rebalancing plan. That's the difference between human effort and automated intelligence.</p>
</li>
</ol>
<p>I want to call out that this tutorial used a simplified setup. For production, you’ll expect complex configurations like”</p>
<ul>
<li><p>Kafka and Cruise Control running on separate machines</p>
</li>
<li><p>Larger monitoring window for Cruise Control</p>
</li>
<li><p>Some self healing capabilities enabled</p>
</li>
</ul>
<p>If there's one thing you take away from this article, let it be this: you should stop managing your Kafka cluster manually. You've seen there's a better way. Use it. Thanks for reading!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an LSM Tree Storage Engine from Scratch – Full Handbook ]]>
                </title>
                <description>
                    <![CDATA[ Databases are one of the most important parts of a software system. They allow us to store huge amounts of data in an organized way and retrieve it efficiently when we need it. In the early days, when the volume of data was relatively small, engineer... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-an-lsm-tree-storage-engine-from-scratch-handbook/</link>
                <guid isPermaLink="false">6944631e80f40a442d1799df</guid>
                
                    <category>
                        <![CDATA[ Databases ]]>
                    </category>
                
                    <category>
                        <![CDATA[ lsmtree ]]>
                    </category>
                
                    <category>
                        <![CDATA[ storage solutions ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Go Language ]]>
                    </category>
                
                    <category>
                        <![CDATA[ heap ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ramesh Sinha ]]>
                </dc:creator>
                <pubDate>Thu, 18 Dec 2025 20:25:02 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1766089431510/433ff03f-8aca-4a87-82d3-0b6d6c1f371c.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Databases are one of the most important parts of a software system. They allow us to store huge amounts of data in an organized way and retrieve it efficiently when we need it.</p>
<p>In the early days, when the volume of data was relatively small, engineers prioritized fast data retrieval and stored data in <a target="_blank" href="https://en.wikipedia.org/wiki/B-tree">B-tree structures</a> that made searching efficient.</p>
<p>But over time, we started building systems that needed to ingest massive amounts of data like logs, metrics, likes, chats and tweets. This made it necessary to design a storage system that would make writing faster.</p>
<p>One such storage system is the LSM-tree (Log-Structured Merge tree).</p>
<p>In this tutorial, rather than immediately diving into the theoretical concepts of an LSM-Tree Storage system, I’ll take a practical, problem-driven approach. I believe that learning through solving problems is far more effective and engaging than simple memorization of concepts.</p>
<p>By approaching these ideas progressively, my goal is to guide you step by step through real-world engineering challenges and solutions, giving you a front-row seat to the intricacies of building a robust storage system from scratch.</p>
<p>We’ll begin by identifying real-world challenges that arise in database design – like handling write-heavy workloads, ensuring data durability, or managing efficient storage. These challenges will set the stage for each feature and component of LSM-Trees.</p>
<p>Through this method, we’ll explore the foundations of LSM-Tree storage systems and dive deeper into their key components: MemTable, SSTable, Write-Ahead Log (WAL), and Manifest File.</p>
<p>We’ll also examine the Write and Read paths, explore Durability and Crash-Recovery mechanisms, and conclude with one of the most critical processes: Compaction.</p>
<p>By the end of this handbook, you’ll understand not just what these components are but also why they are designed the way they are and how they solve the unique challenges of building modern, high-performance databases.</p>
<h3 id="heading-what-well-cover"><strong>What We’ll Cover:</strong></h3>
<ol>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-an-lsm-tree">What is an LSM Tree?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-preface-setting-up-to-build-an-lsm-tree-database">Preface: Setting up to Build an LSM-Tree Database</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-initial-feature-set-laying-the-foundation-of-the-database-system">Initial Feature Set: Laying the Foundation of the Database System</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-memtable-in-memory-data-storage">MemTable: In-Memory Data Storage</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-sstable-persisting-data-for-durability">SSTable: Persisting Data for Durability</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-wal-write-ahead-log-crash-recovery-made-simple">The WAL (Write Ahead Log ): Crash Recovery Made Simple</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-manifest-file-tracking-the-state-of-the-database">Manifest File: Tracking the State of the Database</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-update-and-delete-handling-mutability-in-an-immutable-system">Update and Delete: Handling Mutability in an Immutable System</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-compaction-cleaning-up-stale-and-deleted-data">Compaction: Cleaning Up Stale and Deleted Data</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-complete-code">Complete Code</a></li>
</ul>
</li>
</ol>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>While this tutorial is designed to be comprehensive and approachable, it’ll be helpful if you come in with some foundational knowledge in the following areas:</p>
<ul>
<li><p><strong>Programming in Golang</strong>: Familiarity with Go syntax, error handling, and standard libraries (example <code>os,</code> <code>encoding/gob</code>, <code>container/heap</code>) will make it easier to work through the implementation examples.</p>
</li>
<li><p><strong>Basic data structures and algorithms:</strong> Concepts such as maps, heaps, some sorting algorithms, and early termination are leveraged throughout the tutorial.</p>
</li>
<li><p><strong>Understanding persistent storage:</strong> Awareness of the differences between in-memory and disk-based storage, as well as sequential versus random read/write operations will be helpful in grasping performance-related trade-offs.</p>
</li>
<li><p><strong>General database knowledge:</strong> If you're familiar with key-value databases or CRUD operations (Create, Read, Update, Delete), you’ll have a head start.</p>
</li>
<li><p><strong>Concurrency</strong>: Basic understanding of threads and concurrency.</p>
</li>
</ul>
<p>While having experience in these areas will deepen your understanding of the concepts and reduce the learning curve, I will provide sufficient detail and practical explanations at every step ensuring you gain the insights necessary to follow along and build your own LSM-tree-based storage engine.</p>
<h2 id="heading-what-is-an-lsm-tree">What is an LSM Tree?</h2>
<p>A log-structured merge-tree (or LSM tree) is a data structure that makes database writes super fast by recording new data in memory first, then periodically sorting and merging it into larger files on disk.</p>
<p>The “log” in its name refers to the fact that it saves data in a log-structured format (rather than simply storing it). We will come to what those logs are in a little bit.</p>
<p>LSM trees keep appending new data to the existing data, instead of looking for something that exists and updating it. In other words, you don't have to spend any CPU cycles thinking about where to store data – just append it at the end.</p>
<p>An LSM tree also has "tree" in its name, but does it actually store data in a tree? Not really. The “tree” here is mostly an abstract concept. It refers to the hierarchical organization of levels (L0, L1, L2, and so on), not a tree data structure with nodes and pointers. Again, we will come to those levels in a little bit, but for now, let’s just say it makes sense to call it a tree given that it stores data in a leveled fashion.</p>
<p>Just note that there isn't a tangible tree structure in play (like a binary trees or graph) – it’s not node-based storage.</p>
<p>Finally, there is the "merge" part of the name. For now, suffice it to say that you’ll soon see how this storage engine merges data to save storage by avoiding duplication.</p>
<p>Personally, I think that "Log-Structured Merge <strong>System</strong>" would be clearer than "tree," but "LSM tree" is the established term in the industry, so that's what we'll use.</p>
<h2 id="heading-preface-setting-up-to-build-an-lsm-tree-database">Preface: Setting up to Build an LSM-Tree Database</h2>
<p>Now that we have set the context, lets put this theory into practice and start building our own LSM-tree-based database storage engine from scratch.</p>
<p>To follow along with this tutorial:</p>
<ul>
<li><p>Make sure you have Golang installed on your system. If not, you can download and install it from the <a target="_blank" href="https://go.dev/">official Go website</a>.</p>
</li>
<li><p>Set up your development environment and create a new Go module for this project by running: <code>go mod init lsm-db</code></p>
</li>
<li><p>Keep a code editor or IDE ready to try out the examples.</p>
</li>
</ul>
<h3 id="heading-initial-feature-set-laying-the-foundation-of-the-database-system">Initial Feature Set: Laying the Foundation of the Database System</h3>
<p>When I’m designing or building a system, I like to think that the system already exists, and I assume that I can just start calling functions that support the features of the system. I’ll follow that pattern here and assume that the following functions of the LSM tree exist and we can invoke those functions from <a target="_blank" href="http://main.go">main.go</a>.</p>
<pre><code class="lang-go">db, err := NewDB[<span class="hljs-keyword">string</span>, <span class="hljs-keyword">string</span>](<span class="hljs-number">3</span>, <span class="hljs-number">3</span>) <span class="hljs-comment">// there is a feature to create new with some parameters we will get to</span>
db.Put(<span class="hljs-string">"a"</span>, <span class="hljs-string">"apple"</span>) <span class="hljs-comment">// a feature to add key value</span>
db.Delete(<span class="hljs-string">"a"</span>) <span class="hljs-comment">// a feature to delete a key</span>
val, _ := db.Get(<span class="hljs-string">"a"</span>) <span class="hljs-comment">// a feature to get value given a key</span>
</code></pre>
<p>As we progress through this journey, I’ll introduce essential features such as in-memory storage, flushing data to disk, and handling duplicate keys. We’ll also explore more advanced components, including a Write-Ahead Log (WAL) to ensure crash tolerance, a Manifest file to maintain the database state across application restarts, and a Compaction process to clean up redundant or stale data by merging older SSTables.</p>
<p>By the end of this tutorial, you will gain a clear understanding of how all these components work together to form a robust and efficient LSM-tree-based storage system.</p>
<h3 id="heading-memtable-in-memory-data-storage">MemTable: In-Memory Data Storage</h3>
<p>We’re building a database storage system, so of course you’ll need a way to store data. This means you need some kind of backing storage. This backing storage in an LSM tree is called a MemTable. The "Mem" refers to its in-memory storage. The benefit of in-memory storage is that it’s orders of magnitude faster than storing on disk.</p>
<p>For simplicity, at the core of the MemTable you can use a map (or dictionary depending on the programming language) as the underlying data structure to store key-value pairs. The map allows for fast lookups, insertions, and deletions, making it ideal for in-memory storage where performance is crucial. So the structure for MemTable will look like:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> MemTable[K comparable, V any] <span class="hljs-keyword">struct</span> {
    data <span class="hljs-keyword">map</span>[K]V <span class="hljs-comment">// this is primary storage map. It's generic so that you</span>
                  <span class="hljs-comment">// can store any kind of data</span>
}
</code></pre>
<p>Above code defines a <code>MemTable</code> struct, where <code>data</code> is a map that acts as the main storage for our key-value pairs. Since the <code>data</code> field is a map, you’ll be able to quickly add, retrieve, or delete values associated with a given key.</p>
<p>You must have noticed something new in the code. The use of <code>&lt;K comparable, V any&gt;</code>. This syntax is Go’s <strong>generic types</strong> feature, which allows us to write flexible code that can handle different data types.</p>
<p>Generics are a way to write code that is independent of any specific data type. They allow you to write functions and data structures that can work with a string, int, float, or any custom type you define, without sacrificing type safety.</p>
<p>In the above code, K and V are type parameters. They say: "This MemTable can work with any Key type K that is comparable, and any value type V."</p>
<p>Now that you have the MemTable, think of what functions it should provide to its clients. Well, the clients need to be able to save and retrieve values associated with a key, so the following functions would fit naturally:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(m *MemTable[K, V])</span> <span class="hljs-title">Put</span><span class="hljs-params">(key K, value V)</span></span> {
    m.data[key] = value
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(m *MemTable[K, V])</span> <span class="hljs-title">Get</span><span class="hljs-params">(key K)</span> <span class="hljs-params">(V, <span class="hljs-keyword">bool</span>)</span></span> {
    value, ok := m.data[key]
    <span class="hljs-keyword">var</span> zero V
    <span class="hljs-keyword">if</span> !ok {
        <span class="hljs-keyword">return</span> zero, <span class="hljs-literal">false</span>
    }
    <span class="hljs-keyword">return</span> value, <span class="hljs-literal">true</span>
}
</code></pre>
<p>The above code has <code>Put</code> and a <code>Get</code> functions – let’s break them down:</p>
<ul>
<li><p><strong>Put</strong>: This function allows the client to insert a key-value pair into the MemTable. If the key already exists in the map, its value will be updated with the new value provided as an argument. This is effectively the <code>write</code> operation of our key-value store.</p>
</li>
<li><p><strong>Get</strong>: This function is responsible for retrieving a value associated with a give key from the MemTable. It returns two values, the value itself (of type <code>V</code>) and a boolean (<code>true</code> or <code>false</code>). The boolean indicates whether the key was found in the map. If the key does not exist, the function return a <code>zero value</code> (more on that below) along with <code>false</code>.</p>
</li>
</ul>
<p>Did you notice <code>var zero V</code>?</p>
<p>It's pretty interesting. Think of a situation where we don't get a value from the map – say the key is not there, or something else is wrong. What should the function <code>Get</code> return in that case? Can it return an int (0), or a string "Not found", or some random object (foo)? You don't know anything about the type yet (Generics), so you can't tell it what to return.</p>
<p>In this case, the compiler comes to the rescue. Go has this zero value concept: everything should have a zero value. An int has 0, string has "", bool has false, and pointer, slice, and map have nil. By saying <code>var zero V</code> you are telling the compiler, "I don't know the type yet, you figure out the type while compiling and put that type here as the return type." Neat!</p>
<p>I missed one thing though: how would a client invoke these functions? Right, we need a way to build the MemTable type.</p>
<p>To construct and initialize a MemTable, we can use a <strong>factory function</strong>: a common programming pattern for creating and returning new objects or instances without directly exposing the underlying implementation details.</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">NewMemTable</span>[<span class="hljs-title">K</span> <span class="hljs-title">comparable</span>, <span class="hljs-title">V</span> <span class="hljs-title">any</span>]<span class="hljs-params">()</span> *<span class="hljs-title">MemTable</span>[<span class="hljs-title">K</span>, <span class="hljs-title">V</span>]</span> {
    <span class="hljs-keyword">return</span> &amp;MemTable[K, V]{
        data: <span class="hljs-built_in">make</span>(<span class="hljs-keyword">map</span>[K]V),
    }
}
</code></pre>
<p>Notice how we’ve initialized the data field using the built-in <code>make</code> function. Here’s why we do this:</p>
<p>Go has a built-in function called <code>make</code>, which is used to allocate and initialize slices, maps, and channels. This allocation ensures that they are ready for use without the risk of runtime panics.</p>
<p>You might wonder, why not use the <code>new</code> function to allocate the map? After all, developers coming from other programming backgrounds (like C++ or Java) might expect to use <code>new</code> for all types of memory allocation. But Go <strong>differentiates how it manages memory for composite types versus basic/numeric types</strong>, and that’s where <code>make</code> comes in.</p>
<p>This distinction matters because the <code>new</code> function only <strong>allocates memory</strong> for an object and returns a <em>pointer</em> to that memory. The object itself is not initialized, meaning that while the memory is allocated, the map isn’t ready to use. If we try to perform operations (like adding a key-value pair) on a <code>map</code> only allocated using <code>new</code>, it will cause a runtime panic because the map wasn’t correctly initialized.</p>
<p>For example:</p>
<pre><code class="lang-go">m := <span class="hljs-built_in">new</span>(<span class="hljs-keyword">map</span>[<span class="hljs-keyword">string</span>]<span class="hljs-keyword">int</span>) <span class="hljs-comment">// Allocates a pointer to an uninitialized map</span>
(*m)[<span class="hljs-string">"a"</span>] = <span class="hljs-number">1</span>            <span class="hljs-comment">// This will panic because the map is not initialized</span>
</code></pre>
<p>On the other hand, <code>make</code> both allocates and <strong>initializes the map</strong>, ensuring it’s fully functional right away. That’s why the correct way to create a map is:</p>
<pre><code class="lang-go">m1 := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">map</span>[<span class="hljs-keyword">string</span>]<span class="hljs-keyword">int</span>) <span class="hljs-comment">// Initializes the map properly</span>
m1[<span class="hljs-string">"a"</span>] = <span class="hljs-number">1</span>                <span class="hljs-comment">// This works as expected</span>
</code></pre>
<p>Now that you have the MemTable which can store data in memory, let's hook it up and use it.</p>
<p>But before that, do you remember at the beginning that I used functional invocations like <code>db.Put</code> and <code>db.Get</code>? Well, what is <code>db</code>? Because we are building a database storage system, it makes more sense to name the interface <code>db</code> instead of MemTable, right? And to be honest, it seems like MemTable is going to be part of the database system, not the whole system, doesn't it?</p>
<p>Even if it's not intuitive at the moment to define something like a DB type, let's just do it. Trust me, it will start to get clearer as we move along. This <code>db</code> type will wrap adding and retrieving data from MemTable.</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> DB[K comparable, V any] <span class="hljs-keyword">struct</span> {
    memtable *MemTable[K, V]
}

<span class="hljs-comment">// factory function for DB type</span>
<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">NewDB</span>[<span class="hljs-title">K</span> <span class="hljs-title">comparable</span>, <span class="hljs-title">V</span> <span class="hljs-title">any</span>]<span class="hljs-params">()</span> <span class="hljs-params">(*DB[K, V], error)</span></span> {
    memtable := NewMemTable[K, V]()
    <span class="hljs-keyword">return</span> &amp;DB[K, V]{
        memtable: memtable,
    }, <span class="hljs-literal">nil</span>
}
</code></pre>
<p>Let's just define the Put and Get functions which will invoke corresponding functions in MemTable:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">Put</span><span class="hljs-params">(key K, value V)</span> <span class="hljs-title">error</span></span> {
    db.memtable.Put(key, value)
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">Get</span><span class="hljs-params">(key K)</span> <span class="hljs-params">(V, error)</span></span> {
    <span class="hljs-keyword">if</span> val, ok := db.memtable.Get(key); ok {
        <span class="hljs-keyword">return</span> val, <span class="hljs-literal">nil</span>
    }
    <span class="hljs-keyword">var</span> zero V
    <span class="hljs-keyword">return</span> zero, errors.New(<span class="hljs-string">"key not found"</span>)
}
</code></pre>
<p>Let's integrate whatever we’ve built so far and run it. To run add below code in main.go and run using <code>go run main.go</code></p>
<pre><code class="lang-go">db, err := NewDB[<span class="hljs-keyword">string</span>, <span class="hljs-keyword">string</span>]()
<span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
    log.Fatalf(<span class="hljs-string">"Failed to create DB: %v"</span>, err)
}
db.Put(<span class="hljs-string">"a"</span>, <span class="hljs-string">"apple"</span>)
val, _ := db.Get(<span class="hljs-string">"a"</span>)
log.Printf(<span class="hljs-string">"Get('a') = %s (should be 'apple')"</span>, val)
</code></pre>
<p>Look at that, you have built an in-memory database where your clients can store and fetch data from. It’s using generics so you can store any kind of values (int, string, objects).</p>
<p>Now say you shipped this solution and then it crashes. Your clients will lose all the data. Why will this crash? For one thing, memory is limited and at some point you’re going to run out of it. So there are two major problems with just in-memory storage:</p>
<ol>
<li><p>It's not durable.</p>
</li>
<li><p>Unbounded memory usage is going to crash the system.</p>
</li>
</ol>
<p>How do we solve these problems?</p>
<p>Here's a thought: what if we flush the MemTable data to disk at some regular interval? That way we can ensure that MemTable doesn't grow out of bounds. Also if the db crashes, we won’t lose all the data. We’ll still lose the data that hasn’t been flushed yet, but that's way better than losing all of it.</p>
<h3 id="heading-sstable-persisting-data-for-durability">SSTable: Persisting Data for Durability</h3>
<p>An SSTable is a sorted string table. I wish they’d called it a "Secondary Storage" table, but historically keys and values were strings – hence sorted string table. An SSTable is a persistent, ordered, and immutable file that stores key-value pairs. It’s a file stored on disk, so it's pretty clear that it’s persistent (durable).</p>
<p>Let’s discuss a couple key features of the SSTable:</p>
<ul>
<li><p><strong>It’s ordered</strong>: There is an incentive to store the keys in a sorted order, and it makes searching keys faster and efficient. If not for that, you'd have to scan the whole file to be able to find a key. Later, I will point out some code that leverages sorted storage.</p>
</li>
<li><p><strong>It’s immutable</strong>: Once an SSTable file is written, it can’t be modified. To update or delete a key, you must write a new record in a newer SSTable. This simplifies the design and makes reads and writes very predictable.</p>
</li>
</ul>
<p>But wait, how does that simplify the design?</p>
<p>One of the most complex things in software engineering is dealing with concurrency. Let’s say you’re writing to a file and another thread updates it underneath. How do you know you have the correct data?</p>
<p>With immutable design, you don't have to worry about this at all. You are 100% confident that the data you are reading has not been altered by anybody else. I’ll take that as a massive simplification: you don't have to deal with locks, starvation, staleness, and so on.</p>
<h4 id="heading-how-does-it-make-the-write-path-predictable">How does it make the write path predictable?</h4>
<p>I will answer this partially here and come back to it when we have completed some more implementation. You’ll see that every write in our code follows the exact same steps. There is not a single different condition or edge case.</p>
<p>In a traditional database (using a B-Tree), a typical write involves:</p>
<ol>
<li><p>Finding the data on disk.</p>
</li>
<li><p>Reading the block of data from disk into memory.</p>
</li>
<li><p>Modifying the data in memory.</p>
</li>
<li><p>Writing the entire block back to disk.</p>
</li>
</ol>
<p>The more steps, the more unpredictable performance can get, because the write can be fast if data is already in the memory cache or slow if there are multiple disk seeks needed.</p>
<p>Granted, our code is an overly simplified version, but the extension of this concept still stands true in real LSM implementations.</p>
<h4 id="heading-how-does-it-make-the-read-path-predictable">How does it make the read path predictable?</h4>
<p>Read is predictable because any number of threads can read the same SSTable file at the same time without any problem, with full confidence that data has not been updated.</p>
<p>In contrast, when reading from a mutable data structure, you have to worry that another thread might be in the process of changing the data you are trying to read.</p>
<p>To prevent this, B-Tree-based databases use complex locking mechanisms, and that adds overhead and unpredictability.</p>
<p>I should raise a caution here: the Read in LSM tree storage is not always predictable. It can be faster if data is read from memory and it can be very slow if multiple SSTables need to be looked up to find the key.</p>
<p>Having said that, you don't have to worry about other performance bottlenecks because of locks. Meaning, in B-Tree storage, your read query can be slower because another write query is holding a lock. In simple low-concurrency use cases, you will mostly get amazing read performance from a B-Tree structure, but this advantage wears off as concurrency increases.</p>
<p>LSM tree was built for highly concurrent, write-heavy use cases, and at times slower reads are a trade-off.</p>
<p>The takeaway that gives you ammunition to design better is that B-trees are better for read-heavy workloads. Reads are generally faster and more consistent, but performance can have unpredictable outliers under high write concurrency due to locking.</p>
<p>An LSM tree is better for write-heavy workloads. Writes are much faster. Reads are generally slower and more variable, but their performance profile is more predictable under high write concurrency because there is no read-write locking.</p>
<p>Let's implement an SSTable to see how it works.</p>
<p><strong>The write path:</strong></p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">writeSSTable</span>[<span class="hljs-title">K</span> <span class="hljs-title">comparable</span>, <span class="hljs-title">V</span> <span class="hljs-title">any</span>]<span class="hljs-params">(memtable *MemTable[K, V], path <span class="hljs-keyword">string</span>)</span> <span class="hljs-params">(*SSTable[K, V], error)</span></span> {
    file, err := os.Create(path)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }
    <span class="hljs-keyword">defer</span> file.Close()

    pairs := <span class="hljs-built_in">make</span>([]Pair[K, V], <span class="hljs-number">0</span>, <span class="hljs-built_in">len</span>(memtable.data))
    <span class="hljs-keyword">for</span> k, v := <span class="hljs-keyword">range</span> memtable.data {
        pairs = <span class="hljs-built_in">append</span>(pairs, Pair[K, V]{Key: k, Value: v})
    }

    sort.Slice(pairs, <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">(i, j <span class="hljs-keyword">int</span>)</span> <span class="hljs-title">bool</span></span> {
        <span class="hljs-keyword">return</span> any(pairs[i].Key).(<span class="hljs-keyword">string</span>) &lt; any(pairs[j].Key).(<span class="hljs-keyword">string</span>)
    })

    encoder := gob.NewEncoder(file)
    <span class="hljs-keyword">for</span> _, pair := <span class="hljs-keyword">range</span> pairs {
        <span class="hljs-keyword">if</span> err := encoder.Encode(pair); err != <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
        }
    }

    <span class="hljs-keyword">return</span> &amp;SSTable[K, V]{path: path}, <span class="hljs-literal">nil</span>
}
</code></pre>
<p>The following things are important to note from the above code:</p>
<ol>
<li><p><code>sort.Slice</code>: Remember I spoke about order earlier? So we store data in the SSTable in a sorted fashion, and we will see how we leverage it in the read path.</p>
</li>
<li><p>I have used the gob encoding package. An encoder makes life simpler for you because it streams the data to and from Go data structures to binary streams that can be stored on disk. It handles all the complexity of representing types, field names, and values in a standardized binary format, so that you don't have to.</p>
</li>
</ol>
<p><strong>The read path:</strong></p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(s *SSTable[K, V])</span> <span class="hljs-title">Get</span><span class="hljs-params">(key K)</span> <span class="hljs-params">(V, error)</span></span> {
    file, err := os.Open(s.path)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">var</span> zero V
        <span class="hljs-keyword">return</span> zero, err
    }
    <span class="hljs-keyword">defer</span> file.Close()

    decoder := gob.NewDecoder(file)

    <span class="hljs-keyword">for</span> {
        <span class="hljs-keyword">var</span> pair Pair[K, V]
        <span class="hljs-keyword">if</span> err := decoder.Decode(&amp;pair); err != <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">if</span> err == io.EOF {
                <span class="hljs-keyword">break</span>
            }
            <span class="hljs-keyword">var</span> zero V
            <span class="hljs-keyword">return</span> zero, err
        }

        <span class="hljs-comment">// for simple comparison we are assuming key is just string</span>
        keyInDB := any(pair.Key).(<span class="hljs-keyword">string</span>)
        <span class="hljs-keyword">if</span> keyInDB == any(key).(<span class="hljs-keyword">string</span>) {
            <span class="hljs-keyword">if</span> any(pair.Value).(<span class="hljs-keyword">string</span>) == TOMBSTONE {
                <span class="hljs-keyword">var</span> zero V
                <span class="hljs-keyword">return</span> zero, ErrDeleted
            }
            <span class="hljs-keyword">return</span> pair.Value, <span class="hljs-literal">nil</span>
        }

        <span class="hljs-keyword">if</span> keyInDB &gt; any(key).(<span class="hljs-keyword">string</span>) {
            <span class="hljs-keyword">var</span> zero V
            <span class="hljs-keyword">return</span> zero, ErrNotFound
        }
    }

    <span class="hljs-keyword">var</span> zero V
    <span class="hljs-keyword">return</span> zero, ErrNotFound
}
</code></pre>
<p>On the read path, look at <code>keyInDB &gt; any(key).(string)</code>. This is one of the examples of how we took advantage of storing data in a sorted key order. The moment we find a key in the SSTable <strong>that is greater than the key</strong> we are looking for, we stop looking because it’s obvious all other keys will be greater than this, so we won't find our key anymore.</p>
<p>Now that you have implemented the SSTable, you just have to decide when to flush data from the MemTable to the SSTable. You can just define max size for MemTable and flush it to disk on the write path when the max size is reached.</p>
<p>I am skipping some variables, boilerplate code, and simplifying things for brevity. I will post a GitHub link with the complete implementation later.</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> DB[K comparable, V any] <span class="hljs-keyword">struct</span> {
    memtable        *MemTable[K, V]
    maxMemtableSize <span class="hljs-keyword">int</span>
    memtableSize    <span class="hljs-keyword">int</span>
    sstables        []*SSTable[K, V]
    sstableCounter  <span class="hljs-keyword">int</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">NewDB</span>[<span class="hljs-title">K</span> <span class="hljs-title">comparable</span>, <span class="hljs-title">V</span> <span class="hljs-title">any</span>]<span class="hljs-params">(maxMemtableSize <span class="hljs-keyword">int</span>)</span> <span class="hljs-params">(*DB[K, V], error)</span></span> {
    sstables := <span class="hljs-built_in">make</span>([]*SSTable[K, V], <span class="hljs-number">0</span>)
    memtable := NewMemTable[K, V]()

    <span class="hljs-keyword">return</span> &amp;DB[K, V]{
        memtable:        memtable,
        maxMemtableSize: maxMemtableSize,
        sstables:        sstables,
    }, <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">Put</span><span class="hljs-params">(key K, value V)</span> <span class="hljs-title">error</span></span> {
    db.memtable.Put(key, value)
    db.memtableSize++

    <span class="hljs-keyword">if</span> db.memtableSize &gt;= db.maxMemtableSize {
        <span class="hljs-keyword">if</span> err := db.flushMemtable(); err != <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">return</span> err
        }
    }

    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">flushMemtable</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    sstablePath := fmt.Sprintf(<span class="hljs-string">"data-%d.sstable"</span>, db.sstableCounter)
    sstable, err := writeSSTable(db.memtable, sstablePath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    db.sstables = <span class="hljs-built_in">append</span>(db.sstables, sstable)
    db.sstableCounter++
    db.memtable = NewMemTable[K, V]()
    db.memtableSize = <span class="hljs-number">0</span>

    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}
</code></pre>
<p>You'll notice that every time we flush to disk, we write to a new SSTable versus using a single SSTable for the whole database. This is the immutability aspect we discussed earlier.</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">Get</span><span class="hljs-params">(key K)</span> <span class="hljs-params">(V, error)</span></span> {
    <span class="hljs-keyword">if</span> val, ok := db.memtable.Get(key); ok {
        <span class="hljs-keyword">return</span> val, <span class="hljs-literal">nil</span>
    }

    <span class="hljs-keyword">for</span> i := <span class="hljs-built_in">len</span>(db.sstables) - <span class="hljs-number">1</span>; i &gt;= <span class="hljs-number">0</span>; i-- {
        sstable := db.sstables[i]
        val, err := sstable.Get(key)

        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">var</span> zero V

            <span class="hljs-keyword">if</span> err == ErrDeleted {
                <span class="hljs-keyword">return</span> zero, ErrNotFound
            }
            <span class="hljs-keyword">if</span> err == ErrNotFound {
                <span class="hljs-keyword">continue</span>
            }
            <span class="hljs-keyword">return</span> zero, err
        }

        <span class="hljs-keyword">return</span> val, <span class="hljs-literal">nil</span>
    }

    <span class="hljs-keyword">var</span> zero V
    <span class="hljs-keyword">return</span> zero, ErrNotFound
}
</code></pre>
<p>One important aspect to note on the read path is that we are reading the newest SSTable first. This is because the newest SSTable has the most updated value for the key.</p>
<p>So, say you have a key "a" with value "apple", and along the way you update that value for "a" to be "apricot". You'd have flushed it to a new SSTable (for immutability), and so if you were to read an older SSTable, first you'd get the older value. So by reading the newer SSTable first, we get the correct value and we don't have to worry about updating older SSTables.</p>
<h3 id="heading-the-wal-write-ahead-log-crash-recovery-made-simple">The WAL (Write Ahead Log ): Crash Recovery Made Simple</h3>
<p>Now that we have an SSTable, our data is durable and we are safe from losing data upon crashes. Are we really safe, though? Think of a scenario where a crash happens before we flush to the SSTable. We know MemTable has a max threshold, and until then, data lives in memory. So we’re still prone to losing data if a crash happens before the flush.</p>
<p>This is where the WAL (Write Ahead Log) comes into the picture. It’s the single most important aspect of the LSM tree.</p>
<p>We’ll follow a simple rule: "Before we write a piece of data to the in-memory MemTable, we first write it to a log file on disk."</p>
<p>If a crash happens and the database starts again, the first thing it does is look for a WAL, read it if one is found, and replay all the data into MemTable. This process reconstructs the MemTable to the exact state it was in right before the crash.</p>
<p>It's natural to think that if all of your writes are first written to disk it will impact performance. You aren’t wrong, but at the same time there are nuances.</p>
<p>The writes to WAL are different in that they are append-only sequential writes, meaning random disk seeks are not required. On a traditional spinning hard drive (HDD), this is fast because the disk's read/write head does not have to move to a new location. On a modern solid-state drive (SSD), sequential writes are also much faster than random writes.</p>
<p>Whatever small performance impact we accept is a trade-off for durability.</p>
<p>Now that we know what WAL does, let's implement it. Two key functions of WAL are to write to a file on disk and replay MemTable upon start.</p>
<p>Note that in the factory function below (NewWAL), the file has been opened in append mode.</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">NewWAL</span>[<span class="hljs-title">K</span> <span class="hljs-title">comparable</span>, <span class="hljs-title">V</span> <span class="hljs-title">any</span>]<span class="hljs-params">(path <span class="hljs-keyword">string</span>)</span> <span class="hljs-params">(*WAL[K, V], error)</span></span> {
    file, err := os.OpenFile(path, os.O_APPEND|os.O_CREATE|os.O_WRONLY, <span class="hljs-number">0644</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }
    <span class="hljs-keyword">return</span> &amp;WAL[K, V]{
        file:    file,
        encoder: gob.NewEncoder(file),
    }, <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(wal *WAL[K, V])</span> <span class="hljs-title">Write</span><span class="hljs-params">(key K, value V)</span> <span class="hljs-title">error</span></span> {
    entry := WALEntry[K, V]{Key: key, Value: value}
    <span class="hljs-keyword">return</span> wal.encoder.Encode(&amp;entry)
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">ReplayWAL</span>[<span class="hljs-title">K</span> <span class="hljs-title">comparable</span>, <span class="hljs-title">V</span> <span class="hljs-title">any</span>]<span class="hljs-params">(path <span class="hljs-keyword">string</span>)</span> <span class="hljs-params">(*MemTable[K, V], error)</span></span> {
    memtable := NewMemTable[K, V]()
    file, err := os.Open(path)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">if</span> os.IsNotExist(err) {
            <span class="hljs-comment">// If the file doesn't exist, that's fine. Return an empty memtable.</span>
            <span class="hljs-keyword">return</span> memtable, <span class="hljs-literal">nil</span>
        }
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }
    <span class="hljs-keyword">defer</span> file.Close()

    decoder := gob.NewDecoder(file)
    <span class="hljs-keyword">for</span> {
        <span class="hljs-keyword">var</span> entry WALEntry[K, V]
        <span class="hljs-keyword">if</span> err := decoder.Decode(&amp;entry); err != <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">if</span> err == io.EOF {
                <span class="hljs-keyword">break</span> <span class="hljs-comment">// We've reached the end of the file.</span>
            }
            <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
        }
        memtable.Put(entry.Key, entry.Value)
    }

    <span class="hljs-keyword">return</span> memtable, <span class="hljs-literal">nil</span>
}
</code></pre>
<p>A couple notes about the above code:</p>
<ul>
<li><p><strong>NewWAL</strong>: This function creates an instance of the WAL for our database. It takes in the file path where the WAL data should be stored and opens the file using Go’s <code>os.OpenFile</code> function. Also, a <code>gob.Encoder</code> is initialized to simplify the encoding of Go data structures into binary format for efficient storage in the WAL file.</p>
</li>
<li><p><strong>Write</strong>: The Write function appends a new key-value pair to the WAL file. Every write operation to the MemTable first calls this function to ensure the update is durably recorded:</p>
</li>
<li><p><strong>ReplayWAL:</strong> This is the most important function. In the event of crash, this function comes to our rescue by reconstructing the MemTable from the WAL file. It replays the entries stored in the WAL file and writes it into MemeTable. Following it how it works:</p>
<ol>
<li><p>The function begins by creating a new empty MemTable instance that will be populated with key-value pairs.</p>
</li>
<li><p>It then attempts to open the WAL file. If the file does not exist (example – if this is the first startup), the function assumes there’s nothing to recover and simply returns the empty MemTable.</p>
</li>
<li><p>A <code>gob.Decoder</code> is used to read the WAL file, which helps to deserialize the saved binary-encoded <code>WALEntry</code> data back into key-value pairs.</p>
</li>
<li><p>For each successfully decoded <code>WALEntry</code>, the key-value pair is added back into the MemTable using the <code>Put</code> function.</p>
</li>
</ol>
</li>
</ul>
<p>With this, the database can fully recover its state by replaying all the operations recorded in the WAL.</p>
<p>As far as integration is concerned, every time you create a new DB, you should think of replaying from an existing WAL and opening the WAL in append mode. Also, Put should first write to WAL.</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">NewDB</span>[<span class="hljs-title">K</span> <span class="hljs-title">comparable</span>, <span class="hljs-title">V</span> <span class="hljs-title">any</span>]<span class="hljs-params">(maxMemtableSize <span class="hljs-keyword">int</span>)</span> <span class="hljs-params">(*DB[K, V], error)</span></span> {
    walPath := <span class="hljs-string">"db.wal"</span>
    memtable, err := ReplayWAL[K, V](walPath) <span class="hljs-comment">// this is the replay</span>
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }
    <span class="hljs-comment">//open WAL in append mode</span>
    wal, err := NewWAL[K, V](walPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }

    <span class="hljs-keyword">return</span> &amp;DB[K, V]{
        memtable:        memtable,
        maxMemtableSize: maxMemtableSize,
        memtableSize:    <span class="hljs-built_in">len</span>(memtable.data),
        wal:             wal,
        walPath:         walPath,
        sstables:        <span class="hljs-built_in">make</span>([]*SSTable[K, V], <span class="hljs-number">0</span>),
    }, <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">Put</span><span class="hljs-params">(key K, value V)</span> <span class="hljs-title">error</span></span> {
<span class="hljs-comment">//first write to WAL</span>
    <span class="hljs-keyword">if</span> err := db.wal.Write(key, value); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    db.memtable.Put(key, value)
    db.memtableSize++

    <span class="hljs-keyword">if</span> db.memtableSize &gt;= db.maxMemtableSize {
        <span class="hljs-keyword">if</span> err := db.flushMemtable(); err != <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">return</span> err
        }
    }

    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}
</code></pre>
<h3 id="heading-manifest-file-tracking-the-state-of-the-database">Manifest File: Tracking the State of the Database</h3>
<p>By this point, the database is pretty robust and durable, but an important question lingers: upon restarts, how does our database know about SSTables? Knowing about all SSTables is important for fetching data.</p>
<p>So say our database crashed after writing several SSTables. Without knowing about these SSTables, the database will create a new slice of SSTables and all of our old data is gone – queries won't read those files.</p>
<p>To solve this problem, we introduce an inventory of SSTables called MANIFEST. Every time we successfully create a new SSTable in flushMemtable, we add its path to the MANIFEST and save the MANIFEST to disk.</p>
<p>The very first thing NewDB does on startup is read the MANIFEST. This gives it the list of all the file paths, and it uses this list to perfectly reconstruct its SSTables slice.</p>
<p>In short, MANIFEST determines the state of the DB.</p>
<p>Manifest contains a slice of SSTablePaths. The Read function will read the MANIFEST file to restore the knowledge of the SSTables. The Write function will write a new manifest file.</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> Manifest <span class="hljs-keyword">struct</span> {
    SSTablePaths []<span class="hljs-keyword">string</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">ReadManifest</span><span class="hljs-params">(path <span class="hljs-keyword">string</span>)</span> <span class="hljs-params">(*Manifest, error)</span></span> {
    file, err := os.Open(path)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">if</span> os.IsNotExist(err) {
            <span class="hljs-comment">// If manifest doesn't exist, return empty manifest</span>
            <span class="hljs-keyword">return</span> &amp;Manifest{SSTablePaths: []<span class="hljs-keyword">string</span>{}}, <span class="hljs-literal">nil</span>
        }
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }
    <span class="hljs-keyword">defer</span> file.Close()

    <span class="hljs-keyword">var</span> manifest Manifest
    decoder := gob.NewDecoder(file)
    err = decoder.Decode(&amp;manifest)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }

    <span class="hljs-keyword">return</span> &amp;manifest, <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">WriteManifest</span><span class="hljs-params">(path <span class="hljs-keyword">string</span>, manifest *Manifest)</span> <span class="hljs-title">error</span></span> {
    tmpPath := path + <span class="hljs-string">".tmp"</span>
    file, err := os.Create(tmpPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    encoder := gob.NewEncoder(file)
    <span class="hljs-keyword">if</span> err := encoder.Encode(manifest); err != <span class="hljs-literal">nil</span> {
        file.Close()
        os.Remove(tmpPath)
        <span class="hljs-keyword">return</span> err
    }

    <span class="hljs-keyword">if</span> err := file.Close(); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-comment">// Atomic Rename</span>
    <span class="hljs-keyword">return</span> os.Rename(tmpPath, path)
}
</code></pre>
<p>You'll notice that we aren’t modifying the existing MANIFEST file directly. Instead, we’re creating a temporary file, writing all the data to it, closing it, and then <code>atomically renaming</code> it to replace the old MANIFEST.</p>
<p>The <code>os.Rename()</code> operation is atomic on most filesystems, meaning it either completely succeeds or completely fails – there's no in-between state. This is crucial because if the system crashes while updating the MANIFEST, we need to ensure we don't end up with a corrupted file. We’ll discuss this again below when we’re talking about compaction.</p>
<p>With this approach, we either have the old valid MANIFEST or the new valid MANIFEST, never a partially written corrupted file.</p>
<p>From an integration standpoint, NewDB will read the manifest and set its SSTable slice based on that. The flush method, given that it writes to SSTable, will also write SSTable info to manifest to keep the db updated about new SSTables.</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> DB[K comparable, V any] <span class="hljs-keyword">struct</span> {
    memtable        *MemTable[K, V]
    maxMemtableSize <span class="hljs-keyword">int</span>
    memtableSize    <span class="hljs-keyword">int</span>
    sstables        []*SSTable[K, V]
    sstableCounter  <span class="hljs-keyword">int</span>
    wal             *WAL[K, V]
    walPath         <span class="hljs-keyword">string</span>
    manifest        *Manifest
    manifestPath    <span class="hljs-keyword">string</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">NewDB</span>[<span class="hljs-title">K</span> <span class="hljs-title">comparable</span>, <span class="hljs-title">V</span> <span class="hljs-title">any</span>]<span class="hljs-params">(maxMemtableSize <span class="hljs-keyword">int</span>)</span> <span class="hljs-params">(*DB[K, V], error)</span></span> {
    walPath := <span class="hljs-string">"db.wal"</span>
    memtable, err := ReplayWAL[K, V](walPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }

    wal, err := NewWAL[K, V](walPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }

    manifestPath := <span class="hljs-string">"MANIFEST"</span>
    manifest, err := ReadManifest(manifestPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }

    sstables := <span class="hljs-built_in">make</span>([]*SSTable[K, V], <span class="hljs-built_in">len</span>(manifest.SSTablePaths))
    <span class="hljs-keyword">for</span> i, path := <span class="hljs-keyword">range</span> manifest.SSTablePaths {
        sstables[i] = &amp;SSTable[K, V]{path: path}
    }

    <span class="hljs-keyword">return</span> &amp;DB[K, V]{
        memtable:        memtable,
        maxMemtableSize: maxMemtableSize,
        memtableSize:    <span class="hljs-built_in">len</span>(memtable.data),
        wal:             wal,
        walPath:         walPath,
        manifest:        manifest,
        manifestPath:    manifestPath,
        sstables:        sstables,
    }, <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">flushMemtable</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    sstablePath := fmt.Sprintf(<span class="hljs-string">"data-%d.sstable"</span>, db.sstableCounter)
    sstable, err := writeSSTable(db.memtable, sstablePath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    db.sstables = <span class="hljs-built_in">append</span>(db.sstables, sstable)
    db.sstableCounter++

    db.manifest.SSTablePaths = <span class="hljs-built_in">append</span>(db.manifest.SSTablePaths, sstablePath)
    <span class="hljs-keyword">if</span> err := WriteManifest(db.manifestPath, db.manifest); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    db.memtable = NewMemTable[K, V]()
    db.memtableSize = <span class="hljs-number">0</span>

    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}
</code></pre>
<p>At this point, our DB has almost everything. It can write to memory (MemTable), persist to disk (sstable), and recover from crashes (WAL and manifest). You should include the update and delete feature for completeness – so let’s look at those next.</p>
<h3 id="heading-update-and-delete-handling-mutability-in-an-immutable-system">Update and Delete: Handling Mutability in an Immutable System</h3>
<p>By this time, you should know that in an LSM storage system, data is never updated – rather, new data is written. For example, if you have a data pair ("a": "apple") and over time this has to change to the pair ("a": "apricot"), a new pair will be written to a different SSTable without any change to the existing pair. And yes, this leads to duplicates.</p>
<p>Also, interestingly, data isn't even deleted during write operations. The reason for that is, in a traditional sense, if you have to delete ("a":"apple"), you will have to find where it lives on disk and remove it. This makes writes slow. So instead, a clever mechanism is used: instead of removing the data directly, you can mark the key as deleted by writing a special <code>TOMBSTONE</code> value.</p>
<p>So, in the case of deleting (a : apple), you wouldn't remove the key from any SSTable. Instead, you’d write a new key-value pair such as ("a": "TOMBSTONE"). Here’s what this achieves:</p>
<ul>
<li><p>The <code>"TOMBSTONE"</code> serves as a marker within the SSTable, telling the system that the key <code>"a"</code> has been logically deleted, even though it still physically exists in older SSTables.</p>
</li>
<li><p>During future reads, any value associated with <code>"TOMBSTONE"</code> will be treated as deleted, ensuring that the entry no longer shows up in query results.</p>
</li>
<li><p>This mechanism avoids the need for immediate deletions or expensive in-place updates, making write operations faster and simpler.</p>
</li>
</ul>
<p>But this also raises the following questions:</p>
<ol>
<li><p>How do you accurately read when there are duplicates? Meaning, how do users get ("a": "apricot") instead of ("a": "apple") because the former is latest and accurate?</p>
</li>
<li><p>How do you handle deletes to ensure deleted keys are not returned (and instead, a proper error message is returned)?</p>
</li>
<li><p>These stale and deleted data are garbage. How do you get rid of them to save on storage space?</p>
</li>
</ol>
<p>As long as data is in MemTable (in-memory map), the duplicates are easy to handle: new values will just replace the old values.</p>
<p>But it gets tricky when data is in multiple SSTables. There is a very simple solution to this problem, and that is to just read the newer SSTable before older ones. That way, you will always read the latest value for a given key and exit early.</p>
<p>The following code in the read path ensures reading from newer SSTables before moving to older ones (note the loop starts from <code>len(db.sstables) - 1</code>):</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">Get</span><span class="hljs-params">(key K)</span> <span class="hljs-params">(V, error)</span></span> {
    <span class="hljs-comment">// Check memtable first</span>
    <span class="hljs-keyword">if</span> val, ok := db.memtable.Get(key); ok {
        <span class="hljs-keyword">if</span> any(val).(<span class="hljs-keyword">string</span>) == TOMBSTONE {
            <span class="hljs-keyword">var</span> zero V
            <span class="hljs-keyword">return</span> zero, ErrNotFound
        }
        <span class="hljs-keyword">return</span> val, <span class="hljs-literal">nil</span>
    }

    <span class="hljs-comment">// Then check sstables from newest to oldest</span>
    <span class="hljs-keyword">for</span> i := <span class="hljs-built_in">len</span>(db.sstables) - <span class="hljs-number">1</span>; i &gt;= <span class="hljs-number">0</span>; i-- {
        sstable := db.sstables[i]
        val, err := sstable.Get(key)

        <span class="hljs-keyword">if</span> err == <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">return</span> val, <span class="hljs-literal">nil</span>
        }

        <span class="hljs-keyword">var</span> zero V
        <span class="hljs-keyword">if</span> err == ErrDeleted {
            <span class="hljs-keyword">return</span> zero, ErrNotFound
        }
        <span class="hljs-keyword">if</span> err == ErrNotFound {
            <span class="hljs-keyword">continue</span>
        }
        <span class="hljs-keyword">return</span> zero, err
    }

    <span class="hljs-keyword">var</span> zero V
    <span class="hljs-keyword">return</span> zero, ErrNotFound
}
</code></pre>
<p>And for delete, you could just add a new value "TOMBSTONE":</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">Delete</span><span class="hljs-params">(key K)</span> <span class="hljs-title">error</span></span> {
    <span class="hljs-keyword">return</span> db.Put(key, any(TOMBSTONE).(V))
}
</code></pre>
<p>Note: This implementation assumes V is a string type. In a production system, you would need a more robust way to handle tombstones that works with any value type.</p>
<p>Handling deleted keys becomes simple now. You can check for the value (in MemTable and SSTable) and return an error if the value is "TOMBSTONE":</p>
<pre><code class="lang-go"><span class="hljs-comment">// db.go</span>
<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">Get</span><span class="hljs-params">(key K)</span> <span class="hljs-params">(V, error)</span></span> {
    <span class="hljs-keyword">if</span> val, ok := db.memtable.Get(key); ok {
        <span class="hljs-keyword">if</span> any(val).(<span class="hljs-keyword">string</span>) == TOMBSTONE { <span class="hljs-comment">//got TOMBSTONE, return zero</span>
            <span class="hljs-keyword">var</span> zero V
            <span class="hljs-keyword">return</span> zero, ErrNotFound
        }
        <span class="hljs-keyword">return</span> val, <span class="hljs-literal">nil</span>
    }
    <span class="hljs-comment">// ... rest of function</span>
}
</code></pre>
<pre><code class="lang-go"><span class="hljs-comment">// sstable.go</span>
<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(s *SSTable[K, V])</span> <span class="hljs-title">Get</span><span class="hljs-params">(key K)</span> <span class="hljs-params">(V, error)</span></span> {
    <span class="hljs-comment">// ... earlier code</span>

    keyInDB := any(pair.Key).(<span class="hljs-keyword">string</span>)
    <span class="hljs-keyword">if</span> keyInDB == any(key).(<span class="hljs-keyword">string</span>) {
        <span class="hljs-keyword">if</span> any(pair.Value).(<span class="hljs-keyword">string</span>) == TOMBSTONE {
            <span class="hljs-keyword">var</span> zero V
            <span class="hljs-keyword">return</span> zero, ErrDeleted
        }
        <span class="hljs-keyword">return</span> pair.Value, <span class="hljs-literal">nil</span>
    }

    <span class="hljs-comment">// ... rest of function</span>
}
</code></pre>
<h3 id="heading-compaction-cleaning-up-stale-and-deleted-data">Compaction: Cleaning Up Stale and Deleted Data</h3>
<p>We have handled all the scenarios so far except for one. It’s not a concern of serving read/write traffic but something that’s important for the health of the storage system.</p>
<p>Over time, the system has developed a lot of garbage (stale, deleted data) and needs a garbage collection mechanism. Compaction is a background maintenance process that cleans up and reorganizes data in an LSM storage system.</p>
<p>As the system grows, multiple SSTables have been created. This leads to reads needing multiple file operations to get values. By compacting (or merging) multiple SSTables into a single one, you avoid disk operation overhead. Along the way, you should also permanently delete data that has been TOMBSTONED.</p>
<p>Note: Compaction is the only time data is permanently deleted from an LSM storage system.</p>
<p>To grasp the concept of compaction, we are going to implement something called <code>Full Compaction</code> where you will merge all the existing SSTables into one larger SSTable. In real-world database implementations, the strategy is more complex, there are multi level compaction involved.</p>
<h4 id="heading-compaction-algorithm">Compaction Algorithm</h4>
<p>We’re going to implement <code>K-way merge</code> to perform compaction. It’s a general algorithm that takes K sorted lists and merges them into a single, combined sorted list. In this case, the K sorted lists are the SSTables, and you are going to merge all of them into a single SSTable.</p>
<p>Our SSTables are already sorted, so the idea of merging them involves:</p>
<ol>
<li><p>Taking the smallest (first) keys from each SSTable</p>
</li>
<li><p>Finding the smallest among those keys</p>
</li>
<li><p>Storing the found smallest key into new SSTable file</p>
</li>
<li><p>Fetching next key from the SSTable the smallest key belongs to</p>
</li>
<li><p>Repeating this process for all SSTables</p>
</li>
</ol>
<p>Here’s a simple example with numbers:</p>
<pre><code class="lang-bash">Assume we have 3 sorted lists:
List A : [4, 8, 12]
List B : [3, 9]
List C : [7, 10, 11]

In the first iteration, we will take (4, 3, 7) because those are the smallest keys <span class="hljs-keyword">for</span> individual lists. 
We find the smallest among those, <span class="hljs-built_in">which</span> is 3, and store 3 <span class="hljs-keyword">in</span> the result list.

In the second iteration, we will take (4, 9, 7). Note that 3 has already been accounted <span class="hljs-keyword">for</span>. 
We pick 4 and store it to the result list.

Repeating this until all lists are empty, we get:
Result List : [3, 4, 7, 8, 9, 10, 11, 12]
</code></pre>
<p>The core part of this algorithm is to find the smallest key among the smallest keys from the individual SSTables. Fortunately, we have a data structure called <code>Min-Heap</code> that does this for us. So, you’re going to take the smallest key from each SSTable and put them all onto a Min Heap for it to return the smallest among those. We’re going to leverage go’s <code>container/heap</code> package to get the Min-Heap data structure and corresponding algorithm to find minimum value and put it at the top of heap.</p>
<p>Min Heap needs you to provide a function for it to determine what is the smaller key between two keys, as it uses that logic to determine global minimum. The following function is implemented for that:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(h MinHeap[K, V])</span> <span class="hljs-title">Less</span><span class="hljs-params">(i, j <span class="hljs-keyword">int</span>)</span> <span class="hljs-title">bool</span></span> {
    <span class="hljs-comment">// again for simple comparison assume string key</span>
    keyI := any(h[i].Pair.Key).(<span class="hljs-keyword">string</span>)
    keyJ := any(h[j].Pair.Key).(<span class="hljs-keyword">string</span>)
    <span class="hljs-keyword">if</span> keyI != keyJ {
        <span class="hljs-keyword">return</span> keyI &lt; keyJ
    }
    <span class="hljs-comment">// this is needed for the case when you have duplicate keys,</span>
    <span class="hljs-comment">// you will want to pick the one that is in newer sstable because that is latest</span>
    <span class="hljs-keyword">return</span> h[i].SSTableIndex &gt; h[j].SSTableIndex
}
</code></pre>
<p>One important aspect about the above shown <code>Less</code> function is how it handles ties. So if we have two pairs with same key, which is lesser? Let’s assume two pairs as <code>(a: apple)</code> and <code>(a: apricot)</code>, where (a: apple) is the older value (written to an older SSTable), which pair should the Less function return as the lesser value?</p>
<p>The answer is the one which is in the newer SSTable (see <code>h[i].SSTableIndex &gt; h[j].SSTableIndex</code>). It ensures that the SSTable with higher index (that is, latest) becomes the lesser value, so (a: apricot) wins. It’s is important to always get the newer value of a given key.</p>
<p>The code for compaction looks something like the following. Note that we’re discarding deleted values (TOMBSTONE) and the older values.</p>
<pre><code class="lang-go"><span class="hljs-comment">// put this in a new file compaction.go</span>
<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">MergeSSTables</span>[<span class="hljs-title">K</span> <span class="hljs-title">comparable</span>, <span class="hljs-title">V</span> <span class="hljs-title">any</span>]<span class="hljs-params">(sstables []*SSTable[K, V], newPath <span class="hljs-keyword">string</span>)</span> <span class="hljs-params">(*SSTable[K, V], error)</span></span> {
    newFile, err := os.Create(newPath) <span class="hljs-comment">// create a new sstable file</span>
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }
    <span class="hljs-keyword">defer</span> newFile.Close() <span class="hljs-comment">// prevent memory leak by ensuring file is closed</span>

    newEncoder := gob.NewEncoder(newFile) <span class="hljs-comment">// initialize encoder for new SSTable file</span>


    files := <span class="hljs-built_in">make</span>([]*os.File, <span class="hljs-built_in">len</span>(sstables)) <span class="hljs-comment">// open all the sstables</span>
    decoders := <span class="hljs-built_in">make</span>([]*gob.Decoder, <span class="hljs-built_in">len</span>(sstables)) <span class="hljs-comment">// initialize one decoder per ssltable file</span>
    <span class="hljs-keyword">for</span> i, sstable := <span class="hljs-keyword">range</span> sstables {
        files[i], err = os.Open(sstable.path)
        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
        }
        <span class="hljs-keyword">defer</span> files[i].Close() <span class="hljs-comment">// prevent memory leak by ensuring file is closed</span>
        decoders[i] = gob.NewDecoder(files[i])
    }

    <span class="hljs-comment">// read first pair from each sstable and store in a pair array</span>
    pairs := <span class="hljs-built_in">make</span>([]Pair[K, V], <span class="hljs-built_in">len</span>(decoders))
    emptySSTables := <span class="hljs-built_in">make</span>([]<span class="hljs-keyword">bool</span>, <span class="hljs-built_in">len</span>(decoders)) <span class="hljs-comment">// track empty sstables</span>
    <span class="hljs-keyword">for</span> i, decoder := <span class="hljs-keyword">range</span> decoders {
        <span class="hljs-keyword">if</span> err := decoder.Decode(&amp;pairs[i]); err != <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">if</span> err == io.EOF {
                emptySSTables[i] = <span class="hljs-literal">true</span>
                <span class="hljs-keyword">continue</span>
            }
            <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
        }
    }

    <span class="hljs-comment">// push those pairs onto heap</span>
    h := &amp;MinHeap[K, V]{}
    <span class="hljs-keyword">for</span> i, pair := <span class="hljs-keyword">range</span> pairs {
        <span class="hljs-keyword">if</span> !emptySSTables[i] {
            heap.Push(h, &amp;HeapItem[K, V]{Pair: pair, SSTableIndex: i})
        }
    }

    <span class="hljs-comment">// init the min-heap calculation algorithm from container/heap package</span>
    heap.Init(h)

    <span class="hljs-keyword">var</span> lastKey K
    firstKey := <span class="hljs-literal">true</span>

    <span class="hljs-comment">// pop the min item from heap and store it into new sstable</span>
    <span class="hljs-keyword">for</span> h.Len() &gt; <span class="hljs-number">0</span> {
        item := heap.Pop(h).(*HeapItem[K, V])

        <span class="hljs-comment">// If this key is a duplicate of the last one we saw, skip it</span>
        <span class="hljs-keyword">if</span> !firstKey &amp;&amp; item.Pair.Key == lastKey {
            <span class="hljs-comment">// We only care about the version from the newest SSTable,</span>
            <span class="hljs-comment">// which we have already processed</span>
        } <span class="hljs-keyword">else</span> {
            <span class="hljs-keyword">if</span> any(item.Pair.Value).(<span class="hljs-keyword">string</span>) != TOMBSTONE {
                <span class="hljs-comment">// discard deleted</span>
                <span class="hljs-keyword">if</span> err := newEncoder.Encode(item.Pair); err != <span class="hljs-literal">nil</span> {
                    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
                }
            }
        }

        lastKey = item.Pair.Key
        firstKey = <span class="hljs-literal">false</span>

        <span class="hljs-comment">// Push the next item from the same SSTable into the heap</span>
        <span class="hljs-keyword">var</span> nextPair Pair[K, V]
        <span class="hljs-keyword">if</span> err := decoders[item.SSTableIndex].Decode(&amp;nextPair); err == <span class="hljs-literal">nil</span> {
            heap.Push(h, &amp;HeapItem[K, V]{Pair: nextPair, SSTableIndex: item.SSTableIndex})
        } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> err != io.EOF {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
        }
    }

    <span class="hljs-keyword">return</span> &amp;SSTable[K, V]{path: newPath}, <span class="hljs-literal">nil</span>
}
</code></pre>
<p>All the compaction magic has been packed in one function, <code>MergeSSTables</code>. The function has the following logical steps (and you can check the inline comments in the code to follow along):</p>
<ol>
<li><p>We create a new destination SSTable file and initialize corresponding <code>gob.Encoder</code></p>
</li>
<li><p>We open all the existing SSTable files, and store their references to <code>files array</code>. Also, we initialize one <code>gob.Decoder</code> per exiting SSTable file. To prevent memory leak, a <code>defer</code> statement ensures that each file will be closed once the function completes its work.</p>
</li>
<li><p>Each <code>decoder</code> reads the first key-value pair from its corresponding SSTable and stores it in the <code>pairs</code> array.</p>
</li>
<li><p>SSTables that are already exhausted (for example, are empty or have hit the end of the file) are marked as such in the <code>emptySSTables</code> slice, and we skip pushing them onto the heap.</p>
</li>
<li><p>We push each pair from the pairs array to <code>Min-Heap</code> and then initialize the <code>Min-Heap</code> calculation algorithm. This algorithm is present in Go’s <code>container/heap</code> package.</p>
</li>
<li><p>Each time the smallest key-value pair is popped from the min-heap, it’s compared with the previously processed key (<code>lastKey</code>). Duplicate keys (those whose values are already written) are skipped.</p>
</li>
<li><p>Values marked with a <code>"TOMBSTONE"</code> (logically deleted entries) are ignored and not written to the new SSTable, effectively cleaning up deleted data.</p>
</li>
<li><p>To continue the merge, the next key-value pair from the same SSTable (as the one we just processed) is read and pushed onto the heap, unless the end of the SSTable (<code>io.EOF</code>) has been reached.</p>
</li>
</ol>
<p>To integrate this with the DB, you could use a compaction threshold and trigger compaction as part of the flush when this threshold is reached:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> DB[K comparable, V any] <span class="hljs-keyword">struct</span> {
    memtable            *MemTable[K, V]
    maxMemtableSize     <span class="hljs-keyword">int</span>
    memtableSize        <span class="hljs-keyword">int</span>
    sstables            []*SSTable[K, V]
    sstableCounter      <span class="hljs-keyword">int</span>
    wal                 *WAL[K, V]
    walPath             <span class="hljs-keyword">string</span>
    manifest            *Manifest
    manifestPath        <span class="hljs-keyword">string</span>
    compactionThreshold <span class="hljs-keyword">int</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">NewDB</span>[<span class="hljs-title">K</span> <span class="hljs-title">comparable</span>, <span class="hljs-title">V</span> <span class="hljs-title">any</span>]<span class="hljs-params">(maxMemtableSize <span class="hljs-keyword">int</span>, compactionThreshold <span class="hljs-keyword">int</span>)</span> <span class="hljs-params">(*DB[K, V], error)</span></span> {
    walPath := <span class="hljs-string">"db.wal"</span>
    memtable, err := ReplayWAL[K, V](walPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }

    wal, err := NewWAL[K, V](walPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }

    manifestPath := <span class="hljs-string">"MANIFEST"</span>
    manifest, err := ReadManifest(manifestPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }

    sstables := <span class="hljs-built_in">make</span>([]*SSTable[K, V], <span class="hljs-built_in">len</span>(manifest.SSTablePaths))
    <span class="hljs-keyword">for</span> i, path := <span class="hljs-keyword">range</span> manifest.SSTablePaths {
        sstables[i] = &amp;SSTable[K, V]{path: path}
    }

    <span class="hljs-keyword">return</span> &amp;DB[K, V]{
        wal:                 wal,
        walPath:             walPath,
        memtable:            memtable,
        memtableSize:        <span class="hljs-built_in">len</span>(memtable.data),
        maxMemtableSize:     maxMemtableSize,
        manifestPath:        manifestPath,
        manifest:            manifest,
        sstables:            sstables,
        compactionThreshold: compactionThreshold,
    }, <span class="hljs-literal">nil</span>
}

<span class="hljs-comment">// a new compact function</span>
<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">Compact</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    compactedSSTablePath := fmt.Sprintf(<span class="hljs-string">"data-compacted-%d.sstable"</span>, db.sstableCounter)
    compactedSSTable, err := MergeSSTables(db.sstables, compactedSSTablePath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-comment">// write new SSTable to MANIFEST file</span>
    db.manifest.SSTablePaths = []<span class="hljs-keyword">string</span>{compactedSSTablePath}
    <span class="hljs-keyword">if</span> err := WriteManifest(db.manifestPath, db.manifest); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-comment">//note delete only after writing manifest</span>
    <span class="hljs-keyword">for</span> _, sstable := <span class="hljs-keyword">range</span> db.sstables {
        <span class="hljs-keyword">if</span> err := os.Remove(sstable.path); err != <span class="hljs-literal">nil</span> {
            log.Printf(<span class="hljs-string">"Failed to remove old sstable %s: %v"</span>, sstable.path, err)
        }
    }

    db.sstables = []*SSTable[K, V]{compactedSSTable}
    db.sstableCounter++

    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(db *DB[K, V])</span> <span class="hljs-title">flushMemtable</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    sstablePath := fmt.Sprintf(<span class="hljs-string">"data-%d.sstable"</span>, db.sstableCounter)
    sstable, err := writeSSTable(db.memtable, sstablePath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    db.sstables = <span class="hljs-built_in">append</span>(db.sstables, sstable)
    db.sstableCounter++

    db.manifest.SSTablePaths = <span class="hljs-built_in">append</span>(db.manifest.SSTablePaths, sstablePath)
    <span class="hljs-keyword">if</span> err := WriteManifest(db.manifestPath, db.manifest); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    db.memtable = NewMemTable[K, V]()
    db.memtableSize = <span class="hljs-number">0</span>

    <span class="hljs-comment">// trigger compaction</span>
    <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(db.sstables) &gt;= db.compactionThreshold {
        <span class="hljs-keyword">if</span> err := db.Compact(); err != <span class="hljs-literal">nil</span> {
            log.Printf(<span class="hljs-string">"Compaction failed: %v"</span>, err)
            <span class="hljs-keyword">return</span> err
        }
    }

    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}
</code></pre>
<p>Notice the <code>Compact()</code> function in the integrated DB code? This is where we invoke previously defined the <code>MergeSSTables</code> function to trigger the compaction process. After invoking <code>MergeSSTables</code>, we write a new SSTable to the MANIFEST file and then delete the older SSTables.</p>
<p>Previously, in the <a class="post-section-overview" href="#heading-manifest-file-tracking-the-state-of-the-database">Manifest File: Tracking the State of the Database</a>, I spoke about atomic renaming <code>os.Rename(tmpPath, path)</code>. Let’s talk about why the atomic renaming of MANIFEST matters for compaction.</p>
<p>During compaction, we're making a major change to the database state: replacing multiple SSTables with a single compacted one. The MANIFEST update is critical here because it's the source of truth for which SSTables exist.</p>
<p>Let’s think about what could go wrong without atomic renaming:</p>
<ol>
<li><p>You start writing the new MANIFEST (which points to the compacted SSTable)</p>
</li>
<li><p>System crashes mid-write</p>
</li>
<li><p>MANIFEST is corrupted and unreadable</p>
</li>
<li><p>On restart, the database has no idea which SSTables exist</p>
</li>
<li><p>All data is effectively lost</p>
</li>
</ol>
<p>With atomic renaming:</p>
<ol>
<li><p>We write the new MANIFEST to MANIFEST.tmp</p>
</li>
<li><p>We fully close and sync it to disk</p>
</li>
<li><p>We atomically rename MANIFEST.tmp to MANIFEST using <code>os.Rename(tmpPath, path)</code></p>
</li>
<li><p>If crash happens before step 3: old MANIFEST is intact, we retry compaction</p>
</li>
<li><p>If crash happens during step 3: atomic operation either completes or doesn't – no corruption</p>
</li>
<li><p>If crash happens after step 3: new MANIFEST is in place, we're good</p>
</li>
</ol>
<p>This is also why we delete the old SSTables only after successfully updating the MANIFEST. If we deleted them before updating MANIFEST and then crashed, the MANIFEST would still point to files that no longer exist.</p>
<h4 id="heading-complete-picture">Complete Picture:</h4>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765740593067/c18083ad-bf8a-4cae-92d3-d690a61dac52.png" alt="c18083ad-bf8a-4cae-92d3-d690a61dac52" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Congratulations! You've built a working LSM tree storage engine from scratch. By following the problem-driven approach – discovering issues and implementing solutions as they arose – you've experienced how engineers think about building robust storage systems. I hope this is better than just memorizing the concepts.</p>
<p><strong>Key Takeaways</strong></p>
<ul>
<li><p><strong>Append-only writes</strong> make LSM-trees fast for write-heavy workloads</p>
</li>
<li><p><strong>Immutability</strong> eliminates complex concurrency issues</p>
</li>
<li><p><strong>Trade-off</strong> is that LSM-tree favor writes over reads (opposite of B-trees)</p>
</li>
<li><p><strong>Durability</strong> requires multiple mechanism working together (WAL, MANIFEST, atomic operations)</p>
</li>
<li><p><strong>Background maintenance</strong> (compaction) is essential for long-term health and cost.</p>
</li>
</ul>
<p>Important note: This is a learning implementation. This means that I intentionally simplified the code, so it’s <strong>not production-ready</strong>. Key limitations include:</p>
<ul>
<li><p>No concurrency control (missing mutexes/locks)</p>
</li>
<li><p>No bloom filters for efficient lookups</p>
</li>
<li><p>Simplified compaction strategy</p>
</li>
<li><p>Type safety issues with generic tombstones</p>
</li>
<li><p>Missing robust error recovery</p>
</li>
</ul>
<h3 id="heading-complete-code">Complete Code:</h3>
<p>Like I’ve mentioned before, I've omitted boilerplate code and helper functions for brevity. The complete, runnable implementation is available <a target="_blank" href="https://github.com/justramesh2000/lsm-db">at this GitHub repo</a>.</p>
<p>To learn more about production LSM implementations, study RocksDB, LevelDB, or read the original LSM tree paper by O'Neil et al: <a target="_blank" href="https://www.cs.umb.edu/~poneil/lsmtree.pdf">https://www.cs.umb.edu/~poneil/lsmtree.pdf</a></p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
