Ramesh Sinha - freeCodeCamp.org

Why You Should Stop Managing Kafka Manually – A Guide to Kafka UI and Cruise Control

Ramesh Sinha — Wed, 14 Jan 2026 15:58:54 +0000

Over 80% of Fortune 100 companies use Apache Kafka. That's not surprising, as Kafka has revolutionized how we build real-time data pipelines and streaming applications. If you're working in software engineering today, chances are you've encountered Kafka in some capacity.

But here's the thing: while Kafka itself is incredibly powerful, managing Kafka clusters is notoriously challenging. This isn't a flaw in Kafka – it's just the reality of distributed systems. The bigger your cluster grows, the more complex operations become.

The most painful aspect? Manual cluster management. It's tedious, error-prone, and doesn't scale. What starts as simple topic creation with a few brokers turns into hours of carefully orchestrating partition reassignments across dozens of machines. One typo in a JSON file at 3 AM can take down production.

Sound familiar? You're not alone.

In this guide, you'll learn how two tools can transform Kafka operations from a manual slog into a manageable process:

Kafka UI – A modern web interface that replaces cryptic CLI commands with visual cluster management
Cruise Control – LinkedIn's automation engine that handles cluster balancing and self-healing

We'll start by experiencing the pain of manual management firsthand, then see how these tools solve real-world operational challenges. You'll set up everything locally with Docker and by the end you’ll know exactly how to manage Kafka clusters without the headache.

What We’ll Cover:

Prerequisites
Setting Up Our Unmanaged Cluster
Starting the Cluster & Verification
Creating Topics: The Manual Way
Kafka UI
- Setting up Kafka UI
- Drawbacks of Kafka UI
Cruise Control
Conclusion

The Problem: Manual Kafka Management

Let’s dive right in. First, I'm going to show you what managing a Kafka cluster looks like without any tools – just you, the command line, and dozens of manual operations.

You’ll spin up a small cluster locally, create some topics, and simulate the kind of growth you'd see in a real production environment. By the end of this section, you'll understand exactly why teams spend thousands of engineering hours just keeping Kafka clusters running smoothly.

Fair warning: this is going to feel tedious but it’s ok – that’s the point.

Prerequisites

Before we dive in, make sure you have:

Docker Desktop installed and running
- Mac and Windows users: https://www.docker.com/products/docker-desktop/
- Linux users can install Docker Engine via their package manager
Basic Kafka knowledge. You should understand:
- Topics: Categories for organizing messages
- Partitions: How topics are divided for parallelism
- Brokers: The Kafka servers that store data
- Producers and Consumers: Applications that write to and read from Kafka
- KRaft: Kafka consensus based discovery?

If these terms are new to you, here’s a great handbook about them. I’d also recommend reading Kafka's Introduction first.

System Requirements
- At least 8GB Ram
- 10GB Free Disk space
Some basic understanding of containers is good to have:
- Docker
- Images
- Volumes
- Networks

Setting Up Our Unmanaged Cluster

Let’s go ahead and build the cluster so that we can see the problems firsthand. We’ll use Docker to spin up three Kafka brokers running in KRaft mode (the modern, ZooKeeper-free approach).

Start by creating a file called docker-compose-basic.yml:

version: '3.8'

services:
  kafka-1:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-1
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-1-data:/var/lib/kafka/data

  kafka-2:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-2
    ports:
      - "9093:9093"
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-2-data:/var/lib/kafka/data

  kafka-3:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-3
    ports:
      - "9094:9094"
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-3-data:/var/lib/kafka/data

volumes:
  kafka-1-data:
  kafka-2-data:
  kafka-3-data:

In the above configuration file, we’re creating three Kafka brokers (kafka-1, kafka-2, kafka-3). Each one uses the confluentinc/cp-kafka:7.6.0 image and has its port opened (9092, 9093, 9094).

The environment variables are:

KAFKA_NODE_ID – A unique identifier for each broker (1,2,3). No two brokers can have the same ID.
KAFKA_PROCESS_ROLES: broker, controller – This tells Kafka to run in KRaft mode (without ZooKeeper). Each broker acts as both a data broker and a controller for cluster coordination.
KAFKA_CONTROLLER_QUORUM_VOTERS – The membership list that tells each broker how to find the others. All three brokers must have the identical list: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093. This is how they discover each other and elect a leader.
CLUSTER_ID – A unique identifier for the entire cluster. All brokers must use the exact same value or they won't recognize each other as part of the same cluster. The actual value (MkU3OEVBNTcwNTJENDM2Qk) doesn't matter as long as long as it is consistent across brokers. One important thing to note is that CLUSTER_ID must be a valid base64-encoded UUID per Kafka’s requirement.
KAFKA_LISTENERS - Defines which network interfaces and ports Kafka listens on. We have three listeners:
- PLAINTEXT://0.0.0.0:29092: For inter-broker communication (brokers talking to each other)
- CONTROLLER://0.0.0.0:29093: For controller communication in KRaft mode
- PLAINTEXT_HOST://0.0.0.0:9092 (varies per broker): For external connections from your machine
KAFKA_ADVERTISED_LISTENERS – Tells clients (producers/consumers) how to connect to this broker. This is what gets returned when a client asks "where should I connect?" The PLAINTEXT_HOST://localhost:9092 part is what allows you to connect from your Mac.

Note: Listener configuration is critical. Incorrect settings will prevent clients from connecting even when brokers are running. These settings work for local Docker environments where Docker's internal DNS resolves broker names (kafka-1, kafka-2, kafka-3). For production, replace hostnames with actual IP addresses or FQDNs - (Fully Qualified Domain Name):

KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 – How many copies of consumer offset data to keep. We use 2 instead of 3 because with only three brokers, this prevents issues during rolling restarts. In production with more brokers, you'd use 3 or more.
The Volumes – kafka-x-data:/var/lib/kafka/data creates persistent storage for each broker’s data. Without volumes you will lose your topics and messages if you stop or restart your containers. Volumes are assigned to each broker so they don’t accidentally share data.

Note: For a restart from scratch you need to delete the volumes using the following command. The -v flag removes volumes. Without it, old data persists even after down.

docker compose -f docker-compose-basic.yml down -v

If you're using the legacy docker-compose tool (V1), replace docker compose with docker-compose in all commands throughout this tutorial.

Ports

Three ports are used for any given broker. Their purposes are:

Port	Purpose
9092	external connections (producers, consumers from you Mac)
29092	Internal broker-to-broker communication
29093	Cluster coordination via KRaft

Starting the Cluster & Verification

Now that we have the basic docker configuration for Kafka, let’s run it and verify the results.

Run the following command in the same directory where you saved docker-compose-basic.yml:

docker compose -f docker-compose-basic.yml up -d

The -d flag runs the containers in detached mode (in the background), so you get your terminal back.

You should see output like this:

Using the following command, check if the containers running Kafka brokers are up:

docker ps

You should see three Kafka containers (kafka-1, kafka-2, kafka-3) with status “Up” – something like this:

Run the following command to verify that all three brokers are registered in the cluster:

docker exec -it kafka-1 kafka-broker-api-versions --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092

You should see API version information for all three brokers (IDs 1, 2, 3) without any connection errors.

Note that we’re using kafka-1:29092,kafka-2:29092,kafka-3:29092 here (the internal Docker addresses) instead of localhost:9092 because this command runs inside the kafka-1 container by virtue of docker exec -it kafka-1, where localhost only refers to that specific container.

If any of the above verification returns errors or doesn’t show expected result as shown in screenshots, you can run the following command to see logs and debug:

docker logs kafka-1

Creating Topics: The Manual Way

Now that we have a cluster running, let’s simulate a real production use case where different teams need Kafka topics for their applications – payments, logs, events, metrics notifications, you name it.

Let’s start by creating a topic for logs. The command to do this is:

docker exec -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 12 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

You’ll need to specify some command parameters, which are:

The exact broker address kafka-1:29092,kafka-2:29092,kafka-3:29092 (or the IP address of your servers in production)
The number of partitions – I have used 12 in the above command. Creating too few partitions creates bottlenecks, while creating too many adds overhead.
Retention policy – I have used 7 days (that is, 604800000 milliseconds)
Compression type

Manually managing these parameters and running the command a handful of times is okay – but what if you have to run this for every team in your enterprise? Each team will have different requirements. The grind of copy, paste, adjust becomes painful if you have 100+ topics and multiple clusters (dev, staging, prod).

Feel the pain yet? Well, let’s just go on for a minute and we’ll address this issue shortly. For now, if you run the above command you should see the “Created topic” message:

Note: We’re using kafka-1:29092,kafka-2:29092,kafka-3:29092 to reach Kafka brokers because we’re running the command inside of broker kafka-1 by running using docker exec.

Let's keep going. We’ll create more topics using the same command by changing the topic name and partitions. Copy, paste, update, and run the above commands a couple times. On my machine, I ran it 3 more times like below (you can choose to run couple more times with changed values – it won’t matter because concrete values are not important for this tutorial):

docker exec -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-views \    
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 20 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy


docker exec -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-analytics \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 3 \ 
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy


docker exec -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-articles \ 
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 5 \ 
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

After creating the topics, let’s see all the ones you have now by running the following command:

docker exec -it kafka-1 kafka-topics \ --list \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092

You should see a list of topics like this:

Notice that you just get the list of topics but no meaningful information, like:

How many partitions does each have?
Which brokers are hosting them?
Are they evenly distributed?
What are their configurations?

Partition Information

Let’s try to get information about our partitions. For this tutorial, I have created 4 topics and a total of 40 partitions spread across three brokers. I want to see which broker has the most partitions.

In a well-managed cluster, you’d want them roughly evenly distributed. But how can we check that?

Maybe the describe command shown below can help. Let’s run it:

docker exec -it kafka-1 kafka-topics \
  --describe \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092

It will return a wall of text, something like this:

So, we have partition information but:

No summary or aggregation
No visual representation
It’s difficult to scan and compare
It gets exponentially worse with more topics

Counting Leaders

The Leader field in the above screenshot tells you which broker is the leader for each partition. Leaders handle all read and write requests, so you want them evenly distributed or else some brokers will become overloaded.

Let’s try to count how many partitions each broker leads. To do that, run the following command:

docker exec -it kafka-1 kafka-topics \
  --describe \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 | grep "Leader: 1" | wc -l

It will show something like this:

Per my topic creation, 14 is the count of partitions where broker 1 (Leader : 1) is the leader. You might see a different number depending on how many topics and how many partitions you have created.

You can repeat this command to see the count of partitions led by other brokers. To do so, just change Leader: 1 to Leader: 2 or Leader: 3.. I get 14, 12, 14:

That’s somewhat balanced, but you had to run the command multiple times, parse using grep and wc, and this is just 3 brokers. What if you had 100+? Also, what if you have to get the replicas’ information?

I could go on and on with the data we need and the commands to get that information. But the point I’m trying to make here is that sooner or later this becomes impossible to manage. Your team is going to need an army, and to be honest there isn’t much value in doing all of this manually.

So far, you’ve seen only simple operational commands, but the problems don’t stop there. In a real production environment there are more complex and challenging operations like:

Consumer Lag Monitoring: When consumers fall behind, you need to track which partitions are lagging, which consumer instances own them, and where the lag is growing or shrinking. With CLI tools, you get raw numbers but no trends or context.
Broker Failures: When a broker fails, you need to identify under-replicated partitions, trigger leader elections, and create partition reassignment JSON files manually. One mistake in that JSON can cause data loss.
Cluster rebalancing: You’ll see that when you add new brokers, they sit empty until you manually redistribute partitions. Similarly for removing brokers, you need to move all their partitions first. These operations require calculating optimal placement and creating complex reassignment plans.

If you’re still with me, you’re probably thinking that there has to be a better way. Fortunately, there is – actually, there are a couple complimentary ways and we are going to talk about those next.

Kafka UI

Kafka UI is a modern, open-source web interface for managing Kafka clusters. It replaces the command line chaos we just experienced with a clean, visual dashboard.

Kafka UI provides the following features:

Visual cluster Overview: see all brokers, topics, and partitions at a glance.
Topic management: create, configure, and delete topics with a GUI
Consumer group monitoring: track lags, offsets, and consumer health in real-time
Message browsing: view actual messages in topics without command line tools

Without further ado, let’s set up Kafka UI.

Setting Up Kafka UI

To setup up Kafka UI, let’s modify our existing docker-compose-basic.yml like this:

version: '3.8'

services:
  kafka-1:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-1
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-1-data:/var/lib/kafka/data

  kafka-2:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-2
    ports:
      - "9093:9093"
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-2-data:/var/lib/kafka/data

  kafka-3:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-3
    ports:
      - "9094:9094"
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-3-data:/var/lib/kafka/data
# Adding kafka-UI service start
  kafka-ui:
    image: provectuslabs/kafka-ui:latest
    container_name: kafka-ui
    ports:
      - "8080:8080"
    environment:
      DYNAMIC_CONFIG_ENABLED: 'true'
      KAFKA_CLUSTERS_0_NAME: freecodecamp-cluster
      KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092
    depends_on:
      - kafka-1
      - kafka-2
      - kafka-3
# Adding kafka-UI service end
volumes:
  kafka-1-data:
  kafka-2-data:
  kafka-3-data:

The yaml file is pretty much the same as before except that we have added a new service called kafka-ui (for better clarity, I have added the changes in between start and end comments).

Key Configurations are:

Port 8080 – You can access the UI at http://localhost:8080 from your machine.
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS – This environment variable tells Kafka UI where to connect your cluster (using internal Docker addresses).
KAFKA_CLUSTERS_0_NAME – A friendly name for your cluster in the UI.

Let’s first clean up the old cluster while keeping the topic data intact. Go ahead and run the following command to do so:

docker compose -f docker-compose-basic.yml down

Note that we’re not using -v here, so volumes (topic data) will remain intact.

Wait for couple seconds and then run the following docker up command to bring up our cluster with Kafka UI:

docker compose -f docker-compose-basic.yml up -d

Now open a browser and visit http://localhost:8080/. You’ll see the UI like this:

You can click around and see all information about the cluster we have created, like:

Your 3 brokers
The topics you created earlier
Partition counts

For comparison with manual commands, let's look at the Brokers tab. You can see the partition leader count for each broker at a glance – remember that we had to run multiple commands to get this information earlier. Beyond this, the UI provides many other useful metrics that would require separate command-line queries.

Remember the CLI commands we had to run to create topics? If you go to the Topics tab, you will notice that Topic management (creation, deletion, data cleanup and so on) are just a few button clicks.

Similarly, managing Consumers only requires a few button clicks.

After exploring the Kafka UI, you'll see how much easier it is to monitor your cluster compared to running individual CLI commands.

Drawbacks of Kafka UI

That said, Kafka UI does have some limitations:

Automatic rebalancing: One or few brokers having more partitions that others, you must manually reassign them.
Self-healing: If a broker fails, you have to manually create reassignment plans.
Performance optimization: The UI can’t recommend intelligent partition placement.
Alerts: The UI doesn’t warn you before problems happen.

For small clusters (3 - 10 brokers ), Kafka UI and some command execution might be enough. You’ll be able to see problems clearly and fix them when needed.

For large clusters, manual operations are still not scalable, so we need some kind of a complementary tool…and that tool is Cruise Control.

Cruise Control

Cruise Control is an automation engine for Kafka clusters. While Kafka UI gives you visibility and manual control, Cruise Control provides intelligent automation and self-healing. You can think of Kafka UI as a dashboard with manual controls and Cruise Control as an autopilot. In other words, they complement each other.

Let’s try to create some imbalance in our cluster and fix it manually. This will help you learn how to reason through why you need Cruise Control.

To keep things simple, let’s start from scratch. We will first delete all the Docker resources we have created so far by running the following command:

docker compose -f docker-compose-basic.yml down -v

Running docker-compose down -v will delete all the topics and messages we created so far, but don’t worry –we’ll create them again.

How Cruise Control Works

You can think of Cruise Control as a metric-monitoring and action-taking tool. Kafka brokers collect internal metrics (CPU, disk, network traffic, partition sizes), and a metric reporter running inside each broker sends these metrics to a Kafka topic.

Cruise Control then reads from that topic and analyzes the data. Based on that analysis, it proposes partition movements. We’ll see this in action shortly.

Setting Up Cruise Control

As of this writing, I couldn’t find a compatible Kafka and Cruise Control image that supports KRaft (Kafka Consensus Algorithm), so I decided to create Kafka and Cruise Control public images that will help with the tutorial. I don’t recommend using these images in production. For production usage, you should either wait for community to provide an image or create one of your own.

Change the docker-compose-basic.yml file to look like the below:

version: '3.8'

services:
  kafka-1:
    image: justramesh2000/kafka-apache-cc:3.8.1
    container_name: kafka-1
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
      # Cruise Control Metrics Reporter
      KAFKA_METRIC_REPORTERS: 'com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter'
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS: 'kafka-1:29092,kafka-2:29092,kafka-3:29092'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE: 'true'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS: '1'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR: '2'
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE: 'false'
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS: '60000'
    volumes:
      - kafka-1-data:/var/lib/kafka/data

  kafka-2:
    image: justramesh2000/kafka-apache-cc:3.8.1
    container_name: kafka-2
    ports:
      - "9093:9093"
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
      KAFKA_METRIC_REPORTERS: com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE: 'false'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC: __CruiseControlMetrics
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE: 'true'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS: '1'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR: '2'
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS: '60000'
    volumes:
      - kafka-2-data:/var/lib/kafka/data

  kafka-3:
    image: justramesh2000/kafka-apache-cc:3.8.1
    container_name: kafka-3
    ports:
      - "9094:9094"
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
      KAFKA_METRIC_REPORTERS: com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE: 'false'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC: __CruiseControlMetrics
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE: 'true'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS: '1'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR: '2'
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS: '60000'
    volumes:
      - kafka-3-data:/var/lib/kafka/data
  # Adding kafka-UI service start
  kafka-ui:
    image: provectuslabs/kafka-ui:latest
    container_name: kafka-ui
    ports:
      - "8080:8080"
    environment:
      DYNAMIC_CONFIG_ENABLED: 'true'
      KAFKA_CLUSTERS_0_NAME: freecodecamp-cluster
      KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092
    depends_on:
      - kafka-1
      - kafka-2
      - kafka-3
    volumes:
      - ./config:/opt/cruise-control/config  
  # Adding kafka-UI service end
  # Adding cruise-control start
  cruise-control:
    image: justramesh2000/cruise-control-kraft:2.5.142
    container_name: cruise-control
    ports:
      - "9090:9090"
    volumes:
      - ./config/cruisecontrol.properties:/opt/cruise-control/config/cruisecontrol.properties
      - ./config/capacityJBOD.json:/opt/cruise-control/config/capacityJBOD.json:ro
      - ./config/log4j.properties:/opt/cruise-control/config/log4j.properties:ro
    depends_on:
      - kafka-1
      - kafka-2
      - kafka-3
   # Adding cruise-control end    
volumes:
  kafka-1-data:
  kafka-2-data:
  kafka-3-data:

You should have made the following changes to the file:

Changed Kafka image from confluentinc/cp-kafka:7.6.0 to justramesh2000/kafka-apache-cc:3.8.1. The new image contains the Cruise Control metrics exporter which will export metrics data from Kafka brokers to be used by Cruise Control.
Added the following environment variables:
- KAFKA_METRIC_REPORTERS – This variable tells Kafka to load a plugin called the Cruise Control Metrics Reporter. It runs inside each Kafka broker process, and hooks into Kafka’s internal metrics system. This helps with data collection.
- KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS – This tells the Cruise Control Metrics Reporter where to send metrics to, meaning which Kafka brokers and which port.
- KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE – This disables specific Kubernetes behaviors (Pod name, id instead of Host). We are using Docker, so we don’t need K8s behaviors.
- KAFKA_CRUISE_CONTROL_METRICS_TOPIC – Specifies the name of the topic where metrics will be published. Default is __CruiseControlMetrics but you can customize using this variable if you want to.
- KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE – Automatically creates a __CruiseControlMetrics topic if it doesn’t exist. Without this metric, the reporter will fail reporting until you manually create this topic.
- KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS – Defines the number of partitions for the topic __CruiseControlMetrics.
- KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR – Tells Kafka how many copies of metrics data to keep. In our case, we’re keeping 2 copies of the data.
- KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS – Tells Kafka how often to send metrics. We’re sending every minute.
Added Cruise-control service using image justramesh2000/cruise-control-kraft:2.5.142. For clarity, I have kept this change between the start and end comments.
Under cruise control, we’ve mounted three Cruise Control configurations files. We’ll talk about those files next.

Cruise Control Configuration File

To run Cruise Control, we need to provide several configuration files. Among the key pieces of information are:

Where the Kafka cluster is located
The capacity of each broker

Create a config directory and add the following files:

mkdir config

cruisecontrol.properties

This is Cruise Control’s main configuration file.

Save the following content as cruisecontrol.properties in the config directory:

# Kafka cluster. Tells how to connect to brokers
bootstrap.servers=kafka-1:29092,kafka-2:29092,kafka-3:29092

# Topic from which metrics are to be read
metric.reporter.topic=__CruiseControlMetrics

# Aggregated partition data
partition.metric.sample.store.topic=__KafkaCruiseControlPartitionMetricSamples

#Aggregated broker data
broker.metric.sample.store.topic=__KafkaCruiseControlModelTrainingSamples

# Enable broker failure detection for KRaft mode (no ZooKeeper)
kafka.broker.failure.detection.enable=true

# Capacity. Tells where the capacity file is 
capacity.config.file=config/capacityJBOD.json

# Goals. What to optimize for during cluster balancing. These are the riles for CC to abide to during rebalancing
default.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal

# hard goals. 
hard.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal

# Webserver. For WebApi access
webserver.http.port=9090
webserver.http.address=0.0.0.0

# Execution
num.broker.metrics.windows=1
num.partition.metrics.windows=1

I’ve added in line comments to explain much of the above configuration, but I think the Goals need special attention. These are the rules that we as users have set for Cruise Control to abide by.

By defining goals, we tell Cruise Control to do the following:

RackAwareGoal – Spread replicas across racks (or in our case, brokers)
ReplicaCapacityGoal – Don't overload brokers with too many replicas
DiskCapacityGoal – Don't fill up disk
NetworkInboundCapacityGoal – Balance incoming network traffic
NetworkOutboundCapacityGoal – Balance outgoing network traffic
CpuCapacityGoal – Balance CPU usage
ReplicaDistributionGoal – Evenly distribute replicas
DiskUsageDistributionGoal – Ensure even disk usage across brokers
LeaderReplicaDistributionGoal – Evenly distribute leader replicas
LeaderBytesInDistributionGoal – Balance data flowing to leaders

Via Cruise Control configuration, you can define two types of goals: Default goals and Hard goals. Hard goals must be met. Default goals that aren’t part of the hard goals become soft goals. This means that Cruise Control will give its best effort to satisfy them but won’t reject a proposal if it can’t.

Here’s a little summary:

Type	Meaning	What CC Does
Hard Goals	Must-haves (capacity limits)	Never violates – rejects proposal if can't satisfy
Soft Goals	Nice-to-haves (better balance)	Tries to satisfy – skips if conflicts with hard goals
Default Goals	Hard + Soft together	Optimizes for all – prioritizes hard over soft

Cruise control collects metrics for a defined period (default: 5 minutes) and creates a monitoring window. The following settings control how many windows Cruise Control needs before it’s ready to generate proposals (shortly, we will see what proposals are):

num.broker.metrics.windows=1: Wait for 1 monitoring window before generating proposals. Each window in Cruise Control is 5 minutes by default. This means that Cruise Control will be ready after 5 minutes. I’ve set this to 1 for quick testing. The recommendation is to use a large window in production to avoid false proposals from temporary spikes.
num.partition.metrics.windows=1: Wait for 1 window of partition metrics. Same reasoning as above.

Capacity

This informs cruise control about the capacity (CPU, DISK) of each broker, which then helps it to make decisions. Using the below file, we’re telling Cruise Control the following:

What are the brokerIds
What is the disk path /var/lib/kafka/data and disk capacity (100000000 MB = 100 GB). This is used by DiskCapacityGoal that we set up in the above cruisecontrol.properties file.
What is the CPU 100% (1 Core). Used by CpuCapacityGoal.
What is the NW_IN Network Inbound Capacity (125,000 KB/s = 1 MB/s –Megabytes per second) = 1 Gbps – Giga bits per second). Used by NetworkInboundCapacityGoal.
What is the NW_OUT Network Outbound Capacity (125,000 KB/s). Used by NetworkOutboundCapacityGoal

Save the following content as capacityJBOD.json in the config directory:

{
  "brokerCapacities":[
    {
      "brokerId": "1",
      "capacity": {
        "DISK": {"/var/lib/kafka/data": "100000000"},
        "CPU": "100",
        "NW_IN": "125000",
        "NW_OUT": "125000"
      }
    },
    {
      "brokerId": "2",
      "capacity": {
        "DISK": {"/var/lib/kafka/data": "100000000"},
        "CPU": "100",
        "NW_IN": "125000",
        "NW_OUT": "125000"
      }
    },
    {
      "brokerId": "3",
      "capacity": {
        "DISK": {"/var/lib/kafka/data": "100000000"},
        "CPU": "100",
        "NW_IN": "125000",
        "NW_OUT": "125000"
      }
    }
  ]
}

Logging

This is not important for Cruise Control to work properly, but it’ll help you debug if there are issues. Save the following content as log4j.properties in the config directory. When you execute commands to start Cruise Control and If you see unexpected behaviors like container exiting, you can use the docker logs command to see what happened.

# Root logger - INFO level, output to console
rootLogger.level=INFO
appenders=console

# Console output (for docker logs)
appender.console.type=Console
appender.console.name=STDOUT
appender.console.layout.type=PatternLayout
appender.console.layout.pattern=[%d] %p %m (%c)%n

# Send root logger to console
rootLogger.appenderRef.console.ref=STDOUT

Now that we have all the configurations in place, let’s run the following command to start Kafka brokers with Kafka UI and Cruise Control:

docker compose -f docker-compose-basic.yml up -d

Using the following command, verify that the three Kafka brokers, Kafka UI, and Cruise Control containers are running:

docker ps

You should see something like this:

Now that we have Cruise Control up and running, let’s create some Imbalance and see how much better of an experience we get by using Cruise Control versus mitigating the imbalance manually.

Creating the Imbalance

An imbalance is a scenario where some brokers are handling more messages than others – and they may run into high disk usage or high IOPS.

To create the imbalance in our cluster, we’ll have to create a few topics and then produce messages unevenly. Now that you have Kafka UI running, you can create topics using that method or you can create topics using commands. I’m going to use the commands because it’ll be easier for you to reproduce my work (but I recommend UI for production operations because it prevents typos).

If you also decide to use commands, run the following command. Then using UI, verify that the topics have been created.

Note: You’ll find that the commands are different from previous commands. This is because, previously in our docker-compose-basic.yml file, we were using the confluentinc/cp-kafka:7.6.0 image for Kafka. But now we’re using the justramesh2000/kafka-apache-cc:3.8.1 image which is based off of the apache/kafka:3.8.1 image. For different images, the tools are located at different places, so the command needs to be adjusted to account for that.

docker exec -it kafka-1 bash -c '
/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092 \
  --partitions 12 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-views \
  --bootstrap-server kafka-1:29092 \
  --partitions 20 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-analytics \
  --bootstrap-server kafka-1:29092 \
  --partitions 3 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-articles \
  --bootstrap-server kafka-1:29092 \
  --partitions 5 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy
'

Run the following command to produce uneven messages on different topics we created above.

Heavy Load on freecodecamp-logs:

docker exec -it kafka-1 bash -c "
for i in {1..100000}; do 
  echo '{\"log_id\":\"'\$i'\",\"level\":\"INFO\",\"message\":\"Log entry '\$i'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-logs --bootstrap-server kafka-1:29092"

Heavy load on freecodecamp-views:

docker exec -it kafka-1 bash -c "
for i in {1..80000}; do 
  echo '{\"view_id\":\"'\$i'\",\"page\":\"/article/'\$((i % 100))'\",\"user\":\"user_'\$((i % 1000))'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-views --bootstrap-server kafka-1:29092"

Moderate load on freecodecamp-analytics:

docker exec -it kafka-1 bash -c "
for i in {1..30000}; do 
  echo '{\"event\":\"page_view\",\"user\":\"user_'\$i'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-analytics --bootstrap-server kafka-1:29092"

Now, produce a message with a fixed key to force all data into one Partition. This is a fast way to create strong disk imbalance. Run the following command:

docker exec -it kafka-1 bash -c "
for i in {1..300000}; do
  echo 'hotkey:{\"log_id\":'\$i',\"msg\":\"big payload\"}'
done | /opt/kafka/bin/kafka-console-producer.sh \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092 \
  --property parse.key=true \
  --property key.separator=:"

After running the above commands, come back to the UI, refresh, and you will see a number of messages like this:

Now, go to brokers tab and see the imbalance in Disk Usage:

You should be able to see that Broker-2 has only about 47% of the data that Broker-1 has, and Broker-3 has about 11% more data than Broker-1. Broker-2 is significantly underutilized, while Broker-1 and Broker-3 hold most of the data.

Attempting Manual Rebalancing

Step 1: First, we need to find out which topic is heavy – meaning which one handles more data. My setup shows the freecodecamp-logs topic with 8MB of data:

Step 2: Let’s see where the heavy partitions are.

Click on freecodecamp-logs in Kafka UI and see the partition table. Look at the message count: partition 4 is bigger than the others. The table also gives information about replicas of partitions: partition 4 has replicas on Broker 1 and 3. Broker 2 doesn’t have partition 4 at all. This explains why Broker 2 was underutilized.

Step 3: To balance the cluster, we need to move partition 4 around.

We can move partition 4 to Broker 2. But before that, let’s do some math to be able to rationalize our decision. Note that the calculation doesn’t have to be precise – we just want a relative sense of data between brokers.

Current state:

Broker 1: 4.55 MB
Broker 2: 2.29 MB (underutilized)
Broker 3: 5.11 MB (over-utilized)

Note that roughly the compressed data size for partition 4 is 2.25 MB (exact size is not critical).

If we move partition 4 from [1,3] to [2,3]:

Broker 1: Loses partition 4, so 4.55 + 2.25 = ~2.3 MB
Broker 2: Gains Partition 4, so 2.33 + 2.25 = ~4.58 MB
Broker 3: Already has partition 4, so = 5.11 MB (no change)

The result is that Broker 1 becomes underutilized.

How about if we move partition 4 from [1,3] to [1,2]?

Broker 1: Already has partition 4 = 4.55 MB (no change)
Broker 2: Gains Partition 4, so 2.33 + 2.25 = ~4.58 MB
Broker 3: Loses partition 4, so 5.11 + 2.25 = ~2.8 MB

Hmm, this still creates an imbalance (broker 3 becomes too light).

So basically, manual rebalancing requires complex calculations. Moving a single partition impacts disk usage, leader distribution, and network traffic across multiple brokers. One poorly planned move can create a new imbalance elsewhere.

But, let’s say you somehow landed on a perfect mathematical calculation and you’re ready to make the move to balance. We’ll assume that the perfect plan is to move Partition 4 from [1, 3] to [2, 3]. I know it’s not the perfect move but the point is to see the pain afterwards.

Step 4: it’s time to move the partition manually.

We need to tell Kafka to move partition 4's replicas from brokers [1,3] to brokers [2,3].

To do that, you need create a file called reassignment.json on your machine:

{
  "version": 1,
  "partitions": [
    {
      "topic": "freecodecamp-logs",
      "partition": 4,
      "replicas": [2, 3],
      "log_dirs": ["any", "any"]
    }
  ]
}

What this means:

"partition": 4 – Target Partition
"replicas": [2, 3] – New placement: brokers 2 and 3
"log_dirs": ["any", "any"] – Let Kafka choose the disk directory

Save this file somewhere accessible.

Then run the following command to copy the JSON to the Kafka cluster:

docker cp reassignment.json kafka-1:/tmp/reassignment.json

This copies your local file into the kafka-1 container's /tmp directory.

Run following command to verify the file is there:

docker exec -it kafka-1 cat /tmp/reassignment.json

You should see your JSON file content.

Now run the actual reassignment command:

docker exec -it kafka-1 /opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --reassignment-json-file /tmp/reassignment.json \
  --execute

You will get a message from Kafka that will tell you if Kafka has accepted the reassignment and started moving the data.

You can monitor the reassignment using the following command:

docker exec -it kafka-1 /opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --reassignment-json-file /tmp/reassignment.json \
  --verify

I’m not going to run the manual reassignment because I want to keep the imbalance and show how Cruise Control can help reduce the manual steps. Next, let’s see how Cruise Control handles the same imbalance automatically.

Rebalancing Using Cruise Control

After creating the topic and messages, I have let Cruise Control run for a couple minutes. During that time, it collected metrics and trained its linear regression model. You can run the following command to verify if Cruise Control is running fine and it has data (following is a REST API call using curl):

curl http://localhost:9090/kafkacruisecontrol/state

You will get multiple JSON object outputs as part of the response. Each JSON object holds some information about the state of Cruise Control and the Kafka cluster. Let’s see each of these one at a time:

MonitorState: {
  state: RUNNING(20.000% trained),
  NumValidWindows: (1/1) (100.000%),
  NumValidPartitions: 105/105 (100.000%),
  flawedPartitions: 0
}

This tells about the state of monitoring based on data collected by Cruise Control:

state: RUNNING(20.000% trained) – Cruise Control is actively collecting metrics from your Kafka cluster. Right now it has trained its model on 20% of the expected monitoring data.
NumValidWindows: (1/1) (100%) – Cruise Control has collected 1 complete monitoring window out of 1 required (100% ready). Remember, we had set num.broker.metrics.windows=1 in the cruisecontrol.properties configuration file.
NumValidPartitions: 105/105 (100%) – Cruise Control analyzed all 105 partitions and has metrics for all.
flawedPartitions: 0 – None of the partitions have problematic or missing metrics.

ExecutorState: {state: NO_TASK_IN_PROGRESS}

The above response indicates the execution engine is idle – no partition moves or leadership changes are currently in progress. This makes sense since we haven't asked Cruise Control to do anything yet.

AnalyzerState: {
  isProposalReady: true,
  readyGoals: [
    NetworkInboundCapacityGoal,
    LeaderBytesInDistributionGoal,
    DiskCapacityGoal,
    ReplicaDistributionGoal,
    RackAwareGoal,
    NetworkOutboundCapacityGoal,
    CpuCapacityGoal,
    DiskUsageDistributionGoal,
    LeaderReplicaDistributionGoal,
    ReplicaCapacityGoal
  ]
}

AnalyzerState tells whether Cruise Control is ready to show a proposal or not. In this case it’s ready.

isProposalReady: true – Cruise Control has calculated a potential rebalancing plan (a proposal) that satisfies the configured goals.
readyGoals – These are the goals that are considered ready and valid for rebalancing. Examples:
- DiskCapacityGoal: balance disk usage among brokers
- ReplicaDistributionGoal: balance number of replicas per broker
- RackAwareGoal: maintain replicas across racks for fault tolerance
- LeaderBytesInDistributionGoal: balance network traffic from leaders
- DiskUsageDistributionGoal: ensures partitions are spread to prevent skew

Note that these are the goals we had set earlier in the cruisecontrol.properties file.

AnomalyDetectorState: {
  selfHealingEnabled:[],
  selfHealingDisabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY, TOPIC_ANOMALY, MAINTENANCE_EVENT],
  selfHealingEnabledRatio:{...},
  recentGoalViolations:[],
  recentBrokerFailures:[],
  recentMetricAnomalies:[],
  recentDiskFailures:[],
  recentTopicAnomalies:[],
  recentMaintenanceEvents:[],
  metrics:{...},
  ongoingSelfHealingAnomaly:None,
  balancednessScore:100.000
}

Anomaly detection shows information about any existing anomaly and healing properties.

selfHealingEnabled: [] – Automatic self-healing is currently off. Cruise Control will not move partitions automatically in response to anomalies.
selfHealingDisabled: [...] – Lists the anomaly types that are disabled for automatic self-healing, including broker failures, disk failures, and goal violations.
recentGoalViolations: [] – No goals have been violated recently.
balancednessScore: 100.000 – This is how balanced the cluster is according to Cruise Control’s hard goals. 100% means the cluster is perfectly balanced according to the metrics and hard goals currently active. This metric only cares about Hard Goals (Disk Capacity, CPU capacity) being violated – that’s why it shows 100% even though we know there are some disk usage imbalances in our cluster.

The Proposal

Via AnalyzerState information, Cruise Control told us that it has a proposal for the cluster. Let’s see what it is. We can fetch the proposal using the proposal end point:

curl -s "http://localhost:9090/kafkacruisecontrol/proposals?json=true"

The JSON response is quite large. Let's focus on the key parts that show our cluster's imbalance and how Cruise Control plans to fix it:

{
  "summary": {
    "numReplicaMovements": 13,    // CC wants to move 13 partition replicas
    "numLeaderMovements": 6,      // And reassign 6 partition leaders
    "onDemandBalancednessScoreBefore": 84.67,   // Current: 84.67% balanced
    "onDemandBalancednessScoreAfter": 89.76.    // After: 89.76% balanced
  },
  "goalSummary": [
    {
      "goal": "DiskUsageDistributionGoal",
      "status": "VIOLATED"
    },
    {
      "goal": "LeaderBytesInDistributionGoal",
      "status": "VIOLATED"
    }
  ]
}

Based on the calculations, Cruise Control thinks:

Moving 13 partition replicas will help. Note that manually we decided to move just 1 partition, that is partition 4.
Reassigning 6 partition leaders will help. Manually we didn’t account for any leadership reassignment.
DiskUsageDistributionGoal has been violated. We know that the disk usage is not distributed perfectly.
LeaderBytesInDistributionGoal has also been violated. We couldn’t find this out manually. Technically, you could find out but it would take a decent amount of manual calculations and would still be error-prone.

Note: While we're focusing on disk usage imbalance, Cruise Control optimizes for 10 different goals (disk, CPU, network, leaders, and so on). This holistic approach gives it a better chance of achieving true cluster balance versus balancing manually.

Executing the proposal

Let’s run the actual rebalancing using Cruise Control. The command is:

curl -X POST 'http://localhost:9090/kafkacruisecontrol/rebalance?dryrun=false&json=true'

Again, you’ll get a huge JSON file similar to the proposal.

You can track the status using following API call:

curl "http://localhost:9090/kafkacruisecontrol/user_tasks"

You will get something like this:

Note that the 4th item in the list is our rebalance API call and it’s complete. This was quick for our small Dev cluster, but in large clusters you may see status as InExecution.

Let’s look at the UI to see what is the state of Imbalance now that Cruise Control has completed its execution of the proposal. The UI shows the following for me:

Comparison

Before rebalancing:

Broker 1: 4.52 MB, 69 partitions, 35 leaders
Broker 2: 2.22 MB, 69 partitions, 35 leaders (underutilized)
Broker 3: 5.05 MB, 72 partitions, 35 leaders (overutilized)
Disk range: 2.83 MB (5.05 - 2.22)

After rebalancing:

Broker 1: 4.66 MB, 69 partitions, 38 leaders
Broker 2: 3.87 MB, 77 partitions, 31 leaders
Broker 3: 4.87 MB, 64 partitions, 36 leaders
Disk range: 1.00 MB (4.87 - 3.87)

Results:

Disk usage balanced – Range reduced from 2.83 MB to 1.00 MB (64% improvement!)
Replicas redistributed – Broker 2 gained 8 replicas, Broker 3 lost 8 replicas
Leaders balanced – Changed from 35-35-35 to 38-31-36. Cruise Control prioritized balancing actual network traffic over leader count.

The cluster is now more balanced across all metrics. Congrats!

Conclusion

We covered a lot in this tutorial, so let’s take a step back and look at what we did.

You started by experiencing the reality of manual Kafka management – the endless CLI commands, the tedious calculations, the JSON files, and the potential for costly mistakes. If you felt frustrated during that section, that’s to be expected. That frustration is exactly what thousands of engineering teams deal with every day.

Then you were presented with two complementary tools:

Kafka UI gave you visibility. No more grepping through command outputs or manually counting partition leaders. Everything you need, broker health, topic configurations, consumer lag is right there in a clean web interface. For small teams and development environments, this alone is a game-changer.
Cruise Control gave you intelligence. It didn't just automate what you'd do manually – it also did a fundamentally better job. While you were focused on moving one partition (partition 4), Cruise Control analyzed all 105 partitions across 10 different optimization goals and proposed a comprehensive rebalancing plan. That's the difference between human effort and automated intelligence.

I want to call out that this tutorial used a simplified setup. For production, you’ll expect complex configurations like”

Kafka and Cruise Control running on separate machines
Larger monitoring window for Cruise Control
Some self healing capabilities enabled

If there's one thing you take away from this article, let it be this: you should stop managing your Kafka cluster manually. You've seen there's a better way. Use it. Thanks for reading!

How to Build an LSM Tree Storage Engine from Scratch – Full Handbook

Ramesh Sinha — Thu, 18 Dec 2025 20:25:02 +0000

Databases are one of the most important parts of a software system. They allow us to store huge amounts of data in an organized way and retrieve it efficiently when we need it.

In the early days, when the volume of data was relatively small, engineers prioritized fast data retrieval and stored data in B-tree structures that made searching efficient.

But over time, we started building systems that needed to ingest massive amounts of data like logs, metrics, likes, chats and tweets. This made it necessary to design a storage system that would make writing faster.

One such storage system is the LSM-tree (Log-Structured Merge tree).

In this tutorial, rather than immediately diving into the theoretical concepts of an LSM-Tree Storage system, I’ll take a practical, problem-driven approach. I believe that learning through solving problems is far more effective and engaging than simple memorization of concepts.

By approaching these ideas progressively, my goal is to guide you step by step through real-world engineering challenges and solutions, giving you a front-row seat to the intricacies of building a robust storage system from scratch.

We’ll begin by identifying real-world challenges that arise in database design – like handling write-heavy workloads, ensuring data durability, or managing efficient storage. These challenges will set the stage for each feature and component of LSM-Trees.

Through this method, we’ll explore the foundations of LSM-Tree storage systems and dive deeper into their key components: MemTable, SSTable, Write-Ahead Log (WAL), and Manifest File.

We’ll also examine the Write and Read paths, explore Durability and Crash-Recovery mechanisms, and conclude with one of the most critical processes: Compaction.

By the end of this handbook, you’ll understand not just what these components are but also why they are designed the way they are and how they solve the unique challenges of building modern, high-performance databases.

What We’ll Cover:

Prerequisites
What is an LSM Tree?
Preface: Setting up to Build an LSM-Tree Database
Conclusion
- Complete Code

Prerequisites

While this tutorial is designed to be comprehensive and approachable, it’ll be helpful if you come in with some foundational knowledge in the following areas:

Programming in Golang: Familiarity with Go syntax, error handling, and standard libraries (example os, encoding/gob, container/heap) will make it easier to work through the implementation examples.
Basic data structures and algorithms: Concepts such as maps, heaps, some sorting algorithms, and early termination are leveraged throughout the tutorial.
Understanding persistent storage: Awareness of the differences between in-memory and disk-based storage, as well as sequential versus random read/write operations will be helpful in grasping performance-related trade-offs.
General database knowledge: If you're familiar with key-value databases or CRUD operations (Create, Read, Update, Delete), you’ll have a head start.
Concurrency: Basic understanding of threads and concurrency.

While having experience in these areas will deepen your understanding of the concepts and reduce the learning curve, I will provide sufficient detail and practical explanations at every step ensuring you gain the insights necessary to follow along and build your own LSM-tree-based storage engine.

What is an LSM Tree?

A log-structured merge-tree (or LSM tree) is a data structure that makes database writes super fast by recording new data in memory first, then periodically sorting and merging it into larger files on disk.

The “log” in its name refers to the fact that it saves data in a log-structured format (rather than simply storing it). We will come to what those logs are in a little bit.

LSM trees keep appending new data to the existing data, instead of looking for something that exists and updating it. In other words, you don't have to spend any CPU cycles thinking about where to store data – just append it at the end.

An LSM tree also has "tree" in its name, but does it actually store data in a tree? Not really. The “tree” here is mostly an abstract concept. It refers to the hierarchical organization of levels (L0, L1, L2, and so on), not a tree data structure with nodes and pointers. Again, we will come to those levels in a little bit, but for now, let’s just say it makes sense to call it a tree given that it stores data in a leveled fashion.

Just note that there isn't a tangible tree structure in play (like a binary trees or graph) – it’s not node-based storage.

Finally, there is the "merge" part of the name. For now, suffice it to say that you’ll soon see how this storage engine merges data to save storage by avoiding duplication.

Personally, I think that "Log-Structured Merge System" would be clearer than "tree," but "LSM tree" is the established term in the industry, so that's what we'll use.

Preface: Setting up to Build an LSM-Tree Database

Now that we have set the context, lets put this theory into practice and start building our own LSM-tree-based database storage engine from scratch.

To follow along with this tutorial:

Make sure you have Golang installed on your system. If not, you can download and install it from the official Go website.
Set up your development environment and create a new Go module for this project by running: go mod init lsm-db
Keep a code editor or IDE ready to try out the examples.

Initial Feature Set: Laying the Foundation of the Database System

When I’m designing or building a system, I like to think that the system already exists, and I assume that I can just start calling functions that support the features of the system. I’ll follow that pattern here and assume that the following functions of the LSM tree exist and we can invoke those functions from main.go.

db, err := NewDB[string, string](3, 3) // there is a feature to create new with some parameters we will get to
db.Put("a", "apple") // a feature to add key value
db.Delete("a") // a feature to delete a key
val, _ := db.Get("a") // a feature to get value given a key

As we progress through this journey, I’ll introduce essential features such as in-memory storage, flushing data to disk, and handling duplicate keys. We’ll also explore more advanced components, including a Write-Ahead Log (WAL) to ensure crash tolerance, a Manifest file to maintain the database state across application restarts, and a Compaction process to clean up redundant or stale data by merging older SSTables.

By the end of this tutorial, you will gain a clear understanding of how all these components work together to form a robust and efficient LSM-tree-based storage system.

MemTable: In-Memory Data Storage

We’re building a database storage system, so of course you’ll need a way to store data. This means you need some kind of backing storage. This backing storage in an LSM tree is called a MemTable. The "Mem" refers to its in-memory storage. The benefit of in-memory storage is that it’s orders of magnitude faster than storing on disk.

For simplicity, at the core of the MemTable you can use a map (or dictionary depending on the programming language) as the underlying data structure to store key-value pairs. The map allows for fast lookups, insertions, and deletions, making it ideal for in-memory storage where performance is crucial. So the structure for MemTable will look like:

type MemTable[K comparable, V any] struct {
    data map[K]V // this is primary storage map. It's generic so that you
                  // can store any kind of data
}

Above code defines a MemTable struct, where data is a map that acts as the main storage for our key-value pairs. Since the data field is a map, you’ll be able to quickly add, retrieve, or delete values associated with a given key.

You must have noticed something new in the code. The use of . This syntax is Go’s generic types feature, which allows us to write flexible code that can handle different data types.

Generics are a way to write code that is independent of any specific data type. They allow you to write functions and data structures that can work with a string, int, float, or any custom type you define, without sacrificing type safety.

In the above code, K and V are type parameters. They say: "This MemTable can work with any Key type K that is comparable, and any value type V."

Now that you have the MemTable, think of what functions it should provide to its clients. Well, the clients need to be able to save and retrieve values associated with a key, so the following functions would fit naturally:

func (m *MemTable[K, V]) Put(key K, value V) {
    m.data[key] = value
}

func (m *MemTable[K, V]) Get(key K) (V, bool) {
    value, ok := m.data[key]
    var zero V
    if !ok {
        return zero, false
    }
    return value, true
}

The above code has Put and a Get functions – let’s break them down:

Put: This function allows the client to insert a key-value pair into the MemTable. If the key already exists in the map, its value will be updated with the new value provided as an argument. This is effectively the write operation of our key-value store.
Get: This function is responsible for retrieving a value associated with a give key from the MemTable. It returns two values, the value itself (of type V) and a boolean (true or false). The boolean indicates whether the key was found in the map. If the key does not exist, the function return a zero value (more on that below) along with false.

Did you notice var zero V?

It's pretty interesting. Think of a situation where we don't get a value from the map – say the key is not there, or something else is wrong. What should the function Get return in that case? Can it return an int (0), or a string "Not found", or some random object (foo)? You don't know anything about the type yet (Generics), so you can't tell it what to return.

In this case, the compiler comes to the rescue. Go has this zero value concept: everything should have a zero value. An int has 0, string has "", bool has false, and pointer, slice, and map have nil. By saying var zero V you are telling the compiler, "I don't know the type yet, you figure out the type while compiling and put that type here as the return type." Neat!

I missed one thing though: how would a client invoke these functions? Right, we need a way to build the MemTable type.

To construct and initialize a MemTable, we can use a factory function: a common programming pattern for creating and returning new objects or instances without directly exposing the underlying implementation details.

func NewMemTable[K comparable, V any]() *MemTable[K, V] {
    return &MemTable[K, V]{
        data: make(map[K]V),
    }
}

Notice how we’ve initialized the data field using the built-in make function. Here’s why we do this:

Go has a built-in function called make, which is used to allocate and initialize slices, maps, and channels. This allocation ensures that they are ready for use without the risk of runtime panics.

You might wonder, why not use the new function to allocate the map? After all, developers coming from other programming backgrounds (like C++ or Java) might expect to use new for all types of memory allocation. But Go differentiates how it manages memory for composite types versus basic/numeric types, and that’s where make comes in.

This distinction matters because the new function only allocates memory for an object and returns a pointer to that memory. The object itself is not initialized, meaning that while the memory is allocated, the map isn’t ready to use. If we try to perform operations (like adding a key-value pair) on a map only allocated using new, it will cause a runtime panic because the map wasn’t correctly initialized.

For example:

m := new(map[string]int) // Allocates a pointer to an uninitialized map
(*m)["a"] = 1            // This will panic because the map is not initialized

On the other hand, make both allocates and initializes the map, ensuring it’s fully functional right away. That’s why the correct way to create a map is:

m1 := make(map[string]int) // Initializes the map properly
m1["a"] = 1                // This works as expected

Now that you have the MemTable which can store data in memory, let's hook it up and use it.

But before that, do you remember at the beginning that I used functional invocations like db.Put and db.Get? Well, what is db? Because we are building a database storage system, it makes more sense to name the interface db instead of MemTable, right? And to be honest, it seems like MemTable is going to be part of the database system, not the whole system, doesn't it?

Even if it's not intuitive at the moment to define something like a DB type, let's just do it. Trust me, it will start to get clearer as we move along. This db type will wrap adding and retrieving data from MemTable.

type DB[K comparable, V any] struct {
    memtable *MemTable[K, V]
}

// factory function for DB type
func NewDB[K comparable, V any]() (*DB[K, V], error) {
    memtable := NewMemTable[K, V]()
    return &DB[K, V]{
        memtable: memtable,
    }, nil
}

Let's just define the Put and Get functions which will invoke corresponding functions in MemTable:

func (db *DB[K, V]) Put(key K, value V) error {
    db.memtable.Put(key, value)
    return nil
}

func (db *DB[K, V]) Get(key K) (V, error) {
    if val, ok := db.memtable.Get(key); ok {
        return val, nil
    }
    var zero V
    return zero, errors.New("key not found")
}

Let's integrate whatever we’ve built so far and run it. To run add below code in main.go and run using go run main.go

db, err := NewDB[string, string]()
if err != nil {
    log.Fatalf("Failed to create DB: %v", err)
}
db.Put("a", "apple")
val, _ := db.Get("a")
log.Printf("Get('a') = %s (should be 'apple')", val)

Look at that, you have built an in-memory database where your clients can store and fetch data from. It’s using generics so you can store any kind of values (int, string, objects).

Now say you shipped this solution and then it crashes. Your clients will lose all the data. Why will this crash? For one thing, memory is limited and at some point you’re going to run out of it. So there are two major problems with just in-memory storage:

It's not durable.
Unbounded memory usage is going to crash the system.

How do we solve these problems?

Here's a thought: what if we flush the MemTable data to disk at some regular interval? That way we can ensure that MemTable doesn't grow out of bounds. Also if the db crashes, we won’t lose all the data. We’ll still lose the data that hasn’t been flushed yet, but that's way better than losing all of it.

SSTable: Persisting Data for Durability

An SSTable is a sorted string table. I wish they’d called it a "Secondary Storage" table, but historically keys and values were strings – hence sorted string table. An SSTable is a persistent, ordered, and immutable file that stores key-value pairs. It’s a file stored on disk, so it's pretty clear that it’s persistent (durable).

Let’s discuss a couple key features of the SSTable:

It’s ordered: There is an incentive to store the keys in a sorted order, and it makes searching keys faster and efficient. If not for that, you'd have to scan the whole file to be able to find a key. Later, I will point out some code that leverages sorted storage.
It’s immutable: Once an SSTable file is written, it can’t be modified. To update or delete a key, you must write a new record in a newer SSTable. This simplifies the design and makes reads and writes very predictable.

But wait, how does that simplify the design?

One of the most complex things in software engineering is dealing with concurrency. Let’s say you’re writing to a file and another thread updates it underneath. How do you know you have the correct data?

With immutable design, you don't have to worry about this at all. You are 100% confident that the data you are reading has not been altered by anybody else. I’ll take that as a massive simplification: you don't have to deal with locks, starvation, staleness, and so on.

How does it make the write path predictable?

I will answer this partially here and come back to it when we have completed some more implementation. You’ll see that every write in our code follows the exact same steps. There is not a single different condition or edge case.

In a traditional database (using a B-Tree), a typical write involves:

Finding the data on disk.
Reading the block of data from disk into memory.
Modifying the data in memory.
Writing the entire block back to disk.

The more steps, the more unpredictable performance can get, because the write can be fast if data is already in the memory cache or slow if there are multiple disk seeks needed.

Granted, our code is an overly simplified version, but the extension of this concept still stands true in real LSM implementations.

How does it make the read path predictable?

Read is predictable because any number of threads can read the same SSTable file at the same time without any problem, with full confidence that data has not been updated.

In contrast, when reading from a mutable data structure, you have to worry that another thread might be in the process of changing the data you are trying to read.

To prevent this, B-Tree-based databases use complex locking mechanisms, and that adds overhead and unpredictability.

I should raise a caution here: the Read in LSM tree storage is not always predictable. It can be faster if data is read from memory and it can be very slow if multiple SSTables need to be looked up to find the key.

Having said that, you don't have to worry about other performance bottlenecks because of locks. Meaning, in B-Tree storage, your read query can be slower because another write query is holding a lock. In simple low-concurrency use cases, you will mostly get amazing read performance from a B-Tree structure, but this advantage wears off as concurrency increases.

LSM tree was built for highly concurrent, write-heavy use cases, and at times slower reads are a trade-off.

The takeaway that gives you ammunition to design better is that B-trees are better for read-heavy workloads. Reads are generally faster and more consistent, but performance can have unpredictable outliers under high write concurrency due to locking.

An LSM tree is better for write-heavy workloads. Writes are much faster. Reads are generally slower and more variable, but their performance profile is more predictable under high write concurrency because there is no read-write locking.

Let's implement an SSTable to see how it works.

The write path:

func writeSSTable[K comparable, V any](memtable *MemTable[K, V], path string) (*SSTable[K, V], error) {
    file, err := os.Create(path)
    if err != nil {
        return nil, err
    }
    defer file.Close()

    pairs := make([]Pair[K, V], 0, len(memtable.data))
    for k, v := range memtable.data {
        pairs = append(pairs, Pair[K, V]{Key: k, Value: v})
    }

    sort.Slice(pairs, func(i, j int) bool {
        return any(pairs[i].Key).(string) < any(pairs[j].Key).(string)
    })

    encoder := gob.NewEncoder(file)
    for _, pair := range pairs {
        if err := encoder.Encode(pair); err != nil {
            return nil, err
        }
    }

    return &SSTable[K, V]{path: path}, nil
}

The following things are important to note from the above code:

sort.Slice: Remember I spoke about order earlier? So we store data in the SSTable in a sorted fashion, and we will see how we leverage it in the read path.
I have used the gob encoding package. An encoder makes life simpler for you because it streams the data to and from Go data structures to binary streams that can be stored on disk. It handles all the complexity of representing types, field names, and values in a standardized binary format, so that you don't have to.

The read path:

func (s *SSTable[K, V]) Get(key K) (V, error) {
    file, err := os.Open(s.path)
    if err != nil {
        var zero V
        return zero, err
    }
    defer file.Close()

    decoder := gob.NewDecoder(file)

    for {
        var pair Pair[K, V]
        if err := decoder.Decode(&pair); err != nil {
            if err == io.EOF {
                break
            }
            var zero V
            return zero, err
        }

        // for simple comparison we are assuming key is just string
        keyInDB := any(pair.Key).(string)
        if keyInDB == any(key).(string) {
            if any(pair.Value).(string) == TOMBSTONE {
                var zero V
                return zero, ErrDeleted
            }
            return pair.Value, nil
        }

        if keyInDB > any(key).(string) {
            var zero V
            return zero, ErrNotFound
        }
    }

    var zero V
    return zero, ErrNotFound
}

On the read path, look at keyInDB > any(key).(string). This is one of the examples of how we took advantage of storing data in a sorted key order. The moment we find a key in the SSTable that is greater than the key we are looking for, we stop looking because it’s obvious all other keys will be greater than this, so we won't find our key anymore.

Now that you have implemented the SSTable, you just have to decide when to flush data from the MemTable to the SSTable. You can just define max size for MemTable and flush it to disk on the write path when the max size is reached.

I am skipping some variables, boilerplate code, and simplifying things for brevity. I will post a GitHub link with the complete implementation later.

type DB[K comparable, V any] struct {
    memtable        *MemTable[K, V]
    maxMemtableSize int
    memtableSize    int
    sstables        []*SSTable[K, V]
    sstableCounter  int
}

func NewDB[K comparable, V any](maxMemtableSize int) (*DB[K, V], error) {
    sstables := make([]*SSTable[K, V], 0)
    memtable := NewMemTable[K, V]()

    return &DB[K, V]{
        memtable:        memtable,
        maxMemtableSize: maxMemtableSize,
        sstables:        sstables,
    }, nil
}

func (db *DB[K, V]) Put(key K, value V) error {
    db.memtable.Put(key, value)
    db.memtableSize++

    if db.memtableSize >= db.maxMemtableSize {
        if err := db.flushMemtable(); err != nil {
            return err
        }
    }

    return nil
}

func (db *DB[K, V]) flushMemtable() error {
    sstablePath := fmt.Sprintf("data-%d.sstable", db.sstableCounter)
    sstable, err := writeSSTable(db.memtable, sstablePath)
    if err != nil {
        return err
    }

    db.sstables = append(db.sstables, sstable)
    db.sstableCounter++
    db.memtable = NewMemTable[K, V]()
    db.memtableSize = 0

    return nil
}

You'll notice that every time we flush to disk, we write to a new SSTable versus using a single SSTable for the whole database. This is the immutability aspect we discussed earlier.

func (db *DB[K, V]) Get(key K) (V, error) {
    if val, ok := db.memtable.Get(key); ok {
        return val, nil
    }

    for i := len(db.sstables) - 1; i >= 0; i-- {
        sstable := db.sstables[i]
        val, err := sstable.Get(key)

        if err != nil {
            var zero V

            if err == ErrDeleted {
                return zero, ErrNotFound
            }
            if err == ErrNotFound {
                continue
            }
            return zero, err
        }

        return val, nil
    }

    var zero V
    return zero, ErrNotFound
}

One important aspect to note on the read path is that we are reading the newest SSTable first. This is because the newest SSTable has the most updated value for the key.

So, say you have a key "a" with value "apple", and along the way you update that value for "a" to be "apricot". You'd have flushed it to a new SSTable (for immutability), and so if you were to read an older SSTable, first you'd get the older value. So by reading the newer SSTable first, we get the correct value and we don't have to worry about updating older SSTables.

The WAL (Write Ahead Log ): Crash Recovery Made Simple

Now that we have an SSTable, our data is durable and we are safe from losing data upon crashes. Are we really safe, though? Think of a scenario where a crash happens before we flush to the SSTable. We know MemTable has a max threshold, and until then, data lives in memory. So we’re still prone to losing data if a crash happens before the flush.

This is where the WAL (Write Ahead Log) comes into the picture. It’s the single most important aspect of the LSM tree.

We’ll follow a simple rule: "Before we write a piece of data to the in-memory MemTable, we first write it to a log file on disk."

If a crash happens and the database starts again, the first thing it does is look for a WAL, read it if one is found, and replay all the data into MemTable. This process reconstructs the MemTable to the exact state it was in right before the crash.

It's natural to think that if all of your writes are first written to disk it will impact performance. You aren’t wrong, but at the same time there are nuances.

The writes to WAL are different in that they are append-only sequential writes, meaning random disk seeks are not required. On a traditional spinning hard drive (HDD), this is fast because the disk's read/write head does not have to move to a new location. On a modern solid-state drive (SSD), sequential writes are also much faster than random writes.

Whatever small performance impact we accept is a trade-off for durability.

Now that we know what WAL does, let's implement it. Two key functions of WAL are to write to a file on disk and replay MemTable upon start.

Note that in the factory function below (NewWAL), the file has been opened in append mode.

func NewWAL[K comparable, V any](path string) (*WAL[K, V], error) {
    file, err := os.OpenFile(path, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
    if err != nil {
        return nil, err
    }
    return &WAL[K, V]{
        file:    file,
        encoder: gob.NewEncoder(file),
    }, nil
}

func (wal *WAL[K, V]) Write(key K, value V) error {
    entry := WALEntry[K, V]{Key: key, Value: value}
    return wal.encoder.Encode(&entry)
}

func ReplayWAL[K comparable, V any](path string) (*MemTable[K, V], error) {
    memtable := NewMemTable[K, V]()
    file, err := os.Open(path)
    if err != nil {
        if os.IsNotExist(err) {
            // If the file doesn't exist, that's fine. Return an empty memtable.
            return memtable, nil
        }
        return nil, err
    }
    defer file.Close()

    decoder := gob.NewDecoder(file)
    for {
        var entry WALEntry[K, V]
        if err := decoder.Decode(&entry); err != nil {
            if err == io.EOF {
                break // We've reached the end of the file.
            }
            return nil, err
        }
        memtable.Put(entry.Key, entry.Value)
    }

    return memtable, nil
}

A couple notes about the above code:

NewWAL: This function creates an instance of the WAL for our database. It takes in the file path where the WAL data should be stored and opens the file using Go’s os.OpenFile function. Also, a gob.Encoder is initialized to simplify the encoding of Go data structures into binary format for efficient storage in the WAL file.
Write: The Write function appends a new key-value pair to the WAL file. Every write operation to the MemTable first calls this function to ensure the update is durably recorded:
ReplayWAL: This is the most important function. In the event of crash, this function comes to our rescue by reconstructing the MemTable from the WAL file. It replays the entries stored in the WAL file and writes it into MemeTable. Following it how it works:
1. The function begins by creating a new empty MemTable instance that will be populated with key-value pairs.
2. It then attempts to open the WAL file. If the file does not exist (example – if this is the first startup), the function assumes there’s nothing to recover and simply returns the empty MemTable.
3. A gob.Decoder is used to read the WAL file, which helps to deserialize the saved binary-encoded WALEntry data back into key-value pairs.
4. For each successfully decoded WALEntry, the key-value pair is added back into the MemTable using the Put function.

With this, the database can fully recover its state by replaying all the operations recorded in the WAL.

As far as integration is concerned, every time you create a new DB, you should think of replaying from an existing WAL and opening the WAL in append mode. Also, Put should first write to WAL.

func NewDB[K comparable, V any](maxMemtableSize int) (*DB[K, V], error) {
    walPath := "db.wal"
    memtable, err := ReplayWAL[K, V](walPath) // this is the replay
    if err != nil {
        return nil, err
    }
    //open WAL in append mode
    wal, err := NewWAL[K, V](walPath)
    if err != nil {
        return nil, err
    }

    return &DB[K, V]{
        memtable:        memtable,
        maxMemtableSize: maxMemtableSize,
        memtableSize:    len(memtable.data),
        wal:             wal,
        walPath:         walPath,
        sstables:        make([]*SSTable[K, V], 0),
    }, nil
}

func (db *DB[K, V]) Put(key K, value V) error {
//first write to WAL
    if err := db.wal.Write(key, value); err != nil {
        return err
    }

    db.memtable.Put(key, value)
    db.memtableSize++

    if db.memtableSize >= db.maxMemtableSize {
        if err := db.flushMemtable(); err != nil {
            return err
        }
    }

    return nil
}

Manifest File: Tracking the State of the Database

By this point, the database is pretty robust and durable, but an important question lingers: upon restarts, how does our database know about SSTables? Knowing about all SSTables is important for fetching data.

So say our database crashed after writing several SSTables. Without knowing about these SSTables, the database will create a new slice of SSTables and all of our old data is gone – queries won't read those files.

To solve this problem, we introduce an inventory of SSTables called MANIFEST. Every time we successfully create a new SSTable in flushMemtable, we add its path to the MANIFEST and save the MANIFEST to disk.

The very first thing NewDB does on startup is read the MANIFEST. This gives it the list of all the file paths, and it uses this list to perfectly reconstruct its SSTables slice.

In short, MANIFEST determines the state of the DB.

Manifest contains a slice of SSTablePaths. The Read function will read the MANIFEST file to restore the knowledge of the SSTables. The Write function will write a new manifest file.

type Manifest struct {
    SSTablePaths []string
}

func ReadManifest(path string) (*Manifest, error) {
    file, err := os.Open(path)
    if err != nil {
        if os.IsNotExist(err) {
            // If manifest doesn't exist, return empty manifest
            return &Manifest{SSTablePaths: []string{}}, nil
        }
        return nil, err
    }
    defer file.Close()

    var manifest Manifest
    decoder := gob.NewDecoder(file)
    err = decoder.Decode(&manifest)
    if err != nil {
        return nil, err
    }

    return &manifest, nil
}

func WriteManifest(path string, manifest *Manifest) error {
    tmpPath := path + ".tmp"
    file, err := os.Create(tmpPath)
    if err != nil {
        return err
    }

    encoder := gob.NewEncoder(file)
    if err := encoder.Encode(manifest); err != nil {
        file.Close()
        os.Remove(tmpPath)
        return err
    }

    if err := file.Close(); err != nil {
        return err
    }
    // Atomic Rename
    return os.Rename(tmpPath, path)
}

You'll notice that we aren’t modifying the existing MANIFEST file directly. Instead, we’re creating a temporary file, writing all the data to it, closing it, and then atomically renaming it to replace the old MANIFEST.

The os.Rename() operation is atomic on most filesystems, meaning it either completely succeeds or completely fails – there's no in-between state. This is crucial because if the system crashes while updating the MANIFEST, we need to ensure we don't end up with a corrupted file. We’ll discuss this again below when we’re talking about compaction.

With this approach, we either have the old valid MANIFEST or the new valid MANIFEST, never a partially written corrupted file.

From an integration standpoint, NewDB will read the manifest and set its SSTable slice based on that. The flush method, given that it writes to SSTable, will also write SSTable info to manifest to keep the db updated about new SSTables.

type DB[K comparable, V any] struct {
    memtable        *MemTable[K, V]
    maxMemtableSize int
    memtableSize    int
    sstables        []*SSTable[K, V]
    sstableCounter  int
    wal             *WAL[K, V]
    walPath         string
    manifest        *Manifest
    manifestPath    string
}

func NewDB[K comparable, V any](maxMemtableSize int) (*DB[K, V], error) {
    walPath := "db.wal"
    memtable, err := ReplayWAL[K, V](walPath)
    if err != nil {
        return nil, err
    }

    wal, err := NewWAL[K, V](walPath)
    if err != nil {
        return nil, err
    }

    manifestPath := "MANIFEST"
    manifest, err := ReadManifest(manifestPath)
    if err != nil {
        return nil, err
    }

    sstables := make([]*SSTable[K, V], len(manifest.SSTablePaths))
    for i, path := range manifest.SSTablePaths {
        sstables[i] = &SSTable[K, V]{path: path}
    }

    return &DB[K, V]{
        memtable:        memtable,
        maxMemtableSize: maxMemtableSize,
        memtableSize:    len(memtable.data),
        wal:             wal,
        walPath:         walPath,
        manifest:        manifest,
        manifestPath:    manifestPath,
        sstables:        sstables,
    }, nil
}

func (db *DB[K, V]) flushMemtable() error {
    sstablePath := fmt.Sprintf("data-%d.sstable", db.sstableCounter)
    sstable, err := writeSSTable(db.memtable, sstablePath)
    if err != nil {
        return err
    }

    db.sstables = append(db.sstables, sstable)
    db.sstableCounter++

    db.manifest.SSTablePaths = append(db.manifest.SSTablePaths, sstablePath)
    if err := WriteManifest(db.manifestPath, db.manifest); err != nil {
        return err
    }

    db.memtable = NewMemTable[K, V]()
    db.memtableSize = 0

    return nil
}

At this point, our DB has almost everything. It can write to memory (MemTable), persist to disk (sstable), and recover from crashes (WAL and manifest). You should include the update and delete feature for completeness – so let’s look at those next.

Update and Delete: Handling Mutability in an Immutable System

By this time, you should know that in an LSM storage system, data is never updated – rather, new data is written. For example, if you have a data pair ("a": "apple") and over time this has to change to the pair ("a": "apricot"), a new pair will be written to a different SSTable without any change to the existing pair. And yes, this leads to duplicates.

Also, interestingly, data isn't even deleted during write operations. The reason for that is, in a traditional sense, if you have to delete ("a":"apple"), you will have to find where it lives on disk and remove it. This makes writes slow. So instead, a clever mechanism is used: instead of removing the data directly, you can mark the key as deleted by writing a special TOMBSTONE value.

So, in the case of deleting (a : apple), you wouldn't remove the key from any SSTable. Instead, you’d write a new key-value pair such as ("a": "TOMBSTONE"). Here’s what this achieves:

The "TOMBSTONE" serves as a marker within the SSTable, telling the system that the key "a" has been logically deleted, even though it still physically exists in older SSTables.
During future reads, any value associated with "TOMBSTONE" will be treated as deleted, ensuring that the entry no longer shows up in query results.
This mechanism avoids the need for immediate deletions or expensive in-place updates, making write operations faster and simpler.

But this also raises the following questions:

How do you accurately read when there are duplicates? Meaning, how do users get ("a": "apricot") instead of ("a": "apple") because the former is latest and accurate?
How do you handle deletes to ensure deleted keys are not returned (and instead, a proper error message is returned)?
These stale and deleted data are garbage. How do you get rid of them to save on storage space?

As long as data is in MemTable (in-memory map), the duplicates are easy to handle: new values will just replace the old values.

But it gets tricky when data is in multiple SSTables. There is a very simple solution to this problem, and that is to just read the newer SSTable before older ones. That way, you will always read the latest value for a given key and exit early.

The following code in the read path ensures reading from newer SSTables before moving to older ones (note the loop starts from len(db.sstables) - 1):

func (db *DB[K, V]) Get(key K) (V, error) {
    // Check memtable first
    if val, ok := db.memtable.Get(key); ok {
        if any(val).(string) == TOMBSTONE {
            var zero V
            return zero, ErrNotFound
        }
        return val, nil
    }

    // Then check sstables from newest to oldest
    for i := len(db.sstables) - 1; i >= 0; i-- {
        sstable := db.sstables[i]
        val, err := sstable.Get(key)

        if err == nil {
            return val, nil
        }

        var zero V
        if err == ErrDeleted {
            return zero, ErrNotFound
        }
        if err == ErrNotFound {
            continue
        }
        return zero, err
    }

    var zero V
    return zero, ErrNotFound
}

And for delete, you could just add a new value "TOMBSTONE":

func (db *DB[K, V]) Delete(key K) error {
    return db.Put(key, any(TOMBSTONE).(V))
}

Note: This implementation assumes V is a string type. In a production system, you would need a more robust way to handle tombstones that works with any value type.

Handling deleted keys becomes simple now. You can check for the value (in MemTable and SSTable) and return an error if the value is "TOMBSTONE":

// db.go
func (db *DB[K, V]) Get(key K) (V, error) {
    if val, ok := db.memtable.Get(key); ok {
        if any(val).(string) == TOMBSTONE { //got TOMBSTONE, return zero
            var zero V
            return zero, ErrNotFound
        }
        return val, nil
    }
    // ... rest of function
}

// sstable.go
func (s *SSTable[K, V]) Get(key K) (V, error) {
    // ... earlier code

    keyInDB := any(pair.Key).(string)
    if keyInDB == any(key).(string) {
        if any(pair.Value).(string) == TOMBSTONE {
            var zero V
            return zero, ErrDeleted
        }
        return pair.Value, nil
    }

    // ... rest of function
}

Compaction: Cleaning Up Stale and Deleted Data

We have handled all the scenarios so far except for one. It’s not a concern of serving read/write traffic but something that’s important for the health of the storage system.

Over time, the system has developed a lot of garbage (stale, deleted data) and needs a garbage collection mechanism. Compaction is a background maintenance process that cleans up and reorganizes data in an LSM storage system.

As the system grows, multiple SSTables have been created. This leads to reads needing multiple file operations to get values. By compacting (or merging) multiple SSTables into a single one, you avoid disk operation overhead. Along the way, you should also permanently delete data that has been TOMBSTONED.

Note: Compaction is the only time data is permanently deleted from an LSM storage system.

To grasp the concept of compaction, we are going to implement something called Full Compaction where you will merge all the existing SSTables into one larger SSTable. In real-world database implementations, the strategy is more complex, there are multi level compaction involved.

Compaction Algorithm

We’re going to implement K-way merge to perform compaction. It’s a general algorithm that takes K sorted lists and merges them into a single, combined sorted list. In this case, the K sorted lists are the SSTables, and you are going to merge all of them into a single SSTable.

Our SSTables are already sorted, so the idea of merging them involves:

Taking the smallest (first) keys from each SSTable
Finding the smallest among those keys
Storing the found smallest key into new SSTable file
Fetching next key from the SSTable the smallest key belongs to
Repeating this process for all SSTables

Here’s a simple example with numbers:

Assume we have 3 sorted lists:
List A : [4, 8, 12]
List B : [3, 9]
List C : [7, 10, 11]

In the first iteration, we will take (4, 3, 7) because those are the smallest keys for individual lists. 
We find the smallest among those, which is 3, and store 3 in the result list.

In the second iteration, we will take (4, 9, 7). Note that 3 has already been accounted for. 
We pick 4 and store it to the result list.

Repeating this until all lists are empty, we get:
Result List : [3, 4, 7, 8, 9, 10, 11, 12]

The core part of this algorithm is to find the smallest key among the smallest keys from the individual SSTables. Fortunately, we have a data structure called Min-Heap that does this for us. So, you’re going to take the smallest key from each SSTable and put them all onto a Min Heap for it to return the smallest among those. We’re going to leverage go’s container/heap package to get the Min-Heap data structure and corresponding algorithm to find minimum value and put it at the top of heap.

Min Heap needs you to provide a function for it to determine what is the smaller key between two keys, as it uses that logic to determine global minimum. The following function is implemented for that:

func (h MinHeap[K, V]) Less(i, j int) bool {
    // again for simple comparison assume string key
    keyI := any(h[i].Pair.Key).(string)
    keyJ := any(h[j].Pair.Key).(string)
    if keyI != keyJ {
        return keyI < keyJ
    }
    // this is needed for the case when you have duplicate keys,
    // you will want to pick the one that is in newer sstable because that is latest
    return h[i].SSTableIndex > h[j].SSTableIndex
}

One important aspect about the above shown Less function is how it handles ties. So if we have two pairs with same key, which is lesser? Let’s assume two pairs as (a: apple) and (a: apricot), where (a: apple) is the older value (written to an older SSTable), which pair should the Less function return as the lesser value?

The answer is the one which is in the newer SSTable (see h[i].SSTableIndex > h[j].SSTableIndex). It ensures that the SSTable with higher index (that is, latest) becomes the lesser value, so (a: apricot) wins. It’s is important to always get the newer value of a given key.

The code for compaction looks something like the following. Note that we’re discarding deleted values (TOMBSTONE) and the older values.

// put this in a new file compaction.go
func MergeSSTables[K comparable, V any](sstables []*SSTable[K, V], newPath string) (*SSTable[K, V], error) {
    newFile, err := os.Create(newPath) // create a new sstable file
    if err != nil {
        return nil, err
    }
    defer newFile.Close() // prevent memory leak by ensuring file is closed

    newEncoder := gob.NewEncoder(newFile) // initialize encoder for new SSTable file


    files := make([]*os.File, len(sstables)) // open all the sstables
    decoders := make([]*gob.Decoder, len(sstables)) // initialize one decoder per ssltable file
    for i, sstable := range sstables {
        files[i], err = os.Open(sstable.path)
        if err != nil {
            return nil, err
        }
        defer files[i].Close() // prevent memory leak by ensuring file is closed
        decoders[i] = gob.NewDecoder(files[i])
    }

    // read first pair from each sstable and store in a pair array
    pairs := make([]Pair[K, V], len(decoders))
    emptySSTables := make([]bool, len(decoders)) // track empty sstables
    for i, decoder := range decoders {
        if err := decoder.Decode(&pairs[i]); err != nil {
            if err == io.EOF {
                emptySSTables[i] = true
                continue
            }
            return nil, err
        }
    }

    // push those pairs onto heap
    h := &MinHeap[K, V]{}
    for i, pair := range pairs {
        if !emptySSTables[i] {
            heap.Push(h, &HeapItem[K, V]{Pair: pair, SSTableIndex: i})
        }
    }

    // init the min-heap calculation algorithm from container/heap package
    heap.Init(h)

    var lastKey K
    firstKey := true

    // pop the min item from heap and store it into new sstable
    for h.Len() > 0 {
        item := heap.Pop(h).(*HeapItem[K, V])

        // If this key is a duplicate of the last one we saw, skip it
        if !firstKey && item.Pair.Key == lastKey {
            // We only care about the version from the newest SSTable,
            // which we have already processed
        } else {
            if any(item.Pair.Value).(string) != TOMBSTONE {
                // discard deleted
                if err := newEncoder.Encode(item.Pair); err != nil {
                    return nil, err
                }
            }
        }

        lastKey = item.Pair.Key
        firstKey = false

        // Push the next item from the same SSTable into the heap
        var nextPair Pair[K, V]
        if err := decoders[item.SSTableIndex].Decode(&nextPair); err == nil {
            heap.Push(h, &HeapItem[K, V]{Pair: nextPair, SSTableIndex: item.SSTableIndex})
        } else if err != io.EOF {
            return nil, err
        }
    }

    return &SSTable[K, V]{path: newPath}, nil
}

All the compaction magic has been packed in one function, MergeSSTables. The function has the following logical steps (and you can check the inline comments in the code to follow along):

We create a new destination SSTable file and initialize corresponding gob.Encoder
We open all the existing SSTable files, and store their references to files array. Also, we initialize one gob.Decoder per exiting SSTable file. To prevent memory leak, a defer statement ensures that each file will be closed once the function completes its work.
Each decoder reads the first key-value pair from its corresponding SSTable and stores it in the pairs array.
SSTables that are already exhausted (for example, are empty or have hit the end of the file) are marked as such in the emptySSTables slice, and we skip pushing them onto the heap.
We push each pair from the pairs array to Min-Heap and then initialize the Min-Heap calculation algorithm. This algorithm is present in Go’s container/heap package.
Each time the smallest key-value pair is popped from the min-heap, it’s compared with the previously processed key (lastKey). Duplicate keys (those whose values are already written) are skipped.
Values marked with a "TOMBSTONE" (logically deleted entries) are ignored and not written to the new SSTable, effectively cleaning up deleted data.
To continue the merge, the next key-value pair from the same SSTable (as the one we just processed) is read and pushed onto the heap, unless the end of the SSTable (io.EOF) has been reached.

To integrate this with the DB, you could use a compaction threshold and trigger compaction as part of the flush when this threshold is reached:

type DB[K comparable, V any] struct {
    memtable            *MemTable[K, V]
    maxMemtableSize     int
    memtableSize        int
    sstables            []*SSTable[K, V]
    sstableCounter      int
    wal                 *WAL[K, V]
    walPath             string
    manifest            *Manifest
    manifestPath        string
    compactionThreshold int
}

func NewDB[K comparable, V any](maxMemtableSize int, compactionThreshold int) (*DB[K, V], error) {
    walPath := "db.wal"
    memtable, err := ReplayWAL[K, V](walPath)
    if err != nil {
        return nil, err
    }

    wal, err := NewWAL[K, V](walPath)
    if err != nil {
        return nil, err
    }

    manifestPath := "MANIFEST"
    manifest, err := ReadManifest(manifestPath)
    if err != nil {
        return nil, err
    }

    sstables := make([]*SSTable[K, V], len(manifest.SSTablePaths))
    for i, path := range manifest.SSTablePaths {
        sstables[i] = &SSTable[K, V]{path: path}
    }

    return &DB[K, V]{
        wal:                 wal,
        walPath:             walPath,
        memtable:            memtable,
        memtableSize:        len(memtable.data),
        maxMemtableSize:     maxMemtableSize,
        manifestPath:        manifestPath,
        manifest:            manifest,
        sstables:            sstables,
        compactionThreshold: compactionThreshold,
    }, nil
}

// a new compact function
func (db *DB[K, V]) Compact() error {
    compactedSSTablePath := fmt.Sprintf("data-compacted-%d.sstable", db.sstableCounter)
    compactedSSTable, err := MergeSSTables(db.sstables, compactedSSTablePath)
    if err != nil {
        return err
    }
    // write new SSTable to MANIFEST file
    db.manifest.SSTablePaths = []string{compactedSSTablePath}
    if err := WriteManifest(db.manifestPath, db.manifest); err != nil {
        return err
    }
    //note delete only after writing manifest
    for _, sstable := range db.sstables {
        if err := os.Remove(sstable.path); err != nil {
            log.Printf("Failed to remove old sstable %s: %v", sstable.path, err)
        }
    }

    db.sstables = []*SSTable[K, V]{compactedSSTable}
    db.sstableCounter++

    return nil
}

func (db *DB[K, V]) flushMemtable() error {
    sstablePath := fmt.Sprintf("data-%d.sstable", db.sstableCounter)
    sstable, err := writeSSTable(db.memtable, sstablePath)
    if err != nil {
        return err
    }

    db.sstables = append(db.sstables, sstable)
    db.sstableCounter++

    db.manifest.SSTablePaths = append(db.manifest.SSTablePaths, sstablePath)
    if err := WriteManifest(db.manifestPath, db.manifest); err != nil {
        return err
    }

    db.memtable = NewMemTable[K, V]()
    db.memtableSize = 0

    // trigger compaction
    if len(db.sstables) >= db.compactionThreshold {
        if err := db.Compact(); err != nil {
            log.Printf("Compaction failed: %v", err)
            return err
        }
    }

    return nil
}

Notice the Compact() function in the integrated DB code? This is where we invoke previously defined the MergeSSTables function to trigger the compaction process. After invoking MergeSSTables, we write a new SSTable to the MANIFEST file and then delete the older SSTables.

Previously, in the Manifest File: Tracking the State of the Database, I spoke about atomic renaming os.Rename(tmpPath, path). Let’s talk about why the atomic renaming of MANIFEST matters for compaction.

During compaction, we're making a major change to the database state: replacing multiple SSTables with a single compacted one. The MANIFEST update is critical here because it's the source of truth for which SSTables exist.

Let’s think about what could go wrong without atomic renaming:

You start writing the new MANIFEST (which points to the compacted SSTable)
System crashes mid-write
MANIFEST is corrupted and unreadable
On restart, the database has no idea which SSTables exist
All data is effectively lost

With atomic renaming:

We write the new MANIFEST to MANIFEST.tmp
We fully close and sync it to disk
We atomically rename MANIFEST.tmp to MANIFEST using os.Rename(tmpPath, path)
If crash happens before step 3: old MANIFEST is intact, we retry compaction
If crash happens during step 3: atomic operation either completes or doesn't – no corruption
If crash happens after step 3: new MANIFEST is in place, we're good

This is also why we delete the old SSTables only after successfully updating the MANIFEST. If we deleted them before updating MANIFEST and then crashed, the MANIFEST would still point to files that no longer exist.

Complete Picture:

Conclusion

Congratulations! You've built a working LSM tree storage engine from scratch. By following the problem-driven approach – discovering issues and implementing solutions as they arose – you've experienced how engineers think about building robust storage systems. I hope this is better than just memorizing the concepts.

Key Takeaways

Append-only writes make LSM-trees fast for write-heavy workloads
Immutability eliminates complex concurrency issues
Trade-off is that LSM-tree favor writes over reads (opposite of B-trees)
Durability requires multiple mechanism working together (WAL, MANIFEST, atomic operations)
Background maintenance (compaction) is essential for long-term health and cost.

Important note: This is a learning implementation. This means that I intentionally simplified the code, so it’s not production-ready. Key limitations include:

No concurrency control (missing mutexes/locks)
No bloom filters for efficient lookups
Simplified compaction strategy
Type safety issues with generic tombstones
Missing robust error recovery

Complete Code:

Like I’ve mentioned before, I've omitted boilerplate code and helper functions for brevity. The complete, runnable implementation is available at this GitHub repo.

To learn more about production LSM implementations, study RocksDB, LevelDB, or read the original LSM tree paper by O'Neil et al: https://www.cs.umb.edu/~poneil/lsmtree.pdf