<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ distributed system - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ distributed system - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Tue, 19 May 2026 10:28:37 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/distributed-system/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Service-to-Service Communication: When to Use REST, gRPC, and Event-Driven Messaging ]]>
                </title>
                <description>
                    <![CDATA[ The communication layer is one of the few architectural decisions that touches everything in your apps. It determines your latency floor, how independently teams can deploy, how failures propagate, an ]]>
                </description>
                <link>https://www.freecodecamp.org/news/service-to-service-communication-when-to-use-rest-grpc-and-event-driven-messaging/</link>
                <guid isPermaLink="false">69dea59891716f3cfb76ac50</guid>
                
                    <category>
                        <![CDATA[ distributed system ]]>
                    </category>
                
                    <category>
                        <![CDATA[ service-to-service communication ]]>
                    </category>
                
                    <category>
                        <![CDATA[ api architecture ]]>
                    </category>
                
                    <category>
                        <![CDATA[ REST API ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Abisoye Alli-Balogun ]]>
                </dc:creator>
                <pubDate>Tue, 14 Apr 2026 20:37:44 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/b56ca584-75e1-4d0f-a88a-c2419a0b5d6e.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The communication layer is one of the few architectural decisions that touches everything in your apps. It determines your latency floor, how independently teams can deploy, how failures propagate, and how much pain you feel every time a contract needs to change.</p>
<p>There are three dominant patterns: REST over HTTP, gRPC with Protocol Buffers, and event-driven messaging through a broker. Most production systems use a mix of all three. The skill is knowing which pattern fits which interaction.</p>
<p>In this article, you'll learn the core mechanics of each communication style, the real trade-offs between them across five dimensions (latency, coupling, schema evolution, debugging, and operational complexity), and a decision framework for choosing the right pattern for each service interaction.</p>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>To get the most out of this article, you should be familiar with:</p>
<ul>
<li><p>Basic HTTP concepts (request/response, status codes, headers)</p>
</li>
<li><p>Working with APIs in any backend language (the examples use TypeScript and Node.js)</p>
</li>
<li><p>General understanding of microservices architecture</p>
</li>
<li><p>Familiarity with JSON as a data interchange format</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-the-three-patterns-at-a-glance">The Three Patterns at a Glance</a></p>
</li>
<li><p><a href="#heading-rest-the-default-choice">REST: The Default Choice</a></p>
</li>
<li><p><a href="#heading-grpc-the-performance-choice">gRPC: The Performance Choice</a></p>
</li>
<li><p><a href="#heading-event-driven-messaging-the-decoupling-choice">Event-Driven Messaging: The Decoupling Choice</a></p>
</li>
<li><p><a href="#heading-the-five-trade-off-dimensions">The Five Trade-Off Dimensions</a></p>
</li>
<li><p><a href="#heading-the-decision-framework">The Decision Framework</a></p>
</li>
<li><p><a href="#heading-hybrid-architectures-using-all-three">Hybrid Architectures: Using All Three</a></p>
</li>
<li><p><a href="#heading-schema-governance-at-scale">Schema Governance at Scale</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-the-three-patterns-at-a-glance">The Three Patterns at a Glance</h2>
<p>Before diving deep, here's the landscape:</p>
<table>
<thead>
<tr>
<th></th>
<th>REST</th>
<th>gRPC</th>
<th>Event-Driven</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Communication</strong></td>
<td>Synchronous</td>
<td>Synchronous (+ streaming)</td>
<td>Asynchronous</td>
</tr>
<tr>
<td><strong>Protocol</strong></td>
<td>HTTP/1.1 or HTTP/2</td>
<td>HTTP/2</td>
<td>Broker-dependent (TCP)</td>
</tr>
<tr>
<td><strong>Serialization</strong></td>
<td>JSON (typically)</td>
<td>Protocol Buffers (binary)</td>
<td>JSON, Avro, Protobuf</td>
</tr>
<tr>
<td><strong>Coupling</strong></td>
<td>Request-time</td>
<td>Request-time + schema</td>
<td>Temporal decoupling</td>
</tr>
<tr>
<td><strong>Best for</strong></td>
<td>Public APIs, CRUD</td>
<td>Internal high-throughput</td>
<td>Workflows, event sourcing</td>
</tr>
</tbody></table>
<p>Each has strengths, and none is universally better. The rest of this article explores why.</p>
<h2 id="heading-rest-the-default-choice">REST: The Default Choice</h2>
<p>REST over HTTP is the most widely understood communication pattern. Services expose resources at URL endpoints, and clients interact through standard HTTP methods.</p>
<pre><code class="language-typescript">// Order service calls the inventory service
async function checkInventory(productId: string): Promise&lt;InventoryStatus&gt; {
  const response = await fetch(
    `https://inventory-service/api/v1/products/${productId}/stock`,
    {
      method: "GET",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${getServiceToken()}`,
      },
    }
  );

  if (!response.ok) {
    throw new HttpError(response.status, await response.text());
  }

  return response.json();
}
</code></pre>
<h3 id="heading-where-rest-excels">Where REST Excels</h3>
<p>Every language, framework, and platform speaks HTTP. Your frontend, your mobile app, your partner integrations, and your internal services can all use the same protocol.</p>
<p>The tooling is also mature: load balancers, API gateways, caching proxies, and debugging tools all understand HTTP natively.</p>
<p>It's also relatively simple. A new developer can read a REST call and understand what it does. The URL describes the resource. The HTTP method describes the action. The status code describes the outcome. There's no schema compilation step, no code generation, and no special client library required.</p>
<p>Beyond this, HTTP has built-in caching semantics. A <code>GET /products/123</code> response with a <code>Cache-Control: max-age=60</code> header can be cached by every proxy between the caller and the server. gRPC and event-driven patterns have no equivalent built-in mechanism.</p>
<pre><code class="language-typescript">// REST response with cache headers
app.get("/api/v1/products/:id", async (req, res) =&gt; {
  const product = await getProduct(req.params.id);

  res.set("Cache-Control", "public, max-age=60");
  res.set("ETag", computeETag(product));

  res.json(product);
});
</code></pre>
<h3 id="heading-where-rest-falls-short">Where REST Falls Short</h3>
<p>REST's resource-oriented model often requires multiple round-trips to assemble a response. Fetching an order with its items, customer details, and shipping status might mean three separate HTTP calls. Each call adds network latency, TCP handshake overhead, and serialization cost.</p>
<pre><code class="language-typescript">// Three sequential calls to build one view
async function getOrderDetails(orderId: string): Promise&lt;OrderDetails&gt; {
  const order = await fetch(`/api/orders/${orderId}`).then((r) =&gt; r.json());
  const customer = await fetch(`/api/customers/${order.customerId}`).then((r) =&gt; r.json());
  const shipment = await fetch(`/api/shipments/${order.shipmentId}`).then((r) =&gt; r.json());

  return { order, customer, shipment };
}
</code></pre>
<p>You can mitigate this with composite endpoints or GraphQL, but that adds complexity. gRPC handles this more naturally with message composition.</p>
<p>The serialization overhead is also an issue. JSON is human-readable but expensive to parse. For high-throughput internal communication where nobody reads the payloads, you are paying a tax in CPU and bandwidth for readability you do not need.</p>
<p>Finally, there's no streaming. Standard REST is request-response. If you need the server to push updates to the client (real-time order tracking, live metrics), REST requires workarounds like polling, Server-Sent Events, or WebSockets. None of these are part of the REST model itself.</p>
<h2 id="heading-grpc-the-performance-choice">gRPC: The Performance Choice</h2>
<p>gRPC is a Remote Procedure Call framework built on HTTP/2 and Protocol Buffers. Instead of URLs and JSON, you define services and messages in <code>.proto</code> files, and the framework generates strongly-typed client and server code.</p>
<h3 id="heading-defining-the-contract">Defining the Contract</h3>
<pre><code class="language-protobuf">// inventory.proto
syntax = "proto3";

package inventory;

service InventoryService {
  // Unary: single request, single response
  rpc CheckStock(StockRequest) returns (StockResponse);

  // Server streaming: single request, stream of responses
  rpc WatchStockLevels(WatchRequest) returns (stream StockUpdate);

  // Client streaming: stream of requests, single response
  rpc BulkUpdateStock(stream StockAdjustment) returns (BulkUpdateResult);
}

message StockRequest {
  string product_id = 1;
  string warehouse_id = 2;
}

message StockResponse {
  string product_id = 1;
  int32 available = 2;
  int32 reserved = 3;
  google.protobuf.Timestamp last_updated = 4;
}

message StockUpdate {
  string product_id = 1;
  int32 available = 2;
  string warehouse_id = 3;
}
</code></pre>
<p>After running <code>protoc</code> (the Protocol Buffer compiler), you get generated client and server stubs in your target language. The TypeScript client looks like this:</p>
<pre><code class="language-typescript">import { InventoryServiceClient } from "./generated/inventory";
import { credentials } from "@grpc/grpc-js";

const client = new InventoryServiceClient(
  "inventory-service:50051",
  credentials.createInsecure()
);

async function checkStock(productId: string): Promise&lt;StockResponse&gt; {
  return new Promise((resolve, reject) =&gt; {
    client.checkStock(
      { productId, warehouseId: "warehouse-eu-1" },
      (error, response) =&gt; {
        if (error) reject(error);
        else resolve(response);
      }
    );
  });
}
</code></pre>
<h3 id="heading-where-grpc-excels">Where gRPC Excels</h3>
<p>Protocol Buffers serialize to a compact binary format. A message that is 1 KB as JSON might be 300 bytes as Protobuf. Combined with HTTP/2 multiplexing (multiple requests over a single TCP connection), gRPC delivers significantly lower latency and higher throughput than REST for internal service calls. And we all know performance is important.</p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>REST (JSON/HTTP 1.1)</th>
<th>gRPC (Protobuf/HTTP 2)</th>
</tr>
</thead>
<tbody><tr>
<td>Serialization size</td>
<td>Larger (text-based JSON)</td>
<td>Significantly smaller (binary Protobuf)</td>
</tr>
<tr>
<td>Serialization time</td>
<td>Slower (JSON parse/stringify)</td>
<td>Faster (binary encode/decode)</td>
</tr>
<tr>
<td>Requests per connection</td>
<td>1 (without pipelining)</td>
<td>Multiplexed</td>
</tr>
<tr>
<td>Connection overhead</td>
<td>New connection per request (HTTP/1.1)</td>
<td>Persistent connections with multiplexing</td>
</tr>
</tbody></table>
<p>The exact improvement depends on payload size, network topology, and server implementation. In benchmarks, the difference ranges from marginal (tiny payloads on fast networks) to an order of magnitude (large payloads, high concurrency).</p>
<p>The takeaway: gRPC's binary serialization and HTTP/2 multiplexing give it a structural advantage for internal traffic, but you should measure in your own environment before making latency claims.</p>
<p>Also, gRPC natively supports four communication patterns: unary (request-response), server streaming, client streaming, and bidirectional streaming. This makes it a natural fit for real-time use cases like live stock updates, log tailing, or progress reporting.</p>
<pre><code class="language-typescript">// Server streaming: watch inventory changes in real time
function watchStockLevels(warehouseId: string): void {
  const stream = client.watchStockLevels({ warehouseId });

  stream.on("data", (update: StockUpdate) =&gt; {
    console.log(`Product \({update.productId}: \){update.available} available`);
  });

  stream.on("error", (error) =&gt; {
    console.error("Stream error:", error.message);
    // Reconnect logic here
  });

  stream.on("end", () =&gt; {
    console.log("Stream ended");
  });
}
</code></pre>
<p>Finally, it has strong typing across services. The <code>.proto</code> file is the single source of truth. Both the client and server are generated from it. If the inventory service changes the <code>StockResponse</code> message, the order service's build fails until it regenerates its client. You catch breaking changes at compile time, not at 3 AM.</p>
<h3 id="heading-where-grpc-falls-short">Where gRPC Falls Short</h3>
<p>The first key issue is browser support. Browsers can't make native gRPC calls because the browser's <code>fetch</code> API doesn't expose the HTTP/2 framing that gRPC requires (for example, trailers for status codes and fine-grained control over bidirectional streams).</p>
<p>You need gRPC-Web, which uses a proxy like Envoy to translate between the browser-compatible subset of gRPC and the full protocol. Alternatively, you can place a REST or GraphQL gateway in front of your gRPC services.</p>
<p>Either way, gRPC isn't a viable choice for any endpoint that a browser calls directly — which is why the decision framework in this article defaults to REST for public-facing APIs.</p>
<p>It's also difficult to debug. You can't <code>curl</code> a gRPC endpoint. The binary payloads aren't human-readable. Tools like <code>grpcurl</code> and Postman's gRPC support help, but the debugging experience is worse than inspecting a JSON response in a browser's network tab.</p>
<pre><code class="language-bash"># Debugging a REST endpoint
curl -s https://inventory-service/api/v1/products/abc-123/stock | jq

# Debugging a gRPC endpoint (requires grpcurl)
grpcurl -plaintext -d '{"product_id": "abc-123"}' \
  inventory-service:50051 inventory.InventoryService/CheckStock
</code></pre>
<p>Finally, operational overhead is an issue. You need to manage <code>.proto</code> files, run code generation in your build pipeline, version your proto definitions, and ensure all consumers regenerate when schemas change.</p>
<p>For a team with two services, this is manageable. For twenty services, you need a proto registry and a governance process.</p>
<h2 id="heading-event-driven-messaging-the-decoupling-choice">Event-Driven Messaging: The Decoupling Choice</h2>
<p>Event-driven communication flips the model. Instead of service A calling service B directly, service A publishes an event to a broker (Kafka, RabbitMQ, Amazon SNS/SQS, or similar), and service B consumes it asynchronously.</p>
<pre><code class="language-typescript">// Order service publishes an event after confirming an order
import { Kafka } from "kafkajs";

const kafka = new Kafka({ brokers: ["kafka:9092"] });
const producer = kafka.producer();

async function publishOrderConfirmed(order: Order): Promise&lt;void&gt; {
  await producer.send({
    topic: "order.confirmed",
    messages: [
      {
        key: order.id,
        value: JSON.stringify({
          eventType: "order.confirmed",
          eventId: crypto.randomUUID(),
          timestamp: new Date().toISOString(),
          data: {
            orderId: order.id,
            customerId: order.customerId,
            items: order.items.map((item) =&gt; ({
              productId: item.productId,
              quantity: item.quantity,
            })),
            total: order.total,
          },
        }),
      },
    ],
  });
}
</code></pre>
<pre><code class="language-typescript">// Inventory service consumes the event independently
const consumer = kafka.consumer({ groupId: "inventory-service" });

async function startInventoryConsumer(): Promise&lt;void&gt; {
  await consumer.subscribe({ topic: "order.confirmed" });

  await consumer.run({
    eachMessage: async ({ message }) =&gt; {
      const event = JSON.parse(message.value.toString());

      for (const item of event.data.items) {
        await decrementStock(item.productId, item.quantity);
      }

      logger.info("Inventory updated for order", {
        orderId: event.data.orderId,
      });
    },
  });
}
</code></pre>
<h3 id="heading-where-event-driven-excels">Where Event-Driven Excels</h3>
<p>First, it employs temporal decoupling. The producer doesn't wait for the consumer. The order service publishes "order confirmed" and moves on. If the inventory service is down, the event sits in the broker until it recovers. No timeout, no retry logic in the producer, no cascading failure.</p>
<p>One event can also trigger multiple independent reactions. When an order is confirmed, the inventory service decrements stock, the notification service sends a confirmation email, the analytics service records a conversion, and the shipping service starts fulfillment. The order service doesn't know or care about any of these consumers.</p>
<pre><code class="language-text">order.confirmed event
  ├── inventory-service    → Decrement stock
  ├── notification-service → Send confirmation email
  ├── analytics-service    → Record conversion
  └── shipping-service     → Create shipment
</code></pre>
<p>Adding a new consumer requires zero changes to the producer. This is the lowest coupling you can achieve between services.</p>
<p>There's also a natural audit trail. If your broker retains events (Kafka does this by default), you have a complete history of everything that happened. You can replay events to rebuild state, debug issues by examining the exact sequence of events, or spin up a new service that processes historical events to backfill its data.</p>
<h3 id="heading-where-event-driven-falls-short">Where Event-Driven Falls Short</h3>
<p>After the order service publishes "order confirmed," there's a window where the inventory service hasn't yet processed the event. During that window, a concurrent request might read stale stock levels. If your use case requires "read your own writes" consistency, event-driven communication alone is not enough.</p>
<pre><code class="language-typescript">// The problem: order confirmed, but stock not yet decremented
async function handleCheckout(cart: Cart): Promise&lt;Order&gt; {
  const order = await createOrder(cart);
  await publishOrderConfirmed(order);

  // If another request checks stock RIGHT NOW,
  // it sees the old (pre-decrement) value.
  // The inventory consumer hasn't processed the event yet.
  return order;
}
</code></pre>
<p>Debugging also gets more complex. When something goes wrong in a synchronous call chain, you get a stack trace. When something goes wrong in an event-driven flow, you get a message in a dead-letter queue and a question: which producer sent this? When? What was the system state at that time? Distributed tracing helps, but correlating events across services is fundamentally harder than following a request through a call stack.</p>
<p>You can also have issues with ordering guarantees. Most brokers guarantee ordering within a partition (Kafka) or a queue, but not globally. If the order service publishes "order confirmed" and then "order cancelled," the inventory service might process the cancellation first if the events land on different partitions.</p>
<pre><code class="language-typescript">// Use a consistent partition key to guarantee ordering per entity
await producer.send({
  topic: "order.events",
  messages: [
    {
      // All events for the same order go to the same partition
      key: order.id,
      value: JSON.stringify(event),
    },
  ],
});
</code></pre>
<p>Keying messages by entity ID (order ID, customer ID) ensures events for the same entity are processed in order. Events for different entities can be processed in parallel.</p>
<p>Finally, your operations get more complex. Running a message broker isn't free. Kafka requires ZooKeeper (or KRaft), topic management, partition rebalancing, consumer group coordination, and monitoring for consumer lag. Managed services like Amazon MSK, Confluent Cloud, or Amazon SQS reduce this burden but add cost.</p>
<h3 id="heading-handling-broker-failures">Handling Broker Failures</h3>
<p>What happens when the broker is unavailable? If your service writes to the database and then publishes an event, a broker outage means the event is lost even though the database write succeeded.</p>
<p>These patterns help:</p>
<h4 id="heading-1-the-outbox-pattern">1. The Outbox Pattern</h4>
<p>Instead of publishing directly to the broker, write the event to an "outbox" table in the same database transaction as your business data. A separate process (a poller or a change-data-capture connector like Debezium) reads the outbox table and publishes to the broker.</p>
<pre><code class="language-typescript">// Outbox pattern: write event to the database, not the broker
// db injected via dependency injection
async function confirmOrder(order: Order, db: Database): Promise&lt;void&gt; {
  await db.transaction(async (tx) =&gt; {
    // Business write and event write in the same transaction
    await tx.update("orders", { id: order.id, status: "confirmed" });
    await tx.insert("outbox", {
      id: crypto.randomUUID(),
      topic: "order.confirmed",
      key: order.id,
      payload: JSON.stringify({
        orderId: order.id,
        customerId: order.customerId,
        items: order.items,
        total: order.total,
      }),
      created_at: new Date(),
    });
  });
  // A separate relay process picks up outbox rows and publishes to Kafka
}
</code></pre>
<p>Because the event and the business data are written atomically, you never lose an event due to a broker outage. The relay process retries until the broker is back.</p>
<h4 id="heading-2-at-least-once-delivery">2. At-least-once delivery</h4>
<p>Most brokers guarantee at-least-once delivery, meaning consumers may see the same event more than once (for example, after a rebalance or a retry). Your consumers must be idempotent: processing the same event twice should produce the same result as processing it once.</p>
<pre><code class="language-typescript">// Idempotent consumer: use the eventId to deduplicate
async function handleOrderConfirmed(event: EventEnvelope&lt;OrderData&gt;): Promise&lt;void&gt; {
  const alreadyProcessed = await db.query(
    "SELECT 1 FROM processed_events WHERE event_id = $1",
    [event.eventId]
  );

  if (alreadyProcessed.rows.length &gt; 0) {
    logger.info("Duplicate event, skipping", { eventId: event.eventId });
    return;
  }

  await db.transaction(async (tx) =&gt; {
    await decrementStock(tx, event.data.items);
    await tx.insert("processed_events", {
      event_id: event.eventId,
      processed_at: new Date(),
    });
  });
}
</code></pre>
<p>The combination of the outbox pattern (producer side) and idempotent consumers (consumer side) gives you reliable event-driven communication even when the broker has intermittent failures.</p>
<h2 id="heading-the-five-trade-off-dimensions">The Five Trade-Off Dimensions</h2>
<p>Choosing a communication pattern isn't about which is "best." It's about which trade-offs you can accept for each specific interaction. Here are the five dimensions that matter most.</p>
<h3 id="heading-1-latency">1. Latency</h3>
<table>
<thead>
<tr>
<th>Pattern</th>
<th>Relative Latency</th>
<th>Why</th>
</tr>
</thead>
<tbody><tr>
<td>gRPC</td>
<td>Lowest</td>
<td>Binary serialization, HTTP/2 multiplexing, persistent connections</td>
</tr>
<tr>
<td>REST</td>
<td>Low-moderate</td>
<td>JSON parsing overhead, typically HTTP/1.1 connection setup</td>
</tr>
<tr>
<td>Event-driven</td>
<td>Highest (by design)</td>
<td>Broker write, replication, consumer poll interval</td>
</tr>
</tbody></table>
<p>Exact numbers depend on payload size, network hops, and infrastructure. The structural ordering is consistent: gRPC is fastest for synchronous calls, REST is close behind, and event-driven messaging trades latency for decoupling.</p>
<p>If the caller needs an immediate response (user-facing checkout, real-time search), use gRPC or REST. If the caller doesn't need the result right now (send email, update analytics), use events.</p>
<h3 id="heading-2-coupling">2. Coupling</h3>
<p>Coupling has two dimensions: <strong>temporal</strong> (does the caller wait for the receiver?) and <strong>schema</strong> (do they share a contract?).</p>
<table>
<thead>
<tr>
<th>Pattern</th>
<th>Temporal Coupling</th>
<th>Schema Coupling</th>
</tr>
</thead>
<tbody><tr>
<td>REST</td>
<td>High (caller blocks)</td>
<td>Low (JSON is flexible)</td>
</tr>
<tr>
<td>gRPC</td>
<td>High (caller blocks)</td>
<td>High (shared <code>.proto</code> files)</td>
</tr>
<tr>
<td>Event-driven</td>
<td>None (fire and forget)</td>
<td>Medium (shared event schema)</td>
</tr>
</tbody></table>
<p>REST's loose typing is a double-edged sword. You can add fields to a JSON response without breaking consumers (additive changes are safe). But you can also accidentally remove a field, and the consumer fails at runtime instead of compile time.</p>
<p>gRPC's strict typing catches breaking changes at build time, but it means every schema change requires regenerating clients. For two services, this is trivial. For twenty services consuming the same proto, you need a coordination process.</p>
<p>Event-driven messaging decouples in time but still couples on the event schema. If the <code>order.confirmed</code> event changes its structure, every consumer must handle both the old and new format during the transition.</p>
<h3 id="heading-3-schema-evolution">3. Schema Evolution</h3>
<p>Schema evolution is how you change the contract between services without breaking existing consumers. This is where the three patterns diverge most sharply.</p>
<h4 id="heading-rest-json">REST (JSON):</h4>
<pre><code class="language-typescript">// Version 1: price as a number
{ "productId": "abc-123", "price": 49.99 }

// Version 2: price as an object (breaking change)
{ "productId": "abc-123", "price": { "amount": 49.99, "currency": "USD" } }
</code></pre>
<p>JSON has no built-in versioning. You manage it through one of three strategies:</p>
<table>
<thead>
<tr>
<th>Strategy</th>
<th>How It Works</th>
<th>Trade-offs</th>
</tr>
</thead>
<tbody><tr>
<td><strong>URL versioning</strong> (<code>/api/v1/</code> vs <code>/api/v2/</code>)</td>
<td>Each version is a separate endpoint. Consumers opt in to the new version explicitly.</td>
<td>Simplest to understand. Duplicates route handlers. Hard to sunset old versions when many consumers pin to <code>/v1/</code>.</td>
</tr>
<tr>
<td><strong>Header versioning</strong> (<code>Accept: application/vnd.myapi.v2+json</code>)</td>
<td>Single URL, version negotiated via headers.</td>
<td>Cleaner URLs, no route duplication. Harder to test (you can't just paste a URL into a browser). Proxy and cache behavior is trickier since the response varies by header.</td>
</tr>
<tr>
<td><strong>Defensive parsing</strong> (consumer-side tolerance)</td>
<td>No explicit versioning. Consumers ignore unknown fields and use defaults for missing ones.</td>
<td>Zero coordination cost for additive changes. Breaks down for structural changes (field renames, type changes) where the consumer can't infer intent.</td>
</tr>
</tbody></table>
<p>Additive changes (new fields) are safe with any strategy. Structural changes (renaming fields, changing types) require explicit versioning — URL or header — so consumers can migrate at their own pace.</p>
<h4 id="heading-grpc-protocol-buffers">gRPC (Protocol Buffers):</h4>
<pre><code class="language-protobuf">// Protocol Buffers have built-in evolution rules
message StockResponse {
  string product_id = 1;
  int32 available = 2;
  int32 reserved = 3;
  // Field 4 was removed (never reuse field numbers)
  string warehouse_id = 5;       // New field: old clients ignore it
  optional string region = 6;    // Optional: old clients don't send it
}
</code></pre>
<p>Protocol Buffers handle evolution well by design. You can add new fields (old clients ignore them), deprecate fields (stop writing them, keep the number reserved), and use <code>optional</code> for fields that may not be present.</p>
<p>You can't rename fields, change field types, or reuse field numbers. These rules are enforced by the tooling.</p>
<h4 id="heading-event-driven-avrojson-schema">Event-driven (Avro/JSON Schema):</h4>
<p>For events, schema registries like Confluent Schema Registry enforce compatibility rules:</p>
<pre><code class="language-typescript">// Register a schema with backward compatibility
// New consumers can read old events, old consumers can read new events
const schema = {
  type: "record",
  name: "OrderConfirmed",
  fields: [
    { name: "orderId", type: "string" },
    { name: "customerId", type: "string" },
    { name: "total", type: "double" },
    // New field with default: backward compatible
    { name: "currency", type: "string", default: "USD" },
  ],
};
</code></pre>
<p>With a schema registry, producers can't publish events that violate the compatibility contract. This is the strongest governance model: the registry rejects incompatible schemas before they reach consumers.</p>
<h3 id="heading-4-debugging-and-observability">4. Debugging and Observability</h3>
<table>
<thead>
<tr>
<th>Pattern</th>
<th>Debugging Experience</th>
</tr>
</thead>
<tbody><tr>
<td>REST</td>
<td>Best. Human-readable payloads, browser DevTools, <code>curl</code>, standard HTTP tracing.</td>
</tr>
<tr>
<td>gRPC</td>
<td>Moderate. Binary payloads need <code>grpcurl</code> or Postman. Metadata is inspectable. Distributed tracing works well.</td>
</tr>
<tr>
<td>Event-driven</td>
<td>Hardest. Asynchronous flows require correlation IDs, dead-letter queue inspection, and broker-specific tooling.</td>
</tr>
</tbody></table>
<p>For event-driven systems, correlation IDs are essential:</p>
<pre><code class="language-typescript">// Always include a correlation ID in events
interface EventEnvelope&lt;T&gt; {
  eventId: string;
  eventType: string;
  correlationId: string; // Links related events across services
  causationId: string;   // The event that caused this one
  timestamp: string;
  source: string;
  data: T;
}

async function publishEvent&lt;T extends { entityId: string }&gt;(
  topic: string,
  type: string,
  data: T,
  correlationId: string,
  causationId?: string
): Promise&lt;void&gt; {
  const event: EventEnvelope&lt;T&gt; = {
    eventId: crypto.randomUUID(),
    eventType: type,
    correlationId,
    causationId: causationId ?? correlationId,
    timestamp: new Date().toISOString(),
    source: SERVICE_NAME,
    data,
  };

  await producer.send({
    topic,
    messages: [{ key: data.entityId, value: JSON.stringify(event) }],
  });
}
</code></pre>
<p>When investigating an issue, you search for the correlation ID across all services and reconstruct the full event chain. Without it, you're searching for a needle in a haystack.</p>
<h3 id="heading-5-operational-complexity">5. Operational Complexity</h3>
<table>
<thead>
<tr>
<th>Pattern</th>
<th>What You Operate</th>
</tr>
</thead>
<tbody><tr>
<td>REST</td>
<td>HTTP server, load balancer, API gateway</td>
</tr>
<tr>
<td>gRPC</td>
<td>gRPC server, proto registry, code generation pipeline, gRPC-Web proxy (if browser clients exist)</td>
</tr>
<tr>
<td>Event-driven</td>
<td>Message broker (Kafka/RabbitMQ/SQS), schema registry, dead-letter queues, consumer lag monitoring</td>
</tr>
</tbody></table>
<p>REST has the lowest operational overhead. Every team knows how to run an HTTP server.</p>
<p>gRPC adds a build-time dependency (proto compilation) and requires teams to learn new tooling.</p>
<p>Event-driven adds a runtime dependency (the broker) that must be highly available because if the broker goes down, inter-service communication stops.</p>
<h2 id="heading-the-decision-framework">The Decision Framework</h2>
<p>Use this framework when deciding how a specific pair of services should communicate. The answer is rarely one pattern for your entire system.</p>
<pre><code class="language-text">Does the caller need an immediate response?
├── Yes → Is this a public-facing or browser-accessible API?
│         ├── Yes → REST
│         └── No  → Is throughput or latency critical?
│                   ├── Yes → gRPC
│                   └── No  → REST (simpler, good enough)
└── No  → Can the caller tolerate eventual consistency?
          ├── No  → Use synchronous call (REST or gRPC) with async follow-up
          └── Yes → Does the event need to trigger multiple consumers?
                    ├── Yes → Event-driven messaging
                    └── No  → Is ordering critical?
                              ├── Yes → Event-driven with partition key
                              └── No  → Event-driven (or simple queue like SQS)
</code></pre>
<p>Some concrete examples:</p>
<table>
<thead>
<tr>
<th>Interaction</th>
<th>Pattern</th>
<th>Why</th>
</tr>
</thead>
<tbody><tr>
<td>Browser fetches product details</td>
<td>REST</td>
<td>Browser can't call gRPC natively, plus REST offers cacheability</td>
</tr>
<tr>
<td>Checkout validates payment in real time</td>
<td>gRPC</td>
<td>Low latency, strong typing, internal-only (no browser in the path)</td>
</tr>
<tr>
<td>Order confirmed triggers fulfillment</td>
<td>Event-driven</td>
<td>Multiple consumers, temporal decoupling</td>
</tr>
<tr>
<td>Frontend fetches user profile</td>
<td>REST</td>
<td>Simple CRUD, cacheable, browser-native</td>
</tr>
<tr>
<td>ML service scores recommendations</td>
<td>gRPC</td>
<td>High throughput, binary payloads, streaming</td>
</tr>
<tr>
<td>User signup triggers welcome email</td>
<td>Event-driven</td>
<td>Async, no need for immediate response</td>
</tr>
<tr>
<td>Service health checks</td>
<td>REST</td>
<td>Simplicity, universal tooling</td>
</tr>
<tr>
<td>Real-time stock level monitoring</td>
<td>gRPC streaming</td>
<td>Continuous updates, bidirectional if needed</td>
</tr>
</tbody></table>
<h2 id="heading-hybrid-architectures-using-all-three">Hybrid Architectures: Using All Three</h2>
<p>Most production systems use a combination. Here's a pattern that works well:</p>
<pre><code class="language-text">┌──────────┐    REST     ┌──────────────┐    gRPC    ┌──────────────┐
│ Browser  │────────────▶│  API Gateway │───────────▶│ Order Service│
└──────────┘             └──────────────┘            └──────┬───────┘
                                                           │
                                                    publishes event
                                                           │
                                                           ▼
                                                    ┌─────────────┐
                                                    │    Kafka     │
                                                    └──────┬──────┘
                                          ┌────────────────┼────────────────┐
                                          ▼                ▼                ▼
                                   ┌────────────┐  ┌────────────┐  ┌────────────┐
                                   │ Inventory  │  │ Notification│  │ Analytics  │
                                   │  Service   │  │  Service    │  │  Service   │
                                   └────────────┘  └────────────┘  └────────────┘
</code></pre>
<ul>
<li><p><strong>REST</strong> at the edge: the browser talks to the API gateway using standard HTTP. Cacheable, debuggable, universally supported.</p>
</li>
<li><p><strong>gRPC</strong> between the gateway and internal services: low latency, strong typing, efficient serialization.</p>
</li>
<li><p><strong>Event-driven</strong> for downstream reactions: the order service publishes an event, and multiple consumers react independently.</p>
</li>
</ul>
<h3 id="heading-the-anti-synchronous-trap">The Anti-Synchronous Trap</h3>
<p>A common mistake is using synchronous calls (REST or gRPC) where events are a better fit. The symptom: a service that makes five synchronous calls during a single request, waiting for each to complete before responding to the caller.</p>
<pre><code class="language-typescript">// Anti-pattern: synchronous fan-out
async function confirmOrder(order: Order): Promise&lt;void&gt; {
  await inventoryService.decrementStock(order.items);    // 50ms
  await paymentService.capturePayment(order.paymentId);  // 200ms
  await notificationService.sendConfirmation(order);     // 100ms
  await analyticsService.recordConversion(order);        // 80ms
  await shippingService.createShipment(order);           // 150ms
  // Total: 580ms, and if any one fails, the order fails
}
</code></pre>
<p>Only the first two calls (inventory and payment) are critical to confirming the order. The rest are reactions that can happen asynchronously:</p>
<pre><code class="language-typescript">// Better: synchronous for critical path, events for reactions
async function confirmOrder(order: Order): Promise&lt;void&gt; {
  // Critical path: must succeed for the order to be valid
  await inventoryService.decrementStock(order.items);
  await paymentService.capturePayment(order.paymentId);

  // Non-critical: publish event, let consumers handle the rest
  await publishOrderConfirmed(order);
  // Total: 250ms, and notification/analytics/shipping failures
  // don't block the checkout
}
</code></pre>
<p>This is the same tiered approach from my <a href="https://abisoye.dev/blog/designing-resilient-apis">designing resilient APIs</a> article. Critical operations are synchronous. Non-critical reactions are event-driven. The caller responds faster, and downstream failures do not cascade.</p>
<h2 id="heading-schema-governance-at-scale">Schema Governance at Scale</h2>
<p>As your service count grows, schema management becomes a first-class concern. Here's a practical approach for each pattern.</p>
<h3 id="heading-rest-openapi-as-the-contract">REST: OpenAPI as the Contract</h3>
<pre><code class="language-yaml"># openapi/inventory-service.yaml
openapi: "3.1.0"
info:
  title: Inventory Service
  version: "1.2.0"
paths:
  /api/v1/products/{productId}/stock:
    get:
      operationId: getStock
      parameters:
        - name: productId
          in: path
          required: true
          schema:
            type: string
      responses:
        "200":
          description: Stock level for the product
          content:
            application/json:
              schema:
                $ref: "#/components/schemas/StockResponse"
components:
  schemas:
    StockResponse:
      type: object
      required: [productId, available, reserved]
      properties:
        productId:
          type: string
        available:
          type: integer
        reserved:
          type: integer
</code></pre>
<p>Generate client SDKs from the OpenAPI spec using tools like <code>openapi-typescript</code> or <code>openapi-generator</code>. This gives you type safety without the build-time coupling of gRPC.</p>
<h3 id="heading-grpc-proto-registry">gRPC: Proto Registry</h3>
<p>Store <code>.proto</code> files in a shared repository or a dedicated proto registry (Buf Schema Registry is a good option). Use Buf's breaking change detection in CI:</p>
<pre><code class="language-bash"># Detects breaking changes before they merge
buf breaking --against ".git#branch=main"
</code></pre>
<p>This command requires a <code>buf.yaml</code> configuration file at the root of your proto directory. The file defines your module name and any lint or breaking change rules. See the <a href="https://buf.build/docs/configuration/v2/buf-yaml">Buf documentation</a> for setup details.</p>
<p>This fails your pull request if you rename a field, change a type, or reuse a field number. Non-breaking changes (adding fields, adding services) pass through.</p>
<h3 id="heading-events-schema-registry-with-compatibility-modes">Events: Schema Registry with Compatibility Modes</h3>
<p>For event-driven systems, a schema registry enforces compatibility at publish time. Confluent Schema Registry supports four modes:</p>
<table>
<thead>
<tr>
<th>Mode</th>
<th>Rule</th>
<th>Use Case</th>
</tr>
</thead>
<tbody><tr>
<td><strong>BACKWARD</strong></td>
<td>New schema can read old data</td>
<td>Consumer-first evolution</td>
</tr>
<tr>
<td><strong>FORWARD</strong></td>
<td>Old schema can read new data</td>
<td>Producer-first evolution</td>
</tr>
<tr>
<td><strong>FULL</strong></td>
<td>Both directions</td>
<td>Safest, most restrictive</td>
</tr>
<tr>
<td><strong>NONE</strong></td>
<td>No checks</td>
<td>Development only</td>
</tr>
</tbody></table>
<p>Use <code>FULL</code> compatibility for production topics. It ensures that any consumer, regardless of which schema version it was built against, can read any event on the topic.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this article, you learned the core mechanics of REST, gRPC, and event-driven messaging, the five trade-off dimensions that matter when choosing between them (latency, coupling, schema evolution, debugging, and operational complexity), and a decision framework for matching patterns to specific service interactions.</p>
<p>The key takeaways:</p>
<ol>
<li><p><strong>REST for the edge:</strong> Browser clients, public APIs, simple CRUD. Cacheable, debuggable, universally supported.</p>
</li>
<li><p><strong>gRPC for internal hot paths:</strong> High-throughput service-to-service calls where latency matters and both sides are under your control.</p>
</li>
<li><p><strong>Events for reactions:</strong> When the producer shouldn't wait, when multiple consumers need the same signal, or when temporal decoupling prevents cascading failures.</p>
</li>
<li><p><strong>Use all three:</strong> Most production systems combine patterns. REST at the boundary, gRPC internally, events for async workflows.</p>
</li>
<li><p><strong>Schema governance scales the system:</strong> OpenAPI for REST, proto registries for gRPC, schema registries for events. Without governance, schema changes become the primary source of production incidents.</p>
</li>
</ol>
<p>The right communication pattern isn't a global decision. It's a per-interaction decision, made deliberately, based on which trade-offs you can accept for that specific data flow.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Production-Grade Distributed Chatroom in Go [Full Handbook] ]]>
                </title>
                <description>
                    <![CDATA[ If you've ever wondered how chat applications like Slack, Discord, or WhatsApp work behind the scenes, this tutorial will show you. You'll build a real-time chat server from scratch using Go, learning the fundamental concepts that power modern commun... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-production-grade-distributed-chatroom-in-go-full-handbook/</link>
                <guid isPermaLink="false">698f4ea5c4c4900d2483651c</guid>
                
                    <category>
                        <![CDATA[ distributed system ]]>
                    </category>
                
                    <category>
                        <![CDATA[ golang ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Programming Blogs ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Fri, 13 Feb 2026 16:17:41 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770999686180/3bccee22-e7e9-477f-8a5f-50800896e972.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've ever wondered how chat applications like Slack, Discord, or WhatsApp work behind the scenes, this tutorial will show you. You'll build a real-time chat server from scratch using Go, learning the fundamental concepts that power modern communication systems.</p>
<p>By the end of this guide, you'll have built a working chatroom that supports unlimited concurrent users chatting in real-time, message persistence that survives server crashes, session management so users can reconnect after network interruptions, private messaging between users, and graceful handling of slow or disconnected clients.</p>
<p>More importantly, you'll understand the fundamental concepts behind distributed systems. You'll learn concurrent programming with goroutines and channels, TCP socket programming for network communication, write-ahead logging for data durability, state management with mutexes, and how to design systems that degrade gracefully under failure. These concepts power everything from databases to message queues to web servers.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-is-a-distributed-chatroom">What is a Distributed Chatroom</a>?</p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-youll-learn">What You'll Learn</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tutorial-overview">Tutorial Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-architecture-overview">Architecture Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-core-concepts-you-need-to-know">Core Concepts You Need to Know</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-the-project-structure">How to Set Up the Project Structure</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-define-core-data-types">How to Define Core Data Types</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-initialize-the-server">How to Initialize the Server</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-the-event-loop">How to Build the Event Loop</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-handle-client-connections">How to Handle Client Connections</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-implement-message-broadcasting">How to Implement Message Broadcasting</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-add-persistence-with-wal-and-snapshots">How to Add Persistence with WAL and Snapshots</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-implement-session-management">How to Implement Session Management</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-the-command-system">How to Build the Command System</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-the-client">How to Create the Client</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-test-your-chatroom">How to Test Your Chatroom</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-deploy-your-server">How to Deploy Your Server</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-enhancements-you-can-add">Enhancements You Can Add</a></p>
</li>
<li><p><a target="_blank" href="https://hashnode.com/draft/696aa5b42b74f2bf9668a223#heading-enhancement-you-could-add">Conclusion</a></p>
</li>
</ol>
<p>The complete source code for this project is available on <a target="_blank" href="https://github.com/Caesarsage/distributed-system/tree/main/chatroom-with-broadcast">GitHub</a> if you'd like to reference it while following along.</p>
<h2 id="heading-what-is-a-distributed-chatroom">What is a Distributed Chatroom?</h2>
<p>A chatroom is a server that lets multiple users connect simultaneously and exchange messages in real-time. When we say "production-grade," we mean it includes features you'd expect in a real application: it persists data so messages aren't lost when the server restarts, it handles network failures gracefully, and it can support many concurrent users without slowing down.</p>
<p>The "distributed" aspect refers to how the system manages multiple clients connecting from different locations, all trying to send and receive messages at the same time. This introduces interesting challenges: how do you ensure everyone sees messages in the same order? How do you handle clients with slow internet connections? What happens when someone disconnects unexpectedly?</p>
<p>These aren't just theoretical problems. Every networked application deals with concurrency, state management, and failure handling. Whether you're building a chat app, a multiplayer game, a collaborative editor, or a trading platform, you'll face similar challenges. The patterns you'll learn here apply broadly across distributed systems.</p>
<p>Chat applications are excellent learning projects because they combine several challenging problems in one place. You need to manage concurrent connections safely, broadcast messages to multiple clients without blocking, handle unreliable networks, persist data durably, and ensure the system recovers gracefully from crashes. Each of these topics could be its own tutorial, but here you'll see how they work together in a real application.</p>
<h2 id="heading-what-youll-learn">What You'll Learn</h2>
<p>This tutorial demonstrates several important concepts that are fundamental to building distributed systems. Here's what you'll learn:</p>
<h3 id="heading-1-tcp-socket-programming-in-go">1. TCP Socket Programming in Go</h3>
<p>You'll learn how to accept incoming TCP connections, read and write data over network sockets, and handle connection failures gracefully. These skills are essential for any networked application, from web servers to database clients.</p>
<h3 id="heading-2-concurrent-programming-with-goroutines-and-channels">2. Concurrent Programming with Goroutines and Channels</h3>
<p>Go's concurrency model is one of its strongest features. You'll see how to use goroutines to handle multiple clients simultaneously without blocking. You'll use channels to coordinate between goroutines safely, avoiding the common pitfalls of shared memory concurrency like race conditions and deadlocks.</p>
<h3 id="heading-3-state-management-in-distributed-systems">3. State Management in Distributed Systems</h3>
<p>Managing shared state across concurrent operations is tricky. You'll learn when to use mutexes versus channels, how to design lock granularity to avoid bottlenecks, and how to ensure data consistency when multiple goroutines access the same data.</p>
<h3 id="heading-4-write-ahead-logging-wal-for-durability">4. Write-Ahead Logging (WAL) for Durability</h3>
<p>Databases use WAL to ensure data isn't lost during crashes. You'll implement the same pattern, learning how to balance durability with performance. You'll see why fsync is critical, understand the trade-offs of different persistence strategies, and learn how to recover state after unexpected shutdowns.</p>
<h3 id="heading-5-session-management-and-reconnection">5. Session Management and Reconnection</h3>
<p>Networks are unreliable. Users disconnect, WiFi drops, mobile connections switch towers. You'll build a token-based session system that lets users reconnect seamlessly, preserving their chat history and identity without requiring passwords or complex authentication.</p>
<h3 id="heading-6-graceful-degradation-and-fault-tolerance">6. Graceful Degradation and Fault Tolerance</h3>
<p>Perfect reliability is impossible, so you need to design for partial failures. You'll learn how to prevent slow clients from affecting fast ones, how to continue operating when persistence fails, and how to clean up resources properly when things go wrong.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this tutorial, you should have some foundational knowledge. You don't need to be an expert, but you should be comfortable with the basics.</p>
<ul>
<li><p>Go basics (goroutines, channels, interfaces)</p>
</li>
<li><p>TCP/IP networking fundamentals</p>
</li>
<li><p>Basic concurrency concepts</p>
</li>
<li><p>File I/O operations</p>
</li>
</ul>
<h2 id="heading-tutorial-overview">Tutorial Overview</h2>
<p>This tutorial takes you through building a production-ready chatroom step by step.</p>
<p>You'll start by exploring the overall architecture to understand how components fit together. Then you'll learn about core concepts like concurrency models and persistence strategies.</p>
<p>Next, you'll set up your project structure and define the core data types that represent clients, messages, and the chatroom. Then you'll implement the server initialization and event loop, which is where all coordination happens.</p>
<p>After that, you'll build the networking layer to handle client connections, implement message broadcasting so messages reach all users, and add persistence using write-ahead logging and snapshots.</p>
<p>You'll then implement session management for reconnection, build a command system for user actions, and create a simple client application to test your server.</p>
<p>Finally, you’ll learn how to test and deploy your chatroom, and review key lessons from building a distributed system.</p>
<p>By the end, you'll have a complete, working chatroom and understand how distributed systems handle concurrency, persistence, and failure recovery.</p>
<h2 id="heading-architecture-overview">Architecture Overview</h2>
<p>The system follows a client-server architecture with internal components that work together to provide a robust chat experience.</p>
<h3 id="heading-high-level-architecture">High-Level Architecture</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770347113467/2ca609e9-c902-4311-a571-e9c3b1280786.jpeg" alt="Chatroom broadcast architecture diagram" class="image--center mx-auto" width="1994" height="2005" loading="lazy"></p>
<h3 id="heading-component-breakdown">Component Breakdown</h3>
<h4 id="heading-1-network-layer">1. <strong>Network Layer</strong></h4>
<ul>
<li><p><strong>TCP Listener</strong>: Accepts incoming connections on port 9000</p>
</li>
<li><p><strong>Connection Handler</strong>: Manages individual client connections with dedicated goroutines</p>
</li>
<li><p><strong>Protocol</strong>: Simple newline-delimited text protocol</p>
</li>
</ul>
<h4 id="heading-2-client-management">2. <strong>Client Management</strong></h4>
<p>Each client connection spawns two goroutines:</p>
<ul>
<li><p><strong>Read Goroutine</strong>: Receives messages from client</p>
</li>
<li><p><strong>Write Goroutine</strong>: Sends messages to client (non-blocking with buffered channels)</p>
</li>
</ul>
<h4 id="heading-3-chatroom-core">3. <strong>ChatRoom Core</strong></h4>
<p>This is the heart of the system – a single goroutine running an event loop:</p>
<pre><code class="lang-go"><span class="hljs-keyword">for</span> {
    <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client := &lt;-cr.join:
            <span class="hljs-comment">// Handle new client</span>
        <span class="hljs-keyword">case</span> client := &lt;-cr.leave:
            <span class="hljs-comment">// Handle disconnection</span>
        <span class="hljs-keyword">case</span> message := &lt;-cr.broadcast:
            <span class="hljs-comment">// Broadcast to all clients</span>
        <span class="hljs-keyword">case</span> client := &lt;-cr.listUsers:
            <span class="hljs-comment">// Send user list</span>
        <span class="hljs-keyword">case</span> dm := &lt;-cr.directMessage:
            <span class="hljs-comment">// Handle private message</span>
    }
}
</code></pre>
<h4 id="heading-4-state-management">4. <strong>State Management</strong></h4>
<p>We have three synchronized data structures:</p>
<ul>
<li><p><code>clients map[*Client]bool</code>: Active connections (mutex-protected)</p>
</li>
<li><p><code>sessions map[string]*SessionInfo</code>: User sessions for reconnection</p>
</li>
<li><p><code>messages []Message</code>: In-memory message history</p>
</li>
</ul>
<h4 id="heading-5-persistence-layer">5. <strong>Persistence Layer</strong></h4>
<p>Two-tier approach:</p>
<ul>
<li><p><strong>Write-Ahead Log (WAL)</strong>: Immediate append-only log for durability</p>
</li>
<li><p><strong>Snapshots</strong>: Periodic full state dumps for faster recovery</p>
</li>
</ul>
<h4 id="heading-6-session-management">6. <strong>Session Management</strong></h4>
<p>This enables reconnection with token-based authentication:</p>
<ul>
<li><p>Generates unique tokens per user</p>
</li>
<li><p>1-hour session timeout</p>
</li>
<li><p>Preserves chat history for returning users</p>
</li>
</ul>
<h3 id="heading-message-flow">Message Flow</h3>
<p>Here's how a message travels through the system:</p>
<pre><code class="lang-bash">User Input → Client Read → Server Receive → Broadcast Channel 
    → ChatRoom Loop → Persist to WAL → Fan-out to All Clients
    → Client Write Goroutines → TCP Send → User Display
</code></pre>
<p>The broadcast channel acts as a synchronization point, ensuring total message ordering.</p>
<h2 id="heading-core-concepts-you-need-to-know">Core Concepts You Need to Know</h2>
<h3 id="heading-understanding-the-concurrency-model">Understanding the Concurrency Model</h3>
<p>This chatroom uses Go's CSP (Communicating Sequential Processes) model. This is a fundamentally different approach to concurrency than you might be used to from other languages.</p>
<p>In traditional concurrent programming, you protect shared memory with locks (mutexes). Multiple threads access the same data structure, and you use locks to ensure only one thread modifies it at a time. This works, but it's error-prone. Forget a lock, and you have a race condition. Hold locks too long, and you have deadlocks.</p>
<p>Go encourages a different approach: instead of communicating by sharing memory, you share memory by communicating. You pass data between goroutines through channels. Only one goroutine owns the data at a time, eliminating many concurrency bugs by design.</p>
<p>Channels provide several advantages. They eliminate most race conditions by design, because if only one goroutine owns the data at a time, there's no race to access it. They provide natural flow control since channels can block when full (back pressure) or block when empty (waiting for data). They make it easier to reason about message flow because you can trace how data moves through your system by following the channels. And they offer better composability since you can combine channels with select statements to coordinate multiple operations.</p>
<p>That said, we’ll still use mutexes in this project. Channels aren't always the right tool. We’ll use mutexes when multiple goroutines need quick, frequent access to shared data structures like maps. And we’ll use channels when we want to coordinate behavior or transfer ownership of data.</p>
<p>Here's how the chatroom uses channels to coordinate everything:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> ChatRoom <span class="hljs-keyword">struct</span> {
    join          <span class="hljs-keyword">chan</span> *Client        <span class="hljs-comment">// New connections</span>
    leave         <span class="hljs-keyword">chan</span> *Client        <span class="hljs-comment">// Disconnections</span>
    broadcast     <span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>         <span class="hljs-comment">// Messages to all</span>
    listUsers     <span class="hljs-keyword">chan</span> *Client        <span class="hljs-comment">// User list requests</span>
    directMessage <span class="hljs-keyword">chan</span> DirectMessage  <span class="hljs-comment">// Private messages</span>

    <span class="hljs-comment">// Shared state (mutex-protected)</span>
    clients    <span class="hljs-keyword">map</span>[*Client]<span class="hljs-keyword">bool</span>
    mu         sync.Mutex

    <span class="hljs-comment">// Message history (separate mutex)</span>
    messages   []Message
    messageMu  sync.Mutex
}
</code></pre>
<p>Notice that we have five channels for different types of events. The main event loop receives from all these channels using a select statement. This means all state changes happen sequentially in one place, making the system much easier to reason about.</p>
<p>We could have used one channel that accepts different message types, but separate channels make the code clearer. When you send to <code>chatRoom.join</code>, it's obvious what you're doing. When you send to <code>chatRoom.broadcast</code>, same thing.</p>
<p>The mutexes protect data that many goroutines read frequently. The <code>clients</code> map needs to be accessed every time we broadcast a message. Using a mutex for quick read access is more efficient than passing the entire map through a channel.</p>
<h3 id="heading-understanding-the-persistence-strategy">Understanding the Persistence Strategy</h3>
<p>When your server crashes (and it will eventually), you need to recover the chat history. Users expect their messages to be there when the server restarts. But persistence is expensive: writing to disk is thousands of times slower than writing to memory. So you need a strategy that balances durability with performance.</p>
<p>We’ll use a two-tier approach that's similar to what real databases use: WAL (Write-ahead log) and snapshots.</p>
<p>The WAL is your primary durability mechanism. Here's how it works: every message is immediately appended to a file called <code>messages.wal</code>. This file is append-only, which means we only write to the end. Append-only writes are fast because the disk doesn't need to seek to different locations.</p>
<p>Each message is written as a single line of JSON. After writing each message, we call <code>fsync</code>. This tells the operating system to actually write the data to the physical disk right now, not just buffer it in memory. Without fsync, the OS might lose your data if the power fails before it gets around to writing.</p>
<p>The WAL is append-only and never modified. This makes it very reliable. If the server crashes mid-write, the worst case is one corrupted line at the end, which we can detect and skip during recovery.</p>
<p>The problem with a write-ahead log is that it grows forever. If you have a million messages, you need to replay a million log entries every time you restart the server. That's slow.</p>
<p>Snapshots solve this problem. Every 5 minutes, if there are more than 100 new messages, we write the entire message history to a separate file called <code>snapshot.json</code>. This is the complete state of the chat at that moment.</p>
<p>After creating a snapshot, we truncate (empty) the WAL. New messages continue to append to the WAL, but now we only need to replay messages since the last snapshot.</p>
<p>When the server starts, it first loads the snapshot file (if it exists). This gives us the state from the last snapshot, which might be 100,000 messages. Loading this takes about 100ms. Then it replays all entries from the WAL. This gives us messages written since the last snapshot, which might be only 50 messages. Replaying this takes milliseconds. Finally, it resumes normal operation.</p>
<p>Total recovery time is a few hundred milliseconds instead of several minutes.</p>
<p>This two-tier system gives us the best of both worlds: fast writes during normal operation with the append-only WAL, fast recovery after crashes with snapshot plus small WAL replay, guaranteed durability through fsync after every message, and bounded recovery time because the WAL never grows too large.</p>
<p>The trade-off is that snapshots use more disk space temporarily since you have both the snapshot and the WAL. But disk space is cheap, and correctness is expensive.</p>
<p>Now that you understand the key concepts behind the chatroom's design, it's time to start building. You'll begin by setting up your project structure and creating the necessary directories and files.</p>
<h2 id="heading-how-to-set-up-the-project-structure">How to Set Up the Project Structure</h2>
<p>First, create the directory structure for your project. You will create their files as we walk through the tutorial:</p>
<pre><code class="lang-bash">mkdir -p chatroom-with-broadcast/cmd/server
mkdir -p chatroom-with-broadcast/cmd/client
mkdir -p chatroom-with-broadcast/internal/chatroom
mkdir -p chatroom-with-broadcast/pkg/token
mkdir -p chatroom-with-broadcast/chatdata
<span class="hljs-built_in">cd</span> chatroom-with-broadcast
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770347289299/82f067a8-2cd0-49f0-8338-a002846e618b.png" alt="Chatroom project structure" class="image--center mx-auto" width="388" height="788" loading="lazy"></p>
<p>Then initialize the Go module.</p>
<p>Note that you’ll need Go 1.23.2 or later installed on your machine. Earlier versions might work, but the code examples assume features available in Go 1.23 and above. This version includes improvements to the standard library that make concurrent programming more efficient.</p>
<pre><code class="lang-bash">go mod init github.com/yourusername/chatroom
</code></pre>
<p>Your <code>go.mod</code> file should look like this:</p>
<pre><code class="lang-go">module github.com/yourusername/chatroom

<span class="hljs-keyword">go</span> <span class="hljs-number">1.23</span><span class="hljs-number">.2</span>
</code></pre>
<p>With your project structure in place, you're ready to start writing code. The first step is defining the data types that will represent the core components of your chatroom: messages, clients, and the chatroom itself.</p>
<h2 id="heading-how-to-define-core-data-types">How to Define Core Data Types</h2>
<p>Create a new file <code>internal/chatroom/types.go</code> to define your core data structures. These types form the foundation of your chatroom, so it's important to understand what each one represents and why it's designed the way it is.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"net"</span>
    <span class="hljs-string">"os"</span>
    <span class="hljs-string">"sync"</span>
    <span class="hljs-string">"time"</span>
)

<span class="hljs-comment">// Message represents a single chat message with metadata</span>
<span class="hljs-keyword">type</span> Message <span class="hljs-keyword">struct</span> {
    ID        <span class="hljs-keyword">int</span>       <span class="hljs-string">`json:"id"`</span>
    From      <span class="hljs-keyword">string</span>    <span class="hljs-string">`json:"from"`</span>
    Content   <span class="hljs-keyword">string</span>    <span class="hljs-string">`json:"content"`</span>
    Timestamp time.Time <span class="hljs-string">`json:"timestamp"`</span>
    Channel   <span class="hljs-keyword">string</span>    <span class="hljs-string">`json:"channel"`</span> <span class="hljs-comment">// "global" or "private:username"</span>
}

<span class="hljs-comment">// Client represents a connected user</span>
<span class="hljs-keyword">type</span> Client <span class="hljs-keyword">struct</span> {
    conn         net.Conn      <span class="hljs-comment">// TCP connection</span>
    username     <span class="hljs-keyword">string</span>        <span class="hljs-comment">// Display name</span>
    outgoing     <span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>   <span class="hljs-comment">// Buffered channel for writes</span>
    lastActive   time.Time     <span class="hljs-comment">// For idle detection</span>
    messagesSent <span class="hljs-keyword">int</span>           <span class="hljs-comment">// Statistics</span>
    messagesRecv <span class="hljs-keyword">int</span>
    isSlowClient <span class="hljs-keyword">bool</span>          <span class="hljs-comment">// Testing flag</span>

    reconnectToken <span class="hljs-keyword">string</span>
    mu             sync.Mutex   <span class="hljs-comment">// Protects stats fields</span>
}

<span class="hljs-comment">// ChatRoom is the central coordinator</span>
<span class="hljs-keyword">type</span> ChatRoom <span class="hljs-keyword">struct</span> {
    <span class="hljs-comment">// Communication channels</span>
    join          <span class="hljs-keyword">chan</span> *Client
    leave         <span class="hljs-keyword">chan</span> *Client
    broadcast     <span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>
    listUsers     <span class="hljs-keyword">chan</span> *Client
    directMessage <span class="hljs-keyword">chan</span> DirectMessage

    <span class="hljs-comment">// State</span>
    clients       <span class="hljs-keyword">map</span>[*Client]<span class="hljs-keyword">bool</span>
    mu            sync.Mutex
    totalMessages <span class="hljs-keyword">int</span>
    startTime     time.Time

    <span class="hljs-comment">// Message history</span>
    messages      []Message
    messageMu     sync.Mutex
    nextMessageID <span class="hljs-keyword">int</span>

    <span class="hljs-comment">// Persistence</span>
    walFile       *os.File
    walMu         sync.Mutex
    dataDir       <span class="hljs-keyword">string</span>

    <span class="hljs-comment">// Sessions</span>
    sessions      <span class="hljs-keyword">map</span>[<span class="hljs-keyword">string</span>]*SessionInfo
    sessionsMu    sync.Mutex
}

<span class="hljs-comment">// SessionInfo tracks reconnection data</span>
<span class="hljs-keyword">type</span> SessionInfo <span class="hljs-keyword">struct</span> {
    Username       <span class="hljs-keyword">string</span>
    ReconnectToken <span class="hljs-keyword">string</span>
    LastSeen       time.Time
    CreatedAt      time.Time
}

<span class="hljs-comment">// DirectMessage represents a private message</span>
<span class="hljs-keyword">type</span> DirectMessage <span class="hljs-keyword">struct</span> {
    toClient *Client
    message  <span class="hljs-keyword">string</span>
}
</code></pre>
<h4 id="heading-understanding-the-message-type">Understanding the Message Type</h4>
<p>The <code>Message</code> struct stores everything we need to know about a chat message. The <code>ID</code> field uniquely identifies each message and ensures messages stay in order. The <code>Timestamp</code> lets us show when messages were sent, which is important for chat history.</p>
<p>The <code>Channel</code> field is interesting. Right now, we only use "global" for public messages, but this design lets us add private channels or chat rooms later without changing the data structure. Good data structures anticipate future needs.</p>
<h4 id="heading-understanding-the-client-type">Understanding the Client Type</h4>
<p>Each connected user is represented by a <code>Client</code> struct. The <code>conn</code> field is their TCP connection – this is how we send and receive data.</p>
<p>The <code>outgoing</code> channel is crucial for performance. Notice it's a <code>chan string</code>, which means it's a channel of strings. We'll make this a buffered channel (size 10). This buffer means we can queue up 10 messages for this client without blocking. If a client is slow to read, we can keep sending to other clients.</p>
<p>Without this buffer, one slow client would block the entire broadcast. With the buffer, slow clients just miss messages if they can't keep up, which is much better than slowing everyone down.</p>
<p>The <code>lastActive</code> timestamp helps us detect idle users. If someone hasn't sent a message in 5 minutes, we can disconnect them to free up resources.</p>
<p>The <code>mu</code> mutex protects the statistics fields. Multiple goroutines will update <code>messagesSent</code> and <code>messagesRecv</code>, so we need a mutex to prevent race conditions.</p>
<h4 id="heading-understanding-the-chatroom-type">Understanding the ChatRoom Type</h4>
<p>This is the heart of the system. Notice that we have two kinds of fields: channels and protected state.</p>
<p>The five channels (<code>join</code>, <code>leave</code>, <code>broadcast</code>, <code>listUsers</code>, <code>directMessage</code>) are how different parts of the system communicate with the main event loop. When a new client connects, we send them to the <code>join</code> channel. When someone sends a message, it goes to the <code>broadcast</code> channel.</p>
<p>These channels are unbuffered (capacity 0) because we want synchronization. When you send to an unbuffered channel, you block until someone receives. This ensures the event loop processes events in order.</p>
<p>The protected state (maps and slices) needs mutexes because multiple goroutines access it. Notice that we use separate mutexes for different data. The <code>mu</code> mutex protects the <code>clients</code> map. The <code>messageMu</code> mutex protects the <code>messages</code> slice. The <code>sessionsMu</code> mutex protects the <code>sessions</code> map.</p>
<p>Why separate mutexes? Performance. If we used one mutex for everything, broadcasting a message would lock all the data, preventing new clients from joining. Separate mutexes mean different operations can happen concurrently.</p>
<p>The WAL file (<code>walFile</code>) also has its own mutex (<code>walMu</code>) because writing to disk is slow. We don't want to hold the main mutex while waiting for disk I/O.</p>
<p>With your data types defined, the next step is creating a function to initialize the server. This function will set up all your data structures, restore any persisted state from previous runs, and start background workers.</p>
<h2 id="heading-how-to-initialize-the-server">How to Initialize the Server</h2>
<p>Server initialization is critical because you need to set up all your data structures in the right order. If you restore state after opening the WAL, you might replay messages twice. If you start accepting connections before loading history, users won't see old messages.</p>
<p>Create a file <code>internal/chatroom/run.go</code> to bootstrap the server:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"net"</span>
    <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">NewChatRoom</span><span class="hljs-params">(dataDir <span class="hljs-keyword">string</span>)</span> <span class="hljs-params">(*ChatRoom, error)</span></span> {
    cr := &amp;ChatRoom{
        clients:       <span class="hljs-built_in">make</span>(<span class="hljs-keyword">map</span>[*Client]<span class="hljs-keyword">bool</span>),
        join:          <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> *Client),
        leave:         <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> *Client),
        broadcast:     <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>),
        listUsers:     <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> *Client),
        directMessage: <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> DirectMessage),
        sessions:      <span class="hljs-built_in">make</span>(<span class="hljs-keyword">map</span>[<span class="hljs-keyword">string</span>]*SessionInfo),
        messages:      <span class="hljs-built_in">make</span>([]Message, <span class="hljs-number">0</span>),
        startTime:     time.Now(),
        dataDir:       dataDir,
    }

    <span class="hljs-comment">// Restore from snapshot if available</span>
    <span class="hljs-keyword">if</span> err := cr.loadSnapshot(); err != <span class="hljs-literal">nil</span> {
        fmt.Printf(<span class="hljs-string">"Failed to load snapshot: %v\n"</span>, err)
    }

    <span class="hljs-comment">// Initialize WAL for new messages</span>
    <span class="hljs-keyword">if</span> err := cr.initializePersistence(); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }

    <span class="hljs-comment">// Start background snapshot worker</span>
    <span class="hljs-keyword">go</span> cr.periodicSnapshots()

    <span class="hljs-keyword">return</span> cr, <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">periodicSnapshots</span><span class="hljs-params">()</span></span> {
    ticker := time.NewTicker(<span class="hljs-number">5</span> * time.Minute)
    <span class="hljs-keyword">defer</span> ticker.Stop()

    <span class="hljs-keyword">for</span> <span class="hljs-keyword">range</span> ticker.C {
        cr.messageMu.Lock()
        messageCount := <span class="hljs-built_in">len</span>(cr.messages)
        cr.messageMu.Unlock()

        <span class="hljs-keyword">if</span> messageCount &gt; <span class="hljs-number">100</span> {
            <span class="hljs-keyword">if</span> err := cr.createSnapshot(); err != <span class="hljs-literal">nil</span> {
                fmt.Printf(<span class="hljs-string">"Snapshot failed: %v\n"</span>, err)
            }
        }
    }
}
</code></pre>
<p>Let's break down what happens during initialization:</p>
<h4 id="heading-1-creating-data-structures">1. Creating Data Structures</h4>
<p>We start by creating all the maps and channels. The <code>make</code> function initializes these properly. For maps, this creates an empty map ready to use. For channels, this creates an unbuffered channel (capacity 0).</p>
<p>Notice we create the <code>messages</code> slice with initial capacity 0 but room to grow: <code>make([]Message, 0)</code>. This is more efficient than starting with <code>nil</code> because the slice is ready to append immediately without allocation.</p>
<h4 id="heading-2-loading-the-snapshot">2. Loading the Snapshot</h4>
<p>Before we accept any connections, we try to load a snapshot from disk. This restores the chat history from the last time the server ran. If the snapshot doesn't exist (first run) or fails to load (corrupted file), we just continue with an empty history.</p>
<p>This step must happen before initializing the WAL. If we opened the WAL first, we might replay messages that are already in the snapshot, creating duplicates.</p>
<h4 id="heading-3-initializing-the-wal">3. Initializing the WAL</h4>
<p>The <code>initializePersistence()</code> function opens the WAL file in append mode. It also replays any entries in the WAL that happened after the last snapshot. This ensures we don't lose any messages that were written to the WAL but not yet included in a snapshot.</p>
<p>If this step fails, we return an error and refuse to start. Why? Because if we can't write to the WAL, we can't guarantee durability. It's better to refuse to start than to lie to users by accepting messages we can't persist.</p>
<h4 id="heading-4-starting-background-workers">4. Starting Background Workers</h4>
<p>The <code>periodicSnapshots()</code> function runs in a separate goroutine. It wakes up every 5 minutes and checks if we need to create a snapshot. Notice the <code>defer ticker.Stop()</code> – this is important. If we forget to stop the ticker, it leaks a goroutine and wastes resources.</p>
<p>The goroutine acquires the <code>messageMu</code> lock just to read the message count, then releases it immediately. We don't hold the lock during the snapshot creation because that's slow and would block message broadcasting.</p>
<h4 id="heading-why-5-minutes-and-100-messages">Why 5 Minutes and 100 Messages?</h4>
<p>These are tunable parameters. 5 minutes means recovery never needs to replay more than 5 minutes of messages. 100 messages means we don't create snapshots too frequently during quiet periods.</p>
<p>In a production system, you might make these configurable. A high-traffic chat might want shorter intervals. A low-traffic chat might want longer intervals to reduce disk I/O.</p>
<p>Now that your server is initialized with all the necessary data structures and background workers, you need to build the core coordination mechanism. The event loop is where all state changes happen in your chatroom. It's the heartbeat that keeps everything synchronized.</p>
<h2 id="heading-how-to-build-the-event-loop">How to Build the Event Loop</h2>
<p>The event loop is the heart of your chatroom. Every client connection, message, and disconnection flows through this single point. This might seem like it could be a bottleneck, but it's actually what makes the system simple and safe.</p>
<p>The <code>Run()</code> method is the server's heartbeat. This is where all the magic happens. Every event in the system flows through this loop. Add this to <code>run.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">Run</span><span class="hljs-params">()</span></span> {
    fmt.Println(<span class="hljs-string">"ChatRoom heartbeat started..."</span>)
    <span class="hljs-keyword">go</span> cr.cleanupInactiveClients()

    <span class="hljs-keyword">for</span> {
        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client := &lt;-cr.join:
            cr.handleJoin(client)

        <span class="hljs-keyword">case</span> client := &lt;-cr.leave:
            cr.handleLeave(client)

        <span class="hljs-keyword">case</span> message := &lt;-cr.broadcast:
            cr.handleBroadcast(message)

        <span class="hljs-keyword">case</span> client := &lt;-cr.listUsers:
            cr.sendUserList(client)

        <span class="hljs-keyword">case</span> dm := &lt;-cr.directMessage:
            cr.handleDirectMessage(dm)
        }
    }
}
</code></pre>
<h4 id="heading-understanding-the-select-statement">Understanding the Select Statement</h4>
<p>The <code>select</code> statement is one of Go's most powerful concurrency features. It's like a switch statement for channels. The select waits until one of its cases can proceed, then it executes that case.</p>
<p>Here's what happens: The loop blocks on the select statement, waiting for data on any of the five channels. When data arrives on any channel, that case executes. After the case completes, the loop goes back to waiting.</p>
<p>For example, when a new client connects, code elsewhere in your program sends that client to <code>cr.join</code>. The select receives it and executes <code>cr.handleJoin(client)</code>. Once that finishes, the loop goes back to waiting.</p>
<h4 id="heading-why-use-a-single-event-loop">Why Use a Single Event Loop?</h4>
<p>This might seem like a bottleneck. You have one goroutine processing all events sequentially. Why not process events in parallel?</p>
<p>The answer is consistency. Here's what you gain from sequential processing:</p>
<p><strong>1. No Race Conditions on State</strong></p>
<p>Only one goroutine modifies the <code>clients</code> map, the <code>messages</code> slice, and the <code>sessions</code> map. You never need to worry about two operations interfering with each other. When you add a client in <code>handleJoin</code>, you know for certain that no other code is simultaneously removing clients or broadcasting messages.</p>
<p>This is incredibly powerful. Most bugs in concurrent systems come from unexpected interleaving of operations. By processing events sequentially, you eliminate an entire class of bugs.</p>
<p><strong>2. Total Ordering of Events</strong></p>
<p>Messages are broadcast in the order they arrive. This seems obvious, but it's important. If Alice sends "Hello" and then Bob sends "Hi", you can guarantee everyone sees them in that order. With parallel processing, you'd need additional synchronization to maintain ordering.</p>
<p><strong>3. Simple State Transitions</strong></p>
<p>You can reason about your system state as a series of transitions. "After this join event, the client is in the map. After this leave event, the client is removed." You don't need to worry about concurrent state changes making your reasoning invalid.</p>
<p><strong>4. Easy to Debug</strong></p>
<p>When something goes wrong, you can add logging to the event loop and see exactly what sequence of events led to the problem. With parallel processing, the order of events depends on thread scheduling, making bugs hard to reproduce.</p>
<h4 id="heading-is-this-actually-a-bottleneck">Is This Actually a Bottleneck?</h4>
<p>You might worry that sequential processing limits performance. In practice, it's fine for this workload. Here's why:</p>
<p>The handlers are fast. They do simple things like adding to a map, removing from a map, or forwarding a message to channels. These operations take microseconds. The event loop can process thousands of events per second.</p>
<p>The slow operations (writing to disk, sending to client connections) happen in other goroutines. The event loop doesn't wait for them. It just sends data to a channel or adds work to a queue, then immediately moves to the next event.</p>
<p>If you needed higher throughput, you could shard your chat into multiple rooms, each with its own event loop. But for a single chatroom, sequential processing is both simpler and fast enough.</p>
<h4 id="heading-understanding-the-cleanup-worker">Understanding the Cleanup Worker</h4>
<p>Notice the line <code>go cr.cleanupInactiveClients()</code> before the loop. This starts a background goroutine that periodically checks for idle clients.</p>
<p>Why not include this in the event loop? Because it's time-based, not event-based. The cleanup worker wakes up every 30 seconds and sends disconnect events for idle clients. These events flow through the normal event loop, maintaining our single-threaded state mutation property.</p>
<p>Now add the <code>runServer()</code> function and shutdown handler:</p>
<pre><code class="lang-go"><span class="hljs-keyword">import</span> (
    <span class="hljs-string">"os"</span>
    <span class="hljs-string">"os/signal"</span>
    <span class="hljs-string">"syscall"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">runServer</span><span class="hljs-params">()</span></span> {
    chatRoom, err := NewChatRoom(<span class="hljs-string">"./chatdata"</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        fmt.Printf(<span class="hljs-string">"Failed to initialize: %v\n"</span>, err)
        <span class="hljs-keyword">return</span>
    }
    <span class="hljs-keyword">defer</span> chatRoom.shutdown()

    <span class="hljs-comment">// Set up signal handling for graceful shutdown</span>
    sigChan := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> os.Signal, <span class="hljs-number">1</span>)
    signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)

    <span class="hljs-keyword">go</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
        &lt;-sigChan
        fmt.Println(<span class="hljs-string">"\nReceived shutdown signal"</span>)
        chatRoom.shutdown()
        os.Exit(<span class="hljs-number">0</span>)
    }()

    <span class="hljs-keyword">go</span> chatRoom.Run()

    listener, err := net.Listen(<span class="hljs-string">"tcp"</span>, <span class="hljs-string">":9000"</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        fmt.Println(<span class="hljs-string">"Error starting server:"</span>, err)
        <span class="hljs-keyword">return</span>
    }
    <span class="hljs-keyword">defer</span> listener.Close()

    fmt.Println(<span class="hljs-string">"Server started on :9000"</span>)

    <span class="hljs-keyword">for</span> {
        conn, err := listener.Accept()
        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            fmt.Println(<span class="hljs-string">"Error accepting connection:"</span>, err)
            <span class="hljs-keyword">continue</span>
        }
        fmt.Println(<span class="hljs-string">"New connection from:"</span>, conn.RemoteAddr())
        <span class="hljs-keyword">go</span> handleClient(conn, chatRoom)
    }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">shutdown</span><span class="hljs-params">()</span></span> {
    fmt.Println(<span class="hljs-string">"\nShutting down..."</span>)
    <span class="hljs-keyword">if</span> err := cr.createSnapshot(); err != <span class="hljs-literal">nil</span> {
        fmt.Printf(<span class="hljs-string">"Final snapshot failed: %v\n"</span>, err)
    }
    <span class="hljs-keyword">if</span> cr.walFile != <span class="hljs-literal">nil</span> {
        cr.walFile.Close()
    }
    fmt.Println(<span class="hljs-string">"Shutdown complete"</span>)
}
</code></pre>
<p>The <code>runServer()</code> function ties everything together:</p>
<ol>
<li><p>Create the chatroom with <code>NewChatRoom()</code></p>
</li>
<li><p>Defer the shutdown function so it runs when the function exits</p>
</li>
<li><p>Start the event loop in a separate goroutine with <code>go chatRoom.Run()</code></p>
</li>
<li><p>Listen for TCP connections on port 9000</p>
</li>
<li><p>For each connection, spawn a goroutine with <code>go handleClient()</code></p>
</li>
</ol>
<p>The defer statement is important. No matter how the function exits (normal return, panic, error), the shutdown function runs. This ensures we create a final snapshot and close the WAL file cleanly.</p>
<p>The signal handling goroutine listens for SIGINT (Ctrl+C) or SIGTERM (system shutdown). When it receives one, it calls <code>shutdown()</code> and exits gracefully. This means when you press Ctrl+C, the server saves its state before stopping.</p>
<p>With your event loop running and listening for connections, the next step is handling what happens when a client actually connects. This involves reading their username, creating a session, and setting up the communication channels.</p>
<h2 id="heading-how-to-handle-client-connections">How to Handle Client Connections</h2>
<p>When a client connects to your server, several things need to happen: you need to establish the TCP connection, prompt for a username, create a Client object to represent them, start goroutines to read and write messages, and handle both normal disconnections and unexpected failures.</p>
<p>Create a file <code>internal/chatroom/io.go</code> for managing client connections. When a client connects, <code>handleClient()</code> manages the entire lifecycle:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"bufio"</span>
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"math/rand"</span>
    <span class="hljs-string">"net"</span>
    <span class="hljs-string">"strings"</span>
    <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">handleClient</span><span class="hljs-params">(conn net.Conn, chatRoom *ChatRoom)</span></span> {
    <span class="hljs-keyword">defer</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
        <span class="hljs-keyword">if</span> r := <span class="hljs-built_in">recover</span>(); r != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Panic in handleClient: %v\n"</span>, r)
        }
        conn.Close()
    }()

    <span class="hljs-comment">// Set initial timeout for username entry</span>
    conn.SetReadDeadline(time.Now().Add(<span class="hljs-number">30</span> * time.Second))

    reader := bufio.NewReader(conn)

    <span class="hljs-comment">// Prompt for username or reconnection</span>
    conn.Write([]<span class="hljs-keyword">byte</span>(<span class="hljs-string">"Enter username (or 'reconnect:&lt;username&gt;:&lt;token&gt;'): \n"</span>))

    input, err := reader.ReadString(<span class="hljs-string">'\n'</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        fmt.Println(<span class="hljs-string">"Failed to read username:"</span>, err)
        <span class="hljs-keyword">return</span>
    }
    input = strings.TrimSpace(input)

    <span class="hljs-keyword">var</span> username <span class="hljs-keyword">string</span>
    <span class="hljs-keyword">var</span> reconnectToken <span class="hljs-keyword">string</span>
    <span class="hljs-keyword">var</span> isReconnecting <span class="hljs-keyword">bool</span>

    <span class="hljs-comment">// Parse reconnection attempt</span>
    <span class="hljs-keyword">if</span> strings.HasPrefix(input, <span class="hljs-string">"reconnect:"</span>) {
        parts := strings.Split(input, <span class="hljs-string">":"</span>)
        <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(parts) == <span class="hljs-number">3</span> {
            username = parts[<span class="hljs-number">1</span>]
            reconnectToken = parts[<span class="hljs-number">2</span>]
            isReconnecting = <span class="hljs-literal">true</span>
        } <span class="hljs-keyword">else</span> {
            conn.Write([]<span class="hljs-keyword">byte</span>(<span class="hljs-string">"Invalid format. Use: reconnect:&lt;username&gt;:&lt;token&gt;\n"</span>))
            <span class="hljs-keyword">return</span>
        }
    } <span class="hljs-keyword">else</span> {
        username = input
    }

    <span class="hljs-comment">// Generate guest name if empty</span>
    <span class="hljs-keyword">if</span> username == <span class="hljs-string">""</span> {
        username = fmt.Sprintf(<span class="hljs-string">"Guest%d"</span>, rand.Intn(<span class="hljs-number">1000</span>))
    }

    <span class="hljs-comment">// Validate reconnection or check for duplicate</span>
    <span class="hljs-keyword">if</span> isReconnecting {
        <span class="hljs-keyword">if</span> chatRoom.validateReconnectToken(username, reconnectToken) {
            fmt.Printf(<span class="hljs-string">"%s reconnected successfully\n"</span>, username)
            conn.Write([]<span class="hljs-keyword">byte</span>(fmt.Sprintf(<span class="hljs-string">"Welcome back, %s!\n"</span>, username)))
        } <span class="hljs-keyword">else</span> {
            conn.Write([]<span class="hljs-keyword">byte</span>(<span class="hljs-string">"Invalid token or session expired.\n"</span>))
            <span class="hljs-keyword">return</span>
        }
    } <span class="hljs-keyword">else</span> {
        <span class="hljs-comment">// Prevent duplicate logins</span>
        <span class="hljs-keyword">if</span> chatRoom.isUsernameConnected(username) {
            conn.Write([]<span class="hljs-keyword">byte</span>(<span class="hljs-string">"Username already connected. Use reconnect if you lost connection.\n"</span>))
            <span class="hljs-keyword">return</span>
        }

        <span class="hljs-comment">// Create or retrieve session</span>
        chatRoom.sessionsMu.Lock()
        existingSession := chatRoom.sessions[username]
        chatRoom.sessionsMu.Unlock()

        <span class="hljs-keyword">if</span> existingSession != <span class="hljs-literal">nil</span> {
            token := existingSession.ReconnectToken
            msg := fmt.Sprintf(<span class="hljs-string">"Tip: Save this token: %s\n"</span>, token)
            msg += fmt.Sprintf(<span class="hljs-string">"To reconnect: reconnect:%s:%s\n"</span>, username, token)
            conn.Write([]<span class="hljs-keyword">byte</span>(msg))
        } <span class="hljs-keyword">else</span> {
            session := chatRoom.createSession(username)
            token := session.ReconnectToken
            msg := fmt.Sprintf(<span class="hljs-string">"Your token: %s\n"</span>, token)
            msg += fmt.Sprintf(<span class="hljs-string">"To reconnect: reconnect:%s:%s\n"</span>, username, token)
            conn.Write([]<span class="hljs-keyword">byte</span>(msg))
        }
    }

    <span class="hljs-comment">// Create client object</span>
    client := &amp;Client{
        conn:           conn,
        username:       username,
        outgoing:       <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>, <span class="hljs-number">10</span>), <span class="hljs-comment">// Buffered</span>
        lastActive:     time.Now(),
        reconnectToken: reconnectToken,
        isSlowClient:   rand.Float64() &lt; <span class="hljs-number">0.1</span>, <span class="hljs-comment">// 10% chance for testing</span>
    }

    <span class="hljs-comment">// Clear timeout for normal operation</span>
    conn.SetReadDeadline(time.Time{})

    <span class="hljs-comment">// Notify chatroom</span>
    chatRoom.join &lt;- client

    <span class="hljs-comment">// Send welcome message</span>
    welcomeMsg := buildWelcomeMessage(username)
    conn.Write([]<span class="hljs-keyword">byte</span>(welcomeMsg))

    <span class="hljs-comment">// Start read/write loops</span>
    <span class="hljs-keyword">go</span> readMessages(client, chatRoom)
    writeMessages(client) <span class="hljs-comment">// Blocks until disconnect</span>

    <span class="hljs-comment">// Update session on disconnect</span>
    chatRoom.updateSessionActivity(username)
    chatRoom.leave &lt;- client
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">buildWelcomeMessage</span><span class="hljs-params">(username <span class="hljs-keyword">string</span>)</span> <span class="hljs-title">string</span></span> {
    msg := fmt.Sprintf(<span class="hljs-string">"Welcome, %s!\n"</span>, username)
    msg += <span class="hljs-string">"Commands:\n"</span>
    msg += <span class="hljs-string">"  /users - List all users\n"</span>
    msg += <span class="hljs-string">"  /history [N] - Show last N messages\n"</span>
    msg += <span class="hljs-string">"  /msg &lt;user&gt; &lt;msg&gt; - Private message\n"</span>
    msg += <span class="hljs-string">"  /token - Show your reconnect token\n"</span>
    msg += <span class="hljs-string">"  /stats - Show your stats\n"</span>
    msg += <span class="hljs-string">"  /quit - Leave\n"</span>
    <span class="hljs-keyword">return</span> msg
}
</code></pre>
<p>The initial 30-second timeout prevents connection exhaustion by disconnecting clients who don't enter a username quickly. The buffered <code>outgoing</code> channel prevents slow clients from blocking the broadcaster. Token-based reconnection lets users resume their session without complex authentication. The dual goroutine design means reading and writing happen independently, so a slow write doesn't block incoming messages.</p>
<h3 id="heading-how-to-read-messages-from-clients">How to Read Messages from Clients</h3>
<p>Add the <code>readMessages()</code> goroutine to handles all incoming data:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">readMessages</span><span class="hljs-params">(client *Client, chatRoom *ChatRoom)</span></span> {
    <span class="hljs-keyword">defer</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
        <span class="hljs-keyword">if</span> r := <span class="hljs-built_in">recover</span>(); r != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Panic in readMessages for %s: %v\n"</span>, client.username, r)
        }
    }()

    reader := bufio.NewReader(client.conn)

    <span class="hljs-keyword">for</span> {
        <span class="hljs-comment">// Set 5-minute idle timeout</span>
        client.conn.SetReadDeadline(time.Now().Add(<span class="hljs-number">5</span> * time.Minute))

        message, err := reader.ReadString(<span class="hljs-string">'\n'</span>)
        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">if</span> netErr, ok := err.(net.Error); ok &amp;&amp; netErr.Timeout() {
                fmt.Printf(<span class="hljs-string">"%s timed out\n"</span>, client.username)
            } <span class="hljs-keyword">else</span> {
                fmt.Printf(<span class="hljs-string">"%s disconnected: %v\n"</span>, client.username, err)
            }
            <span class="hljs-keyword">return</span>
        }

        client.markActive() <span class="hljs-comment">// Update activity timestamp</span>

        message = strings.TrimSpace(message)
        <span class="hljs-keyword">if</span> message == <span class="hljs-string">""</span> {
            <span class="hljs-keyword">continue</span>
        }

        client.mu.Lock()
        client.messagesRecv++
        client.mu.Unlock()

        <span class="hljs-comment">// Process commands vs. regular messages</span>
        <span class="hljs-keyword">if</span> strings.HasPrefix(message, <span class="hljs-string">"/"</span>) {
            handleCommand(client, chatRoom, message)
            <span class="hljs-keyword">continue</span>
        }

        <span class="hljs-comment">// Regular message - format and broadcast</span>
        formatted := fmt.Sprintf(<span class="hljs-string">"[%s]: %s\n"</span>, client.username, message)
        chatRoom.broadcast &lt;- formatted
    }
}
</code></pre>
<p>5 minutes of idle time triggers auto-disconnect. This prevents zombie connections from consuming resources.</p>
<h3 id="heading-how-to-write-messages-to-clients">How to Write Messages to Clients</h3>
<p>Add the <code>writeMessages()</code> function to drain the client's <code>outgoing</code> channel:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">writeMessages</span><span class="hljs-params">(client *Client)</span></span> {
    <span class="hljs-keyword">defer</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
        <span class="hljs-keyword">if</span> r := <span class="hljs-built_in">recover</span>(); r != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Panic in writeMessages for %s: %v\n"</span>, client.username, r)
        }
    }()

    writer := bufio.NewWriter(client.conn)

    <span class="hljs-keyword">for</span> message := <span class="hljs-keyword">range</span> client.outgoing {
        <span class="hljs-comment">// Simulate slow client (testing mode)</span>
        <span class="hljs-keyword">if</span> client.isSlowClient {
            time.Sleep(time.Duration(rand.Intn(<span class="hljs-number">500</span>)) * time.Millisecond)
        }

        _, err := writer.WriteString(message)
        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Write error for %s: %v\n"</span>, client.username, err)
            <span class="hljs-keyword">return</span>
        }

        err = writer.Flush()
        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Flush error for %s: %v\n"</span>, client.username, err)
            <span class="hljs-keyword">return</span>
        }
    }
}
</code></pre>
<p>Real-world clients have varying network speeds. A client with a slow internet connection shouldn't block message delivery to other users. This is a fundamental challenge in any system that broadcasts to multiple recipients.</p>
<p>To handle this, we use two techniques. First, the <code>outgoing</code> channel is buffered with a size of 10. This means the system can queue up 10 messages for a client without blocking. If a client temporarily slows down (maybe they're loading a large webpage in another tab), the buffer absorbs the slowdown.</p>
<p>Second, when broadcasting messages (which you'll see in the next section), we use non-blocking sends. If a client's buffer is full because they're consistently too slow, we skip sending to them rather than blocking everyone else. The slow client misses some messages, but everyone else continues normally. This is called graceful degradation: the system continues working even when parts of it have problems.</p>
<p>With client connections handled, the next step is implementing the core feature of any chat system: broadcasting messages to all connected users. Broadcasting means taking one message and sending it to many recipients efficiently and safely.</p>
<h2 id="heading-how-to-implement-message-broadcasting">How to Implement Message Broadcasting</h2>
<p>Broadcasting is the heart of a chat application. When one user sends a message, it needs to reach everyone else instantly. But this is trickier than it sounds because you need to persist the message for durability, send it to clients at different speeds without blocking, and maintain message ordering across all clients.</p>
<p>Create <code>internal/chatroom/handlers.go</code> to handle events.</p>
<p>The <code>handleBroadcast()</code> method is where messages reach all users:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"strings"</span>
    <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">handleBroadcast</span><span class="hljs-params">(message <span class="hljs-keyword">string</span>)</span></span> {
    <span class="hljs-comment">// Parse message metadata</span>
    parts := strings.SplitN(message, <span class="hljs-string">": "</span>, <span class="hljs-number">2</span>)
    from := <span class="hljs-string">"system"</span>
    actualContent := message

    <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(parts) == <span class="hljs-number">2</span> {
        from = strings.Trim(parts[<span class="hljs-number">0</span>], <span class="hljs-string">"[]"</span>)
        actualContent = parts[<span class="hljs-number">1</span>]
    }

    <span class="hljs-comment">// Create persistent message record</span>
    cr.messageMu.Lock()
    msg := Message{
        ID:        cr.nextMessageID,
        From:      from,
        Content:   actualContent,
        Timestamp: time.Now(),
        Channel:   <span class="hljs-string">"global"</span>,
    }
    cr.nextMessageID++
    cr.messages = <span class="hljs-built_in">append</span>(cr.messages, msg)
    cr.messageMu.Unlock()

    <span class="hljs-comment">// Persist to WAL</span>
    <span class="hljs-keyword">if</span> err := cr.persistMessage(msg); err != <span class="hljs-literal">nil</span> {
        fmt.Printf(<span class="hljs-string">"Failed to persist: %v\n"</span>, err)
        <span class="hljs-comment">// Continue anyway - availability over consistency</span>
    }

    <span class="hljs-comment">// Collect current clients</span>
    cr.mu.Lock()
    clients := <span class="hljs-built_in">make</span>([]*Client, <span class="hljs-number">0</span>, <span class="hljs-built_in">len</span>(cr.clients))
    <span class="hljs-keyword">for</span> client := <span class="hljs-keyword">range</span> cr.clients {
        clients = <span class="hljs-built_in">append</span>(clients, client)
    }
    cr.totalMessages++
    cr.mu.Unlock()

    fmt.Printf(<span class="hljs-string">"Broadcasting to %d clients: %s"</span>, <span class="hljs-built_in">len</span>(clients), message)

    <span class="hljs-comment">// Fan-out to all clients</span>
    <span class="hljs-keyword">for</span> _, client := <span class="hljs-keyword">range</span> clients {
        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client.outgoing &lt;- message:
            client.mu.Lock()
            client.messagesSent++
            client.mu.Unlock()
        <span class="hljs-keyword">default</span>:
            fmt.Printf(<span class="hljs-string">"Skipped %s (channel full)\n"</span>, client.username)
        }
    }
}
</code></pre>
<h4 id="heading-consistency-trade-off">Consistency Trade-off:</h4>
<p>If a WAL write fails, you still broadcast the message. Why? Because availability is more important than perfect consistency for a chat application. Users get their messages immediately, and you can handle WAL repair manually if needed.</p>
<h3 id="heading-how-to-handle-join-and-leave-events">How to Handle Join and Leave Events</h3>
<p>Add these handlers to <code>handlers.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">handleJoin</span><span class="hljs-params">(client *Client)</span></span> {
    cr.mu.Lock()
    cr.clients[client] = <span class="hljs-literal">true</span>
    cr.mu.Unlock()

    client.markActive()

    fmt.Printf(<span class="hljs-string">"%s joined (total: %d)\n"</span>, client.username, <span class="hljs-built_in">len</span>(cr.clients))

    cr.sendHistory(client, <span class="hljs-number">10</span>)

    announcement := fmt.Sprintf(<span class="hljs-string">"*** %s joined the chat ***\n"</span>, client.username)
    cr.handleBroadcast(announcement)
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">handleLeave</span><span class="hljs-params">(client *Client)</span></span> {
    cr.mu.Lock()
    <span class="hljs-keyword">if</span> !cr.clients[client] {
        cr.mu.Unlock()
        <span class="hljs-keyword">return</span>
    }
    <span class="hljs-built_in">delete</span>(cr.clients, client)
    cr.mu.Unlock()

    fmt.Printf(<span class="hljs-string">"%s left (total: %d)\n"</span>, client.username, <span class="hljs-built_in">len</span>(cr.clients))

    <span class="hljs-comment">// Close channel safely</span>
    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> &lt;-client.outgoing:
        <span class="hljs-comment">// Already closed</span>
    <span class="hljs-keyword">default</span>:
        <span class="hljs-built_in">close</span>(client.outgoing)
    }

    announcement := fmt.Sprintf(<span class="hljs-string">"*** %s left the chat ***\n"</span>, client.username)
    cr.handleBroadcast(announcement)
}
</code></pre>
<p>The <code>handleJoin</code> function adds the client to the active clients map, marks them as active for idle tracking, sends them the last 10 messages so they can see recent conversation, and broadcasts an announcement so everyone knows they joined.</p>
<p>The <code>handleLeave</code> function removes the client from the map, closes their outgoing channel safely (the select checks if it's already closed to avoid a panic), and broadcasts a departure announcement.</p>
<h3 id="heading-how-to-send-user-lists-and-history">How to Send User Lists and History</h3>
<p>Add these helper functions to <code>handlers.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">sendHistory</span><span class="hljs-params">(client *Client, count <span class="hljs-keyword">int</span>)</span></span> {
    cr.messageMu.Lock()
    <span class="hljs-keyword">defer</span> cr.messageMu.Unlock()

    start := <span class="hljs-built_in">len</span>(cr.messages) - count
    <span class="hljs-keyword">if</span> start &lt; <span class="hljs-number">0</span> {
        start = <span class="hljs-number">0</span>
    }

    historyMsg := <span class="hljs-string">"Recent messages:\n"</span>
    <span class="hljs-keyword">for</span> i := start; i &lt; <span class="hljs-built_in">len</span>(cr.messages); i++ {
        msg := cr.messages[i]
        historyMsg += fmt.Sprintf(<span class="hljs-string">" [%s]: %s\n"</span>, msg.From, msg.Content)
    }

    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> client.outgoing &lt;- historyMsg:
    <span class="hljs-keyword">default</span>:
    }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">sendUserList</span><span class="hljs-params">(client *Client)</span></span> {
    cr.mu.Lock()
    <span class="hljs-keyword">defer</span> cr.mu.Unlock()

    list := <span class="hljs-string">"Users online:\n"</span>
    <span class="hljs-keyword">for</span> c := <span class="hljs-keyword">range</span> cr.clients {
        status := <span class="hljs-string">""</span>
        <span class="hljs-keyword">if</span> c.isInactive(<span class="hljs-number">1</span> * time.Minute) {
            status = <span class="hljs-string">" (idle)"</span>
        }
        list += fmt.Sprintf(<span class="hljs-string">"  - %s%s\n"</span>, c.username, status)
    }

    list += fmt.Sprintf(<span class="hljs-string">"\nTotal messages: %d\n"</span>, cr.totalMessages)
    list += fmt.Sprintf(<span class="hljs-string">"Uptime: %s\n"</span>, time.Since(cr.startTime).Round(time.Second))

    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> client.outgoing &lt;- list:
    <span class="hljs-keyword">default</span>:
    }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">handleDirectMessage</span><span class="hljs-params">(dm DirectMessage)</span></span> {
    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> dm.toClient.outgoing &lt;- dm.message:
        dm.toClient.mu.Lock()
        dm.toClient.messagesSent++
        dm.toClient.mu.Unlock()
    <span class="hljs-keyword">default</span>:
        fmt.Printf(<span class="hljs-string">"Couldn't deliver DM to %s\n"</span>, dm.toClient.username)
    }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">findClientByUsername</span><span class="hljs-params">(username <span class="hljs-keyword">string</span>)</span> *<span class="hljs-title">Client</span></span> {
    cr.mu.Lock()
    <span class="hljs-keyword">defer</span> cr.mu.Unlock()

    <span class="hljs-keyword">for</span> client := <span class="hljs-keyword">range</span> cr.clients {
        <span class="hljs-keyword">if</span> client.username == username {
            <span class="hljs-keyword">return</span> client
        }
    }
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(c *Client)</span> <span class="hljs-title">markActive</span><span class="hljs-params">()</span></span> {
    c.mu.Lock()
    <span class="hljs-keyword">defer</span> c.mu.Unlock()
    c.lastActive = time.Now()
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(c *Client)</span> <span class="hljs-title">isInactive</span><span class="hljs-params">(timeout time.Duration)</span> <span class="hljs-title">bool</span></span> {
    c.mu.Lock()
    <span class="hljs-keyword">defer</span> c.mu.Unlock()
    <span class="hljs-keyword">return</span> time.Since(c.lastActive) &gt; timeout
}
</code></pre>
<p>You now have a working chat system where clients can connect and exchange messages.</p>
<p>But there's a critical problem: if the server crashes or restarts, all messages are lost. The next step is adding persistence so messages survive failures.</p>
<h2 id="heading-how-to-add-persistence-with-wal-and-snapshots">How to Add Persistence with WAL and Snapshots</h2>
<p>Persistence ensures your chat history survives server crashes and restarts. Without it, users would lose all their conversations every time the server goes down.</p>
<p>You'll implement this using two complementary mechanisms: a write-ahead log for immediate durability and snapshots for fast recovery.</p>
<p>Create <code>internal/chatroom/persistence.go</code> to handle data durability.</p>
<p>The WAL ensures messages survive crashes:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"bufio"</span>
    <span class="hljs-string">"encoding/json"</span>
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"io"</span>
    <span class="hljs-string">"os"</span>
    <span class="hljs-string">"path/filepath"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">initializePersistence</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    <span class="hljs-keyword">if</span> err := os.MkdirAll(cr.dataDir, <span class="hljs-number">0755</span>); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> fmt.Errorf(<span class="hljs-string">"create data dir: %w"</span>, err)
    }

    walPath := filepath.Join(cr.dataDir, <span class="hljs-string">"messages.wal"</span>)

    <span class="hljs-keyword">if</span> err := cr.recoverFromWAL(walPath); err != <span class="hljs-literal">nil</span> {
        fmt.Printf(<span class="hljs-string">"Recovery failed: %v\n"</span>, err)
    }

    file, err := os.OpenFile(walPath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, <span class="hljs-number">0644</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> fmt.Errorf(<span class="hljs-string">"open wal: %w"</span>, err)
    }

    cr.walFile = file
    fmt.Printf(<span class="hljs-string">"WAL initialized: %s\n"</span>, walPath)
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">recoverFromWAL</span><span class="hljs-params">(walPath <span class="hljs-keyword">string</span>)</span> <span class="hljs-title">error</span></span> {
    file, err := os.Open(walPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">if</span> os.IsNotExist(err) {
            fmt.Println(<span class="hljs-string">"No WAL found (fresh start)"</span>)
            <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
        }
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-keyword">defer</span> file.Close()

    scanner := bufio.NewScanner(file)
    recovered := <span class="hljs-number">0</span>

    <span class="hljs-keyword">for</span> scanner.Scan() {
        line := scanner.Text()
        <span class="hljs-keyword">if</span> line == <span class="hljs-string">""</span> {
            <span class="hljs-keyword">continue</span>
        }

        <span class="hljs-keyword">var</span> msg Message
        <span class="hljs-keyword">if</span> err := json.Unmarshal([]<span class="hljs-keyword">byte</span>(line), &amp;msg); err != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Skipping corrupt line: %s\n"</span>, line)
            <span class="hljs-keyword">continue</span>
        }

        cr.messages = <span class="hljs-built_in">append</span>(cr.messages, msg)

        <span class="hljs-keyword">if</span> msg.ID &gt;= cr.nextMessageID {
            cr.nextMessageID = msg.ID + <span class="hljs-number">1</span>
        }
        recovered++
    }

    fmt.Printf(<span class="hljs-string">"Recovered %d messages\n"</span>, recovered)
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">persistMessage</span><span class="hljs-params">(msg Message)</span> <span class="hljs-title">error</span></span> {
    cr.walMu.Lock()
    <span class="hljs-keyword">defer</span> cr.walMu.Unlock()

    data, err := json.Marshal(msg)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    _, err = cr.walFile.Write(<span class="hljs-built_in">append</span>(data, <span class="hljs-string">'\n'</span>))
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    <span class="hljs-keyword">return</span> cr.walFile.Sync()
}
</code></pre>
<p>Each line is a JSON-encoded message:</p>
<pre><code class="lang-json">{<span class="hljs-attr">"id"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"from"</span>:<span class="hljs-string">"Alice"</span>,<span class="hljs-attr">"content"</span>:<span class="hljs-string">"Hello world"</span>,<span class="hljs-attr">"timestamp"</span>:<span class="hljs-string">"2024-02-06T10:00:00Z"</span>,<span class="hljs-attr">"channel"</span>:<span class="hljs-string">"global"</span>}
{<span class="hljs-attr">"id"</span>:<span class="hljs-number">2</span>,<span class="hljs-attr">"from"</span>:<span class="hljs-string">"Bob"</span>,<span class="hljs-attr">"content"</span>:<span class="hljs-string">"Hi Alice!"</span>,<span class="hljs-attr">"timestamp"</span>:<span class="hljs-string">"2024-02-06T10:00:05Z"</span>,<span class="hljs-attr">"channel"</span>:<span class="hljs-string">"global"</span>}
</code></pre>
<p>The <code>Sync()</code> call is critical for durability. Without it, the OS might buffer writes in memory, losing them on a crash. The trade-off is that <code>Sync()</code> is expensive (about 1-10ms per call). Production systems might batch multiple messages to improve throughput.</p>
<h3 id="heading-how-to-create-and-load-snapshots">How to Create and Load Snapshots</h3>
<p>Add snapshot functionality to <code>persistence.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">createSnapshot</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    snapshotPath := filepath.Join(cr.dataDir, <span class="hljs-string">"snapshot.json"</span>)
    tempPath := snapshotPath + <span class="hljs-string">".tmp"</span>

    file, err := os.Create(tempPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-keyword">defer</span> file.Close()

    cr.messageMu.Lock()
    data, err := json.MarshalIndent(cr.messages, <span class="hljs-string">""</span>, <span class="hljs-string">"  "</span>)
    cr.messageMu.Unlock()

    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    <span class="hljs-keyword">if</span> _, err := file.Write(data); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    <span class="hljs-keyword">if</span> err := file.Sync(); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    file.Close()

    <span class="hljs-keyword">if</span> err := os.Rename(tempPath, snapshotPath); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    fmt.Printf(<span class="hljs-string">"Snapshot created (%d messages)\n"</span>, <span class="hljs-built_in">len</span>(cr.messages))
    <span class="hljs-keyword">return</span> cr.truncateWAL()
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">truncateWAL</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    cr.walMu.Lock()
    <span class="hljs-keyword">defer</span> cr.walMu.Unlock()

    <span class="hljs-keyword">if</span> cr.walFile != <span class="hljs-literal">nil</span> {
        cr.walFile.Close()
    }

    walPath := filepath.Join(cr.dataDir, <span class="hljs-string">"messages.wal"</span>)
    file, err := os.OpenFile(walPath, os.O_TRUNC|os.O_CREATE|os.O_WRONLY, <span class="hljs-number">0644</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }
    cr.walFile = file
    fmt.Println(<span class="hljs-string">"WAL truncated"</span>)
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">loadSnapshot</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    snapshotPath := filepath.Join(cr.dataDir, <span class="hljs-string">"snapshot.json"</span>)
    file, err := os.Open(snapshotPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">if</span> os.IsNotExist(err) {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
        }
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-keyword">defer</span> file.Close()

    data, err := io.ReadAll(file)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    cr.messageMu.Lock()
    err = json.Unmarshal(data, &amp;cr.messages)
    cr.messageMu.Unlock()

    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    <span class="hljs-keyword">for</span> _, msg := <span class="hljs-keyword">range</span> cr.messages {
        <span class="hljs-keyword">if</span> msg.ID &gt;= cr.nextMessageID {
            cr.nextMessageID = msg.ID + <span class="hljs-number">1</span>
        }
    }

    fmt.Printf(<span class="hljs-string">"Loaded %d messages from snapshot\n"</span>, <span class="hljs-built_in">len</span>(cr.messages))
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}
</code></pre>
<p>Writing to <code>.tmp</code> then renaming ensures you never have a half-written snapshot. Even if power fails mid-write, the old snapshot remains valid.</p>
<h4 id="heading-recovery-flow">Recovery Flow</h4>
<p>When the server starts, it first loads the snapshot if it exists, which might contain 100K messages and takes about 100ms. Then it replays WAL entries written since the snapshot, which might be only recent messages. Total recovery time is seconds instead of minutes.</p>
<p>With persistence in place, your messages are safe. But network connections are unreliable. Users get disconnected when their WiFi drops, their phone switches towers, or their laptop goes to sleep. The next step is implementing session management so users can reconnect without losing their identity or chat history.</p>
<h2 id="heading-how-to-implement-session-management">How to Implement Session Management</h2>
<p>Session management lets users reconnect to your server after network interruptions without needing to create a new account or re-enter credentials. You'll implement this using cryptographically secure tokens that persist across connections.</p>
<p>Create <code>internal/chatroom/session.go</code> for reconnection handling.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"time"</span>

    <span class="hljs-string">"github.com/yourusername/chatroom/pkg/token"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">createSession</span><span class="hljs-params">(username <span class="hljs-keyword">string</span>)</span> *<span class="hljs-title">SessionInfo</span></span> {
    cr.sessionsMu.Lock()
    <span class="hljs-keyword">defer</span> cr.sessionsMu.Unlock()

    tok := token.GenerateToken()

    session := &amp;SessionInfo{
        Username:       username,
        ReconnectToken: tok,
        LastSeen:       time.Now(),
        CreatedAt:      time.Now(),
    }

    cr.sessions[username] = session

    fmt.Printf(<span class="hljs-string">"Created session for %s (token: %s...)\n"</span>, username, tok[:<span class="hljs-number">8</span>])

    <span class="hljs-keyword">return</span> session
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">validateReconnectToken</span><span class="hljs-params">(username, token <span class="hljs-keyword">string</span>)</span> <span class="hljs-title">bool</span></span> {
    cr.sessionsMu.Lock()
    <span class="hljs-keyword">defer</span> cr.sessionsMu.Unlock()

    session, exists := cr.sessions[username]
    <span class="hljs-keyword">if</span> !exists {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>
    }

    <span class="hljs-keyword">if</span> session.ReconnectToken != token {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>
    }

    <span class="hljs-keyword">if</span> time.Since(session.LastSeen) &gt; <span class="hljs-number">1</span>*time.Hour {
        <span class="hljs-built_in">delete</span>(cr.sessions, username)
        <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>
    }

    session.LastSeen = time.Now()

    <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">updateSessionActivity</span><span class="hljs-params">(username <span class="hljs-keyword">string</span>)</span></span> {
    cr.sessionsMu.Lock()
    <span class="hljs-keyword">defer</span> cr.sessionsMu.Unlock()

    <span class="hljs-keyword">if</span> session, exists := cr.sessions[username]; exists {
        session.LastSeen = time.Now()
    }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">isUsernameConnected</span><span class="hljs-params">(username <span class="hljs-keyword">string</span>)</span> <span class="hljs-title">bool</span></span> {
    cr.mu.Lock()
    <span class="hljs-keyword">defer</span> cr.mu.Unlock()

    <span class="hljs-keyword">for</span> client := <span class="hljs-keyword">range</span> cr.clients {
        <span class="hljs-keyword">if</span> client.username == username {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>
        }
    }

    <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">cleanupInactiveClients</span><span class="hljs-params">()</span></span> {
    ticker := time.NewTicker(<span class="hljs-number">30</span> * time.Second)
    <span class="hljs-keyword">defer</span> ticker.Stop()

    <span class="hljs-keyword">for</span> <span class="hljs-keyword">range</span> ticker.C {
        cr.mu.Lock()
        <span class="hljs-keyword">var</span> toRemove []*Client

        <span class="hljs-keyword">for</span> client := <span class="hljs-keyword">range</span> cr.clients {
            <span class="hljs-keyword">if</span> client.isInactive(<span class="hljs-number">5</span> * time.Minute) {
                fmt.Printf(<span class="hljs-string">"Removing inactive: %s\n"</span>, client.username)
                toRemove = <span class="hljs-built_in">append</span>(toRemove, client)
            }
        }
        cr.mu.Unlock()

        <span class="hljs-keyword">for</span> _, client := <span class="hljs-keyword">range</span> toRemove {
            cr.leave &lt;- client
        }
    }
}
</code></pre>
<h3 id="heading-how-to-generate-secure-tokens">How to Generate Secure Tokens</h3>
<p>Create <code>pkg/token/token.go</code> for token generation:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> token

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"crypto/rand"</span>
    <span class="hljs-string">"encoding/hex"</span>
)

<span class="hljs-comment">// GenerateToken returns a secure random 16-byte hex token</span>
<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">GenerateToken</span><span class="hljs-params">()</span> <span class="hljs-title">string</span></span> {
    b := <span class="hljs-built_in">make</span>([]<span class="hljs-keyword">byte</span>, <span class="hljs-number">16</span>)
    _, _ = rand.Read(b)
    <span class="hljs-keyword">return</span> hex.EncodeToString(b)
}
</code></pre>
<p>Tokens here are transmitted in plaintext over TCP. For production use, you should use TLS encryption to protect tokens in transit, hash tokens before storage so a database breach doesn't expose them, and implement rate limiting on reconnection attempts to prevent brute force attacks.</p>
<p>Your chatroom now supports basic messaging and reconnection. But users need ways to interact with the system beyond just sending messages. The command system provides features like listing users, viewing history, and sending private messages.</p>
<h2 id="heading-how-to-build-the-command-system">How to Build the Command System</h2>
<p>Commands are messages that start with a forward slash and perform special actions instead of being broadcast to everyone. This is a pattern used by many chat applications like Slack and Discord. You'll implement several useful commands that enhance the user experience.</p>
<p>Add command handling to <code>io.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">handleCommand</span><span class="hljs-params">(client *Client, chatRoom *ChatRoom, command <span class="hljs-keyword">string</span>)</span></span> {
    parts := strings.Fields(command)
    <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(parts) == <span class="hljs-number">0</span> {
        <span class="hljs-keyword">return</span>
    }

    <span class="hljs-keyword">switch</span> parts[<span class="hljs-number">0</span>] {
    <span class="hljs-keyword">case</span> <span class="hljs-string">"/users"</span>:
        chatRoom.listUsers &lt;- client

    <span class="hljs-keyword">case</span> <span class="hljs-string">"/stats"</span>:
        client.mu.Lock()
        stats := fmt.Sprintf(<span class="hljs-string">"Your Stats:\n"</span>)
        stats += fmt.Sprintf(<span class="hljs-string">"  Messages sent: %d\n"</span>, client.messagesSent)
        stats += fmt.Sprintf(<span class="hljs-string">"  Messages received: %d\n"</span>, client.messagesRecv)
        stats += fmt.Sprintf(<span class="hljs-string">"  Last active: %s ago\n"</span>, 
            time.Since(client.lastActive).Round(time.Second))
        client.mu.Unlock()

        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client.outgoing &lt;- stats:
        <span class="hljs-keyword">default</span>:
        }

    <span class="hljs-keyword">case</span> <span class="hljs-string">"/msg"</span>:
        <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(parts) &lt; <span class="hljs-number">3</span> {
            <span class="hljs-keyword">select</span> {
            <span class="hljs-keyword">case</span> client.outgoing &lt;- <span class="hljs-string">"Usage: /msg &lt;username&gt; &lt;message&gt;\n"</span>:
            <span class="hljs-keyword">default</span>:
            }
            <span class="hljs-keyword">return</span>
        }

        targetUsername := parts[<span class="hljs-number">1</span>]
        messageText := strings.Join(parts[<span class="hljs-number">2</span>:], <span class="hljs-string">" "</span>)

        targetClient := chatRoom.findClientByUsername(targetUsername)
        <span class="hljs-keyword">if</span> targetClient == <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">select</span> {
            <span class="hljs-keyword">case</span> client.outgoing &lt;- fmt.Sprintf(<span class="hljs-string">"User '%s' not found\n"</span>, targetUsername):
            <span class="hljs-keyword">default</span>:
            }
            <span class="hljs-keyword">return</span>
        }

        privateMsg := fmt.Sprintf(<span class="hljs-string">"[From %s]: %s\n"</span>, client.username, messageText)
        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> targetClient.outgoing &lt;- privateMsg:
        <span class="hljs-keyword">default</span>:
            <span class="hljs-keyword">select</span> {
            <span class="hljs-keyword">case</span> client.outgoing &lt;- fmt.Sprintf(<span class="hljs-string">"%s's inbox is full\n"</span>, targetUsername):
            <span class="hljs-keyword">default</span>:
            }
            <span class="hljs-keyword">return</span>
        }

        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client.outgoing &lt;- fmt.Sprintf(<span class="hljs-string">"Message sent to %s\n"</span>, targetUsername):
        <span class="hljs-keyword">default</span>:
        }

    <span class="hljs-keyword">case</span> <span class="hljs-string">"/history"</span>:
        count := <span class="hljs-number">20</span>
        <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(parts) &gt; <span class="hljs-number">1</span> {
            fmt.Sscanf(parts[<span class="hljs-number">1</span>], <span class="hljs-string">"%d"</span>, &amp;count)
        }
        <span class="hljs-keyword">if</span> count &gt; <span class="hljs-number">100</span> {
            count = <span class="hljs-number">100</span>
        }
        cr.sendHistory(client, count)

    <span class="hljs-keyword">case</span> <span class="hljs-string">"/token"</span>:
        chatRoom.sessionsMu.Lock()
        session := chatRoom.sessions[client.username]
        chatRoom.sessionsMu.Unlock()

        <span class="hljs-keyword">if</span> session != <span class="hljs-literal">nil</span> {
            msg := fmt.Sprintf(<span class="hljs-string">"Your reconnect token:\n"</span>)
            msg += fmt.Sprintf(<span class="hljs-string">"   reconnect:%s:%s\n"</span>, client.username, session.ReconnectToken)
            <span class="hljs-keyword">select</span> {
            <span class="hljs-keyword">case</span> client.outgoing &lt;- msg:
            <span class="hljs-keyword">default</span>:
            }
        }

    <span class="hljs-keyword">case</span> <span class="hljs-string">"/quit"</span>:
        announcement := fmt.Sprintf(<span class="hljs-string">"%s left the chat\n"</span>, client.username)
        chatRoom.broadcast &lt;- announcement

        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client.outgoing &lt;- <span class="hljs-string">"Goodbye!\n"</span>:
        <span class="hljs-keyword">default</span>:
        }

        time.Sleep(<span class="hljs-number">100</span> * time.Millisecond)
        client.conn.Close()

    <span class="hljs-keyword">default</span>:
        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client.outgoing &lt;- fmt.Sprintf(<span class="hljs-string">"Unknown: %s\n"</span>, parts[<span class="hljs-number">0</span>]):
        <span class="hljs-keyword">default</span>:
        }
    }
}
</code></pre>
<p>Your server is now complete with all the core features: connection handling, message broadcasting, persistence, session management, and commands. But to actually use your chatroom, you need a client application. The client is much simpler than the server because it just needs to connect and relay messages.</p>
<h2 id="heading-how-to-create-the-client">How to Create the Client</h2>
<p>The client application provides the user interface for your chatroom. It connects to the server, displays incoming messages, and sends outgoing messages typed by the user. While the server is complex with many concurrent components, the client is straightforward</p>
<p>Create <code>internal/chatroom/client.go</code> for the client implementation.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"bufio"</span>
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"net"</span>
    <span class="hljs-string">"os"</span>
    <span class="hljs-string">"strings"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">StartClient</span><span class="hljs-params">()</span></span> {
    conn, err := net.Dial(<span class="hljs-string">"tcp"</span>, <span class="hljs-string">":9000"</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        fmt.Println(<span class="hljs-string">"Error connecting:"</span>, err)
        <span class="hljs-keyword">return</span>
    }
    <span class="hljs-keyword">defer</span> conn.Close()

    fmt.Println(<span class="hljs-string">"Connected to chat server"</span>)

    <span class="hljs-comment">// Background goroutine: read from server</span>
    <span class="hljs-keyword">go</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
        reader := bufio.NewReader(conn)
        <span class="hljs-keyword">for</span> {
            message, err := reader.ReadString(<span class="hljs-string">'\n'</span>)
            <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
                fmt.Println(<span class="hljs-string">"Disconnected from server."</span>)
                os.Exit(<span class="hljs-number">0</span>)
            }
            <span class="hljs-comment">// Clear current prompt line and print message</span>
            fmt.Print(<span class="hljs-string">"\r"</span> + message)
            fmt.Print(<span class="hljs-string">"&gt;&gt; "</span>)
        }
    }()

    <span class="hljs-comment">// Main goroutine: read from stdin</span>
    inputReader := bufio.NewReader(os.Stdin)
    fmt.Println(<span class="hljs-string">"Welcome to the chat server!"</span>)

    <span class="hljs-keyword">for</span> {
        fmt.Print(<span class="hljs-string">"&gt;&gt; "</span>)
        message, _ := inputReader.ReadString(<span class="hljs-string">'\n'</span>)
        message = strings.TrimSpace(message)

        <span class="hljs-keyword">if</span> message == <span class="hljs-string">""</span> {
            <span class="hljs-keyword">continue</span>
        }

        conn.Write([]<span class="hljs-keyword">byte</span>(message + <span class="hljs-string">"\n"</span>))
    }
}
</code></pre>
<h4 id="heading-how-the-client-works">How the Client Works:</h4>
<p>The client uses two goroutines to handle communication simultaneously. The main goroutine reads from stdin (your keyboard) and sends messages to the server. When you type a message and press Enter, it gets sent over the TCP connection immediately.</p>
<p>The background goroutine continuously reads from the server. Whenever a message arrives, it prints it to your screen. The <code>\r</code> (carriage return) clears the current <code>&gt;&gt;</code> prompt before printing the message, so new messages don't appear on the same line as your input. After printing the message, it reprints the prompt so you can continue typing.</p>
<p>This dual-goroutine design means you can receive messages while typing. If someone sends a message while you're in the middle of typing yours, their message appears immediately and your prompt reappears below it.</p>
<p>The <code>defer conn.Close()</code> ensures the connection is properly closed when the function exits. If the server disconnects, the read goroutine gets an error and calls <code>os.Exit(0)</code> to terminate the entire client program gracefully.</p>
<h3 id="heading-how-to-create-entry-points">How to Create Entry Points</h3>
<p>Create <code>cmd/server/main.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"os"</span>

    <span class="hljs-string">"github.com/yourusername/chatroom/internal/chatroom"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
    fmt.Println(<span class="hljs-string">"Starting server from cmd/server..."</span>)
    chatroom.StartServer()
    os.Exit(<span class="hljs-number">0</span>)
}
</code></pre>
<p>Create <code>cmd/client/main.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"github.com/yourusername/chatroom/internal/chatroom"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
    fmt.Println(<span class="hljs-string">"Starting client from cmd/client..."</span>)
    chatroom.StartClient()
}
</code></pre>
<p>Add a wrapper function in <code>internal/chatroom/server.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">StartServer</span><span class="hljs-params">()</span></span> {
    runServer()
}
</code></pre>
<p>With all your entry points created, your chatroom is complete and ready to test. The next step is learning how to test your implementation to ensure everything works correctly.</p>
<h2 id="heading-how-to-test-your-chatroom">How to Test Your Chatroom</h2>
<p>Testing a concurrent system like a chatroom requires a different approach than testing typical sequential code. You need to verify that goroutines coordinate correctly, messages arrive in the right order, and the system handles edge cases like disconnections.</p>
<h3 id="heading-how-to-write-unit-tests">How to Write Unit Tests</h3>
<p>Unit tests verify individual components in isolation. For your chatroom, the most important test is verifying that messages broadcast correctly to all connected clients.</p>
<p>Create <code>internal/chatroom/chatroom_test.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"testing"</span>
    <span class="hljs-string">"strings"</span>
    <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">TestBroadcast</span><span class="hljs-params">(t *testing.T)</span></span> {
    cr, _ := NewChatRoom(<span class="hljs-string">"./testdata"</span>)
    <span class="hljs-keyword">defer</span> cr.shutdown()

    <span class="hljs-keyword">go</span> cr.Run()

    <span class="hljs-comment">// Create mock clients</span>
    client1 := &amp;Client{
        username: <span class="hljs-string">"Alice"</span>,
        outgoing: <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>, <span class="hljs-number">10</span>),
    }
    client2 := &amp;Client{
        username: <span class="hljs-string">"Bob"</span>,
        outgoing: <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>, <span class="hljs-number">10</span>),
    }

    <span class="hljs-comment">// Join clients</span>
    cr.join &lt;- client1
    cr.join &lt;- client2
    time.Sleep(<span class="hljs-number">100</span> * time.Millisecond)

    <span class="hljs-comment">// Broadcast message</span>
    cr.broadcast &lt;- <span class="hljs-string">"[Alice]: Hello!"</span>

    <span class="hljs-comment">// Verify both receive it</span>
    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> msg := &lt;-client1.outgoing:
        <span class="hljs-keyword">if</span> !strings.Contains(msg, <span class="hljs-string">"Hello!"</span>) {
            t.Fatal(<span class="hljs-string">"Client1 didn't receive correct message"</span>)
        }
    <span class="hljs-keyword">case</span> &lt;-time.After(<span class="hljs-number">1</span> * time.Second):
        t.Fatal(<span class="hljs-string">"Client1 didn't receive message"</span>)
    }

    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> msg := &lt;-client2.outgoing:
        <span class="hljs-keyword">if</span> !strings.Contains(msg, <span class="hljs-string">"Hello!"</span>) {
            t.Fatal(<span class="hljs-string">"Client2 didn't receive correct message"</span>)
        }
    <span class="hljs-keyword">case</span> &lt;-time.After(<span class="hljs-number">1</span> * time.Second):
        t.Fatal(<span class="hljs-string">"Client2 didn't receive message"</span>)
    }
}
</code></pre>
<h4 id="heading-understanding-the-test">Understanding the Test:</h4>
<p>This test creates a chatroom instance and starts its event loop with <code>go cr.Run()</code>. Then it creates two mock clients. Notice these aren't real TCP connections – they're just Client structs with outgoing channels. This lets you test the broadcast logic without needing actual network connections.</p>
<p>The test sends both clients to the join channel, waits 100 milliseconds for them to be processed, then broadcasts a message. The <code>select</code> statements with timeout are crucial. They try to receive from each client's outgoing channel, but if nothing arrives within 1 second, the test fails. This prevents the test from hanging forever if something goes wrong.</p>
<p>The <code>time.Sleep(100 * time.Millisecond)</code> gives the event loop time to process the join events before broadcasting. In a real system, you'd use channels to synchronize, but for tests, a small sleep is acceptable.</p>
<p>Run tests with:</p>
<pre><code class="lang-go"><span class="hljs-keyword">go</span> test ./internal/chatroom -v
</code></pre>
<p>The <code>-v</code> flag shows verbose output, printing each test as it runs. You'll see whether the broadcast test passes and how long it took. Below is the output showing that the test passed:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770480869735/1102c089-1d10-43fc-bd7f-5571283213a0.png" alt="Chatroom unit test" class="image--center mx-auto" width="1294" height="456" loading="lazy"></p>
<h3 id="heading-how-to-do-integration-testing">How to Do Integration Testing</h3>
<p>Integration tests verify the entire system working together – the real server, real clients, and real network connections. Unlike unit tests that mock components, integration tests exercise the full stack.</p>
<p>Test the full client-server flow:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Terminal 1: Start server</span>
go run cmd/server/main.go

<span class="hljs-comment"># Terminal 2: Client 1</span>
go run cmd/client/main.go
<span class="hljs-comment"># Enter username: Alice</span>

<span class="hljs-comment"># Terminal 3: Client 2  </span>
go run cmd/client/main.go
<span class="hljs-comment"># Enter username: Bob</span>

<span class="hljs-comment"># Terminal 4: Client 3  </span>
go run cmd/client/main.go
<span class="hljs-comment"># Enter username: John</span>

<span class="hljs-comment"># Test messaging between clients</span>
</code></pre>
<h4 id="heading-what-to-test">What to Test:</h4>
<p>Once you have the server running and multiple clients connected, you can verify all the features you built. Here's what a complete test session looks like:</p>
<ol>
<li><p><strong>Basic Messaging:</strong> Send a message from Alice and verify Bob and John both receive it. You should see the message appear in all client windows with the sender's username in brackets. Try sending from each client to verify the broadcast works in all directions.</p>
</li>
<li><p><strong>Join and Leave Announcements:</strong> When a new client connects, all existing clients should see a "joined the chat" announcement. When someone disconnects (either with <code>/quit</code> or by closing their terminal), everyone should see a "left the chat" message. This confirms your join and leave handlers work correctly.</p>
</li>
<li><p><strong>Private Messaging:</strong> Use <code>/msg Bob this is a private message</code> from Alice's client. The message should appear only in Bob's window, not in John's or Alice's. Try sending private messages between different pairs of users to verify the routing works correctly. The sender should receive a confirmation that the message was sent.</p>
</li>
<li><p><strong>User List:</strong> Run <code>/users</code> from any client. You should see a list of all connected users. If someone has been idle for over a minute, they should show an "(idle)" status. The command should also display total message count and server uptime.</p>
</li>
<li><p><strong>Chat History:</strong> New clients should automatically receive the last 10 messages when they join. You can also use <code>/history 20</code> to request the last 20 messages. This verifies your message persistence is working.</p>
</li>
<li><p><strong>Session Reconnection:</strong> From one client, use <code>/token</code> to get your reconnection token. It will look something like <code>reconnect:Alice:338f04ca...</code>. Copy this token, disconnect the client with Ctrl+C, start a new client, and paste the reconnection string when prompted. You should rejoin the chat with your previous identity, and other users won't see duplicate join announcements.</p>
</li>
<li><p><strong>Statistics:</strong> Use <code>/stats</code> to see how many messages you've sent and received, and when you were last active. This verifies the client-side statistics tracking works.</p>
</li>
<li><p><strong>Error Handling:</strong> Try connecting with a username that's already in use – you should be rejected. Try sending a private message to a non-existent user – you should get an error. Try using an invalid reconnection token – you should be denied. These tests verify your validation logic works.</p>
</li>
</ol>
<p>Look at the server terminal to see the server's perspective. You'll see connection logs, broadcast confirmations, and any errors. When clients disconnect, you should see their sessions being updated. When the server creates snapshots, you'll see those logged, too.</p>
<p>Integration testing catches problems that unit tests miss, like network timeouts, message ordering issues across multiple clients, or problems with how the WAL file is created and locked. The screenshot below shows a successful integration test with three clients (Alice, Bob, and John) all communicating successfully, with private messages, public broadcasts, and proper join/leave handling.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770348441963/d66bb2da-088b-4b9c-95a2-e5f401c5f49e.png" alt="chatroom broadcast test" class="image--center mx-auto" width="2833" height="644" loading="lazy"></p>
<h2 id="heading-how-to-deploy-your-server">How to Deploy Your Server</h2>
<p>Deploying your chatroom means running it on a server that stays up 24/7, automatically restarts if it crashes, and starts when the server boots. There are several approaches depending on your infrastructure.</p>
<h3 id="heading-how-to-use-systemd">How to Use Systemd</h3>
<p>Systemd is the standard init system on most Linux distributions. It manages services, handles restarts, and ensures your chatroom starts on boot.</p>
<p>Create <code>/etc/systemd/system/chatroom.service</code>:</p>
<pre><code class="lang-ini"><span class="hljs-section">[Unit]</span>
<span class="hljs-attr">Description</span>=Chatroom Server
<span class="hljs-attr">After</span>=network.target

<span class="hljs-section">[Service]</span>
<span class="hljs-attr">Type</span>=simple
<span class="hljs-attr">User</span>=chatroom
<span class="hljs-attr">WorkingDirectory</span>=/opt/chatroom
<span class="hljs-attr">ExecStart</span>=/opt/chatroom/server
<span class="hljs-attr">Restart</span>=<span class="hljs-literal">on</span>-failure
<span class="hljs-attr">RestartSec</span>=<span class="hljs-number">5</span>s

<span class="hljs-section">[Install]</span>
<span class="hljs-attr">WantedBy</span>=multi-user.target
</code></pre>
<h4 id="heading-understanding-the-configuration">Understanding the Configuration:</h4>
<p>The <code>[Unit]</code> section describes the service and its dependencies. <code>After=network.target</code> ensures the network is up before starting your chatroom.</p>
<p>The <code>[Service]</code> section defines how to run your server. <code>Type=simple</code> means systemd should just run the command and consider it started. <code>User=chatroom</code> runs the server as a dedicated user (not root) for security. <code>WorkingDirectory</code> sets where the server runs, which is important because your WAL and snapshot files are created relative to this directory.</p>
<p><code>Restart=on-failure</code> tells systemd to automatically restart your server if it crashes. <code>RestartSec=5s</code> waits 5 seconds before restarting, preventing rapid restart loops if there's a persistent problem.</p>
<p>The <code>[Install]</code> section makes your service start at boot when you enable it.</p>
<h4 id="heading-deploying-your-server">Deploying Your Server:</h4>
<p>First, build your server binary:</p>
<pre><code class="lang-bash">go build -o server cmd/server/main.go
</code></pre>
<p>Then copy it to the deployment location:</p>
<pre><code class="lang-bash">sudo mkdir -p /opt/chatroom
sudo cp server /opt/chatroom/
sudo mkdir -p /opt/chatroom/chatdata
</code></pre>
<p>Create a dedicated user for running the service:</p>
<pre><code class="lang-bash">sudo useradd -r -s /bin/<span class="hljs-literal">false</span> chatroom
sudo chown -R chatroom:chatroom /opt/chatroom
</code></pre>
<p>Enable and start the service:</p>
<pre><code class="lang-bash">sudo systemctl <span class="hljs-built_in">enable</span> chatroom
sudo systemctl start chatroom
</code></pre>
<p>Check that it's running:</p>
<pre><code class="lang-bash">sudo systemctl status chatroom
</code></pre>
<p>You can view logs with:</p>
<pre><code class="lang-bash">sudo journalctl -u chatroom -f
</code></pre>
<p>The <code>-f</code> flag follows the logs in real-time, similar to <code>tail -f</code>.</p>
<h3 id="heading-how-to-use-docker">How to Use Docker</h3>
<p>Docker packages your application with all its dependencies, making it easy to deploy anywhere that runs Docker.</p>
<p>Create a <code>Dockerfile</code>:</p>
<pre><code class="lang-dockerfile"><span class="hljs-keyword">FROM</span> golang:<span class="hljs-number">1.23</span>-alpine AS builder
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app</span>
<span class="hljs-keyword">COPY</span><span class="bash"> go.mod go.sum ./</span>
<span class="hljs-keyword">RUN</span><span class="bash"> go mod download</span>
<span class="hljs-keyword">COPY</span><span class="bash"> . .</span>
<span class="hljs-keyword">RUN</span><span class="bash"> go build -o server cmd/server/main.go</span>

<span class="hljs-keyword">FROM</span> alpine:latest
<span class="hljs-keyword">RUN</span><span class="bash"> apk --no-cache add ca-certificates</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /root/</span>
<span class="hljs-keyword">COPY</span><span class="bash"> --from=builder /app/server .</span>
<span class="hljs-keyword">COPY</span><span class="bash"> --from=builder /app/chatdata ./chatdata</span>
<span class="hljs-keyword">EXPOSE</span> <span class="hljs-number">9000</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"./server"</span>]</span>
</code></pre>
<h4 id="heading-understanding-the-dockerfile">Understanding the Dockerfile:</h4>
<p>This uses a multi-stage build. The first stage (<code>builder</code>) uses the full Go image to compile your server. The second stage uses a minimal Alpine Linux image and copies only the compiled binary. This keeps the final image small (about 20MB instead of 800MB).</p>
<p><code>EXPOSE 9000</code> documents which port the container uses. <code>CMD ["./server"]</code> specifies what command runs when the container starts.</p>
<p>Build and Run:</p>
<pre><code class="lang-bash">docker build -t chatroom .
docker run -p 9000:9000 -v $(<span class="hljs-built_in">pwd</span>)/chatdata:/root/chatdata chatroom
</code></pre>
<p>The <code>-p 9000:9000</code> maps port 9000 in the container to port 9000 on your host, making the chatroom accessible. The <code>-v $(pwd)/chatdata:/root/chatdata</code> mounts your local chatdata directory into the container, so messages persist even if you stop and remove the container.</p>
<h4 id="heading-running-in-production">Running in Production:</h4>
<p>For production, you'd typically use Docker Compose or Kubernetes. Here's a simple <code>docker-compose.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">version:</span> <span class="hljs-string">'3.8'</span>
<span class="hljs-attr">services:</span>
  <span class="hljs-attr">chatroom:</span>
    <span class="hljs-attr">build:</span> <span class="hljs-string">.</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9000:9000"</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./chatdata:/root/chatdata</span>
    <span class="hljs-attr">restart:</span> <span class="hljs-string">unless-stopped</span>
</code></pre>
<p>Run with:</p>
<pre><code class="lang-bash">docker-compose up -d
</code></pre>
<p>The <code>restart: unless-stopped</code> policy ensures your container restarts automatically if it crashes or if the Docker daemon restarts</p>
<h2 id="heading-enhancements-you-could-add">Enhancements You Could Add</h2>
<h3 id="heading-1-multi-room-support">1. Multi-Room Support</h3>
<p>You could add the concept of channels/rooms like this:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> ChatRoom <span class="hljs-keyword">struct</span> {
    rooms <span class="hljs-keyword">map</span>[<span class="hljs-keyword">string</span>]*Room
}

<span class="hljs-keyword">type</span> Room <span class="hljs-keyword">struct</span> {
    name    <span class="hljs-keyword">string</span>
    clients <span class="hljs-keyword">map</span>[*Client]<span class="hljs-keyword">bool</span>
    history []Message
}
</code></pre>
<h3 id="heading-2-user-authentication">2. User Authentication</h3>
<p>You could replace simple usernames with proper authentication for added security:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> User <span class="hljs-keyword">struct</span> {
    ID           <span class="hljs-keyword">int</span>
    Username     <span class="hljs-keyword">string</span>
    PasswordHash <span class="hljs-keyword">string</span>
    Email        <span class="hljs-keyword">string</span>
    CreatedAt    time.Time
}
</code></pre>
<h3 id="heading-3-file-sharing">3. File Sharing</h3>
<p>You could allow users to upload files:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> FileMessage <span class="hljs-keyword">struct</span> {
    Message
    FileName <span class="hljs-keyword">string</span>
    FileSize <span class="hljs-keyword">int64</span>
    FileURL  <span class="hljs-keyword">string</span>
}
</code></pre>
<h3 id="heading-4-websocket-support">4. WebSocket Support</h3>
<p>You could add HTTP/WebSocket endpoint for web clients.</p>
<h3 id="heading-5-horizontal-scaling">5. Horizontal Scaling</h3>
<p>For massive scale, you could shard across multiple servers using Redis pub/sub or NATS for inter-server communication.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You've now built a production-ready distributed chatroom from scratch. This project demonstrates important distributed systems concepts including concurrency patterns, network programming, state management, persistence, and fault tolerance.</p>
<p>Additional resources:</p>
<ul>
<li><p><strong>Go Concurrency</strong>: "Concurrency in Go" by Katherine Cox-Buday</p>
</li>
<li><p><strong>Distributed Systems</strong>: "Designing Data-Intensive Applications" by Martin Kleppmann</p>
</li>
<li><p><strong>Networking</strong>: "Unix Network Programming" by Stevens</p>
</li>
</ul>
<p>The full source code is available on <a target="_blank" href="https://github.com/Caesarsage/distributed-system/tree/main/chatroom-with-broadcast">GitHub</a>. Feel free to open issues or contribute improvements.</p>
<p>As always, I hope you enjoyed this guide and learned something. If you want to stay connected or see more hands-on DevOps content, you can follow me on <a target="_blank" href="https://www.linkedin.com/in/destiny-erhabor">LinkedIn</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Why You Should Stop Managing Kafka Manually – A Guide to Kafka UI and Cruise Control ]]>
                </title>
                <description>
                    <![CDATA[ Over 80% of Fortune 100 companies use Apache Kafka. That's not surprising, as Kafka has revolutionized how we build real-time data pipelines and streaming applications. If you're working in software engineering today, chances are you've encountered K... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/stop-managing-kafka-manually-a-guide-to-kafka-ui-and-cruise-control/</link>
                <guid isPermaLink="false">6967bd3e7b94dad713ce9fae</guid>
                
                    <category>
                        <![CDATA[ kafka ]]>
                    </category>
                
                    <category>
                        <![CDATA[ distributed system ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Operational Efficiency ]]>
                    </category>
                
                    <category>
                        <![CDATA[ automation ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ramesh Sinha ]]>
                </dc:creator>
                <pubDate>Wed, 14 Jan 2026 15:58:54 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768353949324/18ce4a43-fb21-4e9b-9285-7c4db7b7ae2e.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Over 80% of Fortune 100 companies use Apache Kafka. That's not surprising, as Kafka has revolutionized how we build real-time data pipelines and streaming applications. If you're working in software engineering today, chances are you've encountered Kafka in some capacity.</p>
<p>But here's the thing: while Kafka itself is incredibly powerful, managing Kafka clusters is notoriously challenging. This isn't a flaw in Kafka – it's just the reality of distributed systems. The bigger your cluster grows, the more complex operations become.</p>
<p>The most painful aspect? Manual cluster management. It's tedious, error-prone, and doesn't scale. What starts as simple topic creation with a few brokers turns into hours of carefully orchestrating partition reassignments across dozens of machines. One typo in a JSON file at 3 AM can take down production.</p>
<p>Sound familiar? You're not alone.</p>
<p>In this guide, you'll learn how two tools can transform Kafka operations from a manual slog into a manageable process:</p>
<ul>
<li><p><strong>Kafka UI</strong> – A modern web interface that replaces cryptic CLI commands with visual cluster management</p>
</li>
<li><p><strong>Cruise Control</strong> – LinkedIn's automation engine that handles cluster balancing and self-healing</p>
</li>
</ul>
<p>We'll start by experiencing the pain of manual management firsthand, then see how these tools solve real-world operational challenges. You'll set up everything locally with <code>Docker</code> and by the end you’ll know exactly how to manage Kafka clusters without the headache.</p>
<h2 id="heading-what-well-cover"><strong>What We’ll Cover:</strong></h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-setting-up-our-unmanaged-cluster">Setting Up Our Unmanaged Cluster</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-starting-the-cluster-amp-verification">Starting the Cluster &amp; Verification</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-creating-topics-the-manual-way">Creating Topics: The Manual Way</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-kafka-ui">Kafka UI</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-setting-up-kafka-ui">Setting up Kafka UI</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-drawbacks-of-kafka-ui">Drawbacks of Kafka UI</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-cruise-control">Cruise Control</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-how-cruise-control-works">How Cruise Control Works</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-setting-up-cruise-control">Setting Up Cruise Control</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cruise-control-configuration-file">Cruise Control Configuration File</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-creating-the-imbalance">Creating the Imbalance</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-attempting-manual-rebalancing">Attempting Manual Rebalancing</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-rebalancing-using-cruise-control">Rebalancing Using Cruise Control</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-the-problem-manual-kafka-management">The Problem: Manual Kafka Management</h2>
<p>Let’s dive right in. First, I'm going to show you what managing a Kafka cluster looks like without any tools – just you, the command line, and dozens of manual operations.</p>
<p>You’ll spin up a small cluster locally, create some topics, and simulate the kind of growth you'd see in a real production environment. By the end of this section, you'll understand exactly why teams spend thousands of engineering hours just keeping Kafka clusters running smoothly.</p>
<p>Fair warning: this is going to feel tedious but it’s ok – <em>that’s the point</em>.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before we dive in, make sure you have:</p>
<ol>
<li><p><strong>Docker Desktop installed and running</strong></p>
<ul>
<li><p>Mac and Windows users: <a target="_blank" href="https://www.docker.com/products/docker-desktop/">https://www.docker.com/products/docker-desktop/</a></p>
</li>
<li><p>Linux users can install Docker Engine via their package manager </p>
</li>
</ul>
</li>
<li><p><strong>Basic Kafka knowledge.</strong> You should understand:</p>
<ul>
<li><p><strong>Topics</strong>: Categories for organizing messages</p>
</li>
<li><p><strong>Partitions</strong>: How topics are divided for parallelism</p>
</li>
<li><p><strong>Brokers</strong>: The Kafka servers that store data</p>
</li>
<li><p><strong>Producers and Consumers</strong>: Applications that write to and read from Kafka</p>
</li>
<li><p><strong>KRaft</strong>: Kafka consensus based discovery?</p>
</li>
</ul>
</li>
</ol>
<p>    If these terms are new to you, <a target="_blank" href="https://www.freecodecamp.org/news/apache-kafka-handbook/">here’s a great handbook about them</a>. I’d also recommend reading <a target="_blank" href="https://kafka.apache.org/intro">Kafka's Introduction</a> first.</p>
<ol start="3">
<li><p><strong>System Requirements</strong></p>
<ul>
<li><p>At least 8GB Ram</p>
</li>
<li><p>10GB Free Disk space</p>
</li>
</ul>
</li>
<li><p>Some basic understanding of <strong>containers</strong> is good to have:</p>
<ul>
<li><p>Docker</p>
</li>
<li><p>Images</p>
</li>
<li><p>Volumes</p>
</li>
<li><p>Networks</p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-setting-up-our-unmanaged-cluster">Setting Up Our Unmanaged Cluster</h2>
<p>Let’s go ahead and build the cluster so that we can see the problems firsthand. We’ll use Docker to spin up three Kafka brokers running in <code>KRaft</code> mode (the modern, ZooKeeper-free approach).</p>
<p>Start by creating a file called <code>docker-compose-basic.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">version:</span> <span class="hljs-string">'3.8'</span>

<span class="hljs-attr">services:</span>
  <span class="hljs-attr">kafka-1:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-1</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9092:9092"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-2:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-2</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9093:9093"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-3:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-3</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9094:9094"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">3</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3-data:/var/lib/kafka/data</span>

<span class="hljs-attr">volumes:</span>
  <span class="hljs-attr">kafka-1-data:</span>
  <span class="hljs-attr">kafka-2-data:</span>
  <span class="hljs-attr">kafka-3-data:</span>
</code></pre>
<p>In the above configuration file, we’re creating three Kafka brokers (<code>kafka-1, kafka-2, kafka-3</code>). Each one uses the <code>confluentinc/cp-kafka:7.6.0</code> image and has its port opened (<code>9092, 9093, 9094</code>).</p>
<p>The environment variables are:</p>
<ul>
<li><p><strong>KAFKA_NODE_ID</strong> – A unique identifier for each broker (1,2,3). No two brokers can have the same ID.</p>
</li>
<li><p><strong>KAFKA_PROCESS_ROLES: broker, controller</strong> – This tells Kafka to run in <code>KRaft</code> mode (without ZooKeeper). Each broker acts as both a data broker and a controller for cluster coordination.</p>
</li>
<li><p><strong>KAFKA_CONTROLLER_QUORUM_VOTERS</strong> – The membership list that tells each broker how to find the others. All three brokers must have the identical list: <code>1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</code>. This is how they discover each other and elect a leader.</p>
</li>
<li><p><strong>CLUSTER_ID</strong> – A unique identifier for the entire cluster. All brokers must use the <strong>exact same value</strong> or they won't recognize each other as part of the same cluster. The actual value (<code>MkU3OEVBNTcwNTJENDM2Qk</code>) doesn't matter as long as long as it is consistent across brokers. One important thing to note is that CLUSTER_ID must be a valid <code>base64-encoded UUID</code>  per Kafka’s requirement.</p>
</li>
<li><p><strong>KAFKA_LISTENERS</strong> - Defines which network interfaces and ports Kafka listens on. We have three listeners:</p>
<ul>
<li><p><strong>PLAINTEXT://0.0.0.0:29092</strong>: For inter-broker communication (brokers talking to each other)</p>
</li>
<li><p><strong>CONTROLLER://0.0.0.0:29093</strong>: For controller communication in <code>KRaft</code> mode</p>
</li>
<li><p><strong>PLAINTEXT_HOST://0.0.0.0:9092</strong> (varies per broker): For external connections from your machine</p>
</li>
</ul>
</li>
<li><p><strong>KAFKA_ADVERTISED_LISTENERS</strong> – Tells clients (producers/consumers) how to connect to this broker. This is what gets returned when a client asks "<code>where should I connect?</code>" The PLAINTEXT_HOST://localhost:9092 part is what allows you to connect from your Mac.</p>
</li>
</ul>
<p>Note: <strong>Listener configuration is critical.</strong> Incorrect settings will prevent clients from connecting even when brokers are running. These settings work for local Docker environments where Docker's internal DNS resolves broker names (<code>kafka-1, kafka-2, kafka-3</code>). For production, replace hostnames with actual IP addresses or FQDNs - (Fully Qualified Domain Name):</p>
<ul>
<li><p><strong>KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2</strong> – How many copies of consumer offset data to keep. We use 2 instead of 3 because with only three brokers, this prevents issues during rolling restarts. In production with more brokers, you'd use 3 or more.</p>
</li>
<li><p><strong>The Volumes</strong> –  <code>kafka-x-data:/var/lib/kafka/data</code>  creates persistent storage for each broker’s data. Without volumes you will lose your topics and messages if you stop or restart your containers. Volumes are assigned to each broker so they don’t accidentally share data.</p>
</li>
</ul>
<p>Note: For a restart from scratch you need to delete the volumes using the following command. The -v flag removes volumes. Without it, old data persists even after down. </p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml down -v
</code></pre>
<p>If you're using the legacy <code>docker-compose</code> tool (V1), replace <code>docker compose</code> with <code>docker-compose</code> in all commands throughout this tutorial.</p>
<h3 id="heading-ports"><strong>Ports</strong></h3>
<p>Three ports are used for any given broker. Their purposes are:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Port</td><td>Purpose</td></tr>
</thead>
<tbody>
<tr>
<td><strong>9092</strong></td><td>external connections (producers, consumers from you Mac)</td></tr>
<tr>
<td><strong>29092</strong></td><td>Internal broker-to-broker communication</td></tr>
<tr>
<td><strong>29093</strong></td><td>Cluster coordination via KRaft</td></tr>
</tbody>
</table>
</div><h2 id="heading-starting-the-cluster-amp-verification">Starting the Cluster &amp; Verification</h2>
<p>Now that we have the basic docker configuration for Kafka, let’s run it and verify the results.</p>
<p>Run the following command in the same directory where you saved <code>docker-compose-basic.yml</code>:</p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml up -d
</code></pre>
<p>The <code>-d</code> flag runs the containers in detached mode (in the background), so you get your terminal back.</p>
<p>You should see output like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767554351786/4500c108-e9b6-403f-98cf-15198b3a9831.png" alt="Docker compose command result" class="image--center mx-auto" width="2484" height="830" loading="lazy"></p>
<p>Using the following command, check if the containers running Kafka brokers are up:</p>
<pre><code class="lang-bash">docker ps
</code></pre>
<p>You should see three Kafka containers (kafka-1, kafka-2, kafka-3) with status “<code>Up</code>” – something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767554452185/4e605547-8153-4a85-9903-e3b1132889a8.png" alt="Running Kafka Containers" class="image--center mx-auto" width="2486" height="278" loading="lazy"></p>
<p>Run the following command to verify that all three brokers are registered in the cluster:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-broker-api-versions --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092
</code></pre>
<p>You should see API version information for all three brokers (IDs 1, 2, 3) without any connection errors.</p>
<p>Note that we’re using <code>kafka-1:29092,kafka-2:29092,kafka-3:29092</code> here (the internal Docker addresses) instead of localhost:9092 because this command runs inside the <code>kafka-1</code> container by virtue of <code>docker exec -it kafka-1</code>, where <code>localhost</code> only refers to that specific container.</p>
<p>If any of the above verification returns errors or doesn’t show expected result as shown in screenshots, you can run the following command to see logs and debug:</p>
<pre><code class="lang-bash">docker logs kafka-1
</code></pre>
<h2 id="heading-creating-topics-the-manual-way">Creating Topics: The Manual Way</h2>
<p>Now that we have a cluster running, let’s simulate a real production use case where different teams need Kafka topics for their applications – payments, logs, events, metrics notifications, you name it.</p>
<p>Let’s start by creating a topic for logs. The command to do this is:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 12 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy
</code></pre>
<p>You’ll need to specify some command parameters, which are:</p>
<ol>
<li><p>The exact broker address <code>kafka-1:29092,kafka-2:29092,kafka-3:29092</code> (or the IP address of your servers in production)</p>
</li>
<li><p>The number of partitions – I have used <code>12</code> in the above command. Creating too few partitions creates bottlenecks, while creating too many adds overhead.</p>
</li>
<li><p>Retention policy – I have used 7 days (that is, 604800000 milliseconds)</p>
</li>
<li><p>Compression type</p>
</li>
</ol>
<p>Manually managing these parameters and running the command a handful of times is okay – but what if you have to run this for every team in your enterprise? Each team will have different requirements. The grind of copy, paste, adjust becomes painful if you have 100+ topics and multiple clusters (dev, staging, prod).</p>
<p>Feel the pain yet? Well, let’s just go on for a minute and we’ll address this issue shortly. For now, if you run the above command you should see the “<strong>Created topic</strong>” message:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767554860337/81937532-b88b-4d9a-bfca-f2f9b0433e4d.png" alt="Create Kafka Topic result" class="image--center mx-auto" width="2454" height="536" loading="lazy"></p>
<p>Note: We’re using <code>kafka-1:29092,kafka-2:29092,kafka-3:29092</code> to reach Kafka brokers because we’re running the command inside of broker kafka-1  by running using <code>docker exec</code>.</p>
<p>Let's keep going. We’ll create more topics using the same command by changing the topic name and partitions. Copy, paste, update, and run the above commands a couple times. On my machine, I ran it 3 more times like below (you can choose to run couple more times with changed values – it won’t matter because concrete values are not important for this tutorial):</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-views \    
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 20 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy


docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-analytics \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 3 \ 
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy


docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-articles \ 
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 5 \ 
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy
</code></pre>
<p>After creating the topics, let’s see all the ones you have now by running the following command:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \ --list \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092
</code></pre>
<p>You should see a list of topics like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767555021681/451f69bc-5f91-432a-9c74-ca56b79aa179.png" alt="Kafka list of Topics" class="image--center mx-auto" width="2392" height="412" loading="lazy"></p>
<p>Notice that you just get the list of topics but no meaningful information, like:</p>
<ul>
<li><p>How many partitions does each have?</p>
</li>
<li><p>Which brokers are hosting them?</p>
</li>
<li><p>Are they evenly distributed?</p>
</li>
<li><p>What are their configurations?</p>
</li>
</ul>
<h3 id="heading-partition-information">Partition Information</h3>
<p>Let’s try to get information about our partitions. For this tutorial, I have created 4 topics and a total of 40 partitions spread across three brokers. I want to see which broker has the most partitions.</p>
<p>In a well-managed cluster, you’d want them roughly evenly distributed. But how can we check that? </p>
<p>Maybe the describe command shown below can help. Let’s run it:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --describe \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092
</code></pre>
<p>It will return a wall of text, something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767555186583/e50948e9-48f8-4431-8008-38f24da98373.png" alt="Kafka describe Topics" class="image--center mx-auto" width="2360" height="1344" loading="lazy"></p>
<p>So, we have partition information but:</p>
<ul>
<li><p>No summary or aggregation</p>
</li>
<li><p>No visual representation </p>
</li>
<li><p>It’s difficult to scan and compare</p>
</li>
<li><p>It gets exponentially worse with more topics</p>
</li>
</ul>
<h3 id="heading-counting-leaders">Counting Leaders</h3>
<p>The Leader field in the above screenshot tells you which broker is the leader for each partition. Leaders handle all read and write requests, so you want them evenly distributed or else some brokers will become overloaded. </p>
<p>Let’s try to count how many partitions each broker leads. To do that, run the following command:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 kafka-topics \
  --describe \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 | grep <span class="hljs-string">"Leader: 1"</span> | wc -l
</code></pre>
<p>It will show something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767555304446/6a58be3a-e21f-4209-8165-1544c5bc6c20.png" alt="Kafka Leader Count result" class="image--center mx-auto" width="2466" height="292" loading="lazy"></p>
<p>Per my topic creation, <code>14</code> is the count of partitions where <code>broker 1 (Leader : 1)</code> is the leader. You might see a different number depending on how many topics and how many partitions you have created.</p>
<p>You can repeat this command to see the count of partitions led by other brokers. To do so, just change <code>Leader: 1</code> to <code>Leader: 2</code> or <code>Leader: 3.</code>. I get <code>14, 12, 14</code>:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767555418151/8c66afe7-b5db-4106-800f-f6e7e8926ba2.png" alt="Kafka Leader Count for all Topics" class="image--center mx-auto" width="2318" height="704" loading="lazy"></p>
<p>That’s somewhat balanced, but you had to run the command multiple times, parse using <code>grep</code> and <code>wc</code>, and this is just 3 brokers. What if you had 100+? Also, what if you have to get the replicas’ information?</p>
<p>I could go on and on with the data we need and the commands to get that information. But the point I’m trying to make here is that sooner or later this becomes impossible to manage. Your team is going to need an army, and to be honest there isn’t much value in doing all of this manually. </p>
<p>So far, you’ve seen only simple operational commands, but the problems don’t stop there. In a real production environment there are more complex and challenging operations like:</p>
<ul>
<li><p><strong>Consumer Lag Monitoring</strong>: When consumers fall behind, you need to track which partitions are lagging, which consumer instances own them, and where the lag is growing or shrinking. With CLI tools, you get raw numbers but no trends or context.</p>
</li>
<li><p><strong>Broker Failures</strong>: When a broker fails, you need to identify under-replicated partitions, trigger leader elections, and create partition reassignment <code>JSON</code> files manually. One mistake in that JSON can cause data loss.</p>
</li>
<li><p><strong>Cluster rebalancing</strong>: You’ll see that when you add new brokers, they sit empty until you manually redistribute partitions. Similarly for removing brokers, you need to move all their partitions first. These operations require calculating optimal placement and creating complex reassignment plans.</p>
</li>
</ul>
<p>If you’re still with me, you’re probably thinking that there has to be a better way. Fortunately, there is – actually, there are a couple complimentary ways and we are going to talk about those next.</p>
<h2 id="heading-kafka-ui">Kafka UI</h2>
<p>Kafka UI is a modern, open-source web interface for managing Kafka clusters. It replaces the <code>command line chaos</code> we just experienced with a clean, visual dashboard.</p>
<p>Kafka UI provides the following features:</p>
<ul>
<li><p><code>Visual cluster Overview</code>: see all brokers, topics, and partitions at a glance.</p>
</li>
<li><p><code>Topic management</code>: create, configure, and delete topics with a GUI</p>
</li>
<li><p><code>Consumer group monitoring</code>: track lags, offsets, and consumer health in real-time</p>
</li>
<li><p><code>Message browsing</code>: view actual messages in topics without command line tools </p>
</li>
</ul>
<p>Without further ado, let’s set up Kafka UI.</p>
<h3 id="heading-setting-up-kafka-ui">Setting Up Kafka UI</h3>
<p>To setup up Kafka UI, let’s modify our existing <code>docker-compose-basic.yml</code> like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">version:</span> <span class="hljs-string">'3.8'</span>

<span class="hljs-attr">services:</span>
  <span class="hljs-attr">kafka-1:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-1</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9092:9092"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-2:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-2</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9093:9093"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-3:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">confluentinc/cp-kafka:7.6.0</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-3</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9094:9094"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">3</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3-data:/var/lib/kafka/data</span>
<span class="hljs-comment"># Adding kafka-UI service start</span>
  <span class="hljs-attr">kafka-ui:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">provectuslabs/kafka-ui:latest</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-ui</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"8080:8080"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">DYNAMIC_CONFIG_ENABLED:</span> <span class="hljs-string">'true'</span>
      <span class="hljs-attr">KAFKA_CLUSTERS_0_NAME:</span> <span class="hljs-string">freecodecamp-cluster</span>
      <span class="hljs-attr">KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS:</span> <span class="hljs-string">kafka-1:29092,kafka-2:29092,kafka-3:29092</span>
    <span class="hljs-attr">depends_on:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3</span>
<span class="hljs-comment"># Adding kafka-UI service end</span>
<span class="hljs-attr">volumes:</span>
  <span class="hljs-attr">kafka-1-data:</span>
  <span class="hljs-attr">kafka-2-data:</span>
  <span class="hljs-attr">kafka-3-data:</span>
</code></pre>
<p>The yaml file is pretty much the same as before except that we have added a new service called <code>kafka-ui</code> (for better clarity, I have added the changes in between start and end comments).</p>
<p>Key Configurations are:</p>
<ul>
<li><p><strong>Port 8080</strong> – You can access the UI at <a target="_blank" href="http://localhost:8080">http://localhost:8080</a> from your machine.</p>
</li>
<li><p><strong>KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS</strong> – This environment variable tells Kafka UI where to connect your cluster (using internal Docker addresses).</p>
</li>
<li><p><strong>KAFKA_CLUSTERS_0_NAME</strong> – A friendly name for your cluster in the UI.</p>
</li>
</ul>
<p>Let’s first clean up the old cluster while keeping the topic data intact. Go ahead and run the following command to do so:</p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml down
</code></pre>
<p>Note that we’re not using <code>-v</code> here, so volumes (topic data) will remain intact.</p>
<p>Wait for couple seconds and then run the following docker up command to bring up our cluster with Kafka UI:</p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml up -d
</code></pre>
<p>Now open a browser and visit <a target="_blank" href="http://localhost:8080/">http://localhost:8080/</a>. You’ll see the UI like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767556151144/f0dd2a51-79c5-4906-a32d-d7732f9fd242.png" alt="Kafka UI" class="image--center mx-auto" width="2380" height="738" loading="lazy"></p>
<p>You can click around and see all information about the cluster we have created, like:</p>
<ul>
<li><p>Your 3 brokers </p>
</li>
<li><p>The topics you created earlier</p>
</li>
<li><p>Partition counts</p>
</li>
</ul>
<p>For comparison with manual commands, let's look at the Brokers tab. You can see the partition leader count for each broker at a glance – remember that we had to run multiple commands to get this information earlier. Beyond this, the UI provides many other useful metrics that would require separate command-line queries.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767556236480/c99ba704-47b9-42da-b135-e1f9503ee1ab.png" alt="Kakfa UI Brokers" class="image--center mx-auto" width="2402" height="838" loading="lazy"></p>
<p>Remember the CLI commands we had to run to create topics? If you go to the <code>Topics</code> tab, you will notice that Topic management (<code>creation, deletion, data cleanup</code> and so on) are just a few button clicks.  </p>
<p>Similarly, managing Consumers only requires a few button clicks.</p>
<p>After exploring the Kafka UI, you'll see how much easier it is to monitor your cluster compared to running individual CLI commands.</p>
<h3 id="heading-drawbacks-of-kafka-ui">Drawbacks of Kafka UI</h3>
<p>That said, Kafka UI does have some limitations:</p>
<ul>
<li><p><strong>Automatic rebalancing</strong>: One or few brokers having more partitions that others, you must manually reassign them. </p>
</li>
<li><p><strong>Self-healing</strong>: If a broker fails, you have to manually create reassignment plans.</p>
</li>
<li><p><strong>Performance optimization</strong>: The UI can’t recommend intelligent partition placement.</p>
</li>
<li><p><strong>Alerts</strong>: The UI doesn’t warn you before problems happen.</p>
</li>
</ul>
<p>For small clusters (3 - 10 brokers ), Kafka UI and some command execution might be enough. You’ll be able to see problems clearly and fix them when needed.  </p>
<p>For large clusters, manual operations are still not scalable, so we need some kind of a complementary tool…and that tool is <strong>Cruise Control</strong>.</p>
<h2 id="heading-cruise-control">Cruise Control</h2>
<p>Cruise Control is an automation engine for Kafka clusters. While Kafka UI gives you visibility and manual control, Cruise Control provides intelligent automation and self-healing. You can think of Kafka UI as a dashboard with manual controls and Cruise Control as an autopilot. In other words, they complement each other.</p>
<p>Let’s try to create some imbalance in our cluster and fix it manually. This will help you learn how to reason through why you need Cruise Control. </p>
<p>To keep things simple, let’s start from scratch. We will first delete all the Docker resources we have created so far by running the following command:</p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml down -v
</code></pre>
<p>Running <code>docker-compose down -v</code> will delete all the topics and messages we created so far, but don’t worry –we’ll create them again.</p>
<h3 id="heading-how-cruise-control-works">How Cruise Control Works</h3>
<p>You can think of Cruise Control as a metric-monitoring and action-taking tool. Kafka brokers collect internal metrics (CPU, disk, network traffic, partition sizes), and a metric reporter running inside each broker sends these metrics to a Kafka topic.</p>
<p>Cruise Control then reads from that topic and analyzes the data. Based on that analysis, it proposes partition movements. We’ll see this in action shortly.</p>
<h3 id="heading-setting-up-cruise-control">Setting Up Cruise Control</h3>
<p>As of this writing, I couldn’t find a compatible Kafka and Cruise Control image that supports <code>KRaft</code> (Kafka Consensus Algorithm), so I decided to create Kafka and Cruise Control public images that will help with the tutorial. I don’t recommend using these images in production. For production usage, you should either wait for community to provide an image or create one of your own.</p>
<p>Change the <code>docker-compose-basic.yml</code> file to look like the below:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">version:</span> <span class="hljs-string">'3.8'</span>

<span class="hljs-attr">services:</span>
  <span class="hljs-attr">kafka-1:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">justramesh2000/kafka-apache-cc:3.8.1</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-1</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9092:9092"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
      <span class="hljs-comment"># Cruise Control Metrics Reporter</span>
      <span class="hljs-attr">KAFKA_METRIC_REPORTERS:</span> <span class="hljs-string">'com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS:</span> <span class="hljs-string">'kafka-1:29092,kafka-2:29092,kafka-3:29092'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE:</span> <span class="hljs-string">'true'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS:</span> <span class="hljs-string">'1'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-string">'2'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE:</span> <span class="hljs-string">'false'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS:</span> <span class="hljs-string">'60000'</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-2:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">justramesh2000/kafka-apache-cc:3.8.1</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-2</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9093:9093"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
      <span class="hljs-attr">KAFKA_METRIC_REPORTERS:</span> <span class="hljs-string">com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS:</span> <span class="hljs-string">kafka-1:29092,kafka-2:29092,kafka-3:29092</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE:</span> <span class="hljs-string">'false'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC:</span> <span class="hljs-string">__CruiseControlMetrics</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE:</span> <span class="hljs-string">'true'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS:</span> <span class="hljs-string">'1'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-string">'2'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS:</span> <span class="hljs-string">'60000'</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2-data:/var/lib/kafka/data</span>

  <span class="hljs-attr">kafka-3:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">justramesh2000/kafka-apache-cc:3.8.1</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-3</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9094:9094"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">KAFKA_NODE_ID:</span> <span class="hljs-number">3</span>
      <span class="hljs-attr">KAFKA_PROCESS_ROLES:</span> <span class="hljs-string">broker,controller</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_QUORUM_VOTERS:</span> <span class="hljs-number">1</span><span class="hljs-string">@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</span>
      <span class="hljs-attr">KAFKA_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094</span>
      <span class="hljs-attr">KAFKA_ADVERTISED_LISTENERS:</span> <span class="hljs-string">PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094</span>
      <span class="hljs-attr">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:</span> <span class="hljs-string">PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_CONTROLLER_LISTENER_NAMES:</span> <span class="hljs-string">CONTROLLER</span>
      <span class="hljs-attr">KAFKA_INTER_BROKER_LISTENER_NAME:</span> <span class="hljs-string">PLAINTEXT</span>
      <span class="hljs-attr">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">CLUSTER_ID:</span> <span class="hljs-string">'MkU3OEVBNTcwNTJENDM2Qk'</span>
      <span class="hljs-attr">KAFKA_LOG_DIRS:</span> <span class="hljs-string">/var/lib/kafka/data</span>
      <span class="hljs-attr">KAFKA_METRIC_REPORTERS:</span> <span class="hljs-string">com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS:</span> <span class="hljs-string">kafka-1:29092,kafka-2:29092,kafka-3:29092</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE:</span> <span class="hljs-string">'false'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC:</span> <span class="hljs-string">__CruiseControlMetrics</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE:</span> <span class="hljs-string">'true'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS:</span> <span class="hljs-string">'1'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR:</span> <span class="hljs-string">'2'</span>
      <span class="hljs-attr">KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS:</span> <span class="hljs-string">'60000'</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3-data:/var/lib/kafka/data</span>
  <span class="hljs-comment"># Adding kafka-UI service start</span>
  <span class="hljs-attr">kafka-ui:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">provectuslabs/kafka-ui:latest</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka-ui</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"8080:8080"</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-attr">DYNAMIC_CONFIG_ENABLED:</span> <span class="hljs-string">'true'</span>
      <span class="hljs-attr">KAFKA_CLUSTERS_0_NAME:</span> <span class="hljs-string">freecodecamp-cluster</span>
      <span class="hljs-attr">KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS:</span> <span class="hljs-string">kafka-1:29092,kafka-2:29092,kafka-3:29092</span>
    <span class="hljs-attr">depends_on:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./config:/opt/cruise-control/config</span>  
  <span class="hljs-comment"># Adding kafka-UI service end</span>
  <span class="hljs-comment"># Adding cruise-control start</span>
  <span class="hljs-attr">cruise-control:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">justramesh2000/cruise-control-kraft:2.5.142</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">cruise-control</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9090:9090"</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./config/cruisecontrol.properties:/opt/cruise-control/config/cruisecontrol.properties</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./config/capacityJBOD.json:/opt/cruise-control/config/capacityJBOD.json:ro</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./config/log4j.properties:/opt/cruise-control/config/log4j.properties:ro</span>
    <span class="hljs-attr">depends_on:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-1</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-2</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">kafka-3</span>
   <span class="hljs-comment"># Adding cruise-control end    </span>
<span class="hljs-attr">volumes:</span>
  <span class="hljs-attr">kafka-1-data:</span>
  <span class="hljs-attr">kafka-2-data:</span>
  <span class="hljs-attr">kafka-3-data:</span>
</code></pre>
<p>You should have made the following changes to the file:</p>
<ul>
<li><p>Changed Kafka image from <code>confluentinc/cp-kafka:7.6.0</code> to <code>justramesh2000/kafka-apache-cc:3.8.1</code>. The new image contains the Cruise Control metrics exporter which will export metrics data from Kafka brokers to be used by Cruise Control.</p>
</li>
<li><p>Added the following environment variables:</p>
<ul>
<li><p><strong>KAFKA_METRIC_REPORTERS</strong> – This variable tells Kafka to load a plugin called the <code>Cruise Control Metrics Reporter</code>. It runs inside each Kafka broker process, and hooks into Kafka’s internal metrics system. This helps with data collection.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS</strong> – This tells the <code>Cruise Control Metrics Reporter</code> where to send metrics to, meaning which Kafka brokers and which port.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE</strong> – This disables specific Kubernetes behaviors (Pod name, id instead of Host). We are using Docker, so we don’t need K8s behaviors.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_TOPIC</strong> – Specifies the name of the topic where metrics will be published. Default is <code>__CruiseControlMetrics</code> but you can customize using this variable if you want to.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE</strong> – Automatically creates a <code>__CruiseControlMetrics</code> topic if it doesn’t exist. Without this metric, the reporter will fail reporting until you manually create this topic.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS</strong> – Defines the number of partitions for the topic <code>__CruiseControlMetrics</code>.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR</strong> – Tells Kafka how many copies of metrics data to keep. In our case, we’re keeping 2 copies of the data.</p>
</li>
<li><p><strong>KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS</strong> – Tells Kafka how often to send metrics. We’re sending every minute.</p>
</li>
</ul>
</li>
<li><p>Added Cruise-control service using image <code>justramesh2000/cruise-control-kraft:2.5.142</code>. For clarity, I have kept this change between the <code>start</code> and <code>end</code> comments.</p>
</li>
<li><p>Under cruise control, we’ve mounted <code>three</code> Cruise Control configurations files. We’ll talk about those files next.</p>
</li>
</ul>
<h3 id="heading-cruise-control-configuration-file">Cruise Control Configuration File</h3>
<p>To run Cruise Control, we need to provide several configuration files. Among the key pieces of information are:</p>
<ul>
<li><p>Where the Kafka cluster is located</p>
</li>
<li><p>The capacity of each broker</p>
</li>
</ul>
<p>Create a config directory and add the following files:</p>
<pre><code class="lang-bash">mkdir config
</code></pre>
<h4 id="heading-cruisecontrolproperties">cruisecontrol.properties</h4>
<p>This is Cruise Control’s main configuration file.</p>
<p>Save the following content as <code>cruisecontrol.properties</code> in the config directory:</p>
<pre><code class="lang-abap"># Kafka cluster. Tells how <span class="hljs-keyword">to</span> connect <span class="hljs-keyword">to</span> brokers
bootstrap.servers=kafka-<span class="hljs-number">1</span>:<span class="hljs-number">29092</span>,kafka-<span class="hljs-number">2</span>:<span class="hljs-number">29092</span>,kafka-<span class="hljs-number">3</span>:<span class="hljs-number">29092</span>

# <span class="hljs-keyword">Topic</span> <span class="hljs-keyword">from</span> which metrics are <span class="hljs-keyword">to</span> be <span class="hljs-keyword">read</span>
metric.reporter.<span class="hljs-keyword">topic</span>=__CruiseControlMetrics

# Aggregated partition <span class="hljs-keyword">data</span>
partition.metric.sample.store.<span class="hljs-keyword">topic</span>=__KafkaCruiseControlPartitionMetricSamples

#Aggregated broker <span class="hljs-keyword">data</span>
broker.metric.sample.store.<span class="hljs-keyword">topic</span>=__KafkaCruiseControlModelTrainingSamples

# Enable broker failure detection <span class="hljs-keyword">for</span> KRaft <span class="hljs-keyword">mode</span> (no ZooKeeper)
kafka.broker.failure.detection.enable=true

# Capacity. Tells <span class="hljs-keyword">where</span> the capacity file <span class="hljs-keyword">is</span> 
capacity.config.file=config/capacityJBOD.json

# Goals. What <span class="hljs-keyword">to</span> optimize <span class="hljs-keyword">for</span> <span class="hljs-keyword">during</span> cluster balancing. These are the riles <span class="hljs-keyword">for</span> CC <span class="hljs-keyword">to</span> abide <span class="hljs-keyword">to</span> <span class="hljs-keyword">during</span> rebalancing
<span class="hljs-keyword">default</span>.goals=com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.RackAwareGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.ReplicaCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.DiskCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.NetworkInboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.NetworkOutboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.CpuCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.ReplicaDistributionGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.DiskUsageDistributionGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.LeaderReplicaDistributionGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.LeaderBytesInDistributionGoal

# hard goals. 
hard.goals=com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.RackAwareGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.ReplicaCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.DiskCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.NetworkInboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.NetworkOutboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.<span class="hljs-keyword">analyzer</span>.goals.CpuCapacityGoal

# Webserver. <span class="hljs-keyword">For</span> WebApi access
webserver.http.port=<span class="hljs-number">9090</span>
webserver.http.address=<span class="hljs-number">0.0</span>.<span class="hljs-number">0.0</span>

# Execution
num.broker.metrics.windows=<span class="hljs-number">1</span>
num.partition.metrics.windows=<span class="hljs-number">1</span>
</code></pre>
<p>I’ve added in line comments to explain much of the above configuration, but I think the <code>Goals</code> need special attention. These are the rules that we as users have set for Cruise Control to abide by.</p>
<p>By defining goals, we tell Cruise Control to do the following:</p>
<ul>
<li><p><code>RackAwareGoal</code> – Spread replicas across racks (or in our case, brokers)</p>
</li>
<li><p><code>ReplicaCapacityGoal</code> – Don't overload brokers with too many replicas</p>
</li>
<li><p><code>DiskCapacityGoal</code> – Don't fill up disk</p>
</li>
<li><p><code>NetworkInboundCapacityGoal</code> – Balance incoming network traffic</p>
</li>
<li><p><code>NetworkOutboundCapacityGoal</code> – Balance outgoing network traffic</p>
</li>
<li><p><code>CpuCapacityGoal</code> – Balance CPU usage</p>
</li>
<li><p><code>ReplicaDistributionGoal</code> – Evenly distribute replicas</p>
</li>
<li><p><code>DiskUsageDistributionGoal</code> – Ensure even disk usage across brokers</p>
</li>
<li><p><code>LeaderReplicaDistributionGoal</code> – Evenly distribute leader replicas</p>
</li>
<li><p><code>LeaderBytesInDistributionGoal</code> – Balance data flowing to leaders</p>
</li>
</ul>
<p>Via Cruise Control configuration, you can define two types of goals: <code>Default goals</code> and <code>Hard goals</code>. Hard goals must be met. Default goals that aren’t part of the hard goals become soft goals. This means that Cruise Control will give its best effort to satisfy them but won’t reject a proposal if it can’t.</p>
<p>Here’s a little summary:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Type</td><td>Meaning</td><td>What CC Does</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Hard Goals</strong></td><td>Must-haves (capacity limits)</td><td><strong>Never violates</strong> – rejects proposal if can't satisfy</td></tr>
<tr>
<td><strong>Soft Goals</strong></td><td>Nice-to-haves (better balance)</td><td><strong>Tries to satisfy</strong> – skips if conflicts with hard goals</td></tr>
<tr>
<td><strong>Default Goals</strong></td><td>Hard + Soft together</td><td><strong>Optimizes for all</strong> – prioritizes hard over soft</td></tr>
</tbody>
</table>
</div><p>Cruise control collects metrics for a defined period (default: 5 minutes) and creates a monitoring window. The following settings control how many windows Cruise Control needs before it’s ready to generate proposals (shortly, we will see what proposals are):</p>
<ul>
<li><p><code>num.broker.metrics.windows=1</code>: Wait for 1 monitoring window before generating proposals. Each window in Cruise Control is 5 minutes by default. This means that Cruise Control will be ready after 5 minutes. I’ve set this to 1 for quick testing. The recommendation is to use a large window in production to avoid false proposals from temporary spikes.</p>
</li>
<li><p><code>num.partition.metrics.windows=1</code>: Wait for 1 window of partition metrics. Same reasoning as above.</p>
</li>
</ul>
<h4 id="heading-capacity">Capacity</h4>
<p>This informs cruise control about the capacity (CPU, DISK) of each broker, which then helps it to make decisions. Using the below file, we’re telling Cruise Control the following:</p>
<ul>
<li><p>What are the brokerIds</p>
</li>
<li><p>What is the disk path <code>/var/lib/kafka/data</code> and disk capacity (<code>100000000</code> MB = 100 GB). This is used by <code>DiskCapacityGoal</code> that we set up in the above <code>cruisecontrol.properties</code> file.</p>
</li>
<li><p>What is the CPU 100% (1 Core). Used by <code>CpuCapacityGoal</code>.</p>
</li>
<li><p>What is the <code>NW_IN</code> Network Inbound Capacity (125,000 KB/s = 1 MB/s –Megabytes per second) = 1 Gbps – Giga <code>bits</code> per second). Used by <code>NetworkInboundCapacityGoal</code>.</p>
</li>
<li><p>What is the <code>NW_OUT</code> Network Outbound Capacity (125,000 KB/s). Used by <code>NetworkOutboundCapacityGoal</code></p>
</li>
</ul>
<p>Save the following content as <code>capacityJBOD.json</code> in the config directory:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"brokerCapacities"</span>:[
    {
      <span class="hljs-attr">"brokerId"</span>: <span class="hljs-string">"1"</span>,
      <span class="hljs-attr">"capacity"</span>: {
        <span class="hljs-attr">"DISK"</span>: {<span class="hljs-attr">"/var/lib/kafka/data"</span>: <span class="hljs-string">"100000000"</span>},
        <span class="hljs-attr">"CPU"</span>: <span class="hljs-string">"100"</span>,
        <span class="hljs-attr">"NW_IN"</span>: <span class="hljs-string">"125000"</span>,
        <span class="hljs-attr">"NW_OUT"</span>: <span class="hljs-string">"125000"</span>
      }
    },
    {
      <span class="hljs-attr">"brokerId"</span>: <span class="hljs-string">"2"</span>,
      <span class="hljs-attr">"capacity"</span>: {
        <span class="hljs-attr">"DISK"</span>: {<span class="hljs-attr">"/var/lib/kafka/data"</span>: <span class="hljs-string">"100000000"</span>},
        <span class="hljs-attr">"CPU"</span>: <span class="hljs-string">"100"</span>,
        <span class="hljs-attr">"NW_IN"</span>: <span class="hljs-string">"125000"</span>,
        <span class="hljs-attr">"NW_OUT"</span>: <span class="hljs-string">"125000"</span>
      }
    },
    {
      <span class="hljs-attr">"brokerId"</span>: <span class="hljs-string">"3"</span>,
      <span class="hljs-attr">"capacity"</span>: {
        <span class="hljs-attr">"DISK"</span>: {<span class="hljs-attr">"/var/lib/kafka/data"</span>: <span class="hljs-string">"100000000"</span>},
        <span class="hljs-attr">"CPU"</span>: <span class="hljs-string">"100"</span>,
        <span class="hljs-attr">"NW_IN"</span>: <span class="hljs-string">"125000"</span>,
        <span class="hljs-attr">"NW_OUT"</span>: <span class="hljs-string">"125000"</span>
      }
    }
  ]
}
</code></pre>
<h4 id="heading-logging">Logging</h4>
<p>This is not important for Cruise Control to work properly, but it’ll help you debug if there are issues. Save the following content as <code>log4j.properties</code> in the config directory. When you execute commands to start Cruise Control and If you see unexpected behaviors like container exiting, you can use the <code>docker logs</code> command to see what happened.</p>
<pre><code class="lang-abap"># Root logger - INFO level, output <span class="hljs-keyword">to</span> console
rootLogger.level=INFO
appenders=console

# Console output (<span class="hljs-keyword">for</span> docker logs)
appender.console.<span class="hljs-keyword">type</span>=Console
appender.console.name=STDOUT
appender.console.layout.<span class="hljs-keyword">type</span>=PatternLayout
appender.console.layout.pattern=[%d] %p %m (%c)%n

# <span class="hljs-keyword">Send</span> root logger <span class="hljs-keyword">to</span> console
rootLogger.appenderRef.console.<span class="hljs-keyword">ref</span>=STDOUT
</code></pre>
<p>Now that we have all the configurations in place, let’s run the following command to start Kafka brokers with Kafka UI and Cruise Control:</p>
<pre><code class="lang-bash">docker compose -f docker-compose-basic.yml up -d
</code></pre>
<p>Using the following command, verify that the three Kafka brokers, Kafka UI, and Cruise Control containers are running:</p>
<pre><code class="lang-bash">docker ps
</code></pre>
<p>You should see something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768085885265/b196c5ce-77b3-4563-b72a-20c6de7123f0.png" alt="Docker running containers" class="image--center mx-auto" width="2634" height="214" loading="lazy"></p>
<p>Now that we have Cruise Control up and running, let’s create some Imbalance and see how much better of an experience we get by using Cruise Control versus mitigating the imbalance manually.</p>
<h3 id="heading-creating-the-imbalance">Creating the Imbalance</h3>
<p>An imbalance is a scenario where some brokers are handling more messages than others – and they may run into high disk usage or high IOPS.</p>
<p>To create the imbalance in our cluster, we’ll have to create a few topics and then produce messages unevenly. Now that you have Kafka UI running, you can create topics using that method or you can create topics using commands. I’m going to use the commands because it’ll be easier for you to reproduce my work (but I recommend UI for production operations because it prevents typos).</p>
<p>If you also decide to use commands, run the following command. Then using UI, verify that the topics have been created.</p>
<p>Note: You’ll find that the commands are different from previous commands. This is because, previously in our <code>docker-compose-basic.yml</code> file, we were using the <code>confluentinc/cp-kafka:7.6.0</code> image for Kafka. But now we’re using the <code>justramesh2000/kafka-apache-cc:3.8.1</code> image which is based off of the <code>apache/kafka:3.8.1</code> image. For different images, the tools are located at different places, so the command needs to be adjusted to account for that.</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 bash -c <span class="hljs-string">'
/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092 \
  --partitions 12 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-views \
  --bootstrap-server kafka-1:29092 \
  --partitions 20 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-analytics \
  --bootstrap-server kafka-1:29092 \
  --partitions 3 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-articles \
  --bootstrap-server kafka-1:29092 \
  --partitions 5 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy
'</span>
</code></pre>
<p>Run the following command to produce uneven messages on different topics we created above.</p>
<p>Heavy Load on <code>freecodecamp-logs</code>:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 bash -c <span class="hljs-string">"
for i in {1..100000}; do 
  echo '{\"log_id\":\"'\$i'\",\"level\":\"INFO\",\"message\":\"Log entry '\$i'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-logs --bootstrap-server kafka-1:29092"</span>
</code></pre>
<p>Heavy load on <code>freecodecamp-views</code>:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 bash -c <span class="hljs-string">"
for i in {1..80000}; do 
  echo '{\"view_id\":\"'\$i'\",\"page\":\"/article/'\$((i % 100))'\",\"user\":\"user_'\$((i % 1000))'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-views --bootstrap-server kafka-1:29092"</span>
</code></pre>
<p>Moderate load on <code>freecodecamp-analytics</code>:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 bash -c <span class="hljs-string">"
for i in {1..30000}; do 
  echo '{\"event\":\"page_view\",\"user\":\"user_'\$i'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-analytics --bootstrap-server kafka-1:29092"</span>
</code></pre>
<p>Now, produce a message with a <code>fixed key</code> to force all data into one Partition. This is a fast way to create strong disk imbalance. Run the following command:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 bash -c <span class="hljs-string">"
for i in {1..300000}; do
  echo 'hotkey:{\"log_id\":'\$i',\"msg\":\"big payload\"}'
done | /opt/kafka/bin/kafka-console-producer.sh \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092 \
  --property parse.key=true \
  --property key.separator=:"</span>
</code></pre>
<p>After running the above commands, come back to the UI, refresh, and you will see a number of messages like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768088817389/13c285b2-308a-4e6c-80f1-4de774f34662.png" alt="Kafka Topics with Message Count" class="image--center mx-auto" width="3738" height="1004" loading="lazy"></p>
<p>Now, go to brokers tab and see the imbalance in Disk Usage:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768088880398/5e615012-a35f-4820-8daa-b46746b65b56.png" alt="Kafka Brokers Disk usage" class="image--center mx-auto" width="4128" height="810" loading="lazy"></p>
<p>You should be able to see that <strong>Broker-2 has only about 47% of the data that Broker-1 has</strong>, and <strong>Broker-3 has about 11% more data than Broker-1</strong>. Broker-2 is significantly underutilized, while Broker-1 and Broker-3 hold most of the data.</p>
<h3 id="heading-attempting-manual-rebalancing">Attempting Manual Rebalancing</h3>
<p><strong>Step 1</strong>: First, we need to find out which topic is heavy – meaning which one handles more data. My setup shows the <code>freecodecamp-logs</code> topic with 8MB of data:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768089339438/e57c1b10-ebac-4824-ad0d-1c8d3ec6ff71.png" alt="Kafka Topics with Message Count" class="image--center mx-auto" width="3716" height="988" loading="lazy"></p>
<p><strong>Step 2</strong>: Let’s see where the heavy partitions are.</p>
<p>Click on <strong>freecodecamp-logs</strong> in Kafka UI and see the partition table. Look at the message count: partition 4 is bigger than the others. The table also gives information about replicas of partitions: partition 4 has replicas on Broker 1 and 3. Broker 2 doesn’t have partition 4 at all. This explains why Broker 2 was underutilized.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768089513001/8993d5ce-c70f-45e3-b643-b8a759dda138.png" alt="Kafka Topic Partitions" class="image--center mx-auto" width="3498" height="1428" loading="lazy"></p>
<p><strong>Step 3:</strong> To balance the cluster, we need to move partition 4 around.</p>
<p>We can move partition 4 to Broker 2. But before that, let’s do some math to be able to rationalize our decision. Note that the calculation doesn’t have to be precise – we just want a relative sense of data between brokers.</p>
<p>Current state:</p>
<ul>
<li><p><strong>Broker 1</strong>: 4.55 MB</p>
</li>
<li><p><strong>Broker 2</strong>: 2.29 MB (underutilized)</p>
</li>
<li><p><strong>Broker 3</strong>: 5.11 MB (over-utilized)</p>
</li>
</ul>
<p>Note that roughly the compressed data size for partition 4 is 2.25 MB (exact size is not critical).</p>
<p>If we move partition 4 from [1,3] to [2,3]:</p>
<ul>
<li><p><strong>Broker 1:</strong> Loses partition 4, so 4.55 + 2.25 = <strong>~2.3 MB</strong></p>
</li>
<li><p><strong>Broker 2:</strong> Gains Partition 4, so 2.33 + 2.25 = ~<strong>4.58 MB</strong></p>
</li>
<li><p><strong>Broker 3:</strong> Already has partition 4, so = <strong>5.11 MB (no change)</strong></p>
</li>
</ul>
<p>The result is that Broker 1 becomes underutilized.</p>
<p>How about if we move partition 4 from [1,3] to [1,2]?</p>
<ul>
<li><p><strong>Broker 1:</strong> Already has partition 4 = <strong>4.55 MB (no change)</strong></p>
</li>
<li><p><strong>Broker 2:</strong> Gains Partition 4, so 2.33 + 2.25 = ~<strong>4.58 MB</strong></p>
</li>
<li><p><strong>Broker 3:</strong> Loses partition 4, so 5.11 + 2.25 = ~<strong>2.8 MB</strong></p>
</li>
</ul>
<p>Hmm, this still creates an imbalance (broker 3 becomes too light).</p>
<p>So basically, manual rebalancing requires complex calculations. Moving a single partition impacts disk usage, leader distribution, and network traffic across multiple brokers. One poorly planned move can create a new imbalance elsewhere. </p>
<p>But, let’s say you somehow landed on a perfect mathematical calculation and you’re ready to make the move to balance. We’ll assume that the perfect plan is to move Partition 4 from [1, 3] to [2, 3]. I know it’s not the perfect move but the point is to see the pain afterwards.</p>
<p><strong>Step 4</strong>: it’s time to move the partition manually.</p>
<p>We need to tell Kafka to move partition 4's replicas from brokers [1,3] to brokers [2,3].</p>
<p>To do that, you need create a file called <code>reassignment.json</code> on your machine:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"version"</span>: <span class="hljs-number">1</span>,
  <span class="hljs-attr">"partitions"</span>: [
    {
      <span class="hljs-attr">"topic"</span>: <span class="hljs-string">"freecodecamp-logs"</span>,
      <span class="hljs-attr">"partition"</span>: <span class="hljs-number">4</span>,
      <span class="hljs-attr">"replicas"</span>: [<span class="hljs-number">2</span>, <span class="hljs-number">3</span>],
      <span class="hljs-attr">"log_dirs"</span>: [<span class="hljs-string">"any"</span>, <span class="hljs-string">"any"</span>]
    }
  ]
}
</code></pre>
<p><strong>What this means:</strong></p>
<ul>
<li><p>"partition": 4 – Target Partition</p>
</li>
<li><p>"replicas": [2, 3] – New placement: brokers 2 and 3</p>
</li>
<li><p>"log_dirs": ["any", "any"] – Let Kafka choose the disk directory</p>
</li>
</ul>
<p>Save this file somewhere accessible.</p>
<p>Then run the following command to copy the JSON to the Kafka cluster:</p>
<pre><code class="lang-bash">docker cp reassignment.json kafka-1:/tmp/reassignment.json
</code></pre>
<p>This copies your local file into the kafka-1 container's /tmp directory.</p>
<p>Run following command to verify the file is there:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 cat /tmp/reassignment.json
</code></pre>
<p>You should see your JSON file content.</p>
<p>Now run the actual reassignment command:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 /opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --reassignment-json-file /tmp/reassignment.json \
  --execute
</code></pre>
<p>You will get a message from Kafka that will tell you if Kafka has accepted the reassignment and started moving the data.</p>
<p>You can monitor the reassignment using the following command:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it kafka-1 /opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --reassignment-json-file /tmp/reassignment.json \
  --verify
</code></pre>
<p>I’m not going to run the manual reassignment because I want to keep the imbalance and show how Cruise Control can help reduce the manual steps. Next, let’s see how Cruise Control handles the same imbalance automatically.</p>
<h3 id="heading-rebalancing-using-cruise-control">Rebalancing Using Cruise Control</h3>
<p>After creating the topic and messages, I have let Cruise Control run for a couple minutes. During that time, it collected metrics and trained its linear regression model. You can run the following command to verify if Cruise Control is running fine and it has data (following is a REST API call using curl):</p>
<pre><code class="lang-bash">curl http://localhost:9090/kafkacruisecontrol/state
</code></pre>
<p>You will get multiple JSON object outputs as part of the response. Each JSON object holds some information about the state of Cruise Control and the Kafka cluster. Let’s see each of these one at a time:</p>
<pre><code class="lang-json">MonitorState: {
  state: RUNNING(<span class="hljs-number">20.000</span>% trained),
  NumValidWindows: (<span class="hljs-number">1</span>/<span class="hljs-number">1</span>) (<span class="hljs-number">100.000</span>%),
  NumValidPartitions: <span class="hljs-number">105</span>/<span class="hljs-number">105</span> (<span class="hljs-number">100.000</span>%),
  flawedPartitions: <span class="hljs-number">0</span>
}
</code></pre>
<p>This tells about the state of monitoring based on data collected by Cruise Control:</p>
<ul>
<li><p><code>state: RUNNING(20.000% trained)</code> – Cruise Control is <strong>actively collecting metrics</strong> from your Kafka cluster. Right now it has <strong>trained its model on 20% of the expected monitoring data</strong>.</p>
</li>
<li><p><code>NumValidWindows: (1/1) (100%)</code> – Cruise Control has collected 1 complete monitoring window out of 1 required (100% ready). Remember, we had set <code>num.broker.metrics.windows=1</code> in the <code>cruisecontrol.properties</code> configuration file.</p>
</li>
<li><p><code>NumValidPartitions: 105/105 (100%)</code> – Cruise Control analyzed all 105 partitions and has metrics for <strong>all.</strong></p>
</li>
<li><p><code>flawedPartitions: 0</code> – None of the partitions have problematic or missing metrics.</p>
</li>
</ul>
<pre><code class="lang-json">ExecutorState: {state: NO_TASK_IN_PROGRESS}
</code></pre>
<p>The above response indicates the execution engine is idle – no partition moves or leadership changes are currently in progress. This makes sense since we haven't asked Cruise Control to do anything yet.</p>
<pre><code class="lang-json">AnalyzerState: {
  isProposalReady: <span class="hljs-literal">true</span>,
  readyGoals: [
    NetworkInboundCapacityGoal,
    LeaderBytesInDistributionGoal,
    DiskCapacityGoal,
    ReplicaDistributionGoal,
    RackAwareGoal,
    NetworkOutboundCapacityGoal,
    CpuCapacityGoal,
    DiskUsageDistributionGoal,
    LeaderReplicaDistributionGoal,
    ReplicaCapacityGoal
  ]
}
</code></pre>
<p>AnalyzerState tells whether Cruise Control is ready to show a proposal or not. In this case it’s ready.</p>
<ul>
<li><p><code>isProposalReady: true</code> – Cruise Control has <strong>calculated a potential rebalancing plan</strong> (a proposal) that satisfies the configured goals.</p>
</li>
<li><p><code>readyGoals</code> – These are the goals that are considered <strong>ready and valid</strong> for rebalancing. Examples:</p>
<ul>
<li><p><code>DiskCapacityGoal</code>: balance disk usage among brokers</p>
</li>
<li><p><code>ReplicaDistributionGoal</code>: balance number of replicas per broker</p>
</li>
<li><p><code>RackAwareGoal</code>: maintain replicas across racks for fault tolerance</p>
</li>
<li><p><code>LeaderBytesInDistributionGoal</code>: balance network traffic from leaders</p>
</li>
<li><p><code>DiskUsageDistributionGoal</code>: ensures partitions are spread to prevent skew</p>
</li>
</ul>
</li>
</ul>
<p>Note that these are the goals we had set earlier in the <code>cruisecontrol.properties</code> file.</p>
<pre><code class="lang-json">AnomalyDetectorState: {
  selfHealingEnabled:[],
  selfHealingDisabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY, TOPIC_ANOMALY, MAINTENANCE_EVENT],
  selfHealingEnabledRatio:{...},
  recentGoalViolations:[],
  recentBrokerFailures:[],
  recentMetricAnomalies:[],
  recentDiskFailures:[],
  recentTopicAnomalies:[],
  recentMaintenanceEvents:[],
  metrics:{...},
  ongoingSelfHealingAnomaly:None,
  balancednessScore:<span class="hljs-number">100.000</span>
}
</code></pre>
<p>Anomaly detection shows information about any existing anomaly and healing properties.</p>
<ul>
<li><p><code>selfHealingEnabled: []</code> – Automatic self-healing is <strong>currently off</strong>. Cruise Control will <strong>not move partitions automatically</strong> in response to anomalies.</p>
</li>
<li><p><code>selfHealingDisabled: [...]</code> – Lists the anomaly types that are <strong>disabled for automatic self-healing</strong>, including broker failures, disk failures, and goal violations.</p>
</li>
<li><p><code>recentGoalViolations: []</code> – No goals have been violated recently.</p>
</li>
<li><p><code>balancednessScore: 100.000</code> – This is <strong>how balanced the cluster is according to Cruise Control’s hard goals</strong>. 100% means the cluster is perfectly balanced according to the metrics and hard goals currently active. This metric only cares about Hard Goals (Disk Capacity, CPU capacity) being violated – that’s why it shows 100% even though we know there are some disk usage imbalances in our cluster.</p>
</li>
</ul>
<h4 id="heading-the-proposal">The Proposal</h4>
<p>Via AnalyzerState information, Cruise Control told us that it has a proposal for the cluster. Let’s see what it is. We can fetch the proposal using the proposal end point:</p>
<pre><code class="lang-bash">curl -s <span class="hljs-string">"http://localhost:9090/kafkacruisecontrol/proposals?json=true"</span>
</code></pre>
<p>The JSON response is quite large. Let's focus on the key parts that show our cluster's imbalance and how Cruise Control plans to fix it:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"summary"</span>: {
    <span class="hljs-attr">"numReplicaMovements"</span>: <span class="hljs-number">13</span>,    <span class="hljs-comment">// CC wants to move 13 partition replicas</span>
    <span class="hljs-attr">"numLeaderMovements"</span>: <span class="hljs-number">6</span>,      <span class="hljs-comment">// And reassign 6 partition leaders</span>
    <span class="hljs-attr">"onDemandBalancednessScoreBefore"</span>: <span class="hljs-number">84.67</span>,   <span class="hljs-comment">// Current: 84.67% balanced</span>
    <span class="hljs-attr">"onDemandBalancednessScoreAfter"</span>: <span class="hljs-number">89.76</span>.    <span class="hljs-comment">// After: 89.76% balanced</span>
  },
  <span class="hljs-attr">"goalSummary"</span>: [
    {
      <span class="hljs-attr">"goal"</span>: <span class="hljs-string">"DiskUsageDistributionGoal"</span>,
      <span class="hljs-attr">"status"</span>: <span class="hljs-string">"VIOLATED"</span>
    },
    {
      <span class="hljs-attr">"goal"</span>: <span class="hljs-string">"LeaderBytesInDistributionGoal"</span>,
      <span class="hljs-attr">"status"</span>: <span class="hljs-string">"VIOLATED"</span>
    }
  ]
}
</code></pre>
<p>Based on the calculations, Cruise Control thinks:</p>
<ol>
<li><p>Moving 13 partition replicas will help. Note that manually we decided to move just 1 partition, that is partition 4.</p>
</li>
<li><p>Reassigning 6 partition leaders will help. Manually we didn’t account for any leadership reassignment.</p>
</li>
<li><p><code>DiskUsageDistributionGoal</code> has been violated. We know that the disk usage is not distributed perfectly.</p>
</li>
<li><p><code>LeaderBytesInDistributionGoal</code> has also been violated. We couldn’t find this out manually. Technically, you could find out but it would take a decent amount of manual calculations and would still be error-prone.</p>
</li>
</ol>
<p>Note: While we're focusing on disk usage imbalance, Cruise Control optimizes for 10 different goals (disk, CPU, network, leaders, and so on). This holistic approach gives it a better chance of achieving true cluster balance versus balancing manually.</p>
<h4 id="heading-executing-the-proposal">Executing the proposal</h4>
<p>Let’s run the actual rebalancing using Cruise Control. The command is:</p>
<pre><code class="lang-bash">curl -X POST <span class="hljs-string">'http://localhost:9090/kafkacruisecontrol/rebalance?dryrun=false&amp;json=true'</span>
</code></pre>
<p>Again, you’ll get a huge JSON file similar to the proposal.</p>
<p>You can track the status using following API call:</p>
<pre><code class="lang-bash">curl <span class="hljs-string">"http://localhost:9090/kafkacruisecontrol/user_tasks"</span>
</code></pre>
<p>You will get something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768095347466/64131c47-5884-4894-95df-d46e9eb8cd97.png" alt="Cruise Control Tasks" class="image--center mx-auto" width="2152" height="344" loading="lazy"></p>
<p>Note that the 4th item in the list is our rebalance API call and it’s complete. This was quick for our small Dev cluster, but in large clusters you may see status as <code>InExecution</code>.</p>
<p>Let’s look at the UI to see what is the state of Imbalance now that Cruise Control has completed its execution of the proposal. The UI shows the following for me:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768095510743/16db64f7-d14b-4120-95c9-ac9b1d43f47e.png" alt="Kafka balanced Disk Usage" class="image--center mx-auto" width="3674" height="834" loading="lazy"></p>
<h4 id="heading-comparison">Comparison</h4>
<p>Before rebalancing:</p>
<ul>
<li><p>Broker 1: 4.52 MB, 69 partitions, 35 leaders</p>
</li>
<li><p>Broker 2: 2.22 MB, 69 partitions, 35 leaders (<strong>underutilized</strong>)</p>
</li>
<li><p>Broker 3: 5.05 MB, 72 partitions, 35 leaders (<strong>overutilized</strong>)</p>
</li>
<li><p><strong>Disk range:</strong> 2.83 MB (5.05 - 2.22)</p>
</li>
</ul>
<p>After rebalancing:</p>
<ul>
<li><p>Broker 1: 4.66 MB, 69 partitions, 38 leaders</p>
</li>
<li><p>Broker 2: 3.87 MB, 77 partitions, 31 leaders</p>
</li>
<li><p>Broker 3: 4.87 MB, 64 partitions, 36 leaders</p>
</li>
<li><p><strong>Disk range:</strong> 1.00 MB (4.87 - 3.87)</p>
</li>
</ul>
<p>Results:</p>
<ul>
<li><p><strong>Disk usage balanced</strong> – Range reduced from 2.83 MB to 1.00 MB (64% improvement!)</p>
</li>
<li><p><strong>Replicas redistributed</strong> – Broker 2 gained 8 replicas, Broker 3 lost 8 replicas</p>
</li>
<li><p><strong>Leaders balanced</strong> – Changed from 35-35-35 to 38-31-36. Cruise Control prioritized balancing actual network traffic over leader count.</p>
</li>
</ul>
<p>The cluster is now more balanced across all metrics. Congrats!</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We covered a lot in this tutorial, so let’s take a step back and look at what we did.</p>
<p>You started by experiencing the reality of manual Kafka management – the endless CLI commands, the tedious calculations, the JSON files, and the potential for costly mistakes. If you felt frustrated during that section, that’s to be expected. That frustration is exactly what thousands of engineering teams deal with every day.</p>
<p>Then you were presented with two complementary tools:</p>
<ol>
<li><p><strong>Kafka UI</strong> gave you visibility. No more grepping through command outputs or manually counting partition leaders. Everything you need, broker health, topic configurations, consumer lag is right there in a clean web interface. For small teams and development environments, this alone is a game-changer.</p>
</li>
<li><p><strong>Cruise Control</strong> gave you intelligence. It didn't just automate what you'd do manually – it also did a fundamentally better job. While you were focused on moving one partition (partition 4), Cruise Control analyzed all 105 partitions across 10 different optimization goals and proposed a comprehensive rebalancing plan. That's the difference between human effort and automated intelligence.</p>
</li>
</ol>
<p>I want to call out that this tutorial used a simplified setup. For production, you’ll expect complex configurations like”</p>
<ul>
<li><p>Kafka and Cruise Control running on separate machines</p>
</li>
<li><p>Larger monitoring window for Cruise Control</p>
</li>
<li><p>Some self healing capabilities enabled</p>
</li>
</ul>
<p>If there's one thing you take away from this article, let it be this: you should stop managing your Kafka cluster manually. You've seen there's a better way. Use it. Thanks for reading!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Message Queues Help Make Distributed Systems More Reliable ]]>
                </title>
                <description>
                    <![CDATA[ Reliable systems consistently perform their intended functions under various conditions while minimizing downtime and failures. As internet users, we tend to take for granted that the systems that we use daily will operate reliably. In this article, ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-message-queues-make-distributed-systems-more-reliable/</link>
                <guid isPermaLink="false">671f94819cd90a859ae00b55</guid>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS SQS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ distributed system ]]>
                    </category>
                
                    <category>
                        <![CDATA[ message queue ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Anant Chowdhary ]]>
                </dc:creator>
                <pubDate>Mon, 28 Oct 2024 13:41:21 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729895479626/5d476c5d-9749-4c2a-977b-bcdd8b2b8199.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Reliable systems consistently perform their intended functions under various conditions while minimizing downtime and failures.</p>
<p>As internet users, we tend to take for granted that the systems that we use daily will operate reliably. In this article, we’ll explore how message queues enhance flexibility and fault tolerance. We’ll also discuss some challenges that we may face while using them.</p>
<p>After reading through, you’ll know how to implement reliable systems and what key performance factors to keep in mind.</p>
<h3 id="heading-prerequisites"><strong>Prerequisites</strong></h3>
<p>Before diving into this article, you should have a foundational understanding of cloud computing. Here are the key concepts:</p>
<ol>
<li><p>Basic principles of Cloud Computing</p>
</li>
<li><p>Availability in Distributed Systems</p>
</li>
<li><p>An understanding of the CAP theorem.</p>
</li>
</ol>
<h3 id="heading-table-of-contents"><strong>Table of Contents</strong></h3>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-does-reliability-mean-in-the-context-of-distributed-systems">Reliability in Distributed Systems</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-makes-software-reliable">What Makes Software Reliable?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-a-message-queue">What is a Message Queue?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-message-queues-help-make-distributed-systems-more-reliable">How Message Queues Help Make Distributed Systems More Reliable</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-challenges-with-message-queues">Challenges with Message Queues</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-summary">Summary</a></p>
</li>
</ol>
<h2 id="heading-what-does-reliability-mean-in-the-context-of-distributed-systems"><strong>What Does Reliability Mean in the Context of Distributed Systems?</strong></h2>
<p>Reliability, according to the OED, is “the quality of being trustworthy or of performing consistently well”. We can translate this definition to the following in the context of distributed systems:</p>
<ol>
<li><p>The ability of a technological system, device, or component to consistently and dependably perform its intended functions under various conditions over time. For instance, in the context of online banking, reliability refers to the consistent and secure processing of transactions. Users expect to complete transfers and access their accounts without errors or outages.</p>
</li>
<li><p>The system being resilient to unexpected or erroneous interactions by users / other systems interacting with it. For instance, if a user tries to access a deleted file on a cloud storage system, the system can gracefully notify them and suggest alternatives, rather than crashing.</p>
</li>
<li><p>The system performs satisfactorily under its expected conditions of operation, as well as in the case of unexpected load and/or disruptions. An example of this is a video streaming service during a major sporting event. The system is designed to perform well under normal traffic but must also handle sudden spikes in users when a popular game starts</p>
</li>
</ol>
<p>This is quite a general view of what reliability is, and the definition changes with time, as systems change with changing technology.</p>
<h2 id="heading-what-makes-software-reliable"><strong>What Makes Software Reliable?</strong></h2>
<p>There are various key components that are used industry wide to make distributed software reliable as used across large scale systems.</p>
<h3 id="heading-data-replication"><strong>Data Replication</strong></h3>
<p>Data replication is a fundamental concept in system design where data is intentionally duplicated and stored in multiple locations or servers.</p>
<p>This redundancy serves several critical purposes, including enhancing data availability, improving fault tolerance, and enabling load balancing.</p>
<p>By replicating data across different nodes or data centers, we may be able to ensure that, in the event of a hardware failure or network issue, the data remains accessible. This reduces downtime and enhances system reliability.</p>
<p>It's essential to implement replication strategies carefully, considering factors like consistency, synchronization, and conflict resolution to maintain data integrity and reliability in distributed systems.</p>
<p>Let’s look at a concrete example. With a primary-secondary database model such as one used with e-commerce websites, we may have the following:</p>
<ol>
<li><p>Replication: The primary database handles all the write operations, whereas the secondary database(s) handles all the reads. This ensures that reads are spread out across multiple databases, enhancing performance and lowering the probability of a crash.</p>
</li>
<li><p>Consistency: The system may use eventual consistency to maintain integrity, ensuring that all replicas eventually reflect the same data. But during high-traffic periods, the website may temporarily allow for slight inconsistencies, such as showing outdated inventory levels.</p>
</li>
<li><p>Conflict Resolution: If two users attempt to buy a single available item at the same time, a conflict resolution strategy may be used. For instance, the system could use timestamps to determine the customer who gets assigned the product, and this may dictate database updates eventually.</p>
</li>
</ol>
<h3 id="heading-load-distribution-across-machines"><strong>Load Distribution Across Machines</strong></h3>
<p>Load distribution involves distributing computational tasks and network traffic across multiple servers or resources to optimize performance and ensure system scalability.</p>
<p>By intelligently spreading workloads, load distribution prevents any single server from becoming overwhelmed, reducing the risk of bottlenecks and downtime.</p>
<p>Some very commonly used load distribution mechanisms are:</p>
<ol>
<li><p>Using Load Balancers: A load balancer can evenly distribute incoming traffic across multiple servers, preventing any single server from becoming a bottleneck.</p>
</li>
<li><p>Dynamic Scaling: Dynamic or auto-scaling can be used to automatically adjust the number of active servers based on current demand, adding more resources during peak times and scaling down during low traffic.</p>
</li>
<li><p>Caching: Caching layers can be used to store frequently accessed data, reducing the load on backend servers by serving requests directly from the cache.</p>
</li>
</ol>
<h3 id="heading-capacity-planning"><strong>Capacity Planning</strong></h3>
<p>Capacity planning entails analyzing factors such as expected user growth, data storage requirements, and processing capabilities to ensure that the system can handle increased loads without performance degradation or downtime.</p>
<p>By accurately forecasting resource needs and scaling infrastructure accordingly, such planning helps optimize costs, maintain reliability, and provide a seamless user experience. Being proactive can help ensure a system is well-prepared to adapt to changing requirements and remains robust and efficient throughout its lifecycle.</p>
<p>A lot of modern systems can scale automatically with projected loads. When traffic or processing requirements increase, such auto scaling automatically provisions additional resources to handle the load. Conversely, when demand decreases, it scales down resources to optimize cost efficiency.</p>
<h3 id="heading-metrics-and-automated-alerting"><strong>Metrics and Automated Alerting</strong></h3>
<p>Metrics involve collecting and analyzing data points that provide insights into various aspects of system behavior, such as resource utilization, response times, error rates, and more.</p>
<p>Automated alerting complements metrics by enabling proactive monitoring. This involves setting predefined thresholds or conditions based on metrics. When a metric crosses or exceeds these thresholds, automated alerts get triggered. These alerts can notify system administrators or operators, allowing them to take immediate action to address potential issues before they impact the system or users.</p>
<p>When used together, metrics and automated alerting create a robust monitoring and troubleshooting system, helping ensure that anomalies or problems are quickly detected and resolved.</p>
<p>Now that you know a bit about what reliability means in the context of Distributed Systems, we can move on to Message Queues.</p>
<h2 id="heading-what-is-a-message-queue"><strong>What is a Message Queue?</strong></h2>
<p>A message queue is a communication mechanism used in distributed systems to enable asynchronous communication between different components or services. It acts as an intermediary that allows one component to send a message to another without the need for direct, synchronous communication.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1701696280571/8697cb07-c765-4f9e-b709-a7a03adf3e11.png?auto=compress,format&amp;format=webp" alt="Multiple producers adding messages to a message queue that in turn are consumed by a consumer." class="image--center mx-auto" width="480" height="370" loading="lazy"></p>
<p>Above, you can see that there are multiple nodes (called Producers) that create messages that are sent to a message queue. These messages are processed by a node called the Consumer node, which may perform a series of actions (for instance database reads, or writes) as a part of each message being processed.</p>
<p>Now let’s look at an actual example where a message queue may be useful. Let’s assume we have an e-commerce website that allows millions of orders to be processed.</p>
<p>Processing an order may take place in the following steps:</p>
<ol>
<li><p>A user creates an order. This sets off a request to a web server, that in turn creates a message that is placed in the orders queue.</p>
</li>
<li><p>A consumer reads the message, and in turn calls different services while processing the message (for instance the inventory checks, the payment service, the shipping service)</p>
</li>
<li><p>Once all processing steps have completed, the consumer removes the message from the queue.</p>
</li>
</ol>
<p>Note that in case there are parts of the system that fail, the message can be left in the queue to be re-processed.</p>
<p>Even in cases where there is a total outage on the processing side of things, messages can simply pile up in the queue and be consumed once services are functional again. This is an example of a queue being useful in multiple failure scenarios.</p>
<p>Let’s look at some code for this scenario using AWS SQS, which is a popular message queue service that allows users to create queues, send messages to the queue, and also consume messages from queues for processing.</p>
<p>The below example uses <a target="_blank" href="http://boto3.amazonaws.com">Boto3</a> which is a Python Client for AWS SQS.</p>
<p>First, we’ll place an order, assuming we already have an SQS queue called OrderQueue in place.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> boto3
<span class="hljs-keyword">import</span> json

<span class="hljs-comment"># Create an SQS client</span>
sqs = boto3.client(<span class="hljs-string">'sqs'</span>)

<span class="hljs-comment"># Let's assume the queue is called OrderQueue</span>
<span class="hljs-comment"># This is the queue in which orders are placed</span>
queue_url = <span class="hljs-string">'https://sqs.us-east-1.amazonaws.com/2233334/OrderQueue'</span>

<span class="hljs-comment"># Function to send an order message</span>
<span class="hljs-comment"># This places an order in the queue, which can at any time be</span>
<span class="hljs-comment"># picked up by a consumer and then processed</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">send_order</span>(<span class="hljs-params">order_details</span>):</span>
    message_body = json.dumps(order_details)
    response = sqs.send_message(
        QueueUrl=queue_url,
        MessageBody=message_body
    )
    print(<span class="hljs-string">f'Order sent with ID: <span class="hljs-subst">{response[<span class="hljs-string">"MessageId"</span>]}</span>'</span>)

<span class="hljs-comment"># Using the queue to place an order</span>
<span class="hljs-comment"># Defining a sample order</span>

order = {
    <span class="hljs-string">'order_id'</span>: <span class="hljs-string">'12345'</span>,
    <span class="hljs-string">'customer_id'</span>: <span class="hljs-string">'67890'</span>,
    <span class="hljs-string">'items'</span>: [
        {<span class="hljs-string">'product_id'</span>: <span class="hljs-string">'abc123'</span>, <span class="hljs-string">'quantity'</span>: <span class="hljs-number">2</span>},
        {<span class="hljs-string">'product_id'</span>: <span class="hljs-string">'xyz456'</span>, <span class="hljs-string">'quantity'</span>: <span class="hljs-number">1</span>}
    ],
    <span class="hljs-string">'total_price'</span>: <span class="hljs-number">59.99</span>
}

<span class="hljs-comment"># Sending the order to the queue which is expected to be picked up </span>
<span class="hljs-comment"># by a consumer and processed eventually.</span>
send_order(order)
</code></pre>
<p>Then once the order has been placed, here’s some code that illustrates how it’ll be picked up for processing:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> boto3
<span class="hljs-keyword">import</span> json

<span class="hljs-comment"># Create an SQS client</span>
sqs = boto3.client(<span class="hljs-string">'sqs'</span>)

<span class="hljs-comment"># Processing orders from the same queue defined above</span>
queue_url = <span class="hljs-string">'https://sqs.us-east-1.amazonaws.com/2233334/OrderQueue'</span>

<span class="hljs-comment"># Function to receive and process orders</span>
<span class="hljs-comment"># Picking up a maximum of 10 messages at a time to process</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">receive_orders</span>():</span>
    response = sqs.receive_message(
        QueueUrl=queue_url,
        MaxNumberOfMessages=<span class="hljs-number">10</span>,  <span class="hljs-comment"># Up to 10 messages</span>
        WaitTimeSeconds=<span class="hljs-number">10</span>
    )

    messages = response.get(<span class="hljs-string">'Messages'</span>, [])

    <span class="hljs-keyword">for</span> message <span class="hljs-keyword">in</span> messages:
        order_details = json.loads(message[<span class="hljs-string">'Body'</span>])
        print(<span class="hljs-string">f'Processing order: <span class="hljs-subst">{order_details}</span>'</span>)

        <span class="hljs-comment"># Processing the order with details such as </span>
        <span class="hljs-comment"># processing payments, updating the inventory levels,</span>
        <span class="hljs-comment"># processing shipping etc.</span>

        <span class="hljs-comment"># Delete the message after processing</span>
        <span class="hljs-comment"># This is important since we don't want an</span>
        <span class="hljs-comment"># order to be processed multiple times.</span>
        sqs.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=message[<span class="hljs-string">'ReceiptHandle'</span>]
        )

<span class="hljs-comment"># Receive a batch of orders</span>
receive_orders()
</code></pre>
<h3 id="heading-what-is-an-intermediary-in-a-distributed-system"><strong>What is an Intermediary in a Distributed System?</strong></h3>
<p>In the context of what we’re discussing here, a message queue is an intermediary. Quoting Amazon AWS’ definition of a message queue:</p>
<blockquote>
<p>“<a target="_blank" href="https://aws.amazon.com/sqs/"><strong><em>Amazon Simple Queue Service (Amazon SQS)</em></strong></a> <em>lets you send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available</em>.”</p>
</blockquote>
<p>This is a wonderfully succinct and accurate description of why a message queue (an intermediary) is important.</p>
<p>In a message queue, messages are placed in a queue data structure, which you can think of as a temporary storage area. The producer places messages in the queue, and the consumer retrieves and processes them at its own pace. This decoupling of producers and consumers allows for greater flexibility, scalability, and fault tolerance in distributed systems.</p>
<h2 id="heading-how-message-queues-help-make-distributed-systems-more-reliable">How Message Queues Help Make Distributed Systems More Reliable</h2>
<p>Now let's discuss how Message Queues help make Distributed Systems more reliable.</p>
<h3 id="heading-1-message-queues-provide-flexibility"><strong>1. Message Queues Provide Flexibility</strong></h3>
<p>Message queues allow for <a target="_blank" href="https://en.wikipedia.org/wiki/Asynchrony_\(computer_programming\)"><strong>asynchronous communication</strong></a> between components. This means that producers can send messages to the queue without waiting for immediate processing by consumers. This allows components to work independently and at their own pace, providing flexibility in terms of processing times. So this is a great way to make designs flexible, and as self contained as possible.</p>
<h3 id="heading-2-message-queues-make-systems-scalable"><strong>2. Message Queues Make Systems Scalable</strong></h3>
<p>Message queues are often the bread and butter of scalable distributed systems for the following reasons:</p>
<ol>
<li><p>Multiple producers can add messages to a message queue. This raises the ceiling and allows us to easily horizontally scale applications.</p>
</li>
<li><p>Multiple consumers can read from a message queue. This again allows us to easily scale throughput if needed in a lot of scenarios.</p>
</li>
</ol>
<h3 id="heading-3-message-queues-make-systems-fault-tolerant"><strong>3. Message Queues Make Systems Fault Tolerant</strong></h3>
<p>What happens if a distributed system is overwhelmed? We sometimes need to have the ability to <em>cut the cord</em> in order to get the system back to a working state. We’d ideally want the ability to process requests that weren’t processed when the system was down.</p>
<p>This is exactly what a message queue can help us with. We may have hundreds of thousands of requests that weren’t processed, but are still in the queue. These can be processed once our system is back online.</p>
<h2 id="heading-challenges-with-message-queues"><strong>Challenges with Message Queues</strong></h2>
<p>As with life, using message queues in distributed systems isn’t a silver bullet to scaling problems.</p>
<p>Here are some situations where message queues may be useful:</p>
<ol>
<li><p>Asynchronous Processing: Messages queues are generally an excellent choice in infrastructure wherever asynchronous processing is required. In workflows such as sending confirmation emails or generating reports after an order is placed, message queues can decouple these tasks from the primary application flow.</p>
</li>
<li><p>Load Balancing: As we saw in our example for message queues, in scenarios where traffic spikes occur, message queues can buffer incoming requests, allowing multiple consumers to process messages concurrently. This helps distribute the load evenly across available resources.</p>
</li>
<li><p>Fault Tolerance: In systems where reliability is crucial, message queues provide a mechanism for handling failures. If a service is temporarily down, messages can be retained in the queue until the service is available again, ensuring that no data is lost unless intended.</p>
</li>
</ol>
<p>Here are a some situations where message queues may not be useful:</p>
<ol>
<li><p>Message queues can be great in scenarios where ordering of messages does not matter. But in situations where order does matter, they can sometimes be slow and more expensive to use.</p>
</li>
<li><p>Designing systems with queues that have multiple consumers isn’t trivial. What happens if a message is processed twice? Is <a target="_blank" href="https://en.wikipedia.org/wiki/Idempotence"><strong>idempotency</strong></a> a requirement? Or does it break our use case? These complexities can often lead us to situations where message queues may not be the best solution.</p>
</li>
</ol>
<h2 id="heading-summary"><strong>Summary</strong></h2>
<p>In this article, you learned about reliability in distributed systems, and how message queues can help make such systems more reliable. Here’s a summary of the key takeaways:</p>
<ol>
<li><p>Reliability is central to distributed systems and there are a few common ways this is handled across the tech industry. Data replication, load distribution, and capacity planning are some ways that can improve the reliability of a system.</p>
</li>
<li><p>Message Queues are intermediaries that can store messages from producers. They can be picked up by consumers at a rate that's generally independent of the rate of production.</p>
</li>
<li><p>Queues are flexible, allowing us to immediately stem the flow of unwanted event processing in case of an unforeseen event.</p>
</li>
<li><p>Despite the versatility of message queues, they're not a panacea for reliability issues. There are often multiple considerations to be kept in mind while processing messages in a message queue.</p>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
