distributed system - freeCodeCamp.org

Service-to-Service Communication: When to Use REST, gRPC, and Event-Driven Messaging

Abisoye Alli-Balogun — Tue, 14 Apr 2026 20:37:44 +0000

The communication layer is one of the few architectural decisions that touches everything in your apps. It determines your latency floor, how independently teams can deploy, how failures propagate, and how much pain you feel every time a contract needs to change.

There are three dominant patterns: REST over HTTP, gRPC with Protocol Buffers, and event-driven messaging through a broker. Most production systems use a mix of all three. The skill is knowing which pattern fits which interaction.

In this article, you'll learn the core mechanics of each communication style, the real trade-offs between them across five dimensions (latency, coupling, schema evolution, debugging, and operational complexity), and a decision framework for choosing the right pattern for each service interaction.

Prerequisites

To get the most out of this article, you should be familiar with:

Basic HTTP concepts (request/response, status codes, headers)
Working with APIs in any backend language (the examples use TypeScript and Node.js)
General understanding of microservices architecture
Familiarity with JSON as a data interchange format

The Three Patterns at a Glance
REST: The Default Choice
gRPC: The Performance Choice
Event-Driven Messaging: The Decoupling Choice
The Five Trade-Off Dimensions
The Decision Framework
Hybrid Architectures: Using All Three
Schema Governance at Scale
Conclusion

The Three Patterns at a Glance

Before diving deep, here's the landscape:

	REST	gRPC	Event-Driven
Communication	Synchronous	Synchronous (+ streaming)	Asynchronous
Protocol	HTTP/1.1 or HTTP/2	HTTP/2	Broker-dependent (TCP)
Serialization	JSON (typically)	Protocol Buffers (binary)	JSON, Avro, Protobuf
Coupling	Request-time	Request-time + schema	Temporal decoupling
Best for	Public APIs, CRUD	Internal high-throughput	Workflows, event sourcing

Each has strengths, and none is universally better. The rest of this article explores why.

REST: The Default Choice

REST over HTTP is the most widely understood communication pattern. Services expose resources at URL endpoints, and clients interact through standard HTTP methods.

// Order service calls the inventory service
async function checkInventory(productId: string): Promise {
  const response = await fetch(
    `https://inventory-service/api/v1/products/${productId}/stock`,
    {
      method: "GET",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${getServiceToken()}`,
      },
    }
  );

  if (!response.ok) {
    throw new HttpError(response.status, await response.text());
  }

  return response.json();
}

Where REST Excels

Every language, framework, and platform speaks HTTP. Your frontend, your mobile app, your partner integrations, and your internal services can all use the same protocol.

The tooling is also mature: load balancers, API gateways, caching proxies, and debugging tools all understand HTTP natively.

It's also relatively simple. A new developer can read a REST call and understand what it does. The URL describes the resource. The HTTP method describes the action. The status code describes the outcome. There's no schema compilation step, no code generation, and no special client library required.

Beyond this, HTTP has built-in caching semantics. A GET /products/123 response with a Cache-Control: max-age=60 header can be cached by every proxy between the caller and the server. gRPC and event-driven patterns have no equivalent built-in mechanism.

// REST response with cache headers
app.get("/api/v1/products/:id", async (req, res) => {
  const product = await getProduct(req.params.id);

  res.set("Cache-Control", "public, max-age=60");
  res.set("ETag", computeETag(product));

  res.json(product);
});

Where REST Falls Short

REST's resource-oriented model often requires multiple round-trips to assemble a response. Fetching an order with its items, customer details, and shipping status might mean three separate HTTP calls. Each call adds network latency, TCP handshake overhead, and serialization cost.

// Three sequential calls to build one view
async function getOrderDetails(orderId: string): Promise {
  const order = await fetch(`/api/orders/${orderId}`).then((r) => r.json());
  const customer = await fetch(`/api/customers/${order.customerId}`).then((r) => r.json());
  const shipment = await fetch(`/api/shipments/${order.shipmentId}`).then((r) => r.json());

  return { order, customer, shipment };
}

You can mitigate this with composite endpoints or GraphQL, but that adds complexity. gRPC handles this more naturally with message composition.

The serialization overhead is also an issue. JSON is human-readable but expensive to parse. For high-throughput internal communication where nobody reads the payloads, you are paying a tax in CPU and bandwidth for readability you do not need.

Finally, there's no streaming. Standard REST is request-response. If you need the server to push updates to the client (real-time order tracking, live metrics), REST requires workarounds like polling, Server-Sent Events, or WebSockets. None of these are part of the REST model itself.

gRPC: The Performance Choice

gRPC is a Remote Procedure Call framework built on HTTP/2 and Protocol Buffers. Instead of URLs and JSON, you define services and messages in .proto files, and the framework generates strongly-typed client and server code.

Defining the Contract

// inventory.proto
syntax = "proto3";

package inventory;

service InventoryService {
  // Unary: single request, single response
  rpc CheckStock(StockRequest) returns (StockResponse);

  // Server streaming: single request, stream of responses
  rpc WatchStockLevels(WatchRequest) returns (stream StockUpdate);

  // Client streaming: stream of requests, single response
  rpc BulkUpdateStock(stream StockAdjustment) returns (BulkUpdateResult);
}

message StockRequest {
  string product_id = 1;
  string warehouse_id = 2;
}

message StockResponse {
  string product_id = 1;
  int32 available = 2;
  int32 reserved = 3;
  google.protobuf.Timestamp last_updated = 4;
}

message StockUpdate {
  string product_id = 1;
  int32 available = 2;
  string warehouse_id = 3;
}

After running protoc (the Protocol Buffer compiler), you get generated client and server stubs in your target language. The TypeScript client looks like this:

import { InventoryServiceClient } from "./generated/inventory";
import { credentials } from "@grpc/grpc-js";

const client = new InventoryServiceClient(
  "inventory-service:50051",
  credentials.createInsecure()
);

async function checkStock(productId: string): Promise {
  return new Promise((resolve, reject) => {
    client.checkStock(
      { productId, warehouseId: "warehouse-eu-1" },
      (error, response) => {
        if (error) reject(error);
        else resolve(response);
      }
    );
  });
}

Where gRPC Excels

Protocol Buffers serialize to a compact binary format. A message that is 1 KB as JSON might be 300 bytes as Protobuf. Combined with HTTP/2 multiplexing (multiple requests over a single TCP connection), gRPC delivers significantly lower latency and higher throughput than REST for internal service calls. And we all know performance is important.

Metric	REST (JSON/HTTP 1.1)	gRPC (Protobuf/HTTP 2)
Serialization size	Larger (text-based JSON)	Significantly smaller (binary Protobuf)
Serialization time	Slower (JSON parse/stringify)	Faster (binary encode/decode)
Requests per connection	1 (without pipelining)	Multiplexed
Connection overhead	New connection per request (HTTP/1.1)	Persistent connections with multiplexing

The exact improvement depends on payload size, network topology, and server implementation. In benchmarks, the difference ranges from marginal (tiny payloads on fast networks) to an order of magnitude (large payloads, high concurrency).

The takeaway: gRPC's binary serialization and HTTP/2 multiplexing give it a structural advantage for internal traffic, but you should measure in your own environment before making latency claims.

Also, gRPC natively supports four communication patterns: unary (request-response), server streaming, client streaming, and bidirectional streaming. This makes it a natural fit for real-time use cases like live stock updates, log tailing, or progress reporting.

// Server streaming: watch inventory changes in real time
function watchStockLevels(warehouseId: string): void {
  const stream = client.watchStockLevels({ warehouseId });

  stream.on("data", (update: StockUpdate) => {
    console.log(`Product \({update.productId}: \){update.available} available`);
  });

  stream.on("error", (error) => {
    console.error("Stream error:", error.message);
    // Reconnect logic here
  });

  stream.on("end", () => {
    console.log("Stream ended");
  });
}

Finally, it has strong typing across services. The .proto file is the single source of truth. Both the client and server are generated from it. If the inventory service changes the StockResponse message, the order service's build fails until it regenerates its client. You catch breaking changes at compile time, not at 3 AM.

Where gRPC Falls Short

The first key issue is browser support. Browsers can't make native gRPC calls because the browser's fetch API doesn't expose the HTTP/2 framing that gRPC requires (for example, trailers for status codes and fine-grained control over bidirectional streams).

You need gRPC-Web, which uses a proxy like Envoy to translate between the browser-compatible subset of gRPC and the full protocol. Alternatively, you can place a REST or GraphQL gateway in front of your gRPC services.

Either way, gRPC isn't a viable choice for any endpoint that a browser calls directly — which is why the decision framework in this article defaults to REST for public-facing APIs.

It's also difficult to debug. You can't curl a gRPC endpoint. The binary payloads aren't human-readable. Tools like grpcurl and Postman's gRPC support help, but the debugging experience is worse than inspecting a JSON response in a browser's network tab.

# Debugging a REST endpoint
curl -s https://inventory-service/api/v1/products/abc-123/stock | jq

# Debugging a gRPC endpoint (requires grpcurl)
grpcurl -plaintext -d '{"product_id": "abc-123"}' \
  inventory-service:50051 inventory.InventoryService/CheckStock

Finally, operational overhead is an issue. You need to manage .proto files, run code generation in your build pipeline, version your proto definitions, and ensure all consumers regenerate when schemas change.

For a team with two services, this is manageable. For twenty services, you need a proto registry and a governance process.

Event-Driven Messaging: The Decoupling Choice

Event-driven communication flips the model. Instead of service A calling service B directly, service A publishes an event to a broker (Kafka, RabbitMQ, Amazon SNS/SQS, or similar), and service B consumes it asynchronously.

// Order service publishes an event after confirming an order
import { Kafka } from "kafkajs";

const kafka = new Kafka({ brokers: ["kafka:9092"] });
const producer = kafka.producer();

async function publishOrderConfirmed(order: Order): Promise {
  await producer.send({
    topic: "order.confirmed",
    messages: [
      {
        key: order.id,
        value: JSON.stringify({
          eventType: "order.confirmed",
          eventId: crypto.randomUUID(),
          timestamp: new Date().toISOString(),
          data: {
            orderId: order.id,
            customerId: order.customerId,
            items: order.items.map((item) => ({
              productId: item.productId,
              quantity: item.quantity,
            })),
            total: order.total,
          },
        }),
      },
    ],
  });
}

// Inventory service consumes the event independently
const consumer = kafka.consumer({ groupId: "inventory-service" });

async function startInventoryConsumer(): Promise {
  await consumer.subscribe({ topic: "order.confirmed" });

  await consumer.run({
    eachMessage: async ({ message }) => {
      const event = JSON.parse(message.value.toString());

      for (const item of event.data.items) {
        await decrementStock(item.productId, item.quantity);
      }

      logger.info("Inventory updated for order", {
        orderId: event.data.orderId,
      });
    },
  });
}

Where Event-Driven Excels

First, it employs temporal decoupling. The producer doesn't wait for the consumer. The order service publishes "order confirmed" and moves on. If the inventory service is down, the event sits in the broker until it recovers. No timeout, no retry logic in the producer, no cascading failure.

One event can also trigger multiple independent reactions. When an order is confirmed, the inventory service decrements stock, the notification service sends a confirmation email, the analytics service records a conversion, and the shipping service starts fulfillment. The order service doesn't know or care about any of these consumers.

order.confirmed event
  ├── inventory-service    → Decrement stock
  ├── notification-service → Send confirmation email
  ├── analytics-service    → Record conversion
  └── shipping-service     → Create shipment

Adding a new consumer requires zero changes to the producer. This is the lowest coupling you can achieve between services.

There's also a natural audit trail. If your broker retains events (Kafka does this by default), you have a complete history of everything that happened. You can replay events to rebuild state, debug issues by examining the exact sequence of events, or spin up a new service that processes historical events to backfill its data.

Where Event-Driven Falls Short

After the order service publishes "order confirmed," there's a window where the inventory service hasn't yet processed the event. During that window, a concurrent request might read stale stock levels. If your use case requires "read your own writes" consistency, event-driven communication alone is not enough.

// The problem: order confirmed, but stock not yet decremented
async function handleCheckout(cart: Cart): Promise {
  const order = await createOrder(cart);
  await publishOrderConfirmed(order);

  // If another request checks stock RIGHT NOW,
  // it sees the old (pre-decrement) value.
  // The inventory consumer hasn't processed the event yet.
  return order;
}

Debugging also gets more complex. When something goes wrong in a synchronous call chain, you get a stack trace. When something goes wrong in an event-driven flow, you get a message in a dead-letter queue and a question: which producer sent this? When? What was the system state at that time? Distributed tracing helps, but correlating events across services is fundamentally harder than following a request through a call stack.

You can also have issues with ordering guarantees. Most brokers guarantee ordering within a partition (Kafka) or a queue, but not globally. If the order service publishes "order confirmed" and then "order cancelled," the inventory service might process the cancellation first if the events land on different partitions.

// Use a consistent partition key to guarantee ordering per entity
await producer.send({
  topic: "order.events",
  messages: [
    {
      // All events for the same order go to the same partition
      key: order.id,
      value: JSON.stringify(event),
    },
  ],
});

Keying messages by entity ID (order ID, customer ID) ensures events for the same entity are processed in order. Events for different entities can be processed in parallel.

Finally, your operations get more complex. Running a message broker isn't free. Kafka requires ZooKeeper (or KRaft), topic management, partition rebalancing, consumer group coordination, and monitoring for consumer lag. Managed services like Amazon MSK, Confluent Cloud, or Amazon SQS reduce this burden but add cost.

Handling Broker Failures

What happens when the broker is unavailable? If your service writes to the database and then publishes an event, a broker outage means the event is lost even though the database write succeeded.

These patterns help:

1. The Outbox Pattern

Instead of publishing directly to the broker, write the event to an "outbox" table in the same database transaction as your business data. A separate process (a poller or a change-data-capture connector like Debezium) reads the outbox table and publishes to the broker.

// Outbox pattern: write event to the database, not the broker
// db injected via dependency injection
async function confirmOrder(order: Order, db: Database): Promise {
  await db.transaction(async (tx) => {
    // Business write and event write in the same transaction
    await tx.update("orders", { id: order.id, status: "confirmed" });
    await tx.insert("outbox", {
      id: crypto.randomUUID(),
      topic: "order.confirmed",
      key: order.id,
      payload: JSON.stringify({
        orderId: order.id,
        customerId: order.customerId,
        items: order.items,
        total: order.total,
      }),
      created_at: new Date(),
    });
  });
  // A separate relay process picks up outbox rows and publishes to Kafka
}

Because the event and the business data are written atomically, you never lose an event due to a broker outage. The relay process retries until the broker is back.

2. At-least-once delivery

Most brokers guarantee at-least-once delivery, meaning consumers may see the same event more than once (for example, after a rebalance or a retry). Your consumers must be idempotent: processing the same event twice should produce the same result as processing it once.

// Idempotent consumer: use the eventId to deduplicate
async function handleOrderConfirmed(event: EventEnvelope): Promise {
  const alreadyProcessed = await db.query(
    "SELECT 1 FROM processed_events WHERE event_id = $1",
    [event.eventId]
  );

  if (alreadyProcessed.rows.length > 0) {
    logger.info("Duplicate event, skipping", { eventId: event.eventId });
    return;
  }

  await db.transaction(async (tx) => {
    await decrementStock(tx, event.data.items);
    await tx.insert("processed_events", {
      event_id: event.eventId,
      processed_at: new Date(),
    });
  });
}

The combination of the outbox pattern (producer side) and idempotent consumers (consumer side) gives you reliable event-driven communication even when the broker has intermittent failures.

The Five Trade-Off Dimensions

Choosing a communication pattern isn't about which is "best." It's about which trade-offs you can accept for each specific interaction. Here are the five dimensions that matter most.

1. Latency

Pattern	Relative Latency	Why
gRPC	Lowest	Binary serialization, HTTP/2 multiplexing, persistent connections
REST	Low-moderate	JSON parsing overhead, typically HTTP/1.1 connection setup
Event-driven	Highest (by design)	Broker write, replication, consumer poll interval

Exact numbers depend on payload size, network hops, and infrastructure. The structural ordering is consistent: gRPC is fastest for synchronous calls, REST is close behind, and event-driven messaging trades latency for decoupling.

If the caller needs an immediate response (user-facing checkout, real-time search), use gRPC or REST. If the caller doesn't need the result right now (send email, update analytics), use events.

2. Coupling

Coupling has two dimensions: temporal (does the caller wait for the receiver?) and schema (do they share a contract?).

Pattern	Temporal Coupling	Schema Coupling
REST	High (caller blocks)	Low (JSON is flexible)
gRPC	High (caller blocks)	High (shared `.proto` files)
Event-driven	None (fire and forget)	Medium (shared event schema)

REST's loose typing is a double-edged sword. You can add fields to a JSON response without breaking consumers (additive changes are safe). But you can also accidentally remove a field, and the consumer fails at runtime instead of compile time.

gRPC's strict typing catches breaking changes at build time, but it means every schema change requires regenerating clients. For two services, this is trivial. For twenty services consuming the same proto, you need a coordination process.

Event-driven messaging decouples in time but still couples on the event schema. If the order.confirmed event changes its structure, every consumer must handle both the old and new format during the transition.

3. Schema Evolution

Schema evolution is how you change the contract between services without breaking existing consumers. This is where the three patterns diverge most sharply.

REST (JSON):

// Version 1: price as a number
{ "productId": "abc-123", "price": 49.99 }

// Version 2: price as an object (breaking change)
{ "productId": "abc-123", "price": { "amount": 49.99, "currency": "USD" } }

JSON has no built-in versioning. You manage it through one of three strategies:

Strategy	How It Works	Trade-offs
URL versioning (`/api/v1/` vs `/api/v2/`)	Each version is a separate endpoint. Consumers opt in to the new version explicitly.	Simplest to understand. Duplicates route handlers. Hard to sunset old versions when many consumers pin to `/v1/`.
Header versioning (`Accept: application/vnd.myapi.v2+json`)	Single URL, version negotiated via headers.	Cleaner URLs, no route duplication. Harder to test (you can't just paste a URL into a browser). Proxy and cache behavior is trickier since the response varies by header.
Defensive parsing (consumer-side tolerance)	No explicit versioning. Consumers ignore unknown fields and use defaults for missing ones.	Zero coordination cost for additive changes. Breaks down for structural changes (field renames, type changes) where the consumer can't infer intent.

Additive changes (new fields) are safe with any strategy. Structural changes (renaming fields, changing types) require explicit versioning — URL or header — so consumers can migrate at their own pace.

gRPC (Protocol Buffers):

// Protocol Buffers have built-in evolution rules
message StockResponse {
  string product_id = 1;
  int32 available = 2;
  int32 reserved = 3;
  // Field 4 was removed (never reuse field numbers)
  string warehouse_id = 5;       // New field: old clients ignore it
  optional string region = 6;    // Optional: old clients don't send it
}

Protocol Buffers handle evolution well by design. You can add new fields (old clients ignore them), deprecate fields (stop writing them, keep the number reserved), and use optional for fields that may not be present.

You can't rename fields, change field types, or reuse field numbers. These rules are enforced by the tooling.

Event-driven (Avro/JSON Schema):

For events, schema registries like Confluent Schema Registry enforce compatibility rules:

// Register a schema with backward compatibility
// New consumers can read old events, old consumers can read new events
const schema = {
  type: "record",
  name: "OrderConfirmed",
  fields: [
    { name: "orderId", type: "string" },
    { name: "customerId", type: "string" },
    { name: "total", type: "double" },
    // New field with default: backward compatible
    { name: "currency", type: "string", default: "USD" },
  ],
};

With a schema registry, producers can't publish events that violate the compatibility contract. This is the strongest governance model: the registry rejects incompatible schemas before they reach consumers.

4. Debugging and Observability

Pattern	Debugging Experience
REST	Best. Human-readable payloads, browser DevTools, `curl`, standard HTTP tracing.
gRPC	Moderate. Binary payloads need `grpcurl` or Postman. Metadata is inspectable. Distributed tracing works well.
Event-driven	Hardest. Asynchronous flows require correlation IDs, dead-letter queue inspection, and broker-specific tooling.

For event-driven systems, correlation IDs are essential:

// Always include a correlation ID in events
interface EventEnvelope {
  eventId: string;
  eventType: string;
  correlationId: string; // Links related events across services
  causationId: string;   // The event that caused this one
  timestamp: string;
  source: string;
  data: T;
}

async function publishEvent(
  topic: string,
  type: string,
  data: T,
  correlationId: string,
  causationId?: string
): Promise {
  const event: EventEnvelope = {
    eventId: crypto.randomUUID(),
    eventType: type,
    correlationId,
    causationId: causationId ?? correlationId,
    timestamp: new Date().toISOString(),
    source: SERVICE_NAME,
    data,
  };

  await producer.send({
    topic,
    messages: [{ key: data.entityId, value: JSON.stringify(event) }],
  });
}

When investigating an issue, you search for the correlation ID across all services and reconstruct the full event chain. Without it, you're searching for a needle in a haystack.

5. Operational Complexity

Pattern	What You Operate
REST	HTTP server, load balancer, API gateway
gRPC	gRPC server, proto registry, code generation pipeline, gRPC-Web proxy (if browser clients exist)
Event-driven	Message broker (Kafka/RabbitMQ/SQS), schema registry, dead-letter queues, consumer lag monitoring

REST has the lowest operational overhead. Every team knows how to run an HTTP server.

gRPC adds a build-time dependency (proto compilation) and requires teams to learn new tooling.

Event-driven adds a runtime dependency (the broker) that must be highly available because if the broker goes down, inter-service communication stops.

The Decision Framework

Use this framework when deciding how a specific pair of services should communicate. The answer is rarely one pattern for your entire system.

Does the caller need an immediate response?
├── Yes → Is this a public-facing or browser-accessible API?
│         ├── Yes → REST
│         └── No  → Is throughput or latency critical?
│                   ├── Yes → gRPC
│                   └── No  → REST (simpler, good enough)
└── No  → Can the caller tolerate eventual consistency?
          ├── No  → Use synchronous call (REST or gRPC) with async follow-up
          └── Yes → Does the event need to trigger multiple consumers?
                    ├── Yes → Event-driven messaging
                    └── No  → Is ordering critical?
                              ├── Yes → Event-driven with partition key
                              └── No  → Event-driven (or simple queue like SQS)

Some concrete examples:

Interaction	Pattern	Why
Browser fetches product details	REST	Browser can't call gRPC natively, plus REST offers cacheability
Checkout validates payment in real time	gRPC	Low latency, strong typing, internal-only (no browser in the path)
Order confirmed triggers fulfillment	Event-driven	Multiple consumers, temporal decoupling
Frontend fetches user profile	REST	Simple CRUD, cacheable, browser-native
ML service scores recommendations	gRPC	High throughput, binary payloads, streaming
User signup triggers welcome email	Event-driven	Async, no need for immediate response
Service health checks	REST	Simplicity, universal tooling
Real-time stock level monitoring	gRPC streaming	Continuous updates, bidirectional if needed

Hybrid Architectures: Using All Three

Most production systems use a combination. Here's a pattern that works well:

┌──────────┐    REST     ┌──────────────┐    gRPC    ┌──────────────┐
│ Browser  │────────────▶│  API Gateway │───────────▶│ Order Service│
└──────────┘             └──────────────┘            └──────┬───────┘
                                                           │
                                                    publishes event
                                                           │
                                                           ▼
                                                    ┌─────────────┐
                                                    │    Kafka     │
                                                    └──────┬──────┘
                                          ┌────────────────┼────────────────┐
                                          ▼                ▼                ▼
                                   ┌────────────┐  ┌────────────┐  ┌────────────┐
                                   │ Inventory  │  │ Notification│  │ Analytics  │
                                   │  Service   │  │  Service    │  │  Service   │
                                   └────────────┘  └────────────┘  └────────────┘

REST at the edge: the browser talks to the API gateway using standard HTTP. Cacheable, debuggable, universally supported.
gRPC between the gateway and internal services: low latency, strong typing, efficient serialization.
Event-driven for downstream reactions: the order service publishes an event, and multiple consumers react independently.

The Anti-Synchronous Trap

A common mistake is using synchronous calls (REST or gRPC) where events are a better fit. The symptom: a service that makes five synchronous calls during a single request, waiting for each to complete before responding to the caller.

// Anti-pattern: synchronous fan-out
async function confirmOrder(order: Order): Promise {
  await inventoryService.decrementStock(order.items);    // 50ms
  await paymentService.capturePayment(order.paymentId);  // 200ms
  await notificationService.sendConfirmation(order);     // 100ms
  await analyticsService.recordConversion(order);        // 80ms
  await shippingService.createShipment(order);           // 150ms
  // Total: 580ms, and if any one fails, the order fails
}

Only the first two calls (inventory and payment) are critical to confirming the order. The rest are reactions that can happen asynchronously:

// Better: synchronous for critical path, events for reactions
async function confirmOrder(order: Order): Promise {
  // Critical path: must succeed for the order to be valid
  await inventoryService.decrementStock(order.items);
  await paymentService.capturePayment(order.paymentId);

  // Non-critical: publish event, let consumers handle the rest
  await publishOrderConfirmed(order);
  // Total: 250ms, and notification/analytics/shipping failures
  // don't block the checkout
}

This is the same tiered approach from my designing resilient APIs article. Critical operations are synchronous. Non-critical reactions are event-driven. The caller responds faster, and downstream failures do not cascade.

Schema Governance at Scale

As your service count grows, schema management becomes a first-class concern. Here's a practical approach for each pattern.

REST: OpenAPI as the Contract

# openapi/inventory-service.yaml
openapi: "3.1.0"
info:
  title: Inventory Service
  version: "1.2.0"
paths:
  /api/v1/products/{productId}/stock:
    get:
      operationId: getStock
      parameters:
        - name: productId
          in: path
          required: true
          schema:
            type: string
      responses:
        "200":
          description: Stock level for the product
          content:
            application/json:
              schema:
                $ref: "#/components/schemas/StockResponse"
components:
  schemas:
    StockResponse:
      type: object
      required: [productId, available, reserved]
      properties:
        productId:
          type: string
        available:
          type: integer
        reserved:
          type: integer

Generate client SDKs from the OpenAPI spec using tools like openapi-typescript or openapi-generator. This gives you type safety without the build-time coupling of gRPC.

gRPC: Proto Registry

Store .proto files in a shared repository or a dedicated proto registry (Buf Schema Registry is a good option). Use Buf's breaking change detection in CI:

# Detects breaking changes before they merge
buf breaking --against ".git#branch=main"

This command requires a buf.yaml configuration file at the root of your proto directory. The file defines your module name and any lint or breaking change rules. See the Buf documentation for setup details.

This fails your pull request if you rename a field, change a type, or reuse a field number. Non-breaking changes (adding fields, adding services) pass through.

Events: Schema Registry with Compatibility Modes

For event-driven systems, a schema registry enforces compatibility at publish time. Confluent Schema Registry supports four modes:

Mode	Rule	Use Case
BACKWARD	New schema can read old data	Consumer-first evolution
FORWARD	Old schema can read new data	Producer-first evolution
FULL	Both directions	Safest, most restrictive
NONE	No checks	Development only

Use FULL compatibility for production topics. It ensures that any consumer, regardless of which schema version it was built against, can read any event on the topic.

Conclusion

In this article, you learned the core mechanics of REST, gRPC, and event-driven messaging, the five trade-off dimensions that matter when choosing between them (latency, coupling, schema evolution, debugging, and operational complexity), and a decision framework for matching patterns to specific service interactions.

The key takeaways:

REST for the edge: Browser clients, public APIs, simple CRUD. Cacheable, debuggable, universally supported.
gRPC for internal hot paths: High-throughput service-to-service calls where latency matters and both sides are under your control.
Events for reactions: When the producer shouldn't wait, when multiple consumers need the same signal, or when temporal decoupling prevents cascading failures.
Use all three: Most production systems combine patterns. REST at the boundary, gRPC internally, events for async workflows.
Schema governance scales the system: OpenAPI for REST, proto registries for gRPC, schema registries for events. Without governance, schema changes become the primary source of production incidents.

The right communication pattern isn't a global decision. It's a per-interaction decision, made deliberately, based on which trade-offs you can accept for that specific data flow.

How to Build a Production-Grade Distributed Chatroom in Go [Full Handbook]

Destiny Erhabor — Fri, 13 Feb 2026 16:17:41 +0000

If you've ever wondered how chat applications like Slack, Discord, or WhatsApp work behind the scenes, this tutorial will show you. You'll build a real-time chat server from scratch using Go, learning the fundamental concepts that power modern communication systems.

By the end of this guide, you'll have built a working chatroom that supports unlimited concurrent users chatting in real-time, message persistence that survives server crashes, session management so users can reconnect after network interruptions, private messaging between users, and graceful handling of slow or disconnected clients.

More importantly, you'll understand the fundamental concepts behind distributed systems. You'll learn concurrent programming with goroutines and channels, TCP socket programming for network communication, write-ahead logging for data durability, state management with mutexes, and how to design systems that degrade gracefully under failure. These concepts power everything from databases to message queues to web servers.

What is a Distributed Chatroom?
What You'll Learn
Prerequisites
Tutorial Overview
Architecture Overview
Core Concepts You Need to Know
How to Set Up the Project Structure
How to Define Core Data Types
How to Initialize the Server
How to Build the Event Loop
How to Handle Client Connections
How to Implement Message Broadcasting
How to Add Persistence with WAL and Snapshots
How to Implement Session Management
How to Build the Command System
How to Create the Client
How to Test Your Chatroom
How to Deploy Your Server
Enhancements You Can Add
Conclusion

The complete source code for this project is available on GitHub if you'd like to reference it while following along.

What is a Distributed Chatroom?

A chatroom is a server that lets multiple users connect simultaneously and exchange messages in real-time. When we say "production-grade," we mean it includes features you'd expect in a real application: it persists data so messages aren't lost when the server restarts, it handles network failures gracefully, and it can support many concurrent users without slowing down.

The "distributed" aspect refers to how the system manages multiple clients connecting from different locations, all trying to send and receive messages at the same time. This introduces interesting challenges: how do you ensure everyone sees messages in the same order? How do you handle clients with slow internet connections? What happens when someone disconnects unexpectedly?

These aren't just theoretical problems. Every networked application deals with concurrency, state management, and failure handling. Whether you're building a chat app, a multiplayer game, a collaborative editor, or a trading platform, you'll face similar challenges. The patterns you'll learn here apply broadly across distributed systems.

Chat applications are excellent learning projects because they combine several challenging problems in one place. You need to manage concurrent connections safely, broadcast messages to multiple clients without blocking, handle unreliable networks, persist data durably, and ensure the system recovers gracefully from crashes. Each of these topics could be its own tutorial, but here you'll see how they work together in a real application.

What You'll Learn

This tutorial demonstrates several important concepts that are fundamental to building distributed systems. Here's what you'll learn:

1. TCP Socket Programming in Go

You'll learn how to accept incoming TCP connections, read and write data over network sockets, and handle connection failures gracefully. These skills are essential for any networked application, from web servers to database clients.

2. Concurrent Programming with Goroutines and Channels

Go's concurrency model is one of its strongest features. You'll see how to use goroutines to handle multiple clients simultaneously without blocking. You'll use channels to coordinate between goroutines safely, avoiding the common pitfalls of shared memory concurrency like race conditions and deadlocks.

3. State Management in Distributed Systems

Managing shared state across concurrent operations is tricky. You'll learn when to use mutexes versus channels, how to design lock granularity to avoid bottlenecks, and how to ensure data consistency when multiple goroutines access the same data.

4. Write-Ahead Logging (WAL) for Durability

Databases use WAL to ensure data isn't lost during crashes. You'll implement the same pattern, learning how to balance durability with performance. You'll see why fsync is critical, understand the trade-offs of different persistence strategies, and learn how to recover state after unexpected shutdowns.

5. Session Management and Reconnection

Networks are unreliable. Users disconnect, WiFi drops, mobile connections switch towers. You'll build a token-based session system that lets users reconnect seamlessly, preserving their chat history and identity without requiring passwords or complex authentication.

6. Graceful Degradation and Fault Tolerance

Perfect reliability is impossible, so you need to design for partial failures. You'll learn how to prevent slow clients from affecting fast ones, how to continue operating when persistence fails, and how to clean up resources properly when things go wrong.

Prerequisites

To get the most out of this tutorial, you should have some foundational knowledge. You don't need to be an expert, but you should be comfortable with the basics.

Go basics (goroutines, channels, interfaces)
TCP/IP networking fundamentals
Basic concurrency concepts
File I/O operations

Tutorial Overview

This tutorial takes you through building a production-ready chatroom step by step.

You'll start by exploring the overall architecture to understand how components fit together. Then you'll learn about core concepts like concurrency models and persistence strategies.

Next, you'll set up your project structure and define the core data types that represent clients, messages, and the chatroom. Then you'll implement the server initialization and event loop, which is where all coordination happens.

After that, you'll build the networking layer to handle client connections, implement message broadcasting so messages reach all users, and add persistence using write-ahead logging and snapshots.

You'll then implement session management for reconnection, build a command system for user actions, and create a simple client application to test your server.

Finally, you’ll learn how to test and deploy your chatroom, and review key lessons from building a distributed system.

By the end, you'll have a complete, working chatroom and understand how distributed systems handle concurrency, persistence, and failure recovery.

Architecture Overview

The system follows a client-server architecture with internal components that work together to provide a robust chat experience.

High-Level Architecture

Component Breakdown

1. Network Layer

TCP Listener: Accepts incoming connections on port 9000
Connection Handler: Manages individual client connections with dedicated goroutines
Protocol: Simple newline-delimited text protocol

2. Client Management

Each client connection spawns two goroutines:

Read Goroutine: Receives messages from client
Write Goroutine: Sends messages to client (non-blocking with buffered channels)

3. ChatRoom Core

This is the heart of the system – a single goroutine running an event loop:

for {
    select {
        case client := <-cr.join:
            // Handle new client
        case client := <-cr.leave:
            // Handle disconnection
        case message := <-cr.broadcast:
            // Broadcast to all clients
        case client := <-cr.listUsers:
            // Send user list
        case dm := <-cr.directMessage:
            // Handle private message
    }
}

4. State Management

We have three synchronized data structures:

clients map[*Client]bool: Active connections (mutex-protected)
sessions map[string]*SessionInfo: User sessions for reconnection
messages []Message: In-memory message history

5. Persistence Layer

Two-tier approach:

Write-Ahead Log (WAL): Immediate append-only log for durability
Snapshots: Periodic full state dumps for faster recovery

6. Session Management

This enables reconnection with token-based authentication:

Generates unique tokens per user
1-hour session timeout
Preserves chat history for returning users

Message Flow

Here's how a message travels through the system:

User Input → Client Read → Server Receive → Broadcast Channel 
    → ChatRoom Loop → Persist to WAL → Fan-out to All Clients
    → Client Write Goroutines → TCP Send → User Display

The broadcast channel acts as a synchronization point, ensuring total message ordering.

Core Concepts You Need to Know

Understanding the Concurrency Model

This chatroom uses Go's CSP (Communicating Sequential Processes) model. This is a fundamentally different approach to concurrency than you might be used to from other languages.

In traditional concurrent programming, you protect shared memory with locks (mutexes). Multiple threads access the same data structure, and you use locks to ensure only one thread modifies it at a time. This works, but it's error-prone. Forget a lock, and you have a race condition. Hold locks too long, and you have deadlocks.

Go encourages a different approach: instead of communicating by sharing memory, you share memory by communicating. You pass data between goroutines through channels. Only one goroutine owns the data at a time, eliminating many concurrency bugs by design.

Channels provide several advantages. They eliminate most race conditions by design, because if only one goroutine owns the data at a time, there's no race to access it. They provide natural flow control since channels can block when full (back pressure) or block when empty (waiting for data). They make it easier to reason about message flow because you can trace how data moves through your system by following the channels. And they offer better composability since you can combine channels with select statements to coordinate multiple operations.

That said, we’ll still use mutexes in this project. Channels aren't always the right tool. We’ll use mutexes when multiple goroutines need quick, frequent access to shared data structures like maps. And we’ll use channels when we want to coordinate behavior or transfer ownership of data.

Here's how the chatroom uses channels to coordinate everything:

type ChatRoom struct {
    join          chan *Client        // New connections
    leave         chan *Client        // Disconnections
    broadcast     chan string         // Messages to all
    listUsers     chan *Client        // User list requests
    directMessage chan DirectMessage  // Private messages

    // Shared state (mutex-protected)
    clients    map[*Client]bool
    mu         sync.Mutex

    // Message history (separate mutex)
    messages   []Message
    messageMu  sync.Mutex
}

Notice that we have five channels for different types of events. The main event loop receives from all these channels using a select statement. This means all state changes happen sequentially in one place, making the system much easier to reason about.

We could have used one channel that accepts different message types, but separate channels make the code clearer. When you send to chatRoom.join, it's obvious what you're doing. When you send to chatRoom.broadcast, same thing.

The mutexes protect data that many goroutines read frequently. The clients map needs to be accessed every time we broadcast a message. Using a mutex for quick read access is more efficient than passing the entire map through a channel.

Understanding the Persistence Strategy

When your server crashes (and it will eventually), you need to recover the chat history. Users expect their messages to be there when the server restarts. But persistence is expensive: writing to disk is thousands of times slower than writing to memory. So you need a strategy that balances durability with performance.

We’ll use a two-tier approach that's similar to what real databases use: WAL (Write-ahead log) and snapshots.

The WAL is your primary durability mechanism. Here's how it works: every message is immediately appended to a file called messages.wal. This file is append-only, which means we only write to the end. Append-only writes are fast because the disk doesn't need to seek to different locations.

Each message is written as a single line of JSON. After writing each message, we call fsync. This tells the operating system to actually write the data to the physical disk right now, not just buffer it in memory. Without fsync, the OS might lose your data if the power fails before it gets around to writing.

The WAL is append-only and never modified. This makes it very reliable. If the server crashes mid-write, the worst case is one corrupted line at the end, which we can detect and skip during recovery.

The problem with a write-ahead log is that it grows forever. If you have a million messages, you need to replay a million log entries every time you restart the server. That's slow.

Snapshots solve this problem. Every 5 minutes, if there are more than 100 new messages, we write the entire message history to a separate file called snapshot.json. This is the complete state of the chat at that moment.

After creating a snapshot, we truncate (empty) the WAL. New messages continue to append to the WAL, but now we only need to replay messages since the last snapshot.

When the server starts, it first loads the snapshot file (if it exists). This gives us the state from the last snapshot, which might be 100,000 messages. Loading this takes about 100ms. Then it replays all entries from the WAL. This gives us messages written since the last snapshot, which might be only 50 messages. Replaying this takes milliseconds. Finally, it resumes normal operation.

Total recovery time is a few hundred milliseconds instead of several minutes.

This two-tier system gives us the best of both worlds: fast writes during normal operation with the append-only WAL, fast recovery after crashes with snapshot plus small WAL replay, guaranteed durability through fsync after every message, and bounded recovery time because the WAL never grows too large.

The trade-off is that snapshots use more disk space temporarily since you have both the snapshot and the WAL. But disk space is cheap, and correctness is expensive.

Now that you understand the key concepts behind the chatroom's design, it's time to start building. You'll begin by setting up your project structure and creating the necessary directories and files.

How to Set Up the Project Structure

First, create the directory structure for your project. You will create their files as we walk through the tutorial:

mkdir -p chatroom-with-broadcast/cmd/server
mkdir -p chatroom-with-broadcast/cmd/client
mkdir -p chatroom-with-broadcast/internal/chatroom
mkdir -p chatroom-with-broadcast/pkg/token
mkdir -p chatroom-with-broadcast/chatdata
cd chatroom-with-broadcast

Then initialize the Go module.

Note that you’ll need Go 1.23.2 or later installed on your machine. Earlier versions might work, but the code examples assume features available in Go 1.23 and above. This version includes improvements to the standard library that make concurrent programming more efficient.

go mod init github.com/yourusername/chatroom

Your go.mod file should look like this:

module github.com/yourusername/chatroom

go 1.23.2

With your project structure in place, you're ready to start writing code. The first step is defining the data types that will represent the core components of your chatroom: messages, clients, and the chatroom itself.

How to Define Core Data Types

Create a new file internal/chatroom/types.go to define your core data structures. These types form the foundation of your chatroom, so it's important to understand what each one represents and why it's designed the way it is.

package chatroom

import (
    "net"
    "os"
    "sync"
    "time"
)

// Message represents a single chat message with metadata
type Message struct {
    ID        int       `json:"id"`
    From      string    `json:"from"`
    Content   string    `json:"content"`
    Timestamp time.Time `json:"timestamp"`
    Channel   string    `json:"channel"` // "global" or "private:username"
}

// Client represents a connected user
type Client struct {
    conn         net.Conn      // TCP connection
    username     string        // Display name
    outgoing     chan string   // Buffered channel for writes
    lastActive   time.Time     // For idle detection
    messagesSent int           // Statistics
    messagesRecv int
    isSlowClient bool          // Testing flag

    reconnectToken string
    mu             sync.Mutex   // Protects stats fields
}

// ChatRoom is the central coordinator
type ChatRoom struct {
    // Communication channels
    join          chan *Client
    leave         chan *Client
    broadcast     chan string
    listUsers     chan *Client
    directMessage chan DirectMessage

    // State
    clients       map[*Client]bool
    mu            sync.Mutex
    totalMessages int
    startTime     time.Time

    // Message history
    messages      []Message
    messageMu     sync.Mutex
    nextMessageID int

    // Persistence
    walFile       *os.File
    walMu         sync.Mutex
    dataDir       string

    // Sessions
    sessions      map[string]*SessionInfo
    sessionsMu    sync.Mutex
}

// SessionInfo tracks reconnection data
type SessionInfo struct {
    Username       string
    ReconnectToken string
    LastSeen       time.Time
    CreatedAt      time.Time
}

// DirectMessage represents a private message
type DirectMessage struct {
    toClient *Client
    message  string
}

Understanding the Message Type

The Message struct stores everything we need to know about a chat message. The ID field uniquely identifies each message and ensures messages stay in order. The Timestamp lets us show when messages were sent, which is important for chat history.

The Channel field is interesting. Right now, we only use "global" for public messages, but this design lets us add private channels or chat rooms later without changing the data structure. Good data structures anticipate future needs.

Understanding the Client Type

Each connected user is represented by a Client struct. The conn field is their TCP connection – this is how we send and receive data.

The outgoing channel is crucial for performance. Notice it's a chan string, which means it's a channel of strings. We'll make this a buffered channel (size 10). This buffer means we can queue up 10 messages for this client without blocking. If a client is slow to read, we can keep sending to other clients.

Without this buffer, one slow client would block the entire broadcast. With the buffer, slow clients just miss messages if they can't keep up, which is much better than slowing everyone down.

The lastActive timestamp helps us detect idle users. If someone hasn't sent a message in 5 minutes, we can disconnect them to free up resources.

The mu mutex protects the statistics fields. Multiple goroutines will update messagesSent and messagesRecv, so we need a mutex to prevent race conditions.

Understanding the ChatRoom Type

This is the heart of the system. Notice that we have two kinds of fields: channels and protected state.

The five channels (join, leave, broadcast, listUsers, directMessage) are how different parts of the system communicate with the main event loop. When a new client connects, we send them to the join channel. When someone sends a message, it goes to the broadcast channel.

These channels are unbuffered (capacity 0) because we want synchronization. When you send to an unbuffered channel, you block until someone receives. This ensures the event loop processes events in order.

The protected state (maps and slices) needs mutexes because multiple goroutines access it. Notice that we use separate mutexes for different data. The mu mutex protects the clients map. The messageMu mutex protects the messages slice. The sessionsMu mutex protects the sessions map.

Why separate mutexes? Performance. If we used one mutex for everything, broadcasting a message would lock all the data, preventing new clients from joining. Separate mutexes mean different operations can happen concurrently.

The WAL file (walFile) also has its own mutex (walMu) because writing to disk is slow. We don't want to hold the main mutex while waiting for disk I/O.

With your data types defined, the next step is creating a function to initialize the server. This function will set up all your data structures, restore any persisted state from previous runs, and start background workers.

How to Initialize the Server

Server initialization is critical because you need to set up all your data structures in the right order. If you restore state after opening the WAL, you might replay messages twice. If you start accepting connections before loading history, users won't see old messages.

Create a file internal/chatroom/run.go to bootstrap the server:

package chatroom

import (
    "fmt"
    "net"
    "time"
)

func NewChatRoom(dataDir string) (*ChatRoom, error) {
    cr := &ChatRoom{
        clients:       make(map[*Client]bool),
        join:          make(chan *Client),
        leave:         make(chan *Client),
        broadcast:     make(chan string),
        listUsers:     make(chan *Client),
        directMessage: make(chan DirectMessage),
        sessions:      make(map[string]*SessionInfo),
        messages:      make([]Message, 0),
        startTime:     time.Now(),
        dataDir:       dataDir,
    }

    // Restore from snapshot if available
    if err := cr.loadSnapshot(); err != nil {
        fmt.Printf("Failed to load snapshot: %v\n", err)
    }

    // Initialize WAL for new messages
    if err := cr.initializePersistence(); err != nil {
        return nil, err
    }

    // Start background snapshot worker
    go cr.periodicSnapshots()

    return cr, nil
}

func (cr *ChatRoom) periodicSnapshots() {
    ticker := time.NewTicker(5 * time.Minute)
    defer ticker.Stop()

    for range ticker.C {
        cr.messageMu.Lock()
        messageCount := len(cr.messages)
        cr.messageMu.Unlock()

        if messageCount > 100 {
            if err := cr.createSnapshot(); err != nil {
                fmt.Printf("Snapshot failed: %v\n", err)
            }
        }
    }
}

Let's break down what happens during initialization:

1. Creating Data Structures

We start by creating all the maps and channels. The make function initializes these properly. For maps, this creates an empty map ready to use. For channels, this creates an unbuffered channel (capacity 0).

Notice we create the messages slice with initial capacity 0 but room to grow: make([]Message, 0). This is more efficient than starting with nil because the slice is ready to append immediately without allocation.

2. Loading the Snapshot

Before we accept any connections, we try to load a snapshot from disk. This restores the chat history from the last time the server ran. If the snapshot doesn't exist (first run) or fails to load (corrupted file), we just continue with an empty history.

This step must happen before initializing the WAL. If we opened the WAL first, we might replay messages that are already in the snapshot, creating duplicates.

3. Initializing the WAL

The initializePersistence() function opens the WAL file in append mode. It also replays any entries in the WAL that happened after the last snapshot. This ensures we don't lose any messages that were written to the WAL but not yet included in a snapshot.

If this step fails, we return an error and refuse to start. Why? Because if we can't write to the WAL, we can't guarantee durability. It's better to refuse to start than to lie to users by accepting messages we can't persist.

4. Starting Background Workers

The periodicSnapshots() function runs in a separate goroutine. It wakes up every 5 minutes and checks if we need to create a snapshot. Notice the defer ticker.Stop() – this is important. If we forget to stop the ticker, it leaks a goroutine and wastes resources.

The goroutine acquires the messageMu lock just to read the message count, then releases it immediately. We don't hold the lock during the snapshot creation because that's slow and would block message broadcasting.

Why 5 Minutes and 100 Messages?

These are tunable parameters. 5 minutes means recovery never needs to replay more than 5 minutes of messages. 100 messages means we don't create snapshots too frequently during quiet periods.

In a production system, you might make these configurable. A high-traffic chat might want shorter intervals. A low-traffic chat might want longer intervals to reduce disk I/O.

Now that your server is initialized with all the necessary data structures and background workers, you need to build the core coordination mechanism. The event loop is where all state changes happen in your chatroom. It's the heartbeat that keeps everything synchronized.

How to Build the Event Loop

The event loop is the heart of your chatroom. Every client connection, message, and disconnection flows through this single point. This might seem like it could be a bottleneck, but it's actually what makes the system simple and safe.

The Run() method is the server's heartbeat. This is where all the magic happens. Every event in the system flows through this loop. Add this to run.go:

func (cr *ChatRoom) Run() {
    fmt.Println("ChatRoom heartbeat started...")
    go cr.cleanupInactiveClients()

    for {
        select {
        case client := <-cr.join:
            cr.handleJoin(client)

        case client := <-cr.leave:
            cr.handleLeave(client)

        case message := <-cr.broadcast:
            cr.handleBroadcast(message)

        case client := <-cr.listUsers:
            cr.sendUserList(client)

        case dm := <-cr.directMessage:
            cr.handleDirectMessage(dm)
        }
    }
}

Understanding the Select Statement

The select statement is one of Go's most powerful concurrency features. It's like a switch statement for channels. The select waits until one of its cases can proceed, then it executes that case.

Here's what happens: The loop blocks on the select statement, waiting for data on any of the five channels. When data arrives on any channel, that case executes. After the case completes, the loop goes back to waiting.

For example, when a new client connects, code elsewhere in your program sends that client to cr.join. The select receives it and executes cr.handleJoin(client). Once that finishes, the loop goes back to waiting.

Why Use a Single Event Loop?

This might seem like a bottleneck. You have one goroutine processing all events sequentially. Why not process events in parallel?

The answer is consistency. Here's what you gain from sequential processing:

1. No Race Conditions on State

Only one goroutine modifies the clients map, the messages slice, and the sessions map. You never need to worry about two operations interfering with each other. When you add a client in handleJoin, you know for certain that no other code is simultaneously removing clients or broadcasting messages.

This is incredibly powerful. Most bugs in concurrent systems come from unexpected interleaving of operations. By processing events sequentially, you eliminate an entire class of bugs.

2. Total Ordering of Events

Messages are broadcast in the order they arrive. This seems obvious, but it's important. If Alice sends "Hello" and then Bob sends "Hi", you can guarantee everyone sees them in that order. With parallel processing, you'd need additional synchronization to maintain ordering.

3. Simple State Transitions

You can reason about your system state as a series of transitions. "After this join event, the client is in the map. After this leave event, the client is removed." You don't need to worry about concurrent state changes making your reasoning invalid.

4. Easy to Debug

When something goes wrong, you can add logging to the event loop and see exactly what sequence of events led to the problem. With parallel processing, the order of events depends on thread scheduling, making bugs hard to reproduce.

Is This Actually a Bottleneck?

You might worry that sequential processing limits performance. In practice, it's fine for this workload. Here's why:

The handlers are fast. They do simple things like adding to a map, removing from a map, or forwarding a message to channels. These operations take microseconds. The event loop can process thousands of events per second.

The slow operations (writing to disk, sending to client connections) happen in other goroutines. The event loop doesn't wait for them. It just sends data to a channel or adds work to a queue, then immediately moves to the next event.

If you needed higher throughput, you could shard your chat into multiple rooms, each with its own event loop. But for a single chatroom, sequential processing is both simpler and fast enough.

Understanding the Cleanup Worker

Notice the line go cr.cleanupInactiveClients() before the loop. This starts a background goroutine that periodically checks for idle clients.

Why not include this in the event loop? Because it's time-based, not event-based. The cleanup worker wakes up every 30 seconds and sends disconnect events for idle clients. These events flow through the normal event loop, maintaining our single-threaded state mutation property.

Now add the runServer() function and shutdown handler:

import (
    "os"
    "os/signal"
    "syscall"
)

func runServer() {
    chatRoom, err := NewChatRoom("./chatdata")
    if err != nil {
        fmt.Printf("Failed to initialize: %v\n", err)
        return
    }
    defer chatRoom.shutdown()

    // Set up signal handling for graceful shutdown
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)

    go func() {
        <-sigChan
        fmt.Println("\nReceived shutdown signal")
        chatRoom.shutdown()
        os.Exit(0)
    }()

    go chatRoom.Run()

    listener, err := net.Listen("tcp", ":9000")
    if err != nil {
        fmt.Println("Error starting server:", err)
        return
    }
    defer listener.Close()

    fmt.Println("Server started on :9000")

    for {
        conn, err := listener.Accept()
        if err != nil {
            fmt.Println("Error accepting connection:", err)
            continue
        }
        fmt.Println("New connection from:", conn.RemoteAddr())
        go handleClient(conn, chatRoom)
    }
}

func (cr *ChatRoom) shutdown() {
    fmt.Println("\nShutting down...")
    if err := cr.createSnapshot(); err != nil {
        fmt.Printf("Final snapshot failed: %v\n", err)
    }
    if cr.walFile != nil {
        cr.walFile.Close()
    }
    fmt.Println("Shutdown complete")
}

The runServer() function ties everything together:

Create the chatroom with NewChatRoom()
Defer the shutdown function so it runs when the function exits
Start the event loop in a separate goroutine with go chatRoom.Run()
Listen for TCP connections on port 9000
For each connection, spawn a goroutine with go handleClient()

The defer statement is important. No matter how the function exits (normal return, panic, error), the shutdown function runs. This ensures we create a final snapshot and close the WAL file cleanly.

The signal handling goroutine listens for SIGINT (Ctrl+C) or SIGTERM (system shutdown). When it receives one, it calls shutdown() and exits gracefully. This means when you press Ctrl+C, the server saves its state before stopping.

With your event loop running and listening for connections, the next step is handling what happens when a client actually connects. This involves reading their username, creating a session, and setting up the communication channels.

How to Handle Client Connections

When a client connects to your server, several things need to happen: you need to establish the TCP connection, prompt for a username, create a Client object to represent them, start goroutines to read and write messages, and handle both normal disconnections and unexpected failures.

Create a file internal/chatroom/io.go for managing client connections. When a client connects, handleClient() manages the entire lifecycle:

package chatroom

import (
    "bufio"
    "fmt"
    "math/rand"
    "net"
    "strings"
    "time"
)

func handleClient(conn net.Conn, chatRoom *ChatRoom) {
    defer func() {
        if r := recover(); r != nil {
            fmt.Printf("Panic in handleClient: %v\n", r)
        }
        conn.Close()
    }()

    // Set initial timeout for username entry
    conn.SetReadDeadline(time.Now().Add(30 * time.Second))

    reader := bufio.NewReader(conn)

    // Prompt for username or reconnection
    conn.Write([]byte("Enter username (or 'reconnect::'): \n"))

    input, err := reader.ReadString('\n')
    if err != nil {
        fmt.Println("Failed to read username:", err)
        return
    }
    input = strings.TrimSpace(input)

    var username string
    var reconnectToken string
    var isReconnecting bool

    // Parse reconnection attempt
    if strings.HasPrefix(input, "reconnect:") {
        parts := strings.Split(input, ":")
        if len(parts) == 3 {
            username = parts[1]
            reconnectToken = parts[2]
            isReconnecting = true
        } else {
            conn.Write([]byte("Invalid format. Use: reconnect::\n"))
            return
        }
    } else {
        username = input
    }

    // Generate guest name if empty
    if username == "" {
        username = fmt.Sprintf("Guest%d", rand.Intn(1000))
    }

    // Validate reconnection or check for duplicate
    if isReconnecting {
        if chatRoom.validateReconnectToken(username, reconnectToken) {
            fmt.Printf("%s reconnected successfully\n", username)
            conn.Write([]byte(fmt.Sprintf("Welcome back, %s!\n", username)))
        } else {
            conn.Write([]byte("Invalid token or session expired.\n"))
            return
        }
    } else {
        // Prevent duplicate logins
        if chatRoom.isUsernameConnected(username) {
            conn.Write([]byte("Username already connected. Use reconnect if you lost connection.\n"))
            return
        }

        // Create or retrieve session
        chatRoom.sessionsMu.Lock()
        existingSession := chatRoom.sessions[username]
        chatRoom.sessionsMu.Unlock()

        if existingSession != nil {
            token := existingSession.ReconnectToken
            msg := fmt.Sprintf("Tip: Save this token: %s\n", token)
            msg += fmt.Sprintf("To reconnect: reconnect:%s:%s\n", username, token)
            conn.Write([]byte(msg))
        } else {
            session := chatRoom.createSession(username)
            token := session.ReconnectToken
            msg := fmt.Sprintf("Your token: %s\n", token)
            msg += fmt.Sprintf("To reconnect: reconnect:%s:%s\n", username, token)
            conn.Write([]byte(msg))
        }
    }

    // Create client object
    client := &Client{
        conn:           conn,
        username:       username,
        outgoing:       make(chan string, 10), // Buffered
        lastActive:     time.Now(),
        reconnectToken: reconnectToken,
        isSlowClient:   rand.Float64() < 0.1, // 10% chance for testing
    }

    // Clear timeout for normal operation
    conn.SetReadDeadline(time.Time{})

    // Notify chatroom
    chatRoom.join <- client

    // Send welcome message
    welcomeMsg := buildWelcomeMessage(username)
    conn.Write([]byte(welcomeMsg))

    // Start read/write loops
    go readMessages(client, chatRoom)
    writeMessages(client) // Blocks until disconnect

    // Update session on disconnect
    chatRoom.updateSessionActivity(username)
    chatRoom.leave <- client
}

func buildWelcomeMessage(username string) string {
    msg := fmt.Sprintf("Welcome, %s!\n", username)
    msg += "Commands:\n"
    msg += "  /users - List all users\n"
    msg += "  /history [N] - Show last N messages\n"
    msg += "  /msg   - Private message\n"
    msg += "  /token - Show your reconnect token\n"
    msg += "  /stats - Show your stats\n"
    msg += "  /quit - Leave\n"
    return msg
}

The initial 30-second timeout prevents connection exhaustion by disconnecting clients who don't enter a username quickly. The buffered outgoing channel prevents slow clients from blocking the broadcaster. Token-based reconnection lets users resume their session without complex authentication. The dual goroutine design means reading and writing happen independently, so a slow write doesn't block incoming messages.

How to Read Messages from Clients

Add the readMessages() goroutine to handles all incoming data:

func readMessages(client *Client, chatRoom *ChatRoom) {
    defer func() {
        if r := recover(); r != nil {
            fmt.Printf("Panic in readMessages for %s: %v\n", client.username, r)
        }
    }()

    reader := bufio.NewReader(client.conn)

    for {
        // Set 5-minute idle timeout
        client.conn.SetReadDeadline(time.Now().Add(5 * time.Minute))

        message, err := reader.ReadString('\n')
        if err != nil {
            if netErr, ok := err.(net.Error); ok && netErr.Timeout() {
                fmt.Printf("%s timed out\n", client.username)
            } else {
                fmt.Printf("%s disconnected: %v\n", client.username, err)
            }
            return
        }

        client.markActive() // Update activity timestamp

        message = strings.TrimSpace(message)
        if message == "" {
            continue
        }

        client.mu.Lock()
        client.messagesRecv++
        client.mu.Unlock()

        // Process commands vs. regular messages
        if strings.HasPrefix(message, "/") {
            handleCommand(client, chatRoom, message)
            continue
        }

        // Regular message - format and broadcast
        formatted := fmt.Sprintf("[%s]: %s\n", client.username, message)
        chatRoom.broadcast <- formatted
    }
}

5 minutes of idle time triggers auto-disconnect. This prevents zombie connections from consuming resources.

How to Write Messages to Clients

Add the writeMessages() function to drain the client's outgoing channel:

func writeMessages(client *Client) {
    defer func() {
        if r := recover(); r != nil {
            fmt.Printf("Panic in writeMessages for %s: %v\n", client.username, r)
        }
    }()

    writer := bufio.NewWriter(client.conn)

    for message := range client.outgoing {
        // Simulate slow client (testing mode)
        if client.isSlowClient {
            time.Sleep(time.Duration(rand.Intn(500)) * time.Millisecond)
        }

        _, err := writer.WriteString(message)
        if err != nil {
            fmt.Printf("Write error for %s: %v\n", client.username, err)
            return
        }

        err = writer.Flush()
        if err != nil {
            fmt.Printf("Flush error for %s: %v\n", client.username, err)
            return
        }
    }
}

Real-world clients have varying network speeds. A client with a slow internet connection shouldn't block message delivery to other users. This is a fundamental challenge in any system that broadcasts to multiple recipients.

To handle this, we use two techniques. First, the outgoing channel is buffered with a size of 10. This means the system can queue up 10 messages for a client without blocking. If a client temporarily slows down (maybe they're loading a large webpage in another tab), the buffer absorbs the slowdown.

Second, when broadcasting messages (which you'll see in the next section), we use non-blocking sends. If a client's buffer is full because they're consistently too slow, we skip sending to them rather than blocking everyone else. The slow client misses some messages, but everyone else continues normally. This is called graceful degradation: the system continues working even when parts of it have problems.

With client connections handled, the next step is implementing the core feature of any chat system: broadcasting messages to all connected users. Broadcasting means taking one message and sending it to many recipients efficiently and safely.

How to Implement Message Broadcasting

Broadcasting is the heart of a chat application. When one user sends a message, it needs to reach everyone else instantly. But this is trickier than it sounds because you need to persist the message for durability, send it to clients at different speeds without blocking, and maintain message ordering across all clients.

Create internal/chatroom/handlers.go to handle events.

The handleBroadcast() method is where messages reach all users:

package chatroom

import (
    "fmt"
    "strings"
    "time"
)

func (cr *ChatRoom) handleBroadcast(message string) {
    // Parse message metadata
    parts := strings.SplitN(message, ": ", 2)
    from := "system"
    actualContent := message

    if len(parts) == 2 {
        from = strings.Trim(parts[0], "[]")
        actualContent = parts[1]
    }

    // Create persistent message record
    cr.messageMu.Lock()
    msg := Message{
        ID:        cr.nextMessageID,
        From:      from,
        Content:   actualContent,
        Timestamp: time.Now(),
        Channel:   "global",
    }
    cr.nextMessageID++
    cr.messages = append(cr.messages, msg)
    cr.messageMu.Unlock()

    // Persist to WAL
    if err := cr.persistMessage(msg); err != nil {
        fmt.Printf("Failed to persist: %v\n", err)
        // Continue anyway - availability over consistency
    }

    // Collect current clients
    cr.mu.Lock()
    clients := make([]*Client, 0, len(cr.clients))
    for client := range cr.clients {
        clients = append(clients, client)
    }
    cr.totalMessages++
    cr.mu.Unlock()

    fmt.Printf("Broadcasting to %d clients: %s", len(clients), message)

    // Fan-out to all clients
    for _, client := range clients {
        select {
        case client.outgoing <- message:
            client.mu.Lock()
            client.messagesSent++
            client.mu.Unlock()
        default:
            fmt.Printf("Skipped %s (channel full)\n", client.username)
        }
    }
}

Consistency Trade-off:

If a WAL write fails, you still broadcast the message. Why? Because availability is more important than perfect consistency for a chat application. Users get their messages immediately, and you can handle WAL repair manually if needed.

How to Handle Join and Leave Events

Add these handlers to handlers.go:

func (cr *ChatRoom) handleJoin(client *Client) {
    cr.mu.Lock()
    cr.clients[client] = true
    cr.mu.Unlock()

    client.markActive()

    fmt.Printf("%s joined (total: %d)\n", client.username, len(cr.clients))

    cr.sendHistory(client, 10)

    announcement := fmt.Sprintf("*** %s joined the chat ***\n", client.username)
    cr.handleBroadcast(announcement)
}

func (cr *ChatRoom) handleLeave(client *Client) {
    cr.mu.Lock()
    if !cr.clients[client] {
        cr.mu.Unlock()
        return
    }
    delete(cr.clients, client)
    cr.mu.Unlock()

    fmt.Printf("%s left (total: %d)\n", client.username, len(cr.clients))

    // Close channel safely
    select {
    case <-client.outgoing:
        // Already closed
    default:
        close(client.outgoing)
    }

    announcement := fmt.Sprintf("*** %s left the chat ***\n", client.username)
    cr.handleBroadcast(announcement)
}

The handleJoin function adds the client to the active clients map, marks them as active for idle tracking, sends them the last 10 messages so they can see recent conversation, and broadcasts an announcement so everyone knows they joined.

The handleLeave function removes the client from the map, closes their outgoing channel safely (the select checks if it's already closed to avoid a panic), and broadcasts a departure announcement.

How to Send User Lists and History

Add these helper functions to handlers.go:

func (cr *ChatRoom) sendHistory(client *Client, count int) {
    cr.messageMu.Lock()
    defer cr.messageMu.Unlock()

    start := len(cr.messages) - count
    if start < 0 {
        start = 0
    }

    historyMsg := "Recent messages:\n"
    for i := start; i < len(cr.messages); i++ {
        msg := cr.messages[i]
        historyMsg += fmt.Sprintf(" [%s]: %s\n", msg.From, msg.Content)
    }

    select {
    case client.outgoing <- historyMsg:
    default:
    }
}

func (cr *ChatRoom) sendUserList(client *Client) {
    cr.mu.Lock()
    defer cr.mu.Unlock()

    list := "Users online:\n"
    for c := range cr.clients {
        status := ""
        if c.isInactive(1 * time.Minute) {
            status = " (idle)"
        }
        list += fmt.Sprintf("  - %s%s\n", c.username, status)
    }

    list += fmt.Sprintf("\nTotal messages: %d\n", cr.totalMessages)
    list += fmt.Sprintf("Uptime: %s\n", time.Since(cr.startTime).Round(time.Second))

    select {
    case client.outgoing <- list:
    default:
    }
}

func (cr *ChatRoom) handleDirectMessage(dm DirectMessage) {
    select {
    case dm.toClient.outgoing <- dm.message:
        dm.toClient.mu.Lock()
        dm.toClient.messagesSent++
        dm.toClient.mu.Unlock()
    default:
        fmt.Printf("Couldn't deliver DM to %s\n", dm.toClient.username)
    }
}

func (cr *ChatRoom) findClientByUsername(username string) *Client {
    cr.mu.Lock()
    defer cr.mu.Unlock()

    for client := range cr.clients {
        if client.username == username {
            return client
        }
    }
    return nil
}

func (c *Client) markActive() {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.lastActive = time.Now()
}

func (c *Client) isInactive(timeout time.Duration) bool {
    c.mu.Lock()
    defer c.mu.Unlock()
    return time.Since(c.lastActive) > timeout
}

You now have a working chat system where clients can connect and exchange messages.

But there's a critical problem: if the server crashes or restarts, all messages are lost. The next step is adding persistence so messages survive failures.

How to Add Persistence with WAL and Snapshots

Persistence ensures your chat history survives server crashes and restarts. Without it, users would lose all their conversations every time the server goes down.

You'll implement this using two complementary mechanisms: a write-ahead log for immediate durability and snapshots for fast recovery.

Create internal/chatroom/persistence.go to handle data durability.

The WAL ensures messages survive crashes:

package chatroom

import (
    "bufio"
    "encoding/json"
    "fmt"
    "io"
    "os"
    "path/filepath"
)

func (cr *ChatRoom) initializePersistence() error {
    if err := os.MkdirAll(cr.dataDir, 0755); err != nil {
        return fmt.Errorf("create data dir: %w", err)
    }

    walPath := filepath.Join(cr.dataDir, "messages.wal")

    if err := cr.recoverFromWAL(walPath); err != nil {
        fmt.Printf("Recovery failed: %v\n", err)
    }

    file, err := os.OpenFile(walPath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
    if err != nil {
        return fmt.Errorf("open wal: %w", err)
    }

    cr.walFile = file
    fmt.Printf("WAL initialized: %s\n", walPath)
    return nil
}

func (cr *ChatRoom) recoverFromWAL(walPath string) error {
    file, err := os.Open(walPath)
    if err != nil {
        if os.IsNotExist(err) {
            fmt.Println("No WAL found (fresh start)")
            return nil
        }
        return err
    }
    defer file.Close()

    scanner := bufio.NewScanner(file)
    recovered := 0

    for scanner.Scan() {
        line := scanner.Text()
        if line == "" {
            continue
        }

        var msg Message
        if err := json.Unmarshal([]byte(line), &msg); err != nil {
            fmt.Printf("Skipping corrupt line: %s\n", line)
            continue
        }

        cr.messages = append(cr.messages, msg)

        if msg.ID >= cr.nextMessageID {
            cr.nextMessageID = msg.ID + 1
        }
        recovered++
    }

    fmt.Printf("Recovered %d messages\n", recovered)
    return nil
}

func (cr *ChatRoom) persistMessage(msg Message) error {
    cr.walMu.Lock()
    defer cr.walMu.Unlock()

    data, err := json.Marshal(msg)
    if err != nil {
        return err
    }

    _, err = cr.walFile.Write(append(data, '\n'))
    if err != nil {
        return err
    }

    return cr.walFile.Sync()
}

Each line is a JSON-encoded message:

{"id":1,"from":"Alice","content":"Hello world","timestamp":"2024-02-06T10:00:00Z","channel":"global"}
{"id":2,"from":"Bob","content":"Hi Alice!","timestamp":"2024-02-06T10:00:05Z","channel":"global"}

The Sync() call is critical for durability. Without it, the OS might buffer writes in memory, losing them on a crash. The trade-off is that Sync() is expensive (about 1-10ms per call). Production systems might batch multiple messages to improve throughput.

How to Create and Load Snapshots

Add snapshot functionality to persistence.go:

func (cr *ChatRoom) createSnapshot() error {
    snapshotPath := filepath.Join(cr.dataDir, "snapshot.json")
    tempPath := snapshotPath + ".tmp"

    file, err := os.Create(tempPath)
    if err != nil {
        return err
    }
    defer file.Close()

    cr.messageMu.Lock()
    data, err := json.MarshalIndent(cr.messages, "", "  ")
    cr.messageMu.Unlock()

    if err != nil {
        return err
    }

    if _, err := file.Write(data); err != nil {
        return err
    }

    if err := file.Sync(); err != nil {
        return err
    }

    file.Close()

    if err := os.Rename(tempPath, snapshotPath); err != nil {
        return err
    }

    fmt.Printf("Snapshot created (%d messages)\n", len(cr.messages))
    return cr.truncateWAL()
}

func (cr *ChatRoom) truncateWAL() error {
    cr.walMu.Lock()
    defer cr.walMu.Unlock()

    if cr.walFile != nil {
        cr.walFile.Close()
    }

    walPath := filepath.Join(cr.dataDir, "messages.wal")
    file, err := os.OpenFile(walPath, os.O_TRUNC|os.O_CREATE|os.O_WRONLY, 0644)
    if err != nil {
        return err
    }
    cr.walFile = file
    fmt.Println("WAL truncated")
    return nil
}

func (cr *ChatRoom) loadSnapshot() error {
    snapshotPath := filepath.Join(cr.dataDir, "snapshot.json")
    file, err := os.Open(snapshotPath)
    if err != nil {
        if os.IsNotExist(err) {
            return nil
        }
        return err
    }
    defer file.Close()

    data, err := io.ReadAll(file)
    if err != nil {
        return err
    }

    cr.messageMu.Lock()
    err = json.Unmarshal(data, &cr.messages)
    cr.messageMu.Unlock()

    if err != nil {
        return err
    }

    for _, msg := range cr.messages {
        if msg.ID >= cr.nextMessageID {
            cr.nextMessageID = msg.ID + 1
        }
    }

    fmt.Printf("Loaded %d messages from snapshot\n", len(cr.messages))
    return nil
}

Writing to .tmp then renaming ensures you never have a half-written snapshot. Even if power fails mid-write, the old snapshot remains valid.

Recovery Flow

When the server starts, it first loads the snapshot if it exists, which might contain 100K messages and takes about 100ms. Then it replays WAL entries written since the snapshot, which might be only recent messages. Total recovery time is seconds instead of minutes.

With persistence in place, your messages are safe. But network connections are unreliable. Users get disconnected when their WiFi drops, their phone switches towers, or their laptop goes to sleep. The next step is implementing session management so users can reconnect without losing their identity or chat history.

How to Implement Session Management

Session management lets users reconnect to your server after network interruptions without needing to create a new account or re-enter credentials. You'll implement this using cryptographically secure tokens that persist across connections.

Create internal/chatroom/session.go for reconnection handling.

package chatroom

import (
    "fmt"
    "time"

    "github.com/yourusername/chatroom/pkg/token"
)

func (cr *ChatRoom) createSession(username string) *SessionInfo {
    cr.sessionsMu.Lock()
    defer cr.sessionsMu.Unlock()

    tok := token.GenerateToken()

    session := &SessionInfo{
        Username:       username,
        ReconnectToken: tok,
        LastSeen:       time.Now(),
        CreatedAt:      time.Now(),
    }

    cr.sessions[username] = session

    fmt.Printf("Created session for %s (token: %s...)\n", username, tok[:8])

    return session
}

func (cr *ChatRoom) validateReconnectToken(username, token string) bool {
    cr.sessionsMu.Lock()
    defer cr.sessionsMu.Unlock()

    session, exists := cr.sessions[username]
    if !exists {
        return false
    }

    if session.ReconnectToken != token {
        return false
    }

    if time.Since(session.LastSeen) > 1*time.Hour {
        delete(cr.sessions, username)
        return false
    }

    session.LastSeen = time.Now()

    return true
}

func (cr *ChatRoom) updateSessionActivity(username string) {
    cr.sessionsMu.Lock()
    defer cr.sessionsMu.Unlock()

    if session, exists := cr.sessions[username]; exists {
        session.LastSeen = time.Now()
    }
}

func (cr *ChatRoom) isUsernameConnected(username string) bool {
    cr.mu.Lock()
    defer cr.mu.Unlock()

    for client := range cr.clients {
        if client.username == username {
            return true
        }
    }

    return false
}

func (cr *ChatRoom) cleanupInactiveClients() {
    ticker := time.NewTicker(30 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        cr.mu.Lock()
        var toRemove []*Client

        for client := range cr.clients {
            if client.isInactive(5 * time.Minute) {
                fmt.Printf("Removing inactive: %s\n", client.username)
                toRemove = append(toRemove, client)
            }
        }
        cr.mu.Unlock()

        for _, client := range toRemove {
            cr.leave <- client
        }
    }
}

How to Generate Secure Tokens

Create pkg/token/token.go for token generation:

package token

import (
    "crypto/rand"
    "encoding/hex"
)

// GenerateToken returns a secure random 16-byte hex token
func GenerateToken() string {
    b := make([]byte, 16)
    _, _ = rand.Read(b)
    return hex.EncodeToString(b)
}

Tokens here are transmitted in plaintext over TCP. For production use, you should use TLS encryption to protect tokens in transit, hash tokens before storage so a database breach doesn't expose them, and implement rate limiting on reconnection attempts to prevent brute force attacks.

Your chatroom now supports basic messaging and reconnection. But users need ways to interact with the system beyond just sending messages. The command system provides features like listing users, viewing history, and sending private messages.

How to Build the Command System

Commands are messages that start with a forward slash and perform special actions instead of being broadcast to everyone. This is a pattern used by many chat applications like Slack and Discord. You'll implement several useful commands that enhance the user experience.

Add command handling to io.go:

func handleCommand(client *Client, chatRoom *ChatRoom, command string) {
    parts := strings.Fields(command)
    if len(parts) == 0 {
        return
    }

    switch parts[0] {
    case "/users":
        chatRoom.listUsers <- client

    case "/stats":
        client.mu.Lock()
        stats := fmt.Sprintf("Your Stats:\n")
        stats += fmt.Sprintf("  Messages sent: %d\n", client.messagesSent)
        stats += fmt.Sprintf("  Messages received: %d\n", client.messagesRecv)
        stats += fmt.Sprintf("  Last active: %s ago\n", 
            time.Since(client.lastActive).Round(time.Second))
        client.mu.Unlock()

        select {
        case client.outgoing <- stats:
        default:
        }

    case "/msg":
        if len(parts) < 3 {
            select {
            case client.outgoing <- "Usage: /msg  \n":
            default:
            }
            return
        }

        targetUsername := parts[1]
        messageText := strings.Join(parts[2:], " ")

        targetClient := chatRoom.findClientByUsername(targetUsername)
        if targetClient == nil {
            select {
            case client.outgoing <- fmt.Sprintf("User '%s' not found\n", targetUsername):
            default:
            }
            return
        }

        privateMsg := fmt.Sprintf("[From %s]: %s\n", client.username, messageText)
        select {
        case targetClient.outgoing <- privateMsg:
        default:
            select {
            case client.outgoing <- fmt.Sprintf("%s's inbox is full\n", targetUsername):
            default:
            }
            return
        }

        select {
        case client.outgoing <- fmt.Sprintf("Message sent to %s\n", targetUsername):
        default:
        }

    case "/history":
        count := 20
        if len(parts) > 1 {
            fmt.Sscanf(parts[1], "%d", &count)
        }
        if count > 100 {
            count = 100
        }
        cr.sendHistory(client, count)

    case "/token":
        chatRoom.sessionsMu.Lock()
        session := chatRoom.sessions[client.username]
        chatRoom.sessionsMu.Unlock()

        if session != nil {
            msg := fmt.Sprintf("Your reconnect token:\n")
            msg += fmt.Sprintf("   reconnect:%s:%s\n", client.username, session.ReconnectToken)
            select {
            case client.outgoing <- msg:
            default:
            }
        }

    case "/quit":
        announcement := fmt.Sprintf("%s left the chat\n", client.username)
        chatRoom.broadcast <- announcement

        select {
        case client.outgoing <- "Goodbye!\n":
        default:
        }

        time.Sleep(100 * time.Millisecond)
        client.conn.Close()

    default:
        select {
        case client.outgoing <- fmt.Sprintf("Unknown: %s\n", parts[0]):
        default:
        }
    }
}

Your server is now complete with all the core features: connection handling, message broadcasting, persistence, session management, and commands. But to actually use your chatroom, you need a client application. The client is much simpler than the server because it just needs to connect and relay messages.

How to Create the Client

The client application provides the user interface for your chatroom. It connects to the server, displays incoming messages, and sends outgoing messages typed by the user. While the server is complex with many concurrent components, the client is straightforward

Create internal/chatroom/client.go for the client implementation.

package chatroom

import (
    "bufio"
    "fmt"
    "net"
    "os"
    "strings"
)

func StartClient() {
    conn, err := net.Dial("tcp", ":9000")
    if err != nil {
        fmt.Println("Error connecting:", err)
        return
    }
    defer conn.Close()

    fmt.Println("Connected to chat server")

    // Background goroutine: read from server
    go func() {
        reader := bufio.NewReader(conn)
        for {
            message, err := reader.ReadString('\n')
            if err != nil {
                fmt.Println("Disconnected from server.")
                os.Exit(0)
            }
            // Clear current prompt line and print message
            fmt.Print("\r" + message)
            fmt.Print(">> ")
        }
    }()

    // Main goroutine: read from stdin
    inputReader := bufio.NewReader(os.Stdin)
    fmt.Println("Welcome to the chat server!")

    for {
        fmt.Print(">> ")
        message, _ := inputReader.ReadString('\n')
        message = strings.TrimSpace(message)

        if message == "" {
            continue
        }

        conn.Write([]byte(message + "\n"))
    }
}

How the Client Works:

The client uses two goroutines to handle communication simultaneously. The main goroutine reads from stdin (your keyboard) and sends messages to the server. When you type a message and press Enter, it gets sent over the TCP connection immediately.

The background goroutine continuously reads from the server. Whenever a message arrives, it prints it to your screen. The \r (carriage return) clears the current >> prompt before printing the message, so new messages don't appear on the same line as your input. After printing the message, it reprints the prompt so you can continue typing.

This dual-goroutine design means you can receive messages while typing. If someone sends a message while you're in the middle of typing yours, their message appears immediately and your prompt reappears below it.

The defer conn.Close() ensures the connection is properly closed when the function exits. If the server disconnects, the read goroutine gets an error and calls os.Exit(0) to terminate the entire client program gracefully.

How to Create Entry Points

Create cmd/server/main.go:

package main

import (
    "fmt"
    "os"

    "github.com/yourusername/chatroom/internal/chatroom"
)

func main() {
    fmt.Println("Starting server from cmd/server...")
    chatroom.StartServer()
    os.Exit(0)
}

Create cmd/client/main.go:

package main

import (
    "fmt"
    "github.com/yourusername/chatroom/internal/chatroom"
)

func main() {
    fmt.Println("Starting client from cmd/client...")
    chatroom.StartClient()
}

Add a wrapper function in internal/chatroom/server.go:

package chatroom

func StartServer() {
    runServer()
}

With all your entry points created, your chatroom is complete and ready to test. The next step is learning how to test your implementation to ensure everything works correctly.

How to Test Your Chatroom

Testing a concurrent system like a chatroom requires a different approach than testing typical sequential code. You need to verify that goroutines coordinate correctly, messages arrive in the right order, and the system handles edge cases like disconnections.

How to Write Unit Tests

Unit tests verify individual components in isolation. For your chatroom, the most important test is verifying that messages broadcast correctly to all connected clients.

Create internal/chatroom/chatroom_test.go:

package chatroom

import (
    "testing"
    "strings"
    "time"
)

func TestBroadcast(t *testing.T) {
    cr, _ := NewChatRoom("./testdata")
    defer cr.shutdown()

    go cr.Run()

    // Create mock clients
    client1 := &Client{
        username: "Alice",
        outgoing: make(chan string, 10),
    }
    client2 := &Client{
        username: "Bob",
        outgoing: make(chan string, 10),
    }

    // Join clients
    cr.join <- client1
    cr.join <- client2
    time.Sleep(100 * time.Millisecond)

    // Broadcast message
    cr.broadcast <- "[Alice]: Hello!"

    // Verify both receive it
    select {
    case msg := <-client1.outgoing:
        if !strings.Contains(msg, "Hello!") {
            t.Fatal("Client1 didn't receive correct message")
        }
    case <-time.After(1 * time.Second):
        t.Fatal("Client1 didn't receive message")
    }

    select {
    case msg := <-client2.outgoing:
        if !strings.Contains(msg, "Hello!") {
            t.Fatal("Client2 didn't receive correct message")
        }
    case <-time.After(1 * time.Second):
        t.Fatal("Client2 didn't receive message")
    }
}

Understanding the Test:

This test creates a chatroom instance and starts its event loop with go cr.Run(). Then it creates two mock clients. Notice these aren't real TCP connections – they're just Client structs with outgoing channels. This lets you test the broadcast logic without needing actual network connections.

The test sends both clients to the join channel, waits 100 milliseconds for them to be processed, then broadcasts a message. The select statements with timeout are crucial. They try to receive from each client's outgoing channel, but if nothing arrives within 1 second, the test fails. This prevents the test from hanging forever if something goes wrong.

The time.Sleep(100 * time.Millisecond) gives the event loop time to process the join events before broadcasting. In a real system, you'd use channels to synchronize, but for tests, a small sleep is acceptable.

Run tests with:

go test ./internal/chatroom -v

The -v flag shows verbose output, printing each test as it runs. You'll see whether the broadcast test passes and how long it took. Below is the output showing that the test passed:

How to Do Integration Testing

Integration tests verify the entire system working together – the real server, real clients, and real network connections. Unlike unit tests that mock components, integration tests exercise the full stack.

Test the full client-server flow:

# Terminal 1: Start server
go run cmd/server/main.go

# Terminal 2: Client 1
go run cmd/client/main.go
# Enter username: Alice

# Terminal 3: Client 2  
go run cmd/client/main.go
# Enter username: Bob

# Terminal 4: Client 3  
go run cmd/client/main.go
# Enter username: John

# Test messaging between clients

What to Test:

Once you have the server running and multiple clients connected, you can verify all the features you built. Here's what a complete test session looks like:

Basic Messaging: Send a message from Alice and verify Bob and John both receive it. You should see the message appear in all client windows with the sender's username in brackets. Try sending from each client to verify the broadcast works in all directions.
Join and Leave Announcements: When a new client connects, all existing clients should see a "joined the chat" announcement. When someone disconnects (either with /quit or by closing their terminal), everyone should see a "left the chat" message. This confirms your join and leave handlers work correctly.
Private Messaging: Use /msg Bob this is a private message from Alice's client. The message should appear only in Bob's window, not in John's or Alice's. Try sending private messages between different pairs of users to verify the routing works correctly. The sender should receive a confirmation that the message was sent.
User List: Run /users from any client. You should see a list of all connected users. If someone has been idle for over a minute, they should show an "(idle)" status. The command should also display total message count and server uptime.
Chat History: New clients should automatically receive the last 10 messages when they join. You can also use /history 20 to request the last 20 messages. This verifies your message persistence is working.
Session Reconnection: From one client, use /token to get your reconnection token. It will look something like reconnect:Alice:338f04ca.... Copy this token, disconnect the client with Ctrl+C, start a new client, and paste the reconnection string when prompted. You should rejoin the chat with your previous identity, and other users won't see duplicate join announcements.
Statistics: Use /stats to see how many messages you've sent and received, and when you were last active. This verifies the client-side statistics tracking works.
Error Handling: Try connecting with a username that's already in use – you should be rejected. Try sending a private message to a non-existent user – you should get an error. Try using an invalid reconnection token – you should be denied. These tests verify your validation logic works.

Look at the server terminal to see the server's perspective. You'll see connection logs, broadcast confirmations, and any errors. When clients disconnect, you should see their sessions being updated. When the server creates snapshots, you'll see those logged, too.

Integration testing catches problems that unit tests miss, like network timeouts, message ordering issues across multiple clients, or problems with how the WAL file is created and locked. The screenshot below shows a successful integration test with three clients (Alice, Bob, and John) all communicating successfully, with private messages, public broadcasts, and proper join/leave handling.

How to Deploy Your Server

Deploying your chatroom means running it on a server that stays up 24/7, automatically restarts if it crashes, and starts when the server boots. There are several approaches depending on your infrastructure.

How to Use Systemd

Systemd is the standard init system on most Linux distributions. It manages services, handles restarts, and ensures your chatroom starts on boot.

Create /etc/systemd/system/chatroom.service:

[Unit]
Description=Chatroom Server
After=network.target

[Service]
Type=simple
User=chatroom
WorkingDirectory=/opt/chatroom
ExecStart=/opt/chatroom/server
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Understanding the Configuration:

The [Unit] section describes the service and its dependencies. After=network.target ensures the network is up before starting your chatroom.

The [Service] section defines how to run your server. Type=simple means systemd should just run the command and consider it started. User=chatroom runs the server as a dedicated user (not root) for security. WorkingDirectory sets where the server runs, which is important because your WAL and snapshot files are created relative to this directory.

Restart=on-failure tells systemd to automatically restart your server if it crashes. RestartSec=5s waits 5 seconds before restarting, preventing rapid restart loops if there's a persistent problem.

The [Install] section makes your service start at boot when you enable it.

Deploying Your Server:

First, build your server binary:

go build -o server cmd/server/main.go

Then copy it to the deployment location:

sudo mkdir -p /opt/chatroom
sudo cp server /opt/chatroom/
sudo mkdir -p /opt/chatroom/chatdata

Create a dedicated user for running the service:

sudo useradd -r -s /bin/false chatroom
sudo chown -R chatroom:chatroom /opt/chatroom

Enable and start the service:

sudo systemctl enable chatroom
sudo systemctl start chatroom

Check that it's running:

sudo systemctl status chatroom

You can view logs with:

sudo journalctl -u chatroom -f

The -f flag follows the logs in real-time, similar to tail -f.

How to Use Docker

Docker packages your application with all its dependencies, making it easy to deploy anywhere that runs Docker.

Create a Dockerfile:

FROM golang:1.23-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN go build -o server cmd/server/main.go

FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /app/server .
COPY --from=builder /app/chatdata ./chatdata
EXPOSE 9000
CMD ["./server"]

Understanding the Dockerfile:

This uses a multi-stage build. The first stage (builder) uses the full Go image to compile your server. The second stage uses a minimal Alpine Linux image and copies only the compiled binary. This keeps the final image small (about 20MB instead of 800MB).

EXPOSE 9000 documents which port the container uses. CMD ["./server"] specifies what command runs when the container starts.

Build and Run:

docker build -t chatroom .
docker run -p 9000:9000 -v $(pwd)/chatdata:/root/chatdata chatroom

The -p 9000:9000 maps port 9000 in the container to port 9000 on your host, making the chatroom accessible. The -v $(pwd)/chatdata:/root/chatdata mounts your local chatdata directory into the container, so messages persist even if you stop and remove the container.

Running in Production:

For production, you'd typically use Docker Compose or Kubernetes. Here's a simple docker-compose.yml:

version: '3.8'
services:
  chatroom:
    build: .
    ports:
      - "9000:9000"
    volumes:
      - ./chatdata:/root/chatdata
    restart: unless-stopped

Run with:

docker-compose up -d

The restart: unless-stopped policy ensures your container restarts automatically if it crashes or if the Docker daemon restarts

Enhancements You Could Add

1. Multi-Room Support

You could add the concept of channels/rooms like this:

type ChatRoom struct {
    rooms map[string]*Room
}

type Room struct {
    name    string
    clients map[*Client]bool
    history []Message
}

2. User Authentication

You could replace simple usernames with proper authentication for added security:

type User struct {
    ID           int
    Username     string
    PasswordHash string
    Email        string
    CreatedAt    time.Time
}

You could allow users to upload files:

type FileMessage struct {
    Message
    FileName string
    FileSize int64
    FileURL  string
}

4. WebSocket Support

You could add HTTP/WebSocket endpoint for web clients.

5. Horizontal Scaling

For massive scale, you could shard across multiple servers using Redis pub/sub or NATS for inter-server communication.

Conclusion

You've now built a production-ready distributed chatroom from scratch. This project demonstrates important distributed systems concepts including concurrency patterns, network programming, state management, persistence, and fault tolerance.

Additional resources:

Go Concurrency: "Concurrency in Go" by Katherine Cox-Buday
Distributed Systems: "Designing Data-Intensive Applications" by Martin Kleppmann
Networking: "Unix Network Programming" by Stevens

The full source code is available on GitHub. Feel free to open issues or contribute improvements.

As always, I hope you enjoyed this guide and learned something. If you want to stay connected or see more hands-on DevOps content, you can follow me on LinkedIn.

Why You Should Stop Managing Kafka Manually – A Guide to Kafka UI and Cruise Control

Ramesh Sinha — Wed, 14 Jan 2026 15:58:54 +0000

Over 80% of Fortune 100 companies use Apache Kafka. That's not surprising, as Kafka has revolutionized how we build real-time data pipelines and streaming applications. If you're working in software engineering today, chances are you've encountered Kafka in some capacity.

But here's the thing: while Kafka itself is incredibly powerful, managing Kafka clusters is notoriously challenging. This isn't a flaw in Kafka – it's just the reality of distributed systems. The bigger your cluster grows, the more complex operations become.

The most painful aspect? Manual cluster management. It's tedious, error-prone, and doesn't scale. What starts as simple topic creation with a few brokers turns into hours of carefully orchestrating partition reassignments across dozens of machines. One typo in a JSON file at 3 AM can take down production.

Sound familiar? You're not alone.

In this guide, you'll learn how two tools can transform Kafka operations from a manual slog into a manageable process:

Kafka UI – A modern web interface that replaces cryptic CLI commands with visual cluster management
Cruise Control – LinkedIn's automation engine that handles cluster balancing and self-healing

We'll start by experiencing the pain of manual management firsthand, then see how these tools solve real-world operational challenges. You'll set up everything locally with Docker and by the end you’ll know exactly how to manage Kafka clusters without the headache.

What We’ll Cover:

Prerequisites
Setting Up Our Unmanaged Cluster
Starting the Cluster & Verification
Creating Topics: The Manual Way
Kafka UI
- Setting up Kafka UI
- Drawbacks of Kafka UI
Cruise Control
Conclusion

The Problem: Manual Kafka Management

Let’s dive right in. First, I'm going to show you what managing a Kafka cluster looks like without any tools – just you, the command line, and dozens of manual operations.

You’ll spin up a small cluster locally, create some topics, and simulate the kind of growth you'd see in a real production environment. By the end of this section, you'll understand exactly why teams spend thousands of engineering hours just keeping Kafka clusters running smoothly.

Fair warning: this is going to feel tedious but it’s ok – that’s the point.

Prerequisites

Before we dive in, make sure you have:

Docker Desktop installed and running
- Mac and Windows users: https://www.docker.com/products/docker-desktop/
- Linux users can install Docker Engine via their package manager
Basic Kafka knowledge. You should understand:
- Topics: Categories for organizing messages
- Partitions: How topics are divided for parallelism
- Brokers: The Kafka servers that store data
- Producers and Consumers: Applications that write to and read from Kafka
- KRaft: Kafka consensus based discovery?

If these terms are new to you, here’s a great handbook about them. I’d also recommend reading Kafka's Introduction first.

System Requirements
- At least 8GB Ram
- 10GB Free Disk space
Some basic understanding of containers is good to have:
- Docker
- Images
- Volumes
- Networks

Setting Up Our Unmanaged Cluster

Let’s go ahead and build the cluster so that we can see the problems firsthand. We’ll use Docker to spin up three Kafka brokers running in KRaft mode (the modern, ZooKeeper-free approach).

Start by creating a file called docker-compose-basic.yml:

version: '3.8'

services:
  kafka-1:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-1
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-1-data:/var/lib/kafka/data

  kafka-2:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-2
    ports:
      - "9093:9093"
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-2-data:/var/lib/kafka/data

  kafka-3:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-3
    ports:
      - "9094:9094"
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-3-data:/var/lib/kafka/data

volumes:
  kafka-1-data:
  kafka-2-data:
  kafka-3-data:

In the above configuration file, we’re creating three Kafka brokers (kafka-1, kafka-2, kafka-3). Each one uses the confluentinc/cp-kafka:7.6.0 image and has its port opened (9092, 9093, 9094).

The environment variables are:

KAFKA_NODE_ID – A unique identifier for each broker (1,2,3). No two brokers can have the same ID.
KAFKA_PROCESS_ROLES: broker, controller – This tells Kafka to run in KRaft mode (without ZooKeeper). Each broker acts as both a data broker and a controller for cluster coordination.
KAFKA_CONTROLLER_QUORUM_VOTERS – The membership list that tells each broker how to find the others. All three brokers must have the identical list: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093. This is how they discover each other and elect a leader.
CLUSTER_ID – A unique identifier for the entire cluster. All brokers must use the exact same value or they won't recognize each other as part of the same cluster. The actual value (MkU3OEVBNTcwNTJENDM2Qk) doesn't matter as long as long as it is consistent across brokers. One important thing to note is that CLUSTER_ID must be a valid base64-encoded UUID per Kafka’s requirement.
KAFKA_LISTENERS - Defines which network interfaces and ports Kafka listens on. We have three listeners:
- PLAINTEXT://0.0.0.0:29092: For inter-broker communication (brokers talking to each other)
- CONTROLLER://0.0.0.0:29093: For controller communication in KRaft mode
- PLAINTEXT_HOST://0.0.0.0:9092 (varies per broker): For external connections from your machine
KAFKA_ADVERTISED_LISTENERS – Tells clients (producers/consumers) how to connect to this broker. This is what gets returned when a client asks "where should I connect?" The PLAINTEXT_HOST://localhost:9092 part is what allows you to connect from your Mac.

Note: Listener configuration is critical. Incorrect settings will prevent clients from connecting even when brokers are running. These settings work for local Docker environments where Docker's internal DNS resolves broker names (kafka-1, kafka-2, kafka-3). For production, replace hostnames with actual IP addresses or FQDNs - (Fully Qualified Domain Name):

KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 – How many copies of consumer offset data to keep. We use 2 instead of 3 because with only three brokers, this prevents issues during rolling restarts. In production with more brokers, you'd use 3 or more.
The Volumes – kafka-x-data:/var/lib/kafka/data creates persistent storage for each broker’s data. Without volumes you will lose your topics and messages if you stop or restart your containers. Volumes are assigned to each broker so they don’t accidentally share data.

Note: For a restart from scratch you need to delete the volumes using the following command. The -v flag removes volumes. Without it, old data persists even after down.

docker compose -f docker-compose-basic.yml down -v

If you're using the legacy docker-compose tool (V1), replace docker compose with docker-compose in all commands throughout this tutorial.

Ports

Three ports are used for any given broker. Their purposes are:

Port	Purpose
9092	external connections (producers, consumers from you Mac)
29092	Internal broker-to-broker communication
29093	Cluster coordination via KRaft

Starting the Cluster & Verification

Now that we have the basic docker configuration for Kafka, let’s run it and verify the results.

Run the following command in the same directory where you saved docker-compose-basic.yml:

docker compose -f docker-compose-basic.yml up -d

The -d flag runs the containers in detached mode (in the background), so you get your terminal back.

You should see output like this:

Using the following command, check if the containers running Kafka brokers are up:

docker ps

You should see three Kafka containers (kafka-1, kafka-2, kafka-3) with status “Up” – something like this:

Run the following command to verify that all three brokers are registered in the cluster:

docker exec -it kafka-1 kafka-broker-api-versions --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092

You should see API version information for all three brokers (IDs 1, 2, 3) without any connection errors.

Note that we’re using kafka-1:29092,kafka-2:29092,kafka-3:29092 here (the internal Docker addresses) instead of localhost:9092 because this command runs inside the kafka-1 container by virtue of docker exec -it kafka-1, where localhost only refers to that specific container.

If any of the above verification returns errors or doesn’t show expected result as shown in screenshots, you can run the following command to see logs and debug:

docker logs kafka-1

Creating Topics: The Manual Way

Now that we have a cluster running, let’s simulate a real production use case where different teams need Kafka topics for their applications – payments, logs, events, metrics notifications, you name it.

Let’s start by creating a topic for logs. The command to do this is:

docker exec -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 12 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

You’ll need to specify some command parameters, which are:

The exact broker address kafka-1:29092,kafka-2:29092,kafka-3:29092 (or the IP address of your servers in production)
The number of partitions – I have used 12 in the above command. Creating too few partitions creates bottlenecks, while creating too many adds overhead.
Retention policy – I have used 7 days (that is, 604800000 milliseconds)
Compression type

Manually managing these parameters and running the command a handful of times is okay – but what if you have to run this for every team in your enterprise? Each team will have different requirements. The grind of copy, paste, adjust becomes painful if you have 100+ topics and multiple clusters (dev, staging, prod).

Feel the pain yet? Well, let’s just go on for a minute and we’ll address this issue shortly. For now, if you run the above command you should see the “Created topic” message:

Note: We’re using kafka-1:29092,kafka-2:29092,kafka-3:29092 to reach Kafka brokers because we’re running the command inside of broker kafka-1 by running using docker exec.

Let's keep going. We’ll create more topics using the same command by changing the topic name and partitions. Copy, paste, update, and run the above commands a couple times. On my machine, I ran it 3 more times like below (you can choose to run couple more times with changed values – it won’t matter because concrete values are not important for this tutorial):

docker exec -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-views \    
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 20 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy


docker exec -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-analytics \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 3 \ 
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy


docker exec -it kafka-1 kafka-topics \
  --create \
  --topic freecodecamp-articles \ 
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --partitions 5 \ 
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

After creating the topics, let’s see all the ones you have now by running the following command:

docker exec -it kafka-1 kafka-topics \ --list \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092

You should see a list of topics like this:

Notice that you just get the list of topics but no meaningful information, like:

How many partitions does each have?
Which brokers are hosting them?
Are they evenly distributed?
What are their configurations?

Partition Information

Let’s try to get information about our partitions. For this tutorial, I have created 4 topics and a total of 40 partitions spread across three brokers. I want to see which broker has the most partitions.

In a well-managed cluster, you’d want them roughly evenly distributed. But how can we check that?

Maybe the describe command shown below can help. Let’s run it:

docker exec -it kafka-1 kafka-topics \
  --describe \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092

It will return a wall of text, something like this:

So, we have partition information but:

No summary or aggregation
No visual representation
It’s difficult to scan and compare
It gets exponentially worse with more topics

Counting Leaders

The Leader field in the above screenshot tells you which broker is the leader for each partition. Leaders handle all read and write requests, so you want them evenly distributed or else some brokers will become overloaded.

Let’s try to count how many partitions each broker leads. To do that, run the following command:

docker exec -it kafka-1 kafka-topics \
  --describe \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 | grep "Leader: 1" | wc -l

It will show something like this:

Per my topic creation, 14 is the count of partitions where broker 1 (Leader : 1) is the leader. You might see a different number depending on how many topics and how many partitions you have created.

You can repeat this command to see the count of partitions led by other brokers. To do so, just change Leader: 1 to Leader: 2 or Leader: 3.. I get 14, 12, 14:

That’s somewhat balanced, but you had to run the command multiple times, parse using grep and wc, and this is just 3 brokers. What if you had 100+? Also, what if you have to get the replicas’ information?

I could go on and on with the data we need and the commands to get that information. But the point I’m trying to make here is that sooner or later this becomes impossible to manage. Your team is going to need an army, and to be honest there isn’t much value in doing all of this manually.

So far, you’ve seen only simple operational commands, but the problems don’t stop there. In a real production environment there are more complex and challenging operations like:

Consumer Lag Monitoring: When consumers fall behind, you need to track which partitions are lagging, which consumer instances own them, and where the lag is growing or shrinking. With CLI tools, you get raw numbers but no trends or context.
Broker Failures: When a broker fails, you need to identify under-replicated partitions, trigger leader elections, and create partition reassignment JSON files manually. One mistake in that JSON can cause data loss.
Cluster rebalancing: You’ll see that when you add new brokers, they sit empty until you manually redistribute partitions. Similarly for removing brokers, you need to move all their partitions first. These operations require calculating optimal placement and creating complex reassignment plans.

If you’re still with me, you’re probably thinking that there has to be a better way. Fortunately, there is – actually, there are a couple complimentary ways and we are going to talk about those next.

Kafka UI

Kafka UI is a modern, open-source web interface for managing Kafka clusters. It replaces the command line chaos we just experienced with a clean, visual dashboard.

Kafka UI provides the following features:

Visual cluster Overview: see all brokers, topics, and partitions at a glance.
Topic management: create, configure, and delete topics with a GUI
Consumer group monitoring: track lags, offsets, and consumer health in real-time
Message browsing: view actual messages in topics without command line tools

Without further ado, let’s set up Kafka UI.

Setting Up Kafka UI

To setup up Kafka UI, let’s modify our existing docker-compose-basic.yml like this:

version: '3.8'

services:
  kafka-1:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-1
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-1-data:/var/lib/kafka/data

  kafka-2:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-2
    ports:
      - "9093:9093"
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-2-data:/var/lib/kafka/data

  kafka-3:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka-3
    ports:
      - "9094:9094"
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
    volumes:
      - kafka-3-data:/var/lib/kafka/data
# Adding kafka-UI service start
  kafka-ui:
    image: provectuslabs/kafka-ui:latest
    container_name: kafka-ui
    ports:
      - "8080:8080"
    environment:
      DYNAMIC_CONFIG_ENABLED: 'true'
      KAFKA_CLUSTERS_0_NAME: freecodecamp-cluster
      KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092
    depends_on:
      - kafka-1
      - kafka-2
      - kafka-3
# Adding kafka-UI service end
volumes:
  kafka-1-data:
  kafka-2-data:
  kafka-3-data:

The yaml file is pretty much the same as before except that we have added a new service called kafka-ui (for better clarity, I have added the changes in between start and end comments).

Key Configurations are:

Port 8080 – You can access the UI at http://localhost:8080 from your machine.
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS – This environment variable tells Kafka UI where to connect your cluster (using internal Docker addresses).
KAFKA_CLUSTERS_0_NAME – A friendly name for your cluster in the UI.

Let’s first clean up the old cluster while keeping the topic data intact. Go ahead and run the following command to do so:

docker compose -f docker-compose-basic.yml down

Note that we’re not using -v here, so volumes (topic data) will remain intact.

Wait for couple seconds and then run the following docker up command to bring up our cluster with Kafka UI:

docker compose -f docker-compose-basic.yml up -d

Now open a browser and visit http://localhost:8080/. You’ll see the UI like this:

You can click around and see all information about the cluster we have created, like:

Your 3 brokers
The topics you created earlier
Partition counts

For comparison with manual commands, let's look at the Brokers tab. You can see the partition leader count for each broker at a glance – remember that we had to run multiple commands to get this information earlier. Beyond this, the UI provides many other useful metrics that would require separate command-line queries.

Remember the CLI commands we had to run to create topics? If you go to the Topics tab, you will notice that Topic management (creation, deletion, data cleanup and so on) are just a few button clicks.

Similarly, managing Consumers only requires a few button clicks.

After exploring the Kafka UI, you'll see how much easier it is to monitor your cluster compared to running individual CLI commands.

Drawbacks of Kafka UI

That said, Kafka UI does have some limitations:

Automatic rebalancing: One or few brokers having more partitions that others, you must manually reassign them.
Self-healing: If a broker fails, you have to manually create reassignment plans.
Performance optimization: The UI can’t recommend intelligent partition placement.
Alerts: The UI doesn’t warn you before problems happen.

For small clusters (3 - 10 brokers ), Kafka UI and some command execution might be enough. You’ll be able to see problems clearly and fix them when needed.

For large clusters, manual operations are still not scalable, so we need some kind of a complementary tool…and that tool is Cruise Control.

Cruise Control

Cruise Control is an automation engine for Kafka clusters. While Kafka UI gives you visibility and manual control, Cruise Control provides intelligent automation and self-healing. You can think of Kafka UI as a dashboard with manual controls and Cruise Control as an autopilot. In other words, they complement each other.

Let’s try to create some imbalance in our cluster and fix it manually. This will help you learn how to reason through why you need Cruise Control.

To keep things simple, let’s start from scratch. We will first delete all the Docker resources we have created so far by running the following command:

docker compose -f docker-compose-basic.yml down -v

Running docker-compose down -v will delete all the topics and messages we created so far, but don’t worry –we’ll create them again.

How Cruise Control Works

You can think of Cruise Control as a metric-monitoring and action-taking tool. Kafka brokers collect internal metrics (CPU, disk, network traffic, partition sizes), and a metric reporter running inside each broker sends these metrics to a Kafka topic.

Cruise Control then reads from that topic and analyzes the data. Based on that analysis, it proposes partition movements. We’ll see this in action shortly.

Setting Up Cruise Control

As of this writing, I couldn’t find a compatible Kafka and Cruise Control image that supports KRaft (Kafka Consensus Algorithm), so I decided to create Kafka and Cruise Control public images that will help with the tutorial. I don’t recommend using these images in production. For production usage, you should either wait for community to provide an image or create one of your own.

Change the docker-compose-basic.yml file to look like the below:

version: '3.8'

services:
  kafka-1:
    image: justramesh2000/kafka-apache-cc:3.8.1
    container_name: kafka-1
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
      # Cruise Control Metrics Reporter
      KAFKA_METRIC_REPORTERS: 'com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter'
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS: 'kafka-1:29092,kafka-2:29092,kafka-3:29092'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE: 'true'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS: '1'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR: '2'
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE: 'false'
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS: '60000'
    volumes:
      - kafka-1-data:/var/lib/kafka/data

  kafka-2:
    image: justramesh2000/kafka-apache-cc:3.8.1
    container_name: kafka-2
    ports:
      - "9093:9093"
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
      KAFKA_METRIC_REPORTERS: com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE: 'false'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC: __CruiseControlMetrics
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE: 'true'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS: '1'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR: '2'
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS: '60000'
    volumes:
      - kafka-2-data:/var/lib/kafka/data

  kafka-3:
    image: justramesh2000/kafka-apache-cc:3.8.1
    container_name: kafka-3
    ports:
      - "9094:9094"
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
      KAFKA_LOG_DIRS: /var/lib/kafka/data
      KAFKA_METRIC_REPORTERS: com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE: 'false'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC: __CruiseControlMetrics
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE: 'true'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS: '1'
      KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR: '2'
      KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS: '60000'
    volumes:
      - kafka-3-data:/var/lib/kafka/data
  # Adding kafka-UI service start
  kafka-ui:
    image: provectuslabs/kafka-ui:latest
    container_name: kafka-ui
    ports:
      - "8080:8080"
    environment:
      DYNAMIC_CONFIG_ENABLED: 'true'
      KAFKA_CLUSTERS_0_NAME: freecodecamp-cluster
      KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092
    depends_on:
      - kafka-1
      - kafka-2
      - kafka-3
    volumes:
      - ./config:/opt/cruise-control/config  
  # Adding kafka-UI service end
  # Adding cruise-control start
  cruise-control:
    image: justramesh2000/cruise-control-kraft:2.5.142
    container_name: cruise-control
    ports:
      - "9090:9090"
    volumes:
      - ./config/cruisecontrol.properties:/opt/cruise-control/config/cruisecontrol.properties
      - ./config/capacityJBOD.json:/opt/cruise-control/config/capacityJBOD.json:ro
      - ./config/log4j.properties:/opt/cruise-control/config/log4j.properties:ro
    depends_on:
      - kafka-1
      - kafka-2
      - kafka-3
   # Adding cruise-control end    
volumes:
  kafka-1-data:
  kafka-2-data:
  kafka-3-data:

You should have made the following changes to the file:

Changed Kafka image from confluentinc/cp-kafka:7.6.0 to justramesh2000/kafka-apache-cc:3.8.1. The new image contains the Cruise Control metrics exporter which will export metrics data from Kafka brokers to be used by Cruise Control.
Added the following environment variables:
- KAFKA_METRIC_REPORTERS – This variable tells Kafka to load a plugin called the Cruise Control Metrics Reporter. It runs inside each Kafka broker process, and hooks into Kafka’s internal metrics system. This helps with data collection.
- KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS – This tells the Cruise Control Metrics Reporter where to send metrics to, meaning which Kafka brokers and which port.
- KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE – This disables specific Kubernetes behaviors (Pod name, id instead of Host). We are using Docker, so we don’t need K8s behaviors.
- KAFKA_CRUISE_CONTROL_METRICS_TOPIC – Specifies the name of the topic where metrics will be published. Default is __CruiseControlMetrics but you can customize using this variable if you want to.
- KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE – Automatically creates a __CruiseControlMetrics topic if it doesn’t exist. Without this metric, the reporter will fail reporting until you manually create this topic.
- KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS – Defines the number of partitions for the topic __CruiseControlMetrics.
- KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR – Tells Kafka how many copies of metrics data to keep. In our case, we’re keeping 2 copies of the data.
- KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS – Tells Kafka how often to send metrics. We’re sending every minute.
Added Cruise-control service using image justramesh2000/cruise-control-kraft:2.5.142. For clarity, I have kept this change between the start and end comments.
Under cruise control, we’ve mounted three Cruise Control configurations files. We’ll talk about those files next.

Cruise Control Configuration File

To run Cruise Control, we need to provide several configuration files. Among the key pieces of information are:

Where the Kafka cluster is located
The capacity of each broker

Create a config directory and add the following files:

mkdir config

cruisecontrol.properties

This is Cruise Control’s main configuration file.

Save the following content as cruisecontrol.properties in the config directory:

# Kafka cluster. Tells how to connect to brokers
bootstrap.servers=kafka-1:29092,kafka-2:29092,kafka-3:29092

# Topic from which metrics are to be read
metric.reporter.topic=__CruiseControlMetrics

# Aggregated partition data
partition.metric.sample.store.topic=__KafkaCruiseControlPartitionMetricSamples

#Aggregated broker data
broker.metric.sample.store.topic=__KafkaCruiseControlModelTrainingSamples

# Enable broker failure detection for KRaft mode (no ZooKeeper)
kafka.broker.failure.detection.enable=true

# Capacity. Tells where the capacity file is 
capacity.config.file=config/capacityJBOD.json

# Goals. What to optimize for during cluster balancing. These are the riles for CC to abide to during rebalancing
default.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal

# hard goals. 
hard.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal

# Webserver. For WebApi access
webserver.http.port=9090
webserver.http.address=0.0.0.0

# Execution
num.broker.metrics.windows=1
num.partition.metrics.windows=1

I’ve added in line comments to explain much of the above configuration, but I think the Goals need special attention. These are the rules that we as users have set for Cruise Control to abide by.

By defining goals, we tell Cruise Control to do the following:

RackAwareGoal – Spread replicas across racks (or in our case, brokers)
ReplicaCapacityGoal – Don't overload brokers with too many replicas
DiskCapacityGoal – Don't fill up disk
NetworkInboundCapacityGoal – Balance incoming network traffic
NetworkOutboundCapacityGoal – Balance outgoing network traffic
CpuCapacityGoal – Balance CPU usage
ReplicaDistributionGoal – Evenly distribute replicas
DiskUsageDistributionGoal – Ensure even disk usage across brokers
LeaderReplicaDistributionGoal – Evenly distribute leader replicas
LeaderBytesInDistributionGoal – Balance data flowing to leaders

Via Cruise Control configuration, you can define two types of goals: Default goals and Hard goals. Hard goals must be met. Default goals that aren’t part of the hard goals become soft goals. This means that Cruise Control will give its best effort to satisfy them but won’t reject a proposal if it can’t.

Here’s a little summary:

Type	Meaning	What CC Does
Hard Goals	Must-haves (capacity limits)	Never violates – rejects proposal if can't satisfy
Soft Goals	Nice-to-haves (better balance)	Tries to satisfy – skips if conflicts with hard goals
Default Goals	Hard + Soft together	Optimizes for all – prioritizes hard over soft

Cruise control collects metrics for a defined period (default: 5 minutes) and creates a monitoring window. The following settings control how many windows Cruise Control needs before it’s ready to generate proposals (shortly, we will see what proposals are):

num.broker.metrics.windows=1: Wait for 1 monitoring window before generating proposals. Each window in Cruise Control is 5 minutes by default. This means that Cruise Control will be ready after 5 minutes. I’ve set this to 1 for quick testing. The recommendation is to use a large window in production to avoid false proposals from temporary spikes.
num.partition.metrics.windows=1: Wait for 1 window of partition metrics. Same reasoning as above.

Capacity

This informs cruise control about the capacity (CPU, DISK) of each broker, which then helps it to make decisions. Using the below file, we’re telling Cruise Control the following:

What are the brokerIds
What is the disk path /var/lib/kafka/data and disk capacity (100000000 MB = 100 GB). This is used by DiskCapacityGoal that we set up in the above cruisecontrol.properties file.
What is the CPU 100% (1 Core). Used by CpuCapacityGoal.
What is the NW_IN Network Inbound Capacity (125,000 KB/s = 1 MB/s –Megabytes per second) = 1 Gbps – Giga bits per second). Used by NetworkInboundCapacityGoal.
What is the NW_OUT Network Outbound Capacity (125,000 KB/s). Used by NetworkOutboundCapacityGoal

Save the following content as capacityJBOD.json in the config directory:

{
  "brokerCapacities":[
    {
      "brokerId": "1",
      "capacity": {
        "DISK": {"/var/lib/kafka/data": "100000000"},
        "CPU": "100",
        "NW_IN": "125000",
        "NW_OUT": "125000"
      }
    },
    {
      "brokerId": "2",
      "capacity": {
        "DISK": {"/var/lib/kafka/data": "100000000"},
        "CPU": "100",
        "NW_IN": "125000",
        "NW_OUT": "125000"
      }
    },
    {
      "brokerId": "3",
      "capacity": {
        "DISK": {"/var/lib/kafka/data": "100000000"},
        "CPU": "100",
        "NW_IN": "125000",
        "NW_OUT": "125000"
      }
    }
  ]
}

Logging

This is not important for Cruise Control to work properly, but it’ll help you debug if there are issues. Save the following content as log4j.properties in the config directory. When you execute commands to start Cruise Control and If you see unexpected behaviors like container exiting, you can use the docker logs command to see what happened.

# Root logger - INFO level, output to console
rootLogger.level=INFO
appenders=console

# Console output (for docker logs)
appender.console.type=Console
appender.console.name=STDOUT
appender.console.layout.type=PatternLayout
appender.console.layout.pattern=[%d] %p %m (%c)%n

# Send root logger to console
rootLogger.appenderRef.console.ref=STDOUT

Now that we have all the configurations in place, let’s run the following command to start Kafka brokers with Kafka UI and Cruise Control:

docker compose -f docker-compose-basic.yml up -d

Using the following command, verify that the three Kafka brokers, Kafka UI, and Cruise Control containers are running:

docker ps

You should see something like this:

Now that we have Cruise Control up and running, let’s create some Imbalance and see how much better of an experience we get by using Cruise Control versus mitigating the imbalance manually.

Creating the Imbalance

An imbalance is a scenario where some brokers are handling more messages than others – and they may run into high disk usage or high IOPS.

To create the imbalance in our cluster, we’ll have to create a few topics and then produce messages unevenly. Now that you have Kafka UI running, you can create topics using that method or you can create topics using commands. I’m going to use the commands because it’ll be easier for you to reproduce my work (but I recommend UI for production operations because it prevents typos).

If you also decide to use commands, run the following command. Then using UI, verify that the topics have been created.

Note: You’ll find that the commands are different from previous commands. This is because, previously in our docker-compose-basic.yml file, we were using the confluentinc/cp-kafka:7.6.0 image for Kafka. But now we’re using the justramesh2000/kafka-apache-cc:3.8.1 image which is based off of the apache/kafka:3.8.1 image. For different images, the tools are located at different places, so the command needs to be adjusted to account for that.

docker exec -it kafka-1 bash -c '
/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092 \
  --partitions 12 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-views \
  --bootstrap-server kafka-1:29092 \
  --partitions 20 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-analytics \
  --bootstrap-server kafka-1:29092 \
  --partitions 3 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

/opt/kafka/bin/kafka-topics.sh --create \
  --topic freecodecamp-articles \
  --bootstrap-server kafka-1:29092 \
  --partitions 5 \
  --replication-factor 2 \
  --config retention.ms=604800000 \
  --config compression.type=snappy
'

Run the following command to produce uneven messages on different topics we created above.

Heavy Load on freecodecamp-logs:

docker exec -it kafka-1 bash -c "
for i in {1..100000}; do 
  echo '{\"log_id\":\"'\$i'\",\"level\":\"INFO\",\"message\":\"Log entry '\$i'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-logs --bootstrap-server kafka-1:29092"

Heavy load on freecodecamp-views:

docker exec -it kafka-1 bash -c "
for i in {1..80000}; do 
  echo '{\"view_id\":\"'\$i'\",\"page\":\"/article/'\$((i % 100))'\",\"user\":\"user_'\$((i % 1000))'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-views --bootstrap-server kafka-1:29092"

Moderate load on freecodecamp-analytics:

docker exec -it kafka-1 bash -c "
for i in {1..30000}; do 
  echo '{\"event\":\"page_view\",\"user\":\"user_'\$i'\"}'
done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-analytics --bootstrap-server kafka-1:29092"

Now, produce a message with a fixed key to force all data into one Partition. This is a fast way to create strong disk imbalance. Run the following command:

docker exec -it kafka-1 bash -c "
for i in {1..300000}; do
  echo 'hotkey:{\"log_id\":'\$i',\"msg\":\"big payload\"}'
done | /opt/kafka/bin/kafka-console-producer.sh \
  --topic freecodecamp-logs \
  --bootstrap-server kafka-1:29092 \
  --property parse.key=true \
  --property key.separator=:"

After running the above commands, come back to the UI, refresh, and you will see a number of messages like this:

Now, go to brokers tab and see the imbalance in Disk Usage:

You should be able to see that Broker-2 has only about 47% of the data that Broker-1 has, and Broker-3 has about 11% more data than Broker-1. Broker-2 is significantly underutilized, while Broker-1 and Broker-3 hold most of the data.

Attempting Manual Rebalancing

Step 1: First, we need to find out which topic is heavy – meaning which one handles more data. My setup shows the freecodecamp-logs topic with 8MB of data:

Step 2: Let’s see where the heavy partitions are.

Click on freecodecamp-logs in Kafka UI and see the partition table. Look at the message count: partition 4 is bigger than the others. The table also gives information about replicas of partitions: partition 4 has replicas on Broker 1 and 3. Broker 2 doesn’t have partition 4 at all. This explains why Broker 2 was underutilized.

Step 3: To balance the cluster, we need to move partition 4 around.

We can move partition 4 to Broker 2. But before that, let’s do some math to be able to rationalize our decision. Note that the calculation doesn’t have to be precise – we just want a relative sense of data between brokers.

Current state:

Broker 1: 4.55 MB
Broker 2: 2.29 MB (underutilized)
Broker 3: 5.11 MB (over-utilized)

Note that roughly the compressed data size for partition 4 is 2.25 MB (exact size is not critical).

If we move partition 4 from [1,3] to [2,3]:

Broker 1: Loses partition 4, so 4.55 + 2.25 = ~2.3 MB
Broker 2: Gains Partition 4, so 2.33 + 2.25 = ~4.58 MB
Broker 3: Already has partition 4, so = 5.11 MB (no change)

The result is that Broker 1 becomes underutilized.

How about if we move partition 4 from [1,3] to [1,2]?

Broker 1: Already has partition 4 = 4.55 MB (no change)
Broker 2: Gains Partition 4, so 2.33 + 2.25 = ~4.58 MB
Broker 3: Loses partition 4, so 5.11 + 2.25 = ~2.8 MB

Hmm, this still creates an imbalance (broker 3 becomes too light).

So basically, manual rebalancing requires complex calculations. Moving a single partition impacts disk usage, leader distribution, and network traffic across multiple brokers. One poorly planned move can create a new imbalance elsewhere.

But, let’s say you somehow landed on a perfect mathematical calculation and you’re ready to make the move to balance. We’ll assume that the perfect plan is to move Partition 4 from [1, 3] to [2, 3]. I know it’s not the perfect move but the point is to see the pain afterwards.

Step 4: it’s time to move the partition manually.

We need to tell Kafka to move partition 4's replicas from brokers [1,3] to brokers [2,3].

To do that, you need create a file called reassignment.json on your machine:

{
  "version": 1,
  "partitions": [
    {
      "topic": "freecodecamp-logs",
      "partition": 4,
      "replicas": [2, 3],
      "log_dirs": ["any", "any"]
    }
  ]
}

What this means:

"partition": 4 – Target Partition
"replicas": [2, 3] – New placement: brokers 2 and 3
"log_dirs": ["any", "any"] – Let Kafka choose the disk directory

Save this file somewhere accessible.

Then run the following command to copy the JSON to the Kafka cluster:

docker cp reassignment.json kafka-1:/tmp/reassignment.json

This copies your local file into the kafka-1 container's /tmp directory.

Run following command to verify the file is there:

docker exec -it kafka-1 cat /tmp/reassignment.json

You should see your JSON file content.

Now run the actual reassignment command:

docker exec -it kafka-1 /opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --reassignment-json-file /tmp/reassignment.json \
  --execute

You will get a message from Kafka that will tell you if Kafka has accepted the reassignment and started moving the data.

You can monitor the reassignment using the following command:

docker exec -it kafka-1 /opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \
  --reassignment-json-file /tmp/reassignment.json \
  --verify

I’m not going to run the manual reassignment because I want to keep the imbalance and show how Cruise Control can help reduce the manual steps. Next, let’s see how Cruise Control handles the same imbalance automatically.

Rebalancing Using Cruise Control

After creating the topic and messages, I have let Cruise Control run for a couple minutes. During that time, it collected metrics and trained its linear regression model. You can run the following command to verify if Cruise Control is running fine and it has data (following is a REST API call using curl):

curl http://localhost:9090/kafkacruisecontrol/state

You will get multiple JSON object outputs as part of the response. Each JSON object holds some information about the state of Cruise Control and the Kafka cluster. Let’s see each of these one at a time:

MonitorState: {
  state: RUNNING(20.000% trained),
  NumValidWindows: (1/1) (100.000%),
  NumValidPartitions: 105/105 (100.000%),
  flawedPartitions: 0
}

This tells about the state of monitoring based on data collected by Cruise Control:

state: RUNNING(20.000% trained) – Cruise Control is actively collecting metrics from your Kafka cluster. Right now it has trained its model on 20% of the expected monitoring data.
NumValidWindows: (1/1) (100%) – Cruise Control has collected 1 complete monitoring window out of 1 required (100% ready). Remember, we had set num.broker.metrics.windows=1 in the cruisecontrol.properties configuration file.
NumValidPartitions: 105/105 (100%) – Cruise Control analyzed all 105 partitions and has metrics for all.
flawedPartitions: 0 – None of the partitions have problematic or missing metrics.

ExecutorState: {state: NO_TASK_IN_PROGRESS}

The above response indicates the execution engine is idle – no partition moves or leadership changes are currently in progress. This makes sense since we haven't asked Cruise Control to do anything yet.

AnalyzerState: {
  isProposalReady: true,
  readyGoals: [
    NetworkInboundCapacityGoal,
    LeaderBytesInDistributionGoal,
    DiskCapacityGoal,
    ReplicaDistributionGoal,
    RackAwareGoal,
    NetworkOutboundCapacityGoal,
    CpuCapacityGoal,
    DiskUsageDistributionGoal,
    LeaderReplicaDistributionGoal,
    ReplicaCapacityGoal
  ]
}

AnalyzerState tells whether Cruise Control is ready to show a proposal or not. In this case it’s ready.

isProposalReady: true – Cruise Control has calculated a potential rebalancing plan (a proposal) that satisfies the configured goals.
readyGoals – These are the goals that are considered ready and valid for rebalancing. Examples:
- DiskCapacityGoal: balance disk usage among brokers
- ReplicaDistributionGoal: balance number of replicas per broker
- RackAwareGoal: maintain replicas across racks for fault tolerance
- LeaderBytesInDistributionGoal: balance network traffic from leaders
- DiskUsageDistributionGoal: ensures partitions are spread to prevent skew

Note that these are the goals we had set earlier in the cruisecontrol.properties file.

AnomalyDetectorState: {
  selfHealingEnabled:[],
  selfHealingDisabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY, TOPIC_ANOMALY, MAINTENANCE_EVENT],
  selfHealingEnabledRatio:{...},
  recentGoalViolations:[],
  recentBrokerFailures:[],
  recentMetricAnomalies:[],
  recentDiskFailures:[],
  recentTopicAnomalies:[],
  recentMaintenanceEvents:[],
  metrics:{...},
  ongoingSelfHealingAnomaly:None,
  balancednessScore:100.000
}

Anomaly detection shows information about any existing anomaly and healing properties.

selfHealingEnabled: [] – Automatic self-healing is currently off. Cruise Control will not move partitions automatically in response to anomalies.
selfHealingDisabled: [...] – Lists the anomaly types that are disabled for automatic self-healing, including broker failures, disk failures, and goal violations.
recentGoalViolations: [] – No goals have been violated recently.
balancednessScore: 100.000 – This is how balanced the cluster is according to Cruise Control’s hard goals. 100% means the cluster is perfectly balanced according to the metrics and hard goals currently active. This metric only cares about Hard Goals (Disk Capacity, CPU capacity) being violated – that’s why it shows 100% even though we know there are some disk usage imbalances in our cluster.

The Proposal

Via AnalyzerState information, Cruise Control told us that it has a proposal for the cluster. Let’s see what it is. We can fetch the proposal using the proposal end point:

curl -s "http://localhost:9090/kafkacruisecontrol/proposals?json=true"

The JSON response is quite large. Let's focus on the key parts that show our cluster's imbalance and how Cruise Control plans to fix it:

{
  "summary": {
    "numReplicaMovements": 13,    // CC wants to move 13 partition replicas
    "numLeaderMovements": 6,      // And reassign 6 partition leaders
    "onDemandBalancednessScoreBefore": 84.67,   // Current: 84.67% balanced
    "onDemandBalancednessScoreAfter": 89.76.    // After: 89.76% balanced
  },
  "goalSummary": [
    {
      "goal": "DiskUsageDistributionGoal",
      "status": "VIOLATED"
    },
    {
      "goal": "LeaderBytesInDistributionGoal",
      "status": "VIOLATED"
    }
  ]
}

Based on the calculations, Cruise Control thinks:

Moving 13 partition replicas will help. Note that manually we decided to move just 1 partition, that is partition 4.
Reassigning 6 partition leaders will help. Manually we didn’t account for any leadership reassignment.
DiskUsageDistributionGoal has been violated. We know that the disk usage is not distributed perfectly.
LeaderBytesInDistributionGoal has also been violated. We couldn’t find this out manually. Technically, you could find out but it would take a decent amount of manual calculations and would still be error-prone.

Note: While we're focusing on disk usage imbalance, Cruise Control optimizes for 10 different goals (disk, CPU, network, leaders, and so on). This holistic approach gives it a better chance of achieving true cluster balance versus balancing manually.

Executing the proposal

Let’s run the actual rebalancing using Cruise Control. The command is:

curl -X POST 'http://localhost:9090/kafkacruisecontrol/rebalance?dryrun=false&json=true'

Again, you’ll get a huge JSON file similar to the proposal.

You can track the status using following API call:

curl "http://localhost:9090/kafkacruisecontrol/user_tasks"

You will get something like this:

Note that the 4th item in the list is our rebalance API call and it’s complete. This was quick for our small Dev cluster, but in large clusters you may see status as InExecution.

Let’s look at the UI to see what is the state of Imbalance now that Cruise Control has completed its execution of the proposal. The UI shows the following for me:

Comparison

Before rebalancing:

Broker 1: 4.52 MB, 69 partitions, 35 leaders
Broker 2: 2.22 MB, 69 partitions, 35 leaders (underutilized)
Broker 3: 5.05 MB, 72 partitions, 35 leaders (overutilized)
Disk range: 2.83 MB (5.05 - 2.22)

After rebalancing:

Broker 1: 4.66 MB, 69 partitions, 38 leaders
Broker 2: 3.87 MB, 77 partitions, 31 leaders
Broker 3: 4.87 MB, 64 partitions, 36 leaders
Disk range: 1.00 MB (4.87 - 3.87)

Results:

Disk usage balanced – Range reduced from 2.83 MB to 1.00 MB (64% improvement!)
Replicas redistributed – Broker 2 gained 8 replicas, Broker 3 lost 8 replicas
Leaders balanced – Changed from 35-35-35 to 38-31-36. Cruise Control prioritized balancing actual network traffic over leader count.

The cluster is now more balanced across all metrics. Congrats!

Conclusion

We covered a lot in this tutorial, so let’s take a step back and look at what we did.

You started by experiencing the reality of manual Kafka management – the endless CLI commands, the tedious calculations, the JSON files, and the potential for costly mistakes. If you felt frustrated during that section, that’s to be expected. That frustration is exactly what thousands of engineering teams deal with every day.

Then you were presented with two complementary tools:

Kafka UI gave you visibility. No more grepping through command outputs or manually counting partition leaders. Everything you need, broker health, topic configurations, consumer lag is right there in a clean web interface. For small teams and development environments, this alone is a game-changer.
Cruise Control gave you intelligence. It didn't just automate what you'd do manually – it also did a fundamentally better job. While you were focused on moving one partition (partition 4), Cruise Control analyzed all 105 partitions across 10 different optimization goals and proposed a comprehensive rebalancing plan. That's the difference between human effort and automated intelligence.

I want to call out that this tutorial used a simplified setup. For production, you’ll expect complex configurations like”

Kafka and Cruise Control running on separate machines
Larger monitoring window for Cruise Control
Some self healing capabilities enabled

If there's one thing you take away from this article, let it be this: you should stop managing your Kafka cluster manually. You've seen there's a better way. Use it. Thanks for reading!

How Message Queues Help Make Distributed Systems More Reliable

Anant Chowdhary — Mon, 28 Oct 2024 13:41:21 +0000

Reliable systems consistently perform their intended functions under various conditions while minimizing downtime and failures.

As internet users, we tend to take for granted that the systems that we use daily will operate reliably. In this article, we’ll explore how message queues enhance flexibility and fault tolerance. We’ll also discuss some challenges that we may face while using them.

After reading through, you’ll know how to implement reliable systems and what key performance factors to keep in mind.

Prerequisites

Before diving into this article, you should have a foundational understanding of cloud computing. Here are the key concepts:

Basic principles of Cloud Computing
Availability in Distributed Systems
An understanding of the CAP theorem.

Reliability in Distributed Systems
What Makes Software Reliable?
What is a Message Queue?
How Message Queues Help Make Distributed Systems More Reliable
Challenges with Message Queues
Summary

What Does Reliability Mean in the Context of Distributed Systems?

Reliability, according to the OED, is “the quality of being trustworthy or of performing consistently well”. We can translate this definition to the following in the context of distributed systems:

The ability of a technological system, device, or component to consistently and dependably perform its intended functions under various conditions over time. For instance, in the context of online banking, reliability refers to the consistent and secure processing of transactions. Users expect to complete transfers and access their accounts without errors or outages.
The system being resilient to unexpected or erroneous interactions by users / other systems interacting with it. For instance, if a user tries to access a deleted file on a cloud storage system, the system can gracefully notify them and suggest alternatives, rather than crashing.
The system performs satisfactorily under its expected conditions of operation, as well as in the case of unexpected load and/or disruptions. An example of this is a video streaming service during a major sporting event. The system is designed to perform well under normal traffic but must also handle sudden spikes in users when a popular game starts

This is quite a general view of what reliability is, and the definition changes with time, as systems change with changing technology.

What Makes Software Reliable?

There are various key components that are used industry wide to make distributed software reliable as used across large scale systems.

Data Replication

Data replication is a fundamental concept in system design where data is intentionally duplicated and stored in multiple locations or servers.

This redundancy serves several critical purposes, including enhancing data availability, improving fault tolerance, and enabling load balancing.

By replicating data across different nodes or data centers, we may be able to ensure that, in the event of a hardware failure or network issue, the data remains accessible. This reduces downtime and enhances system reliability.

It's essential to implement replication strategies carefully, considering factors like consistency, synchronization, and conflict resolution to maintain data integrity and reliability in distributed systems.

Let’s look at a concrete example. With a primary-secondary database model such as one used with e-commerce websites, we may have the following:

Replication: The primary database handles all the write operations, whereas the secondary database(s) handles all the reads. This ensures that reads are spread out across multiple databases, enhancing performance and lowering the probability of a crash.
Consistency: The system may use eventual consistency to maintain integrity, ensuring that all replicas eventually reflect the same data. But during high-traffic periods, the website may temporarily allow for slight inconsistencies, such as showing outdated inventory levels.
Conflict Resolution: If two users attempt to buy a single available item at the same time, a conflict resolution strategy may be used. For instance, the system could use timestamps to determine the customer who gets assigned the product, and this may dictate database updates eventually.

Load Distribution Across Machines

Load distribution involves distributing computational tasks and network traffic across multiple servers or resources to optimize performance and ensure system scalability.

By intelligently spreading workloads, load distribution prevents any single server from becoming overwhelmed, reducing the risk of bottlenecks and downtime.

Some very commonly used load distribution mechanisms are:

Using Load Balancers: A load balancer can evenly distribute incoming traffic across multiple servers, preventing any single server from becoming a bottleneck.
Dynamic Scaling: Dynamic or auto-scaling can be used to automatically adjust the number of active servers based on current demand, adding more resources during peak times and scaling down during low traffic.
Caching: Caching layers can be used to store frequently accessed data, reducing the load on backend servers by serving requests directly from the cache.

Capacity Planning

Capacity planning entails analyzing factors such as expected user growth, data storage requirements, and processing capabilities to ensure that the system can handle increased loads without performance degradation or downtime.

By accurately forecasting resource needs and scaling infrastructure accordingly, such planning helps optimize costs, maintain reliability, and provide a seamless user experience. Being proactive can help ensure a system is well-prepared to adapt to changing requirements and remains robust and efficient throughout its lifecycle.

A lot of modern systems can scale automatically with projected loads. When traffic or processing requirements increase, such auto scaling automatically provisions additional resources to handle the load. Conversely, when demand decreases, it scales down resources to optimize cost efficiency.

Metrics and Automated Alerting

Metrics involve collecting and analyzing data points that provide insights into various aspects of system behavior, such as resource utilization, response times, error rates, and more.

Automated alerting complements metrics by enabling proactive monitoring. This involves setting predefined thresholds or conditions based on metrics. When a metric crosses or exceeds these thresholds, automated alerts get triggered. These alerts can notify system administrators or operators, allowing them to take immediate action to address potential issues before they impact the system or users.

When used together, metrics and automated alerting create a robust monitoring and troubleshooting system, helping ensure that anomalies or problems are quickly detected and resolved.

Now that you know a bit about what reliability means in the context of Distributed Systems, we can move on to Message Queues.

What is a Message Queue?

A message queue is a communication mechanism used in distributed systems to enable asynchronous communication between different components or services. It acts as an intermediary that allows one component to send a message to another without the need for direct, synchronous communication.

Above, you can see that there are multiple nodes (called Producers) that create messages that are sent to a message queue. These messages are processed by a node called the Consumer node, which may perform a series of actions (for instance database reads, or writes) as a part of each message being processed.

Now let’s look at an actual example where a message queue may be useful. Let’s assume we have an e-commerce website that allows millions of orders to be processed.

Processing an order may take place in the following steps:

A user creates an order. This sets off a request to a web server, that in turn creates a message that is placed in the orders queue.
A consumer reads the message, and in turn calls different services while processing the message (for instance the inventory checks, the payment service, the shipping service)
Once all processing steps have completed, the consumer removes the message from the queue.

Note that in case there are parts of the system that fail, the message can be left in the queue to be re-processed.

Even in cases where there is a total outage on the processing side of things, messages can simply pile up in the queue and be consumed once services are functional again. This is an example of a queue being useful in multiple failure scenarios.

Let’s look at some code for this scenario using AWS SQS, which is a popular message queue service that allows users to create queues, send messages to the queue, and also consume messages from queues for processing.

The below example uses Boto3 which is a Python Client for AWS SQS.

First, we’ll place an order, assuming we already have an SQS queue called OrderQueue in place.

import boto3
import json

# Create an SQS client
sqs = boto3.client('sqs')

# Let's assume the queue is called OrderQueue
# This is the queue in which orders are placed
queue_url = 'https://sqs.us-east-1.amazonaws.com/2233334/OrderQueue'

# Function to send an order message
# This places an order in the queue, which can at any time be
# picked up by a consumer and then processed
def send_order(order_details):
    message_body = json.dumps(order_details)
    response = sqs.send_message(
        QueueUrl=queue_url,
        MessageBody=message_body
    )
    print(f'Order sent with ID: {response["MessageId"]}')

# Using the queue to place an order
# Defining a sample order

order = {
    'order_id': '12345',
    'customer_id': '67890',
    'items': [
        {'product_id': 'abc123', 'quantity': 2},
        {'product_id': 'xyz456', 'quantity': 1}
    ],
    'total_price': 59.99
}

# Sending the order to the queue which is expected to be picked up 
# by a consumer and processed eventually.
send_order(order)

Then once the order has been placed, here’s some code that illustrates how it’ll be picked up for processing:

import boto3
import json

# Create an SQS client
sqs = boto3.client('sqs')

# Processing orders from the same queue defined above
queue_url = 'https://sqs.us-east-1.amazonaws.com/2233334/OrderQueue'

# Function to receive and process orders
# Picking up a maximum of 10 messages at a time to process
def receive_orders():
    response = sqs.receive_message(
        QueueUrl=queue_url,
        MaxNumberOfMessages=10,  # Up to 10 messages
        WaitTimeSeconds=10
    )

    messages = response.get('Messages', [])

    for message in messages:
        order_details = json.loads(message['Body'])
        print(f'Processing order: {order_details}')

        # Processing the order with details such as 
        # processing payments, updating the inventory levels,
        # processing shipping etc.

        # Delete the message after processing
        # This is important since we don't want an
        # order to be processed multiple times.
        sqs.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=message['ReceiptHandle']
        )

# Receive a batch of orders
receive_orders()

What is an Intermediary in a Distributed System?

In the context of what we’re discussing here, a message queue is an intermediary. Quoting Amazon AWS’ definition of a message queue:

“Amazon Simple Queue Service (Amazon SQS) lets you send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available.”

This is a wonderfully succinct and accurate description of why a message queue (an intermediary) is important.

In a message queue, messages are placed in a queue data structure, which you can think of as a temporary storage area. The producer places messages in the queue, and the consumer retrieves and processes them at its own pace. This decoupling of producers and consumers allows for greater flexibility, scalability, and fault tolerance in distributed systems.

How Message Queues Help Make Distributed Systems More Reliable

Now let's discuss how Message Queues help make Distributed Systems more reliable.

1. Message Queues Provide Flexibility

Message queues allow for asynchronous communication between components. This means that producers can send messages to the queue without waiting for immediate processing by consumers. This allows components to work independently and at their own pace, providing flexibility in terms of processing times. So this is a great way to make designs flexible, and as self contained as possible.

2. Message Queues Make Systems Scalable

Message queues are often the bread and butter of scalable distributed systems for the following reasons:

Multiple producers can add messages to a message queue. This raises the ceiling and allows us to easily horizontally scale applications.
Multiple consumers can read from a message queue. This again allows us to easily scale throughput if needed in a lot of scenarios.

3. Message Queues Make Systems Fault Tolerant

What happens if a distributed system is overwhelmed? We sometimes need to have the ability to cut the cord in order to get the system back to a working state. We’d ideally want the ability to process requests that weren’t processed when the system was down.

This is exactly what a message queue can help us with. We may have hundreds of thousands of requests that weren’t processed, but are still in the queue. These can be processed once our system is back online.

Challenges with Message Queues

As with life, using message queues in distributed systems isn’t a silver bullet to scaling problems.

Here are some situations where message queues may be useful:

Asynchronous Processing: Messages queues are generally an excellent choice in infrastructure wherever asynchronous processing is required. In workflows such as sending confirmation emails or generating reports after an order is placed, message queues can decouple these tasks from the primary application flow.
Load Balancing: As we saw in our example for message queues, in scenarios where traffic spikes occur, message queues can buffer incoming requests, allowing multiple consumers to process messages concurrently. This helps distribute the load evenly across available resources.
Fault Tolerance: In systems where reliability is crucial, message queues provide a mechanism for handling failures. If a service is temporarily down, messages can be retained in the queue until the service is available again, ensuring that no data is lost unless intended.

Here are a some situations where message queues may not be useful:

Message queues can be great in scenarios where ordering of messages does not matter. But in situations where order does matter, they can sometimes be slow and more expensive to use.
Designing systems with queues that have multiple consumers isn’t trivial. What happens if a message is processed twice? Is idempotency a requirement? Or does it break our use case? These complexities can often lead us to situations where message queues may not be the best solution.

Summary

In this article, you learned about reliability in distributed systems, and how message queues can help make such systems more reliable. Here’s a summary of the key takeaways:

Reliability is central to distributed systems and there are a few common ways this is handled across the tech industry. Data replication, load distribution, and capacity planning are some ways that can improve the reliability of a system.
Message Queues are intermediaries that can store messages from producers. They can be picked up by consumers at a rate that's generally independent of the rate of production.
Queues are flexible, allowing us to immediately stem the flow of unwanted event processing in case of an unforeseen event.
Despite the versatility of message queues, they're not a panacea for reliability issues. There are often multiple considerations to be kept in mind while processing messages in a message queue.

distributed system - freeCodeCamp.org

Service-to-Service Communication: When to Use REST, gRPC, and Event-Driven Messaging

Prerequisites

Table of Contents

The Three Patterns at a Glance

REST: The Default Choice

Where REST Excels

Where REST Falls Short

gRPC: The Performance Choice

Defining the Contract

Where gRPC Excels

Where gRPC Falls Short

Event-Driven Messaging: The Decoupling Choice

Where Event-Driven Excels

Where Event-Driven Falls Short

Handling Broker Failures

1. The Outbox Pattern

2. At-least-once delivery

The Five Trade-Off Dimensions

1. Latency

2. Coupling

3. Schema Evolution

REST (JSON):

gRPC (Protocol Buffers):

Event-driven (Avro/JSON Schema):

4. Debugging and Observability

5. Operational Complexity

The Decision Framework

Hybrid Architectures: Using All Three

The Anti-Synchronous Trap

Schema Governance at Scale

REST: OpenAPI as the Contract

gRPC: Proto Registry

Events: Schema Registry with Compatibility Modes

Conclusion

How to Build a Production-Grade Distributed Chatroom in Go [Full Handbook]

Table of Contents

What is a Distributed Chatroom?

What You'll Learn

1. TCP Socket Programming in Go

2. Concurrent Programming with Goroutines and Channels

3. State Management in Distributed Systems

4. Write-Ahead Logging (WAL) for Durability

5. Session Management and Reconnection

6. Graceful Degradation and Fault Tolerance

Prerequisites

Tutorial Overview

Architecture Overview

High-Level Architecture

Component Breakdown

1. Network Layer

2. Client Management

3. ChatRoom Core

4. State Management

5. Persistence Layer

6. Session Management

Message Flow

Core Concepts You Need to Know

Understanding the Concurrency Model

Understanding the Persistence Strategy

How to Set Up the Project Structure

How to Define Core Data Types

Understanding the Message Type

Understanding the Client Type

Understanding the ChatRoom Type

How to Initialize the Server

1. Creating Data Structures

2. Loading the Snapshot

3. Initializing the WAL

4. Starting Background Workers

Why 5 Minutes and 100 Messages?

How to Build the Event Loop

Understanding the Select Statement

Why Use a Single Event Loop?

Is This Actually a Bottleneck?

Understanding the Cleanup Worker

How to Handle Client Connections

How to Read Messages from Clients

How to Write Messages to Clients

How to Implement Message Broadcasting