Prince Onukwili - freeCodeCamp.org

How to Deploy Your Own Cockroach DB Instance on Kubernetes [Full Book for Devs]

Prince Onukwili — Tue, 25 Nov 2025 17:16:50 +0000

Developers are smart, wonderful people, and they’re some of the most logical thinkers you’ll ever meet. But we’re pretty terrible at naming things 😂

Like, what in the world – out of every other possible name, they decided to name a database after a literal cockroach? 🤣

I mean, I get it: cockroaches are known for being resilient, and the devs were probably trying to say “our database never dies”… but still…a cockroach?

The name aside, out of all the databases out there, you might be wondering why would you choose CockroachDB? And if you did choose it, where would you even start when trying to host and deploy it? Would you go for a managed cloud service? Or could you actually self-manage it?

If you ever thought of doing it yourself – maybe in a dev environment, or even introducing it to your company – how would you go about it?

Well, just calm your nerves 😄

In this book, we’ll explore everything you need to know about deploying and managing CockroachDB on Kubernetes. We’ll dive deep into:

Understanding how CockroachDB’s masterless (multi-primary) architecture actually works
Setting up and deploying CockroachDB on a Kubernetes cluster
Automating backups to Google Cloud Storage using just a few queries in the CockroachDB cluster
Managing service accounts and authentication securely
Tuning CockroachDB’s memory settings for stable performance
Scaling the cluster horizontally and vertically without downtime
Monitoring and maintaining the database like a pro

By the end, you’ll not only understand how CockroachDB works, you’ll be confident enough to deploy and manage your own resilient, production-ready instance. 🚀

What Even Is CockroachDB? 🤔
Why Choose CockroachDB Over PostgreSQL or MongoDB 🤷🏾‍♂️?
- How Fault Tolerance is Handled in PostgreSQL and MongoDB
- How CockroachDB Handles It Differently
How CockroachDB Works Behind the Scenes ⚙️
Where (and How) Should You Host CockroachDB? ☁️
Setting Up Your Local Environment 🧑‍💻
Deploying CockroachDB on Minikube (The Fun Part Begins 😁!)
Accessing the CockroachDB Console & Viewing Metrics
Backing Up CockroachDB to Google Cloud Storage ☁️
Managing Resources & Optimizing Memory Usage
Scaling CockroachDB the Right Way
What to Consider When Deploying CockroachDB on Google Kubernetes Engine (GKE) ☁️
How to Get a CockroachDB Enterprise License for FREEE!
Conclusion & Next Steps ✨
- About the Author 👨🏾‍💻

What Even Is CockroachDB? 🤔

Hey! before we jump into setting up our Kubernetes cluster and deploying our CockroachDB cluster, let’s get grounded in what CockroachDB really is. (Because if you don’t understand the why and how, the implementation and practical session will just feel like magic 😅.)

Simple Definition

CockroachDB is a distributed SQL database. This means it gives you the features of a relational database (tables, SQL queries, JOINS, transactions) but copies data across multiple replicas (servers, nodes, instances). No need for sharding manually. 😃

It’s built to survive failures, scale easily (compared to other SQL databases), and keep your data consistent no matter what (across all the instances).

Who Made CockroachDB? When Was it Released?

CockroachDB was created by Cockroach Labs, founded by Spencer Kimball, Peter Mattis, and Ben Darnell. The idea first started taking shape around 2014, and by 2015 Cockroach Labs was formally founded.

Its 1.0 “production-ready” version was announced in 2017, marking its transition from beta to being suitable for real-world use.

What Problems Does CockroachDB Try to Solve?

Traditional relational databases are great, but they run into real challenges when your app grows. CockroachDB was built to solve those. Here are the key pain points and how CockroachDB addresses them:

Pain Point	What usually happens	How CockroachDB fixes it
Single primary bottleneck	ONLY ONE “primary” node handles writes, updates, and deletes. That node can become difficult to scale (adapt to the DB usage) without downtime	CockroachDB is multi-primary, meaning every node can accept reads and writes. No single “primary” for the entire cluster.
Manual sharding complexity	You have to split data (shard) by hand, decide which piece goes where, and handle cross-shard queries, lots of headache 😖.	CockroachDB automatically partitions data into smaller units (called ranges) and moves them around to balance load.
Failover downtime	If the primary node fails, you need to promote a replica (read-only instance) and switch over. During that time, your app might be down.	Because there’s no single primary, if one of the instances fail, others take over seamlessly (via consensus) without a big outage.
Geographic scaling & latency	Serving users in different regions is hard — either data is far away (slow) or you must build complex replication logic.	CockroachDB lets you distribute nodes across regions. You can serve local reads/writes while keeping global consistency.

So instead of fighting your database as it grows, CockroachDB handles much of the hard work for you.

Key Terms You Should Know (in plain language):

Node: Duplicates or copies of your database. These are also known as replicas. They can be read-only (databases from which data can only be read, for example using SELECT statements), OR read-write (databases from which data can be read, created, updated, and deleted).
Replication: making copies of data on multiple nodes. If one node fails, others still have the data.
Raft (consensus algorithm): a system that ensures copies (replicas) agree on changes in a safe, reliable way. For example, when you want to write data, Raft ensures that most copies agree before it’s accepted.
Sharding / Ranges: Instead of putting all your data in one big blob, CockroachDB splits it into smaller chunks called ranges. Each range is replicated and can move between nodes.
Distributed transaction: a transaction (series of operations) that might touch data stored in different nodes. CockroachDB manages this, so you still get ACID (atomic, consistent, isolated, durable) properties.

Why the name “CockroachDB”? 😅

You might wonder: Why name a database after a cockroach? It sounds weird at first, but there's a reason:

Cockroaches are known for surviving harsh conditions: radiation, natural disasters, and so on. The founders wanted a database that feels almost “impossible to kill,” that can survive node failures, outages, and network splits. The name is a tongue-in-cheek nod to resilience.

Why Choose CockroachDB Over PostgreSQL or MongoDB 🤷🏾‍♂️?

Let’s compare the classic setup (Postgres / MongoDB) to CockroachDB, especially why you might want to go with CockroachDB, and how it helps ease scaling. I’ll also explain some terms to make sure you’re following.

In many setups, when you use Postgres or MongoDB, you’ll often have one “primary” node that handles all writes (that is, inserts, updates, deletes).

Then you have multiple “read replicas” that copy the primary’s data and serve read requests (selects). That works okay – reads can be spread out – but all write traffic goes to that one primary node.

Usually, the primary eventually gets stressed when the write volume grows (for example, more customers create accounts and products on your platform).

You can add more read replicas (horizontal scaling for reads, for example customers trying to view their accounts, or previously created products on your site), but scaling the primary is much harder.

To scale the primary, you often resort to upgrading its resources (CPU, RAM, disk) – that’s vertical scaling – which often needs downtime (shut down the primary database, increase its CPU and RAM, then spin it back up).

Or you’d have to manually shard (split) your data across multiple primaries, route traffic carefully, and manage complexity.

How Fault Tolerance is Handled in PostgreSQL and MongoDB

When you try to make Postgres (or MongoDB) highly available and fault tolerant in a self-managed setup, you often need two+ read replicas and one primary.

The tricky part is handling what happens when the primary fails (or is taken down temporarily for an upgrade). You need something that can promote a replica to a primary automatically.

In Postgres land, that’s often handled by Patroni or repmgr (tools that handle cluster management, failover, leader election, and so on).

In MongoDB, such logic is part of the replica set behavior: it does automatic elections among replicas.

Here are some of the core challenges with that classic model:

Every write must go to a single primary. If that primary fails or is overloaded, your whole system suffers.
Scaling reads is easy (add more replicas), but scaling writes is hard.
Vertical scaling (give more resources to one server) has its cons. If the primary node needs more resources, you might experience some downtime when it’s being scaled up.
Manual sharding is messy: you decide which piece of data goes to which shard, handle cross-shard queries, and build routing logic. That’s a lot of maintenance and can lead to unexpected issues if not handled properly.
One service (or load balancer/proxy) points to the primary (for ALL write queries).
Another service or routing logic handles read queries and can share reads across replicas.
You might use HAProxy, pgpool-II, or pgBouncer for Postgres to route traffic, do read/write splitting, or manage connection pooling. These are external (not part of the database core) tools you have to configure.

So when the primary fails, Patroni (or repmgr, and so on) will detect it and promote one of the read replicas to be the new primary.

But that promotion, reconfiguration, and traffic rerouting often cause a brief window of downtime (when your primary database node becomes unavailable).

How CockroachDB Handles It Differently

CockroachDB changes the rules:

All replicas are equal for reads and writes. You don’t have a special “primary” that handles writes. Every node in the cluster can accept write requests.
CockroachDB breaks your data into small chunks (ranges) and replicates them across nodes. If you add a new node, data moves around automatically to balance the load.
Every write is automatically copied to other replicas, and consistency is managed by a protocol (Raft), so you don’t have to build this yourself.
No manual sharding needed. Because the database handles how data is split and moved, you don’t need to decide how to shard by hand.
You don’t need a special service to route writes vs reads queries. Any node can accept both reads and writes.
During scaling, you don’t have to worry about which node is the primary – because there is no primary.
You can scale your nodes one at a time (rollout style). When one node is being upgraded, the others continue to serve traffic. You won’t hit a downtime window just because you're scaling the “primary.”
Because there's no replica promotion logic to fight with, there's no moment where a replica needs to be “elevated” to primary – it’s all just nodes continuing to serve.

How CockroachDB Works Behind the Scenes ⚙️

In CockroachDB, there are many moving parts behind the scenes. But they work together, so you don’t have to babysit them. The core ideas, which we’ve mostly already touched on, are:

Splitting data into pieces (ranges)
Keeping multiple copies of each piece (replicas/replication)
Making sure all copies agree via Raft consensus
Moving pieces around to balance the load (automatic rebalancing/distribution)
Coordinating transactions that might touch many pieces

Let’s go through each of those, one by one.

Ranges: The Small Pieces of Data

Imagine you have a giant book of recipes. If you try to carry the whole thing, it’s heavy. So you split the book into smaller booklets, each covering recipes for a certain range of meals: breakfasts, lunches, dinners, desserts.

In CockroachDB, data is split into ranges, which are like those smaller booklets:

Each range covers a certain block of data (like “all users whose ID is 1-1000”)
When a range gets too big (like having too many recipes in one booklet) it’s cut/split into two smaller ones. That makes each piece easier to manage.
If two neighboring ranges have become very small (few recipes), they might be merged (joined) back together so you’re not keeping too many tiny booklets.
These splits and merges happen automatically, behind the scenes, so the database stays smooth as things grow or shrink.

This chopping helps the system in many ways: moving pieces, copying them, balancing load, recovering from node failures becomes easier.

Replication: Many Copies for Safety

Nobody likes losing their work, so you keep backup copies. CockroachDB does this for data as well.

For each range, there are usually 3 copies (replicas) stored on different machines (nodes). If one machine dies, you still have others. (cockroachlabs.com). And these copies are always kept in sync: when you write something (for example, insert or update), the change is propagated to the other copies.

The database also tolerates failures. If one node goes down, the system detects it and eventually makes a new copy elsewhere to replace it. So the target number of copies is maintained. This gives you fault tolerance: your data stays safe even when parts of your system fail.

Raft Consensus: How All Copies Agree

Having copies is useful, but you also need them to agree with each other – like all your recipe booklets have the same content in each copy. The Raft protocol is a way to make sure that happens reliably.

Here’s how Raft works in simple terms:

Each range has a group of replicas. One of these replicas acts as the leader. Others are followers.
All write requests for that range go through the leader. The leader gets the request, then tells followers to record the same change.
Once most of the copies (a majority) say “yep, we got it,” the change is considered final (committed). Then the leader tells the client, “Done.”
If the leader stops working (the machine dies or the network fails), the followers notice it (they stop getting regular “I’m alive” messages), then they hold an election to pick a new leader, and the show goes on.
This way, the system ensures everyone has the same final data and no conflicting changes happen.

So Raft is the agreement protocol that keeps all copies in sync and safe.

MultiRaft: Keeping Raft Efficient When Things Scale

When you have many ranges (many pieces of the booklets), each range has its own Raft group. That can mean a lot of “are you alive?” messages between nodes, and a lot of overhead. MultiRaft is the trick CockroachDB uses to make this efficient.

MultiRaft groups together Raft work for many ranges that share nodes, so overhead is reduced. Instead of sending separate heartbeat (are you alive?) messages for each range, some of the messages are bundled.

This reduces network chatter and resource waste and helps the database scale smoothly when you have tons of data and many pieces.

Rebalancing: Movement for Balance

When your ranges are not evenly spread across nodes (machines), some machines are doing way too much work, and some hardly any. That’s not good. So CockroachDB automatically moves pieces around to balance things.

The system watches how busy each node is (how many ranges it holds, how much data, how much read/write traffic).
If one node is overloaded, it will move some ranges to other nodes.
If a node dies, the system notices and makes sure that ranges that were on that node get copied somewhere else so safety (replica count) is maintained.
If you add a new node, the system starts moving ranges to the new node so its resources are used.

This happens without you having to manually decide “move this here, move that there.”

Distributed Transactions: Doing Work Across Multiple Ranges

Often, an operation touches multiple ranges. For example, “transfer money from account A (in range 1) to account B (in range 2)”. That must be handled carefully so that either both parts succeed, or neither do.

CockroachDB supports distributed transactions, meaning a single transaction can work across many ranges. It uses “intent” writes (temporary placeholders) and once everything is ready, it commits the transaction so it becomes permanent. If something fails, it aborts (cancels) the whole thing. The system ensures atomic behavior: all or nothing.

How It All Fits Together: Read + Write Flow (What Happens When You Use It)

Let’s picture a write, step by step:

Your app sends a write (for example, “add new user”) to any node in the CockroachDB cluster.
That node figures out which range(s) are involved (which pieces hold the data you want to write).
For each range, the write goes to that range’s leader.
The leader writes the change to their own copy, then tells followers to do the same.
Once most copies confirm they have the change, the leader declares it “committed” and tells your app, “yes, write done.”
If a node is busy or down, others still handle traffic.

Read flow:

Your app sends a read (for example “get user by ID”) to any node.
That node checks its copies. If it has a fresh copy, it answers. If not, it asks the node that does.

Everything works so data is correct, up to date, and reliably available even if machines fail or network lags.

Why This All Matters (Putting It in Plain English)

All these tweaks are important for several key reasons. First of all, because data is chopped into ranges and replicated, no single node is a bottleneck. Also, Raft ensures consensus, so you can trust that data is consistent across all working replicas.

Beyond this, rebalancing is automatic, you don’t have to micromanage shards or worry about nodes drowning in load. And because transactions that touch multiple ranges are coordinated, you can trust ACID properties even in a distributed setup.

Where (and How) Should You Host CockroachDB? ☁️

There isn’t just one “right” way to host CockroachDB. There are a few paths you can pick, each with pros and cons. What you pick depends on cost, control, ease of use, and your risk tolerance.

In this section, we’ll explore:

Cockroach Labs’ own managed cloud (CockroachDB Cloud)
“Bring Your Own Cloud” (BYOC) – letting Cockroach Labs manage it inside your cloud account
Hosting via cloud marketplaces (AWS, GCP, Azure)
Self-hosting / Kubernetes / your own infrastructure
And notes on DigitalOcean support

Let’s dive in.

Option 1: CockroachDB Cloud (fully managed by Cockroach Labs)

This is the easiest option if you want to offload operations. You don’t manage nodes (computers, Virtual machines, and so on), upgrades, or backups, as Cockroach Labs handles all that.

What it offers:

You sign up and click “create cluster.”
Automatic scaling, zero-downtime upgrades, and managed backups.
It supports multiple cloud providers behind the scenes (you pick region(s)).
You get tools, APIs, and Terraform integration to automate it.
They often give free credits to get started.

Tradeoffs:

You have less control over underlying infrastructure, for example Virtual Machines, networking, disks, and so on (you trade control for convenience).
You pay for the managed service premium.
You rely on Cockroach Labs’ SLAs, uptime, and support.

If you want, you can check it out here: CockroachDB Cloud (managed by Cockroach Labs).

Option 2: Bring Your Own Cloud (BYOC)

This is a middle ground: you keep your cloud environment, but let Cockroach Labs manage the database. It gives you control over infrastructure, billing, network, and so on, while still offloading operational complexity.

How it works:

You run CockroachDB Cloud inside your cloud account (AWS, GCP, and so on).
Cockroach Labs still handles provisioning, upgrades, backups, and observability. You manage roles, networking, and logs.
Useful for complying with regulations, keeping data within your cloud folder/account, and using your cloud discounts.

Tradeoffs:

You still need to set up cloud aspects (VPCs, IAM, roles) correctly.
There’s more complexity than pure managed, but more control as well.
Cockroach Labs needs access to certain parts of your account (permissions).

If you want to explore BYOC, you can read more here: CockroachDB Bring Your Own Cloud.

Option 3: Use Cloud Marketplaces (AWS, GCP, Azure)

If you already use a cloud provider, sometimes the easiest way is to deploy via their marketplace offerings. It gives you familiarity, billing simplicity, and so on.

GCP Marketplace – CockroachDB is available on the Google Cloud Marketplace, making it easier to deploy within your GCP environment. You can learn more here: GCP Marketplace.
AWS Marketplace – CockroachDB is listed there: AWS Marketplace.
Azure Marketplace – Also supported for Azure deployments (SaaS/managed listings): Azure Marketplace.
DigitalOcean – There is support for CockroachDB deployment on DigitalOcean using their infrastructure: Deploy CockroachDB on DigitalOcean.

These options let you stay in your cloud console, use your existing cloud accounts, and integrate with other resources you already have.

But you're still responsible for certain operational tasks (networking, security, monitoring, backups) depending on how the marketplace offering is configured.

Option 4 (My Favorite 😁): Self-Hosting — Especially Using Kubernetes

If you self-host CockroachDB, you get full control. You’re the boss of everything: the machines, storage, networking, backups, upgrades, monitoring – all of it.

What’s even better is that using Kubernetes means your setup isn’t tied to one cloud provider. You can run it on AWS, GCP, Azure, or even on-premises later, with very little change. Kubernetes gives you a “portable infra” layer.

Managed CockroachDB services charge you extra for “maintenance, upgrades, backup, etc.” – those are baked into the price. But when you self-host, you accept the burden, but also avoid paying that extra margin. You pay for compute, disks, network, and your time/ops work.

You can also self-host in the cloud (using cloud VMs) but still manage every layer: disks, network, security, and so on. Using Kubernetes, there is a sweet middle ground: you get cloud reliability for VMs, but you fully control everything above that.

Why Kubernetes Beats Tools Like Docker Swarm or Hashicorp Nomad for Databases

Because CockroachDB is a stateful system (it holds data), you need strong support for “data that stays even when a pod restarts or moves.” Kubernetes is designed with good primitives for that. Other tools don’t always shine there.

Here’s the comparison in simple terms:

Docker Swarm / Docker Compose: Great for stateless apps (web servers, APIs), but when it comes to databases, it struggles. Swarm doesn’t natively support persistent volume claims at a cluster level, so if a container (database replica) moves to a different node (VM), it might lose access to its storage. Devs often pin containers to specific nodes manually to avoid this.
Nomad: More flexible and simpler in some ways, but it’s not as rich in features around connectivity, storage management, and built-in tooling for containers. It works well in mixed workloads, but handling complex databases usually means you need to build extra layers.
Kubernetes: It has built-in support for stateful workloads:
- StatefulSets (Properly managing data for each database): This ensures that each CockroachDB replica (pod) keeps its identity and storage intact even if the pod restarts. So the database replica doesn’t lose its “name” or data when things change.
- Persistent volumes and persistent volume claims (external disks): These are like dedicated hard drives or disks attached to pods (database replicas). Even if a pod moves, crashes, or restarts, the disk (data) stays. Kubernetes makes sure the data stays safe.
- StorageClasses (choose your disk): You can customize the disks in which your data will be stored, that is:
  - HDD (most affordable, but slower),
  - Balanced Disk (SSD enabled, a balance between costs and speed),
  - Fast SSD (Very fast, recommended by the CockroachDB team, but a bit more expensive than a Balanced Disk).
  - Rolling updates, anti-affinity, (No Downtime, High Availability, Fault tolerance).
    Anti-affinity means you can tell Kubernetes, “don’t put more than one CockroachDB replica on the same VM or physical machine.” This protects you if one VM goes bad, other replicas are safe.
  - Rolling updates let you update one replica at a time (configuration, version, resources) without bringing down the whole cluster. While one replica updates, others serve traffic. That helps avoid downtime.
  - Kubernetes also has ordered start/stop for replicas (via StatefulSets) so things are predictable and safe
- Vertical vs horizontal scaling (earlier talk – reminder)
  You remember we talked about scaling in prior sections:
  - Horizontal scaling means adding more replicas (more pods, more nodes) so load spreads out.
  - Vertical scaling means increasing the resources (CPU, RAM, disk) of existing nodes/replicas.

In tools like Nomad or Docker Swarm, vertical scaling tends to be harder, often involves stopping services, shutting things down, and restarting VMs, which causes downtime.

Kubernetes makes vertical and horizontal scaling easier at the pod level (you can resize one pod CPU + RAM) and manage rolling upgrades so you don’t take everything down at once.

You can also add more database replicas to the cluster easily (to balance load and make the database process queries faster), and the data is automatically copied to the new database replica (replication), especially when you use the official CockroachDB Helm Chart.

Why Other Tools (Swarm / Nomad / Docker Compose) Don’t Match Up Here

Docker Swarm and Docker Compose are simpler to use and are good when you don’t have much complexity. But they lack robust features for stable storage, default support for replication, vertical scaling, horizontal scaling of stateful services, and so on. For example, Swarm doesn’t have built-in StatefulSets or dynamic volume provisioning like Kubernetes.

Nomad is more flexible than Swarm in some ways, but many users say storage plugins (CSI) are weaker than what Kubernetes has. Also, less built-in for ordering things, rolling updates for stateful apps.

So while these work fine for simpler apps (stateless services, small apps), when you have a distributed stateful SQL database like CockroachDB, Kubernetes gives you more safety, more control, less chance of data loss or misconfiguration.

Because of all this, running CockroachDB on Kubernetes gives you the tools you need baked in, reducing how much custom plumbing you must write yourself.

Trade-offs (things to watch out for)

You have to manage everything: backups, monitoring the ENTIRE CockroachDB cluster, withstanding failures (fault tolerance), and upgrades. That’s work 🥲.
You need to know your way around infra (VMs, disks, networking, and inter-node connections) and operations (or have teammates who do – DevOps Engineers, Cloud Architects, Site Reliability Engineers).
Using managed Kubernetes (like GKE, EKS, AKS) helps as you offload the control plane. You still manage the nodes, storage, and higher layers.
But even with that, you avoid paying for “database management as a service” markup – you're only paying for infrastructure plus your time.

Setting Up Your Local Environment 🧑‍💻

Alright, we’ve learned quite a bit so far: what CockroachDB is, how it works behind the scenes, and where you can host it. Now, it’s time to roll up our sleeves and get our hands dirty with some practical setup.

Before we deploy CockroachDB, we need a safe “playground” where we can test and experiment without touching the cloud or spending a dime.

Why these tools?

Before we jump into running commands, here’s a quick lookup of what tools we’ll use and why:

Minikube: A tool that runs a small Kubernetes cluster on your computer. It gives you a local “mini cloud” where you can deploy and experiment.
Kubectl: The command line tool you’ll use to talk to your Kubernetes cluster to deploy apps, check status, and manage resources.
Helm: A package manager for Kubernetes. It helps you install complex applications (like CockroachDB) with fewer manual steps.

Step 1: Install Minikube

What is Minikube?
Minikube is a lightweight tool that helps you run a small Kubernetes cluster on your personal computer.

Think of it as your own mini-cloud environment where you can test, deploy, and learn Kubernetes (and in our case, CockroachDB) locally. It’s perfect for learning and experimenting before deploying on the cloud.

Here’s how to get it on different operating systems:

🪟 Windows

Make sure you have a hypervisor (VirtualBox, Hyper-V) or Docker installed.
Open PowerShell as Administrator.

Run:

 choco install minikube

or use:

 winget install minikube

After installation, check the version:
```
 minikube version
```
If it returns a version number, you’re good 👍🏾

If you don’t have the choco or winget package manager, you can install Minikube via PowerShell by following the steps in the docs.

🍎 macOS

Ensure you have Homebrew installed.
In Terminal, run:
```
 brew install minikube
```
Start the cluster:
```
 minikube start
```
Verify:
```
 minikube version
```

🐧 Linux

Ensure you’re on a supported distribution (Ubuntu, Fedora, and so on) and virtualization (Docker, KVM, and so on) is enabled.

Run:

 curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
 sudo install minikube-linux-amd64 /usr/local/bin/minikube
 rm minikube-linux-amd64

Start the cluster:
```
 minikube start
```
Verify:
```
 minikube status
```

✅ At this point you should have a local Kubernetes cluster up and running on your machine! Next, we’ll install Kubectl so you can talk to the cluster from your command line.

Step 2: Install kubectl

What kubectl does:
kubectl is the command-line tool that lets you talk to your Kubernetes cluster. Using it, you can deploy applications, check your cluster’s health, and manage resources inside your cluster.

You’ll use it a lot when working with Kubernetes on Minikube and later when you deploy CockroachDB.

Here’s how to install it on Windows, macOS, and Linux:

🪟 Windows

Open PowerShell as Administrator.

Run:

 choco install kubernetes-cli

or if you prefer:

 choco install kubectl

Then check the version:
```
 kubectl version --client
```
If it prints a version number, you’re good.

🍎 macOS

Open Terminal.
If you have Homebrew installed, run:
```
 brew install kubectl
```
Check the version:
```
 kubectl version --client
```
That should show something like “Client Version: v1.x.x”.

🐧 Linux

Open your terminal.

Download the latest kubectl binary:

 curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"

Make it executable and move it into your PATH:

 chmod +x ./kubectl
 sudo mv ./kubectl /usr/local/bin/kubectl

Verify:
```
 kubectl version --client
```

After this, you’ll have kubectl installed and ready to use with your local Minikube cluster. Next up we’ll install Helm, which will make deploying CockroachDB much easier.

Step 3: Install Helm

Helm is basically the package manager for Kubernetes. Think of it like how you use apt, yum, or brew to install software on your computer. Helm does something similar for Kubernetes apps.

With Kubernetes, deploying a full app often means writing lots of configs (manifests – Deployments, Services, PersistentVolumes, ConfigMaps, and so on). Helm lets us bundle all of that into a single “package” (called a chart) so we don’t have to manually create the resources one-after-the-other (which could be hectic to manage btw 😖).

Because our goal is to deploy a pretty complex system (CockroachDB) on Kubernetes – which includes stateful nodes, persistent storage, networking, SSL/TLS, and so on – using a Helm chart makes it so much easier than crafting dozens of YAML files from scratch.

So before we install CockroachDB, we’ll install Helm. This gives us the toolkit to deploy and manage our cluster much more easily.

Let’s install Helm on each platform. After this, you’ll have the helm command ready to deploy apps into your Kubernetes cluster.

🪟 Windows

Open PowerShell as Administrator.

If you have Chocolatey installed, run:

 choco install kubernetes-helm

Alternatively:

 choco install helm

Confirm installation:
```
 helm version
```
You should see something like version.BuildInfo{Version:"v3.x.x",…}.

🍎 macOS

Open Terminal.
With Homebrew installed, run:
```
 brew install helm
```
Verify:
```
 helm version
```
If you see version info, you’re good.

🐧 Linux

Open your terminal.

Download and install the binary (example for the latest version):

 curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
 chmod 700 get_helm.sh
 ./get_helm.sh

Or you can directly download the binary and move it into your PATH.

Check version:
```
 helm version
```

✅ After this, you have helm installed and you’re ready to use it.

In the next part, we’ll use Helm to install CockroachDB into your local Minikube cluster. We’ll add the CockroachDB chart, configure it, and spin up a multi-node replica setup right on your PC.

Deploying CockroachDB on Minikube (The Fun Part Begins 😁!)

Before we go to the cloud, we’ll deploy CockroachDB locally on Minikube using Helm.

This process will help us:

Understand how CockroachDB runs in a cluster
Learn how Kubernetes manages database replicas
Gain hands-on experience before deploying to the cloud

Step 1: Visit ArtifactHub

ArtifactHub is like an App Store for Kubernetes Helm Charts – a huge collection of open-source Helm charts and packages you can easily install.

Go to https://artifacthub.io
In the search bar, type CockroachDB
Click the CockroachDB Helm chart result (you’ll see it published by Cockroach Labs).

You’ll see something like this 👇🏾

Step 2: Explore the Helm Chart

You’ll notice a lot of information on the page:

README – the documentation for installing and customizing CockroachDB
Default Values – all the settings that define how the database runs

Don’t worry if it looks overwhelming. We’ll walk through it together 😉

Step 3: Copy the Default Values

Every Helm chart has a default configuration file. These defaults are usually too advanced or too heavy for local setups, so we’ll create our own lighter version. But first, let’s copy the original for reference.

On the CockroachDB chart page, click the Default Values button.
A modal window will pop up showing a long YAML file.
Click the Copy icon in the top-right corner to copy all the default values.

Step 4: Create a Folder for Our Project

We’ll keep everything organized in a single folder.

mkdir cockroachdb-tutorial
cd cockroachdb-tutorial

Inside this folder, create a new file called:

nano cockroachdb-original-values.yml

Now paste all the default values you copied earlier (use Ctrl+V or right-click → Paste), then save and exit (Ctrl+O, then Ctrl+X in nano).

If you’re on Windows, just open Notepad/VSCode, paste the content, and save the file in the same folder.

Step 5: Understanding the Key Configurations

Let’s break down a few important values you’ll notice in the file.

🧩 `statefulset.replicas`

This tells CockroachDB how many database nodes (replicas) to run in the cluster. By default, it’s set to 3, meaning you’ll have 3 independent database instances that can all read and write data.

⚙️ `statefulset.resources.requests` and `statefulset.resources.limits`

These settings tell Kubernetes how much CPU and memory to give CockroachDB.

requests: the minimum guaranteed amount
limits: the maximum allowed amount

CockroachDB can be a bit greedy with memory 😅, so limits make sure it doesn’t take everything and leave no room for other apps.

💾 `storage.persistentVolume.size`

This defines how much disk space each CockroachDB node gets. For example, if you set it to 10Gi and you have 3 replicas, total usage = 30Gi.

💽 `storage.persistentVolume.storageClass`

This defines the type of disk to use:

standard: HDD (cheap but slow)
standard-rwo: SSD (faster and affordable)
pd-ssd or fast-ssd: NVMe (super fast but pricey)

You can check available storage classes in your Minikube cluster using:

kubectl get sc

On Minikube, the default storage class is usually standard.

You can learn more about Google Cloud storage classes here.

🔐 `tls.enabled`

This controls whether CockroachDB requires TLS certificates for secure connections.

If true, you’ll need to generate certificates for any app or client that connects to your cluster (instead of using a username and password). This is strongly recommended for production, but for our local Minikube setup, we’ll disable it so it’s easier to play around and test connections.

Step 6: Create a Simplified Values Config for the CockroachDB Helm Chart

We’ll now create a new config file with lighter resource settings for our local test environment.

In the same folder, create:

nano cockroachdb-values.yml

Then paste this:

statefulset:
  replicas: 3
  podSecurityContext:
    fsGroup: 1000
    runAsUser: 1000
    runAsGroup: 1000
  resources:
    requests:
      memory: "1Gi" # You should have 3GB+ of RAM free on your device; else, you can reduce this to 500Mi (this will result in your PC needing just 1.5 GB of RAM free)
      cpu: 1  # The same with this, you can reduce it to 500m CPU if you don't have up to 3 CPU cores (1 CPU core * 3 replicas)
    limits:
      memory: "1Gi"
      cpu: 1
  podAntiAffinity:
    type: ""
  nodeSelector:
    kubernetes.io/hostname: minikube

storage:
  persistentVolume:
    size: 5Gi # Make sure you have 15GB+ of free storage on your local machine, if not, you can reduce it to 2 - 3 Gi
    storageClass: standard

tls:
  enabled: false

init:
  jobs:
    wait:
      enabled: true

Setting the requests and limits to the same value ensures Kubernetes won’t terminate CockroachDB pods due to high memory or CPU usage.

You can read more about this here.

Overview of the YAML values

Now, let’s understand the content of the cockroachdb-values.yml file together

podSecurityContext – why you needed it on Minikube:

podSecurityContext:
  fsGroup: 1000
  runAsUser: 1000
  runAsGroup: 1000

This block sets the Linux user and group IDs that the CockroachDB process runs as inside the container, and the group ownership for mounted files.

Why this matters, simply:

The CockroachDB process runs as UID 1000 inside the container. If the disk mount (the persistent volume) is owned by a different UID, Cockroach can’t create files there and fails with permission denied.
runAsUser and runAsGroup make the container process run as UID/GID 1000.
fsGroup makes the mounted volume be accessible to that group, so the process can write to /cockroach/cockroach-data.

In short, these lines make sure the DB process has permission to create and write files on the mounted disk (volume), which is especially important on Minikube and other local setups where host-mounted storage can have odd permissions.

podAntiAffinity and nodeSelector – what they do:

podAntiAffinity:
  type: ""

nodeSelector:
  kubernetes.io/hostname: minikube

podAntiAffinity is the default behavior. Normally this tells Kubernetes to spread pods across different nodes (VMs), so replicas don’t run on the same physical machine. This is good for high availability, because one node failing won’t kill multiple replicas.

By setting type: "" (empty), you disabled that spreading rule, so Kubernetes can place multiple CockroachDB replicas on the same node.

nodeSelector tells Kubernetes to schedule pods only on nodes that match the label you set (here kubernetes.io/hostname: minikube). That forces all pods to run on the node named minikube.

Quick summary of the effect:

Good for local testing on a multi-node Minikube cluster, when only one node has properly mounted writable storage.
Not recommended for production, because it places all replicas on the same machine (single point of failure).

PS: If you’re using another Kubernetes cluster provider, for example K3s, Kind, and so on… this might not get deployed due to the nodeSelector property targeting minikube nodes. So, I'd advise removing the nodeSelector property entirely.

...
nodeSelector:
    kubernetes.io/hostname: minikube
...

✅ At this point, we’ve:

Copied the default CockroachDB Helm chart configuration
Created a lightweight version for Minikube
Learned what each key property means

🚀 Step 7: Install the CockroachDB Cluster Using Helm

Great job so far! You’ve created your cockroachdb-values.yml file and set up your custom configuration for Minikube. Now we’ll actually deploy the cluster.

What we’re going to do:
We’ll use Helm to install the official CockroachDB Helm chart using our custom values. This will spin up your 3-node cluster locally so you can play with it.

Command to run:

helm install crdb cockroachdb/cockroachdb -f cockroachdb-values.yml

Here:

crdb is the name we’re giving this release (you can pick something else if you like).
cockroachdb/cockroachdb tells Helm which chart to use.
-f cockroachdb-values.yml tells Helm to use our custom file instead of default values.

After the command runs:

After a little while the command completes, and you’ll see output telling you what resources were created (pods, services, persistent volume claims, and so on).

Now to check if everything is working, do this:

kubectl get pods | grep -i crdb

This filters pods with “crdb” in the name (our release prefix).

You should see something like:

The three primary pods (0, 1, 2) should be in Running state. The init job or pod (crdb-cockroachdb-init-xxx) should show Completed. This means the initialization tasks (cluster bootstrap) succeeded.

If you see that, congratulations! You’ve got your local CockroachDB cluster up and running! 🎉

Accessing the CockroachDB Console & Viewing Metrics

Alright! Now that our CockroachDB cluster is up and running, let’s take a peek behind the scenes and explore the CockroachDB Admin Console. It’s a beautiful web dashboard that helps us visualize everything happening in our database cluster.

In this section, we’ll learn how to:

Access the CockroachDB admin console right from your browser 🖥️
Understand what each built-in dashboard shows (CPU, memory, disk, SQL performance)
Confirm that our cluster is healthy and that all 3 nodes are working together perfectly

Step 1: Locate the CockroachDB Public Service

CockroachDB automatically creates a public service that allows us to connect to the database and also access its dashboard.

Let’s check it out by running:

kubectl get svc | grep -i crdb

You should see a line similar to:

crdb-cockroachdb-public   ClusterIP   10.x.x.x      26257/TCP,8080/TCP   ...

This service (crdb-cockroachdb-public) is what we’ll use to connect to both:

The database itself (via port 26257)
The dashboard UI (via port 8080)

Step 2: Learn More About the Service

Let’s dig a little deeper to understand it:

kubectl describe svc crdb-cockroachdb-public

Here’s what you’ll notice:

Port 26257 is used for gRPC connections (this is how applications connect to send and receive SQL queries).
Port 8080 is used for the web dashboard, where we can view metrics and monitor performance.

Step 3: Access the CockroachDB Dashboard

Now, let’s make the dashboard available on your local computer. Run this command:

kubectl port-forward svc/crdb-cockroachdb-public 8080:8080

This command simply tells Kubernetes:

“Hey, please open a tunnel from my local computer’s port 8080 to the CockroachDB service’s port 8080 in the cluster.”

Once you see something like:

...you’re good to go!

Step 4: Visit the Dashboard

Now, open your browser and go to http://localhost:8080.

You’ll see the CockroachDB Admin Console. This is your central command center for monitoring your cluster

Here, you’ll be able to view:

Number of replicas (nodes): You should see 3 in our setup.
RAM usage per node: Helps track how much memory each CockroachDB instance is using.
CPU usage: Useful to know when your database is getting busy.
Disk space: Shows how much data your cluster is storing and how much free space remains.

Here’s what your dashboard might look like 👇🏾

Step 5: Exploring the Metrics Dashboard

Now that you’re inside the CockroachDB Admin Console (http://localhost:8080), let’s take things a step further by exploring the Metrics section. This is where CockroachDB really shines.

On the left-hand side, click on “Metrics.” Here, you’ll find a collection of dashboards showing how your database is performing behind the scenes, things like query activity, performance, memory use, and much more.

These metrics help you understand what’s happening inside your cluster and make data-driven decisions – like when to scale up, optimize queries, or add more nodes.

We’ll start by focusing on some of the most insightful ones, such as:

SQL Queries Per Second – how busy your database is
Service Latency (SQL Statements, 99th percentile) – how fast or slow your queries are

Then, we’ll also look at others like SQL Contention, Replicas per Node, and Capacity to get a complete view of your CockroachDB cluster’s health.

Here’s what each of these metrics means in simple, everyday terms 👇🏾

SQL Queries Per Second

This metric shows the number of SQL commands (like SELECT, INSERT, UPDATE, DELETE) your database cluster is handling every second. In simpler words, it’s how busy your database is. Imagine cars passing through a toll booth – this is the count of cars per second.

This is useful to know because if this number is steadily climbing, your system is getting more traffic or work. You may need to scale up (more nodes, more resources) or optimize queries. If it drops suddenly, something might be wrong (traffic drop, and so on).

Look for a stable or expected value for your workload. Spikes or sustained high values mean you should check performance.

Service Latency: SQL Statements, 99th percentile

This metric shows the time it takes (for the slowest ~1 % of queries) from when the database gets the request until it finishes executing it. Think of waiting in a queue: 99% percentile is what the slowest people (1 in 100) experienced.

You’ll want to know this because if the slowest queries are taking too long, it might signal a bottleneck (CPU, disk, network, and so on). Low latency = good user experience.

So keep an eye out: if this value rises (gets worse) over time, investigate what’s slowing down. If it stays low and stable, you’re in good shape.

SQL Statement Contention

Statement contention demonstrates the number of SQL queries that got “stuck” or had to wait because other queries were using the same data or resources. This is like if two people were trying to grab the same book – one has to wait. That waiting is contention.

High contention means your database is chasing conflicts, waiting for locks or resources. This slows things down overall. So you’ll want to keep this number as low as possible. If it starts rising, you might need to revisit your schema, queries, or scale differently.

Replicas per Node

This tells you how many copies (“replicas”) of data ranges live on each database node. If you imagine your data is like documents saved in several safes (nodes), this shows how many copies are in each safe.

This matters, because you want balanced replicas so no node is overloaded with too many copies (which can slow it down or put it at risk).

To check on this, make sure nodes have roughly equal replica counts. If one node has many more replicas, you might need to rebalance or add nodes.

Capacity

Capacity shows how much disk/storage your cluster has (total), how much is used, and how much is free. Imagine a warehouse: it’s like how many boxes you can store, how many you’ve filled, and how much empty space remains.

You’ll need to know this, because if capacity is nearly full, you risk running out of space which can cause downtime or performance issues.

Free space should stay healthy (for example less than ~80% used). If it crosses that, plan to add storage or nodes.

Why These Matter Together

When you combine these metrics, you get a clear picture:

High Queries Per Second + high latency = maybe you're under-powered.
High contention = your workload design might be fighting itself.
Imbalanced replicas or full capacity = infrastructure issues.
Stable low latency + balanced replicas + plenty of capacity = sounds like a healthy cluster.

So by keeping an eye on these, you make data-driven decisions: when to scale, when to optimize, when to tweak configs.

Step 6: Creating a Little Load on the CockroachDB Cluster

So far, we’ve explored the CockroachDB dashboard and understood what each metric means. Now, let’s make things a bit more fun. 🎉

In this part, we’ll run a simple Python app that connects to our CockroachDB cluster and performs a few database operations (creating, updating, deleting, and retrieving some records). This will help us generate a small load on the database so we can actually see the metrics in action.

Here’s what we’ll be doing step-by-step 👇🏾

Step 6.1: Create a ConfigMap for Our Books Data

We’ll first create a list of 20 books that our Python script will interact with. Each book will have basic info like name, author, genre, pages, and price.

Create a new file called books.json

On Linux:

  nano books.json

Paste the below JSON content into it.

  [
    {
      "name": "The Bright Signal",
      "author": "Ava Hart",
      "isbn": "9783218196000",
      "published_year": 2020,
      "pages": 234,
      "genre": "Fantasy",
      "price": 10.99
    },
    {
      "name": "The Hidden Library",
      "author": "Liam Stone",
      "isbn": "9783863794026",
      "published_year": 1993,
      "pages": 358,
      "genre": "Romance",
      "price": 30.2
    },
    {
      "name": "The Shadow Archive",
      "author": "Maya Chen",
      "isbn": "9781615594078",
      "published_year": 2001,
      "pages": 404,
      "genre": "History",
      "price": 16.21
    },
    {
      "name": "The Bright Voyage",
      "author": "Noah Rivers",
      "isbn": "9785931034133",
      "published_year": 1987,
      "pages": 507,
      "genre": "Fantasy",
      "price": 13.14
    },
    {
      "name": "The Shadow Garden",
      "author": "Zara Malik",
      "isbn": "9785534192834",
      "published_year": 2004,
      "pages": 404,
      "genre": "Sci-Fi",
      "price": 28.13
    },
    {
      "name": "The Crystal Signal",
      "author": "Ethan Brooks",
      "isbn": "9785030564135",
      "published_year": 2009,
      "pages": 508,
      "genre": "Self-Help",
      "price": 20.79
    },
    {
      "name": "The Atomic Atlas",
      "author": "Iris Park",
      "isbn": "9787242388493",
      "published_year": 2025,
      "pages": 442,
      "genre": "Romance",
      "price": 18.5
    },
    {
      "name": "The First Library",
      "author": "Caleb Nguyen",
      "isbn": "9787101226911",
      "published_year": 2017,
      "pages": 528,
      "genre": "Romance",
      "price": 24.47
    },
    {
      "name": "The Crystal River",
      "author": "Sofia Diaz",
      "isbn": "9781845146276",
      "published_year": 2004,
      "pages": 599,
      "genre": "Fiction",
      "price": 31.15
    },
    {
      "name": "The Crystal Archive",
      "author": "Jude Bennett",
      "isbn": "9784893252883",
      "published_year": 1996,
      "pages": 632,
      "genre": "Fiction",
      "price": 40.47
    },
    {
      "name": "The Last Compass",
      "author": "Nina Volkova",
      "isbn": "9784303911713",
      "published_year": 2018,
      "pages": 451,
      "genre": "History",
      "price": 29.53
    },
    {
      "name": "The Crystal Garden",
      "author": "Omar Haddad",
      "isbn": "9784896383461",
      "published_year": 1988,
      "pages": 251,
      "genre": "Thriller",
      "price": 36.38
    },
    {
      "name": "The Silent Signal",
      "author": "Priya Kapoor",
      "isbn": "9781509839308",
      "published_year": 2008,
      "pages": 649,
      "genre": "Fantasy",
      "price": 28.05
    },
    {
      "name": "The Hidden Compass",
      "author": "Felix Romero",
      "isbn": "9781834738291",
      "published_year": 2025,
      "pages": 180,
      "genre": "Self-Help",
      "price": 19.15
    },
    {
      "name": "The Lost Signal",
      "author": "Tara Quinn",
      "isbn": "9781165667017",
      "published_year": 2010,
      "pages": 368,
      "genre": "Fiction",
      "price": 41.37
    },
    {
      "name": "The Last Signal",
      "author": "Hana Sato",
      "isbn": "9783387262476",
      "published_year": 2005,
      "pages": 467,
      "genre": "Nonfiction",
      "price": 42.01
    },
    {
      "name": "The Crystal Archive",
      "author": "Leo Fischer",
      "isbn": "9780801326776",
      "published_year": 1984,
      "pages": 573,
      "genre": "Nonfiction",
      "price": 42.31
    },
    {
      "name": "The Hidden Atlas",
      "author": "Mila Novak",
      "isbn": "9784746872343",
      "published_year": 2005,
      "pages": 180,
      "genre": "Nonfiction",
      "price": 16.58
    },
    {
      "name": "The Hidden Compass",
      "author": "Arthur Wells",
      "isbn": "9780097882086",
      "published_year": 1983,
      "pages": 713,
      "genre": "Fantasy",
      "price": 39.42
    },
    {
      "name": "The Silent Atlas",
      "author": "Selene Ortiz",
      "isbn": "9781939909169",
      "published_year": 1991,
      "pages": 190,
      "genre": "Self-Help",
      "price": 33.79
    }
  ]

To save and close the file in nano:

Press CTRL + O → then ENTER (to save)
Press CTRL + X (to exit the editor)

Then create a ConfigMap from the file:

 kubectl create configmap books-json --from-file=books.json

Step 6.2: Create the Python Script ConfigMap

Next, we’ll create a simple Python script that:

Creates a new table for books
Inserts 20 records
Updates 7 of them
Deletes 5
Retrieves 15 books from the database

It’s like simulating a small library app. 📚

Create a new file called books-script.yml and paste the content below:

apiVersion: v1
kind: ConfigMap
metadata:
  name: books-script
data:
  run.py: |
    #!/usr/bin/env python3
    import argparse
    import json
    import os
    import sys
    import time
    from typing import List, Dict

    import psycopg
    from psycopg.rows import dict_row

    DDL = """
    CREATE TABLE IF NOT EXISTS books (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        name STRING NOT NULL,
        author STRING NOT NULL,
        isbn STRING UNIQUE,
        published_year INT4,
        pages INT4,
        genre STRING,
        price DECIMAL(10,2),
        created_at TIMESTAMPTZ NOT NULL DEFAULT now()
    );
    """

    INSERT_SQL = """
    INSERT INTO books (name, author, isbn, published_year, pages, genre, price)
    VALUES (%s, %s, %s, %s, %s, %s, %s);
    """

    UPDATE_SQL = """
    UPDATE books
    SET price = %s, pages = %s
    WHERE isbn = %s;
    """

    DELETE_SQL = """
    DELETE FROM books
    WHERE isbn = %s;
    """

    GET_SQL = """
    SELECT id, name, author, isbn, published_year, pages, genre, price, created_at
    FROM books
    WHERE isbn = %s;
    """

    def load_books(path: str) -> List[Dict]:
        with open(path, "r") as f:
            return json.load(f)

    def connect_with_retry(dsn: str, attempts: int = 30, delay: float = 2.0):
        last_exc = None
        for _ in range(attempts):
            try:
                conn = psycopg.connect(dsn, autocommit=False)
                return conn
            except Exception as e:
                last_exc = e
                time.sleep(delay)
        raise last_exc

    def main():
        ap = argparse.ArgumentParser()
        ap.add_argument("--dsn", required=True, help="Postgres/CockroachDB DSN")
        ap.add_argument("--json", default="/app/books.json", help="Path to books JSON")
        args = ap.parse_args()

        books = load_books(args.json)
        print(f"Loaded {len(books)} books")

        conn = connect_with_retry(args.dsn)
        conn.row_factory = dict_row
        try:
            with conn:
                with conn.cursor() as cur:
                    print("Creating table...")
                    cur.execute(DDL)

                    print("Inserting 20 books...")
                    for b in books[:20]:
                        cur.execute(INSERT_SQL, (
                            b["name"], b["author"], b["isbn"],
                            b.get("published_year"), b.get("pages"),
                            b.get("genre"), b.get("price"),
                        ))

                    print("Updating 7 books...")
                    for b in books[:7]:
                        new_price = round(float(b.get("price", 10)) + 1.23, 2)
                        new_pages = int(b.get("pages", 100)) + 5
                        cur.execute(UPDATE_SQL, (new_price, new_pages, b["isbn"]))

                    print("Deleting 5 books...")
                    for b in books[-5:]:
                        cur.execute(DELETE_SQL, (b["isbn"],))

                    print("Performing 15 retrievals...")
                    for b in books[:15]:
                        cur.execute(GET_SQL, (b["isbn"],))
                        row = cur.fetchone()
                        if row:
                            print(f"GET {b['isbn']}: {row['name']} by {row['author']} (${row['price']})")
                        else:
                            print(f"GET {b['isbn']}: not found (possibly deleted)")

            print("All operations completed.")
        finally:
            conn.close()

    if __name__ == "__main__":
        main()

This script connects to the CockroachDB cluster, creates a table (if it doesn’t exist), and performs all those operations in sequence.

It runs around 50 SQL queries in total – a mix of INSERT, UPDATE, DELETE, and SELECT statements.

Now apply it:

kubectl apply -f books-script.yml

Step 6.3: Create the Job to Run the Script

Next, let’s create a Kubernetes Job that will actually run our Python script inside a container.

Create a file called books-job.yml and paste the manifest below:

apiVersion: batch/v1
kind: Job
metadata:
  name: books-job
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: runner
          image: python:3.12-slim
          env:
            - name: CRDB_DSN
              value: "postgresql://root@crdb-cockroachdb-public:26257/defaultdb?sslmode=disable"
          command: ["bash", "-lc"]
          args:
            - |
              pip install --no-cache-dir "psycopg[binary]>=3.1,<3.3" && \
              python /app/run.py --dsn "$CRDB_DSN" --json /app/books.json
          volumeMounts:
            - name: script
              mountPath: /app/run.py
              subPath: run.py
            - name: books
              mountPath: /app/books.json
              subPath: books.json
      volumes:
        - name: script
          configMap:
            name: books-script
            defaultMode: 0555
        - name: books
          configMap:
            name: books-json

Here’s what’s happening:

The Job runs a container based on Python 3.12-slim.
It connects to CockroachDB using the connection string postgresql://root@crdb-cockroachdb-public:26257/defaultdb?sslmode=disable. Notice how sslmode=disable: this is because we disabled TLS in our Helm values earlier.
The Job mounts the two ConfigMaps we created earlier (books-json and books-script) as volumes inside the container. Think of volumes like small external drives that the container can read from.

Apply it:

kubectl apply -f books-job.yml

Step 6.4: Check if the Job Ran Successfully

After a minute or two, check your pods:

kubectl get po

If you see books-job-xxx with the status Completed, then your script ran successfully 🎉

That means our database just got a nice little workout – some records were created, updated, deleted, and read.

Step 7: Viewing the Metrics from the Load

Now that we’ve generated a small load, let’s jump back to the CockroachDB dashboard.

Head to the Metrics section, and under SQL Queries Per Second, you should see a little spike: this shows the activity from our Python job.👇🏾

Hover your mouse over the graph lines to see exact numbers.

Do the same for Service Latency: SQL Statements (99th percentile). You’ll notice a few bumps showing how long some of the queries took.👇🏾

This small experiment gives you a real feel for how CockroachDB reacts under activity, even a tiny one.

To explore more metrics and dashboards, check out the official CockroachDB documentation here.

Step 8: View the List of Created Items in the Database

Now that our Python job ran and touched the database (creating, updating, deleting, retrieving records), let’s check the content of our books table just to verify everything really happened.

First, we’ll create another Kubernetes job (or pod) that connects to our CockroachDB cluster and runs a simple SQL query SELECT * FROM books;. This pulls out all the remaining records in the table.

Here’s the manifest to use. Create a file named view-books.yml and paste the below content inside it:

apiVersion: batch/v1
kind: Job
metadata:
  name: view-books
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: client
          image: cockroachdb/cockroach:v25.3.2
          command: ["bash", "-lc"]
          args:
            - |
              cockroach sql \
                --insecure \
                --host=crdb-cockroachdb-public:26257 \
                --database=defaultdb \
                --format=records \
                --execute="SELECT * FROM public.books;"

Note: We use sslmode=disable because we turned off TLS in our Minikube config. This job mounts nothing fancy. It just spins up, connects to the database, runs the SELECT, and displays the result.

Run the job:

kubectl apply -f view-books.yml

Wait a minute, then check the pod status:

kubectl get po

Look for something like books-client-job-xxx in Completed state.

Finally, view the job logs to see the actual records:

kubectl logs view-books

You’ll see output similar to the below:

Backing Up CockroachDB to Google Cloud Storage ☁️

In this section we’ll explain how you can automate backups of your CockroachDB cluster using simple SQL commands, service accounts (for authenticating to Google Cloud), and Google Cloud Storage (where the data will be stored).

Why Backups Are Absolutely Critical

Imagine you’ve built your cluster on Kubernetes, and everything’s humming along for weeks or months. You’ve got tens or hundreds of gigabytes of data and 10k+ users relying on it.

Then BAM! Something happens. Maybe someone accidentally overwrote the Helm release (helm upgrade --install … with the same release name, for example crdb), or a cloud disk got deleted, or a critical node failed and you lose the majority of data replicas. That’s the nightmare we all dread 😭.

Mistakes happen, even if you’re super careful. What matters most is: How fast and easily could you recover?

That’s why we’ll set up daily backups of our CockroachDB cluster, targeting a Google Cloud Storage bucket. (Quick note: Google Cloud Object Storage is a service where you can store large amounts of data in the cloud as “objects”. You can grab, store, and retrieve data from it, just like Google Drive or Apple Storage. 😃)

With your backups going into a storage bucket, if disaster strikes, you can restore the entire cluster (or specific databases/tables) in minutes or hours – instead of days or losing data forever.

Connecting to Our DB – Installing Beekeeper Studio

So far, we’ve been connecting to our database programmatically, running commands from pods or jobs inside Kubernetes. But what if there was a more visual and user-friendly way to explore our data?

Well, meet my friend Beekeeper Studio. 🙂

Beekeeper Studio is a sleek, open-source database management tool that lets you connect to a wide range of databases like PostgreSQL, MySQL, SQLite, and (most importantly for us) CockroachDB.

It comes with a simple, modern interface for running queries, browsing tables, and viewing data – no need to jump into pods or remember command-line flags 😄

How to Install Beekeeper Studio

Visit the official Beekeeper Studio download page here: https://www.beekeeperstudio.io/get
Click the “Skip to the download” link. You’ll see something like this:
You’ll be redirected to a page listing download options for different operating systems.
Choose your OS and download the correct installer.
Afterwards, install the downloaded Beekeeper Studio software according to your OS

Connecting Beekeeper Studio to CockroachDB

Now that we’ve installed Beekeeper Studio, it’s time to connect it to our CockroachDB cluster running inside Minikube

But before we jump in, here’s something important to note:👇🏾

Our CockroachDB cluster is running INSIDE Kubernetes, and by default, it’s not accessible from outside the cluster.

To confirm this, run:

kubectl get svc crdb-cockroachdb-public

You should see something like this 👇🏾

Notice the CLUSTER-IP column. That means the service can only be accessed by other pods INSIDE the Minikube cluster – not from your laptop or external apps

Exposing the Cluster for Local Access

To make our database accessible from your local machine (so Beekeeper Studio can reach it), we’ll use Kubernetes Port Forwarding.

In a new terminal tab, run:

kubectl port-forward svc/crdb-cockroachdb-public 26257

This command tells Kubernetes to forward your local port 26257 to CockroachDB service’s port 26257 inside the cluster.

Once it’s running, your CockroachDB instance will now be accessible from localhost:26257.
(Note: it’s not accessible via your browser because this isn’t an HTTP endpoint 😅)

🐝 Connecting via Beekeeper Studio

Open Beekeeper Studio.
Click on the dropdown that says “Select a connection type…”.
Choose CockroachDB from the list.
In the connection window that pops up:
- Disable the Enable SSL option.
- Set User to root
- Set Default Database to defaultdb
- Host to localhost
- Port to 26257
Now click Test (bottom right corner). You should see a success message like Connection looks good.

Your setup should look like this:👇🏾

Finally, click Connect (right beside the Test button).

Verify the Connection

Once connected, you’ll land on a clean workspace where you can run SQL queries.

To confirm you’re connected to the right cluster, run:

SELECT * FROM books;

You should see a table containing about 15 books (the same ones we inserted earlier):

And there you go. You’ve now connected Beekeeper Studio to your CockroachDB running inside Minikube! 🚀

Creating a Google Cloud Account

Before we can back up our CockroachDB data to Google Cloud Storage, we need to have a Google Cloud account ready.

Step 1: Visit the Google Cloud Console

Head over to 👉🏾 https://console.cloud.google.com

If you don’t have a Google account yet, don’t worry. The process is simple and self-explanatory once you visit the site :). You’ll be guided to create a Google account first, and then your Google Cloud account.

Step 2: Create or Use a Project

Once you’re in the Google Cloud Console, you’ll either:

Use the default project that was automatically created for you, or
Create a new one by clicking on “New Project” and naming it crdb-tutorial.

Projects are like folders that contain all your Google Cloud resources: compute instances, storage buckets, databases, and more.

Step 3: Link a Billing Account (Optional but Recommended)

If you already have a billing account, link it to your project.

If not, you can easily create one by following Google’s instructions here. (You’ll need a valid Debit or Credit card.)

Don’t worry if your card doesn’t link right away. Sometimes Google’s billing system can be picky. 😅

Here’s a quick fix that usually works:

Add your card to Google Pay first.
Then go to Google Subscriptions in your Google account, and link it to your Google Billing Account.

To add your card via Google Subscriptions, visit here. (You need to have a Google account first. Don’t worry, the site will direct you on what to do if you don’t.)

You’ll see a page like this:👇🏾

Click Manage payment methods, then add your card details.

Once you’ve done that, refresh your Google Billing Account page – you should now see your card as one of the available options.

Creating a Google Cloud Storage Bucket

Now that we’ve set up our Google Cloud account and enabled billing, let’s create a Cloud Storage Bucket. This is simply a location (like an online folder) where our CockroachDB backup files will be stored.

In your Google Cloud console, type “storage” in the search bar at the top. From the dropdown results, click on “Cloud Storage”:

On the new page, click on the “Buckets” link in the side menu, then click the “Create Bucket” button.

Give your bucket a unique name, like cockroachdb-backup-. For example, cockroachdb-backup-i8wu, cockroachdb-backup-7gw8u. The random characters ensure your bucket name is unique globally (no other Google Cloud user will have the same name).

Scroll to the bottom and click “Create” to create your bucket.

You’ll see a pop-up asking you to confirm public access prevention. This means that only you (and people you explicitly give access to) can view or edit your bucket. Make sure the “Enforce public access prevention on this bucket” checkbox is checked, then click “Confirm.”

Perfect! 🎉 You’ve now created a storage bucket where your CockroachDB backups will live.

Giving CockroachDB Access to the Bucket

Our next goal is to let the CockroachDB cluster upload and read files from this bucket. To do this, we’ll create something called a Service Account using Google IAM.

What’s IAM?
IAM stands for Identity and Access Management. It’s basically Google Cloud’s way of managing who can access what in your project.

With IAM, we can create a service account (like a “digital employee”) and give it permission to interact with our bucket instead of using our personal Google account.

Creating a Service Account

Type “service account” in the search bar and click on “Service Accounts” in the results.

Click “Create Service Account” at the top of the page. On the new page, type: cockroachdb-backup as the service account name, then click ‘Create and Continue’

Now we’ll give this service account permission to work with our storage bucket. In the Permissions section, type “storage object creator” in the filter box and select it from the dropdown.

Repeat the same for “storage object viewer”, and “storage object user”.

At the end, you should see three roles assigned:

Storage Object Creator
Storage Object Viewer
Storage Object User

Click “Continue”, then “Done.”

You’ve now created a service account that can create and read files in your bucket.

Downloading the Service Account Key

To let our CockroachDB cluster use this service account, we’ll generate a key file.

What’s a key file?
It’s just a small JSON file containing secret information your app (CockroachDB) can use to authenticate securely with Google Cloud – like an ID card.

But be careful ⚠️ If this key gets into the wrong hands, anyone could use it to access your Google Cloud resources. Never share or upload this file to your GitHub, BitBucket, or GitLab repository, or any other online repositories.

In the Service Accounts page, find your cockroachdb-backup account, click the three dots (⋮) under the Action column, then select “Manage Keys.”

On the new page, click “Add Key” then “Create new key.”

A dialog box will pop-up, choose JSON as the key type, and click “Create.”

Google will automatically download a file named something like cockroachdb-backup-1234567890abcdef.json

We’ll use this key soon when we configure our CockroachDB backup job.

Attaching the Key to Our CockroachDB Cluster

Now that we’ve downloaded the service account key, we need to attach it to our CockroachDB cluster so that the DB can upload and read backups from our Google Cloud Storage bucket.

Why this is needed:
Our Minikube cluster (and even any managed Kubernetes cluster like GKE, EKS, or AKS) doesn’t have direct access to the files on your computer. So, we’ll upload the key file to Kubernetes as a Secret, and then mount it inside our CockroachDB pods as a volume.

Step 1: Create a Kubernetes Secret

Run the command below in your terminal👇🏾 Replace with the path to your downloaded key file:

kubectl create secret generic gcs-key --from-file=key.json=

This command creates a Kubernetes Secret named gcs-key that securely stores your Google Cloud key.

Step 2: Mount the Secret to the CockroachDB Cluster

Now, let’s tell Kubernetes to use this secret inside our CockroachDB cluster.

Open your cockroachdb-values.yml file and scroll to the statefulset: section. Add the following lines under it:👇🏾

statefulset:
  ...
  env:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: /var/run/gcp/key.json

  volumes:
    - name: gcp-sa
      secret:
        secretName: gcs-key

  volumeMounts:
    - name: gcp-sa
      mountPath: /var/run/gcp
      readOnly: true

Here’s what this does:

The volumes section tells Kubernetes to create a volume from the secret we just made.
The volumeMounts section attaches that volume inside the CockroachDB container.
The GOOGLE_APPLICATION_CREDENTIALS environment variable points CockroachDB to our key file so it knows where to find it when connecting to Google Cloud.

Your final file should look like this:👇🏾

statefulset:
  replicas: 3
  podSecurityContext:
    fsGroup: 1000
    runAsUser: 1000
    runAsGroup: 1000
  resources:
    requests:
      memory: "1Gi"
      cpu: 1
    limits:
      memory: "1Gi"
      cpu: 1
  podAntiAffinity:
    type: ""
  nodeSelector:
    kubernetes.io/hostname: minikube
  env:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: /var/run/gcp/key.json
  volumes:
    - name: gcp-sa
      secret:
        secretName: gcs-key
  volumeMounts:
    - name: gcp-sa
      mountPath: /var/run/gcp
      readOnly: true

storage:
  persistentVolume:
    size: 5Gi
    storageClass: standard

tls:
  enabled: false

init:
  jobs:
    wait:
      enabled: true

Now, apply the update using Helm:👇🏾

helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml

Step 3: Confirm the Key Exists in the Cluster

Once the upgrade is complete, run this command to confirm the key is now inside your CockroachDB pods:

kubectl exec -it crdb-cockroachdb-1 -- cat /var/run/gcp/key.json

You should see something similar to this:👇🏾

prince@DESKTOP-QHVTAUD:~/programming/cockroachdb-tutorial$ kubectl exec -it crdb-cockroachdb-1 -- cat /var/run/gcp/key.json
{
  "type": "service_account",
  "project_id": ***,
  "private_key_id": ***,
  "private_key": ***,
  "client_email": ***,
  "client_id": ***,
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": ***,
  "universe_domain": "googleapis.com"
}

Nice! That means our cluster now has access to the Google Cloud key.

Step 4: Creating the Backup Schedule

CockroachDB makes backups super convenient. It can automatically back up your database on a schedule (without you needing to manually create Kubernetes CronJobs).

To create an automatic backup schedule, run this SQL command inside the CockroachDB SQL shell 👇🏾(Replace the BUCKET_NAME placeholder with the name of your Google Cloud Storage bucket):

CREATE SCHEDULE backup_cluster
FOR BACKUP INTO 'gs:///cluster?AUTH=implicit'
WITH revision_history
RECURRING '@hourly'
FULL BACKUP '@daily'
WITH SCHEDULE OPTIONS first_run = 'now';

Here’s what each part means:

AUTH=implicit tells CockroachDB to use the Google key we mounted (GOOGLE_APPLICATION_CREDENTIALS) for authentication.
FULL BACKUP '@daily' creates a complete backup of the entire database every day.
RECURRING '@hourly' creates smaller, incremental backups every hour, capturing just the changes since the last backup.
WITH SCHEDULE OPTIONS first_run = 'now' starts the first backup immediately after running the command.

After running it, CockroachDB will return two rows:

The first is for the recurring incremental backup (hourly updates)
The second is for the full backup (daily snapshot)

You can read more about full and incremental backups in the official docs here 👉🏾CockroachDB Backups Guide.

Step 5: Checking Backup Status

To see the status of your backups, copy the Job ID from the second row (the id column) and run this command:

SHOW JOBS FOR SCHEDULE ;

Replace with the ID you copied.

You’ll see output similar to this:👇🏾

Now, do the same for the recurring backup job (the ID on the 1st row of the previous result)

If both statuses show succeeded, that means your full and recurring backups worked perfectly! If either is still running, just give it a few minutes – backups can take a bit of time :)

Testing Our Backup — Disaster Recovery Time

Woohoo! We’ve successfully created a backup of our CockroachDB cluster to Google Cloud Storage. That’s a huge milestone. But let’s be honest: how can we be sure it works if we’ve never tried restoring it?

So, in true brave-developer fashion, we’re going to do the unthinkable: destroy our entire database...yes, everything! 😬

Why would we do that?! Because in real life, disasters happen. A node crashes, data gets wiped, or an upgrade goes sideways. The question is: Can we recover? Let’s find out.

Step 1: Uninstall the Helm Chart

First, let’s remove the CockroachDB Helm release. This deletes the cluster resources like StatefulSets, pods, and secrets:

helm uninstall crdb

This removes the running cluster, but not the actual data, which is stored on Persistent Volumes (PVs).

Step 2: Delete Persistent Volume Claims (PVCs)

Each CockroachDB node stores its data in a Persistent Volume Claim (PVC). These PVCs remain even after uninstalling the Helm release, so let’s manually delete them:

kubectl delete pvc datadir-crdb-cockroachdb-0
kubectl delete pvc datadir-crdb-cockroachdb-1
kubectl delete pvc datadir-crdb-cockroachdb-2

Step 3: Delete the Persistent Volumes (PVs)

Next, list all the Persistent Volumes:

kubectl get pv

You’ll see a list of volumes similar to this 👇🏾

Look for the PVs that are bound to the PVCs you just deleted. Then delete them manually using:

kubectl delete pv

At this point, you’ve completely wiped out your database like it never existed 🥲. Don’t worry: this is all part of the plan.

Step 4: Reinstall the Cluster

Let’s bring CockroachDB back to life (an empty one for now):

helm install crdb cockroachdb/cockroachdb -f cockroachdb-values.yml

Once the installation is done, expose the cluster locally again:

kubectl port-forward svc/crdb-cockroachdb-public 26257

Step 5: Check What’s Left

Connect to the Beekeeper Studio to your DB if your not, and try running the query below:

SELECT * FROM books;

You’ll get an error saying the books table doesn’t exist, because this is a brand new database.

Step 6: Restore from Google Cloud Storage

Now for the magic part, let’s bring our data back from the backup we created earlier 😃!

Run this query the new cluster:

RESTORE FROM LATEST IN 'gs:///cluster?AUTH=implicit';

Replace with your actual Google Cloud Storage bucket name (for example: cockroachdb-backup-7gw8u).

CockroachDB will begin restoring your data. This can take a few seconds or minutes depending on your backup size. When it’s done, you’ll see a response showing a success status:

Step 7: Confirm the Restoration

Now, run the same query again:

SELECT * FROM books;

Boom 💥 your books are back 😁! That means your backup and restore process works perfectly. You just performed a full disaster recovery test.

Congrats! You’ve done something many real-world teams fail to test: a full backup and restore cycle. You’ve now proven that your database setup is resilient, even in a worst-case scenario.

Managing Resources & Optimizing Memory Usage

In this section, we’ll learn how CockroachDB handles memory internally (for things like caching and SQL query work), and how to tune these settings so you avoid OOM kills or Eviction – Kubernetes crashing/stopping the database due to it using too much memory than what was allocated to it.

How CockroachDB Uses Memory

When you deploy CockroachDB nodes (each replica) via Kubernetes, each pod (node) needs memory for multiple things. At a high level, there are two major internal uses:

Cache (conf.cache): This is the space CockroachDB uses to keep frequently accessed data in memory so queries can run faster without hitting the disk.
SQL Memory (conf.max-sql-memory): This is the memory used when running SQL queries (things like sorting, joins, buffering numbers, and temporary data).

Together, they need to be sized appropriately relative to the total memory you give the pod, so there’s room for these internal operations plus other overhead (networking, logging, background tasks).

The Memory Usage Formula You Must Follow

Here’s the golden rule you should never forget:

(2 × max-sql-memory) + cache  ≤  80% of the memory limit

What this means:

You take the max-sql-memory value and multiply by 2 (because SQL work may need space for both input and output, etc)
Add your cache value
That total must be less than or equal to 80% of the pod’s memory limit (statefulset.resources.limits.memory)
The remaining ~20% (or more) is free space for other internal CockroachDB processes like background jobs, metrics, network, and so on

If you give CockroachDB too little “free” memory beyond these two settings, you risk OOM kills (pod gets killed by Kubernetes because it used more memory than allowed) or performance issues.

Where You Find These Settings

If you go to the Helm chart docs on ArtifactHub, CockroachDB Helm Chart on ArtifactHub, and scroll down to the Configuration section (or press Ctrl-F for conf.cache), you’ll see:

conf.cache (cache size)
conf.max-sql-memory (SQL memory size)
It states that each of these is by default set to roughly 25% of the memory allocation you set in the resources.limits.memory for the statefulset.

Concrete Example (Step-by-Step)

Let’s do the math with numbers in our Minikube environment.

In our case we set statefulset.resources.limits.memory = 2 GiB for each CockroachDB pod.
The Helm default of ¼ (25%) rule means:
- conf.cache = ¼ × 2 GiB = 512 MiB
- conf.max-sql-memory = ¼ × 2 GiB = 512 MiB
Apply the formula: (2 × 512 MiB) + 512 MiB = 1,536 MiB
Calculate 80% of the memory limit: 80% of 2 GiB = 1,638 MiB (approximately)
Compare: 1,536 MiB ≤ 1,638 MiB – so we’re within the safe zone ✅
That means in this configuration, CockroachDB expects to use ~1,536 MiB for its cache + SQL memory. This leaves ~512 MiB (20%) of the 2 GiB limit for other internal processes.

That leftover memory is for things like internal bookkeeping (range rebalancing, replication metadata), communication among database replicas, metric collection, logging, garbage collection, and temporary or unexpected memory spikes.

If you don’t leave this free space, your node might struggle when “normal operations”. And on Kubernetes, if the pod uses more memory than the limits.memory says, it can get OOM-killed which causes downtime or restarts.

⚠️ On Requests vs Limits in Kubernetes

Important nuance: Kubernetes schedules pods based on requests (what you ask for) but enforces limits based on limits (what you allow).

statefulset.resources.requests.memory = what the scheduler guarantees the pod will have.
statefulset.resources.limits.memory = the maximum the pod can use before Kubernetes will kill it for excess memory.

Because CockroachDB’s internal memory computations (cache + SQL memory) use the limit value to calculate sizing, if you set requests < limits you’ll get a mismatch. Example:

Suppose requests = 1 GiB, limits = 2 GiB
Kubernetes may schedule the pod on a node that has (at least) 1 GiB free
But internally, CockroachDB will plan for ~1.5 GiB usage (based on the 2 GiB limit)
The node may not actually have that much free memory available
The pod might try to use more memory than the node reserved and risk eviction due to less memory for other pods

✅ Best practice: Set requests = limits for memory and CPU for CockroachDB pods. That way the scheduler reserves enough space for what CockroachDB will use internally.

Overriding the Default Fractions

If you want to set static conf.cache or conf.max-sql-memory values (rather than relying on 25% of limit) you can – but you must still obey the memory usage formula.

For example, if you set:

...
conf:
  cache: "1Gi"
  max-sql-memory: "1Gi"
statefulset:
  resources:
    requests:
      memory: "3Gi"
      cpu: 1
    limits:
      memory: "3Gi"
      cpu: 1

According to the above configuration your pod memory request and limit is 3 GiB, then calculate:

(2 × 1Gi) + 1Gi = 3Gi
80% of 3Gi = ~2.4Gi

Here 3Gi > 2.4Gi, so you’d be violating the rule. This is a risky setup.

So you’ll need to either reduce cache or SQL memory, for example to 768Mi (or increase the memory limit, for example 4Gi) so that your formula results in ≤ 80% of the limit.

Scaling CockroachDB the Right Way

In this section we’ll look at when and how you should grow your CockroachDB cluster – whether that means adding more replicas (horizontal scale), giving each node more CPU/RAM (vertical scale), or giving them more storage.

I’ll explain everything in simple terms and cover what metrics to watch, what decisions to make, and how to scale safely.

What we’ll discuss:

How you can tell it’s time to “grow” your cluster
How to safely add more nodes or upgrade what you already have
How to decide whether you need more nodes, bigger nodes, or bigger disks
How to do all this without causing downtime or stress

Key Metrics to Understand

Before we dive into how to scale our cluster, we need to understand what certain metrics mean. Because, these metrics will help us make calculated decisions, knowing what and and when to scale certain resources.

Read bytes/second & Write bytes/second (Throughput)

Read bytes/second is how much data (in bytes) the disk is reading every second from itself to the database, that is, passing from the disk to the database app.

Write bytes/second is how much data is being written to the disk per second, that is, moving from the database to the disk.

This matters because your database is an application that stores data on disk. If your app needs to read a lot of data (reads) or write a lot of data (writes), this metric shows the volume of data flowing to/from disk.

To keep an eye on it, go to your CockroachDB dashboard and navigate to the “Metrics” link on the sidebar. Under the “Metrics” title, click the “Dashboard:…” drop-down and select “Hardware” from the options.

Now, scroll down a bit till you see “Disk Read Bytes/s” and “Disk Write Bytes/s”.

Read IOPS & Write IOPS

IOPS = “Input/Output Operations Per Second”. Here, Read IOPS = how many read operations the disk is performing per second. Write IOPS = how many write operations per second.

This is different from throughput because throughput is about how many bytes (data) are being transferred. IOPS, on the other hand, is about how many operations are happening (regardless of size).

Here’s an example: 10 read operations/sec of 1 MiB each = 10 MiB/sec throughput, 10 IOPS. Another scenario: 100 reads/sec of 10 KiB each = ~1 MiB/sec throughput, but 100 IOPS (higher operations count though lower data size.

Scroll down a bit more to view the IOPS metrics:

SQL p99 Latency (99th percentile latency)

P99 latency is the time it takes for the slowest 1% of queries to finish.

For example, let’s say you run 1,000 queries. How long the slowest 10 of them took is what p99 shows.

This matters because it’s not about the average query, but about the tail (worst cases). If your p99 is high, it means some queries are seriously lagging. All other queries might be fine, but some are dragging.

So if p99 jumps up (for example, from 10 ms → 300 ms), you should investigate: maybe big joins, missing indexes, contention, or data takes too much time to get stored in the disk.

To access the SQL P99 Latency metrics, simply click the “Dashboard:…” select field, and choose the “Overview” option from the dropdown.

PS: The higher the p99 latency, the more problem there is (slower queries).

Disk Ops In Progress (Queue Depth)

This shows how many disk reads and writes are waiting in line (queued) because the storage system is busy.

A queue depth of 0–5 is generally OK. If it frequently goes into double-digits (10+), that means storage is struggling and latency may spike. If you see this number high and staying high, you may need faster storage or more database replicas.

Simple rule: if “Ops In Progress” > ~9 for extended time, this is a bad sign. Time to check disks and I/O.

To access the “Disk Ops In Progress“ metric, return to the “Hardware“ dashboard, and scroll down:

By monitoring these, you can choose:

“I need more nodes” (horizontal scale)
“I need bigger nodes or faster storage” (vertical scale)
“I need better query/index tuning” (optimize rather than scale)

When (and What) to Scale Based on Your Metrics

So, let’s imagine you’re watching your CockroachDB dashboard and notice this pattern:

The SQL P99 latency (the slowest 1% of your queries) is high, meaning your queries are taking too long.
The CPU usage for your CockroachDB pods (under Cockroach process CPU%) is above 80% consistently.

That’s a classic sign your cluster is running out of CPU power and the database is struggling to process queries fast enough because the CPU is maxed out.

Here’s how to fix it 👇🏾

Step 1: Add More CPU Power

You can scale up your CPUs directly through the Helm chart values file, cockroachdb-values.yml.

In that file, look for the section where CPU and memory requests/limits are defined under statefulset.resources. Then, increase the CPU allocations. For example:

statefulset:
  resources:
    requests:
      cpu: "3"
      memory: "6Gi"
    limits:
      cpu: "3"
      memory: "6Gi"

This means each CockroachDB pod (replica) will now request 3 vCPUs (guaranteed). Save the file, then apply the update with the Helm command:

helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml

Once the upgrade is done, give it 30 minutes to 1 hour to stabilize. The CockroachDB dashboard will automatically start showing you updated metrics.

If you see that the CPU usage drops below 70% and the SQL P99 latency improves, you’re good. 👍🏾

Step 2: Add Another Replica (New Node)

But…what if the latency is still high even after adding more CPU? That likely means the cluster is still overloaded, and it’s time to add another node (replica) to distribute the load.

Here’s why that works: CockroachDB is horizontally scalable, meaning it automatically spreads out your data (remember ranges?) and balances reads/writes across all replicas. So, the more nodes you add, the more evenly your cluster can share the work.

To add another replica, simply increase the replicas value in your Helm config:

statefulset:
  replicas: 4  # If it was 3 before

Then, redeploy again:

helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml

This adds a new pod (a new CockroachDB node) to your cluster. CockroachDB will automatically rebalance your data across nodes – no manual migration needed

💡 Tip: Try to keep one CockroachDB pod (replica) per VM. For example, if you have 3 replicas, you should ideally have 3 separate VMs (worker nodes). This ensures better fault tolerance and performance.

Luckily, the official CockroachDB Helm chart already helps with this by managing Pod anti-affinity rules, so pods are automatically spread across nodes safely.

Disk-Bound Situations — What to Do When Your Disk Is the Limiting Factor

If you’re seeing this kind of pattern in your CockroachDB dashboard and Kubernetes cluster:

SQL P99 latency is high (queries are slow)
“Disk Ops In Progress” (queue depth) stays above ~9-10 – meaning many disk I/O operations are waiting to be processed
Disk “Read bytes/sec” or “Write bytes/sec” (throughput) are high or “Read IOPS” or “Write IOPS” are high (even though CPU looks okay)

Then you’re very likely disk-bound, meaning your storage is the bottleneck.

Here’s how to fix it (and yes, it’s a bit more complex than just “add more RAM”)…

Step 1: Increase Disk Size in Your Helm Values

Often the first problem is that the disk size is too small. Here’s how you can increase it:

Open your cockroachdb-values.yml (the Helm chart values file)
Look for the storage section, for example:

storage:
  persistentVolume:
    size: 5Gi  # current size

Update it to a larger size, like:

storage:
  persistentVolume:
    size: 15Gi  # increased size

Save the file and run:

helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml

N.B. If this doesn’t work or you receive an error from the Helm chart concerning not being able to modify some values (this is normal), just upsize the disk this way:👇🏾 (just replace the PVC_NAME and SIZE placeholders accordingly)

kubectl patch pvc  \
  -p '{"spec":{"resources":{"requests":{"storage":""}}}}'

Do that for each PVC (datadir-crdb-cockroachdb-0, datadir-crdb-cockroachdb-1, and so on).

Important: Increasing size may help, but often alone is not enough because your disk speed (IOPS/throughput) also depends on factors beyond just size.

Let’s break down why that’s the case, and what really affects your disk performance (especially on Google Cloud, which is what I’m using, too).

Why Disk Speed Can Vary

Your CockroachDB cluster uses external disks provided by your cloud provider (like Google, AWS, or Azure). The speed of those disks – that is, how fast they can read/write data – isn’t fixed. It depends on a few key factors.

On Google Cloud, disk performance depends on three main things:

Disk type: HDD, SSD, or fast SSD (pd-ssd) (the faster the disk type, the faster it can handle data operations)
Disk size: larger disks usually come with higher speed limits (the bigger, the faster)
VM’s vCPU count: more CPUs mean higher quotas for both
- read/write operations per second (IOPS), and
- how much data can flow to/from the disk per second (throughput)

The Recommended Disk Type for CockroachDB

The pd-ssd (Google’s fast SSD) is the recommended type for CockroachDB.

Each pd-ssd disk starts with a minimum of 6,000 IOPS (read or write operations per second).
It also has around 240 MiB/s (~252 MB/s) of read/write throughput.

In simple terms, that means your CockroachDB disk can handle up to 6,000 read/write operations EVERY SECOND, and move 250+ MB of data in and out every second. That’s pretty impressive!

But here’s the catch: those numbers can still vary depending on your VM family and CPU count.

How VM Family Affects Disk Speed (E2 Example)

If your CockroachDB is running on an E2 VM family (one of Google Cloud’s general-purpose VM types):

A VM with 2–7 vCPUs can handle up to:
- 15k IOPS (read/write operations per second)
- 250+ MiB/s throughput (which is already far more than many databases ever use 😅)
A VM with 8–15 vCPUs still allows 15k IOPS, but throughput jumps up to ~800 MiB/s 😮 –
meaning your disk can push nearly 0.8 GB per second of data in/out IN A SECOND.

The more vCPUs you have, the higher these limits grow, both for IOPS and throughput.

Putting It All Together

So, if you notice high SQL P99 latency (queries taking long), and disk read and write IOPS or throughput (read & write bytes) usage close to their limits, then your disk may be maxing out, not your database itself.

Here’s what you can do:

Check your current VM’s vCPU count and disk performance limit for that CPU.
If you’re using E2 with low vCPUs (for example, 2–4), try increasing it to 8 vCPUs or more. That’ll immediately lift your IOPS and throughput ceiling.

Example: E2 VM Family IOPS/Throughput Table

E2 per-VM caps (pd-ssd):

e2-medium:     10k write / 12k read IOPS, 200/200 MiB/s
2–7 vCPUs:     15k / 15k IOPS, 240/240 MiB/s
8–15 vCPUs:    15k / 15k IOPS, 800/800 MiB/s
16–31 vCPUs:   25k / 25k IOPS, 1,000 write / 1,200 read MiB/s
32 vCPUs:      60k / 60k IOPS, 1,000 write / 1,200 read MiB/s

The rule is simple — the higher the CPU tier (2–7, 8–15, and so on), the higher the disk speed cap.

⚠️ But What If You’re Still Seeing Slow Queries?

If your CockroachDB queries are still slow, but your metrics show that you’re not fully using your disk capacity (based on your VM’s CPU range), then your disk size might be the actual limitation.

In that case:

Gradually increase your disk size, for exaxmple from 50Gi to 70Gi to 100Gi.
Each increase enables your disk to pass more amount of data in and out (especially with pd-ssd).
Remember: once you increase disk size on Google Cloud, you can’t shrink it back down, so grow it slowly and observe improvements before scaling again.

This step helps you pinpoint exactly whether the slowdown is coming from insufficient IOPS, throughput, or just a disk that’s too small for CockroachDB’s workload 💪🏾

Memory Pressure — What to Do When Your Database Hits the Limit

There are some signs in your cluster you can look out for that’ll tell you your database is getting close to its limit. Pods (database replicas) might be getting OOMKilled (out of memory) or being evicted by Kubernetes, or your memory usage might be staying above ~ 75–80% for a while.

If either these is the case, you’re often dealing with memory pressure (you can check memory usage on the CockroachDB overview dashboard).

Why this happens

If you didn’t set memory requests and limits properly for each replica, the pod might not have enough head-room for all of its internal work (cache, SQL memory, background jobs) and Kubernetes kills it or it crashes.

Also, as you increase load (lots of queries, many users), your database needs more memory for two internal areas:

--cache (or conf.cache): in-memory data caching
--max-sql-memory (or conf.max-sql-memory): memory for running SQL queries (joins, sorts, and so on).
And yes, we covered the formula earlier (2 × max-sql-memory) + cache ≤ ~ 80% of RAM limit.

What to do:

First, you can increase the DB memory. In your Helm chart values (cockroachdb-values.yml), bump up the statefulset.resources.limits.memory and statefulset.resources.requests.memory. Or you can modify conf.cache and conf.max-sql-memory values (if you’re comfortable) but only if the total RAM limit is sufficient to support them.

Because the defaults (when you installed) set each to ~25% of RAM limit, they will scale automatically when you increase RAM.

For example:

If RAM limit per pod = 5 GiB, then cache ≈ 1.25 GiB, max-sql-memory ≈ 1.25 GiB
If you raise RAM limit to 8 GiB, these become ≈ 2 GiB each. This keeps you inside the formula and avoids memory crashes.

Quick YAML snippet example:

statefulset:
  resources:
    requests:
      memory: "8Gi"
    limits:
      memory: "8Gi"
conf:
  cache: "25%"
  max-sql-memory: "25%"

After editing your values file, remember to apply it:

helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml

When Queries Are Slow but Everything Else (CPU, Memory & Disk) Looks “Fine”

Sometimes you’ll see that your resource metrics (CPU, memory, disk I/O) all seem healthy. But your queries are still slow.

What then? One important cause: hotspots – especially “hot ranges” or “hot nodes” in CockroachDB.

A hot range is a portion of data (in CockroachDB, a range is a section of data from a table) that’s receiving much more traffic (reads or writes) than others.

A hot node, on the other hand, is a node/replica in the cluster which has significantly more load compared to the other nodes – often because it holds one or more hot ranges.

Because most of the traffic (queries) go to a range which is on a specific node, even though your overall CPU / memory / disk metrics might look “okay”, performance still suffers locally: queries are funneled into that specific range, making a “hotspot”.

Learn more about Hotspots here.

Why A High Write Workload Can Slow Reads

When you have lots of write queries, they may overload specific ranges or nodes (especially if the keyspace is skewed). Writes tend to:

Acquire locks or latches on rows or ranges
Cause contention among transactions
Require coordination (for example, via Raft consensus) which impacts performance.

When writes dominate a range, read queries that hit the same ranges may get queued behind these write operations, or suffer longer wait times.

Since reads and writes are sharing the same underlying data/ranges, too much writes can delay reads by creating bottlenecks. The docs call this part of “write hotspots”.

Key Signs You Might Have a Hotspot

One node’s CPU % is much higher than the others (even though overall resources seem fine)
On the Hot Ranges page in the CockroachDB UI, some ranges show very high QPS (queries per second) compared to others.
You observe that increasing overall resources (more CPU, more nodes) didn’t resolve the slowness. This suggests the problem isn’t “not enough resources” but “resource imbalance”.

What You Can Do

There are a few things you can do to prevent hotspots:

Use the Hot Ranges UI page (go to the Database Console and then to Hot Ranges) to identify the range IDs and table/indexes causing the issue.
Examine how the key space is being used. If your table/index primary key is monotonically increasing (for example, timestamps or serial IDs), the writes may target a narrow portion of the data, causing a hotspot. The docs suggest using hash-sharded indexes or distributing writes across the key-space.
Ensure load is balanced across nodes: avoid “one node doing most of the work”. If needed, add nodes or ensure range distribution/lease-holder movement is happening.
Monitor write-versus-read workload. if writes are heavy, they may cause queuing for reads even when resources appear OK. So look at write heavy traffic patterns and try reducing the amount of writes (if possible).

⚠️ Note

Learning everything about hotspots, key visualizers, and range splitting is a bit advanced. For those wanting to dive deeper: see the CockroachDB Performance Recipes page.

Understanding Disk Speed (IOPS & Throughput) Across Cloud Providers

So far, we’ve talked about how disk speed affects CockroachDB’s performance – especially how Google Cloud measures it. But it’s important to know that each cloud provider has its own way of measuring and limiting disk performance (IOPS and throughput).

So, while our earlier examples focused on Google Cloud, similar logic applies to AWS, Azure, and even DigitalOcean, just with different formulas and limits.

For Google Cloud:

These guides break down how disk performance works:

Persistent Disk performance overview: explains how baseline IOPS and throughput are calculated and the per-instance caps.
About Persistent Disks: quick definitions of pd-standard (HDD), pd-balanced (SSD), and pd-ssd (SSD).
Optimize PD performance: shows how disk size, machine series, and tuning can affect performance.

For AWS (EBS):

AWS’s Elastic Block Store (EBS) has several disk types:

EBS volume types: overview of all SSD and HDD types (gp3, gp2, io2, and so on).
General Purpose SSD (gp3): lets you provision custom IOPS and throughput for your disks (about 0.25 MiB/s per IOPS, up to 2,000 MiB/s).

For Azure (Managed Disks):

Azure disks also vary by type and size:

Disk types overview: compares Standard HDD, Standard SSD, Premium SSD, Premium SSD v2, and Ultra Disk.
Premium SSD v2: lets you independently set IOPS and throughput for your disks.
VM & disk performance: lists per-VM IOPS and throughput caps.

For DigitalOcean:

DigitalOcean offers simpler storage setups:

Volumes overview: explains block storage and NVMe details.
Volume Limits: shows per-Droplet IOPS and throughput caps (including burst windows).

Downsizing the Cluster (Reducing Replicas)

Now that we’ve seen how to scale up our CockroachDB cluster, let’s look at how to scale it down safely and correctly.

Let’s assume we scaled our cluster from 3 replicas to 5 replicas earlier (to handle more workload).

PS: If your CockroachDB pods were crashing often, you might need to increase the CPU and memory limits in the Helm chart configuration, like this:

statefulset:
  replicas: 5
  resources:
    requests:
      memory: "2Gi"
      cpu: 1
    limits:
      memory: "3Gi" # We can keep the memory requests and limits inconsistent for now, since we're in a development environment
      cpu: 1
...

Then, you update the cluster using:

helm upgrade crdb cockroachdb/helm-chart -f cockroachdb-values.yml

After a few minutes, you can confirm the newly added replicas kubectl get pods. You should now see five CockroachDB pods running.

Also, check your CockroachDB Admin UI – the new nodes should now appear in the cluster overview.

P.S: You might experience some issues when upscaling your cluster, especially if you don’t have sufficient memory and CPU on your PC or wherever you’re running your Kubernetes cluster.

⚠️ The Wrong Way to Downscale

Now, what if your workload reduces and you’d like to cut costs by scaling down from 5 replicas back to 3?

You might think, “Oh, I’ll just reduce the number of replicas in the Helm chart from 5 to 3 and redeploy.” But hold on, that’s very wrong! 😅

Scaling up CockroachDB is simple…but scaling down must be done carefully, because of certain factors which will explain.

Decommissioning a Node Before Scaling Down the Cluster

Before you go ahead and reduce the number of replicas in your CockroachDB cluster, it’s important to follow the right process.

You can’t just go from 5 replicas down to 3 and expect everything to go smoothly. There are steps you must take.

Why you can’t just scale from 5 to 3 instantly

If you reduce your cluster size too quickly, you might:

Lose data redundancy or fail to meet the required replication factor.
Cause data rebalancing to happen under heavy load, which can slow queries.
Put your cluster into a state where certain ranges or data replicas don’t have enough copies to remain fault-tolerant.

✅ The correct approach: Decommission first, then scale down one node at a time

Here’s the safe way to downscale:

Decommission the node you plan to remove.
Once decommissioning is complete, reduce the replica count (for example, from 5 to 4).
Delete the disk/PVC tied to that removed node.
Repeat the process (remove one node at a time) until you reach your target size (for example, down to 3 replicas).

Step-by-step: Decommission the 5th node (before scaling 5 to 4)

Create a client pod to run CockroachDB commands.
Create a file named cockroachdb-client.yml with this content:

 apiVersion: v1
 kind: Pod
 metadata:
   name: cockroachdb-client
 spec:
   serviceAccountName: 
   containers:
     - name: cockroachdb-client
       image: cockroachdb/cockroach:v25.3.1
       imagePullPolicy: IfNotPresent
       command:
         - sleep
         - "2147483648"
   terminationGracePeriodSeconds: 300

Replace with your CockroachDB service account name (find it via kubectl get sa -l app.kubernetes.io/name=cockroachdb).

Apply the manifest:

 kubectl apply -f cockroachdb-client.yml

Confirm the pod is running:
```
 kubectl get pods
```
You should see cockroachdb-client.

Exec into the client pod:

 kubectl exec -it cockroachdb-client -- bash

Get the list of nodes and IDs:
```
 ./cockroach node status --insecure --host 
```
Find your service name: kubectl get svc -l app.kubernetes.io/component=cockroachdb. In our case it’s crdb-cockroachdb-public.

You’ll see nodes with IDs 1, 2, 3, 4, 5. Each maps to a replica pod like crdb-cockroachdb-0, -1, -2, -3, -4.
Decommission the node with the highest index (since Kubernetes will remove the highest-numbered replica when scaling down).
For example, if you’re removing the pod crdb-cockroachdb-4…, and the node ID is 5:

Run the command below to decommission the 5th node.
```
 ./cockroach node decommission 5 --host crdb-cockroachdb-public --insecure
```
Navigate to the CockroachDB dashboard, and monitor until the node status shows as decommissioned.
In the CockroachDB Console’s Cluster Overview page, you’ll see formerly removed nodes under “Recently Decommissioned Nodes”.

Scale down the replicas in your Helm values file:

 statefulset:
   replicas: 4
 ...

Then run:

 helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml

Verify pods:
```
 kubectl get pods
```
You should now see 4 CockroachDB replica pods.
Delete the PVC for the removed node (to avoid paying for storage you’re no longer using):

kubectl delete pvc datadir-crdb-cockroachdb-4

Repeat the process for the next node if you want to go from 4 to 3 replicas: decommission node #4 next, scale to 3, delete its PVC, and so on.

After you’re done, you’ll have the target state (for example, 3 nodes) safely and cleanly without causing cluster instability or data loss.

To learn more about scaling down your CockroachDB nodes, visit the official CockroachDB docs.

Note that you should NOT use Horizontal Pod Autoscalers for scaling up and down your CockroachDB cluster.

Remember, before scaling down, you need to DECOMMISSION THE NODES FIRST, and scale down ONE AT A TIME!

However, the Horizontal Pod Autoscalers do NOT obey this. So if you intend to auto-scale your CockroachDB cluster, it's best to have a fixed size of replicas, for example, 3, 5, 7.

Then set up a Vertical pod Autoscaler to scale their CPU and RAM (Remember to set the Memory and CPU requests and limits to the same quantity to prevent eviction as explained earlier).

What to Consider When Deploying CockroachDB on Google Kubernetes Engine (GKE) ☁️

Up until now we’ve been working in a development environment (using Minikube, local setups), testing and learning.

Now we’re ready to move into production mode 🤓. And one of the best places to host CockroachDB in production is on GKE.

In this section, we’ll cover GKE-specific considerations, such as storage classes, load balancers, networking, and how to secure our CockroachDB cluster on GKE using mTLS for authenticating our clients and encrypting any data sent to and from our CockroachDB cluster.

Creating Your GKE Cluster

To get started, head over to the Google Cloud Console.

In the search bar at the top, type “Kubernetes” and click on “Kubernetes Engine” from the dropdown.

You’ll be taken to the Kubernetes Engine page. On the left sidebar, click “Clusters.” Then click the “Create” button at the top.

💡 Note: You’ll need to enable the Compute Engine API before you can create a GKE cluster. If you haven’t done that yet, Google Cloud will automatically redirect you to a page where you can enable it. Just click “Enable”, then return to the cluster page.

You can also learn more about enabling APIs in Google Cloud here: Enable APIs in Google Cloud.

Once you’re back, you’ll see the cluster creation page. If it defaults to Autopilot, click “Switch to Standard cluster” in the top-right corner. This gives you more control over node settings.

Under Cluster basics, give your cluster a name – something like cockroachdb-tutorial works great! Then, set Location type to Zonal (that’s fine for now).

On the left sidebar, go to “Node pools.” You’ll see a default pool already added.

Keep the name as is.
Set the Number of nodes to 1.
Enable the Cluster autoscaler option (so it can scale up automatically later).
Set the Maximum number of Nodes to 10, and the minimum to 0.

Next, click the dropdown arrow beside “default-pool” and select “Nodes.” Here, set up your node specifications:

VM family: E2
Machine type: Custom
vCPUs: 2
Memory: 7 GB
Boot disk type: Standard persistent disk
Disk size: 50 GB

When all that’s set, click “Create.” Your cluster will start provisioning.

Connecting to your GKE cluster

Once your GKE cluster creation is complete (this might take a few minutes), you’ll see something like this in the console:

Next, click the “Connect” link at the top of the page. A modal will pop up. Copy the CLI command you see.

It’ll look something like:

gcloud container clusters get-credentials cockroachdb-tutorial --zone us-central1-a --project

📌 Note: To run this command successfully, you need to have the gcloud CLI tool installed. If you don’t have it yet, visit Install Google Cloud SDK and pick the steps for your OS.

After installing the gcloud CLI, run:

gcloud auth login

This authenticates your terminal with your Google Cloud account so you can access the cluster securely.

After authenticating your terminal with access to Google Cloud, run the command you copied earlier. You should see something like this:

Now run the command to retrieve your pods, kubectl get po. This will retrieve the pods from your new cluster on Google Kubernetes Engine, not Minikube.

For now, we’ve not deployed anything yet, so the namespace should be empty.

But we should have at least 1 worker node available. Run the kubectl get nodes command to view it. You should see something similar to this (GKE takes care of our control plane for us, so when we view the nodes, we’ll only see the worker nodes).

Deploying CockroachDB in Production (on GKE)

Now that we’ve successfully created our Google Kubernetes Engine (GKE) cluster, it’s time to deploy our CockroachDB cluster in it – this time, in production mode.

Unlike our earlier Minikube setup (which we used for local development), deploying to GKE introduces new considerations like security, storage classes, and authentication methods – all tailored for a real-world production environment.

To get started, create a new file called cockroachdb-production.yml, and paste the following configuration inside:

statefulset:
  replicas: 3
  resources:
    requests:
      memory: "3Gi"
      cpu: 1
    limits:
      memory: "3Gi"
      cpu: 1
  serviceAccount:
    create: true
    name: "crdb-cockroachdb"
    annotations:
      iam.gke.io/gcp-service-account: 

storage:
  persistentVolume:
    size: 10Gi
    storageClass: premium-rwo

tls:
  enabled: true

init:
  labels:
    app.kubernetes.io/component: init
  jobs:
    wait:
      enabled: true

Replace the placeholder with the CockroachDB backup service account you created earlier (in the “Backing Up CockroachDB to Google Cloud Storage” section). It should look something like this cockroachdb-backup@.iam.gserviceaccount.com.

Understanding the Configuration

Let’s break down what’s happening in this production Helm values configuration and how it differs from the one we used in Minikube.👇🏽

1. Modified the `statefulset` Configuration

We’re allocating 3 GiB of RAM and 1 vCPU to each replica, both as requests and limits.

This ensures that each node has enough guaranteed resources and avoids Kubernetes evicting it due to it using more than its requested resources.

We also defined a service account and annotated it with a GCP service account using the iam.gke.io/gcp-service-account annotation.

This annotation allows CockroachDB to securely access Google Cloud services (like Google Cloud Storage) without using static JSON key files (key.json), thanks to a GKE feature called Workload Identity.

In production, we let GKE handle authentication to Google services instead of mounting key files.

2. Removed `podSecurityContext`

In Minikube, we included this section:

...
podSecurityContext:
  fsGroup: 1000
  runAsUser: 1000
  runAsGroup: 1000
...

We did that to give CockroachDB permission to access our local disk for persistent storage. But in GKE, this isn’t needed. Google Cloud handles storage mounting securely on our behalf, so we can safely omit this part.

3. Removed `podAntiAffinity` and `nodeSelector`

In our Minikube deployment, we used:

...
podAntiAffinity:
  type: ""
nodeSelector:
  kubernetes.io/hostname: minikube
...

That was just to force all CockroachDB instances to run on the same node on Minikube.

But in production, we want each replica on a different VM. This ensures high availability, even if one VM fails, only one CockroachDB replica is affected, and the cluster stays active.

Since our cluster uses a replication factor of 3, at least 2 replicas (a quorum) need to be active for the database to stay online, else, it will crash 🥲.

4. Removed `env`, `volumes`, and `volumeMounts`

In Minikube, we had to manually mount the Service Account key:

...
env:
  - name: GOOGLE_APPLICATION_CREDENTIALS
    value: /var/run/gcp/key.json
volumes:
  - name: gcp-sa
    secret:
      secretName: gcs-key
volumeMounts:
  - name: gcp-sa
    mountPath: /var/run/gcp
    readOnly: true
...

This was needed so CockroachDB could access our Google Cloud Storage bucket for backups.

But in production, we don’t use key files. Instead, we use a GKE feature called Workload Identity.

It securely binds a Kubernetes Service Account to a Google Service Account, giving our CockroachDB pods the same permissions as the GCP account: no keys, no secrets, and much safer 🔒

5. Updated `storage.persistentVolume.storageClass`

In Minikube, we used a standard disk:

...
storage:
  persistentVolume:
    size: 5Gi
    storageClass: standard
...

But for production, we’re switching to a faster SSD:

...
storage:
  persistentVolume:
    size: 10Gi
    storageClass: premium-rwo
...

This uses Google Cloud’s pd-ssd disk type which is the recommended choice for CockroachDB due to its high IOPS (read/write operations per second) and throughput. This gives our cluster faster read and write speeds under load, leading to better performance.

6. Enabled TLS for Secure Communication

In development, we disabled TLS:

tls:
  enabled: false

That made it easier and simpler to connect without dealing with certificates.

But in production, security is non-negotiable. We’re enabling TLS to ensure that all communication with CockroachDB is encrypted in transit, and that only clients with valid certificates (signed by the same authority) can connect. This is mutual TLS (mTLS) authentication.

mTLS ensures that both sides (client and server) prove who they are, preventing impersonation or man-in-the-middle attacks. It’s one of the strongest ways to secure a production database connection.

To learn more about TLS and mTLS encryption, check out:

Installing the CockroachDB Cluster on GKE

We’ll use the values file you created (cockroachdb-production.yml) and deploy our CockroachDB cluster in our GKE cluster using Helm.

Deploy the cluster

Run the following command:

helm install crdb cockroachdb/cockroachdb -f cockroachdb-production.yml

This command tells Helm to install a release named crdb using the cockroachdb/cockroachdb chart with your custom production-values file.

This step will take a few minutes. GKE will spin up 3 (or more) worker nodes to host the CockroachDB replicas.

Thanks to pod anti-affinity rules, you’ll typically see one replica pod per VM (which improves fault tolerance).

Verify the pods

Once provisioning is done, check the pods:

kubectl get pods

You should see three CockroachDB replica pods (for example: crdb-cockroachdb-0, crdb-cockroachdb-1, crdb-cockroachdb-2) in Running status.

Verify the storage class (SSD)

Now check the persistent volume claims to confirm they’re using the fast SSD storage class you requested:

kubectl get pvc

Look for your PVCs (persistent volume claims) and check the STORAGECLASS column. You should see something like premium-rwo instead of standard or standard-rwo. This confirms that your replicas are using the high-performance disk type you configured.

📌 This is important, because in production you want good disk IOPS and throughput. Slower disks can bottleneck the database.

Connecting to Our CockroachDB Cluster (Now That TLS + mTLS Are Enabled)

Now that we’ve enabled TLS encryption and mTLS authentication, let’s actually try connecting to the cluster so you can see what this security setup looks like in action.

We’ll break down in more detail what TLS and mTLS mean shortly. But for now, let’s jump straight into trying to connect – because once you see the behavior, the explanation becomes much easier to understand.

Step 1: Expose the CockroachDB Cluster to Your Local PC (Using Port Forwarding)

Just like we've been doing from the start, we’ll expose our CockroachDB cluster through port-forwarding.

Open a new terminal window and run:

kubectl port-forward svc/crdb-cockroachdb-public 26259:26257

What this means:

The first port (26259) is the port on your computer.
The second port (26257) is the port inside the CockroachDB cluster.
Format is: :

So now, CockroachDB will be reachable locally at localhost:26259.

Step 2: Open Beekeeper Studio and Create a Fresh Connection

If Beekeeper Studio is still connected to our old Minikube cluster, or you're not seeing the “new connection” screen, just press Ctrl + Shift + N. This opens a new connection window instantly.

Step 3: Enter the Connection Details

Now fill in these fields:

Port: 26259
User: root
Default Database: defaultdb

Now click Test Connection.

And boom! You should see a message telling you something like:

“This cluster is running in secure mode. You must use SSL to connect.”

It’ll look similar to this:👇🏾

This is good: it means our CockroachDB cluster is officially in secure mode, and it’s rejecting any connection that doesn’t include proper TLS certificates.

Connecting via Mutual TLS (mTLS) — Why We Need a Certificate for Our `root` User

Now that our CockroachDB cluster is officially running in secure mode, we can’t just connect to it with a username and port anymore. CockroachDB won’t accept that.

To talk to it, we must connect using Mutual TLS (mTLS).

Why? Because TLS alone only protects the connection in one direction (you verifying the server). mTLS protects the connection in both directions (you verify the server, and the server also verifies you).

Let’s break this down in simple, everyday English 👇🏾

Why TLS Exists in the First Place

Whenever you send anything to CockroachDB, like a query, a connection, a password, whatever, it’s all data moving over a network – for example, the internet.

Without protection, anyone could intercept it and read the data being sent to your DB while it’s on its way
TLS fixes that :)

✔️ The CockroachDB cluster has its own public key + private key
✔️ It has a certificate that carries its public key
✔️ When you connect, the cluster sends you this certificate
✔️ Your database tool, for example Beekeeper, uses the public key in the process of encrypting all your traffic sent to the DB
✔️ Only CockroachDB can decrypt it with the help of its private key

This gives you encryption and proof you’re really talking to CockroachDB, not some fake service pretending to be it.

Why mTLS Exists (Mutual TLS)

TLS protects the server – CockroachDB. mTLS protects both sides – you and CockroachDB.

So CockroachDB also wants YOU to send your certificate.

But not just any certificate. Your certificate must be:

Signed by THE SAME Certificate Authority (CA)
Trusted by the CockroachDB cluster
Mapped to a CockroachDB user (like root)

This is how CockroachDB says:

“Let me see your certificate so I know you’re someone I should allow in.”

And we reply:

“Here is my certificate, signed by the same CA that signed yours.”

At that point, both sides trust each other.

If this still feels abstract, watch this video. It explains TLS beautifully.

Let’s Explore Our Cluster’s Certificate

Remember that the Helm chart automatically created:

The CockroachDB Certificate Authority
The CockroachDB node certificates
The keypairs used for encryption

You can list all the CockroachDB-related Kubernetes secrets with:

kubectl get secrets

The one we're interested in is:

crdb-cockroachdb-node-secret

If you inspect this secret, you’ll see three keys inside:

ca.crt: the CA’s public certificate
tls.key: the CockroachDB node’s private key
tls.crt: the CockroachDB node certificate

Now let’s decode the CockroachDB node certificate.

Run this:

kubectl get secret crdb-cockroachdb-node-secret -o jsonpath='{.data.tls\.crt}' | base64 -d > crdb-node.crt

This gives you the raw certificate (which looks like gibberish):

-----BEGIN CERTIFICATE-----
MIIEGDCCAwCgAwIBAgIQWgOPJa4OLoZZjcXLgDF3bjANBgkqhkiG9w0BAQsFADAr
...
-----END CERTIFICATE-----

Let’s decode it into something readable:

openssl x509 -in ./crdb-node.crt -text -noout > crdb-node.crt.decoded

Open the crdb-node.crt.decoded file. This is the human-readable CockroachDB cluster certificate.

N.B.: You need to have the openssl tool installed in order to be able to make the certificate human-readable. If you don’t, install it following this tutorial.

Understanding the Certificate Sections (Explained Super Simply)

1. Issuer

You’ll see something like:

Issuer: O = Cockroach, CN = Cockroach CA

This tells us:

The certificate was signed by a Certificate Authority created by the Helm chart
The Organization (O) is “Cockroach”
The Common Name (CN) is “Cockroach CA”

This basically means:

“This certificate comes from the CockroachDB internal CA.”

2. Subject

You’ll also see this:

Subject: O = Cockroach, CN = node

What does this mean?

Organization = Cockroach

This simply groups all CockroachDB-generated certificates under one “organization label.”
It doesn’t refer to the company. It’s just a logical grouping created by CockroachDB’s built-in toolset.

Common Name = node

This tells CockroachDB that this certificate belongs to a cluster node, not a user or a client machine.
In CockroachDB, node certificates are used for:
1. DB-to-DB communication
2. cluster gossip
3. handling incoming connections from clients (you)

So this certificate is saying:

“Hi, I’m a CockroachDB node. Please trust me as part of the cluster.”

3. Extended Key Usage (EKU)

Scroll down and you’ll see:

X509v3 Extended Key Usage:
    TLS Web Server Authentication
    TLS Web Client Authentication

This is super important, because it defines how this certificate is allowed to be used.

Let’s simplify it:

TLS Web Server Authentication

This means:

“This certificate can be presented by a server to prove its identity.”

In our case, the CockroachDB node uses this certificate to prove to you (the client) that it is the real CockroachDB server. Think of it like flashing an ID card before letting you in.

TLS Web Client Authentication

This means:

“This certificate can also be used as a client certificate.”

Why would a server have a client certificate? Well, because in CockroachDB, nodes (DBs) talk to each other. When node A connects to node B, node A is a client, and node B is a server.

So the same certificate serves two roles. Your local machine will use a different certificate, created specifically for your root user. We’ll generate that soon.

Creating a Client Certificate (So We Can Finally Connect to CockroachDB)

Now that we’ve seen how the CockroachDB node certificate works, let’s generate our client certificate – the one we’ll use to connect from Beekeeper Studio.

Remember: CockroachDB is running in secure mode, so it won’t accept any connection that doesn’t come with a valid, signed certificate.

To fix that, let’s build a tiny Kubernetes pod whose only job is to create a certificate for our root SQL user.

Step 1: Create a File Called `gen-root-cert.yml`

Paste this into it:

apiVersion: v1
kind: Pod
metadata:
  name: gen-root-cert
spec:
  restartPolicy: Never
  volumes:
    - name: crdb-ca
      secret:
        secretName: crdb-cockroachdb-ca-secret
        items:
          - key: ca.crt
            path: ca.crt
          - key: ca.key
            path: ca.key
  containers:
    - name: gen
      image: cockroachdb/cockroach:v25.3.1
      command: ["sh", "-ec"]
      args:
        - |
          mkdir -p /out

          # Copy the CockroachDB cluster Certificate Authority certificate file `ca.crt` (for Mutual TLS authentication)
          cp /ca/ca.crt /out/ca.crt

          # Create the client certificate and key pair for the SQL user 'root' using the CockroachDB cluster Certificate Authority private key `ca.key`
          /cockroach/cockroach cert create-client root \
            --certs-dir=/out \
            --ca-key=/ca/ca.key \
            --lifetime=5h \
            --overwrite

          # List the generated files
          ls -al /out

          # Keep the pod alive so we can kubectl cp the files
          sleep 3600
      volumeMounts:
        - { name: crdb-ca, mountPath: /ca, readOnly: true }
      resources:
        requests:
          memory: "50Mi"
          cpu: "10m"
        limits:
          memory: "500Mi"
          cpu: "50m"

So how does this work?

We previously mentioned that the Helm chart created a secret, crdb-cockroachdb-ca-secret.

This secret contains:

The Certificate Authority public certificate
The private key (used for signing)
The CA metadata

CockroachDB requires that the server certificate (node cert) and the client certificate (your root cert) be signed by THE SAME CA. Because this ensures both sides trust each other.

So what do we do?

We mount the CA secret into the pod:

volumes:
  - name: crdb-ca
    secret:
      secretName: crdb-cockroachdb-ca-secret

This gives the pod access to:

/ca/ca.crt: CA public certificate
/ca/ca.key: CA private key

And with these, we can sign new client certificates inside the cluster.

The important command inside the pod:

/cockroach/cockroach cert create-client root \
  --certs-dir=/out \
  --ca-key=/ca/ca.key \
  --lifetime=5h \
  --overwrite

What this does:

Generates a brand new public/private key pair for the root SQL user
Uses the CA private key to sign the client certificate
Places everything inside /out
Makes the certificate valid for 5 hours

If we passed demo instead of root, then the certificate CN would be demo, and CockroachDB would treat anyone using that certificate as the demo SQL user.

That’s how CockroachDB identifies and authenticates users when running in secure mode.

Step 2: Deploy the Pod

Run:

kubectl apply -f gen-root-cert.yml

Give it a minute to start and generate the files.

Step 3: Copy the Certificates to Your Local PC

We need three files:

client.root.crt: client certificate
client.root.key: private key
ca.crt: CA certificate

Copy them from the pod to your machine:

kubectl cp default/gen-root-cert:/out/client.root.crt ./client.root.crt
kubectl cp default/gen-root-cert:/out/client.root.key ./client.root.key
kubectl cp default/gen-root-cert:/out/ca.crt             ./ca.crt

Now your folder should contain:

client.root.crt
client.root.key
ca.crt

These are the files Beekeeper Studio needs for mTLS.

Step 4: Decode the Client Certificate (Just Like We Did for the Node Certificate)

Run:

openssl x509 -in client.root.crt -text -noout > crdb-root.crt.decoded

Open the crdb-root.crt.decoded file and look at the contents.

Understanding the Client Certificate

Issuer

You'll see Issuer: O = Cockroach, CN = Cockroach CA

This is the same Issuer as the CockroachDB node certificate.

This confirms that both certificates were signed by the same Certificate Authority, that they trust each other, and that mTLS will work perfectly.

Subject

You’ll see: Subject: O = Cockroach, CN = root

This means that the Organization is just a label grouping CockroachDB identities, and that the Common Name is root. This is VERY important.

The CN of a client certificate literally tells CockroachDB:

“This connection belongs to the SQL user named root.”

If CN was demo, CockroachDB would authenticate you as the demo SQL user.

Extended Key Usage (EKU)

You should see: TLS Web Client Authentication.

This is exactly what we want. It tells CockroachDB:

“This certificate is only for clients connecting to the database.”

Unlike node certificates, you will NOT see: TLS Web Server Authentication.

Why?

Because:

Server Authentication = for certificates the SERVER SHOWS TO THE CLIENT. For example: CockroachDB nodes proving they are legitimate.
Client Authentication = for certificates THE CLIENT SENDS TO THE SERVER. For example: You proving you are the real root user.

Why your client certificate cannot be used as a server certificate

Because a server certificate says:

“Trust me, I AM the CockroachDB server.”

But your client certificate says:

“Trust me, I am an authenticated user.”

Two very different identities. And CockroachDB will reject any certificate used in the wrong role.

So having only TLS Web Client Authentication in your certificate is perfect for our use case. :)

Connecting to Our CockroachDB Cluster Securely (Using mTLS)

Now that we’ve successfully generated the certificates and key pairs we need, it's time to use them to securely connect to our CockroachDB cluster from Beekeeper Studio.

Remember: CockroachDB is running in secure mode, so without these certificates, it will reject all incoming connections, even if you enter the correct username and password.

Let’s walk through the steps.👇🏾

Step 1: Make Sure Port Forwarding Is Still Running

Before connecting, ensure that your CockroachDB cluster is still exposed to your PC.

If you already closed the previous terminal window, simply re-run this:

kubectl port-forward svc/crdb-cockroachdb-public 26259:26257

This makes your CockroachDB node reachable at: localhost:26259. If this step isn’t active, Beekeeper Studio will not be able to connect.

Step 2: Open Beekeeper Studio and Set Up the Connection

Launch Beekeeper Studio and open a fresh connection window (Ctrl + Shift + N if needed).

Now fill in the fields like this:

Field	Value
Connection Type	CockroachDB
Host	`localhost`
Port	`26259`
User	`root`
Default Database	`defaultdb`

Now enable the “Enable SSL” option. Once enabled, expand the SSL section and set the following three fields:

CA Cert: Set this to the location of: ca.crt. This is the root Certificate Authority file you copied earlier using: kubectl cp default/gen-root-cert:/out/ca.crt ./ca.crt. It should still be in your project’s root directory (for example, cockroachdb-tutorial/).
Certificate: Set this to the location of: client.root.crt
Key File: Set this to the location of: client.root.key

Step 3: Click “Connect”

Once all the fields are set properly, click Connect.

If everything was done correctly, you should now be connected to your CockroachDB cluster securely over Mutual TLS.

If the connection fails:

Double-check your certificate paths
Ensure port-forwarding is running
Verify the user is root
Confirm the selected connection type is CockroachDB.

Step 4: Run Your First Secure Query

Now that you're connected, let’s verify everything works by running:

SHOW users;

You should see two users automatically created by CockroachDB:

admin
root

In the next subsection, we’ll create a new SQL user and generate a certificate for that user (just like we did for the root user) so you’ll understand how CockroachDB handles user authentication in production environments.

Restoring Our Previous Database into the New GKE CockroachDB Cluster (without SA keys)

Now that our CockroachDB cluster is up and running on GKE – fully secured with TLS encryption and mTLS authentication – it’s time to bring back the data from our previous setup.

Remember how we backed up our CockroachDB database (running on Minikube) to Google Cloud Storage?

Well, now we’re going to restore that same backup into our new production cluster on GKE. But before CockroachDB can access our bucket, we must give it permission – securely.

And here’s the cool part: we don’t need to use Service Account keys anymore.

Why We Don’t Need Service Account Keys on GKE

Earlier, in the backup section, we generated a Service Account key on our PC and mounted it into our Minikube cluster.

But for GKE, we intentionally left out the following fields in our cockroachdb-production.yml:

env
volumes
volumeMounts

The reason? GKE supports something called Workload Identity.

Workload Identity lets us securely connect Kubernetes Service Accounts (KSAs) to Google Cloud Service Accounts (GSAs), without storing or mounting any secret keys. The authentication happens “implicitly” thanks to Google’s metadata server.

💡 Workload Identity works easily when your cluster is running on GKE. It’s more complex to set up on Minikube, Kind, EKS, AKS, or any other non-GKE cluster.

Step 1: Linking the Google Service Account to Our Kubernetes Service Account

We already touched this when deploying our cluster, but let’s look at the specific line again.

Open your cockroachdb-production.yml Helm values file and scroll to the serviceAccount section. You should see something like this:

...
serviceAccount:
    create: true
    name: "crdb-cockroachdb"
    annotations:
      iam.gke.io/gcp-service-account: cockroachdb-backup@.iam.gserviceaccount.com
...

Replace the placeholder with your real Google Cloud project ID.

If you’re unsure of the ID, go to Google Cloud Console, then to IAM & Admin, and finally to Service Accounts. Search for cockroachdb-backup and copy the project ID from there.

This annotation instructs GKE to automatically authenticate our CockroachDB pods as the cockroachdb-backup Google Service Account – no keys needed.

Step 2: Binding KSA ↔️ GSA Using Workload Identity

Annotating the Service Account isn’t enough. We still need to explicitly allow our KSA to “impersonate" the GSA.

Run this command to set the active project:

gcloud config set project

Now, apply the IAM policy binding:

gcloud iam service-accounts add-iam-policy-binding \
   \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:.svc.id.goog[/]"

Replace the placeholders with:

with cockroachdb-backup@.iam.gserviceaccount.com
with your GCP project ID
with where CockroachDB runs (default)
with crdb-cockroachdb

After a few seconds, you should see something like:

Updated IAM policy for serviceAccount [cockroachdb-backup@.iam.gserviceaccount.com].
bindings:
- members:
  - serviceAccount:.svc.id.goog[default/crdb-cockroachdb]
  role: roles/iam.workloadIdentityUser
etag: ***
version: 1

Perfect. Your KSA can now access Google Cloud Storage automatically.

Restoring Our Previous Database from Google Cloud Storage

Now that authentication is set up, let’s restore the backup we previously created in the Minikube cluster.

Open Beekeeper Studio and reconnect to your CockroachDB cluster (the one running on GKE).

Before restoring anything, let’s check if the books table exists:

SELECT * FROM books;

You should see an error saying the table doesn’t exist. Don’t worry, that’s expected.

Now, Let’s Restore the Data 🎉

Run this command:

RESTORE FROM LATEST IN 'gs:///cluster?AUTH=implicit';

Replace with the name of the bucket you created earlier (for example: cockroachdb-backup-7gw8u).

CockroachDB will now:

Authenticate using Workload Identity
Find the latest backup inside your bucket
Restore all tables, schemas, and data into your new GKE cluster

After a couple of minutes, you should get a Success message.

Now, run the query again:

SELECT * FROM books;

Boom! Your books from the Minikube cluster should now appear inside the new CockroachDB cluster running on GKE 😃.

Connecting to the Database with a New User

So far, we’ve been connecting to our CockroachDB cluster using the root user. While this is super convenient for tutorials, it’s not recommended for real apps.

This is because the root user has advanced privileges – basically, full access to your entire cluster. If an attacker got hold of these credentials, or your application was compromised, they could do A LOT of damage. 😬

Instead, it’s best practice to create a user with limited permissions for your apps. This way, even if the user is compromised, the damage is contained.

Authentication Options for Users

CockroachDB is flexible when it comes to authentication:

Password Authentication: Create a user with a password and connect using just username + password (no client certificates required).
Passwordless / Mutual TLS Authentication: Create a user without a password, then connect using client certificates signed by the same CA (like we did for root).
Both Password + Mutual TLS: Create a user with a password and also connect using client certificates. This adds an extra layer of security.

In this subsection, we’ll start simple and use password authentication.

Step 1: Create the New User

Open your current connection in Beekeeper Studio (signed in as root) and run:

CREATE USER password_auth WITH PASSWORD 'supersecret';

You should see a message confirming the user was created successfully.

Step 2: Connect as the New User

Open a new Beekeeper Studio window (Ctrl + Shift + N). DO NOT exit/close the old window, as we’ll need it later.

Fill in the connection fields:

Field	Value
Connection Type	CockroachDB
Host	`localhost`
Port	`26259`
Database	`defaultdb`
User	`password_auth`
Password	`huh` (for now, we’ll try a wrong password to see it fail)

Click Connect.

❌ You’ll see an error about SSL connection being required.

Even though we’re connecting with a password instead of certificates, enabling SSL is still important. It encrypts the data between Beekeeper Studio and CockroachDB.

Without it, sensitive info like passwords and queries could be intercepted (man-in-the-middle attacks).

Step 3 — Enable SSL & CA Verification

Tick Enable SSL
Click the CA Cert field and select the ca.crt file in your project root (cockroachdb-tutorial/)

This ensures that Beekeeper Studio verifies it’s really talking to our CockroachDB cluster and protects against attackers trying to intercept the connection.

Now, click Connect again.

❌ Initially, you’ll still see a Password authentication failed error because we intentionally entered the wrong password.

Step 4: Connect With the Correct Password

Replace the password with supersecret, then click Connect.

You are now signed in as the password_auth user!

Step 5: Check Permissions

Run:

SELECT * FROM books;

❌ You should see an error stating that password_auth does not have permission to access the books table.

This is expected, as it confirms that our limited-access user can only access what we explicitly grant it. Even if compromised, the attacker can’t modify our entire database.

Step 6: Granting Access to Specific Tables

To allow password_auth to work with the books table, switch back to the root connection Beekeeper Studio window and run:

GRANT USAGE ON SCHEMA defaultdb.public TO password_auth;
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE defaultdb.public.books TO password_auth;

This gives the user read and write access to the books table only.

Step 7: Verify the New User Access

Go back to the Beekeeper Studio window where you’re signed in as password_auth and run:

SELECT * FROM books;

Boom! You should now see the list of books from your restored database.

Our new user is fully functional with limited privileges, making it safe for use in real applications.

Connecting with Passwordless Authentication (Mutual TLS)

We’ve already seen how to connect to the database using a user that authenticates with a password, and without any client certificates.

Now, let’s look at the opposite scenario: passwordless authentication via Mutual TLS (mTLS).

This is one of the strongest forms of authentication because instead of a password, the database verifies you using a cryptographically signed certificate.

Let’s walk through it.

Step 1: Create the `mtls_auth` User

Navigate back to the Beekeeper Studio window where you're currently signed in as the root user. Run:

CREATE USER mtls_auth;

You should see a success message confirming that the user has been created.

N.B.: If this query fails, there’s a good chance your root client certificate has expired. Remember that we set a 5-hour lifetime when generating it earlier.

If this happens, delete the certificate-generation pod:

kubectl delete po/gen-root-cert

Then re-apply the gen-root-cert.yml manifest. Copy the newly generated client.root.crt, client.root.key, and ca.crt back to your PC. Then try creating the user again.

Step 2: Attempt Signing In as `mtls_auth` (Expect Failure)

Open a new Beekeeper Studio window (Ctrl + Shift + N).

Try filling in the connection settings using:

User: mtls_auth
SSL enabled
CA Cert: ca.crt
Client Cert: client.root.crt
Client Key: client.root.key

Click Connect.

You’ll see an error message similar to this:

Why does this fail?

The user has no password, so password login is impossible.
You’re using the root certificate, not a certificate belonging to mtls_auth. CockroachDB is strict: each user must authenticate using their own certificate.

So let's fix that by generating a new certificate + key pair for the mtls_auth user.

Step 3: Create Certificate + Key for `mtls_auth`

Just like we generated certificates for the root user earlier, we’ll do the same for mtls_auth.

Create a new manifest named gen-mtls_auth-cert.yml.

Paste in this content:

apiVersion: v1
kind: Pod
metadata:
  name: gen-mtls-auth-cert 
spec:
  restartPolicy: Never
  volumes:
    - name: crdb-ca
      secret:
        secretName: crdb-cockroachdb-ca-secret 
        items:
          - key: ca.crt
            path: ca.crt
          - key: ca.key
            path: ca.key
  containers:
    - name: gen
      image: cockroachdb/cockroach:v25.3.1
      command: ["sh", "-ec"]
      args:
        - |
          mkdir -p /out

          # Copy the CA certificate
          cp /ca/ca.crt /out/ca.crt

          # Create the client certificate and key pair for user 'mtls_auth'
          /cockroach/cockroach cert create-client mtls_auth \
            --certs-dir=/out \
            --ca-key=/ca/ca.key \
            --lifetime=5h \
            --overwrite

          # List generated files
          ls -al /out

          # Keep pod alive for kubectl cp
          sleep 3600
      volumeMounts:
        - { name: crdb-ca, mountPath: /ca, readOnly: true }
      resources:
        requests:
          memory: "50Mi"
          cpu: "10m"
        limits:
          memory: "500Mi"
          cpu: "50m"

Apply this file, wait for the pod to start, then copy the generated files:

kubectl cp default/gen-mtls-auth-cert:/out/client.mtls_auth.crt ./client.mtls_auth.crt 
kubectl cp default/gen-mtls-auth-cert:/out/client.mtls_auth.key ./client.mtls_auth.key
kubectl cp default/gen-mtls-auth-cert:/out/ca.crt ./ca.crt

Now we have the correct certificate + key pair for our new user.

Step 4: Connect as `mtls_auth`

Go back to the new Beekeeper Studio window and update the SSL fields:

CA Cert: ca.crt
Certificate: client.mtls_auth.crt
Key File: client.mtls_auth.key

Click Connect.

This time, it should succeed instantly

Step 5 — Inspect the Certificate

To understand how CockroachDB links certificates to users, decode the certificate:

openssl x509 -in client.mtls_auth.crt -text -noout > client.mtls_auth.crt.decoded

Open the file, scroll to the Subject field, and you’ll see:

...
Subject: O = Cockroach, CN = mtls_auth
...

The CN (Common Name) is the username CockroachDB uses to authenticate the session.

This is how CockroachDB knows you’re connecting as the mtls_auth user without any password at all. :)

Step 6: Try Reading the Books Table

Run:

SELECT * FROM books;

❌ You’ll get a permission error, just like we did earlier with the password_auth user.

This is expected because mtls_auth has no privileges yet. Perfect!

Step 7: Grant Permissions to `mtls_auth`

Switch to the Beekeeper Studio window where you're signed in as root, and run:

GRANT USAGE ON SCHEMA defaultdb.public TO mtls_auth;
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE defaultdb.public.books TO mtls_auth;

You should see a success message.

Now return to the mtls_auth session and run:

SELECT * FROM books;

Boom! You should now see your previously restored list of books.

You’ve successfully connected using passwordless, certificate-based authentication and granted controlled permissions to the new user. :)

Connecting via Mutual TLS (mTLS) from Our Apps on Kubernetes

So far, we’ve been connecting to our CockroachDB cluster securely using Beekeeper Studio thanks to our TLS certificates and mTLS authentication.

But…what happens when we have applications running inside our Kubernetes cluster that need to talk to CockroachDB as well?

Exactly: those apps also need to authenticate using client certificates

And that brings us to a very important point…

Why We Should Not Generate Client Certificates Using Pods (The Dangerous Way)

Up until now, we’ve been generating our client certificates using Kubernetes Pods like:

gen-root-cert
gen-mtls-auth-cert

They work, yes…but they’re not safe for production.

Why? Because these jobs mount our Certificate Authority (CA) key inside the pod:

...
volumes:
    - name: crdb-ca
      secret:
        secretName: crdb-cockroachdb-ca-secret
        items:
          - key: ca.crt
            path: ca.crt
          - key: ca.key
            path: ca.key
...

This is a big security risk!

If an attacker ever gains access to that pod?

🔥 Your CA key is exposed
🔥 They can generate their own trusted certificates
🔥 They can impersonate ANY client/user, including the root and admin users
🔥 They’ll have full access to your CockroachDB cluster

And they’ll keep that access forever, until you rotate the CA key (which is painful and disruptive).

This is why CockroachDB strongly advises against mounting CA keys into Pods.

The Right Way: Using Cert Manager (Recommended by CockroachDB)

CockroachDB’s official docs recommend managing client certificates using cert-manager.

This is because instead of YOU exposing your CA key inside Pods, cert-manager handles everything internally and securely:

Cert-manager stores and protects your CA key
It generates client certificates for you
It issues private keys without ever exposing your CA key
It auto-renews certificates before they expire
And it gives you production-grade certificate lifecycle management

But Wait: Don’t We Need the CA Key to Generate Client Certificates?

Great question.

Yes, normally you need the CA key to sign client certificates…but cert-manager takes care of that for us.

You simply:

Create an Issuer (or ClusterIssuer)
Tell cert-manager to use your CockroachDB CA
Request a Certificate

Then cert-manager automatically:

Signs it
Stores it in a Kubernetes Secret (where its safe)
Rotates it before expiry
Keeps your CA key completely secure

No more exposing the CA key in Pods. No more writing custom Kubernetes Pods.

Certificate Rotation — Another Huge Win

Let’s talk about expirations.

Right now:

The mtls_auth client cert we generated manually has 5 hours validity
After 5 hours, it expires
Your apps will fail all DB connections
You’d need to regenerate a new certificate manually
Or worse: create a CronJob to regenerate them every 4 hours

This is messy and unsafe.

With cert-manager?

Certificates are automatically rotated
Renewed before expiration
No downtime
No manual intervention
Apps easily reload the new certificates

Alright — Let’s Install Cert Manager

To start using cert-manager, install it using the Helm chart:

helm repo add cert-manager https://charts.jetstack.io

helm install cert-manager cert-manager/cert-manager \
  --set crds.enabled=true \
  --create-namespace \
  -n cert-manager \
  --version 1.19.1

Once cert-manager is installed, we’ll:

Create a ClusterIssuer that uses our CockroachDB CA
Create a Certificate for our mtls_auth user
Mount that Certificate into our application Pods
Connect securely to CockroachDB via mTLS from inside Kubernetes

That’s what we’ll walk through next

Before cert-manager can issue our certificates, it needs an Issuer. And before creating an Issuer, we need a secret that contains our CA certificate and CA key using the correct key names.

Creating a CA Secret for the Issuer

cert-manager’s Issuer is a bit picky about the secret format. It expects the secret to contain two keys:

tls.crt: the CA certificate
tls.key: the CA private key

But\ the CockroachDB Helm chart automatically generates a secret named crdb-cockroachdb-ca-secret, which uses different key names:

ca.crt
ca.key

So even though this secret contains exactly what we need, cert-manager won’t accept it because the keys are not named the way it expects.

To fix this, we’ll re-create a new secret with the correct key names. First, copy the existing CA files from Kubernetes to your local machine:

kubectl get secret crdb-cockroachdb-ca-secret -o jsonpath='{.data.ca\.crt}' | base64 -d > ca.crt

If you get a “permission denied”, simply delete any existing ca.crt file in your project directory.

Now copy the key:

kubectl get secret crdb-cockroachdb-ca-secret -o jsonpath='{.data.ca\.key}' | base64 -d > ca.key

Next, create the properly formatted secret:

kubectl create secret tls crdb-ca-issuer-secret --cert=ca.crt --key=ca.key

If you describe it:

kubectl describe secret crdb-ca-issuer-secret

You should now see tls.crt and tls.key in the Data section – exactly what cert-manager needs.

Creating the Issuer

Now that we have a properly formatted CA secret, we can create the Issuer that cert-manager will use to sign our client certificates.

Create a file called crdb-issuer.yml:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: crdb-issuer
spec:
  ca:
    secretName: crdb-ca-issuer-secret

Apply it:

kubectl apply -f crdb-issuer.yml

Confirm that it’s ready:

kubectl get issuer crdb-issuer

The Ready column should display True.

Creating the Certificate Manifest

Now we’ll define a Certificate object. This doesn’t create the client certificate instantly – instead, it tells cert-manager what kind of certificate we need. cert-manager then generates and stores the certificate automatically.

Create a file named crdb-mtls_auth-certificate.yml:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: crdb-mtls-auth-certificate
spec:
  secretName: crdb-mtls-auth-certificate # Secret that will hold the cert+key
  commonName: mtls_auth # MUST match Cockroach SQL role
  duration: 24h # 1 day
  renewBefore: 20h # renew 4 hours before expiry
  privateKey:
    algorithm: RSA
    size: 2048
    encoding: PKCS8
  usages:
    - client auth # important: client certificate
  issuerRef:
    name: crdb-issuer
    kind: Issuer
    group: cert-manager.io

Let’s look at the important properties so we can understand what the Certificate workload does:

secretName: The Kubernetes secret where cert-manager will store the generated certificate, key, and CA certificate. This is where your apps will later mount the certificate files from.
commonName: Very important! This must match the CockroachDB SQL user (mtls_auth), because CockroachDB uses the certificate’s Common Name to identify the connecting user.
duration and renewBefore: duration defines how long the certificate is valid. renewBefore ensures cert-manager renews it early, preventing the certificate from getting expired before it gets renewed (to avoid downtime).
usages: Tells cert-manager what the certificate is for. client auth ensures this certificate is only used by clients connecting to servers, not the other way around.
issuerRef: Points to the Issuer we created earlier. This tells cert-manager who should sign the certificate.

Apply the manifest:

kubectl apply -f crdb-mtls_auth-certificate.yml

After a few seconds, cert-manager will generate the certificate.

Check the secret:

kubectl get secret crdb-mtls-auth-certificate

Describe it to view the keys:

kubectl describe secret crdb-mtls-auth-certificate

You should see:

tls.crt
tls.key
ca.crt

These are the files the application will use.

If we copied the content of the tls.crt to our local machine and decoded it using the openssl x509... command, we'll see similar details to the content in the client.mtls_auth.crt client certificate we previously generated, with the Common Name (CN being mtls_auth).

Creating a Pod That Connects Using the Client Certificate

Now let’s create a simple Pod that uses our new client certificate to connect to CockroachDB.

Create a file called books-pod.yml:

apiVersion: v1
kind: Pod
metadata:
  name: books-pod
spec:
  restartPolicy: Never
  volumes:
    - name: crdb-certs
      secret:
        secretName: crdb-mtls-auth-certificate
        # Make secret files read-only for the user only: 0400 (Without this, the Python app will thow an error). Howevwe, this is not compulsory for all apps, just this one being used in this tutorial :)
        defaultMode: 0400
  containers:
    - name: books
      image: prince2006/cockroachdb-tutorial-python-app:new
      imagePullPolicy: Always
      env:
        - name: DATABASE_URL
          value: >-
            postgresql://mtls_auth@crdb-cockroachdb-public.default:26257/defaultdb?sslmode=verify-full&sslrootcert=/crdb-certs/ca.crt&sslcert=/crdb-certs/tls.crt&sslkey=/crdb-certs/tls.key
      volumeMounts:
        - name: crdb-certs
          mountPath: /crdb-certs
          readOnly: true
      resources:
        limits:
          memory: "100Mi"
          cpu: "50m"
        requests:
          memory: "50Mi"
          cpu: "10m"

Here’s what’s happening:

We mount the generated certificate secret into /crdb-certs.
The Python app uses those certificate files (tls.crt, tls.key, ca.crt) to authenticate.
The connection string does NOT include a password. CockroachDB authenticates the user entirely via the certificate’s Common Name.

Apply the Pod:

kubectl apply -f books-pod.yml

After about a minute, view the logs:

kubectl logs books-pod

Or if the Pod already restarted:

kubectl logs -p books-pod

You should see a successful connection to CockroachDB using the mtls_auth user and a list of books

If you remove the certificate files or try connecting without them, the app will fail – as expected.

Congratulations!

You’ve officially built a fully secure, production-ready CockroachDB cluster on Kubernetes – complete with:

End-to-end encryption (TLS)
Mutual TLS authentication (mTLS) for users and apps
Automated, daily backups to Google Cloud Storage
Proper certificate rotation with cert-manager

How to Get a CockroachDB Enterprise License for Free

Okay, so here’s a thing: even though you’ve built a super professional CockroachDB cluster, there’s one small catch: without a license, your cluster might be “throttled.”

We know that because, when we access our dashboard, we get a message concerning our cluster getting throttled.

That means things slow down: queries take longer, performance gets worse, and scaling up won’t magically make it faster. Yeah, it’s real. 🥲

Why does this happen? Because CockroachDB’s “full feature set” is under a special license. If you don’t set a valid license, it limits how many SQL transactions you can run at a time.

Three Types of Licenses

Here’s a breakdown of the different kinds of CockroachDB licenses and what they mean for you:

Trial License
- Valid for 30 days.
- Lets you try all the “Enterprise” features.
- You must send telemetry (more on that soon) while the trial is active.
Enterprise License (Paid)
- This is CockroachDB’s “premium / fully paid” version.
- You can pick the kind of license based on your environment: “Production”, “Pre-production”, or “Development.”
- Companies with more than $10 million in annual revenue need to pay for this license.
- There are discounts, startup perks, or “free” versions for smaller companies (more below).
Enterprise Free License
- This is the magic one for early-stage companies or startups: it has exactly the same features as the paid Enterprise license. But it’s free if your business makes under $10 million per year.
- You do need to renew it each year.
- Support for this “Free” license is community-level (forums, docs), not paid enterprise.

N.B.: To keep your free license active and not get throttled, CockroachDB requires telemetry. Telemetry means your cluster sends some usage data back to Cockroach Labs. And no, they’re not “stealing your data”. Here’s what that actually means:

Telemetry includes basic usage stats, cluster health info, and configuration metrics.
It does NOT send your business data, queries, or personal customer data.
It helps Cockroach Labs make sure the free license is used responsibly, and helps them build better features.
If you stop sending telemetry, your cluster will eventually be throttled after 7 days (slowed down).

How to Apply for the Free Enterprise License

Here’s how you can try to get that free enterprise license:

Go to the CockroachDB Cloud Console (Sign up if you don’t have a account). Then go to the “Organization” link on the menu, click it, then click the “Enterprise Licenses” from the dropdown.
Click the Create License button → Enable the “Find out if my company qualifies for an Enterprise Free license” option.
Fill in the form: your name, company name, job function, and the intended use of the license.
Click “Continue”.

You should see this success message “Based on your company's intended use, you qualify for an Enterprise Free license.” Now agree to the terms and conditions, then click the “Generate License key“.

Learn more about CockroachDB licenses here 👉🏾 https://www.cockroachlabs.com/docs/stable/licensing-faqs

Adding Your License to the CockroachDB Cluster

Now that you’ve gotten your shiny new CockroachDB license (whether it’s the Free one or the Enterprise one), the next step is…actually using it.

Let’s add it to your CockroachDB cluster so it stops shouting “THROTTLED!” at you every time you open the dashboard :)

We’ll do this by updating our CockroachDB Helm configuration.

Step 1: Update Your `cockroachdb-production.yml`

Open your production Helm values file, and inside the init section, add the following:

init:
...
    provisioning:
        enabled: true
        clusterSettings:
          cluster.organization: "''" # Enter the name of your organization here 
          enterprise.license: "''" # Enter your CockroachDB Enterprise license key here
...

Now replace:

with the name of your startup, business, project, or company
with the exact license string CockroachDB gave you

That’s it – super simple.

Step 2: Apply the Changes With Helm

Run your usual Helm upgrade command:

helm upgrade cockroachdb -f cockroachdb-production.yml cockroachdb/cockroachdb

Step 3: Confirm the License Was Added Correctly

Now let’s double-check everything worked.

Connect as the root user: You can connect using Beekeeper Studio (like we’ve been doing).
Run this query to check your license:

SHOW CLUSTER SETTING enterprise.license;

If everything went well, you should see your license key printed out in the results.

Step 4: Make Sure Telemetry Is Enabled (Important!)

Remember: without telemetry enabled, your cluster will still get throttled, even if you have a valid license 🥲

Run:

SHOW CLUSTER SETTING diagnostics.reporting.enabled;

If the result says “true”, you're good! Telemetry is on, CockroachDB can verify your license, and your cluster will behave normally without slowing down.

Conclusion & Next Steps ✨

Throughout this book, you’ve gone from “What even is CockroachDB?” to actually running your own secure, production-ready database on Kubernetes – and that’s a BIG deal. 🎉

You learned why CockroachDB is special, how it avoids downtime, and why it’s different from the usual databases everyone talks about.

Then you set up your own local environment, practiced everything safely on Minikube, and gradually built your way to a full production setup on GKE.

You explored CockroachDB’s dashboard, checked your cluster’s health, backed up your data to the cloud, and even learned how to keep your database fast, stable, and ready to grow when needed.

Finally, you deployed it on Google Cloud, secured it with encryption and certificates, and connected to it from your own PC – all step-by-step.

By now, you’ve basically gone from curious learner to “I can actually run this thing in production.” 🚀

You’ve covered a lot – and you’ve built something powerful, modern, and production-worthy. Amazing job 👏🏾😁!! And thanks for reading.

About the Author 👨🏾‍💻

Hi, I’m Prince! I’m a DevOps engineer and Cloud architect passionate about building, deploying, architecting, and managing applications and sharing knowledge with the tech community.

If you enjoyed this book, you can learn more about me by exploring more of my blogs and projects on my LinkedIn profile. and reach out to me on Twitter (X). You can find more of my articles here or on my freeCodeCamp blog.

You can also visit my website. Let’s connect and grow together! 😊

Load Balancing with Azure Application Gateway and Azure Load Balancer – When to Use Each One

Prince Onukwili — Wed, 14 May 2025 19:37:46 +0000

You’ve probably heard someone mention load balancing when talking about cloud apps. Maybe even names like Azure Load Balancer, Azure Application Gateway, or something about Virtual Machines and Scale Sets. 😵‍💫

It all sounds important...but also a little confusing. Like, why are there so many moving parts? And what do they actually do?

In this guide, we’re going to break it all down – step by step – using real examples and simple language.

You’ll learn:

What load balancers are (and why apps even need them)
How apps were deployed before load balancers existed (hint: everything lived on one lonely server)
How Azure Virtual Machines work – and how they let you scale up your apps
What Virtual Machine Scale Sets are, and how they help handle sudden traffic spikes
The differences between Azure Load Balancer and Azure Application Gateway, and when to use each

By the end, you won’t just understand what these tools do – you’ll know when and why to use them in real-world scenarios.

Whether you’re a curious beginner, a hands-on builder, or someone just trying to wrap their head around Azure’s ecosystem, this guide is for you.

Ready to untangle the cloud spaghetti? Let’s go! 🍝🚀

📚 Table of Contents

🧊 What Are Load Balancers?
🖥️ How Applications Were Deployed Before Load Balancers
⚙️ Azure Virtual Machines (VMs) – The Building Blocks
📈 The Need for Scaling – Vertical vs Horizontal
🔁 Azure Virtual Machine Scale Sets (VMSS) – Scaling Made Simple
📦 Azure Load Balancer – Spreading the Traffic
🍴 Azure Application Gateway – Smart Routing for Modern Apps
🔍 Azure Load Balancer vs Azure Application Gateway
🧭 Use Cases: When to Use Each One
✅ Conclusion
Study Further 📚
About the Author 👨‍💻

🧊 What Are Load Balancers?

Imagine you're running a small restaurant with just one chef in the kitchen. Everything goes smoothly when you have a few customers – each order is prepared one after the other, and everyone leaves satisfied.

But what happens when 50 people walk in all at once?

🍽️ One chef can’t handle that many orders at the same time.
⏳ People start waiting longer.
😤 Some customers leave.
💥 The chef gets overwhelmed – and eventually burns out.

This is what can happen to a server (the computer running your app) when too many users try to access it at the same time.

So, What Does a Load Balancer Do?

A load balancer is like a smart restaurant manager. But instead of food orders, it handles user requests – the things people do when they open your app, click buttons, or load data.

Let’s say you now have three chefs (servers) instead of one. The load balancer’s job is to:

👀 Watch for incoming orders (user requests)
🧠 Decide which chef (server) is available or least busy
🍽️ Send that request to the right one
🔁 Repeat this over and over, making sure things stay fast and smooth

So in simple terms, a load balancer takes all the incoming traffic to your app and distributes it across multiple servers so no single server gets overloaded – cool, right? 🙂

Why Were Load Balancers Introduced?

Back in the early days, many applications were hosted on just one machine – called a Single Server Deployment.

That was okay when you had a small number of users. But once things started to grow – more users, more actions, more data – single servers became a bottleneck:

They could only handle a limited number of requests.
If they went down, your entire app would stop working.
Scaling (adding more power) was expensive and manual.

💡 Enter load balancers – designed to solve this by making it possible to:

Spread traffic across multiple servers (so no one server crashes under pressure),
Replace or restart servers without downtime,
Add or remove servers as needed, depending on how busy your app is (this is called scaling).

A Simple Use-Case Scenario

Let’s say you're building an online store — your own mini Amazon. At first, you host your app on one Azure Virtual Machine. Things are great. But one day, you run a huge promo and suddenly…thousands of people flood in to browse, shop, and check out.

Your single VM starts lagging.

Orders fail. People complain. Your dream app? Crashing fast. 💥

So what do you do?

You spin up two more VMs to help out – but now you’ve got another problem: How do you divide the traffic between the three?

This is where the load balancer steps in. It:

Looks at every incoming user request
Figures out which VM is available and least busy
Sends the request there
Keeps rotating requests in real-time

And the result?
✅ No single VM gets overwhelmed
✅ Your app stays fast and responsive
✅ Users are happy (and buying stuff again!)

🖥️ How Applications Were Deployed Before Load Balancers

Before cloud tools like load balancers came along, the typical way to run an application was pretty simple: You’d deploy the entire app on a single server, like running a small business from one tiny shop.

First Things First: What’s a Server?

Think of a server as a special computer that’s always connected to the internet. Its job is to “serve” your app to people when they visit your website, open your app, or use your service.

In cloud platforms like Azure, we usually call these Virtual Machines (VMs) – basically, software-powered servers you can spin up with a few clicks.

Monoliths vs Microservices

Now, applications come in different “shapes.” The two most common are:

Monoliths: Everything is bundled together into one big app. All the code – from user login to shopping cart to checkout – lives in a single unit.
Microservices: The app is broken into smaller, independent apps (services). Each service does one job – like login, payments, orders – and runs separately.

How Were These Apps Deployed?

Whether it was a monolith or a bunch of microservices, they were all usually deployed on a single server (VM).

For monoliths, you just ran the entire app directly on the server. For microservices: you'd run each service in a separate space on that same server, using containers.

Wait — What’s a Container?

A container is like a mini-computer inside a computer. It has everything an app needs to run – code, tools, settings – and it keeps each app isolated from the others.

Why use containers?

You can run multiple services on the same server without their underlying software (software needed for each app to run) interfering with each other.
It’s faster and more efficient than installing everything directly on the server.
They make moving apps between environments (for example, test → production) super smooth (no more “But, it works on my machine…”).

Popular tools like Docker make working with containers easy.

Connecting It All Together: Domains, Subdomains, and Reverse Proxies

When your app lives on a server, you want people to be able to reach it. That’s where domain names come in.

Your server has a public IP address – a set of numbers like 102.80.1.23, that gives it a unique identifier on the public internet
But instead of asking users to type numbers, you link that IP to a domain name, like mycoolapp.com

If your app has microservices, you might even assign subdomains like:

api.mycoolapp.com for the backend
dashboard.mycoolapp.com for the user interface
payments.mycoolapp.com for payments

To manage all this, you’d use a reverse proxy (like Nginx or Apache). It listens on the main domain and subdomains, and forwards traffic to the right app or service.

Example:

Someone visits dashboard.mycoolapp.com
The reverse proxy checks the domain and forwards the request to the correct container running the dashboard service

And to help with all of this setup – from deploying containers to configuring reverse proxies – there are developer-friendly tools like Coolify. Coolify is an open-source platform that makes it super easy for developers and DevOps teams to:

Deploy apps in containers
Set up domains and subdomains
Configure reverse proxies – all from a clean dashboard, no complex terminal commands needed

All this was set up on ONE SERVER/VM. But here’s the catch: when that one server got overloaded or went down…💥 everything stopped.

That’s why we needed a better way. And that's where scaling and load balancing came in – to keep apps running smoothly, no matter the traffic.

⚙️ Azure Virtual Machines (VMs) – The Building Blocks

When it comes to running apps in the cloud, Virtual Machines (VMs) are the basic building blocks – kind of like renting an apartment in a giant digital skyscraper.

You don’t need to buy the whole building (aka physical servers), you just rent the space you need, when you need it.

What Exactly Is a Virtual Machine?

A Virtual Machine is a software-based computer that runs inside a real, physical computer (a server) – hosted in a data center, like those run by Microsoft Azure.

It looks and behaves like a normal computer:

It has an operating system (Windows, Linux)
You can install apps
It has memory (RAM), storage (disks), and CPU

But the best part? You don’t need to worry about the hardware. Azure takes care of that behind the scenes – all you do is say:

“Hey Azure, give me a Linux VM with 4GB RAM and 2 CPUs.”

And boom 💥 — it spins up in minutes.

Why Use a VM?

Let’s say you’ve built a web app – it’s just a simple blog. You want to deploy it and make it accessible to the world.

Here's what you can do with a VM:

Set it up with your favorite OS (for example, Ubuntu)
Install web servers like Nginx or Apache
Deploy your app
Bind it to your domain name
Let the world visit your blog at myawesomeblog.com

It’s your own personal environment – no sharing, full control.

📈 The Need for Scaling – Vertical vs Horizontal

Imagine your app is growing. At first, it’s just a few users. Then a few hundred. Then thousands are logging in, placing orders, chatting, uploading photos – all at once 😮

Suddenly, your server (VM) is under pressure. It’s like trying to pour a flood through a straw.

So, What Do You Do When One Server Isn’t Enough?

This is where scaling comes in – the art of upgrading your app’s infrastructure to keep up with traffic.

There are two main ways to scale:

🧱 Option 1: Vertical Scaling (aka Scaling Up)

You take your existing VM and give it more power:

Add more CPUs 🧠
Increase RAM 🧵
Add faster disks ⚡

Think of it like upgrading from a regular car to a sports car. It’s the same vehicle, just faster and stronger.

Pros:

Simple to do
No major changes to your app setup

Cons:

There’s a limit to how much you can upgrade
Still a single point of failure: if the VM crashes, everything goes down 😬

🧩 Option 2: Horizontal Scaling (aka Scaling Out)

Instead of boosting one server, you add more servers – multiple VMs running copies of your app.

Now:

Users can be distributed across all these VMs
If one goes down, others keep serving traffic
You can dynamically add or remove VMs based on traffic

It’s like opening more checkout counters in a busy supermarket 🛒

Pros:

The load is evenly distributed. For example, if one server previously handled 100% of the traffic, adding two more servers would result in the traffic being split into approximately 33% to 34% for each server.
Improves both performance and reliability
You can scale based on real-time demand, that is traffic inflow

Cons:

Needs something to split traffic between VMs – Load Balancers
More expensive. You end up paying the original amount for 1 VM (for example $30) for the number of VMs you provide – if you provide 3 VMs at $30 each, you end up paying $90 at the end of the month

Quick Real-World Example

Let’s say you’ve launched an e-commerce site for sneakers 👟 Traffic spikes during a big sale? Your vertical scaling (bigger VM) might choke.

But with horizontal scaling:

You spin up 5 VMs across different regions
Traffic is shared between them
If one VM slows down, others handle the load

So, remember 👇🏾

Scaling Type	Description	Pros	Cons
🧱 Vertical Scaling	Make 1 VM more powerful (adding more CPU power, SSD, RAM, bandwidth, and so on)	Easy setup, fewer changes	Hardware limits, 1 point of failure - If that 1 server/VM goes down, so does your app :(
🧩 Horizontal Scaling	Add more VMs to handle traffic	Flexible, reliable	Needs traffic distribution logic (Load Balancer). Usually more expensive (the price of 1 VM times the number of VMs)

🔁 Azure Virtual Machine Scale Sets (VMSS) – Scaling Made Simple

Okay – so we’ve talked about horizontal scaling: adding multiple VMs to handle growing traffic. Sounds great, right?

But here’s the thing: manually spinning up and configuring 5, 10, or 100 VMs... every time your app gets busy? Yeah, that’s not fun 🙃

Enter: Virtual Machine Scale Sets (VMSS)

VMSS is Azure’s way of automating horizontal scaling. Instead of creating each VM one by one, you define a template, and Azure takes care of the rest:

How many VMs to start with
How to configure them (OS, apps, settings) ⚙️
When to add or remove VMs based on traffic 📈📉

A Simple Analogy 🧃

Think of VMSS like a juice dispenser at a party:

At first, it pours into 2 cups (VMs)
If 10 guests show up? It starts filling 5 cups
Party slows down? Back to 2 cups again

You never have to refill manually – the dispenser adjusts on its own. 🎉

How It Works (Without the Jargon 😌)

You set the rules: “If CPU usage goes above 70%, add 2 more VMs.”
Azure watches traffic and adjusts the number of VMs automatically.
All VMs are identical – like clones, all running the same app setup.
It works with Azure Load Balancer to spread traffic across all these VMs smoothly.

Real-Life Example: Food Delivery App 🍕📱

You’ve built an app where users order food. During lunch and dinner, traffic explodes.

💡 With VMSS:

You start with 3 VMs in the morning
At 12PM, Azure sees high CPU usage, so it spins up 5 more VMs
At 3PM, traffic drops, so Azure removes the extra VMs

You only pay for what you use. And users get a smooth experience – no delays, no crashes 👌🏾

📦 Azure Load Balancer – Spreading the Traffic

By now, you know that your app can live on multiple Virtual Machines (VMs), and that you can scale them easily using Virtual Machine Scale Sets (VMSS).

But here's the big question: when users start accessing your app – hundreds, even thousands at once – how do you make sure that all that traffic is fairly and efficiently distributed across those VMs?

You don’t want one VM to be overwhelmed while others are just chilling. You need a middleman – something smart enough to balance the load.

That’s where Azure Load Balancer steps in. It’s Azure’s way of saying, “Don’t worry, I got this” when traffic starts rolling in.

🏢 So, What Is Azure Load Balancer?

Azure Load Balancer is a traffic director. It takes incoming traffic from the internet (or even internal sources within your network) and intelligently spreads it across multiple backend machines – usually VMs.

It's like having a well-trained receptionist who routes every customer to the next available agent, so no one waits too long and no one gets overwhelmed 😃.

And the best part? This entire process happens in the background – fast, silent, and seamless. Users visiting your app have no idea a traffic manager is working behind the scenes. They just see a fast, responsive experience.

🌐 The Frontend IP – Your App’s Public Face

Every Azure Load Balancer is tied to a Frontend IP, which is basically the public IP address of your application – the one users connect to when they open www.yourapp.com.

This IP acts as the entry point. All user traffic comes through it first. But the Load Balancer doesn’t actually run your app. Instead, it accepts the traffic and forwards it to one of the VMs in the backend pool (we’ll get to that shortly).

You can configure this Frontend IP to be either public (accessible over the internet) or private (used for internal traffic within your cloud network – say, between microservices or internal tools).

🗂️ Backend Pool – Where the Magic Happens

Behind every Azure Load Balancer is a backend pool – a group of VMs (or VM Scale Set instances) where your actual app is running. These are the real workers, doing all the heavy lifting.

When traffic hits the Frontend IP, the Load Balancer takes that request and hands it off to one of the VMs in the backend pool.

But it doesn’t just randomly pick one. It checks a few things first – like whether the VM is healthy, whether it's already busy, and what rules you’ve set.

Each VM in the pool typically runs the same app or service. This means any of them can handle any incoming request, which is what makes load balancing possible in the first place.

🩺 Health Probes – Keeping Tabs on the VMs

Now, how does the Load Balancer know which VM is healthy or not? This is where health probes come in. Think of them as regular check-ups.

You configure the Load Balancer to periodically "ping" each VM – maybe by hitting a specific URL (like /health) or a certain port (like 80 for HTTP). If a VM doesn’t respond correctly, Azure marks it as unhealthy and temporarily removes it from the rotation.

This ensures users never get routed to a broken or unresponsive instance of your app. And once the VM becomes healthy again, it's automatically added back to the pool.

⚖️ Load Balancing Rules – Who Gets What?

Next, we have Load Balancing Rules. These are the instructions that tell Azure Load Balancer exactly how to behave.

You can define rules like:

“Forward all HTTP (port 80) traffic to backend pool VMs on port 80”
“Forward HTTPS (port 443) traffic to VMs on port 443”
“Only route traffic to healthy VMs”

These rules make Azure Load Balancer highly customizable. You get to decide how traffic flows, which protocols to support, and how to handle backend ports. It's like customizing the rules of a relay race – who gets the baton and when.

👟 Real-World Example: Sneaker Sale Rush

Imagine you're running an online sneaker store at www.sneakerblast.com. You’re launching a flash sale, and thousands of users are hitting your website all at once.

Thanks to your Azure Load Balancer, here’s what happens:

All those users land on your Frontend IP, the public face of your site.
The Load Balancer accepts the traffic and checks the health probes of all VMs in the backend pool.
Based on its rules, it forwards each user to a healthy, available VM.
One VM might serve a user in Lagos, another in Nairobi, another in Accra – all seamlessly.

If one VM crashes or lags? The Load Balancer detects it instantly and stops routing traffic to it until it’s back online.

That’s smooth traffic management without any manual effort.

🍴 Azure Application Gateway – Smart Routing for Modern Apps

So far, we’ve seen how Azure Load Balancer helps you split traffic across multiple VMs running a single service – like a monolithic app or a web frontend.

Let’s say you have a web application deployed on a VM. It listens on port 80, and you’ve scaled it into 3 instances. The Azure Load Balancer takes requests from the internet and spreads them across all 3 instances of the same service. Easy, right?

You can even link the Load Balancer’s public IP address to your domain – like mydomain.com – so users can visit your site normally.

🧠 But What If You Have Multiple Services?

Now imagine you’ve gone beyond just one app. You’re building something more modern, like a set of microservices.

You now have:

A payment service listening on port 5000
An authentication service on port 6000
A purchase service on port 7000

All deployed across the same VMs (or Virtual Machine Scale Set), just on different ports.

Here’s the problem: an Azure Load Balancer is designed to route traffic to one backend pool – basically one service – on one port. If you tie it to mydomain.com, it can only send traffic to one of your microservices. 😬

So… what do you do?

You might think: “Let me just create a separate Load Balancer for each service!” 🤕

But that means:

You’ll have to pay for multiple load balancers
You’ll end up managing 3–5 public IP addresses
You might even need to buy multiple domains like mypayment.com, myauth.com, and so on to route users properly

Yikes. That’s impractical, messy, and expensive 😖💸

🎉 Enter Azure Application Gateway

Azure Application Gateway solves this problem beautifully. It’s designed to route traffic intelligently – not just to one service, but to multiple services using just one gateway.

It works like this:

You create one public-facing frontend IP (like 52.160.100.5)
You link that IP address to your main domain, for example mydomain.com
Then, you define multiple backend pools – one for each service:
- Payment service (port 5000)
- Auth service (port 6000)
- Purchase service (port 7000)
Next, you set up routing rules that decide how to forward each request.

✨ Two Ways to Route with Application Gateway

You can configure smart routing based on:

URL paths:
- mydomain.com/payment → Payment service
- mydomain.com/auth → Auth service
Subdomains (host headers):
- payment.mydomain.com → Payment service
- auth.mydomain.com → Auth service

This way, all your services share one public IP and one domain – super clean, super efficient 🙌🏾

🤓 Real-Life Scenario (Let’s Break It Down)

Let’s say you’re building a startup platform that has three key microservices:

Payment service that handles transactions
Authentication service that handles login and user identity
Purchase service that manages product ordering

Each service is containerized and deployed on the same VM (or across several VMs using a VM Scale Set). But – and this is key – they all listen on different ports inside the VMs:

Payment → port 3000
Auth → port 6000
Purchase → port 7000

Now, without a smart routing solution, you’d be stuck trying to expose just one of these services using a standard Azure Load Balancer. But you need all three to be accessible from the internet – and you don’t want to pay for or manage 3 different Load Balancers 😅

So, what do you do?

🧠 Using Azure Application Gateway to Route Traffic Intelligently

Here's how you can fix this using one Application Gateway:

Deploy your microservices inside each VM:
- Each service runs on a specific port
- All VMs in your scale set are identical (they contain all three services)
Create backend pools in Application Gateway:
- A backend pool for the payment service (pointing to port 3000 on all VMs)
- One for the auth service (port 6000)
- Another for the purchase service (port 7000)
Create routing rules:
- Option A (Path-based routing):
  - Requests to mydomain.com/payment → go to the payment backend pool
  - Requests to mydomain.com/auth → go to the auth backend pool
  - Requests to mydomain.com/purchase → go to the purchase backend pool
- Option B (Subdomain-based routing):
  - payment.mydomain.com → payment service
  - auth.mydomain.com → auth service
  - purchase.mydomain.com → purchase service

You just tell the Application Gateway: “Hey, if a request comes in for this URL or subdomain, send it to this port on these VMs.” And it does just that – consistently and intelligently 🔁

📦 So, What’s Really Happening?

Imagine a user visits mydomain.com/auth. Here’s what goes on behind the scenes:

The DNS translates mydomain.com to your Application Gateway’s public IP
The Gateway receives the request
It checks your routing rules
It sees that /auth should go to the backend pool for port 6000
It forwards the request to one of the VMs running the auth service
The response goes back to the user – fast and seamless ✨

This happens in milliseconds, for every request. And because the Application Gateway is aware of multiple ports and services, it can handle routing logic that a regular Load Balancer just can’t do.

🔍 Azure Load Balancer vs Azure Application Gateway

By now, you've seen how both tools help route traffic in Azure – but they solve different problems.

Let’s break down how they compare, and when you should use one over the other 👇🏾

🛣️ 1. Routing Logic

Azure Load Balancer
It simply distributes incoming traffic evenly across a pool of VMs. It doesn’t care what the request is – it just balances the load.

Imagine a delivery guy who doesn't ask questions – he just drops each package at the next available house.

That’s what Azure Load Balancer does: it sends traffic to one of your servers without looking inside the request.

Azure Application Gateway
This is the smart one. It looks at what’s inside each request (like the URL path or domain) and makes intelligent decisions.

Just like a smarter delivery guy who looks at the address and decides where to go: "Oh! This one is for the payment office, not the main office."

That’s what Application Gateway does: it reads the request (like the URL or domain name) and sends it to the right place according to the routing rules.

🌐 2. Protocols Handled

Load Balancer
Works at the transport layer (Layer 4 in the OSI model). It deals with TCP/UDP traffic – raw network traffic, like HTTP, video streaming, games, and so on.

Application Gateway
Works at the application layer (Layer 7). It handles web traffic only – like websites and apps (HTTP/HTTPS) – and it can actually read what's being asked, like:

“Go to /login”
“Go to payment.mydomain.com”.

TL;DR: Load Balancer just pushes packets. App Gateway actually reads your web requests.

🔁 3. Use Case Scenarios

Situation	Best Choice
You have one big app and just want to spread users across servers	✅ Load Balancer
You have multiple services (like login, payment, and so on) and need to send users to the right one	✅ Application Gateway
You want to use subdomains (like login.mysite.com)	✅ Application Gateway
You want to secure your website with HTTPS and Web Application Firewall (WAF)	✅ Application Gateway
You want the simplest setup and lowest cost	✅ Load Balancer

🔐 4. SSL Termination & Security Features

Load Balancer doesn’t handle security stuff. You’ll need to secure each server yourself (for example, set up HTTPS on each one).

Application Gateway can secure everything in one place – you upload your SSL certificate once and it takes care of HTTPS for all services.

It can also protect you from hackers and bad traffic with something called WAF (Web Application Firewall), which protects your app from threats like SQL injection, XSS, and so on (you need to set this up manually).

💰 5. Pricing and Complexity

Load Balancer is cheaper and easier to set up. Great when you don’t need anything fancy.

Application Gateway costs more, but gives you more control and less headache when working with complex apps and microservices.

Trying to use Load Balancer for multiple services? You’ll need to create one Load Balancer per service, which becomes costly and impractical.

🧠 Summary Table

Feature	Load Balancer	Application Gateway
Can it understand the request?	❌ No	✅ Yes
Can it route based on URL or subdomain?	❌ No	✅ Yes
Can it handle secure HTTPS traffic?	❌ No	✅ Yes
Is it good for simple apps?	✅ Yes	✅ Yes
Is it good for complex apps with many services?	❌ No	✅ Yes
Cost	💲 Lower	💰 Higher

🧭 Use Cases: When to Use Each One

There’s no one-size-fits-all when it comes to hosting apps in the cloud. The right setup depends on what you’re building, how much traffic you expect, and how complex your app is.

Let’s walk through 4 different use-case scenarios, starting from the most basic setup all the way to a fully auto-scaled and smartly routed architecture.

1️⃣ Single VM Instance – For Small Projects or Internal Tools

Use this when:
You're just getting started. You’ve built a small app – maybe a portfolio, a blog, or a side project – and you want to make it live, OR You’re a startup that just launched.

How it works:
You spin up one Azure VM, install your app on it, and open the port it listens on (for example, port 80 for a web server). You can then attach a public IP to the VM and bind it to a custom domain like myawesomeapp.com.

Real-life examples:

A developer hosting a portfolio website or blog
A startup testing a new product with only a few users
An internal company tool for a small team

Pros:

Super simple setup
Low cost
Full control of your environment

Cons:

If the VM goes down, your app goes down
No auto-scaling – performance may drop with traffic spikes (the only way to adapt to increased CPU/memory usage due to traffic inflow is via manually scaling the VM vertically)
You manually maintain and monitor everything

2️⃣ Manual Horizontal Scaling – For Apps With Medium, Predictable Traffic

Use this when:
Your app is growing – maybe you have a few thousand users now, and performance matters. You want more than one server so your app doesn’t crash during busy hours.

How it works:
You manually create 2 or 3 Azure VMs with the same app setup. You then add a Load Balancer in front to split traffic evenly across them.

Real-life examples:

A business with a customer portal
A school website that handles regular logins, lecture video streaming, and so on during class hours
An app that gets traffic mostly during the day (predictable load)

Pros:

Better performance and availability
Load is shared across multiple VMs
You can scale manually when needed

Cons:

You must manually add or remove VMs – which takes effort
Still need to monitor performance manually
No built-in automation or auto-healing

3️⃣ Auto-Scaling with VM Scale Sets + Azure Load Balancer – For Apps With Spiky or Unpredictable Traffic

Use this when:
You’re building something more serious – traffic comes in waves (for example, a fitness/coach booking app), and you don’t want to sit around scaling VMs all day. You want Azure to automatically scale your infrastructure for you.

How it works:
You set up a Virtual Machine Scale Set (VMSS) that can automatically create more VMs when needed (like during high traffic), and remove them when things are calm — saving money. A Load Balancer distributes traffic across all those VMs.

Real-life examples:

A media platform where people upload videos or photos
A shopping site that gets surges during promotions, for example Black Fridays
A booking platform with peak traffic in evenings/weekends

Pros:

Automatic scaling – saves time and money
High availability: VMs can be replaced if one fails
Easy to grow as your user base grows

Cons:

Works best if your app is monolithic (one big service)
No support for routing traffic to specific services – just spreads traffic across VMs
Load Balancer can’t look at URL paths or subdomains

4️⃣ VM Scale Set + Azure Application Gateway – For Microservices or Complex Web Apps

Use this when:
You have a modern, multi-service app – maybe built with microservices. Each service (like payments, authentication, search, and so on) lives on a different port or even in a container.

You want to route traffic smartly – like /login goes to the auth service, /pay to payments, and /search to the search service – all on the same domain.

How it works:
You still use a VM Scale Set for auto-scaling, but instead of a basic Load Balancer, you add an Application Gateway. It can inspect each request and send it to the right service based on things like:

URL path (for example, /payments, /orders)
Subdomain (for example, payments.mydomain.com, auth.mydomain.com)

Real-life examples:

A full-blown SaaS product with multiple services
An e-commerce site with checkout, account, orders, and admin dashboards
A business migrating from a monolith to a microservices setup

Pros:

Smart routing based on path or subdomain
Everything runs under one public IP and one domain
Secure HTTPS handling + optional Web Application Firewall (WAF)
Auto-scaling and high availability

Cons:

More complex setup
Slightly higher cost due to Application Gateway
Needs planning around port numbers and backend pools

🧠 Quick Summary Table

Setup	Best For	Scaling	Routing Logic	Cost	Ease
☁️ Single VM	Small sites, personal apps	❌ (Manual)	❌ One app only	💲 (Lowest)	⭐⭐⭐⭐
🧱 Manual Horizontal Scaling + Load Balancer	Mid-size apps, predictable traffic	✅ (Manual)	❌ One app only	💲💲💲 (due to multiple VMs running at once without down-scaling — even with no traffic)	⭐⭐ (due to manual scaling)
🔁 VMSS + Load Balancer	Busy apps, spiky traffic	✅ (Auto)	❌ One app only	💲💲	⭐⭐⭐
🍴 VMSS + App Gateway	Microservices, modern apps	✅ (Auto)	✅ Smart routing (involving multiple microservices)	💲💲💲💲(Highest)	⭐⭐

✅ Conclusion

By now, you’ve gone from simply hearing the words “load balancer” or “scale set” to understanding exactly how they work, when to use them, and what problems they solve. Whether you’re just launching a small app or scaling up a high-traffic service, Azure gives you flexible, powerful tools to grow with confidence.

We started from the very beginning – a single virtual machine. It’s simple and great for small apps, but it quickly becomes a bottleneck as traffic grows.

That’s where scaling comes in. We explored:

🧱 Vertical scaling – Upgrading the same VM (quick fix, but limited)
🧩 Horizontal scaling – Adding more VMs to handle traffic better

Then we introduced Azure Virtual Machine Scale Sets (VMSS) – which bring auto-scaling to life. No more manual intervention – Azure can scale your servers up and down based on demand.

But where things really get smart is with load balancers:

📦 Azure Load Balancer helps spread traffic across your VMs — great for single-service apps
🍴 Azure Application Gateway takes it further by routing requests based on URL paths or subdomains — perfect for multi-service or microservice apps

🎯 TL;DR – What Should You Use?

Single VM: For side projects, portfolios, or internal tools
Manual scaling + Load Balancer: For medium apps with predictable load
VMSS + Load Balancer: For monolithic apps with auto-scaling needs
VMSS + Application Gateway: Also includes auto-scaling but for microservices or smart routing needs

💡 Final Thoughts

Cloud apps grow – fast. And with growth comes complexity. But with the right Azure setup, you can stay one step ahead of your traffic, serve users better, and keep costs under control.

Remember: you don’t need to start big. Start small, understand your app's traffic patterns, and scale only when you need to. Tools like Azure VM Scale Sets, Load Balancer, and Application Gateway give you the control and power to build scalable, modern applications without over-engineering.

Thanks for sticking with me through this deep dive. I hope this made things clearer, simpler, and maybe even a little fun 😊

Study Further 📚

If you would like to learn more about Azure Virtual Machines, Scale Sets, Load Balancer, and Application Gateway, you can check out the courses below:

Microsoft Azure Fundamentals AZ-900 Exam Prep Specialization — Microsoft, Coursera
Azure Virtual Machine Tutorial | Creating A Virtual Machine In Azure | Azure Training | Simplilearn — YouTube
Virtual machine scale sets — YouTube
Azure Load Balancer | Azure Load Balancer Tutorial | All About Load Balancer | Edureka — YouTube
Azure Application Gateway Deep dive | Step by step explained — YouTube

About the Author 👨‍💻

Hi, I’m Prince! I’m a DevOps engineer and Cloud architect passionate about building, deploying, and managing scalable applications and sharing knowledge with the tech community.

If you enjoyed this article, you can learn more about me by exploring more of my blogs and projects on my LinkedIn profile. You can find my LinkedIn articles here. You can also visit my website to read more of my articles as well. Let’s connect and grow together! 😊

Learn Kubernetes – Full Handbook for Developers, Startups, and Businesses

Prince Onukwili — Fri, 02 May 2025 17:34:12 +0000

You’ve probably heard the word Kubernetes floating around, or it’s cooler nickname k8s (pronounced “kates“). Maybe in a job post, a tech podcast, or from that one DevOps friend who always brings it up like it’s the secret sauce to everything 😅. It sounds important, but also... kinda mysterious.

So what is Kubernetes, really? Why is it everywhere? And should you care?

In this handbook, we’ll unpack Kubernetes in a way that actually makes sense. No buzzwords. No overwhelming tech-speak. Just straight talk. You’ll learn what Kubernetes is, how it came about, and why it became such a big deal – especially for teams building and running huge apps with millions of users.

We’ll rewind a bit to see how things were done before Kubernetes showed up (spoiler: it wasn’t pretty), and walk through the real problems it was designed to solve.

By the end, you’ll not only understand the purpose of Kubernetes, but you’ll also know how to deploy a simple app on a Kubernetes cluster – even if you’re just getting started.

Yep, by the time we’re done, you’ll go from “I keep hearing about Kubernetes” to “Hey, I kinda get it now!” 😄

📚 Table of Contents

What is Kubernetes?
How Applications Were Deployed Before Kubernetes
The Problem Kubernetes Solves 🧠
How Kubernetes Works – Components of a Kubernetes Environment 🧑‍🔧
Kubernetes Workloads 🛠️ – Pods, Deployments, Services, & More
How to Create a Kubernetes Cluster in a Demo Environment with play-with-k8s
- Sign in to Play with Kubernetes
- Create Your Kubernetes Cluster
How to Deploy an Application on Your Kubernetes Cluster
✅ Advantages of Using Kubernetes in Business
😬 Disadvantages of Using Kubernetes
Use Cases: When (and When Not) to Use Kubernetes
Conclusion
Study Further 📚
About the Author 👨‍💻

What is Kubernetes?

Imagine you're building a huge software platform, like a banking app. This app needs many features, like user onboarding, depositing money, withdrawals, payments, and so on. These features are so big and complex that it’s easier to split them into separate applications. These individual applications are called microservices.

So what are Microservices? Think of them like little building blocks that work together to create a bigger platform. So, you might have:

One microservice for user onboarding
Another for processing deposits
Another for handling payments
And many, many more!

To the user, it still looks like they’re using one smooth, unified banking app. But behind the scenes, it’s like a bunch of little apps working together to make everything run.

But here’s where things get tricky...

When you have dozens (or even hundreds) of these microservices, managing them becomes a nightmare. You might need to:

Deploy each one separately
Monitor them individually (to ensure they don’t crash/become slow due to too much load)
Scale them (make them bigger to handle more users) as traffic surges, one by one

So, if your banking app suddenly gets millions of users, you'd have to manually tweak and update each microservice to keep it running smoothly. 😖 It’s a lot of work, and if something goes wrong, you’re in deep trouble.

This is where Kubernetes comes to the rescue! 🚀

Kubernetes is like a super-efficient manager for all these microservices. It’s a platform that helps you:

Automate the deployment (getting the apps up and running)
Scale the microservices (making them bigger or smaller as needed based on the inflow of traffic – your customers)
Monitor them (keeping an eye on their health)
Ensure reliability (so if one microservice breaks/fails, k8s replaces it immediately)

In simple terms, Kubernetes takes all your little microservices and organizes them, ensuring they run smoothly together, no matter how much traffic your app gets. It handles everything behind the scenes, like a conductor leading an orchestra, so your microservices work together without chaos.

How Applications Were Deployed Before Kubernetes

Before Kubernetes came into the picture, software teams had quite the juggling act when it came to deploying applications – especially when they were made up of lots of microservices.

One popular method was using a distributed system setup. Here’s what that looked like:

Imagine each microservice (like your user onboarding, payments, deposits, and so on) being installed on separate servers (physical computers or virtual machines). Each of these servers had to be carefully prepared:

The microservice itself needed to be installed.
The software dependencies it needed (like programming languages, libraries, tools) also had to be installed.
Everything had to be configured manually ON EACH server.

And all of these servers had to talk to each other – sometimes over the public internet, or via private networks like VPNs.

Sounds like a lot of work, right? 😮 It was! Managing updates, fixing bugs, scaling up during traffic spikes, and keeping things from crashing could turn into a full-time headache for developers and system admins. 😖

Then Came Containers 🚢

A more modern solution that eased the pain (a little) was using containers.

So, what are containers?

Think of a container like a lunchbox for your microservice. Instead of installing the microservice and its supporting tools directly on a server, you pack everything it needs – code, settings, software libraries – into this single, neat container. Wherever the container goes, the microservice runs exactly the same way. No surprises!

Tools like Docker made this super easy. Once your microservice was packed into a container, you could deploy it on:

A single server
Multiple servers
Or cloud platforms like AWS Elastic Beanstalk, Azure App Service, or Google Cloud Run.

The Problem Kubernetes Solves 🧠

At first, when containers arrived on the scene, it felt like developers had struck gold.

You could package a microservice into a neat little container and run it anywhere – no more installing the same software on every server again and again. Tools like Docker and Docker Compose made this smooth for small projects.

But the real world? That’s where it got messy.

The Growing Headache of Managing Containers 💡

When you have just a few microservices, you can manually deploy and manage their containers without much stress. But when your app grows – and you suddenly have dozens or even hundreds of microservices – managing them becomes an uphill battle:

You had to deploy each container manually.
You had to restart them if one crashed.
You had to scale them one by one when more users started flooding in.

Docker and Docker Compose were great for a small playground or startups, but not for an enterprise application with high traffic inflow.

Cloud-Managed Services Helped... But Only Up To a Point 🧑‍💻

Cloud services like AWS Elastic Beanstalk, Azure App Service, and Google Code Engine offered a shortcut. They let you deploy containers without worrying about setting up servers.

You could:

Deploy each container on its own managed cloud instance.
Scale them automatically based on traffic.

BUT there were still some big headaches:

📦 Grouping microservices was awkward and expensive

Sure, you could organize containers by environment (like “testing” or “production”) or even by team (like “Finance” or “HR”). But each new microservice usually needed its own cloud instance – for example, a separate Azure App Service or Elastic Beanstalk environment FOR EVERY SINGLE CONTAINER.

Imagine this:

Each App Service instance costs ~$50 per month.
You’ve got 10 microservices.
That’s $500/month... even if they’re barely used. 💸 Yikes!

Kubernetes: Smarter, Leaner, and More Flexible 💪

With Kubernetes, you don’t need to spin up a separate server for each microservice. You can start with just one or two servers (VMs) – and Kubernetes will automatically decide which container goes where based on available space and resources.

No stress, no waste! 💡

🧑‍🍳 Kubernetes Lets You Customize Everything

You can assign resources to each microservice container.
👉 Example: If you have a "Payment" microservice that’s lightweight, you might give it 0.5 vCPUs and 512MB of memory. If you have a "Data Analytics" microservice that’s resource-hungry, you could give it 2 vCPUs and 4GB of memory.
You can set a minimum number of instances for each microservice.
👉 Example: If you want at least 2 copies of your "Login" service always running (so your app doesn’t break if one fails), Kubernetes makes sure you always have 2 live copies at all times.
You can group your containers however you like:
👉 By teams (Finance, HR, DevOps) or by environments (Testing, Staging, Production). Kubernetes makes this grouping super clean and logical.
You can automatically scale individual containers.
👉 When more users flood your app, Kubernetes can create extra copies (called “replicas”) of only the containers that are under pressure. No more wasting resources on containers that don’t need it.
You can even scale your servers!
👉 Kubernetes can automatically increase the number of servers (VMs) in your environment – called a Cluster – when traffic grows. So you could start with 2 VMs at $30 each ($60/month) and let Kubernetes add more servers only when necessary, rather than locking yourself into high fixed costs like $500/month for cloud-managed services.

Also, Kubernetes works the same way everywhere. Whether you deploy your containers on AWS, Google Cloud, Azure, or even your own laptop – Kubernetes doesn’t care. Your setup stays the same.

Compare that to managed services like Elastic Beanstalk or Azure App Service – which tie you to their platform, making it super hard to switch later.

✅ In short: Kubernetes saves you money, time, and a whole lot of headaches. It lets you run, scale, and organize your microservices without being chained to a single cloud provider — and without drowning in manual work.

How Kubernetes Works — Components of a Kubernetes Environment 🧑‍🔧

So by now you’ve seen the problem: running dozens (or hundreds!) of microservices manually is like juggling too many balls – you’re bound to drop some.

That’s why Kubernetes was created. But... how does it actually do all this magic? Let’s first break it down with the technical definition (simple but sharp – perfect for interviews) and then the layperson’s analogy (so it sticks in your head!).

1️⃣ Cluster 🏰

A Kubernetes Cluster is the entire setup of machines (physical or cloud-based) where Kubernetes runs. It’s made of one or more Master Nodes and Worker Nodes, working together to deploy and manage containerized applications.

Think of a Kubernetes Cluster as your entire playground. This is the environment where all your microservices live, grow, and play together.

A cluster is made up of two types of computers (called nodes):

Master Node (nowadays often called the Control Plane)
Worker Nodes

2️⃣ Master Node (Control Plane) 👑

The Master Node is like the brain of Kubernetes. It manages and coordinates the whole cluster – deciding which applications run where, monitoring health, and scaling things up or down as needed.

It’s like the boss of the entire cluster. It doesn’t run your applications directly. Instead, it:

Watches over the worker nodes
Decides which microservice (container) goes where
Makes sure everything runs smoothly and fairly

Think of it like a factory manager who tells machines what to do, when to start, when to stop, and where to send the next package.

Inside the Master Node are a few clever mini-components that handle the real work.

3️⃣ API Server 💌

The API Server is the front door to Kubernetes. It handles communication between users and the system, taking commands and feeding them into the cluster.

This is where you (or your team) give Kubernetes instructions. Whether you're deploying a new app or scaling an existing one, you "talk" to the API Server first. It's like submitting a request at the front desk – the API server passes it on to the right people (or machines).

4️⃣ Scheduler 📅

The Scheduler assigns Pods (applications) to Worker Nodes based on available resources and needs.

Imagine you’ve asked Kubernetes to launch a new microservice. The Scheduler checks:

Which worker node has enough space?
Which node has enough memory and CPU?
Where would this service run best?

It makes the decision and assigns the microservice to the perfect spot. Smart, huh?

5️⃣ Controller Manager 🎛️

The Controller Manager runs controllers that watch over the cluster and ensures that the system’s actual state matches the desired state.

This component watches over the system like a hawk. Let’s say you told Kubernetes:
"Hey, I want 3 copies of my payment microservice running at all times."

If one of them crashes, the Controller Manager sees that and spins up a new one to replace it automatically. It makes sure the reality always matches the plan.

6️⃣ etcd 📚

etcd is Kubernetes' memory – a distributed key-value store where cluster data is saved: config files, state, and metadata.

Imagine a notebook where all rules, records, and plans are written down. Without etcd, Kubernetes would forget everything.

7️⃣ Worker Nodes 💪

Worker Nodes are the servers that run the actual application containers, doing the heavy lifting in the cluster.

These are the machines where your microservices actually live and run. The Master Node gives orders, but the Worker Nodes do the heavy lifting – they run your containers!

Each worker node has a few helpers to manage its microservices:

The Kubelet
The Kube Proxy

8️⃣ Kubelet 📢

The Kubelet is the agent which lives on each Worker Node that makes sure containers are healthy and running as expected.

It listens to the Master Node’s instructions. If the Master Node says:"Hey, run this container!", the Kubelet makes it happen and keeps it running. If something goes wrong, the Kubelet reports back to the Master Node

9️⃣ Kube Proxy 🚦

Kube Proxy handles network traffic, ensuring that Pods can talk to each other and to the outside world.

Imagine your banking app’s login service needs to talk to the payments service. The Kube Proxy handles the routing so the request reaches the right place. It also handles load balancing, so no single microservice gets overwhelmed.

So, to summarize:

The Master Node is the boss – it plans, watches, and assigns tasks.
The Worker Nodes do the actual work – running your microservices.
Components like etcd, Kubelet, Scheduler, Controller Manager, and Kube Proxy all work together like parts of a well-oiled machine.

Kubernetes is designed to handle your microservices automatically – keeping them alive, scaling them up, moving them around, and restarting them if they crash – so you don’t have to babysit them yourself.

Kubernetes Workloads 🛠️ — Pods, Deployments, Services, & More

Kubernetes workloads are the objects you use to manage and run your applications. Think of them as blueprints 📐 that tell Kubernetes what to run and how to run it – whether it’s a single app container, a group of containers, a database, or a batch job. Here are some of the workloads in Kubernetes:

1️⃣ Pods

A Pod is the smallest and simplest unit in the Kubernetes object model. It represents a single instance of a running process in your cluster and can contain one or more containers that share storage and network resources.

Think of a Pod as a wrapper around one or more containers that need to work together. They share the same network IP and storage, allowing them to communicate easily and share data. Pods are ephemeral (live for a short time, they can be replaced very easily). If a Pod dies, Kubernetes can create a new one to replace it almost instantly.

Say you have an application which is split into 2 distributed monoliths – a frontend and a backend. The frontend will run in a container in Pod A, while the backend app will run in a container in another Pod B.

2️⃣ Deployments

A Deployment provides declarative updates for Pods and ReplicaSets. You describe a desired state in a Deployment, and the Deployment Controller changes the actual state to the desired state at a controlled rate.

Deployments manage the lifecycle of your application Pods. They ensure that the specified number of Pods are running and can handle updates, rollbacks, and scaling. If a Pod fails, the Deployment automatically replaces it to maintain the desired state.

Imagine you're managing a store. A Deployment is like the store manager – you tell it how many workers (Pods) you want, and it makes sure they’re always present. If one doesn't show up for work, the manager finds a replacement automatically. You can also tell it to hire more workers or fire some when needed.

3️⃣ Services

A Service in Kubernetes defines a way to access/communicate with Pods. Services enable communication between different Pods (for example, your frontend Pod A can communicate with your backend Pod B via a service) and can expose your application to external traffic (for example the public internet).

Services act as a stable endpoint to access a set of Pods. Even if the underlying Pods change, the Service's IP and DNS name remain constant, ensuring communication between the Pods within the cluster or with the internet.

A Service is like the front door to your app. No matter which worker (Pod) is behind it, people always use the same entrance to access it. It hides the messy stuff happening behind the scenes and gives users a simple way to connect to your app.

4️⃣ ReplicaSets

A ReplicaSet ensures that a specified number of identical Pods are running at any given time. It is often used to guarantee the availability of a specified number of Pods (horizontal scaling).

ReplicaSets maintain a stable set of running Pods. If a Pod crashes or is deleted, the ReplicaSet automatically creates a new one to replace it, ensuring your application remains available.

Think of a ReplicaSet like a robot that counts how many copies of your app are running. If one goes missing, it automatically makes a new one. It keeps the number steady, just like you told it to.

5️⃣ DaemonSets

A DaemonSet ensures that all (or some) Nodes run an instance (a copy) of a specific Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are also removed.

DaemonSets are used to deploy a Pod on every node in the cluster. This is useful for running background tasks like log collection or monitoring agents on all nodes (for example to get the CPU, memory, and disk usage of each node).

A DaemonSet is like saying, “I want this helper app to run on every single computer we have.” As mentioned earlier, it’s great for things like log collectors or security checkers – small helpers that every machine should have.

6️⃣ StatefulSets

A StatefulSet is the workload API object used to manage stateful applications (applications that store data, for example in their filesystem – databases). It manages the deployment and scaling of a set of Pods and provides guarantees about the ordering and uniqueness of these Pods.

StatefulSets are designed for applications that require persistent storage and stable network identities, like databases.

Let’s say you’re running a database or anything that needs to save info. A StatefulSet is like giving each app a name tag and a personal drawer to store their stuff. Even if you restart them, they come back with the same name and same drawer.

7️⃣ Jobs

A Job creates one or more Pods and ensures that a specified number of them successfully terminate. As Pods successfully complete, the Job tracks the successful completions. When a specified number of successful completions is reached, the Job is complete.

A Job is like a one-time task. Imagine sending out a batch of emails or processing a report. You want the task to run, finish, and then stop. That’s exactly what a Job does.

8️⃣ CronJobs

A CronJob creates Jobs on a time-based schedule. It runs a Job periodically on a given schedule, written in Cron format.

A CronJob is like setting a reminder or alarm. It tells your app (in this case the Job) to do something every night at 2 AM, every Monday morning, or once a month – whatever schedule you give it.

🛠️ How to Create a Kubernetes Cluster in a Demo Environment with `play-with-k8s`

As we've discussed earlier, a Kubernetes cluster is a set of machines (called nodes) that run containerized applications.

Setting up a Kubernetes cluster locally or in the cloud can be complex and expensive. To simplify the learning process, Docker provides a free, browser-based platform called Play with Kubernetes. This environment allows you to create and interact with a Kubernetes cluster without installing anything on your local machine. It's an excellent tool for beginners to get hands-on experience with Kubernetes.

Visit the platform at https://labs.play-with-k8s.com/.
Authenticate:
- Click on the "Login" button.
- You can sign in using your Docker Hub or GitHub account.
- If you don't have an account, you can create one for free on Docker Hub or GitHub.

🚀 Create Your Kubernetes Cluster

Once signed in, follow these steps to set up your cluster:

Step 1: Start a New Session:

Click on the "Start" button to initiate a new session. This will create a new session giving you about 4 hours of play time, after which the cluster and it’s resources will be automatically terminated.

Step 2: Add Instances:

Then click on "+ Add New Instance" to create a new node (Virtual Machine).

This will open a terminal window where you can run commands.

Step 3: Initialize the Master Node:

In the terminal, run the following command to initialize the master node:

kubeadm init --apiserver-advertise-address $(hostname -i) --pod-network-cidr

You can find the command in the terminal. In my case, the IP address is 10.5.0.0/16. Replace the placeholder with the IP address specified in your terminal.

This process will set up the control plane of your Kubernetes cluster.

Step 4: Add Worker Nodes:

If you want to add worker nodes, in the master node terminal, you'll find a kubeadm join... command after running the kubeadm init --apiserver-advertise-address $(hostname -i) --pod-network-cidr command.

Click on "+ Add New Instance" to create another node just as you did earlier.

Run this command in the new node's terminal to join it to the cluster:

Step 5: Configure the Cluster’s networking:

Navigate to the master node, and run the command below to configure the cluster’s networking.

kubectl apply -f https://raw.githubusercontent.com/cloudnativelabs/kube-router/master/daemonset/kubeadm-kuberouter.yaml

Step 6: Verify the Cluster:

In the master node terminal (the first node with the highlighted user profile), run:

kubectl get nodes

You should see a list of nodes in your cluster, including the master and any worker nodes you've added.

Congratulations! You just created your very own Kubernetes cluster with 2 VMs: the master node (where the control plane resides), and the worker nodes (where the Kubernetes workloads, for example Pods, will be deployed).

🚀 How to Deploy an Application on Your Kubernetes Cluster

Now that we've set up our Kubernetes cluster using Play with Kubernetes, it's time to deploy the application and make it accessible over the internet.

🧠 Understanding Imperative vs. Declarative Approaches in Kubernetes

Before we proceed, it's essential to grasp the two primary methods for managing resources in Kubernetes: Imperative and Declarative.

🖋️ Imperative Approach

In the imperative approach, you directly issue commands to the Kubernetes API to create or modify resources. Each command specifies the desired action, and Kubernetes executes it immediately.

Imagine telling someone, "Turn on the light." You're giving a direct command, and the action happens right away. Similarly, with imperative commands, you instruct Kubernetes step-by-step on what to do.

Example:
To create a pod running an NGINX container, run the below command in the terminal of the master node:

kubectl run nginx-pod --image=nginx

Now wait a few seconds and run the command below to check the status of the pod:

kubectl get pods

You should get a response similar to this

Now let’s expose our Pod to the internet by creating a Service. Run the command below to expose the Pod:

kubectl expose pod nginx-pod --type=NodePort --port=80

To get the IP address of the Cluster so we can access our Pod, run the command below:

kubectl get svc

The command displays the IP address from which we can access our service. You should get an output similar to this:

Now, copy the IP address for the nginx-pod service and run the command below to make a request to your Pod:

curl

Replace the placeholder with the IP address of your nginx-pod service. In my case, it’s 10.98.108.173.

You should get a response from your nginx-pod Pod:

We couldn’t access the Pod from the internet, that is our browser, because our Cluster isn’t connected to a cloud service like AWS or Google Cloud which can provide us with an external load balancer.

Now let’s try doing the same thing but using the Declarative method.

🚀 Declarative Approach

So far, we used the imperative approach, where we typed commands like kubectl run or kubectl expose directly into the terminal to make Kubernetes do something immediately.

But Kubernetes has another (and often better) way to do things: the declarative approach.

🧾 What Is the Declarative Approach?

Instead of giving Kubernetes instructions step-by-step like a chef in a kitchen, you give it a full recipe – a file that describes exactly what you want (for example, what app to run, how many copies of it, how to expose it, and so on).

This recipe is written in a file called a manifest.

📘 What’s a Manifest?

A manifest is a file (usually written in YAML format) that describes a Kubernetes object – like a Pod, a Deployment, or a Service.

It’s like writing down what you want, handing it over to Kubernetes, and saying: “Hey, please make sure this exists exactly how I described it.”

We’ll use two manifests:

One to deploy our application
Another to expose it to the internet

Let’s walk through it!

📁 Step 1: Clone the GitHub Repo

We already have a GitHub repo that contains the two manifest files we need. Let’s clone it into our Kubernetes environment.

Run this in the terminal (on your master node):

git clone https://github.com/onukwilip/simple-kubernetes-app

Now, let’s go into the folder:

cd simple-kubernetes-app

You should see two files:

deployment.yaml
service.yaml

📦 Step 2: Understanding the Deployment Manifest (`deployment.yaml`)

This manifest will tell Kubernetes to deploy our app and ensure it’s always running.

Here’s what’s inside:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx

Now, let’s break this down:

apiVersion: apps/v1: This tells Kubernetes which version of the API we’re using to define this object.
kind: Deployment: This means we’re creating a Deployment (a controller that manages Pods).
metadata.name: We’re giving our Deployment a name: nginx-deployment.
spec.replicas: 3: We’re telling Kubernetes: “Please run 3 copies (replicas) of this app.”
selector.matchLabels: Kubernetes will use this label to find which Pods this Deployment is managing.
template.metadata.labels & spec.containers: This section describes the Pods that the Deployment should create – each Pod will run a container using the official nginx image.

✅ In plain terms: We're asking Kubernetes to create and maintain 3 copies of an app that runs NGINX, and automatically restart them if any fails.

🌐 Step 3: Understanding the Service Manifest (`service.yaml`)

This file tells Kubernetes to expose our NGINX app to the outside world using a Service.

Here’s the file – let’s break this down, too:

apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  type: NodePort
  selector:
    app: nginx
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80

apiVersion: v1: We’re using version 1 of the Kubernetes API.
kind: Service: We’re creating a Service object.
metadata.name: nginx-service: Giving it a name.
spec.type: NodePort: We’re exposing it through a port on the node (so we can access it via the node's IP address).
selector.app: nginx: This tells Kubernetes to connect this Service to Pods with the label app: nginx.
ports.port and targetPort: The Service will listen on port 80 and forward traffic to port 80 on the Pod.

✅ In plain terms: This file says, “Expose our NGINX app through the cluster’s network so we can access it from the outside world.”

🧹 Step 4: Clean Up Previous Resources

If you’re still running the Pod and Service we created using the imperative approach, let’s delete them to avoid conflicts:

kubectl delete pod nginx-pod
kubectl delete service nginx-pod

📥 Step 5: Apply the Manifests

Now let’s deploy the NGINX app and expose it – this time using the declarative way.

From inside the simple-kubernetes-app folder, run:

kubectl apply -f deployment.yaml

Then:

kubectl apply -f service.yaml

This will create the Deployment and the Service described in the files. 🎉

🔍 Step 6: Check That It’s Running

Let’s see if the Pods were created:

kubectl get pods

You should see 3 Pods running!

And let’s check the service:

kubectl get svc

Look for the nginx-service. You’ll see something like:

Note the NodePort (for example, 30001) as we’ll use it to access the app.

🌍 Step 7: Access the App

You can now send a request to your app like this:

curl http://:

Replace with the IP of your master node (you’ll usually find this in Play With Kubernetes at the top of your terminal), and with the NodePort shown in the kubectl get svc command.

You should see the HTML content of the NGINX welcome page printed out.

Now terminate the cluster environment by clicking the CLOSE SESSION button:

🆚 Why Declarative Is Better (In Most Cases)

🔁 Reusable: You can use the same files again and again.
📦 Version-controlled: You can push these files to GitHub and track changes over time.
🛠️ Fixes mistakes easily: Want to change 3 replicas to 5? Just update the file and re-apply!
🧠 Easier to maintain: Especially when you have many resources to manage.

💼 Advantages of Using Kubernetes in Business

Kubernetes isn’t just a developer tool—it’s a business enabler as well. It helps companies deliver products faster, more reliably, and with reduced operational overhead.

Let’s break down how Kubernetes translates to real-world business benefits:

1️⃣ Better Use of Cloud Resources = Cost Savings

Before Kubernetes, deploying many microservices for a single application often meant creating separate cloud resources (like one Azure App Service per microservice), which could rack up huge costs quickly. Imagine $50/month per service × 10 services = $500/month 😬.

With Kubernetes:
You can run multiple microservices on fewer virtual machines (VMs) while Kubernetes automatically decides the most efficient way to use the available servers. That means you pay for fewer servers and get more out of them 💸.

2️⃣ High Availability and Uptime = Happy Customers

Kubernetes watches your apps like a hawk 👀. If one of them crashes or fails, Kubernetes restarts or replaces it immediately – automatically.

For your business:
This means less downtime, fewer support tickets, and happier customers who don’t even notice when things go wrong in the background.

3️⃣ Easy Scaling During High Demand

Manually scaling apps during high traffic (like Black Friday) can be a nightmare 😰. And if you don't act fast, customers experience slowness or crashes.

With Kubernetes:
You can configure each microservice to automatically scale — meaning it adds more instances of that service only when needed (too many users on your site trying to purchase different products) and scales back down when traffic drops. This ensures your app is always responsive and you only pay for what you use.

4️⃣ Faster Deployment = Faster Time to Market

Kubernetes supports automation and repeatability. Teams can deploy new features or microservices faster without worrying about infrastructure setup every time.

For business:
This means faster product updates, quicker response to market demands, and competitive advantage 🚀.

5️⃣ Consistent Environments = Fewer Bugs

Each microservice in Kubernetes is containerized, meaning it runs with all its dependencies in a self-contained package. You can run the exact same app setup in:

Development
Testing
Production

This reduces bugs caused by "it works on my machine" issues 🤦‍♂️ and helps teams build with confidence.

6️⃣ Vendor Independence (Bye-bye to Vendor lock-in)

When you use cloud-managed services (like AWS Elastic Beanstalk or Azure App Service), it’s often hard to move to another provider because everything is tailored to that specific platform.

With Kubernetes:
It works the same way on AWS, Azure, GCP, or even your own data center. This means you can switch cloud providers easily and avoid being locked into one vendor – aka cloud freedom! ☁️🕊️

7️⃣ Organizational Clarity

Kubernetes lets you organize your apps clearly. You can group workloads by:

Team (for example, Finance, HR)
Environment (for example, testing, staging, production)

This structure helps large teams collaborate better, stay organized, and manage resources efficiently.

😬 Disadvantages of Using Kubernetes

Like everything in tech, Kubernetes isn’t all rainbows and rockets 🚀. Just like any other tool, it has its pros and its cons. And it's super important for startup founders, product managers, or even CEOs to know when Kubernetes is the right fit – and when it’s just overkill.

Let’s break down the main disadvantages in a simple, honest way:

👨‍🔧 1. You’ll Likely Need a DevOps Engineer or Team

Kubernetes is powerful, yes. But that power comes with great responsibility 😅.

In simple terms:

You don't just "click a button" and your app is magically running.
Kubernetes needs someone who understands how to set it up, keep it running, and fix issues when they pop up. This person (or team) is usually called a DevOps Engineer, SIte Relability Engineer or Cloud Engineer.

Here’s what they’ll typically handle:

Creating the cluster (the environment where your apps will run)
Defining how your app containers should behave (how many should run, how much memory they need, when they should restart, and so on)
Monitoring the apps and making sure they’re healthy
Ensuring security rules are followed
Handling automated scaling, deployment rollouts, backups, and so on.

💡 In short: You’ll need someone skilled to manage this tool. If you’re a solo founder or a small team with no DevOps experience, Kubernetes might be too much upfront.

💰 2. Kubernetes Can Be Expensive (If Used Prematurely)

Kubernetes saves money at scale – but can cost more if you adopt it too early or for the wrong use case.

Here's why:

Kubernetes is meant for managing multiple applications or microservices. If your business only has one small app, you’re using a rocket to deliver a pizza 🍕 – it’s just not necessary.
Kubernetes is also best when you have high or unpredictable traffic. It can automatically scale up your services when traffic spikes...but if your traffic is steady and small, you won’t benefit much from that power.

Let’s say:

You have one app with moderate traffic.
You deploy it on Kubernetes (which requires at least 1–2 VMs + setup).
You hire a DevOps engineer to manage it.
You pay for cloud compute + storage + monitoring.

You could end up spending $300–$800/month or more... for something that could’ve been hosted on a simple service like Render, Heroku, or a basic VM for a fraction of the cost.

So when should you consider Kubernetes?

When your platform is made up of multiple services (For example, separate services for user auth, payments, analytics, notifications, and so on)
When you’re expecting traffic spikes (for example, launching in new countries, going viral, seasonal demand like black Friday)
When you want flexibility in managing your infrastructure across cloud providers (AWS, GCP, Azure) or even on-premises

🧭 Use Cases: When (and When Not) to Use Kubernetes

Kubernetes is an incredibly powerful tool – but it’s not always the right solution from day one.

Let’s break down when it makes sense to use Kubernetes and when it might be overkill 👇

✅ When You Should Use Kubernetes

Kubernetes becomes essential in these scenarios:

1. Your Application Is Made of Many Microservices

If your app is broken down into multiple microservices – like user authentication, payments, orders, notifications, and more – it’s a good sign that Kubernetes might eventually help.

Kubernetes can:

Help manage each microservice independently
Automatically scale each one based on demand
Restart failed services automatically
Make it easier to roll out updates to specific parts of the application

2. You’re Getting Steady and High Traffic

It’s not just about complexity – it’s about demand.

If your app receives a consistent, high volume of users (like hundreds or thousands every day), and you start seeing signs that your servers are getting overloaded, Kubernetes shines here. It can:

Automatically increase resources when traffic surges
Balance the load across multiple servers
Prevent downtime due to traffic spikes

3. You Want Portability and Cloud Independence

If your business doesn’t want to be locked into just one cloud provider (for example, only AWS), Kubernetes gives you flexibility. You can move your application between AWS, GCP, Azure – or even to your own data center – with fewer changes.

4. Your DevOps Team Is Growing

When you have multiple developers or teams working on different parts of the app, Kubernetes helps:

Organize and isolate workloads per team
Improve collaboration and consistency
Provide easy access control and monitoring

❌ When You Should Not Use Kubernetes

Let’s be honest: Kubernetes is not for everyone, especially not at the beginning.

1. You Just Launched Your App

In the early days of your product, when you’ve just launched and traffic is still low, Kubernetes is overkill. You don’t need its complexity (yet).

👉 Instead, deploy your app or each microservice on a simple virtual machine (VM). It’s cheaper and faster to get started.

2. You Don’t Need Auto-scaling (Yet)

If traffic to your app is still small and manageable, a single server (or a few of them) can easily handle the load. In that case, it’s better to:

Deploy your microservices manually or with Docker Compose
Monitor and scale manually when needed
Keep things simple until the need for automation becomes obvious

3. You Don’t Have a DevOps Team

Kubernetes is powerful – but it needs expertise to set up and maintain. If you don’t have a DevOps engineer or someone who understands Kubernetes, it may cause more problems than it solves.

Hiring a DevOps team can be expensive, and setting up Kubernetes incorrectly can lead to outages, security risks, or wasted resources 💸

📈 When to Move to Kubernetes

So, what’s the best path forward?

Here’s a simple roadmap:

Start small: Deploy your app (or microservices) on one or a few VMs
Watch traffic: As user demand grows, increase VM size or replicate the app manually
Track pain points: If scaling becomes too manual, or if services crash under load...
Then adopt Kubernetes 🧠

It’s not about how complex your app is – it’s about when the traffic and growth demand an upgrade in how you manage things.

🎯 TL;DR for Founders and DevOps Teams

Don’t jump to Kubernetes just because it’s trendy
Use it only when traffic grows steadily and auto-scaling becomes necessary
Kubernetes is most valuable when you want to scale reliably and efficiently
Before that point, stick to simple deployments – it’ll save you time, money, and stress

🎉 Conclusion

Wow! What a journey we’ve been on 😄

We started by answering the big question — What is Kubernetes? We discovered that it’s not some mythical beast, but a powerful orchestration tool that helps us manage, deploy, scale, and maintain containerized applications in a smarter way.

Then, we took a step back in time to see how applications were deployed before Kubernetes — the headaches of manually installing software on servers, spinning up separate cloud instances for every microservice, and racking up huge cloud bills just to stay afloat. We also saw how containers simplified things, but even they had their own limitations when managed at scale.

That’s where Kubernetes came to the rescue

We explored:

The problems Kubernetes solves – like auto-scaling, efficient resource management, cost savings, and seamless container grouping.
Kubernetes architecture and components – breaking down complex terms like the cluster, master node, worker nodes, Pods, Services, Kubelet, and more, into simple, easy-to-digest ideas.
Kubernetes workloads like Deployments, Pods, Services, DaemonSets, and StatefulSets, and what they do behind the scenes to keep our apps running reliably.

From theory to practice, we even got our hands dirty:

We created a free Kubernetes cluster using Play with Kubernetes 🧪
Deployed a real application using both imperative (direct command) and declarative (manifest file) approaches
Understood why the declarative method makes our infrastructure easier to manage, especially when our systems grow.

Then we took a business lens 🔍 and looked at:

The advantages of Kubernetes – from auto-scaling during traffic surges, to cost efficiency, and cloud-agnostic deployment.
And also the disadvantages – like needing experienced DevOps engineers and not being ideal for every stage of a product's lifecycle.

Finally, we wrapped up with real-life use cases, highlighting when Kubernetes is a must-have, and when it’s better to wait – especially for early-stage startups still trying to find their audience.

So, whether you're a DevOps newbie, a startup founder, or just someone curious about how modern tech keeps your favorite apps online – you now have a strong foundational understanding of Kubernetes 🙌

Kubernetes is powerful, but it doesn't have to be overwhelming. With a solid grasp of the basics (which you now have 💪), you're well on your way to managing scalable applications like a pro.

Start simple. Grow smart. And when the time is right – Kubernetes will be your best friend.

Study Further 📚

If you would like to learn more about Kubernetes, you can check out the courses below:

About the Author 👨‍💻

Hi, I’m Prince! I’m a DevOps engineer and Cloud architect passionate about building, deploying, and managing scalable applications and sharing knowledge with the tech community.

The Serverless Architecture Handbook: How to Publish a Node Js Docker Image to AWS ECR and Deploy the Container to AWS Lambda

Prince Onukwili — Thu, 17 Apr 2025 02:19:13 +0000

Imagine you’re tasked with building a web application that can handle incoming traffic surges as your users grow without accumulating too much cost. Sounds like a dream, right?

But here’s the thing: traditionally, to do this, you would have to manage lots of infrastructure – resources on which your application will be deployed – which can be a real headache. You’d have servers (VM instances or physical computers) to configure, databases to scale, load balancers to monitor...it’s a whole lot 😩

This is where Serverless architecture comes to the rescue. With the Serverless model, you can deploy your applications to handle thousands of users without you having to worry about incurring too much cost, managing infrastructure, servers, networking, and so on.

In this article, you’ll learn about Serverless Architecture: what it’s all about, and how to deploy your very own application using AWS Lambda. We’ll walk through the entire process step-by-step:

How to clone your application repository using Git.
How to build an image of your application using Docker.
How to install the AWS CLI on your local machine and create AWS IAM users with the right permissions to push your Docker image to AWS Elastic Container Registry (ECR).

Once the image is up and running on ECR, we’ll then connect it to AWS Lambda and deploy the container to Lambda for a fully serverless experience. 💡✨

Ready to go serverless? Let’s get started! 🚀

What is Serverless Architecture?
Differences Between Serverless and Other Deployment Models ⚡
🧠 Prerequisites — What You Should Know Before Following Along!
How to Set Up the Application Using Git 🐙
Understanding the Codebase 🔎
How to Create a Docker Image of the Application 🐋
How to Create a Container Registry on AWS Elastic Container Registry (ECR) 📁
IAM with AWS: How to Create a User on AWS IAM to Allow Access to Your AWS ECR 👤🔐
How to Upload Your Docker Image to the AWS ECR repository ⬆️
How to Deploy the Application Container to AWS Lambda from the Image on AWS ECR 🚀
Advantages of Adopting the Serverless Model in Businesses 💼
Disadvantages of the Serverless Model 🚫
When to Adopt the Serverless Model 🤔
Conclusion 📝
About the Author 👨‍💻

What is Serverless Architecture?

Before we dive deeper, let’s break down what we mean by Servers. In the tech world, servers are powerful computers that store, process, and manage data. Think of them as the behind-the-scenes workhorses that:

Store your data: Like a central filing cabinet for your digital documents.
Run your applications: They execute the code that keeps your app or website running.
Handle requests: Servers respond to user requests – like loading a webpage or processing a login.

Alright, now let’s talk about Serverless Architecture – but first, let’s clear up a common misconception. When most people hear the word "Serverless", they immediately think, "Wait… no servers? How does that even work?!" 😅

Here’s the truth: Serverless doesn’t mean there are no servers involved (surprise, surprise! 😉). Instead, it means you, as a developer, don’t have to worry about managing the servers that your application runs on. The server-side infrastructure is fully handled by the cloud provider – in this case, AWS Lambda. You just focus on writing code and deploying it, and AWS takes care of the rest.

So, What’s the Big Deal with Serverless?

In a traditional setup, when you deploy your application, you’re responsible for things like:

Provisioning servers (how many servers do you need? What size?)
Scaling resources (how do you handle traffic spikes without overpaying?)
Monitoring and keeping everything running smoothly.

Sounds like a lot, right? 🤯 Well, Serverless Architecture simplifies all of that by letting you focus purely on your application code. With Lambda, you can run code in response to events (like an HTTP request, a file upload, or a database change) without worrying about the infrastructure behind it. AWS automatically scales the compute resources as needed, charging you only for the time your code is actually running. ⏱️💸

Imagine you’re at a restaurant. Instead of running the kitchen yourself (like managing your own servers), you just place an order (your code) and the chef (AWS Lambda) makes it for you, on-demand, based on what you need. 🍽️🍴

Differences Between Serverless and Other Deployment Models ⚡

Now that you understand how Serverless works, let’s take a little detour and explore the other models used to deploy applications. After all, Serverless isn’t the only kid on the block, and this will give you some important perspective when choosing the right model for your use case. 👀

When you build an app, you need somewhere to host it – a home for your code to live and run. Over the years, the tech world has come up with different ways to handle this, and each one gives you a different level of control (and responsibility) over your servers.

Let’s break it down.

🏠 Infrastructure as a Service (IaaS)

With IaaS, cloud providers like AWS, Google Cloud, or Microsoft Azure give you the building blocks – virtual servers (also called instances), storage, and networking tools – but it’s still your job to set everything up.

It’s like renting an empty apartment. You get the walls, the doors, and the roof, but you still have to bring your own furniture, set up your Wi-Fi, and clean the place regularly. 🏡🧹

When you choose IaaS, you’re responsible for:

Configuring the servers (choosing the size, the operating system, and installing software).
Handling updates, patches, and security.
Scaling up or down when traffic changes.

Example: Amazon EC2 (Elastic Compute Cloud) is a classic IaaS service. You rent a virtual machine, set it up yourself, and manage it like a digital landlord.

🎯 Platform as a Service (PaaS)

Next up, we’ve got PaaS – a more polished setup.

In this model, the cloud provider takes care of the infrastructure and the underlying operating system, so you don’t have to. You just upload your code, configure a few settings, and the platform runs your app.

It’s like moving into a fully furnished apartment — the kitchen works, the lights are on, and the Wi-Fi is already connected. You just show up with your bags and get to work! 🧳✨

Example: AWS Elastic Beanstalk, Heroku, or Google App Engine.

🌩️ Serverless: The Special PaaS

Now here’s where things get interesting: Serverless actually falls under the PaaS umbrella, but it deserves its own spotlight. Why? Because it takes the convenience of PaaS and pushes it to the next level.

In a traditional PaaS model (like AWS Fargate or Heroku), your application is running 24/7, whether you have visitors using it or not. You pay for the reserved space and compute power all month long, just like renting an apartment. Even if you didn’t sleep there the entire month, the bill still comes at the end. 💸🏡

But with Serverless, the rules change. You only pay when your code is actually being used.

How Applications Run in the Serverless Model ⚙️

In a Serverless model, your application isn’t just sitting there running all day. It “wakes up” only when it’s needed. But what exactly causes it to wake up? That’s where triggers come in.

Triggers are events that tell your Serverless application, “Hey, it’s time to do something!” These events could be all sorts of things, like:

A user visiting your website and clicking a button.
Someone uploading a file to your cloud storage (like an image or document).
A new row being added to a database.
An automated schedule (like a reminder that runs every day at 8 AM).

When one of these events happens, your application instantly comes to life, runs the exact task you programmed, and then goes back to “sleep” until the next trigger. This is how Serverless keeps your cloud costs low and your resources efficient – no constant running in the background, only action when there’s actually something to do!.⚡😎

For example, if a user sends a request that triggers your application to run for just 10 seconds and uses 20MB of memory, that’s all you pay for — the exact time and resources consumed.

No users? No requests? No payment. Now that’s a smart way to save money. 🧠💰

💡 Quick Comparison: PaaS vs Serverless

Feature	Traditional PaaS (example: AWS Fargate)	Serverless PaaS (example: AWS Lambda)
Server Configuration	You select compute size & limits.	No need — AWS handles it all.
Scaling	You configure scaling policies.	Automatic, event-driven scaling (based on incoming traffic). The higher the traffic, the more compute power is added to your application, and vice versa. 😃
Billing	Charged for running instances 24/7, even when idle.	Charged only when your code runs. ⏱️💸
Deployment	Deploy full applications.	Deploy small chunks of code (functions). You can also deploy microservices and full-scale web applications

🧠 Prerequisites — What You Should Know Before Following Along

Before we dive in, here’s the best part: I wrote this article to be super beginner-friendly and detailed, so even if you have little to no programming background, you’ll still be able to follow along.

Whether you’re a developer, a tech-curious startup, or a business leader trying to understand modern cloud solutions, this guide was written for you.

That said, having some light knowledge in these areas will make the ride even smoother:

🧑‍💻 Basic Programming Concepts – like how Node.js apps run and what a server does.
💡 Familiarity with Common Tech Terms – words like “deploy,” “application,” “CPU,” and “software” will pop up, but don’t worry: I’ve done my best to break these down into simple, relatable explanations.

No prior cloud experience? No problem! This guide holds your hand all the way from setup to deployment – all in plain language, no jargon.

So buckle up, and let’s proceed with deploying your very own application to AWS Lambda. 😁

How to Set Up the Application Using Git 🐙

Before we jump into writing code or deploying anything, the very first step is to grab the application we’ll be working with — and for that, we’ll be using Git.

But wait... what’s Git? — It’s a Version Control System (VCS) that helps developers track changes to their code, collaborate with teammates without stepping on each other’s toes, and safely store their work in a central place — like GitHub.

Clone the Application Repository 🧑‍💻

I’ve already created a simple project for us to use in this tutorial — it’s sitting pretty on GitHub, waiting for you.

To clone the project onto your local machine, open up your terminal and run:

git clone https://github.com/onukwilip/lambda-tutorial.git

This command will download all the code from the lambda-tutorial repository into a folder on your computer. 📁

Once the cloning is done, navigate into the project directory like this:

cd lambda-tutorial

Boom — just like that, your local machine is now set up with the same code that’s stored in the GitHub repo. 🏡

Understanding the Codebase 🔎

Open the Codebase in Your Favorite IDE 🧑‍💻

For this tutorial, we’ll be using Visual Studio Code (VS Code), but feel free to use any editor you’re comfortable with.

Once you open the lambda-tutorial project folder, you’ll notice it’s a simple Node.js web server. Nothing too fancy — just a server that can handle requests and respond with some data.

Now, it’s important to understand what’s going on inside our codebase, especially if you’re coming from deploying on platforms like Render, Vercel, or Google Cloud Run.

Deploying to Lambda vs Other Serverless Platforms ⚡

When you deploy to platforms like Vercel, Render, or Google Cloud Run, you usually package your web server just the way you wrote it – whether it’s a Node.js Express server or a Next.js app – and the platform handles it pretty much as-is.

Those platforms run your server like a mini container (or microservice) that’s always ready to handle incoming traffic, just like a waiter standing by at your table, waiting for your order.

But AWS Lambda works a little differently.

Lambda expects your code to be organized around functions – not full web servers. Think of Lambda as a chef that only shows up the moment an order is placed, cooks the food, and disappears once the job is done. 👨‍🍳🍽️

So if you’ve got a full-blown Node.js Express server, you’ll need to do a tiny bit of “translation” to fit Lambda’s expectations – and that’s where the lambda.js file comes in.

The `lambda.js` File — Your Lambda Translator 🔀

Here’s what the file looks like:

const serverless = require("serverless-http");
const app = require("./app");

const handler = serverless(app);
module.exports.handler = handler;

Let’s break it down:

const serverless = require("serverless-http");: This imports a handy little library called serverless-http. (The serverless-http library is important for our platform to run properly on AWS Lambda.) It acts like a translator: it takes your regular Express app and wraps it so that AWS Lambda can understand it.
const handler = serverless(app);: Here’s the magic. This wraps your Express app into a Lambda-compatible function.
module.exports.handler = handler;: This exports your wrapped function so AWS Lambda can call it when the application is triggered.

So, instead of starting your server like this:

app.listen(5000, () => {
  console.log("Server running on port 5000");
});

You’re handing your app over to Lambda and letting it handle incoming requests, scale, and run the app only when it’s needed.

The `app.js` File — Your Classic Express App 💻

Your app.js is where the main application logic lives. Here is usually where you:

Set up Express.
Define routes (like /api, /users, /hello).
Apply middleware (like JSON parsing, logging, CORS, and so on).
Handle HTTP requests and send back responses.

In a normal deployment (Render, Google Cloud Run, DigitalOcean, or your own server), you’d start the server using app.listen(PORT) at the bottom of this file.

But since we’re deploying to Lambda, you don’t directly start the server here. Instead, you export the app like this:

module.exports = app;

This way, your application stays “server-agnostic” – it’s not hardcoded to run on a traditional server. Lambda (via the lambda.js file) takes care of starting and stopping your app whenever it’s triggered by an event (like an HTTP request). Smart, right? 💡

Why this setup? 🤔

This little separation gives you flexibility:

You can write your Node.js app like you always would (using Express) inside app.js.
And you only tweak the entry point (via lambda.js) to fit AWS Lambda’s expectations.

How to Create a Docker Image of the Application 🐋

Now that we’ve had a good look at the code, let’s package it up the smart way — using Docker.

What is Docker? 🐳

Now, you might be wondering, "Why are we using Docker?"

Docker is a software for creating images of your applications and running those images as containers. Just like real-world shipping containers hold goods securely, Docker containers hold your app, bundled with everything it needs to run: its code, libraries, dependencies, and settings. Everything is all wrapped up neatly, so your app runs the same way everywhere, whether on your laptop, AWS Lambda, or even your friend’s machine.

Let’s Take a Look at the Dockerfile 🔍

Inside your project folder, you’ll find a file named Dockerfile. This is basically the recipe that Docker uses to build your app’s container image.

Here’s what it looks like:

FROM node:18-slim AS builder

WORKDIR /app

COPY package.json .

RUN npm i -f

COPY . .

USER root

FROM amazon/aws-lambda-nodejs

ENV PORT=5000

COPY --from=builder /app/ ${LAMBDA_TASK_ROOT}
COPY --from=builder /app/node_modules ${LAMBDA_TASK_ROOT}/node_modules
COPY --from=builder /app/package.json ${LAMBDA_TASK_ROOT}
COPY --from=builder /app/package-lock.json ${LAMBDA_TASK_ROOT}

EXPOSE 5000

CMD [ "lambda.handler" ]

Let’s break down the important steps— in plain English: 😎

FROM node:18-slim AS builder: We start by using a lightweight version of Node.js called node:18-slim and give it a tag named builder (think of it as Stage 1). This gives us the tools we need to build a Node.js app, but without extra stuff that makes the image heavy. The tag builder enables us to re-use the content of this build in the next stage
WORKDIR /app: We set the working directory inside the container to /app. Think of this as telling Docker: "Hey, this is the folder where I’ll be working from!"
COPY package.json .: This copies the package.json file (which lists your app’s dependencies) into the /app folder inside the container.
RUN npm i -f: This installs all the Node.js dependencies (the packages your app needs to work).
The -f flag forces npm to resolve conflicts if any pop up.
COPY . .: This copies the rest of your project files from your computer into the container.
USER root: This sets the user to root (administrator level) inside the container. Useful when extra permissions are needed for certain tasks.
FROM amazon/aws-lambda-nodejs: Now here’s the switch: we swap to the official AWS Lambda base image for Node.js! That is, Stage 2. This image is designed to work smoothly when deploying containers to Lambda.
ENV PORT=5000: We set an environment variable for the server port. Our app will listen on port 5000.
COPY --from=builder /app/ ${LAMBDA_TASK_ROOT}: This grabs all the files from the builder stage and copies them into Lambda’s special working directory (${LAMBDA_TASK_ROOT}).
COPY --from=builder /app/node_modules ${LAMBDA_TASK_ROOT}/node_modules: Same thing, but this one specifically copies the node_modules folder (all your installed dependencies) into Lambda’s working directory.
COPY --from=builder /app/package.json ${LAMBDA_TASK_ROOT}: Copies the package.json file into Lambda’s working directory.
COPY --from=builder /app/package-lock.json ${LAMBDA_TASK_ROOT}: Copies the lock file for your dependencies – so Lambda knows exactly which versions of libraries to use.
EXPOSE 5000: This tells Docker, “Hey, my app is going to listen for requests on port 5000!" (Though Lambda doesn’t use this directly, it’s useful for local testing.)
CMD [ "lambda.handler" ]: This tells AWS Lambda which function to run when the container starts.
In this case, it’s looking for a handler function inside your app – that’s the entry point!

How to Create Our Own Docker Image

Before we proceed, you need to have Docker running on your machine. If you haven’t installed Docker yet, check out the official installation guide here: Docker Installation Tutorial. It’s a great resource to get Docker up and running.

Ensure Docker is Running

Make sure Docker Desktop is installed and running. You can usually tell by the Docker icon in your system tray. If it’s not running, start it up before proceeding.

Build the Docker Image

Now, it’s time to create a Docker image of our application. In your terminal, navigate to the root directory of your project (where your Dockerfile is located). Then run the following command:

docker build -t demo-lambda-project:latest .

The docker build command tells Docker to create an image.
The -t demo-lambda-project:latest flag assigns a tag (or name) to your image (we’ll change this later to the image naming convention supported by AWS Elastic Container Registry – ECR).
- Here, demo-lambda-project is the name, and latest is the tag indicating the most recent build.
The . at the end tells Docker to look for the Dockerfile in the current directory.

What This Does

Docker will now follow the instructions in your Dockerfile step-by-step. It starts by building your Node.js app (using the lightweight Node 18 image), installs the dependencies, and then copies everything over to an AWS Lambda-ready image. Once done, you have a neat image tagged as demo-lambda-project:latest that’s ready for deployment.

How to Create a Container Registry on AWS Elastic Container Registry (ECR) 📁

Okay, let’s dive into creating an image registry on AWS Elastic Container Registry (ECR). Follow these steps closely to set up your repository named lambda-practice:

In the search bar at the top, type "ECR". You should see Amazon ECR pop up in the dropdown results. Click on it to navigate to the Elastic Container Registry section.

Step 2: Start Creating Your Repository

Once you’re in the ECR section, look for a button that says "Create repository". Click this button to start setting up your new container registry.

Step 3: Configuring the Repository Details

You’ll need to add some info like:

Repository name: In the form that appears, enter lambda-practice as the repository name. This name will be used to reference your repository later when uploading your Docker image.
Tag mutability: You’ll also see an option for Tag Mutability. For this tutorial, set it to Mutable. This means that if you need to update or change a tag on your image later, you can do so. (Keep in mind that in some scenarios, you might want immutable tags for images used in production environments – but mutable tags are great for testing and development, especially since we want to use the tag latest for our images.)

When you’re happy with the settings, click the "Create repository" button at the bottom of the form.

Repository Created – Now Let's Take a Look

After creating the repository, AWS will redirect you to the page listing your repositories.

Find the repository named lambda-practice in the list. This is your newly created container registry where you can push Docker images.

Copy the lambda-practice repository URI, which we’ll need later when we push our image from our local machine. The URI should be in a format similar to this - .dkr.ecr..amazonaws.com/lambda-practice

And that’s it! You’ve now successfully created a container registry on AWS ECR and have your repository (lambda-practice) ready to receive your Docker image. 🚀

IAM with AWS: How to Create a User on AWS IAM to Allow Access to Your AWS ECR 👤🔐

Now that we’ve successfully created our AWS ECR container registry (the home for our Docker image), it's time to make sure our local machine has the necessary permissions to interact with that registry. Without proper authorization, we won’t be able to upload our image.

To do that, we’ll create an IAM user with the appropriate permissions.

Step 1: Access the IAM Console

Start by logging in to your AWS Management Console: https://console.aws.amazon.com/console/home.

In the search bar at the top, type "IAM" and select the IAM service from the dropdown. This brings you to the IAM dashboard where you can manage users, roles, policies, and more.

Step 2: Navigate to the Users Section

On the left sidebar of the IAM dashboard, click on "Users". Here you'll see a list of existing users, and this is where you'll add a new one.

Step 3: Create a New User

Click the "Add users" button at the top. In the "Set user details" step, enter the username as lambda-practice.

Step 4: Attach Permissions Directly

In the "Set permissions" step, choose "Attach policies directly". In the search box, type AmazonEC2ContainerRegistryPowerUser. Select the AmazonEC2ContainerRegistryPowerUser policy by ticking its checkbox. This policy grants the necessary permissions to work with AWS ECR, such as pushing and pulling Docker images.

Click Next, and verify that the username is lambda-practice and that the AmazonEC2ContainerRegistryPowerUser policy is attached. If everything looks good, click "Create user".

Step 5: Generate Access Keys for the User

Once the user is created, you’ll be redirected to the page listing all IAM users. Locate and click on the user lambda-practice. This action will take you to the user’s summary page.

Navigate to the "Security credentials" tab.
Under "Access keys", click the "Create access key" button.
A page will appear for configuring the new access key.

In the "Access key best practices & alternatives" step, select "Command Line Interface (CLI)".

Why should you select this option? Choosing CLI ensures that the generated access key is optimized for use with the AWS CLI and other command-line tools (like Docker commands that push images to ECR), which is exactly what we need for our workflow.

Leave the other configurations as their default settings, and then click "Create access key".

Once the key is created, you’ll see the new Access key ID and Secret access key. Make sure to copy and store these credentials securely. They are essential for authorizing your local machine to access AWS ECR and perform operations with the permissions assigned to the lambda-practice user.

How to Authorize Your Local PC to Publish Images to the AWS ECR Repository

Now that we have our IAM user set up and the access keys in hand, it’s time to authenticate our local PC so we can securely push our Docker images to AWS ECR using the AWS CLI. Follow these steps:

Step 1: Install the AWS CLI

If you haven’t installed the AWS CLI on your machine yet, download and install it using the official guide here: Install the AWS CLI.

This tool allows you to interact with your AWS account right from the command line, which is essential for pushing images to ECR.

Step 2: Configure Your AWS CLI Credentials

Once installed, you need to configure your AWS CLI to use the credentials associated with the lambda-practice user. Open your terminal and run the following command to set up a new profile named lambda:

aws configure --profile lambda

You’ll be prompted to enter the following details:

AWS Access Key ID: Paste the access key ID that you generated for the lambda-practice user.
AWS Secret Access Key: Paste the corresponding secret access key.
Default region name: Enter your preferred AWS region (for example, us-east-1 or your relevant region).
Default output format: You can leave this as json or choose your preferred format.

This command configures a new CLI profile called lambda with the credentials of our IAM user.

Step 3: Verify the Configuration

To ensure everything is set up correctly, run:

aws sts get-caller-identity --profile lambda

This command will return details about the IAM user configured for the lambda profile, confirming that your local PC is now authenticated correctly.

Now you’re all set! Your AWS CLI is configured with the lambda profile, meaning your local machine has the right credentials to interact with your AWS ECR repository and push Docker images using the permissions assigned to your lambda-practice IAM user.

How to Upload Your Docker Image to the AWS ECR repository ⬆️

Uploading your Docker image to AWS ECR is the moment when your hard work gets sent off to your repository so AWS Lambda can later grab and run your container. Now that your PC is authorized to talk to ECR, let’s take a look at how to upload the image.

Step 1: Log in to ECR with Docker

Before you can push your image, you need to authenticate Docker to your AWS ECR account. You do this by running a command that gets an authentication token from AWS and pipes it to Docker. For example:

aws ecr get-login-password --region  --profile lambda | docker login --username AWS --password-stdin .dkr.ecr..amazonaws.com

Let’s break it down:

aws ecr get-login-password --region --profile lambda: This part uses the AWS CLI to get a temporary login password for ECR. Be sure to replace with the region in which your ECR repository was created (for example, us-east-1).
| docker login --username AWS --password-stdin .dkr.ecr..amazonaws.com: The pipe (|) takes the password from the AWS CLI command and passes it as input to docker login. The login command then logs Docker into ECR using the provided username (AWS) and the password. Replace with your actual AWS account ID.

Step 2: Environment Considerations

This command works on shell environments like Powershell, zsh, and bash.

Windows Users (CMD):
If you’re using the classic Windows Command Prompt (CMD), the piping syntax might not work the same way. In that case, you might consider using Windows PowerShell or Git Bash. Alternatively, you can run the command in an environment like Windows Subsystem for Linux (WSL).

Why Use the Correct Region?

It is crucial to use the exact region where your ECR repository was created. The region is a part of your repository URI. If you use the wrong region, the login will fail because it won’t find the correct repository endpoint.

How to Check the Region:

Log in to your AWS Console, navigate to the ECR section, and select your repository. The URI will look similar to this: .dkr.ecr..amazonaws.com/lambda-practice. Here, is the region you must use in your login command.

Step 3: Build Your Docker Image with the Correct Tag

Before pushing the image to ECR, you need to build it on your local machine and tag it with your repository’s name. In your terminal, navigate to your project’s root folder (where your Dockerfile is located), then run (replace and placeholders with your AWS Account ID and AWS ECR repository region):

docker build -t .dkr.ecr..amazonaws.com/lambda-practice:latest

Step 4: Push Your Docker Image to AWS ECR

Once your image is built and tagged, it’s time to push it to your remote ECR repository. Run the following command:

docker push .dkr.ecr..amazonaws.com/lambda-practice:latest

This command tells Docker to upload (or “push”) your image to the repository you created earlier.

Make sure the repository URI and tag match what you used in the build command.
Remember, if you use a different region than the one in your repository URI, the push will fail because AWS won’t recognize the repository endpoint.

How to Deploy the Application Container to AWS Lambda from the Image on AWS ECR 🚀

You can deploy your function on AWS Lambda in several ways, each catering to different use cases. Here’s a quick rundown:

ZIP file upload: Simply compress your code and dependencies into a ZIP file, then upload it directly via the AWS Lambda console. This traditional method is great for small codebases that don’t require custom runtimes.
Direct editing in the console: Write or edit your function code directly in the AWS Lambda code editor. Handy for quick tweaks, but not ideal for larger projects.
Container image: Package your application as a Docker container image and deploy it. This approach is particularly useful if you have complex dependencies, need a custom runtime, or want consistent environments across development and production.

In this tutorial, we’re taking the container image route because it offers flexibility, consistency, and scalability – all while letting us reuse our existing Docker configuration. Let’s walk through the steps for deploying your containerized application to AWS Lambda:

Step 1: Access the AWS Lambda Console

Log into your AWS Management Console. In the search bar at the top, type "Lambda" and select the AWS Lambda service from the dropdown results.

Step 2: Create a New Lambda Function

Once on the Lambda page, click the "Create function" button. You’ll see multiple function creation options. For our purposes, select the "Container image" option. This choice tells AWS that you’ll be deploying a containerized application instead of uploading a ZIP file.

Step 3: Name Your Function

In the function setup screen, enter lambda-practice as the name of your new Lambda function. This name identifies your function in AWS.

Step 4: Configure the Container Image

Under the “Container image” settings, click the "Browse images" button. A new window should appear, listing your available images from AWS Elastic Container Registry (ECR).

Select the repository you previously created (for instance, the one named lambda-practice), and pick the image tagged as latest.

Step 5: Finalize and Create

Now you’ll want to review the basic settings. In this step, you might also configure additional options such as memory allocation, timeout limits, and environment variables, depending on your application needs.

Once everything is set, click "Create function" to finalize the deployment.

How to Enable Access to Your Lambda Function

Awesome – hurray, you’ve successfully deployed your image from AWS ECR to AWS Lambda! Now the next step is to make sure your function is up and running and can be triggered properly. But you might be wondering, “How do I actually access my Lambda function to see if it’s working?” Let's break it down:

Understanding Lambda Function Triggers

There are several ways to invoke a Lambda function, and AWS supports multiple trigger options. Here are a few:

Event Source Mapping: Automatically triggers your function in response to changes in services like DynamoDB, Kinesis, or S3.
Scheduled Events: Set up cron-like scheduled invocations via Amazon CloudWatch Events.
API Gateway: Create RESTful APIs that call your function.
AWS SDK/CLI: Directly invoke the function using the AWS SDK or CLI commands.
Function URLs: A simple way to expose your function over HTTPS, giving you a public URL that users or applications can call directly.

In this tutorial, we’re going to use a Function URL to trigger our Lambda function via an HTTP event. This method allows you to invoke your function from the public internet and is perfect for testing or building public-facing APIs.

How to Create a Function URL for Your Lambda Function

Now that you're on your Lambda function's details page, here’s how to create a Function URL step-by-step:

First, on your Lambda function’s page, click the "Configuration" tab at the top. Within the Configuration section, find and select the "Function URL" sub-tab. This is where you manage the public URL for your function.

Click on the "Create Function URL" button. This will open a new configuration screen for setting up your Function URL.

Authentication type: Set the Auth type to NONE. This setting allows public, unauthenticated access to your function from the internet, which means anyone with the URL can invoke it. (This is great for testing or building public services, but be cautious with security in production environments!)
Additional settings: Under the Additional Settings section, enable Configure cross-origin resource sharing (CORS). This is useful if you plan to call your function from client-side applications hosted on different domains. Think of it as opening a window for your app to communicate with other web pages or services.

After configuring your settings, click the appropriate button to create or save the Function URL.

Verify Your Function URL

Once configured, you’ll see the Function URL displayed on the same page. You can now copy this URL.

Paste the URL into a browser or use tools like curl or Postman to send an HTTP request, triggering your Lambda function and verifying that it works as expected.

You should get a response just like this on your browser:

And that’s it! You’ve successfully set up a public HTTP endpoint that triggers your AWS Lambda function. Whether you're testing your deployment or building a public-facing API, the Function URL makes it easy for anyone to interact with your function.

Congrats — You did it!

You've just walked through the entire journey of deploying a Node.js web server, containerized with Docker, all the way to AWS Lambda using AWS ECR as your image repository. 🚀

From writing and containerizing your Node.js application, creating an AWS ECR repository, setting up IAM users and access keys, pushing your Docker image to ECR, to deploying it on Lambda – you’ve covered it all like a pro. 💪

Not only that, but you also configured a public-facing Function URL so your serverless app can now handle requests from anywhere in the world 🌍.

You’ve just combined modern cloud-native workflows with serverless deployment – giving you flexibility, scalability, and lightning-fast response times without the headache of managing servers 😁.

👏 Give yourself a pat on the back. You’ve officially containerized and deployed your Node.js web server to AWS Lambda!

Advantages of Adopting the Serverless Model in Businesses 💼

When it comes to deploying applications in the cloud, the serverless model has truly flipped the old playbook and has helped businesses save on Cloud costs! Let’s break it down in simple, real-world terms.

Cost-Efficiency 💰

For most businesses – especially startups – serverless offers a major financial advantage. Here’s why:

In traditional models like IaaS (Infrastructure as a Service) and PaaS (Platform as a Service), such as using AWS EC2 or AWS Elastic Beanstalk, you provision resources upfront.

For example: You spin up a server with 4 GB RAM and 4 vCPUs, and AWS charges you $100/month (this covers 730 hours – the whole month). Even if your app barely does anything – say it only serves real requests for 120 hours, and uses just 1 GB of memory – you still pay the full $100, because the resources were reserved and waiting for traffic 24/7.

But with Serverless:

You don’t pre-allocate or reserve compute power.
Your application only runs when someone actually needs it (for example, when a user makes an HTTP request).
You only pay for the actual execution time and the resources used.

For instance, if your function only runs for 50 hours in a month and uses 1.5 GB RAM, you might pay something like $30, compared to the flat $100 you'd have paid on EC2 or Elastic Beanstalk.

Scalability Without Stress 📈

Serverless platforms like AWS Lambda automatically handle:

Scaling up during high demand.
Scaling down to zero when idle.

This means your team won’t need to predict or provision for resources during traffic surges. Whether 1 or 1 million users visit your app, the cloud provider handles the rest.

Simplified Operations ⚙️

For your software team:

No more babysitting servers, patching security updates, or worrying about load balancers.
You focus purely on writing the business logic and shipping code.
The cloud provider handles the infrastructure behind the scenes.

This frees up your team’s time, cuts maintenance tasks, and speeds up development times.

Better Return on Investment (ROI) 📊

Because you only pay for what you use, the cost-to-value ratio improves significantly. Startups and businesses can:

Launch faster.
Experiment without financial risk.
Scale without surprise bills.
Avoid overpaying for idle resources.

Disadvantages of the Serverless Model 🚫

As exciting and cost-friendly as the serverless model seems, the golden rule in tech still applies:
every solution comes with trade-offs.

Let’s walk through a few important downsides you should consider:

No Built-in Support for Background Jobs ⏰

Unlike traditional servers where you can run background processes – like sending out newsletters at midnight or cleaning up databases at scheduled times – serverless platforms such as AWS Lambda don’t natively support background tasks or recurring jobs.

For example, let’s say you wanted your app to automatically generate reports every day at 3 AM. In a typical server setup, you’d just write a cron job and call it a day.

But with Lambda or serverless, you can’t do this directly inside your deployed function. Instead, you need external tools like:

AWS EventBridge (for scheduling and triggering Lambda functions)
Or other cloud-native schedulers.

This adds a bit of extra setup, management, and sometimes extra cost.

Unpredictable Cloud Costs 💸

One of the biggest selling points of serverless is “pay-as-you-use” – but this can also become a financial blind spot, because:

Costs depend on traffic volume and resource usage.
If your app suddenly goes viral or experiences a traffic spike, your cloud bill could skyrocket without warning.

For example, an app that runs stable at $30/month for low traffic could unexpectedly hit $1000+ if a marketing campaign or external event drives huge numbers of users to your service. While this means your app is succeeding, your budget might take a hit.

In contrast, with traditional models like AWS EC2 or Elastic Beanstalk, your costs are usually predictable – even if your server sits idle all month.

When to Adopt the Serverless Model 🤔

So, is Serverless always the right choice? Not necessarily!

If you expect:

Steady, predictable workloads, EC2 or Elastic Beanstalk might offer more cost certainty.
Long-running background tasks, serverless isn’t ideal without extra services.
Real-time control over resource limits, traditional servers give you more flexibility.

But if your app has burst traffic (users come and go), event-driven logic (like APIs or webhooks), or you want minimal ops overhead, then Serverless can save time, effort, and money.

When Serverless is the Perfect Fit: A Startup Building an Event-Driven API

Imagine you’re running a small tech startup that just launched an app for booking fitness classes. Your team is small, budgets are tight, and traffic is unpredictable – some days you have 50 users, some days 5,000.

In this case:

Your backend mostly handles HTTP requests: new sign-ups, class bookings, cancellations, and payments.
Traffic spikes during lunch breaks and weekends, but is quiet at night.
You don’t want to hire a full-time DevOps engineer just to manage servers.

👉 Why Serverless is perfect in this case:

You only pay when people use your app.
No need to manage or provision servers.
AWS Lambda auto-scales based on demand.
Fast to deploy, easy to connect to other AWS services (like DynamoDB for your database, S3 for images, and SES for emails).

By using Serverless in this case, you can save money, scale automatically, and stay laser-focused on features – not infrastructure.

When Serverless is Not a Good Fit: A Video Streaming Platform

Now imagine you’re building the next YouTube-like service for a niche audience – say, education-based content for universities.

In this case:

Your platform requires continuous background processing: encoding videos, generating thumbnails, and pushing them to CDN.
Users stream content 24/7, meaning your app is always under load.
Background jobs like recommendation engine updates or nightly reports need to run frequently.

👉 Why Serverless might be a bad idea:

Functions like AWS Lambda have a timeout limit (for example 15 minutes max per execution).
Continuous processing or streaming doesn’t fit the on-demand, short-lived nature of serverless.
Costs could skyrocket since the app runs almost all the time, making it more expensive than a dedicated EC2 or Kubernetes cluster.

Better alternative:
For this kind of use case, a traditional server-based setup – like EC2 or container orchestration via ECS or Kubernetes – would offer more control, predictable pricing, and support for long-running processes

✅ Bottom line:
Serverless is fantastic for modern apps, but like any tool, it’s best used when its strengths match your project’s needs.

Conclusion 📝

Congratulations on making it to the end of this tutorial! 🚀

In this article, we explored the power of serverless computing by walking step-by-step through the process of deploying a Node.js web server using Docker and AWS Lambda.

From building your container image, pushing it to AWS ECR, and finally deploying it on Lambda – you’ve now seen how easy it is to get an app running without the hassle of provisioning servers.

We also discussed the advantages of adopting the Serverless model in deploying your applications, it’s disadvantages, and real-world use cases in which you should adopt the serverless approach.

About the Author 👨‍💻

Hi, I’m Prince! I’m a DevOps engineer and Cloud architect passionate about building, deploying, and managing scalable applications and sharing knowledge with the tech community.

The CI/CD Handbook: Learn Continuous Integration and Delivery with GitHub Actions, Docker, and Google Cloud Run

Prince Onukwili — Thu, 05 Dec 2024 16:21:12 +0000

Hey everyone! 🌟 If you’re in the tech space, chances are you’ve come across terms like Continuous Integration (CI), Continuous Delivery (CD), and Continuous Deployment. You’ve probably also heard about automation pipelines, staging environments, production environments, and concepts like testing workflows.

These terms might seem complex or interchangeable at first glance, leaving you wondering: What do they actually mean? How do they differ from one another? 🤔

In this handbook, I’ll break down these concepts in a clear and approachable way, drawing on relatable analogies to make each term easier to understand. 🧠💡 Beyond just theory, we’ll dive into a hands-on tutorial where you’ll learn how to set up a CI/CD workflow step by step.

Together, we’ll:

Set up a Node.js project. ✨
Implement automated tests using Jest and Supertest. 🛠️
Set up a CI/CD workflow using GitHub Actions, triggered on push, and pull requests, or after a new release. ⚙️
Build and publish a Docker image of your application to Docker Hub. 📦
Deploy your application to a staging environment for testing. 🚀
Finally, roll it out to a production environment, making it live! 🌐

By the end of this guide, not only will you understand the difference between CI/CD concepts, but you’ll also have practical experience in building your own automated pipeline. 😃

What is Continuous Integration, Deployment, and Delivery?
Differences Between Continuous Integration, Continuous Delivery, and Continuous Deployment
How to Set Up a Node.js Project with a Web Server and Automated Tests
How to Create a GitHub Repository to Host Your Codebase
How to Set Up the CI and CD Workflows Within Your Project
Set Up a Docker Hub Repository for the Project's Image and Generate an Access Token for Publishing the Image
Create a Google Cloud Account, Project, and Billing Account
Create a Google Cloud Service Account to Enable Deployment of the Node.js Application to Google Cloud Run via the CD Pipeline
Create the Staging Branch and Merge the Feature Branch into It (Continuous Integration and Continuous Delivery)
Merge the Staging Branch into the Main Branch (Continuous Integration and Continuous Deployment)
Conclusion

What is Continuous Integration, Deployment, and Delivery? 🤔

Continuous Integration (CI)

Imagine you’re part of a team of six developers, all working on the same project. Without a proper system, chaos would ensue.

Let’s say Mr. A is building a new login feature, Mrs. B is fixing a bug in the search bar, and Mr. C is tweaking the dashboard UI—all at the same time. If everyone is editing the same "folder" or codebase directly, things could go horribly wrong: "Hey! Who just broke the app?!" 😱

To keep everything in order, teams use Version Control Systems (VCS) like GitHub, GitLab, or BitBucket. Think of it as a digital workspace where everyone can safely collaborate without stepping on each other’s toes. 🗂️✨

Here’s how Continuous Integration fits into this process step-by-step:

1. The Main Branch: The General Folder ✨

At the heart of every project is the main branch—the ultimate source of truth. It contains the stable codebase that powers your live app. It’s where every team member contributes their work, but with one important rule: only tested and approved code gets merged here. 🚀

2. Feature Branches: Personal Workspaces 🔨

When someone like Mr. A wants to work on a new feature, they create a feature branch. This branch is essentially a personal copy of the main branch where they can tinker, write code, and test without affecting others. Mrs. B and Mr. C are also working on their own branches. Everyone’s experiments stay neatly organized. 🧪💡

3. Merging Changes: The CI Workflow 🎉

When Mr. A is satisfied with his feature, he doesn’t just shove it into the main branch—CI ensures it’s done safely:

Automated Tests: Before merging, CI tools automatically run tests on Mr. A’s code to check for bugs or errors. Think of it as a bouncer guarding the main branch, ensuring no bad code gets in. 🕵️‍♂️
Build Verification: The feature branch code is also "built" (converted into a deployable version of the app) to confirm it works as intended.

Once these checks are passed, Mr. A’s feature branch is merged into the main branch. This frequent merging of changes is what we call Continuous Integration.

Continuous Delivery (CD)

Continuous Delivery (CD) often gets mixed up with Continuous Deployment, and while they share similarities, they serve distinct purposes in the development lifecycle. Let’s break it down! 🧐

The Need for a `Staging` Area 🌉

In the Continuous Integration (CI) process we discussed above, we primarily dealt with feature branches and the main branch. But directly merging changes from feature branches into the main branch (which powers the live product) can be risky. Why? 🛑

While automated tests and builds catch many errors, they’re not foolproof. Some edge cases or bugs might slip through unnoticed. This is where the staging branch and staging environment come into play! 🎭

Think of the staging branch as a “trial run.” Before unleashing changes to real customers, the codebase from feature branches is merged into the staging branch and deployed to a staging environment. This environment is an exact replica of the production environment, but it’s used exclusively by the Quality Assurance (QA) team for testing.

The QA team takes the role of a “test driver,” running the platform through its paces just as a real user would. They check for usability issues, edge cases, or bugs that automated tests might miss, and provide feedback to developers for fixes. 🚦 If everything passes, the codebase is cleared for deployment to production.

Continuous Delivery in Action 📦

The process of merging changes into the staging branch and deploying them to the staging environment is what we call Continuous Delivery. 🛠️ It ensures that the application is always in a deployable state, ready for the next step in the pipeline.

Unlike Continuous Deployment (which we’ll discuss later), Continuous Delivery doesn’t automatically push changes to production (live platform). Instead, it pauses to let humans—namely the QA team or stakeholders—decide when to proceed. This adds an extra layer of quality assurance, reducing the chances of errors making it to the live product. 🕵️‍♂️

Continuous Deployment (CD)

Continuous Deployment (CD) takes automation to its peak. While it shares similarities with Continuous Delivery, the key difference lies in the final step: there’s no manual approval required. The final process—merging the codebase and deploying it live for end users (the QA testers or the team lead could do this).

Let’s explore what makes Continuous Deployment so powerful (and a little scary)! 😅

The Last Mile of the CI/CD Pipeline 🛣️

Imagine you’ve gone through the rigorous process of Continuous Integration: teammates have merged their feature branches, automated tests were run, and the codebase was successfully deployed to the staging environment during Continuous Delivery.

Now, you’re confident that the application is free of bugs and ready to shine in the production environment—the live version of your platform used by real customers.

In Continuous Deployment, this final step of deploying changes to the live environment happens automatically. The pipeline triggers whenever specific events occur, such as:

A Pull Request (PR) is merged into the main branch.
A new release version is created.
A commit is pushed directly to the production branch (though this is rare for most teams).

Once triggered, the pipeline springs into action, building, testing, and finally deploying the updated codebase to the production environment. 📡

Differences Between Continuous Integration, Continuous Delivery, and Continuous Deployment 🔍

Aspect	Continuous Integration (CI)	Continuous Delivery (CD)	Continuous Deployment (CD)
Primary Focus	Merging feature branches into the main/general codebase OR to the staging codebase.	Deploying the tested code to a staging environment for QA testing and approval.	Automatically deploying the code to the live production environment.
Automation Level	Automates testing and building processes for feature branches.	Automates deployment to staging/test environments after successful testing.	Fully automates the deployment to production with no manual approval.
Testing Scope	Automated tests run on feature branches to ensure code quality before merging into the main or staging branch.	Includes automated tests before deployment to staging and allows QA testers to perform manual testing in a controlled environment.	May include automated tests as a final check, ensuring the production environment is stable before deployment.
Branch Involved	Feature branches merging into the main/general or staging branch.	Staging branch used as an intermediate step before merging into the main branch.	Main/general branch deployed directly to production.
Environment Target	Ensures integration and testing within a local environment or build pipeline.	Deploys to staging/test environments where QA testers validate features.	Deploys to production/live environment accessed by end users.
Key Goal	Prevent integration conflicts and ensure new changes don’t break the existing codebase.	Provide a stable, near-production environment for thorough QA testing before final deployment.	Ensure that new features and updates reach users as soon as possible with minimal delays.
Approval Process	No approval needed. Feature branches are tested and merged upon passing criteria.	QA team or lead provides feedback/approval before changes are merged into the main branch for production.	No manual approval. Deployment is entirely automated.
Example Trigger	A developer merges a feature branch into the main branch.	The staging branch passes automated tests (during PR) and is ready for deployment to the testing environment.	A new release is created or a pull request is merged into the main branch, triggering an automatic production deployment.

Now that we’ve untangled the mysteries of Continuous Integration, Continuous Delivery, and Continuous Deployment, it’s time to roll up our sleeves and put theory into practice 😁.

How to Set Up a Node.js Project with a Web Server and Automated Tests ✨

In this hands-on section, we’ll build a Node.js web server with automated tests using Jest. From there, we’ll create a CI/CD pipeline with GitHub Actions that automates testing for every pull request to the staging and main branches. Finally, we’ll publish an Image of our application to DockerHub and deploy the image to Google Cloud Run, first to a staging environment for testing and later to the production environment for live use.

Ready to bring your project to life? Let’s get started! 🚀✨

Step 1: Install Node.js 📥

To get started, you’ll need to have Node.js installed on your machine. Node.js provides the JavaScript runtime we’ll use to create our web server.

Visit https://nodejs.org/en/download/package-manager
Choose your operating system (Windows, macOS, or Linux) and download the installer.
Follow the installation instructions to complete the setup.

To verify that Node.js was installed successfully, open your terminal and run node -v. This should display the installed version of Node.js

Step 2: Clone the Starter Repository 📂

The next step is to grab the starter code from GitHub. If you don’t have Git installed, you can download it at https://git-scm.com/downloads. Choose your OS and follow the instructions to install Git. Once you’re set, it’s time to clone the repository.

Run the following command in your terminal to clone the boilerplate code:

git clone --single-branch --branch initial https://github.com/onukwilip/ci-cd-tutorial

This will download the project files from the initial branch, which contains the starter template for our Node.js web server.

Navigate into the project directory:

cd ci-cd-tutorial

Step 3: Install Dependencies 📦

Once you’re in the project directory, install the required dependencies for the Node.js project. These are the packages that power the application:

npm install --force

This will download and set up all the libraries specified in the project. Alright, dependencies installed? You’re one step closer!

Step 4: Run Automated Tests ✅

Before diving into the code, let’s confirm that the automated tests are functioning correctly. Run:

npm test

You should see two successful test results in your terminal. This indicates that the starter project is correctly configured with working automated tests.

Step 5: Start the Web Server 🌐

Finally, let’s start the web server and see it in action. Run the following command:

npm start

Wait for the application to start running. Open your browser and visit http://localhost:5000. 🎉 You should see the starter web server up and running, ready for your CI/CD magic:

How to Create a GitHub Repository to Host Your Codebase 📂

Go to GitHub: Open your browser and visit GitHub - https://github.com.
Sign In: Click on the Sign In button in the top-right corner and enter your username and password to log in, OR create an account if you don’t have one by clicking the Sign up button.

Step 2: Create a New Repository

Once you're signed in, on the main GitHub page, you’ll see a "+" sign in the top-right corner next to your profile picture. Click on it, and select “New repository” from the dropdown.

Now it’s time to set the repository details. You’ll include:

Repository Name: Choose a name for your repository. For example, you can call it ci-cd-tutorial.
Description (Optional): You can add a short description, like “A tutorial project for CI/CD with Docker and GitHub Actions.”
Visibility: Choose whether you want your repository to be public (accessible by anyone) or private (only accessible by you and those you invite). For the sake of this tutorial, make it public.
Do Not Check the Add a README File Box: Important: Make sure you do not check the option to Add a README file. This will automatically create a README.md file in your repository, which could cause conflicts later when you push your local files. We'll add the README file manually if needed later.

After filling out the details, click on “Create repository”.

Step 3: Change the Remote Destination and Push to Your New Repository

Update the Remote Repository URL:

Since you've already cloned the codebase from my repository, you need to update the remote destination to point to your newly created GitHub repository.

Copy your repository URL (the URL of the page you were redirected to after creating the repository). It should look similar to this: https://github.com//.

Open your terminal in the project directory and run the following commands:

git remote set-url origin

Replace with your GitHub repository URL which you copied earlier.

Rename the Current Branch to `main`:

If your branch is named something other than main, you can rename it to main using:

git branch -M main

Push to Your New Repository:

Finally, commit any changes you’ve made and push your local repository to the new remote GitHub repository by running:

git add .
git commit -m 'Created boilerplate'
git push -u origin main

Now your local codebase is linked to your new GitHub repository, and the files are successfully pushed there. You can verify by visiting your repository on GitHub.

How to Set Up the CI and CD Workflows Within Your Project ⚙️

Now it’s time to create the CI and CD workflows for our project! These workflows won’t run on your local PC but will be automatically triggered and executed in the cloud once you push your changes to the remote repository. GitHub Actions will detect these workflows and run them based on the triggers you define.

Step 1: Prepare the Workflow Directory 📂

Before adding the CI/CD pipelines, it's a good practice to first create a feature branch. This step mirrors the workflow commonly used in teams, where new features or changes are made in separate branches before they are merged into the main codebase.

To create and switch to a new branch, run the following command:

git checkout -b feature/ci-cd-pipeline

This will create a new branch called feature/ci-cd-pipeline and switch to it. Now, you can safely add and test the CI/CD workflows without affecting the main branch.

Once you finish, you’ll be able to merge this feature branch back into main or staging as part of the pull request process.

In the project’s root directory, create a folder named .github. Inside .github, create another folder called workflows.

Any YAML file placed in the .github/workflows directory is automatically recognized as a GitHub Actions workflow. These workflows will execute based on specific triggers, such as pull requests, pushes, or releases.

Step 2: Create the Continuous Integration Workflow 🚀

We’ll now create a CI workflow that automatically tests the application whenever a pull request is made to the main or staging branches.

First, inside the workflows directory, create a file named ci-pipeline.yml.

Paste the following code into the file:

name: CI Pipeline to staging/production environment
on:
  pull_request:
    branches:
      - staging
      - main
jobs:
  test:
    runs-on: ubuntu-latest
    name: Setup, test, and build project
    env:
      PORT: 5001
    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - name: Install dependencies
        run: npm ci

      - name: Test application
        run: npm test

      - name: Build application
        run: |
          echo "Run command to build the application if present"
          npm run build --if-present

Explanation of the CI Workflow

Here’s a breakdown of each section in the workflow:

name: CI Pipeline to staging/production environment: This is the title of your workflow. It helps you identify this pipeline in GitHub Actions.
on: The on parameter is what determines the events that trigger your workflow. When the workflow YAML file is pushed to the remote GitHub repository, GitHub Actions automatically registers the workflow using the configured triggers in the on field. These triggers act as event listeners that tell GitHub when to execute the workflow

For example:

If we set pull_request as the value for the on parameter and specify the branches we want to monitor using the branches key, GitHub sets up event listeners for pull requests to those branches.
```
 on:
   pull_request:
     branches:
       - main
       - staging
```
This configuration means that GitHub will trigger the workflow whenever a pull request is made to the main or staging branches.

Multiple Triggers:
You can define multiple event listeners in the on parameter. For instance, in addition to pull requests, you can add a listener for push events.
```
 on:
   pull_request:
     branches:
       - main
       - staging
   push:
     branches:
       - main
```
This configuration ensures that the workflow is triggered when:
- A pull request is made to either the main or staging branch.
- A push is made directly to the main branch.

📘 Learn more about triggers: Check out the official GitHub documentation here.

jobs: The jobs section outlines the specific tasks (or jobs) that the workflow will execute. Each job is an independent unit of work that runs on a separate virtual machine (VM). This isolation ensures a clean, unique environment for every job, avoiding potential conflicts between tasks.

Key Points About Jobs:
1. Clean VM for Each Job: When GitHub Actions runs a workflow, it assigns a dedicated VM instance to each job. This means the environment is reset for every job, ensuring there’s no overlap or interference between tasks.
2. Multiple Jobs: Workflows can have multiple jobs, each responsible for a specific task. For example:
  - A Test job to install dependencies and run automated tests.
  - A Build job to compile the application.
3. Job Organization: Jobs can be organized to run:
  - Sequentially: Ensures one job is completed before the next starts, for example the Test job must finish before the Build job. This sequential flow mimics the "pipeline" structure.
  - Simultaneously: Multiple jobs can run in parallel to save time, especially if the jobs are independent of one another.
4. Single Job in This Workflow: In our current workflow, there is only one job, test, which:
  - Installs dependencies.
  - Runs automated tests.
  - Builds the application.

📘 Learn more about jobs: Dive into the GitHub Actions jobs documentation here.

runs-on: ubuntu-latest: Specifies the operating system the job will run on. GitHub provides pre-configured virtual environments, and we’re using the latest Ubuntu image.
env: Sets environment variables for the job. Here, we define the PORT variable used by our application.
Steps: Steps define the individual actions to execute within a job:
- Checkout: Uses the actions/checkout action to clone the repository containing the codebase in the feature branch into the virtual machine instance environment. This step ensures the pipeline has access to the project files.
- Install dependencies: Runs npm ci to install the required Node.js packages.
- Test application: Runs the automated tests using the npm test command. This validates the codebase for errors or failing test cases.
- Build application: Builds the application if a build script is defined in the package.json. The --if-present flag ensures this step doesn’t fail if no build script is present.

Now that we’ve completed the CI pipeline, which runs on pull requests to the main or staging branches, let’s move on to setting up the Continuous Delivery (CD) and Continuous Deployment pipelines. 🚀

Step 3: The Continuous Delivery and Deployment Workflow

First, create the Pipeline File:
In the .github/workflows folder, create a new file called cd-pipeline.yml. This file will define the workflows for automating delivery and deployment.

Next, paste the configuration:
Copy and paste the following configuration into the cd-pipeline.yml file:

name: CD Pipeline to Google Cloud Run (staging and production)
on:
  push:
    branches:
      - staging
  workflow_dispatch: {}
  release:
    types: published

env:
  PORT: 5001
  IMAGE: ${{vars.IMAGE}}:${{github.sha}}
jobs:
  test:
    runs-on: ubuntu-latest
    name: Setup, test, and build project
    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - name: Install dependencies
        run: npm ci

      - name: Test application
        run: npm test
  build:
    needs: test
    runs-on: ubuntu-latest
    name: Setup project, Authorize GitHub Actions to GCP and Docker Hub, and deploy
    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - name: Authenticate for GCP
        id: gcp-auth
        uses: google-github-actions/auth@v0
        with:
          credentials_json: ${{ secrets.GCP_SERVICE_ACCOUNT }}

      - name: Set up Cloud SDK
        uses: google-github-actions/setup-gcloud@v0

      - name: Authenticate for Docker Hub
        id: docker-auth
        env:
          D_USER: ${{secrets.DOCKER_USER}}
          D_PASS: ${{secrets.DOCKER_PASSWORD}}
        run: |
          docker login -u $D_USER -p $D_PASS
      - name: Build and tag Image
        run: |
          docker build -t ${{env.IMAGE}} .
      - name: Push the image to Docker hub
        run: |
          docker push ${{env.IMAGE}}
      - name: Enable the Billing API
        run: |
          gcloud services enable cloudbilling.googleapis.com --project=${{secrets.GCP_PROJECT_ID}}
      - name: Deploy to GCP Run - Production environment (If a new release was published from the master branch)
        if: github.event_name == 'release' && github.event.action == 'published' && github.event.release.target_commitish == 'main'
        run: |
          gcloud run deploy ${{vars.GCR_PROJECT_NAME}} \
          --region ${{vars.GCR_REGION}} \
          --image ${{env.IMAGE}} \
          --platform "managed" \
          --allow-unauthenticated \
          --tag production \
      - name: Deploy to GCP Run - Staging environment
        if: github.ref != 'refs/heads/main'
        run: |
          echo "Deploying to staging environment"
          # Deploy service with to staging environment
          gcloud run deploy ${{vars.GCR_STAGING_PROJECT_NAME}} \
          --region ${{vars.GCR_REGION}} \
          --image ${{env.IMAGE}} \
          --platform "managed" \
          --allow-unauthenticated \
          --tag staging \

The CD pipeline configuration combines Continuous Delivery and Continuous Deployment workflows into a single file for simplicity. It builds on the concepts of CI/CD we discussed earlier, automating testing, building, and deploying the application to Google Cloud Run.

Explanation of the CD pipeline:

Workflow Triggers (on)

push: Workflow triggers on pushes to the staging branch.
workflow_dispatch: Enables manual execution of the workflow via the GitHub Actions interface.
release: Triggers when a new release is published.
Example: When a release is published from the main branch, the app deploys to the production environment.

Job 1 – Testing the Codebase: The first job in the pipeline, Test, ensures the codebase is functional and error-free before proceeding with delivery or deployment
Job 2 – Building and Deploying the Application: Aha! Moment ✨: These jobs run sequentially. 😃 The Build job begins only after the Test job is completed successfully. It prepares the application for deployment and manages the actual deployment process.

Here's what happens:
- Authorization for GCP and Docker Hub: The workflow authenticates with both Google Cloud Platform (GCP) and Docker Hub. For GCP, it uses the google-github-actions/auth@v0 action to handle service account credentials stored as secrets. Similarly, it logs into Docker Hub with stored credentials to enable image uploads.
- Build and Push Docker Image: The application is built into a Docker image and tagged with a unique identifier (${{env.IMAGE}}). This image is then pushed to Docker Hub, making it accessible for deployment.
- Deploy to Google Cloud Run: Based on the event that triggered the workflow, the application is deployed to either the staging or production environment in Google Cloud Run. A push to the staging branch deploys to the staging environment (Continuous Delivery), while a release from the main branch deploys to production (Continuous Deployment).

To ensure the security and flexibility of our pipeline, we rely on external variables and secrets rather than hardcoding sensitive information directly into the workflow file.

Why? Workflow configuration files are part of your repository and accessible to anyone with access to the codebase. If sensitive data, like API keys or passwords, is exposed here, it can be easily compromised. 😨

Instead, we use GitHub’s Secrets to securely store and access this information. Secrets allow us to define variables that are encrypted and only accessible by our workflows. For example:

DockerHub Credentials: We’ll add a Docker username and access token to the repository’s secrets. These are essential for authenticating with DockerHub to upload the built Docker images.
Google Cloud Service Account Key: This key will grant the pipeline the necessary permissions to deploy the application on Google Cloud Run securely.

We'll set up these variables and secrets incrementally as we proceed, ensuring each step is fully secure and functional. 🎯

Set Up a Docker Hub Repository for the Project's Image and Generate an Access Token for Publishing the Image 📦

Before we dive into the steps, let’s quickly go over what we’re about to do. In this section, you’ll learn how to create a Docker Hub repository, which acts like an online storage space for your application’s container image.

Think of a container image as a snapshot of your application, ready to be deployed anywhere. To ensure smooth and secure access, we’ll also generate a special access token, kind of like a revokable password that our CI/CD pipeline can use to upload your app’s image to Docker Hub. Let’s get started! 🚀

Here are the steps to follow to sign up for Docker Hub:

Go to the Docker Hub website: Open your web browser and visit Docker Hub - https://hub.docker.com/.
Create an account: On the Docker Hub homepage, you’ll see a button labelled "Sign Up" in the top-right corner. Click on it.
Fill in your details: You'll be asked to provide a few details like your username, email address, and password. Choose a strong password that you can remember.
Agree to the terms: You’ll need to check a box to agree to Docker’s terms of service. After that, click “Sign Up” to create your account.
Verify your email: Docker Hub will send you an email to verify your account. Open that email and click on the verification link to complete your account creation.

After verifying your email, go back to Docker Hub, and click on "Sign In" at the top right. Then you can use the credentials you just created to log in.

Step 3: Generate an Access Token (for the CI/CD pipeline)

Now that you have an account, you can create an access token. This token will allow your GitHub Actions workflow to securely sign into Docker Hub and upload Docker images.

Once you’re logged into Docker Hub, click on your profile picture (or avatar) in the top right corner. This will open a menu. From the menu, click “Account Settings”.

Then in the left-hand menu of your account settings, scroll to the "Security" tab. This section is where you manage your tokens and passwords.

Now you’ll need to create a new access token. In the Security tab, you’ll see a link labelled “Personal access tokens” – click on it. Click the button labelled “Generate new token”.

You’ll be asked to give your token a description. You can name it something like "GitHub Actions CI/CD" so that you know what it's for.

After giving it a description, click on the “Access permissions dropdown“ and select “Read & Write“, or “Read, Write, Delete“. Click “Generate“

Now, you need to copy the credentials. After clicking the generate button, Docker Hub will create an access token. Immediately copy this token along with your username and save it somewhere safe, like in a file (don’t worry, we’ll add it to our GitHub secrets). You won’t be able to see this token again, so make sure you save it!

Step 4: Add the Token to GitHub as a Secret

To do this, open your GitHub repository where the codebase is hosted. In the GitHub repo, click on the Settings tab (located near the top of your repo page).

Then on the left sidebar, scroll down and click on “Secrets and Variables”, then choose “Actions”.

Here are the steps to create and manage your new secret:

Add a new secret: Click on the “New repository secret” button.
Set up the secret:
- In the Name field, type DOCKER_PASSWORD.
- In the Value field, paste the access token you copied earlier.
Save the secret: Finally, click Add secret to save your Docker access token securely in GitHub.

Then you’ll repeat the process for your Docker username. Create a new secret called DOCKER_USER and add your Docker username that you copied earlier.

And that’s it! Now your CI/CD pipeline can use this token to securely log in to Docker Hub and upload images automatically when triggered. 🎉

Step 5: Creating the Dockerfile for the Project

Before you can build and publish the Docker image to Docker Hub, you need to create a Dockerfile that contains the necessary instructions to build your application.

Follow the steps below to create the Dockerfile in the root folder of your project:

Navigate to your project’s root folder.
Create a new file named Dockerfile.
Open the Dockerfile in a text editor and paste the following content into it:

FROM node:18-slim

WORKDIR /app

COPY package.json .

RUN npm install -f

COPY . .

# EXPOSE 5001
EXPOSE 5001

CMD ["npm", "start"]

Explanation of the Dockerfile:

FROM node:18-slim: This sets the base image for the Docker container, which is a slim version of the official Node.js image based on version 18.
WORKDIR /app: Sets the working directory for the application inside the container to /app.
COPY package.json .: Copies the package.json file into the working directory.
RUN npm install -f: Installs the project dependencies using npm.
COPY . .: Copies the rest of the project files into the container.
EXPOSE 5001: This tells Docker to expose port 5001, which is the port our app will run on inside the container.
CMD ["npm", "start"]: This sets the default command to start the application when the container is run, using npm start.

Create a Google Cloud Account, Project, and Billing Account ☁️

In this section, we’re laying the foundation for deploying our application to Google Cloud. First, we’ll set up a Google Cloud account (don’t worry, it’s free to get started!). Then, we’ll create a new project where all the resources for your app will live.

Finally, we’ll enable billing so you can unlock the cloud services needed for deployment. Think of this as setting up your workspace in the cloud—organized, ready, and secure! Let’s dive in! ☁️

First, go to Google Cloud Console. If you don’t have a Google Cloud account, you’ll need to create one.

To do this, click on Get Started for Free and follow the steps to set up your account (you’ll need to provide payment information, but Google offers $300 in free credits to get started). If you already have a Google account, simply sign in using your credentials.

Once you’ve signed in, you’ll be taken to your Google Cloud dashboard. This is where you can manage all your cloud projects and resources.

Step 2: Create a New Google Cloud Project 🏗️

At the top left of the Google Cloud Console, you’ll see a drop-down menu beside the Google Cloud logo. Click on this drop-down to display your current projects.

Now it’s time to create a new project. In the top-left corner of the pop-up modal, click on the New Project button.

You’ll be redirected to a page where you’ll need to provide some basic details for your new project. So now enter the following information:

Project Name: Enter a name of your choice for the project (for example, gcr-ci-cd-project).
Location: Select a location for your project. You can leave it as the default "No organization" if you're just getting started.

Once you've entered the project name, click the Create button. Google Cloud will now start creating your new project. It may take a few seconds.

Step 3: Access Your New Project 🛠️

After a few seconds, you’ll be redirected to your Google Cloud dashboard.

Click on the drop-down menu beside the Google Cloud logo again, and you should now see your newly created project listed in the modal where you can select it.

Then click on the project name (for example, gcr-ci-cd-project) to enter your project’s dashboard.

Step 4: Link A Billing Account To Your Project 💳

To access the billing page, in the Google Cloud Console, find the Navigation Menu (the three horizontal lines) at the top left of the screen. Click on it to open a list of options. Scroll down and click on Billing. This will take you to the billing section of your Google Cloud account.

If you haven't set up a billing account yet, you'll be prompted to do so. Click on the "Link a billing account" button to start the process.

Now you can create a new billing account (if you don’t have one). You’ll be redirected to a page where you can either select an existing billing account or create a new one. If you don't already have a billing account, click on "Create a billing account".

Provide the necessary details, including:

Account name (for example, "Personal Billing Account" or your business name).
Country: Choose the country where your business or account is based.
Currency: Choose the currency in which you want to be billed.

Next, enter your payment information (credit card or bank account details). Google Cloud will verify your payment method, so make sure the information is correct.

Read and agree to the Google Cloud Terms of Service and Billing Account Terms. Once you’ve done this, click "Start billing" to finish setting up your billing account

After setting up your billing account, you’ll be taken to a page that asks you to link it to your project. Select the billing account you just created or an existing billing account you want to use. Click Set Account to link the billing account to your project.

After you’ve linked your billing account to your project, you should see a confirmation message indicating that billing has been successfully enabled for your project.

You can always verify this by returning to the Billing section in the Google Cloud Console, where you’ll see your billing account listed.

Create a Google Cloud Service Account to Enable Deployment of the Node.js Application to Google Cloud Run via the CD Pipeline 🚀

Why Do We Need a Service Account and Key? 🤔

A service account allows our CI/CD pipeline to authenticate and interact with Google Cloud services programmatically. By assigning specific roles (permissions), we ensure the service account can only perform tasks related to deployment, such as managing Google Cloud Run.

The service account key is a JSON file containing the credentials used for authentication. We securely store this key as a GitHub secret to protect sensitive information.

Step 1: Open the Service Accounts Page

Here are the steps you can follow to set up your service account and get your key:

First, visit the Google Cloud Console at https://console.cloud.google.com/. Ensure you’ve selected the correct project (e.g. gcr-ci-cd-project). To change projects, click the drop-down menu next to the Google Cloud logo at the top-left corner and select your project.

Then navigate to the Navigation Menu (three horizontal lines in the top-left corner) and click on IAM & Admin > Service Accounts.

Step 2: Create a New Service Account

Click on the "Create Service Account" button. This will open a form where you’ll define your service account details.

Next, enter the Service Account details:

Name: Enter a descriptive name (for example, ci-cd-sa).
ID: This will auto-fill based on the name.
Description: Add a description to help identify its purpose, such as “Used for deploying Node.js app to Cloud Run.”
Click Create and Continue to proceed.

Step 3: Assign Necessary Roles (Permissions)

On the next screen, you’ll assign roles to the service account. Add the following roles one by one:

Cloud Run Admin: Allows management of Cloud Run services.
Service Account User: Grants the ability to use service accounts.
Service Usage Admin: Enables control over enabling APIs.
Viewer: Provides read-only access to view resources.

To add a role:

Click on "Select a Role".
Use the search bar to type the role name (for example, "Cloud Run Admin") and select it.
Repeat for all four roles.

Your screen should look similar to this:

After assigning the roles, click Continue.

Step 4: Skip Granting Users Access to the Service Account

On the next screen, you’ll see an option to grant additional users access to this service account. Click Done to complete the creation process.

Step 5: Generate a Service Account Key 🔑

You should now see your newly created service account in the list. Find the row for your service account (for example, ci-cd-sa) and click the three vertical dots under the “Actions” column. Select "Manage Keys" from the drop-down menu.

To add a new key:

Click on "Add Key" > "Create New Key".
In the pop-up dialog, select JSON as the key type.
Click Create.

Now, download the key file. A JSON file will automatically be downloaded to your computer. This file contains the credentials needed to authenticate with Google Cloud.

Make sure you keep the key secure and store it in a safe location. Don’t share it – treat it as sensitive information.

Step 6: Add the Service Account Key to GitHub Secrets 🔒

Start by opening the downloaded JSON file using a text editor (like Notepad or VS Code). Then select and copy the entire contents of the file.

Then navigate to the repository you created for this project on GitHub. Click on the Settings tab at the top of the repository. Scroll down and find the Secrets and variables > Actions section.

Now you need to add a new secret. Click the "New repository secret" button. In the Name field, enter GCP_SERVICE_ACCOUNT. In the Value field, paste the JSON content you copied earlier. Click Add secret to save it.

Do the same for the GCP_PROJECT_ID secret, but now add your Google Project ID as the value. To get your project ID, follow these steps:

Navigate to the Google Cloud Console: Open Google Cloud Console at https://console.cloud.google.com/.
Locate the Project Dropdown: At the top-left of the screen, next to the Google Cloud logo, you will see a drop-down that shows the name of your current project.
View the Project ID: Click the drop-down, and you'll see a list of all your projects. Your Project ID will be displayed next to the project name. It is a unique identifier used by Google Cloud.
Copy the Project ID: Copy the Project ID that is displayed, and add it as the value of the GCP_PROJECT_ID secret.

Step 7: Adding External Variables to the GitHub Repository 🔧

Before proceeding with deployment, we need to define some external variables that were referenced in the CD workflow. These variables ensure that the pipeline knows critical details about your Google Cloud Run services and Docker container registry.

Here are the steps you’ll need to follow to do this:

First, go to your repository on GitHub.
Click the Settings tab at the top of the repository. Scroll down to Secrets and variables > Actions.
Click on the Variables tab next to Secrets. Click "New repository variable" for each variable. Then you’ll need to define these variables:
- GCR_PROJECT_NAME: Set this to the name of your Cloud Run service for the production/live environment. For example, gcr-ci-cd-app.
- GCR_STAGING_PROJECT_NAME: Set this to the name of your Cloud Run service for the staging/test environment. For example, gcr-ci-cd-staging.
- GCR_REGION: Enter the region where you’d like to deploy the services. For this tutorial, set it to us-central1.
- IMAGE: Specify the name of the Docker image/container registry where the published image will be uploaded. For example, /ci-cd-tutorial-app.
After entering each variable name and value, click Add variable.

Enabling the Service Usage API on the Google Cloud Project 🌐

To deploy your application, the Service Usage API must be enabled in your Google Cloud project. This API allows you to manage Google Cloud services programmatically, including enabling/disabling APIs and monitoring their usage.

Follow these steps to enable it:

First, visit the Google Cloud Console at https://console.cloud.google.com/.
Then make sure you’re in the correct project. Click the project drop-down menu near the Google Cloud logo at the top-left corner. Select gcr-ci-cd-project , or the name you gave your project from the list of projects.
Next you’ll need to access the API library. Open the Navigation Menu (three horizontal lines in the top-left corner). Select APIs & Services > Library from the menu.
In the API Library, use the search bar to search for "Service Usage API".
Click on the Service Usage API from the search results. On the API’s details page, click Enable.
To verify, go to APIs & Services > Enabled APIs & Services in the Google Cloud Console. Confirm that the Service Usage API appears in the list of enabled APIs.

Create the Staging Branch and Merge the Feature Branch into It (Continuous Integration and Continuous Delivery) 🌟

When changes from the feature/ci-cd-pipeline branch are merged into the staging branch, we complete the Continuous Integration (CI) process, and the workflow ci-pipeline.yml will run. This ensures that the changes made in the feature branch are tested and integrated into a shared branch.

Once the pull request (PR) is merged into staging, the Continuous Delivery (CD) pipeline automatically triggers, deploying the application to the staging environment. This simulates how updates are tested in a safe environment before being pushed to production.

Create the `staging` Branch on the Remote Repository

To enable the CI/CD pipeline, we’ll first create a staging branch on the remote GitHub repository. This branch will serve as the test environment where changes are deployed before they reach the production environment.

To create the staging branch directly on GitHub, follow these steps:

First, navigate to your repository on GitHub. Open your web browser and go to the GitHub repository where you want to create the new staging branch.
Then, switch to the main branch. On the top of the repository page, locate the Branch dropdown (usually labelled as main or the current branch name). Click on the dropdown and make sure you are on the main branch.
Next, create the staging branch. In the same dropdown where you see the main branch, type staging into the text box. Once you start typing, GitHub will offer you the option to create a new branch called staging. Select the Create branch: staging option from the dropdown.
Finally, verify the branch**.** After creating the staging branch, GitHub will automatically switch to it. You should now see staging in the branch dropdown, confirming the new branch was created.

Merge Your Feature Branch into the Staging Branch via a Pull Request (PR)

This process combines both Continuous Integration (CI) and Continuous Delivery (CD). You will commit changes from your feature branch, push them to the remote feature branch, and then open a PR to merge those changes into the staging branch. Here's how to do it:

Step 1: Commit Local Changes on Your Feature Branch

First, you’ll want to make sure that you are on the correct branch (the feature branch) by running:

git status

If you are not on the feature/ci-cd-pipeline branch, switch to it by running:

git checkout feature/ci-cd-pipeline

Now, it’s time to add your changes you made for the commit:

git add .

This stages all changes, including new files, modified files, and deleted files.

Next, commit your changes with a clear and descriptive message:

git commit -m "Set up CI/CD pipelines for the project"

Then you can verify your commit by running:

git log

This will display your most recent commits, and you should see the commit message you just added.

Step 2: Push Your Feature Branch Changes to the Remote Repository

After committing your changes, push them to the remote repository:

git push origin feature/ci-cd-pipeline

This pushes your local changes on the feature/ci-cd-pipeline branch to the remote GitHub repository.

Once the push is successful, visit your GitHub repository in a web browser, and confirm that the feature/ci-cd-pipeline branch is updated with your new commit.

Step 3: Create a Pull Request to Merge the Feature Branch into Staging

Go to your repository on GitHub and ensure that you are on the main page of the repository.

You should see an alert at the top of the page suggesting you create a pull request for the recently pushed branch (feature/ci-cd-pipeline). Click the Compare & Pull Request button next to the alert.

Now, it’s time to choose the base and compare branches. On the PR creation page, make sure the base branch is set to staging (this is the branch you want to merge your changes into). The compare branch should already be set to feature/ci-cd-pipeline (the branch you just pushed). If they’re not selected correctly, use the dropdowns to change them.

You’ll want to come up with a good PR description for this. Write a clear title and description for the pull request, explaining what changes you're merging and why. For example:

Title: "Merge CI/CD setup changes from feature branch"
Description: "This pull request adds the CI/CD pipelines for GitHub Actions and Docker Hub integration to the project. It includes the configurations for both CI and CD workflows."

Now GitHub will show a list of all the changes that will be merged. Take a moment to review them and ensure everything looks correct.

If all looks good after reviewing, click on the Create pull request button. This will create the PR and notify team members (if any) that changes are ready to be reviewed and merged.

Wait a few seconds, and you should see a message indicating that all the checks have passed. Click on the link with the description "CI Pipeline to staging/production environment...". This should direct you to the Continuous Integration workflow, where you can view the steps that ran

The Continuous Integration (CI) Process

The CI process begins when a Pull Request is made to the staging branch. It triggers the GitHub Actions workflow defined in the .github/workflows/ci-pipeline.yml file. The workflow runs the necessary steps to set up the environment, install dependencies, and build the Node.js application.

It then runs automated tests (using npm test) to ensure that the changes do not break any functionality in the codebase. If all these steps are completed successfully, the CI pipeline confirms that the feature branch is stable and ready to be merged into the staging branch for further testing and deployment.

Step 4: Merge the Pull Request

If your team or collaborators are part of the project, they may review your PR. This step may involve discussing any changes or improvements. If everything looks good, a reviewer will merge the PR.

Once the PR has been reviewed and approved, you can merge the PR. To do this, just click on the Merge pull request button. Choose Confirm merge when prompted.

After merging, you can go to the staging branch to verify that the changes were successfully merged.

Navigating to the Actions Page After Merging the PR

Once you have successfully merged your pull request from the feature/ci-cd-pipeline branch into the staging branch, the Continuous Delivery (CD) pipeline will be triggered. To view the progress of the CD pipeline, navigate to the Actions tab in your GitHub repository. Here's how to do it:

Go to your GitHub repository.
At the top of the page, you will see the Actions tab next to the Code tab. Click on it.
On the Actions page, you will see a list of workflows that have been triggered. Look for the one labelled CD Pipeline to Google Cloud Run (staging and production). It should appear as a new run after the PR merge.
Click on the workflow run to view its progress and see the detailed logs for each step.

This will allow you to monitor the status of the CD pipeline and check if there are any issues during deployment.

If you look at the CD steps and workflow, you'll see that the step to deploy the application to the production environment was skipped, while the step to deploy to the staging environment was executed.

Continuous Delivery (CD) pipeline – what’s going on:

The Continuous Delivery (CD) Pipeline automates the process of deploying the application to Google Cloud Run (testing environment). This workflow is triggered by a push to the staging branch, which happens after the changes from the feature branch are merged into staging. It can also be manually triggered via workflow_dispatch or upon a new release being published.

The pipeline consists of multiple stages:

Test Job: The pipeline begins by setting up the environment and running tests using the npm test command. If the tests pass, the process moves forward.
Build Job: The next step builds the Docker image of the Node.js application, tags it, and then pushes it to Docker Hub.
Deployment to GCP: After the image is pushed, the workflow authenticates to Google Cloud and deploys the application. If the event is a release (that is, a push to the main branch), the application is deployed to the production environment. If the event is a push to staging, the app is deployed to the staging environment.

The CD process ensures that any changes made to the staging branch are automatically tested, built, and deployed to the staging environment, ready for further validation. When a release is published, it will trigger deployment to production, ensuring your app is always up to date.

Accessing the Deployed Application in the Staging Environment on Google Cloud Run 🌐

Once the deployment to Google Cloud Run is successfully completed, you'll want to access your application running in the staging environment. Follow these steps to find and visit your deployed application:

1. Navigate to the Google Cloud Console

Open the Google Cloud Console in your browser by visiting https://console.cloud.google.com. If you're not already signed in, make sure you log in with your Google account.

2. Go to the Cloud Run Dashboard

In the Google Cloud Console, use the Search bar at the top or navigate through the left-hand menu: Go to Cloud Run (you can type this into the search bar, or find it under Products & services > Compute > Cloud Run). Click on Cloud Run to open the Cloud Run dashboard.

3. Select Your Staging Service

In the Cloud Run dashboard, you should see a list of all your services deployed across various environments. Find the service associated with the staging environment. The name should be similar to what you defined in your workflow (for example, gcr-ci-cd-staging).

4. Access the Service URL

Once you've selected your staging service, you’ll be taken to the Service details page. This page provides all the important information about your deployed service.
On this page, look for the URL section under the Service URL heading. The URL will look something like: https://gcr-ci-cd-staging-.run.app.

5. Visit the Application

Click on the Service URL, and it will open your staging environment in a new tab in your browser. You can now interact with your application as if it were live, but in the staging environment.

Merge the Staging Branch into the Main Branch (Continuous Integration and Continuous Deployment) 🌐

In this section, we'll take the updates in the staging branch, merge them into the main branch, and trigger the CI/CD pipeline. This process not only ensures your changes are production-ready but also deploys them to the production/live environment. 🚀

Step 1: Push Local Changes and Open a Pull Request

Why? The first step involves merging the staging branch into the main branch. Just like in the previous Continuous Delivery process, this ensures the integration of thoroughly tested updates.

Here’s how to do it:

First, visit the GitHub repository where your project is hosted.

Then go to the Pull Requests tab. Click New Pull Request. Choose staging as the source branch (base branch) and main as the target branch. Add a clear title and description for the Pull Request, explaining why these updates are ready for production deployment.

Step 2: Continuous Integration (CI) Pipeline Execution

After merging the pull request, the Continuous Integration (CI) pipeline will automatically execute to validate that the changes are still stable when integrated into the main branch.

Pipeline Steps:

Code Checkout: The workflow fetches the latest code from the main branch.
Dependency Installation: The pipeline installs all required dependencies.
Testing: Automated tests are run to validate the application's stability.

Step 3: Create a New Release

The Continuous Deployment (CD) workflow to deploy to the production environment is triggered by the creation of a new release from the main branch.

Let’s walk through the steps to create a release.

On your GitHub repository page, click on the Releases section (located under the Code tab).

Next, click Draft a new release. Set the Target branch to main. Enter a Tag version (for example, v1.0.0) following semantic versioning. Add a Release title and an optional description of the changes.

Then, click Publish Release to finalize.

Why run the Continuous Deployment pipeline on release instead of on push? 🤔

In our setup, we decided not to trigger the Continuous Deployment (CD) pipeline every time changes are pushed to the main branch. Instead, we trigger it only when a new release is created. This gives the team more control over when updates are deployed to the production environment.

Imagine a scenario where developers are working on new features—they may push changes to the main branch as part of their regular workflow, but these features might not be complete or ready for users yet. Automatically deploying every push could accidentally expose unfinished features to your users, which can be confusing or disruptive.

By requiring a release to trigger the deployment, the team gets a chance to finalize and polish all changes before they go live.

For example, developers can test new features in the staging environment, fix any issues, and merge those changes into the main branch without worrying about them immediately appearing in production. This workflow ensures that only well-tested and complete features make their way to your end users.

Ultimately, this approach helps maintain a smooth user experience. Instead of seeing half-built features or unexpected changes, users only see updates that are ready and functional. It also gives the team the flexibility to push changes to the main branch frequently—preventing merge conflicts and making collaboration easier—while keeping control over what gets deployed live. 🚀

Step 4: Navigate to the Actions Page

After the release is published, the CD pipeline for the production environment is triggered. To monitor this repeat the process taken for the Continuous Delivery workflow, follow these steps:

Go to the GitHub Actions tab: In your GitHub repository, click on the Actions tab.
Locate the deployment workflow: Look for the CD Pipeline to Google Cloud Run (staging and production) workflow. You’ll notice that the workflow has been triggered on the main branch due to the push event.
Open the workflow details: Click on the workflow to view detailed steps, logs, and statuses for each part of the deployment process.

This time, the Continuous delivery workflow deploys the application to the production/live environment.

Step 5: Access the Live Application

Once the deployment is complete, go to Google Cloud Console at https://console.cloud.google.com.

Navigate to Cloud Run from the menu. Select the service corresponding to the production environment (for example, gcr-ci-cd-app).

Locate the Service URL in the service details page. Open the URL in your browser to access the live application.

And now, congratulations – you’re done!

Conclusion 🌟

In this article, we explored how to build and automate a CI/CD pipeline for a Node.js application, using GitHub Actions, Docker Hub, and Google Cloud Run.

We set up workflows to handle Continuous Integration by testing and integrating code changes and Continuous Delivery to deploy those changes to a staging environment. We also containerized our app using Docker and deployed it seamlessly to Google Cloud Run.

Finally, we implemented Continuous Deployment, ensuring updates to the production environment happen only when a release is created from the main branch.

This approach gives teams the flexibility to push and test incomplete features without impacting end users. By following these steps, you've built a robust pipeline that makes deploying your application smoother, faster, and more reliable.

Study Further 📚

If you would like to learn more about Continuous Integration, Delivery, and Deployment you can check out the courses below:

About the Author 👨‍💻

Hi, I’m Prince! I’m a software engineer passionate about building scalable applications and sharing knowledge with the tech community.

If you enjoyed this article, you can learn more about me by exploring more of my blogs and projects on my LinkedIn profile. You can find my LinkedIn articles here. And you can visit my website to read more of my articles as well. Let’s connect and grow together! 😊

Prince Onukwili - freeCodeCamp.org

How to Deploy Your Own Cockroach DB Instance on Kubernetes [Full Book for Devs]

Table of Contents

What Even Is CockroachDB? 🤔

Simple Definition

Who Made CockroachDB? When Was it Released?

What Problems Does CockroachDB Try to Solve?

Key Terms You Should Know (in plain language):

Why the name “CockroachDB”? 😅

Why Choose CockroachDB Over PostgreSQL or MongoDB 🤷🏾‍♂️?

How Fault Tolerance is Handled in PostgreSQL and MongoDB

How CockroachDB Handles It Differently

How CockroachDB Works Behind the Scenes ⚙️

Ranges: The Small Pieces of Data

Replication: Many Copies for Safety

Raft Consensus: How All Copies Agree

MultiRaft: Keeping Raft Efficient When Things Scale

Rebalancing: Movement for Balance

Distributed Transactions: Doing Work Across Multiple Ranges

How It All Fits Together: Read + Write Flow (What Happens When You Use It)

Why This All Matters (Putting It in Plain English)

Where (and How) Should You Host CockroachDB? ☁️

Option 1: CockroachDB Cloud (fully managed by Cockroach Labs)

Option 2: Bring Your Own Cloud (BYOC)

Option 3: Use Cloud Marketplaces (AWS, GCP, Azure)

Option 4 (My Favorite 😁): Self-Hosting — Especially Using Kubernetes

Why Kubernetes Beats Tools Like Docker Swarm or Hashicorp Nomad for Databases

Why Other Tools (Swarm / Nomad / Docker Compose) Don’t Match Up Here

Trade-offs (things to watch out for)

Setting Up Your Local Environment 🧑‍💻

Why these tools?

Step 1: Install Minikube

🪟 Windows

🍎 macOS

🐧 Linux

Step 2: Install kubectl

🪟 Windows

🍎 macOS

🐧 Linux

Step 3: Install Helm

🪟 Windows

🍎 macOS

🐧 Linux

Deploying CockroachDB on Minikube (The Fun Part Begins 😁!)

Step 1: Visit ArtifactHub

Step 2: Explore the Helm Chart

Step 3: Copy the Default Values

Step 4: Create a Folder for Our Project

Step 5: Understanding the Key Configurations

🧩 statefulset.replicas

⚙️ statefulset.resources.requests and statefulset.resources.limits

💾 storage.persistentVolume.size

💽 storage.persistentVolume.storageClass

🔐 tls.enabled

Step 6: Create a Simplified Values Config for the CockroachDB Helm Chart

Overview of the YAML values

🚀 Step 7: Install the CockroachDB Cluster Using Helm

After the command runs:

Accessing the CockroachDB Console & Viewing Metrics

Step 1: Locate the CockroachDB Public Service

Step 2: Learn More About the Service

Step 3: Access the CockroachDB Dashboard

Step 4: Visit the Dashboard

Step 5: Exploring the Metrics Dashboard

SQL Queries Per Second

Service Latency: SQL Statements, 99th percentile

SQL Statement Contention

Replicas per Node

Capacity

Why These Matter Together

Step 6: Creating a Little Load on the CockroachDB Cluster

Step 6.1: Create a ConfigMap for Our Books Data

Step 6.2: Create the Python Script ConfigMap

Step 6.3: Create the Job to Run the Script

Step 6.4: Check if the Job Ran Successfully

Step 7: Viewing the Metrics from the Load

Step 8: View the List of Created Items in the Database

Backing Up CockroachDB to Google Cloud Storage ☁️

Why Backups Are Absolutely Critical

Connecting to Our DB – Installing Beekeeper Studio

🧩 `statefulset.replicas`

⚙️ `statefulset.resources.requests` and `statefulset.resources.limits`

💾 `storage.persistentVolume.size`

💽 `storage.persistentVolume.storageClass`

🔐 `tls.enabled`