Load Balancing - freeCodeCamp.org

How to Reduce Latency in Your Generative AI Apps with Gemini and Cloud Run

Amina Lawal — Wed, 10 Dec 2025 14:35:12 +0000

You've built your first Generative AI feature. Now what? When deploying AI, the challenge is no longer if the model can answer, but how fast it can answer for a user halfway across the globe. Low latency is not a luxury, it's a requirement for good user experience.

Today, we’ve moved beyond simple container deployments and into building Global AI Architectures. This setup leverages Google’s infrastructure to deliver context-aware, instant Gen AI responses anywhere in the world. If you're ready to get your hands dirty, let's build the future of global, intelligent features.

In this article, you’re not just going to deploy a container, you’ll be building a global AI architecture.

A global AI architecture is a design pattern that leverages a worldwide network to deploy and manage AI services, ensuring the fastest possible response time (low latency) for users, no matter where they are located. Instead of deploying a feature to a single region, this architecture distributes the service across multiple continents.

Most people may deploy a service to a single region. That’s fine for a local user, but physical distance, and the speed of light, creates terrible latency for everyone else. We are going to eliminate this problem by leveraging Google’s global network to deploy the service in a "triangle" of locations.

The generative AI service you’ll be building is a "Local Guide." This application will be designed to be deeply hyper-personalized, changing its personality and providing recommendations based on the user's detected geographical context. For example, if a user is in Paris, the guide will greet them warmly, mentioning their city and suggesting a local activity.

You’re going to build this service to achieve three critical goals:

Lives Almost Everywhere: Deployed to three continents simultaneously (USA, Europe, and Asia).
Feels Instant: Uses Google's global fiber network and Anycast IP to route users to the nearest server, ensuring the lowest possible latency.
Knows Where You Are: Automatically detects the user's location (without relying on client-side GPS permissions) to provide deeply personalized, location-aware suggestions.

Prerequisites
Phase 1: The "Location-Aware" Code
Phase 2: Build & Push
Phase 3: The "Triangle" Deployment
Phase 4: The Global Network (The Glue)
Phase 5: Testing (Teleportation Time)
Conclusion: The Global AI Edge

Prerequisites

To follow along, you need:

A Google Cloud Project (with billing enabled).
Google Cloud Shell (Recommended! No local setup required). Click the icon in the top right of the GCP Console that looks like a terminal prompt >_.

Note: The project utilizes various Google Cloud services (Cloud Run, Artifact Registry, Load Balancer, Vertex AI), all of which require a Google Cloud Project with billing enabled to function. While many of these services offer a free tier, you must link a billing account to your project. Although a billing account is required, new Google Cloud users may be eligible for a free trial credit that should cover the cost of this lab. See credit program eligibility and coverage

Phase 1: The "Location-Aware" Code

We don’t want to build a generic chatbot, so we’ll be building a "Local Guide" that changes its personality based on where the request comes from.

Enable the APIs

To wake up the services, run this in your terminal:

gcloud services enable \
  run.googleapis.com \
  artifactregistry.googleapis.com \
  compute.googleapis.com \
  aiplatform.googleapis.com \
  cloudbuild.googleapis.com

This command enables the necessary Google Cloud APIs for the project:

Cloud Run (run.googleapis.com)
Artifact Registry (artifactregistry.googleapis.com)
Compute Engine (compute.googleapis.com)
Vertex AI (aiplatform.googleapis.com)
Cloud Build (cloudbuild.googleapis.com).

Enabling them ensures that the services we need are ready to be used.

Create and Populate `main.py`

This is the brain of our service. In your Cloud Shell terminal, create a file named main.py and paste the following code into it:

import os
import logging
from flask import Flask, request, jsonify
import vertexai
from vertexai.generative_models import GenerativeModel

app = Flask(__name__)

# Initialize Vertex AI
PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT")
vertexai.init(project=PROJECT_ID)

@app.route("/", methods=["GET", "POST"])
def generate():
    # 1. Identify where the code is physically running (We set this ENV var later)
    service_region = os.environ.get("SERVICE_REGION", "unknown-region")

    # 2. Identify where the user is (Header comes from Global Load Balancer)
    # Format typically: "City,State,Country"
    user_location = request.headers.get("X-Client-Geo-Location", "Unknown Location")

    model = GenerativeModel("gemini-2.5-flash")

    # 3. Construct a location-aware prompt
    prompt = (
        f"You are a helpful local guide. The user is currently in {user_location}. "
        "Greet them warmly mentioning their city, and suggest one "
        "hidden gem activity to do nearby right now. Keep it under 50 words."
    )

    try:
        response = model.generate_content(prompt)
        return jsonify({
            "ai_response": response.text,
            "meta": {
                "served_from_region": service_region,
                "user_detected_location": user_location
            }
        })
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == "__main__":
    app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))

It’s a simple Flask web application that relies entirely on a specific HTTP header (X-Client-Geo-Location) that the global load balancer will inject later in the process. This design choice keeps the Python code clean, fast, and focused on using the context that the powerful Google Cloud infrastructure provides. The script uses Vertex AI and the high-performance Gemini 2.5 Flash generative model.

This core logic of the application is a simple Flask web service. It does the following:

Initialization: Sets up the Flask app, logging, and initializes the Vertex AI client using the project ID.
Context: It extracts two critical pieces of information: the SERVICE_REGION (where the code is physically running) from the environment variable, and the X-Client-Geo-Location (the user's detected location) from the request header, which will be injected by the global load balancer.
AI Generation: It uses the high-performance gemini-2.5-flash model.
Prompt Construction: A dynamic, location-aware prompt is built using the detected city to instruct Gemini to act as a helpful local guide and provide a personalized suggestion.
Response: The response includes the AI's generated text and a meta section containing both the serving region and the user's detected location, which helps in verification.

Create the `Dockerfile`

This Dockerfile tells Cloud Run how to build the Python application into a container image. Create a file named Dockerfile in the same directory as main.py and paste the following content into it:

FROM python:3.9-slim

WORKDIR /app
COPY main.py .

# Install Flask and Vertex AI SDK
RUN pip install flask google-cloud-aiplatform

CMD ["python", "main.py"]

Here’s what the code does:

Starts with a lightweight Python base image python:3.9-slim.
Sets the working directory inside the container WORKDIR /app.
Copies your application code into the container.
RUN pip install... installs the required Python packages: Flask for the web server and google-cloud-aiplatform for accessing the Gemini model.
CMD specifies the command to run when the container starts.

Phase 2: Build & Push

Let's package this up. For efficiency and consistency, we’ll follow the best practice of Build Once, Deploy Many. We’ll build the container image once using Cloud Build and store it in Google's Artifact Registry. This guarantees that the same tested application code runs in New York, Belgium, and Tokyo.

First, sets an environment variable for your Google Cloud Project ID to simplify later commands.

# 1. Set your Project ID variable
export PROJECT_ID=$(gcloud config get-value project)

Then create a new Docker repository named gemini-global-repo in the us-central1 region to store the application container image:

# 2. Create the repository
gcloud artifacts repositories create gemini-global-repo \
    --repository-format=docker \
    --location=us-central1 \
    --description="Repo for Global Gemini App"

Using the mkdir gemini-app command, create and navigate into a directory where you should place your main.py and Dockerfile:

# 3. Prepare the Build Environment (Crucial Step! 💡). To ensure the build process only includes our necessary code and avoids including temporary files from Cloud Shell's home directory 
mkdir gemini-app
cd gemini-app

Next, use gcloud builds submit --tag to build the container image from the files in the current directory and push the resulting image to the newly created Artifact Registry repository:

# 4. Build the image (This takes about 2 minutes)
gcloud builds submit --tag us-central1-docker.pkg.dev/$PROJECT_ID/gemini-global-repo/region-ai:v

NOTE: You might notice that we created the Artifact Registry repository (gemini-global-repo) in the us-central1 region. This choice is purely for management and storage of the container image. When you create an image and push it to a regional Artifact Registry, the resulting image is still accessible globally. For this lab, us-central1 serves as a reliable, central location for our single, canonical container image, the single source of truth, which is then pulled by Cloud Run in the three separate global regions.

Phase 3: The "Triangle" Deployment

We’ll deploy the same image to three corners of the world, forming our "Triangle". This ensures that whether a user is in Lagos, London, or Tokyo, they’ll be geographically close to a server. This is the low-latency core of our architecture.

We’ll use Cloud Run to deploy our services. Cloud Run is a fully managed serverless platform on Google Cloud that enables you to run stateless containers via web requests or events. Crucially, it is serverless, meaning you don't manage any virtual machines, operating system updates, or scaling infrastructure. You provide a container image, and Cloud Run automatically scales it up (and down to zero) in the region you specify.

For this project, we’ll use its regional deployment capability to easily and consistently deploy the exact same container image to New York, Belgium, and Tokyo.

Note: Setting it up primarily involves enabling the API (done in Phase 1) and using the gcloud run deploy command, which handles provisioning and managing the service in the specified region.

Now, we’ll proceed to deploy the single, canonical container image to three separate Cloud Run regions, forming the "Triangle Deployment".

First, set a variable for the image path, pointing to the image stored in Artifact Registry.

# Define our image URL
export IMAGE_URL=us-central1-docker.pkg.dev/$PROJECT_ID/gemini-global-repo/region-ai:v1


# 1. Deploy to USA (New York)
gcloud run deploy gemini-service \
    --image $IMAGE_URL \
    --region us-east4 \
    --set-env-vars SERVICE_REGION=us-east4 \
    --allow-unauthenticated

# 2. Deploy to Europe (Belgium)
gcloud run deploy gemini-service \
    --image $IMAGE_URL \
    --region europe-west1 \
    --set-env-vars SERVICE_REGION=europe-west1 \
    --allow-unauthenticated

# 3. Deploy to Asia (Tokyo)
gcloud run deploy gemini-service \
    --image $IMAGE_URL \
    --region asia-northeast1 \
    --set-env-vars SERVICE_REGION=asia-northeast1 \
    --allow-unauthenticated

gcloud run deploy gemini-service... deploys the service. Key flags:

--image \$IMAGE_URL specifies the container image to use.
--region specifies the deployment region (for example, us-east4 for New York).
--set-env-vars SERVICE_REGION=... injects an environment variable into the running container to let the main.py code know its own physical region.
--allow-unauthenticated makes the service publicly accessible, as required for the Load Balancer to connect.

Note: The commands are repeated for Europe (europe-west1) and Asia (asia-northeast1) regions.

user_detected_location is always "Unknown Location". This is expected. You are accessing the Cloud Run URLs directly, not via the global load balancer, so the X-Client-Geo-Location header is not yet being injected.

Phase 4: The Global Network (The Glue)

You are now ready to execute the steps to create the Global External HTTP Load Balancer infrastructure. This is the "magic" that stitches the three regional services together behind a single Anycast IP Address. The load balancer performs two critical functions:

Global Routing: It uses Google’s high-speed network to automatically route the user to the closest available region (for example, Tokyo user → Asia service).
Context Injection: It dynamically adds the X-Client-Geo-Location header to the request, telling your code exactly where the user is^.

The Global IP

gcloud compute addresses create... creates a single, global, static Anycast IP address (gemini-global-ip) that will serve as the single public entry point for users worldwide. That is

gcloud compute addresses create gemini-global-ip \
    --global \
    --ip-version IPV4

The Network Endpoint Groups (NEGs)

gcloud compute network-endpoint-groups create... creates a Serverless Network Endpoint Group (NEG) for each regional Cloud Run deployment. For example, neg-us is created in us-east4 and points to the gemini-service in that region. These map your Cloud Run services to the Load Balancer's backend service:

# USA NEG
gcloud compute network-endpoint-groups create neg-us \
    --region=us-east4 \
    --network-endpoint-type=serverless  \
    --cloud-run-service=gemini-service

# Europe NEG
gcloud compute network-endpoint-groups create neg-eu \
    --region=europe-west1 \
    --network-endpoint-type=serverless \
    --cloud-run-service=gemini-service

# Asia NEG
gcloud compute network-endpoint-groups create neg-asia \
    --region=asia-northeast1 \
    --network-endpoint-type=serverless \
    --cloud-run-service=gemini-service

The Backend Service & Routing

This is the load balancer's core, distributing traffic across your regions. Connect the NEGs to a global backend.

gcloud compute backend-services create... creates the global backend service (gemini-backend-global), which is the core component that manages traffic distribution:

# Create the backend service
gcloud compute backend-services create gemini-backend-global \
    --global \
    --protocol=HTTP

gcloud compute backend-services add-backend... adds all three regional NEGs (neg-us, neg-eu, neg-asia) as backends to the global service. This tells the load balancer where all the services are located:

# Add the 3 regions to the backend
gcloud compute backend-services add-backend gemini-backend-global \
    --global --network-endpoint-group=neg-us --network-endpoint-group-region=us-east4
gcloud compute backend-services add-backend gemini-backend-global \
    --global --network-endpoint-group=neg-eu --network-endpoint-group-region=europe-west1
gcloud compute backend-services add-backend gemini-backend-global \
    --global --network-endpoint-group=neg-asia --network-endpoint-group-region=asia-northeast1

The URL Map & Frontend

Now we can finalize the connection.

gcloud compute url-maps create... creates a URL Map (gemini-url-map) to direct all incoming traffic to the Backend Service:

# Create URL Map (Maps incoming requests to the backend service)
gcloud compute url-maps create gemini-url-map \
    --default-service gemini-backend-global

gcloud compute target-http-proxies create... creates an HTTP Proxy (gemini-http-proxy) that inspects the request and directs it based on the URL map.

# Create HTTP Proxy (The component that inspects the request headers)
gcloud compute target-http-proxies create gemini-http-proxy \
    --url-map gemini-url-map

export VIP=... retrieves the final, public IP address of the newly created Global IP and stores it in the VIP environment variable.

# Get your IP Address variable
export VIP=$(gcloud compute addresses describe gemini-global-ip --global --format="value(address)")

gcloud compute forwarding-rules create... creates the final global Forwarding Rule (gemini-forwarding-rule). This links the Global IP ($VIP) to the HTTP Proxy and opens port 80 for public traffic.

# Create Forwarding Rule (Open port 80)
gcloud compute forwarding-rules create gemini-forwarding-rule \
    --address=$VIP \
    --global \
    --target-http-proxy=gemini-http-proxy \
    --ports=80

Phase 5: Testing (Teleportation Time)

Global load balancers take about 5-7 minutes to propagate worldwide. This is how you verify that the global load balancer is working correctly:

Using the single VIP (Virtual IP) address.
Routing traffic to the nearest server.
Injecting the X-Client-Geo-Location header to tell your code where the user is.

1. Get your Global IP

First, ensure your VIP variable is set and retrieve the final address:

echo "http://$VIP/"

The output will be your single point of entry for the entire global architecture.

2. Test "Teleportation"

These curl commands simulate a user requesting the service from different geographical locations by manually injecting the X-Client-Geo-Location header. This bypasses the need to be physically in those locations for testing.

Simulate Europe (Paris)

We expect this to be served by the europe-west1 region because it's the closest server.

curl -H "X-Client-Geo-Location: Paris,France" http://$VIP/

Expected Output: Gemini should say "Bonjour" and mention Paris. The served_from_region should be europe-west1.

Simulate Asia (Tokyo)

We expect this to be served by the asia-northeast1 region.

curl -H "X-Client-Geo-Location: Tokyo,Japan" http://$VIP/

Expected Output: Gemini should mention Tokyo. The served_from_region should be asia-northeast1.

Simulate USA (New York)

We expect this to be served by the us-east4 region.

curl -s -H "X-Client-Geo-Location: New York,USA" http://$VIP/ | jq .

Expected Output: Gemini should mention USA. The served_from_region should be us-east4.

Note: The | jq . part is optional, but highly recommended as it formats the JSON output, making it much easier to read the served_from_region and ai_response details. If jq isn't available, you can just run curl ... without it.

Conclusion: The Global AI Edge

Congratulations! You have successfully built a sophisticated, global AI architecture that solves the challenges of latency and personalization for generative AI features. By combining the following technologies, you achieved two critical outcomes:

Guaranteed Low Latency: By deploying the Cloud Run service to a "Triangle" of global regions (USA, Europe, Asia) and using the Global External HTTP Load Balancer's Anycast IP, your users are automatically routed across Google’s private fiber network to the closest available server.
Hyper-Personalization: The global load balancer was configured to dynamically inject the user's geographical location via the X-Client-Geo-Location header. This context was passed directly to the Gemini 2.5 Flash model, allowing it to act as a truly location-aware "Local Guide".

This pattern allows you to scale intelligent features globally and is immediately applicable to any application where speed and context are essential, from real-time translations to hyper-local recommendations.

Cleanup

Don't leave the meter running! Remember to execute the cleanup commands to ensure you don't incur unnecessary charges

gcloud run services delete gemini-service --region us-east4 --quiet
gcloud run services delete gemini-service --region europe-west1 --quiet
gcloud run services delete gemini-service --region asia-northeast1 --quiet
gcloud compute forwarding-rules delete gemini-forwarding-rule --global --quiet
gcloud compute addresses delete gemini-global-ip --global --quiet
gcloud compute backend-services delete gemini-backend-global --global --quiet
gcloud compute url-maps delete gemini-url-map --global --quiet
gcloud compute target-http-proxies delete gemini-http-proxy --global --quiet

Resources

Google Cloud Shell Documentation
Vertex AI Generative AI SDK
Artifact Registry Documentation
Cloud Run Documentation
Global External HTTP(S) Load Balancer Overview
Serverless Network Endpoint Groups (NEGs)
Serve traffic from multiple regions

Load Balancing with Azure Application Gateway and Azure Load Balancer – When to Use Each One

Prince Onukwili — Wed, 14 May 2025 19:37:46 +0000

You’ve probably heard someone mention load balancing when talking about cloud apps. Maybe even names like Azure Load Balancer, Azure Application Gateway, or something about Virtual Machines and Scale Sets. 😵‍💫

It all sounds important...but also a little confusing. Like, why are there so many moving parts? And what do they actually do?

In this guide, we’re going to break it all down – step by step – using real examples and simple language.

You’ll learn:

What load balancers are (and why apps even need them)
How apps were deployed before load balancers existed (hint: everything lived on one lonely server)
How Azure Virtual Machines work – and how they let you scale up your apps
What Virtual Machine Scale Sets are, and how they help handle sudden traffic spikes
The differences between Azure Load Balancer and Azure Application Gateway, and when to use each

By the end, you won’t just understand what these tools do – you’ll know when and why to use them in real-world scenarios.

Whether you’re a curious beginner, a hands-on builder, or someone just trying to wrap their head around Azure’s ecosystem, this guide is for you.

Ready to untangle the cloud spaghetti? Let’s go! 🍝🚀

📚 Table of Contents

🧊 What Are Load Balancers?
🖥️ How Applications Were Deployed Before Load Balancers
⚙️ Azure Virtual Machines (VMs) – The Building Blocks
📈 The Need for Scaling – Vertical vs Horizontal
🔁 Azure Virtual Machine Scale Sets (VMSS) – Scaling Made Simple
📦 Azure Load Balancer – Spreading the Traffic
🍴 Azure Application Gateway – Smart Routing for Modern Apps
🔍 Azure Load Balancer vs Azure Application Gateway
🧭 Use Cases: When to Use Each One
✅ Conclusion
Study Further 📚
About the Author 👨‍💻

🧊 What Are Load Balancers?

Imagine you're running a small restaurant with just one chef in the kitchen. Everything goes smoothly when you have a few customers – each order is prepared one after the other, and everyone leaves satisfied.

But what happens when 50 people walk in all at once?

🍽️ One chef can’t handle that many orders at the same time.
⏳ People start waiting longer.
😤 Some customers leave.
💥 The chef gets overwhelmed – and eventually burns out.

This is what can happen to a server (the computer running your app) when too many users try to access it at the same time.

So, What Does a Load Balancer Do?

A load balancer is like a smart restaurant manager. But instead of food orders, it handles user requests – the things people do when they open your app, click buttons, or load data.

Let’s say you now have three chefs (servers) instead of one. The load balancer’s job is to:

👀 Watch for incoming orders (user requests)
🧠 Decide which chef (server) is available or least busy
🍽️ Send that request to the right one
🔁 Repeat this over and over, making sure things stay fast and smooth

So in simple terms, a load balancer takes all the incoming traffic to your app and distributes it across multiple servers so no single server gets overloaded – cool, right? 🙂

Why Were Load Balancers Introduced?

Back in the early days, many applications were hosted on just one machine – called a Single Server Deployment.

That was okay when you had a small number of users. But once things started to grow – more users, more actions, more data – single servers became a bottleneck:

They could only handle a limited number of requests.
If they went down, your entire app would stop working.
Scaling (adding more power) was expensive and manual.

💡 Enter load balancers – designed to solve this by making it possible to:

Spread traffic across multiple servers (so no one server crashes under pressure),
Replace or restart servers without downtime,
Add or remove servers as needed, depending on how busy your app is (this is called scaling).

A Simple Use-Case Scenario

Let’s say you're building an online store — your own mini Amazon. At first, you host your app on one Azure Virtual Machine. Things are great. But one day, you run a huge promo and suddenly…thousands of people flood in to browse, shop, and check out.

Your single VM starts lagging.

Orders fail. People complain. Your dream app? Crashing fast. 💥

So what do you do?

You spin up two more VMs to help out – but now you’ve got another problem: How do you divide the traffic between the three?

This is where the load balancer steps in. It:

Looks at every incoming user request
Figures out which VM is available and least busy
Sends the request there
Keeps rotating requests in real-time

And the result?
✅ No single VM gets overwhelmed
✅ Your app stays fast and responsive
✅ Users are happy (and buying stuff again!)

🖥️ How Applications Were Deployed Before Load Balancers

Before cloud tools like load balancers came along, the typical way to run an application was pretty simple: You’d deploy the entire app on a single server, like running a small business from one tiny shop.

First Things First: What’s a Server?

Think of a server as a special computer that’s always connected to the internet. Its job is to “serve” your app to people when they visit your website, open your app, or use your service.

In cloud platforms like Azure, we usually call these Virtual Machines (VMs) – basically, software-powered servers you can spin up with a few clicks.

Monoliths vs Microservices

Now, applications come in different “shapes.” The two most common are:

Monoliths: Everything is bundled together into one big app. All the code – from user login to shopping cart to checkout – lives in a single unit.
Microservices: The app is broken into smaller, independent apps (services). Each service does one job – like login, payments, orders – and runs separately.

How Were These Apps Deployed?

Whether it was a monolith or a bunch of microservices, they were all usually deployed on a single server (VM).

For monoliths, you just ran the entire app directly on the server. For microservices: you'd run each service in a separate space on that same server, using containers.

Wait — What’s a Container?

A container is like a mini-computer inside a computer. It has everything an app needs to run – code, tools, settings – and it keeps each app isolated from the others.

Why use containers?

You can run multiple services on the same server without their underlying software (software needed for each app to run) interfering with each other.
It’s faster and more efficient than installing everything directly on the server.
They make moving apps between environments (for example, test → production) super smooth (no more “But, it works on my machine…”).

Popular tools like Docker make working with containers easy.

Connecting It All Together: Domains, Subdomains, and Reverse Proxies

When your app lives on a server, you want people to be able to reach it. That’s where domain names come in.

Your server has a public IP address – a set of numbers like 102.80.1.23, that gives it a unique identifier on the public internet
But instead of asking users to type numbers, you link that IP to a domain name, like mycoolapp.com

If your app has microservices, you might even assign subdomains like:

api.mycoolapp.com for the backend
dashboard.mycoolapp.com for the user interface
payments.mycoolapp.com for payments

To manage all this, you’d use a reverse proxy (like Nginx or Apache). It listens on the main domain and subdomains, and forwards traffic to the right app or service.

Example:

Someone visits dashboard.mycoolapp.com
The reverse proxy checks the domain and forwards the request to the correct container running the dashboard service

And to help with all of this setup – from deploying containers to configuring reverse proxies – there are developer-friendly tools like Coolify. Coolify is an open-source platform that makes it super easy for developers and DevOps teams to:

Deploy apps in containers
Set up domains and subdomains
Configure reverse proxies – all from a clean dashboard, no complex terminal commands needed

All this was set up on ONE SERVER/VM. But here’s the catch: when that one server got overloaded or went down…💥 everything stopped.

That’s why we needed a better way. And that's where scaling and load balancing came in – to keep apps running smoothly, no matter the traffic.

⚙️ Azure Virtual Machines (VMs) – The Building Blocks

When it comes to running apps in the cloud, Virtual Machines (VMs) are the basic building blocks – kind of like renting an apartment in a giant digital skyscraper.

You don’t need to buy the whole building (aka physical servers), you just rent the space you need, when you need it.

What Exactly Is a Virtual Machine?

A Virtual Machine is a software-based computer that runs inside a real, physical computer (a server) – hosted in a data center, like those run by Microsoft Azure.

It looks and behaves like a normal computer:

It has an operating system (Windows, Linux)
You can install apps
It has memory (RAM), storage (disks), and CPU

But the best part? You don’t need to worry about the hardware. Azure takes care of that behind the scenes – all you do is say:

“Hey Azure, give me a Linux VM with 4GB RAM and 2 CPUs.”

And boom 💥 — it spins up in minutes.

Why Use a VM?

Let’s say you’ve built a web app – it’s just a simple blog. You want to deploy it and make it accessible to the world.

Here's what you can do with a VM:

Set it up with your favorite OS (for example, Ubuntu)
Install web servers like Nginx or Apache
Deploy your app
Bind it to your domain name
Let the world visit your blog at myawesomeblog.com

It’s your own personal environment – no sharing, full control.

📈 The Need for Scaling – Vertical vs Horizontal

Imagine your app is growing. At first, it’s just a few users. Then a few hundred. Then thousands are logging in, placing orders, chatting, uploading photos – all at once 😮

Suddenly, your server (VM) is under pressure. It’s like trying to pour a flood through a straw.

So, What Do You Do When One Server Isn’t Enough?

This is where scaling comes in – the art of upgrading your app’s infrastructure to keep up with traffic.

There are two main ways to scale:

🧱 Option 1: Vertical Scaling (aka Scaling Up)

You take your existing VM and give it more power:

Add more CPUs 🧠
Increase RAM 🧵
Add faster disks ⚡

Think of it like upgrading from a regular car to a sports car. It’s the same vehicle, just faster and stronger.

Pros:

Simple to do
No major changes to your app setup

Cons:

There’s a limit to how much you can upgrade
Still a single point of failure: if the VM crashes, everything goes down 😬

🧩 Option 2: Horizontal Scaling (aka Scaling Out)

Instead of boosting one server, you add more servers – multiple VMs running copies of your app.

Now:

Users can be distributed across all these VMs
If one goes down, others keep serving traffic
You can dynamically add or remove VMs based on traffic

It’s like opening more checkout counters in a busy supermarket 🛒

Pros:

The load is evenly distributed. For example, if one server previously handled 100% of the traffic, adding two more servers would result in the traffic being split into approximately 33% to 34% for each server.
Improves both performance and reliability
You can scale based on real-time demand, that is traffic inflow

Cons:

Needs something to split traffic between VMs – Load Balancers
More expensive. You end up paying the original amount for 1 VM (for example $30) for the number of VMs you provide – if you provide 3 VMs at $30 each, you end up paying $90 at the end of the month

Quick Real-World Example

Let’s say you’ve launched an e-commerce site for sneakers 👟 Traffic spikes during a big sale? Your vertical scaling (bigger VM) might choke.

But with horizontal scaling:

You spin up 5 VMs across different regions
Traffic is shared between them
If one VM slows down, others handle the load

So, remember 👇🏾

Scaling Type	Description	Pros	Cons
🧱 Vertical Scaling	Make 1 VM more powerful (adding more CPU power, SSD, RAM, bandwidth, and so on)	Easy setup, fewer changes	Hardware limits, 1 point of failure - If that 1 server/VM goes down, so does your app :(
🧩 Horizontal Scaling	Add more VMs to handle traffic	Flexible, reliable	Needs traffic distribution logic (Load Balancer). Usually more expensive (the price of 1 VM times the number of VMs)

🔁 Azure Virtual Machine Scale Sets (VMSS) – Scaling Made Simple

Okay – so we’ve talked about horizontal scaling: adding multiple VMs to handle growing traffic. Sounds great, right?

But here’s the thing: manually spinning up and configuring 5, 10, or 100 VMs... every time your app gets busy? Yeah, that’s not fun 🙃

Enter: Virtual Machine Scale Sets (VMSS)

VMSS is Azure’s way of automating horizontal scaling. Instead of creating each VM one by one, you define a template, and Azure takes care of the rest:

How many VMs to start with
How to configure them (OS, apps, settings) ⚙️
When to add or remove VMs based on traffic 📈📉

A Simple Analogy 🧃

Think of VMSS like a juice dispenser at a party:

At first, it pours into 2 cups (VMs)
If 10 guests show up? It starts filling 5 cups
Party slows down? Back to 2 cups again

You never have to refill manually – the dispenser adjusts on its own. 🎉

How It Works (Without the Jargon 😌)

You set the rules: “If CPU usage goes above 70%, add 2 more VMs.”
Azure watches traffic and adjusts the number of VMs automatically.
All VMs are identical – like clones, all running the same app setup.
It works with Azure Load Balancer to spread traffic across all these VMs smoothly.

Real-Life Example: Food Delivery App 🍕📱

You’ve built an app where users order food. During lunch and dinner, traffic explodes.

💡 With VMSS:

You start with 3 VMs in the morning
At 12PM, Azure sees high CPU usage, so it spins up 5 more VMs
At 3PM, traffic drops, so Azure removes the extra VMs

You only pay for what you use. And users get a smooth experience – no delays, no crashes 👌🏾

📦 Azure Load Balancer – Spreading the Traffic

By now, you know that your app can live on multiple Virtual Machines (VMs), and that you can scale them easily using Virtual Machine Scale Sets (VMSS).

But here's the big question: when users start accessing your app – hundreds, even thousands at once – how do you make sure that all that traffic is fairly and efficiently distributed across those VMs?

You don’t want one VM to be overwhelmed while others are just chilling. You need a middleman – something smart enough to balance the load.

That’s where Azure Load Balancer steps in. It’s Azure’s way of saying, “Don’t worry, I got this” when traffic starts rolling in.

🏢 So, What Is Azure Load Balancer?

Azure Load Balancer is a traffic director. It takes incoming traffic from the internet (or even internal sources within your network) and intelligently spreads it across multiple backend machines – usually VMs.

It's like having a well-trained receptionist who routes every customer to the next available agent, so no one waits too long and no one gets overwhelmed 😃.

And the best part? This entire process happens in the background – fast, silent, and seamless. Users visiting your app have no idea a traffic manager is working behind the scenes. They just see a fast, responsive experience.

🌐 The Frontend IP – Your App’s Public Face

Every Azure Load Balancer is tied to a Frontend IP, which is basically the public IP address of your application – the one users connect to when they open www.yourapp.com.

This IP acts as the entry point. All user traffic comes through it first. But the Load Balancer doesn’t actually run your app. Instead, it accepts the traffic and forwards it to one of the VMs in the backend pool (we’ll get to that shortly).

You can configure this Frontend IP to be either public (accessible over the internet) or private (used for internal traffic within your cloud network – say, between microservices or internal tools).

🗂️ Backend Pool – Where the Magic Happens

Behind every Azure Load Balancer is a backend pool – a group of VMs (or VM Scale Set instances) where your actual app is running. These are the real workers, doing all the heavy lifting.

When traffic hits the Frontend IP, the Load Balancer takes that request and hands it off to one of the VMs in the backend pool.

But it doesn’t just randomly pick one. It checks a few things first – like whether the VM is healthy, whether it's already busy, and what rules you’ve set.

Each VM in the pool typically runs the same app or service. This means any of them can handle any incoming request, which is what makes load balancing possible in the first place.

🩺 Health Probes – Keeping Tabs on the VMs

Now, how does the Load Balancer know which VM is healthy or not? This is where health probes come in. Think of them as regular check-ups.

You configure the Load Balancer to periodically "ping" each VM – maybe by hitting a specific URL (like /health) or a certain port (like 80 for HTTP). If a VM doesn’t respond correctly, Azure marks it as unhealthy and temporarily removes it from the rotation.

This ensures users never get routed to a broken or unresponsive instance of your app. And once the VM becomes healthy again, it's automatically added back to the pool.

⚖️ Load Balancing Rules – Who Gets What?

Next, we have Load Balancing Rules. These are the instructions that tell Azure Load Balancer exactly how to behave.

You can define rules like:

“Forward all HTTP (port 80) traffic to backend pool VMs on port 80”
“Forward HTTPS (port 443) traffic to VMs on port 443”
“Only route traffic to healthy VMs”

These rules make Azure Load Balancer highly customizable. You get to decide how traffic flows, which protocols to support, and how to handle backend ports. It's like customizing the rules of a relay race – who gets the baton and when.

👟 Real-World Example: Sneaker Sale Rush

Imagine you're running an online sneaker store at www.sneakerblast.com. You’re launching a flash sale, and thousands of users are hitting your website all at once.

Thanks to your Azure Load Balancer, here’s what happens:

All those users land on your Frontend IP, the public face of your site.
The Load Balancer accepts the traffic and checks the health probes of all VMs in the backend pool.
Based on its rules, it forwards each user to a healthy, available VM.
One VM might serve a user in Lagos, another in Nairobi, another in Accra – all seamlessly.

If one VM crashes or lags? The Load Balancer detects it instantly and stops routing traffic to it until it’s back online.

That’s smooth traffic management without any manual effort.

🍴 Azure Application Gateway – Smart Routing for Modern Apps

So far, we’ve seen how Azure Load Balancer helps you split traffic across multiple VMs running a single service – like a monolithic app or a web frontend.

Let’s say you have a web application deployed on a VM. It listens on port 80, and you’ve scaled it into 3 instances. The Azure Load Balancer takes requests from the internet and spreads them across all 3 instances of the same service. Easy, right?

You can even link the Load Balancer’s public IP address to your domain – like mydomain.com – so users can visit your site normally.

🧠 But What If You Have Multiple Services?

Now imagine you’ve gone beyond just one app. You’re building something more modern, like a set of microservices.

You now have:

A payment service listening on port 5000
An authentication service on port 6000
A purchase service on port 7000

All deployed across the same VMs (or Virtual Machine Scale Set), just on different ports.

Here’s the problem: an Azure Load Balancer is designed to route traffic to one backend pool – basically one service – on one port. If you tie it to mydomain.com, it can only send traffic to one of your microservices. 😬

So… what do you do?

You might think: “Let me just create a separate Load Balancer for each service!” 🤕

But that means:

You’ll have to pay for multiple load balancers
You’ll end up managing 3–5 public IP addresses
You might even need to buy multiple domains like mypayment.com, myauth.com, and so on to route users properly

Yikes. That’s impractical, messy, and expensive 😖💸

🎉 Enter Azure Application Gateway

Azure Application Gateway solves this problem beautifully. It’s designed to route traffic intelligently – not just to one service, but to multiple services using just one gateway.

It works like this:

You create one public-facing frontend IP (like 52.160.100.5)
You link that IP address to your main domain, for example mydomain.com
Then, you define multiple backend pools – one for each service:
- Payment service (port 5000)
- Auth service (port 6000)
- Purchase service (port 7000)
Next, you set up routing rules that decide how to forward each request.

✨ Two Ways to Route with Application Gateway

You can configure smart routing based on:

URL paths:
- mydomain.com/payment → Payment service
- mydomain.com/auth → Auth service
Subdomains (host headers):
- payment.mydomain.com → Payment service
- auth.mydomain.com → Auth service

This way, all your services share one public IP and one domain – super clean, super efficient 🙌🏾

🤓 Real-Life Scenario (Let’s Break It Down)

Let’s say you’re building a startup platform that has three key microservices:

Payment service that handles transactions
Authentication service that handles login and user identity
Purchase service that manages product ordering

Each service is containerized and deployed on the same VM (or across several VMs using a VM Scale Set). But – and this is key – they all listen on different ports inside the VMs:

Payment → port 3000
Auth → port 6000
Purchase → port 7000

Now, without a smart routing solution, you’d be stuck trying to expose just one of these services using a standard Azure Load Balancer. But you need all three to be accessible from the internet – and you don’t want to pay for or manage 3 different Load Balancers 😅

So, what do you do?

🧠 Using Azure Application Gateway to Route Traffic Intelligently

Here's how you can fix this using one Application Gateway:

Deploy your microservices inside each VM:
- Each service runs on a specific port
- All VMs in your scale set are identical (they contain all three services)
Create backend pools in Application Gateway:
- A backend pool for the payment service (pointing to port 3000 on all VMs)
- One for the auth service (port 6000)
- Another for the purchase service (port 7000)
Create routing rules:
- Option A (Path-based routing):
  - Requests to mydomain.com/payment → go to the payment backend pool
  - Requests to mydomain.com/auth → go to the auth backend pool
  - Requests to mydomain.com/purchase → go to the purchase backend pool
- Option B (Subdomain-based routing):
  - payment.mydomain.com → payment service
  - auth.mydomain.com → auth service
  - purchase.mydomain.com → purchase service

You just tell the Application Gateway: “Hey, if a request comes in for this URL or subdomain, send it to this port on these VMs.” And it does just that – consistently and intelligently 🔁

📦 So, What’s Really Happening?

Imagine a user visits mydomain.com/auth. Here’s what goes on behind the scenes:

The DNS translates mydomain.com to your Application Gateway’s public IP
The Gateway receives the request
It checks your routing rules
It sees that /auth should go to the backend pool for port 6000
It forwards the request to one of the VMs running the auth service
The response goes back to the user – fast and seamless ✨

This happens in milliseconds, for every request. And because the Application Gateway is aware of multiple ports and services, it can handle routing logic that a regular Load Balancer just can’t do.

🔍 Azure Load Balancer vs Azure Application Gateway

By now, you've seen how both tools help route traffic in Azure – but they solve different problems.

Let’s break down how they compare, and when you should use one over the other 👇🏾

🛣️ 1. Routing Logic

Azure Load Balancer
It simply distributes incoming traffic evenly across a pool of VMs. It doesn’t care what the request is – it just balances the load.

Imagine a delivery guy who doesn't ask questions – he just drops each package at the next available house.

That’s what Azure Load Balancer does: it sends traffic to one of your servers without looking inside the request.

Azure Application Gateway
This is the smart one. It looks at what’s inside each request (like the URL path or domain) and makes intelligent decisions.

Just like a smarter delivery guy who looks at the address and decides where to go: "Oh! This one is for the payment office, not the main office."

That’s what Application Gateway does: it reads the request (like the URL or domain name) and sends it to the right place according to the routing rules.

🌐 2. Protocols Handled

Load Balancer
Works at the transport layer (Layer 4 in the OSI model). It deals with TCP/UDP traffic – raw network traffic, like HTTP, video streaming, games, and so on.

Application Gateway
Works at the application layer (Layer 7). It handles web traffic only – like websites and apps (HTTP/HTTPS) – and it can actually read what's being asked, like:

“Go to /login”
“Go to payment.mydomain.com”.

TL;DR: Load Balancer just pushes packets. App Gateway actually reads your web requests.

🔁 3. Use Case Scenarios

Situation	Best Choice
You have one big app and just want to spread users across servers	✅ Load Balancer
You have multiple services (like login, payment, and so on) and need to send users to the right one	✅ Application Gateway
You want to use subdomains (like login.mysite.com)	✅ Application Gateway
You want to secure your website with HTTPS and Web Application Firewall (WAF)	✅ Application Gateway
You want the simplest setup and lowest cost	✅ Load Balancer

🔐 4. SSL Termination & Security Features

Load Balancer doesn’t handle security stuff. You’ll need to secure each server yourself (for example, set up HTTPS on each one).

Application Gateway can secure everything in one place – you upload your SSL certificate once and it takes care of HTTPS for all services.

It can also protect you from hackers and bad traffic with something called WAF (Web Application Firewall), which protects your app from threats like SQL injection, XSS, and so on (you need to set this up manually).

💰 5. Pricing and Complexity

Load Balancer is cheaper and easier to set up. Great when you don’t need anything fancy.

Application Gateway costs more, but gives you more control and less headache when working with complex apps and microservices.

Trying to use Load Balancer for multiple services? You’ll need to create one Load Balancer per service, which becomes costly and impractical.

🧠 Summary Table

Feature	Load Balancer	Application Gateway
Can it understand the request?	❌ No	✅ Yes
Can it route based on URL or subdomain?	❌ No	✅ Yes
Can it handle secure HTTPS traffic?	❌ No	✅ Yes
Is it good for simple apps?	✅ Yes	✅ Yes
Is it good for complex apps with many services?	❌ No	✅ Yes
Cost	💲 Lower	💰 Higher

🧭 Use Cases: When to Use Each One

There’s no one-size-fits-all when it comes to hosting apps in the cloud. The right setup depends on what you’re building, how much traffic you expect, and how complex your app is.

Let’s walk through 4 different use-case scenarios, starting from the most basic setup all the way to a fully auto-scaled and smartly routed architecture.

1️⃣ Single VM Instance – For Small Projects or Internal Tools

Use this when:
You're just getting started. You’ve built a small app – maybe a portfolio, a blog, or a side project – and you want to make it live, OR You’re a startup that just launched.

How it works:
You spin up one Azure VM, install your app on it, and open the port it listens on (for example, port 80 for a web server). You can then attach a public IP to the VM and bind it to a custom domain like myawesomeapp.com.

Real-life examples:

A developer hosting a portfolio website or blog
A startup testing a new product with only a few users
An internal company tool for a small team

Pros:

Super simple setup
Low cost
Full control of your environment

Cons:

If the VM goes down, your app goes down
No auto-scaling – performance may drop with traffic spikes (the only way to adapt to increased CPU/memory usage due to traffic inflow is via manually scaling the VM vertically)
You manually maintain and monitor everything

2️⃣ Manual Horizontal Scaling – For Apps With Medium, Predictable Traffic

Use this when:
Your app is growing – maybe you have a few thousand users now, and performance matters. You want more than one server so your app doesn’t crash during busy hours.

How it works:
You manually create 2 or 3 Azure VMs with the same app setup. You then add a Load Balancer in front to split traffic evenly across them.

Real-life examples:

A business with a customer portal
A school website that handles regular logins, lecture video streaming, and so on during class hours
An app that gets traffic mostly during the day (predictable load)

Pros:

Better performance and availability
Load is shared across multiple VMs
You can scale manually when needed

Cons:

You must manually add or remove VMs – which takes effort
Still need to monitor performance manually
No built-in automation or auto-healing

3️⃣ Auto-Scaling with VM Scale Sets + Azure Load Balancer – For Apps With Spiky or Unpredictable Traffic

Use this when:
You’re building something more serious – traffic comes in waves (for example, a fitness/coach booking app), and you don’t want to sit around scaling VMs all day. You want Azure to automatically scale your infrastructure for you.

How it works:
You set up a Virtual Machine Scale Set (VMSS) that can automatically create more VMs when needed (like during high traffic), and remove them when things are calm — saving money. A Load Balancer distributes traffic across all those VMs.

Real-life examples:

A media platform where people upload videos or photos
A shopping site that gets surges during promotions, for example Black Fridays
A booking platform with peak traffic in evenings/weekends

Pros:

Automatic scaling – saves time and money
High availability: VMs can be replaced if one fails
Easy to grow as your user base grows

Cons:

Works best if your app is monolithic (one big service)
No support for routing traffic to specific services – just spreads traffic across VMs
Load Balancer can’t look at URL paths or subdomains

4️⃣ VM Scale Set + Azure Application Gateway – For Microservices or Complex Web Apps

Use this when:
You have a modern, multi-service app – maybe built with microservices. Each service (like payments, authentication, search, and so on) lives on a different port or even in a container.

You want to route traffic smartly – like /login goes to the auth service, /pay to payments, and /search to the search service – all on the same domain.

How it works:
You still use a VM Scale Set for auto-scaling, but instead of a basic Load Balancer, you add an Application Gateway. It can inspect each request and send it to the right service based on things like:

URL path (for example, /payments, /orders)
Subdomain (for example, payments.mydomain.com, auth.mydomain.com)

Real-life examples:

A full-blown SaaS product with multiple services
An e-commerce site with checkout, account, orders, and admin dashboards
A business migrating from a monolith to a microservices setup

Pros:

Smart routing based on path or subdomain
Everything runs under one public IP and one domain
Secure HTTPS handling + optional Web Application Firewall (WAF)
Auto-scaling and high availability

Cons:

More complex setup
Slightly higher cost due to Application Gateway
Needs planning around port numbers and backend pools

🧠 Quick Summary Table

Setup	Best For	Scaling	Routing Logic	Cost	Ease
☁️ Single VM	Small sites, personal apps	❌ (Manual)	❌ One app only	💲 (Lowest)	⭐⭐⭐⭐
🧱 Manual Horizontal Scaling + Load Balancer	Mid-size apps, predictable traffic	✅ (Manual)	❌ One app only	💲💲💲 (due to multiple VMs running at once without down-scaling — even with no traffic)	⭐⭐ (due to manual scaling)
🔁 VMSS + Load Balancer	Busy apps, spiky traffic	✅ (Auto)	❌ One app only	💲💲	⭐⭐⭐
🍴 VMSS + App Gateway	Microservices, modern apps	✅ (Auto)	✅ Smart routing (involving multiple microservices)	💲💲💲💲(Highest)	⭐⭐

✅ Conclusion

By now, you’ve gone from simply hearing the words “load balancer” or “scale set” to understanding exactly how they work, when to use them, and what problems they solve. Whether you’re just launching a small app or scaling up a high-traffic service, Azure gives you flexible, powerful tools to grow with confidence.

We started from the very beginning – a single virtual machine. It’s simple and great for small apps, but it quickly becomes a bottleneck as traffic grows.

That’s where scaling comes in. We explored:

🧱 Vertical scaling – Upgrading the same VM (quick fix, but limited)
🧩 Horizontal scaling – Adding more VMs to handle traffic better

Then we introduced Azure Virtual Machine Scale Sets (VMSS) – which bring auto-scaling to life. No more manual intervention – Azure can scale your servers up and down based on demand.

But where things really get smart is with load balancers:

📦 Azure Load Balancer helps spread traffic across your VMs — great for single-service apps
🍴 Azure Application Gateway takes it further by routing requests based on URL paths or subdomains — perfect for multi-service or microservice apps

🎯 TL;DR – What Should You Use?

Single VM: For side projects, portfolios, or internal tools
Manual scaling + Load Balancer: For medium apps with predictable load
VMSS + Load Balancer: For monolithic apps with auto-scaling needs
VMSS + Application Gateway: Also includes auto-scaling but for microservices or smart routing needs

💡 Final Thoughts

Cloud apps grow – fast. And with growth comes complexity. But with the right Azure setup, you can stay one step ahead of your traffic, serve users better, and keep costs under control.

Remember: you don’t need to start big. Start small, understand your app's traffic patterns, and scale only when you need to. Tools like Azure VM Scale Sets, Load Balancer, and Application Gateway give you the control and power to build scalable, modern applications without over-engineering.

Thanks for sticking with me through this deep dive. I hope this made things clearer, simpler, and maybe even a little fun 😊

Study Further 📚

If you would like to learn more about Azure Virtual Machines, Scale Sets, Load Balancer, and Application Gateway, you can check out the courses below:

Microsoft Azure Fundamentals AZ-900 Exam Prep Specialization — Microsoft, Coursera
Azure Virtual Machine Tutorial | Creating A Virtual Machine In Azure | Azure Training | Simplilearn — YouTube
Virtual machine scale sets — YouTube
Azure Load Balancer | Azure Load Balancer Tutorial | All About Load Balancer | Edureka — YouTube
Azure Application Gateway Deep dive | Step by step explained — YouTube

About the Author 👨‍💻

Hi, I’m Prince! I’m a DevOps engineer and Cloud architect passionate about building, deploying, and managing scalable applications and sharing knowledge with the tech community.

If you enjoyed this article, you can learn more about me by exploring more of my blogs and projects on my LinkedIn profile. You can find my LinkedIn articles here. You can also visit my website to read more of my articles as well. Let’s connect and grow together! 😊

Load Balancing - freeCodeCamp.org

How to Reduce Latency in Your Generative AI Apps with Gemini and Cloud Run

Table of Contents

Prerequisites

Phase 1: The "Location-Aware" Code

Enable the APIs

Create and Populate main.py

Create the Dockerfile

Phase 2: Build & Push

Phase 3: The "Triangle" Deployment

Phase 4: The Global Network (The Glue)

The Global IP

The Network Endpoint Groups (NEGs)

The Backend Service & Routing

The URL Map & Frontend

Phase 5: Testing (Teleportation Time)

1. Get your Global IP

2. Test "Teleportation"

Simulate Europe (Paris)

Simulate USA (New York)

Conclusion: The Global AI Edge

Cleanup

Resources

Load Balancing with Azure Application Gateway and Azure Load Balancer – When to Use Each One

📚 Table of Contents

🧊 What Are Load Balancers?

So, What Does a Load Balancer Do?

Why Were Load Balancers Introduced?

A Simple Use-Case Scenario

🖥️ How Applications Were Deployed Before Load Balancers

First Things First: What’s a Server?

Monoliths vs Microservices

How Were These Apps Deployed?

Wait — What’s a Container?

Connecting It All Together: Domains, Subdomains, and Reverse Proxies

⚙️ Azure Virtual Machines (VMs) – The Building Blocks

What Exactly Is a Virtual Machine?

Why Use a VM?

📈 The Need for Scaling – Vertical vs Horizontal

So, What Do You Do When One Server Isn’t Enough?

🧱 Option 1: Vertical Scaling (aka Scaling Up)

🧩 Option 2: Horizontal Scaling (aka Scaling Out)

Quick Real-World Example

So, remember 👇🏾

🔁 Azure Virtual Machine Scale Sets (VMSS) – Scaling Made Simple

Enter: Virtual Machine Scale Sets (VMSS)

A Simple Analogy 🧃

How It Works (Without the Jargon 😌)

Real-Life Example: Food Delivery App 🍕📱

📦 Azure Load Balancer – Spreading the Traffic

🏢 So, What Is Azure Load Balancer?

🌐 The Frontend IP – Your App’s Public Face

🗂️ Backend Pool – Where the Magic Happens

🩺 Health Probes – Keeping Tabs on the VMs

⚖️ Load Balancing Rules – Who Gets What?

👟 Real-World Example: Sneaker Sale Rush

🍴 Azure Application Gateway – Smart Routing for Modern Apps

🧠 But What If You Have Multiple Services?

🎉 Enter Azure Application Gateway

✨ Two Ways to Route with Application Gateway

🤓 Real-Life Scenario (Let’s Break It Down)

🧠 Using Azure Application Gateway to Route Traffic Intelligently

📦 So, What’s Really Happening?

🔍 Azure Load Balancer vs Azure Application Gateway

🛣️ 1. Routing Logic

🌐 2. Protocols Handled

🔁 3. Use Case Scenarios

🔐 4. SSL Termination & Security Features

💰 5. Pricing and Complexity

🧠 Summary Table

🧭 Use Cases: When to Use Each One

1️⃣ Single VM Instance – For Small Projects or Internal Tools

2️⃣ Manual Horizontal Scaling – For Apps With Medium, Predictable Traffic

3️⃣ Auto-Scaling with VM Scale Sets + Azure Load Balancer – For Apps With Spiky or Unpredictable Traffic

4️⃣ VM Scale Set + Azure Application Gateway – For Microservices or Complex Web Apps

🧠 Quick Summary Table

✅ Conclusion

🎯 TL;DR – What Should You Use?

💡 Final Thoughts

Study Further 📚

Create and Populate `main.py`

Create the `Dockerfile`