ratelimit - freeCodeCamp.org

How to Implement Token Bucket Rate Limiting with FastAPI

Prosper Ugbovo — Fri, 27 Mar 2026 21:36:52 +0000

APIs power everything from mobile apps to enterprise platforms, quietly handling millions of requests per day. Without safeguards, a single misconfigured client or a burst of automated traffic can overwhelm your service, degrading performance for everyone.

Rate limiting prevents this. It controls how many requests a client can make within a given timeframe, protecting your infrastructure from both intentional abuse and accidental overload.

Among the several algorithms used for rate limiting, the Token Bucket stands out for its balance of simplicity and flexibility. Unlike fixed window counters that reset abruptly, the Token Bucket allows short bursts of traffic while still enforcing a sustainable long-term rate. This makes it a practical choice for APIs where clients occasionally need to send a quick flurry of requests without being penalized.

In this guide, you'll implement a Token Bucket rate limiter in a FastAPI application. You'll build the algorithm from scratch as a Python class, wire it into FastAPI as middleware with per-user tracking, add standard rate limit headers to your responses, and test everything with a simple script. By the end, you'll have a working rate limiter you can drop into any FastAPI project.

What we'll cover:

Prerequisites
Understanding the Token Bucket Algorithm
Setting Up the FastAPI Project
Implementing the Token Bucket Class
Adding Per-User Rate Limiting Middleware
Testing the Rate Limiter
Where Rate Limiting Fits in Your Architecture
Conclusion

Prerequisites

To follow this tutorial, you'll need:

Python 3.9 or later installed on your machine. You can verify your version by running python --version.
Familiarity with Python and basic knowledge of how HTTP APIs work.
A text editor such as VS Code, Vim, or any editor you prefer.

Understanding the Token Bucket Algorithm

Before writing code, it helps to understand the mechanism you'll be building.

The Token Bucket algorithm models rate limiting with two simple concepts: a bucket that holds tokens, and a refill process that adds tokens at a steady rate.

Here is how it works:

The bucket starts full, holding a fixed maximum number of tokens (the capacity).
Each incoming request costs one token. If the bucket has tokens available, the request is allowed, and one token is removed.
If the bucket is empty, the request is rejected with a 429 Too Many Requests response.
Tokens are added back to the bucket at a constant refill rate, regardless of whether requests are coming in. The bucket never exceeds its maximum capacity.

The capacity determines how large a burst the system absorbs. The refill rate defines the sustained throughput. For example, a bucket with a capacity of 10 and a refill rate of 2 tokens per second allows a client to fire 10 requests instantly, but after that, they can only make 2 requests per second until the bucket refills.

This two-parameter design gives you precise control:

Parameter	Controls	Example
Capacity (max tokens)	Maximum burst size	10 tokens = 10 requests at once
Refill rate	Sustained throughput	2 tokens/sec = 2 requests/sec long-term
Refill interval	Granularity of refill	1.0 sec = tokens added every second

Compared to other rate-limiting algorithms:

Fixed Window counters reset at hard boundaries (for example, every minute), which can allow double the intended rate at window edges. The Token Bucket has no such boundary.
Sliding Window counters are more accurate but more complex to implement and maintain.
Leaky Bucket processes requests at a fixed rate and queues the rest. The Token Bucket is similar, but allows bursts instead of forcing a constant pace.

The Token Bucket is widely used in production systems. AWS API Gateway, NGINX, and Stripe all use variations of it.

Setting Up the FastAPI Project

Create a project directory and install the dependencies:

mkdir fastapi-ratelimit && cd fastapi-ratelimit

Create and activate a virtual environment:

python -m venv venv

On Linux/macOS:

source venv/bin/activate

On Windows:

venv\Scripts\activate

Install FastAPI and Uvicorn:

pip install fastapi uvicorn

Create the project file structure:

fastapi-ratelimit/
├── main.py
└── ratelimiter.py

Create main.py with a minimal FastAPI application:

from fastapi import FastAPI

app = FastAPI()


@app.get("/")
async def root():
    return {"message": "Hello, world!"}

Start the server to verify the setup:

uvicorn main:app --reload

You should see output similar to:

INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Started reloader process

Open in your browser http://127.0.0.1:8000 or run curl http://127.0.0.1:8000. You should receive:

{"message": "Hello, world!"}

With the project running, you can move on to building the rate limiter.

Implementing the Token Bucket Class

Open ratelimiter.py in your editor and add the following code. This class implements the Token Bucket algorithm with thread-safe operations:

import time
import threading


class TokenBucket:
    """
    Token Bucket rate limiter.

    Each bucket starts full at `max_tokens` and refills `refill_rate`
    tokens every `interval` seconds, up to the maximum capacity.
    """

    def __init__(self, max_tokens: int, refill_rate: int, interval: float):
        """
        Initialize a new Token Bucket.

        :param max_tokens: Maximum number of tokens the bucket can hold (burst capacity).
        :param refill_rate: Number of tokens added per refill interval.
        :param interval: Time in seconds between refills.
        """
        assert max_tokens > 0, "max_tokens must be positive"
        assert refill_rate > 0, "refill_rate must be positive"
        assert interval > 0, "interval must be positive"

        self.max_tokens = max_tokens
        self.refill_rate = refill_rate
        self.interval = interval

        self.tokens = max_tokens
        self.refilled_at = time.time()
        self.lock = threading.Lock()

    def _refill(self):
        """Add tokens based on elapsed time since the last refill."""
        now = time.time()
        elapsed = now - self.refilled_at

        if elapsed >= self.interval:
            num_refills = int(elapsed // self.interval)
            self.tokens = min(
                self.max_tokens,
                self.tokens + num_refills * self.refill_rate
            )
            # Advance the timestamp by the number of full intervals consumed,
            # not to `now`, so partial intervals aren't lost.
            self.refilled_at += num_refills * self.interval

    def allow_request(self, tokens: int = 1) -> bool:
        """
        Attempt to consume `tokens` from the bucket.

        Returns True if the request is allowed, False if the bucket
        does not have enough tokens.
        """
        with self.lock:
            self._refill()

            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

    def get_remaining(self) -> int:
        """Return the current number of available tokens."""
        with self.lock:
            self._refill()
            return self.tokens

    def get_reset_time(self) -> float:
        """Return the Unix timestamp when the next refill occurs."""
        with self.lock:
            return self.refilled_at + self.interval

The class has three public methods:

allow_request() is the core method. It refills tokens based on elapsed time, then tries to consume one. It returns True if the request is allowed, False if the bucket is empty.
get_remaining() returns the number of tokens the client has left. You will use this for response headers.
get_reset_time() returns when the next token will be added. This is also exposed in response headers.

The threading.Lock ensures that concurrent requests don't create race conditions when reading or modifying the token count. This is important because FastAPI runs request handlers concurrently.

Note: This implementation stores bucket state in memory. If you restart the server, all buckets reset. For persistence across restarts or multiple server instances, you would store token counts in Redis or a similar external store. The in-memory approach is sufficient for single-instance deployments.

Adding Per-User Rate Limiting Middleware

A single global bucket would throttle all users together. One heavy user could exhaust the limit for everyone. Instead, you'll assign a separate bucket to each user, identified by their IP address.

Add the following to ratelimiter.py, below the TokenBucket class:

from collections import defaultdict


class RateLimiterStore:
    """
    Manages per-user Token Buckets.

    Each unique client key (e.g., IP address) gets its own bucket
    with identical parameters.
    """

    def __init__(self, max_tokens: int, refill_rate: int, interval: float):
        self.max_tokens = max_tokens
        self.refill_rate = refill_rate
        self.interval = interval
        self._buckets: dict[str, TokenBucket] = {}
        self._lock = threading.Lock()

    def get_bucket(self, key: str) -> TokenBucket:
        """
        Return the TokenBucket for a given client key.
        Creates a new bucket if one does not exist yet.
        """
        with self._lock:
            if key not in self._buckets:
                self._buckets[key] = TokenBucket(
                    max_tokens=self.max_tokens,
                    refill_rate=self.refill_rate,
                    interval=self.interval,
                )
            return self._buckets[key]

Now open main.py and replace its contents with the full application, including the rate-limiting middleware:

import time

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse

from ratelimiter import RateLimiterStore

app = FastAPI()

# Configure rate limits: 10 requests burst, 2 tokens added every 1 second.
limiter = RateLimiterStore(max_tokens=10, refill_rate=2, interval=1.0)


@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    """
    Middleware that enforces per-IP rate limiting on every request.
    Adds standard rate limit headers to every response.
    """
    # Identify the client by IP address.
    client_ip = request.client.host
    bucket = limiter.get_bucket(client_ip)

    # Check if the client has tokens available.
    if not bucket.allow_request():
        retry_after = bucket.get_reset_time() - time.time()
        return JSONResponse(
            status_code=429,
            content={"detail": "Too many requests. Try again later."},
            headers={
                "Retry-After": str(max(1, int(retry_after))),
                "X-RateLimit-Limit": str(bucket.max_tokens),
                "X-RateLimit-Remaining": str(bucket.get_remaining()),
                "X-RateLimit-Reset": str(int(bucket.get_reset_time())),
            },
        )

    # Request is allowed. Process it and add rate limit headers to the response.
    response = await call_next(request)
    response.headers["X-RateLimit-Limit"] = str(bucket.max_tokens)
    response.headers["X-RateLimit-Remaining"] = str(bucket.get_remaining())
    response.headers["X-RateLimit-Reset"] = str(int(bucket.get_reset_time()))
    return response


@app.get("/")
async def root():
    return {"message": "Hello, world!"}


@app.get("/data")
async def get_data():
    return {"data": "Some important information"}


@app.get("/health")
async def health():
    return {"status": "ok"}

The middleware does the following on every incoming request:

Extracts the client's IP address from request.client.host.
Retrieves (or creates) that client's Token Bucket from the store.
Calls allow_request(). If the bucket is empty, it returns a 429 response with a Retry-After header telling the client how long to wait.
If tokens are available, it processes the request normally and attaches rate limit headers to the response.

The three X-RateLimit-* headers follow a widely adopted convention:

Header	Meaning
`X-RateLimit-Limit`	Maximum burst capacity (max tokens)
`X-RateLimit-Remaining`	Tokens left in the current bucket
`X-RateLimit-Reset`	Unix timestamp when the next refill occurs

These headers allow well-behaved clients to self-throttle before hitting the limit.

Testing the Rate Limiter

Restart the server if it's not already running:

uvicorn main:app --reload

Manual Testing with curl

Manual testing with curl is useful during development when you want to quickly verify that your middleware is working. A single request lets you confirm that the rate limit headers are present, the values are correct, and one token is consumed as expected.

This approach is fast and requires no additional setup, making it ideal for spot-checking your configuration after making changes.

Send a single request and inspect the response:

curl -i http://127.0.0.1:8000/data

You should see a 200 response with headers like:

HTTP/1.1 200 OK
x-ratelimit-limit: 10
x-ratelimit-remaining: 9
x-ratelimit-reset: 1739836801

Automated Burst Test

While curl confirms that the rate limiter is active, it can't verify that the limiter actually blocks requests when the bucket is empty. For that, you need to send requests faster than the refill rate and observe the 429 responses. An automated burst test is essential before deploying to production, after changing your bucket parameters, or when you need to verify both the blocking and refill behavior.

Create a file called test_ratelimit.py in your project directory:

import requests
import time


def test_burst():
    """Send 15 rapid requests to trigger the rate limit."""
    url = "http://127.0.0.1:8000/data"
    results = []

    for i in range(15):
        response = requests.get(url)
        remaining = response.headers.get("X-RateLimit-Remaining", "N/A")
        results.append((i + 1, response.status_code, remaining))
        print(f"Request {i+1:2d} | Status: {response.status_code} | Remaining: {remaining}")

    print()

    allowed = sum(1 for _, status, _ in results if status == 200)
    blocked = sum(1 for _, status, _ in results if status == 429)
    print(f"Allowed: {allowed}, Blocked: {blocked}")


def test_refill():
    """Exhaust tokens, wait for a refill, then confirm requests succeed again."""
    url = "http://127.0.0.1:8000/data"

    print("\n--- Exhausting tokens ---")
    for i in range(12):
        response = requests.get(url)
        print(f"Request {i+1:2d} | Status: {response.status_code}")

    print("\n--- Waiting 3 seconds for refill ---")
    time.sleep(3)

    print("\n--- Sending requests after refill ---")
    for i in range(5):
        response = requests.get(url)
        remaining = response.headers.get("X-RateLimit-Remaining", "N/A")
        print(f"Request {i+1:2d} | Status: {response.status_code} | Remaining: {remaining}")


if __name__ == "__main__":
    print("=== Burst Test ===")
    test_burst()

    # Allow bucket to refill before next test
    time.sleep(6)

    print("\n=== Refill Test ===")
    test_refill()

Install the requests library if you don't have it:

pip install requests

Run the test:

python test_ratelimit.py

You should see output similar to:

=== Burst Test ===
Request  1 | Status: 200 | Remaining: 9
Request  2 | Status: 200 | Remaining: 8
Request  3 | Status: 200 | Remaining: 7
...
Request 10 | Status: 200 | Remaining: 0
Request 11 | Status: 429 | Remaining: 0
Request 12 | Status: 429 | Remaining: 0
...
Request 15 | Status: 429 | Remaining: 0

Allowed: 10, Blocked: 5

The first 10 requests succeed (one token each from the full bucket). Requests 11 through 15 are rejected because the bucket is empty. The refill test then confirms that after waiting, tokens reappear and requests succeed again.

Note: The exact split between allowed and blocked requests may vary slightly due to timing. Tokens may refill between rapid requests. This is expected behavior.

Where Rate Limiting Fits in Your Architecture

The implementation in this tutorial runs inside your application process, which is the simplest approach and works well for single-instance deployments. In larger systems, rate limiting typically appears at multiple layers:

API gateway level (NGINX, Kong, Traefik, Envoy): A coarse global rate limit applied to all traffic before it reaches your application. This protects against large-scale abuse and DDoS.
Application level (this tutorial): Fine-grained per-user or per-endpoint limits inside your service. This is useful for enforcing different quotas on different API tiers.
Both: Many production systems combine a gateway-level global limiter with an in-app per-user limiter. The gateway catches the flood and the application enforces business rules.

For multi-instance deployments (multiple server processes behind a load balancer), the in-memory RateLimiterStore won't share state across instances. In that case, replace the in-memory dictionary with Redis. The Token Bucket logic stays the same – only the storage layer changes.

Conclusion

In this guide, you built a Token Bucket rate limiter from scratch and integrated it into a FastAPI application with per-user tracking and standard rate limit response headers. You also tested the implementation to verify that burst capacity and refill behavior work as expected.

The Token Bucket algorithm gives you two straightforward controls, capacity for burst tolerance and refill rate for sustained throughput, which cover the vast majority of rate-limiting needs.

From here, you can extend this foundation by:

Replacing the in-memory store with Redis for multi-instance deployments.
Applying different rate limits per endpoint by creating separate RateLimiterStore instances.
Using authenticated user IDs instead of IP addresses for more accurate client identification.
Adding metrics and logging to track how often clients are being throttled.

How to Build an In-Memory Rate Limiter in Next.js

Orim Dominic Adah — Fri, 09 Jan 2026 19:24:26 +0000

An API rate limiter is a server-side component of a web service that limits the number of API requests a client can make to an endpoint within a period of time. For example, X (formerly known as Twitter) limits the number of tweets that a specific user can make to three hundred every three hours.

Rate limiters enforce the responsible use of APIs by blocking requests that exceed the set usage limits.

By following along with this article, you will:

Learn how rate limiters work
Build an in-memory rate limiter for a Next.js pa router project
Use Artillery to load test the rate limiter for accuracy and resilience

Here’s What We’ll Cover:

Benefits of Rate Limiters
How Rate Limiters Work
Rate Limiting Algorithms
How to Build an In-Memory Rate Limiter
- How The In-Memory Rate Limiter Works
- The Request Handler
The Front End
How to Load Test the Rate Limiter for Resilience with Artillery
Conclusion

To get the most out of this article, you should have experience in building APIs with Next.js pages router, Express, or any other Node.js backend framework that uses middlewares.

Benefits of Rate Limiters

Rate limiters control how many requests are allowed within a given time window. They have several benefits you should know about if you’re considering using them.

First, they help prevent the abuse of web servers. Rate limiters guard web servers from overuse that needlessly increases their load. They block excessive requests from Denial of Service (DoS) attacks from bots so that the web service doesn’t crash from unnecessary overload and can continue to be available to legitimate users.

They also help manage the cost of using external APIs. Some API endpoints make requests to external APIs to complete their operations – for example, API endpoints that send emails through an email service provider. When an endpoint relies on paid external APIs and user access of the endpoint is not restricted, excessive usage can lead to increased and expensive costs for the web service. Rate limiters block the excessive usage of endpoints like these, helping to keep costs to a reasonable minimum.

How Rate Limiters Work

Rate limiters work using a three-step mechanism. The process includes tracking requests from specific clients, monitoring their usage, and blocking extra requests once the threshold has been exceeded.

In more detail, rate limiters:

Track requests: Rate limiters take note of API clients that make requests and attributes that are specific to the clients (for example, an IP address or a userId). These specific attributes are references or keys that are used to identify clients.
Monitor usage: Depending on the rate limiting mechanism, rate limiters increase or decrease the metric that is used to determine the threshold of use. For example, within a three-hour time period, Twitter can track and increase the number of times a user makes an API request to the create tweet endpoint.
Ensure threshold compliance: Rate limiters check the threshold of use for every request made. If it has been exceeded, it blocks the request from accessing the functionality of the API endpoint and responds with a status code of 429.

Rate Limiting Algorithms

You can implement rate limiting using different algorithms based on the requirements of the rate limiter. Each rate limiting algorithm has its merits and demerits. Below are some popular rate limiting algorithms you can play around with.

Fixed Window Algorithm

In the fixed window rate limiting algorithm, the number of requests made within a fixed time period is tracked and every request increases the request count tracked. If the number requests within the time frame is exceeded, any extra request that comes in within the time frame is blocked. At the end of the time period, the request count is reset and increases for every request made.

Its mechanism is easy to understand and it’s memory-efficient. Its challenge is that spikes in traffic close to the start or the end of a time window can allow more requests than permitted.

Sliding Window Algorithm

The sliding window algorithm fixes the issue with the fixed window algorithm where spikes in traffic close to the start or end of a time window can allow more requests than permitted.

It works as follows:

It keeps a track of the timestamps of requests made in a cache.
When there’s a new request, it removes all timestamps that are older than the start of the current time window and it appends the new request’s timestamp to the cache.
If the count of the requests in the cache is higher than the threshold, the request is blocked. Otherwise, it’s allowed.

Although this algorithm is more accurate than the fixed window algorithm, it consumes more memory because of the storage of timestamps.

Token Bucket Algorithm

In the token bucket algorithm, a bucket that contains a predefined number of tokens is assigned to a user. Tokens are added to the bucket at a predefined rate, for example 2 tokens may be added every second.

Once the bucket is full, no more tokens are added. Each request consumes one or more tokens, and if the tokens are exhausted, requests are blocked until the bucket has tokens again.

The Token Bucket algorithm has the benefits of being memory efficient, easy to implement, and accurate enough to block extra requests even during a burst in traffic.

In this tutorial, we’ll use the fixed window algorithm to build a rate limiter. We’ll also battle-test it for resilience and accuracy using Artillery.

How to Build an In-Memory Rate Limiter

If you’re a backend developer, you may have noticed that users sometimes abuse the reset password API endpoint in your Next.js application. This is a cause for concern because the API endpoint makes a request to your email service provider to send an email and you get charged for it.

Because of this, you may want to limit the requests that users make to this endpoint so that you can prevent the abuse of the API and save costs. And that’s where a rate limiter comes in.

You can get the code for this tutorial hereis tutorial here. You can clone it, install the dependencies with npm install, and run it following the instructions in the README file. You’ll need it to follow along with the rest of this article.

I built the project using Next.js and it uses the pages router. I’ve also built the rate limiter and you can find it here. You can see how to use it in the reset password API endpoint here.

It has a user interface that you can use to test the rate limiter – but let’s dive into the code first.

How The In-Memory Rate Limiter Works

To help you better understand the rate limiter, I've created this diagram. We'll walk through what's happening after:

The src/lib/server/rate-limiter.ts file exports a function called applyRateLimiter which accepts three parameters:

the request object
the response object
getOptsFn

getOptsFn is a function that accepts the request object and, when executed, returns properties specific to the request for tracking, monitoring, and blocking by the rate limiter. getOptsFn is a function and not a static object so that the specific properties of a request can be dynamically created by the request handler for each request.

src/lib/server/rate-limiter.ts also has an in-memory map called cache. cache stores the key (or unique identifier) of a request and maps it to its usage. An interval runs every minute to remove keys with expiredAt values that have passed from the cache. This helps to manage the amount of memory used by the cache.

type GetOptionsFn = (req: NextApiRequest) => {
  key: string;
  maxTries: number;
  expiresAt: Date;
};

const cache = new Map();

// clear stale keys from cache every minute
setInterval(() => {
  const currentDate = new Date();
  for (const [key, usage] of cache) {
    if (!usage) continue;

    if (currentDate > usage.expiresAt) {
      cache.delete(key);
    }
  }
}, 60000);

When the rate limiter is executed, it uses the getOptsFn to generate the following from the request:

key: The unique identifier for the request that can be used to track its usage
maxTries: The maximum number of times a request can be made within the specified time window
expiresAt: The expiry time of a time window

based on its content where it was created.

  const opts = getOptsFn(req);
  const usage = cache.get(opts.key);

  if (!usage) {
    cache.set(opts.key, {
      tries: 1,
      maxTries: opts.maxTries,
      expiresAt: opts.expiresAt,
    });

    return;
  }

The rate limiter then checks if the key of the request exists in the cache. If it doesn’t, it sets it in the cache, mapping it to the following values:

tries : The number of times that the request has been made without being blocked
maxTries: The maximum number of times that the request should be allowed within the time window without blocking
expiresAt: The expiry time of the time window

It also allows the request to continue by exiting the rate limiter through the return statement. The values set in cache will be used to determine if and when consecutive requests with the same key should be blocked or not.

If the request’s key exists in cache, the rate limiter checks if the number of unblocked tries (usage.tries) from cache is less than the number of allowed usage tries (usage.maxTries). If it evaluates to true, it means that the request has not exceeded its maximum tries. It also checks if the expiry time of the time window stored in cache for the request has elapsed.

The request is not blocked if one of the following conditions evaluates to true:

the request has not exceeded its maximum tries AND its time window has not elapsed
the current time window of the request usage in cache (usage.expiresAt) has elapsed

  const currentDate = new Date();
  const retryAfter = usage.expiresAt.getTime() - currentDate.getTime();
  const timeWindowHasElapsed = retryAfter < 0
  const canProceed = usage.tries < opts.maxTries && !timeWindowHasElapsed;

  if (canProceed) {
    cache.set(opts.key, {
      ...usage,
      tries: usage.tries + 1,
    });

    return;
  }

  if (timeWindowHasElapsed) { // if usage.expiresAt has elapsed
    cache.set(opts.key, {
      tries: 1,
      maxTries: opts.maxTries,
      expiresAt: opts.expiresAt,
    });

    return;
  }

If canProceed is truthy, the rate limiter increases the number of tries (usage.tries) that the request has in the cache and then allows the request to proceed by exiting the rate limiter using the return statement. If timeWindowHasElapsed is truthy, the rate limiter resets the usage of the request in the cache using values gotten from getOptsFn and then allows the request to proceed. If both conditions are falsy, the request is blocked with a 429 response status code.

  res.setHeader("Retry-After", retryAfter/1000);
  return res.status(429).json({
    error: { message: "Too many requests" },
  });

According to REST specifications, a 429 HTTP response may include a Retry-After header to let clients know how long to wait before making a new request. The value of the Retry-After header had been calculated beforehand and is set on the response object using res.setHeader.

The Request Handler

You can find the reset password request handler in src/pages/api/reset-password-init.ts. First, it performs validation checks on the request method and body to ensure that it is fit for its operations. The validation ensures that the request is a POST request and that the request body includes an email property. It ends the request with the appropriate response code if validation fails.

  if (req.method !== "POST") {
    return res.status(405).json({
      error: { message: "Not allowed" },
    });
  }

  if (!req.body.email || typeof req.body.email != "string") {
    return res.status(400).json({
      error: { message: "'email' is required" },
    });
  }

generateOptions is the function that is eventually passed as getOptsFn to the rate limiter. The generateOptions function generates the specific properties of the request for the rate limiter. In the case of this endpoint, the properties are:

key: A string in the format [method].[endpoint].[email]. For an email value of “Hello@me.com”, the key will be post.reset-password.hello@me.com which will be constant for every request for that email to this endpoint. This key value format makes it unique and specific to this request.
expiresAt: The time when the time window expires. If the request is in cache, this value is ignored by the rate limiter and it uses the value in the cache instead
maxTries: The maximum number of tries that should be allowed within the time window. If the request is in the rate limiter cache already, this value is ignored in preference of the value in cache.

  const generateOptions = function (req: NextApiRequest) {
    const now = new Date();
    const inFiveSeconds = new Date(now.getTime() + 5000);

    return {
      expiresAt: inFiveSeconds,
      key: `post.reset-password.${req.body.email.toLowerCase()}`,
      maxTries: 1,
    };
  };

For the reset password handler, requests are rate limited to one every five seconds. You can tweak the expiresAt and maxTries values to test how it works. applyRateLimiter is executed with its arguments and if it does not block the request, the handler can go on to send the mail and respond to the client.

The Front End

You can visit the user interface to test the rate limiter manually. Visit the URL shown (http://localhost:3000 by default) after you ran npm run dev. You should see the user interface shown below to test the rate limiter.

How to Load Test the Rate Limiter for Resilience with Artillery

Artillery is a tool for testing and reporting how well web applications can perform under heavy load. In this section, you will use Artillery to test how efficient and accurate the rate limiter that you built is.

To use Artillery, install it globally via the npm install -g artillery@latest command so that the artillery command can be available for use via the CLI.

The Load Test Configuration

In the loadtest folder located at the root of the project, you will find the setup.yaml file. It contains the instructions for Artillery to use to carry out the load test. The instructions tell Artillery to create virtual users that will make API requests to the application with the base URL identified by target in three phases:

Warm up: Make API requests for a duration of ten seconds, starting from one request per second and increase it to five requests per second.
Ramp up: After warm up, make API requests for a duration of thirty seconds, starting from five requests per second and increase it to ten requests per second.
Spike phase: After ramp up, make API requests for a duration of twenty seconds, starting from ten requests per second and increase it to thirty requests per second.

This brings the total time of the load test to sixty seconds.

config:
  target: http://localhost:3000/api

  phases:
    - duration: 10
      arrivalRate: 1
      rampTo: 5
      name: Warm up

    - duration: 30
      arrivalRate: 5
      rampTo: 10
      name: Ramp up

    - duration: 20
      arrivalRate: 10
      rampTo: 30
      name: Spike phase

The plugins section contains instructions for extensions you can use to analyse the results from Artillery and get reports. For example, the ensure plugin contains setups that will report “OK” if at least 99% of the request responses have a latency of 100ms or less.

  plugins:
    ensure:
      thresholds:
        - http.response_time.p99: 100
        - http.response_time.p95: 75

The metrics-by-endpoint plugin (not used in this project) is another Artillery plugin that is used to display response time metrics for each URL in the test.

A scenario is a sequence of steps that describes a virtual user session in the app. Each virtual user created in phases will make an API request to the end endpoint in flow and the requests in the loop will happen or loop only once per virtual user (because the flow count has a value of 1).

scenarios:
  - flow:
      - loop:
          - post:
              url: "/reset-password-init"
              headers:
                Content-Type: "application/json"
              json:
                email: "j.doe@email.com"

        count: 1

Run the Load Test

Make sure that the application is running and run the load test with the command artillery run loadtest/setup.yaml --output loadtest/results.json from the root folder of the project. This will run the load test on the rate-limited endpoint and save the output of the results in loadtest/results.json.

Review the Results

Regardless of the of the number of requests made, the setup of our rate limiter allows only one request every five seconds. This means that the number of requests that should be allowed within a space of sixty seconds is twelve.

If you take a look at loadtest/results.json, you will see that only twelve requests had a status code of 200. If you increase the values of arrivalRate or rampTo in any or all of the phases to increase the number of requests made to the endpoint and you run the load test again, only twelve requests will still have a status code of 200. This means that our rate limiter is remaining effective and accurate even under high loads.

For latency, you should consider the report of the ensure plugin which is logged to the terminal at the end of the test. A result such as:

Checks:
ok: http.response_time.p95 < 75
ok: http.response_time.p99 < 100

means that 95% of all requests made had a latency of less than 75 milliseconds and 99% of all requests made had a latency of less than 100 milliseconds. These are good results.

Conclusion

In this article, you have learned about rate limiters, rate limiting algorithms, and how to build and use an in-memory rate limiter in Next.js.

You also got a brief introduction to load testing with Artillery. Be sure to apply what you have learned in one of your Next.js projects when you need it.

Feel free to connect with me on LinkedIn for questions or clarifications. Thank you for reading this far and I hope this helps you achieve what you intended to achieve. Don’t hesitate to share this article if you feel that it would help someone else out there. Cheers!