You ship your first LLM-powered feature. It works and the users love it. A second team adds another feature calling a different model, and a third integrates a completely different provider.

Six months later, you have fourteen microservices, each holding their own API keys, writing their own retry logic, and failing in their own unique ways.

Nobody knows how much you're spending on tokens or which service is hammering the rate limit. And when OpenAI goes down, everything goes down with it.

That scenario plays out across engineering teams every single day, and the root cause is almost always the same: moving fast with LLMs while skipping the infrastructure thinking that holds everything together at scale.

Fortunately, a well-established architectural pattern solves exactly these problems. If you already run Kubernetes, you're more than halfway to implementing it. That pattern is called the LLM Gateway Pattern, and this article walks you through what it is, why it matters, and how to put it into practice.

Table of Contents

What Is the LLM Gateway Pattern?

The LLM Gateway Pattern is an architectural approach where all LLM API traffic from your applications flows through a single, centralized proxy service before reaching any external provider. Think of it as the AI equivalent of an API gateway, except it's purpose-built for the unique challenges that come with language models: token budgets, streaming responses, model routing, semantic caching, and multi-provider fallback.

Instead of every service in your cluster talking directly to OpenAI or Anthropic, they all talk to one internal gateway. That gateway handles authentication, routing, rate limiting, logging, and failover. Your application services stay clean and focused on business logic, while the gateway takes on all the messy operational concerns of working with LLMs at scale.

The pattern itself is not new in concept. Engineers have used API gateways for years to manage REST traffic. What makes LLM gateways distinct is that they understand the specific shape of LLM requests, including token counts, model parameters, prompt structure, and streaming semantics.

How It Works

The core components of an LLM Gateway on Kubernetes are straightforward. Here is the high-level flow:

Diagram showing how LLM Gateway works on Kubernetes

App Pods send requests to the gateway using a standard OpenAI-compatible API format. Because of this, most existing LLM client libraries work without modification — you just change the base URL to point at your internal gateway service.

The Gateway Service receives each incoming request, authenticates the caller, applies any configured rate limits, checks the cache, selects the appropriate upstream provider based on routing rules, and forwards the request. On the way back, it logs token usage and latency before returning the response to the caller.

ConfigMap holds the routing rules. Which model should handle requests tagged as fast? Which provider should the system fall back to if the primary one is unavailable? All of this lives in configuration, not code, so you can update routing behaviour without redeploying anything.

Secrets hold the actual API keys for each provider. The gateway is the only service in the cluster that needs access to them. Application pods never touch provider credentials directly.

Provider endpoints are the actual LLM APIs: OpenAI, Anthropic, a self-hosted vLLM instance running in your cluster, or any other provider that exposes an OpenAI-compatible interface.

The Problem Without a Gateway

To appreciate why this pattern matters, it helps to look at what happens when you skip it.

1. Scattered Secrets and No Central Control

Every service that calls an LLM needs an API key. In Kubernetes, this usually means creating a Secret per namespace or per deployment.

When that key rotates or gets compromised, you're hunting through dozens of manifests to update it. There's no single place to revoke access or audit who is calling what.

2. No Visibility into Cost or Usage

LLM APIs charge per token. Without a centralized layer collecting usage data, you have no reliable way to know which service is responsible for that spike in your monthly bill.

3. Provider Lock-in at the Application Level

When you hardcode https://api.openai.com into your service, switching to a different provider or routing certain requests to a cheaper model becomes a code change. You need to redeploy your application just to change which model handles a request type.

4. No Caching

Many LLM applications send semantically similar or identical prompts repeatedly. Without a shared caching layer, each one incurs full token costs and full latency. The savings from even basic caching can be significant.

All of these problems compound as your team grows and more services start calling LLMs. The gateway pattern cuts through all of them in one architectural decision.

Deploying an LLM Gateway on Kubernetes

There are several tools that can serve as an LLM gateway in a Kubernetes environment, including LiteLLM Proxy, Portkey, OpenRouter, and Envoy with custom filters.

For the rest of this walkthrough, we'll use LiteLLM Proxy. It ships with a Helm chart, supports over a hundred models across all major providers, and comes with a management UI that makes initial configuration straightforward.

Storing API Keys Securely

Start by creating a Kubernetes Secret that holds your provider API keys. Your gateway pods will consume these credentials as environment variables, which means no provider key ever needs to live inside your application containers:

apiVersion: v1
kind: Secret
metadata:
  name: llm-gateway-secrets
  namespace: ai-platform
type: Opaque
stringData:
  OPENAI_API_KEY: "sk-..."
  ANTHROPIC_API_KEY: "sk-ant-..."

Defining Routing Rules in a ConfigMap

The routing configuration tells the gateway which models are available and how to reach each one. Keeping this in a ConfigMap means you can update your routing rules without touching a single line of application code:

apiVersion: v1
kind: ConfigMap
metadata:
  name: llm-gateway-config
  namespace: ai-platform
data:
  config.yaml: |
    model_list:
      - model_name: gpt-4o
        litellm_params:
          model: openai/gpt-4o
          api_key: os.environ/OPENAI_API_KEY
      - model_name: claude-sonnet
        litellm_params:
          model: anthropic/claude-sonnet-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY
      - model_name: fast
        litellm_params:
          model: openai/gpt-4o-mini
          api_key: os.environ/OPENAI_API_KEY

With this configuration in place, any application in your cluster can reach the gateway at http://llm-gateway.ai-platform.svc.cluster.local using the standard OpenAI client format, regardless of which actual provider sits behind it.

Scaling the Gateway

Because the gateway is stateless, horizontal scaling is straightforward. You can attach a HorizontalPodAutoscaler to scale based on CPU utilization or request rate:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-gateway-hpa
  namespace: ai-platform
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-gateway
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Wiring Up Observability

A gateway you can't observe is a gateway you can't trust, so wiring up monitoring before you go to production is worth the extra hour it takes.

LiteLLM exposes a /metrics endpoint in Prometheus format. You can scrape it with a standard ServiceMonitor if you run the Prometheus Operator, or configure Prometheus directly to target the gateway service.

The metrics that matter most in day-to-day operations are token throughput per model, request latency percentiles, error rates per provider, and cache hit ratio.

Once Prometheus is collecting that data, you can build Grafana dashboards that show token spend broken down by caller, model, and time period. This gives engineering managers and finance teams the cost visibility they've been asking for, and it takes surprisingly little effort to set up once the metrics pipeline is in place.

If you run an OpenTelemetry collector in your cluster, you can also configure the gateway to emit trace spans for every LLM request. This lets you see the full latency breakdown from the moment a user action triggers a call in your application all the way through to the provider response. So when something is slow, you can tell immediately whether the bottleneck sits in your service, the gateway, or upstream with the provider.

Features of an LLM Gateway

Not all gateway implementations are equal, so as your needs grow, these are the core capabilities worth evaluating.

Multi-Provider Routing

A well-built gateway routes requests to different providers based on declarative, configurable rules that live entirely outside your application code. This means that changing a model never requires a redeployment.

Semantic Caching

Rather than only caching byte-for-byte identical prompts, a semantic cache uses embedding similarity to recognise when two different prompts are asking essentially the same thing. This can cut redundant API calls dramatically.

Rate Limiting Per Consumer

The gateway should let you set token budgets and request limits per team, per namespace, or per application, so no single runaway service can starve the rest of your cluster or drive up costs unchecked.

Fallback and Failover

When a primary provider fails or exceeds acceptable latency thresholds, the gateway should automatically retry against a configured fallback. This centralizes logic that is notoriously hard to get right inside individual services.

Token Usage Tracking

Every request should produce a detailed usage record capturing input tokens, output tokens, model, caller identity, and latency. This gives engineering managers the clear, actionable picture of AI spending they need.

Wrapping Up

The LLM Gateway Pattern solves a set of operational problems that every team building on language models at scale will eventually run into. Scattered secrets, invisible costs, inconsistent failure handling, and provider lock-in are all symptoms of the same underlying issue: infrastructure concerns leaking into services that shouldn't have to deal with them.

A centralized gateway on Kubernetes gives your application teams a stable, provider-agnostic interface while giving your platform team the visibility and controls they need to manage cost and reliability effectively. When a provider goes down in the middle of the night, your configured fallback kicks in automatically instead of someone waking up to a page.

Start with LiteLLM Proxy, wire up the Prometheus metrics, build a simple Grafana dashboard, and watch how quickly the pattern pays for itself. Once you have seen what centralized LLM traffic management looks like in practice, it becomes very hard to go back to doing it any other way.