Jessica Patel - freeCodeCamp.org

How to Design a Type-Safe, Lazy, and Secure Plugin Architecture in React

Jessica Patel — Mon, 30 Mar 2026 15:00:00 +0000

Modern web applications increasingly need to evolve faster than a single team can maintain a monolithic codebase. Product teams often want to add features independently, experiment with new capabilities, or deploy domain-specific functionality without modifying the core application every time. This is where a plugin architecture becomes valuable.

A plugin architecture allows an application to load external modules that extend its functionality at runtime. Instead of embedding every feature directly in the core application, the system exposes a controlled interface (the host API) that plugins use to integrate with the platform. These plugins can register UI components, contribute functionality, or interact with application services while remaining isolated from the core codebase.

This architectural pattern is widely used across software ecosystems. Platforms such as IDEs, content management systems, and browser extensions rely on plugins to allow third-party developers to extend their functionality without compromising stability.

In a web application context, a similar approach allows large frontend systems to evolve modularly, enabling multiple teams to ship features independently.

In this tutorial, you'll learn how to design a type-safe, lazy-loaded, and secure plugin architecture in React — complete with lifecycle management, independent bundling, hot-loading, and real TypeScript examples.

By the end, you'll have everything you need to transform your React application into a modular platform capable of hosting independent extensions without sacrificing maintainability, performance, or security.

A Common Pain Point: Scaling Frontend Platforms
What This Article Will Cover
Prerequisites
Why a Plugin Architecture?
Core Concepts of a React Plugin Architecture
High-Level Architecture of a React Plugin System
Real TypeScript Example: A Chat Plugin
Best Practices
When NOT to Use a Plugin Architecture
Future Enhancements
Conclusion

A Common Pain Point: Scaling Frontend Platforms

Consider a large internal admin dashboard used by multiple teams across an organization. Each team wants to add its own functionality, like analytics dashboards, workflow management tools, user administration panels, and domain-specific reporting modules.

If all these features are implemented directly in the main React application, several problems quickly emerge. Merge conflicts in the core repository become frequent, unrelated features grow tightly coupled, and release cycles slow down because every change requires redeploying the entire application. Worse, adding new features carries a constant risk of breaking existing functionality.

A plugin architecture solves this problem by allowing each feature to be developed as an independent plugin. The host application provides a stable platform and a controlled API, while teams can ship their own plugins without modifying the core system.

What This Article Will Cover

This guide walks you through how to design a type-safe, lazy-loaded, and secure plugin architecture in React using TypeScript. You'll learn how to design a host API that plugins can safely interact with, how to define a plugin lifecycle for initialization, mounting, updates, and cleanup, and how to bundle plugins independently so they can be developed and deployed separately.

You'll also learn how to lazy-load plugins at runtime to improve performance, how to implement a security model that prevents plugins from accessing sensitive application state, and how to enable hot-loading during development while enforcing safety through CI/CD pipelines.

By the end of this article, you'll understand how to build a flexible plugin system that allows your React application to grow into a modular platform capable of hosting independent extensions without sacrificing maintainability, performance, or security.

Prerequisites

Before following along with this guide, you should be familiar with several core technologies and concepts used throughout the examples.

React Fundamentals
A basic understanding of React components, hooks, and JSX is required. The examples assume familiarity with functional components, useState, and useEffect.

TypeScript Basics
Since the plugin architecture relies heavily on type contracts between the host application and plugins, you should understand TypeScript interfaces, generics, and module exports.

Modern JavaScript Modules
Knowledge of ES modules (import / export) and dynamic imports will help when working with lazy-loaded plugins.

React Tooling (Vite or Webpack)
The examples reference modern frontend build tools such as Vite. Familiarity with how bundlers compile React applications and manage dependencies will help when configuring plugin builds.

Basic Web Security Concepts
Some sections discuss sandboxing and restricted APIs. A general understanding of browser security concepts such as iframes, same-origin policies, and API boundaries is helpful but not strictly required.

Why a Plugin Architecture?

Imagine you're building an internal admin platform where multiple teams need to ship independent features as plugins without risking the core application. A plugin architecture allows each team to contribute functionality safely, while the host maintains type safety, security, and performance.

This guide targets React/TypeScript engineers who want to design a plugin system capable of hosting third-party extensions without compromising maintainability.

The benefits of this approach are significant. Extensibility means developers or third parties can add features without touching core code. Isolation allows plugins to be sandboxed so they can't affect unrelated parts of the application. Lazy loading ensures only the features a user actually needs are fetched, keeping the application fast. TypeScript enforces a strict contract between plugins and the host, catching errors at compile time rather than at runtime. Finally, controlled APIs and permission boundaries prevent malicious or poorly written plugins from interfering with the rest of the system.

A well-architected plugin system balances all of these qualities – flexibility, safety, and maintainability – without forcing unnecessary trade-offs between them.

Core Concepts of a React Plugin Architecture

Before diving into code, it helps to understand the key building blocks that make up a React plugin system.

At a high level, a plugin architecture in React revolves around five concerns.

The Host API is the interface the core application exposes to plugins.
The Plugin Lifecycle defines methods for initialization, mounting, updating, and cleanup.
Bundling means compiling each plugin separately to avoid coupling it to the host.
The Security Model covers permissions and sandboxing to prevent misuse.
Finally, Hot-loading and CI streamline the development and deployment experience.

We'll explore each of these concepts in detail in the sections that follow. First, let's look at how they fit together visually.

High-Level Architecture of a React Plugin System

The following diagram illustrates how the host application interacts with independently bundled plugins. The host exposes a controlled API, loads plugins dynamically, and manages their lifecycle while maintaining security boundaries.

The core application serves as the runtime environment for all plugins, housing the plugin loader, lifecycle manager, and the host API.

The plugin loader dynamically imports plugin bundles at runtime using import(), while the host API ensures plugins interact with the application through a controlled interface rather than accessing internal state directly.

Each plugin is compiled as a separate bundle and registers itself with the host during initialization. A dedicated security layer enforces all of these boundaries, ensuring plugins cannot directly manipulate internal state or sensitive resources.

Together, these pieces ensure that plugins remain independent, lazy-loadable, and secure, while the host application retains full control over lifecycle management and platform stability.

Real TypeScript Example: A Chat Plugin

Now that you have a mental model of the architecture, let's look at a minimal working example before diving into each concept individually. This example demonstrates how a plugin registers itself with the host application and exposes a UI component through the host API.

The following plugin implements a simple chat feature that registers a React component with the host platform.

Chat Plugin Implementation

// plugins/chat-plugin/src/plugin.ts

import { Plugin, HostAPI } from '../../src/plugins';

const ChatPlugin: Plugin = {
  name: 'ChatPlugin',
  version: '1.0.0',
  init(host: HostAPI) {
    host.registerComponent('Chat', () => (
      Welcome to the Chat Plugin!
    ));
    host.log('ChatPlugin initialized');
  },
};

export default ChatPlugin;

Host Application Usage

The host application loads the plugin and renders the component it registered.

const Chat = hostAPI.getComponent('Chat');

return (
  
    {Chat ?  : 'Loading Chat Plugin...'}
  
);

In this example, the plugin doesn't directly modify the host application. Instead, it interacts through the Host API, registering a component that the host can render dynamically. The sections below break down exactly how each piece of this system is built.

1. How to Define the Host API

The host API is the contract between the core app and its plugins. It defines what functionality plugins can access. Before plugins can do anything useful, the host must expose a controlled interface, establishing the contract between the core application and its extensions.

Example: TypeScript Host API

// src/plugins/host.ts

export interface HostAPI {
  // Using ComponentType instead of FC reinforces type-safety while allowing class/function components
  registerComponent: (name: string, component: React.ComponentType) => void;
  getComponent: (name: string) => React.ComponentType | undefined;
  log: (message: string) => void;
}

// Note: We still use `any` for props here for extensibility; plugins can define stricter props locally if needed.

export const hostAPI: HostAPI = { 
    registerComponent(name, component) { 
        console.log(Registered component: ${name}); 
        componentRegistry[name] = component; 
    }, 
    getComponent(name) { 
        return componentRegistry[name]; 
    }, 
    log(message) { 
        console.log([PLUGIN LOG]: ${message}); 
    }, 
};

const componentRegistry: Record> = {};

This API allows plugins to register UI components and log messages, without giving them unrestricted access to the application state.

2. How to Define the Plugin Lifecycle

A plugin lifecycle ensures consistent behavior across all extensions. Once the host API exists, plugins need a structured way to initialize, render, and clean up resources.

Lifecycle Interface

// src/plugins/plugin.ts

import { HostAPI } from './host';

export interface Plugin {
  name: string;
  version: string;
  init: (host: HostAPI) => void;
  mount?: () => void;
  update?: () => void;
  unmount?: () => void;
}

// Typically, the host calls mount/update/unmount based on route changes, feature flags, or user interactions.

The init method is called when the plugin is first loaded and receives the host API as its argument. mount is called when the plugin's UI is displayed, while update is an optional hook triggered when props or state change.

When a plugin is removed, unmount is called to clean up any resources the plugin was holding, preventing memory leaks and side effects in the host application.

3. How to Bundle Plugins Separately

Each plugin should be packaged as an independent module so that it can be developed, versioned, and deployed without tightly coupling it to the host application.

Modern build tools such as Vite or Webpack make it possible to compile plugins into standalone bundles that the host can load dynamically at runtime.

Example Vite Configuration for a Plugin

// vite.config.ts

import { defineConfig } from 'vite';
import react from '@vitejs/plugin-react';

export default defineConfig({
  plugins: [react()],
  build: {
    lib: {
      entry: 'src/plugin.ts',
      name: 'MyPlugin',
      fileName: 'my-plugin',
      formats: ['es'],
    },
    rollupOptions: {
      external: ['react', 'react-dom'],
    },
  },
});

The external option ensures the plugin uses the host's React, preventing duplicate React versions in memory.

4. How to Lazy-Load Plugins

Even when plugins are bundled independently, loading all of them during application startup would significantly increase initial load time. Instead, plugins should be loaded on demand using dynamic imports so that functionality is only fetched when the user actually needs it.

// src/plugins/loader.ts

export async function loadPlugin(url: string): Promise { 

    // Using /* @vite-ignore */ because the URL is dynamic and cannot be         statically analyzed by Vite.
    // Tradeoff: plugin cannot be pre-bundled; ensure URLs are trusted to avoid security risks.

    const module = await import(/ @vite-ignore */ url); 
    return module.default as Plugin; 
}

Usage in React:

const [plugin, setPlugin] = React.useState(null);

React.useEffect(() => {
  loadPlugin('/plugins/my-plugin.js').then((p) => {
    p.init(hostAPI);
    setPlugin(p);
  });
}, []);

This pattern allows applications to scale without preloading all plugins, improving initial load time.

5. Security & Permission Model

Because plugins run code that originates outside the core application, security boundaries are essential. Even though plugins interact through the host API, the platform must still restrict what capabilities they can access in order to prevent misuse or accidental interference with application state.

Example: Restricted API

export interface SecureHostAPI {
  log: (message: string) => void;
  registerComponent: (name: string, component: React.ComponentType) => void;
  fetchData?: (endpoint: string) => Promise; // Only if allowed
}

You can enhance security further using iframe sandboxing or Web Workers for heavier isolation.

// Example of a sandboxed iframe plugin



// Advanced isolation notes:
// - You can define different SecureHostAPI shapes for internal vs. third-party plugins,
//   exposing more capabilities to trusted plugins while restricting untrusted ones.
// - For stronger isolation, use message passing (postMessage) with iframes or Web Workers
//   so plugins cannot access the DOM or host state directly.

This approach prevents DOM and network access outside the API.

6. Plugin Hot-loading

Hot-loading is essential for developer productivity. Tools like Vite's HMR let you see plugin updates immediately, speeding up iteration and reducing friction.

React Example with HMR:

if (import.meta.hot) {
  import.meta.hot.accept('/plugins/my-plugin.js', (newModule) => {
    const updatedPlugin = newModule.default as Plugin;
    updatedPlugin.init(hostAPI);
    setPlugin(updatedPlugin);
  });
}

With hot-loading, developers can update plugins without restarting the host app.

7. CI & Deployment Considerations

To deploy safely, plugins must be verified and tested. CI/CD pipelines enforce type safety, bundling, and security checks automatically. For a production-grade plugin system, continuous integration pipelines should:

Lint and type-check each plugin using TypeScript.
Run automated tests to ensure plugin compliance.
Bundle plugins independently with versioned outputs.
Deploy plugins to a secure CDN or internal repository.
Verify signatures or hashes to prevent tampering.

GitHub Actions Example for Plugin CI:

name: Build Plugin

on:
  push:
    paths:
      - 'plugins/**'

jobs:
  build:
    runs-on: ubuntu-latest
    steps: 
      - uses: actions/checkout@v3 
      - uses: actions/setup-node@v3 
      with:
        node-version: 20
      - run: npm install 
      - run: npm run build --workspace plugins/my-plugin 
      - run: npm run test --workspace plugins/my-plugin
      # Optional: sign plugin artifacts or generate a checksum to verify integrity before loading in the host

This ensures every plugin is type-safe, tested, and ready for deployment.

Putting It All Together

At this point, you have walked through each architectural layer independently. Here's how all the pieces map to a real project structure:

src/
├── plugins/ 
│ ├── host.ts ← Host API definition 
│ ├── plugin.ts ← Plugin lifecycle interface 
│ └── loader.ts ← Dynamic plugin loader 
plugins/ 
└── chat-plugin/ 
    └── src/ 
        └── plugin.ts ← Chat plugin implementation

Each file has a single, clear responsibility. host.ts owns the contract, plugin.ts owns the lifecycle shape, loader.ts handles runtime importing, and the plugin itself lives entirely outside the core src/ tree – deployable and versioned independently.

Best Practices

At this point, you have a host API, a well-defined plugin lifecycle, isolated bundles, lazy-loading, and a security model. These foundations ensure plugins are robust, type-safe, and maintainable — ready to be extended with versioning, testing, and CI/CD pipelines.

Type safety: Always define TypeScript interfaces for host APIs and plugin contracts.
Lazy loading: Only load plugins when required.
Security: Expose a minimal API and avoid giving plugins unrestricted access.
Isolated state: Keep plugin state isolated to prevent accidental interference.
Versioning: Maintain plugin versions to ensure compatibility with the host.
Testing: Unit-test plugins against host API mocks.
CI/CD: Automate linting, testing, and bundling for plugins.

When NOT to Use a Plugin Architecture

In some cases, introducing a plugin system can add unnecessary complexity without delivering meaningful benefits.

Small or Single-Team Applications

If a project is maintained by a small team and the feature set is relatively stable, a plugin architecture may be excessive. A simpler modular structure within the main codebase is usually easier to maintain and reason about.

Tightly Coupled Features

Plugin systems work best when features can operate independently. If new functionality requires deep access to application state or tightly integrated workflows, forcing it into a plugin model may introduce unnecessary abstractions and complexity rather than solving a real problem.

Performance-Critical Systems

Although lazy-loading can mitigate performance issues, plugin architectures still introduce additional runtime complexity. Applications with strict performance constraints may benefit from a more tightly optimized architecture rather than dynamic plugin loading.

Limited Security Controls

Allowing external code to run inside an application always introduces security risks. If the platform can't enforce strong API boundaries, sandboxing, or validation of plugins, it may be safer to avoid a plugin architecture altogether.

Early-Stage Products

In early product development, requirements often change rapidly. Designing a plugin system too early can slow development because engineers must maintain abstraction layers before the product's core architecture has stabilized. It's usually better to wait until the platform's boundaries are well understood before introducing this level of extensibility.

Future Enhancements

As the platform matures, there are several directions worth exploring.

Dynamic permissions would allow plugins to explicitly request capabilities, with the host deciding whether to grant them. This makes the security model more granular and auditable.

A plugin marketplace could serve as a central registry of verified plugins, making discovery and distribution easier for teams.

For use cases that require stronger isolation, Web Workers or iframes offer more robust sandboxing than API boundaries alone.

An event bus is another useful addition, allowing plugins to communicate with each other through a shared message system rather than direct API calls, which keeps inter-plugin dependencies loose and manageable.

Conclusion

Designing a plugin architecture in React is ultimately about treating your application as a platform rather than a single codebase. By defining clear contracts between the host application and its extensions, you enable teams to ship features independently while preserving stability, security, and performance.

If you are building a system that multiple teams (or even third-party developers) need to extend, start by establishing a minimal host API and plugin contract. Focus on strong TypeScript interfaces, clear lifecycle boundaries, and strict API access rules. These foundations ensure that plugins remain predictable and safe as the ecosystem grows.

As your platform evolves, you can gradually introduce more advanced capabilities such as plugin versioning, capability-based permissions, sandboxed execution environments, or an internal plugin marketplace.

Observability and monitoring also become increasingly important as the number of plugins grows, allowing you to detect compatibility issues or performance regressions early.

The key takeaway is to start simple but intentional. A small, well-defined plugin interface combined with lazy loading and secure API boundaries is often enough to support the first generation of extensions. From there, your architecture can expand naturally into a full ecosystem where features are delivered as modular, independently deployable plugins.

When implemented thoughtfully, a React plugin architecture transforms a single application into a scalable, extensible platform capable of supporting long-term growth and collaboration across teams.

How to Build End-to-End LLM Observability in FastAPI with OpenTelemetry

Jessica Patel — Fri, 13 Mar 2026 16:13:16 +0000

This article shows how to build end-to-end, code-first LLM observability in a FastAPI application using the OpenTelemetry Python SDK.

Instead of relying on vendor-specific agents or opaque SDKs, we will manually design traces, spans, and semantic attributes that capture the full lifecycle of an LLM-powered request.

Introduction
Prerequisites and Technical Context
Why LLM Observability Is Fundamentally Different
Reference Architecture: A Traceable RAG Request
Reference Architecture Explained
Why This Design Is Better Than Simpler Alternatives
LLM Models That Work Best for This Architecture
OpenTelemetry Primer (LLM-Relevant Concepts Only)
Designing LLM-Aware Spans
FastAPI Example: End-to-End LLM Spans (Complete and Explained)
Semantic Attributes: Best Practices for LLM Observability
Evaluation Hooks Inside Traces
Exporting and Visualizing Traces (Where This Fits with Vendor Tooling)
Operational Patterns and Anti-Patterns
Extending the System
Conclusion

Introduction

Large Language Models (LLMs) are rapidly becoming a core component of modern software systems. Applications that once relied on deterministic APIs are now incorporating LLM-powered features such as conversational assistants, document summarization, intelligent search, and retrieval-augmented generation (RAG).

While these capabilities unlock new user experiences, they also introduce operational complexity that traditional monitoring approaches were never designed to handle.

Unlike conventional software services, LLM systems are probabilistic by nature. The same request may produce slightly different responses depending on factors such as prompt structure, model configuration, retrieval context, and sampling parameters such as temperature or top-p.

In addition, LLM workloads introduce entirely new operational dimensions such as token consumption, prompt construction latency, inference cost, context window limits, and response quality.

These factors mean that a request can appear technically successful from an infrastructure perspective while still producing an incorrect, hallucinated, or low-quality result.

Traditional observability tools typically focus on infrastructure-level signals such as latency, error rate, and throughput. While these metrics remain important, they are insufficient for understanding how an LLM application behaves in production.

Engineers must also understand what prompt was constructed, which documents were retrieved, how many tokens were consumed, which model configuration was used, and how the final response was evaluated. Without this visibility, debugging LLM behavior becomes extremely difficult and operational costs can quickly spiral out of control.

This is where LLM observability becomes essential. Observability for LLM systems extends beyond infrastructure monitoring. It captures the full lifecycle of an AI-driven request — from user input and context retrieval to prompt construction, model inference, post-processing, and quality evaluation.

When implemented correctly, observability allows teams to answer why the model generated a particular response, which retrieval results influenced the output, how much a request cost in terms of tokens, where latency occurred within the request pipeline, and whether the response passed basic quality or safety checks.

This article demonstrates how to implement end-to-end LLM observability in a FastAPI application using OpenTelemetry. Instead of relying on proprietary monitoring agents or opaque vendor SDKs, we take a code-first approach to instrumentation. By explicitly designing traces, spans, and semantic attributes, we gain precise control over how LLM interactions are observed and analyzed.

Throughout the guide, we will walk through a practical architecture for tracing a retrieval-augmented generation (RAG) workflow, where each stage of the request lifecycle is represented as a trace span. We will explore how to design meaningful span boundaries, capture prompt and model metadata safely, record token usage and cost signals, and attach evaluation results directly to traces.

The article also explains how this instrumentation can be exported to any OpenTelemetry-compatible backend such as Jaeger, Grafana Tempo, or LLM-specific platforms like Phoenix.

By the end of this guide, you will understand how to:

Structure traces so that each user request maps to a single end-to-end LLM interaction
Design span hierarchies that reflect the logical stages of an LLM pipeline
Capture prompt metadata, model configuration, and token usage safely
Attach evaluation and quality signals to traces for deeper analysis
Export observability data to different backends without changing instrumentation

Most importantly, the goal of this article is not simply to demonstrate how to add telemetry to an application. Instead, it aims to show how to think about observability when building LLM-powered systems.

When LLM operations are treated as first-class components within a distributed system, traces become a powerful tool for debugging, optimization, cost management, and continuous improvement of model behavior.

Prerequisites and Technical Context

Before following this guide, you should be familiar with the Python programming language, basic web API concepts, and general microservice architecture. Below are some key tools and concepts used in this article.

FastAPI (Web Framework)

FastAPI is used as the primary web framework for the application. It is a modern Python framework designed for building high-performance APIs using standard Python type hints. FastAPI simplifies request validation, serialization, and API documentation while remaining lightweight and fast.

Large Language Models (LLMs)

Large Language Models (LLMs) are the computational core of the example system. An LLM is a model trained on vast amounts of text data to generate or transform language in ways that resemble human communication. In production environments, LLMs are commonly used for tasks such as conversational interfaces, summarization, and question answering.

Observability (Concept)

Observability is the overarching concept that connects all the technical pieces in this article. At a high level, observability refers to the ability to understand a system's internal behavior by examining the data it produces during execution. Rather than asking whether a system is simply "up" or "down," observability helps answer deeper questions about why a request behaved a certain way, where latency was introduced, or how different components interacted.

OpenTelemetry (Instrumentation Standard)

OpenTelemetry is the mechanism used to implement observability within the application. It is an open, vendor-neutral standard for generating telemetry data such as traces, metrics, and logs. By instrumenting key parts of the LLM workflow, we can observe how requests flow through the system, how long each step takes, and what contextual data influenced the final outcome. OpenTelemetry serves as the foundation for collecting this information in a consistent and portable way, independent of any specific monitoring backend.

Why LLM Observability Is Fundamentally Different

Traditional observability assumes deterministic behavior: the same input produces the same output. LLM systems violate this assumption. The same request can vary due to prompt template changes, retrieval differences, sampling parameters (temperature, top-p), model version upgrades, and context window truncation.

As a result, teams need visibility into what the model saw, how it was configured, what it retrieved, how long it took, and how much it cost, all correlated to a single user request. Logs alone are insufficient, and metrics lack dimensionality. Distributed traces are the backbone of LLM observability.

Reference Architecture: A Traceable RAG Request

A typical FastAPI-based RAG service follows this flow:

Each step is observable, but only if we deliberately instrument it. The goal is one trace per user request, with child spans representing each logical LLM step.

Reference Architecture Explained

Client Sends a Request to /chat

The architecture begins when a client sends a request to the /chat endpoint. This request typically contains the user's query along with any session or conversation context required by the application.

Keeping the client interface minimal and well-defined is intentional: it ensures the backend receives a predictable input shape and prevents application-specific logic from leaking into downstream LLM processing.

From an observability perspective, this request marks the start of a single end-to-end trace, allowing every subsequent operation to be correlated back to the original user action.

FastAPI Validates Input and Authenticates the User

Once the request reaches the service, FastAPI performs schema validation and authentication. Validation guarantees that only well-formed inputs proceed through the pipeline, while authentication ensures that expensive LLM operations are only executed for authorized users.

Placing this step early reduces unnecessary computation and protects the system from abuse. It also improves trace quality by ensuring that all observed requests represent legitimate execution paths rather than malformed or rejected traffic.

Retriever Queries the Vector Database

After validation, the system queries a vector database to retrieve documents relevant to the user's request. This retrieval step is the foundation of retrieval-augmented generation (RAG). By grounding the LLM in external knowledge, the system improves factual accuracy and reduces hallucinations.

Separating retrieval from generation allows teams to tune similarity thresholds, embedding models, and top-k values independently, and it makes it easier to diagnose whether poor responses are caused by bad retrieval or model behavior.

Prompt Is Assembled Using Retrieved Documents

With relevant documents in hand, the system constructs the final prompt that will be sent to the LLM. This step combines the user query, retrieved context, system instructions, and formatting rules into a single structured prompt.

Making prompt assembly an explicit stage enables prompt versioning, experimentation, and observability. It also provides a natural place to detect issues such as context window overflows or excessive prompt size before invoking the model.

LLM API Is Invoked

The LLM API call is the most expensive and non-deterministic operation in the pipeline, which is why it occurs only after all preparatory work is complete. At this stage, the model receives a fully constructed prompt and produces a response based on its configuration parameters.

This step is the primary focus of latency, cost, and reliability controls such as retries, timeouts, and circuit breakers. From an observability standpoint, this span becomes the anchor for token usage, cost attribution, and prompt-level debugging.

Response Is Post-Processed and Returned

After the LLM returns a response, the system performs post-processing before sending the result back to the client. This may include formatting, filtering, validation, or enrichment of the output. Post-processing acts as a final safeguard against malformed or low-quality responses and ensures consistency with application requirements. It also provides a clean boundary for attaching evaluation signals, such as response length, relevance scores, or truncation indicators, before the request completes.

Why This Design Is Better Than Simpler Alternatives

This architecture intentionally avoids coupling responsibilities together. Validation, retrieval, prompt construction, model execution, and response handling are all distinct steps. This separation makes the system easier to test, easier to observe, and easier to evolve. When something fails, engineers can identify where and why rather than treating the LLM as a black box.

Compared to a monolithic "send user input directly to the LLM" approach, this design offers better correctness, lower cost, and higher resilience. It also aligns naturally with distributed tracing, since each block maps cleanly to a trace span with a clear semantic purpose. As the system grows, additional features such as caching, fallback models, or policy enforcement can be added without destabilizing the entire flow.

Most importantly, this architecture treats the LLM as one component in a larger system, not the system itself. That mindset is essential for building reliable production applications.

LLM Models That Work Best for This Architecture

This architecture is model-agnostic, but certain model characteristics work particularly well with retrieval-augmented workflows.

Models with strong instruction-following and reasoning capabilities tend to perform best, especially when prompts include structured context from retrieved documents. General-purpose models such as GPT-4-class systems perform well when accuracy and reasoning depth are critical.

For lower-latency or cost-sensitive use cases, smaller instruction-tuned models can be effective when paired with high-quality retrieval. Open-source models such as LLaMA-derived or Mistral-based systems also fit well into this architecture, particularly when deployed behind a private inference endpoint.

The key requirement is not the model itself, but how it is used. Models that can reliably ground their responses in provided context, respect system instructions, and produce stable outputs under varying prompts integrate most cleanly into this design. Because retrieval and prompt construction are explicit stages, models can be swapped or compared without changing the overall system structure.

OpenTelemetry Primer (LLM-Relevant Concepts Only)

OpenTelemetry defines three core types of telemetry data: traces, metrics, and logs. For LLM systems, traces are the most important. To make them useful, you need to understand a few building blocks:

a trace represents a single end-to-end request
a span is a timed operation within that trace
attributes are key–value metadata attached to spans
events are time-stamped annotations
context propagation ensures child spans attach to the correct parent.

FastAPI’s async nature makes correct context propagation essential, but OpenTelemetry’s Python SDK handles this as long as spans are created correctly.

With those concepts in place, the next step is to wire OpenTelemetry into the app. Start by configuring the OpenTelemetry SDK in FastAPI: define a TracerProvider, attach a Resource (service name and environment), configure an exporter (Jaeger, Tempo, Phoenix, and so on), and enable FastAPI auto-instrumentation.

Designing LLM-Aware Spans

Span Taxonomy

A clean span hierarchy is critical. In this guide, a single http.request span (usually auto-generated) acts as the root, and it contains child spans such as rag.retrieval, rag.prompt.build, llm.call, llm.postprocess, and, optionally, llm.eval. Each of these spans represents a logical unit of work rather than an implementation detail.

Span Boundaries

Getting span boundaries right is just as important as picking the right span names. Avoid extremes like wrapping the entire LLM workflow in one giant span, creating a separate span for every token, or dumping all data into logs.

Instead, aim for a few coarse-grained spans that each represent a meaningful step in the request, enrich them with well-chosen attributes, and use events to mark important milestones within a span rather than splitting everything into smaller spans.

Instrumenting the LLM Call

When instrumenting the LLM call, treat it as the most critical span in the trace. Whether you are calling OpenAI, Anthropic, or another provider, start the span immediately before the API request and end it only after the full response (or stream) is complete.

Within that span, capture retries, timeouts, and errors so it becomes the central place for latency analysis, cost attribution, and prompt debugging.

For streaming responses, you can emit events for each chunk to track progress, but avoid creating separate child spans unless you truly need fine-grained timing.

FastAPI Example: End-to-End LLM Spans (Complete and Explained)

from fastapi import FastAPI, Request
from opentelemetry import trace
from opentelemetry.trace import Tracer
from typing import List
import asyncio
import hashlib

# Obtain a tracer instance from OpenTelemetry.
# All spans created with this tracer will be part of the same distributed
# tracing system and exported to the configured backend.
tracer: Tracer = trace.get_tracer(__name__)

# Initialize the FastAPI application.
app = FastAPI()

# Helper functions used by the observable endpoint
async def retrieve_documents(query: str) -> List[str]:
    """
    Simulate document retrieval (e.g., vector search or knowledge base lookup).
    This function represents the retrieval stage in a RAG pipeline.
    In a real system, this might query a vector database or search index.
    """
    await asyncio.sleep(0.05)  # Simulate I/O latency
    return [
        "FastAPI enables high-performance async APIs.",
        "OpenTelemetry provides vendor-neutral observability.",
        "LLM observability requires tracing prompts and tokens.",
    ]


def build_prompt(query: str, documents: List[str]) -> str:
    """
    Construct the final prompt from retrieved documents and the user query.
    Prompt construction is kept separate so it can be observed or modified
    independently if needed (for example, to measure prompt assembly latency).
    """
    context = "\n".join(documents)
    return f"""
Context:
{context}

Question:
{query}
"""


class LLMResponse:
    """
    Minimal abstraction for an LLM response.
    This keeps the example self-contained while still allowing us to attach
    token usage and other metadata for observability.
    """

    def __init__(self, text: str, prompt_tokens: int, completion_tokens: int):
        self.text = text
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_token
    
    @property
    def total_tokens(self) -> int:
        return self.prompt_tokens + self.completion_tokens

async def call_llm(prompt: str) -> LLMResponse:
    """
    Simulate an LLM API call.
    In a real implementation, this would call OpenAI, Anthropic, or another
    provider. The artificial delay represents model latency.
    """
    await asyncio.sleep(0.2)  # Simulate inference time
    response_text = "FastAPI and OpenTelemetry enable end-to-end LLM observability."
    # Token count is approximated here for demonstration purposes.
    prompt_tokens = len(prompt.split())
    completion_tokens = len(response_text.split())
    return LLMResponse(response_text, prompt_tokens, completion_tokens)


def summarize_response(response: LLMResponse) -> str:
    """
    Example post-processing step.
    Post-processing is separated into its own phase so any additional latency
    or errors are not incorrectly attributed to the LLM itself.
    """
    return response.text


# Observable FastAPI endpoint
@app.post("/query")
async def rag_query(request: Request, query: str):
    """
    Handle a single RAG-style request with explicit OpenTelemetry spans.
    This endpoint demonstrates how to create one trace per request, with child
    spans for retrieval, LLM invocation, and post-processing.
    """

    # Create a top-level span for the HTTP request.
    # Even if FastAPI auto-instrumentation is enabled, defining this explicitly
    # allows us to attach domain-specific metadata.
    with tracer.start_as_current_span("http.request") as http_span:
        http_span.set_attribute("http.method", "POST")
        http_span.set_attribute("http.route", "/query")

        # Retrieval phase
        # This span isolates the retrieval step so that relevance issues can be
        # debugged independently of LLM behavior.
        with tracer.start_as_current_span("rag.retrieval") as retrieval_span:
            retrieval_span.set_attribute("rag.top_k", 5)
            retrieval_span.set_attribute("rag.similarity_threshold", 0.8)
            documents = await retrieve_documents(query)

            # Record how many documents were returned.
            # This is a key signal when diagnosing hallucinations
            # or missing context in the final response.
            retrieval_span.set_attribute(
                "rag.documents_returned",
                len(documents),
            )

        # LLM invocation phase
        # This span wraps the actual LLM call and is the primary anchor for
        # latency, cost, and prompt-related analysis.
        with tracer.start_as_current_span("llm.call") as llm_span:
            llm_span.set_attribute("llm.provider", "example")
            llm_span.set_attribute("llm.model", "example-llm")
            llm_span.set_attribute("llm.temperature", 0.7)
            llm_span.set_attribute("llm.prompt_template_id", "rag_v1")

            # Build the final prompt using retrieved context.
            # The raw prompt is intentionally not stored as a span attribute.
            prompt = build_prompt(query, documents)
            
            # Prompt metadata
            prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()
            llm_span.set_attribute("llm.prompt_hash", prompt_hash)
            llm_span.set_attribute("llm.prompt_length", len(prompt))

            response = await call_llm(prompt)

            # Hash the response instead of storing raw text.
            # This allows correlation across traces without exposing content.
            response_hash = hashlib.sha256(
                response.text.encode()
            ).hexdigest()
            llm_span.set_attribute("llm.response_hash", response_hash)

            # Record token usage to enable cost attribution
            # and capacity planning.
            llm_span.set_attribute("llm.usage.prompt_tokens", response.prompt_tokens)
            llm_span.set_attribute("llm.usage.completion_tokens", response.completion_tokens)
            llm_span.set_attribute("llm.usage.total_tokens", response.total_tokens)
            
            # example price per token
            estimated_cost = response.total_tokens * 0.000002
            llm_span.set_attribute("llm.cost_estimated_usd", estimated_cost)

        # Post-processing phase
        # Any transformation after the LLM response is captured here,
        # ensuring inference latency is not overstated.
        with tracer.start_as_current_span("llm.postprocess") as post_span:
            summary = summarize_response(response)
            post_span.set_attribute(
                "llm.summary_length",
                len(summary),
            )

    # Return the final response to the client.
    # All spans above belong to the same distributed trace.
    return {"summary": summary}

Before examining the full code example, it helps to understand how the instrumentation relates to the observability principles described earlier in this article.

The goal of the example is not simply to show how to create spans, but to demonstrate how a single user request can be represented as a structured trace containing meaningful metadata about each stage of the LLM pipeline.

At a high level, the code follows three key design ideas:

One trace per user request
One span per logical LLM workflow stage
Semantic attributes attached to spans for debugging, cost tracking, and analysis

Each of these concepts directly corresponds to the observability practices discussed earlier.

Top-Level Request Span

The FastAPI endpoint begins by creating a top-level span called http.request. This span represents the entire lifecycle of the incoming request and serves as the root span for the trace.

with tracer.start_as_current_span("http.request") as http_span:

Although FastAPI can generate HTTP spans automatically through OpenTelemetry auto-instrumentation, explicitly creating this span allows the application to attach domain-specific metadata such as route names or user identifiers.

Attributes such as the HTTP method and route are attached here:

http_span.set_attribute("http.method", "POST")
http_span.set_attribute("http.route", "/query")

This ensures that every trace can be easily filtered by endpoint when analyzing production traffic.

Retrieval Span

The next span captures the retrieval phase of the RAG pipeline:

with tracer.start_as_current_span("rag.retrieval") as retrieval_span:

This span isolates the vector search or knowledge retrieval step from the rest of the pipeline. If users report irrelevant answers, engineers can inspect this span to determine whether the issue originates from poor retrieval results rather than model behavior.

Several semantic attributes are attached here:

rag.top_k – number of documents requested
rag.similarity_threshold – similarity cutoff used for filtering results
rag.documents_returned – number of documents actually retrieved

These attributes align with the RAG observability signals discussed in the earlier section of the article.

LLM Invocation Span

The most important span in the trace is the llm.call span, which wraps the actual model invocation.

with tracer.start_as_current_span("llm.call") as llm_span:

This span captures the latency, configuration, and token usage associated with the LLM request. In production systems, it becomes the primary location for analyzing model behavior and cost.

Key attributes recorded in this span include:

llm.provider – the model provider (OpenAI, Anthropic, etc.)
llm.model – the specific model version
llm.temperature – sampling parameter controlling response randomness
llm.prompt_template_id – identifier for the prompt template used

These attributes make it possible to correlate changes in model configuration with downstream quality or cost changes.

Prompt Handling and Privacy

Instead of storing the full prompt or response text directly in the trace, the example demonstrates a safer practice: hashing sensitive data.

response_hash = hashlib.sha256(response.text.encode()).hexdigest()

The resulting hash is stored as a span attribute:

llm_span.set_attribute("llm.response_hash", response_hash)

This approach allows engineers to correlate repeated responses across traces without exposing potentially sensitive content in observability systems.

Token Usage Tracking

The llm.call span also records token usage:

llm_span.set_attribute(
    "llm.usage.total_tokens",
    response.total_tokens
)

Capturing token usage at the span level is critical for monitoring cost and efficiency, since token consumption directly determines billing for most LLM providers.

Post-Processing Span

Finally, the example includes a llm.postprocess span:

with tracer.start_as_current_span("llm.postprocess") as post_span:

This span represents any transformation applied after the model generates its response. Separating post-processing from the LLM call ensures that additional latency — such as formatting, filtering, or validation — is not incorrectly attributed to the model itself.

An attribute such as response length is recorded here:

post_span.set_attribute("llm.summary_length", len(summary))

This can be useful when diagnosing issues such as unexpectedly short or truncated outputs.

How the Spans Form a Complete Trace

When the request finishes, all spans belong to the same distributed trace:

http.request
 ├── rag.retrieval
 ├── llm.call
 └── llm.postprocess

This hierarchy reflects the logical workflow of a retrieval-augmented LLM system. Because each span contains structured metadata, engineers can quickly answer questions such as:

Was the latency caused by retrieval or model inference?
How many documents influenced the prompt?
Which model configuration produced the response?
How many tokens were consumed?
Was the response post-processed or truncated?

This structured trace design is what transforms observability from simple monitoring into a practical debugging and optimization tool for LLM systems.

Semantic Attributes: Best Practices for LLM Observability

The goal is not to capture every possible detail, but to record the minimal set of stable, high-signal attributes that enable effective debugging, cost control, and quality analysis in production. Poor attribute design leads to noisy traces, privacy risks, and dashboards that are impossible to reason about.

Prompt, Response, and Model Metadata

Storing raw prompts is often unsafe and expensive, so it is better to record minimal, structured metadata instead. In practice, this means attaching a stable template identifier with llm.prompt_template_id, a hashed version of the final prompt using llm.prompt_hash (to avoid storing raw text), and a size indicator such as llm.prompt_length, which captures the number of tokens or characters.

You should also always record key inference parameters: llm.provider (for example, "openai" or "anthropic"), llm.model (for example, "gpt-4.1"), llm.temperature and llm.top_p (sampling parameters), llm.max_tokens (the maximum tokens allowed), and llm.stream to indicate whether streaming was enabled, while staying within your organization’s privacy and compliance requirements.


with tracer.start_as_current_span("llm.call") as llm_span:
            llm_span.set_attribute("llm.provider", "example")
            llm_span.set_attribute("llm.model", "example-llm")
            llm_span.set_attribute("llm.temperature", 0.7)
            llm_span.set_attribute("llm.top_p", 0.9)
            llm_span.set_attribute("llm.max_tokens", 512)
            llm_span.set_attribute("llm.stream", False)
            llm_span.set_attribute("llm.prompt_template_id", "rag_v1")

            # Build the final prompt using retrieved context.
            # The raw prompt is intentionally not stored as a span attribute.
            prompt = build_prompt(query, documents)
            
            # Prompt metadata
            prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()
            llm_span.set_attribute("llm.prompt_hash", prompt_hash)
            llm_span.set_attribute("llm.prompt_length", len(prompt))

Token Usage and Cost (Why This Matters in Practice)

Token usage is one of the most common blind spots in LLM systems. Many teams monitor latency and error rates but discover runaway costs only after invoices spike. Because token consumption varies significantly by prompt structure, retrieved context, and model configuration, it must be captured explicitly at the span level.

The most important practice is to record token usage at the end of the LLM span, once the model has completed inference. This ensures that the values reflect the full request rather than partial or streamed output.

At minimum, capture the attributes:llm.usage.prompt_tokens ,llm.usage.completion_tokens and llm.usage.total_tokens.

def __init__(self, text: str, prompt_tokens: int, completion_tokens: int):
        self.text = text
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_token
    
    @property
    def total_tokens(self) -> int:
        return self.prompt_tokens + self.completion_tokens

async def call_llm(prompt: str) -> LLMResponse:
    """
    Simulate an LLM API call.
    In a real implementation, this would call OpenAI, Anthropic, or another
    provider. The artificial delay represents model latency.
    """
    await asyncio.sleep(0.2)  # Simulate inference time
    response_text = "FastAPI and OpenTelemetry enable end-to-end LLM observability."
    # Token count is approximated here for demonstration purposes.
    prompt_tokens = len(prompt.split())
    completion_tokens = len(response_text.split())
    return LLMResponse(response_text, prompt_tokens, completion_tokens)

These values allow you to distinguish between requests that are expensive because of large prompts (often caused by excessive retrieval or poor prompt construction) versus those that are expensive because of long model-generated outputs.

*Where possible, also attach an estimated cost:* llm.cost_estimated_usd

    # example price per token
    estimated_cost = response.total_tokens * 0.000002
    llm_span.set_attribute("llm.cost_estimated_usd", estimated_cost)

This value is typically derived by multiplying token counts by the model's published pricing. Even if the estimate is approximate, it enables powerful analysis. For example, you can identify which endpoints, prompt templates, or user flows are responsible for the highest cumulative cost, rather than relying on coarse, account-level billing dashboards.

Once spans carry the right attributes, the next step is to connect them to output quality, not just system health.

Evaluation Hooks Inside Traces

This section describes an additional pattern you can layer on top of the core instrumentation in this guide. It is optional and not implemented in the sample code, but it shows how to attach quality signals directly to your traces.

Observability is not just about whether the system stayed up, it is also about whether the model produced a useful answer. Evaluation hooks inside traces let you attach lightweight quality signals directly to the same spans you use for latency and cost.

Inline evaluations are the simplest approach. You can run quick checks synchronously and record the results as span attributes, such as llm.eval.passed for a simple boolean check, llm.eval.relevance_score for an optional numerical score, or flags like llm.eval.hallucination_detected and llm.eval.refusal_detected. These attributes travel with the trace, so you can filter and aggregate on them in your observability backend just like any other field.

For higher accuracy, you can introduce model-based evaluation as a separate step. In this pattern, an evaluator LLM runs asynchronously on the original prompt and response, and its work is captured in a child span (for example, llm.eval) that shares the same trace ID as the main llm.call span. You then attach scores such as relevance, faithfulness, or toxicity to that evaluation span.

Because the evaluation span shares the same trace ID, you can correlate quality regressions with changes in prompts or retrieval.

Exporting and Visualizing Traces (Where This Fits with Vendor Tooling)

This code-first observability design is vendor-agnostic. Once traces are emitted using OpenTelemetry, they can be exported to different backends without changing instrumentation.

General-purpose tracing systems like Jaeger and Grafana Tempo help engineers debug latency, errors, and request flow across retrieval, prompting, and model calls, answering how the system behaved. LLM-focused platforms such as Arize Phoenix use the same data but add model-specific insights like prompt clustering, token analysis, and quality correlation.

Because instrumentation stays OpenTelemetry-native, you maintain full control over attributes and trace structure while still using vendor dashboards, and you can switch backends as your needs evolve without touching the application code.

Operational Patterns and Anti-Patterns

Effective LLM observability requires disciplined practices. High-volume systems should sample traces to limit overhead, and prompts or responses should be hashed by default to reduce storage and privacy risk. Traces must be treated as production data, with proper access control and retention policies.

Common pitfalls include relying only on vendor SDK traces, logging prompts without trace correlation, or ignoring evaluation signals. These issues fragment visibility and hide quality regressions, especially when observability focuses only on agents instead of full application context.

Extending the System

Once traces are reliable, they support advanced capabilities. Metrics like p95 latency can be derived from spans, logs can be linked using trace IDs, and historical traces can power offline evaluation or prompt testing.

By following OpenTelemetry conventions, the observability stack also stays aligned with emerging LLM semantic standards, keeping the system flexible and future-proof.

Conclusion

End-to-end LLM observability is not achieved by installing another agent. It is achieved through intentional span design, meaningful semantic attributes, and, where needed, lightweight evaluation hooks.

By treating LLM calls as first-class operations within distributed traces, you gain faster debugging, controlled costs, safer deployments, and measurable quality improvements. The backend — Jaeger, Tempo, Phoenix — is interchangeable. The instrumentation strategy is not.

A well-designed trace is the most valuable artifact in a production LLM system.

How to Build Your Own Circuit Breaker in Spring Boot – and Really Understand Resilience4j

Jessica Patel — Mon, 16 Feb 2026 21:09:29 +0000

This article explains how to design and implement your own circuit breaker in Spring Boot using explicit failure tracking, a scheduler-driven recovery model, and clear state transitions.

Instead of relying solely on Resilience4j, we’ll walk through the internal mechanics so you understand how circuit breakers actually work.

What We’ll Cover

Prerequisites and Technical Context
What Is a Circuit Breaker in Distributed Systems
Design Goals for a Custom Circuit Breaker
How to Build a Minimal Working CircuitBreaker Class
Spring Boot Scheduler Example
Custom Breaker vs Resilience4j
When You Should Not Build Your Own
Extending the Design
Common Mistakes
Conclusion

Prerequisites and Technical Context

This article assumes you are comfortable with core Spring Boot and Java concepts. We won’t cover framework fundamentals or basic concurrency principles in depth. Here’s what you’ll need to know:

Spring Boot Basics

You should be comfortable with how dependency injection works in Spring, how to define @Configuration classes and @Bean definitions, and the basic service-layer structure of a Spring application. In this tutorial, we’ll treat the circuit breaker as a plain Java component and wire it into Spring through configuration classes rather than annotations.

Java Concurrency Fundamentals

You don’t need to be a concurrency expert, but you should be comfortable with Java’s basic concurrency tools. The implementation uses AtomicInteger, volatile fields, a ScheduledExecutorService, and simple synchronization, so you should understand why shared mutable state is dangerous, how atomic operations differ from synchronized blocks, and why state transitions in a shared state machine must be serialized.

Functional Interfaces

The circuit breaker exposes an execute(Supplier) method, so you should be comfortable using Supplier, writing simple lambda expressions, and wrapping outbound service calls inside a function you can pass to the breaker.

Resilience4j Basics

You don’t need hands-on Resilience4j experience, but you should know that it’s a lightweight Java fault-tolerance library that offers circuit breakers, retries, rate limiters, bulkheads, and is commonly used in Spring Boot via annotations or config. In this article we’ll only reference Resilience4j for comparison, not for actual configuration or usage.

What Is a Circuit Breaker in Distributed Systems?

A circuit breaker is a fault-tolerance pattern that stops a system from repeatedly attempting operations that are likely to fail.

The name comes from electrical engineering. In a physical circuit, a breaker “opens” when the current becomes unsafe, preventing damage. After a cooldown period, it allows current to flow again to test whether the issue has been resolved.

In software, the same principle applies. When Service A depends on Service B, and Service B becomes slow or unavailable, naïvely retrying every request can:

Exhaust thread pools
Saturate connection pools
Increase latency across the system
Trigger cascading failures
Bring down otherwise healthy services

Instead of continuing to send requests to a failing dependency, a circuit breaker:

Detects repeated failures
Opens the circuit and blocks calls
Fails fast without attempting the operation
Periodically tests whether the dependency has recovered

This turns uncontrolled failure into controlled degradation.

Why Circuit Breakers Matter in Spring Boot

Because circuit breakers are a foundational resilience pattern in distributed systems, most Spring Boot teams reach immediately for Resilience4j or legacy Hystrix‑style abstractions – and for good reason. These libraries are mature, well-tested, and production-proven.

However, treating circuit breakers as black boxes often leads to:

Misconfigured thresholds
Incorrect assumptions about failure handling
Difficulty extending behavior beyond library defaults
Debugging issues where “the breaker opened, but we don’t know why”

Building your own circuit breaker – even if you never ship it to production – forces you to understand the mechanics that actually protect your system. In some cases, a custom implementation also provides flexibility that general-purpose libraries cannot.

Why Circuit Breakers Are Foundational

Circuit breakers are a foundational resilience pattern because they protect your scarcest resources (like threads, network and database connections, and CPU time) from being exhausted by a failing dependency.

Without a breaker, a single slow service can gradually consume all of those resources and turn a local problem into a system-wide outage.

Circuit breakers enforce isolation boundaries between services and sit alongside timeouts, retries, bulkheads, and rate limiters, but they make one crucial strategic choice that simple retries do not: they stop trying for now. That decision is what prevents cascading collapse.

What Problem Circuit Breakers Solve That Timeouts and Retries Do Not

Timeouts and retries are reactive: timeouts cap how long you wait, and retries try the same operation again in the hope it succeeds.

A circuit breaker is proactive. It monitors failure patterns and, once a threshold is crossed, temporarily disables the failing integration point so new requests are rejected immediately instead of timing out.This dramatically reduces resource waste and stabilizes the system under stress.

The Circuit Breaker State Model

Any circuit breaker – library-based or custom – follows the same conceptual state machine.

Closed: In the Closed state, all requests are allowed and failures are simply monitored.
Open: When failures cross a configured threshold, the breaker moves to Open, blocks new requests, and makes them fail immediately.
Half-Open: After a cooldown period, it enters Half-Open, where it lets a small number of trial requests through to test whether the dependency has recovered; based on those results, it either returns to Closed or goes back to Open.

The complexity lies not in the states themselves, but in how and when transitions occur.

Why Not Just Use Resilience4j?

Resilience4j is excellent, but there are valid reasons to build your own:

You want non-standard failure logic (for example, domain-aware errors).
You need custom recovery strategies.
You want state persisted or shared differently.
You need tight integration with business metrics.
You want to understand the internals for tuning and debugging.

More importantly, understanding the internals prevents misuse. Many production incidents stem from misconfigured circuit breakers rather than missing ones.

Design Goals for a Custom Circuit Breaker

Before writing any code, we need to be clear about what “correct” behavior looks like. A circuit breaker seems simple in theory, but subtle design mistakes can introduce race conditions, false openings, or silent failures where it stops protecting the system.

The following goals shape a predictable and production-safe implementation.

Thread-Safe and Low Overhead

The breaker sits on the hot path of outbound calls, so every protected request passes through it. If it introduces lock contention or heavy synchronization, it quickly becomes a bottleneck.

The implementation needs to avoid coarse-grained locking, use atomic primitives carefully, and serialize state transitions without blocking execution more than necessary. Thread safety is non‑negotiable: a circuit breaker that misbehaves under concurrency is worse than having no breaker at all.

Predictable State Transitions

Circuit breakers are state machines. If their transitions are inconsistent or prone to races, you end up with split‑brain behavior – one thread believes the breaker is OPEN while another believes it is CLOSED – and your protection becomes undefined.

To avoid this, every transition (CLOSED → OPEN → HALF_OPEN → CLOSED) must be explicit, atomic, and deterministic, all guarded by a single transition mechanism. In this design, predictability matters far more than cleverness.

Explicit Failure Tracking

Not every failure should open the breaker. If you blindly count every exception, you risk opening the breaker on client validation errors, treating business rule violations as infrastructure failures, and hiding real domain bugs behind resilience logic.

Failure classification has to be deliberate: the breaker should react only to infrastructure‑level problems such as timeouts, connection errors, and 5xx responses, not to domain logic errors. Keeping that separation ensures your resilience layer stays aligned with actual failure modes.

Time-Based Recovery Using a Scheduler

Some implementations check timestamps on every request to decide when to move from OPEN to HALF_OPEN, adding extra branching to the hot path.

Instead, this design uses a scheduler: when the breaker opens, it schedules a recovery attempt, keeps the OPEN state purely fail‑fast, and avoids request‑driven polling. That approach reduces branching and contention under load. Recovery should be controlled and predictable – not opportunistic.

Framework-Agnostic Core Logic

The breaker itself should be plain Java – no Spring annotations, no AOP, and no direct framework coupling. That choice makes unit testing easier, keeps the component portable, and preserves a clean separation of concerns with less hidden magic. Spring should wrap the breaker, not define it, so your resilience strategy is not trapped inside any one framework’s abstractions.

Easy Integration into Spring Boot

Although the core logic is framework‑agnostic, it still needs to plug cleanly into a Spring application. That means wiring it via @Configuration, supporting dependency injection, and calling it from clear execution points in your service layer. Resilience behavior should be obvious in code reviews. Hiding it behind annotations often leads to confusion when you are debugging production issues.

How to Build a Minimal Working CircuitBreaker Class

Now let’s turn the conceptual components into a single cohesive class. This is still a minimal implementation, but it’s complete enough to demonstrate state, failure tracking, scheduling, and execution logic in one place.

A minimal circuit breaker consists of:

State holder
Failure tracker
Transition rules
Scheduler for recovery
Execution guard

public final class CircuitBreaker {

    enum State {
        CLOSED,
        OPEN,
        HALF_OPEN
    }

    private final ScheduledExecutorService scheduler;
    private final int failureThreshold;
    private final int halfOpenTrialLimit;
    private final Duration openCooldown;

    private final AtomicInteger failureCount = new AtomicInteger(0);
    private final AtomicInteger halfOpenTrials = new AtomicInteger(0);

    // All transitions go through this field, guarded by `synchronized` blocks.
    private volatile State state = State.CLOSED;

    public CircuitBreaker(
            ScheduledExecutorService scheduler,
            int failureThreshold,
            int halfOpenTrialLimit,
            Duration openCooldown
    ) {
        this.scheduler = scheduler;
        this.failureThreshold = failureThreshold;
        this.halfOpenTrialLimit = halfOpenTrialLimit;
        this.openCooldown = openCooldown;
    }

    public  T execute(Supplier action) {
        // 1. Guards the functionality based on its current state. 
        //We are using synchronized block for thread safety. 
        // Make sure another thread does not override our current state
        State current;
        synchronized (this) {
            current = state;

            if (current == State.OPEN) {
                throw new IllegalStateException("Circuit breaker is OPEN. Call rejected.");
            }

            if (current == State.HALF_OPEN) {
                int trials = halfOpenTrials.incrementAndGet();
                if (trials > halfOpenTrialLimit) {
                    // Too many trial requests; fail fast.
                    halfOpenTrials.decrementAndGet();
                    throw new IllegalStateException("Circuit breaker is HALF_OPEN. Trial limit exceeded.");
                }
            }
        }

        // 2. Execute the business functionality here. For e.g API calls to other systems 
        try {
            T result = action.get();
            // 3. Record success
            onSuccess();
            return result;
        } catch (Throwable t) {
            // 3. Record failure
            onFailure(t);
            // 4. Propagate to caller
            if (t instanceof RuntimeException re) {
                throw re;
            }
            if (t instanceof Error e) {
                throw e;
            }
            throw new RuntimeException(t);
        }
    }

    private void onSuccess() {
        synchronized (this) {
            failureCount.set(0);

            if (state == State.HALF_OPEN) {
                // A successful trial closes the breaker.
                transitionToClosed();
            }
        }
    }

    private void onFailure(Throwable t) {
        // Example: only count "server-side" failures.
        boolean breakerRelevant = true; // placeholder for domain-specific checks

        if (!breakerRelevant) {
            return;
        }

        synchronized (this) {
            int failures = failureCount.incrementAndGet();
            if (state == State.CLOSED && failures >= failureThreshold) {
                transitionToOpen();
            } else if (state == State.HALF_OPEN) {
                // Any failure in HALF_OPEN sends us back to OPEN.
                transitionToOpen();
            }
        }
    }

    private void transitionToOpen() {
        state = State.OPEN;
        // Reset counters so the next CLOSED phase starts clean.
        failureCount.set(0);
        halfOpenTrials.set(0);
        scheduleHalfOpen();
    }

    private void transitionToHalfOpen() {
        synchronized (this) {
            state = State.HALF_OPEN;
            halfOpenTrials.set(0);
        }
    }

    private void transitionToClosed() {
        state = State.CLOSED;
        failureCount.set(0);
        halfOpenTrials.set(0);
    }

    private void scheduleHalfOpen() {
        scheduler.schedule(
                this::transitionToHalfOpen,
                openCooldown.toMillis(),
                TimeUnit.MILLISECONDS
        );
    }
}

Now we’ll walk through each responsibility in that class: why the fields exist, how state transitions work, where concurrency guarantees matter, how execution is guarded, and how the scheduler drives recovery.

Each subsection maps directly back to part of this class – we’re not introducing new concepts, just explaining the behavior implemented within the code above.

Concurrency and State Transition Guarantees

Although the breaker uses atomic primitives for counters and a volatile state field, this only works because all state transitions are guarded consistently.

In practice, every transition – CLOSED → OPEN, OPEN → HALF_OPEN, HALF_OPEN → CLOSED – must be performed under the same synchronization mechanism as shown below: either a single lock or a CAS-based state machine. Mixing unsynchronized state writes with atomic counters can lead to split-brain behavior (for example, one thread reopening the breaker while another closes it).

synchronized (this) {
            current = state;

            if (current == State.OPEN) {
                throw new IllegalStateException("Circuit breaker is OPEN. Call rejected.");
            }

            if (current == State.HALF_OPEN) {
                int trials = halfOpenTrials.incrementAndGet();
                if (trials > halfOpenTrialLimit) {
                    // Too many trial requests; fail fast.
                    halfOpenTrials.decrementAndGet();
                    throw new IllegalStateException("Circuit breaker is HALF_OPEN. Trial limit exceeded.");
                }
            }
        }

The rule is simple: reads may be optimistic, but writes and transitions must be serialized.

Explaining the State Model in the Class

At the core of the implementation is a simple but strict state machine represented by the State enum: CLOSED, OPEN and HALF_OPEN

The state field is declared volatile so changes are immediately visible across threads. When one thread moves the breaker to a new state, other threads see that update without delay.

Alongside the state, the class maintains failureCount and halfOpenTrials counters using AtomicInteger (Refer to the code in the above section). These track how failures accumulate and how many recovery attempts we have made, without resorting to coarse‑grained locks.

The key design idea is separation of responsibilities: the enum captures the current mode of operation, while the atomic counters hold the metrics that influence state transitions. Atomic increments alone do not guarantee safe transitions, though, so all updates to the state still follow a consistent serialization strategy to avoid race conditions.

enum State {
        CLOSED,
        OPEN,
        HALF_OPEN
    }

This structure gives us a clear foundation: a small, explicit state machine with observable transition boundaries.

Failure Tracking Inside the Class

private void onFailure(Throwable t) {
        // Example: only count "server-side" failures.
        boolean breakerRelevant = true; // placeholder for domain-specific checks

        if (!breakerRelevant) {
            return;
        }

        synchronized (this) {
            int failures = failureCount.incrementAndGet();
            if (state == State.CLOSED && failures >= failureThreshold) {
                transitionToOpen();
            } else if (state == State.HALF_OPEN) {
                // Any failure in HALF_OPEN sends us back to OPEN.
                transitionToOpen();
            }
        }
    }

In this implementation, failure tracking is intentionally simple: we count consecutive failures. Each time a protected call throws an exception we classify as breaker‑relevant, failureCount is incremented. On a successful call, the counter resets.

I chose consecutive failures for clarity rather than sophistication. More advanced strategies, like sliding time windows or failure ratios, introduce extra state and timing complexity. When you’re learning how a breaker works, a simple counter makes the transition rules easy to reason about and easy to test.

Equally important, the breaker should not treat every exception the same. Domain validation errors, client misuse, and business rule violations shouldn’t affect the breaker’s state. Only infrastructure‑level problems (like timeouts, connection failures, or 5xx responses) should move the breaker toward OPEN. That separation keeps the breaker focused on dependency instability, not application bugs or bad inputs.

How Closed State Transitions to Open

When the breaker is in the CLOSED state, all requests flow through normally. In this phase the breaker is purely observational: it monitors outcomes and increments failureCount whenever a breaker‑relevant exception occurs.

Inside the onFailure method (shown in the above section), once the failureCount exceeds the configured threshold, the breaker transitions to OPEN. This transition must be atomic and serialized – otherwise, multiple threads could try to open the breaker at the same time, leading to inconsistent scheduling or duplicate recovery tasks.

private void transitionToOpen() {
        state = State.OPEN;
        // Reset counters so the next CLOSED phase starts clean.
        failureCount.set(0);
        halfOpenTrials.set(0);
        scheduleHalfOpen();
    }

Moving to OPEN immediately changes system behavior. From that point on, new requests are rejected without attempting the protected operation, which shields downstream services and preserves local resources such as threads and connection pools.

OPEN State Behavior in the Class

The OPEN state represents pure fail‑fast behavior. While the breaker is open, no protected calls are executed. The execute() method immediately throws an exception indicating that the circuit is open.

public  T execute(Supplier action) {
        // 1. Guards the functionality based on its current state. 
        //We are using synchronized block for thread safety. 
        // Make sure another thread does not override our current state
        State current;
        synchronized (this) {
            current = state;

            if (current == State.OPEN) {
                throw new IllegalStateException("Circuit breaker is OPEN. Call rejected.");
            }
....
}

This behavior is not about improving latency – it is about resource protection. Letting calls continue and simply “wait for timeouts” would still tie up threads and connections. The value of the OPEN state is that it refuses to participate in propagating failure at all.

In this state, the breaker has a single responsibility: wait for the scheduled recovery attempt. It doesn’t check timestamps on each request or poll in the hot path. Its behavior is deterministic: reject immediately and let the scheduler decide when to try again.

Scheduler‑Driven Recovery: Entering HALF_OPEN

When the breaker transitions to OPEN, it immediately schedules a delayed task using the injected ScheduledExecutorService. After the configured cooldown period elapses, that task transitions the breaker to HALF_OPEN.

// Refer below methods from the main code 

private void transitionToOpen() {
        state = State.OPEN;
        // Reset counters so the next CLOSED phase starts clean.
        failureCount.set(0);
        halfOpenTrials.set(0);
        scheduleHalfOpen(); // schedule a delayed task after changing the state to State.Open
    }

private void scheduleHalfOpen() {
        scheduler.schedule(
                this::transitionToHalfOpen,
                openCooldown.toMillis(),
                TimeUnit.MILLISECONDS
        );
    }

This design keeps time-based logic out of the request execution path. Rather than checking elapsed time on every call, the breaker delegates recovery timing to a dedicated scheduler thread. This reduces conditional logic under load and keeps the execute() method focused on guarding execution.

The scheduler must be reliable and isolated. A single-threaded executor is typically sufficient because transitions are rare and lightweight. More importantly, transitions should be idempotent so that unexpected rescheduling does not corrupt state.

Spring Boot Scheduler Example

In Spring Boot, you can wire a dedicated ScheduledExecutorService bean to drive state transitions instead of using plain Java threads.

@Configuration
class CircuitBreakerConfig {

    // First bean 
    @Bean
    ScheduledExecutorService circuitBreakerScheduler() {
        return Executors.newSingleThreadScheduledExecutor();
    }

    // Second bean 
    @Bean
    CircuitBreaker circuitBreaker(ScheduledExecutorService circuitBreakerScheduler) {
        return new CircuitBreaker(
                circuitBreakerScheduler,
                5,                     // failureThreshold
                2,                     // halfOpenTrialLimit
                Duration.ofSeconds(30) // openCooldown
        );
    }
}

The configuration class above wires the circuit breaker into the Spring container without introducing framework coupling into the breaker itself.

The first bean circuitBreakerScheduler() defines a dedicated ScheduledExecutorService. This executor is responsible exclusively for time-based state transitions. When the breaker moves to OPEN, it uses this scheduler to queue a delayed task that transitions the state to HALF_OPEN.

Using a single-threaded executor is intentional. Circuit breaker transitions are lightweight and infrequent, so parallel scheduling is unnecessary. A single thread guarantees serialized transition execution and avoids overlapping recovery attempts.

The second bean constructs the CircuitBreaker itself. Here we inject the scheduler and configure three things: a failure threshold of 5 consecutive errors, a half‑open trial limit of 2 test requests, and a 30‑second cooldown before we attempt recovery again. This configuration makes the breaker’s behavior explicit and easy to reason about – there are no hidden properties files or annotations, because everything that affects resilience is defined in one place.

At this point, the breaker is a fully managed Spring bean that you can inject into services and use programmatically.

How This Connects to Execution Flow

Once registered as a bean, the breaker becomes part of the application’s dependency graph. A typical service might inject it and wrap outbound calls:

@Service
class ExternalApiService {

    private final CircuitBreaker circuitBreaker;
    private final RestTemplate restTemplate;

    ExternalApiService(CircuitBreaker circuitBreaker, RestTemplate restTemplate) {
        this.circuitBreaker = circuitBreaker;
        this.restTemplate = restTemplate;
    }

    public String callExternal() {
        return circuitBreaker.execute(() ->
                restTemplate.getForObject("http://external/api", String.class)
        );
    }
}

Every outbound call to the external system flows through the breaker’s execute() method, which enforces the current state rules before allowing the call to proceed. That makes resilience behavior explicit at the integration boundary: anyone reviewing the service can immediately see that the call is protected. There is no hidden interception layer and no AOP proxy quietly changing behavior at runtime.

Scheduler Design and Thread Safety

The scheduler’s only responsibility is delayed state transition. It doesn’t execute business logic and it doesn’t evaluate request outcomes. Its purpose is narrowly scoped: move the breaker from OPEN to HALF_OPEN after a cooldown.

Because the executor is single-threaded, scheduled tasks cannot overlap. But this doesn’t eliminate concurrency concerns entirely. Request threads may still attempt transitions at the same time the scheduler fires. For this reason, transition methods such as transitionToHalfOpen() and transitionToOpen() must remain serialized and idempotent.

In other words, even though the scheduler simplifies time-based recovery, it doesn’t replace the need for careful state management.

The architectural separation looks like this:

Request threads → enforce execution rules and record outcomes
Scheduler thread → handle time-based recovery transitions

Keeping these responsibilities separate reduces complexity in the hot path and improves predictability under load.

Why We Avoid @Scheduled for This Design

Spring provides @Scheduled as an alternative mechanism for time-based tasks. While convenient, it introduces global scheduling behavior and reduces isolation.

By using a dedicated ScheduledExecutorService for the breaker, we avoid interference with other scheduled jobs, keep lifecycle control explicit, and tie scheduling logic directly to breaker transitions.

This design reinforces the principle that resilience components should be isolated and predictable.

Bringing It All Together

At this stage, the full interaction looks like this:

A service wraps its dependency call with circuitBreaker.execute().
If the breaker is CLOSED, the call proceeds and any relevant failures are counted.
When failures exceed the threshold, the breaker moves to OPEN and schedules a recovery attempt.
While OPEN, calls fail immediately without hitting the downstream system.
After the cooldown period, the scheduler transitions the breaker to HALF_OPEN.
A small number of trial calls then decide whether the breaker returns to CLOSED or goes back to OPEN.

Nothing is hidden: every transition is visible in code, every configuration value is explicit, and each thread involved has a single responsibility. That clarity is what makes a custom implementation useful for learning – and safe when it is designed correctly.

Observability: Making the Breaker Understandable

A circuit breaker without observability is risky. At a minimum you should expose the current state, the failure count, the time of the last transition, and how long the breaker has been open.

On the metrics side, track how often the breaker opens, how many calls are rejected per second, and the success rate of recovery attempts.

Your logs should record state transitions at INFO level and failure classification decisions at DEBUG. With that level of visibility, your custom breaker is often easier to understand and tune than what many libraries provide out of the box..

Handling Different Failure Types

Not all failures are equal.

API Response Timeouts → breaker‑relevant
API 5xx responses → breaker‑relevant
API 4xx responses → usually not
Any data or business validation errors → never

A custom breaker lets you apply this kind of business‑aware classification, which is often hard to express cleanly with generic libraries.

Custom Breaker vs Resilience4j

Aspect	Custom Breaker	Resilience4j
Learning value	High	Low
Flexibility	High	Medium
Time to implement	Medium	Low
Operational maturity	Depends	High
Custom failure logic	Easy	Limited
Tooling / metrics	You wire metrics, logs, observability manually	Built-in metrics, logging, and integrations

The choice is not binary. Many teams prototype with a custom breaker and later replace it with Resilience4j – now correctly configured.

When You Should Not Build Your Own

Do not build a custom breaker if:

You lack observability.
You do not understand concurrency.
You need advanced features immediately.
Your system is safety-critical.

For example, if you are building a payments platform with strict SLAs and cannot afford to battle-test a custom breaker, stick with a mature library like Resilience4j. The risk of subtle concurrency bugs, misclassified failures, or scheduler misconfigurations is too high to experiment in production.

Extending the Design

Once you understand the core, you can add:

Sliding window metrics.
Adaptive thresholds.
Persistent breaker state.
Distributed breakers (per dependency).
Integration with feature flags.

These extensions are much easier when you control the internals.

Common Mistakes

Common mistakes when working with circuit breakers include:

Opening the breaker on the first failure.
Blocking threads while OPEN.
Allowing unlimited HALF_OPEN requests.
Treating all exceptions equally.
Ignoring observability.

Most of these happen when using libraries without understanding them.

Conclusion

Resilience libraries are powerful, but they are not magic. A circuit breaker is fundamentally a state machine with failure tracking and time-based transitions. Building your own – even once – forces you to internalize this reality.

In Spring Boot systems, a custom circuit breaker:

Clarifies failure semantics.
Improves debugging.
Enables domain-specific resilience.
Makes you a better user of Resilience4j.

You may never deploy your own breaker to production. But after building one, you will never configure a circuit breaker blindly again.

Jessica Patel - freeCodeCamp.org

How to Design a Type-Safe, Lazy, and Secure Plugin Architecture in React

Table of Contents

A Common Pain Point: Scaling Frontend Platforms

What This Article Will Cover

Prerequisites

Why a Plugin Architecture?

Core Concepts of a React Plugin Architecture

High-Level Architecture of a React Plugin System

Real TypeScript Example: A Chat Plugin

Chat Plugin Implementation

Host Application Usage

1. How to Define the Host API

2. How to Define the Plugin Lifecycle

3. How to Bundle Plugins Separately

4. How to Lazy-Load Plugins

5. Security & Permission Model

6. Plugin Hot-loading

7. CI & Deployment Considerations

Putting It All Together

Best Practices

When NOT to Use a Plugin Architecture

Small or Single-Team Applications

Tightly Coupled Features

Performance-Critical Systems

Limited Security Controls

Early-Stage Products

Future Enhancements

Conclusion

How to Build End-to-End LLM Observability in FastAPI with OpenTelemetry

Table of Contents

Introduction

Prerequisites and Technical Context

FastAPI (Web Framework)

Large Language Models (LLMs)

Observability (Concept)

OpenTelemetry (Instrumentation Standard)

Why LLM Observability Is Fundamentally Different

Reference Architecture: A Traceable RAG Request

Reference Architecture Explained

Client Sends a Request to /chat

FastAPI Validates Input and Authenticates the User

Retriever Queries the Vector Database​

Prompt Is Assembled Using Retrieved Documents

LLM API Is Invoked

Response Is Post-Processed and Returned

Why This Design Is Better Than Simpler Alternatives

LLM Models That Work Best for This Architecture

OpenTelemetry Primer (LLM-Relevant Concepts Only)

Designing LLM-Aware Spans

Span Taxonomy

Span Boundaries

Instrumenting the LLM Call

FastAPI Example: End-to-End LLM Spans (Complete and Explained)

Top-Level Request Span

Retrieval Span

LLM Invocation Span

Prompt Handling and Privacy

Token Usage Tracking

Post-Processing Span

How the Spans Form a Complete Trace

Semantic Attributes: Best Practices for LLM Observability

Prompt, Response, and Model Metadata​

Token Usage and Cost (Why This Matters in Practice)

Evaluation Hooks Inside Traces

Exporting and Visualizing Traces (Where This Fits with Vendor Tooling)

Operational Patterns and Anti-Patterns

Extending the System

Conclusion

How to Build Your Own Circuit Breaker in Spring Boot – and Really Understand Resilience4j

What We’ll Cover

Prerequisites and Technical Context

Spring Boot Basics

Java Concurrency Fundamentals

Functional Interfaces

Resilience4j Basics

What Is a Circuit Breaker in Distributed Systems?

Why Circuit Breakers Matter in Spring Boot

Why Circuit Breakers Are Foundational

What Problem Circuit Breakers Solve That Timeouts and Retries Do Not

Retriever Queries the Vector Database

Prompt, Response, and Model Metadata