FastAPI - freeCodeCamp.org

How to Serve a Multi-User AI Agent with FastAPI and Streamlit

Darsh Shah — Mon, 20 Jul 2026 22:07:49 +0000

In this tutorial, I’ll show you how to serve a multi-user local AI agent as a REST API using FastAPI, then add a lightweight Streamlit UI on top.

Instead of interacting with the agent through a terminal, we’ll expose it over HTTP so multiple users can access it through a chat-style frontend interface. Each session will maintain its own conversation history and streamed responses.

The local AI agent will be built with LangChain v1, Ollama, Qwen, and Python, running on your own machine and ready to plug into larger applications without any per-call model API charges.

Background
What is FastAPI?
What is Streamlit?
What Is Multi-User Support?
Motivation and Architecture
Step 1: Install Ollama and Pull the Model
Step 2: Install Python Dependencies
Step 3: Build the agent and API layer with FastAPI
Step 4: Build Streamlit UI
Step 5: Run the backend app
Step 6: Run the frontend app
Sample Output
What to Improve Before Production
Conclusion

Background

Many AI agents start out as simple Python scripts that run in a command-line terminal. You type a message, the agent responds, and everything happens in a single local session.

That setup is great for development and testing, but it becomes limiting when you want other people or applications to interact with the agent.

To make an AI agent truly useful, we need to expose it through an interface that other users can access. A REST API is a practical way to do that.

To follow this tutorial, you'll need Ollama installed on your machine. The tutorial works on macOS, Windows, and Linux. I'm using a MacBook Pro with 32 GB of RAM, but you can run this on a lower-memory machine by choosing a smaller Qwen model from Ollama.

What is FastAPI?

FastAPI is a Python web framework for building APIs. In this tutorial, it gives us a simple way to expose the agent over HTTP so other apps, scripts, or services can call it.

FastAPI is a good fit for AI apps because it gives us a clean boundary around the system. We define the request and response models in Python, FastAPI validates them automatically, and it turns HTTP requests into Python objects and Python objects back into JSON. It also generates interactive API docs for free and supports async endpoints, which is useful for AI workloads that may take longer to respond.

What is Streamlit?

Streamlit is a Python framework for building lightweight web interfaces with minimal frontend work. It lets us create interactive browser-based apps using normal Python code instead of HTML, CSS, and JavaScript.

In this tutorial, Streamlit sits on top of the FastAPI backend as a thin client. FastAPI exposes the AI agent over HTTP, and Streamlit gives us a simple UI for calling that API and displaying the results. That separation keeps the backend reusable while still making the agent easy to use in the browser.

What Is Multi-User Support?

Multi-user support means the AI agent can handle requests from more than one user while keeping each user’s session separate.

For example, User 1 asks the agent one question and User 2 asks a different question. The agent should remember the correct context for each user independently. Without multi-user support, all users may end up sharing the same conversation state, which can lead to mixed responses, incorrect memory, or overwritten context.

Motivation and Architecture

Turning an AI agent into an API is the natural next step after building it locally. A Python script is great for experimenting, but an API makes the agent reusable. And adding multi-user support makes the agent extensible to be used by others.

To keep things simple, we’ll use a small local agent powered by Ollama and Qwen. The agent has two tools: one for checking the current time and another for counting words.

FastAPI provides the HTTP layer by exposing one endpoint called /chat/stream. When the request comes in with a user message, Pydantic validates the request, LangChain handles the agent loop and tool calling, and the final answer is returned as stream. Streamlit sits on top of that API and acts as a frontend that sends requests to the API and displays the results.

Example request:

{ 
    "message": "How many words are in: LangChain makes tool calling easier",
    "user_id":"123e4567-e89b-12d3-a456-426614174000"
 }

Example response:

{
  "answer": "There are **5** words in LangChain makes tool calling easier."
}

The model runs locally through Ollama, so there are no per-call model API charges.

Step 1: Install Ollama and Pull the Model

To get started, install the Ollama application for your platform.

We’ll use Qwen as the chat model. I’m using qwen3.5:4b. If your machine has less RAM, you can use qwen3.5:0.8b instead.

ollama pull qwen3.5:4b

Step 2: Install Python Dependencies

Create a virtual environment and install the required packages:

python3 -m venv venv
source venv/bin/activate

pip install fastapi uvicorn streamlit requests langchain langchain-core langchain-ollama langgraph

If tutorial requires LangChain >= 1.0.0.

Step 3: Build the Agent and API Layer with FastAPI

This application has three main responsibilities. FastAPI exposes the HTTP endpoint, Pydantic validates the incoming request data, and LangChain runs the agent, including tool calling and short-term memory.

The user_id sent with each request is used as the thread identifier, allowing the checkpointer to keep each user’s conversation history separate. This memory is per session. So every new session will have its own memory.

Another important detail is that the agent is created only once at startup with agent = build_agent(). Reusing the same agent instance avoids rebuilding the model and tool list for every request, which reduces overhead and improves response times while still supporting multiple users.

Inside the /chat/stream endpoint, the backend uses LangChain’s stream_events(..., version="v3") to generate the response as a stream instead of waiting for the full answer all at once. FastAPI then wraps that stream in a StreamingResponse, so the frontend can receive the output gradually as it's produced. This makes the app feel much more interactive, because users can start reading the answer immediately while the rest is still being generated.

Put together, this gives you a lightweight backend that validates input, preserves separate memory for each user, and streams responses to the UI in real time.

Save the following code as app.py:

from datetime import datetime
from uuid import UUID

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse

from pydantic import BaseModel

from langchain.agents import create_agent
from langchain_core.tools import tool
from langchain_ollama import ChatOllama
from langgraph.checkpoint.memory import InMemorySaver

CHAT_MODEL = "qwen3.5:4b"

SYSTEM_PROMPT = (
    "You are a helpful assistant with access to tools for getting the current time "
    "and counting words in text. "
    "Use tools when needed. If the question does not need a tool, answer directly."
)

# -----------------------------
# Request model
# -----------------------------

class ChatRequest(BaseModel):
    user_id: UUID
    message: str

# -----------------------------
# Tools
# -----------------------------

@tool
def current_time() -> str:
    """Return the current local date and time."""
    return datetime.now().strftime("%Y-%m-%d %H:%M:%S")


@tool
def word_count(text: str) -> int:
    """Count the number of words in a piece of text."""
    return len(text.split())


# -----------------------------
# Agent + checkpoint memory
# -----------------------------

# Store conversation history in short term memory
checkpointer = InMemorySaver()

def build_agent():
    model = ChatOllama(model=CHAT_MODEL, temperature=0)
    return create_agent(
        model=model,
        tools=[current_time, word_count],
        system_prompt=SYSTEM_PROMPT,
        checkpointer=checkpointer,
    )


agent = build_agent()

# -----------------------------
# Streaming endpoint
# -----------------------------

app = FastAPI()

@app.post("/chat/stream")
def chat_stream(req: ChatRequest):
    def generate():
        run = agent.stream_events(
            {
                "messages": [{"role": "user", "content": req.message}],
            },
            config={
                "configurable": {
                    # Keep each user's short-term memory isolated
                    # by using their user_id as the thread ID.
                    "thread_id": str(req.user_id),
                }
            },
            version="v3",
        )

        for message in run.messages:
            for token in message.text:
                yield token

    return StreamingResponse(generate(), media_type="text/plain")

Step 4: Build Streamlit UI

The Streamlit code creates a simple chat interface for the AI agent and keeps each browser session tied to a unique user_id.

When the app first loads, it generates and stores a UUID in st.session_state, which is later sent to the backend so the agent can keep that user’s conversation history separate from other users. It also creates a chat_history list in session state so previous messages remain visible every time Streamlit reruns the script. The app then loops through that saved history and displays each message in a chat-style format using st.chat_message().

When the user enters a new message through st.chat_input(), the app immediately saves and displays it, then sends it to the backend API with a POST request to http://127.0.0.1:8001/chat/stream along with the session’s user_id.

The request is made with stream=True, which allows the response to arrive gradually instead of all at once. As each chunk of text is received from the backend, the code appends it to full_answer and updates a placeholder on the page, creating a live streaming effect. Once the response is complete, the final assistant message is stored in chat_history so it remains part of the conversation on the page

Save the below as streamlit_app.py

import uuid
import requests
import streamlit as st

API_URL = "http://127.0.0.1:8001/chat/stream"

st.title("Local AI Agent")

if "user_id" not in st.session_state:
    st.session_state.user_id = str(uuid.uuid4())

if "chat_history" not in st.session_state:
    st.session_state.chat_history = []

# Show previous messages
for item in st.session_state.chat_history:
    with st.chat_message(item["role"]):
        st.markdown(item["content"])

message = st.chat_input("Enter a message")

if message:
    # Save and show user message
    st.session_state.chat_history.append({"role": "user", "content": message})
    with st.chat_message("user"):
        st.markdown(message)

    # Stream assistant response
    full_answer = ""
    with st.chat_message("assistant"):
        placeholder = st.empty()

        # Send the reqeust to backend API via POST request
        with requests.post(
            API_URL,
            json={
                "message": message,
                "user_id": st.session_state.user_id,
            },
            stream=True,
        ) as response:
            response.raise_for_status()

            for chunk in response.iter_content(chunk_size=None, decode_unicode=True):
                if chunk:
                    full_answer += chunk
                    placeholder.markdown(full_answer)

    # Save final assistant response
    st.session_state.chat_history.append(
        {"role": "assistant", "content": full_answer}
    )

Step 5: Run the Backend App

Start the server with Uvicorn:

uvicorn app:app --reload --port 8001

Once the application starts, open:

http://127.0.0.1:8001/
http://127.0.0.1:8001/docs

The /docs endpoint is automatically generated by FastAPI using your Pydantic models. It provides an interactive interface where you can test the API without writing any client code.

You can send requests directly from curl. In your terminal, run these commands to invoke the API for the AI agent and check the output:

$ curl -X POST http://127.0.0.1:8001/chat/stream \
  -H "Content-Type: application/json" \
  -d '{"message":"What time is it?","user_id":"123e4567-e89b-12d3-a456-426614174000"}'

$ curl -X POST http://127.0.0.1:8001/chat/stream \
  -H "Content-Type: application/json" \
  -d '{"message":"How many words are in: LangChain makes tool calling easier","user_id":"123e4567-e89b-12d3-a456-426614174000"}'

$ curl -X POST "http://127.0.0.1:8001/chat/stream" \
-H "Content-Type: application/json" \
-d '{"message":"What is the capital of France?","user_id":"123e4567-e89b-12d3-a456-426614174000"}'

To stop the server, press Ctrl+C in the terminal.

Step 6: Run the Frontend App

In another terminal, go to the project directory:

source venv/bin/activate
streamlit run streamlit_app.py

That opens the frontend in your browser at http://localhost:8501/. Try the example prompts like "What is the capital of France". You should see the answer in a chat style interface.

The UI is calling the FastAPI endpoint and invoking the AI agent. You now have a working end to end application for your local AI agent that you can play with.

To stop the server, press Ctrl+C in the terminal.

Sample Output

The image below show two browser sessions of the app running side by side on the same endpoint. Each session is assigned a unique id, which allows the backend to maintain a separate conversation history for each user.

Even though both users ask the same question, “Who am I?”, the responses are different because each session’s answer is based on its own prior messages.

What to Improve Before Production

Although this application is fully functional, it's still intentionally minimal. It already supports a reusable FastAPI backend, a Streamlit chat interface, per-user conversation history, and streaming responses.

If you wanted to take it further, the next steps would be adding authentication, persistent storage, structured logging, monitoring, and more robust deployment setup.

It's also worth noting that if your goal is simply to get a polished self-hosted chat UI up and running quickly, you may not need to build the frontend yourself. Projects like LibreChat and Open WebUI already provide richer interfaces and broader features out of the box.

This tutorial takes a different approach: instead of adopting a full platform, it shows how to build a lightweight custom stack yourself so you can better understand the architecture and have more control over how the agent is exposed.

Conclusion

In this tutorial, we took a local AI agent, wrapped it in a FastAPI app, and used Streamlit UI on top of it.

This transforms the AI agent from a standalone script into a reusable service. Instead of only working in a terminal, it can now be accessed through a simple HTTP endpoint by other apps, scripts, or internal tools.

By assigning each session a unique id, the service can also maintain separate conversation history for multiple users, making it possible to support a chat-style interface with isolated memory per session.

From here, you can continue extending the same service by adding authentication or production-ready features. Happy tinkering!

If you enjoyed this tutorial, you can find more of my writing on my blog (recent posts include system design paper series), my work on my personal website, and updates on LinkedIn.

How to Build an End-to-End ML Platform Locally: From Experiment Tracking to CI/CD

Sandeep Bharadwaj Mannapur — Tue, 17 Mar 2026 20:33:56 +0000

Machine learning projects don’t end at training a model in a Jupyter notebook. The hard part is the “last mile”: turning that notebook model into something you can run reliably, update safely, and trust over time.

Most ML systems fail in production for boring (and painful) reasons: the training code and the serving code drift apart, input data changes shape, a “small” preprocessing tweak breaks predictions, or the model silently degrades because real-world behavior shifts. None of these problems are solved by a better algorithm, they’re solved by engineering: repeatable pipelines, validation, versioning, monitoring, and automated checks.

In this hands-on handbook, you’ll build a complete mini ML platform on your local machine, an end-to-end project that takes a model from training to deployment with the core “last mile” infrastructure in place. We’ll use a fraud detection example (predicting fraudulent transactions), but the same workflow works for churn prediction or any binary classification problem. Everything runs locally (no cloud required), and every step is copy-paste runnable so you can follow along and verify outputs as you go.

By the end, you'll have a production-ready ML pipeline running on your machine – from training the model to serving predictions, with the infrastructure to test, monitor, and iterate with confidence. And yes, we'll do it in a hands-on manner with code snippets you can copy-paste and run. Let's dive in!

📦 Get the Complete Code
All code from this handbook is available in a ready-to-run repository:
Repository: https://github.com/sandeepmb/freecodecamp-local-ml-platform
Clone it and follow along, or use it as a reference implementation.

Project Overview and Setup
Build a Simple Model and API (The Naive Approach)
- Train a Quick Model
- Serve Predictions with FastAPI
Where the Naive Approach Breaks
Add Experiment Tracking and Model Registry with MLflow
Ensure Feature Consistency with Feast
Add Data Validation with Great Expectations
- Define Expectations
- Integrate Validation into FastAPI
Monitor Model Performance and Data Drift
Automate Testing and Deployment with CI/CD
Incident Response Playbook
How to Put It All Together
What’s Next: Scale to Production
Conclusion
References

Project Overview and Setup

Before we jump into coding, let's set the stage. Our use-case is credit card fraud detection – a binary classification problem where we predict whether a transaction is fraudulent (is_fraud = 1) or legitimate (is_fraud = 0). This is a common ML task and a good proxy for production ML challenges because fraud patterns can change over time (allowing us to discuss model drift), and bad input data (for example, malformed transaction info) can cause serious issues if not handled properly.

Tech Stack

We will use Python-based tools that are popular in MLOps but still beginner-friendly:

Tool	Purpose	Why We Chose It
MLflow	Experiment tracking and model registry	Open-source, widely adopted, great UI
Feast	Feature store for consistent feature serving	Production-grade, runs locally, same API for offline/online
FastAPI	High-performance web framework for serving predictions	Fast, automatic docs, modern Python
Great Expectations	Data validation framework	Declarative expectations, great reports
Evidently	Monitoring for data drift and model decay	Beautiful reports, easy to integrate
Docker	Containerization for environment consistency	Industry standard, works everywhere
GitHub Actions	CI/CD automation	Free for public repos, tight GitHub integration

Let me explain each tool briefly:

MLflow is an open-source platform designed to manage the ML lifecycle. It provides experiment tracking (logging parameters, metrics, and artifacts), a model registry (versioning models with aliases), and model serving capabilities. We'll use it to ensure our experiments are reproducible and our models are versioned.

Feast (Feature Store) is an open-source feature store that helps manage and serve features consistently between training and inference. This prevents a common problem called "training-serving skew" where the features used in production differ slightly from those used in training, causing silent accuracy degradation.

FastAPI is a modern, fast web framework for building APIs with Python. It's known for being easy to use, efficient, and producing automatic interactive documentation. We'll use it to serve our model predictions.

Great Expectations is an open-source tool for data quality testing. It allows us to define "expectations" on data (like "amount should be positive" or "hour should be between 0 and 23") and test incoming data against them.

Evidently is an open-source library for monitoring data and model performance over time. It can detect data drift (when input distributions change) and model decay (when accuracy drops).

Docker ensures the same environment and dependencies in development and deployment, avoiding the classic "works on my machine" problem.

GitHub Actions provides CI/CD automation. An efficient CI/CD pipeline helps integrate and deploy changes faster and with fewer errors.

💡 Mental Model: Think of this as building a "safety net" around your ML model. Each tool we add catches a different failure mode, like defensive driving for machine learning.

Prerequisites

You'll need:

Python 3.9+ installed on your machine
Docker Desktop installed and running
GitHub account (if you want to try the CI/CD pipeline)
Basic familiarity with Python and ML concepts (what training and prediction mean)

You don't need MLOps or Kubernetes experience. Everything will be done locally with just Python and Docker – no cloud and no Kubernetes needed.

Project Structure

Let's set up a basic project structure on your local machine. Open your terminal and run:

# Create project directory and subfolders
mkdir ml-platform-tutorial && cd ml-platform-tutorial
mkdir -p data models src tests feature_repo

# Set up a virtual environment (recommended)
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

Your project structure should look like this:

ml-platform-tutorial/
├── data/              # Training and test datasets
├── models/            # Saved model files
├── src/               # Source code
├── tests/             # Test files
├── feature_repo/      # Feast feature repository
├── venv/              # Virtual environment
└── requirements.txt   # Dependencies

Next, create a requirements.txt with all the necessary libraries:

# requirements.txt

# Core ML libraries
pandas==2.2.0
numpy==1.26.3
scikit-learn==1.4.0

# Experiment tracking and model registry
mlflow==2.10.0

# Feature store
feast==0.36.0

# API framework
fastapi==0.109.0
uvicorn==0.27.0
httpx==0.26.0

# Data validation
great-expectations==0.18.8

# Monitoring
evidently==0.7.20

# Testing
pytest==8.0.0
pytest-cov==4.1.0

# Utilities
pyarrow==15.0.0
pydantic==2.6.0

📌 Version Note: Exact versions are pinned to ensure reproducibility. Newer versions may work, but all examples were tested with the versions listed here.

Install the dependencies:

pip install -r requirements.txt

This might take a few minutes as it installs all the packages. Once complete, we're ready to start building our project step by step.

Checkpoint: You should have a project folder with data/, models/, src/, tests/, and feature_repo/ directories, and an activated virtual environment with all dependencies installed. Verify by running python -c "import mlflow; import feast; import fastapi; print('All imports successful!')".

Figure 1: The Complete ML Platform We'll Build

Don't worry if this looks complex, we'll build each component step by step, starting with the simplest piece and connecting them together.

1. Build a Simple Model and API (The Naive Approach)

To illustrate why we need all these tools, let's start by building a naive ML system without any MLOps infrastructure. We'll train a simple model and deploy it quickly, then observe what problems arise. This "naive approach" is how most ML projects start – and understanding its limitations will motivate the solutions we implement later.

1.1 Train a Quick Model

First, we need some data. For simplicity, we'll generate a synthetic dataset for fraud detection so that we don't rely on any external data files. The dataset will have features like:

amount: Transaction amount in dollars
hour: Hour of the day (0-23) when the transaction occurred
day_of_week: Day of the week (0=Monday, 6=Sunday)
merchant_category: Type of merchant (grocery, restaurant, retail, online, travel)
is_fraud: Label indicating if the transaction is fraudulent (1) or legitimate (0)

We will simulate that only ~2% of transactions are fraud, which is an imbalance typical in real fraud data. This imbalance is important because it affects how we evaluate our model.

Create src/generate_data.py:

# src/generate_data.py
"""
Generate synthetic fraud detection dataset.

This script creates realistic-looking transaction data where fraudulent
transactions have different patterns than legitimate ones:
- Fraud tends to have higher amounts
- Fraud tends to occur late at night
- Fraud is more common for online and travel merchants
"""
import pandas as pd
import numpy as np

def generate_transactions(n_samples=10000, fraud_ratio=0.02, seed=42):
    """
    Generate synthetic fraud detection dataset.
    
    Args:
        n_samples: Total number of transactions to generate
        fraud_ratio: Proportion of fraudulent transactions (default 2%)
        seed: Random seed for reproducibility
    
    Returns:
        DataFrame with transaction features and fraud labels
    
    Fraud transactions have different patterns:
    - Higher amounts (mean \(245 vs \)33 for legit)
    - Late night hours (0-5, 23)
    - More likely to be online or travel merchants
    """
    np.random.seed(seed)
    n_fraud = int(n_samples * fraud_ratio)
    n_legit = n_samples - n_fraud

    # Legitimate transactions: normal shopping patterns
    # - Amounts follow a log-normal distribution (most small, some large)
    # - Hours are uniformly distributed throughout the day
    # - Merchant categories weighted toward everyday shopping
    legit = pd.DataFrame({
        "amount": np.random.lognormal(mean=3.5, sigma=1.2, size=n_legit),  # ~$33 average
        "hour": np.random.randint(0, 24, size=n_legit),
        "day_of_week": np.random.randint(0, 7, size=n_legit),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_legit,
            p=[0.30, 0.25, 0.25, 0.15, 0.05]  # Weighted toward everyday shopping
        ),
        "is_fraud": 0
    })
    
    # Fraudulent transactions: suspicious patterns
    # - Higher amounts (fraudsters go big)
    # - Late night hours (less scrutiny)
    # - More online and travel (easier to exploit)
    fraud = pd.DataFrame({
        "amount": np.random.lognormal(mean=5.5, sigma=1.5, size=n_fraud),  # ~$245 average
        "hour": np.random.choice([0, 1, 2, 3, 4, 5, 23], size=n_fraud),  # Late night
        "day_of_week": np.random.randint(0, 7, size=n_fraud),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_fraud,
            p=[0.05, 0.05, 0.10, 0.60, 0.20]  # Weighted toward online/travel
        ),
        "is_fraud": 1
    })
    
    # Combine and shuffle
    df = pd.concat([legit, fraud], ignore_index=True)
    df = df.sample(frac=1, random_state=seed).reset_index(drop=True)
    
    return df

if __name__ == "__main__":
    # Generate dataset
    print("Generating synthetic fraud detection dataset...")
    df = generate_transactions(n_samples=10000, fraud_ratio=0.02)
    
    # Split into train (80%) and test (20%)
    train_df = df.sample(frac=0.8, random_state=42)
    test_df = df.drop(train_df.index)
    
    # Save to CSV files
    train_df.to_csv("data/train.csv", index=False)
    test_df.to_csv("data/test.csv", index=False)
    
    # Print summary statistics
    print(f"\nDataset generated successfully!")
    print(f"Training set: {len(train_df):,} transactions")
    print(f"Test set: {len(test_df):,} transactions")
    print(f"Overall fraud ratio: {df['is_fraud'].mean():.2%}")
    print(f"\nLegitimate transactions - Average amount: ${df[df['is_fraud']==0]['amount'].mean():.2f}")
    print(f"Fraudulent transactions - Average amount: ${df[df['is_fraud']==1]['amount'].mean():.2f}")
    print(f"\nMerchant category distribution (fraud):")
    print(df[df['is_fraud']==1]['merchant_category'].value_counts(normalize=True))

Run the data generation script:

python src/generate_data.py

You should see output like:

Generating synthetic fraud detection dataset...

Dataset generated successfully!
Training set: 8,000 transactions
Test set: 2,000 transactions
Overall fraud ratio: 2.00%

Legitimate transactions - Average amount: $33.45
Fraudulent transactions - Average amount: $245.67

Merchant category distribution (fraud):
online        0.60
travel        0.20
retail        0.10
restaurant    0.05
grocery       0.05

Now you have data/train.csv and data/test.csv with ~8000 training and ~2000 testing transactions.

Why This Matters: The synthetic data has realistic patterns — fraud is rare (2%), high-value, late-night, and concentrated in certain merchant categories. These patterns give our model something to learn.

Now, let's train a quick model. We'll use a simple Random Forest classifier from scikit-learn to predict is_fraud. In this naive version, we won't do much feature engineering – just label encode the categorical merchant_category and feed everything to the model.

Create src/train_naive.py:

# src/train_naive.py
"""
Train a fraud detection model - NAIVE VERSION.

This script demonstrates the "quick and dirty" approach to ML:
- No experiment tracking
- No model versioning
- Just train and save to a pickle file

We'll improve on this in later sections.
"""
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    recall_score,
    confusion_matrix,
    classification_report
)

def main():
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    print(f"Training samples: {len(train_df):,}")
    print(f"Test samples: {len(test_df):,}")
    print(f"Training fraud ratio: {train_df['is_fraud'].mean():.2%}")
    
    # Encode the categorical feature
    # We need to save the encoder to use the same mapping at inference time
    print("\nEncoding categorical features...")
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    print(f"Merchant category mapping: {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")
    
    # Prepare features and labels
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    # Train a Random Forest classifier
    print("\nTraining Random Forest model...")
    model = RandomForestClassifier(
        n_estimators=100,      # Number of trees
        max_depth=10,          # Maximum depth of each tree
        random_state=42,       # For reproducibility
        n_jobs=-1              # Use all CPU cores
    )
    model.fit(X_train, y_train)
    print("Training complete!")
    
    # Evaluate on test data
    print("\n" + "="*50)
    print("MODEL EVALUATION")
    print("="*50)
    
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"F1-score:  {f1_score(y_test, y_pred):.4f}")
    
    print("\nConfusion Matrix:")
    cm = confusion_matrix(y_test, y_pred)
    print(f"  True Negatives:  {cm[0][0]:,} (correctly identified legitimate)")
    print(f"  False Positives: {cm[0][1]:,} (legitimate flagged as fraud)")
    print(f"  False Negatives: {cm[1][0]:,} (fraud missed - DANGEROUS!)")
    print(f"  True Positives:  {cm[1][1]:,} (correctly caught fraud)")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
    
    # Feature importance
    print("\nFeature Importance:")
    for name, importance in sorted(
        zip(feature_cols, model.feature_importances_),
        key=lambda x: x[1],
        reverse=True
    ):
        print(f"  {name}: {importance:.4f}")
    
    # Save the model and encoder together
    print("\nSaving model to models/model.pkl...")
    with open("models/model.pkl", "wb") as f:
        pickle.dump((model, encoder), f)
    
    print("\nModel trained and saved successfully!")
    print("\nWARNING: This naive approach has several problems:")
    print("  - No record of hyperparameters or metrics")
    print("  - No model versioning")
    print("  - No way to reproduce this exact model")
    print("  - We'll fix these issues in the following sections!")

if __name__ == "__main__":
    main()

Run the training script:

python src/train_naive.py

You should see output similar to:

Loading data...
Training samples: 8,000
Test samples: 2,000
Training fraud ratio: 2.00%

Encoding categorical features...
Merchant category mapping: {'grocery': 0, 'online': 1, 'restaurant': 2, 'retail': 3, 'travel': 4}

Training Random Forest model...
Training complete!

==================================================
MODEL EVALUATION
==================================================

Accuracy:  0.9820
Precision: 0.7273
Recall:    0.6154
F1-score:  0.6667

Confusion Matrix:
  True Negatives:  1,956 (correctly identified legitimate)
  False Positives: 4 (legitimate flagged as fraud)
  False Negatives: 32 (fraud missed - DANGEROUS!)
  True Positives:  8 (correctly caught fraud)

Feature Importance:
  amount: 0.5423
  hour: 0.2156
  merchant_encoded: 0.1345
  day_of_week: 0.1076

Important observation: You'll see ~98% accuracy but a lower F1-score (around 0.5-0.7). With only 2% fraud, accuracy is extremely misleading! A model that always predicts "not fraud" would achieve 98% accuracy while catching zero fraud. This is why we focus on F1-score, precision, and recall for imbalanced classification problems.

💡 If you're new to imbalanced classification, remember: high accuracy can be meaningless when the positive class is rare.

The script outputs a file models/model.pkl containing both the trained model and the label encoder (we need both for inference).

Checkpoint: You should now have:

data/train.csv (~8,000 rows)
data/test.csv (~2,000 rows)
models/model.pkl (trained model + encoder)

The model should show ~98% accuracy but F1 around 0.5-0.7. Verify the files exist: ls -la data/ models/

1.2 Serve Predictions with FastAPI

Now that we have a model, let's deploy it as an API so that clients can get predictions. We'll use FastAPI because it's straightforward, very fast, and produces automatic interactive documentation.

FastAPI is known for:

Easy to use: Pythonic syntax with type hints
High performance: One of the fastest Python frameworks
Automatic documentation: Swagger UI out of the box
Data validation: Using Pydantic models

Create src/serve_naive.py:

# src/serve_naive.py
"""
Serve fraud detection model as a REST API - NAIVE VERSION.

This is a simple API that:
1. Loads the trained model at startup
2. Accepts transaction data via POST request
3. Returns fraud prediction

We'll improve this with validation, monitoring, and better
model loading in later sections.
"""
import pickle
from fastapi import FastAPI
from pydantic import BaseModel, Field
from typing import Optional

# Load the trained model and encoder at startup
# This is loaded once when the server starts, not on every request
print("Loading model...")
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)
print("Model loaded successfully!")

# Create the FastAPI application
app = FastAPI(
    title="Fraud Detection API",
    description="""
    Predict whether a credit card transaction is fraudulent.
    
    This API accepts transaction details and returns:
    - Whether the transaction is predicted to be fraud
    - The probability of fraud (0.0 to 1.0)
    
    **Note:** This is the naive version without validation or monitoring.
    """,
    version="1.0.0"
)

# Define the input schema using Pydantic
# This provides automatic validation and documentation
class Transaction(BaseModel):
    """Schema for a transaction to be evaluated for fraud."""
    amount: float = Field(
        ..., 
        description="Transaction amount in dollars",
        example=150.00
    )
    hour: int = Field(
        ..., 
        description="Hour of the day (0-23)",
        example=14
    )
    day_of_week: int = Field(
        ..., 
        description="Day of week (0=Monday, 6=Sunday)",
        example=3
    )
    merchant_category: str = Field(
        ..., 
        description="Type of merchant",
        example="online"
    )

class PredictionResponse(BaseModel):
    """Schema for the prediction response."""
    is_fraud: bool = Field(description="Whether the transaction is predicted as fraud")
    fraud_probability: float = Field(description="Probability of fraud (0.0 to 1.0)")
    
@app.post("/predict", response_model=PredictionResponse)
def predict(transaction: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Takes transaction details and returns a fraud prediction
    along with the probability score.
    """
    # Convert the request to a dictionary
    data = transaction.dict()
    
    # Encode the merchant category using the same encoder from training
    # This ensures consistency between training and serving
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        # Handle unknown merchant categories
        # In production, we'd want better handling here
        data["merchant_encoded"] = 0
    
    # Prepare features in the same order as training
    X = [[
        data["amount"],
        data["hour"],
        data["day_of_week"],
        data["merchant_encoded"]
    ]]
    
    # Get prediction and probability
    prediction = model.predict(X)[0]
    probability = model.predict_proba(X)[0][1]  # Probability of class 1 (fraud)
    
    return PredictionResponse(
        is_fraud=bool(prediction),
        fraud_probability=round(float(probability), 4)
    )

@app.get("/health")
def health_check():
    """
    Health check endpoint.
    
    Returns the status of the API. Useful for:
    - Load balancer health checks
    - Kubernetes liveness probes
    - Monitoring systems
    """
    return {
        "status": "healthy",
        "model_loaded": model is not None
    }

@app.get("/")
def root():
    """Root endpoint with API information."""
    return {
        "message": "Fraud Detection API",
        "version": "1.0.0",
        "docs": "/docs",
        "health": "/health"
    }

A few important things to note about this code:

Pydantic Models: We use BaseModel to define the expected input JSON schema. FastAPI automatically validates incoming requests against this schema.
Type Hints: The type hints (float, int, str) provide both documentation and runtime validation.
Feature Encoding: On each request, we encode the merchant category using the same LabelEncoder we saved from training. This ensures consistency between training and serving.
Health Endpoint: The /health endpoint is standard practice for production APIs - it allows load balancers and monitoring systems to check if the service is running.

To run this API, use Uvicorn (an ASGI server):

uvicorn src.serve_naive:app --reload --host 0.0.0.0 --port 8000

The --reload flag enables auto-reload during development (the server restarts when you change code).

You should see:

Loading model...
Model loaded successfully!
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process

Now open your browser and go to http://localhost:8000/docs. You'll see the Swagger UI – an auto-generated interactive documentation where you can test the API directly from your browser!

Test the API using curl in another terminal:

# Test with a legitimate-looking transaction
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}'

Expected response:

{"is_fraud": false, "fraud_probability": 0.02}

# Test with a suspicious transaction (high amount, late night, online)
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 500.0, "hour": 3, "day_of_week": 1, "merchant_category": "online"}'

Expected response:

{"is_fraud": true, "fraud_probability": 0.78}

We have a working model served as an API! In a real scenario, we could now integrate this API with a payment processing frontend, mobile app, or any system that needs fraud predictions.

But before we celebrate, let's examine this naive approach for potential pitfalls...

Checkpoint: Your API should be running at http://localhost:8000. The Swagger UI at /docs should show both endpoints (/predict and /health). Test with curl or the Swagger UI to verify predictions are returned.

2. Where the Naive Approach Breaks

Our quick-and-dirty ML pipeline works on the surface: it can train a model and serve predictions. However, hidden problems will emerge if we try to maintain or scale this system in production.

This section is critical: understanding these issues will motivate the solutions we implement in the following sections. Let's go through the problems one by one.

Problem 1: No Experiment Tracking (Reproducibility)

Try this thought experiment: Run train_naive.py again with different hyperparameters (change n_estimators to 200, or max_depth to 15). Would you be able to exactly reproduce the previous model's results if someone asked?

Probably not. Currently, we have no record of:

Which hyperparameters we used
What metrics we achieved
What version of the data we trained on
What library versions were installed
When the training happened
Who ran the training

Three months from now, if your manager asks "How was this model trained? Can you reproduce the results?" – you'd be in trouble. You might have the code, but you don't know which version of the code, which parameters, or which data produced the model that's currently in production.

Experiment tracking is the practice of logging all these details (code versions, parameters, metrics, data versions, artifacts) so experiments can be compared and replicated. Our naive approach lacks this entirely, making our results hard to trust or build upon.

Problem 2: Model Versioning and Deployment Chaos

We trained one model and saved it as model.pkl. Now consider this scenario:

You train a new model with different hyperparameters
You overwrite model.pkl with the new model
You deploy it to production
Users start complaining about more false positives
You want to roll back to the previous model
Problem: The previous model was overwritten and is gone forever

There's no systematic versioning. Questions you cannot answer:

Which model version is currently in production?
What were the metrics for model v1 vs v2?
When was each model trained and by whom?
Can we instantly roll back if the new model performs worse?
What changed between versions?

Without version control for models, you're flying blind. Imagine deploying code without Git – that's what we're doing with our model.

Problem 3: No Data Validation – Garbage In, Garbage Out

Right now, our API will accept any input and try to make a prediction. Let's see what happens with bad data.

Create a test script src/test_bad_data.py:

# src/test_bad_data.py
"""Test what happens when we send garbage data to the API."""
import requests

BASE_URL = "http://localhost:8000"

print("Testing API with various bad inputs...\n")

# Test 1: Negative amount
print("Test 1: Negative amount")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -500.0,        # Negative amount - impossible!
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 2: Invalid hour
print("Test 2: Hour = 25 (should be 0-23)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 25,              # Invalid hour!
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 3: Invalid day of week
print("Test 3: day_of_week = 10 (should be 0-6)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 10,       # Invalid day!
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 4: Unknown merchant category
print("Test 4: Unknown merchant category")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "unknown_category"  # Not in training data!
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 5: All bad at once
print("Test 5: Everything wrong")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -1000.0,
    "hour": 99,
    "day_of_week": 15,
    "merchant_category": "totally_fake"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

print("Observation: The API happily accepts ALL garbage and returns predictions!")
print("This is dangerous - bad data leads to bad predictions with no warning.")

Run it (make sure your API is still running):

python src/test_bad_data.py

You'll see something like:

Testing API with various bad inputs...

Test 1: Negative amount
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.15}

Test 2: Hour = 25 (should be 0-23)
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.08}

...

Observation: The API happily accepts ALL garbage and returns predictions!

The API accepts garbage and returns predictions with no warning! In production, this could mean:

Incorrect predictions based on impossible data
Fraud going undetected because of malformed input
Legitimate transactions blocked based on corrupted data
No way to debug why predictions are wrong

As the saying goes: "Garbage in, garbage out." But even worse – we don't even know garbage went in!

Problem 4: Model Drift – Performance Decay Over Time

Here's a scenario that happens in every production ML system:

January: You train your model on historical fraud data. It achieves 98% accuracy and 0.67 F1-score. Everyone's happy.
February: The model is deployed and working well. Fraud is being caught.
March: Fraudsters adapt. They start using different patterns – smaller amounts, different merchant categories, different times of day.
April: Your model's accuracy has dropped from 98% to 85%. F1-score dropped from 0.67 to 0.35. Fraud is slipping through.
May: A major fraud incident occurs. Investigation reveals the model has been underperforming for 2 months.

The problem: Nobody noticed for 2 months because there was no monitoring.

This phenomenon is called data drift (when input data distributions change) or concept drift (when the relationship between inputs and outputs changes). Both are inevitable in real-world systems.

Without monitoring:

You don't know when performance degrades
You don't know why performance degrades
You can't take corrective action until users complain
By then, significant damage may have occurred

Problem 5: No CI/CD or Deployment Safety

Our "deployment process" was literally:

SSH into the server (or run locally)
Run python src/train_naive.py
Copy model.pkl to the right place
Restart the API
Hope for the best

There's:

No automated testing: A typo could break everything
No staging environment: We test directly in production
No gradual rollout: 100% of traffic hits the new model immediately
No rollback capability: If something breaks, we have to manually fix it
No audit trail: Who deployed what and when?

This is how production incidents happen. A rushed deployment at 5 PM on Friday breaks the fraud detection system, and nobody notices until Monday when fraud losses have spiked.

Figure 2: Problems with the Naive Approach

Summary: What We Need to Fix

Our simple ML service is missing critical infrastructure. Here's the mapping of problems to solutions:

Problem	Impact	Solution	Section
No experiment tracking	Can't reproduce or compare models	MLflow Tracking	3
No model versioning	Can't roll back or audit	MLflow Registry	3
No feature consistency	Training-serving skew	Feast Feature Store	4
No data validation	Garbage predictions	Great Expectations	5
No monitoring	Drift goes unnoticed	Evidently	6
No CI/CD	Risky deployments	GitHub Actions + Docker	7

The good news: We can fix each of these by incrementally adding components to our pipeline. Each tool addresses a specific problem, and together they form a robust ML platform.

Let's start fixing these issues, one by one.

3. Add Experiment Tracking and Model Registry with MLflow

What breaks without this: You can't reproduce yesterday's results, can't compare experiments, and can't roll back when a new model fails in production.

Our first fix addresses Problems 1 and 2: experiment reproducibility and model versioning.

MLflow is an open-source platform designed to manage the ML lifecycle. We'll use two of its key components:

MLflow Tracking: Log experiments (parameters, metrics, artifacts) so you can compare runs and reproduce results
MLflow Model Registry: Version your models with aliases (champion, challenger) and manage the deployment lifecycle

Why This Matters: Without tracking, ML is guesswork. With MLflow, every run is logged with parameters, metrics, and artifacts. You can compare runs side-by-side, understand what actually improved your model, and reproduce any past experiment. The Model Registry adds governance – you know exactly which model is in production and can roll back in seconds.

3.1 How to Set Up the MLflow Tracking Server

MLflow can log experiments to a local directory by default, but to use the full UI and model registry, it's best to run the MLflow tracking server.

Open a new terminal (keep it separate from your API terminal) and run:

# Create a directory for MLflow data
mkdir -p mlruns

# Start the MLflow server
mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlruns

Let's break down these parameters:

--host 0.0.0.0: Listen on all network interfaces
--port 5000: Run on port 5000
--backend-store-uri sqlite:///mlflow.db: Store experiment metadata in a SQLite database (for production, you'd use PostgreSQL or MySQL)
--default-artifact-root ./mlruns: Store model artifacts (files) in the mlruns directory

You should see:

[INFO] Starting gunicorn 21.2.0
[INFO] Listening at: http://0.0.0.0:5000

Now open your browser and navigate to http://localhost:5000. You'll see the MLflow UI – it should be empty initially since we haven't logged any experiments yet.

3.2 How to Log Experiments in Code

Now let's modify our training script to log everything to MLflow. Create src/train_mlflow.py:

# src/train_mlflow.py
"""
Train fraud detection model with MLflow experiment tracking.

This script demonstrates proper ML experiment tracking:
- Log all hyperparameters
- Log all metrics (train and test)
- Log the trained model as an artifact
- Register the model in the Model Registry

Compare this to train_naive.py to see the difference!
"""
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    roc_auc_score
)
import pickle
from datetime import datetime

# Configure MLflow to use our tracking server
mlflow.set_tracking_uri("http://localhost:5000")

# Create or get the experiment
# All runs will be grouped under this experiment name
mlflow.set_experiment("fraud-detection")

def load_and_preprocess_data():
    """Load and preprocess the training and test data."""
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    # Encode categorical feature
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    # Prepare features
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    return X_train, y_train, X_test, y_test, encoder

def train_and_log_model(
    n_estimators: int = 100,
    max_depth: int = 10,
    min_samples_split: int = 2,
    min_samples_leaf: int = 1
):
    """
    Train a model and log everything to MLflow.
    
    Args:
        n_estimators: Number of trees in the forest
        max_depth: Maximum depth of each tree
        min_samples_split: Minimum samples required to split a node
        min_samples_leaf: Minimum samples required at a leaf node
    """
    X_train, y_train, X_test, y_test, encoder = load_and_preprocess_data()
    
    # Start an MLflow run - everything logged will be associated with this run
    with mlflow.start_run():
        # Add a descriptive run name
        run_name = f"rf_est{n_estimators}_depth{max_depth}_{datetime.now().strftime('%H%M%S')}"
        mlflow.set_tag("mlflow.runName", run_name)
        
        # Log all hyperparameters
        # These are the "knobs" we can tune
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("min_samples_split", min_samples_split)
        mlflow.log_param("min_samples_leaf", min_samples_leaf)
        mlflow.log_param("model_type", "RandomForestClassifier")
        
        # Log data information
        mlflow.log_param("train_samples", len(X_train))
        mlflow.log_param("test_samples", len(X_test))
        mlflow.log_param("fraud_ratio", float(y_train.mean()))
        mlflow.log_param("n_features", X_train.shape[1])
        
        # Train the model
        print(f"\nTraining model: n_estimators={n_estimators}, max_depth={max_depth}")
        model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X_train, y_train)
        
        # Evaluate and log metrics for BOTH train and test sets
        # This helps detect overfitting
        for dataset_name, X, y in [("train", X_train, y_train), ("test", X_test, y_test)]:
            y_pred = model.predict(X)
            y_prob = model.predict_proba(X)[:, 1]
            
            # Calculate all metrics
            accuracy = accuracy_score(y, y_pred)
            precision = precision_score(y, y_pred, zero_division=0)
            recall = recall_score(y, y_pred, zero_division=0)
            f1 = f1_score(y, y_pred, zero_division=0)
            roc_auc = roc_auc_score(y, y_prob)
            
            # Log metrics with dataset prefix
            mlflow.log_metric(f"{dataset_name}_accuracy", accuracy)
            mlflow.log_metric(f"{dataset_name}_precision", precision)
            mlflow.log_metric(f"{dataset_name}_recall", recall)
            mlflow.log_metric(f"{dataset_name}_f1", f1)
            mlflow.log_metric(f"{dataset_name}_roc_auc", roc_auc)
            
            print(f"  {dataset_name.upper()} - Accuracy: {accuracy:.4f}, F1: {f1:.4f}, ROC-AUC: {roc_auc:.4f}")
        
        # Log feature importance
        for feature, importance in zip(
            ["amount", "hour", "day_of_week", "merchant_encoded"],
            model.feature_importances_
        ):
            mlflow.log_metric(f"importance_{feature}", importance)
        
        # Log the model to MLflow AND register it in the Model Registry
        # This creates a new version of the model automatically
        print("\nRegistering model in MLflow Model Registry...")
        mlflow.sklearn.log_model(
            sk_model=model,
            artifact_path="model",
            registered_model_name="fraud-detection-model",
            input_example=X_train.iloc[:5]  # Example input for documentation
        )
        
        # Save and log the encoder as a separate artifact
        # We need this for inference
        with open("encoder.pkl", "wb") as f:
            pickle.dump(encoder, f)
        mlflow.log_artifact("encoder.pkl")
        
        # Get the run ID for reference
        run_id = mlflow.active_run().info.run_id
        print(f"\nMLflow Run ID: {run_id}")
        print(f"View this run: http://localhost:5000/#/experiments/1/runs/{run_id}")
        
        return model, encoder

def run_experiment_sweep():
    """
    Run multiple experiments with different hyperparameters.
    
    This demonstrates how MLflow helps compare different configurations.
    """
    print("="*60)
    print("RUNNING HYPERPARAMETER EXPERIMENT SWEEP")
    print("="*60)
    
    # Define different configurations to try
    experiments = [
        {"n_estimators": 50, "max_depth": 5},
        {"n_estimators": 100, "max_depth": 10},
        {"n_estimators": 100, "max_depth": 15},
        {"n_estimators": 200, "max_depth": 10},
        {"n_estimators": 200, "max_depth": 20},
    ]
    
    for i, params in enumerate(experiments, 1):
        print(f"\n--- Experiment {i}/{len(experiments)} ---")
        train_and_log_model(**params)
    
    print("\n" + "="*60)
    print("EXPERIMENT SWEEP COMPLETE!")
    print("="*60)
    print("\nView all experiments at: http://localhost:5000")
    print("Compare runs to find the best hyperparameters!")

if __name__ == "__main__":
    run_experiment_sweep()

This script:

Connects to MLflow: mlflow.set_tracking_uri("http://localhost:5000")
Creates an experiment: mlflow.set_experiment("fraud-detection")
Logs parameters: All hyperparameters and data info
Logs metrics: Accuracy, precision, recall, F1, ROC-AUC for both train and test sets
Logs the model: Saves the trained model as an artifact
Registers the model: Adds it to the Model Registry with automatic versioning

Run the experiment sweep:

python src/train_mlflow.py

You'll see output for each experiment:

============================================================
RUNNING HYPERPARAMETER EXPERIMENT SWEEP
============================================================

--- Experiment 1/5 ---
Loading data...
Training model: n_estimators=50, max_depth=5
  TRAIN - Accuracy: 0.9821, F1: 0.6545, ROC-AUC: 0.9234
  TEST - Accuracy: 0.9795, F1: 0.5714, ROC-AUC: 0.8956

Registering model in MLflow Model Registry...
MLflow Run ID: abc123...

--- Experiment 5/5 ---
Training model: n_estimators=200, max_depth=20
  TRAIN - Accuracy: 0.9856, F1: 0.7123, ROC-AUC: 0.9567
  TEST - Accuracy: 0.9810, F1: 0.6667, ROC-AUC: 0.9234

============================================================
EXPERIMENT SWEEP COMPLETE!
============================================================

All 5 runs are now logged to MLflow with full metrics comparison available in the UI.

Now refresh the MLflow UI at http://localhost:5000. You'll see:

Experiments tab: Shows the "fraud-detection" experiment with 5 runs
Each run: Shows parameters, metrics, and artifacts
Compare: You can select multiple runs and compare them side-by-side
Models tab: Shows "fraud-detection-model" with 5 versions

MLflow Tracking UI: Compare runs, metrics, and models at a glance

3.3 How to Use the Model Registry

The Model Registry provides a central hub for managing model versions and their lifecycle stages.

In the MLflow UI:

Click the "Models" tab in the top navigation
Click "fraud-detection-model"
You'll see all 5 versions listed with their metrics

Model Aliases: MLflow now uses aliases instead of stages. If you've seen older tutorials using "Staging" and "Production" stages, aliases are the newer, more flexible approach.

@champion: The production model serving live traffic
@challenger: Candidate model being tested
You can create custom aliases like @baseline, @latest and so on.

Assign an alias:

Open MLflow UI → Models → fraud-detection-model
Click on the version you want to promote
Click "Add Alias"
Enter champion and save

Now you've assigned the @champion alias to your best model. Your API will load whichever version has this alias, making rollbacks as simple as moving the alias to a different version.

Figure 3: MLflow Model Lifecycle — From Training to Production

3.4 Update API to Load from Registry

Now let's update our API to load the champion model from the MLflow Registry instead of a pickle file. Create src/serve_mlflow.py:

# src/serve_mlflow.py
"""
Serve fraud detection model from MLflow Model Registry.

This version loads the @champion model from MLflow, which means:
- Always serves the latest @champion model
- Can roll back by changing the @champion alias
- No manual file copying needed
"""
import mlflow
import mlflow.sklearn
import pickle
import os
from fastapi import FastAPI
from pydantic import BaseModel, Field

# Configure MLflow
mlflow.set_tracking_uri("http://localhost:5000")

print("Loading model from MLflow Model Registry...")

# Load the champion model from the registry
# This automatically gets whichever version has the @champion alias
try:
    model = mlflow.sklearn.load_model("models:/fraud-detection-model@champion")
    print("Successfully loaded champion model from MLflow!")
except Exception as e:
    print(f"Error loading from MLflow: {e}")
    print("Make sure you've assigned the @champion alias to a model in the MLflow UI")
    raise

# Load the encoder (saved as an artifact)
# In a real system, you might also version this in MLflow
with open("encoder.pkl", "rb") as f:
    encoder = pickle.load(f)
print("Encoder loaded successfully!")

app = FastAPI(
    title="Fraud Detection API (MLflow)",
    description="""
    Fraud detection API that loads models from MLflow Model Registry.
    
    This version always serves the model with the @champion alias.
    To update the model:
    1. Train a new model with train_mlflow.py
    2. Compare metrics in MLflow UI
    3. Promote the best model to Production
    4. Restart this API
    
    To roll back: Move the @champion alias to a previous version in MLflow UI.
    """,
    version="2.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount in dollars", example=150.00)
    hour: int = Field(..., description="Hour of the day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Monday, 6=Sunday)", example=3)
    merchant_category: str = Field(..., description="Type of merchant", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    model_source: str = "MLflow Production"

@app.post("/predict", response_model=PredictionResponse)
def predict(tx: Transaction):
    """Predict whether a transaction is fraudulent using the champion model."""
    data = tx.dict()
    
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        data["merchant_encoded"] = 0
    
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        model_source="MLflow Production"
    )

@app.get("/health")
def health():
    return {"status": "healthy", "model_source": "MLflow Registry"}

@app.get("/model-info")
def model_info():
    """Get information about the currently loaded model."""
    return {
        "registry": "MLflow",
        "model_name": "fraud-detection-model",
        "alias": "champion",
        "tracking_uri": "http://localhost:5000"
    }

Stop your old API (Ctrl+C) and start this new one:

uvicorn src.serve_mlflow:app --reload --host 0.0.0.0 --port 8000

Now deploying a new model is a controlled, auditable process:

Train new model → Automatically registered as new version
Compare metrics → Use MLflow UI to compare with current Production
Set as champion → Assign @champion alias in MLflow UI
Restart API → Loads new Production model
Roll back if needed → Move @champion alias to previous version

Checkpoint:

MLflow UI (http://localhost:5000) should show the "fraud-detection" experiment with 5 runs
The "Models" tab should show "fraud-detection-model" with 5 versions
One version should have @champion alias
The API should load and serve @champion model

4. Ensure Feature Consistency with Feast

⚠️ First time hearing about feature stores? Don't worry.
You don't need to master every Feast detail on the first read.
Focus on why feature consistency matters — you can revisit the implementation later.
Key takeaway: Training and serving must compute features the same way, or your model silently fails.

What breaks without this: Your model sees different feature values in production than it saw during training. Accuracy drops silently. This is called "training-serving skew" and it's one of the most common causes of ML system failures.

One subtle but critical issue in ML systems is training-serving skew – when data transformations at training time differ from inference time. Even small discrepancies can severely degrade performance.

Why This Matters: Imagine you're computing "average transaction amount per merchant category" as a feature. During training, you compute it using pandas in a notebook. During serving, you compute it using SQL in a different system. Small differences in how these computations handle edge cases (nulls, rounding, time windows) cause the model to see different features in production than it was trained on.

The result? Silent failures where accuracy drops but nothing errors out. Your model is making predictions based on features it's never seen before, and you have no idea.

In our naive implementation, we did handle one simple case: we saved the LabelEncoder to ensure merchant_category is encoded the same way in training and serving. But imagine if we had more complex feature engineering:

Rolling averages over time windows
User-level aggregations
Cross-feature interactions
Real-time features from streaming data

Maintaining consistency manually becomes impossible.

4.1 What is Feast and Why Use It?

In production ML platforms, teams use a feature store to guarantee feature consistency between training and serving. Feast is one popular open-source option.

In this tutorial, we use Feast not because you must, but because it makes the training-serving contract explicit and teachable. The principles apply whether you use Feast, Tecton, Featureform, or a custom solution.

Feast provides:

Capability	Description
Single source of truth	Define features once, use everywhere
Offline/online consistency	Same features for training and serving
Point-in-time correctness	Prevents data leakage in training
Low-latency serving	Millisecond feature retrieval
Feature versioning	Track changes to feature definitions

How Feast works:

Define features in Python code (feature definitions)
Materialize features from your data sources to the online store
Retrieve features using the same API for both training (offline) and serving (online)

This ensures that training and serving use exactly the same feature computation logic.

4.2 Install and Initialize Feast

We already installed Feast via requirements.txt. Now let's initialize a feature repository.

# Navigate to the feature_repo directory
cd feature_repo

# Initialize Feast (this creates template files)
feast init . --minimal

# Go back to project root
cd ..

This creates the basic Feast structure:

feature_repo/
├── feature_store.yaml    # Feast configuration
└── __init__.py

4.3 Define Feature Definitions

First, let's create the Feast configuration file:

# feature_repo/feature_store.yaml
project: fraud_detection
registry: ../data/registry.db
provider: local
online_store:
  type: sqlite
  path: ../data/online_store.db
offline_store:
  type: file
entity_key_serialization_version: 3

This configuration:

Names our project "fraud_detection"
Uses SQLite for the online store (for production, you'd use Redis or DynamoDB)
Uses local files for the offline store (for production, you'd use BigQuery or Snowflake)

Now create the feature definitions:

# feature_repo/features.py
"""
Feast feature definitions for fraud detection.

This file defines:
- Entities: The keys we use to look up features (merchant_category)
- Data Sources: Where the raw feature data comes from (Parquet file)
- Feature Views: The features themselves and their schemas

The key insight: These definitions are the SINGLE SOURCE OF TRUTH.
Both training and serving use these exact definitions.
"""
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, ValueType
from feast.types import Float32, Int64

# =============================================================================
# ENTITIES
# =============================================================================
# An entity is the "key" we use to look up features.
# For merchant-level features, the entity is merchant_category.

merchant = Entity(
    name="merchant_category",
    description="Merchant category for the transaction (for example, 'online', 'grocery')",
    value_type=ValueType.STRING,
)

# =============================================================================
# DATA SOURCES
# =============================================================================
# Data sources tell Feast where to find the raw feature data.
# For local development, we use a Parquet file.
# For production, this could be BigQuery, Snowflake, S3, etc.

merchant_stats_source = FileSource(
    name="merchant_stats_source",
    path="../data/merchant_features.parquet",  # We'll create this file
    timestamp_field="event_timestamp",       # Required for point-in-time joins
)

# =============================================================================
# FEATURE VIEWS
# =============================================================================
# A Feature View defines a group of related features.
# It specifies:
# - Which entity the features are for
# - The schema (names and types of features)
# - Where the data comes from
# - How long features are valid (TTL)

merchant_stats_fv = FeatureView(
    name="merchant_stats",
    description="Aggregated statistics per merchant category",
    entities=[merchant],
    ttl=timedelta(days=7),  # Features are valid for 7 days
    schema=[
        Field(name="avg_amount", dtype=Float32, description="Average transaction amount"),
        Field(name="transaction_count", dtype=Int64, description="Number of transactions"),
        Field(name="fraud_rate", dtype=Float32, description="Historical fraud rate"),
    ],
    source=merchant_stats_source,
    online=True,  # Enable online serving (low-latency retrieval)
)

4.4 Materialize Features to Online Store

Now we need to:

Compute the features from our training data
Save them in a format Feast can read
Apply the Feast definitions
Materialize features to the online store

Create src/prepare_feast_features.py:

# src/prepare_feast_features.py
"""
Prepare feature data for Feast.

This script:
1. Computes aggregated merchant features from training data
2. Saves them in Parquet format (Feast's offline store format)
3. Applies Feast feature definitions
4. Materializes features to the online store for low-latency serving

Run this whenever your training data changes or you want to refresh features.
"""
import pandas as pd
import numpy as np
from datetime import datetime
import subprocess
import os

def compute_merchant_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute aggregated features by merchant category.
    
    THIS IS THE SINGLE SOURCE OF TRUTH FOR FEATURE COMPUTATION.
    
    Both training and serving will use features computed by this exact logic.
    Any change here automatically applies everywhere.
    
    Args:
        df: Transaction DataFrame with columns: amount, merchant_category, is_fraud
        
    Returns:
        DataFrame with computed features per merchant category
    """
    print("Computing merchant-level features...")
    
    # Group by merchant category and compute aggregates
    stats = df.groupby('merchant_category').agg({
        'amount': ['mean', 'count'],
        'is_fraud': 'mean'
    }).reset_index()
    
    # Flatten column names
    stats.columns = ['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']
    
    # Add timestamp for Feast (required for point-in-time correct joins)
    stats['event_timestamp'] = datetime.now()
    
    # Convert types to match Feast schema
    stats['avg_amount'] = stats['avg_amount'].astype('float32')
    stats['transaction_count'] = stats['transaction_count'].astype('int64')
    stats['fraud_rate'] = stats['fraud_rate'].astype('float32')
    
    return stats

def main():
    print("="*60)
    print("FEAST FEATURE PREPARATION")
    print("="*60)
    
    # Load training data
    print("\n1. Loading training data...")
    train_df = pd.read_csv('data/train.csv')
    print(f"   Loaded {len(train_df):,} transactions")
    
    # Compute merchant features
    print("\n2. Computing merchant features...")
    merchant_features = compute_merchant_features(train_df)
    
    print("\n   Computed features:")
    print(merchant_features.to_string(index=False))
    
    # Save as Parquet (required format for Feast file source)
    print("\n3. Saving features to Parquet...")
    os.makedirs('data', exist_ok=True)
    output_path = 'data/merchant_features.parquet'
    merchant_features.to_parquet(output_path, index=False)
    print(f"   Saved to {output_path}")
    
    # Apply Feast feature definitions
    print("\n4. Applying Feast feature definitions...")
    try:
        result = subprocess.run(
            ['feast', 'apply'],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Feature definitions applied successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error applying Feast: {e.stderr}")
        raise
    
    # Materialize features to online store
    print("\n5. Materializing features to online store...")
    try:
        result = subprocess.run(
            ['feast', 'materialize-incremental', datetime.now().isoformat()],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Features materialized successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error materializing: {e.stderr}")
        raise
    
    print("\n" + "="*60)
    print("FEAST FEATURE PREPARATION COMPLETE!")
    print("="*60)
    print("\nYou can now:")
    print("  - Retrieve features for training: get_training_features()")
    print("  - Retrieve features for serving: get_online_features()")
    print("  - View feature stats: feast feature-views list")

if __name__ == "__main__":
    main()

Run the feature preparation:

python src/prepare_feast_features.py

You should see:

============================================================
FEAST FEATURE PREPARATION
============================================================

1. Loading training data... 8,000 transactions
2. Computing merchant features...
   grocery: avg=$31.24, fraud_rate=0.85%
   online: avg=$98.45, fraud_rate=4.87%
   restaurant: avg=$28.12, fraud_rate=0.50%
   retail: avg=$45.67, fraud_rate=1.02%
   travel: avg=$156.23, fraud_rate=4.18%
3. Saving to data/merchant_features.parquet ✓
4. Applying Feast definitions... ✓
5. Materializing to online store... ✓

FEAST FEATURE PREPARATION COMPLETE!

4.5 Retrieve Features for Training and Serving

Now let's create utilities to retrieve features consistently for both training and serving:

# src/feast_features.py
"""
Feast feature retrieval for training and serving.

This module provides functions to retrieve features from Feast:
- get_training_features(): For offline training (historical features)
- get_online_features(): For real-time serving (low-latency)

IMPORTANT: Both functions use the SAME feature definitions,
ensuring consistency between training and serving.
"""
import pandas as pd
from feast import FeatureStore
from datetime import datetime

# Initialize Feast store (points to our feature_repo)
store = FeatureStore(repo_path="feature_repo")

def get_training_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Get features for training using Feast's offline store.
    
    Uses point-in-time correct joins to prevent data leakage.
    This means features are looked up as of the time each transaction occurred,
    not as of "now" - preventing you from accidentally using future data.
    
    Args:
        df: DataFrame with at least 'merchant_category' column
        
    Returns:
        DataFrame with original columns plus Feast features
    """
    print("Retrieving training features from Feast offline store...")
    
    # Prepare entity dataframe with timestamps
    # Each row needs: entity key(s) + event_timestamp
    entity_df = df[['merchant_category']].copy()
    entity_df['event_timestamp'] = datetime.now()  # See note below
    entity_df = entity_df.drop_duplicates()
    
    # ⚠️ Simplification: For clarity, we use the current timestamp here.
    # In real systems, this would be the actual event time of each transaction.
    
    # Retrieve historical features
    # Feast handles the point-in-time join automatically
    training_data = store.get_historical_features(
        entity_df=entity_df,
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
    ).to_df()
    
    # Merge features back with original dataframe
    result = df.merge(
        training_data[['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']],
        on='merchant_category',
        how='left'
    )
    
    print(f"Retrieved features for {len(entity_df)} unique merchants")
    return result

def get_online_features(merchant_category: str) -> dict:
    """
    Get features for real-time serving using Feast's online store.
    
    This is optimized for low-latency retrieval (milliseconds).
    Use this in your prediction API for real-time inference.
    
    Args:
        merchant_category: The merchant category to look up
        
    Returns:
        Dictionary with feature names and values
    """
    # Retrieve from online store (low-latency)
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": merchant_category}],
    ).to_dict()
    
    # Format the response
    return {
        'merchant_avg_amount': feature_vector['avg_amount'][0],
        'merchant_tx_count': feature_vector['transaction_count'][0],
        'merchant_fraud_rate': feature_vector['fraud_rate'][0],
    }

def get_online_features_batch(merchant_categories: list) -> pd.DataFrame:
    """
    Get features for multiple merchants at once (batch serving).
    
    More efficient than calling get_online_features() in a loop.
    
    Args:
        merchant_categories: List of merchant categories to look up
        
    Returns:
        DataFrame with features for each merchant
    """
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": mc} for mc in merchant_categories],
    ).to_df()
    
    return feature_vector

if __name__ == "__main__":
    # Test the feature retrieval functions
    print("="*60)
    print("TESTING FEAST FEATURE RETRIEVAL")
    print("="*60)
    
    # Test offline retrieval (for training)
    print("\n1. Testing OFFLINE feature retrieval (for training)...")
    train_df = pd.read_csv('data/train.csv').head(10)
    enriched = get_training_features(train_df)
    print("\n   Sample enriched training data:")
    print(enriched[['amount', 'merchant_category', 'avg_amount', 'fraud_rate']].head())
    
    # Test online retrieval (for serving)
    print("\n2. Testing ONLINE feature retrieval (for serving)...")
    for category in ['online', 'grocery', 'travel', 'restaurant', 'retail']:
        features = get_online_features(category)
        print(f"   {category}: avg_amount=${features['merchant_avg_amount']:.2f}, "
              f"fraud_rate={features['merchant_fraud_rate']:.2%}")
    
    # Test batch retrieval
    print("\n3. Testing BATCH online retrieval...")
    batch_features = get_online_features_batch(['online', 'grocery', 'travel'])
    print(batch_features)
    
    print("\n" + "="*60)
    print("FEAST FEATURE RETRIEVAL TEST COMPLETE!")
    print("="*60)

Test the feature retrieval:

python src/feast_features.py

You should see:

============================================================
TESTING FEAST FEATURE RETRIEVAL
============================================================

1. Testing OFFLINE feature retrieval (for training)...
Retrieving training features from Feast offline store...
Retrieved features for 5 unique merchants

   Sample enriched training data:
   amount merchant_category  avg_amount  fraud_rate
    45.23           grocery       31.24      0.0085
   123.45            online       98.45      0.0487
    ...

2. Testing ONLINE feature retrieval (for serving)...
   online: avg_amount=$98.45, fraud_rate=4.87%
   grocery: avg_amount=$31.24, fraud_rate=0.85%
   travel: avg_amount=$156.23, fraud_rate=4.18%
   restaurant: avg_amount=$28.12, fraud_rate=0.50%
   retail: avg_amount=$45.67, fraud_rate=1.02%

3. Testing BATCH online retrieval...
  merchant_category  avg_amount  transaction_count  fraud_rate
               online       98.45               1234      0.0487
              grocery       31.24               2345      0.0085
               travel      156.23                478      0.0418

Why Feast Over Custom Code?

Aspect	Custom Code	Feast
Consistency	Manual effort to keep in sync	Automatic - same definitions everywhere
Point-in-time correctness	Must implement yourself	Built-in
Online serving	Must build your own cache	Built-in online store
Feature versioning	Not supported	Built-in
Scalability	Limited	Production-ready (BigQuery, Redis, etc.)
Team collaboration	Difficult	Feature registry with documentation
Monitoring	Manual	Built-in feature statistics

💡 Mental Model: Treat feature definitions like database schemas.
You wouldn't compute a column one way in your application and a different way in your reports. Features deserve the same discipline — define once, use everywhere.

Checkpoint: After running prepare_feast_features.py, you should have:

data/merchant_features.parquet (computed features)
data/registry.db (Feast registry)
data/online_store.db (SQLite online store)

Running python src/feast_features.py should successfully retrieve features for all merchant categories.

5. Add Data Validation with Great Expectations

What breaks without this: Your API accepts garbage input (negative amounts, invalid hours) and returns meaningless predictions. Worse, you have no idea it happened.

Recall that our API currently trusts input blindly. We saw how garbage data produces a prediction with no warning. Great Expectations is an open-source tool for data quality testing – defining rules (expectations) and testing data against them.

Why This Matters: Data validation acts as a gatekeeper. Bad data is rejected before it can harm predictions. As the saying goes, "Garbage in, garbage out" – feeding unreliable data yields unreliable results. With validation, we transform this to "Garbage in, error out" – much better for debugging and reliability.

5.1 Define Expectations

What are reasonable expectations for our transaction data? Based on domain knowledge:

Field	Expectation	Reason
`amount`	Positive (> 0)	Negative transactions don't make sense
`amount`	Below $50,000	Extremely large amounts are outliers/errors
`hour`	0-23 inclusive	Valid hours in a day
`day_of_week`	0-6 inclusive	Valid days (Mon=0, Sun=6)
`merchant_category`	One of known categories	Must match training data
All fields	Not null	Required for prediction

Create src/data_validation.py:

# src/data_validation.py
"""
Data validation for fraud detection.

This module provides functions to validate input data BEFORE making predictions.
Invalid data is rejected with clear error messages.

The key insight: It's better to reject bad input than to make garbage predictions.
"""
import pandas as pd
from typing import Dict, List, Any, Optional

# Define the valid merchant categories (must match training data!)
VALID_CATEGORIES = ["grocery", "restaurant", "retail", "online", "travel"]

def validate_transaction(data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Validate a single transaction for fraud prediction.
    
    Checks all business rules and data quality requirements.
    Returns a dictionary with 'valid' (bool) and 'errors' (list).
    
    Args:
        data: Dictionary with transaction fields
        
    Returns:
        {"valid": bool, "errors": list of error messages}
        
    Example:
        >>> validate_transaction({"amount": -100, "hour": 25, ...})
        {"valid": False, "errors": ["amount must be positive", "hour must be 0-23"]}
    """
    errors = []
    
    # ==========================================================================
    # Amount Validation
    # ==========================================================================
    amount = data.get("amount")
    if amount is None:
        errors.append("amount is required")
    elif not isinstance(amount, (int, float)):
        errors.append(f"amount must be a number (got {type(amount).__name__})")
    elif amount <= 0:
        errors.append("amount must be positive")
    elif amount > 50000:
        errors.append(f"amount exceeds maximum allowed value of \(50,000 (got \){amount:,.2f})")
    
    # ==========================================================================
    # Hour Validation
    # ==========================================================================
    hour = data.get("hour")
    if hour is None:
        errors.append("hour is required")
    elif not isinstance(hour, int):
        errors.append(f"hour must be an integer (got {type(hour).__name__})")
    elif not (0 <= hour <= 23):
        errors.append(f"hour must be between 0 and 23 (got {hour})")
    
    # ==========================================================================
    # Day of Week Validation
    # ==========================================================================
    day = data.get("day_of_week")
    if day is None:
        errors.append("day_of_week is required")
    elif not isinstance(day, int):
        errors.append(f"day_of_week must be an integer (got {type(day).__name__})")
    elif not (0 <= day <= 6):
        errors.append(f"day_of_week must be between 0 (Monday) and 6 (Sunday) (got {day})")
    
    # ==========================================================================
    # Merchant Category Validation
    # ==========================================================================
    category = data.get("merchant_category")
    if category is None:
        errors.append("merchant_category is required")
    elif not isinstance(category, str):
        errors.append(f"merchant_category must be a string (got {type(category).__name__})")
    elif category not in VALID_CATEGORIES:
        errors.append(
            f"merchant_category must be one of {VALID_CATEGORIES} (got '{category}')"
        )
    
    return {
        "valid": len(errors) == 0,
        "errors": errors
    }

def validate_batch(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Validate a batch of transactions using Great Expectations.
    
    This is useful for validating training data or batch prediction requests.
    Uses Great Expectations for more sophisticated validation.
    
    Args:
        df: DataFrame with transaction data
        
    Returns:
        Dictionary with validation results
    """
    import great_expectations as gx
    
    # Convert to Great Expectations dataset
    ge_df = gx.from_pandas(df)
    
    results = []
    
    # Amount expectations
    r = ge_df.expect_column_values_to_be_between(
        'amount', min_value=0.01, max_value=50000, mostly=0.99
    )
    results.append(('amount_range', r.success, r.result))
    
    # Hour expectations
    r = ge_df.expect_column_values_to_be_between(
        'hour', min_value=0, max_value=23
    )
    results.append(('hour_range', r.success, r.result))
    
    # Day of week expectations
    r = ge_df.expect_column_values_to_be_between(
        'day_of_week', min_value=0, max_value=6
    )
    results.append(('day_range', r.success, r.result))
    
    # Merchant category expectations
    r = ge_df.expect_column_values_to_be_in_set(
        'merchant_category', VALID_CATEGORIES
    )
    results.append(('category_valid', r.success, r.result))
    
    # No nulls in critical fields
    for col in ['amount', 'hour', 'day_of_week', 'merchant_category']:
        r = ge_df.expect_column_values_to_not_be_null(col)
        results.append((f'{col}_not_null', r.success, r.result))
    
    # Summarize results
    passed = sum(1 for _, success, _ in results if success)
    total = len(results)
    
    return {
        'success': passed == total,
        'passed': passed,
        'total': total,
        'pass_rate': passed / total,
        'details': {name: {'passed': success, 'result': result} 
                   for name, success, result in results}
    }

if __name__ == "__main__":
    print("="*60)
    print("TESTING DATA VALIDATION")
    print("="*60)
    
    # Test single transaction validation
    print("\n1. Single Transaction Validation")
    print("-"*40)
    
    test_cases = [
        {
            "name": "Valid transaction",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Negative amount",
            "data": {"amount": -100.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Invalid hour",
            "data": {"amount": 50.0, "hour": 25, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Unknown merchant",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "unknown"}
        },
        {
            "name": "Everything wrong",
            "data": {"amount": -999, "hour": 99, "day_of_week": 15, "merchant_category": "fake"}
        },
    ]
    
    for tc in test_cases:
        result = validate_transaction(tc["data"])
        status = "PASS" if result["valid"] else "FAIL"
        print(f"\n{tc['name']}: {status}")
        if result["errors"]:
            for error in result["errors"]:
                print(f"  - {error}")
    
    # Test batch validation
    print("\n\n2. Batch Validation with Great Expectations")
    print("-"*40)
    
    train_df = pd.read_csv('data/train.csv')
    results = validate_batch(train_df)
    
    print(f"\nTraining data validation: {results['passed']}/{results['total']} checks passed")
    print(f"Pass rate: {results['pass_rate']:.1%}")
    
    if not results['success']:
        print("\nFailed checks:")
        for name, detail in results['details'].items():
            if not detail['passed']:
                print(f"  - {name}")

When to Use Which Validation Approach

Approach	Use Case	Latency	When to Use
Custom Python (`validate_transaction`)	Real-time API requests	<1ms	Every prediction request
Great Expectations	Batch data quality	Seconds	Training data, periodic audits, CI/CD

We use both in this tutorial because they serve different purposes:

Custom validation is your runtime gatekeeper — fast enough for every request
Great Expectations is your batch auditor — thorough checks on datasets

5.2 Integrate Validation into FastAPI

Now let's update our API to reject invalid input with clear error messages:

# src/serve_validated.py
"""
Serve fraud detection model with input validation.

This version adds data validation BEFORE making predictions:
- Invalid inputs are rejected with HTTP 400 and clear error messages
- Valid inputs are processed and predictions returned

This is much safer than the naive version which accepted garbage.
"""
import pickle
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from src.data_validation import validate_transaction

# Load model
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)

app = FastAPI(
    title="Fraud Detection API (Validated)",
    description="""
    Fraud detection API with input validation.
    
    All inputs are validated before prediction:
    - amount: Must be positive and below $50,000
    - hour: Must be 0-23
    - day_of_week: Must be 0-6
    - merchant_category: Must be one of: grocery, restaurant, retail, online, travel
    
    Invalid inputs return HTTP 400 with detailed error messages.
    """,
    version="3.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount (must be positive)", example=150.00)
    hour: int = Field(..., description="Hour of day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Mon, 6=Sun)", example=3)
    merchant_category: str = Field(..., description="Merchant type", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    validation_passed: bool = True

class ValidationErrorResponse(BaseModel):
    detail: dict

@app.post("/predict", response_model=PredictionResponse, responses={400: {"model": ValidationErrorResponse}})
def predict(tx: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Input is validated before prediction. Invalid inputs return HTTP 400.
    """
    data = tx.dict()
    
    # VALIDATE INPUT BEFORE MAKING PREDICTION
    validation = validate_transaction(data)
    
    if not validation["valid"]:
        raise HTTPException(
            status_code=400,
            detail={
                "message": "Validation failed",
                "errors": validation["errors"],
                "input": data
            }
        )
    
    # Input is valid - make prediction
    data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        validation_passed=True
    )

@app.get("/health")
def health():
    return {"status": "healthy", "validation": "enabled"}

Start the validated API:

uvicorn src.serve_validated:app --reload --host 0.0.0.0 --port 8000

Now test with bad data:

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}'

Response (HTTP 400):

{
  "detail": {
    "message": "Validation failed",
    "errors": [
      "amount must be positive",
      "hour must be between 0 and 23 (got 25)",
      "day_of_week must be between 0 (Monday) and 6 (Sunday) (got 10)",
      "merchant_category must be one of ['grocery', 'restaurant', 'retail', 'online', 'travel'] (got 'fake')"
    ],
    "input": {"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}
  }
}

This is a huge improvement! Instead of silently accepting garbage and returning meaningless predictions, we now:

Reject invalid input immediately
Provide clear, actionable error messages
Return the original input for debugging
Use proper HTTP status codes (400 for client error)

Checkpoint: Your validated API should:

Accept valid transactions and return predictions
Reject invalid transactions with HTTP 400 and detailed error messages
Show validation errors for each invalid field

6. Monitor Model Performance and Data Drift

What breaks without this: Your model's accuracy drops from 98% to 70% over two months. Nobody notices until customers complain. By then, significant damage has occurred.

Even with a great model and clean input data, time can be an enemy. Model performance can decline as real-world data evolves – this is known as model drift or model decay.

Why This Matters: In traditional software, you monitor CPU, memory, error rates, and response times. In ML, you must also monitor:

Data quality (are inputs within expected ranges?)
Model performance (is accuracy holding up?)
Data drift (has input distribution changed?)
Prediction drift (has the distribution of predictions changed?)

Without monitoring, your model could be silently failing for weeks before anyone notices. By then, significant damage may have occurred – fraud slipping through, good customers blocked, revenue lost.

6.1 The Four Pillars of ML Observability

Pillar	What to Monitor	Why It Matters
Data Quality	Are inputs valid? Nulls? Outliers?	Bad data causes bad predictions
Model Performance	Accuracy, precision, recall, F1	Is the model still working?
Data Drift	Has input distribution changed from training?	Model may not generalize to new data
Prediction Drift	Has prediction distribution changed?	May indicate data or concept drift

6.2 Build a Drift Monitor with Evidently

Evidently is an open-source library specifically designed for ML monitoring. It can detect drift, generate reports, and integrate with monitoring systems.

Create src/monitoring.py:

# src/monitoring.py
"""
Model monitoring with Evidently.

This module provides tools to:
1. Detect data drift between training and production data
2. Generate detailed HTML reports
3. Track drift over time
4. Alert when drift exceeds thresholds

In production, you would run drift checks periodically (hourly, daily)
and alert when significant drift is detected.
"""
import pandas as pd
import numpy as np
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric
)
from datetime import datetime
from typing import List, Dict, Any, Optional

class DriftMonitor:
    """
    Monitor for detecting data drift between reference (training) and current data.
    
    Implementation Note: We use two approaches here:
    1. Scipy's KS-test — A lightweight statistical method that works anywhere (our fallback)
    2. Evidently — A full-featured library with beautiful reports (our primary tool)
    
    The KS-test is included as defensive coding — if Evidently fails to generate 
    a report, we still get drift detection.
    
    Usage:
        monitor = DriftMonitor(training_data)
        result = monitor.check_drift(production_data)
        if result['drift_detected']:
            alert("Drift detected!")
    """
    
    def __init__(self, reference_data: pd.DataFrame, feature_columns: Optional[List[str]] = None):
        """
        Initialize the drift monitor with reference (training) data.
        
        Args:
            reference_data: The training data to compare against
            feature_columns: Columns to monitor (default: all numeric columns)
        """
        self.reference = reference_data
        self.feature_columns = feature_columns or reference_data.select_dtypes(
            include=[np.number]
        ).columns.tolist()
        self.history: List[Dict[str, Any]] = []
        
        print(f"Drift monitor initialized with {len(self.reference):,} reference samples")
        print(f"Monitoring columns: {self.feature_columns}")
    
    def check_drift(self, current_data: pd.DataFrame, threshold: float = 0.1) -> Dict[str, Any]:
        """
        Check for drift between reference and current data.
        
        Args:
            current_data: Current/production data to check
            threshold: Drift share threshold for alerting (default 10%)
            
        Returns:
            Dictionary with drift results
        """
        from scipy import stats
        
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        # Simple statistical drift detection using KS test
        drifted_columns = []
        for col in self.feature_columns:
            statistic, p_value = stats.ks_2samp(
                ref_subset[col].dropna(),
                cur_subset[col].dropna()
            )
            if p_value < 0.05:  # 5% significance level
                drifted_columns.append(col)
        
        n_features = len(self.feature_columns)
        n_drifted = len(drifted_columns)
        drift_share = n_drifted / n_features if n_features > 0 else 0
        
        result = {
            'timestamp': datetime.now().isoformat(),
            'drift_detected': n_drifted > 0,
            'drift_share': drift_share,
            'drifted_columns': drifted_columns,
            'n_features': n_features,
            'n_drifted': n_drifted,
            'current_samples': len(current_data),
            'threshold': threshold,
            'alert': drift_share > threshold
        }
        
        self.history.append(result)
        
        return result
    
    def generate_report(self, current_data: pd.DataFrame, output_path: str = "drift_report.html"):
        """
        Generate a detailed HTML drift report using Evidently.
        
        Opens in browser for visual inspection of drift patterns.
        """
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        try:
            report = Report(metrics=[DataDriftPreset()])
            report.run(reference_data=ref_subset, current_data=cur_subset)
            
            # Save HTML report
            with open(output_path, 'w') as f:
                f.write(report.show(mode='inline').data)
            
            print(f"Drift report saved to {output_path}")
            print(f"Open this file in a browser to view detailed visualizations.")
        except Exception as e:
            print(f"Could not generate Evidently report: {e}")
            print(f"Using simplified drift detection instead.")
    
    def get_alerts(self, threshold: float = 0.1) -> List[Dict[str, Any]]:
        """
        Get all alerts from history where drift exceeded threshold.
        """
        return [
            {
                'timestamp': r['timestamp'],
                'severity': 'HIGH' if r['drift_share'] > 0.3 else 'MEDIUM',
                'drift_share': r['drift_share'],
                'message': f"Drift detected: {r['drift_share']:.1%} of features drifted",
                'drifted_columns': r['drifted_columns']
            }
            for r in self.history
            if r['drift_share'] > threshold
        ]
    
    def summary(self) -> Dict[str, Any]:
        """Get summary statistics from monitoring history."""
        if not self.history:
            return {"message": "No drift checks performed yet"}
        
        drift_shares = [r['drift_share'] for r in self.history]
        alerts = [r for r in self.history if r['alert']]
        
        return {
            'total_checks': len(self.history),
            'total_alerts': len(alerts),
            'avg_drift_share': np.mean(drift_shares),
            'max_drift_share': np.max(drift_shares),
            'first_check': self.history[0]['timestamp'],
            'last_check': self.history[-1]['timestamp']
        }


def simulate_drift_scenarios():
    """
    Demonstrate drift detection with different scenarios.
    
    This simulates what happens when production data differs from training data.
    """
    from src.generate_data import generate_transactions
    
    print("="*70)
    print("DRIFT DETECTION SIMULATION")
    print("="*70)
    
    # Load reference (training) data
    print("\n1. Loading reference data (training set)...")
    reference = pd.read_csv('data/train.csv')
    feature_cols = ['amount', 'hour', 'day_of_week']
    
    # Initialize drift monitor
    monitor = DriftMonitor(reference, feature_cols)
    
    # Scenario 1: Similar data (should show minimal drift)
    print("\n" + "-"*70)
    print("SCENARIO 1: Test data (similar distribution)")
    print("-"*70)
    test_data = pd.read_csv('data/test.csv')
    result = monitor.check_drift(test_data)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 2: Fraud spike (10% fraud instead of 2%)
    print("\n" + "-"*70)
    print("SCENARIO 2: Fraud spike (10% fraud rate instead of 2%)")
    print("-"*70)
    fraud_spike = generate_transactions(n_samples=2000, fraud_ratio=0.10, seed=101)
    result = monitor.check_drift(fraud_spike)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 3: Amount inflation (everything costs more)
    print("\n" + "-"*70)
    print("SCENARIO 3: Amount inflation (2x multiplier)")
    print("-"*70)
    inflated = test_data.copy()
    inflated['amount'] = inflated['amount'] * 2
    result = monitor.check_drift(inflated)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 4: Time shift (more late-night transactions)
    print("\n" + "-"*70)
    print("SCENARIO 4: Time shift (mostly late-night transactions)")
    print("-"*70)
    night_shift = test_data.copy()
    night_shift['hour'] = np.random.choice([0, 1, 2, 3, 22, 23], size=len(night_shift))
    result = monitor.check_drift(night_shift)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Generate detailed report for the most drifted scenario
    print("\n" + "-"*70)
    print("GENERATING DETAILED REPORT")
    print("-"*70)
    monitor.generate_report(night_shift, "drift_report.html")
    
    # Print summary
    print("\n" + "-"*70)
    print("MONITORING SUMMARY")
    print("-"*70)
    summary = monitor.summary()
    print(f"  Total checks: {summary['total_checks']}")
    print(f"  Total alerts: {summary['total_alerts']}")
    print(f"  Average drift share: {summary['avg_drift_share']:.1%}")
    print(f"  Maximum drift share: {summary['max_drift_share']:.1%}")
    
    # Print alerts
    alerts = monitor.get_alerts()
    if alerts:
        print(f"\n  Alerts ({len(alerts)}):")
        for alert in alerts:
            print(f"    [{alert['severity']}] {alert['message']}")
    
    print("\n" + "="*70)
    print("DRIFT DETECTION SIMULATION COMPLETE")
    print("="*70)
    print("\nOpen drift_report.html in your browser to see detailed visualizations!")


if __name__ == "__main__":
    simulate_drift_scenarios()

Run the drift simulation:

python src/monitoring.py

You'll see output showing how drift detection works in different scenarios. Then open drift_report.html in your browser to see beautiful visualizations of the drift patterns.

6.3 Production Monitoring Strategy

In a production environment, you would:

Log all predictions to a database or data warehouse
Run drift checks periodically (hourly for high-traffic systems, daily for lower traffic)
Set up alerts when drift exceeds thresholds (integrate with PagerDuty, Slack, etc.)
Trigger retraining if drift is severe or sustained
Create dashboards to track drift over time (Grafana, Datadog, etc.)

Checkpoint: Running python src/monitoring.py should:

Show minimal drift for similar data (test set)
Show significant drift for modified data (fraud spike, inflation, time shift)
Generate an HTML report that you can view in your browser

7. Automate Testing and Deployment with CI/CD

What breaks without this: A typo in your code breaks the API. You deploy on Friday at 5 PM. Nobody notices until Monday. Fraud losses spike over the weekend.

CI/CD (Continuous Integration/Continuous Deployment) ensures reliable, repeatable releases. As JFrog notes: "A strong CI/CD pipeline enables ML teams to build robust, bug-free models more quickly and efficiently."

Why This Matters: In ML, changes aren't just code – they're also data and models. CI/CD ensures that when you change training logic, data preprocessing, or hyperparameters, tests verify the change doesn't break anything before it reaches production. It's the difference between deploying with confidence and deploying with crossed fingers.

7.1 Write Tests for Data and Model

Create tests/test_data_and_model.py:

# tests/test_data_and_model.py
"""
Tests for data quality and model performance.

These tests run in CI/CD to ensure:
1. Data meets quality requirements
2. Model meets performance thresholds
3. No regressions are introduced

Run with: pytest tests/test_data_and_model.py -v
"""
import pandas as pd
import pickle
import pytest
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

class TestDataQuality:
    """Tests for training data quality."""
    
    @pytest.fixture
    def train_data(self):
        return pd.read_csv("data/train.csv")
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_train_data_has_expected_columns(self, train_data):
        """Training data must have all required columns."""
        required_columns = {"amount", "hour", "day_of_week", "merchant_category", "is_fraud"}
        actual_columns = set(train_data.columns)
        missing = required_columns - actual_columns
        assert not missing, f"Missing columns: {missing}"
    
    def test_train_data_not_empty(self, train_data):
        """Training data must have rows."""
        assert len(train_data) > 0, "Training data is empty"
        assert len(train_data) >= 1000, f"Training data too small: {len(train_data)} rows"
    
    def test_no_negative_amounts(self, train_data):
        """Transaction amounts must be non-negative."""
        negative_count = (train_data["amount"] < 0).sum()
        assert negative_count == 0, f"Found {negative_count} negative amounts"
    
    def test_amounts_reasonable(self, train_data):
        """Transaction amounts should be within reasonable bounds."""
        max_amount = train_data["amount"].max()
        assert max_amount <= 100000, f"Max amount {max_amount} exceeds reasonable limit"
    
    def test_hours_valid(self, train_data):
        """Hours must be 0-23."""
        invalid = train_data[(train_data["hour"] < 0) | (train_data["hour"] > 23)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid hours"
    
    def test_days_valid(self, train_data):
        """Days of week must be 0-6."""
        invalid = train_data[(train_data["day_of_week"] < 0) | (train_data["day_of_week"] > 6)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid days"
    
    def test_merchant_categories_valid(self, train_data):
        """Merchant categories must be from known set."""
        valid_categories = {"grocery", "restaurant", "retail", "online", "travel"}
        actual_categories = set(train_data["merchant_category"].unique())
        invalid = actual_categories - valid_categories
        assert not invalid, f"Invalid merchant categories: {invalid}"
    
    def test_fraud_ratio_reasonable(self, train_data):
        """Fraud ratio should be realistic (between 0.1% and 50%)."""
        fraud_ratio = train_data["is_fraud"].mean()
        assert 0.001 <= fraud_ratio <= 0.5, f"Fraud ratio {fraud_ratio:.2%} is unrealistic"
    
    def test_no_nulls_in_critical_columns(self, train_data):
        """Critical columns must not have null values."""
        critical = ["amount", "hour", "day_of_week", "merchant_category", "is_fraud"]
        for col in critical:
            null_count = train_data[col].isnull().sum()
            assert null_count == 0, f"Column {col} has {null_count} null values"


class TestModelPerformance:
    """Tests for model performance thresholds."""
    
    @pytest.fixture
    def model_and_encoder(self):
        with open("models/model.pkl", "rb") as f:
            return pickle.load(f)
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_model_loads_successfully(self, model_and_encoder):
        """Model file must load without errors."""
        model, encoder = model_and_encoder
        assert model is not None, "Model is None"
        assert encoder is not None, "Encoder is None"
    
    def test_model_can_predict(self, model_and_encoder, test_data):
        """Model must be able to make predictions."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        predictions = model.predict(X)
        assert len(predictions) == len(X), "Prediction count mismatch"
    
    def test_accuracy_threshold(self, model_and_encoder, test_data):
        """Model accuracy must be at least 90%."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        accuracy = model.score(X, y)
        assert accuracy >= 0.90, f"Accuracy {accuracy:.2%} below 90% threshold"
    
    def test_f1_threshold(self, model_and_encoder, test_data):
        """Model F1-score must be at least 0.3 (sanity check for imbalanced data)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        f1 = f1_score(y, y_pred)
        assert f1 >= 0.3, f"F1-score {f1:.2f} below 0.3 threshold"
    
    def test_precision_not_zero(self, model_and_encoder, test_data):
        """Model precision must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        precision = precision_score(y, y_pred, zero_division=0)
        assert precision > 0, "Model has zero precision (predicts no fraud)"
    
    def test_recall_not_zero(self, model_and_encoder, test_data):
        """Model recall must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        recall = recall_score(y, y_pred, zero_division=0)
        assert recall > 0, "Model has zero recall (misses all fraud)"

Create tests/test_api.py:

# tests/test_api.py
"""
Tests for the FastAPI prediction service.

These tests ensure the API:
1. Returns correct responses for valid inputs
2. Rejects invalid inputs with proper error messages
3. Health check works

Run with: pytest tests/test_api.py -v
Note: Requires the API to be running on localhost:8000
"""
import pytest
import httpx

BASE_URL = "http://localhost:8000"

class TestPredictionEndpoint:
    """Tests for the /predict endpoint."""
    
    def test_valid_prediction_returns_200(self):
        """Valid input should return HTTP 200 with prediction."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        assert "is_fraud" in data
        assert "fraud_probability" in data
        assert isinstance(data["is_fraud"], bool)
        assert 0 <= data["fraud_probability"] <= 1
    
    def test_high_risk_transaction(self):
        """High-risk transaction should have higher fraud probability."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 500.0,
            "hour": 3,  # Late night
            "day_of_week": 1,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        # High-risk transactions should have elevated probability
        # (not asserting exact value as model may vary)
        assert data["fraud_probability"] >= 0.0
    
    def test_negative_amount_rejected(self):
        """Negative amount should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": -100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
        assert "errors" in response.json()["detail"]
    
    def test_invalid_hour_rejected(self):
        """Invalid hour should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 25,  # Invalid
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_invalid_merchant_rejected(self):
        """Unknown merchant category should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "unknown_category"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_missing_field_rejected(self):
        """Missing required field should be rejected."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14
            # Missing day_of_week and merchant_category
        }, timeout=10)
        
        assert response.status_code == 422  # Pydantic validation error


class TestHealthEndpoint:
    """Tests for the /health endpoint."""
    
    def test_health_returns_200(self):
        """Health endpoint should return 200."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        assert response.status_code == 200
    
    def test_health_returns_healthy_status(self):
        """Health endpoint should indicate healthy status."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        data = response.json()
        assert data["status"] == "healthy"

Run tests locally:

# Run data and model tests (API not needed)
pytest tests/test_data_and_model.py -v

# Run API tests (requires API to be running)
pytest tests/test_api.py -v

7.2 GitHub Actions Workflow

⚠️ Note for Production Teams
In real ML teams, you typically don't retrain full models inside CI — it's slow and resource-intensive.
Here we do it to keep everything local, reproducible, and self-contained for learning.
Production pipelines usually separate training (scheduled jobs) from testing (CI/CD).

Create .github/workflows/ci.yml:

# .github/workflows/ci.yml
name: ML Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: 'pip'
      
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      
      - name: Generate training data
        run: python src/generate_data.py
      
      - name: Train model
        run: python src/train_naive.py
      
      - name: Run data quality tests
        run: pytest tests/test_data_and_model.py -v --tb=short
      
      - name: Build Docker image
        run: docker build -t fraud-detection-api .
      
      - name: Run container for API tests
        run: |
          docker run -d -p 8000:8000 --name test-api fraud-detection-api
          sleep 10  # Wait for API to start
          curl -f http://localhost:8000/health || exit 1
      
      - name: Run API tests
        run: pytest tests/test_api.py -v --tb=short
      
      - name: Cleanup
        if: always()
        run: docker stop test-api || true

7.3 Dockerize the Application

Create Dockerfile:

# Dockerfile
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ src/
COPY models/ models/
COPY data/ data/

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run the API
CMD ["uvicorn", "src.serve_validated:app", "--host", "0.0.0.0", "--port", "8000"]

Create .dockerignore:

# .dockerignore
venv/
__pycache__/
*.pyc
.git/
.github/
mlruns/
*.db
*.html
.pytest_cache/

Build and run locally:

# Build the Docker image
docker build -t fraud-detection-api .

# Run the container
docker run -p 8000:8000 fraud-detection-api

# Test it
curl http://localhost:8000/health

Checkpoint:

All tests pass: pytest tests/test_data_and_model.py -v
Docker image builds successfully
Container runs and responds to health checks

8. Incident Response Playbook

When things go wrong in production (and they will), you need a plan. This section provides playbooks for common ML incidents.

Scenario: False Positive Spike

Symptoms: Your fraud model suddenly flags 40% of legitimate transactions as fraud, blocking customers and overwhelming your manual review team.

Severity: HIGH - Direct customer impact

Phase 1: Mitigation (0-5 minutes)

Acknowledge the incident - Notify stakeholders that you're aware and responding
Roll back to previous model - In MLflow UI, move the @champion alias to the previous model version
Restart the API - docker restart fraud-api or redeploy
Verify - Check that false positive rate has returned to normal
Communicate - "Issue detected and mitigated. Investigating root cause."

Phase 2: Diagnosis (5-60 minutes)

Check drift report - Run python src/monitoring.py with recent production data
Check data validation logs - Did upstream data format change?
Check recent deployments - Was there a new model or code deployed recently?
Compare metrics - What's different between the rolled-back and problematic model?

Example root causes:

Upstream system sent amounts in cents instead of dollars
New merchant category appeared that wasn't in training data
Holiday shopping patterns differed significantly from training data

Phase 3: Remediation (1-24 hours)

Fix the root cause - Add validation for the edge case, or update training data
Retrain if needed - Include new patterns in training data
Add test case - Prevent this from happening again
Document - Add to runbook for future reference

Scenario: Gradual Performance Decay

Symptoms: Monitoring shows fraud recall dropping 2% per week over a month. No sudden failures, just slow degradation.

Severity: MEDIUM - Gradual impact, time to respond

Response:

Investigate drift report - Look for gradual distribution changes
```
python src/monitoring.py
```
Collect recent labeled data - Get confirmed fraud cases from the past month
Analyze patterns - What's different about recent fraud?
- New attack vectors?
- Different time patterns?
- New merchant categories?
Retrain on combined data - Include both old and new patterns
```
python src/train_mlflow.py
```
Deploy via canary - Route 10% of traffic to the new model first
- Monitor metrics for 1-2 days
- If metrics improve, increase to 50%, then 100%
- If metrics worsen, roll back
Set up recurring retraining - Schedule weekly or monthly retraining

Scenario: Upstream Data Schema Change

Symptoms: API starts returning 500 errors. Logs show KeyError: 'merchant_category'.

Severity: HIGH - Service is down

Response:

Check error logs - Identify the exact error
```
KeyError: 'merchant_category'
```
Check upstream data - Did the field name change?
- merchant_category -> category
- amount -> transaction_amount

Immediate fix - Add field name mapping

# Quick fix in API
if 'category' in data and 'merchant_category' not in data:
    data['merchant_category'] = data['category']

Long-term fix - Add validation that catches schema changes

required_fields = ['amount', 'hour', 'day_of_week', 'merchant_category']
missing = [f for f in required_fields if f not in data]
if missing:
    raise ValidationError(f"Missing fields: {missing}")

Add integration test - Test with upstream system in CI/CD

9. How to Put It All Together

Let's step back and appreciate what we've built. Our initial naive system has transformed into a local ML platform with production-grade components.

💡 Mental Model: Each tool in this stack is a "catch net" for a specific failure mode:

MLflow catches "which model is this?"

Feast catches "are features consistent?"

Great Expectations catches "is this data valid?"

Evidently catches "has the world changed?"

CI/CD catches "did we break something?"

Together, they form defense-in-depth for ML systems.

Component	Tool	Problem Solved
Experiment Tracking	MLflow	Every run logged, reproducible
Model Registry	MLflow	Versioned models, rollback capability
Feature Store	Feast	Consistent features, no training-serving skew
Data Validation	Great Expectations	Bad data rejected with clear errors
Monitoring	Evidently	Drift detected before it causes problems
Containerization	Docker	Environment consistency everywhere
CI/CD	GitHub Actions	Automated testing and safe deployments

The Complete Workflow

Here's how all the pieces work together in practice:

Data arrives - New transaction data comes in from upstream systems
Validation gate - Great Expectations rules check data quality. Bad data is rejected with clear error messages before it can cause harm.
Feature computation - Feast computes features using the same definitions for both training and serving. No more training-serving skew.
Training - When you retrain, MLflow logs all parameters, metrics, and artifacts. Every experiment is reproducible and comparable.
Model registry - Trained models are automatically versioned. You can compare metrics, promote the best to Production, and roll back if needed.
Serving - FastAPI loads the @champion model from MLflow. Each request is validated, features are retrieved from Feast, and predictions are returned.
Monitoring - Evidently checks for drift periodically. If input distributions change significantly, alerts are triggered.
Retraining loop - When drift is detected, you retrain on new data, compare metrics, and promote if better. The cycle continues.
CI/CD safety net - All code changes go through automated tests. Docker ensures environment consistency. Nothing reaches production without passing the pipeline.

10. What's Next: Scale to Production

This project runs locally, but the principles and tools extend directly to production deployments. Here's how each component scales:

Scaling Feast for Production

We used Feast with local SQLite stores. For production:

Component	Local	Production
Online Store	SQLite	Redis, DynamoDB, or PostgreSQL
Offline Store	Parquet files	BigQuery, Snowflake, or Redshift
Feature Server	Embedded	Dedicated Feast serving cluster

Benefits at scale:

Sub-10ms feature retrieval
Horizontal scaling for high throughput
Feature monitoring and statistics
Point-in-time joins at petabyte scale

Scaling MLflow for Production

Component	Local	Production
Backend Store	SQLite	PostgreSQL or MySQL
Artifact Store	Local filesystem	S3, GCS, or Azure Blob
Tracking Server	Single instance	Load-balanced cluster

Kubernetes Deployment

When you outgrow Docker Compose:

KServe or Seldon for serverless model serving with auto-scaling
Horizontal Pod Autoscaler to scale based on CPU/memory/custom metrics
Canary deployments to safely roll out new models (route 10% traffic first)
GPU scheduling for inference-heavy models

Advanced Monitoring

Expand observability with:

Prometheus + Grafana for real-time dashboards
OpenTelemetry for distributed tracing
PagerDuty/Slack integration for alerts
Labeled data collection for continuous model evaluation

A/B Testing and Multi-Armed Bandits

How to Use the Model Registry:

Serve multiple models concurrently (champion vs challengers)
Route traffic dynamically based on context
Collect metrics for each model variant
Automatically promote the best performer

Conclusion

Congratulations on building a production-ready ML system on your local machine!

What we assembled here is a microcosm of real-world ML platforms:

We started with just a model saved to a pickle file
We ended up with MLOps best practices: experiment tracking, model versioning, feature stores, data validation, monitoring, containerization, and CI/CD

The tools we used are production-grade:

MLflow powers ML platforms at companies like Microsoft, Facebook, and Databricks
Feast is used by companies like Gojek, Shopify, and Robinhood
FastAPI is one of the fastest Python web frameworks
Great Expectations is used at companies like GitHub and Shopify
Evidently is used for monitoring ML in production at scale

The principles apply at any scale:

Always track experiments
Always version models
Always validate data
Always monitor for drift
Always containerize for consistency
Always automate testing

Next Steps You Can Try

Deploy to the cloud - Push your Docker container to AWS ECS, Google Cloud Run, or Azure Container Instances
Add model explainability - Use SHAP or LIME to explain individual predictions
Implement A/B testing - Serve multiple models and compare performance
Add feature importance monitoring - Track how feature importance changes over time
Set up real-time alerting - Connect Evidently to Slack or PagerDuty
Implement continuous training - Automatically retrain when drift is detected
Add bias and fairness monitoring - Ensure your model treats all groups fairly

Remember that productionizing ML is an iterative process. There's always another layer of robustness to add, another edge case to handle, another metric to track. But with the foundation you've built here, you're well on your way to taking models from promising notebook experiments to deployed, monitored, and maintainable production applications.

Happy building, and may your models be accurate and your pipelines resilient!

Get the Complete Code

The entire project from this handbook is available as a public GitHub repository:

🔗 github.com/sandeepmb/freecodecamp-local-ml-platform

The repository includes:

All source code (src/ directory)
Test files (tests/ directory)
Feast feature definitions (feature_repo/)
Docker and CI/CD configuration
Ready-to-run scripts

Quick Start:

git clone https://github.com/sandeepmb/freecodecamp-local-ml-platform.git
cd freecodecamp-local-ml-platform
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python src/generate_data.py
python src/train_naive.py

References

MLflow Documentation - Experiment tracking and model registry
Feast Documentation - Feature store
Feast Quickstart - Getting started with Feast
FastAPI Documentation - Modern Python web framework
Great Expectations - Data validation
Evidently AI Documentation - ML monitoring
CI/CD for Machine Learning (JFrog) - CI/CD best practices
Training-Serving Skew Explained - Understanding skew
Docker Documentation - Containerization
GitHub Actions Documentation - CI/CD automation

How to Ship a Production-Ready RAG App with FAISS (Guardrails, Evals, and Fallbacks)

Chidozie Managwu — Mon, 16 Mar 2026 17:43:51 +0000

Most LLM applications look great in a high-fidelity demo. Then they hit the hands of real users and start failing in very predictable yet damaging ways.

They answer questions they should not, they break when document retrieval is weak, they time out due to network latency, and nobody can tell exactly what happened because there are no logs and no tests.

In this tutorial, you’ll build a beginner-friendly Retrieval Augmented Generation (RAG) application designed to survive production realities. This isn’t just a script that calls an API. It’s a system featuring a FastAPI backend, a persisted FAISS vector store, and essential safety guardrails (including a retrieval gate and fallbacks).

Why RAG Alone Does Not Equal Production-Ready
The Architecture You Are Building
Project Setup and Structure
How to Build the RAG Layer with FAISS
How to Add the LLM Call with Structured Output
How to Add Guardrails: Retrieval Gate and Fallbacks
FastAPI App: Creating the /answer Endpoint
How to Add Beginner-Friendly Evals
What to Improve Next: Realistic Upgrades

Why RAG Alone Does Not Equal Production-Ready

Retrieval Augmented Generation (RAG) is often hailed as the hallucination killer. By grounding the model in retrieved text, we provide it with the facts it needs to be accurate. But simply connecting a vector database to an LLM isn’t enough for a production environment.

Production issues usually arise from the silent failures in the system surrounding the model:

Weak retrieval: If the app retrieves irrelevant chunks of text, the model tries to bridge the gap by inventing an answer anyway. Without a designated “I do not know” path, the model is essentially forced to hallucinate.
Lack of visibility: Without structured outputs and basic logging, you can’t tell if bad retrieval, a confusing prompt, or a model update caused a wrong answer.
Fragility: A simple API timeout or malformed provider response becomes a user-facing outage if you don’t implement fallbacks.
No regression testing: In traditional software, we have unit tests. In AI, we need evals. Without them, a small tweak to your prompt might fix one issue but break ten others without you realising it.

We’ll solve each of these issues systematically in this guide.

Prerequisites

This tutorial is beginner-friendly, but it assumes you have a few basics in place so you can focus on building a robust RAG system instead of getting stuck on setup issues.

Knowledge

You should be comfortable with:

Python fundamentals (functions, modules, virtual environments)
Basic HTTP + JSON (requests, response payloads)
APIs with FastAPI (what an endpoint is and how to run a server)
High-level LLM concepts (prompting, temperature, structured outputs)

Tools + Accounts

You’ll need:

Python 3.10+
A working OpenAI-compatible API key (OpenAI or any provider that supports the same request/response shape)
A local environment where you can run a FastAPI app (Mac/Linux/Windows)

What This Tutorial Covers (and What It Doesn’t)

We’ll build a production-minded baseline:

A FAISS-backed retriever with a persisted index + metadata
A retrieval gate to prevent “forced hallucination”
Structured JSON outputs so your backend is stable
Fallback behavior for timeouts and provider errors
A small eval harness to prevent regressions

We won’t implement advanced upgrades such as rerankers, semantic chunking, auth, background jobs beyond a roadmap at the end.

The Architecture You Are Building

The flow of our application follows a disciplined path so every answer is grounded in evidence:

User query: The user submits a question via a FastAPI endpoint.
Retrieval: The system embeds the question and retrieves the top-k most similar document chunks.
The retrieval gate: We evaluate the similarity score. If the context is not relevant enough, we stop immediately and refuse the query.
Augmentation and generation: If the gate passes, we send a context-augmented prompt to the LLM.
Structured response: The model returns a JSON object containing the answer, sources used, and a confidence level.

Project Setup and Structure

To keep things organized and maintainable, we’ll use a modular structure. This allows you to swap out your LLM provider or your vector database without rewriting your entire core application.

Project Structure

.
├── app.py              # FastAPI entry point and API logic
├── rag.py              # FAISS index, persistence, and document retrieval
├── llm.py              # LLM API interface and JSON parsing
├── prompts.py          # Centralized prompt templates
├── data/               # Source .txt documents
├── index/              # Persisted FAISS index and metadata
└── evals/              # Evaluation dataset and runner script
    ├── eval_set.json
    └── run_evals.py

Install Dependencies

First, create a virtual environment to isolate your project:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install fastapi uvicorn faiss-cpu numpy pydantic requests python-dotenv

Configure the Environment

Create a .env file in the root directory. We are targeting OpenAI-compatible providers:

OPENAI_API_KEY=your_actual_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o-mini

Important note on compatibility: The code below assumes an OpenAI-style API. If you use a provider that is not compatible, you must change the URL, headers (for example X-API-Key), and the way you extract embeddings and final message content in embed_texts() and call_llm().

How to Build the RAG Layer with FAISS

In rag.py, we handle the “Retriever” part of RAG. This involves turning raw text into mathematical vectors that the computer can compare.

What is FAISS (and What Does It Do)?

FAISS (Facebook AI Similarity Search) is a fast library for vector similarity search. In a RAG system, each chunk of text becomes an embedding vector (a list of floats). FAISS stores those vectors in an index so you can quickly ask:

“Given this question embedding, which document chunks are closest to it?”

In this tutorial, we use IndexFlatIP inner product and normalise vectors with faiss.normalize_L2(...). With normalised vectors, the inner product behaves like cosine similarity, giving us a stable score we can use for a retrieval gate.

Chunking Strategy With Overlap

We’ll use chunking with overlap. If we split a document at exactly 1,000 characters, we might cut a sentence in half, losing its meaning. By using an overlap, for example, 200 characters, we ensure that the end of one chunk and the beginning of the next share context.

Implementation of `rag.py`

import os
import faiss
import numpy as np
import requests
import json
from typing import List, Dict
from dotenv import load_dotenv

load_dotenv()

INDEX_PATH = "index/faiss.index"
META_PATH = "index/meta.json"

def chunk_text(text: str, size: int = 1000, overlap: int = 200) -> List[str]:
    chunks = []
    step = max(1, size - overlap)
    for i in range(0, len(text), step):
        chunk = text[i : i + size].strip()
        if chunk:
            chunks.append(chunk)
    return chunks

def embed_texts(texts: List[str]) -> np.ndarray:
    # Note: If your provider is not OpenAI-compatible, change this URL and headers
    url = f"{os.getenv('OPENAI_BASE_URL')}/embeddings"
    headers = {"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"}
    payload = {"input": texts, "model": "text-embedding-3-small"}

    resp = requests.post(url, headers=headers, json=payload, timeout=30)
    resp.raise_for_status()
    # If your provider uses a different response format, change the line below
    vectors = np.array([item["embedding"] for item in resp.json()["data"]], dtype="float32")
    return vectors

def build_index() -> None:
    all_chunks: List[str] = []
    metadata: List[Dict] = []

    if not os.path.exists("data"):
        os.makedirs("data")
        return

    for file in os.listdir("data"):
        if not file.endswith(".txt"):
            continue

        with open(f"data/{file}", "r", encoding="utf-8") as f:
            text = f.read()

        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        for c in chunks:
            metadata.append({"source": file, "text": c})

    if not all_chunks:
        return

    embeddings = embed_texts(all_chunks)
    faiss.normalize_L2(embeddings)

    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    os.makedirs("index", exist_ok=True)
    faiss.write_index(index, INDEX_PATH)

    with open(META_PATH, "w", encoding="utf-8") as f:
        json.dump(metadata, f, ensure_ascii=False)

def load_index():
    if not (os.path.exists(INDEX_PATH) and os.path.exists(META_PATH)):
        raise FileNotFoundError(
            "FAISS index not found. Add .txt files to data/ and run build_index()."
        )

    index = faiss.read_index(INDEX_PATH)
    with open(META_PATH, "r", encoding="utf-8") as f:
        metadata = json.load(f)
    return index, metadata

def retrieve(query: str, k: int = 5) -> List[Dict]:
    index, metadata = load_index()

    q_emb = embed_texts([query])
    faiss.normalize_L2(q_emb)

    scores, ids = index.search(q_emb, k)
    results = []
    for score, idx in zip(scores[0], ids[0]):
        if idx == -1:
            continue
        m = metadata[idx]
        results.append(
            {"score": float(score), "source": m["source"], "text": m["text"], "id": int(idx)}
        )
    return results

How to Add the LLM Call with Structured Output

A major failure point in AI apps is the “chatty” nature of LLMs. If your backend expects a list of sources but the LLM returns conversational filler, your code will crash.

We solve this with structured output: instruct the model to return a strict JSON object, then parse it safely.

Implementation of `llm.py`

import json
import requests
import os
from typing import Dict, Any

def call_llm(system_prompt: str, user_prompt: str) -> Dict[str, Any]:
    # Note: Change URL/Headers if using a non-OpenAI compatible provider
    url = f"{os.getenv('OPENAI_BASE_URL')}/chat/completions"
    headers = {
        "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
        "Content-Type": "application/json",
    }

    payload = {
        "model": os.getenv("OPENAI_MODEL"),
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "response_format": {"type": "json_object"},
        "temperature": 0,
    }

    try:
        resp = requests.post(url, headers=headers, json=payload, timeout=30)
        resp.raise_for_status()
        content = resp.json()["choices"][0]["message"]["content"]

        parsed = json.loads(content)
        parsed.setdefault("answer", "")
        parsed.setdefault("refusal", False)
        parsed.setdefault("confidence", "medium")
        parsed.setdefault("sources", [])
        return parsed

    except (requests.Timeout, requests.ConnectionError):
        return {
            "answer": "The system is temporarily unavailable (network issue). Please try again.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "network_error",
        }
    except Exception:
        return {
            "answer": "A system error occurred while generating the answer.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "unknown_error",
        }

How to Add Guardrails: Retrieval Gate and Fallbacks

Guardrails are interceptors. They sit between the user and the model to prevent predictable failures.

The Retrieval Gate: How It Works and How to Add It

In a standard RAG pipeline, the system always calls the LLM. If the user asks an irrelevant question, the retriever will still return the “closest” (but wrong) chunks.

The solution is the retrieval gate:

Retrieve top-k chunks and get the top similarity score
If the score is below a threshold (for example 0.30), refuse immediately
Only call the LLM when retrieval is strong enough to ground the answer

A threshold of 0.30 is a reasonable starting point when using normalised cosine similarity, but you should tune it using evals (next section).

Fallbacks and Why They Matter

Fallbacks ensure that if an API fails or times out, the user gets a helpful message instead of a crash. They also keep your API response shape consistent, which prevents frontend errors and makes logging meaningful.

In this tutorial, fallbacks are implemented inside call_llm() so your FastAPI layer stays simple.

FastAPI App: Creating the /answer Endpoint

The app.py file is the conductor. It ties retrieval, guardrails, prompting, and generation together.

Implementation of `app.py`

from fastapi import FastAPI
from pydantic import BaseModel
from rag import retrieve
from llm import call_llm
import prompts
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("rag_app")

app = FastAPI(title="Production-Ready RAG")

class QueryRequest(BaseModel):
    question: str

@app.post("/answer")
async def get_answer(req: QueryRequest):
    start_time = time.time()
    question = (req.question or "").strip()

    if not question:
        return {
            "answer": "Please provide a non-empty question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
        }

    # 1) Retrieval
    results = retrieve(question, k=5)
    top_score = results[0]["score"] if results else 0.0

    logger.info("query=%r top_score=%.3f num_results=%d", question, top_score, len(results))

    # 2) Retrieval Gate (Guardrail)
    if top_score < 0.30:
        return {
            "answer": "I do not have documents to answer that question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
            "retrieval": {"top_score": top_score, "k": 5},
        }

    # 3) Augment
    context_text = "\n\n".join([f"Source {r['source']}: {r['text']}" for r in results])
    user_prompt = f"Context:\n{context_text}\n\nQuestion: {question}"

    # 4) Generation with Fallback
    response = call_llm(prompts.SYSTEM_PROMPT, user_prompt)

    # 5) Attach debug metadata
    response["latency_sec"] = round(time.time() - start_time, 2)
    response["retrieval"] = {"top_score": top_score, "k": 5}
    return response

Centralized Prompt – Template: prompts.py

A small but important habit: keep prompts centralised so they’re versionable and easy to evaluate.

Example `prompts.py`

SYSTEM_PROMPT = """You are a RAG assistant. Use ONLY the provided Context to answer.
If the context does not contain the answer, respond with refusal=true.

Return a valid JSON object with exactly these keys:
- answer: string
- refusal: boolean
- confidence: "low" | "medium" | "high"
- sources: array of strings (source filenames you used)

Do not include any extra keys. Do not include markdown. Do not include commentary."""

How to Add Beginner-Friendly Evals

In AI systems, outputs are probabilistic. This makes testing harder than traditional software. Evals (evaluations) are a set of “golden questions” and “expected behaviours” you run repeatedly to detect regressions.

Instead of “does it output exactly this string,” you test:

Should the app refuse when the retrieval is weak?
When it answers, does it include sources?
Is the behaviour stable across prompt tweaks and model changes?

Step 1: Create `evals/eval_set.json`

This should contain both positive and negative cases.

[
  {
    "id": "in_scope_01",
    "question": "What is a retrieval gate and why is it important?",
    "expect_refusal": false,
    "notes": "Should explain gating and relate it to hallucination prevention."
  },
  {
    "id": "out_of_scope_01",
    "question": "What is the capital of France?",
    "expect_refusal": true,
    "notes": "If the knowledge base only includes our docs, the app should refuse."
  },
  {
    "id": "edge_01",
    "question": "",
    "expect_refusal": true,
    "notes": "Empty input should not call the LLM."
  }
]

Step 2: Create `evals/run_evals.py`

This runner calls your API endpoint (end-to-end) and checks expected behaviours.

import json
import requests

API_URL = "http://127.0.0.1:8000/answer"

def run():
    with open("evals/eval_set.json", "r", encoding="utf-8") as f:
        cases = json.load(f)

    passed = 0
    failed = 0

    for case in cases:
        resp = requests.post(API_URL, json={"question": case["question"]}, timeout=60)
        resp.raise_for_status()
        out = resp.json()

        got_refusal = bool(out.get("refusal", False))
        expect_refusal = bool(case["expect_refusal"])

        ok = (got_refusal == expect_refusal)

        # Beginner-friendly: if it answers, sources should exist and be a list
        if not got_refusal:
            ok = ok and isinstance(out.get("sources"), list)

        if ok:
            passed += 1
            print(f"PASS {case['id']}")
        else:
            failed += 1
            print(f"FAIL {case['id']} expected_refusal={expect_refusal} got_refusal={got_refusal}")
            print("Output:", json.dumps(out, indent=2))

    print(f"\nDone. Passed={passed} Failed={failed}")
    if failed:
        raise SystemExit(1)

if __name__ == "__main__":
    run()

How to Use Evals in Practice

Run your server:

uvicorn app:app --reload

In another terminal, run evals:

python evals/run_evals.py

If an eval fails, you have a concrete signal that something changed in retrieval, gating, prompting, or provider behaviour.

What to Improve Next: Realistic Upgrades

Building a reliable RAG app is iterative. Here are realistic next steps:

Semantic chunking: Break text based on meaning instead of character count.
Reranking: Use a cross-encoder reranker to reorder the top-k chunks for higher precision.
Metadata filtering: Filter results by category, date, or department to reduce false positives.
Better citations: Store chunk IDs and show exactly which chunk(s) the answer came from.
Observability: Add request IDs, structured logs, and traces so “what happened?” is answerable.
Async + background indexing: Move index building to a background job and keep the API responsive.

Final Thoughts: Production-Ready Is a Set of Habits

Building an AI application that survives in the real world is about building a system that is predictable, measurable, and safe.

Retrieval quality is measurable: Use similarity scores to gate your LLM.
Refusal is a feature: It is better to say “I do not know” than to lie.
Fallbacks are mandatory: Design for the moment the API goes down.
Evals prevent regressions: Never deploy a change without running your tests.

About Me

I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead AI Titans Network, a community for developers learning how to ship AI products.

My work has been recognized with the Global Tech Hero award and featured on platforms like HackerNoon.

How to Build End-to-End LLM Observability in FastAPI with OpenTelemetry

Jessica Patel — Fri, 13 Mar 2026 16:13:16 +0000

This article shows how to build end-to-end, code-first LLM observability in a FastAPI application using the OpenTelemetry Python SDK.

Instead of relying on vendor-specific agents or opaque SDKs, we will manually design traces, spans, and semantic attributes that capture the full lifecycle of an LLM-powered request.

Introduction
Prerequisites and Technical Context
Why LLM Observability Is Fundamentally Different
Reference Architecture: A Traceable RAG Request
Reference Architecture Explained
Why This Design Is Better Than Simpler Alternatives
LLM Models That Work Best for This Architecture
OpenTelemetry Primer (LLM-Relevant Concepts Only)
Designing LLM-Aware Spans
FastAPI Example: End-to-End LLM Spans (Complete and Explained)
Semantic Attributes: Best Practices for LLM Observability
Evaluation Hooks Inside Traces
Exporting and Visualizing Traces (Where This Fits with Vendor Tooling)
Operational Patterns and Anti-Patterns
Extending the System
Conclusion

Introduction

Large Language Models (LLMs) are rapidly becoming a core component of modern software systems. Applications that once relied on deterministic APIs are now incorporating LLM-powered features such as conversational assistants, document summarization, intelligent search, and retrieval-augmented generation (RAG).

While these capabilities unlock new user experiences, they also introduce operational complexity that traditional monitoring approaches were never designed to handle.

Unlike conventional software services, LLM systems are probabilistic by nature. The same request may produce slightly different responses depending on factors such as prompt structure, model configuration, retrieval context, and sampling parameters such as temperature or top-p.

In addition, LLM workloads introduce entirely new operational dimensions such as token consumption, prompt construction latency, inference cost, context window limits, and response quality.

These factors mean that a request can appear technically successful from an infrastructure perspective while still producing an incorrect, hallucinated, or low-quality result.

Traditional observability tools typically focus on infrastructure-level signals such as latency, error rate, and throughput. While these metrics remain important, they are insufficient for understanding how an LLM application behaves in production.

Engineers must also understand what prompt was constructed, which documents were retrieved, how many tokens were consumed, which model configuration was used, and how the final response was evaluated. Without this visibility, debugging LLM behavior becomes extremely difficult and operational costs can quickly spiral out of control.

This is where LLM observability becomes essential. Observability for LLM systems extends beyond infrastructure monitoring. It captures the full lifecycle of an AI-driven request — from user input and context retrieval to prompt construction, model inference, post-processing, and quality evaluation.

When implemented correctly, observability allows teams to answer why the model generated a particular response, which retrieval results influenced the output, how much a request cost in terms of tokens, where latency occurred within the request pipeline, and whether the response passed basic quality or safety checks.

This article demonstrates how to implement end-to-end LLM observability in a FastAPI application using OpenTelemetry. Instead of relying on proprietary monitoring agents or opaque vendor SDKs, we take a code-first approach to instrumentation. By explicitly designing traces, spans, and semantic attributes, we gain precise control over how LLM interactions are observed and analyzed.

Throughout the guide, we will walk through a practical architecture for tracing a retrieval-augmented generation (RAG) workflow, where each stage of the request lifecycle is represented as a trace span. We will explore how to design meaningful span boundaries, capture prompt and model metadata safely, record token usage and cost signals, and attach evaluation results directly to traces.

The article also explains how this instrumentation can be exported to any OpenTelemetry-compatible backend such as Jaeger, Grafana Tempo, or LLM-specific platforms like Phoenix.

By the end of this guide, you will understand how to:

Structure traces so that each user request maps to a single end-to-end LLM interaction
Design span hierarchies that reflect the logical stages of an LLM pipeline
Capture prompt metadata, model configuration, and token usage safely
Attach evaluation and quality signals to traces for deeper analysis
Export observability data to different backends without changing instrumentation

Most importantly, the goal of this article is not simply to demonstrate how to add telemetry to an application. Instead, it aims to show how to think about observability when building LLM-powered systems.

When LLM operations are treated as first-class components within a distributed system, traces become a powerful tool for debugging, optimization, cost management, and continuous improvement of model behavior.

Prerequisites and Technical Context

Before following this guide, you should be familiar with the Python programming language, basic web API concepts, and general microservice architecture. Below are some key tools and concepts used in this article.

FastAPI (Web Framework)

FastAPI is used as the primary web framework for the application. It is a modern Python framework designed for building high-performance APIs using standard Python type hints. FastAPI simplifies request validation, serialization, and API documentation while remaining lightweight and fast.

Large Language Models (LLMs)

Large Language Models (LLMs) are the computational core of the example system. An LLM is a model trained on vast amounts of text data to generate or transform language in ways that resemble human communication. In production environments, LLMs are commonly used for tasks such as conversational interfaces, summarization, and question answering.

Observability (Concept)

Observability is the overarching concept that connects all the technical pieces in this article. At a high level, observability refers to the ability to understand a system's internal behavior by examining the data it produces during execution. Rather than asking whether a system is simply "up" or "down," observability helps answer deeper questions about why a request behaved a certain way, where latency was introduced, or how different components interacted.

OpenTelemetry (Instrumentation Standard)

OpenTelemetry is the mechanism used to implement observability within the application. It is an open, vendor-neutral standard for generating telemetry data such as traces, metrics, and logs. By instrumenting key parts of the LLM workflow, we can observe how requests flow through the system, how long each step takes, and what contextual data influenced the final outcome. OpenTelemetry serves as the foundation for collecting this information in a consistent and portable way, independent of any specific monitoring backend.

Why LLM Observability Is Fundamentally Different

Traditional observability assumes deterministic behavior: the same input produces the same output. LLM systems violate this assumption. The same request can vary due to prompt template changes, retrieval differences, sampling parameters (temperature, top-p), model version upgrades, and context window truncation.

As a result, teams need visibility into what the model saw, how it was configured, what it retrieved, how long it took, and how much it cost, all correlated to a single user request. Logs alone are insufficient, and metrics lack dimensionality. Distributed traces are the backbone of LLM observability.

Reference Architecture: A Traceable RAG Request

A typical FastAPI-based RAG service follows this flow:

Each step is observable, but only if we deliberately instrument it. The goal is one trace per user request, with child spans representing each logical LLM step.

Reference Architecture Explained

Client Sends a Request to /chat

The architecture begins when a client sends a request to the /chat endpoint. This request typically contains the user's query along with any session or conversation context required by the application.

Keeping the client interface minimal and well-defined is intentional: it ensures the backend receives a predictable input shape and prevents application-specific logic from leaking into downstream LLM processing.

From an observability perspective, this request marks the start of a single end-to-end trace, allowing every subsequent operation to be correlated back to the original user action.

FastAPI Validates Input and Authenticates the User

Once the request reaches the service, FastAPI performs schema validation and authentication. Validation guarantees that only well-formed inputs proceed through the pipeline, while authentication ensures that expensive LLM operations are only executed for authorized users.

Placing this step early reduces unnecessary computation and protects the system from abuse. It also improves trace quality by ensuring that all observed requests represent legitimate execution paths rather than malformed or rejected traffic.

Retriever Queries the Vector Database

After validation, the system queries a vector database to retrieve documents relevant to the user's request. This retrieval step is the foundation of retrieval-augmented generation (RAG). By grounding the LLM in external knowledge, the system improves factual accuracy and reduces hallucinations.

Separating retrieval from generation allows teams to tune similarity thresholds, embedding models, and top-k values independently, and it makes it easier to diagnose whether poor responses are caused by bad retrieval or model behavior.

Prompt Is Assembled Using Retrieved Documents

With relevant documents in hand, the system constructs the final prompt that will be sent to the LLM. This step combines the user query, retrieved context, system instructions, and formatting rules into a single structured prompt.

Making prompt assembly an explicit stage enables prompt versioning, experimentation, and observability. It also provides a natural place to detect issues such as context window overflows or excessive prompt size before invoking the model.

LLM API Is Invoked

The LLM API call is the most expensive and non-deterministic operation in the pipeline, which is why it occurs only after all preparatory work is complete. At this stage, the model receives a fully constructed prompt and produces a response based on its configuration parameters.

This step is the primary focus of latency, cost, and reliability controls such as retries, timeouts, and circuit breakers. From an observability standpoint, this span becomes the anchor for token usage, cost attribution, and prompt-level debugging.

Response Is Post-Processed and Returned

After the LLM returns a response, the system performs post-processing before sending the result back to the client. This may include formatting, filtering, validation, or enrichment of the output. Post-processing acts as a final safeguard against malformed or low-quality responses and ensures consistency with application requirements. It also provides a clean boundary for attaching evaluation signals, such as response length, relevance scores, or truncation indicators, before the request completes.

Why This Design Is Better Than Simpler Alternatives

This architecture intentionally avoids coupling responsibilities together. Validation, retrieval, prompt construction, model execution, and response handling are all distinct steps. This separation makes the system easier to test, easier to observe, and easier to evolve. When something fails, engineers can identify where and why rather than treating the LLM as a black box.

Compared to a monolithic "send user input directly to the LLM" approach, this design offers better correctness, lower cost, and higher resilience. It also aligns naturally with distributed tracing, since each block maps cleanly to a trace span with a clear semantic purpose. As the system grows, additional features such as caching, fallback models, or policy enforcement can be added without destabilizing the entire flow.

Most importantly, this architecture treats the LLM as one component in a larger system, not the system itself. That mindset is essential for building reliable production applications.

LLM Models That Work Best for This Architecture

This architecture is model-agnostic, but certain model characteristics work particularly well with retrieval-augmented workflows.

Models with strong instruction-following and reasoning capabilities tend to perform best, especially when prompts include structured context from retrieved documents. General-purpose models such as GPT-4-class systems perform well when accuracy and reasoning depth are critical.

For lower-latency or cost-sensitive use cases, smaller instruction-tuned models can be effective when paired with high-quality retrieval. Open-source models such as LLaMA-derived or Mistral-based systems also fit well into this architecture, particularly when deployed behind a private inference endpoint.

The key requirement is not the model itself, but how it is used. Models that can reliably ground their responses in provided context, respect system instructions, and produce stable outputs under varying prompts integrate most cleanly into this design. Because retrieval and prompt construction are explicit stages, models can be swapped or compared without changing the overall system structure.

OpenTelemetry Primer (LLM-Relevant Concepts Only)

OpenTelemetry defines three core types of telemetry data: traces, metrics, and logs. For LLM systems, traces are the most important. To make them useful, you need to understand a few building blocks:

a trace represents a single end-to-end request
a span is a timed operation within that trace
attributes are key–value metadata attached to spans
events are time-stamped annotations
context propagation ensures child spans attach to the correct parent.

FastAPI’s async nature makes correct context propagation essential, but OpenTelemetry’s Python SDK handles this as long as spans are created correctly.

With those concepts in place, the next step is to wire OpenTelemetry into the app. Start by configuring the OpenTelemetry SDK in FastAPI: define a TracerProvider, attach a Resource (service name and environment), configure an exporter (Jaeger, Tempo, Phoenix, and so on), and enable FastAPI auto-instrumentation.

Designing LLM-Aware Spans

Span Taxonomy

A clean span hierarchy is critical. In this guide, a single http.request span (usually auto-generated) acts as the root, and it contains child spans such as rag.retrieval, rag.prompt.build, llm.call, llm.postprocess, and, optionally, llm.eval. Each of these spans represents a logical unit of work rather than an implementation detail.

Span Boundaries

Getting span boundaries right is just as important as picking the right span names. Avoid extremes like wrapping the entire LLM workflow in one giant span, creating a separate span for every token, or dumping all data into logs.

Instead, aim for a few coarse-grained spans that each represent a meaningful step in the request, enrich them with well-chosen attributes, and use events to mark important milestones within a span rather than splitting everything into smaller spans.

Instrumenting the LLM Call

When instrumenting the LLM call, treat it as the most critical span in the trace. Whether you are calling OpenAI, Anthropic, or another provider, start the span immediately before the API request and end it only after the full response (or stream) is complete.

Within that span, capture retries, timeouts, and errors so it becomes the central place for latency analysis, cost attribution, and prompt debugging.

For streaming responses, you can emit events for each chunk to track progress, but avoid creating separate child spans unless you truly need fine-grained timing.

FastAPI Example: End-to-End LLM Spans (Complete and Explained)

from fastapi import FastAPI, Request
from opentelemetry import trace
from opentelemetry.trace import Tracer
from typing import List
import asyncio
import hashlib

# Obtain a tracer instance from OpenTelemetry.
# All spans created with this tracer will be part of the same distributed
# tracing system and exported to the configured backend.
tracer: Tracer = trace.get_tracer(__name__)

# Initialize the FastAPI application.
app = FastAPI()

# Helper functions used by the observable endpoint
async def retrieve_documents(query: str) -> List[str]:
    """
    Simulate document retrieval (e.g., vector search or knowledge base lookup).
    This function represents the retrieval stage in a RAG pipeline.
    In a real system, this might query a vector database or search index.
    """
    await asyncio.sleep(0.05)  # Simulate I/O latency
    return [
        "FastAPI enables high-performance async APIs.",
        "OpenTelemetry provides vendor-neutral observability.",
        "LLM observability requires tracing prompts and tokens.",
    ]


def build_prompt(query: str, documents: List[str]) -> str:
    """
    Construct the final prompt from retrieved documents and the user query.
    Prompt construction is kept separate so it can be observed or modified
    independently if needed (for example, to measure prompt assembly latency).
    """
    context = "\n".join(documents)
    return f"""
Context:
{context}

Question:
{query}
"""


class LLMResponse:
    """
    Minimal abstraction for an LLM response.
    This keeps the example self-contained while still allowing us to attach
    token usage and other metadata for observability.
    """

    def __init__(self, text: str, prompt_tokens: int, completion_tokens: int):
        self.text = text
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_token
    
    @property
    def total_tokens(self) -> int:
        return self.prompt_tokens + self.completion_tokens

async def call_llm(prompt: str) -> LLMResponse:
    """
    Simulate an LLM API call.
    In a real implementation, this would call OpenAI, Anthropic, or another
    provider. The artificial delay represents model latency.
    """
    await asyncio.sleep(0.2)  # Simulate inference time
    response_text = "FastAPI and OpenTelemetry enable end-to-end LLM observability."
    # Token count is approximated here for demonstration purposes.
    prompt_tokens = len(prompt.split())
    completion_tokens = len(response_text.split())
    return LLMResponse(response_text, prompt_tokens, completion_tokens)


def summarize_response(response: LLMResponse) -> str:
    """
    Example post-processing step.
    Post-processing is separated into its own phase so any additional latency
    or errors are not incorrectly attributed to the LLM itself.
    """
    return response.text


# Observable FastAPI endpoint
@app.post("/query")
async def rag_query(request: Request, query: str):
    """
    Handle a single RAG-style request with explicit OpenTelemetry spans.
    This endpoint demonstrates how to create one trace per request, with child
    spans for retrieval, LLM invocation, and post-processing.
    """

    # Create a top-level span for the HTTP request.
    # Even if FastAPI auto-instrumentation is enabled, defining this explicitly
    # allows us to attach domain-specific metadata.
    with tracer.start_as_current_span("http.request") as http_span:
        http_span.set_attribute("http.method", "POST")
        http_span.set_attribute("http.route", "/query")

        # Retrieval phase
        # This span isolates the retrieval step so that relevance issues can be
        # debugged independently of LLM behavior.
        with tracer.start_as_current_span("rag.retrieval") as retrieval_span:
            retrieval_span.set_attribute("rag.top_k", 5)
            retrieval_span.set_attribute("rag.similarity_threshold", 0.8)
            documents = await retrieve_documents(query)

            # Record how many documents were returned.
            # This is a key signal when diagnosing hallucinations
            # or missing context in the final response.
            retrieval_span.set_attribute(
                "rag.documents_returned",
                len(documents),
            )

        # LLM invocation phase
        # This span wraps the actual LLM call and is the primary anchor for
        # latency, cost, and prompt-related analysis.
        with tracer.start_as_current_span("llm.call") as llm_span:
            llm_span.set_attribute("llm.provider", "example")
            llm_span.set_attribute("llm.model", "example-llm")
            llm_span.set_attribute("llm.temperature", 0.7)
            llm_span.set_attribute("llm.prompt_template_id", "rag_v1")

            # Build the final prompt using retrieved context.
            # The raw prompt is intentionally not stored as a span attribute.
            prompt = build_prompt(query, documents)
            
            # Prompt metadata
            prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()
            llm_span.set_attribute("llm.prompt_hash", prompt_hash)
            llm_span.set_attribute("llm.prompt_length", len(prompt))

            response = await call_llm(prompt)

            # Hash the response instead of storing raw text.
            # This allows correlation across traces without exposing content.
            response_hash = hashlib.sha256(
                response.text.encode()
            ).hexdigest()
            llm_span.set_attribute("llm.response_hash", response_hash)

            # Record token usage to enable cost attribution
            # and capacity planning.
            llm_span.set_attribute("llm.usage.prompt_tokens", response.prompt_tokens)
            llm_span.set_attribute("llm.usage.completion_tokens", response.completion_tokens)
            llm_span.set_attribute("llm.usage.total_tokens", response.total_tokens)
            
            # example price per token
            estimated_cost = response.total_tokens * 0.000002
            llm_span.set_attribute("llm.cost_estimated_usd", estimated_cost)

        # Post-processing phase
        # Any transformation after the LLM response is captured here,
        # ensuring inference latency is not overstated.
        with tracer.start_as_current_span("llm.postprocess") as post_span:
            summary = summarize_response(response)
            post_span.set_attribute(
                "llm.summary_length",
                len(summary),
            )

    # Return the final response to the client.
    # All spans above belong to the same distributed trace.
    return {"summary": summary}

Before examining the full code example, it helps to understand how the instrumentation relates to the observability principles described earlier in this article.

The goal of the example is not simply to show how to create spans, but to demonstrate how a single user request can be represented as a structured trace containing meaningful metadata about each stage of the LLM pipeline.

At a high level, the code follows three key design ideas:

One trace per user request
One span per logical LLM workflow stage
Semantic attributes attached to spans for debugging, cost tracking, and analysis

Each of these concepts directly corresponds to the observability practices discussed earlier.

Top-Level Request Span

The FastAPI endpoint begins by creating a top-level span called http.request. This span represents the entire lifecycle of the incoming request and serves as the root span for the trace.

with tracer.start_as_current_span("http.request") as http_span:

Although FastAPI can generate HTTP spans automatically through OpenTelemetry auto-instrumentation, explicitly creating this span allows the application to attach domain-specific metadata such as route names or user identifiers.

Attributes such as the HTTP method and route are attached here:

http_span.set_attribute("http.method", "POST")
http_span.set_attribute("http.route", "/query")

This ensures that every trace can be easily filtered by endpoint when analyzing production traffic.

Retrieval Span

The next span captures the retrieval phase of the RAG pipeline:

with tracer.start_as_current_span("rag.retrieval") as retrieval_span:

This span isolates the vector search or knowledge retrieval step from the rest of the pipeline. If users report irrelevant answers, engineers can inspect this span to determine whether the issue originates from poor retrieval results rather than model behavior.

Several semantic attributes are attached here:

rag.top_k – number of documents requested
rag.similarity_threshold – similarity cutoff used for filtering results
rag.documents_returned – number of documents actually retrieved

These attributes align with the RAG observability signals discussed in the earlier section of the article.

LLM Invocation Span

The most important span in the trace is the llm.call span, which wraps the actual model invocation.

with tracer.start_as_current_span("llm.call") as llm_span:

This span captures the latency, configuration, and token usage associated with the LLM request. In production systems, it becomes the primary location for analyzing model behavior and cost.

Key attributes recorded in this span include:

llm.provider – the model provider (OpenAI, Anthropic, etc.)
llm.model – the specific model version
llm.temperature – sampling parameter controlling response randomness
llm.prompt_template_id – identifier for the prompt template used

These attributes make it possible to correlate changes in model configuration with downstream quality or cost changes.

Prompt Handling and Privacy

Instead of storing the full prompt or response text directly in the trace, the example demonstrates a safer practice: hashing sensitive data.

response_hash = hashlib.sha256(response.text.encode()).hexdigest()

The resulting hash is stored as a span attribute:

llm_span.set_attribute("llm.response_hash", response_hash)

This approach allows engineers to correlate repeated responses across traces without exposing potentially sensitive content in observability systems.

Token Usage Tracking

The llm.call span also records token usage:

llm_span.set_attribute(
    "llm.usage.total_tokens",
    response.total_tokens
)

Capturing token usage at the span level is critical for monitoring cost and efficiency, since token consumption directly determines billing for most LLM providers.

Post-Processing Span

Finally, the example includes a llm.postprocess span:

with tracer.start_as_current_span("llm.postprocess") as post_span:

This span represents any transformation applied after the model generates its response. Separating post-processing from the LLM call ensures that additional latency — such as formatting, filtering, or validation — is not incorrectly attributed to the model itself.

An attribute such as response length is recorded here:

post_span.set_attribute("llm.summary_length", len(summary))

This can be useful when diagnosing issues such as unexpectedly short or truncated outputs.

How the Spans Form a Complete Trace

When the request finishes, all spans belong to the same distributed trace:

http.request
 ├── rag.retrieval
 ├── llm.call
 └── llm.postprocess

This hierarchy reflects the logical workflow of a retrieval-augmented LLM system. Because each span contains structured metadata, engineers can quickly answer questions such as:

Was the latency caused by retrieval or model inference?
How many documents influenced the prompt?
Which model configuration produced the response?
How many tokens were consumed?
Was the response post-processed or truncated?

This structured trace design is what transforms observability from simple monitoring into a practical debugging and optimization tool for LLM systems.

Semantic Attributes: Best Practices for LLM Observability

The goal is not to capture every possible detail, but to record the minimal set of stable, high-signal attributes that enable effective debugging, cost control, and quality analysis in production. Poor attribute design leads to noisy traces, privacy risks, and dashboards that are impossible to reason about.

Prompt, Response, and Model Metadata

Storing raw prompts is often unsafe and expensive, so it is better to record minimal, structured metadata instead. In practice, this means attaching a stable template identifier with llm.prompt_template_id, a hashed version of the final prompt using llm.prompt_hash (to avoid storing raw text), and a size indicator such as llm.prompt_length, which captures the number of tokens or characters.

You should also always record key inference parameters: llm.provider (for example, "openai" or "anthropic"), llm.model (for example, "gpt-4.1"), llm.temperature and llm.top_p (sampling parameters), llm.max_tokens (the maximum tokens allowed), and llm.stream to indicate whether streaming was enabled, while staying within your organization’s privacy and compliance requirements.


with tracer.start_as_current_span("llm.call") as llm_span:
            llm_span.set_attribute("llm.provider", "example")
            llm_span.set_attribute("llm.model", "example-llm")
            llm_span.set_attribute("llm.temperature", 0.7)
            llm_span.set_attribute("llm.top_p", 0.9)
            llm_span.set_attribute("llm.max_tokens", 512)
            llm_span.set_attribute("llm.stream", False)
            llm_span.set_attribute("llm.prompt_template_id", "rag_v1")

            # Build the final prompt using retrieved context.
            # The raw prompt is intentionally not stored as a span attribute.
            prompt = build_prompt(query, documents)
            
            # Prompt metadata
            prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()
            llm_span.set_attribute("llm.prompt_hash", prompt_hash)
            llm_span.set_attribute("llm.prompt_length", len(prompt))

Token Usage and Cost (Why This Matters in Practice)

Token usage is one of the most common blind spots in LLM systems. Many teams monitor latency and error rates but discover runaway costs only after invoices spike. Because token consumption varies significantly by prompt structure, retrieved context, and model configuration, it must be captured explicitly at the span level.

The most important practice is to record token usage at the end of the LLM span, once the model has completed inference. This ensures that the values reflect the full request rather than partial or streamed output.

At minimum, capture the attributes:llm.usage.prompt_tokens ,llm.usage.completion_tokens and llm.usage.total_tokens.

def __init__(self, text: str, prompt_tokens: int, completion_tokens: int):
        self.text = text
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_token
    
    @property
    def total_tokens(self) -> int:
        return self.prompt_tokens + self.completion_tokens

async def call_llm(prompt: str) -> LLMResponse:
    """
    Simulate an LLM API call.
    In a real implementation, this would call OpenAI, Anthropic, or another
    provider. The artificial delay represents model latency.
    """
    await asyncio.sleep(0.2)  # Simulate inference time
    response_text = "FastAPI and OpenTelemetry enable end-to-end LLM observability."
    # Token count is approximated here for demonstration purposes.
    prompt_tokens = len(prompt.split())
    completion_tokens = len(response_text.split())
    return LLMResponse(response_text, prompt_tokens, completion_tokens)

These values allow you to distinguish between requests that are expensive because of large prompts (often caused by excessive retrieval or poor prompt construction) versus those that are expensive because of long model-generated outputs.

*Where possible, also attach an estimated cost:* llm.cost_estimated_usd

    # example price per token
    estimated_cost = response.total_tokens * 0.000002
    llm_span.set_attribute("llm.cost_estimated_usd", estimated_cost)

This value is typically derived by multiplying token counts by the model's published pricing. Even if the estimate is approximate, it enables powerful analysis. For example, you can identify which endpoints, prompt templates, or user flows are responsible for the highest cumulative cost, rather than relying on coarse, account-level billing dashboards.

Once spans carry the right attributes, the next step is to connect them to output quality, not just system health.

Evaluation Hooks Inside Traces

This section describes an additional pattern you can layer on top of the core instrumentation in this guide. It is optional and not implemented in the sample code, but it shows how to attach quality signals directly to your traces.

Observability is not just about whether the system stayed up, it is also about whether the model produced a useful answer. Evaluation hooks inside traces let you attach lightweight quality signals directly to the same spans you use for latency and cost.

Inline evaluations are the simplest approach. You can run quick checks synchronously and record the results as span attributes, such as llm.eval.passed for a simple boolean check, llm.eval.relevance_score for an optional numerical score, or flags like llm.eval.hallucination_detected and llm.eval.refusal_detected. These attributes travel with the trace, so you can filter and aggregate on them in your observability backend just like any other field.

For higher accuracy, you can introduce model-based evaluation as a separate step. In this pattern, an evaluator LLM runs asynchronously on the original prompt and response, and its work is captured in a child span (for example, llm.eval) that shares the same trace ID as the main llm.call span. You then attach scores such as relevance, faithfulness, or toxicity to that evaluation span.

Because the evaluation span shares the same trace ID, you can correlate quality regressions with changes in prompts or retrieval.

Exporting and Visualizing Traces (Where This Fits with Vendor Tooling)

This code-first observability design is vendor-agnostic. Once traces are emitted using OpenTelemetry, they can be exported to different backends without changing instrumentation.

General-purpose tracing systems like Jaeger and Grafana Tempo help engineers debug latency, errors, and request flow across retrieval, prompting, and model calls, answering how the system behaved. LLM-focused platforms such as Arize Phoenix use the same data but add model-specific insights like prompt clustering, token analysis, and quality correlation.

Because instrumentation stays OpenTelemetry-native, you maintain full control over attributes and trace structure while still using vendor dashboards, and you can switch backends as your needs evolve without touching the application code.

Operational Patterns and Anti-Patterns

Effective LLM observability requires disciplined practices. High-volume systems should sample traces to limit overhead, and prompts or responses should be hashed by default to reduce storage and privacy risk. Traces must be treated as production data, with proper access control and retention policies.

Common pitfalls include relying only on vendor SDK traces, logging prompts without trace correlation, or ignoring evaluation signals. These issues fragment visibility and hide quality regressions, especially when observability focuses only on agents instead of full application context.

Extending the System

Once traces are reliable, they support advanced capabilities. Metrics like p95 latency can be derived from spans, logs can be linked using trace IDs, and historical traces can power offline evaluation or prompt testing.

By following OpenTelemetry conventions, the observability stack also stays aligned with emerging LLM semantic standards, keeping the system flexible and future-proof.

Conclusion

End-to-end LLM observability is not achieved by installing another agent. It is achieved through intentional span design, meaningful semantic attributes, and, where needed, lightweight evaluation hooks.

By treating LLM calls as first-class operations within distributed traces, you gain faster debugging, controlled costs, safer deployments, and measurable quality improvements. The backend — Jaeger, Tempo, Phoenix — is interchangeable. The instrumentation strategy is not.

A well-designed trace is the most valuable artifact in a production LLM system.

How to Use WebSockets: From Python to FastAPI

Nneoma Uche — Thu, 12 Mar 2026 00:19:12 +0000

Real-time data powers much of modern software: live stock prices, chat applications, sports scores, collaborative tools. And to build these systems, you'll need to understand how real-time communication actually works—which isn’t always straightforward.

I ran into this firsthand while trying to build a live options dashboard. HTTP requests weren't going to cut it, and everything I was reading seemed overly complex until I went back to the basics. This article is the result of that process.

We'll cover Python's websockets library from scratch, then move into FastAPI, where many Python backends live. It's worth noting that WebSockets aren't the only solution for real-time communication. WebRTC may be a better fit depending on your use case, but understanding WebSockets is the right starting point before exploring further.

WebSocket Connections and Methods
How to Build Your First WebSocket in Python
File Transfer Over WebSockets
How to Connect to an External WebSocket
WebSockets in FastAPI
How to Handle WebSocket Disconnections in FastAPI
Conclusion

WebSocket Connections and Methods

A WebSocket connection enables bi-directional communication between a client and a server. Once a connection is established, both sides can communicate freely without either having to ask first. This is different from a regular HTTP request, where the client always has to ask before the server can respond.

It looks something like this:

        CLIENT  <===== open connection =====>  SERVER

Note that a WebSocket URL is not a regular web page, so you can't "visit it" like a website. You need a client to talk to it.

Different frameworks provide different methods for handling WebSocket connections. With Python’s websockets library, for instance, a connection is automatically accepted the moment a client connects. With frameworks like FastAPI, you have to explicitly call await websocket.accept(), otherwise the connection gets rejected.

Let’s look at the core methods provided by Python’s websockets library:

websockets.serve(...): starts a WebSocket server.
websockets.connect(...): connects to a WebSocket server.
websockets.send(...): sends a message from either side.
websockets.recv(): receives a message from client or server.

recv() takes no arguments because it's purely a waiting operation. It waits for the next message and returns it:

message = await websocket.recv()

How to Build Your First WebSocket in Python

Before we dive into frameworks, let’s explore Python’s websockets library. You’ll set up a simple server and client, and exchange messages over a WebSocket connection, giving you a solid foundation for understanding WebSockets under the hood.

Environment Setup

Run the following in your virtual environment to install or verify the WebSockets package:

pip install websockets
# or, to check if it's already installed:
pip show websockets

Create the WebSocket Server

Create server.py in your project folder, and paste this:

import asyncio
import websockets

async def handler(connection):
    print("Client connected")

    message = await connection.recv()
    print("Received from client:", message)
    await connection.send("Hello client!")


async def main():
    async with websockets.serve(handler, "localhost", 8000):
        print("Server running at ws://localhost:8000")
        #await asyncio.Future()  # runs forever
        await asyncio.sleep(30)

asyncio.run(main())

When this line executes:

async with websockets.serve(handler, "localhost", 8000):

The library opens a TCP socket on the specified host and port and waits for incoming clients. When one connects, it creates a connection object and passes it into your handler function.

The handler is required because it defines what the server does with each connection. The host and port arguments are also important. Both default to None – passing neither raises an error because the OS cannot bind a network server without a port.

You could pass port=0 to let the OS assign a free port automatically, but then you'd need an extra step to figure out which port was chosen, so the client can connect:

server.sockets[0].getsockname()

It’s simpler to specify both host and port explicitly, so the client knows exactly where the server is running.

Set Up the Client

Create client.py in the same folder and add this:

import asyncio
import websockets

async def client():
    async with websockets.connect("ws://localhost:8000") as websocket:
        await websocket.send("Hello server!")
        response = await websocket.recv()
        print("Server replied:", response)

asyncio.run(client())

Test the Connection

First, open a terminal and run server.py. You should see:

Server running at ws://localhost:8000

In a second terminal, run client.py. Messages should appear in both terminals confirming that the connection is active and both sides are communicating.

Note that the server must be running before you start the client – otherwise the client has nothing to connect to, and the connection will fail.

Keeping the server alive: a note on asyncio.Future()

In server.py, there’s a line currently commented out:

await asyncio.Future()

This keeps the server running indefinitely. For local development and testing however, await asyncio.sleep(30) is a simpler alternative. It keeps the server alive for a fixed period without running forever.

File Transfer Over WebSockets

WebSockets aren't limited to text. They support raw bytes too, which means you can send files directly over the connection. Here’s how a client can send a file to a server over a WebSocket connection:

Update `server.py`

async def file_handler(ws):
    print("Client connected, waiting for file...")
    file_bytes = await ws.recv()  # receive bytes
    with open("received_file.png", "wb") as f:
        f.write(file_bytes)
    print("File received and saved!")
    await ws.send("File received successfully!")

async def main():
    async with websockets.serve(file_handler, "localhost", 8000):
        print("Server running on ws://localhost:8000")
        await asyncio.sleep(50)  # keep server alive

asyncio.run(main())

The handler waits for incoming bytes with await ws.recv(); the websockets library automatically detects whether the incoming message is text or bytes, so no extra configuration is needed. Once received, the file is written to disk in binary mode ("wb") and the server sends a confirmation message back to the client.

Update `client.py`

import asyncio
import websockets

async def send_file():
    uri = "ws://localhost:8000"
    async with websockets.connect(uri) as ws:
        with open("portfolio-image.png", "rb") as f:  #open file in binary mode
            file_bytes = f.read()
        await ws.send(file_bytes)  # send bytes
        response = await ws.recv()
        print("Server response:", response)

asyncio.run(send_file())

The client opens the image in binary mode ("rb"), reads the entire file into memory as bytes, and sends it in a single ws.send() call. It then waits for the server's confirmation before closing the connection.

Test it

Add an image to your project folder and make sure the filename in client.py matches. Run server.py first, then client.py in a second terminal.

Once the transfer completes, the server saves the file as received_file.png in the same directory. You should see it appear in your workspace immediately.

This approach loads the entire file into memory before sending. For large files, it’s better to read and send them in chunks. But this is the easiest way to understand WebSocket byte transfer.

How to Connect to an External WebSocket

So far you've been connecting to servers you built yourself. But WebSocket clients can also connect to public servers. For example, a client can connect to Postman’s echo server:

import asyncio
import websockets

async def connect_external():
    uri = "wss://ws.postman-echo.com/raw"  # public WebSocket server
    async with websockets.connect(uri) as ws:
        print("Connected to external server!")

        # Send a message
        await ws.send("Hello external server!")
        print("Message sent")

        # Receive response
        response = await ws.recv()
        print("Received from server:", response)
asyncio.run(connect_external())

Notice the client connects to Postman’s echo server using the wss:// URI scheme instead of ws://. This indicates the connection is encrypted using TLS, similar to how https:// secures regular web requests.

An echo server returns exactly what you send it. So "Hello external server!" comes straight back as the response. It's a useful sandbox for testing your client-side WebSocket code without needing your own server.

WebSockets in FastAPI

FastAPI provides a WebSocket object (via Starlette under the hood) to manage real-time connections. You can define WebSocket endpoints just like HTTP routes, while Uvicorn handles the event loop – no manual asyncio server management needed. This makes FastAPI a natural fit for real-time projects, from chat apps to live dashboards and data feeds.

Before jumping into code, here's a quick reference of the core methods you'll be working with.

Accepting:

await websocket.accept(): the accept() method must be called first, before anything else. Skip it and the connection gets rejected.

Sending:

await websocket.send_text(data): sends a string.
await websocket.send_bytes(data): sends binary data.
await websocket.send_json(data): serializes and sends JSON.

Receiving:

await websocket.receive_text(): waits for a text message.
await websocket.receive_bytes(): waits for binary data.
await websocket.receive_json(): receives and deserializes JSON.
async for msg in websocket.iter_text(): iterates over incoming messages, exits cleanly on disconnect.

Closing:

await websocket.close(code=1000): standard code for a normal closure. It accepts an optional “reason” argument.

Here's what the WebSocket lifecycle looks like in FastAPI:

Building a Simple Echo Server with FastAPI

As you saw with the Postman example, an echo server sends back the message a client provides. Let's build one with FastAPI.

1. Install FastAPI:

pip install "fastapi[standard]"

2. Update `server.py`:

from fastapi import FastAPI, WebSocket

app = FastAPI()

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    data = await websocket.receive_text()
   
    await websocket.send_text(f"You said: {data}")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="127.0.0.1", port=8000)

A few things to note here compared to the plain websockets library:

WebSocket endpoints are defined with @app.websocket("/ws") just like an HTTP route.
await websocket.accept() is required before anything else. FastAPI won't accept connections without it.
Uvicorn handles the event loop and server startup for you via the if name == "__main__" block. No asyncio.run() or asyncio.Future() needed.

3. Update client.py:

async def test_client():
    uri = "ws://127.0.0.1:8000/ws"
    async with websockets.connect(uri) as ws:
        await ws.send("Hello FastAPI server!")
        response = await ws.recv()
        print("Server replied:", response)

asyncio.run(test_client())

Since the FastAPI server isn't secured with TLS, the client URI uses ws:// instead of wss://. Make sure to match the host and port from your server code.

4. Interact with the echo server:

Start server.py, then run client.py in another terminal. The server terminal should show the echoed message.

How to Handle WebSocket Disconnections in FastAPI

Clients will inevitably disconnect in real-time applications, sometimes intentionally, sometimes unexpectedly. If not handled properly, this can crash your server or leave it in a broken state.

The WebSocketDisconnect exception in FastAPI is raised whenever a client unexpectedly closes the connection, allowing the server to handle disconnects gracefully, log the event, and clean up resources without crashing.

Here’s an example:

@app.websocket("/ws")
async def websocket_endpoint(ws: WebSocket):
    await ws.accept()
    try:
        while True:
            data = await ws.receive_text()
   
            if "bye" in data or "quit" in data:
                await ws.send_text("Closing connection")
                await ws.close(code=1000, reason="Server requested close")  
                break
            await ws.send_text(f"I got your request: {data}")
    except WebSocketDisconnect:
        print("Client disconnected")  # connection already closed

The server runs a continuous loop waiting for messages. If the client message contains "bye" or "quit", the server responds, calls await ws.close(code=1000), and breaks out of the loop cleanly.

But if the client disconnects unexpectedly, WebSocketDisconnect is caught by the except block and the server moves on without crashing. At this point the connection is already closed on the client side, so calling ws.close() inside the except block is unnecessary.

Conclusion

WebSockets make real-time communication possible by keeping a persistent connection open between client and server. Starting with Python’s websockets library helps clarify how the protocol works under the hood, while frameworks like FastAPI provide the structure needed for production applications.

The parts that trip most people up early on are asyncio and FastAPI's explicit websocket.accept(). With asyncio, the question is usually why it's needed and why the server dies instantly without something keeping it alive. And it's easy to ignore websocket.accept() if you're coming from the plain websockets library where that happens automatically. Once those click, everything else follows naturally.

How to Build and Deploy a Blog-to-Audio Service Using OpenAI

Manish Shivanandhan — Wed, 14 Jan 2026 04:34:50 +0000

Turning written blog posts into audio is a simple way to reach more people. Many users prefer listening during travel or workouts. Others enjoy having both reading and listening options.

With OpenAI’s text-to-speech models, you can build a clean service that takes a blog URL or pasted text and produces a natural-sounding audio file.

In this article, you’ll learn how to build this system end-to-end. You will learn how to fetch blog content, send it to OpenAI’s audio API, save the output as an MP3 file, and serve everything through a small FastAPI app.

At the end, you’ll also build a minimal user interface and deploy it to Sevalla so that anyone can upload text and download audio without touching code.

Understanding the Core Idea
How to Set Up Your Project
How to Fetch and Clean Blog Content
How to Send Text to OpenAI for Audio
How to Build a FastAPI Backend
How to Add a Simple User Interface
How to Deploy Your Service to Sevalla
Conclusion

Understanding the Core Idea

A blog-to-audio service has only three important parts. The first part takes a blog link or text and cleans it. The second part sends the clean text to OpenAI’s text-to-speech model. The third part gives the final MP3 file back to the user.

OpenAI’s speech generation is simple to use. You send text, choose a voice, and get audio back. The quality is high and works well even for long posts. This means you do not need to worry about training models or tuning voices.

The only job left is to make the system easy to use. That is where FastAPI and a small HTML form help. They wrap your code into a web service so anyone can try it.

How to Set Up Your Project

Create a folder for your project. Inside it, create a file called main.py. You will also need a basic HTML file later.

Install the libraries you need with pip:

pip install fastapi uvicorn requests beautifulsoup4 python-multipart

FastAPI gives you a simple backend. Requests module helps download blog pages. BeautifulSoup helps remove HTML tags and extract readable text. Python-multipart helps upload form data.

You must also install the OpenAI client:

pip install openai

Make sure you have your OpenAI API key ready. Set it in your terminal before running the app:

export OPENAI_API_KEY="your-key"

On Windows, you can do:

setx OPENAI_API_KEY "your-key"

How to Fetch and Clean Blog Content

To convert a blog into audio, you must first extract the main article text. You can fetch the page with requests and parse it with BeautifulSoup.

Below is a simple function that does this.

import requests
from bs4 import BeautifulSoup

def extract_text_from_url(url: str) -> str:
    response = requests.get(url, timeout=10)
    html = response.text
    soup = BeautifulSoup(html, "html.parser")
    paragraphs = soup.find_all("p")
    text = " ".join(p.get_text(strip=True) for p in paragraphs)
    return text

Here is what happens step by step.

The function downloads the page.
BeautifulSoup reads the HTML and finds all paragraph tags.
It pulls out the text in each paragraph and joins them into one long string.
This gives you a clean version of the blog post without ads or layout code.

If the user pastes text instead of a URL, you can skip this part and use the text as it is.

How to Send Text to OpenAI for Audio

OpenAI’s text-to-speech API makes this part of the work very easy. You send a message with text and select a voice such as Alloy or Verse. The API returns raw audio bytes. You can save these bytes as an MP3 file.

Here is a helper function to convert text into audio:

from openai import OpenAI
client = OpenAI()

def text_to_audio(text: str, output_path: str):
    audio = client.audio.speech.create(
        model="gpt-4o-mini-tts",
        voice="alloy",
        input=text
    )
    with open(output_path, "wb") as f:
        f.write(audio.read())

This function calls the OpenAI client and passes the text, model name, and voice choice. The .read() method extracts the binary audio stream. Writing this to an MP3 file completes the process.

If the blog post is very long, you may want to limit text length or chunk the text and join the audio files later. But for most blogs, the model can handle the entire text in one request.

How to Build a FastAPI Backend

Now you can wrap both steps into a simple FastAPI server. This server will accept either a URL or pasted text. It will convert the content into audio and return the MP3 file as a response.

Here is the full backend code:

from fastapi import FastAPI, Form
from fastapi.responses import FileResponse
import uuid
import os

app = FastAPI()
@app.post("/convert")
def convert(url: str = Form(None), text: str = Form(None)):
    if not url and not text:
        return {"error": "Please provide a URL or text"}
    if url:
        try:
            text_content = extract_text_from_url(url)
        except Exception:
            return {"error": "Could not fetch the URL"}
    else:
        text_content = text
    file_id = uuid.uuid4().hex
    output_path = f"audio_{file_id}.mp3"
    text_to_audio(text_content, output_path)
    return FileResponse(output_path, media_type="audio/mpeg")

Here is how it works. The user sends form data with either url or text. The server checks which one exists.

If there is a URL, it extracts text with the earlier function. If there is no URL, it uses the provided text directly. A unique file name is created for every request. Then the audio file is generated and returned as an MP3 download.

You can run the server like this:

uvicorn main:app --reload

Open your browser at http://localhost:8000. You will not see the UI yet, but the API endpoint is working. You can test it using a tool like Postman or by building the front end next.

How to Add a Simple User Interface

A service is much easier to use when it has a clean UI. Below is a simple HTML page that sends either a URL or text to your FastAPI backend. Save this file as index.html in the same folder:

html>
<html>
<head>
    <title>Blog to Audiotitle>
    <style>
        body { font-family: Arial, padding: 40px; max-width: 600px; margin: auto; }
        input, textarea { width: 100%; padding: 10px; margin-top: 10px; }
        button { padding: 12px 20px; margin-top: 20px; cursor: pointer; }
    style>
head>
<body>
    <h2>Convert Blog to Audioh2>
    <form action="/convert" method="post">
        <label>Blog URLlabel>
        <input type="text" name="url" placeholder="Enter a blog link">
<p>or paste text belowp>
        <textarea name="text" rows="10" placeholder="Paste blog text here">textarea>
        <button type="submit">Convert to Audiobutton>
    form>
body>
html>

This page gives the user two options. They can type a URL or paste text. The form sends the data to /convert using a POST request. The response will be the MP3 file, so the browser will download it.

To serve the HTML file, add this route to your main.py:

from fastapi.responses import HTMLResponse

@app.get("/")
def home():
    with open("index.html", "r") as f:
        html = f.read()
    return HTMLResponse(html)

Now, when you visit the main URL, you will see a clean form.

When you submit a URL, the server will process your request and give you an audio file.

Great. Our text to audio service is working. Now let’s get it into production.

How to Deploy Your Service to Sevalla

You can choose any cloud provider, like AWS, DigitalOcean, or others, to host your service. I will be using Sevalla for this example.

Sevalla is a developer-friendly PaaS provider. It offers application hosting, database, object storage, and static site hosting for your projects.

Every platform will charge you for creating a cloud resource. Sevalla comes with a $50 credit for us to use, so we won’t incur any costs for this example.

Let’s push this project to GitHub so that we can connect our repository to Sevalla. We can also enable auto-deployments so that any new change to the repository is automatically deployed.

You can also fork my repository from here.

Log in to Sevalla and click on Applications -> Create new application. You can see the option to link your GitHub repository to create a new application.

Use the default settings. Click “Create application”. Now we have to add our OpenAI API key to the environment variables. Click on the “Environment variables” section once the application is created, and save the OPENAI_API_KEY value as an environment variable.

Now we are ready to deploy our application. Click on “Deployments” and click “Deploy now”. It will take 2–3 minutes for the deployment to complete.

Once done, click on “Visit app”. You will see the application served via a URL ending with sevalla.app . This is your new root URL. You can replace localhost:8000 with this URL and start using it.

Congrats! Your blog-to-audio service is now live. You can extend this by adding other capabilities and pushing your code to GitHub. Sevalla will automatically deploy your application to production.

Conclusion

You now know how to build a full blog-to-audio service using OpenAI. You learned how to fetch blog text, convert it into speech, and serve it with FastAPI. You also learned how to create a simple user interface, allowing people to try it with no setup.

With this foundation, you can turn any written content into smooth, natural audio. This can help creators reach a wider audience, enhance accessibility, and provide users with more ways to enjoy content.

Hope you enjoyed this article. Signup for my free newsletter TuringTalks.ai for more hands-on tutorials on AI. You can also visit my website.

How to Build and Deploy an AI Agent with LangChain, FastAPI, and Sevalla

Manish Shivanandhan — Thu, 08 Jan 2026 23:43:55 +0000

Artificial intelligence is changing how we build software. Just a few years ago, writing code that could talk, decide, or use external data felt hard.

Today, thanks to new tools, developers can build smart agents that read messages, reason about them, and call functions on their own.

One such platform that makes this easy is LangChain. With LangChain, you can link language models, tools, and apps together. You can also wrap your agent inside a FastAPI server, then push it to a cloud platform for deployment.

This article will walk you through building your first AI agent. You will learn what LangChain is, how to build an agent, how to serve it through FastAPI, and how to deploy it on Sevalla.

What We’ll Cover

What is LangChain?
How to Build Your First Agent with LangChain
Wrapping Your Agent with FastAPI
How to Deploy Your AI Agent to Sevalla
Conclusion

What is LangChain?

LangChain is a framework for working with large language models. It helps you build apps that think, reason, and act.

A model on its own only gives text replies, but LangChain lets it do more. It lets a model call functions, use tools, connect with databases, and follow workflows.

Think of LangChain as a bridge. On one side is the language model. On the other side are your tools, data sources, and business logic. LangChain tells the model what tools exist, when to use them, and how to reply. This makes it ideal for building agents that answer questions, automate tasks, or handle complex flows.

Many developers use LangChain because it is flexible. It supports many AI models. It fits well with Python.

Langchain also makes it easier to move from prototype to production. Once you learn how to create an agent, you can reuse the pattern for more advanced use cases.

I have recently published a detailed langchain tutorial here.

How to Build Your First Agent with LangChain

Let’s make our first agent. It will respond to user questions and call a tool when needed.

We’ll give it a simple weather tool, then ask it about the weather in a city. Before this, create a file called .env and add your OpenAI api key. Langchain will automatically use it when making requests to OpenAI.

OPENAI_API_KEY=

Here is the code for our agent:


from langchain.agents import create_agent
from dotenv import load_dotenv

# load environment variables
load_dotenv()

# defining the tool that LLM can call
def get_weather(city: str) -> str:
    """Get weather for a given city."""
    return f"It's always sunny in {city}!"

# Creating an agent
agent = create_agent(
    model="gpt-4o",
    tools=[get_weather],
    system_prompt="You are a helpful assistant",
)

result = agent.invoke({"messages":[{"role":"user","content":"What is the weather in san francisco?"}]})

This small program shows the power of LangChain agents.

First, we import create_agent, which helps us build the agent. Then we write a function called get_weather. It takes a city name and returns a friendly sentence.

The function acts as our tool. A tool is something the agent can use. In real projects, tools might fetch prices, store notes, or call APIs.

Next, we call create_agent. We give it three things. We pass the model we want to use. We list the tools we want it to call. And we give a system prompt. The system prompt tells the agent who it is and how it should behave.

Finally, we run the agent. We call invoke with a message.

The user asks for the weather in San Francisco. The agent reads this message. It sees that the question needs the weather function. So it calls our tool get_weather, passes the city, and returns an answer.

Even though this example is tiny, it captures the main idea. The agent reads natural language, figures out what tool to use, and sends a reply.

Later, you can add more tools or replace the weather function with one that connects to a real API. But this is enough for us to wrap and deploy.

Wrapping Your Agent with FastAPI

The next step is to serve our agent. FastAPI helps us expose our agent through an HTTP endpoint. That way, users and systems can call it through a URL, send messages, and get replies.

To begin, you install FastAPI and write a simple file like main.py. Inside it, you import FastAPI, load the agent, and write a route.

When someone posts a question, the API forwards it to the agent and returns the answer. The flow is simple.

The user talks to FastAPI. FastAPI talks to your agent. The agent thinks and replies. Here is the FAST API wrapper for your agent.

from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
from langchain.agents import create_agent
from dotenv import load_dotenv
import os

load_dotenv()

# defining the tool that LLM can call
def get_weather(city: str) -> str:
    """Get weather for a given city."""
    return f"It's always sunny in {city}!"

# Creating an agent
agent = create_agent(
    model="gpt-4o",
    tools=[get_weather],
    system_prompt="You are a helpful assistant",
)

app = FastAPI()

class ChatRequest(BaseModel):
    message: str

@app.get("/")
def root():
    return {"message": "Welcome to your first agent"}

@app.post("/chat")
def chat(request: ChatRequest):
    result = agent.invoke({"messages":[{"role":"user","content":request.message}]})
    return {"reply": result["messages"][-1].content}

def main():
    port = int(os.getenv("PORT", 8000))
    uvicorn.run(app, host="0.0.0.0", port=port)

if __name__ == "__main__":
    main()

Here, FastAPI defines a /chat endpoint. When someone sends a message, the server calls our agent. The agent processes it as before. Then FastAPI returns a clean JSON reply. The API layer hides the complexity inside a simple interface.

At this point, you have a working agent server. You can run it on your machine, call it with Postman or cURL, and check responses. When this works, you are ready to deploy.

How to Deploy Your AI Agent to Sevalla

You can choose any cloud provider, like AWS, DigitalOcean, or others to host your agent. I will be using Sevalla for this example.

Sevalla is a developer-friendly PaaS provider. It offers application hosting, database, object storage, and static site hosting for your projects.

Every platform will charge you for creating a cloud resource. Sevalla comes with a $50 credit for us to use, so we won’t incur any costs for this example.

Let’s push this project to GitHub so that we can connect our repository to Sevalla. We can also enable auto-deployments so that any new change to the repository is automatically deployed.

You can also fork my repository from here.

Log in to Sevalla and click on Applications -> Create new application. You can see the option to link your GitHub repository to create a new application

Use the default settings. Click “Create application”. Now we have to add our openai api key to the environment variables. Click on the “Environment variables” section once the application is created, and save the OPENAI_API_KEY value as an environment variable.

Now we are ready to deploy our application. Click on “Deployments” and click “Deploy now”. It will take 2–3 minutes for the deployment to complete.

Congrats! Your first AI agent with tool calling is now live. You can extend this by adding more tools and other capabilities, and pushing your code to GitHub, and Sevalla will automatically deploy your application to production.

Conclusion

Building AI agents is no longer a task for experts. With LangChain, you can write a few lines and create reasoning tools that respond to users and call functions on their own.

By wrapping the agent with FastAPI, you give it a doorway that apps and users can access. Finally, Sevalla makes it easy to push your agent live, monitor it, and run it in production.

This journey from agent idea to deployed service shows what modern AI development looks like. You start small. You explore tools. You wrap them and deploy them.

Then you iterate, add more capability, improve logic, and plug in real tools. Before long, you have a smart, living agent online. That is the power of this new wave of technology.

Hope you enjoyed this article. Signup for my free newsletter TuringTalks.ai for more hands-on tutorials on AI. You can also visit my website.

How to Implement Dependency Injection in FastAPI

Nneoma Uche — Fri, 14 Nov 2025 14:46:01 +0000

Several languages and frameworks depend on dependency injection—no pun intended. Go, Angular, NestJS, and Python's FastAPI all use it as a core pattern.

If you've been working with FastAPI, you've likely encountered dependencies in action. Perhaps you saw Depends() in a tutorial or the docs and were confused for a minute. I certainly was. That confusion sparked weeks of experimenting with this system. The truth is, you can't avoid dependency injection when building backend services with FastAPI. It's baked into the framework's DNA, powering everything from authentication and database connections to request validation.

FastAPI's docs describe its dependency injection system as 'powerful but intuitive.' That’s accurate, once you understand how it works. This article breaks it down, covering function dependencies, class dependencies, dependency scopes, as well as practical examples.

Prerequisites
Dependencies and Dependency Injection in FastAPI
Getting Started: Environment Setup
Types of Dependencies in FastAPI
- How to Use Function Dependencies in FastAPI
- How to Use Class Dependencies in FastAPI
Dependency Scope
Common Use Cases for Dependency Injection
Conclusion

Prerequisites

To follow along with this article, you should have:

Working knowledge of Python.
Ability to create and activate virtual environments.
Basic understanding of FastAPI.
Familiarity with Object-Oriented Programming (OOP) concepts.

Dependencies and Dependency Injection in FastAPI

A dependency is a reusable piece of logic, like authentication, database connection, or validation, that your path operations require. Dependency injection (DI) is how FastAPI delivers these dependencies to specific parts of your application: you declare them using Depends() and FastAPI automatically executes them when the associated route receives a request.

Think of it as requesting the tools your application needs. You declare dependencies once and FastAPI provides them wherever needed, with no repetitive setup across routes.

This makes for modular, scalable applications. Without DI, you would have to repeat the same setup code on every endpoint, making updates tedious and bugs more likely.

Getting Started: Environment Setup

Let's set up your development environment to work through the examples in this guide.

Start by creating a project folder, then:

Create and activate a virtual environment:

python -m venv deps
source deps/bin/activate          #on Mac
deps\Scripts\activate             # On Windows

Install FastAPI with all dependencies:

pip install 'fastapi[all]'

Organize your project as follows:

fastapi-deps/
├── deps/                 # Virtual environment
├── function_deps.py
├── class_deps.py
├── router_deps.py
├── app.py
└── requirements.txt

Types of Dependencies in FastAPI

In FastAPI, a dependency is a callable object that retrieves or verifies information before a route executes. Dependencies can be implemented as either functions or classes.

Function dependencies are the most straightforward approach and work well for most use cases, including validation, authentication, and data retrieval. Class dependencies can handle the same tasks but are particularly useful when you need stateful logic, multiple instances with different configurations, or prefer object-oriented patterns.

How to Use Function Dependencies in FastAPI

A function dependency is a helper function (such as for authentication or data retrieval) that can be injected into path operations. To demonstrate, we'll create a simple user authentication dependency using an in-memory database—a list of dictionaries.

Recall the folder structure from earlier? We’ll write this code in fastapi-deps/function_deps.py.

Start by importing the required modules:

from fastapi import FastAPI, Depends, HTTPException
import uvicorn

You bring in FastAPI to create the app instance, Depends for dependency injection, and HTTPException to handle errors gracefully. uvicorn will be used to run the application later.

Next, instantiate the FastAPI application:

app = FastAPI()

app = FastAPI() creates your application instance: the object that will hold all your endpoints and dependencies.

Next, create an in-memory database. Define a list of dictionaries to act as your temporary database. Each dictionary represents a user entry containing a name and a password.

users = [
    {"name": "Ore", "password": "jkzvdgwya12"},
    {"name": "Uche", "password": "lga546"},
    {"name": "Seke", "password": "SK99!"},
    {"name": "Afi", "password": "Afi@144"},
    {"name": "Sam", "password": "goTiger72*"},
    {"name": "Ozi", "password": "xx%hI"},
    {"name": "Ella", "password": "Opecluv18"},
    {"name": "Claire", "password": "cBoss@14G"},
    {"name": "Sena", "password": "SenDaBoss5"},
    {"name": "Ify", "password": "184Norab"}  
]

💡

This type of database isn’t persistent; any data stored therein is lost when the application restarts.

Then, define a dependency function for user validation. The simple helper function below checks whether a username and password provided by the user match an existing user in the database.

#the dependency function
def user_dep(name: str, password: str):
    for u in users:
        if u["name"] == name and u["password"] == password:
            return {"name": name, "valid": True}

This function expects two string parameters, name and password, from the incoming request. If it finds a match in the users database, it returns a dictionary confirming the user’s validity. FastAPI automatically converts this dictionary into a JSON response.

Next, inject the dependency into a path function:

#the web endpoint
@app.get("/users/{user}")
def get_user(user = Depends(user_dep)) -> dict:
    if not user:
        raise HTTPException(status_code=401, detail="Invalid username or password")
    return user

The user_dep function is injected into the path operation using Depends(). When an HTTP request is made to this endpoint, FastAPI executes the dependency first, validates the input, and passes its return value to the user parameter.

The -> dict: annotation indicates that the function returns a dictionary, which FastAPI auto-converts to JSON. If no matching record is found, an HTTPException with a 401 status code is raised; otherwise, the verified user data is returned.

Now you’ll start the FastAPI server. To start the server, open your terminal in the project directory and run:

uvicorn function_deps:app --reload

function_deps is the name of your Python file (without the .py extension).
--reload automatically restarts the server whenever you save changes.

Once it starts, you’ll see an output similar to the image below:

Now you can test the endpoint. Open your browser or the Postman desktop app to validate the user “Seke”. Paste this URL into your browser: http://127.0.0.1:8000/users/{user}?name=Seke&password=SK99!

Alternatively, you can test the endpoint using FastAPI’s built-in docs at: http://127.0.0.1:8000/docs

In the Swagger UI:

Click on the Get User endpoint
Click Try it out
Enter “Seke” in the name field and “SK99!” in the password field
Click Execute

You should get a 200 status code, with the payload in this image:

You can also test the endpoint with usernames or passwords that don’t exist in the database. Each time, you should see a 401 error like this:

How to Use Class Dependencies in FastAPI

While functions are the most common way to define dependencies, FastAPI also supports class-based dependencies. Classes are useful when you need reusable instances with configurable state or prefer object-oriented patterns.

Class dependencies inject the same way: through the Depends function in your path operation.

Let's convert the user_dep function dependency to a class. It will authenticate users, grant access to valid credentials, and raise exceptions for unauthorized attempts. We'll apply it to a user dashboard endpoint to ensure only authenticated users access their resources.

#Dependency class for user authentication
class UserAuth():
    def __init__(self, name: str, password: str):
        self.name = name
        self.password = password

    def __call__(self):
        #check if name and password entered correspond to any row in the db
        for user in users:
            if user["name"] == self.name and user["password"] == self.password:
                pass
        #If no match found, raise an error
        raise HTTPException(status_code=401, detail="Invalid username or password")

The __init__ method receives the parameters from the request (name and password) and stores them as instance attributes. These can then be accessed in the __call__ method, which contains the dependency logic.

Note that __call__ doesn't return a value in this example. It simply raises an HTTPException if authentication fails. The __call__ method makes the class instance callable, allowing FastAPI to invoke it like a regular function.

Here’s how to inject UserAuth into a path function:

#Injecting the class dependency into a path operation
@app.get("/user/dashboard")
def get_dashboard(user: UserAuth = Depends(UserAuth)):
    return {"message": f"Access granted to {user.name}"}

What's happening here:

When a client requests the /user/dashboard endpoint, FastAPI executes the dependency first. Recognizing UserAuth as a class, FastAPI automatically creates an instance and populates it with values from the query parameters.

Here’s the execution flow to help you understand:

Depends(UserAuth) tells FastAPI: “Before running this route, create a UserAuth instance.”
FastAPI extracts name and password from the request URL (for example, /user/dashboard?name=Seke&password=SK99!).
It then calls UserAuth(name=”Seke”, password=”SK99!”) to create the instance.

The UserAuth instance, with its stored name and password attributes, is passed to the user parameter in get_dashboard.
The route function can access user.name and user.password directly.
If __call__ raises an exception, the route never executes.

Test the endpoint with valid credentials from the users list, and you should see output like this:

A closer look at FastAPI’s official documentation provides an alternative approach to classes as dependencies. However, using the __call__ method, in my opinion, is the most straightforward and self-contained approach. It keeps your authentication logic modular without adding extra code to the path operation.

The trade-off is that class dependencies are more verbose than helper functions, but cleaner for complex logic.

Dependency Scope

FastAPI offers two ways to inject dependencies into a path operation: as a function parameter or via the path decorator. When you include a dependency as a function parameter, the dependency's return value is available within the function. But when injected into the decorator, the dependency executes without passing a return value to the path function.

Beyond single endpoints, FastAPI lets you inject dependencies at the router or global level. Let’s examine these scopes in more detail.

Path Operation Level

While the first example injected dependencies into path function parameters, you can also inject them directly into the decorator using the dependencies parameter. This approach is useful for side-effects (for example, authentication guards, rate limiting or request logging) where the return data is not required in the path operation.

Replace the previous code in fastapi-deps/function_deps.py with this:

#dep function to pass in decorator
def user_dep(name: str, password: str):
    for u in users:
        if u["name"] == name and u["password"] == password:
            return
    raise HTTPException(status_code=401, detail="Invalid username or password")

#path function
@app.get("/users/{user}", dependencies=[Depends(user_dep)])
def get_user() -> dict:
    return {"message" : "Access granted!"}

This decorator-based dependency acts as a pre-check before the endpoint executes. It validates credentials without passing any values to the path function. On authentication failure, FastAPI raises an HTTPException and prevents the path operation from running.

If you test this using a valid name and password from the in-memory database, your output should look like this:

Router Level

Injecting dependencies at the router level allows multiple endpoints to share common logic without repeating the dependency in each route.

We'll use the same user_dep function but inject it at the router level. Add these imports to fastapi-deps/router_deps.py:

from fastapi import APIRouter, Depends

#import the dependency function
from function_deps import user_dep

Then, create an APIRouter instance, passing your dependency to the dependencies parameter. This makes the dependency run automatically for every route you define under this router.

In this example, user_dep executes before get_user() and any other endpoints you add to the router, eliminating the need to declare it on each route.

router = APIRouter(prefix="/users", dependencies=[Depends(user_dep)])

#define the routes with or without additional dependencies
@router.get("/{user}")
def get_user() -> dict:
    return {"message" : "Access granted!"}

In your main application file (app.py), import the router and register it with your FastAPI application using include_router(). This makes all routes defined in the router accessible through your application.

from fastapi import FastAPI
import uvicorn
from router_deps import router as user_router

app = FastAPI()
app.include_router(user_router)

if __name__ == "__main__":
    uvicorn.run("app:app", reload=True)

Start your server and test the route using a valid name–password pair from the users list, then try a mismatched one. You should get a 200 status for the correct credentials and 401 for invalid ones.

Application Level

Application-level dependencies (also called global dependencies) are defined when instantiating the FastAPI app and apply to every route in your application. Unlike router-level dependencies that target specific endpoint groups, app-level dependencies extend across the entire application. Any dependency injected into the FastAPI app object will automatically execute for all path functions.

Let's inject a simple logging dependency alongside the user authentication dependency we've used throughout this article.

Update fastapi-deps/app.py with this code:

from fastapi import FastAPI, Depends
import uvicorn
from function_deps import user_dep
from router_deps import router as user_router
from datetime import datetime

#Basic logging dependency
def log_request():
    print(f"[{datetime.now()}] Request received.")

app = FastAPI(dependencies=[Depends(log_request), Depends(user_dep)])
app.include_router(user_router)

@app.get("/home")
def get_main():
    return "Welcome back!!!"


if __name__ == "__main__":
    uvicorn.run("app:app", reload=True)

When you send a request to any endpoint within this application, log_request acknowledges it and outputs what time the request was made. Since we aren’t sending the logs to any database in particular, it will just print to the terminal (or console) like so:

Request the endpoint with valid credentials using your browser, cURL, Postman, or the Swagger UI. You should get this response:

💡

Although the same authentication and logging logic apply to all registered routers, the specific message users see depends on what you program into each router.

Common Use Cases for Dependency Injection

Dependency injection solves several common challenges in API development. Here are the most frequent use cases where you'll apply this pattern.

Database Connections: Reusing connection logic across multiple endpoints prevents connection leaks, and ensures each request has an isolated session.
Authentication & Authorization: Dependencies help validate tokens and verify user roles across protected routes.
Logging & Monitoring: A logging dependency can automatically record each request to your monitoring system or database. It is beneficial for debugging and tracking API usage.
Rate Limiting: You can control request frequency and prevent API abuse by injecting rate-limiting logic in path functions.
Configuration & Settings: FastAPI’s dependency injection system simplifies configuration management by letting you inject settings such as API keys or environment variables wherever needed, keeping your code consistent.
Pagination & Filtering: Injecting common parameters like page_size and limit standardize data retrieval patterns across endpoints.

Conclusion

FastAPI's dependency injection system helps you manage shared logic and resources efficiently while adhering to DRY principles. However, knowing when to inject a dependency versus when to skip it is a skill that comes with practice.

Dependency injection isn't needed for simple, standalone logic. But for resources requiring lifecycle management, shared logic, or modularity, FastAPI's dependency injection system simplifies checks and app operations—with or without return values.

How to Deploy Your FastAPI + PostgreSQL App on Render: A Beginner's Guide

Preston Osoro — Thu, 22 May 2025 15:55:44 +0000

This guide is a comprehensive roadmap for deploying a FastAPI backend connected to a PostgreSQL database using Render, a cloud platform that supports hosting Python web apps and managed PostgreSQL databases.

You can find the complete source code here.

Deployment Context

When deploying a FastAPI app connected to PostgreSQL, you need to select a platform that supports Python web applications and managed databases. This guide uses Render as the example platform because it provides both web hosting and a PostgreSQL database service in one environment, making it straightforward to connect your backend with the database.

You can apply the concepts here to other cloud providers as well, but the steps will differ depending on the platform’s specifics.

Here’s what we’ll cover:

Project Structure for a Real-World FastAPI App
What You'll Need Before You Start
Deployment Steps
Local Development Workflow
Best Practices and Common Troubleshooting Tips
Common Issues and Solutions
Conclusion

Project Structure

If you’re building a real-world API with FastAPI you’ll quickly outgrow a single main.py file. That’s when modular project structure becomes essential for maintainability.

Here’s an example structure we’ll use throughout this guide:

FastAPI/
├── database/
│   ├── base.py
│   ├── database.py
│   └── __init__.py
├── fastapi_app/
│   └── main.py
├── items/
│   ├── models/
│   │   ├── __init__.py
│   │   └── item.py
│   ├── routes/
│   │   ├── __init__.py
│   │   └── item.py
│   └── schemas/
│       ├── __init__.py
│       └── item.py
├── models/
│   └── __init__.py
├── orders/
│   ├── models/
│   │   ├── __init__.py
│   │   └── order.py
│   ├── routes/
│   │   ├── __init__.py
│   │   └── order.py
│   └── schemas/
│       ├── __init__.py
│       └── order.py
└── users/
    ├── models/
    │   ├── __init__.py
    │   └── user.py
    ├── routes/
    │   ├── __init__.py
    │   └── user.py
    └── schemas/
        ├── __init__.py
        └── user.py

What You'll Need Before You Start

Before diving in, make sure you've got:

A free Render account (sign up if you don't have one)
A GitHub or GitLab repository for your FastAPI project
Basic familiarity with Python, FastAPI, and Git
Your project structure set up similarly to the example above

Deployment Steps

Step 1: Set Up Local PostgreSQL Database

For local development, you'll need to set up PostgreSQL on your machine like this:

-- 1. Log in as superuser
psql -U postgres

-- 2. Create a new database
CREATE DATABASE your_db;

-- 3. Create a user with password
CREATE USER your_user WITH PASSWORD 'your_secure_password';

-- 4. Grant all privileges on the database
GRANT ALL PRIVILEGES ON DATABASE your_db TO your_user;

-- 5. (Optional) Allow the user to create tables
ALTER USER your_user CREATEDB;

-- 6. Exit
\q

After setting up your local database, create a .env file in your project root:

DATABASE_URL=postgresql://your_user:your_secure_password@localhost:5432/your_db

Step 2: Set Up Your Database Connection

Create database/database.py to manage your PostgreSQL connection with SQLAlchemy:

This file is crucial as it creates the database engine, defines session management, and provides a dependency function for your routes.

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
import os
from dotenv import load_dotenv

load_dotenv()

DATABASE_URL = os.getenv("DATABASE_URL")
"""
The engine manages the connection to the database and handles query execution.
"""
engine = create_engine(DATABASE_URL)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)

# Database dependency for routes
def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()

And add database/base.py for the base class:

from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()

Step 3: Configure Your FastAPI Main Application

Create main FastAPI application file fastapi_app/main.py to import all your route modules:

import os
from fastapi import FastAPI, APIRouter
from fastapi.openapi.utils import get_openapi
from fastapi.security import OAuth2PasswordBearer
import uvicorn
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Database imports
from database import Base, engine

# Import models to ensure they're registered with SQLAlchemy
import models

# Import router modules
from items.routes import item_router
from orders.routes import order_router
from users.routes import user_router

# Initialize FastAPI app
app = FastAPI(
    title="Store API",
    version="1.0.0",
    description="API documentation for Store API"
)

# Create database tables on startup
Base.metadata.create_all(bind=engine)

# Root endpoint
@app.get("/")
async def root():
    return {"message": "Welcome to FastAPI Store"}

# Setup versioned API router and include module routers
api_router = APIRouter(prefix="/v1")
api_router.include_router(item_router)
api_router.include_router(order_router)
api_router.include_router(user_router)

# Register the master router with the app
app.include_router(api_router)

# Setup OAuth2 scheme for Swagger UI login flow
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="/v1/auth/login")

# Custom OpenAPI schema with security configuration
def custom_openapi():
    if app.openapi_schema:
        return app.openapi_schema

    openapi_schema = get_openapi(
        title=app.title,
        version=app.version,
        description=app.description,
        routes=app.routes,
    )

    # Add security scheme
    openapi_schema["components"]["securitySchemes"] = {
        "BearerAuth": {
            "type": "http",
            "scheme": "bearer",
            "bearerFormat": "JWT",
        }
    }

    # Apply global security requirement
    openapi_schema["security"] = [{"BearerAuth": []}]

    app.openapi_schema = openapi_schema
    return app.openapi_schema

app.openapi = custom_openapi

# Run the app using Uvicorn when executed directly
if __name__ == "__main__":
    port = os.environ.get("PORT")
    if not port:
        raise EnvironmentError("PORT environment variable is not set")
    uvicorn.run("fastapi_app.main:app", host="0.0.0.0", port=int(port), reload=False)

Step 4: Create a Requirements File

In your project root, create a requirements.txt file that includes all the necessary dependencies:

fastapi>=0.68.0
uvicorn>=0.15.0
sqlalchemy>=1.4.23
psycopg2-binary>=2.9.1
python-dotenv>=0.19.0
pydantic>=1.8.2

Step 5: Provision a PostgreSQL Database on Render

Then click "New +" in the top right and select "PostgreSQL".

Fill in the details:

Name: your-app-db (choose a descriptive name)
Database: your_app (this will be your database name)
User: leave default (auto-generated)
Region: Choose the closest to your target users
Plan: Free tier

Save and note the Internal Database URL shown after creation, which will look something like this:

postgres://user:password@postgres-instance.render.com/your_app

Step 6: Deploy Your FastAPI App on Render

With your database provisioned, it's time to deploy your API. You can do that by following these steps:

In Render dashboard, click "New +" and select "Web Service"
Connect your GitHub/GitLab repository
Name your service
Then configure the build settings:
- Environment: Python 3
- Build Command: pip install -r requirements.txt
- Start Command: python3 -m fastapi_app.main
Add your environment variables:
- Click "Environment" tab
- Add your database URL:
  - Key: DATABASE_URL
  - Value: Paste the Internal Database URL from your PostgreSQL service
- Add any other environment variables your application needs
Finally, click Deploy Web Service.
- Render will start building and deploying your application
- This process takes a few minutes. You can monitor logs during build and deployment in real-time

Step 7: Test Your API Endpoints

Once deployed, access your API’s URL (for example, https://your-app-name.onrender.com).

Navigate to /docs to open the interactive Swagger UI, where you can test your endpoints directly:

Expand an endpoint
Click Try it out
Provide any required input
Click Execute
View the response

Local Development Workflow

While your app is deployed, you'll still need to work on it locally. Here's how to maintain a smooth development workflow:

First, create a local .env file (don't commit this to Git):

DATABASE_URL=postgresql://username:password@localhost:5432/your_local_db

Then install your dependencies in a virtual environment:

python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

Next, run your local server:

python3 -m fastapi_app.main

This command triggers the __main__ block in fastapi_app/main.py, which starts the FastAPI app using Uvicorn. It reads the PORT from your environment, so ensure it's set (e.g., via a .env file).

Then make changes to your code and test locally before pushing to GitHub/GitLab. You can push your changes to automatically trigger a new deployment on Render.

Best Practices and Tips

Use database migrations: Add Alembic to your project for managing schema changes
```
 pip install alembic
 alembic init migrations
```

Separate development and production configurations:

 if os.environ.get("ENVIRONMENT") == "production":
     # Production settings
 else:
     # Development settings

Monitor your application:
- Render provides logs and metrics for your application. You can set up alerts for errors or high resource usage.
Optimize database queries:
- Use SQLAlchemy's relationship loading options.
- Consider adding indexes to frequently queried fields.
Scale when needed:
- Render allows you to upgrade your plan as your application grows. Consider upgrading your database plan for production applications.

Common Issues and Solutions

When deploying a Python web app on Render, a few issues can commonly occur. Here's a more detailed look at them and how you can resolve each one.

Database connection errors:

If your app can’t connect to the database, first double-check that your DATABASE_URL environment variable is correctly set in your Render dashboard. Make sure the URL includes the right username, password, host, port, and database name.

Also, confirm that your SQLAlchemy models match the actual schema in your database. A mismatch here can lead to errors during migrations or app startup. If you're using Postgres, ensure that the database user has permission to read/write tables and perform migrations.

Deployment fails entirely:

When deployment fails, Render usually provides helpful logs under the “Events” tab. Check there for any error messages. A few common culprits include:

A missing requirements.txt file or forgotten dependencies.
A bad start command in the Render settings. Double-check that it points to your correct entry point (for example, gunicorn app:app or uvicorn main:app --host=0.0.0.0 --port=10000).
Improper Python version. You can specify this in a runtime.txt file (for example, python-3.11.1).

API returns 500 Internal Server errors:

Internal server errors can happen for several reasons. To debug:

Open your Render logs and look for Python tracebacks or unhandled exceptions.
Try to reproduce the issue locally using the same request and data.
Add try/except blocks around critical logic to capture and log errors more gracefully.

Even better, set up structured logging or error tracking (for example, with Sentry) to catch these before your users do.

Slow response times:

If your app is slow or intermittently timing out, check:

Whether you're still on the free Render tier, which has limited CPU and memory. Consider upgrading if you’re handling production-level traffic.
If you're running heavy or unoptimized database queries, tools like SQLAlchemy’s .explain() or Django Debug Toolbar can help.
If you’re frequently fetching the same data, try caching it using a lightweight in-memory cache like functools.lru_cache or a Redis instance.

Conclusion

Deploying a FastAPI app connected to PostgreSQL on Render is straightforward with the right structure and setup. While this guide used Render as an example, the concepts apply broadly across cloud platforms.

With this setup, you can develop, test, and deploy robust Python APIs backed by PostgreSQL databases efficiently.

The free tier on Render has some limitations, including PostgreSQL databases that expire after 90 days unless upgraded. For production applications, consider upgrading to a paid plan for better performance and reliability.

Happy coding!

Use the FARM Stack to Develop Full Stack Apps

Beau Carnes — Wed, 18 Sep 2024 13:55:42 +0000

The FARM stack is a modern web development stack that combines three powerful technologies: FastAPI, React, and MongoDB. This full-stack solution provides developers with a robust set of tools to build scalable, efficient, and high-performance web applications.

In this article, I'll be giving you an introduction to each of the key technologies, and then we'll build a project using the FARM stack and Docker so you can see how everything works together.

This article is based on a course I created on the freeCodeCamp.org YouTube channel. Watch it here:

Introduction to the FARM Stack

The FARM in FARM stack stands for:

F: FastAPI (Backend)
R: React (Frontend)
M: MongoDB (Database)

The FARM stack is designed to leverage the strengths of each component, allowing developers to create feature-rich applications with a smooth development experience.

Components of FARM Stack

FastAPI: FastAPI is a modern, high-performance Python web framework for building APIs. It's designed to be easy to use, fast to code, and ready for production environments. FastAPI is built on top of Starlette for the web parts and Pydantic for the data parts, making it a powerful choice for building robust backend services.
React: React is a popular JavaScript library for building user interfaces. Developed and maintained by Facebook, React allows developers to create reusable UI components that efficiently update and render as data changes. Its component-based architecture and virtual DOM make it an excellent choice for building dynamic and responsive frontend applications.
MongoDB: MongoDB is a document-oriented NoSQL database. It stores data in flexible, JSON-like documents, meaning fields can vary from document to document and data structure can be changed over time. This flexibility makes MongoDB an ideal choice for applications that need to evolve quickly and handle diverse data types.

Advantages of using FARM Stack

High Performance: FastAPI is one of the fastest Python frameworks available, while React's virtual DOM ensures efficient UI updates. MongoDB's document model allows for quick reads and writes.
Scalability: All components of the FARM stack are designed to scale. FastAPI can handle concurrent requests efficiently, React applications can manage complex UIs, and MongoDB can distribute data across multiple servers.
Community and Ecosystem: All three technologies have large, active communities and rich ecosystems of libraries and tools.
Flexibility: The FARM stack is flexible enough to accommodate various types of web applications, from simple CRUD apps to complex, data-intensive systems.

By combining these technologies, the FARM stack provides a comprehensive solution for building modern web applications. It allows developers to create fast, scalable backends with FastAPI, intuitive and responsive frontends with React, and flexible, efficient data storage with MongoDB. This stack is particularly well-suited for applications that require real-time updates, complex data models, and high performance.

Project Overview: Todo Application

In the video course, I cover more about each individual technology in the FARM Stack. But in this article, we are going to jump right into a project to put everything together.

We will be creating a todo application to help us understand the FARM stack. Before we start creating the applicaiton, let’s discuss more about the features and software architecture.

Features of the todo application

Our FARM stack todo application will include the following features:

Multiple Todo Lists:
- Users can create, view, update, and delete multiple todo lists.
- Each list has a name and contains multiple todo items.
Todo Items:
- Within each list, users can add, view, update, and delete todo items.
- Each item has a label, a checked/unchecked status, and belongs to a specific list.
Real-time Updates:
- The UI updates in real-time when changes are made to lists or items.
Responsive Design:
- The application will be responsive and work well on both desktop and mobile devices.

System architecture

Our todo application will follow a typical FARM stack architecture:

Frontend (React):
- Provides the user interface for interacting with todo lists and items.
- Communicates with the backend via RESTful API calls.
Backend (FastAPI):
- Handles API requests from the frontend.
- Implements business logic for managing todo lists and items.
- Interacts with the MongoDB database for data persistence.
Database (MongoDB):
- Stores todo lists and items.
- Provides efficient querying and updating of todo data.
Docker:
- Containerizes each component (frontend, backend, database) for easy development and deployment.

Data model design

Our MongoDB data model will consist of two main structures:

Todo List:

   {
     "_id": ObjectId,
     "name": String,
     "items": [
       {
         "id": String,
         "label": String,
         "checked": Boolean
       }
     ]
   }

List Summary (for displaying in the list of all todo lists):

   {
     "_id": ObjectId,
     "name": String,
     "item_count": Integer
   }

API endpoint design

Our FastAPI backend will expose the following RESTful endpoints:

Todo Lists:
- GET /api/lists: Retrieve all todo lists (summary view)
- POST /api/lists: Create a new todo list
- GET /api/lists/{list_id}: Retrieve a specific todo list with all its items
- DELETE /api/lists/{list_id}: Delete a specific todo list
Todo Items:
- POST /api/lists/{list_id}/items: Add a new item to a specific list
- PATCH /api/lists/{list_id}/checked_state: Update the checked state of an item
- DELETE /api/lists/{list_id}/items/{item_id}: Delete a specific item from a list

This project will provide a solid foundation in FARM stack development and Docker containerization, which you can then expand upon for more complex applications in the future.

So let's get started with the project.

Project Tutorial

Project Setup and Backend Development

Step 1: Set up the project structure

Create a new directory for your project:

   mkdir farm-stack-todo
   cd farm-stack-todo

Create subdirectories for the backend and frontend:

   mkdir backend frontend

Step 2: Set up the backend environment

Navigate to the backend directory:

   cd backend

Create a virtual environment and activate it:

   python -m venv venv
   source venv/bin/activate  # On Windows, use: venv\Scripts\activate

Create the following files in the backend directory:

- Dockerfile
  - pyproject.toml

In your terminal, install the required packages:

pip install "fastapi[all]" "motor[srv]" beanie aiostream

Generate the requirements.txt file:

pip freeze > requirements.txt

After creating the requirements.txt file (either through pip-compile or manually), you can install the dependencies using:

   pip install -r requirements.txt

Add the following content to Dockerfile:

   FROM python:3

   WORKDIR /usr/src/app
   COPY requirements.txt ./

   RUN pip install --no-cache-dir --upgrade -r ./requirements.txt

   EXPOSE 3001

   CMD [ "python", "./src/server.py" ]

Add the following content to pyproject.toml:

   [tool.pytest.ini_options]
   pythonpath = "src"

Step 4: Set up the backend structure

Create a src directory inside the backend directory:

   mkdir src

Create the following files inside the src directory:

- server.py
  - dal.py

Step 5: Implement the Data Access Layer (DAL)

Open src/dal.py and add the following content:

from bson import ObjectId
from motor.motor_asyncio import AsyncIOMotorCollection
from pymongo import ReturnDocument

from pydantic import BaseModel

from uuid import uuid4

class ListSummary(BaseModel):
  id: str
  name: str
  item_count: int

  @staticmethod
  def from_doc(doc) -> "ListSummary":
      return ListSummary(
          id=str(doc["_id"]),
          name=doc["name"],
          item_count=doc["item_count"],
      )

class ToDoListItem(BaseModel):
  id: str
  label: str
  checked: bool

  @staticmethod
  def from_doc(item) -> "ToDoListItem":
      return ToDoListItem(
          id=item["id"],
          label=item["label"],
          checked=item["checked"],
      )

class ToDoList(BaseModel):
  id: str
  name: str
  items: list[ToDoListItem]

  @staticmethod
  def from_doc(doc) -> "ToDoList":
      return ToDoList(
          id=str(doc["_id"]),
          name=doc["name"],
          items=[ToDoListItem.from_doc(item) for item in doc["items"]],
      )

class ToDoDAL:
  def __init__(self, todo_collection: AsyncIOMotorCollection):
      self._todo_collection = todo_collection

  async def list_todo_lists(self, session=None):
      async for doc in self._todo_collection.find(
          {},
          projection={
              "name": 1,
              "item_count": {"$size": "$items"},
          },
          sort={"name": 1},
          session=session,
      ):
          yield ListSummary.from_doc(doc)

  async def create_todo_list(self, name: str, session=None) -> str:
      response = await self._todo_collection.insert_one(
          {"name": name, "items": []},
          session=session,
      )
      return str(response.inserted_id)

  async def get_todo_list(self, id: str | ObjectId, session=None) -> ToDoList:
      doc = await self._todo_collection.find_one(
          {"_id": ObjectId(id)},
          session=session,
      )
      return ToDoList.from_doc(doc)

  async def delete_todo_list(self, id: str | ObjectId, session=None) -> bool:
      response = await self._todo_collection.delete_one(
          {"_id": ObjectId(id)},
          session=session,
      )
      return response.deleted_count == 1

  async def create_item(
      self,
      id: str | ObjectId,
      label: str,
      session=None,
  ) -> ToDoList | None:
      result = await self._todo_collection.find_one_and_update(
          {"_id": ObjectId(id)},
          {
              "$push": {
                  "items": {
                      "id": uuid4().hex,
                      "label": label,
                      "checked": False,
                  }
              }
          },
          session=session,
          return_document=ReturnDocument.AFTER,
      )
      if result:
          return ToDoList.from_doc(result)

  async def set_checked_state(
      self,
      doc_id: str | ObjectId,
      item_id: str,
      checked_state: bool,
      session=None,
  ) -> ToDoList | None:
      result = await self._todo_collection.find_one_and_update(
          {"_id": ObjectId(doc_id), "items.id": item_id},
          {"$set": {"items.$.checked": checked_state}},
          session=session,
          return_document=ReturnDocument.AFTER,
      )
      if result:
          return ToDoList.from_doc(result)

  async def delete_item(
      self,
      doc_id: str | ObjectId,
      item_id: str,
      session=None,
  ) -> ToDoList | None:
      result = await self._todo_collection.find_one_and_update(
          {"_id": ObjectId(doc_id)},
          {"$pull": {"items": {"id": item_id}}},
          session=session,
          return_document=ReturnDocument.AFTER,
      )
      if result:
          return ToDoList.from_doc(result)

This concludes Part 1 of the tutorial, where we set up the project structure and implemented the Data Access Layer for our FARM stack todo application. In the next part, we'll implement the FastAPI server and create the API endpoints.

Implementing the FastAPI Server

Step 6: Implement the FastAPI server

Open src/server.py and add the following content:

from contextlib import asynccontextmanager
from datetime import datetime
import os
import sys

from bson import ObjectId
from fastapi import FastAPI, status
from motor.motor_asyncio import AsyncIOMotorClient
from pydantic import BaseModel
import uvicorn

from dal import ToDoDAL, ListSummary, ToDoList

COLLECTION_NAME = "todo_lists"
MONGODB_URI = os.environ["MONGODB_URI"]
DEBUG = os.environ.get("DEBUG", "").strip().lower() in {"1", "true", "on", "yes"}


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup:
    client = AsyncIOMotorClient(MONGODB_URI)
    database = client.get_default_database()

    # Ensure the database is available:
    pong = await database.command("ping")
    if int(pong["ok"]) != 1:
        raise Exception("Cluster connection is not okay!")

    todo_lists = database.get_collection(COLLECTION_NAME)
    app.todo_dal = ToDoDAL(todo_lists)

    # Yield back to FastAPI Application:
    yield

    # Shutdown:
    client.close()


app = FastAPI(lifespan=lifespan, debug=DEBUG)


@app.get("/api/lists")
async def get_all_lists() -> list[ListSummary]:
    return [i async for i in app.todo_dal.list_todo_lists()]


class NewList(BaseModel):
    name: str


class NewListResponse(BaseModel):
    id: str
    name: str


@app.post("/api/lists", status_code=status.HTTP_201_CREATED)
async def create_todo_list(new_list: NewList) -> NewListResponse:
    return NewListResponse(
        id=await app.todo_dal.create_todo_list(new_list.name),
        name=new_list.name,
    )


@app.get("/api/lists/{list_id}")
async def get_list(list_id: str) -> ToDoList:
    """Get a single to-do list"""
    return await app.todo_dal.get_todo_list(list_id)


@app.delete("/api/lists/{list_id}")
async def delete_list(list_id: str) -> bool:
    return await app.todo_dal.delete_todo_list(list_id)


class NewItem(BaseModel):
    label: str


class NewItemResponse(BaseModel):
    id: str
    label: str


@app.post(
    "/api/lists/{list_id}/items/",
    status_code=status.HTTP_201_CREATED,
)
async def create_item(list_id: str, new_item: NewItem) -> ToDoList:
    return await app.todo_dal.create_item(list_id, new_item.label)


@app.delete("/api/lists/{list_id}/items/{item_id}")
async def delete_item(list_id: str, item_id: str) -> ToDoList:
    return await app.todo_dal.delete_item(list_id, item_id)


class ToDoItemUpdate(BaseModel):
    item_id: str
    checked_state: bool


@app.patch("/api/lists/{list_id}/checked_state")
async def set_checked_state(list_id: str, update: ToDoItemUpdate) -> ToDoList:
    return await app.todo_dal.set_checked_state(
        list_id, update.item_id, update.checked_state
    )


class DummyResponse(BaseModel):
    id: str
    when: datetime


@app.get("/api/dummy")
async def get_dummy() -> DummyResponse:
    return DummyResponse(
        id=str(ObjectId()),
        when=datetime.now(),
    )


def main(argv=sys.argv[1:]):
    try:
        uvicorn.run("server:app", host="0.0.0.0", port=3001, reload=DEBUG)
    except KeyboardInterrupt:
        pass


if __name__ == "__main__":
    main()

This implementation sets up the FastAPI server with CORS middleware, connects to MongoDB, and defines the API endpoints for our todo application.

Step 7: Set up environment variables

Create a .env file in the root directory with the following content. Make sure to add the database name ("todo") at the end of ".mongodb.net/".

MONGODB_URI='mongodb+srv://beau:codecamp@cluster0.ji7hu.mongodb.net/todo?retryWrites=true&w=majority&appName=Cluster0'

Step 8: Create a docker-compose file

In the root directory of your project (farm-stack-todo), create a file named compose.yml with the following content:

name: todo-app
services:
  nginx:
    image: nginx:1.17
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/conf.d/default.conf
    ports:
      - 8000:80
    depends_on:
      - backend
      - frontend
  frontend:
    image: "node:22"
    user: "node"
    working_dir: /home/node/app
    environment:
      - NODE_ENV=development
      - WDS_SOCKET_PORT=0
    volumes:
      - ./frontend/:/home/node/app
    expose:
      - "3000"
    ports:
      - "3000:3000"
    command: "npm start"
  backend:
    image: todo-app/backend
    build: ./backend
    volumes:
      - ./backend/:/usr/src/app
    expose:
      - "3001"
    ports:
      - "8001:3001"
    command: "python src/server.py"
    environment:
      - DEBUG=true
    env_file:
      - path: ./.env
        required: true

Step 9: Set up Nginx configuration

Create a directory named nginx in the root of your project:

mkdir nginx

Create a file named nginx.conf inside the nginx directory with the following content:

server {
    listen 80;
    server_name farm_intro;

    location / {
        proxy_pass http://frontend:3000;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }

    location /api {
        proxy_pass http://backend:3001/api;
    }
}

This concludes Part 2 of the tutorial, where we implemented the FastAPI server, set up environment variables, created a docker-compose file, and configured Nginx. In the next part, we'll focus on setting up the React frontend for our FARM stack todo application.

Setting up the React Frontend

Step 10: Create the React application

Navigate to the frontend directory:

cd ../frontend

Create a new React application using Create React App:

npx create-react-app .

Install additional dependencies:

   npm install axios react-icons

Step 11: Set up the main App component

Replace the content of src/App.js with the following:

import { useEffect, useState } from "react";
import axios from "axios";
import "./App.css";
import ListToDoLists from "./ListTodoLists";
import ToDoList from "./ToDoList";

function App() {
  const [listSummaries, setListSummaries] = useState(null);
  const [selectedItem, setSelectedItem] = useState(null);

  useEffect(() => {
    reloadData().catch(console.error);
  }, []);

  async function reloadData() {
    const response = await axios.get("/api/lists");
    const data = await response.data;
    setListSummaries(data);
  }

  function handleNewToDoList(newName) {
    const updateData = async () => {
      const newListData = {
        name: newName,
      };

      await axios.post(`/api/lists`, newListData);
      reloadData().catch(console.error);
    };
    updateData();
  }

  function handleDeleteToDoList(id) {
    const updateData = async () => {
      await axios.delete(`/api/lists/${id}`);
      reloadData().catch(console.error);
    };
    updateData();
  }

  function handleSelectList(id) {
    console.log("Selecting item", id);
    setSelectedItem(id);
  }

  function backToList() {
    setSelectedItem(null);
    reloadData().catch(console.error);
  }

  if (selectedItem === null) {
    return (
      <div className="App">
        <ListToDoLists
          listSummaries={listSummaries}
          handleSelectList={handleSelectList}
          handleNewToDoList={handleNewToDoList}
          handleDeleteToDoList={handleDeleteToDoList}
        />
      div>
    );
  } else {
    return (
      <div className="App">
        <ToDoList listId={selectedItem} handleBackButton={backToList} />
      div>
    );
  }
}

export default App;

Step 12: Create the ListTodoLists component

Create a new file src/ListTodoLists.js with the following content:

import "./ListTodoLists.css";
import { useRef } from "react";
import { BiSolidTrash } from "react-icons/bi";

function ListToDoLists({
  listSummaries,
  handleSelectList,
  handleNewToDoList,
  handleDeleteToDoList,
}) {
  const labelRef = useRef();

  if (listSummaries === null) {
    return <div className="ListToDoLists loading">Loading to-do lists ...div>;
  } else if (listSummaries.length === 0) {
    return (
      <div className="ListToDoLists">
        <div className="box">
        <label>
          New To-Do List: 
          <input id={labelRef} type="text" />
        label>
        <button
          onClick={() =>
            handleNewToDoList(document.getElementById(labelRef).value)
          }
        >
          New
        button>
        div>
        <p>There are no to-do lists!p>
      div>
    );
  }
  return (
    <div className="ListToDoLists">
      <h1>All To-Do Listsh1>
      <div className="box">
        <label>
          New To-Do List: 
          <input id={labelRef} type="text" />
        label>
        <button
          onClick={() =>
            handleNewToDoList(document.getElementById(labelRef).value)
          }
        >
          New
        button>
      div>
      {listSummaries.map((summary) => {
        return (
          <div
            key={summary.id}
            className="summary"
            onClick={() => handleSelectList(summary.id)}
          >
            <span className="name">{summary.name} span>
            <span className="count">({summary.item_count} items)span>
            <span className="flex">span>
            <span
              className="trash"
              onClick={(evt) => {
                evt.stopPropagation();
                handleDeleteToDoList(summary.id);
              }}
            >
              <BiSolidTrash />
            span>
          div>
        );
      })}
    div>
  );
}

export default ListToDoLists;

Create a new file src/ListTodoLists.css with the following content:

.ListToDoLists .summary {
    border: 1px solid lightgray;
    padding: 1em;
    margin: 1em;
    cursor: pointer;
    display: flex;
}

.ListToDoLists .count {
    padding-left: 1ex;
    color: blueviolet;
    font-size: 92%;
}

Step 13: Create the ToDoList component

Create a new file src/ToDoList.js with the following content:

import "./ToDoList.css";
import { useEffect, useState, useRef } from "react";
import axios from "axios";
import { BiSolidTrash } from "react-icons/bi";

function ToDoList({ listId, handleBackButton }) {
  let labelRef = useRef();
  const [listData, setListData] = useState(null);

  useEffect(() => {
    const fetchData = async () => {
      const response = await axios.get(`/api/lists/${listId}`);
      const newData = await response.data;
      setListData(newData);
    };
    fetchData();
  }, [listId]);

  function handleCreateItem(label) {
    const updateData = async () => {
      const response = await axios.post(`/api/lists/${listData.id}/items/`, {
        label: label,
      });
      setListData(await response.data);
    };
    updateData();
  }

  function handleDeleteItem(id) {
    const updateData = async () => {
      const response = await axios.delete(
        `/api/lists/${listData.id}/items/${id}`
      );
      setListData(await response.data);
    };
    updateData();
  }

  function handleCheckToggle(itemId, newState) {
    const updateData = async () => {
      const response = await axios.patch(
        `/api/lists/${listData.id}/checked_state`,
        {
          item_id: itemId,
          checked_state: newState,
        }
      );
      setListData(await response.data);
    };
    updateData();
  }

  if (listData === null) {
    return (
      <div className="ToDoList loading">
        <button className="back" onClick={handleBackButton}>
          Back
        button>
        Loading to-do list ...
      div>
    );
  }
  return (
    <div className="ToDoList">
      <button className="back" onClick={handleBackButton}>
        Back
      button>
      <h1>List: {listData.name}h1>
      <div className="box">
        <label>
          New Item: 
          <input id={labelRef} type="text" />
        label>
        <button
          onClick={() =>
            handleCreateItem(document.getElementById(labelRef).value)
          }
        >
          New
        button>
      div>
      {listData.items.length > 0 ? (
        listData.items.map((item) => {
          return (
            <div
              key={item.id}
              className={item.checked ? "item checked" : "item"}
              onClick={() => handleCheckToggle(item.id, !item.checked)}
            >
              <span>{item.checked ? "✅" : "⬜️"} span>
              <span className="label">{item.label} span>
              <span className="flex">span>
              <span
                className="trash"
                onClick={(evt) => {
                  evt.stopPropagation();
                  handleDeleteItem(item.id);
                }}
              >
                <BiSolidTrash />
              span>
            div>
          );
        })
      ) : (
        <div className="box">There are currently no items.div>
      )}
    div>
  );
}

export default ToDoList;

Create a new file src/ToDoList.css with the following content:

.ToDoList .back {
    margin: 0 1em;
    padding: 1em;
    float: left;
}

.ToDoList .item {
    border: 1px solid lightgray;
    padding: 1em;
    margin: 1em;
    cursor: pointer;
    display: flex;
}

.ToDoList .label {
    margin-left: 1ex;
}

.ToDoList .checked .label {
    text-decoration: line-through;
    color: lightgray;
}

Step 14: Update the main CSS file

Replace the content of src/index.css with the following:

html, body {
  margin: 0;
  font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen',
    'Ubuntu', 'Cantarell', 'Fira Sans', 'Droid Sans', 'Helvetica Neue',
    sans-serif;
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
  font-size: 12pt;
}

input, button {
  font-size: 1em;
}

code {
  font-family: source-code-pro, Menlo, Monaco, Consolas, 'Courier New',
    monospace;
}

.box {
    border: 1px solid lightgray;
    padding: 1em;
    margin: 1em;
}

.flex {
  flex: 1;
}

This concludes Part 3 of the tutorial, where we set up the React frontend for our FARM stack todo application. We've created the main App component, the ListTodoLists component for displaying all todo lists, and the ToDoList component for individual todo lists. In the next part, we'll focus on running and testing the application.

Running and Testing the Application

Step 18: Run the application using Docker Compose

Make sure you have Docker and Docker Compose installed on your system
Open a terminal in the root directory of your project (farm-stack-todo)
Build and start the containers:

docker-compose up --build

Once the containers are up and running, open your web browser and go to http://localhost:8000

Step 19: Stopping the application

If you're running the application without Docker:
- Stop the React development server by pressing Ctrl+C in its terminal
- Stop the FastAPI server by pressing Ctrl+C in its terminal
- Stop the MongoDB server by pressing Ctrl+C in its terminal
If you're running the application with Docker Compose:
- Press Ctrl+C in the terminal where you ran docker-compose up
- Run the following command to stop and remove the containers:

     docker-compose down

```

Congratulations! You have successfully built and tested a FARM stack todo application. This application demonstrates the integration of FastAPI, React, and MongoDB in a full-stack web application.

Here are some potential next steps to enhance your application:

Add user authentication and authorization
Implement data validation and error handling
Add more features like due dates, priorities, or categories for todo items
Improve the UI/UX with a more polished design
Write unit and integration tests for both frontend and backend
Set up continuous integration and deployment (CI/CD) for your application

Remember to keep your dependencies updated and follow best practices for security and performance as you continue to develop your application.

Conclusion and Next Steps

Congratulations on completing this comprehensive FARM stack tutorial! By building this todo application, you've gained hands-on experience with some of the most powerful and popular technologies in modern web development. You've learned how to create a robust backend API with FastAPI, build a dynamic and responsive frontend with React, persist data with MongoDB, and containerize your entire application using Docker. This project has demonstrated how these technologies work together seamlessly to create a full-featured, scalable web application.

FastAPI Handbook – How to Develop, Test, and Deploy APIs

Atharva Shah — Tue, 25 Jul 2023 20:54:10 +0000

Welcome to the world of FastAPI, a sleek and high-performance web framework for constructing Python APIs. Don't worry if you're new to API programming – we'll start at the beginning.

An API (Application Programming Interface) connects several software programs allowing them to converse and exchange information. APIs are essential in modern software development as they are an application's backend architecture.

After reading this quick start guide, you will be able to develop a course administration API using FastAPI and MongoDB. The best part is that you will not only be writing APIs but also testing and containerizing the app.

In this walkthrough project, we'll create a Python backend system using FastAPI, a fast web framework, and a MongoDB database for course information storage and retrieval.

The system will allow users to access course details, view chapters, rate individual chapters, and aggregate ratings.

The project is designed for Python developers with basic programming knowledge and some NoSQL knowledge. Familiarity with MongoDB, Docker, and PyTest is not required since I will be highlighting everything you need to know for the scope of this project.

What We'll Build

Here's what we are going to be building:

FastAPI Backend: It will serve as the interface for handling API requests and responses. FastAPI is chosen for its ease of use, performance, and intuitive design.

MongoDB Database: A NoSQL database to store course information. MongoDB's flexible schema allows us to store data in JSON-like documents, making it suitable for this project.

Course Information: Users will be able to view various course details, such as course name, description, instructor, etc.

Chapter Details: The system will provide information about the chapters in a course, including chapter names, descriptions, and any other relevant data.

Chapter Rating: Users will have the ability to rate individual chapters. We will implement functionality to record and retrieve chapter ratings.

Course Aggregated Rating: The system will calculate and display the aggregated rating for each course based on the ratings of its chapters.

This walkthrough shows how to set up a development environment, build a FastAPI backend, integrate MongoDB, define API endpoints, add chapter rating functionality, and compute aggregate course ratings. It covers fundamental project concepts as well as Python, MongoDB, and NoSQL databases.

By the end, this useful backend system will manage chapter details, course information, and user ratings, serving as the basis for a complex and rewarding project.

The goal is to create a system that processes course-related queries. The course information must then be retrieved from MongoDB depending on the request. Lastly, this answer data must be returned in a standard format (JSON).

We'll begin with a script that reads the course information from courses.json. This data will be stored in the MongoDB instance. Once the data has been loaded, our API code may connect to this database to allow for simple data retrieval.

The interesting aspect is creating several endpoints with FastAPI. Our API will be able to:

Fetch a list of all courses
Show a comprehensive course overview
List detailed information about certain chapters
Record user scores for each chapter.

Additionally, for each course, we will aggregate all reviews, providing visitors with relevant information regarding course popularity and quality.

This tutorial focuses on building a scalable, efficient, and user-friendly API. Once we've tested everything, we'll containerize the application using Docker. This will greatly simplify deployment, maintenance, and installation.

Here are the sections of this tutorial:

API Methods
Client and Server
How to Set Up the MongoDB Database
How to Parse and Insert Course Data into MongoDB
How to Design the FastAPI Endpoints
Automated API Endpoint Testing with PyTest
How to Containerize the Application with Docker
Conclusion

API Methods

HTTP (Hypertext Transfer Protocol) methods specify the action to be taken on a resource. The following are the most often used API development methods:

GET: Requests information from a server. When a client submits a GET request, it is requesting data from the server.

POST: Sends data to the server for processing. When a client submits a POST request, it is often delivering data to the server to create or update a resource.

PUT: Updates server data. When a client submits a PUT request, the resource indicated in the request is updated.

DELETE: A client sending a DELETE request is asking for the removal of the specified resource.

Client and Server

The client is often a front-end application that sends requests to the server, such as a web browser or a mobile app. The server, on the other hand, is the back-end application in charge of processing client requests and responding appropriately.

A request is a communication delivered by the client to the server that specifies the intended action and any required data. The HTTP method, URL (Uniform Resource Locator), headers, and, in the case of POST or PUT requests, the data payload are all part of a request.

After the server gets the request, it processes it and returns a response. The response is the message given back to the client by the server that contains the requested data or the outcome of the activity.

A response generally comprises an HTTP status code indicating the success or failure of the request, as well as any data sent back to the client by the server.

Diagram showing how APIs work

How to Set Up the MongoDB Database

MongoDB is a type of NoSQL database. It is non-relational and saves information as collections and documents.

Install MongoDB for your operating system from the official website.

Now run the mongosh command for your terminal to verify if the installation was successful.

Running the mongosh command should yield this output

Connect to the MongoDB server with MongoDB Compass. I recommend that you set up MongoDB by specifying settings such as port number, storage engine, authentication, and so forth.

Create a new MongoDB connection

Now that the connection is established, the next step is to create a database or a "document". Call this database "courses". It will be empty for you currently. In just a minute we'll insert the documents using a Python script.

How to Parse and Insert Course Data into MongoDB

You could insert records one by one, but it is best to use a JSON file to simplify that process. Download this file courses.json from GitHub. All course information is present in it (as a list of courses).

Specifically, each course has the following structure:

name: The title of the course.
date: Creation date as a UNIX timestamp.
description: The description of the course.
domain: List of the course domain(s).
chapters: List of the course chapters. Each chapter has a title name and content text.

You will need a few Python packages for this project.

BSON - Binary serialization format that is used in MongoDB for efficient data storage and retrieval. It comes bundled with PyMongo.
FastAPI - Web framework for creating Python APIs that offer high performance, automatic validation, interactive documentation, and support for async operations.
PyMongo - Official MongoDB driver for Python. It serves as a high-level API for integrating MongoDB within Python.
Uvicorn - Primary ASGI server that improves application performance. It is responsible for server startup.
Starlette - ASGI framework that powers FastAPI and allows rapid prototyping development.
Pydantic - Integrated data validation and parsing library. We need it to create interactive API documentation while automatically validating incoming request data and enforcing data type rules.

Get them installed via the pip commands like so:

pip install fastapi pymongo uvicorn starlette pydantic

Now, let's write a Python script to insert all this course data into the database so that we can start building API routes. Spin up your IDE, create a file called script.py, and make sure it is in the same directory as the courses.json file.

""" 
Script to parse course information from courses.json, create the appropriate databases and
collection(s) on a local instance of MongoDB, create the appropriate indices (for efficient retrieval)
and finally add the course data on the collection(s).
"""

import pymongo
import json

# Connect to MongoDB
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["courses"]
collection = db["courses"]

# Read courses from courses.json
with open("courses.json", "r") as f:
    courses = json.load(f)

# Create index for efficient retrieval
collection.create_index("name")

# add rating field to each course
for course in courses:
    course['rating'] = {'total': 0, 'count': 0}

# add rating field to each chapter
for course in courses:
    for chapter in course['chapters']:
        chapter['rating'] = {'total': 0, 'count': 0}

# Add courses to collection
for course in courses:
    collection.insert_one(course)

# Close MongoDB connection
client.close()

This script populates a MongoDB database with the course information from the JSON file.

It begins by connecting to the local MongoDB instance. It reads course data from a file called courses.json and creates a new field for course ratings. It then develops an index to speed up data retrieval. Lastly, the course data is added to the MongoDB collection.

It's a straightforward script for managing course data in a database. On running the script, all records from the courses.json should have been inserted into the courses DB. Switch to MongoDB Compass to verify it.

You should be able to see the JSON items in your courses database after running the python script

How to Design the FastAPI Endpoints

These API endpoints provide an efficient way to manage course information, retrieve course details, and allow user interactions for rating chapters.

I recommend designing the API endpoints first along with the HTTP request type before writing the code. This acts as a good reference and provides clarity during the coding process.

Endpoint	Request Type	Description
/courses	GET	Get a list of all available courses with sorting options.

Options: Sort by title (ascending), date (descending), or total course rating (descending).

Optional filtering based on domain is supported. | | /courses/{course_id} | GET | Get the overview of a specific course identified by course_id. | | /courses/{course_id}/{chapter_id} | GET | Get information about a specific chapter within a course. | | /courses/{course_id}/{chapter_id} | POST | Rate a specific chapter within a course.

Options: Positive rating (1), negative rating (-1).

The ratings are aggregated for each course. |

Okay, time to dive into the API code. Create a brand new Python file and call it main.py:

import contextlib
from fastapi import FastAPI, HTTPException, Query
from pymongo import MongoClient
from bson import ObjectId
from fastapi.encoders import jsonable_encoder

app = FastAPI()
client = MongoClient('mongodb://localhost:27017/')
db = client['courses']

The code imports essential modules and creates an active instance of the FastAPI class named app. It also establishes a connection to the local MongoDB database using the PyMongo library and the db variable now stores the connection reference to the courses document.

Let's go over each of these endpoints in more detail now.

The Get All Courses Endpoint (`/courses` – GET)

This endpoint allows you to retrieve a list of all available courses. You can sort the courses based on different criteria, such as alphabetical order (based on the course title in ascending order), date (in descending order), or total course rating (in descending order). Also, we'll allow users to filter the courses based on their domain.

@app.get('/courses')
def get_courses(sort_by: str = 'date', domain: str = None):
    # set the rating.total and rating.count to all the courses based on the sum of the chapters rating
    for course in db.courses.find():
        total = 0
        count = 0
        for chapter in course['chapters']:
            with contextlib.suppress(KeyError):
                total += chapter['rating']['total']
                count += chapter['rating']['count']
        db.courses.update_one({'_id': course['_id']}, {'$set': {'rating': {'total': total, 'count': count}}})


    # sort_by == 'date' [DESCENDING]
    if sort_by == 'date':
        sort_field = 'date'
        sort_order = -1

    # sort_by == 'rating' [DESCENDING]
    elif sort_by == 'rating':
        sort_field = 'rating.total'
        sort_order = -1

    # sort_by == 'alphabetical' [ASCENDING]
    else:  
        sort_field = 'name'
        sort_order = 1

    query = {}
    if domain:
        query['domain'] = domain


    courses = db.courses.find(query, {'name': 1, 'date': 1, 'description': 1, 'domain':1,'rating':1,'_id': 0}).sort(sort_field, sort_order)
    return list(courses)

This code defines an endpoint in the FastAPI application to retrieve a list of all available courses. The endpoint can be accessed using an HTTP GET request to the '/courses' URL.

The @app.get() decorator is attached to the get_course function and it takes care of this.

When a request is made to this endpoint, the code first calculates the total course rating by summing up the ratings of all the chapters in each course. It then updates the rating field of each course in the MongoDB database with the computed total and count of ratings.

Next, the code determines the sorting mode based on the sort_by query parameter. If sort_by is set to date, the courses will be sorted by their creation date in descending order. If it is set to rating, the courses will be sorted by their total rating in descending order. Otherwise, the courses will be sorted alphabetically by their names in ascending order.

If the optional domain query parameter is provided, the code will filter the courses based on the specified domain.

Finally, the code queries the MongoDB database to retrieve the relevant course information, including the course name, creation date, description, domain, and rating. The courses are sorted according to the selected sorting mode and returned as a list.

That was the code explanation, but what about the actual API response? Run the command below in your terminal from the current working directory:

uvicorn main:app --reload

Uvicorn is an ASGI webserver. You can interact with API endpoints right on your local machine without any external server. On running the above command you should see a success message stating that the server has started.

Fire up your browser and enter http://127.0.0.1:8000/courses in the URL bar. The output that you will see will be the JSON response directly from the server.

Verify that the first object contains the following:

{
"name": "Introduction to Programming",
"date": 1659906000,
"description": "An introduction to programming using a language called Python. Learn how to read and write code as well as how to test and \"debug\" it. Designed for students with or without prior programming experience who'd like to learn Python specifically. Learn about functions, arguments, and return values (oh my!); variables and types; conditionals and Boolean expressions; and loops. Learn how to handle exceptions, find and fix bugs, and write unit tests; use third-party libraries; validate and extract data with regular expressions; model real-world entities with classes, objects, methods, and properties; and read and write files. Hands-on opportunities for lots of practice. Exercises inspired by real-world programming problems. No software required except for a web browser, or you can write code on your own PC or Mac.",
"domain": [
    "programming"
    ],
"rating": {
    "total": 6,
    "count": 12
    }
}

Guess what? It is a list of all the courses that we stored in our database. Your front-end application may now iterate over all these items and present them in a fancy way to the user. That is the power of APIs.

The Rating for the entire course will be updated as per the aggregated sum of chapters as mentioned in the assignment document.

At this point, if you wish to see the documentation for your API do so by navigating to the http://127.0.0.1:8000/docs endpoint. This navigable API comes prepackages with FastAPI. How cool is that?

FastAPI docs for all your API endpoints

Don't like the plain old look of the docs? Fret not, there is also a /redoc endpoint with a slightly fancier interface. Just navigate to [http://127.0.0.1:8000/](http://127.0.0.1:8000/docs)redoc and you will be greeted with this screen.

FastAPI alternate redoc interface with search and download options

The Get Course Overview Endpoint (`/courses/{course_id}` – GET)

You'll use this endpoint to get an overview of a specific course. Simply provide the course_id in the URL, and the API will return detailed information about that particular course.

@app.get('/courses/{course_id}')
def get_course(course_id: str):
    course = db.courses.find_one({'_id': ObjectId(course_id)}, {'_id': 0, 'chapters': 0})
    if not course:
        raise HTTPException(status_code=404, detail='Course not found')
    try:
        course['rating'] = course['rating']['total']
    except KeyError:
        course['rating'] = 'Not rated yet' 

    return course

This code snippet searches the MongoDB database for the course with the specified course_ id and extracts the course information while leaving out the chapters field.

If it cannot find the course, it throws an HTTPException with the status code 404. If it finds it, it tries to access the rating field and replaces it with its 'total' value to display the total rating. If not, the rating box is set to Not rated yet.

Finally, without the chapters field, it returns the JSON response of the course information, including the total rating.

Single Course Overview Endpoint Response

Get Specific Chapter Information Endpoint (`/courses/{course_id}/{chapter_id}` – GET)

Hitting this endpoint returns specific information about a chapter within a course. By specifying both the course_id and the chapter_id in the URL, you can access the details of that particular chapter.

@app.get('/courses/{course_id}/{chapter_id}')
def get_chapter(course_id: str, chapter_id: str):    
    course = db.courses.find_one({'_id': ObjectId(course_id)}, {'_id': 0, })
    if not course:
        raise HTTPException(status_code=404, detail='Course not found')
    chapters = course.get('chapters', [])
    try:
        chapter = chapters[int(chapter_id)]
    except (ValueError, IndexError) as e:
        raise HTTPException(status_code=404, detail='Chapter not found') from e
    return chapter

As you might expect, course_id is the course identity, and chapter id is the chapter identifier inside that course.

When a request is made to this endpoint, the code first searches the MongoDB database for the course with the specified course id, ignoring the _id column in the response.

If the course with the supplied course_id cannot be found in the database, the code throws an HTTPException with the status code 404, indicating that the course could not be located.

The code then uses the GET function to retrieve the list of chapters for the course, setting the default value to an empty list if the 'chapters' field does not exist.

Using the chapter_id provided in the request, the code then attempts to retrieve the exact chapter within the list of chapters. If the chapter id is not a valid integer or is out of range for the list of chapters, the code throws an HTTPException with the status code 404. This indicates that it could not locate the chapter.

If it locates the chapter, the response contains information on the individual chapter within the course.

Chapter Detail Endpoint

Rate Chapter Endpoint (`/courses/{course_id}/{chapter_id}` – POST)

This endpoint allows users to rate individual chapters within a course. You can provide a rating of 1 for a positive review or -1 for a negative review. The API aggregates all the ratings for each course, providing valuable feedback for future improvements.

Up until now, we've mostly seen GET requests. But now let's see how you can send data to the server, validate it, and insert it in the application database.

@app.post('/courses/{course_id}/{chapter_id}')
def rate_chapter(course_id: str, chapter_id: str, rating: int = Query(..., gt=-2, lt=2)):
    course = db.courses.find_one({'_id': ObjectId(course_id)}, {'_id': 0, })
    if not course:
        raise HTTPException(status_code=404, detail='Course not found')
    chapters = course.get('chapters', [])
    try:
        chapter = chapters[int(chapter_id)]
    except (ValueError, IndexError) as e:
        raise HTTPException(status_code=404, detail='Chapter not found') from e
    try:
        chapter['rating']['total'] += rating
        chapter['rating']['count'] += 1
    except KeyError:
        chapter['rating'] = {'total': rating, 'count': 1}
    db.courses.update_one({'_id': ObjectId(course_id)}, {'$set': {'chapters': chapters}})
    return chapter

We have put in place an endpoint for users to rate each chapter within a course using an HTTP POST request to the /courses/course_id/chapter_id URL. Users can provide a rating value of 1 for a positive rating or -1 for a negative rating. The code queries the MongoDB database to find the course with the specified course_id, excluding the _id field.

If it doesn't find the course, it raises an HTTP exception with a status code of 404. The code retrieves the list of chapters, setting the default value to an empty list.

If the chapter_id is not a valid integer or is out of range, it raises an HTTPException with a status code of 404. If the chapter is found, the code updates its rating by incrementing the total rating value with the provided rating and incrementing the count value.

If the chapter does not have an existing rating field, it creates one and initializes it with the provided rating and a count of 1. The updated rating is then updated in the database, and the updated chapter is returned as the response, providing feedback to the user about their rating for that chapter.

POST Request to add a rating to a chapter

To make a POST request, open the docs and click on the request highlighted in the above image. Then, click on "Try it out", fill in the post data, and press the Execute button right below. This sends the POST data to the server which is then validated.

If all the submitted data is as expected, the server accepts and shows the 200 status code meaning that the operation was successful. The submitted data is now in the MongoDB document.

Post Request Success

That's a wrap on the API development part.

Automated API Endpoint Testing with PyTest

As the complexity of modern web applications increases, so does the number of API endpoints and their interactions.

In a dynamic e-commerce web app, there could be hundreds of endpoints, each supporting multiple HTTP request methods. And these endpoints might be intricately interconnected.

Ensuring the proper functioning of all these endpoints after each development iteration becomes a formidable task for developers and QA teams. Here is where automated testing comes to the rescue.

Create a file test_app.py in the same directory as courses.json and main.py:

from fastapi.testclient import TestClient
from pymongo import MongoClient
from bson import ObjectId
import pytest
from main import app

client = TestClient(app)
mongo_client = MongoClient('mongodb://localhost:27017/')
db = mongo_client['courses']

That sets up an automated testing environment.

FastAPI Test Client simulates HTTP requests to the web app. With this, you can pretend to be a user, sending requests to your app and getting responses back, just like a real user would.

We're using MongoDB Connection for course data storage, with MongoClient enabling interaction and data updates during tests.

Test Database is a separate database for testing. It will not affect the actual course documents.

With this configuration, you can now create test functions that send requests to your FastAPI app using the TestClient. You will interact with your MongoDB database during these tests, but don't worry—this is just the test database, so nothing important will be harmed.

How to Test the "Get Courses List" Endpoint

These test functions use TestClient to interact with the "/courses" endpoint of the FastAPI application. They check if the endpoint behaves as expected when different parameters, such as sorting and filtering by domain, are provided.

The tests verify the status codes, data presence, sorting order, and domain filtering in the API responses, ensuring the functionality of the course endpoint is correct and reliable.

def test_get_courses_no_params():
    response = client.get("/courses")
    assert response.status_code == 200

def test_get_courses_sort_by_alphabetical():
    response = client.get("/courses?sort_by=alphabetical")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert sorted(courses, key=lambda x: x['name']) == courses


def test_get_courses_sort_by_date():
    response = client.get("/courses?sort_by=date")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert sorted(courses, key=lambda x: x['date'], reverse=True) == courses

def test_get_courses_sort_by_rating():
    response = client.get("/courses?sort_by=rating")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert sorted(courses, key=lambda x: x['rating']['total'], reverse=True) == courses

def test_get_courses_filter_by_domain():
    response = client.get("/courses?domain=mathematics")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert all([c['domain'][0] == 'mathematics' for c in courses])

def test_get_courses_filter_by_domain_and_sort_by_alphabetical():
    response = client.get("/courses?domain=mathematics&sort_by=alphabetical")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert all([c['domain'][0] == 'mathematics' for c in courses])
    assert sorted(courses, key=lambda x: x['name']) == courses

def test_get_courses_filter_by_domain_and_sort_by_date():
    response = client.get("/courses?domain=mathematics&sort_by=date")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert all([c['domain'][0] == 'mathematics' for c in courses])
    assert sorted(courses, key=lambda x: x['date'], reverse=True) == courses

Pay attention to the assert statements. The expected results are checked against actual results and it returns a True or False Boolean based on the this comparison. The objective is to get all the tests to pass by equalizing these values.

How to Test the "Get Single Course Info" Endpoint

The tests use TestClient to send queries to FastAPI's "/courses/course id" endpoint, retrieving course data from the MongoDB database using the db.courses.find_one function. Comparing API response data to database data can help you determine if the endpoint handles existing and non-existent course IDs.

def test_get_course_by_id_exists():
    response = client.get("/courses/6431137ab5da949e5978a281")
    assert response.status_code == 200
    course = response.json()
    # get the course from the database
    course_db = db.courses.find_one({'_id': ObjectId('6431137ab5da949e5978a281')})
    # get the name of the course from the database
    name_db = course_db['name']
    # get the name of the course from the response
    name_response = course['name']
    # compare the two
    assert name_db == name_response


def test_get_course_by_id_not_exists():
    response = client.get("/courses/6431137ab5da949e5978a280")
    assert response.status_code == 404
    assert response.json() == {'detail': 'Course not found'}

How to Test the "Get Course Chapter Info" Endpoint

The tests anticipate the FastAPI application's "/courses/course id/chapter number" endpoint to provide chapter information for a certain course ID and number when they use the TestClient to make the request.

We use assertions to determine if the answer includes the anticipated data or gives a "Not Found" response for a non-existent chapter. It validates that the correct API chapter was retrieved and handles existing and non-existent chapters.

def test_get_chapter_info():
    response = client.get("/courses/6431137ab5da949e5978a281/1")
    assert response.status_code == 200
    chapter = response.json()
    assert chapter['name'] == 'Big Picture of Calculus'
    assert chapter['text'] == 'Highlights of Calculus'


def test_get_chapter_info_not_exists():
    response = client.get("/courses/6431137ab5da949e5978a281/990")
    assert response.status_code == 404
    assert response.json() == {'detail': 'Chapter not found'}

How to Test the "Post Course Rating" Endpoint

To test the rating capability, the test function specifies the course ID, chapter ID, and rating variables. It uses the TestClient's post method to submit a POST request to the "/courses/course id/chapter id" API, providing the course ID and chapter number in the URL and passing the rating variable as a query parameter.

FastAPI mimics a user's activity to rate a certain chapter of a course. The response is successful with a 200 status code. JSON content is validated for "name" and "rating" keys, as well as "total" and "count" keys. The total rating and rating count are greater than 0, indicating users have rated the chapter.

def test_rate_chapter():
    course_id = "6431137ab5da949e5978a281"
    chapter_id = "1"
    rating = 1

    response = client.post(f"/courses/{course_id}/{chapter_id}?rating={rating}")

    assert response.status_code == 200

    # Check if the response body has the expected structure
    assert "name" in response.json()
    assert "rating" in response.json()
    assert "total" in response.json()["rating"]
    assert "count" in response.json()["rating"]

    assert response.json()["rating"]["total"] > 0
    assert response.json()["rating"]["count"] > 0

def test_rate_chapter_not_exists():
    response = client.post("/courses/6431137ab5da949e5978a281/990/rate", json={"rating": 1})
    assert response.status_code == 404
    assert response.json() == {'detail': 'Not Found'}

This verification makes sure that the rating addition endpoint works as intended, with the API returning the correct success code and expected information about the chapter, including its name and updated rating details.

By running the pytest command, all the test functions in the test_app.py file will be executed, and you'll get feedback on whether the endpoints are functioning as expected or if any errors or regressions have occurred. This allows developers and QA teams to catch issues early in the development cycle and maintain the application's reliability and stability.

As you can see in the image below, all the tests are passing. Good job! As you keep on adding more features and endpoints to the app, keep adding the associated tests in order to validate correctness. This is called Test Driven Development (TDD).

Running API Tests with Pytest

Running the Pytest command shows the output as illustrated in the image above. It says that 13 tests pasts. This means that all our endpoints are functional and return the expected responses.

By detecting regressions, integrating components, resolving errors, doing load and performance tests, and testing for security, endpoint testing verifies that an application's essential operations are right. All potential weaknesses and vulnerabilities are noted and tagged for inspection.

Pytest helps you make sure that API endpoints work well together, and also helps you deal with failures and edge cases. It can manage numerous concurrent large requests in practical situations.

How to Containerize the Application with Docker

You can put your application and all of its dependencies together into a single unit called a container. This is called containerization. It separates the application from the underlying system, which maintains consistency across different operating systems.

Docker is a modern containerization technology that makes it easier to create, distribute, and execute containers. It enables developers to consistently and reproducibly build, ship, and execute apps without building from source.

Get Docker installed from here: https://www.docker.com/get-started.

Dockerizing Python programs helps you make sure that they run consistently across multiple computers, eliminating compatibility difficulties. It containerizes the software, its dependencies, and customizations, making it portable.

In the same directory as other files, make a new file called Dockerfile. Note that it does not require any extension.

# Use an official Python runtime as a parent image
FROM python:3.9-slim-buster

# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /code
WORKDIR /app

COPY ./requirements.txt /app/requirements.txt

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir --upgrade -r /app/requirements.txt

COPY . /app

# Run app.py when the container launches
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]

Starting with the official Python 3.9 thin image, the Dockerfile defines the image's blueprint.

It changes the working directory to /app, which is where the application code will be stored. This projects requirements are listed in the requirements.txt file, which was put into the container.

The RUN command uses pip to install Python requirements. COPY moves the app's code from the host to the container's /app directory. CMD provides the command that will be executed when the container starts.

In this case, it runs "uvicorn main:app" (the main.py FastAPI app) with host set to 0.0.0.0 and port 80.

How to Run the Docker Container

Build the Docker image in the same directory as the Dockerfile using: **docker build -t my_python_app .**

Containerizing the FastAPI app with Docker

Run the container in detached mode using the command **docker run -d -p 80:80 my_python_app**.

Once you do this, you can view the status of the containers and the image from Docker Desktop.

Docker Desktop shows that our container image is now in a running state on port 80

How to Terminate the Docker Container

Find the container ID or name with **docker ps**. Stop the container using its ID or name: **docker stop **

This walkthrough has only addressed development, testing, and containerization. Just note that post deployment container security, if neglected, introduces risks like vulnerabilities, misconfigurations, and attacks. You should ideally take advantage of a CNAPP (Cloud Native Application Protection Platform) to scan images, stick to best practises, and monitor running containers for protection.

The takeaway is that Docker containerization allows bundling of Python scripts with dependencies, making them consistent and portable. The Dockerfile describes how the image should be created.

Running the container after it has been constructed is as simple as issuing a single command. It's just as simple to put a stop to it. Docker makes it simple to manage Python application distribution.

Conclusion

This tutorial was a quick start guide to help you leverage the power of FastAPI. We built a course administration API that efficiently handles queries related to courses.

We did this by importing course data from a JSON file into MongoDB and then creating multiple endpoints for users to access course lists, overviews, chapter information, and user scores. We also added a review aggregation feature to demonstrate using HTTP POST and HTTP GET methods so that you can grab data as well as post data to the server.

PyTest helped us handle automated testing, ensuring dependability and stability. We then containerized the application Docker, which simplifies deployment and maintenance.

My Github Repository contains the complete code covered in this quick start walkthrough. Subscribe to my technical blog for technical cheat sheets and resources.

How to Add JWT Authentication in FastAPI – A Practical Guide

freeCodeCamp — Tue, 07 Jun 2022 23:28:25 +0000

By Abdullah Adeel

FastAPI is a modern, fast, battle tested and light-weight web development framework written in Python. Other popular options in the space are Django, Flask and Bottle.

And since it's new, FastAPI comes with both advantages and disadvantages.

On the positive side, FastAPI implements all the modern standards, taking full advantage of the features supported by the latest Python versions. It has async support and type hinting. And it's also fast (hence the name FastAPI), unopinionated, robust, and easy to use.

On the negative side, FastAPI lacks some complex features like out of the box user management and admin panel that come baked in with Django. The community support for FastAPI is good but not as great as other frameworks that have been out there for years and have hundreds if not thousands of open-source projects for different use cases.

That was a very brief introduction to FastAPI. In this article, you'll learn how to implement JWT (JSON Web Token) authentication in FastAPI with a practical example.

Project Setup

In this example, I am going to use replit (a great web-based IDE). Alternatively, you can simply setup your FastAPI project locally by following the docs or use this replit starter template by forking it. This template has all the required dependencies already installed.

If you have the project setup on your local environment, here are the dependencies that you need to install for JWT authentication (assuming that you have a FastAPI project running):

pip install "python-jose[cryptography]" "passlib[bcrypt]" python-multipart

NOTE: In order to store users, I am going to use replit's built-in database. But you can apply similar operations if you are using any standard database like PostgreSQL, MongoDB, and so on.

If you want to see the complete implementation, I have this full video tutorial that includes everything a production ready FastAPI application might have.

https://replit.com/@abdadeel/FastAPIwithJWTauth

Authentication with FastAPI

Authentication in general can have a lot of moving parts, from handling password hashing and assigning tokens to validating tokens on each request.

FastAPI leverages dependency injection (a software engineering design pattern) to handle authentication schemes. Here is the list of some general steps in the process:

Password hashing
Creating and assigning JWT tokens
User creation
Validating tokens on each request to ensure authentication

Password Hashing

When creating a user with a username and password, you need to hash passwords before storing them in the database. Let's see how to easily hash passwords.

Create a file named utils.py in the app directory and add the following function to hash user passwords.

from passlib.context import CryptContext

password_context = CryptContext(schemes=["bcrypt"], deprecated="auto")


def get_hashed_password(password: str) -> str:
    return password_context.hash(password)


def verify_password(password: str, hashed_pass: str) -> bool:
    return password_context.verify(password, hashed_pass)

We're using passlib to create the configuration context for password hashing. Here we are configuring it to use bcrypt .

The get_hashed_password function takes a plain password and returns the hash for it that can be safely stored in the database. The verify_password function takes the plain and hashed passwords and return a boolean representing whether the passwords match or not.

How to Generate JWT Tokens

In this section, we will write two helper functions to generate access and refresh tokens with a particular payload. Later we can use these functions to generate tokens for a particular user by passing the user-related payload.

Inside the app/utils.py file that you created earlier, add the following import statements:

import os
from datetime import datetime, timedelta
from typing import Union, Any
from jose import jwt

Add the following constants that will be passed when creating JWTs:

ACCESS_TOKEN_EXPIRE_MINUTES = 30  # 30 minutes
REFRESH_TOKEN_EXPIRE_MINUTES = 60 * 24 * 7 # 7 days
ALGORITHM = "HS256"
JWT_SECRET_KEY = os.environ['JWT_SECRET_KEY']   # should be kept secret
JWT_REFRESH_SECRET_KEY = os.environ['JWT_REFRESH_SECRET_KEY']    # should be kept secret

JWT_SECRET_KEY and JWT_REFRESH_SECRET_KEY can be any strings, but make sure to keep them secret and set them as environment variables.

If you are following along on replit.com, you can set these environment variables from the Secrets tab on the left menu bar.

Add the following functions at the end of the app/utils.py file:

def create_access_token(subject: Union[str, Any], expires_delta: int = None) -> str:
    if expires_delta is not None:
        expires_delta = datetime.utcnow() + expires_delta
    else:
        expires_delta = datetime.utcnow() + timedelta(minutes=ACCESS_TOKEN_EXPIRE_MINUTES)

    to_encode = {"exp": expires_delta, "sub": str(subject)}
    encoded_jwt = jwt.encode(to_encode, JWT_SECRET_KEY, ALGORITHM)
    return encoded_jwt

def create_refresh_token(subject: Union[str, Any], expires_delta: int = None) -> str:
    if expires_delta is not None:
        expires_delta = datetime.utcnow() + expires_delta
    else:
        expires_delta = datetime.utcnow() + timedelta(minutes=REFRESH_TOKEN_EXPIRE_MINUTES)

    to_encode = {"exp": expires_delta, "sub": str(subject)}
    encoded_jwt = jwt.encode(to_encode, JWT_REFRESH_SECRET_KEY, ALGORITHM)
    return encoded_jwt

The only difference between these two functions is that the expiration time for refresh tokens is longer than for access tokens.

The functions simply take the payload to include inside the JWT, which can be anything. Usually you would want to store information like USER_ID here, but this can be anything from strings to objects/dictionaries. The functions return tokens as strings.

In the end your app/utils.py file should look something like this:

from passlib.context import CryptContext
import os
from datetime import datetime, timedelta
from typing import Union, Any
from jose import jwt

ACCESS_TOKEN_EXPIRE_MINUTES = 30  # 30 minutes
REFRESH_TOKEN_EXPIRE_MINUTES = 60 * 24 * 7 # 7 days
ALGORITHM = "HS256"
JWT_SECRET_KEY = os.environ['JWT_SECRET_KEY']     # should be kept secret
JWT_REFRESH_SECRET_KEY = os.environ['JWT_REFRESH_SECRET_KEY']      # should be kept secret

password_context = CryptContext(schemes=["bcrypt"], deprecated="auto")

def get_hashed_password(password: str) -> str:
    return password_context.hash(password)


def verify_password(password: str, hashed_pass: str) -> bool:
    return password_context.verify(password, hashed_pass)


def create_access_token(subject: Union[str, Any], expires_delta: int = None) -> str:
    if expires_delta is not None:
        expires_delta = datetime.utcnow() + expires_delta
    else:
        expires_delta = datetime.utcnow() + timedelta(minutes=ACCESS_TOKEN_EXPIRE_MINUTES)

    to_encode = {"exp": expires_delta, "sub": str(subject)}
    encoded_jwt = jwt.encode(to_encode, JWT_SECRET_KEY, ALGORITHM)
    return encoded_jwt

def create_refresh_token(subject: Union[str, Any], expires_delta: int = None) -> str:
    if expires_delta is not None:
        expires_delta = datetime.utcnow() + expires_delta
    else:
        expires_delta = datetime.utcnow() + timedelta(minutes=REFRESH_TOKEN_EXPIRE_MINUTES)

    to_encode = {"exp": expires_delta, "sub": str(subject)}
    encoded_jwt = jwt.encode(to_encode, JWT_REFRESH_SECRET_KEY, ALGORITHM)
    return encoded_jwt

How to Handle User Signups

Inside the app/app.py file, create another endpoint for handling user signups. The endpoint should take the username/email and password as data. It then checks to make sure another account with the email/username does not exist. Then it creates the user and saves it to the database.

In app/app.py, add the following handler function:

from fastapi import FastAPI, status, HTTPException
from fastapi.responses import RedirectResponse
from app.schemas import UserOut, UserAuth
from replit import db
from app.utils import get_hashed_password
from uuid import uuid4

@app.post('/signup', summary="Create new user", response_model=UserOut)
async def create_user(data: UserAuth):
    # querying database to check if user already exist
    user = db.get(data.email, None)
    if user is not None:
            raise HTTPException(
            status_code=status.HTTP_400_BAD_REQUEST,
            detail="User with this email already exist"
        )
    user = {
        'email': data.email,
        'password': get_hashed_password(data.password),
        'id': str(uuid4())
    }
    db[data.email] = user    # saving user to database
    return user

How to Handle Logins

FastAPI has a standard way of handling logins to comply with OpenAPI standards. This automatically adds authentication in the swagger docs without any extra configurations.

Add the following handler function for user logins and assign each user access and refresh tokens. Don't forget to include imports.

from fastapi import FastAPI, status, HTTPException, Depends
from fastapi.security import OAuth2PasswordRequestForm
from fastapi.responses import RedirectResponse
from app.schemas import UserOut, UserAuth, TokenSchema
from replit import db
from app.utils import (
    get_hashed_password,
    create_access_token,
    create_refresh_token,
    verify_password
)
from uuid import uuid4

@app.post('/login', summary="Create access and refresh tokens for user", response_model=TokenSchema)
async def login(form_data: OAuth2PasswordRequestForm = Depends()):
    user = db.get(form_data.username, None)
    if user is None:
        raise HTTPException(
            status_code=status.HTTP_400_BAD_REQUEST,
            detail="Incorrect email or password"
        )

    hashed_pass = user['password']
    if not verify_password(form_data.password, hashed_pass):
        raise HTTPException(
            status_code=status.HTTP_400_BAD_REQUEST,
            detail="Incorrect email or password"
        )

    return {
        "access_token": create_access_token(user['email']),
        "refresh_token": create_refresh_token(user['email']),
    }

This endpoint is a bit different from the other post endpoints where you defined the schema for filtering incoming data.

For login endpoints, we use OAuth2PasswordRequestForm as a dependency. This will make sure to extract data from the request and pass is as a form_data argument to the the login handler function. python-multipart is used to extract form data. So make sure that you have installed it.

The endpoint will reflect in the swagger docs with inputs for username and password.

On successful response, you will get tokens as shown here:

How to Add Protected Routes

Now since we have added support for login and signup, we can add protected endpoints. In FastAPI, protected endpoints are handled using dependency injection and FastAPI can infer this from the OpenAPI schema and reflect it in the swagger docs.

Let's see the power of dependency injection. At this point, there is no way we can authenticate from the docs. This is because currently we don't have any protected endpoint, so the OpenAPI schema does not have enough information about the login strategy we are using.

No button in swagger docs to login.

Let's create our custom dependency. It's nothing but a function that is run before the actual handler function to get arguments passed to the hander function. Let's see with a practical example.

Create another file app/deps.py and add include the following function in it:

from typing import Union, Any
from datetime import datetime
from fastapi import Depends, HTTPException, status
from fastapi.security import OAuth2PasswordBearer
from .utils import (
    ALGORITHM,
    JWT_SECRET_KEY
)

from jose import jwt
from pydantic import ValidationError
from app.schemas import TokenPayload, SystemUser
from replit import db

reuseable_oauth = OAuth2PasswordBearer(
    tokenUrl="/login",
    scheme_name="JWT"
)


async def get_current_user(token: str = Depends(reuseable_oauth)) -> SystemUser:
    try:
        payload = jwt.decode(
            token, JWT_SECRET_KEY, algorithms=[ALGORITHM]
        )
        token_data = TokenPayload(**payload)

        if datetime.fromtimestamp(token_data.exp) < datetime.now():
            raise HTTPException(
                status_code = status.HTTP_401_UNAUTHORIZED,
                detail="Token expired",
                headers={"WWW-Authenticate": "Bearer"},
            )
    except(jwt.JWTError, ValidationError):
        raise HTTPException(
            status_code=status.HTTP_403_FORBIDDEN,
            detail="Could not validate credentials",
            headers={"WWW-Authenticate": "Bearer"},
        )

    user: Union[dict[str, Any], None] = db.get(token_data.sub, None)


    if user is None:
        raise HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail="Could not find user",
        )

    return SystemUser(**user)

Here we are defining the get_current_user function as a dependency which in turn takes an instance of OAuth2PasswordBearer as a dependency.

reuseable_oauth = OAuth2PasswordBearer(
    tokenUrl="/login",
    scheme_name="JWT"
)

OAuth2PasswordBearer takes two required parameters. tokenUrl is the URL in your application that handles user login and return tokens. scheme_name set to JWT will allow the frontend swagger docs to call tokenUrl from the frontend and save tokens in memory. Then each subsequent request to the protected endpoints will have the token sent as Authorization headers so OAuth2PasswordBearer can parse it.

Now let's add a protected endpoint that returns user account information as the response. For this, a user has to be logged in and the endpoint will respond with information for the currently logged-in user.

In app/app.py create another handler function. Make sure to include imports as well.

from app.deps import get_current_user

@app.get('/me', summary='Get details of currently logged in user', response_model=UserOut)
async def get_me(user: User = Depends(get_current_user)):
    return user

As soon as you add this endpoint, you will be able to see the Authorize button in the swagger docs and a 🔒 icon in front of the protected endpoint /me.

This is power of dependency injection and FastAPI's ability to generate an automatic OpenAPI schema.

Clicking the Authorize button will open the authorization form with the required fields for login. On a successful response, tokens will be saved and sent to subsequent request in the headers.

Swagger integrated login form

successfully logged in

At this point, you can access all the protected endpoints. To make an endpoint protected, you just need to add the get_current_user function as a dependency. That's all you need to do!

Conclusion

If you followed along, you should have a working FastAPI application with JWT authentication. If not, you can always run this repl and play around with it or visit this deployed version. You can find the GitHub code for this project here.

If you found this article helpful, give me a follow at twitter @abdadeel_. And don't forget that you can always watch this video for detail explanation with a practical example.

Thanks ;)

How to Create Microservices with FastAPI

Beau Carnes — Thu, 24 Mar 2022 16:43:34 +0000

FastAPI is a Web framework for developing RESTful APIs in Python. It is a great choice when you want to build an app based on microservices.

We just published a course on the freeCodeCamp.org YouTube channel that will teach you how to develop microservices app using FastAPI.

In this course, you will create a simple microservices app using Python FastAPI with React on the frontend. You will learn how to use RedisJSON as a Database and dispatch events with Redis Streams. RedisJSON is a NoSQL database just like MongoDB and Redis Streams is an Event Bus just like RabbitMQ or Apache Kafka.

Antonio Papa from Scalable Scripts developed this course. He has a bunch if experience working with a variety of frontend and backend frameworks.

Here are the sections in this course:

App Demo
Inventory Microservice Setup
Redis Cloud
Connect to Redis Cloud
Products CRUD
Payment Microservice Setup
Internal Http Requests
Background Tasks
Redis Streams
Frontend

Watch the full course below or on the freeCodeCamp.org YouTube channel (1.5 hour watch).

FastAPI - freeCodeCamp.org

How to Serve a Multi-User AI Agent with FastAPI and Streamlit

Table of Contents

Background

What is FastAPI?

What is Streamlit?

What Is Multi-User Support?

Motivation and Architecture

Step 1: Install Ollama and Pull the Model

Step 2: Install Python Dependencies

Step 3: Build the Agent and API Layer with FastAPI

Step 4: Build Streamlit UI

Step 5: Run the Backend App

Step 6: Run the Frontend App

Sample Output

What to Improve Before Production

Conclusion

How to Build an End-to-End ML Platform Locally: From Experiment Tracking to CI/CD

Table of Contents

Project Overview and Setup

Tech Stack

Prerequisites

Project Structure

1. Build a Simple Model and API (The Naive Approach)

1.1 Train a Quick Model

1.2 Serve Predictions with FastAPI

2. Where the Naive Approach Breaks

Problem 1: No Experiment Tracking (Reproducibility)

Problem 2: Model Versioning and Deployment Chaos

Problem 3: No Data Validation – Garbage In, Garbage Out

Problem 4: Model Drift – Performance Decay Over Time

Problem 5: No CI/CD or Deployment Safety

Summary: What We Need to Fix

3. Add Experiment Tracking and Model Registry with MLflow

3.1 How to Set Up the MLflow Tracking Server

3.2 How to Log Experiments in Code

3.3 How to Use the Model Registry

3.4 Update API to Load from Registry

4. Ensure Feature Consistency with Feast

4.1 What is Feast and Why Use It?

4.2 Install and Initialize Feast

4.3 Define Feature Definitions

4.4 Materialize Features to Online Store

4.5 Retrieve Features for Training and Serving

Why Feast Over Custom Code?

5. Add Data Validation with Great Expectations

5.1 Define Expectations

When to Use Which Validation Approach

5.2 Integrate Validation into FastAPI

6. Monitor Model Performance and Data Drift

6.1 The Four Pillars of ML Observability

6.2 Build a Drift Monitor with Evidently

6.3 Production Monitoring Strategy

7. Automate Testing and Deployment with CI/CD

7.1 Write Tests for Data and Model

7.2 GitHub Actions Workflow

7.3 Dockerize the Application

8. Incident Response Playbook

Scenario: False Positive Spike

Scenario: Gradual Performance Decay

Scenario: Upstream Data Schema Change

9. How to Put It All Together

The Complete Workflow

10. What's Next: Scale to Production

Scaling Feast for Production

Scaling MLflow for Production

Kubernetes Deployment

Advanced Monitoring

A/B Testing and Multi-Armed Bandits

Conclusion

Next Steps You Can Try

Get the Complete Code

References

How to Ship a Production-Ready RAG App with FAISS (Guardrails, Evals, and Fallbacks)

Table of Contents

Why RAG Alone Does Not Equal Production-Ready

Prerequisites

Knowledge

Tools + Accounts

What This Tutorial Covers (and What It Doesn’t)

Implementation of `rag.py`

Implementation of `llm.py`

Implementation of `app.py`

Example `prompts.py`

Step 1: Create `evals/eval_set.json`

Step 2: Create `evals/run_evals.py`

Retriever Queries the Vector Database

Prompt, Response, and Model Metadata