<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Python - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Python - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Fri, 29 May 2026 10:34:22 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/python/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Production RAG with LangChain & Vector Databases ]]>
                </title>
                <description>
                    <![CDATA[ Master the transition from simple prototypes to production-grade RAG systems by addressing the critical scaling, debugging, and security challenges that standard tutorials often ignore. We just posted ]]>
                </description>
                <link>https://www.freecodecamp.org/news/production-rag-with-langchain-vector-databases/</link>
                <guid isPermaLink="false">6a183a86badcd8afcb9da431</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 28 May 2026 12:52:22 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5f68e7df6dfc523d0a894e7c/373e1c14-905f-461d-acd6-3adae358e41b.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Master the transition from simple prototypes to production-grade RAG systems by addressing the critical scaling, debugging, and security challenges that standard tutorials often ignore.</p>
<p>We just posted a comprehensive course on the <a href="http://freeCodeCamp.org">freeCodeCamp.org</a> YouTube channel that covers the entire RAG pipeline—from vector database optimization and observability to advanced agentic and multimodal architectures. You will learn to make sure your AI applications are robust, secure, and ready for deployment. Paulo Dichone created this course.</p>
<p>Here are the sections in the course:</p>
<ul>
<li><p>Intro</p>
</li>
<li><p>Full RAG Overview</p>
</li>
<li><p>Development Environment Setup</p>
</li>
<li><p>Document Loader - Overview</p>
</li>
<li><p>Document Processing Pipeline - RAG Indexing Pipeline</p>
</li>
<li><p>Embedding Dimensions - Deep Dive</p>
</li>
<li><p>Hands-on - Create a Vector DB Using Chroma</p>
</li>
<li><p>Similarity Search with Scores</p>
</li>
<li><p>Building a Basic RAG System</p>
</li>
<li><p>Debugging RAG Systems</p>
</li>
<li><p>Hybrid Search</p>
</li>
<li><p>Token Budgeting</p>
</li>
<li><p>Observability - Introduction</p>
</li>
<li><p>LangSmith Setup</p>
</li>
<li><p>RAG Optimization</p>
</li>
<li><p>Scaling RAG Systems</p>
</li>
<li><p>The Real Costs of Vector Search</p>
</li>
<li><p>Production Hosting</p>
</li>
<li><p>Supabase and PGVector - Set up and Introduction</p>
</li>
<li><p>Three Pillars of Production Visibility</p>
</li>
<li><p>Production Project</p>
</li>
<li><p>Set up the Security Layer</p>
</li>
<li><p>Set up the LangGraph Agent and the FastAPI API - Testing and LangSmith Observability Dashboard</p>
</li>
<li><p>Test the Security Layer</p>
</li>
<li><p>Security Checklist</p>
</li>
<li><p>Advanced RAG Topics - Long Context Models vs RAG</p>
</li>
<li><p>Contextual Retrieval</p>
</li>
<li><p>Late Chunking vs Early Chunking</p>
</li>
<li><p>Agentic RAG - Self-Correcting Retrieval</p>
</li>
<li><p>GraphRAG - Multi-hop Reasoning</p>
</li>
<li><p>Multimodal RAG - ColPali - Vision-Based Document RAG</p>
</li>
<li><p>Summary - Advanced RAG (Current State)</p>
</li>
<li><p>RAG Evolution - Overview</p>
</li>
<li><p>Outro</p>
</li>
</ul>
<p>Watch the full course on <a href="https://youtu.be/mHxLXzYjQRE">the freeCodeCamp.org YouTube channel</a> (8-hour watch).</p>
<div class="embed-wrapper"><iframe width="560" height="315" src="https://www.youtube.com/embed/mHxLXzYjQRE" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Bash & Python for Real DevOps Automation – Full Handbook with 5 Production Use Cases ]]>
                </title>
                <description>
                    <![CDATA[ Automation scripts often validate process completion instead of system health. A Kubernetes pod can be running while the application inside it can't authenticate to the database. A Terraform deploymen ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-bash-python-for-real-devops-automation-handbook-with-production-use-cases/</link>
                <guid isPermaLink="false">6a171310badcd8afcb060460</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Bash ]]>
                    </category>
                
                    <category>
                        <![CDATA[ automation ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Wed, 27 May 2026 15:51:44 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/73f2a745-c1b5-4cbb-8f97-2ba6c5230592.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Automation scripts often validate process completion instead of system health.</p>
<p>A Kubernetes pod can be running while the application inside it can't authenticate to the database. A Terraform deployment can return clean while someone has manually changed infrastructure in the cloud console. A canary rollout can show zero errors while users wait five seconds for every request.</p>
<p>The problem isn't the tooling. The problem is that the system can look healthy when it really is not.</p>
<p>This handbook walks through five production-style automation scenarios using Bash and Python for:</p>
<ul>
<li><p>Detecting abnormal AWS spend before the monthly invoice arrives</p>
</li>
<li><p>Correlating logs across multiple services using trace IDs</p>
</li>
<li><p>Finding infrastructure drift outside Terraform</p>
</li>
<li><p>Validating secret rotation at the application level</p>
</li>
<li><p>Automatically rolling back slow deployments before users complain</p>
</li>
</ul>
<p>By the end of this handbook, you'll be able to build small scripts that help you notice when something is wrong in a system, even when the tools say everything is fine.</p>
<p>The scripts are intentionally small. The important part is the operational thinking behind them like what signal the script measures, what failure mode it can detect, and what assumptions the platform is making underneath.</p>
<p>Each use case includes a runnable demo environment, the complete script, a breakdown of the system behaviour involved, and an intentional failure you can trigger yourself.</p>
<p>If you're new to this workflow, start with use case 1 and work forward. The later sections build on the same pattern: automation is useful when it verifies reality, not just process completion.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, set up the following:</p>
<ul>
<li><p><strong>Python 3.8 or higher</strong> – check with <code>python3 --version</code></p>
</li>
<li><p><strong>A Python virtual environment</strong> – create one before installing anything:</p>
</li>
</ul>
<pre><code class="language-plaintext">python3 -m venv venv
source venv/bin/activate  

 # on Windows: 

venv\Scripts\activate
</code></pre>
<p>This keeps your installed packages isolated from your system Python and prevents permission errors on shared machines.</p>
<ul>
<li><p><strong>pip</strong> – Python's package installer, included with Python</p>
</li>
<li><p><strong>AWS CLI</strong> configured with a working profile – a free-tier AWS account is enough for use cases 1, 3, and 4. Verify it's working with:</p>
<pre><code class="language-plaintext">aws sts get-caller-identity
</code></pre>
</li>
<li><p><strong>Docker and Docker Compose</strong> – needed for use cases 2, 4, and 5</p>
</li>
<li><p><strong>Kind</strong> (Kubernetes in Docker) – a way to run Kubernetes locally for use cases 4 and 5. Install with <code>brew install kind</code> on macOS, or follow the <a href="https://kind.sigs.k8s.io/docs/user/quick-start/">Kind quick start guide</a></p>
</li>
<li><p><strong>kubectl</strong> – the command-line tool for talking to a Kubernetes cluster. After installing Kind, run <code>kind create cluster</code> and kubectl is configured automatically</p>
</li>
<li><p><strong>Helm</strong> – a package manager for Kubernetes, needed for use case 5. Install with <code>brew install helm</code> or the <a href="https://helm.sh/docs/intro/install/">Helm install guide</a></p>
</li>
<li><p><strong>Terraform</strong> – needed for use case 3. Install with <code>brew install terraform</code> on macOS or follow the <a href="https://developer.hashicorp.com/terraform/install">Terraform install guide</a>. Check with <code>terraform version</code>.</p>
</li>
<li><p><strong>bc</strong> – a calculator utility used by the canary watch scripts for floating-point comparison. Install with <code>brew install bc</code> on macOS or <code>apt install bc</code> on Ubuntu. Run <code>bc --version</code> to confirm it is available before starting use case 5.</p>
</li>
</ul>
<h3 id="heading-knowledge-and-skills">Knowledge and Skills</h3>
<ul>
<li><p>You should be comfortable reading Python and Bash scripts without needing to write them from scratch.</p>
</li>
<li><p>You should have basic Linux terminal comfort – navigating directories, running scripts, reading output, and so on.</p>
</li>
<li><p>You should know what Kubernetes pods and deployments are at a basic level – you don't need deep Kubernetes expertise, as use cases 4 and 5 will introduce the Kubernetes concepts they rely on as they go.</p>
</li>
<li><p>Familiarity with AWS basics such as what EC2, IAM, and Secrets Manager will help with use cases 1, 3, and 4, while use case 2 runs entirely on your local machine and requires no AWS knowledge at all.</p>
</li>
<li><p>For use case 3, knowing what Terraform is and what a state file does will help. You don't need to write any Terraform, but understanding that Terraform tracks and what it created is the foundation of the whole use case.</p>
</li>
</ul>
<h3 id="heading-aws-iam-permissions-required">AWS IAM Permissions Required</h3>
<p>The scripts in this article make real AWS API calls. Your IAM user or role needs the following minimum permissions. (If you see an <code>AccessDenied</code> error, this is the first place to look.):</p>
<table>
<thead>
<tr>
<th>Use Case</th>
<th>Required IAM Permission</th>
</tr>
</thead>
<tbody><tr>
<td>1 - Cost Anomaly Detection</td>
<td><code>ce:GetCostAndUsage</code></td>
</tr>
<tr>
<td>3 - Drift Detection</td>
<td><code>ec2:DescribeSecurityGroups</code></td>
</tr>
<tr>
<td>4 - Secrets Rotation</td>
<td><code>secretsmanager:GetSecretValue</code>, <code>secretsmanager:PutSecretValue</code></td>
</tr>
</tbody></table>
<p>If you're using a fresh AWS free-tier account with <code>AdministratorAccess</code> attached, these permissions are already included and you can skip this step.</p>
<p>If you're on a restricted IAM user, here's how to add them. In the AWS Console, go to IAM, click Users, then click your username. Under the Permissions tab, click Add permissions, then Create inline policy.</p>
<p>Switch to the JSON tab and paste a policy document granting the permissions in the table above, then save it.</p>
<p>If your company manages AWS through an organization and you don't have permission to edit your own IAM policies, ask your administrator to add these permissions to your role.</p>
<h3 id="heading-companion-github-repository">Companion GitHub Repository</h3>
<p>All demo projects live at: <a href="https://github.com/Osomudeya/devops-scripting-labs"><strong>https://github.com/irvingtalks/devops-scripting-labs</strong></a></p>
<p>Each use case has its own numbered folder with the complete script, supporting files, a <code>setup.sh</code> to prepare the environment, and a <code>break_it.sh</code> that injects the specific failure each use case is built around.</p>
<p>Clone the repo before starting:</p>
<pre><code class="language-plaintext">git clone https://github.com/irvingtalks/devops-scripting-labs
cd devops-scripting-labs
</code></pre>
<p>Before running any use case, check that you have everything installed:</p>
<pre><code class="language-plaintext">./preflight.sh
</code></pre>
<p>This checks for every tool the lab needs like Python, AWS CLI, Docker, Kind, Helm, Terraform, and <code>bc</code> and tells you exactly what's missing with the install command for each one.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-use-case-1-cost-anomaly-detection">Use Case 1 - Cost Anomaly Detection</a></p>
</li>
<li><p><a href="#heading-use-case-2-log-correlation-across-services">Use Case 2 - Log Correlation Across Services</a></p>
</li>
<li><p><a href="#heading-use-case-3-infrastructure-drift-detection">Use Case 3 - Infrastructure Drift Detection</a></p>
</li>
<li><p><a href="#heading-use-case-4-secrets-rotation-with-zero-downtime">Use Case 4 - Secrets Rotation with Zero Downtime</a></p>
</li>
<li><p><a href="#heading-use-case-5-automated-canary-rollback-trigger">Use Case 5 - Automated Canary Rollback Trigger</a></p>
</li>
<li><p><a href="#heading-what-you-can-do-now">What You Can Do Now</a></p>
</li>
</ul>
<h2 id="heading-use-case-1-cost-anomaly-detection">Use Case 1 - Cost Anomaly Detection</h2>
<p><strong>Environment:</strong> AWS Cost Explorer API (read-only, available in all accounts) <strong>Language:</strong> Python</p>
<h3 id="heading-the-production-problem">The Production Problem</h3>
<p>A junior engineer is testing a Kubernetes configuration. They spin up a managed node group in AWS (a set of EC2 virtual machines that the Kubernetes cluster uses to run workloads) and configure the cluster autoscaler, which is the Kubernetes component responsible for adding more machines when the cluster needs more capacity. The test goes well, and on Friday afternoon, they forget to tear the environment down.</p>
<p>Over the weekend, the autoscaler keeps provisioning new nodes because the test workloads are still running and requesting resources. By Monday morning you have a node group that has been quietly growing for two and a half days, and nobody noticed until the invoice landed three weeks later.</p>
<p>The script in this use case exists because your AWS bill isn't just a monthly number. It's a time series, and you can monitor it the same way you monitor application metrics. Check it daily, know your baseline, and you catch this kind of event in hours instead of weeks.</p>
<h3 id="heading-whats-actually-happening-at-the-system-level">What's Actually Happening at the System Level</h3>
<p><strong>What this is not:</strong> This isn't a finance dashboard. It's an operational anomaly detector and the signal it monitors is cost. But the thing it's actually detecting is unexpected infrastructure behavior such as resources left running, autoscaler events, and forgotten environments.</p>
<p>AWS Cost Explorer is a service that stores your billing data and exposes it through an API, and when you call it, you're running a query against your account's billing records by specifying the time range, the granularity, and how you want results grouped.</p>
<p>One thing to know before you start investigating any flagged cost is that AWS decides which service category to put a charge under, not you. An EBS snapshot copy running across regions might appear under the EC2 line item rather than data transfer, which means a spike in EC2 spend doesn't necessarily mean something went wrong with your EC2 instances. The script flags the spike correctly, but investigating it means asking <em>"what changed in my infrastructure on this date"</em> rather than <em>"what is running in EC2 right now."</em></p>
<p>The billing label is a starting point, not a diagnosis.</p>
<h3 id="heading-set-up-the-demo-environment">Set Up the Demo Environment</h3>
<p>Navigate to <code>01-cost-anomaly/</code> in the <a href="https://github.com/Osomudeya/devops-scripting-labs">companion repo</a>. No cluster setup is needed for this use case because the script runs against your AWS account directly, and the only dependency is boto3:</p>
<pre><code class="language-plaintext">cd 01-cost-anomaly
pip install boto3
</code></pre>
<p>Before running against your real account, make sure your AWS credentials are configured. The script uses whatever credentials the AWS CLI is set up with. If you haven't done this yet:</p>
<pre><code class="language-plaintext">aws configure
</code></pre>
<p>This will ask for your AWS Access Key ID, Secret Access Key, default region (use <code>us-east-1</code> if unsure), and output format (type <code>json</code>). You can find your access keys in the AWS Console under IAM → Users → your username → Security credentials → Create access key.</p>
<p>Your account needs the <code>ce:GetCostAndUsage</code> permission also, if you're on a fresh account with AdministratorAccess that's already included.</p>
<p>If you have an AWS account with a few weeks of billing history, you can run the script directly against your real data:</p>
<pre><code class="language-plaintext">python detect_cost_anomaly.py
</code></pre>
<p>Two things to know before running against a real account. First, Cost Explorer data has a 24-hour lag. This means spend from today won't appear until tomorrow, so the script automatically excludes the most recent day to avoid incomplete results.</p>
<p>Second, the script uses unblended costs, which is what you actually pay on a single-account setup. Blended costs are a weighted average used in multi-account organisations sharing reserved capacity and will give different numbers.</p>
<p>If you have a new account or prefer not to use real billing data, the script includes a <code>--sample</code> flag that uses built-in data and calls no AWS APIs at all.<br>Run this first to see what the output looks like before reading the code:</p>
<pre><code class="language-plaintext">python detect_cost_anomaly.py --sample
</code></pre>
<h3 id="heading-the-script">The Script</h3>
<pre><code class="language-python">#!/usr/bin/env python3
# detect_cost_anomaly.py — Use Case 1: Cost Anomaly Detection
# Full explanation of every function is in the article.

import statistics
import sys
from datetime import datetime, timedelta

import boto3

def build_sample_data(days=30):
    """Synthetic Cost Explorer rows for the last `days` (ending yesterday).

    The EC2 spike is placed on yesterday (device local date) so sample output
    always matches the same window as live Cost Explorer mode.
    """
    last_day = datetime.today().date() - timedelta(days=1)
    first_day = last_day - timedelta(days=days - 1)
    anomaly_day_index = days - 1
    results = []
    for i in range(days):
        day = first_day + timedelta(days=i)
        d = i + 1
        results.append(
            {
                "TimePeriod": {
                    "Start": str(day),
                    "End": str(day + timedelta(days=1)),
                },
                "Groups": [
                    {
                        "Keys": ["Amazon EC2"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(
                                    round(
                                        18.50
                                        if i == anomaly_day_index
                                        else 1.10 + (d % 3) * 0.10,
                                        2,
                                    )
                                )
                            }
                        },
                    },
                    {
                        "Keys": ["Amazon S3"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(round(0.04 + (d % 5) * 0.01, 2))
                            }
                        },
                    },
                    {
                        "Keys": ["Amazon RDS"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(round(0.85 + (d % 4) * 0.05, 2))
                            }
                        },
                    },
                ],
            }
        )
    return results, str(last_day)


def get_daily_costs(days=30):
    ce = boto3.client("ce", region_name="us-east-1")
    end = datetime.today().date() - timedelta(days=1)
    start = end - timedelta(days=days)
    response = ce.get_cost_and_usage(
        TimePeriod={"Start": str(start), "End": str(end)},
        Granularity="DAILY",
        Metrics=["UnblendedCost"],
        GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
    )
    return response["ResultsByTime"]


def build_service_timeseries(results):
    services = {}
    for day in results:
        date_str = day["TimePeriod"]["Start"]
        for group in day["Groups"]:
            service = group["Keys"][0]
            cost = float(group["Metrics"]["UnblendedCost"]["Amount"])
            if service not in services:
                services[service] = []
            services[service].append({"date": date_str, "cost": cost})
    return services


def detect_anomalies(services, baseline_days=7, multiplier=2.0, recent_days=None):
    """Flag days where cost exceeds prior `baseline_days` average + 2σ.

    Uses a rolling baseline (each day vs the previous week). If `recent_days`
    is set, only returns anomalies on or after today - recent_days.
    """
    cutoff = None
    if recent_days is not None:
        cutoff = datetime.today().date() - timedelta(days=recent_days)

    anomalies = []
    for service, daily in services.items():
        if len(daily) &lt; baseline_days + 1:
            continue
        for i in range(baseline_days, len(daily)):
            day = daily[i]
            day_date = datetime.strptime(day["date"], "%Y-%m-%d").date()
            if cutoff is not None and day_date &lt; cutoff:
                continue
            baseline_costs = [d["cost"] for d in daily[i - baseline_days : i]]
            avg = statistics.mean(baseline_costs)
            if avg &lt; 0.01:
                continue
            try:
                std = statistics.stdev(baseline_costs)
            except statistics.StatisticsError:
                continue
            threshold = avg + (multiplier * std)
            if day["cost"] &gt; threshold:
                anomalies.append(
                    {
                        "service": service,
                        "date": day["date"],
                        "actual": round(day["cost"], 4),
                        "baseline_avg": round(avg, 4),
                        "threshold": round(threshold, 4),
                        "pct_above": round(((day["cost"] - avg) / avg) * 100, 1),
                    }
                )
    return sorted(anomalies, key=lambda x: x["date"])


def parse_args(argv):
    use_sample = "--sample" in argv
    recent_days = None
    for arg in argv[1:]:
        if arg.startswith("--recent-days="):
            recent_days = int(arg.split("=", 1)[1])
    return use_sample, recent_days


def run(use_sample=False, recent_days=None):
    if use_sample:
        results, anomaly_date = build_sample_data()
        print("Running against sample data (--sample mode).")
        print(
            f"This data represents 30 days of billing ending yesterday, "
            f"with a realistic EC2 anomaly on {anomaly_date}.\n"
        )
    else:
        print("Fetching 30 days of daily AWS costs by service...")
        print("Note: today is excluded — Cost Explorer has a 24-hour billing lag.\n")
        results = get_daily_costs(days=30)

    if recent_days is not None:
        since = datetime.today().date() - timedelta(days=recent_days)
        print(
            f"Checking for spikes in the last {recent_days} days only "
            f"(on or after {since}), each vs its prior 7-day average.\n"
        )

    services = build_service_timeseries(results)
    anomalies = detect_anomalies(services, recent_days=recent_days)

    if not anomalies:
        print("No anomalies detected.")
        print("\nNote: this script flags statistical outliers against your own baseline.")
        print("A consistently elevated spend level will not trigger — only sudden increases.")
        return

    print(f"{'=' * 60}")
    print(f"ANOMALIES DETECTED: {len(anomalies)}")
    print(f"{'=' * 60}\n")

    for a in anomalies:
        print(f"Service:      {a['service']}")
        print(f"Date:         {a['date']}")
        print(f"Actual cost:  ${a['actual']}")
        print(f"Baseline avg: ${a['baseline_avg']} (prior 7-day average)")
        print(f"Threshold:    ${a['threshold']}")
        print(f"Overage:      {a['pct_above']}% above baseline")
        print()

    print("=" * 60)
    print("A note on AWS cost attribution:")
    print("The service label in Cost Explorer is assigned by AWS, not by the resource")
    print("that caused the cost. An EC2 spike may be caused by EBS snapshot copies,")
    print("cross-region data transfer, or autoscaling events that AWS categorizes under")
    print("EC2 in billing — not a running EC2 instance you can find in the console.")
    print()
    print("Before investigating the flagged service directly, ask:")
    print("What changed in my infrastructure on or before the flagged date?")
    print("Work backward from the operational change, not forward from the billing label.")


if __name__ == "__main__":
    use_sample, recent_days = parse_args(sys.argv)
    run(use_sample=use_sample, recent_days=recent_days)
</code></pre>
<h3 id="heading-how-the-script-works">How the Script Works</h3>
<p><code>get_daily_costs</code> pulls your AWS billing data for the last 30 days.</p>
<p><code>build_service_timeseries</code> takes the raw data from AWS and reorganises it. AWS groups the data by day first, then by service. This function flips that around so each service has its own list of daily costs, which is what the detection step needs to work with.</p>
<p><code>detect_anomalies</code> is where the actual check happens. For each service, it compares each day's spend to the 7 days right before it. If yesterday cost dramatically more than the week before, the script flags it. That's all it does.</p>
<p><code>--recent-days=7</code> means <em>"only show me anomalies from the last 7 days."</em> The script still fetches 30 days of data because it needs that history to calculate the comparison, but the results are filtered to the window you care about. This is good for a quick Monday morning check.</p>
<p><code>--sample</code> runs without touching your AWS account at all. It uses built-in fake billing data with a spike baked into yesterday's date so the detection always fires. Use this first to see what the output looks like before connecting it to real data.</p>
<h3 id="heading-what-the-output-looks-like">What the Output Looks Like</h3>
<p>Running <code>--sample</code> (the spike date will show as yesterday's actual date, not a fixed value):</p>
<pre><code class="language-plaintext">Running against sample data (--sample mode).
30 days of billing ending yesterday, with an EC2 spike on 2026-05-14.

============================================================
ANOMALIES DETECTED: 1
============================================================

Service:      Amazon EC2
Date:         2026-05-14
Actual cost:  $18.5
Baseline avg: $1.2143 (prior 7-day average)
Threshold:    $1.3939
Overage:      1423.4% above baseline

============================================================
A note on AWS cost attribution:
The service label in Cost Explorer is assigned by AWS, not by the resource
that caused the cost. An EC2 spike may be caused by EBS snapshot copies,
cross-region data transfer, or autoscaling events that AWS categorizes under
EC2 in billing - not a running EC2 instance you can find in the console.

Before investigating the flagged service directly, ask:
What changed in my infrastructure on or before the flagged date?
Work backward from the operational change, not forward from the billing label.
</code></pre>
<p>Your numbers will differ slightly from the above because the sample data generates dates from today dynamically. The spike always shows up on yesterday and the surrounding baseline numbers shift depending on the day you run it.</p>
<h3 id="heading-the-decision-the-script-cant-make-for-you">The Decision the Script Can't Make for You</h3>
<p>The anomaly is on the EC2 line, and the instinct is to go look at running EC2 instances. But as the output warns, the attribution is AWS's choice, not yours.</p>
<p>Before opening the EC2 console, check your deployment history for that date. What was deployed? Was a new environment created? Did an autoscaler event run? Start from the operational change and follow the thread to the billing data, because starting from the billing label and working backward is slower and frequently misleading.</p>
<h3 id="heading-break-it-on-purpose">Break it On Purpose</h3>
<pre><code class="language-bash"># See the spike immediately with no AWS account needed
python detect_cost_anomaly.py --sample

# Run against your real account
python detect_cost_anomaly.py

# Only show anomalies from the last 7 days, good for a quick this-week check
python detect_cost_anomaly.py --recent-days=7

# Combine both flags - sample data filtered to the last 7 days
python detect_cost_anomaly.py --sample --recent-days=7
</code></pre>
<p><strong>If your real account returns "No anomalies detected" that's not a failure.</strong> It means your spend has been consistent. A clean account returns clean output. The script is doing exactly what it should.</p>
<p>When a real event happens on your account such as an autoscaler left running, a forgotten environment or an unexpected data transfer, this is what catches it before the invoice does.</p>
<h2 id="heading-use-case-2-log-correlation-across-services">Use Case 2 – Log Correlation Across Services</h2>
<p><strong>Environment:</strong> Fully local – Docker Compose, three Python services<br><strong>Language:</strong> Python</p>
<h3 id="heading-the-production-problem">The Production Problem</h3>
<p>A user reports that their payment failed. You open your logging tool and search. The auth service logged a successful authentication. The ledger service logged a successful transaction but the notification service which should have sent a payment confirmation email has logged nothing at all.</p>
<p>Two services reported success while one service stay silent. The payment still failed, and you have three logs and no clear answer about where the chain broke.</p>
<h3 id="heading-whats-actually-happening-at-the-system-level">What's Actually Happening at the System Level</h3>
<p><strong>What this is not:</strong> This isn't a guide to installing a log aggregation tool. It's about the data structure that makes log correlation possible in the first place and what happens when that structure breaks on one service's error path.</p>
<p>In a system with a single service, debugging is simple: one service, one log file, one timeline. But when a user request passes through multiple services, you need a way to link all the logs together. That link is called a trace ID.</p>
<p>Think of it like a ticket number at a government office. When you walk in, you get a number, say, A247. Every desk that handles your case writes A247 on your file. If something goes wrong, the manager pulls every record with A247 and sees exactly what happened, in order, across every desk. That is a trace ID. One number, shared across every service that touched the request.</p>
<p>In the demo, when a payment comes in, the auth service creates a unique ID for it. Every log line that auth, ledger, and notification write for that payment includes the same ID. When something breaks, you run <code>correlate.py</code> with that ID and it finds every related log line across all three services and sorts them by time:</p>
<pre><code class="language-plaintext">python correlate.py pay-abc123
</code></pre>
<p>Here's what those logs look like. Notice that every line has the same <code>trace_id</code>:</p>
<pre><code class="language-json">{"timestamp": "2026-05-01T14:23:01.234Z", "trace_id": "pay-abc123", "service": "auth", "event": "user_authenticated", "level": "INFO", "user_id": "u_789", "duration_ms": 12}
{"timestamp": "2026-05-01T14:23:01.891Z", "trace_id": "pay-abc123", "service": "ledger", "event": "transaction_recorded", "level": "INFO", "amount": 50.0, "currency": "USD"}
{"timestamp": "2026-05-01T14:23:02.103Z", "trace_id": "pay-abc123", "service": "notification", "event": "email_queued", "level": "INFO", "recipient": "user@example.com"}
</code></pre>
<p>Now here's what breaks it. The notification service hits a timeout connecting to the email provider. The developer who wrote the error handler forgot to include the trace ID, so instead of a proper log line, it writes this:</p>
<pre><code class="language-plaintext">2026-05-01T14:23:02.415Z ERROR Connection timeout to email provider smtp.example.com:587
</code></pre>
<p>The error happened, the log line exists. But because it has no <code>trace_id</code>, <code>correlate.py</code> can't find it.</p>
<p>The notification still appears in the timeline, and you can see <code>email_send_attempt</code> – but <code>email_queued</code> never follows it.</p>
<pre><code class="language-plaintext">Timeline — 5 events across 3 service(s):

  [2026-05-15T21:59:00.605307+00:00] [AUTH] [INFO] payment_request_received
  [2026-05-15T21:59:00.606008+00:00] [AUTH] [INFO] user_authenticated
  [2026-05-15T21:59:00.617331+00:00] [LEDGER] [INFO] transaction_recorded
  [2026-05-15T21:59:00.630313+00:00] [NOTIFICATION] [INFO] email_send_attempt
  [2026-05-15T21:59:00.685182+00:00] [AUTH] [INFO] payment_complete
</code></pre>
<p>The attempt is there but the failure is not. The developer just forgot one field.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/22b7d7b0-8ae5-4573-bcb0-faaf5d807e8a.png" alt="log correlation attempt terminal output - ERROR Connection timeout" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-set-up-the-demo-environment">Set Up the Demo Environment</h3>
<p>Navigate to <code>02-log-correlation/</code> and start the three services:</p>
<pre><code class="language-plaintext">cd 02-log-correlation
docker compose up -d
</code></pre>
<p>This starts the auth, ledger, and notification services. Trigger a payment request to generate some logs:</p>
<pre><code class="language-plaintext">./trigger_request.sh
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/06977757-e7fd-43c3-aaca-dbc6d17c951a.png" alt="trigger_request.sh terminal output - also showing the traceid" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>The script prints the trace ID it used. Copy the ID and Run the correlation script against it now, before we break anything, to see the full working path:</p>
<pre><code class="language-plaintext">python correlate.py pay-5831e1bf
</code></pre>
<p>You should see something like this (your trace ID will be different but the structure is the same):</p>
<pre><code class="language-plaintext">Loading logs from ./logs/...
Loaded 6 structured log lines.

============================================================
Trace ID: pay-5831e1bf
============================================================

Timeline - 6 events across 3 service(s):

  [2026-05-15T21:42:28.079046+00:00] [AUTH] [INFO] payment_request_received
    service: auth
    user_id: u_789
    amount: 50.0
  [2026-05-15T21:42:28.080718+00:00] [AUTH] [INFO] user_authenticated
    service: auth
    user_id: u_789
    duration_ms: 12
  [2026-05-15T21:42:28.145528+00:00] [LEDGER] [INFO] transaction_recorded
    service: ledger
    user_id: u_789
    amount: 50.0
    currency: USD
  [2026-05-15T21:42:28.210088+00:00] [NOTIFICATION] [INFO] email_send_attempt
    service: notification
    recipient: user@example.com
  [2026-05-15T21:42:28.347893+00:00] [NOTIFICATION] [INFO] email_queued
    service: notification
    recipient: user@example.com
    amount: 50.0
  [2026-05-15T21:42:28.378402+00:00] [AUTH] [INFO] payment_complete
    service: auth
    user_id: u_789
    amount: 50.0
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/26a0b226-4e3d-4967-a7c6-6d367409fb1d.png" alt="terminal output showing the full payment journey" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>That's the full payment journey with auth, ledger, notification in the exact order it happened. Now let's look at how the script works.</p>
<h3 id="heading-the-script">The Script</h3>
<pre><code class="language-python"># correlate.py
import json
import os
import sys

SERVICES = ["auth", "ledger", "notification"]
LOG_DIR = "./logs"


def load_logs(log_dir):
    """
    Read each service's log file and parse every line as JSON.
    Lines that fail JSON parsing are printed as warnings.
    They are not silently dropped - a plain-text error line in a service
    that should emit structured logs is itself evidence worth seeing.
    """
    all_lines = []

    for service in SERVICES:
        log_file = os.path.join(log_dir, f"{service}.log")

        if not os.path.exists(log_file):
            print(f"  WARNING: No log file for '{service}' at {log_file}")
            continue

        with open(log_file) as f:
            for line_num, line in enumerate(f, 1):
                line = line.strip()
                if not line:
                    continue
                try:
                    parsed = json.loads(line)
                    parsed["_source"] = service
                    all_lines.append(parsed)
                except json.JSONDecodeError:
                    # This line exists in the log but cannot be correlated.
                    print(f"  WARNING: {service}.log line {line_num} is not structured JSON:")
                    print(f"           {line[:100]}")
                    print(f"           This line will NOT appear in any trace-based search.")

    return all_lines


def correlate(trace_id, all_lines):
    """
    Find every log line with this trace_id and sort by timestamp.
    The sorted result is the reconstructed timeline of the request.
    """
    matched = [line for line in all_lines if line.get("trace_id") == trace_id]
    matched.sort(key=lambda x: x.get("timestamp", ""))
    return matched


def find_missing_services(matched):
    """
    Check which services produced zero trace-tagged lines for this request.
    A missing service is not just an absence - it is a signal.
    Either the request never reached that service, or an error path swallowed
    the trace ID. Both are worth investigating.
    """
    services_seen = {line["_source"] for line in matched}
    return [s for s in SERVICES if s not in services_seen]


def print_timeline(trace_id, matched, missing):
    print(f"\n{'=' * 60}")
    print(f"Trace ID: {trace_id}")
    print(f"{'=' * 60}")

    if not matched:
        print("\nNo structured log lines found with this trace ID.")
        print("Either the trace ID is wrong, or no service emitted")
        print("a structured log line for this request.")
        return

    services_count = len({line["_source"] for line in matched})
    print(f"\nTimeline - {len(matched)} events across {services_count} service(s):\n")

    for line in matched:
        ts = line.get("timestamp", "unknown")
        service = line.get("_source", "unknown").upper()
        event = line.get("event", "unknown event")
        level = line.get("level", "INFO")
        extras = {k: v for k, v in line.items()
                  if k not in ("timestamp", "trace_id", "event", "level", "_source")}

        print(f"  [{ts}] [{service}] [{level}] {event}")
        for k, v in extras.items():
            print(f"    {k}: {v}")

    if missing:
        print(f"\n{'=' * 60}")
        print("MISSING TELEMETRY")
        print(f"{'=' * 60}")
        print(f"These services produced no trace-tagged events for trace {trace_id}:\n")
        for s in missing:
            print(f"  - {s}")
        print()
        print("This means one of three things:")
        print("  1. The request never reached this service.")
        print("  2. The service received it but an error path swallowed the trace ID,")
        print("     leaving a plain-text log line that trace correlation cannot find.")
        print("  3. This service's log file was not included in this run.")
        print()
        print("Check the raw log file for a plain-text error line around the same timestamp.")
        print("If one exists, that is your root cause - and a structured logging gap to fix.")


def run(trace_id):
    print(f"Loading logs from {LOG_DIR}/...")
    all_lines = load_logs(LOG_DIR)
    print(f"Loaded {len(all_lines)} structured log lines.\n")

    matched = correlate(trace_id, all_lines)
    missing = find_missing_services(matched)
    print_timeline(trace_id, matched, missing)


if __name__ == "__main__":
    if len(sys.argv) &lt; 2:
        print("Usage: python correlate.py &lt;trace_id&gt;")
        print("Example: python correlate.py pay-abc123")
        sys.exit(1)
    run(sys.argv[1])
</code></pre>
<h3 id="heading-how-the-script-works">How the Script Works</h3>
<p><code>load_logs</code> reads log files from each service. Each line should be JSON. If a line isn't JSON, it prints a warning that usually means an error log is missing a trace ID and can't be tracked.</p>
<p><code>correlate</code> finds all logs that match the given trace ID and sorts them by time. This rebuilds the full request flow across services.</p>
<p><code>find_missing_services</code> checks which services have no logs for that trace ID. This tells you where the request stopped or where the trace ID was lost.</p>
<p><code>print_timeline</code> displays the full request timeline in order. It also shows which services are missing if something didn't log correctly.</p>
<p>One thing worth knowing for when you use this in a real Kubernetes environment:<br>in Kubernetes, <code>kubectl logs</code> only shows the current running container.<br>If a pod restarts, you can use this:</p>
<pre><code class="language-plaintext">kubectl logs &lt;pod-name&gt; --previous
</code></pre>
<p>But this only works for the last restart. Older logs are gone unless you use a logging system like Loki or CloudWatch.</p>
<h3 id="heading-what-the-output-looks-like-after-breaking-it">What the Output Looks Like After Breaking it</h3>
<p>The point of this section is to show you what happens when a service fails silently, – when the error exists in the logs but the script can't find it because the developer forgot one field.</p>
<p><code>break_it.sh</code> forces the notification service to fail when it tries to send an email, and because the error handler was written without a trace ID, the failure gets logged as plain text with no way to tie it back to the original request.</p>
<p>Run it:</p>
<pre><code class="language-plaintext">./break_it.sh
</code></pre>
<p>Then trigger a new request:</p>
<pre><code class="language-plaintext">./trigger_request.sh
</code></pre>
<p>Copy the trace ID it prints, then correlate it:</p>
<pre><code class="language-plaintext">python correlate.py pay-xxxxxxxx
</code></pre>
<p>Here is what you'll see:</p>
<pre><code class="language-plaintext">Loading logs from ./logs/...
  WARNING: notification.log line 10 is not structured JSON:
           2026-05-15T21:59:00.681583+00:00 ERROR Connection timeout to email
           provider http://mock-email:80/ after 0.001s - failed to send
           confirmation to user@example.com
           This line will NOT appear in any trace-based search.
Loaded 29 structured log lines.

============================================================
Trace ID: pay-6cf69a8c
============================================================

Timeline - 5 events across 3 service(s):

  [2026-05-15T21:59:00.605307+00:00] [AUTH] [INFO] payment_request_received
  [2026-05-15T21:59:00.606008+00:00] [AUTH] [INFO] user_authenticated
  [2026-05-15T21:59:00.617331+00:00] [LEDGER] [INFO] transaction_recorded
  [2026-05-15T21:59:00.630313+00:00] [NOTIFICATION] [INFO] email_send_attempt
  [2026-05-15T21:59:00.685182+00:00] [AUTH] [INFO] payment_complete
</code></pre>
<p>Look at this carefully. The notification is in the timeline, and it logged <code>email_send_attempt</code>. But <code>email_queued</code> is missing, which means the email never actually sent and the error that explains why isn't in the timeline at all. It's hiding in the WARNING at the very top, where the script told you it found a line it couldn't parse.</p>
<p>That's the problem: where the attempt is visible but the failure is invisible.</p>
<p>Run <code>cat logs/notification.log</code> and scroll to the bottom:</p>
<pre><code class="language-plaintext">{"timestamp": "2026-05-15T21:59:00.630313+00:00", "trace_id": "pay-6cf69a8c",
 "service": "notification", "event": "email_send_attempt", ...}
2026-05-15T21:59:00.681583+00:00 ERROR Connection timeout to email provider
http://mock-email:80/ after 0.001s - failed to send confirmation to user@example.com
</code></pre>
<p>Two lines to note: the first has a trace ID, which the script found and showed in the timeline. The second doesn't – the script flagged it as a warning and skipped it. The error happened 0.075 seconds after the attempt. The log file has both lines. The timeline only has one.</p>
<p>That is what <em>"invisible failure"</em> looks like in production. The payment went through. The confirmation email never sent. The error is sitting right there in the log file, <code>Connection timeout to email provider after 0.001s</code> but in the correlation output above, the timeline shows <code>email_send_attempt</code> and then jumps straight to <code>payment_complete</code> with nothing in between: no error, no failure, no gap. It looks like everything worked.</p>
<p>The fix is in <code>02-log-correlation/services/notification/main.py</code>. Here's the broken error handler:</p>
<pre><code class="language-python">except httpx.TimeoutException:
    emit_plain(f"Connection timeout to email provider {EMAIL_PROVIDER_URL}")
    return {"status": "ok"}
</code></pre>
<p>And here's the fixed version. The only change is passing <code>req.trace_id</code> into <code>emit</code> instead of calling <code>emit_plain</code>:</p>
<pre><code class="language-python">except httpx.TimeoutException:
    emit(req.trace_id, "email_timeout", level="ERROR",
         provider=EMAIL_PROVIDER_URL)
    return {"status": "ok"}
</code></pre>
<p>Once that change is made, the timeout error shows up in the timeline like everything else:</p>
<pre><code class="language-plaintext">  [2026-05-15T21:59:00.681583+00:00] [NOTIFICATION] [ERROR] email_timeout
    provider: http://mock-email:80/
</code></pre>
<p>One command, one trace ID, the full picture.</p>
<h3 id="heading-the-decision-the-script-cant-make-for-you">The Decision the Script Can't Make For You</h3>
<p>The correlation script identifies notification as the gap. When you check the raw <code>notification.log</code>, you find the plain-text timeout error, that the request reached the service, that authentication and transaction recording both succeeded, but that the email failed.</p>
<p>Whether a notification failure is a payment failure depends entirely on how your system was designed. If notification is a soft dependency, this error shouldn't have surfaced to the user as a payment failure, and something else in your system design is wrong. If it's a hard dependency, the transaction itself should have rolled back. The script found where things broke, but the right response depends on the design.</p>
<h3 id="heading-break-it-on-purpose">Break it On Purpose</h3>
<ol>
<li><p>Run <code>./break_it.sh</code> – this switches the notification service to a mode where its error handler drops the trace ID</p>
</li>
<li><p>Run <code>./trigger_request.sh</code> to generate a new payment request and get a new trace ID</p>
</li>
<li><p>Run <code>python correlate.py &lt;new trace ID&gt;</code> – the notification will be missing from the timeline</p>
</li>
<li><p>Run <code>cat logs/notification.log</code> – the timeout error is right there, without a trace ID, invisible to the script</p>
</li>
</ol>
<h2 id="heading-use-case-3-infrastructure-drift-detection">Use Case 3 - Infrastructure Drift Detection</h2>
<p><strong>Environment:</strong> AWS free tier (one security group) + Terraform<br><strong>Language:</strong> Python</p>
<h3 id="heading-the-production-problem">The Production Problem</h3>
<p>Your Terraform plan shows no changes. Your deployment is behaving differently than it did yesterday, and when you ask around, someone eventually remembers: a colleague made a quick manual change to a security group in the AWS console last week to unblock a staging test. They meant to go back and apply it through Terraform but they forgot.</p>
<p>Your Terraform state file and your actual AWS infrastructure have been quietly disagreeing ever since. Not that anything broke loudly or an alert fired. Terraform wouldn't even know unless someone ran <code>terraform plan</code> to check, and in this scenario, nobody did.</p>
<p>This is called infrastructure drift, and it's far more common than most teams want to admit.</p>
<h3 id="heading-whats-actually-happening-at-the-system-level">What's Actually Happening at the System Level</h3>
<p><strong>What this is not:</strong> This isn't the same as running <code>terraform plan</code>. A plan shows you what Terraform <em>would</em> change. This script shows you what has <em>already</em> changed in AWS without Terraform knowing.</p>
<p>The script itself doesn't run any Terraform commands. It reads the state file Terraform already produced. In the demo, Terraform creates that file. In a real environment, it already exists from your normal workflow.</p>
<p>Think of Terraform's state file as a receipt. When Terraform creates a security group, it writes down exactly what it created, the rules, the ports, the CIDRs. That receipt is the state file.</p>
<p>The script compares that receipt against what AWS actually has right now. If someone went into the AWS console and added a rule that isn't on the receipt, the script flags it as drift.</p>
<p>The blind spot is that, if someone creates a completely new security group in the console and never uses Terraform at all, there's no receipt for it. The script can't compare something it has never seen. It returns clean, and that group sits in your account undetected.</p>
<p>The demo shows both. First you break a known resource. Then the <code>--invisible</code> scenario creates a new one outside Terraform entirely, and the script returns clean even though your account now has an extra security group.</p>
<h3 id="heading-set-up-the-demo-environment">Set Up the Demo Environment</h3>
<p>Navigate to <code>03-drift-detection/</code> in the companion repo:</p>
<pre><code class="language-plaintext">cd 03-drift-detection
pip install -r requirements.txt
</code></pre>
<p>Run setup. This uses real Terraform, not a mock:</p>
<pre><code class="language-plaintext">./setup.sh
</code></pre>
<p>This runs <code>terraform init</code> and <code>terraform apply</code>, which creates a real AWS security group:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/d5617704-d04e-40cc-8ca9-aa9d6b806e5e.png" alt="screenshot of AWS dashboard showing security group created" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>It also writes a genuine <code>terraform.tfstate</code> file. Open it in any text editor if you want to see what Terraform actually produces. It's JSON, it's readable, and it's the real thing.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/3ea8e237-a1d7-43bc-be3d-50dcb6ff4b76.png" alt="screenshot of IDE folder structure showing terraform.tfstate file being created" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Once setup completes, run the script:</p>
<pre><code class="language-plaintext">python detect_drift.py terraform.tfstate
</code></pre>
<p>You should see something like this, but your actual security group ID will be different:</p>
<pre><code class="language-plaintext">Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  OK - No drift detected.
</code></pre>
<p>The lab is alive and both sides of the contract match. Now let's look at what the script is doing.</p>
<h3 id="heading-the-script-code-files">The Script (<a href="https://github.com/Osomudeya/devops-scripting-labs">Code Files</a>)</h3>
<pre><code class="language-python"># detect_drift.py
import boto3
import json
import sys


def load_tfstate(path):
    """
    The Terraform state file is plain JSON - open it in any text editor
    and you will see a 'resources' array listing everything Terraform knows about.
    This function reads that file and returns the parsed contents.
    """
    with open(path) as f:
        return json.load(f)


def get_security_groups_from_state(tfstate):
    """
    Walk through the resources array and collect every security group entry.
    Each resource has a 'type', a 'name', and an 'instances' array holding
    the attribute values Terraform recorded when it last ran.
    We extract the resource ID and the ingress (inbound) rules.
    """
    resources = {}
    for resource in tfstate.get("resources", []):
        if resource["type"] == "aws_security_group":
            for instance in resource.get("instances", []):
                sg_id = instance["attributes"]["id"]
                resources[sg_id] = {
                    "ingress": instance["attributes"].get("ingress", [])
                }
    return resources


def get_security_group_from_aws(sg_id):
    """
    Call the AWS EC2 API to fetch the live current state of this security group.
    Under the hood, boto3 constructs an authenticated HTTPS request, signs it with
    your AWS credentials, sends it to the EC2 API endpoint in your configured region,
    and parses the response. The response contains far more data than we need -
    we extract only the inbound rules.
    """
    ec2 = boto3.client("ec2")
    response = ec2.describe_security_groups(GroupIds=[sg_id])
    sg = response["SecurityGroups"][0]
    return {"ingress": sg.get("IpPermissions", [])}


def normalize_state_rules(rules):
    """
    Terraform stores ingress rules in its own format.
    We normalize them into a set of tuples for easy comparison.
    Each tuple is: (from_port, to_port, protocol, cidr_block)
    """
    normalized = set()
    for rule in rules:
        for cidr in rule.get("cidr_blocks", []):
            normalized.add((
                rule.get("from_port", 0),
                rule.get("to_port", 0),
                rule.get("protocol", "-1"),
                cidr
            ))
    return normalized


def normalize_aws_rules(rules):
    """
    AWS returns ingress rules in a different format from Terraform's.
    We normalize them into the same tuple shape so the comparison works.
    """
    normalized = set()
    for rule in rules:
        from_port = rule.get("FromPort", 0)
        to_port = rule.get("ToPort", 0)
        protocol = rule.get("IpProtocol", "-1")
        for ip_range in rule.get("IpRanges", []):
            normalized.add((from_port, to_port, protocol, ip_range["CidrIp"]))
    return normalized


def detect_drift(tfstate_path):
    print(f"Loading Terraform state from: {tfstate_path}")
    tfstate = load_tfstate(tfstate_path)
    state_sgs = get_security_groups_from_state(tfstate)

    if not state_sgs:
        print("No security groups found in state file. Nothing to compare.")
        return

    drift_found = False

    for sg_id, state_data in state_sgs.items():
        print(f"\nChecking: {sg_id}")

        try:
            aws_data = get_security_group_from_aws(sg_id)
        except Exception as e:
            print(f"  ERROR: Could not fetch {sg_id} from AWS - {e}")
            print(f"  Check your IAM permissions: ec2:DescribeSecurityGroups is required.")
            continue

        state_rules = normalize_state_rules(state_data["ingress"])
        aws_rules = normalize_aws_rules(aws_data["ingress"])

        # Rules in AWS that Terraform does not know about (manual additions)
        added_in_aws = aws_rules - state_rules
        # Rules Terraform expects that no longer exist in AWS (manual deletions)
        removed_from_aws = state_rules - aws_rules

        if added_in_aws:
            drift_found = True
            print("  DRIFT - Rules present in AWS but missing from state file:")
            for rule in added_in_aws:
                print(f"    Port {rule[0]}-{rule[1]} | Protocol: {rule[2]} | CIDR: {rule[3]}")

        if removed_from_aws:
            drift_found = True
            print("  DRIFT - Rules in state file but removed from AWS:")
            for rule in removed_from_aws:
                print(f"    Port {rule[0]}-{rule[1]} | Protocol: {rule[2]} | CIDR: {rule[3]}")

        if not added_in_aws and not removed_from_aws:
            print("  OK - No drift detected.")

    print("\n" + "=" * 60)
    if drift_found:
        print("Drift detected. See above for details.")
    else:
        print("No drift detected in monitored resources.")

    print("\nIMPORTANT: This script only checks resources tracked in your state file.")
    print("Resources created manually in AWS without Terraform are invisible to this check.")
    print("A clean output here does not mean your AWS account is clean - it means")
    print("the resources you are watching match what Terraform last recorded.")


if __name__ == "__main__":
    tfstate_path = sys.argv[1] if len(sys.argv) &gt; 1 else "terraform.tfstate"
    detect_drift(tfstate_path)
</code></pre>
<h3 id="heading-how-the-script-works">How the Script Works</h3>
<p><code>load_tfstate</code> opens <code>terraform.tfstate</code> and reads it. Run <code>cat terraform.tfstate</code> after setup and you'll see that it's just a text file and everything Terraform knows about your infrastructure is stored in there.</p>
<p><code>get_security_groups_from_state</code> pulls out every security group from that file, the ID AWS assigned it, and the inbound rules Terraform last recorded. These are the expected values.</p>
<p><code>get_security_group_from_aws</code> calls the AWS API and fetches the same security group's current inbound rules. These are the actual values. The script now has two versions of the same thing.</p>
<p><code>normalize_state_rules</code> and <code>normalize_aws_rules</code> exist because Terraform and AWS store the same rule in slightly different formats. These two functions convert both into the same format so the comparison works.</p>
<p>The comparison is the last step. Rules in AWS but not in the state file were added manually. Rules in the state file but not in AWS were deleted manually. The script prints both.</p>
<h3 id="heading-what-the-output-looks-like">What the Output Looks Like</h3>
<p>A clean run with no drift:</p>
<pre><code class="language-plaintext">Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  OK - No drift detected.

============================================================
No drift detected in monitored resources.

IMPORTANT: This script only checks resources tracked in your state file.
Resources created manually in AWS without Terraform are invisible to this check.
A clean output here does not mean your AWS account is clean - it means
the resources you are watching match what Terraform last recorded.
</code></pre>
<p>After injecting drift:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/61ae6f66-36e4-4e03-8e76-f250e2489dab.png" alt="screenshot of AWS dashboard showing security group inbound rule created" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<pre><code class="language-plaintext">Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  DRIFT - Rules present in AWS but missing from state file:
    Port 22-22 | Protocol: tcp | CIDR: 0.0.0.0/0

============================================================
Drift detected. See above for details.
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5744ac0c-2e4f-4015-9b72-a1fa7084587e.png" alt="screenshot of terminal output after injecting drift showing &quot;drift detected&quot;" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-the-decision-the-script-cant-make-for-you">The Decision the Script Can't Make For You</h3>
<p>The script finds drift, an inbound rule that Terraform doesn't know about. The instinct is to revert it immediately by running <code>terraform apply</code>, but before doing that, ask one question: was this change an emergency hotfix? Someone may have manually opened a port at 2am to restore a broken service while a proper fix was being prepared. And if you revert it automatically, you might undo something that was deliberately placed there to keep a service running.</p>
<p>Drift detection tells you that things are different. It doesn't tell you which version is correct, and investigating that is the work that comes after the script runs.</p>
<h3 id="heading-break-it-on-purpose">Break it On Purpose</h3>
<ol>
<li><p>Run <code>./break_it.sh</code>. This adds an SSH inbound rule (port 22) directly via the AWS CLI, simulating a manual console change.</p>
</li>
<li><p>Run <code>python detect_drift.py terraform.tfstate</code>. The drift appears in the output.</p>
</li>
<li><p>Run <code>./break_it.sh --invisible</code> to create a brand new security group that's not in the state file at all, then run the script again. It returns clean even though a new resource exists in your account, making the coverage gap visible.</p>
</li>
<li><p>Run <code>./teardown.sh</code>. When finished, this runs <code>terraform destroy</code> to delete the security group and clean up all AWS resources. No charges will remain after this.</p>
</li>
</ol>
<h2 id="heading-use-case-4-secrets-rotation-with-zero-downtime">Use Case 4 - Secrets Rotation with Zero Downtime</h2>
<p><strong>Environment:</strong> AWS Secrets Manager + local Kind cluster<br><strong>Language:</strong> Python</p>
<h3 id="heading-the-production-problem">The Production Problem</h3>
<p><strong>The goal of this use case:</strong> Kubernetes says a pod is healthy, but your users are getting database errors. The script catches that gap before the users are affected by running one extra check that Kubernetes never runs.</p>
<p>You rotate your database credentials. The pod restarts. <code>kubectl get pods</code> shows Running. Ten minutes later, users can't log in.</p>
<p>The rotation worked, but the problem is that Kubernetes checked whether the HTTP server was alive, not whether it could authenticate with the database. Those are two different things.</p>
<h3 id="heading-whats-actually-happening">What's Actually Happening</h3>
<p><strong>What this is not:</strong> This isn't about how to store secrets in Kubernetes. It's about what happens after the secret is rotated.</p>
<p>When a pod is already running, it holds a pool of open database connections that were authenticated before the rotation happened. Those connections stay alive after the password changes because they were authenticated before the change and the database does not kick them out. But when the pool needs to open a new connection, it uses the current environment credentials, which still have the old password. That new connection fails immediately.</p>
<p>Meanwhile, Kubernetes sees the pod responding to HTTP and marks it Running, so your users are hitting the failures with no indication from the cluster that anything is wrong.</p>
<h3 id="heading-what-the-healthzdb-endpoint-does">What the <code>/healthz/db</code> Endpoint Does</h3>
<p><code>/healthz</code> returns 200 if the HTTP server is alive. That is all Kubernetes checks.</p>
<p><code>/healthz/db</code> opens a fresh database connection using the current credentials and runs <code>SELECT 1</code>. If that fails after a rotation, the pod is Running but can't serve database requests. The rotation script calls this endpoint as its final step – the check Kubernetes never runs.</p>
<p>Here's what that looks like in the demo FastAPI application (<a href="https://github.com/Osomudeya/devops-scripting-labs">code files</a>):</p>
<pre><code class="language-python"># app.py (relevant section)
import os
import asyncpg
from fastapi import FastAPI, HTTPException

app = FastAPI()

DB_HOST = os.environ.get("DB_HOST", "postgres")
DB_PORT = int(os.environ.get("DB_PORT", "5432"))
DB_NAME = os.environ.get("DB_NAME", "appdb")
DB_USERNAME = os.environ.get("DB_USERNAME", "appuser")
DB_PASSWORD = os.environ.get("DB_PASSWORD", "")

@app.get("/healthz")
async def healthz():
    # Always returns 200 if the HTTP server is alive.
    # This is all the Kubernetes readiness probe checks.
    return {"status": "ok"}

@app.get("/healthz/db")
async def healthz_db():
    # Opens a fresh connection using the current environment credentials.
    # If the password was rotated and this pod has not restarted yet,
    # the environment still has the old password - this connection fails.
    # /healthz above would still return 200. Your users would see errors.
    try:
        conn = await asyncpg.connect(
            host=DB_HOST, port=DB_PORT,
            database=DB_NAME, user=DB_USERNAME, password=DB_PASSWORD,
        )
        await conn.execute("SELECT 1")
        await conn.close()
        return {"status": "ok", "db": "authenticated"}

    except asyncpg.InvalidPasswordError:
        raise HTTPException(
            status_code=503,
            detail=(
                f"Authentication failed for '{DB_USERNAME}'. "
                "Password may have been rotated. "
                "Readiness probe does not check this."
            )
        )
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"Database error: {str(e)}")
</code></pre>
<p>The difference between these two endpoints is the entire lesson of this use case.</p>
<h3 id="heading-set-up-the-demo-environment">Set Up the Demo Environment</h3>
<p>Navigate to <code>04-secrets-rotation/</code> and run the setup script:</p>
<pre><code class="language-plaintext">cd 04-secrets-rotation
./setup.sh
</code></pre>
<p>This starts a Kind cluster, deploys real PostgreSQL with the <code>appuser</code> account already created, deploys the demo FastAPI app connected to it, and creates an initial secret in AWS Secrets Manager.</p>
<p>Once setup completes, install the dependencies:</p>
<pre><code class="language-plaintext">pip install boto3 kubernetes
</code></pre>
<p>Before running the rotation, confirm everything is running:</p>
<pre><code class="language-plaintext">kubectl get pods
</code></pre>
<p>You should see <code>myapp</code> and <code>postgres</code> pods both in the Running state. If any pod shows Pending or Error, wait 30 seconds and check again. PostgreSQL takes a moment to finish initialising.</p>
<p>You can also verify that the secret was created in AWS. In the console, go to AWS Secrets Manager and look for <code>myapp/db-credentials</code>:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/c5d481f7-c938-43f8-91ec-09640c137897.png" alt="screenshot showing AWS secret created" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>If you prefer the CLI:</p>
<pre><code class="language-plaintext">aws secretsmanager get-secret-value --secret-id myapp/db-credentials
</code></pre>
<p>Once both pods are Running and the secret exists, run the rotation to see the full path:</p>
<pre><code class="language-plaintext">python rotate_secret.py
</code></pre>
<p><strong>If Step 6 shows FAILED on this first clean run</strong>, it's almost always a timing issue: the app pod restarted successfully but <code>/healthz/db</code> ran before the new pod finished establishing its first database connection. Wait 20 seconds and run <code>python rotate_secret.py</code> again. If it fails repeatedly, run <code>kubectl logs deployment/myapp</code> to see what the app is reporting.</p>
<p>You should see all six steps complete cleanly, ending with:</p>
<pre><code class="language-plaintext">Rotation complete. Credential verified at the application level.
  AWS Secrets Manager: updated
  PostgreSQL:          updated (ALTER USER)
  Kubernetes Secret:   updated
  Application pod:     restarted, authenticated
</code></pre>
<p>The lab is alive and the full rotation chain works end to end. Now let's look at what the script is doing.</p>
<h3 id="heading-the-script-code-files">The Script (<a href="https://github.com/Osomudeya/devops-scripting-labs">Code Files</a>)</h3>
<pre><code class="language-python"># rotate_secret.py
import boto3
import base64
import json
import subprocess
import sys
from kubernetes import client, config


def get_current_secret(secret_name):
    """
    Fetch the current credential from AWS Secrets Manager.
    The secret is stored as a JSON string with 'username' and 'password' fields.
    """
    sm = boto3.client("secretsmanager")
    response = sm.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])


def rotate_in_aws(secret_name, username, new_password):
    """
    Write the new credential to AWS Secrets Manager.
    put_secret_value creates a new version - the previous version is
    not deleted immediately, giving you a short rollback window.
    """
    sm = boto3.client("secretsmanager")
    new_value = json.dumps({"username": username, "password": new_password})
    sm.put_secret_value(SecretId=secret_name, SecretString=new_value)
    print("  [AWS] Secret updated in Secrets Manager.")


def update_kubernetes_secret(namespace, k8s_secret_name, username, new_password):
    """
    Patch the Kubernetes Secret object with the new credential values.
    Kubernetes requires secret data to be base64-encoded - this is encoding,
    not encryption. Anyone with access to the Secret object can decode the values.
    Real encryption at rest requires separate etcd encryption configuration.
    """
    config.load_kube_config()
    v1 = client.CoreV1Api()

    secret_data = {
        "username": base64.b64encode(username.encode()).decode(),
        "password": base64.b64encode(new_password.encode()).decode()
    }

    v1.patch_namespaced_secret(
        name=k8s_secret_name,
        namespace=namespace,
        body={"data": secret_data}
    )
    print(f"  [K8s] Kubernetes Secret '{k8s_secret_name}' updated.")


def rolling_restart(namespace, deployment_name):
    """
    Trigger a rolling restart of the deployment.
    Rolling restart means Kubernetes creates one new pod, waits for it to pass
    its readiness probe, then terminates one old pod - and repeats until all
    pods have been replaced. Availability is preserved throughout.
    This is very different from deleting all pods at once.
    """
    result = subprocess.run(
        ["kubectl", "rollout", "restart",
         f"deployment/{deployment_name}", "-n", namespace],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        raise RuntimeError(f"Rolling restart failed: {result.stderr}")
    print(f"  [K8s] Rolling restart triggered for '{deployment_name}'.")


def wait_for_rollout(namespace, deployment_name, timeout=120):
    """
    Block until the rolling restart finishes or times out.
    'Finished' means all new pods are Running and their readiness probes passed.
    This does NOT mean the application can authenticate with the new credential.
    That is what verify_credential checks next.
    """
    print(f"  [K8s] Waiting for rollout (timeout: {timeout}s)...")
    result = subprocess.run(
        ["kubectl", "rollout", "status",
         f"deployment/{deployment_name}",
         "-n", namespace,
         f"--timeout={timeout}s"],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        raise RuntimeError(f"Rollout did not complete: {result.stderr}")
    print("  [K8s] Rollout complete. All pods report Ready.")


def verify_credential(namespace, deployment_name):
    """
    This is the check the readiness probe does not make.
    We exec into the running pod and call /healthz/db - an endpoint that
    makes an actual authenticated query to the database.
    If this passes: the credential is working at the application level.
    If this fails after the readiness probe passed: the contract mismatch is confirmed.
    The pod is Running. The application cannot serve database requests.
    """
    print("  [Verify] Running post-rotation credential check...")

    result = subprocess.run(
        ["kubectl", "get", "pods", "-n", namespace,
         "-l", f"app={deployment_name}",
         "-o", "jsonpath={.items[0].metadata.name}"],
        capture_output=True, text=True
    )
    pod_name = result.stdout.strip()

    if not pod_name:
        print("  [Verify] ERROR: No running pod found for this deployment.")
        return False

    verify = subprocess.run(
        ["kubectl", "exec", pod_name, "-n", namespace,
         "--", "curl", "-sf", "http://localhost:8000/healthz/db"],
        capture_output=True, text=True
    )

    if verify.returncode != 0:
        print("  [Verify] FAILED - Pod is Running but database authentication failed.")
        print("           The readiness probe validated HTTP reachability.")
        print("           The application cannot authenticate with the new credential.")
        print("           These are two different contracts. Only one was checked automatically.")
        return False

    print("  [Verify] PASSED - Application confirmed it can authenticate with the new credential.")
    return True


def rotate(secret_name, new_password, namespace, k8s_secret_name, deployment_name):
    print("\n[Step 1/6] Reading current secret from AWS Secrets Manager...")
    current = get_current_secret(secret_name)
    username = current["username"]

    print("[Step 2/6] Updating AWS Secrets Manager...")
    rotate_in_aws(secret_name, username, new_password)

    print("[Step 3/6] Rotating password at the database level (ALTER USER)...")
    rotate_postgres_password(namespace, new_password)

    print("[Step 4/6] Updating Kubernetes Secret object...")
    update_kubernetes_secret(namespace, k8s_secret_name, username, new_password)

    print("[Step 5/6] Triggering rolling restart...")
    rolling_restart(namespace, deployment_name)
    wait_for_rollout(namespace, deployment_name)

    print("[Step 6/6] Verifying the new credential works at the application level...")
    success = verify_credential(namespace, deployment_name)

    print("\n" + "=" * 60)
    if success:
        print("Rotation complete. Credential verified at the application level.")
    else:
        print("Rotation incomplete. Readiness probe passed but credential verification failed.")
        print("Recommended action: force-restart all pods to flush the connection pool,")
        print("or investigate the database session timeout configuration.")
        sys.exit(1)


if __name__ == "__main__":
    import secrets as _secrets
    rotate(
        secret_name="myapp/db-credentials",
        new_password=_secrets.token_urlsafe(16),
        namespace="default",
        k8s_secret_name="db-credentials",
        deployment_name="myapp"
    )
</code></pre>
<h3 id="heading-how-the-script-works">How the Script Works</h3>
<p><code>get_current_secret</code> reads the current credential from AWS Secrets Manager so the script knows the username before it generates a new password.</p>
<p><code>rotate_in_aws</code> writes the new credential to Secrets Manager. It creates a new version rather than overwriting the old one, so you have a short window to roll back if something goes wrong.</p>
<p><code>_pg_password_literal</code> and <code>rotate_postgres_password</code> handle the step that most rotation scripts skip, which is actually changing the password inside PostgreSQL. This is done by running <code>ALTER USER appuser PASSWORD '...'</code> directly on the live PostgreSQL pod. Before this step, the database still accepts the old password. After this step, it does not.</p>
<p><code>update_kubernetes_secret</code> writes the new password into the Kubernetes Secret so that any new pod that starts will get the new credential from the beginning.</p>
<p><code>rolling_restart</code> and <code>wait_for_rollout</code> restart the application pods one at a time so the deployment stays available throughout. When this step completes, all pods are Running and their readiness probes have passed – but keep in mind that "Running" only means <code>/healthz</code> returned 200, which is exactly the problem this use case is about.</p>
<p><code>verify_credential</code> is the extra step Kubernetes never runs. It reaches inside the new pod and calls <code>/healthz/db</code>, which opens a real database connection with the credentials in the pod's current environment. If this passes, the rotation is genuinely complete. If this fails after the readiness probe passed, you have confirmed the gap: the pod looks healthy but can't serve database requests.</p>
<h3 id="heading-what-the-output-looks-like">What the Output Looks Like</h3>
<p>Successful rotation:</p>
<pre><code class="language-plaintext">[Step 1/6] Reading current secret from AWS Secrets Manager...
[Step 2/6] Updating AWS Secrets Manager...
  [AWS] Secrets Manager updated.
[Step 3/6] Rotating password at the database level (ALTER USER)...
  [DB]  Running ALTER USER on PostgreSQL...
  [DB]  Password changed at the database level.
        New connections now require the new password.
        Existing pool connections remain valid until they close.
[Step 4/6] Updating Kubernetes Secret object...
  [K8s] Kubernetes Secret 'db-credentials' updated.
[Step 5/6] Triggering rolling restart...
  [K8s] Rolling restart triggered for 'myapp'.
  [K8s] Waiting for rollout (timeout: 120s)...
  [K8s] Rollout complete. All pods report Ready.
[Step 6/6] Verifying the new credential works at the application level...
  [Verify] Running post-rotation credential check...
  [Verify] PASSED - Application confirmed it can authenticate with the new credential.

============================================================
Rotation complete. Credential verified at the application level.
  AWS Secrets Manager: updated
  PostgreSQL:          updated (ALTER USER)
  Kubernetes Secret:   updated
  Application pod:     restarted, authenticated
</code></pre>
<p>The lab is alive and the full rotation chain works end to end.</p>
<p>Before you break anything, confirm the pod is healthy:</p>
<pre><code class="language-plaintext">kubectl get pods
</code></pre>
<p>You should see <code>myapp</code> in Running state. That is the baseline: everything working as expected. Now let's break it.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/caabd892-562a-40f4-81c1-2c59fd15240b.png" alt="terminal screenshot showing output of 'kubectl get pods&quot;" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-break-it-on-purpose">Break it On Purpose</h3>
<h4 id="heading-step-1-desync-the-db">Step 1: Desync the DB</h4>
<pre><code class="language-plaintext">./break_it.sh
</code></pre>
<p>This runs <code>ALTER USER</code> directly on PostgreSQL with a wrong password. The K8s Secret still has the old password, so the pod's environment and the database are now out of sync.</p>
<h4 id="heading-step-2-check-what-kubernetes-sees">Step 2: Check what Kubernetes sees</h4>
<pre><code class="language-plaintext">kubectl exec deployment/myapp -- curl -s http://localhost:8000/healthz
</code></pre>
<p>You will see <code>{"status":"ok"}</code>. The pod is still showing Ready in <code>kubectl get pods</code>. Kubernetes has no idea anything is wrong – that's the contract gap made visible in your terminal.</p>
<h4 id="heading-step-3-check-what-your-users-experience">Step 3: Check what your users experience</h4>
<pre><code class="language-plaintext">kubectl exec deployment/myapp -- curl -s http://localhost:8000/healthz/db
</code></pre>
<p>You'll see a <code>503</code> error. Fresh database connections are failing. Your users are already seeing this.</p>
<h4 id="heading-step-4-see-the-mixed-pattern-optional">Step 4: See the mixed pattern (optional)</h4>
<pre><code class="language-plaintext">./load_test.sh
</code></pre>
<p>Some requests succeed because they hit old pool connections that were authenticated before the break. Some fail because they need a fresh connection. The pod looks healthy, but half your traffic is failing.</p>
<h4 id="heading-step-5-run-the-rotation-script">Step 5: Run the rotation script</h4>
<pre><code class="language-plaintext">python rotate_secret.py
</code></pre>
<p>This time, Step 6 catches the failure. Here's what you'll see:</p>
<pre><code class="language-plaintext">[Step 5/6] Triggering rolling restart...
  [K8s] Rollout complete. All pods report Ready.
[Step 6/6] Verifying the new credential works at the application level...
  [Verify] Running post-rotation credential check...
  [Verify] FAILED - Pod is Running but database authentication failed.
           The readiness probe validated HTTP reachability.
           The application cannot authenticate with the new credential.
           These are two different contracts. Only one was checked automatically.

============================================================
Rotation incomplete. Readiness probe passed but credential verification failed.
</code></pre>
<p>The pod is Running and shows Ready in <code>kubectl get pods</code>. The rotation script says the credential is broken. That's the contract gap visible in your terminal, caught before your users hit it.</p>
<p><strong>The lesson:</strong> <code>/healthz</code> tells you the HTTP server is alive. <code>/healthz/db</code> tells you the application can actually connect to the database. Kubernetes only checks the first one unless you add a database probe. The rotation script adds that check at the end of every rotation so you catch the failure before your users do.</p>
<h3 id="heading-the-decision-the-script-cant-make-for-you">The Decision the Script Can't Make For You</h3>
<p>The verification failed, the pod is Running, and requests to the database are failing. You have two options:</p>
<ol>
<li><p>force-restart all pods at once to flush the connection pool (which is faster but causes a brief capacity reduction), or</p>
</li>
<li><p>wait for old sessions to expire naturally (which avoids downtime but leaves requests failing intermittently until the pool cycles).</p>
</li>
</ol>
<p>The script found the problem, but deciding what to do next belongs to an engineer who knows the system.</p>
<h3 id="heading-teardown">Teardown</h3>
<pre><code class="language-plaintext">./teardown.sh
</code></pre>
<h2 id="heading-use-case-5-automated-canary-rollback-trigger">Use Case 5 - Automated Canary Rollback Trigger</h2>
<p><strong>Environment:</strong> Fully local – Kind, Prometheus via Helm<br><strong>Language:</strong> Bash</p>
<h3 id="heading-what-this-use-case-does-and-why-it-matters">What This Use Case Does and Why it Matters</h3>
<p>This use case runs a script that watches your new deployment and automatically rolls it back if something goes wrong, before your users flood your support queue.</p>
<p>This matters in production because, when you ship a new version, you don't send all traffic to it immediately. You send a small slice, say 20% to the new version while 80% still goes to the old one. If the new version is broken, only 20% of users are affected and you can roll back before the damage spreads. But the rollback only works if you're watching the right things.</p>
<p><strong>The takehome:</strong> Two scripts watch the same failing canary. One reports everything is fine. The other fires the rollback. The only difference is what they measure. Your automation is only as good as what it watches.</p>
<p><strong>What to watch for:</strong> <code>canary_watch_v1.sh</code> watches errors only and stays silent while the canary is slow. <code>canary_watch_v2.sh</code> watches errors AND response time and fires the rollback. The difference between them is the lesson.</p>
<p><strong>What this is not:</strong> This isn't a guide to canary deployments. It's about what your monitoring misses when it only watches one signal.</p>
<h3 id="heading-how-it-works">How it Works</h3>
<p>Three things run in the cluster: the stable app (three pods, handles most traffic), the canary app (one pod, handles a small slice), and Prometheus (collects response times and error counts from both every 15 seconds).</p>
<p>The watch script asks Prometheus every 15 seconds: <em>"Is the canary behaving normally?"</em> If the answer is no for three checks in a row, it rolls back the canary automatically.</p>
<p>The question is that what does <em>"behaving normally"</em> mean? That is the entire use case.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/204f992e-44c2-4404-a6ff-f2279ea23aeb.png" alt="terminal screenshot showing output result of 'kubectl get pods&quot;" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-set-up-the-demo-environment">Set Up the Demo Environment</h3>
<p>Navigate to <code>05-canary-rollback/</code> and run:</p>
<pre><code class="language-plaintext">cd 05-canary-rollback
./setup.sh
</code></pre>
<p>Setup takes a few minutes. It installs Prometheus, deploys both versions of the demo app, and starts a load generator pod that sends continuous traffic to both so Prometheus always has data.</p>
<p>When setup finishes, confirm everything is running:</p>
<pre><code class="language-plaintext">kubectl get pods
</code></pre>
<p>You should see output like this:</p>
<pre><code class="language-plaintext">NAME                                                   READY   STATUS    RESTARTS   AGE
load-generator-68c59698b7-kws2l                        1/1     Running   0          4m54s
myapp-canary-6d6979c66f-g9lgw                          1/1     Running   0          32s
myapp-stable-6bcf994fc4-b4k9l                          1/1     Running   0          4m55s
myapp-stable-6bcf994fc4-ndhxc                          1/1     Running   0          4m55s
myapp-stable-6bcf994fc4-z97kx                          1/1     Running   0          4m55s
prometheus-kube-prometheus-operator-59b847d96c-mp72s   1/1     Running   0          5m58s
prometheus-prometheus-kube-prometheus-prometheus-0     2/2     Running   0          5m1s
</code></pre>
<p>Three stable pods, one canary pod, one load generator, Prometheus running. The lab is alive.</p>
<p><strong>Wait 60 seconds before running anything else.</strong> Prometheus needs time to scrape the first metrics from the pods. If you skip this, the watch scripts return empty data with no explanation.</p>
<h3 id="heading-three-terminal-windows">Three Terminal Windows</h3>
<p>You need three separate command prompts running at the same time.</p>
<p><strong>On macOS:</strong> open Terminal and press <code>Cmd+T</code> twice. You now have three tabs, each an independent terminal.<br><strong>On Linux:</strong> press <code>Ctrl+Shift+T</code> in most terminal apps, or right-click and choose "Open new tab."</p>
<p>Label them Terminal 1 for the watch script, Terminal 2 for injecting failures, Terminal 3 for watching latency.</p>
<h3 id="heading-the-scripts">The Scripts</h3>
<h4 id="heading-version-1-watches-errors-only-code-here">Version 1: watches errors only (<a href="https://github.com/Osomudeya/devops-scripting-labs.git">code here</a>)</h4>
<pre><code class="language-bash">#!/usr/bin/env bash
# canary_watch_v1.sh

PROMETHEUS="http://localhost:9090"
DEPLOYMENT="myapp-canary"
NAMESPACE="default"
ERROR_THRESHOLD="0.05"
CHECK_INTERVAL=15
STRIKE_LIMIT=3

strikes=0

echo "Canary monitor running (v1 - error rate only)."
echo "Rollback triggers if error rate exceeds \({ERROR_THRESHOLD} for \){STRIKE_LIMIT} checks."
echo ""

while true; do
    ts=$(date '+%Y-%m-%dT%H:%M:%S')

    error_query='sum(rate(http_requests_total{app="myapp-canary",status=~"5.."}[1m])) / sum(rate(http_requests_total{app="myapp-canary"}[1m]))'

    error_rate=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${error_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2&gt;/dev/null)

    error_rate=${error_rate:-0}
    above=\((echo "\)error_rate &gt; $ERROR_THRESHOLD" | bc -l)

    echo "[\(ts] error_rate=\){error_rate} | threshold=\({ERROR_THRESHOLD} | breach=\)([ "$above" = "1" ] &amp;&amp; echo YES || echo NO)"

    if [ "$above" = "1" ]; then
        strikes=$((strikes + 1))
        echo "  Strike \({strikes}/\){STRIKE_LIMIT}"
        if [ "\(strikes" -ge "\)STRIKE_LIMIT" ]; then
            echo "  ROLLBACK TRIGGERED"
            kubectl rollout undo deployment/"\({DEPLOYMENT}" -n "\){NAMESPACE}"
            exit 0
        fi
    else
        strikes=0
    fi

    sleep "${CHECK_INTERVAL}"
done
</code></pre>
<h4 id="heading-version-2-watches-error-rate-and-response-time">Version 2: watches error rate AND response time</h4>
<pre><code class="language-bash">#!/usr/bin/env bash
# canary_watch_v2.sh

PROMETHEUS="http://localhost:9090"
DEPLOYMENT="myapp-canary"
NAMESPACE="default"
ERROR_THRESHOLD="0.05"
LATENCY_THRESHOLD="2.0"
CHECK_INTERVAL=15
STRIKE_LIMIT=3

strikes=0

echo "Canary monitor running (v2 - error rate + P99 latency)."
echo "Error threshold: \({ERROR_THRESHOLD} | Latency P99 threshold: \){LATENCY_THRESHOLD}s"
echo ""

while true; do
    ts=$(date '+%Y-%m-%dT%H:%M:%S')

    error_query='sum(rate(http_requests_total{app="myapp-canary",status=~"5.."}[1m])) / sum(rate(http_requests_total{app="myapp-canary"}[1m]))'
    error_rate=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${error_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2&gt;/dev/null)

    latency_query='histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app="myapp-canary"}[1m])) by (le))'
    latency=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${latency_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2&gt;/dev/null)

    error_rate=${error_rate:-0}
    latency=${latency:-0}

    error_breach=\((echo "\)error_rate &gt; $ERROR_THRESHOLD" | bc -l)
    latency_breach=\((echo "\)latency &gt; $LATENCY_THRESHOLD" | bc -l)

    triggered_by=""
    [ "\(error_breach" = "1" ] &amp;&amp; triggered_by="error_rate(\){error_rate})"
    [ "\(latency_breach" = "1" ] &amp;&amp; triggered_by="\){triggered_by:+\({triggered_by}, }latency_p99(\){latency}s)"

    echo "[\(ts] error_rate=\){error_rate} | latency_p99=\({latency}s | breach=\){triggered_by:-none}"

    if [ "\(error_breach" = "1" ] || [ "\)latency_breach" = "1" ]; then
        strikes=$((strikes + 1))
        echo "  Strike \({strikes}/\){STRIKE_LIMIT} | Triggered by: ${triggered_by}"
        if [ "\(strikes" -ge "\)STRIKE_LIMIT" ]; then
            echo ""
            echo "  ROLLBACK TRIGGERED"
            echo "  Signal: ${triggered_by}"
            kubectl rollout undo deployment/"\({DEPLOYMENT}" -n "\){NAMESPACE}"
            exit 0
        fi
    else
        strikes=0
    fi

    sleep "${CHECK_INTERVAL}"
done
</code></pre>
<h3 id="heading-how-the-scripts-work">How the Scripts Work</h3>
<p>The <code>error rate query</code> asks Prometheus: <em>"What fraction of requests to the canary returned an error in the last minute?"</em> A result of <code>0.0</code> means no errors. A result of <code>0.06</code> means 6% of requests are failing, above the 5% threshold. You see this in the output as:</p>
<pre><code class="language-plaintext">error_rate=0.06 | threshold=0.05 | breach=YES
</code></pre>
<p>The <code>latency query</code> asks: <em>"How slow is the slowest 1% of requests to the canary right now?"</em> A result of <code>5.234</code> means 1 in every 100 requests is taking over 5 seconds. You see this as:</p>
<pre><code class="language-plaintext">latency_p99=5.234s | breach=latency_p99(5.234s)
</code></pre>
<p>V1 only runs the first query. V2 runs both. Same canary, same problem, different answers.</p>
<p>The three-strike rule means a single bad check doesn't trigger a rollback – three in a row does. The tradeoff is 45 seconds (three checks at 15 seconds each) of exposure before the rollback fires.</p>
<p>When three strikes hit, the watch script itself runs:</p>
<pre><code class="language-plaintext">kubectl rollout undo deployment/myapp-canary -n default
</code></pre>
<p>That one line is what triggers the rollback. It lives inside <code>canary_watch_v2.sh</code> and runs automatically – you don't have to do anything. The script detects, decides, and acts.</p>
<h3 id="heading-break-it-on-purpose">Break it On Purpose</h3>
<p><strong>In Terminal 1</strong>, start the v1 monitor:</p>
<pre><code class="language-plaintext">./canary_watch_v1.sh
</code></pre>
<p>You will see this repeating every 15 seconds:</p>
<pre><code class="language-plaintext">Canary monitor running (v1 - error rate only).
Rollback triggers if error rate exceeds 0.05 for 3 checks.

[2026-05-17T11:53:12] error_rate=0 | threshold=0.05 | breach=NO
[2026-05-17T11:53:27] error_rate=0 | threshold=0.05 | breach=NO
[2026-05-17T11:53:42] error_rate=0 | threshold=0.05 | breach=NO
</code></pre>
<p><code>breach=NO</code> means the canary looks healthy. Leave this running and move to Terminal 2.</p>
<p><strong>In Terminal 2</strong>, inject latency into the canary:</p>
<pre><code class="language-plaintext">./break_it.sh
</code></pre>
<p>This makes every request to the canary take 5 seconds. Requests still return 200 – no errors, just slowness. You will see:</p>
<pre><code class="language-plaintext">Injecting latency into the canary deployment...
deployment "myapp-canary" successfully rolled out
Latency injection is active.

The canary pod is Running and passing its readiness probe.
Every request to the canary now takes 5 seconds.
Error rate: 0%   |   P99 latency: ~5s
</code></pre>
<p>Now look back at Terminal 1. The v1 monitor keeps printing <code>breach=NO</code>. The canary is taking 5 seconds per request and your monitoring says everything is fine. That's the failure.</p>
<p><strong>In Terminal 3</strong>, see what your users are actually experiencing:</p>
<pre><code class="language-plaintext">./check_latency.sh
</code></pre>
<pre><code class="language-plaintext">TIMESTAMP                   STABLE (ms)   CANARY (ms)   STATUS
---------                   -----------   -----------   ------
2026-05-17T11:55:14         18ms          5008ms        CANARY DEGRADED
2026-05-17T11:55:20         7ms           5008ms        CANARY DEGRADED
2026-05-17T11:55:27         6ms           5008ms        CANARY DEGRADED
</code></pre>
<p>Stable is responding in 6–18 milliseconds. Canary is taking over 5 seconds. Users on the canary are waiting 5 seconds for every page load. The v1 monitor in Terminal 1 still says <code>breach=NO</code>.</p>
<p>This is the lesson: the monitoring and the user experience are completely disconnected. The script isn't broken. It's watching the wrong thing.</p>
<p>Now let's see the fix. Press <code>Ctrl+C</code> in Terminal 1 to stop v1. Start v2 in the same terminal:</p>
<pre><code class="language-plaintext">./canary_watch_v2.sh
</code></pre>
<p>In Terminal 2, re-inject the latency:</p>
<pre><code class="language-plaintext">./break_it.sh
</code></pre>
<p>Watch Terminal 1. V2 catches the latency and fires the rollback after three strikes:</p>
<pre><code class="language-plaintext">Canary monitor running (v2 - error rate + P99 latency).
Error threshold: 0.05 | Latency P99 threshold: 2.0s

[2026-05-15T14:30:00] error_rate=0.0 | latency_p99=0.082s | breach=none
[2026-05-15T14:30:15] error_rate=0.0 | latency_p99=5.234s | breach=latency_p99(5.234s)
  Strike 1/3 | Triggered by: latency_p99(5.234s)
[2026-05-15T14:30:30] error_rate=0.0 | latency_p99=5.891s | breach=latency_p99(5.891s)
  Strike 2/3 | Triggered by: latency_p99(5.891s)
[2026-05-15T14:30:45] error_rate=0.0 | latency_p99=6.102s | breach=latency_p99(6.102s)
  Strike 3/3 | Triggered by: latency_p99(6.102s)

  ROLLBACK TRIGGERED
  Signal: latency_p99(6.102s)

deployment.apps/myapp-canary rolled back
</code></pre>
<p>The error rate never moved from 0. V2 rolled back anyway because latency crossed the threshold. That's the difference one extra measurement makes.</p>
<p>After the rollback, confirm the canary is dormant but not deleted:</p>
<pre><code class="language-plaintext">kubectl rollout history deployment/myapp-canary -n default
</code></pre>
<pre><code class="language-plaintext">REVISION  CHANGE-CAUSE
1         &lt;none&gt;
2         &lt;none&gt;
</code></pre>
<p>Two revisions. The rollback scaled revision 2 down to zero and restored revision 1. Nothing was deleted, and you can re-deploy if you decide the rollback was a false alarm.</p>
<h3 id="heading-the-decision-the-script-cant-make-for-you">The Decision the Script Can't Make For You</h3>
<p>V2 rolled back based on latency with zero errors. Before re-deploying, ask if the latency was a real regression in the new code, or a temporary spike, like a database cache warming up on first use? Both produce the same signal. Only you know which is more likely given what changed.</p>
<p>False positive rollbacks slow down deployments and erode confidence in automation. The right thresholds depend on your users and your system.<br>What the script enforces is whatever you configure.</p>
<h3 id="heading-teardown">Teardown</h3>
<pre><code class="language-plaintext">./teardown.sh
</code></pre>
<h2 id="heading-what-you-can-do-now">What You Can Do Now</h2>
<p>Each use case in this handbook was a script solving a specific problem the standard tooling wasn't catching. Here's where you land:</p>
<p>You can catch AWS cost spikes before the invoice and you know that the service label is AWS's attribution, not a pointer to what actually caused the cost. Start from what changed operationally, not from the billing label.</p>
<p>You can reconstruct the full timeline of any failed request across multiple services from a single trace ID, and you know that a missing service in that timeline is evidence, not just an absence.</p>
<p>You can detect infrastructure drift by comparing what Terraform believes against what AWS actually contains, and you know that a clean result means the resources Terraform manages are in sync, not that your entire AWS account is clean.</p>
<p>You can validate a secret rotation at the application level, not just at the infrastructure level, and you know the difference between a readiness probe passing and the application actually being able to connect to the database.</p>
<p>You can build a canary rollback trigger that watches the right signals, and you know why watching only error rates can leave a slow, broken deployment running while users wait.</p>
<p>The pattern across all five use cases is the same: the standard tooling reported everything as fine while something was actually broken. The cost script returned clean, the pod showed Running, and the canary showed zero errors – not because the tools were wrong but because they were only checking what was easy to check. These scripts check what the standard tooling skips.</p>
<p><strong>GitHub repo:</strong> <a href="https://github.com/Osomudeya/devops-scripting-labs.git">https://github.com/Osomudeya/devops-scripting-labs</a></p>
<p>I write about DevOps weekly, covering real systems, interview, CV tips and tricks, and real incidents – <a href="https://osomudeya.kit.com/23db7ca59f"><strong>Join the newsletter</strong></a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Clean Time Series Data in Python ]]>
                </title>
                <description>
                    <![CDATA[ Real-world time series data is rarely clean. Sensors drop out, systems clock-drift, pipelines duplicate records, and manual data entry introduces mistakes. By the time a dataset reaches your notebook, ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-clean-time-series-data-in-python/</link>
                <guid isPermaLink="false">6a0ad57ee4a28cf570ec90ac</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Bala Priya C ]]>
                </dc:creator>
                <pubDate>Mon, 18 May 2026 09:01:50 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/bf717910-4e75-44c5-8ea1-fd55eb574100.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Real-world time series data is rarely clean. Sensors drop out, systems clock-drift, pipelines duplicate records, and manual data entry introduces mistakes. By the time a dataset reaches your notebook, it has passed through collection, transmission, and storage, each step a potential source of corruption.</p>
<p>Cleaning time series data is harder than cleaning tabular data because time is a structural constraint. You can't shuffle rows or impute a missing value with a column mean without pulling future data into a past observation. Every cleaning decision has to respect temporal ordering, or it breaks the integrity of everything built on top of it.</p>
<p>This guide walks through the full cleaning pipeline in Python: from raw data arrival to a dataset ready for feature engineering or modelling. We'll cover missing value detection and imputation, outlier identification and treatment, duplicate handling, frequency alignment, noise smoothing, and schema validation, applied to sample sensor data throughout.</p>
<p><a href="https://github.com/balapriyac/data-science-tutorials/blob/main/time-series-data-cleaning/time_series_data_cleaning.ipynb">You can get the Colab notebook from GitHub and follow along</a>.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along to this guide, you'll need to be:</p>
<ul>
<li><p>Comfortable working with Python and pandas DataFrames</p>
</li>
<li><p>Familiar with time-indexed data</p>
</li>
<li><p>Aware of what feature engineering and machine learning modelling involve at a high level</p>
</li>
</ul>
<p>We'll use <code>pandas</code> and <code>numpy</code> for data manipulation, <code>scipy</code> for signal smoothing and statistical tests, <code>scikit-learn</code> for anomaly detection, and <code>statsmodels</code> for seasonal decomposition. Install them before running any code in this guide:</p>
<pre><code class="language-bash">pip install pandas numpy scipy scikit-learn statsmodels
</code></pre>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-how-to-audit-your-time-series-before-cleaning-it">How to Audit Your Time Series Before Cleaning It</a></p>
</li>
<li><p><a href="#heading-how-to-reindex-to-a-canonical-frequency">How to Reindex to a Canonical Frequency</a></p>
</li>
<li><p><a href="#heading-how-to-handle-missing-values">How to Handle Missing Values</a></p>
<ul>
<li><p><a href="#heading-forward-fill-for-step-function-signals">Forward Fill — For Step-Function Signals</a></p>
</li>
<li><p><a href="#heading-time-weighted-interpolation-for-continuous-signals">Time-Weighted Interpolation — For Continuous Signals</a></p>
</li>
<li><p><a href="#heading-seasonal-decomposition-imputation-for-long-gaps">Seasonal Decomposition Imputation — For Long Gaps</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-detect-and-handle-outliers">How to Detect and Handle Outliers</a></p>
<ul>
<li><p><a href="#heading-z-score-with-rolling-window">Z-Score with Rolling Window</a></p>
</li>
<li><p><a href="#heading-iqr-based-outlier-detection">IQR-Based Outlier Detection</a></p>
</li>
<li><p><a href="#heading-isolation-forest-for-multivariate-outlier-detection">Isolation Forest — For Multivariate Outlier Detection</a></p>
</li>
<li><p><a href="#heading-outlier-treatment">Outlier Treatment</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-remove-duplicates">How to Remove Duplicates</a></p>
</li>
<li><p><a href="#heading-frequency-alignment-and-resampling">Frequency Alignment and Resampling</a></p>
</li>
<li><p><a href="#heading-smoothing-noise">Smoothing Noise</a></p>
<ul>
<li><p><a href="#heading-exponential-weighted-moving-average">Exponential Weighted Moving Average</a></p>
</li>
<li><p><a href="#heading-savitzky-golay-filter">Savitzky-Golay Filter</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-schema-and-sanity-validation">Schema and Sanity Validation</a></p>
</li>
<li><p><a href="#heading-the-complete-cleaning-checklist">The Complete Cleaning Checklist</a></p>
</li>
</ul>
<h2 id="heading-how-to-audit-your-time-series-before-cleaning-it">How to Audit Your Time Series Before Cleaning It</h2>
<p>The first rule of data cleaning is: look before you cut. Before imputing, smoothing, or dropping anything, you need a complete picture of what's wrong and where.</p>
<p>A good audit covers the following:</p>
<ul>
<li><p>The time index: Is it regular? Are there gaps?</p>
</li>
<li><p>Missing value distribution: Are missing values random or clustered?</p>
</li>
<li><p>Value range: Are there obvious gaps or sensor failures?</p>
</li>
<li><p>Duplicate timestamps</p>
</li>
</ul>
<p>Let's spin up a sample dataset (with some of the above problems):</p>
<pre><code class="language-python"># Simulate one week of smart grid voltage readings (hourly)
# with realistic problems injected
periods = 168
index = pd.date_range("2024-06-01", periods=periods, freq="H")

voltage = (
    230.0
    + 3.5 * np.sin(2 * np.pi * np.arange(periods) / 24)
    + np.random.normal(0, 1.2, periods)
)

# Inject problems
voltage[14:17] = np.nan          # sensor dropout: 3 consecutive missing
voltage[42] = np.nan             # isolated missing
voltage[78] = 312.4              # spike outlier
voltage[101:104] = np.nan        # another dropout
voltage[130] = 187.2             # dip outlier

series = pd.Series(voltage, index=index, name="voltage_v")

# --- Audit ---
print("=== TIME SERIES AUDIT ===")
print(f"Period:        {series.index.min()} → {series.index.max()}")
print(f"Observations:  {len(series)}")
print(f"Expected freq: {pd.infer_freq(series.index)}")
print(f"\nMissing values: {series.isna().sum()} ({series.isna().mean()*100:.1f}%)")
print(f"Value range:    [{series.min():.2f}, {series.max():.2f}]")
print(f"Mean ± Std:     {series.mean():.2f} ± {series.std():.2f}")

# Identify consecutive missing runs
missing_mask = series.isna()
missing_runs = []
run_start = None
for i, (ts, is_missing) in enumerate(missing_mask.items()):
    if is_missing and run_start is None:
        run_start = ts
    elif not is_missing and run_start is not None:
        missing_runs.append((run_start, missing_mask.index[i - 1]))
        run_start = None

print(f"\nMissing runs ({len(missing_runs)} total):")
for start, end in missing_runs:
    print(f"  {start} → {end}")
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">=== TIME SERIES AUDIT ===
Period:        2024-06-01 00:00:00 → 2024-06-07 23:00:00
Observations:  168
Expected freq: h

Missing values: 7 (4.2%)
Value range:    [187.20, 312.40]
Mean ± Std:     230.22 ± 7.81

Missing runs (3 total):
  2024-06-01 14:00:00 → 2024-06-01 16:00:00
  2024-06-02 18:00:00 → 2024-06-02 18:00:00
  2024-06-05 05:00:00 → 2024-06-05 07:00:00
</code></pre>
<p>This audit gives you a map of the damage before you start cleaning. The key task is distinguishing between <strong>isolated missing values</strong>, which are imputable with local context, and <strong>missing long runs</strong>, which may need a different strategy or flagging for downstream consumers.</p>
<h2 id="heading-how-to-reindex-to-a-canonical-frequency">How to Reindex to a Canonical Frequency</h2>
<p>Before imputing missing values, you need to confirm your time index is actually <em>regular</em>. A common problem in ingested time series is that missing timestamps are simply absent rather than represented as <code>NaN</code> rows — which means a <code>.fillna()</code> call will never find them.</p>
<pre><code class="language-python"># Simulate a sensor feed with missing timestamps (not just missing values)
irregular_index = index.delete([14, 15, 16, 42, 101, 102, 103])
irregular_series = series.dropna().reindex(irregular_index)

print(f"Original length:   {len(series)}")
print(f"Irregular length:  {len(irregular_series)}")
print(f"Inferred freq:     {pd.infer_freq(irregular_series.index)}")  # None = irregular

# Reindex to the full canonical hourly grid
canonical_index = pd.date_range(
    start=irregular_series.index.min(),
    end=irregular_series.index.max(),
    freq="H"
)

reindexed = irregular_series.reindex(canonical_index)

print(f"\nAfter reindex:")
print(f"Length:         {len(reindexed)}")
print(f"Missing values: {reindexed.isna().sum()}")
print(f"Inferred freq:  {pd.infer_freq(reindexed.index)}")
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">Original length:   168
Irregular length:  161
Inferred freq:     None

After reindex:
Length:         168
Missing values: 7
Inferred freq:  h
</code></pre>
<p><code>pd.infer_freq</code> returning <code>None</code> is your signal that the index has gaps. After reindexing to the canonical grid, missing timestamps become explicit <code>NaN</code> rows, and now your imputation logic can find them.</p>
<h2 id="heading-how-to-handle-missing-values">How to Handle Missing Values</h2>
<p>Not all missing values should be handled the same way. A single isolated missing reading in a smooth signal is best filled with interpolation. A 3-hour sensor dropout in a volatile signal, however, might be better flagged than fabricated. Strategy should match both gap length and signal behavior.</p>
<h3 id="heading-forward-fill-for-step-function-signals">Forward Fill — For Step-Function Signals</h3>
<p>Forward fill is appropriate when the variable holds its last known value until something changes it — a machine state, a setpoint, a categorical flag.</p>
<pre><code class="language-python"># Equipment operating mode — a step signal
mode_data = pd.Series(
    ["running", "running", np.nan, np.nan, "idle", "idle", np.nan, "running"],
    index=pd.date_range("2024-06-01", periods=8, freq="H"),
    name="operating_mode"
)

filled_mode = mode_data.ffill()
print(pd.DataFrame({"original": mode_data, "ffill": filled_mode}))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">                    original    ffill
2024-06-01 00:00:00  running  running
2024-06-01 01:00:00  running  running
2024-06-01 02:00:00      NaN  running
2024-06-01 03:00:00      NaN  running
2024-06-01 04:00:00     idle     idle
2024-06-01 05:00:00     idle     idle
2024-06-01 06:00:00      NaN     idle
2024-06-01 07:00:00  running  running
</code></pre>
<h3 id="heading-time-weighted-interpolation-for-continuous-signals">Time-Weighted Interpolation — For Continuous Signals</h3>
<p>For continuous sensor readings, linear interpolation weighted by time handles irregular gaps correctly because it doesn't assume equal spacing.</p>
<pre><code class="language-python"># Fill the voltage series using time-based interpolation
voltage_clean = reindexed.interpolate(method="time")

# Compare original vs filled around the first gap
gap_window = voltage_clean["2024-06-01 12:00":"2024-06-01 18:00"]
original_window = reindexed["2024-06-01 12:00":"2024-06-01 18:00"]

comparison = pd.DataFrame({
    "original":     original_window,
    "interpolated": gap_window.round(3),
    "was_missing":  original_window.isna(),
})
print(comparison)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">                       original  interpolated  was_missing
2024-06-01 12:00:00  230.290355       230.290        False
2024-06-01 13:00:00  226.798197       226.798        False
2024-06-01 14:00:00         NaN       226.848         True
2024-06-01 15:00:00         NaN       226.897         True
2024-06-01 16:00:00         NaN       226.947         True
2024-06-01 17:00:00  226.996356       226.996        False
2024-06-01 18:00:00  225.410371       225.410        False
</code></pre>
<h3 id="heading-seasonal-decomposition-imputation-for-long-gaps">Seasonal Decomposition Imputation — For Long Gaps</h3>
<p>For gaps longer than a few steps in a seasonal signal, interpolating across the gap ignores the seasonal pattern. A better approach is to decompose the series, impute each component separately, then reconstruct.</p>
<pre><code class="language-python">from statsmodels.tsa.seasonal import seasonal_decompose

# Use a longer series for decomposition (needs enough periods)
long_voltage = pd.Series(
    230.0
    + 3.5 * np.sin(2 * np.pi * np.arange(336) / 24)
    + np.random.normal(0, 1.0, 336),
    index=pd.date_range("2024-06-01", periods=336, freq="H")
)

# Inject a 6-hour gap
long_voltage.iloc[100:106] = np.nan

# Interpolate first to give decompose a complete series to work with
temp_filled = long_voltage.interpolate(method="time")
decomp = seasonal_decompose(temp_filled, model="additive", period=24)

# Reconstruct: trend + seasonal + zero residual for missing positions
reconstructed = long_voltage.copy()
missing_idx = long_voltage[long_voltage.isna()].index
reconstructed[missing_idx] = (
    decomp.trend[missing_idx].fillna(method="ffill")
    + decomp.seasonal[missing_idx]
)

print(f"Missing before: {long_voltage.isna().sum()}")
print(f"Missing after:  {reconstructed.isna().sum()}")
print("\nFilled values at gap:")
print(reconstructed[missing_idx].round(3))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">
                       original  interpolated  was_missing
2024-06-01 12:00:00  230.290355       230.290        False
2024-06-01 13:00:00  226.798197       226.798        False
2024-06-01 14:00:00         NaN       226.848         True
2024-06-01 15:00:00         NaN       226.897         True
2024-06-01 16:00:00         NaN       226.947         True
2024-06-01 17:00:00  226.996356       226.996        False
2024-06-01 18:00:00  225.410371       225.410        False
</code></pre>
<p>The seasonal decomposition imputation respects the time-of-day pattern. As you can see, the filled values aren't a flat line across the gap but follow the expected daily curve.</p>
<h2 id="heading-how-to-detect-and-handle-outliers">How to Detect and Handle Outliers</h2>
<p>Outliers in time series are trickier than in tabular data because context matters. For example, an unusually high or low voltage might be a sensor spike or a genuine grid event. You need methods that use <em>temporal context</em>, not just global statistics.</p>
<h3 id="heading-z-score-with-rolling-window">Z-Score with Rolling Window</h3>
<p>A global Z-score misses local anomalies in non-stationary series. A rolling Z-score flags values that are unusual <em>relative to their local neighbourhood</em>.</p>
<p><strong>Note</strong>: A <strong>non-stationary series</strong> is a time series whose statistical properties—such as mean, variance, or trend—change over time instead of remaining constant.</p>
<pre><code class="language-python">window = 24  # 24-hour rolling window

roll_mean = voltage_clean.rolling(window, center=True, min_periods=1).mean()
roll_std  = voltage_clean.rolling(window, center=True, min_periods=1).std()

rolling_z = (voltage_clean - roll_mean) / roll_std

threshold = 3.0
outliers_z = rolling_z[rolling_z.abs() &gt; threshold]

print(f"Rolling Z-score outliers detected: {len(outliers_z)}")
print(outliers_z.round(3))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">Rolling Z-score outliers detected: 2
2024-06-04 06:00:00    4.646
2024-06-06 10:00:00   -4.484
Name: voltage_v, dtype: float64
</code></pre>
<p>Z-score outlier detection works best for approximately Gaussian (normal) distributions because it assumes the data is centered around a mean with symmetric spread measured by standard deviation.</p>
<h3 id="heading-iqr-based-outlier-detection">IQR-Based Outlier Detection</h3>
<p>The interquartile range (IQR) method is more robust for detecting outliers in non-Gaussian distributions. The interquartile range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1), representing the spread of the middle 50% of the data.</p>
<pre><code class="language-python">Q1 = voltage_clean.quantile(0.25)
Q3 = voltage_clean.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = voltage_clean[
    (voltage_clean &lt; lower_bound) | (voltage_clean &gt; upper_bound)
]

print(f"IQR bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f"Outliers detected: {len(outliers_iqr)}")
print(outliers_iqr.round(2))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">IQR bounds: [220.16, 239.46]
Outliers detected: 2
2024-06-04 06:00:00    312.4
2024-06-06 10:00:00    187.2
Name: voltage_v, dtype: float64
</code></pre>
<h3 id="heading-isolation-forest-for-multivariate-outlier-detection">Isolation Forest — For Multivariate Outlier Detection</h3>
<p>When you have multiple sensors, an isolated reading on one channel might look normal, but its combination with readings from other channels reveals the anomaly. Isolation Forest handles this naturally.</p>
<pre><code class="language-python"># Build a multi-sensor DataFrame
np.random.seed(42)
n = 200

sensor_df = pd.DataFrame({
    "voltage_v":    230 + 3 * np.sin(2 * np.pi * np.arange(n) / 24) + np.random.normal(0, 1, n),
    "current_a":    15  + 0.8 * np.sin(2 * np.pi * np.arange(n) / 24) + np.random.normal(0, 0.3, n),
    "frequency_hz": 50  + np.random.normal(0, 0.05, n),
}, index=pd.date_range("2024-06-01", periods=n, freq="H"))

# Inject a multivariate anomaly — voltage drops, current spikes together
sensor_df.iloc[88, 0] = 194.2   # voltage dip
sensor_df.iloc[88, 1] = 28.7    # current surge (consistent with fault)

clf = IsolationForest(contamination=0.02, random_state=42)
sensor_df["anomaly_score"] = clf.fit_predict(sensor_df[["voltage_v", "current_a", "frequency_hz"]])

anomalies = sensor_df[sensor_df["anomaly_score"] == -1]
print(f"Anomalies detected: {len(anomalies)}")
print(anomalies[["voltage_v", "current_a", "frequency_hz"]].round(2))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">Anomalies detected: 4
                     voltage_v  current_a  frequency_hz
2024-06-02 07:00:00     234.75      15.84         49.90
2024-06-04 06:00:00     233.09      15.82         50.15
2024-06-04 16:00:00     194.20      28.70         50.08
2024-06-06 05:00:00     235.09      15.41         49.91
</code></pre>
<p>In practice you'd follow up anomaly scores with domain-specific threshold rules.</p>
<h3 id="heading-outlier-treatment">Outlier Treatment</h3>
<p>Once outliers are identified, you can handle them in several ways:</p>
<ul>
<li><p>Cap them using Winsorization by limiting extreme values to a threshold.</p>
</li>
<li><p>Replace them with interpolated or estimated values.</p>
</li>
<li><p>Flag them so the model can handle them appropriately.</p>
</li>
</ul>
<pre><code class="language-python"># Winsorize: cap at the IQR bounds
voltage_winsorized = voltage_clean.clip(lower=lower_bound, upper=upper_bound)

# Replace outliers with time-interpolated values
voltage_outlier_fixed = voltage_clean.copy()
voltage_outlier_fixed[outliers_iqr.index] = np.nan
voltage_outlier_fixed = voltage_outlier_fixed.interpolate(method="time")

print("Outlier treatment comparison:")
for ts in outliers_iqr.index:
    print(f"\n  {ts}")
    print(f"    Original:     {voltage_clean[ts]:.2f}")
    print(f"    Winsorized:   {voltage_winsorized[ts]:.2f}")
    print(f"    Interpolated: {voltage_outlier_fixed[ts]:.2f}")
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">Outlier treatment comparison:

  2024-06-04 06:00:00
    Original:     312.40
    Winsorized:   239.46
    Interpolated: 232.01

  2024-06-06 10:00:00
    Original:     187.20
    Winsorized:   220.16
    Interpolated: 231.43
</code></pre>
<p>Winsorization preserves the point but clips it to a plausible range — useful when you want to retain the information that something anomalous happened. Interpolation treats the outlier as if it were missing — better when you believe the reading is simply wrong.</p>
<h2 id="heading-how-to-remove-duplicates">How to Remove Duplicates</h2>
<p>Duplicate timestamps are common when data pipelines retry on failure. Unlike tabular duplicates, time series duplicates aren't always identical, a retry might deliver a slightly different reading for the same timestamp.</p>
<pre><code class="language-python"># Inject duplicate timestamps with slightly different values (retry scenario)
dup_index = index.tolist()
dup_index.insert(20, index[20])  # exact duplicate timestamp
dup_index.insert(55, index[55])  # retry duplicate

dup_values = voltage_clean.tolist()
dup_values.insert(20, voltage_clean.iloc[20])
dup_values.insert(55, voltage_clean.iloc[55] + 0.7)  # slightly different value

dup_series = pd.Series(dup_values, index=pd.DatetimeIndex(dup_index), name="voltage_v")

print(f"Length with duplicates: {len(dup_series)}")
print(f"Duplicate timestamps:   {dup_series.index.duplicated().sum()}")

# Strategy 1: keep first (original reading)
dedup_first = dup_series[~dup_series.index.duplicated(keep="first")]

# Strategy 2: keep mean (average across retries)
dedup_mean = dup_series.groupby(level=0).mean()

print(f"\nAfter dedup (keep first): {len(dedup_first)}")
print(f"After dedup (mean):       {len(dedup_mean)}")

# Show the retry duplicate
ts_retry = index[55]
print(f"\nRetry duplicate at {ts_retry}:")
print(f"  Values:      {dup_series[ts_retry].values.round(3)}")
print(f"  Keep first:  {dedup_first[ts_retry]:.3f}")
print(f"  Mean:        {dedup_mean[ts_retry]:.3f}")
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">Length with duplicates: 170
Duplicate timestamps:   2

After dedup (keep first): 168
After dedup (mean):       168

Retry duplicate at 2024-06-03 07:00:00:
  Values:      [235.198 234.498]
  Keep first:  235.198
  Mean:        234.848
</code></pre>
<p>For most sensor pipelines, keep-first is the right default; the first delivery is the original reading. Mean makes sense when retries come from independent sensors measuring the same quantity.</p>
<h2 id="heading-frequency-alignment-and-resampling">Frequency Alignment and Resampling</h2>
<p>Real pipelines often mix data at different frequencies. For example, you may need a 1-minute meter reading merged with an hourly weather feed. Before joining them, you need to align frequencies explicitly.</p>
<pre><code class="language-python"># 1-minute power draw readings
power_1min = pd.Series(
    42 + 18 * ((pd.date_range("2024-06-01", periods=1440, freq="T").hour.isin(range(8, 19)))).astype(int)
    + np.random.normal(0, 2, 1440),
    index=pd.date_range("2024-06-01", periods=1440, freq="T"),
    name="power_kw"
)

# Downsample to hourly: mean is appropriate for power (average over the hour)
power_hourly_mean = power_1min.resample("H").mean().round(2)

# Downsample to hourly: max (peak demand within the hour)
power_hourly_max = power_1min.resample("H").max().round(2)

# Downsample to hourly: sum (total energy = kWh)
energy_hourly_kwh = (power_1min.resample("H").sum() / 60).round(3)

comparison = pd.DataFrame({
    "mean_kw":    power_hourly_mean,
    "peak_kw":    power_hourly_max,
    "energy_kwh": energy_hourly_kwh,
}).iloc[7:13]

print(comparison)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">                     mean_kw  peak_kw  energy_kwh
2024-06-01 07:00:00    42.13    46.28      42.133
2024-06-01 08:00:00    60.56    64.81      60.557
2024-06-01 09:00:00    59.91    64.88      59.912
2024-06-01 10:00:00    60.07    65.16      60.066
2024-06-01 11:00:00    60.08    64.99      60.083
2024-06-01 12:00:00    59.72    63.65      59.724
</code></pre>
<p>Which aggregation you choose matters enormously for downstream use. Mean power is right for load profiling. Peak power is right for capacity planning. Sum (converted to kWh) is right for billing. You can probably see why the <em>right</em> answer is domain-specific and not technical.</p>
<h2 id="heading-smoothing-noise">Smoothing Noise</h2>
<p>Raw sensor data often contains high-frequency noise that obscures the underlying signal. Smoothing before feature engineering prevents the model from fitting to noise, but over-smoothing destroys real variation.</p>
<h3 id="heading-exponential-weighted-moving-average">Exponential Weighted Moving Average</h3>
<p>Exponential Weighted Moving Average or EWMA gives <em>more weight to recent observations</em> and adapts quickly to level changes. This is better than a simple moving average for non-stationary signals.</p>
<pre><code class="language-python"># Noisy temperature sensor (°C)
temp_noisy = pd.Series(
    3.5
    + 1.2 * np.sin(2 * np.pi * np.arange(168) / 24)
    + np.random.normal(0, 0.8, 168),  # high noise
    index=pd.date_range("2024-06-01", periods=168, freq="H"),
    name="temperature_c"
)

temp_ewma = temp_noisy.ewm(span=6, adjust=False).mean()
temp_sma  = temp_noisy.rolling(window=6, center=True).mean()

comparison = pd.DataFrame({
    "raw":  temp_noisy,
    "ewma": temp_ewma.round(3),
    "sma":  temp_sma.round(3),
}).iloc[22:30]

print(comparison)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">                          raw   ewma    sma
2024-06-01 22:00:00  3.212372  2.843  3.035
2024-06-01 23:00:00  3.106840  2.918  3.176
2024-06-02 00:00:00  3.712290  3.145  3.011
2024-06-02 01:00:00  3.344376  3.202  3.294
2024-06-02 02:00:00  2.148946  2.901  3.705
2024-06-02 03:00:00  4.241105  3.284  4.087
2024-06-02 04:00:00  5.677429  3.968  4.381
2024-06-02 05:00:00  5.400083  4.377  4.765
</code></pre>
<h3 id="heading-savitzky-golay-filter">Savitzky-Golay Filter</h3>
<p>For signals where you need to preserve peak shapes — not just smooth them away — the <a href="https://eigenvector.com/wp-content/uploads/2020/01/SavitzkyGolay.pdf">Savitzky-Golay filter</a> fits a polynomial over a sliding window and is better at maintaining the height of genuine spikes.</p>
<pre><code class="language-python">from scipy.signal import savgol_filter

temp_savgol = pd.Series(
    savgol_filter(temp_noisy.values, window_length=11, polyorder=2),
    index=temp_noisy.index,
    name="temp_savgol"
).round(3)

print(pd.DataFrame({
    "raw":    temp_noisy,
    "savgol": temp_savgol,
}).iloc[22:30])
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">                          raw  savgol
2024-06-01 22:00:00  3.212372   2.960
2024-06-01 23:00:00  3.106840   2.944
2024-06-02 00:00:00  3.712290   3.114
2024-06-02 01:00:00  3.344376   3.379
2024-06-02 02:00:00  2.148946   3.809
2024-06-02 03:00:00  4.241105   4.288
2024-06-02 04:00:00  5.677429   4.749
2024-06-02 05:00:00  5.400083   5.138
</code></pre>
<h2 id="heading-schema-and-sanity-validation">Schema and Sanity Validation</h2>
<p>Cleaning without validation is incomplete. You need automated checks that run every time new data arrives — catching problems before they silently corrupt downstream models.</p>
<pre><code class="language-python">def validate_time_series(series: pd.Series, config: dict) -&gt; dict:
    """
    Run schema and sanity checks on a time series.
    Returns a report dict with pass/fail per check.
    """
    report = {}

    # Frequency check
    inferred = pd.infer_freq(series.index)
    report["freq_regular"] = inferred == config["expected_freq"]

    # Missing value threshold
    missing_rate = series.isna().mean()
    report["missing_below_threshold"] = missing_rate &lt;= config["max_missing_rate"]
    report["missing_rate"] = round(missing_rate, 4)

    # Value range check
    in_range = series.dropna().between(config["min_value"], config["max_value"])
    report["values_in_range"] = in_range.all()
    report["out_of_range_count"] = (~in_range).sum()

    # Duplicate timestamps
    report["no_duplicates"] = not series.index.duplicated().any()

    # Monotonic index
    report["index_monotonic"] = series.index.is_monotonic_increasing

    return report


config = {
    "expected_freq":    "H",
    "max_missing_rate": 0.05,
    "min_value":        210.0,
    "max_value":        250.0,
}

report = validate_time_series(voltage_outlier_fixed, config)

print("=== VALIDATION REPORT ===")
for check, result in report.items():
    if check in ("missing_rate", "out_of_range_count"):
        print(f"  {check}: {result}")
    else:
        status = "✓ PASS" if result else "✗ FAIL"
        print(f"  {status}  {check}")
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">=== VALIDATION REPORT ===
  ✗ FAIL  freq_regular
  ✓ PASS  missing_below_threshold
  missing_rate: 0.0
  ✓ PASS  values_in_range
  out_of_range_count: 0
  ✓ PASS  no_duplicates
  ✓ PASS  index_monotonic
</code></pre>
<p>This validator is the kind of function you wrap around every data ingestion step in a production pipeline. Run it before cleaning to know what's broken, and after cleaning to confirm everything passed.</p>
<h2 id="heading-the-complete-cleaning-checklist">The Complete Cleaning Checklist</h2>
<p>Here's the full sequence to run on any incoming time series dataset:</p>
<table>
<thead>
<tr>
<th>Step</th>
<th>Technique</th>
<th>When to Use</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Audit</strong></td>
<td>Index check, missing map, value range</td>
<td>Always — before anything else</td>
</tr>
<tr>
<td><strong>Reindex</strong></td>
<td><code>reindex</code> to canonical frequency</td>
<td>When timestamps are absent rather than NaN</td>
</tr>
<tr>
<td><strong>Missing: short gaps</strong></td>
<td>Time interpolation</td>
<td>Continuous signals, gaps ≤ 3 steps</td>
</tr>
<tr>
<td><strong>Missing: step signals</strong></td>
<td>Forward fill</td>
<td>Categorical or setpoint data</td>
</tr>
<tr>
<td><strong>Missing: long gaps</strong></td>
<td>Seasonal decomposition impute</td>
<td>Seasonal signals, gaps &gt; 6 steps</td>
</tr>
<tr>
<td><strong>Outliers: univariate</strong></td>
<td>Rolling Z-score or IQR</td>
<td>Single sensor, local anomalies</td>
</tr>
<tr>
<td><strong>Outliers: multivariate</strong></td>
<td>Isolation Forest</td>
<td>Multiple correlated sensors</td>
</tr>
<tr>
<td><strong>Outlier treatment</strong></td>
<td>Winsorize or interpolate</td>
<td>Depending on whether event is real</td>
</tr>
<tr>
<td><strong>Duplicates</strong></td>
<td>Keep first or group mean</td>
<td>Pipeline retry duplicates</td>
</tr>
<tr>
<td><strong>Resampling</strong></td>
<td><code>.resample()</code> with correct aggregation</td>
<td>Frequency alignment before joins</td>
</tr>
<tr>
<td><strong>Smoothing</strong></td>
<td>EWMA or Savitzky-Golay</td>
<td>Noisy sensors before feature engineering</td>
</tr>
<tr>
<td><strong>Validation</strong></td>
<td>Schema + sanity checks</td>
<td>After cleaning, and on every new batch</td>
</tr>
</tbody></table>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>The order matters. Reindex before imputing. Impute before smoothing. Validate after everything. Skipping steps or doing them out of order compounds errors in ways that are very difficult to trace back once you're looking at model predictions.</p>
<p>Time series cleaning isn't glamorous work, but a model trained on clean data and thoughtfully engineered features will almost always outperform a more sophisticated model trained on data that wasn't cleaned properly. Getting this pipeline right is the highest-leverage thing you can do before you try running even the simplest algorithm on your time series data.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Calculator with Tkinter in Python  ]]>
                </title>
                <description>
                    <![CDATA[ In this tutorial, you'll learn how to create a simple arithmetic calculator in Python with Tkinter. The project will be one of your first steps towards building an actual GUI in Python. This is a hand ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-calculator-with-tkinter-in-python/</link>
                <guid isPermaLink="false">6a07203c99d875f5cd667635</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GUI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tkinter ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Sara Jadhav ]]>
                </dc:creator>
                <pubDate>Fri, 15 May 2026 13:31:40 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/0ae14c91-3e47-464c-b392-1026321a7764.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this tutorial, you'll learn how to create a simple arithmetic calculator in Python with Tkinter. The project will be one of your first steps towards building an actual GUI in Python.</p>
<p>This is a hands-on tutorial, which will help you form your early GUI projects. It's meant for anyone who wants to start building visual projects in Python.</p>
<p>The Tkinter library is a standard built-in Python library which helps us make Graphical User Interfaces in Python. Since it's a built-in library, we don't have to separately install it. So, once you have Python installed on your computer, you just have to set it up and you're good to follow along here.</p>
<p>But keep in mind that Tkinter may not be installed with your Python from the distributor end. To check if it's installed or not, open your command prompt and type:</p>
<pre><code class="language-plaintext">python -m tkinter
</code></pre>
<p>This will open up a Tkinter specimen window if Tkinter is installed and working on your computer.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-do-we-want-to-see-in-our-project">What Do We Want to See in Our Project?</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-the-window">How to Set Up the Window</a></p>
</li>
<li><p><a href="#heading-how-to-name-the-window">How to Name the Window</a></p>
</li>
<li><p><a href="#heading-how-to-create-frames-in-the-window">How to Create Frames in the Window</a></p>
</li>
<li><p><a href="#heading-how-to-add-buttons-to-the-window">How to Add Buttons to the Window</a></p>
</li>
<li><p><a href="#heading-how-to-add-the-output-screen-of-the-calculator">How to Add the Output Screen of the Calculator</a></p>
</li>
<li><p><a href="#heading-how-to-make-the-numbers-visible-on-the-output-screen">How to Make the Numbers Visible on the Output Screen</a></p>
</li>
<li><p><a href="#heading-how-to-add-a-scrollbar-to-the-output-screen">How to Add a Scrollbar to the Output Screen</a></p>
</li>
<li><p><a href="#heading-how-to-add-the-equal-to-button">How to Add the Equal To Button</a></p>
</li>
<li><p><a href="#heading-how-to-add-the-ac-button">How to Add the AC Button</a></p>
</li>
<li><p><a href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before starting, here are some prerequisites for this tutorial which will help you get the most out of it:</p>
<ul>
<li><p>Basic Python Syntax</p>
</li>
<li><p>Understanding of how to import and use libraries and its different functions</p>
</li>
<li><p>Understanding of how to use different attributes of the module</p>
</li>
</ul>
<p>Now that we know what we need to proceed in this tutorial, let's actually dive-in the process!</p>
<p>The first step for building any project is to create a clear-cut idea of what you want to build. Let's look at what we're going to make.</p>
<h2 id="heading-what-do-we-want-to-see-in-our-project">What Do We Want to See in Our Project?</h2>
<p>We're going to build a simple arithmetic calculator. The calculator works as follows:</p>
<ul>
<li><p>It has all the numerals (0, 1, 2, ...., 9) in a keyboard.</p>
</li>
<li><p>It has basic arithmetic (+, -, /, *, =) operators lining the keyboard.</p>
</li>
<li><p>The calculator is non-resizable, that is the user can't extend the width or the height of the application window.</p>
</li>
<li><p>The calculator has a screen above the keyboard which shows the user input and the final answer.</p>
</li>
<li><p>Finally, the calculator has an 'AC' button which stands for 'All Clear' which erases everything on the output screen of the window and allows the user to use it again.</p>
</li>
</ul>
<p>With this, we have a clear idea about what we're going to build.</p>
<p>Also, you can create the UI beforehand and place the widgets accordingly on the window. Here's an image of the UI we'll create in this tutorial:</p>
<img src="https://cdn.hashnode.com/uploads/covers/67e55054a0be57d730442ec0/93d8458d-f829-4edb-9651-622a14f9444a.png" alt="UI of the calculator " style="display:block;margin:0 auto" width="249" height="388" loading="lazy">

<h2 id="heading-how-to-set-up-the-window">How to Set Up the Window</h2>
<p>To set up our main window where we'll later add our widgets, first we need to import the Tkinter library into our program. Then we'll initialize the window using the <code>tk.Tk()</code> function. To display the window on the screen continuously until we quit manually, we'll use the <code>mainloop()</code> function. Here's what the code looks like:</p>
<pre><code class="language-python">import tkinter as tk

# screen initialization
root = tk.Tk()

# This keeps the window active
root.mainloop()
</code></pre>
<p>The<code>root</code> variable represents our window. So, from now on, we'll be adding the widgets to this window.</p>
<p>When a user hits "Run", you'll see a blank window on your screen as shown in the image below. Congrats! This is your first GUI.</p>
<img src="https://cdn.hashnode.com/uploads/covers/67e55054a0be57d730442ec0/ee60f1d8-ea5b-415d-96bf-90941ecd9424.png" alt="Blank tkinter window " style="display:block;margin:0 auto" width="214" height="241" loading="lazy">

<h2 id="heading-how-to-name-the-window">How to Name the Window</h2>
<p>The 'tk' written on the Title Bar is the default title of the window. To set our own window title, we can use the <code>title()</code> function. The following code shows how you can do that:</p>
<pre><code class="language-python">import tkinter as tk

root = tk.Tk()

# Naming the window
root.title("Calculator")

root.mainloop()
</code></pre>
<p>On executing the program, we get the following window:</p>
<img src="https://cdn.hashnode.com/uploads/covers/67e55054a0be57d730442ec0/ef6ca391-3744-4915-9118-4996124adc85.png" alt="Blank window with title changed to 'Calculator'" style="display:block;margin:0 auto" width="302" height="281" loading="lazy">

<p>Now you should be able to see that the title of the window changed successfully.</p>
<h2 id="heading-how-to-create-frames-in-the-window">How to Create Frames in the Window</h2>
<p>After setting up the window, now we have to place the buttons on it. For placing the buttons, we need to create a container in which we'll put them.</p>
<p>The container could be the main window, but we'll avoid that for this project. This is because we want to place some buttons to the side of and below others to create our keyboard. To make it easier, we'll create Frame containers.</p>
<p>A Frame container represents a vertical column of the window. The initial dimension of the frame is 0 x 0. The frame resizes accordingly when we place a widget in it.</p>
<p>We'll create four frames in our window. The first frame will contain the buttons 1, 4, 7, and AC. The second frame will contain the buttons 2, 5, 8, and 0, the third frame will contain the buttons 3, 6, 9, and =, and the last frame will contain the buttons +, -, x, and / (just like in the UI shown above).</p>
<p>We can create frames in Tkinter using <code>tk.Frame()</code>. We'll pass the parent container for the Frame – that is, the main window in its argument. The following code should make it clear:</p>
<pre><code class="language-python">import tkinter as tk

# screen initialization
root = tk.Tk()

# Naming the window
root.title("Calculator")

# Creating Frames
frame1 = tk.Frame(root)
frame1.pack(side='left', anchor='n')
frame2 = tk.Frame(root)
frame2.pack(side='left', anchor='n')
frame3 = tk.Frame(root)
frame3.pack(side='left', anchor='n')
frame4 = tk.Frame(root)
frame4.pack(side='left', anchor='n')

# This keeps the window active
root.mainloop()
</code></pre>
<p>The <code>pack()</code> function embeds the Frame geometrically on the window. The <code>side='left'</code> parameter embeds the Frames to the extreme left of the screen. By default, this is set to the center. <code>anchor='n'</code> tells us that the widgets should be placed starting from the very top of the frame. By default, the widgets start adding from the center of the Frame. The 'n' in the <code>anchor='n'</code> stands for 'North'.</p>
<p>An important thing to note is that, since we defined <code>frame1</code> early in the program, it will occupy the extreme left portion of the window. But even though <code>frame2</code> is also set to occupy extreme left, the two frames <code>frame1</code> and <code>frame2</code> won't overlap. Instead <code>frame2</code> will take a position so that it goes as far left as it can go on the window without overlapping <code>frame1</code>. So frames <code>frame1</code>, <code>frame2</code>, <code>frame3</code> and <code>frame4</code> are side by side on the left side of the window.</p>
<h2 id="heading-how-to-add-buttons-to-the-window">How to Add Buttons to the Window</h2>
<p>We can create a button widget in Tkinter by using the <code>tk.Button()</code> function. The <code>tk.Button()</code> function consists of various parameters:</p>
<ul>
<li><p><strong>master:</strong> This allows us to provide the parent container in which we have to place our button. This expects a container object.</p>
</li>
<li><p><strong>text:</strong> In this parameter, we have to pass the text which we want to display on our button. This expects a string.</p>
</li>
<li><p><strong>font:</strong> This expects a tuple with the first element providing the name of the font and the next element providing the font size.</p>
</li>
<li><p><strong>image:</strong> This allows us to put an image over our button.</p>
</li>
<li><p><strong>bg:</strong> This allows us to set the background colour for our button.</p>
</li>
<li><p><strong>fg:</strong> This allows us to set the foreground colour for our button.</p>
</li>
<li><p><strong>activebackground:</strong> When the button is clicked, the colour passed in this parameter becomes visible.</p>
</li>
<li><p><strong>command:</strong> This allows us to link a command to the button.</p>
</li>
</ul>
<p>Now that we know the basics of creating a button, lets actually create the keyboard of our calculator.</p>
<p>To create the keyboard, we have to put quite a few buttons on the window. To make our work easier, we'll define a function to create our buttons, just with different text. Let's look at the code below:</p>
<pre><code class="language-python">import tkinter as tk

# screen initialization
root = tk.Tk()

# Naming the window
root.title("Calculator")

frame1 = tk.Frame(root)
frame1.pack(side='left', anchor='n')
frame2 = tk.Frame(root)
frame2.pack(side='left', anchor='n')
frame3 = tk.Frame(root)
frame3.pack(side='left', anchor='n')
frame4 = tk.Frame(root)
frame4.pack(side='left', anchor='n')

pixel = tk.PhotoImage(width=55, height=55)

def buttons(text, frame):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg="#333300", fg="white", compound="center")
    return button


def buttons_ops(text, frame, bg, fg):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg=bg, fg=fg, activebackground="black",
                        compound="center")
    return button

btn1 = buttons('1',frame1).pack()
btn4 = buttons('4', frame1).pack()
btn7 = buttons('7', frame1).pack()

btn2 = buttons('2', frame2).pack()
btn5 = buttons('5', frame2).pack()
btn8 = buttons('8', frame2).pack()
btn0 = buttons_ops('0', frame2, '#333300', 'white').pack()

plus = buttons_ops('+', frame4, 'black', 'white').pack()
minus= buttons_ops('-', frame4,  'black', 'white').pack()
mul = buttons_ops('x', frame4, 'black', 'white').pack()
div = buttons_ops('/', frame4, 'black', 'white').pack()

btn3 = buttons('3', frame3).pack()
btn6 = buttons('6', frame3).pack()
btn9 = buttons('9', frame3).pack()

# This keeps the window active
root.mainloop()
</code></pre>
<p>Now let's break it down:</p>
<p>First, we created an Tkinter image object via <code>tk.PhotoImage()</code>. This is a transparent image. The purpose behind creating this image is to set a perfect width and height of the button pixel-wise. The <code>compound='center'</code> ensures that the button text is aligned at the center of the transparent image.</p>
<p>You can change the size of the button by changing the <code>width</code> and <code>height</code> parameters of the <code>pixel</code> object.</p>
<p>Secondly, we created a function which takes the 'text' and the 'container frame' as the argument. Inside the function, we created a button object and returned it. For the numerical buttons, we've created the function <code>buttons</code> whereas for operator buttons, we've created the function <code>buttons_ops</code>. This was done only to ensure different style of buttons (in terms of background and foreground, and so on).</p>
<p>You can change the colours of the buttons by making changes in the <code>bg</code> and <code>fg</code> parameters of the <code>tk.Button()</code> function.</p>
<p>Then we created all the buttons with these two functions. The <code>pack()</code> function puts the buttons in their respective places. Remember that we haven't created the <code>=</code> and <code>AC</code> buttons.</p>
<p>When we execute the program, the following window will pop up:</p>
<img src="https://cdn.hashnode.com/uploads/covers/67e55054a0be57d730442ec0/012c6883-d79b-44fa-8171-0c6cfc837b4e.png" alt="Window with embedded buttons " style="display:block;margin:0 auto" width="269" height="289" loading="lazy">

<p>You can try clicking the buttons to make sure that everything is working great up to this point.</p>
<h2 id="heading-how-to-add-the-output-screen-of-the-calculator">How to Add the Output Screen of the Calculator</h2>
<p>For the output screen of the calculator, we'll be using the <code>Entry</code> object in Tkinter. The <code>Entry</code> object will be the best match in this case because we want a single line screen to showcase the user input. We could also use a <code>Text</code> object, but it provides a multiline area. So here, we'll just be using the <code>Entry</code> object.</p>
<p>Also, since we want the output screen to be on the top of the keyboard, we need to define and embed this object before embedding the frames.</p>
<p>The <code>Entry</code> object is created using the <code>tk.Entry()</code> function. This has similar parameters to the <code>tk.Button()</code> function. The following code creates an entry box:</p>
<pre><code class="language-python">import tkinter as tk

# screen initialization
root = tk.Tk()

# Naming the window
root.title("Calculator")

# creating the output screen
entry = tk.Entry(root, width=9, font=('Arial', 38, 'bold'), state='readonly')
entry.pack(pady=(30, 10))


frame1 = tk.Frame(root)
frame1.pack(side='left', anchor='n')
frame2 = tk.Frame(root)
frame2.pack(side='left', anchor='n')
frame3 = tk.Frame(root)
frame3.pack(side='left', anchor='n')
frame4 = tk.Frame(root)
frame4.pack(side='left', anchor='n')

pixel = tk.PhotoImage(width=55, height=55)

def buttons(text, frame):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg="#333300", fg="white", compound="center")
    return button


def buttons_ops(text, frame, bg, fg):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg=bg, fg=fg, activebackground="black",
                        compound="center")
    return button

btn1 = buttons('1',frame1).pack()
btn4 = buttons('4', frame1).pack()
btn7 = buttons('7', frame1).pack()

btn2 = buttons('2', frame2).pack()
btn5 = buttons('5', frame2).pack()
btn8 = buttons('8', frame2).pack()
btn0 = buttons_ops('0', frame2, '#333300', 'white').pack()

plus = buttons_ops('+', frame4, 'black', 'white').pack()
minus= buttons_ops('-', frame4,  'black', 'white').pack()
mul = buttons_ops('x', frame4, 'black', 'white').pack()
div = buttons_ops('/', frame4, 'black', 'white').pack()

btn3 = buttons('3', frame3).pack()
btn6 = buttons('6', frame3).pack()
btn9 = buttons('9', frame3).pack()

# This keeps the window active
root.mainloop()
</code></pre>
<p>In the code above, we put the parent container of the <code>entry</code> object as the main window <code>root</code>. I set the <code>width</code> parameter to 9 as it fit well with the dimensions of the window and the keyboard. You can try it out with different values for width and set a perfectly sized output screen.</p>
<p>You may have noticed that we didn't use the <code>pack()</code> on the same line as object definition. This is because using <code>pack()</code> on the same line as object definition is a bad practice as it limits certain functionality.</p>
<p>So, why did we use the <code>pack()</code> function on the same line while creating buttons? This is because we didn't work heavily with the buttons in this project, so we attempted to reduce the lines of code.</p>
<p>In the <code>tk.Entry()</code> function, we set <code>state='readonly'</code>. This prohibits any direct text input into the the output screen. That means, we can only use the buttons to show the characters on the output screen. By default, this is set to <code>state='normal'</code>, which allows direct input from the keyboard into the entry box.</p>
<p>The <code>pady</code> parameter inside the <code>pack()</code> function leaves the given amount of pixels above and below the object. To perform such an operation, let's say to pad 10 pixels on both sides of the object, we can write <code>pady=10</code> .</p>
<p>Here, we didn't want the same amount of padding above and below the object. So we used a tuple with first element representing the pixels to pad above the output screen, and the second element representing the pixels to pad below the output screen.</p>
<p>Up until now, our GUI looks as shown below:</p>
<img src="https://cdn.hashnode.com/uploads/covers/67e55054a0be57d730442ec0/dc15d02d-2243-4751-b825-53d5b4061daf.png" alt="Window with embedded output screen" style="display:block;margin:0 auto" width="262" height="389" loading="lazy">

<p>We can now see that the output screen is set perfectly.</p>
<h2 id="heading-how-to-make-the-numbers-visible-on-the-output-screen">How to Make the Numbers Visible on the Output Screen</h2>
<p>Next step is to make characters visible on the output screen. Every button that we click should render on the output screen. For this, we have to link commands to each button. Let's first look at the code and then see how it works:</p>
<pre><code class="language-python">import tkinter as tk

# screen initialization
root = tk.Tk()

# Naming the window
root.title("Calculator")

entry = tk.Entry(root, width=9, font=('Arial', 38, 'bold'),state='readonly')
entry.pack(pady=(30, 10))


frame1 = tk.Frame(root)
frame1.pack(side='left', anchor='n')
frame2 = tk.Frame(root)
frame2.pack(side='left', anchor='n')
frame3 = tk.Frame(root)
frame3.pack(side='left', anchor='n')
frame4 = tk.Frame(root)
frame4.pack(side='left', anchor='n')

pixel = tk.PhotoImage(width=55, height=55)

def command(text):
    entry.config(state='normal')
    entry.insert(tk.END, text) 
    entry.config(state='readonly')  


def buttons(text, frame):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg="#333300", fg="white", compound="center",
                       command=lambda :command(text))
    return button


def buttons_ops(text, frame, bg, fg):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg=bg, fg=fg, activebackground="black",
                        compound="center", command=lambda:command(text))
    return button

btn1 = buttons('1',frame1).pack()
btn4 = buttons('4', frame1).pack()
btn7 = buttons('7', frame1).pack()

btn2 = buttons('2', frame2).pack()
btn5 = buttons('5', frame2).pack()
btn8 = buttons('8', frame2).pack()
btn0 = buttons_ops('0', frame2, '#333300', 'white').pack()

plus = buttons_ops('+', frame4, 'black', 'white').pack()
minus= buttons_ops('-', frame4,  'black', 'white').pack()
mul = buttons_ops('x', frame4, 'black', 'white').pack()
div = buttons_ops('/', frame4, 'black', 'white').pack()

btn3 = buttons('3', frame3).pack()
btn6 = buttons('6', frame3).pack()
btn9 = buttons('9', frame3).pack()

# This keeps the window active
root.mainloop()
</code></pre>
<p>In the code above, we defined a new function called <code>command()</code>. This function takes one argument <code>text</code>. Inside the function, we changed the <code>state</code> of the <code>entry</code> object to <code>normal</code> via <code>config</code>. By doing this, we can now make changes in the text of the <code>entry</code> object.</p>
<p>Then we used the <code>insert()</code> function for the <code>entry</code> object. The <code>insert()</code> function appends the <code>text</code> argument to the existing set of characters.</p>
<p>The first argument of the <code>insert()</code> function takes the index where the text will be inserted. <code>tk.END</code> represents the last character of the text in the object. The second argument of the <code>insert()</code> function takes the text that is to be inserted.</p>
<p>Finally, we change the <code>state</code> of the object again to <code>readonly</code> to prohibit any outside input other than our defined calculator keyboard.</p>
<p>Now let's look at the <code>buttons</code> and the <code>buttons_ops</code> functions. You may have noticed that we've added the <code>command</code> parameter to the <code>tk.Button()</code> function. The <code>lambda</code> tells the program to perform the command only when the button is clicked.</p>
<p>Collectively, <code>command=lambda:command(text)</code> means that, on clicking the buttons which we have defined up until now, it executes the <code>command()</code> function and shows the pressed button character on the output screen.</p>
<p>Now try clicking some buttons on your window. They should appear on the output screen as shown below:</p>
<img src="https://cdn.hashnode.com/uploads/covers/67e55054a0be57d730442ec0/e8cc0cd5-dc0c-4f01-9b21-7dec6874d00f.png" alt="Image of the calculator showing input on the calculator screen" style="display:block;margin:0 auto" width="262" height="395" loading="lazy">

<h2 id="heading-how-to-add-a-scrollbar-to-the-output-screen">How to Add a Scrollbar to the Output Screen</h2>
<p>Now, you might have encountered a problem: when you input a large number of characters, you were able to see only the first few characters. The rest were invisible.</p>
<p>To tackle this, we'll add a scrollbar to the output screen.</p>
<p>First, we'll create a scrollbar object via <code>tk.Scrollbar()</code> before the <code>entry</code> object. The following code shows how:</p>
<pre><code class="language-python">import tkinter as tk

# screen initialization
root = tk.Tk()

# Naming the window
root.title("Calculator")

scrollbar = tk.Scrollbar(root, orient='horizontal')

entry = tk.Entry(root, width=9, font=('Arial', 38, 'bold'), state='readonly', xscrollcommand=scrollbar.set)
entry.pack(pady=(30, 10))

scrollbar.config(command=entry.xview)
scrollbar.pack()

frame1 = tk.Frame(root)
frame1.pack(side='left', anchor='n')
frame2 = tk.Frame(root)
frame2.pack(side='left', anchor='n')
frame3 = tk.Frame(root)
frame3.pack(side='left', anchor='n')
frame4 = tk.Frame(root)
frame4.pack(side='left', anchor='n')

pixel = tk.PhotoImage(width=55, height=55)

def command(text):
    entry.config(state='normal')
    entry.insert(tk.END, text)
    entry.config(state='readonly')


def buttons(text, frame):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg="#333300", fg="white", compound="center",
                       command=lambda :command(text))
    return button


def buttons_ops(text, frame, bg, fg):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg=bg, fg=fg, activebackground="black",
                        compound="center", command=lambda:command(text))
    return button

btn1 = buttons('1',frame1).pack()
btn4 = buttons('4', frame1).pack()
btn7 = buttons('7', frame1).pack()

btn2 = buttons('2', frame2).pack()
btn5 = buttons('5', frame2).pack()
btn8 = buttons('8', frame2).pack()
btn0 = buttons_ops('0', frame2, '#333300', 'white').pack()

plus = buttons_ops('+', frame4, 'black', 'white').pack()
minus= buttons_ops('-', frame4,  'black', 'white').pack()
mul = buttons_ops('x', frame4, 'black', 'white').pack()
div = buttons_ops('/', frame4, 'black', 'white').pack()

btn3 = buttons('3', frame3).pack()
btn6 = buttons('6', frame3).pack()
btn9 = buttons('9', frame3).pack()

# This keeps the window active
root.mainloop()
</code></pre>
<p>The <code>orient</code> parameter in the <code>tk.Scrollbar()</code> object determines the nature of the scrollbar. Here, we've aligned it with the X-axis. We also added a parameter in the original <code>entry</code> object. The <code>xscrollcommand</code> sets the scrollbar to the output screen.</p>
<p>Then we connected the scrollbar to the entry object by setting <code>command=entry.xview</code> and embedded the scrollbar in the output screen.</p>
<p>The following image shows the scrollbar. You can use the arrow signs to navigate forward or backward through the text:</p>
<img src="https://cdn.hashnode.com/uploads/covers/67e55054a0be57d730442ec0/29d0fd37-a94f-4bad-af24-6627055fd4b3.png" alt="Image of calculator with the scrollbar" style="display:block;margin:0 auto" width="283" height="429" loading="lazy">

<h2 id="heading-how-to-add-the-equal-to-button">How to Add the Equal To Button</h2>
<p>We haven't yet made the <code>equal to</code> button – so let's do that now. To start, we'll define a function called <code>cmd_equal()</code>. In this function, we'll first change the <code>state</code> of the <code>entry</code> to <code>normal</code>. Then we'll extract the text in the output screen using the <code>entry.get()</code> function and replace 'x' by '*'. We do this because multiplication is represented by '*' and not 'x'.</p>
<p>Then we'll add a <code>try-except</code> block. We'll try to evaluate the mathematical expression that we extracted using Python's built-in <code>eval()</code> function. If that's invalid, instead of throwing an error, we'll output 'Invalid' onto our screen.</p>
<pre><code class="language-python">import tkinter as tk

# screen initialization
root = tk.Tk()

# Naming the window
root.title("Calculator")

scrollbar = tk.Scrollbar(root, orient='horizontal')

entry = tk.Entry(root, width=9, font=('Arial', 38, 'bold'), state='readonly', xscrollcommand=scrollbar.set)
entry.pack(pady=(30, 10))

scrollbar.config(command=entry.xview)
scrollbar.pack()

frame1 = tk.Frame(root)
frame1.pack(side='left', anchor='n')
frame2 = tk.Frame(root)
frame2.pack(side='left', anchor='n')
frame3 = tk.Frame(root)
frame3.pack(side='left', anchor='n')
frame4 = tk.Frame(root)
frame4.pack(side='left', anchor='n')

pixel = tk.PhotoImage(width=55, height=55)

def command(text):
    entry.config(state='normal')
    entry.insert(tk.END, text)
    entry.config(state='readonly')

def cmd_equal():
    entry.config(state='normal')
    txt = entry.get().replace('x', '*')

    try:
        result = eval(txt)

    except:
        result = 'INVALID'
    entry.delete(0, tk.END)
    entry.insert(tk.END, result)
    entry.config(state='readonly')   


def buttons(text, frame):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg="#333300", fg="white", compound="center",
                       command=lambda :command(text))
    return button


def buttons_ops(text, frame, bg, fg):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg=bg, fg=fg, activebackground="black",
                        compound="center", command=lambda:command(text))
    return button

btn1 = buttons('1',frame1).pack()
btn4 = buttons('4', frame1).pack()
btn7 = buttons('7', frame1).pack()

btn2 = buttons('2', frame2).pack()
btn5 = buttons('5', frame2).pack()
btn8 = buttons('8', frame2).pack()
btn0 = buttons_ops('0', frame2, '#333300', 'white').pack()

plus = buttons_ops('+', frame4, 'black', 'white').pack()
minus= buttons_ops('-', frame4,  'black', 'white').pack()
mul = buttons_ops('x', frame4, 'black', 'white').pack()
div = buttons_ops('/', frame4, 'black', 'white').pack()

btn3 = buttons('3', frame3).pack()
btn6 = buttons('6', frame3).pack()
btn9 = buttons('9', frame3).pack()
equal= tk.Button(frame3, text='=', font=('Arial', 20), image=pixel, bg='white', fg='black', activebackground="black",
                        compound="center", command=lambda: cmd_equal()).pack()

# This keeps the window active
root.mainloop()
</code></pre>
<p>Here, we've also used <code>entry.delete()</code>. This function will delete all the text on the output screen from the first argument's index (that is from the 0th index) to the last argument's index, that is to the end of the text (represented by <code>tk.END</code>).</p>
<p>Then we inserted our result onto the output screen using <code>entry.insert()</code>. An important thing to note is that we've embedded the <code>equal to</code> button below the definition of <code>btn9</code> in the same frame. This puts our <code>equal to</code> button in just the right place.</p>
<p>The following images show the initial and final screens, respectively.</p>
<img src="https://cdn.hashnode.com/uploads/covers/67e55054a0be57d730442ec0/8051ce51-5139-4668-97e2-b5c742c687a1.png" alt="Calculator window showing mathematical expression " style="display:block;margin:0 auto" width="270" height="407" loading="lazy">

<p>On clicking the equal to button:</p>
<img src="https://cdn.hashnode.com/uploads/covers/67e55054a0be57d730442ec0/1ddc6b4c-3e51-4237-b9f6-90986bff1963.png" alt="Calculator window showing evaluated mathematical expression" style="display:block;margin:0 auto" width="267" height="407" loading="lazy">

<h2 id="heading-how-to-add-the-ac-button">How to Add the AC Button</h2>
<p>Now finally, we'll define our last function: <code>cmd_ac()</code>. This function will delete everything on the output screen. We'll do this by first changing the <code>state</code> to <code>normal</code>, then using <code>entry.delete()</code>, and lastly changing the <code>state</code> back to <code>readonly</code>. Then we'll put this function in the <code>command()</code> parameter of the <code>ac</code> button.</p>
<p>To keep the UI from dismantling when we expand the window, we'll use the <code>resizable()</code> function. This functions takes two arguments: one corresponds to the permission to expand the width and the other to the height. To prohibit expansion of the window, we'll set both the parameters to <code>False</code>.</p>
<p>So the final code will be:</p>
<pre><code class="language-python">import tkinter as tk

# screen initialization
root = tk.Tk()

# Naming the window
root.title("Calculator")

scrollbar = tk.Scrollbar(root, orient='horizontal')

entry = tk.Entry(root, width=9, font=('Arial', 38, 'bold'), state='readonly', xscrollcommand=scrollbar.set)
entry.pack(pady=(30, 10))

scrollbar.config(command=entry.xview)
scrollbar.pack()

frame1 = tk.Frame(root)
frame1.pack(side='left', anchor='n')
frame2 = tk.Frame(root)
frame2.pack(side='left', anchor='n')
frame3 = tk.Frame(root)
frame3.pack(side='left', anchor='n')
frame4 = tk.Frame(root)
frame4.pack(side='left', anchor='n')

pixel = tk.PhotoImage(width=55, height=55)

def command(text):
    entry.config(state='normal')
    entry.insert(tk.END, text)
    entry.config(state='readonly')

def cmd_ac():
    entry.config(state='normal')
    entry.delete(0, tk.END)
    entry.config(state='readonly')

def cmd_equal():
    entry.config(state='normal')
    txt = entry.get().replace('x', '*')

    try:
        result = eval(txt)

    except:
        result = 'INVALID'
    entry.delete(0, tk.END)
    entry.insert(tk.END, result)
    entry.config(state='readonly')


def buttons(text, frame):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg="#333300", fg="white", compound="center",
                       command=lambda :command(text))
    return button


def buttons_ops(text, frame, bg, fg):
    button = tk.Button(frame, text=text, font=('Arial', 20), image=pixel, bg=bg, fg=fg, activebackground="black",
                        compound="center", command=lambda:command(text))
    return button

btn1 = buttons('1',frame1).pack()
btn4 = buttons('4', frame1).pack()
btn7 = buttons('7', frame1).pack()
ac = tk.Button(frame1, text="AC", font=('Arial', 20), image=pixel, bg="#666699", fg="white", compound="center",
                        command=lambda: cmd_ac()).pack()

btn2 = buttons('2', frame2).pack()
btn5 = buttons('5', frame2).pack()
btn8 = buttons('8', frame2).pack()
btn0 = buttons_ops('0', frame2, '#333300', 'white').pack()

plus = buttons_ops('+', frame4, 'black', 'white').pack()
minus= buttons_ops('-', frame4,  'black', 'white').pack()
mul = buttons_ops('x', frame4, 'black', 'white').pack()
div = buttons_ops('/', frame4, 'black', 'white').pack()

btn3 = buttons('3', frame3).pack()
btn6 = buttons('6', frame3).pack()
btn9 = buttons('9', frame3).pack()
equal= tk.Button(frame3, text='=', font=('Arial', 20), image=pixel, bg='white', fg='black', activebackground="black",
                        compound="center", command=lambda: cmd_equal()).pack()


root.resizable(0,0) 
# This keeps the window active
root.mainloop()
</code></pre>
<p>When we hit run, this should display our final project.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>So now you know how to build a simple arithmetic calculator. To strengthen and build upon the concepts that you learned here, you can try to add some more functionality to this calculator. Here are some ideas for you to practice the things learnt here:</p>
<ul>
<li><p>Adding a decimal point button to the calculator to allow users work with fractional numbers.</p>
</li>
<li><p>Adding percentage button to the calculator to allow users calculate percentages.</p>
</li>
<li><p>Adding a delete button to the calculator which, instead of clearing entire screen, deletes one character at a time.</p>
</li>
<li><p>Making the calculator 'computer keyboard interactive', that is, allowing input directly from the computer keyboard. (Hint for this task: changing the <code>state</code> of the <code>entry</code> object to <code>normal</code>, and adding conditions for 'invalid' expressions).</p>
</li>
</ul>
<p>Thanks for reading!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an Autonomous OSINT Agent in Python Using Claude's Tool Use API ]]>
                </title>
                <description>
                    <![CDATA[ When I started studying OSINT, I always felt I was just putting random values into software without deeply understanding what I was doing. After months in the field, I realized I wasn't really investi ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-autonomous-agent-in-python-using-claude/</link>
                <guid isPermaLink="false">6a06669ebaf09db7a64df6cf</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mcp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ claude ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tommaso Bertocchi ]]>
                </dc:creator>
                <pubDate>Fri, 15 May 2026 00:19:42 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/5890d77b-0678-4c68-a9c3-2304fb2a02ad.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When I started studying OSINT, I always felt I was just putting random values into software without deeply understanding what I was doing. After months in the field, I realized I wasn't really investigating — I was just executing steps that follow a predictable pattern. That's exactly what an AI agent is good at. So I built one.</p>
<p>In this tutorial you'll learn how to set up OpenOSINT, an open-source Python OSINT framework with an AI agent at its core. You'll learn how Claude's native tool use API works, how to run autonomous investigations from the terminal using the interactive AI REPL, how to use the direct CLI for scripting, and how to expose all the tools to Claude Code or Claude Desktop via an MCP server.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-osint-and-why-manual-workflows-break-down">What Is OSINT and Why Manual Workflows Break Down</a></p>
</li>
<li><p><a href="#heading-what-youll-build">What You'll Build</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-claudes-tool-use-api-works">How Claude's Tool Use API Works</a></p>
</li>
<li><p><a href="#heading-how-to-install-openosint">How to Install OpenOSINT</a></p>
</li>
<li><p><a href="#heading-how-to-use-the-interactive-ai-repl">How to Use the Interactive AI REPL</a></p>
</li>
<li><p><a href="#heading-how-to-run-individual-tools-from-the-cli">How to Run Individual Tools from the CLI</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-the-mcp-server">How to Set Up the MCP Server</a></p>
</li>
<li><p><a href="#heading-how-the-agent-loop-works-under-the-hood">How the Agent Loop Works Under the Hood</a></p>
</li>
<li><p><a href="#heading-project-architecture">Project Architecture</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-is-osint-and-why-manual-workflows-break-down">What Is OSINT and Why Manual Workflows Break Down</h2>
<p>Open Source Intelligence (OSINT) is the practice of collecting and analyzing information from publicly available sources. Security researchers use it during penetration tests. Journalists use it to verify identities and trace connections. Threat analysts use it to profile infrastructure.</p>
<p>A typical OSINT workflow looks like this:</p>
<ol>
<li><p>You have a target email address</p>
</li>
<li><p>You run <code>holehe</code> to find which platforms that email is registered on</p>
</li>
<li><p>You notice a username in the output</p>
</li>
<li><p>You manually copy that username and run <code>sherlock</code> to search 300+ platforms</p>
</li>
<li><p>You switch to a browser to check HaveIBeenPwned</p>
</li>
<li><p>You open another tab for a WHOIS lookup</p>
</li>
<li><p>You take notes and repeat</p>
</li>
</ol>
<p>Every tool is a silo. Every pivot is manual. The investigation logic — what to run next, what to chain, what the findings mean — lives entirely in your head.</p>
<p>When you close the terminal, it's gone.</p>
<p>This tutorial walks you through <a href="https://github.com/OpenOSINT/OpenOSINT">OpenOSINT</a>, an open-source Python framework that replaces that fragmented workflow with an AI agent that chains tools autonomously, executes them against real binaries, and saves a structured Markdown report.</p>
<p>More importantly, you'll learn the core design principle that makes it trustworthy for security research: <strong>hallucination in tool results is structurally impossible</strong>.</p>
<h2 id="heading-what-youll-build">What You'll Build</h2>
<p>By the end of this tutorial, you'll have a working OSINT agent that you can use in three ways:</p>
<ul>
<li><p><strong>Interactive AI REPL</strong> — type a target in natural language and the agent decides what to run</p>
</li>
<li><p><strong>Direct CLI</strong> — run individual tools without AI, useful for scripting</p>
</li>
<li><p><strong>MCP Server</strong> — expose all tools to Claude Code or Claude Desktop</p>
</li>
</ul>
<p>Here's what a real session looks like:</p>
<pre><code class="language-plaintext">$ openosint
openosint ❯ investigate target@example.com

  → generate_dorks('target@example.com')
  → search_email('target@example.com')
  ✓ Found: Spotify, WordPress, Gravatar, Office365

  → search_breach('target@example.com')
  ✓ Found in 2 breaches: LinkedIn (2016), Adobe (2013)

  → search_username('target_handle')
  ✓ Found on: GitHub, Reddit, HackerNews, Twitter

  ╭──────────────── Report ────────────────╮
  │ ## Online Presence                     │
  │ Spotify · WordPress · Gravatar         │
  │                                        │
  │ ## Data Breaches                       │
  │ LinkedIn (2016) · Adobe (2013)         │
  ╰────────────────────────────────────────╯

  ✓ Report saved → reports/2026-05-11_report.md
</code></pre>
<p>The agent went from email → linked accounts → username pivot → cross-platform search with no human orchestration at any step.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow this tutorial, you'll need:</p>
<ul>
<li><p>Python 3.10 or later installed on your machine</p>
</li>
<li><p>Basic familiarity with the command line</p>
</li>
<li><p>An <a href="https://console.anthropic.com/">Anthropic API key</a> — only required for the AI REPL, not for the CLI or MCP server</p>
</li>
<li><p>Git installed</p>
</li>
</ul>
<p>You don't need prior experience with OSINT tools or the Anthropic SDK.</p>
<h2 id="heading-how-claudes-tool-use-api-works">How Claude's Tool Use API Works</h2>
<p>Before you dive into installation, it's worth understanding the mechanism that makes this framework trustworthy for security research.</p>
<p>Most AI applications that wrap external tools work by generating text that describes what a tool <em>would</em> return. That's a problem when accuracy matters — the model can hallucinate plausible-looking usernames, fake subdomains, or data breaches that never happened.</p>
<p>Claude's tool use API works differently. When the model decides it needs to call a tool, it does <strong>not</strong> generate the output. It stops and emits a structured <code>tool_use</code> block containing the tool name and the arguments it wants to pass.</p>
<p>Your code then runs the actual binary — <code>holehe</code>, <code>sherlock</code>, or whatever else — and sends the real output back as a <code>tool_result</code>. The model reads that real output and decides its next step.</p>
<p>Here's the flow:</p>
<pre><code class="language-plaintext">User prompt
    ↓
Model decides to call search_email()
    ↓
Hard stop — model emits tool_use block
    ↓
Your code runs holehe against the real target
    ↓
Real output sent back as tool_result
    ↓
Model reads actual results, decides next step
    ↓
Repeat until investigation is complete
</code></pre>
<p>The model never generates tool output. It only ever reads it. If <code>sherlock</code> finds 12 profiles, those 12 URLs go back into the context verbatim. The model cannot add a 13th that doesn't exist.</p>
<p>This is not a prompting trick or a system prompt instruction. It is how the API is architected. Keep this in mind as you read through the agent loop code later in this tutorial.</p>
<h2 id="heading-how-to-install-openosint">How to Install OpenOSINT</h2>
<p>Start by cloning the repository and installing the package:</p>
<pre><code class="language-bash">git clone https://github.com/OpenOSINT/OpenOSINT.git
cd OpenOSINT
pip install -e .
</code></pre>
<p>Alternatively, if you just want to use the tool without modifying the source, install it directly from PyPI:</p>
<pre><code class="language-bash">pip install openosint
</code></pre>
<p>Next, set your Anthropic API key. This is only required for the interactive AI REPL — the direct CLI and MCP server work without it:</p>
<pre><code class="language-bash">export ANTHROPIC_API_KEY=sk-ant-...
</code></pre>
<h3 id="heading-how-to-install-the-external-tool-dependencies">How to Install the External Tool Dependencies</h3>
<p>OpenOSINT wraps several standalone OSINT tools. Install the ones you plan to use:</p>
<pre><code class="language-bash">pip install holehe            # email account enumeration
pip install sherlock-project  # username search across 300+ platforms
pip install sublist3r         # subdomain enumeration
</code></pre>
<p>For phone intelligence, <code>phoneinfoga</code> is a standalone binary. Download the release for your platform from its <a href="https://github.com/sundowndev/phoneinfoga/releases">GitHub releases page</a> and place it somewhere in your <code>PATH</code>.</p>
<h3 id="heading-how-to-configure-optional-api-keys">How to Configure Optional API Keys</h3>
<p>Two tools work at higher rate limits with optional API keys:</p>
<pre><code class="language-bash">export HIBP_API_KEY=your_key    # required for breach checks via HaveIBeenPwned v3
export IPINFO_TOKEN=your_token  # optional — raises ipinfo.io rate limits
</code></pre>
<p>If a binary is missing or an API key is not configured, that specific tool returns a descriptive error string. All other tools continue to work normally.</p>
<h2 id="heading-how-to-use-the-interactive-ai-repl">How to Use the Interactive AI REPL</h2>
<p>Run <code>openosint</code> with no arguments to start the AI-powered REPL. You can also use <code>openosint shell</code> — it's equivalent:</p>
<pre><code class="language-bash">$ openosint
# or
$ openosint shell
</code></pre>
<p>If you prefer to pass the API key inline rather than via environment variable, use the <code>--api-key</code> flag:</p>
<pre><code class="language-bash">$ openosint --api-key sk-ant-...
</code></pre>
<p>You'll get a prompt where you can type targets or questions in natural language:</p>
<pre><code class="language-plaintext">openosint ❯ investigate target@example.com
openosint ❯ find all accounts for johndoe99
openosint ❯ what subdomains does example.com have?
openosint ❯ check if +14155552671 is a mobile number
</code></pre>
<p>The agent decides which tools to run based on your input. You don't need to specify which tools to use or in what order. If you type an email address, the agent will run email enumeration. If it finds a linked username, it may pivot and search that username across platforms.</p>
<p>Reports are saved automatically to the <code>reports/</code> directory after every investigation that produces structured findings.</p>
<p>Here are the commands available inside the REPL:</p>
<table>
<thead>
<tr>
<th>Command</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td><code>clear</code></td>
<td>Reset the conversation memory</td>
</tr>
<tr>
<td><code>save</code></td>
<td>Manually save the last report</td>
</tr>
<tr>
<td><code>tools</code></td>
<td>Show available tools and their status</td>
</tr>
<tr>
<td><code>config</code></td>
<td>Show current configuration</td>
</tr>
<tr>
<td><code>help</code></td>
<td>List all commands</td>
</tr>
<tr>
<td><code>exit</code> or Ctrl-D</td>
<td>Quit</td>
</tr>
</tbody></table>
<h2 id="heading-how-to-run-individual-tools-from-the-cli">How to Run Individual Tools from the CLI</h2>
<p>If you want to run a single tool without the AI layer — for scripting, automation, or quick lookups — use the direct CLI:</p>
<pre><code class="language-bash"># Email account enumeration (default timeout: 120s)
openosint email target@example.com

# With a custom timeout in seconds
openosint email target@example.com -t 60

# Username search across 300+ platforms (default timeout: 180s)
openosint username johndoe99

# Enable verbose output for debugging
openosint -v email target@example.com
</code></pre>
<p>The direct CLI doesn't require an Anthropic API key. It runs the underlying binary and prints the output to the terminal.</p>
<p>This mode is useful when you need predictable, scriptable behavior — for example, piping output into another tool or running automated checks.</p>
<h2 id="heading-how-to-set-up-the-mcp-server">How to Set Up the MCP Server</h2>
<p>OpenOSINT also ships as a Model Context Protocol (MCP) server. This exposes all 9 tools to any MCP-compatible AI client.</p>
<h3 id="heading-how-to-register-with-claude-code">How to Register with Claude Code</h3>
<pre><code class="language-bash">claude mcp add openosint python /absolute/path/to/OpenOSINT/openosint/mcp_server.py
</code></pre>
<p>Verify the registration worked:</p>
<pre><code class="language-bash">claude mcp list
</code></pre>
<p>Once registered, you can drive investigations from the Claude Code prompt:</p>
<pre><code class="language-plaintext">&gt; Investigate target@example.com. If you find a linked username,
  trace it across other platforms and compile a full report.
</code></pre>
<h3 id="heading-how-to-configure-claude-desktop">How to Configure Claude Desktop</h3>
<p>Add the following to your Claude Desktop config at <code>~/Library/Application Support/Claude/claude_desktop_config.json</code>:</p>
<pre><code class="language-json">{
  "mcpServers": {
    "openosint": {
      "command": "python",
      "args": ["/absolute/path/to/OpenOSINT/openosint/mcp_server.py"]
    }
  }
}
</code></pre>
<p>Restart Claude Desktop after saving the file. The tools will appear in Claude's tool list.</p>
<p>The MCP server uses stdio transport and does not need a persistent background process. Claude Code or Claude Desktop starts it on demand.</p>
<h2 id="heading-how-the-agent-loop-works-under-the-hood">How the Agent Loop Works Under the Hood</h2>
<p>Here is a simplified version of the agent loop from <code>openosint/agent.py</code>:</p>
<pre><code class="language-python">import anthropic
import asyncio

client = anthropic.Anthropic()

async def run_investigation(user_prompt: str) -&gt; str:
    messages = [{"role": "user", "content": user_prompt}]

    while True:
        response = client.messages.create(
            model="claude-...",   # model configured via --api-key / env var
            max_tokens=4096,
            tools=TOOL_SCHEMAS,   # JSON schemas for all 9 tools
            messages=messages
        )

        # Agent is done — extract and return the final report
        if response.stop_reason == "end_turn":
            return extract_text(response)

        # Agent needs a tool — run the real binary
        if response.stop_reason == "tool_use":
            tool_results = []

            for block in response.content:
                if block.type == "tool_use":
                    # Runs holehe, sherlock, etc. as real subprocesses
                    real_output = await execute_tool(block.name, block.input)

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": real_output  # real output, never generated
                    })

            # Append assistant turn and real tool results to conversation
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
</code></pre>
<p>There are a few important things to understand in this code.</p>
<ol>
<li><p><strong>The loop runs until</strong> <code>stop_reason == "end_turn"</code>: The agent decides when it has gathered enough information to write the final report. It may call one tool or ten, depending on what it finds.</p>
</li>
<li><p><code>execute_tool()</code> <strong>runs real subprocesses</strong>: It's a thin async wrapper around Python's <code>asyncio.create_subprocess_exec()</code> with a configurable timeout. There's no simulation and no mocked data at any point.</p>
</li>
<li><p><strong>Conversation history is maintained across the entire loop</strong>: Each tool result goes back into <code>messages</code>, so the model always has full context of what it found when deciding what to run next.</p>
</li>
<li><p><strong>Tool schemas are defined as JSON</strong>: Each tool has a name, description, and parameter schema. The model uses these to know what tools exist and what arguments they accept. Here's a simplified example for <code>search_email</code>:</p>
</li>
</ol>
<pre><code class="language-python">{
    "name": "search_email",
    "description": (
        "Enumerates online services and social accounts "
        "associated with an email address using holehe."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "email": {
                "type": "string",
                "description": "Target email address"
            }
        },
        "required": ["email"]
    }
}
</code></pre>
<p>The same pattern applies to all 9 tools. The model reads these schemas at the start of every request and uses them to decide what's available and how to call it.</p>
<h2 id="heading-project-architecture">Project Architecture</h2>
<p>The codebase is organized in five layers. The hard rule across the codebase is that no layer imports from a layer above it:</p>
<pre><code class="language-plaintext">openosint/tools/        Core tools
                        Async wrappers around external binaries and APIs.
                        Stateless. No AI. No CLI. Pure functions.

openosint/agent.py      AI agent
                        Anthropic tool use loop.
                        Per-session conversation history.
                        Imports from tools/. Nothing imports from agent.py.

openosint/repl.py       Interactive REPL (prompt_toolkit + Rich)
openosint/mcp_server.py MCP server (stdio transport)
openosint/cli.py        CLI entry point
</code></pre>
<p>This separation makes each layer independently testable. The core tools are pure async functions that take a string and return a string — you can unit test them without touching the agent or the CLI.</p>
<p>It also means the AI layer is entirely optional. If you don't have an Anthropic API key, you use the CLI and bypass the agent. The MCP server also operates independently of the agent.</p>
<h3 id="heading-the-9-available-tools">The 9 Available Tools</h3>
<table>
<thead>
<tr>
<th>Tool</th>
<th>Backend</th>
<th>What it returns</th>
</tr>
</thead>
<tbody><tr>
<td><code>search_email</code></td>
<td>holehe</td>
<td>Social accounts linked to an email</td>
</tr>
<tr>
<td><code>search_username</code></td>
<td>sherlock</td>
<td>Accounts across 300+ platforms</td>
</tr>
<tr>
<td><code>search_breach</code></td>
<td>HaveIBeenPwned v3</td>
<td>Breach names, dates, leaked data types</td>
</tr>
<tr>
<td><code>search_whois</code></td>
<td>python-whois</td>
<td>Registrant, registrar, creation/expiry</td>
</tr>
<tr>
<td><code>search_ip</code></td>
<td>ipinfo.io</td>
<td>Geolocation, ASN, hostname, org</td>
</tr>
<tr>
<td><code>search_domain</code></td>
<td>sublist3r</td>
<td>Subdomain enumeration</td>
</tr>
<tr>
<td><code>generate_dorks</code></td>
<td>built-in</td>
<td>12 targeted Google dork URLs, no network calls</td>
</tr>
<tr>
<td><code>search_paste</code></td>
<td>psbdmp.ws</td>
<td>Pastebin dump mentions</td>
</tr>
<tr>
<td><code>search_phone</code></td>
<td>phoneinfoga</td>
<td>Carrier, country, line type</td>
</tr>
</tbody></table>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to set up and use OpenOSINT — a Python OSINT framework built on Claude's tool use API.</p>
<p>The key takeaway is the design principle: by using native tool use, the agent never generates tool output. It only reads real output from real binaries. This makes it suitable for security research where accuracy matters and hallucination isn't an acceptable failure mode.</p>
<p>To recap the three interfaces:</p>
<ul>
<li><p>Run <code>openosint</code> for the interactive AI REPL — best for full investigations with automatic chaining</p>
</li>
<li><p>Run <code>openosint email</code> or <code>openosint username</code> for direct CLI access — best for scripting and automation</p>
</li>
<li><p>Register the MCP server in Claude Code or Claude Desktop to run investigations inside your existing AI environment</p>
</li>
</ul>
<p>The full source code is available on <a href="https://github.com/OpenOSINT/OpenOSINT">GitHub</a> under the MIT license. Contributions and issues are welcome.</p>
<p><strong>Legal note</strong>: OpenOSINT is for authorized security research, penetration testing, and investigative journalism only. Users are solely responsible for compliance with applicable law, including GDPR, CCPA, and the CFAA. See the <a href="https://github.com/OpenOSINT/OpenOSINT/blob/main/DISCLAIMER.md">DISCLAIMER.md</a> for the full notice.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build Optimal AI Agents That Actually Work – A Handbook for Devs ]]>
                </title>
                <description>
                    <![CDATA[ Since moving to Silicon Valley in 2025, I've seen AI everywhere. And after I attended NVIDIA GTC 2025, one thing became very clear from many conversations I had: most companies now have AI agents runn ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-optimal-ai-agents-that-actually-work-a-handbook-for-devs/</link>
                <guid isPermaLink="false">6a024a82fca21b0d4b6c5283</guid>
                
                    <category>
                        <![CDATA[ ai-agent ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agentic AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tiago Capelo Monteiro ]]>
                </dc:creator>
                <pubDate>Mon, 11 May 2026 21:30:42 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/f1ca2c84-0c3f-4f20-84f2-9bad5cc1c915.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Since moving to Silicon Valley in 2025, I've seen AI everywhere. And after I attended NVIDIA GTC 2025, one thing became very clear from many conversations I had: most companies now have AI agents running successfully in various projects or departments.</p>
<p>But almost no one has managed to roll them out well across an entire organization. And even where agents are deployed, they're often poorly organized.</p>
<p>Companies are shipping agent systems almost by guessing.</p>
<p>Some of the questions I heard were:</p>
<ul>
<li><p>What's the right number of AI agents in a team?</p>
</li>
<li><p>What's the best model provider to use?</p>
</li>
<li><p>Should the agents have a "boss" agent supervising them, or should they coordinate peer-to-peer?</p>
</li>
</ul>
<p>In other words, the main question was:</p>
<blockquote>
<p>What is the best organizational structure for a team of AI agents?</p>
</blockquote>
<p>This article tries to answer exactly that.</p>
<p>I previously wrote <a href="https://www.freecodecamp.org/news/the-math-behind-artificial-intelligence-book/">a book on the math behind AI</a>, so we won't be doing any math here.</p>
<p>Instead, we'll focus on how to organize agents for real business cases.</p>
<p>We'll use a recent AI paper from Google Research, Google DeepMind, and MIT — <a href="https://research.google/blog/towards-a-science-of-scaling-agent-systems-when-and-why-agent-systems-work/">Towards a Science of Scaling Agent Systems: When and Why Agent Systems Work</a> as our primary source.</p>
<p>For the code, I'll use a Jupyter notebook in Google Collab.</p>
<h3 id="heading-heres-what-well-cover">Here's What We'll Cover:</h3>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-is-an-llm">What is an LLM?</a></p>
</li>
<li><p><a href="#heading-what-are-ai-agents">What Are AI Agents?</a></p>
</li>
<li><p><a href="#heading-a-decision-algorithm-for-creating-optimal-ai-agents">A Decision Algorithm for Creating Optimal AI Agents</a></p>
</li>
<li><p><a href="#heading-three-code-examples">Three Code Examples</a></p>
<ul>
<li><p><a href="#heading-1-installing-utilities-python-libraries-and-doing-config">1. Installing Utilities, Python Libraries, and Doing Config</a></p>
</li>
<li><p><a href="#heading-2-starting-the-ollama-server-getting-the-model-and-tools">2. Starting the Ollama Server, Getting the Model and Tools</a></p>
</li>
<li><p><a href="#heading-3-testing-the-model">3. Testing the Model</a></p>
</li>
<li><p><a href="#heading-4-running-ai-agents">4. Running AI Agents</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-conclusion-the-future-of-ai-is-evals">Conclusion: The Future of AI is Evals</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You don't need to be an expert developer to create AI agents. There are many no-code tools that can help you through the process.</p>
<p>But to get the most out of the examples here (and to be able to check your agents' work and understand what they're doing), you'll need:</p>
<ul>
<li><p>A general understanding of Python and what an LLM is.</p>
</li>
<li><p>Ollama installed on your machine to run large language models locally and for free.</p>
</li>
<li><p>A Jupyter Notebook setup. Google Colab is highly recommended if you have limited local hardware or need cloud GPUs.</p>
</li>
</ul>
<p>Let's get into it!</p>
<h2 id="heading-what-is-an-llm">What is an LLM?</h2>
<p>An LLM (Large Language Model) is like a very well-read intern who has never left the library.</p>
<p>The LLM can quote, summarize, translate, and imitate almost any style. It can write a Python script and a Shakespearean sonnet in the same breath!</p>
<p>But it has limitations. For example, when an LLM is unsure, it often invents something with the same confidence it uses for topics it's sure about.</p>
<p>This is called hallucination.</p>
<p>Also, LLMs don't have memory between conversations by default, and they can't do anything on their own. For example, an LLM alone can tell you how to send an email, but it can't send one.</p>
<p>This is where agents come in.</p>
<h2 id="heading-what-are-ai-agents">What Are AI Agents?</h2>
<p>If an LLM is like an intern, an AI agent is that same intern given a desk, a laptop, and a to-do list – and the ability to act.</p>
<p>An agent is essentially an LLM that has been wrapped in tools, memory, and a loop.</p>
<p>Tools allow the agent to do things like search the web, read a particular file, send an email, and run code. Memory allows the LLM to remember what it did before in other tasks. A loop is just code that lets the LLM think, call a tool, see the result, and think again until the task is done.</p>
<p>In many cases, an individual agent is very useful. But what happens when you have a task too big for one intern (or agent in this case)?</p>
<p>Naturally, you can hire more interns! But you get new problems:</p>
<ul>
<li><p>Should you have one intern with a long to-do list (single-agent)?</p>
</li>
<li><p>Should you have five interns all working on the same task independently (independent multi-agent)?</p>
</li>
<li><p>How many interns should be on a team?</p>
</li>
<li><p>Should a boss who assigns subtasks manage the interns?</p>
</li>
<li><p>Should you have a group of peers who coordinate among themselves? A mix?</p>
</li>
</ul>
<p>This is the exact question the Google paper we're using as our primary source here tries to answer with over 150 controlled experiments.</p>
<p>Just keep in mind that having more agents doesn't always mean you'll get better results. Sometimes one agent is a perfect fit. And other times you'll need more.</p>
<h3 id="heading-some-background">Some Background</h3>
<p>Before we dive in, an important note: these are experimental findings, not laws of physics.</p>
<p>The Google paper evaluated, using an exhaustive methodology, many possible teams of AI agents and providers.</p>
<p>Some of the providers where:</p>
<ul>
<li><p>OpenAI (ChatGPT)</p>
</li>
<li><p>Google (Gemini)</p>
</li>
<li><p>Anthropic (Claude)</p>
</li>
</ul>
<p>The results of each differed by model family:</p>
<ul>
<li><p>OpenAI models gained most from centralized/hybrid setups</p>
</li>
<li><p>Google models showed a clear efficiency plateau</p>
</li>
<li><p>Anthropic models were more sensitive to coordination overhead.</p>
</li>
</ul>
<p>Since it's a persuasive study based on a lot of experiments, your team can consider these to be strong guidelines you can use when choosing a model family.</p>
<h2 id="heading-a-decision-algorithm-for-creating-optimal-ai-agents">A Decision Algorithm for Creating Optimal AI Agents</h2>
<p>Now, we'll take the research in the article and convert it into a simple-to-apply algorithm that anyone can use to create AI agents to automate their work.</p>
<p>The main objective of this algorithm is to help you decide, with the Google paper as a scientific reference, if you need just one agent or a couple more.</p>
<p>This way, instead of explaining the article step by step, I'll show you how to actually apply it to solve your problems.</p>
<h3 id="heading-1-check-your-budget">1. Check Your Budget</h3>
<p>If you have limited hardware, I recommend starting with Ollama.</p>
<p>Ollama is a tool that allows you to run LLMs on your personal computer. And when you run it locally, it's free (and open source).</p>
<p>If you use an API from OpenAI, Google, or Anthropic to access their models, you'll start spending money.</p>
<p>As of 6 of may 2026, OpenAI's GPT-5.5 costs \(5.00 per 1M tokens, but for GPT-5.4 mini, it costs \)0.75 per 1M tokens.</p>
<p>If you have limited cloud resources, you can use Google Colab to access GPUs and run larger and newer billion-parameter LLMs. Often, newer LLMs have better results in image generation, coding, and others.</p>
<p>You can also use LLMs with Ollama in Google Colab.</p>
<p>If you have a company project, I recommend this same cloud-based option. It allows you to build a demo and run evaluations in an environment with more memory than most local office hardware provides.</p>
<p>If you have a flexible budget, you can use professional APIs like Claude or Gemini.</p>
<p>Always remember that agents cost tokens, and tokens cost money.</p>
<h3 id="heading-2-start-with-only-one-agent">2. Start with Only ONE Agent</h3>
<p>Always begin with a single agent. Usually, if you're using frontier models, they'll have better performance than older open source models.</p>
<h3 id="heading-3-measure-performance">3. Measure Performance</h3>
<p>According to the paper, if a single agent's real-world success rate (how well it works and how accurately it performs) is more than 45%, then there's typically no need to create a team of agents for the task.</p>
<p>To measure this, run the agent on 50–100 representative tasks. Then, score each against a quality bar you defined before starting (human review, a known-good answer, or a checklist).</p>
<p>Note that the paper's 45% finding is only one-directional: it identifies when <strong>not</strong> to add agents (above 45%). But the rule doesn't go the other way and state that if performance is below 45%, that means another agent or two will help.</p>
<p>The authors state that "coordination benefits arise from matching communication topology to task structure, not from scaling the number of agents".</p>
<p>Basically, if your agent underperforms, fix the agent first! Don't just automatically think you need another agent.</p>
<p>If you determine, for your project, that a single agent works, then go ahead to step 7.</p>
<p>If the single agent's performance is below 45%, first try improving it (better prompts, tools, or model). Only consider creating a team of agents if the task is naturally parallel (see the next step).</p>
<h3 id="heading-4-assess-task-parallelism">4. Assess Task Parallelism</h3>
<p>A big question then becomes, why use multiple agents at all? Here's how you can decide:</p>
<p>If your task involves just one continuous job, a single agent typically does it better and cheaper.</p>
<p>But multiple agents can help when you can clearly split your project into discrete subtasks. Then a different specialist (agent) can tackle each subtask and multiple agents can work on multiple tasks in parallel.</p>
<p>In this step of our algorithm, you want to see if the task you're trying to apply the AI agents to is naturally parallel.</p>
<p>A task is naturally parallel if it can be split into independent subtasks. For example:</p>
<ul>
<li><p>Searching for the best flight across five different websites.</p>
</li>
<li><p>Summarizing ten separate news articles at once.</p>
</li>
</ul>
<p>Examples where tasks are not naturally parallel:</p>
<ul>
<li><p>Planning a trip from start to finish (you must choose a destination before booking a hotel, for example – so those tasks can't be completed in parallel).</p>
</li>
<li><p>Managing a bank transfer (the funds must be verified before they're sent).</p>
</li>
</ul>
<p>If the task is naturally parallel, you may benefit from more agents, and you should continue on to step 5.</p>
<p>If it's not (the task is sequential or step-by-step), stop. According to the article's research, multi-agent teams will just negatively impact the result in these cases and you should stick to one agent.</p>
<p>In this case (not naturally parallel), you can just work on improving your prompts, tools, or your model for the single agent. Then after it beats the 45%, go to step 7.</p>
<h3 id="heading-5-pick-the-topology-by-task-type">5. Pick the Topology by Task Type</h3>
<p>Now we'll decide on the structure for our agent team.</p>
<p>Topology simply means the structure of a system. In this case, we're talking about the structure of the team of AI agents.</p>
<p>This step only applies once you've decided you need multiple agents. Both topologies we'll examine here are multi-agent.</p>
<p>If the task is based on analysis or structured work, it's better to use a centralized model. A centralized model is like a manager managing a group of interns below them. The interns report to the manager, and the manager coordinates them.</p>
<p>A centralized model is good for pipelines like financial reports.</p>
<p>According to the study, this reduces error amplification from ~17x to 4x. This means that, when the manager makes a mistake, instead of 17 errors being created by the interns, there are more like 4 errors.</p>
<p>If the task is more related to exploration, use a decentralized model.</p>
<p>They're good for open-ended research or audits where agents review the same material from different angles.</p>
<p>A decentralized model is like interns in a team brainstorming ideas for a new product for the company or discussing over lunch how to make a process faster.</p>
<h3 id="heading-6-cap-the-team-size-and-available-tools-per-agent">6. Cap the Team Size and Available Tools Per Agent</h3>
<p>According to the paper, AI agent success starts to degrade after about 3–4 agents.</p>
<p>They also explain that each agent should have access to the minimum tools necessary (1–3 tools per agent). The more tools each agent has, the worse it performs.</p>
<h3 id="heading-7-build-evaluations">7. Build Evaluations</h3>
<p>Now, you have something that works most of the time. But how can you ensure the agents will scale across the organization? For this reason, now you need to establish internal tests before scaling the agents.</p>
<p>These internal tests are called evals (evaluations).</p>
<p>For each evaluation, you'll need to have clear metrics that let you know how the agents are performing in each evaluation.</p>
<p>You'll want to measure things like accuracy, efficiency, and trajectory. Accuracy tells us if the model got it right. Efficiency reports how fast and cheap it was to process the request. And trajectory shows if the model used the right tools to do the task.</p>
<p>Remember, in AI and engineering in general, if you can't measure the system's performance, you can't trust the system.</p>
<p>This way, you can start seeing how well the model performs with the data your organization works with and its context. Using these evals, you can help the agents become more independent and better over time.</p>
<p>Evals might be:</p>
<ul>
<li><p>Input emails and output responses expected</p>
</li>
<li><p>Input customer support transcripts and outputs summarized action items</p>
</li>
<li><p>Input complex legal contracts and outputs identified high-risk clauses</p>
</li>
</ul>
<p>Then you see how close the agent's or agents' outputs are to the expected output.</p>
<p>You can also try different models and go through this decision algorithm again to see which models work best for your use case. After all, new models are often better than previous models.</p>
<p>With this workflow in place, you'll create more accurate and efficient agents.</p>
<p>Now let's look at this algorithm in action using three use cases.</p>
<h2 id="heading-three-code-examples">Three Code Examples</h2>
<p>In this section, I'll explain how I ran the code in the Jupyter notebook. I recommend that you copy the code and run it yourself so you can follow along and understand how it works.</p>
<p>We'll start the code in the sections I defined in the Google Colab so that you understand everything.</p>
<p>You can find the <a href="https://github.com/tiagomonteiro0715/How-to-Build-Optimal-AI-Agents-That-Actually-Work-Handbook">here on GitHub as well</a>. I used the MIT license for this code.</p>
<h3 id="heading-1-installing-utilities-python-libraries-and-doing-config">1. Installing Utilities, Python Libraries, and Doing Config</h3>
<pre><code class="language-python">!sudo apt update &amp;&amp; sudo apt install -y pciutils
!sudo apt-get install -y zstd
!curl -fsSL https://ollama.com/install.sh | sh
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/c91a3d8b-18dd-4850-bca6-ae707e69736c.png" alt="c91a3d8b-18dd-4850-bca6-ae707e69736c" style="display:block;margin:0 auto" width="2132" height="664" loading="lazy">

<p>This code essentially prepares the notebook to run AI agents.</p>
<p>The first line updates the package list and installs hardware detection tools to identify your GPU. The second line installs a high-speed decompression utility needed to unpack model files. Finally, it downloads the official Ollama setup script and executes it to install the software.</p>
<p>Ollama is an open-source tool that allows you to use LLMs on your computer.</p>
<pre><code class="language-python">!pip install uv
!uv pip install langchain-ollama ollama crewai duckduckgo-search langchain-community ddgs faker
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/d86340f3-3a19-4a89-9975-ecb4116d379a.png" alt="d86340f3-3a19-4a89-9975-ecb4116d379a" style="display:block;margin:0 auto" width="3680" height="752" loading="lazy">

<p>Here, we downloaded the <code>uv</code> Python package. It's like pip but far faster and safer.</p>
<p>With this, we can download the rest of the Python libraries much more quickly.</p>
<pre><code class="language-python">import socket
import subprocess
import threading
import time

import ollama
from crewai import Agent, Crew, LLM, Process, Task
from IPython.display import Markdown
from langchain_ollama.llms import OllamaLLM

from crewai.tools import tool
from langchain_community.tools import DuckDuckGoSearchRun

from faker import Faker
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/60effe35-2293-4201-afb0-f561a64470e4.png" alt="60effe35-2293-4201-afb0-f561a64470e4" style="display:block;margin:0 auto" width="2492" height="1652" loading="lazy">

<p>With the above code, we imported all the Python libraries needed to create optimal AI agents.</p>
<p>Let's see what each one does:</p>
<ul>
<li><p><a href="https://github.com/python/cpython/blob/main/Lib/socket.py">socket</a>: Connects your computer to others over a network.</p>
</li>
<li><p><a href="https://github.com/python/cpython/blob/main/Lib/subprocess.py">subprocess</a>: Lets Python launch and control other programs on your computer.</p>
</li>
<li><p><a href="https://github.com/python/cpython/blob/main/Lib/threading.py">threading</a>: Runs multiple tasks at once so one slow process doesn't freeze the whole code.</p>
</li>
<li><p><a href="https://github.com/python/cpython/blob/main/Modules/timemodule.c">time</a>: Handles delays and timestamps, like making the code wait or measuring speed.</p>
</li>
<li><p><a href="https://github.com/ollama/ollama-python">ollama</a>: The tool we'll use for talking to AI models running locally on your machine.</p>
</li>
<li><p><a href="https://github.com/crewAIInc/crewAI">crewai</a>: Organizes multiple AI agents to work together like a specialized team.</p>
</li>
<li><p><a href="https://github.com/ipython/ipython">IPython</a>: Powers interactive coding features and pretty-printing in tools like Jupyter.</p>
</li>
<li><p><a href="https://github.com/langchain-ai/langchain/blob/master/libs/partners/ollama/README.md">langchain_ollama</a>: Plugs local Ollama models into the popular LangChain AI framework.</p>
</li>
<li><p><a href="https://github.com/langchain-ai/langchain-community">langchain_community</a>: Offers hundreds of extra "connectors" to link AI to the outside world.</p>
</li>
<li><p><a href="https://github.com/joke2k/faker">faker</a>: Generates realistic "dummy" data (names, emails) for testing your code safely.</p>
</li>
</ul>
<pre><code class="language-python">fake = Faker("en_US")

Faker.seed(42)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/6d896775-9db5-4d1a-b144-07b035f1dc35.png" alt="6d896775-9db5-4d1a-b144-07b035f1dc35" style="display:block;margin:0 auto" width="2080" height="664" loading="lazy">

<p>In these two lines of code, we configured the Faker Python library to generate fake data in English from the United States.</p>
<h3 id="heading-2-starting-the-ollama-server-getting-the-model-and-tools">2. Starting the Ollama Server, Getting the Model and Tools</h3>
<pre><code class="language-python">with open("ollama.log", "w") as log_file:
    process = subprocess.Popen(["ollama", "serve"], stdout=log_file, stderr=log_file)

def is_server_ready(port=11434):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        return s.connect_ex(('localhost', port)) == 0

print("Booting Ollama server...")
max_retries = 20
ready = False

for i in range(max_retries):
    if is_server_ready():
        ready = True
        break
    time.sleep(1)
    if i % 5 == 0:
        print(f"Still waiting... ({i}s)")

if ready:
    print("\n Success! Ollama is running and ready for models.")
    !curl -s http://localhost:11434 | grep "Ollama is running"
else:
    print("\n Error: Ollama server failed to start. Check 'ollama.log' for details.")
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/1daf506b-fb25-4487-9bb3-887b37bb0aaf.png" alt="1daf506b-fb25-4487-9bb3-887b37bb0aaf" style="display:block;margin:0 auto" width="3512" height="2552" loading="lazy">

<p>This code helps ensure that your local environment is fully prepared before your AI models try to run.</p>
<p>AI servers often take some time to boot, so just be patient.</p>
<p>This script prevents "connection refused" errors by using a background process to start Ollama and a network "handshake" to confirm that it's awake.</p>
<pre><code class="language-python">!ollama pull mistral-small3.2
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/ce54b7e0-0b4f-4751-b797-ac4bd45cae63.png" alt="ce54b7e0-0b4f-4751-b797-ac4bd45cae63" style="display:block;margin:0 auto" width="2080" height="528" loading="lazy">

<p>In this line, we loaded the <code>mistral-small3.2</code> LLM to the Google Colab notebook.</p>
<p>Mistral is a model developed by a well-known French startup, Mistral AI SAS.</p>
<pre><code class="language-python">_ddg = DuckDuckGoSearchRun()

@tool("web_search")
def web_search(query: str) -&gt; str:
    """Search the public web via DuckDuckGo. Input: a concise search query string. Returns: top result snippets as plain text."""
    return _ddg.run(query)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/0cadabf5-d454-418d-844c-3167a68283bd.png" alt="0cadabf5-d454-418d-844c-3167a68283bd" style="display:block;margin:0 auto" width="3680" height="1024" loading="lazy">

<p>In this code we've created a tool for our agents to use: we're giving the agents the ability to search the web with DuckDuckGo. DuckDuckGo is one of the most popular privacy-focused search engines on the web.</p>
<p>This is crucial because it enables our agents to provide recent information they haven't yet been programmed to know.</p>
<h3 id="heading-3-testing-the-model">3. Testing the Model</h3>
<p>Now we'll write the code that's the layout where we'll define and test the LLM.</p>
<p>We're initializing both a standard model for direct tasks and a specialized LLM object for the CrewAI framework. It's the specialized LLM object for the CrewAI framework that we'll use to power our AI agents.</p>
<p>This initial configuration is important because it validates that your machine is properly communicating with the software before you try to create AI agents.</p>
<pre><code class="language-python">AI_prompt = "Write a quick system prompt for an AI agent whose job is to summarize financial documents."

AI_model = OllamaLLM(model="mistral-small3.2")

crew_llm = LLM(
    model="ollama/mistral-small3.2",
    base_url="http://localhost:11434"
)

print("Running Mistral...")
AI_response = AI_model.invoke(AI_prompt)
display(Markdown(f"### AI Output:\n{AI_response}"))
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/5f76b8c8-6713-40dd-a624-fc83fb35f666.png" alt="5f76b8c8-6713-40dd-a624-fc83fb35f666" style="display:block;margin:0 auto" width="3680" height="1564" loading="lazy">

<h3 id="heading-4-running-the-ai-agents">4. Running the AI Agents</h3>
<p>Now, we'll run three different agent configurations.</p>
<p>The first one is a single agent for sequential tasks. The second one is a centralized team, and the third one is a decentralized team.</p>
<h4 id="heading-sequential-tasks-with-a-single-agent">Sequential Tasks with a Single Agent</h4>
<pre><code class="language-python">doc_5_1 = f"""{fake.company()} {fake.company_suffix()} — Q3 2026 Earnings Report
Prepared by: {fake.name()}, CFO
KEY METRICS
Revenue: ${fake.random_int(50, 500)}M (up {fake.random_int(5, 25)}% YoY)
Net Income: ${fake.random_int(10, 80)}M
Operating Margin: {fake.random_int(12, 28)}%
Active Customers: {fake.random_int(10_000, 500_000):,}
Cash on Hand: ${fake.random_int(100, 900)}M
Employee Headcount: {fake.random_int(200, 5000):,}
MANAGEMENT COMMENTARY
{fake.paragraph(nb_sentences=5)}
RISK FACTORS
{fake.paragraph(nb_sentences=4)}
"""
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/15c0b2f4-9e8e-4ed1-950b-2d897502ae28.png" alt="15c0b2f4-9e8e-4ed1-950b-2d897502ae28" style="display:block;margin:0 auto" width="3328" height="1652" loading="lazy">

<p>In this code, we prepared the general template where the fake data will be generated.</p>
<pre><code class="language-python">print(doc_5_1)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/c16aa43e-da98-4255-be6e-0ba60b342163.png" alt="c16aa43e-da98-4255-be6e-0ba60b342163" style="display:block;margin:0 auto" width="2080" height="528" loading="lazy">

<pre><code class="language-plaintext">Rodriguez, Figueroa and Sanchez and Sons — Q3 2026 Earnings Report
Prepared by: Megan Mcclain, CFO
KEY METRICS
Revenue: $94M (up 23% YoY)
Net Income: $64M
Operating Margin: 13%
Active Customers: 25,622
Cash on Hand: $195M
Employee Headcount: 1,991
MANAGEMENT COMMENTARY
Own night respond red information last everything. Serve civil institution. Choice whatever from behavior benefit. Page southern role movie win her.
RISK FACTORS
Stop peace technology officer relate. Product significant world. Term herself law street class. Decide environment view possible participant commercial. Clear here writer policy news.
</code></pre>
<p>With this code, we printed the document the agent will process.</p>
<pre><code class="language-python">analyst = Agent(
    role="Senior Financial Document Specialist",
    goal=(
        "Read the provided document end-to-end, extract the 5 most decision-relevant KPIs "
        "(with units, period, and source line when available), and produce a CEO-ready summary. "
        "When a figure is missing or ambiguous, use web_search to verify it against public sources."
    ),
    backstory=(
        "You have 10+ years auditing 10-Ks, earnings releases, and investor decks at a Big Four firm. "
        "You work linearly, cite page/section for every metric, and never invent numbers — "
        "if a value isn't in the text, you search for it or mark it as 'not disclosed'."
    ),
    tools=[web_search],
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/528b2693-3b24-4119-b88e-3eda4d1d9141.png" alt="528b2693-3b24-4119-b88e-3eda4d1d9141" style="display:block;margin:0 auto" width="3680" height="2464" loading="lazy">

<p>In this code, we defined an agent that acts as an analyst. This analyst will analyze the report that's generated. It will also have access to DuckDuckGo.</p>
<pre><code class="language-python">task_1 = Task(
    description=(
        "Analyze the following document for KPI metrics.\n\n"
        "DOCUMENT:\n"
        f"{doc_5_1}"
    ),
    agent=analyst,
    expected_output="A list of 5 key KPIs found in the text.",
)

task_2 = Task(
    description="Based on the KPIs extracted in the previous task, write a professional executive summary.",
    agent=analyst,
    expected_output="A 200-word summary suitable for a CEO.",
)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/b5737a9c-ccc8-477c-b859-bf6de5a82f87.png" alt="b5737a9c-ccc8-477c-b859-bf6de5a82f87" style="display:block;margin:0 auto" width="3680" height="1924" loading="lazy">

<p>The analyst will only have two tasks: one is to find KPI metrics and the second is to write a report of the document. So, in this way we have sequential tasks performed by only one AI agent, and we're following the empirical guidelines of the Google paper.</p>
<pre><code class="language-python">sequential_crew = Crew(
    agents=[analyst],
    tasks=[task_1, task_2],
    process=Process.sequential
)

print("Running Case 1: Sequential...")
result_1 = sequential_crew.kickoff()
display(Markdown(f"### Case 1 Result:\n{result_1}"))
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/c1a24352-e8e3-4f49-a2d0-c7e0bd75d3db.png" alt="c1a24352-e8e3-4f49-a2d0-c7e0bd75d3db" style="display:block;margin:0 auto" width="3680" height="1204" loading="lazy">

<pre><code class="language-plaintext">Dear CEO,

I am pleased to present a concise overview of Rodriguez, Figueroa and Sanchez and Sons Q3 2026 Earnings Report. Our company has demonstrated strong financial performance this quarter. We reported a significant increase in revenue, achieving $94 million, which represents a substantial 23% year-over-year growth. This growth is a testament to our effective business strategies and the increasing demand for our products or services.

Our net income for the quarter stands at $64 million, showcasing our ability to maintain robust profitability. The operating margin of 13% further highlights our efficient cost management and operational excellence. Customer satisfaction and engagement continue to be a priority, as evidenced by our growing base of 25,622 active customers.

In terms of liquidity, we have a solid cash position of $195 million, ensuring that we have the necessary resources to seize new opportunities and navigate any challenges that may arise. Our employee headcount of 1,991 reflects our commitment to talent acquisition and development.

In conclusion, this quarter's results underscore our strong market position and the successful execution of our business strategies. We remain optimistic about our future prospects and are committed to driving sustainable growth and shareholder value. Let's continue to build on this momentum in the coming quarters.

Best Regards, [Your Name]
</code></pre>
<p>Finally, we've run the agent we created and the above is the agent's report.</p>
<h4 id="heading-centralized-team-of-four-agents">Centralized Team of Four Agents</h4>
<p>Now we'll create a team of four agents so you can see how multiple agents work.</p>
<p>This team researches lithium market trends to carry out financial modeling and generate an investment proposal based on data.</p>
<p>A centralized team works here because each step feeds into the next. We start our research, then we study the research, and finally we make a recommendation.</p>
<p>Let's build the first one that will research the market:</p>
<pre><code class="language-python">researcher = Agent(
    role="Commodity Market Researcher (Battery Metals)",
    goal=(
        "Produce dated, sourced price data points for 2026 lithium carbonate and lithium hydroxide forecasts. "
        "Always pull from web_search; never guess. Return each data point as: value, unit, date, source URL."
    ),
    backstory=(
        "Ex-analyst at a commodities desk. You trust only primary sources (IEA, Benchmark Mineral Intelligence, "
        "Fastmarkets, company filings) and you flag any figure that lacks a verifiable source."
    ),
    tools=[web_search],
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/6d204267-0a65-4b0a-b93a-844282724550.png" alt="6d204267-0a65-4b0a-b93a-844282724550" style="display:block;margin:0 auto" width="3680" height="2104" loading="lazy">

<p>The first agent we created will search the web for data related to lithium. For this task it will have access to DuckDuckGo.</p>
<p>Now we'll create an agent that knows and works in finance to model the data the researcher got.</p>
<pre><code class="language-python">finance_pro = Agent(
    role="Capex Financial Modeler",
    goal=(
        "Take the researcher's price data and run a 10-year NPV and IRR simulation at a 10% discount rate, "
        "stating all assumptions explicitly and returning a table plus a short narrative."
    ),
    backstory=(
        "You've built DCF models for gigafactory investments. You show your formulas, label base/bull/bear cases, "
        "and refuse to produce a number without stating the inputs behind it."
    ),
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/375e5943-3bd4-4c05-8ab1-4fcc10dab892.png" alt="375e5943-3bd4-4c05-8ab1-4fcc10dab892" style="display:block;margin:0 auto" width="3680" height="1924" loading="lazy">

<p>The finance agent will use the researcher's information and make simulations of it.</p>
<p>From there, we'll define another agent that will advise us on strategy based on the financial model:</p>
<pre><code class="language-plaintext">strategy_advisor = Agent(
    role="Investment Strategy Advisor",
    goal=(
        "Synthesize the researcher's price data and the modeler's NPV/IRR results into a "
        "clear go/no-go recommendation, with the top 3 risks and the conditions under which "
        "the recommendation flips."
    ),
    backstory=(
        "Former MD at a project-finance fund. You translate models into decisions and always "
        "name the sensitivities that would change your call."
    ),
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/daf6b079-cb53-410b-a2cb-5d7d933a13f6.png" alt="daf6b079-cb53-410b-a2cb-5d7d933a13f6" style="display:block;margin:0 auto" width="3676" height="1744" loading="lazy">

<p>This way, we have one agent to do the research, another to do the modeling, and a final one to advise us on strategy.</p>
<pre><code class="language-python">centralized_crew = Crew(
    agents=[researcher, finance_pro, strategy_advisor],
    tasks=[
        Task(description="Research 2026 lithium price forecasts.", agent=researcher, expected_output="Price data points."),
        Task(description="Run an NPV simulation using prices.", agent=finance_pro, expected_output="Full NPV report."),
        Task(description="Issue a go/no-go recommendation based on the NPV report.", agent=strategy_advisor, expected_output="Go/no-go memo with top 3 risks."),
    ],
    process=Process.hierarchical,
    manager_llm=crew_llm
)

print("Running Case 2: Centralized (Hierarchical)...")
result_2 = centralized_crew.kickoff()
display(Markdown(f"### Case 2 Result:\n{result_2}"))
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/90723254-2519-4187-a208-d014c7b20b66.png" alt="90723254-2519-4187-a208-d014c7b20b66" style="display:block;margin:0 auto" width="3680" height="1924" loading="lazy">

<p>Now, we create the 4th agent. This is the<code>manager_llm</code>, and it auto-spawns the manager that will review the other agents' work.</p>
<p>Then, we run the three agents together.</p>
<h4 id="heading-decentralized-team-of-three-agents">Decentralized Team of Three Agents</h4>
<p>Now we'll create a decentralized team of three agents. Once again, the first step is to create the data.</p>
<p>A decentralized model fits here because the auditors review the same data from different angles. Also, the auditors cross-reference findings.</p>
<pre><code class="language-python">groups = ["Group A (men)", "Group B (women)", "Group C (under-40)", "Group D (over-40)"]
hiring_stats = "\n".join(
    f"{g}: {fake.random_int(40, 120)} applicants, {fake.random_int(5, 25)} hired"
    for g in groups
)
feedback = "\n".join(
    f'- Candidate {fake.name()}: "{fake.sentence(nb_words=12)}"'
    for _ in range(6)
)
doc_5_3 = f"""Q1 2026 Hiring Audit Data — {fake.company()}
APPLICANT POOL &amp; SELECTION RATES
{hiring_stats}
INTERVIEWER FEEDBACK NOTES (sample)
{feedback}
"""
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/5ff84edc-306e-460b-bb3a-181254cbab79.png" alt="5ff84edc-306e-460b-bb3a-181254cbab79" style="display:block;margin:0 auto" width="3680" height="1744" loading="lazy">

<p>We also defined a general template to generate the fake data.</p>
<pre><code class="language-python">print(doc_5_3)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/d68ddc9a-15c6-4f0f-aa12-ecdf08e6c7d0.png" alt="d68ddc9a-15c6-4f0f-aa12-ecdf08e6c7d0" style="display:block;margin:0 auto" width="3680" height="528" loading="lazy">

<pre><code class="language-plaintext">Q1 2026 Hiring Audit Data — Zimmerman Inc
APPLICANT POOL &amp; SELECTION RATES
Group A (men): 81 applicants, 6 hired
Group B (women): 69 applicants, 6 hired
Group C (under-40): 80 applicants, 17 hired
Group D (over-40): 74 applicants, 7 hired
INTERVIEWER FEEDBACK NOTES (sample)
- Candidate Tommy Walter: "Defense material those poor central cause seat much section investment on gun."
- Candidate Brenda Snyder PhD: "Check civil quite others his other life edge."
- Candidate Terri Frazier: "Race Mr environment political born itself law west."
- Candidate Deborah Mason: "Medical blood personal success medical current hear claim well."
- Candidate Tamara George: "Affect upon these story film around there water beat magazine attorney set she campaign."
- Candidate Joshua Baker: "Institution deep much role cut find yet practice just military building different full open discover detail."
</code></pre>
<p>Above is the fake data we generated.</p>
<p>Now, we'll create three auditors.</p>
<p>The first auditor focuses on the demographic groups of the people it hires.</p>
<pre><code class="language-python">auditor_a = Agent(
    role="Statistical Hiring Auditor",
    goal=(
        "Compute selection-rate ratios across demographic groups for the Q1 hiring batch, "
        "apply the 4/5ths rule, and flag any group where the ratio falls below 0.80. "
        "Use web_search only to confirm regulatory definitions."
    ),
    backstory=(
        "Former EEOC compliance analyst. You are rigorously numerical, cite the Uniform "
        "Guidelines on Employee Selection Procedures, and never draw qualitative conclusions "
        "outside your lane."
    ),
    tools=[web_search],
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/bd05e48c-156e-4f34-aaa7-6ded4e460a46.png" alt="bd05e48c-156e-4f34-aaa7-6ded4e460a46" style="display:block;margin:0 auto" width="3680" height="2104" loading="lazy">

<p>Then we'll define the second auditor for recruitment processing. This one seeks to find bias in the way interviews are conducted.</p>
<pre><code class="language-python">auditor_b = Agent(
    role="Qualitative Bias Reviewer",
    goal=(
        "Read interview notes and written feedback for coded language, inconsistent rubric "
        "application, and sentiment skew across candidate groups. Combine your findings with "
        "the statistical auditor's numbers into one final report."
    ),
    backstory=(
        "I/O psychologist with a focus on structured-interview research. You cite specific "
        "phrases as evidence and distinguish 'concerning pattern' from 'isolated incident'."
    ),
    tools=[web_search],
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/bcb01353-cab0-4fa1-8ca5-22aacc8ed88e.png" alt="bcb01353-cab0-4fa1-8ca5-22aacc8ed88e" style="display:block;margin:0 auto" width="3680" height="2192" loading="lazy">

<p>Finally, we create a third auditor that will focus on whether the the various hiring policies are met or not.</p>
<pre><code class="language-plaintext">auditor_c = Agent(
    role="Process &amp; Policy Compliance Auditor",
    goal=(
        "Review the hiring process for adherence to documented policy: structured-interview "
        "use, rubric consistency, and required approval steps. Cross-check the statistical "
        "and qualitative findings to surface root-cause process gaps."
    ),
    backstory=(
        "Internal audit lead with an HR-ops background. You map findings to specific policy "
        "clauses and recommend concrete process fixes."
    ),
    tools=[web_search],
    llm=crew_llm,
    verbose=True,
    allow_delegation=True,
)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/d1be79dd-7346-4d6a-b794-672050a97aa4.png" alt="d1be79dd-7346-4d6a-b794-672050a97aa4" style="display:block;margin:0 auto" width="3640" height="1832" loading="lazy">

<p>In each auditor initialization, we define 'allow_delegation=True'. This way, the agents know they can communicate with each other.</p>
<p>Then we give each auditor a task.</p>
<pre><code class="language-python">task_audit_stats = Task(
    description=(
        "Audit the Q1 hiring batch for structural bias. "
        "Compute selection rates per group and flag any disparities.\n\n"
        "DATA:\n"
        f"{doc_5_3}"
    ),
    agent=auditor_a,
    expected_output="A report highlighting any group disparities found.",
)

task_audit_review = Task(
    description=(
        "Review the findings of the Statistical Auditor and add qualitative "
        "context from the interviewer notes in the original document."
    ),
    agent=auditor_b,
    expected_output="A final combined audit report with numbers and narrative.",
)

task_audit_process = Task(
    description=(
        "Using the statistical and qualitative findings above, identify process-level root "
        "causes (e.g. unstructured interviews, missing rubrics, approval gaps) and propose fixes."
    ),
    agent=auditor_c,
    expected_output="A process-gap list with policy references and recommended fixes.",
)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/5af5e0b0-14d7-4a5b-a274-a0df4b7012cb.png" alt="5af5e0b0-14d7-4a5b-a274-a0df4b7012cb" style="display:block;margin:0 auto" width="3680" height="3004" loading="lazy">

<p>Finally, we assemble the auditor team:</p>
<pre><code class="language-python">decentralized_crew = Crew(
    agents=[auditor_a, auditor_b, auditor_c],
    tasks=[task_audit_stats, task_audit_review, task_audit_process],
    process=Process.sequential,
)

print("Running Case 3: Decentralized (Peer Review)...")
result_3 = decentralized_crew.kickoff()
display(Markdown(f"### Case 3 Result:\n{result_3}"))
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/66b61567c4f938d4d78aca50/c9cfff42-eb86-4f57-9840-7f85cc83768a.png" alt="c9cfff42-eb86-4f57-9840-7f85cc83768a" style="display:block;margin:0 auto" width="2732" height="1204" loading="lazy">

<pre><code class="language-plaintext">
Case 3 Result:
Combined Audit Report: Q1 Hiring Batch Audit for Structural Bias
Statistical Audit Findings:

    Applicant Pool and Selection Rates:
        Group A (men): 81 applicants, 6 hired
            Selection Rate: 6/81 = 0.074074 (7.41%)
        Group B (women): 69 applicants, 6 hired
            Selection Rate: 6/69 = 0.08696 (8.70%)
        Group C (under-40): 80 applicants, 17 hired
            Selection Rate: 17/80 = 0.2125 (21.25%)
        Group D (over-40): 74 applicants, 7 hired
            Selection Rate: 7/74 = 0.094595 (9.46%)

    Selection Rate Ratios:
        Group A / Group B: 0.074074 / 0.08696 = 0.85 (85%)
        Group C / Group D: 0.2125 / 0.094595 = 2.24 (224%)

    Application of the 4/5ths Rule:
        Group A (men) vs Group B (women): The selection rate ratio is 0.85, which is above the 0.80 threshold.
        Group C (under-40) vs Group D (over-40): The selection rate ratio is 2.24, which is above the 0.80 threshold.

    Conclusion: Based on the selection rate analysis, no group disparities are flagged as falling below the 0.80 threshold according to the 4/5ths rule.

Qualitative Audit Findings:
Group A (men) vs Group B (women):

    Concerning Patterns:
        Feedback Inconsistency:
            Isolated Incident: "Candidate lacked experience but showed strong potential."
                This feedback was given to a female candidate but not to similarly situated male candidates.
        Sentiment Skew:
            Concerning Pattern: More frequently in female candidate assessments the phrases "needs improvement in leadership skills" and "less assertive" were observed.

Group C (under-40) vs Group D (over-40):

    Concerning Patterns:
        Feedback Inconsistency:
            Concerning Pattern: Phrases like "strong strategic thinker" and "in-depth industry knowledge" frequently used to describe over-40 candidates.
                Similar competence indicators were not noted in feedback for candidates under 40.
        Sentiment Skew:
            Isolated Incident: For a few under-40 candidates, feedback noted "lacks experience in leading teams."
                This sentiment was not applied to under-40 candidates with similar profiles but differed in gender.

Additional Notes:

    Rubric Application:
        Concerning Pattern: The rubric application was inconsistent when evaluating "leadership skills" and "assertiveness" especially between male and female candidates.
        Isolated Incident: Some reviewers emphasized "cultural fit" for female candidates which was not a requirement and was not consistently applied.

Final Conclusion:

Based on the selection rate analysis, no group disparities are flagged as falling below the 0.80 threshold according to the 4/5ths rule. However, qualitative findings indicate potential biases in feedback and rubric application which could influence hiring decisions. Recommendations:

    Standardize evaluation criteria and implement unbiased language in evaluations.
    Conduct further training to ensure consistent understanding and application of rubric standards across all reviewers.
    Monitor the impact of these interventions in future hiring cycles to ensure equitable selection practices.
</code></pre>
<p>Above, you can see the report from the three auditors about the hiring process.</p>
<h2 id="heading-conclusion-the-future-of-ai-is-evals">Conclusion: The Future of AI is Evals</h2>
<p>If you remember one thing from this article, let it be this: <strong>The organizations that win with AI agents are not the ones with the most agents. They are the ones with the best evals.</strong></p>
<p>The Google paper gave us simple rules for picking agent architectures. Those rules are very useful, and I've laid them out&nbsp;in the form of an algorithm.</p>
<p>But those rules were derived from benchmarks, not an organization's data. For that reason, you have to build your own evals. Nobody knows what "correct" looks like in your domain except you.</p>
<p>This is the same point made by Sam Bhagwat in <a href="https://mastra.ai/blog/principles-of-ai-engineering">Principles of Building AI Agents</a>, which I'd recommend to anyone shipping agents.</p>
<p>So here's the playbook again:</p>
<ol>
<li><p><strong>Check your budget first:</strong> Tokens cost money. Know what you can spend per task.</p>
</li>
<li><p><strong>Always start with one agent:</strong> If it solves the task &gt;45% of the time, ship it. Don't add agents.</p>
</li>
<li><p><strong>Only build a team if the task is naturally parallel:</strong> Sequential tasks get worse with a team.</p>
</li>
<li><p><strong>Match topology to task:</strong> For analysis it is better a centralized team. For open web research it is betetr a decentralized team. If it is sequential, it is better just one agent.</p>
</li>
<li><p><strong>Cap teams at 3–4 agents and no more than 3 tools per agent:</strong> Like in real life the smaller the team the more agile and less mistakes it makes.</p>
</li>
<li><p><strong>Put a supervisor on any parallel setup:</strong> According to the study, unchecked swarms amplify errors ~17×. Supervised ones ~4×.</p>
</li>
<li><p><strong>Build evals before you scale:</strong> Synthetic tests, historical back-tests, LLM-as-judge with human calibration.</p>
</li>
</ol>
<p>And keep humans in the loop for high-stakes decisions.</p>
<p>Once again, agents are like interns. Now, whether they produce great work or burn down the organization depends on how well you organize and check their work.</p>
<p>You can find the <a href="https://github.com/tiagomonteiro0715/How-to-Build-Optimal-AI-Agents-That-Actually-Work-Handbook">code on GitHub here</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Live Options Database in Python – A Complete Guide ]]>
                </title>
                <description>
                    <![CDATA[ Live options analytics change constantly. Implied volatility shifts, Greeks drift, and the shape of the surface can look different even a few minutes later. But a lot of teams still treat these number ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-live-options-database-in-python-a-complete-guide/</link>
                <guid isPermaLink="false">69fd19789f93a850a43041c9</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Databases ]]>
                    </category>
                
                    <category>
                        <![CDATA[ stockmarket ]]>
                    </category>
                
                    <category>
                        <![CDATA[ trading,  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Nikhil Adithyan ]]>
                </dc:creator>
                <pubDate>Thu, 07 May 2026 23:00:08 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4ecffa99-c492-4959-9899-885021d11ee4.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Live options analytics change constantly. Implied volatility shifts, Greeks drift, and the shape of the surface can look different even a few minutes later.</p>
<p>But a lot of teams still treat these numbers like something you glance at once. A screenshot in a deck. A one-off notebook cell. A quick check in a UI before a meeting.</p>
<p>That works until you need to answer basic questions that show up in real workflows:</p>
<p>What did TSLA's surface look like at 10:32? When did skew start steepening? Did the change come from the wings moving or the ATM shifting?</p>
<p>If you don't store the data as it arrives, you can't replay it, compare it, or audit it. You're stuck with whatever you happened to look at in the moment.</p>
<p>In this walkthrough, we'll build something small but practical: an internal database that continuously captures SpiderRock MLink's LiveImpliedQuote analytics for TSLA, stores each snapshot as queryable history, and also maintains a "latest view" table so you can pull the current surface state without scanning the full history.</p>
<p><strong>The goal is not to build a trading system. It's to build a reliable internal dataset that you can monitor and query.</strong></p>
<p>Note: SpiderRock MLink's LiveImpliedQuote analytics is a product offered for a fee, which includes exchange charges for the underlying market data used in its creation.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-data-were-using">What Data We're Using</a></p>
</li>
<li><p><a href="#heading-setup-importing-packages">Setup: Importing Packages</a></p>
</li>
<li><p><a href="#heading-database-design">Database Design</a></p>
</li>
<li><p><a href="#heading-pulling-liveimpliedquote">Pulling LiveImpliedQuote</a></p>
</li>
<li><p><a href="#heading-normalizing-the-response-into-rows">Normalizing the Response Into Rows</a></p>
</li>
<li><p><a href="#heading-writing-to-the-database">Writing To The Database</a></p>
</li>
<li><p><a href="#heading-running-a-short-polling-capture">Running a Short Polling Capture</a></p>
</li>
<li><p><a href="#heading-analysis-smile-reconstruction-from-the-database">Analysis: Smile Reconstruction From the Database</a></p>
<ul>
<li><p><a href="#heading-pick-an-expiry-with-good-coverage">Pick an Expiry with Good Coverage</a></p>
</li>
<li><p><a href="#heading-rebuild-the-smile-across-snapshots">Rebuild the Smile Across Snapshots</a></p>
</li>
<li><p><a href="#heading-zoom-in-around-spot">Zoom-In Around Spot</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-analysis-atm-iv-and-skew-over-time">Analysis: ATM IV and Skew Over Time</a></p>
</li>
<li><p><a href="#heading-alert-style-thresholds">Alert-Style Thresholds</a></p>
</li>
<li><p><a href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before running any of the code in this walkthrough, there are a few things you need to have in place.</p>
<p>On the API side, you need a SpiderRock MLink account with access to the LiveImpliedQuote feed. The examples use the REST interface, so no websocket setup is required, but you do need a valid API key. If you don't have one yet, you can reach out to SpiderRock directly to get access.</p>
<p>On the Python side, the environment is minimal. You need Python 3.10 or later for the tuple type hint syntax used in one of the function signatures. The external packages are requests, pandas, numpy, and matplotlib. Everything else – sqlite3, time, datetime – is part of the standard library. You can install the external dependencies with:</p>
<pre><code class="language-plaintext">pip install requests pandas numpy matplotlib
</code></pre>
<p>No database setup is required beyond a writable local path. SQLite creates the file automatically on first run, so there's nothing to install or configure separately.</p>
<p>Finally, the walkthrough uses TSLA as the target symbol because it has a liquid and active options chain. If you want to swap in a different underlying, the only thing you need to change is the symbol variable in the config block.</p>
<h2 id="heading-what-data-were-using">What Data We're Using</h2>
<p>This build is driven by one OptAnalytics message type from SpiderRock MLink: <a href="https://docs.spiderrockconnect.com/docs/next/MessageSchemas/Schema/Topics/analytics/LiveImpliedQuote/"><strong>LiveImpliedQuote</strong></a>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/7150e733-6238-410b-afe7-abc781d67e7a.png" alt="LiveImpliedQuote docs page" style="display:block;margin:0 auto" width="1000" height="451" loading="lazy">

<p>Each message represents an option contract and comes with the analytics you actually need for monitoring:</p>
<ul>
<li><p>the option identifier (symbol, expiry, strike, call or put)</p>
</li>
<li><p>surface IV (sVol) and related surface fields</p>
</li>
<li><p>Greeks (delta, gamma, theta, vega)</p>
</li>
<li><p>context fields like underlying price (uPrc), time to expiry (years), and rate (rate)</p>
</li>
<li><p>timestamps and calc source markers, which matter when you're turning a live feed into a database</p>
</li>
</ul>
<p>We'll treat sVol as the main volatility field for the article and refer to it as surface IV. That keeps the workflow consistent when we rebuild smiles or compute skew proxies from stored history.</p>
<p>The demo uses TSLA because it has a rich and active options chain, which makes the database and queries more interesting even in a short capture window. The same pipeline works for any other underlying&nbsp;– the only thing you change is the symbol filter.</p>
<h2 id="heading-setup-importing-packages">Setup: Importing Packages</h2>
<p>Before touching the database or the API, we set up a small, repeatable environment. This section is intentionally minimal. We only import what we need for three things: making REST calls, storing data in SQLite, and doing basic analysis and plots.</p>
<pre><code class="language-python">import requests
import sqlite3
import pandas as pd
import numpy as np
import time
from datetime import datetime, timezone
import matplotlib.pyplot as plt
plt.style.use('ggplot')
</code></pre>
<ul>
<li><p><code>requests</code> is used for calling MLink REST endpoints.</p>
</li>
<li><p><code>sqlite3</code> gives us a lightweight database we can write to locally without extra setup.</p>
</li>
<li><p><code>pandas</code> and <code>numpy</code> are only for shaping and filtering the data once it comes back.</p>
</li>
<li><p><code>time</code> and <code>datetime</code> help us run a polling loop and timestamp each snapshot so the database becomes a real-time series.</p>
</li>
</ul>
<h2 id="heading-database-design">Database Design</h2>
<p>If the goal is to make live analytics queryable, the database design has to support two different needs.</p>
<p>First, you want an audit trail. Every snapshot should be preserved so you can reconstruct what the surface looked like at a specific time.</p>
<p>Second, you also want a fast way to answer "what does it look like right now" without scanning everything you've ever stored.</p>
<p>So we use two tables:</p>
<ul>
<li><p><code>implied_quote_history</code>: Append-only. Every poll inserts a full snapshot.</p>
</li>
<li><p><code>implied_quote_latest</code>: One row per option contract. Each poll upserts into this table so it always reflects the most recent snapshot.</p>
</li>
</ul>
<p>The core of both tables is a stable option identifier. In the feed, the option key is nested, so we normalize it into a single <code>option_key</code> string that includes symbol, expiry, strike, call or put, and venue fields. This becomes the primary key for the latest table and the main join key for queries.</p>
<pre><code class="language-python">#config
api_key = "YOUR SPIDERROCK API KEY"
mlink_url = "https://mlink-live.nms.saturn.spiderrockconnect.com/rest/json"

msg_type = "LiveImpliedQuote"

symbol = "TSLA"
poll_interval_s = 10
poll_duration_s = 120
limit = 2000

#create db connection
db_path = "/mnt/data/optanalytics_iv_greeks.db"

def get_conn(path: str = db_path):
    conn = sqlite3.connect(path)
    conn.execute("PRAGMA journal_mode=WAL;")
    conn.execute("PRAGMA synchronous=NORMAL;")
    return conn

#create db schema
def setup_db(path: str = db_path):
    conn = get_conn(path)
    cur = conn.cursor()

    cur.execute("""
    create table if not exists implied_quote_history (
        id integer primary key autoincrement,
        asof_ts text not null,

        option_key text not null,
        symbol text not null,
        expiry text not null,
        strike real not null,
        cp text not null,

        calc_source text,
        u_prc real,
        years real,
        rate real,

        s_vol real,
        atm_vol real,
        s_mark real,

        o_bid real,
        o_ask real,
        o_bid_iv real,
        o_ask_iv real,

        delta real,
        gamma real,
        theta real,
        vega real,

        src_ts text
    );
    """)

    cur.execute("""
    create index if not exists idx_hist_symbol_expiry_asof
    on implied_quote_history(symbol, expiry, asof_ts);
    """)

    cur.execute("""
    create index if not exists idx_hist_option_asof
    on implied_quote_history(option_key, asof_ts);
    """)

    cur.execute("""
    create table if not exists implied_quote_latest (
        option_key text primary key,

        last_asof_ts text not null,
        symbol text not null,
        expiry text not null,
        strike real not null,
        cp text not null,

        calc_source text,
        u_prc real,
        years real,
        rate real,

        s_vol real,
        atm_vol real,
        s_mark real,

        o_bid real,
        o_ask real,
        o_bid_iv real,
        o_ask_iv real,

        delta real,
        gamma real,
        theta real,
        vega real,

        src_ts text
    );
    """)

    cur.execute("""
    create index if not exists idx_latest_symbol_expiry
    on implied_quote_latest(symbol, expiry);
    """)

    conn.commit()
    conn.close()

setup_db()
</code></pre>
<p>This creates the SQLite database file and both tables. The history table is append-only and indexed for the two queries we'll run later: pulling snapshots by expiry and time, and pulling a specific option's timeline by <code>option_key</code>. The latest table is keyed by <code>option_key</code>, which lets us upsert and maintain a consistent "current view."</p>
<p>The columns we store are intentionally opinionated. We keep surface IV (s_vol), surface mark (s_mark), Greeks, and a few context fields. We also store timestamps so later we can reason about when a value was produced.</p>
<h2 id="heading-pulling-liveimpliedquote">Pulling LiveImpliedQuote</h2>
<p>Now we do the first live pull. The goal here is not to build a perfect filter. It's to confirm that we can retrieve a meaningful slice of TSLA option analytics and that the response structure is what we expect.</p>
<p>We request LiveImpliedQuote and filter by symbol using the where clause. The response is a list where most rows are actual LiveImpliedQuote messages, and one row at the end is a QueryResult summary.</p>
<pre><code class="language-python">def fetch_live_implied_quote(symbol: str, limit: int = 2000):
    where = f"okey.tk:eq:{symbol}"

    params = {
        "apiKey": api_key,
        "cmd": "getmsgs",
        "msgType": msg_type,
        "where": where,
        "limit": limit
    }

    r = requests.get(mlink_url, params=params)
    r.raise_for_status()
    return r.json()

raw = fetch_live_implied_quote(symbol, limit=limit)
print("raw messages:", len(raw))
print("first type:", raw[0].get("header", {}).get("mTyp") if raw else None)
</code></pre>
<p>This is a straight REST <code>getmsgs</code> call. We pass the API key, message type, and a simple symbol filter. The <code>limit</code> is important. It caps how many messages we get back in one poll, so for active underlyings, the returned set of strikes and expiries can vary between polls. That's fine for this tutorial, because the goal is to show the database pattern and the types of monitoring queries it enables.</p>
<p>This is the output you should see:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/606259cd-e6ed-4f6f-b24f-48fafe9c561b.png" alt="LiveImpliedQuote sample pull" style="display:block;margin:0 auto" width="988" height="170" loading="lazy">

<h2 id="heading-normalizing-the-response-into-rows">Normalizing the Response Into Rows</h2>
<p>Right now, raw is a list of nested message objects. That format is fine for transport, but it's not something you can store or query directly. So now, we turn each LiveImpliedQuote message into one flat row with a consistent schema.</p>
<pre><code class="language-python">def make_option_key(okey: dict) -&gt; str:
    return "|".join([
        str(okey.get("tk")),
        str(okey.get("dt")),
        str(okey.get("xx")),
        str(okey.get("cp")),
        str(okey.get("at")),
        str(okey.get("ts")),
    ])

def normalize_liq(raw: list, asof_ts: str, keep_calc_source: str = "Loop") -&gt; pd.DataFrame:
    rows = []

    for row in raw:
        if row.get("header", {}).get("mTyp") != "LiveImpliedQuote":
            continue

        m = row.get("message", {})
        if keep_calc_source and m.get("calcSource") != keep_calc_source:
            continue

        pkey = m.get("pkey", {})
        okey = pkey.get("okey", {})
        if not okey:
            continue

        s_vol = m.get("sVol")
        if s_vol is None or s_vol == 0:
            continue

        o_bid = m.get("oBid", 0) or 0
        o_ask = m.get("oAsk", 0) or 0

        quote_ok = int(not (o_bid == 0 and o_ask == 0))

        rows.append({
            "asof_ts": asof_ts,
            "option_key": make_option_key(okey),

            "symbol": okey.get("tk"),
            "expiry": okey.get("dt"),
            "strike": okey.get("xx"),
            "cp": okey.get("cp"),

            "calc_source": m.get("calcSource"),
            "u_prc": m.get("uPrc"),
            "years": m.get("years"),
            "rate": m.get("rate"),

            "s_vol": s_vol,
            "atm_vol": m.get("atmVol"),
            "s_mark": m.get("sMark"),

            "o_bid": o_bid,
            "o_ask": o_ask,
            "o_bid_iv": m.get("oBidIv"),
            "o_ask_iv": m.get("oAskIv"),
            "quote_ok": quote_ok,

            "delta": m.get("de"),
            "gamma": m.get("ga"),
            "theta": m.get("th"),
            "vega": m.get("ve"),

            "src_ts": m.get("timestamp"),
        })

    df = pd.DataFrame(rows)
    if df.empty:
        return df

    df = (
        df.sort_values("src_ts")
          .drop_duplicates(subset=["option_key"], keep="last")
          .reset_index(drop=True)
    )
    return df

asof_ts = datetime.now(timezone.utc).isoformat(timespec="seconds").replace("+00:00", "Z")
snapshot_df = normalize_liq(raw, asof_ts)

print("snapshot rows:", len(snapshot_df))
print("quote_ok distribution:", snapshot_df["quote_ok"].value_counts().to_dict() if not snapshot_df.empty else {})
snapshot_df.head()
</code></pre>
<p>There are three practical decisions baked into this normalization step:</p>
<ul>
<li><p>First, we build a stable <code>option_key</code> from the option identifier so we have a consistent primary key for the latest table.</p>
</li>
<li><p>Second, we keep only <code>calcSource="Loop"</code>. LiveImpliedQuote can include both Tick and Loop records. Loop records tend to be more consistent for snapshot-style analysis because the underlying reference price is stable across the surface.</p>
</li>
<li><p>Third, we avoid aggressive filtering. In this dataset, the top-of-book bid and ask fields can be zero even when the analytics fields are populated. So instead of dropping those rows, we store a <code>quote_ok</code> flag and keep the record. That keeps the pipeline usable while still making it obvious later which rows had live quotes.</p>
</li>
</ul>
<p>This is the output:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/7d04a9e8-d3ec-4737-a0a7-64cb3888380c.png" alt="LiveImpliedQuote snapshot" style="display:block;margin:0 auto" width="1500" height="496" loading="lazy">

<p>At this point, one row represents one option contract snapshot. The fact that <code>quote_ok</code> is 0 across the board simply means bid and ask are not populated in this slice, even though surface IV, Greeks, and other analytics fields are present. That's still useful for building a monitoring database, because the core idea here is tracking the evolution of analytics over time, not reconstructing executable markets.</p>
<h2 id="heading-writing-to-the-database">Writing to the Database</h2>
<p>Now that we have a clean snapshot DataFrame, the job is to persist it in two places.</p>
<p>History table: Append everything. This is the audit log. Latest table: Upsert by <code>option_key</code>. This is the fast "current view."</p>
<p>This separation is what makes the database useful. History lets you reconstruct any past snapshot. Latest lets you answer "what does the surface look like right now" without scanning time series.</p>
<pre><code class="language-python">def safe_add_column(table: str, col: str, col_type: str, path: str = db_path):
    conn = get_conn(path)
    cur = conn.cursor()
    existing = [r[1] for r in cur.execute(f"PRAGMA table_info({table});").fetchall()]
    if col not in existing:
        cur.execute(f"ALTER TABLE {table} ADD COLUMN {col} {col_type};")
    conn.commit()
    conn.close()

safe_add_column("implied_quote_history", "quote_ok", "INTEGER")
safe_add_column("implied_quote_latest", "quote_ok", "INTEGER")

def write_snapshot_to_db(df: pd.DataFrame, path: str = db_path) -&gt; tuple[int, int]:
    if df.empty:
        return 0, 0

    conn = get_conn(path)
    cur = conn.cursor()

    cols = [
        "asof_ts",
        "option_key","symbol","expiry","strike","cp",
        "calc_source","u_prc","years","rate",
        "s_vol","atm_vol","s_mark",
        "o_bid","o_ask","o_bid_iv","o_ask_iv",
        "delta","gamma","theta","vega",
        "quote_ok","src_ts"
    ]

    for c in cols:
        if c not in df.columns:
            df[c] = None

    insert_df = df[cols].copy()

    cur.executemany(
        """
        insert into implied_quote_history (
            asof_ts,
            option_key, symbol, expiry, strike, cp,
            calc_source, u_prc, years, rate,
            s_vol, atm_vol, s_mark,
            o_bid, o_ask, o_bid_iv, o_ask_iv,
            delta, gamma, theta, vega,
            quote_ok, src_ts
        ) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """,
        insert_df.itertuples(index=False, name=None)
    )
    history_inserted = cur.rowcount

    cur.executemany(
        """
        insert into implied_quote_latest (
            option_key,
            last_asof_ts, symbol, expiry, strike, cp,
            calc_source, u_prc, years, rate,
            s_vol, atm_vol, s_mark,
            o_bid, o_ask, o_bid_iv, o_ask_iv,
            delta, gamma, theta, vega,
            quote_ok, src_ts
        ) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        on conflict(option_key) do update set
            last_asof_ts=excluded.last_asof_ts,
            symbol=excluded.symbol,
            expiry=excluded.expiry,
            strike=excluded.strike,
            cp=excluded.cp,
            calc_source=excluded.calc_source,
            u_prc=excluded.u_prc,
            years=excluded.years,
            rate=excluded.rate,
            s_vol=excluded.s_vol,
            atm_vol=excluded.atm_vol,
            s_mark=excluded.s_mark,
            o_bid=excluded.o_bid,
            o_ask=excluded.o_ask,
            o_bid_iv=excluded.o_bid_iv,
            o_ask_iv=excluded.o_ask_iv,
            delta=excluded.delta,
            gamma=excluded.gamma,
            theta=excluded.theta,
            vega=excluded.vega,
            quote_ok=excluded.quote_ok,
            src_ts=excluded.src_ts
        """,
        insert_df[[
            "option_key","asof_ts","symbol","expiry","strike","cp",
            "calc_source","u_prc","years","rate",
            "s_vol","atm_vol","s_mark",
            "o_bid","o_ask","o_bid_iv","o_ask_iv",
            "delta","gamma","theta","vega",
            "quote_ok","src_ts"
        ]].itertuples(index=False, name=None)
    )
    latest_upserted = cur.rowcount

    conn.commit()
    conn.close()
    return history_inserted, latest_upserted

hist_n, latest_n = write_snapshot_to_db(snapshot_df)
print("history inserted:", hist_n)
print("latest upserted:", latest_n)
</code></pre>
<p>We batch write using <code>executemany</code> so inserts are fast even with thousands of option rows. The history insert is straightforward. The latest write uses a SQLite upsert keyed on <code>option_key</code>, which means if the contract already exists in the latest table, its fields are overwritten with the newest snapshot.</p>
<p>You should see:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/8fdbdeb1-a4f2-434d-a3c7-99f44e51ec5d.png" alt="History inserted: 1852, latest upserted: 1852" style="display:block;margin:0 auto" width="608" height="137" loading="lazy">

<p>After the first write, both tables have the same number of rows. That's expected, because there is only one snapshot in history so far. Once we start polling multiple snapshots, the history table will grow every cycle, while the latest table will stay roughly flat and continue updating in place.</p>
<h2 id="heading-running-a-short-polling-capture">Running a Short Polling Capture</h2>
<p>At this point, the pipeline works end-to-end for a single snapshot. The whole point of the database, though, is to turn live analytics into a time series. So we run a short capture window and store multiple snapshots back-to-back.</p>
<p>This isn't meant to be a production scheduler. It's just a simple loop that runs for a couple of minutes, polls every few seconds, timestamps the snapshot, and writes it to both tables.</p>
<pre><code class="language-python">def poll_and_write(symbol: str, duration_s: int = poll_duration_s, interval_s: int = poll_interval_s):
    start = time.time()
    polls = 0
    total_hist = 0

    while time.time() - start &lt; duration_s:
        asof_ts = datetime.now(timezone.utc).isoformat(timespec="seconds").replace("+00:00", "Z")

        raw = fetch_live_implied_quote(symbol, limit=limit)
        df = normalize_liq(raw, asof_ts)

        hist_n, latest_n = write_snapshot_to_db(df)
        polls += 1
        total_hist += hist_n

        print(f"[{polls}] {asof_ts} snapshot_rows={len(df)} history+={hist_n} latest_upsert={latest_n}")
        time.sleep(interval_s)

    print(f"done. polls={polls}, total_history_added={total_hist}")

poll_and_write(symbol, duration_s=120, interval_s=10)
</code></pre>
<p>Each loop iteration represents one snapshot. We generate a UTC timestamp (asof_ts), pull the latest batch from LiveImpliedQuote, normalize it into rows, then write it into the database. The history table accumulates every snapshot. The latest table overwrites by <code>option_key</code>, so it always represents the most recent view.</p>
<p>One practical detail is worth calling out. The API call is capped by limit, so you're not guaranteed to receive an identical set of strikes and expiries every poll. That's why <code>snapshot_rows</code> can vary between iterations.</p>
<p>In production, you usually stabilize the slice by pinning specific expiries and a strike band or by interpolating IV to fixed moneyness points. For this tutorial, we're keeping ingestion simple and focusing on the database pattern and the monitoring queries it enables.</p>
<p>You should see per-poll telemetry like this:</p>
<pre><code class="language-plaintext">[1] 2026-04-14T18:09:29Z snapshot_rows=1454 history+=1454 latest_upsert=1454
...
done. polls=9, total_history_added=12806
</code></pre>
<p>This confirms the database is building a time series. Over nine polls, you stored 12,806 option rows in history. The latest table is updated each time, but it doesn't grow in the same way as history because it overwrites per contract key.</p>
<p>From the next section, we'll stop writing and start querying.</p>
<h2 id="heading-analysis-smile-reconstruction-from-the-database">Analysis: Smile Reconstruction From the Database</h2>
<p>Once the data is in <code>implied_quote_history</code>, the workflow flips. We stop thinking in terms of "API responses" and start thinking in terms of "queries." This section does two things. First, it picks an expiry that has enough rows to be representative. Then it reconstructs the call-side volatility smile for that expiry across a few timestamps.</p>
<h3 id="heading-pick-an-expiry-with-good-coverage">Pick an Expiry with Good Coverage</h3>
<p>If you pick an expiry that only appears sporadically in the captured snapshots, the smile plot will be misleading. So we start by looking at which expiries have the most rows in the history table.</p>
<pre><code class="language-python">conn = get_conn()

expiry_counts = pd.read_sql_query(
    """
    select expiry, count(*) as n
    from implied_quote_history
    where symbol = ?
    group by expiry
    order by n desc
    limit 10
    """,
    conn,
    params=(symbol,)
)

conn.close()
expiry_counts
</code></pre>
<p>This query scans only the history table, filters to TSLA, and counts how many option rows exist per expiry across the capture window. We keep the top 10 and pick the first one as the expiry we'll reconstruct.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/2f7b897f-0a4f-4b1a-826e-0fee6b19f2bd.png" alt="Expiry-wise coverage" style="display:block;margin:0 auto" width="373" height="724" loading="lazy">

<p>The expiry date <code>2026-11-20</code> has the highest count.</p>
<p>Here, the count doesn't mean this expiry is "best" in any trading sense. It just means it showed up most consistently in the captured data. That makes it a practical choice for a clean smile comparison.</p>
<h3 id="heading-rebuild-the-smile-across-snapshots">Rebuild the Smile Across Snapshots</h3>
<p>Now we query the stored history for one expiry, keep only calls, and plot surface IV (s_vol) against strike for multiple snapshot timestamps.</p>
<pre><code class="language-python">chosen_expiry = "2026-11-20" 

conn = get_conn()
smile = pd.read_sql_query(
    """
    select asof_ts, strike, cp, s_vol, u_prc
    from implied_quote_history
    where symbol = ? and expiry = ?
    """,
    conn,
    params=(symbol, chosen_expiry)
)
conn.close()

smile_calls = smile[smile["cp"] == "Call"].copy()

ts_list = sorted(smile_calls["asof_ts"].unique())
pick = [ts_list[0], ts_list[len(ts_list)//2], ts_list[-1]]

plt.figure(figsize=(9,5))
for ts in pick:
    g = smile_calls[smile_calls["asof_ts"] == ts].sort_values("strike")
    plt.plot(g["strike"], g["s_vol"], label=ts)

plt.title(f"{symbol} Vol Smile (Calls) | Expiry {chosen_expiry} | 3 snapshots")
plt.xlabel("Strike")
plt.ylabel("Implied Vol (s_vol)")
plt.grid(True)
plt.legend()
plt.show()
</code></pre>
<p>We pull all rows for the chosen expiry from history, then filter to calls so we don't mix put and call shapes. To keep the plot readable, we only plot three snapshots. First, middle, and last.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/84416f80-9253-4f18-8da4-ea814e174987.png" alt="TSLA vol smile (calls)" style="display:block;margin:0 auto" width="778" height="475" loading="lazy">

<p>Over a short capture window, the smiles often overlap heavily. That doesn't mean the system isn't working. It usually means the surface didn't move much in those two minutes. The important part is that we can reconstruct and compare it purely from stored history.</p>
<h3 id="heading-zoom-in-around-spot">Zoom-In Around Spot</h3>
<p>The full-range plot is useful for shape, but it can hide small shifts near the region people actually care about. So we zoom to a band around the underlying price.</p>
<pre><code class="language-python">s0 = float(smile_calls["u_prc"].dropna().median())
low, high = s0 * 0.6, s0 * 1.4

for ts in pick:
    g = smile_calls[smile_calls["asof_ts"] == ts].sort_values("strike")
    g = g[(g["strike"] &gt;= low) &amp; (g["strike"] &lt;= high)]
    plt.plot(g["strike"], g["s_vol"], label=ts)

plt.title(f"{symbol} Vol Smile (Calls) | Expiry {chosen_expiry} | zoomed")
plt.xlabel("Strike")
plt.ylabel("Implied Vol (s_vol)")
plt.grid(True)
plt.legend(fontsize=8)
plt.show()
</code></pre>
<p>We take a robust spot proxy from the stored <code>u_prc</code> values and then keep strikes within a range around it. The goal is not precision. It's to make the chart readable and show whether the near-ATM region is drifting.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/107de4b4-7b40-4e79-a38b-fac96cb11b26.png" alt="TSLA vol smile (calls)  -  zoomed-in" style="display:block;margin:0 auto" width="781" height="475" loading="lazy">

<p>Here, even small changes become visible. This is also why storing history matters. If you only looked at one snapshot in isolation, these shifts would be easy to miss or dismiss.</p>
<h2 id="heading-analysis-atm-iv-and-skew-over-time">Analysis: ATM IV and Skew Over Time</h2>
<p>A full smile plot is useful, but it's not always the fastest way to monitor a surface. In practice, teams usually track a few summary numbers per expiry so they can spot changes quickly, then drill down only when something looks off.</p>
<p>Here we reduce each stored snapshot into two metrics for a single expiry.</p>
<ul>
<li><p>ATM IV: Surface IV at the strike closest to spot.</p>
</li>
<li><p>Skew proxy: Surface IV at 0.9 times spot minus surface IV at 1.1 times spot, using the closest available strikes.</p>
</li>
</ul>
<pre><code class="language-python">chosen_expiry = "2026-11-20"

conn = get_conn()
df = pd.read_sql_query(
    """
    select asof_ts, strike, s_vol, u_prc
    from implied_quote_history
    where symbol = ? and expiry = ? and cp = 'Call'
    """,
    conn,
    params=(symbol, chosen_expiry)
)
conn.close()

df["strike"] = df["strike"].astype(float)
df["s_vol"] = df["s_vol"].astype(float)

def closest_iv(grp: pd.DataFrame, target_strike: float):
    g = grp.iloc[(grp["strike"] - target_strike).abs().argsort()[:1]]
    return float(g["s_vol"].iloc[0]), float(g["strike"].iloc[0])

rows = []
for ts, grp in df.groupby("asof_ts"):
    spot = float(grp["u_prc"].dropna().median())
    atm_target = spot
    down_target = spot * 0.9
    up_target = spot * 1.1

    atm_iv, atm_k = closest_iv(grp, atm_target)
    down_iv, down_k = closest_iv(grp, down_target)
    up_iv, up_k = closest_iv(grp, up_target)

    rows.append({
        "asof_ts": ts,
        "spot": spot,
        "atm_strike": atm_k,
        "atm_iv": atm_iv,
        "k90": down_k,
        "iv_90": down_iv,
        "k110": up_k,
        "iv_110": up_iv,
        "skew_90_110": down_iv - up_iv
    })

metrics = pd.DataFrame(rows).sort_values("asof_ts").reset_index(drop=True)
metrics
</code></pre>
<p>We query the history table for one expiry and keep only calls, then group by snapshot timestamp. For each snapshot, we use the median <code>u_prc</code> as a spot proxy and pick the closest available strike to spot. That gives ATM IV. We repeat the same approach for 0.9 times spot and 1.1 times spot and compute a skew proxy as the difference.</p>
<p>The table also stores the actual strikes used (atm_strike, k90, k110). Options strikes are discrete, so the nearest strike can change between snapshots. Keeping the chosen strikes visible makes the metric explainable when it moves.</p>
<p>The output is a table with one row per snapshot timestamp and the computed metrics.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/5590b162-5fe7-4713-8f56-edc4c6171ab2.png" alt="ATM IV, skew proxy metrics" style="display:block;margin:0 auto" width="1000" height="441" loading="lazy">

<p>Now that we have a clean time series table, we can visualize the two metrics. First, ATM IV. Then, the skew proxy.</p>
<pre><code class="language-python">plt.plot(metrics["asof_ts"], metrics["atm_iv"])
plt.title(f"{symbol} ATM IV over time | Expiry {chosen_expiry}")
plt.xticks(rotation=30, ha="right")
plt.ylabel("ATM IV (s_vol)")
plt.grid(True)
plt.show()

plt.plot(metrics["asof_ts"], metrics["skew_90_110"])
plt.title(f"{symbol} Skew proxy (IV@0.9S - IV@1.1S) | Expiry {chosen_expiry}")
plt.xticks(rotation=30, ha="right")
plt.ylabel("Skew proxy")
plt.grid(True)
plt.show()
</code></pre>
<p>Here is the first chart, ATM IV over time.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/0df9b0ff-e02f-4c6b-b4ec-175ddc46522c.png" alt="TSLA ATM IV over time" style="display:block;margin:0 auto" width="831" height="453" loading="lazy">

<p>ATM IV tends to move slowly over short windows unless there is a sharp repricing event. In this run, it stays fairly stable, which is a realistic outcome for a short capture. The value here is that the database turns "fairly stable" into something you can quantify and compare later, rather than a vague impression.</p>
<p>Here is the second chart, Skew proxy over time.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/f90243ee-6039-4d7e-94ed-d248eaaf9722.png" alt="TSLA skew proxy" style="display:block;margin:0 auto" width="831" height="453" loading="lazy">

<p>The skew proxy is more sensitive because it's based on wing points. If it changes, it usually means the downside is being repriced differently from the upside for that expiry. One nuance is that the nearest available strike can change between snapshots, which can create step-like moves even when the surface isn't moving dramatically. That's why we keep k90 and k110 in the metrics table. It keeps the skew plot explainable.</p>
<h2 id="heading-alert-style-thresholds">Alert-Style Thresholds</h2>
<p>Once you have a metrics table per snapshot, adding a monitoring layer is straightforward. The idea isn't to generate trades. It's to flag when the surface moves enough that someone should look closer.</p>
<p>Here we do two checks:</p>
<ul>
<li><p>ATM IV change alert: Flag if ATM IV changes more than a small threshold between snapshots.</p>
</li>
<li><p>Skew change alert: Flag if the skew proxy changes more than a threshold between snapshots.</p>
</li>
</ul>
<pre><code class="language-python">alerts = metrics.copy()

alerts["atm_iv_change"] = alerts["atm_iv"].diff()
alerts["skew_change"] = alerts["skew_90_110"].diff()

atm_thresh = 0.002    
skew_thresh = 0.003   

alerts["atm_alert"] = alerts["atm_iv_change"].abs() &gt;= atm_thresh
alerts["skew_alert"] = alerts["skew_change"].abs() &gt;= skew_thresh

alerts[[
    "asof_ts",
    "atm_iv", "atm_iv_change", "atm_alert",
    "skew_90_110", "skew_change", "skew_alert",
    "atm_strike", "k90", "k110"
]]
</code></pre>
<p>We take the per-snapshot metrics table and compute first differences. Then we compare those changes to thresholds and store boolean flags. The output table keeps both the metrics and the strikes used for the calculations, so any alert is explainable rather than a black box.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/b6805adc-90f6-4c57-8dee-aa6e0ec4d724.png" alt="Alerts dataframe" style="display:block;margin:0 auto" width="1500" height="546" loading="lazy">

<p>In this run, the ATM IV alerts are all false, while the skew alert triggers once.</p>
<p>The skew alert fires because the skew proxy jumps by more than the threshold between two snapshots. This is explainable. If you see the table, you can see the strikes used for the proxy changed around the same time (k90 shifts from 340 to 315). Because strikes are discrete, nearest-strike metrics can step even when the surface is not moving dramatically.</p>
<p>To make this easier to read, we also plot the two series and mark alert points.</p>
<pre><code class="language-python">plt.plot(alerts["asof_ts"], alerts["atm_iv"])
for i, r in alerts[alerts["atm_alert"]].iterrows():
    plt.scatter(r["asof_ts"], r["atm_iv"],  s=30, edgecolors="r", alpha=0.6, linewidth=2)
plt.title(f"{symbol} ATM IV with alerts | Expiry {chosen_expiry}")
plt.xticks(rotation=30, ha="right")
plt.grid(True)
plt.show()

plt.plot(alerts["asof_ts"], alerts["skew_90_110"])
for i, r in alerts[alerts["skew_alert"]].iterrows():
    plt.scatter(r["asof_ts"], r["skew_90_110"], s=30, edgecolors="r", alpha=0.6, linewidth=2)
plt.title(f"{symbol} Skew proxy with alerts | Expiry {chosen_expiry}")
plt.xticks(rotation=30, ha="right")
plt.grid(True)
plt.show()
</code></pre>
<p>Both plots use the same pattern. Plot the metric as a line, then overlay a marker on any timestamp where the corresponding alert flag is true. This makes it obvious when something crossed the threshold.</p>
<p>This chart represents skew proxy with alerts.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/eff87263-68f0-4132-935d-bdf148e73c82.png" alt="TSLA skew proxy with alerts" style="display:block;margin:0 auto" width="831" height="453" loading="lazy">

<p>This chart shows one alert marker, which matches what we saw in the table.</p>
<p>The ATM IV plot isn't featured since there are no alert points.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>In this walkthrough, we used SpiderRock MLink's LiveImpliedQuote feed for TSLA and turned it into a small internal database you can query. We stored every snapshot in an append-only history table, maintained a latest view keyed by a stable option identifier, then used that stored data to rebuild a smile, track ATM surface IV and a simple skew proxy, and add a basic alert rule on top.</p>
<p>This fits well in B2B workflows because it turns live analytics into something operational: a dataset you can audit, replay, and monitor. The same pattern works whether you're building an internal dashboard, running routine surface checks for a desk, or doing a quick post-event review without relying on screenshots and one-off notebook runs.</p>
<p>If you want to extend it, the most practical next steps are longer capture windows, tracking multiple symbols, and moving from SQLite to Postgres once the data volume grows. If metric stability becomes important, you can also standardize the slice you track per poll or interpolate IV to fixed moneyness points so skew measures don't step when nearest strikes change.</p>
<p>With that being said, you've reached the end of the article. Hope you learned something new and useful.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Market Research Copilot with MCP and Python [Full Handbook] ]]>
                </title>
                <description>
                    <![CDATA[ Most financial AI tools are good at one thing: summarizing a stock. You ask about Apple, NVIDIA, or Tesla, and they give you a clean overview of price action, a few ratios, and maybe some company cont ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-market-research-copilot-with-mcp-and-python-handbook/</link>
                <guid isPermaLink="false">69fb845950ecad45335e0fe2</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mcp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                    <category>
                        <![CDATA[ stockmarket ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Nikhil Adithyan ]]>
                </dc:creator>
                <pubDate>Wed, 06 May 2026 18:11:37 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/97192f8e-e5c5-4339-8974-90d823d93a86.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most financial AI tools are good at one thing: summarizing a stock. You ask about Apple, NVIDIA, or Tesla, and they give you a clean overview of price action, a few ratios, and maybe some company context. That can be useful, but it falls short the moment the task becomes more like real research.</p>
<p>Real research usually starts with a view. Not a ticker. A trader, analyst, or product team is more likely to ask something like, “Apple looks attractive because downside has been controlled and business quality remains high. Does the data actually support that?” That's a different problem. A summary can't answer it properly because the system needs to test the claim itself, not just describe the company around it.</p>
<p>In this tutorial, we're going to build a financial research copilot that does exactly that. It takes a natural-language thesis, pulls historical prices and fundamentals through EODHD’s MCP server, turns those inputs into structured evidence, and returns a short research memo with a verdict.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-this-copilot-actually-produces">What This Copilot Actually Produces</a></p>
</li>
<li><p><a href="#heading-what-makes-this-different-from-a-normal-stock-assistant">What Makes This Different from a Normal Stock Assistant</a></p>
</li>
<li><p><a href="#heading-the-workflow">The Workflow</a></p>
</li>
<li><p><a href="#heading-building-the-mcp-client">Building the MCP Client</a></p>
</li>
<li><p><a href="#heading-setting-up-corepyhttpcorepy">Setting Up core.py</a></p>
</li>
<li><p><a href="#heading-parsing-a-research-prompt-into-a-structured-request">Parsing a Research Prompt into a Structured Request</a></p>
</li>
<li><p><a href="#heading-fetching-the-two-data-sources-historical-amp-fundamental-data">Fetching the Two Data Sources: Historical &amp; Fundamental Data</a></p>
</li>
<li><p><a href="#heading-building-the-first-evidence-layer-from-price-data">Building the First Evidence Layer from Price Data</a></p>
</li>
<li><p><a href="#heading-building-the-second-evidence-layer-from-fundamentals">Building the Second Evidence Layer from Fundamentals</a></p>
</li>
<li><p><a href="#heading-what-do-we-have-so-far">What do we have so far?</a></p>
</li>
<li><p><a href="#heading-classifying-the-thesis">Classifying the Thesis</a></p>
</li>
<li><p><a href="#heading-turning-signals-into-support-contradiction-and-missing-evidence">Turning Signals into Support, Contradiction, and Missing Evidence</a></p>
<ul>
<li><a href="#heading-sanity-check-jupyter-notebook">Sanity Check (Jupyter Notebook)</a></li>
</ul>
</li>
<li><p><a href="#heading-assigning-a-verdict">Assigning a Verdict</a></p>
</li>
<li><p><a href="#heading-building-the-facts-object">Building the Facts Object</a></p>
<ul>
<li><p><a href="#heading-1-company-context">1. Company Context</a></p>
</li>
<li><p><a href="#heading-2-single-stock-facts-builder">2. Single-Stock Facts Builder</a></p>
</li>
<li><p><a href="#heading-3-watchlist-facts-builder">3. Watchlist Facts Builder</a></p>
</li>
<li><p><a href="#heading-sanity-check-jupyter-notebook-1">Sanity Check (Jupyter Notebook)</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-writing-the-final-memo">Writing the Final Memo</a></p>
<ul>
<li><a href="#heading-sanity-check-jupyter-notebook-2">Sanity Check (Jupyter Notebook)</a></li>
</ul>
</li>
<li><p><a href="#heading-stitching-everything-together">Stitching Everything Together</a></p>
</li>
<li><p><a href="#heading-demo-time-jupyter-notebook">Demo Time! (Jupyter Notebook)</a></p>
<ul>
<li><p><a href="#heading-demo-1-testing-whether-a-premium-is-actually-justified">Demo 1. Testing Whether a Premium Is Actually Justified</a></p>
</li>
<li><p><a href="#heading-demo-2-testing-whether-volatility-is-too-high-for-the-underlying-business">Demo 2. Testing Whether Volatility Is Too High for the Underlying Business</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>Before starting, make sure you have the following in place.</p>
<p>You will need Python 3.9 or later, along with these libraries: <code>mcp</code>, <code>openai</code>, <code>numpy</code>, and <code>pandas</code>. Install them with pip before running any code.</p>
<p>You will also need two API keys. One from EODHD for historical prices and fundamentals data, and one from OpenAI for parsing and memo generation. If you don't have an EODHD key, you can get one by registering for a developer account at <a href="http://eodhd.com">eodhd.com</a>.</p>
<p>The tutorial assumes basic familiarity with Python and async programming. You don't need a background in finance, but it helps to understand what a P/E ratio and drawdown mean before reading the evidence-building sections.</p>
<p>A Jupyter notebook environment is recommended for running the sanity checks, though any Python environment that supports <code>await</code> will work.</p>
<h2 id="heading-what-this-copilot-actually-produces">What This Copilot Actually&nbsp;Produces</h2>
<p>Before getting into the pipeline, it helps to see the kind of output we're building toward. The easiest way to understand this project is to look at one real example.</p>
<p>Suppose the user gives the system this prompt:</p>
<blockquote>
<p>I think Apple looks attractive because downside has been controlled and business quality remains high. Can you test that for AAPL over the last 180&nbsp;days?</p>
</blockquote>
<p>The copilot doesn't respond with a loose summary of Apple. It turns that into a structured research memo:</p>
<pre><code class="language-plaintext">1. Thesis under review  

Apple appears attractive due to controlled downside and sustained high business 
quality.

2. Supporting evidence  

Over the past 180 days, maximum drawdown was limited to -13.82%, suggesting relatively contained downside.Profitability metrics are strong, with a 35.37% operating margin and 27.04% profit margin. Returns on capital are high, with ROA at 24.38% and ROE at 152.02%, indicating efficient asset use and strong  capital efficiency. Growth metrics support ongoing business strength, with quarterly revenue growth of 15.70% and earnings growth of 18.30% year-over-year. Forward estimates also remain positive, with expected earnings growth of 9.68% and 
revenue growth of 6.87%.

3. Evidence that weakens the thesis  

Net EPS revisions over the past 30 days are negative (-3), indicating some deterioration in analyst sentiment.

4. Missing evidence  

No material gaps in the provided dataset.

5. Verdict  

partially_supported - There is more supporting evidence than contradicting evidence, but the thesis is not fully confirmed.

6. Bottom-line assessment  

Apple demonstrates strong and consistent business quality supported by high margins, returns, and continued growth. Downside has been relatively contained over the observed period, though not negligible. However, negative earnings 
revisions introduce some caution, leaving the thesis supported but not conclusively established.
</code></pre>
<p>This example makes the goal of the project much clearer. We're not building a system that simply tells us what happened to Apple. We're building one that takes a claim, checks it against market and fundamentals data, and returns a structured judgment.</p>
<p>That distinction matters because the memo is only the final surface. Underneath it, the system first parses the thesis, pulls prices and fundamentals through <a href="https://eodhd.com/financial-apis/mcp-server-for-financial-data-by-eodhd"><strong>EODHD’s MCP server</strong></a>, computes the relevant signals, builds support and contradiction, assigns a verdict, and only then writes the final note. That's what gives the output its structure.</p>
<p>In this first part, we’ll build everything up to the evidence layers that power this kind of output.</p>
<h2 id="heading-what-makes-this-different-from-a-normal-stock-assistant">What Makes This Different from a Normal Stock Assistant</h2>
<img src="https://cdn-images-1.medium.com/max/1000/1*rJirKoA1xWiuZjyENZypGg.png" alt="Stock assistant vs Thesis copilot workflow comparison" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>A normal stock assistant starts with a ticker and tries to explain what happened. It may summarize price action, mention a few ratios, and add some company context. That is useful when the question is broad, but it's not enough when the input is a specific investment view.</p>
<p>This project starts from the opposite direction. The input is not “tell me about Apple.” The input is a claim, like Apple looks attractive because downside has been controlled and business quality remains high. That changes the job of the system. It now has to test each part of that claim, decide what supports it, decide what weakens it, and be clear about what's still missing.</p>
<p>That one shift is what shapes the whole workflow. Instead of ending at retrieval and summarization, the pipeline has to parse the thesis, map the data to the right kind of evidence, and return a verdict. That's what makes this feel like a research copilot rather than a better stock summary tool.</p>
<h2 id="heading-the-workflow">The Workflow</h2>
<p>At a high level, the copilot follows a simple sequence:</p>
<ul>
<li><p>parse the user’s thesis into a structured request</p>
</li>
<li><p>fetch historical prices and fundamentals through MCP</p>
</li>
<li><p>turn those inputs into market and business signals</p>
</li>
<li><p>map those signals into support, contradiction, and missing evidence</p>
</li>
<li><p>assign a verdict</p>
</li>
<li><p>write the final memo</p>
</li>
</ul>
<p>That's the full loop. The output may look like a short research note, but it sits on top of a more controlled pipeline in <code>core.py</code>.</p>
<h4 id="heading-project-structure">Project structure:</h4>
<pre><code class="language-plaintext">project/
├── client.py
├── core.py
└── test.ipynb
</code></pre>
<p><code>client.py</code> is the MCP access layer. It connects to EODHD, lists tools, calls them with retries and timeouts, and returns metadata for each request. <code>core.py</code> contains the actual thesis-testing logic, including parsing, data fetching, signal computation, evidence building, verdict assignment, and memo generation. <code>test.ipynb</code> is where the quality checks and end-to-end demos are run.</p>
<p>This split is useful because it keeps the tutorial easy to follow. When we move into code, each block has a clear place. MCP access stays in <code>client.py</code>, while the research workflow stays in <code>core.py</code>.</p>
<h2 id="heading-building-the-mcp-client">Building the MCP&nbsp;Client</h2>
<p>We’ll start with the thinnest part of the project, which is the MCP access layer.</p>
<p>This file only does one job. It connects to EODHD’s MCP server, lists available tools, calls a tool with retries and a timeout, and returns a small metadata object alongside the response. The actual thesis logic doesn't belong here. Keeping this layer small makes the rest of the project much easier to reason about later.</p>
<p>Create a file called <code>client.py</code> and add this:</p>
<pre><code class="language-python">import time
import asyncio

from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client

class EODHDMCP:
    def __init__(self, apikey, base_url=None):
        self.apikey = apikey
        self.base_url = base_url or "https://mcp.eodhd.dev/mcp"
        self._tools = None

    def _url(self):
        return f"{self.base_url}?apikey={self.apikey}"

    def _open(self):
        return streamablehttp_client(self._url())

    async def list_tools(self):
        if self._tools is not None:
            return self._tools

        async with self._open() as (read, write, _):
            async with ClientSession(read, write) as s:
                await s.initialize()
                resp = await s.list_tools()
                self._tools = [t.name for t in resp.tools]
                return self._tools

    async def call_tool(self, name, args, trace_id, timeout_s=25, retries=2):
        last = None

        for attempt in range(retries + 1):
            t0 = time.time()
            try:
                async with self._open() as (read, write, _):
                    async with ClientSession(read, write) as s:
                        await s.initialize()
                        out = await asyncio.wait_for(s.call_tool(name, args), timeout=timeout_s)
                        dt = time.time() - t0
                        meta = {
                            "trace_id": trace_id,
                            "tool": name,
                            "args": args,
                            "latency_s": round(dt, 3),
                        }
                        return out, meta
            except Exception as e:
                last = e
                if attempt &lt; retries:
                    await asyncio.sleep(0.5 * (attempt + 1))

        raise last
</code></pre>
<p>There are only two methods that really matter here. <code>list_tools()</code> is just a quick way to inspect and cache the tools exposed by the MCP server. <code>call_tool()</code> is the method the rest of the project will actually use. It makes the request, applies timeout and retry handling, and returns both the raw output and a small metadata object.</p>
<p>That metadata becomes useful later because the workflow stays traceable. When the copilot returns a memo, we still know which tool was called, with what arguments, and how long it took. So even though this file is small, it gives the rest of the system a clean and inspectable access layer.</p>
<h2 id="heading-setting-up-corepy">Setting Up&nbsp;<code>core.py</code></h2>
<p>Now that the MCP client is ready, we can start building the main workflow in <code>core.py</code>.</p>
<p>This file will hold the actual thesis-testing logic, so the first step is to set up the imports, API clients, a few limits, and some small helper functions that the rest of the pipeline will reuse.</p>
<p>Create a file called <code>core.py</code> and start with this:</p>
<pre><code class="language-python">import json
import re
import time
import uuid
import asyncio
from datetime import date, timedelta

import numpy as np
import pandas as pd
from openai import OpenAI

from client import EODHDMCP

eodhd_api_key = "your eodhd api key"
mcp_base_url = "https://mcp.eodhd.dev/mcp"

openai_api_key = "your openai api key"
model_name = "gpt-5.3-chat-latest"

max_lookback_days = 365
max_tool_calls = 10
max_tickers = 5

mcp = EODHDMCP(eodhd_api_key, base_url=mcp_base_url)
oa = OpenAI(api_key=openai_api_key)

def log_event(event, trace_id, **extra):
    payload = {
        "event": event,
        "trace_id": trace_id,
        "ts": round(time.time(), 3),
    }
    payload.update(extra)
    print(json.dumps(payload, default=str))

def get_dates_from_lookback(days):
    end = date.today()
    start = end - timedelta(days=int(days))
    return start.isoformat(), end.isoformat()

def make_state():
    return {
        "tool_calls": 0,
        "tool_trace": [],
    }

def bump_tool_call(state, meta):
    state["tool_calls"] += 1
    state["tool_trace"].append(meta)

    if state["tool_calls"] &gt; max_tool_calls:
        raise RuntimeError("tool call budget exceeded")

def to_text(out):
    if isinstance(out, str):
        return out.strip()

    if hasattr(out, "content"):
        try:
            parts = []
            for item in out.content:
                if hasattr(item, "text") and item.text is not None:
                    parts.append(item.text)
                else:
                    parts.append(str(item))
            return "\n".join(parts).strip()
        except Exception:
            pass

    return str(out).strip()
</code></pre>
<p>Note: Replace <code>“your eodhd api key”</code> with your actual EODHD API key. If you don’t have one, you can obtain it by opening an EODHD developer account.</p>
<p>This block does three things:</p>
<ul>
<li><p>First, it sets up the two clients we need. <code>mcp</code> is the EODHD MCP client from <code>client.py</code>, and <code>oa</code> is the OpenAI client that will be used for parsing and memo generation later.</p>
</li>
<li><p>Second, it defines a few small limits for the workflow. These help keep the system controlled by capping the lookback window, the number of tickers, and the number of tool calls in a single run.</p>
</li>
<li><p>Third, it adds helper functions that the rest of the file depends on. <code>log_event()</code> gives us lightweight tracing, <code>get_dates_from_lookback()</code> converts a lookback window into start and end dates, <code>make_state()</code> and <code>bump_tool_call()</code> help track MCP usage, and <code>to_text()</code> safely converts tool output into plain text before we parse it.</p>
</li>
</ul>
<h2 id="heading-parsing-a-research-prompt-into-a-structured-request">Parsing a Research Prompt into a Structured Request</h2>
<p>The first thing this copilot needs to do is clean up the input. A user isn't going to send a perfectly formatted request every time. They're more likely to write a research thought in plain English and mix the thesis, ticker, and timeframe into one prompt.</p>
<p>That is why the system starts by turning the raw prompt into four fields:</p>
<ul>
<li><p>ticker</p>
</li>
<li><p>lookback window</p>
</li>
<li><p>thesis</p>
</li>
<li><p>mode</p>
</li>
</ul>
<p>This logic goes into <code>core.py</code>.</p>
<pre><code class="language-python">def parse_request(text):
    prompt = f"""
You are extracting fields for a financial thesis-testing copilot.

Return only valid JSON with this exact shape:
{{
  "tickers": ["AAPL"],
  "lookback_days": 180,
  "thesis": "the actual thesis statement",
  "mode": "single"
}}

Rules:
- Extract only tickers explicitly mentioned or strongly implied.
- Do not invent tickers.
- If there are multiple tickers, mode must be "watchlist".
- If there is one ticker, mode must be "single".
- If no timeframe is mentioned, use 180.
- Convert months to days using 30 days per month.
- Convert years to days using 365 days per year.
- Keep the thesis concise but faithful to the user's intent.
- Return JSON only. No markdown. No explanation.

User request:
{text}
""".strip()

    r = oa.responses.create(
        model=model_name,
        input=[{"role": "user", "content": prompt}],
    )

    raw = r.output_text.strip()

    try:
        parsed = json.loads(raw)
    except Exception:
        raise RuntimeError(f"parser returned non-json text: {raw[:500]}")

    return parsed
</code></pre>
<p>This function gives the model one very narrow job. It's not asking for an opinion or analysis. It's only asking for structured extraction. That matters because we want flexibility at the input layer, but we don't want the whole workflow to become fuzzy.</p>
<p>Once the model returns that JSON, Python takes over and tightens it up.</p>
<pre><code class="language-python">def enforce_limits(parsed):
    tickers = parsed.get("tickers", [])
    if not isinstance(tickers, list):
        tickers = []

    tickers = [str(x).upper().strip() for x in tickers if str(x).strip()]
    tickers = tickers[:max_tickers]

    lookback_days = parsed.get("lookback_days", 180)
    try:
        lookback_days = int(lookback_days)
    except Exception:
        lookback_days = 180

    if lookback_days &lt; 1:
        lookback_days = 1
    if lookback_days &gt; max_lookback_days:
        lookback_days = max_lookback_days

    thesis = str(parsed.get("thesis", "")).strip()
    if not thesis:
        thesis = "No thesis provided."

    mode = parsed.get("mode", "single")
    if len(tickers) &gt; 1:
        mode = "watchlist"
    else:
        mode = "single"

    return {
        "tickers": tickers,
        "lookback_days": lookback_days,
        "thesis": thesis,
        "mode": mode,
    }
</code></pre>
<p>This second function is what keeps the workflow controlled. It cleans the tickers, caps how many we allow in one request, clamps the time window, and makes sure the mode matches the number of tickers. So the model gives us flexibility, while the code gives us boundaries. That combination is important for a build like this.</p>
<h2 id="heading-fetching-the-two-data-sources-historical-amp-fundamental-data">Fetching the Two Data Sources: Historical &amp; Fundamental Data</h2>
<p>Once the request is parsed, the next step is to pull the data that will feed the rest of the workflow. For this version, we only use two sources from EODHD: historical prices and fundamentals. That's enough to test a surprising number of thesis types without making the build unnecessarily wide.</p>
<p>Add these two functions to <code>core.py</code>:</p>
<pre><code class="language-python">async def fetch_prices(ticker, start_date, end_date, trace_id, state):
    args = {
        "ticker": ticker,
        "start_date": start_date,
        "end_date": end_date,
        "period": "d",
        "order": "a",
        "fmt": "json",
    }

    out, meta = await mcp.call_tool("get_historical_stock_prices", args, trace_id)
    text = to_text(out)

    bump_tool_call(state, meta)

    if not text:
        raise RuntimeError("empty response from get_historical_stock_prices")

    try:
        data = json.loads(text)
    except Exception:
        raise RuntimeError(f"price tool returned non-json text: {text[:300]}")

    if isinstance(data, dict) and data.get("error"):
        raise RuntimeError(data["error"])

    df = pd.DataFrame(data)
    if df.empty:
        return df

    keep = [c for c in ["date", "close"] if c in df.columns]
    df = df[keep].copy()
    df["ticker"] = ticker

    return df

async def fetch_fundamentals(ticker, trace_id, state):
    args = {
        "ticker": ticker,
        "include_financials": False,
        "fmt": "json",
    }

    out, meta = await mcp.call_tool("get_fundamentals_data", args, trace_id)
    text = to_text(out)

    bump_tool_call(state, meta)

    if not text:
        raise RuntimeError("empty response from get_fundamentals_data")

    try:
        data = json.loads(text)
    except Exception:
        raise RuntimeError(f"fundamentals tool returned non-json text: {text[:300]}")

    if isinstance(data, dict) and data.get("error"):
        raise RuntimeError(data["error"])

    return data
</code></pre>
<ul>
<li><p><code>fetch_prices()</code> pulls daily historical data for the requested window and reduces it to the fields we actually need right now: <code>date</code>, <code>close</code>, and the ticker itself. That trimmed DataFrame is what we'll later use for return, drawdown, volatility, trend, and other market signals.</p>
</li>
<li><p><code>fetch_fundamentals()</code> keeps the fundamentals payload as JSON because we'll extract different categories from it in the next sections, including margins, growth, valuation, revisions, and beta.</p>
</li>
</ul>
<p>A couple of details matter here. Both functions run through the same MCP wrapper, so they automatically inherit the timeout, retry, and metadata handling we already built in <code>client.py</code>. Both also call <code>bump_tool_call()</code>, which lets us track how many external calls were made during a single run. That becomes useful later when we want the workflow to stay inspectable rather than feel like a black box.</p>
<h2 id="heading-building-the-first-evidence-layer-from-price-data">Building the First Evidence Layer from Price&nbsp;Data</h2>
<p>Once the price data is in, the next step is to turn that raw series into something we can actually reason with. For this copilot, price history isn't the final answer, but it is still the first evidence layer. It helps us test claims around downside control, risk, momentum, and the quality of returns.</p>
<p>Add this to <code>core.py</code>:</p>
<pre><code class="language-python">def compute_price_signals(prices_df):
    if prices_df is None or prices_df.empty:
        return {}

    df = prices_df.copy()
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    df["close"] = pd.to_numeric(df["close"], errors="coerce")

    df = df.dropna(subset=["date", "close"]).sort_values("date")
    if df.empty:
        return {}

    close = df["close"]
    rets = close.pct_change().dropna()

    out = {
        "n_points": int(len(close)),
        "start_price": float(close.iloc[0]),
        "end_price": float(close.iloc[-1]),
    }

    if len(close) &gt;= 2:
        out["ret_total"] = float(close.iloc[-1] / close.iloc[0] - 1)

    if not rets.empty:
        vol_daily = float(rets.std())
        vol_annualized = float(vol_daily * np.sqrt(252))

        out["vol_daily"] = vol_daily
        out["vol_annualized"] = vol_annualized

        if vol_annualized &gt; 0 and "ret_total" in out:
            out["ret_to_vol"] = float(out["ret_total"] / vol_annualized)

    peak = close.cummax()
    drawdown = close / peak - 1
    out["max_drawdown"] = float(drawdown.min())

    logp = np.log(close.values)
    x = np.arange(len(logp))
    if len(logp) &gt;= 3:
        out["trend_slope"] = float(np.polyfit(x, logp, 1)[0])
    else:
        out["trend_slope"] = 0.0

    return out
</code></pre>
<p>This function gives us a compact set of market signals from a plain close-price series. <code>ret_total</code> tells us how the stock moved over the full window. <code>vol_annualized</code> tells us how noisy that move was. <code>max_drawdown</code> is useful when the thesis talks about downside control. <code>trend_slope</code> gives us a simple directional measure, and <code>ret_to_vol</code> helps us judge return quality instead of looking at raw return alone.</p>
<p>The important point here is that we aren't asking the model to infer all of this from raw prices. We compute it first in Python, so the later reasoning step starts from explicit signals rather than vague interpretation. That makes the whole workflow much more stable.</p>
<h2 id="heading-building-the-second-evidence-layer-from-fundamentals">Building the Second Evidence Layer from Fundamentals</h2>
<p>Price data gives us one side of the thesis. The second side comes from fundamentals. This is the part that makes the project stop sounding generic. Once the copilot starts treating fundamentals as actual evidence, instead of just company profile data, the outputs become much more useful.</p>
<p>Add this helper first in <code>core.py</code>:</p>
<pre><code class="language-python">def _to_float(x):
    if x in (None, "", "NA"):
        return None
    try:
        return float(x)
    except Exception:
        return None
</code></pre>
<p>This small function just cleans values before we use them. Fundamentals payloads often contain strings, nulls, or <code>"NA"</code>, so it helps to normalize everything early.</p>
<p>Now add the main function:</p>
<pre><code class="language-python">def compute_fundamental_signals(fundamentals):
    if not isinstance(fundamentals, dict):
        return {}

    general = fundamentals.get("General", {}) or {}
    highlights = fundamentals.get("Highlights", {}) or {}
    valuation = fundamentals.get("Valuation", {}) or {}
    technicals = fundamentals.get("Technicals", {}) or {}

    earnings = fundamentals.get("Earnings", {}) or {}
    trend = earnings.get("Trend", {}) or {}

    latest_trend = None
    if isinstance(trend, dict) and trend:
        latest_key = sorted(trend.keys())[-1]
        latest_trend = trend.get(latest_key, {}) or {}
    else:
        latest_trend = {}

    out = {
        "sector": general.get("Sector"),
        "industry": general.get("Industry"),
        "employees": _to_float(general.get("FullTimeEmployees")),

        "market_cap": _to_float(highlights.get("MarketCapitalization")),
        "pe_ratio": _to_float(highlights.get("PERatio")),
        "peg_ratio": _to_float(highlights.get("PEGRatio")),
        "profit_margin": _to_float(highlights.get("ProfitMargin")),
        "operating_margin": _to_float(highlights.get("OperatingMarginTTM")),
        "roa": _to_float(highlights.get("ReturnOnAssetsTTM")),
        "roe": _to_float(highlights.get("ReturnOnEquityTTM")),
        "revenue_ttm": _to_float(highlights.get("RevenueTTM")),
        "revenue_growth_yoy": _to_float(highlights.get("QuarterlyRevenueGrowthYOY")),
        "earnings_growth_yoy": _to_float(highlights.get("QuarterlyEarningsGrowthYOY")),
        "dividend_yield": _to_float(highlights.get("DividendYield")),

        "trailing_pe": _to_float(valuation.get("TrailingPE")),
        "forward_pe": _to_float(valuation.get("ForwardPE")),
        "price_sales": _to_float(valuation.get("PriceSalesTTM")),
        "price_book": _to_float(valuation.get("PriceBookMRQ")),
        "ev_revenue": _to_float(valuation.get("EnterpriseValueRevenue")),
        "ev_ebitda": _to_float(valuation.get("EnterpriseValueEbitda")),

        "beta": _to_float(technicals.get("Beta")),

        "earnings_estimate_growth": _to_float(latest_trend.get("earningsEstimateGrowth")),
        "revenue_estimate_growth": _to_float(latest_trend.get("revenueEstimateGrowth")),
        "eps_revisions_up_30d": _to_float(latest_trend.get("epsRevisionsUpLast30days")),
        "eps_revisions_down_30d": _to_float(latest_trend.get("epsRevisionsDownLast30days")),
    }

    if out["trailing_pe"] is not None and out["forward_pe"] is not None:
        out["forward_vs_trailing_pe_change"] = out["forward_pe"] - out["trailing_pe"]

    if out["eps_revisions_up_30d"] is not None and out["eps_revisions_down_30d"] is not None:
        out["net_eps_revisions_30d"] = out["eps_revisions_up_30d"] - out["eps_revisions_down_30d"]

    return out
</code></pre>
<p>This function pulls together the parts of the fundamentals payload that matter most for thesis testing.</p>
<ul>
<li><p>From <code>Highlights</code>, we get profitability, returns on capital, growth, and market cap. From <code>Valuation</code>, we get multiples like trailing P/E, forward P/E, price-to-sales, and EV-based ratios.</p>
</li>
<li><p>From <code>Technicals</code>, we take beta.</p>
</li>
<li><p>From <code>Earnings.Trend</code>, we pick up forward estimate growth and revision data.</p>
</li>
</ul>
<p>These are the fields that let us test claims around business quality, premium justification, valuation, and forward expectations in a much more concrete way.</p>
<p>The last two derived fields are also useful. The gap between forward P/E and trailing P/E gives us a quick way to see whether valuation is easing or staying stretched. Net EPS revisions over the last 30 days tell us whether analyst expectations are improving or deteriorating.</p>
<h2 id="heading-what-do-we-have-so-far">What Do We Have So Far?</h2>
<p>At this point, the copilot can parse a thesis, fetch prices and fundamentals, and convert both into two reusable signal layers:</p>
<ul>
<li><p>Price signals cover return, volatility, drawdown, trend, and return quality</p>
</li>
<li><p>Fundamentals signals cover margins, returns on capital, growth, valuation, revisions, and beta.</p>
</li>
</ul>
<p>Next, we’ll turn those signals into what a real research workflow needs: supporting evidence, weakening evidence, what’s missing, a verdict, and the final memo.</p>
<h2 id="heading-classifying-the-thesis">Classifying the&nbsp;Thesis</h2>
<p>Before the copilot can judge a thesis, it first needs to understand what kind of claim is being made.</p>
<p>This matters because not every thesis should be tested the same way. A claim about controlled downside should care more about drawdown and volatility. A claim about business quality should lean more on margins, returns on capital, and growth. A claim about premium justification may need both business quality and valuation context.</p>
<p>So instead of jumping straight from signals to a verdict, we'll add a small classification step. This gives the system a short list of claim types to work with and a cleaner summary of the thesis.</p>
<p>Add this to <code>core.py</code>:</p>
<pre><code class="language-python">def classify_thesis(thesis):
    prompt = f"""
You are classifying a stock thesis into a few broad claim types.

Return only valid JSON like this:
{{
  "claim_types": ["controlled_downside", "business_quality"],
  "summary": "short restatement of the thesis"
}}

Allowed claim types:
- controlled_downside
- momentum_strength
- low_risk
- high_risk
- valuation_attractive
- valuation_expensive
- business_quality
- weak_business_quality
- premium_justified
- premium_not_justified

Rules:
- pick only the claim types that are clearly relevant
- do not invent extra labels
- if nothing fits strongly, return an empty list
- summary should be short and faithful

Thesis:
{thesis}
""".strip()

    r = oa.responses.create(
        model=model_name,
        input=[{"role": "user", "content": prompt}],
    )

    raw = r.output_text.strip()

    try:
        out = json.loads(raw)
    except Exception:
        raise RuntimeError(f"thesis classifier returned non-json text: {raw[:500]}")

    claim_types = out.get("claim_types", [])
    if not isinstance(claim_types, list):
        claim_types = []

    clean = []
    allowed = {
        "controlled_downside",
        "momentum_strength",
        "low_risk",
        "high_risk",
        "valuation_attractive",
        "valuation_expensive",
        "business_quality",
        "weak_business_quality",
        "premium_justified",
        "premium_not_justified",
    }

    for x in claim_types:
        x = str(x).strip()
        if x in allowed and x not in clean:
            clean.append(x)

    return {
        "claim_types": clean,
        "summary": str(out.get("summary", "")).strip(),
    }
</code></pre>
<p>This function keeps the model’s job narrow. It's not being asked to decide whether the thesis is right or wrong. It's only being asked to identify the kind of thesis it's dealing with. That makes the next step much cleaner, because the evidence engine no longer has to treat every prompt the same way.</p>
<p>The validation at the bottom is important too. Even though the model returns the labels, Python still filters them through an allowed set and removes anything unexpected. That keeps this step flexible, but still controlled.</p>
<h2 id="heading-turning-signals-into-support-contradiction-and-missing-evidence">Turning Signals into Support, Contradiction, and Missing&nbsp;Evidence</h2>
<p>This is the step where the copilot actually starts reasoning.</p>
<p>Up to this point, we have three things in hand. We have the thesis, we have the claim types, and we have the signal layers built from price data and fundamentals. But none of that is useful on its own unless the system can turn it into a clear argument.</p>
<p>That means it needs to answer three questions for every thesis:</p>
<ul>
<li><p>What in the data supports this claim?</p>
</li>
<li><p>What in the data weakens it?</p>
</li>
<li><p>What is still missing before we can judge it properly?</p>
</li>
</ul>
<p>That's exactly what <code>build_evidence_blocks()</code> does. It takes the classified thesis, checks the relevant price and fundamentals signals, and sorts them into three buckets: support, contradiction, and missing evidence.</p>
<p>Add this to <code>core.py</code>:</p>
<pre><code class="language-python">def build_evidence_blocks(thesis, thesis_tags, price_signals, fundamental_signals):
    evidence_for = []
    evidence_against = []
    missing_evidence = []

    ret_total = price_signals.get("ret_total")
    vol = price_signals.get("vol_annualized")
    dd = price_signals.get("max_drawdown")
    trend = price_signals.get("trend_slope")
    ret_to_vol = price_signals.get("ret_to_vol")

    pe = fundamental_signals.get("pe_ratio") or fundamental_signals.get("trailing_pe")
    forward_pe = fundamental_signals.get("forward_pe")
    beta = fundamental_signals.get("beta")

    profit_margin = fundamental_signals.get("profit_margin")
    operating_margin = fundamental_signals.get("operating_margin")
    roa = fundamental_signals.get("roa")
    roe = fundamental_signals.get("roe")
    revenue_growth = fundamental_signals.get("revenue_growth_yoy")
    earnings_growth = fundamental_signals.get("earnings_growth_yoy")
    earnings_estimate_growth = fundamental_signals.get("earnings_estimate_growth")
    revenue_estimate_growth = fundamental_signals.get("revenue_estimate_growth")
    net_eps_revisions = fundamental_signals.get("net_eps_revisions_30d")

    claim_types = thesis_tags.get("claim_types", [])

    if "controlled_downside" in claim_types:
        if dd is not None:
            if dd &gt; -0.15:
                evidence_for.append(f"Maximum drawdown was relatively contained at {dd:.2%}.")
            else:
                evidence_against.append(f"Maximum drawdown reached {dd:.2%}, which weakens the controlled-downside claim.")
        else:
            missing_evidence.append("No drawdown signal available to test downside control.")

    if "momentum_strength" in claim_types:
        if trend is not None and ret_total is not None:
            if trend &gt; 0 and ret_total &gt; 0:
                evidence_for.append(f"Trend was positive and total return over the window was {ret_total:.2%}.")
            else:
                evidence_against.append("Trend and total return do not strongly support a momentum-strength view.")
        else:
            missing_evidence.append("No usable trend or return signal available to test momentum.")

    if "low_risk" in claim_types:
        if vol is not None:
            if vol &lt; 0.30:
                evidence_for.append(f"Annualized volatility was {vol:.2%}, which supports a lower-risk view.")
            else:
                evidence_against.append(f"Annualized volatility was {vol:.2%}, which weakens a low-risk thesis.")
        else:
            missing_evidence.append("No volatility signal available to test risk.")

    if "high_risk" in claim_types:
        if vol is not None:
            if vol &gt;= 0.30:
                evidence_for.append(f"Annualized volatility was {vol:.2%}, which supports a higher-risk view.")
            else:
                evidence_against.append(f"Annualized volatility was only {vol:.2%}, which does not strongly support a high-risk thesis.")
        else:
            missing_evidence.append("No volatility signal available to test risk.")

    if "valuation_attractive" in claim_types:
        if pe is not None:
            if pe &lt; 20:
                evidence_for.append(f"P/E is {pe:.2f}, which supports a more attractive valuation view.")
            elif pe &gt; 30:
                evidence_against.append(f"P/E is {pe:.2f}, which weakens the attractive-valuation claim.")
        else:
            missing_evidence.append("No P/E metric available to test valuation attractiveness.")

        if forward_pe is not None and pe is not None:
            if forward_pe &lt; pe:
                evidence_for.append(f"Forward P/E ({forward_pe:.2f}) is below trailing P/E ({pe:.2f}), which can support an improving earnings setup.")

    if "valuation_expensive" in claim_types or "premium_not_justified" in claim_types:
        if pe is not None:
            if pe &gt; 30:
                evidence_for.append(f"P/E is {pe:.2f}, which supports an expensive-valuation view.")
            else:
                evidence_against.append(f"P/E is {pe:.2f}, which does not strongly support an expensive-valuation claim.")
        else:
            missing_evidence.append("No P/E metric available to test whether valuation looks expensive.")

    if "business_quality" in claim_types or "premium_justified" in claim_types:
        quality_hits = 0

        if operating_margin is not None:
            if operating_margin &gt;= 0.25:
                evidence_for.append(f"Operating margin is {operating_margin:.2%}, which supports strong business quality.")
                quality_hits += 1
            else:
                evidence_against.append(f"Operating margin is {operating_margin:.2%}, which is not especially strong for a quality claim.")

        if profit_margin is not None:
            if profit_margin &gt;= 0.20:
                evidence_for.append(f"Profit margin is {profit_margin:.2%}, which supports business quality.")
                quality_hits += 1
            else:
                evidence_against.append(f"Profit margin is {profit_margin:.2%}, which weakens a strong-quality thesis.")

        if roa is not None:
            if roa &gt;= 0.10:
                evidence_for.append(f"ROA is {roa:.2%}, which supports efficient asset use.")
                quality_hits += 1
            else:
                evidence_against.append(f"ROA is {roa:.2%}, which does not strongly support a quality claim.")

        if roe is not None:
            if roe &gt;= 0.20:
                evidence_for.append(f"ROE is {roe:.2%}, which supports strong capital efficiency.")
                quality_hits += 1
            else:
                evidence_against.append(f"ROE is {roe:.2%}, which is weaker than expected for a strong-quality thesis.")

        if revenue_growth is not None:
            if revenue_growth &gt; 0:
                evidence_for.append(f"Quarterly revenue growth was {revenue_growth:.2%} YoY, which supports business momentum.")
                quality_hits += 1
            else:
                evidence_against.append(f"Quarterly revenue growth was {revenue_growth:.2%} YoY, which weakens the quality claim.")

        if earnings_growth is not None:
            if earnings_growth &gt; 0:
                evidence_for.append(f"Quarterly earnings growth was {earnings_growth:.2%} YoY, which supports operating strength.")
                quality_hits += 1
            else:
                evidence_against.append(f"Quarterly earnings growth was {earnings_growth:.2%} YoY, which weakens the quality claim.")

        if earnings_estimate_growth is not None:
            if earnings_estimate_growth &gt; 0:
                evidence_for.append(f"Forward earnings estimate growth is {earnings_estimate_growth:.2%}, which supports a healthier forward outlook.")
            else:
                evidence_against.append(f"Forward earnings estimate growth is {earnings_estimate_growth:.2%}, which weakens the quality argument.")

        if revenue_estimate_growth is not None:
            if revenue_estimate_growth &gt; 0:
                evidence_for.append(f"Forward revenue estimate growth is {revenue_estimate_growth:.2%}, which supports ongoing business strength.")
            else:
                evidence_against.append(f"Forward revenue estimate growth is {revenue_estimate_growth:.2%}, which weakens the quality argument.")

        if net_eps_revisions is not None:
            if net_eps_revisions &gt; 0:
                evidence_for.append(f"Net EPS revisions over the last 30 days are positive ({net_eps_revisions:.0f}), which supports improving expectations.")
            elif net_eps_revisions &lt; 0:
                evidence_against.append(f"Net EPS revisions over the last 30 days are negative ({net_eps_revisions:.0f}), which weakens the thesis.")

        if quality_hits == 0:
            missing_evidence.append("This version could not extract enough direct business-quality metrics to test the quality claim.")

    if "weak_business_quality" in claim_types:
        if operating_margin is not None and operating_margin &lt; 0.15:
            evidence_for.append(f"Operating margin is only {operating_margin:.2%}, which supports a weaker-quality view.")
        if profit_margin is not None and profit_margin &lt; 0.10:
            evidence_for.append(f"Profit margin is only {profit_margin:.2%}, which supports a weaker-quality view.")
        if revenue_growth is not None and revenue_growth &lt;= 0:
            evidence_for.append(f"Revenue growth is {revenue_growth:.2%} YoY, which supports a weaker-quality view.")
        if earnings_growth is not None and earnings_growth &lt;= 0:
            evidence_for.append(f"Earnings growth is {earnings_growth:.2%} YoY, which supports a weaker-quality view.")

    if beta is not None:
        if beta &gt; 1.2:
            evidence_against.append(f"Beta is {beta:.2f}, which suggests above-market sensitivity.")
        elif beta &lt; 0.9:
            evidence_for.append(f"Beta is {beta:.2f}, which suggests below-market sensitivity.")
    else:
        missing_evidence.append("No beta value available.")

    if ret_to_vol is None:
        missing_evidence.append("No return-to-volatility signal available.")

    if not evidence_for and not evidence_against:
        missing_evidence.append("The current data is not enough to strongly support or reject the thesis.")

    return {
        "thesis": thesis,
        "thesis_summary": thesis_tags.get("summary", ""),
        "claim_types": claim_types,
        "evidence_for": evidence_for,
        "evidence_against": evidence_against,
        "missing_evidence": list(dict.fromkeys(missing_evidence)),
    }
</code></pre>
<p>The function looks long, but the logic is simple once you break it down.</p>
<p>It starts by pulling the signals it needs from the two evidence layers that we built earlier. Then it checks the thesis tags one by one. If the thesis is about controlled downside, it looks at drawdown. If it's about risk, it looks at volatility and beta. If't is about business quality, it leans on margins, returns on capital, growth, and revisions. If it's about valuation, it checks multiples like P/E and the relationship between forward and trailing valuation.</p>
<p>That's the key shift in this project. The copilot is no longer just collecting data. It's deciding which parts of the EODHD-backed signal set actually matter for the thesis in front of it.</p>
<p>The three output buckets are what make this useful.</p>
<ul>
<li><p><code>evidence_for</code> holds the points that support the claim.</p>
</li>
<li><p><code>evidence_against</code> holds the points that weaken it.</p>
</li>
<li><p><code>missing_evidence</code> makes the gaps explicit instead of letting the system sound more confident than it should.</p>
</li>
</ul>
<p>That's what makes this feel like a thesis-testing workflow rather than a polished stock summary.</p>
<h3 id="heading-sanity-check-jupyter-notebook">Sanity Check (Jupyter Notebook)</h3>
<p>Run this code inside <code>test.ipynb</code> for a quick sanity check:</p>
<pre><code class="language-python">import uuid
from core import (
    fetch_prices,
    fetch_fundamentals,
    compute_price_signals,
    classify_thesis,
    build_evidence_blocks,
    make_state
)
import json

trace_id = uuid.uuid4().hex[:10]
state = make_state()

thesis = "Apple looks attractive because downside has been controlled and business quality remains high."

prices = await fetch_prices("AAPL.US", "2026-01-01", "2026-04-01", trace_id, state)
funds = await fetch_fundamentals("AAPL.US", trace_id, state)

signals = compute_price_signals(prices)
tags = classify_thesis(thesis)
evidence = build_evidence_blocks(thesis, tags, signals, funds)

print(tags)
print(json.dumps(evidence, indent=2))
</code></pre>
<p><strong>Expected Output:</strong></p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/38ec0e04-b237-4ebb-8b26-61e2f82f36b0.png" alt="Sanity check expected output" style="display:block;margin:0 auto" width="1500" height="508" loading="lazy">

<h2 id="heading-assigning-a-verdict">Assigning a&nbsp;Verdict</h2>
<p>Once the evidence is structured, the copilot still needs one more layer before it can write a memo. It needs a controlled way to label the thesis.</p>
<p>That's the job of <code>decide_verdict()</code>. It looks at how much evidence supports the thesis, how much weakens it, and whether the claim still depends on missing business-quality or valuation evidence. The goal here isn't to create a perfect scoring model. It's to make sure the system doesn't jump from a few evidence strings straight into a confident conclusion.</p>
<p>Add this to <code>core.py</code>:</p>
<pre><code class="language-python">def decide_verdict(evidence, claim_types=None):
    claim_types = claim_types or []

    evidence_for = evidence.get("evidence_for", [])
    evidence_against = evidence.get("evidence_against", [])
    missing = evidence.get("missing_evidence", [])

    n_for = len(evidence_for)
    n_against = len(evidence_against)
    n_missing = len(missing)

    quality_claim = any(x in claim_types for x in ["business_quality", "weak_business_quality", "premium_justified", "premium_not_justified"])
    valuation_claim = any(x in claim_types for x in ["valuation_attractive", "valuation_expensive", "premium_justified", "premium_not_justified"])

    if n_for == 0 and n_against == 0:
        return {
            "verdict": "unresolved_due_to_missing_evidence",
            "reason": "There is not enough usable evidence to test the thesis.",
        }

    if quality_claim and n_missing &gt;= 1:
        if n_against &gt; 0:
            return {
                "verdict": "weakly_supported",
                "reason": "Some evidence supports the thesis, but direct business-quality evidence is missing and contradictory signals remain.",
            }
        return {
            "verdict": "partially_supported",
            "reason": "Part of the thesis is supported, but direct business-quality evidence is missing.",
        }

    if valuation_claim and n_missing &gt;= 1:
        return {
            "verdict": "unresolved_due_to_missing_evidence",
            "reason": "The thesis depends on valuation evidence that is not available in this version.",
        }

    if n_for &gt; 0 and n_against == 0:
        if n_missing &gt;= 2:
            return {
                "verdict": "partially_supported",
                "reason": "The available evidence supports the thesis, but important evidence is still missing.",
            }
        return {
            "verdict": "supported",
            "reason": "The available evidence mainly supports the thesis.",
        }

    if n_against &gt; 0 and n_for == 0:
        return {
            "verdict": "not_supported",
            "reason": "The available evidence mainly weakens the thesis.",
        }

    if n_for &gt; n_against:
        return {
            "verdict": "partially_supported",
            "reason": "There is more supporting evidence than contradicting evidence, but the thesis is not fully confirmed.",
        }

    if n_against &gt;= n_for:
        return {
            "verdict": "weakly_supported",
            "reason": "Contradicting evidence is meaningful enough that the thesis is only weakly supported.",
        }

    return {
        "verdict": "unresolved_due_to_missing_evidence",
        "reason": "The evidence is mixed and does not clearly resolve the thesis.",
    }
</code></pre>
<p>The logic here is intentionally simple. It doesn't try to do fine-grained scoring. Instead, it uses the shape of the evidence to decide whether the thesis is supported, partially supported, weakly supported, not supported, or still unresolved.</p>
<p>A couple of checks matter more than the rest. If the thesis depends on business-quality or valuation evidence and that evidence is still missing, the verdict gets capped early instead of sounding stronger than it should. That is important because a thesis can look convincing on price behavior alone, but still be incomplete if the claim depends on fundamentals that aren't actually present.</p>
<p>The other useful thing about this function is that it returns both a short label and a reason. That makes the final output easier to understand later, and it also gives the memo-writing step something cleaner to work from than a bare category.</p>
<h2 id="heading-building-the-facts-object">Building the Facts&nbsp;Object</h2>
<p>Before the memo gets written, the system first puts everything into one structured object. That object becomes the single source of truth for the final output. Instead of handing the model a mix of scattered variables, we'll give it one clean package containing the thesis, signals, company context, evidence, and verdict.</p>
<h3 id="heading-1-company-context">1. Company&nbsp;Context</h3>
<p>We’ll start with a small helper that pulls the basic company context from the fundamentals payload.</p>
<p>Add this to <code>core.py</code>:</p>
<pre><code class="language-python">def extract_company_context(fundamentals):
    if not isinstance(fundamentals, dict):
        return {}

    gen = fundamentals.get("General", {}) or {}

    out = {
        "name": gen.get("Name"),
        "code": gen.get("Code"),
        "exchange": gen.get("Exchange"),
        "sector": gen.get("Sector"),
        "industry": gen.get("Industry"),
        "country": gen.get("CountryName"),
        "market_cap": gen.get("MarketCapitalization"),
        "pe_ratio": gen.get("PERatio"),
        "beta": gen.get("Beta"),
        "dividend_yield": gen.get("DividendYield"),
        "description": gen.get("Description"),
    }

    clean = {}
    for k, v in out.items():
        if v not in (None, "", "NA"):
            clean[k] = v

    return clean
</code></pre>
<p>This function is just a cleanup step. It gives us a compact company context block that can later sit alongside the price and fundamentals signals without dragging the full fundamentals payload into the memo layer.</p>
<h3 id="heading-2-single-stock-facts-builder">2. Single-Stock Facts&nbsp;Builder</h3>
<p>Now add the single-stock facts builder:</p>
<pre><code class="language-python">def build_thesis_facts(parsed, ticker, signals, fundamentals, thesis_tags, evidence):
    company = extract_company_context(fundamentals)

    facts = {
        "type": "single_name_thesis_test",
        "ticker": ticker,
        "lookback_days": parsed["lookback_days"],
        "thesis": parsed["thesis"],
        "thesis_summary": thesis_tags.get("summary", ""),
        "claim_types": thesis_tags.get("claim_types", []),
        "market_signals": {
            "ret_total": signals.get("ret_total"),
            "vol_annualized": signals.get("vol_annualized"),
            "max_drawdown": signals.get("max_drawdown"),
            "trend_slope": signals.get("trend_slope"),
            "ret_to_vol": signals.get("ret_to_vol"),
            "start_price": signals.get("start_price"),
            "end_price": signals.get("end_price"),
            "n_points": signals.get("n_points"),
        },
        "company_context": {
            "name": company.get("name"),
            "exchange": company.get("exchange"),
            "sector": company.get("sector"),
            "industry": company.get("industry"),
            "country": company.get("country"),
            "market_cap": company.get("market_cap"),
            "pe_ratio": company.get("pe_ratio"),
            "beta": company.get("beta"),
            "dividend_yield": company.get("dividend_yield"),
        },
        "description": company.get("description"),
        "evidence_for": evidence.get("evidence_for", []),
        "evidence_against": evidence.get("evidence_against", []),
        "missing_evidence": evidence.get("missing_evidence", []),
    }

    facts["verdict"] = decide_verdict(evidence, thesis_tags.get("claim_types", []))
    return facts
</code></pre>
<p>This is the main facts object for a single-stock thesis. It pulls together the parsed thesis, the market signals, the basic company context, the evidence buckets, and the verdict. At this point, the copilot has already done the reasoning work. The memo isn't deciding anything new. It's just writing from this object.</p>
<h3 id="heading-3-watchlist-facts-builder">3. Watchlist Facts&nbsp;Builder</h3>
<p>Now add the watchlist version:</p>
<pre><code class="language-python">def build_watchlist_facts(parsed, tickers, signals_by_ticker, fundamentals_by_ticker, thesis_tags, evidence_by_ticker):
    per_ticker = {}

    for t in tickers:
        company = extract_company_context(fundamentals_by_ticker.get(t, {}))
        signals = signals_by_ticker.get(t, {})
        evidence = evidence_by_ticker.get(t, {})

        per_ticker[t] = {
            "company_context": {
                "name": company.get("name"),
                "sector": company.get("sector"),
                "industry": company.get("industry"),
                "market_cap": company.get("market_cap"),
                "pe_ratio": company.get("pe_ratio"),
                "beta": company.get("beta"),
            },
            "market_signals": {
                "ret_total": signals.get("ret_total"),
                "vol_annualized": signals.get("vol_annualized"),
                "max_drawdown": signals.get("max_drawdown"),
                "trend_slope": signals.get("trend_slope"),
                "ret_to_vol": signals.get("ret_to_vol"),
            },
            "evidence_for": evidence.get("evidence_for", []),
            "evidence_against": evidence.get("evidence_against", []),
            "missing_evidence": evidence.get("missing_evidence", []),
            "verdict": decide_verdict(evidence, thesis_tags.get("claim_types", []))
        }

    facts = {
        "type": "watchlist_thesis_test",
        "tickers": tickers,
        "lookback_days": parsed["lookback_days"],
        "thesis": parsed["thesis"],
        "thesis_summary": thesis_tags.get("summary", ""),
        "claim_types": thesis_tags.get("claim_types", []),
        "per_ticker": per_ticker,
    }

    return facts
</code></pre>
<p>This version does the same thing, but across multiple tickers. Instead of one top-level evidence block, it stores a per-ticker structure so the memo layer can later compare names without needing to reconstruct anything.</p>
<p>That is the main reason this section matters. By the time we reach the memo step, we no longer want to pass loose values around. We want one structured object that already contains:</p>
<ul>
<li><p>the thesis</p>
</li>
<li><p>the relevant signals</p>
</li>
<li><p>the company context</p>
</li>
<li><p>the evidence buckets</p>
</li>
<li><p>the verdict</p>
</li>
</ul>
<p>That keeps the final writing step much cleaner and makes the whole workflow easier to debug.</p>
<h3 id="heading-sanity-check-jupyter-notebook">Sanity Check (Jupyter Notebook)</h3>
<p>Run this code inside <code>test.ipynb</code> for a quick sanity check:</p>
<pre><code class="language-python">from core import build_thesis_facts, extract_company_context

facts = build_thesis_facts(
    parsed={
        "tickers": ["AAPL"],
        "lookback_days": 180,
        "thesis": "Apple looks attractive because downside has been controlled and business quality remains high.",
        "mode": "single"
    },
    ticker="AAPL.US",
    signals=signals,
    fundamentals=funds,
    thesis_tags=tags,
    evidence=evidence
)

print(json.dumps(facts, indent=2))
</code></pre>
<p><strong>Expected Output:</strong></p>
<pre><code class="language-json">{
  "type": "single_name_thesis_test",
  "ticker": "AAPL.US",
  "lookback_days": 180,
  "thesis": "Apple looks attractive because downside has been controlled and business quality remains high.",
  "thesis_summary": "Apple is attractive due to controlled downside and strong business quality",
  "claim_types": [
    "controlled_downside",
    "business_quality"
  ],
  "market_signals": {
    "ret_total": -0.05675067340688533,
    "vol_annualized": 0.2504818805125429,
    "max_drawdown": -0.11322450740687473,
    "trend_slope": -0.0005437843809243782,
    "ret_to_vol": -0.22656598270006817,
    "start_price": 271.01,
    "end_price": 255.63,
    "n_points": 62
  },
  "company_context": {
    "name": "Apple Inc",
    "exchange": "NASDAQ",
    "sector": "Technology",
    "industry": "Consumer Electronics",
    "country": "USA",
    "market_cap": null,
    "pe_ratio": null,
    "beta": null,
    "dividend_yield": null
  },
  "description": "Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. The company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; and wearables, home, and accessories comprising AirPods, Apple Vision Pro, Apple TV, Apple Watch, Beats products, and HomePod, as well as Apple branded and third-party accessories. It also provides AppleCare support and cloud services; and operates various platforms, including the App Store that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts, as well as advertising services include third-party licensing arrangements and its own advertising platforms. In addition, the company offers various subscription-based services, such as Apple Arcade, a game subscription service; Apple Fitness+, a personalized fitness service; Apple Music, which offers users a curated listening experience with on-demand radio stations; Apple News+, a subscription news and magazine service; Apple TV, which offers exclusive original content and live sports; Apple Card, a co-branded credit card; and Apple Pay, a cashless payment service, as well as licenses its intellectual property. The company serves consumers, and small and mid-sized businesses; and the education, enterprise, and government markets. It distributes third-party applications for its products through the App Store. The company also sells its products through its retail and online stores, and direct sales force; and third-party cellular network carriers and resellers. The company was formerly known as Apple Computer, Inc. and changed its name to Apple Inc. in January 2007. Apple Inc. was founded in 1976 and is headquartered in Cupertino, California.",
  "evidence_for": [
    "Maximum drawdown was relatively contained at -11.32%."
  ],
  "evidence_against": [],
  "missing_evidence": [
    "This version does not include direct business-quality metrics such as margins, growth, cash flow, or return on capital.",
    "Only basic company context is available, which is not enough on its own to confirm business quality.",
    "No beta value available."
  ],
  "verdict": {
    "verdict": "partially_supported",
    "reason": "Part of the thesis is supported, but direct business-quality evidence is missing."
  }
}
</code></pre>
<h2 id="heading-writing-the-final-memo">Writing the Final&nbsp;Memo</h2>
<p>At this point, the hard part is already done.</p>
<p>By the time we reach the memo step, the copilot already has a structured facts object with the thesis, claim types, market signals, company context, evidence buckets, and verdict. So this final function isn't where the reasoning happens. It's just the presentation layer that turns that structured judgment into something readable.</p>
<p>Add this to <code>core.py</code>:</p>
<pre><code class="language-python">def write_thesis_memo(facts):
    prompt = f"""
You are writing a short financial research memo.

Write using only the facts provided below.
Do not invent numbers, events, comparisons, or opinions beyond the supplied evidence.
If evidence is missing, say so clearly.

Use this exact structure:

1. Thesis under review
2. Supporting evidence
3. Evidence that weakens the thesis
4. Missing evidence
5. Verdict
6. Bottom-line assessment

Style rules:
- Keep it concise
- Keep it analytical and professional
- No bullet points unless necessary
- No hype
- No generic investment disclaimer language
- The bottom-line assessment should be balanced and evidence-based
- The verdict section must explicitly use the supplied verdict

Facts:
{json.dumps(facts, indent=2, default=str)}
""".strip()

    r = oa.responses.create(
        model=model_name,
        input=[{"role": "user", "content": prompt}],
    )

    return r.output_text.strip()
</code></pre>
<p>This function keeps the model boxed into one narrow task. It's not being asked to look at raw price history, raw fundamentals, or scattered variables. It's being asked to write from one clean facts object that already contains the judgment.</p>
<p>That separation matters because it keeps the final memo grounded. The model isn't deciding what it thinks about the stock at the last second. It's simply turning the structured output of the earlier steps into a short research note.</p>
<p>The prompt is also deliberately strict. It fixes the memo structure, tells the model not to invent anything, and makes the verdict explicit instead of leaving it implied. That helps the final output stay consistent even when the underlying thesis changes.</p>
<h3 id="heading-sanity-check-jupyter-notebook">Sanity Check (Jupyter Notebook)</h3>
<p>You can test it with a facts object from the previous section:</p>
<pre><code class="language-python">from core import write_thesis_memo

memo = write_thesis_memo(facts)
print(memo)
</code></pre>
<p><strong>Expected Output:</strong></p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/b5f44144-8da4-4c9a-8a59-c5ac6915a6b0.png" alt="Sanity check expected output" style="display:block;margin:0 auto" width="1500" height="606" loading="lazy">

<h2 id="heading-stitching-everything-together">Stitching Everything Together</h2>
<p>At this point, all the individual pieces are ready. We have the parser, the data fetchers, the signal builders, the thesis classifier, the evidence engine, the verdict layer, and the memo writer. The only thing left is to connect them into one end-to-end function.</p>
<p>Add this to <code>core.py</code>:</p>
<pre><code class="language-python">async def run_thesis_copilot(user_text):
    trace_id = uuid.uuid4().hex[:10]
    log_event("request_started", trace_id, text=user_text)

    parsed = enforce_limits(parse_request(user_text))
    tickers = parsed["tickers"]

    if not tickers:
        return {
            "memo": "No valid ticker was found in the request.",
            "facts": {},
            "data_used": {},
            "tool_trace_id": trace_id,
        }

    log_event(
        "parsed",
        trace_id,
        tickers=tickers,
        lookback_days=parsed["lookback_days"],
        mode=parsed["mode"],
        thesis=parsed["thesis"],
    )

    start_date, end_date = get_dates_from_lookback(parsed["lookback_days"])
    state = make_state()

    try:
        thesis_tags = classify_thesis(parsed["thesis"])

        if parsed["mode"] == "single":
            ticker = tickers[0]
            ticker_full = ticker if "." in ticker else f"{ticker}.US"

            log_event(
                "tool_phase",
                trace_id,
                mode="single",
                ticker=ticker_full,
                start_date=start_date,
                end_date=end_date,
            )

            prices = await fetch_prices(ticker_full, start_date, end_date, trace_id, state)
            funds = await fetch_fundamentals(ticker_full, trace_id, state)

            price_signals = compute_price_signals(prices)
            fundamental_signals = compute_fundamental_signals(funds)

            evidence = build_evidence_blocks(
                parsed["thesis"],
                thesis_tags,
                price_signals,
                fundamental_signals
            )

            facts = build_thesis_facts(
                parsed,
                ticker_full,
                price_signals,
                funds,
                thesis_tags,
                evidence
            )

            facts["fundamental_signals"] = fundamental_signals

            memo = write_thesis_memo(facts)

            out = {
                "memo": memo,
                "facts": facts,
                "data_used": {
                    "tickers": [ticker_full],
                    "date_range": [start_date, end_date],
                    "tools_called": [x.get("tool") for x in state["tool_trace"]],
                    "tool_calls": state["tool_calls"],
                },
                "tool_trace_id": trace_id,
            }

            log_event("request_finished", trace_id, tool_calls=state["tool_calls"])
            return out

        ticker_full = [x if "." in x else f"{x}.US" for x in tickers]

        log_event(
            "tool_phase",
            trace_id,
            mode="watchlist",
            tickers=ticker_full,
            start_date=start_date,
            end_date=end_date,
        )

        signals_by_ticker = {}
        funds_by_ticker = {}
        evidence_by_ticker = {}

        for t in ticker_full:
            prices = await fetch_prices(t, start_date, end_date, trace_id, state)
            funds = await fetch_fundamentals(t, trace_id, state)

            price_signals = compute_price_signals(prices)
            fundamental_signals = compute_fundamental_signals(funds)

            evidence = build_evidence_blocks(
                parsed["thesis"],
                thesis_tags,
                price_signals,
                fundamental_signals
            )

            signals_by_ticker[t] = {
                **price_signals,
                "fundamental_signals": fundamental_signals
            }
            funds_by_ticker[t] = funds
            evidence_by_ticker[t] = evidence

        facts = build_watchlist_facts(
            parsed,
            ticker_full,
            signals_by_ticker,
            funds_by_ticker,
            thesis_tags,
            evidence_by_ticker,
        )

        memo = write_thesis_memo(facts)

        out = {
            "memo": memo,
            "facts": facts,
            "data_used": {
                "tickers": ticker_full,
                "date_range": [start_date, end_date],
                "tools_called": [x.get("tool") for x in state["tool_trace"]],
                "tool_calls": state["tool_calls"],
            },
            "tool_trace_id": trace_id,
        }

        log_event("request_finished", trace_id, tool_calls=state["tool_calls"])
        return out

    except Exception as e:
        detail = repr(e)
        if hasattr(e, "exceptions"):
            detail = detail + " | " + " ; ".join([repr(x) for x in e.exceptions])

        log_event("request_failed", trace_id, err=detail)

        return {
            "memo": f"failed: {e}",
            "facts": {},
            "data_used": {
                "tickers": tickers,
                "date_range": [start_date, end_date],
                "tools_called": [x.get("tool") for x in state["tool_trace"]],
                "tool_calls": state["tool_calls"],
            },
            "tool_trace_id": trace_id,
        }
</code></pre>
<p>This function is just the full workflow in one place. It parses the request, fetches the data, computes the two signal layers, builds the evidence, assembles the facts object, writes the memo, and returns everything in a clean output.</p>
<p>The useful part is that it returns more than just the memo. It also returns the structured facts object, the tools that were used, the date range, and the trace ID. That keeps the final result inspectable instead of turning the copilot into a black box.</p>
<h2 id="heading-demo-time-jupyter-notebook">Demo Time! (Jupyter Notebook)</h2>
<h3 id="heading-demo-1-testing-whether-a-premium-is-actually-justified">Demo 1: Testing Whether a Premium Is Actually Justified</h3>
<p>This is a good first demo because it pushes the copilot beyond a basic single-stock check. The prompt isn't asking whether NVIDIA is a good company in general. It's asking whether NVIDIA’s premium over AMD can actually be defended using market behavior and business quality.</p>
<p>Here's the prompt:</p>
<pre><code class="language-python">from core import run_thesis_copilot

q = """
Between NVDA and AMD, I think NVDA's premium is still justified by stronger market behavior and business quality.
Check that over the last 6 months.
""".strip()

result = await run_thesis_copilot(q)

print(result["memo"])
print(result["data_used"])
</code></pre>
<p>And here's the output:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/e4a9e881-243a-47bb-b36b-1e273deb8e04.png" alt="Demo 1 output" style="display:block;margin:0 auto" width="1398" height="793" loading="lazy">

<p>What makes this output useful is that it doesn't flatten the result into a simple yes or no. NVIDIA clearly looks stronger on business quality, but market behavior isn't as convincing, and the lack of direct valuation data stops the copilot from overclaiming.</p>
<p>This is the kind of behavior we want. The system isn't just comparing two companies. It's testing whether the specific claim about a premium actually holds up.</p>
<h3 id="heading-demo-2-testing-whether-volatility-is-too-high-for-the-underlying-business">Demo 2: Testing Whether Volatility Is Too High for the Underlying Business</h3>
<p>The second demo shifts back to a single-stock thesis, but the claim is different. This time, the question isn't whether the company looks attractive. It's whether the stock is more volatile than the underlying business quality would justify.</p>
<p>Here's the prompt:</p>
<pre><code class="language-python">q = """
TSLA feels too volatile for the underlying business quality.
Test that thesis over the last year.
""".strip()

result = await run_thesis_copilot(q)

print(result["memo"])
print(result["data_used"])
</code></pre>
<p>And here's the output:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/a9767ee9-d227-4478-a2aa-9ee62c46488c.png" alt="Demo 2 output" style="display:block;margin:0 auto" width="1500" height="679" loading="lazy">

<p>This result is useful because it shows a more conflicted thesis. Tesla’s recent returns and forward growth expectations offer some support, but the current profitability, recent operating trends, revisions, and volatility profile all push back against the idea that the business quality is strong enough to fully justify that risk.</p>
<p>So the final verdict lands where it should: not as a clean confirmation, but as a weakly supported thesis.</p>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>At this point, the copilot already does the most important part well. It can take a natural-language thesis, pull the right market and fundamentals data through EODHD’s MCP layer, turn those inputs into structured evidence, and return a research memo that's much more disciplined than a normal stock summary.</p>
<p>At the same time, this version still has clear limits. It doesn't yet go deeper into statement-level accounting logic, it doesn't use news or catalyst context, and its handling of relative valuation can still be stronger for more demanding comparison cases.</p>
<p>But even with those limits, the shift here is already meaningful. The real change wasn't just connecting a model to financial data. It was moving from summarizing stocks to testing claims.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Scoped Note-Taking API with Django Rest Framework and SimpleJWT ]]>
                </title>
                <description>
                    <![CDATA[ If you've built a Django API and you're wondering how to add authentication so that each user can only access their own data, you're in the right place. Most Django tutorials teach you session-based a ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-scoped-note-taking-api-with-django-rest-framework-and-simplejwt/</link>
                <guid isPermaLink="false">69fa4395a386d7f121cd3bfc</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Django ]]>
                    </category>
                
                    <category>
                        <![CDATA[ REST API ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Programming Blogs ]]>
                    </category>
                
                    <category>
                        <![CDATA[ django rest framework ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Prabodh Tuladhar ]]>
                </dc:creator>
                <pubDate>Tue, 05 May 2026 19:23:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/36921ffa-4741-4e11-8f16-2c84322ebceb.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've built a Django API and you're wondering how to add authentication so that each user can only access their own data, you're in the right place.</p>
<p>Most Django tutorials teach you session-based authentication. That works fine when your frontend and backend live on the same server. But the moment you separate them – say, a React app on Netlify talking to a Django API on PythonAnywhere – then sessions start to break down.</p>
<p>Cookies don't travel well across different domains, and suddenly your login system stops working.</p>
<p>That's where JSON Web Tokens (JWT) come in. JWTs give you a stateless, cookie-free way to authenticate users. They work seamlessly across domains, devices, and platforms. The server doesn't need to remember anything. It just verifies the token's signature and knows exactly who's making the request.</p>
<p>But authentication is only half the problem. Once you know who a user is, you still need to control what they can see. This is where <strong>scoping</strong> comes in.</p>
<p>Scoping means ensuring that each user can only access their own data. User A should never be able to read, edit, or delete User B's data (notes in our case), even if they somehow guess the right ID.</p>
<p>In this tutorial, you'll build a a personal note-taking API where users can register, log in with JWT tokens, and store notes that only they can access.</p>
<p>Along the way, you'll implement a custom user model, configure SimpleJWT for token-based authentication, and write scoped views that lock each user's data behind their own credentials.</p>
<h3 id="heading-what-well-cover">What We'll Cover:</h3>
<ul>
<li><p><a href="#heading-prerequisities">Prerequisities</a></p>
</li>
<li><p><a href="#heading-what-is-jwt-and-why-use-it-over-session-authentication">What is JWT and Why Use It Over Session Authentication</a>?</p>
<ul>
<li><p><a href="#heading-how-session-authentication-works">How Session Authentication Works</a></p>
</li>
<li><p><a href="#heading-how-jwt-authentication-works">How JWT Authentication Works</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-1-how-to-set-up-the-project-and-install-the-dependecies">Step 1: How to Set Up the Project and Install the Dependecies</a></p>
<ul>
<li><p><a href="#heading-11-how-to-create-the-project">1.1 How to Create the Project</a></p>
</li>
<li><p><a href="#heading-12-how-to-create-a-virtual-environment-and-install-the-required-dependencies">1.2 How to Create a Virtual Environment and Install the Required Dependencies</a></p>
</li>
<li><p><a href="#heading-13-how-to-create-the-project-and-the-app">1.3 How to Create the Project and the App</a></p>
</li>
<li><p><a href="#heading-14-how-to-register-the-app-and-django-rest-framework-drf">1.4 How to Register the App and Django Rest Framework (DRF)</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-2-how-to-create-a-custom-user-model">Step 2: How to Create a Custom User Model</a></p>
<ul>
<li><p><a href="#heading-21-how-to-define-the-custom-user-model">2.1 How to Define the Custom User Model</a></p>
</li>
<li><p><a href="#heading-22-how-to-tell-django-to-use-your-custom-user-model">2.2 How to Tell Django to Use Your Custom User Model</a></p>
</li>
<li><p><a href="#heading-23-how-to-run-migrations">2.3 How to Run Migrations</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-3-how-to-define-the-note-model">Step 3: How to Define the Note Model</a></p>
<ul>
<li><p><a href="#heading-32-how-to-apply-migration">3.2 How to Apply Migration</a></p>
</li>
<li><p><a href="#heading-33-how-to-register-models-in-the-admin">3.3 How to Register Models in the Admin</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-4-how-to-create-the-serializer">Step 4: How to Create the Serializer</a></p>
<ul>
<li><p><a href="#heading-41-how-to-create-userserializer">4.1 How to Create UserSerializer</a></p>
</li>
<li><p><a href="#heading-42-how-to-create-noteserializer">4.2 How to Create NoteSerializer</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-5-how-to-configure-simplejwt">Step 5: How to Configure SimpleJWT</a></p>
<ul>
<li><p><a href="#heading-51-how-to-update-rest-framework-settings">5.1 How to Update REST Framework Settings</a></p>
</li>
<li><p><a href="#heading-52-how-to-add-token-url-endpoints">5.2 How to Add Token URL Endpoints</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-6-how-to-build-the-authentication-logic">Step 6: How to Build the Authentication Logic</a></p>
</li>
<li><p><a href="#heading-step-7-how-to-implement-scoped-views">Step 7: How to Implement Scoped Views</a></p>
<ul>
<li><p><a href="#heading-71-how-to-create-a-noteviewset">7.1 How to Create a NoteViewSet</a></p>
</li>
<li><p><a href="#heading-72-why-this-matters-preventing-id-enumeration-attacks">7.2 Why This Matters: Preventing ID Enumeration Attacks</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-8-how-to-connect-a-url">Step 8: How to Connect a URL</a></p>
<ul>
<li><p><a href="#heading-81-how-to-create-app-level-urls">8.1 How to Create App-level URLs</a></p>
</li>
<li><p><a href="#heading-82-how-to-verify-the-project-level-urls">8.2 How to Verify the Project-Level URLs</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-9-how-to-test-the-apis-with-postman">Step 9: How to Test the APIs with Postman</a></p>
<ul>
<li><p><a href="#heading-91-how-to-register-a-user">9.1 How to Register a User</a></p>
</li>
<li><p><a href="#heading-92-how-to-obtain-access-and-refresh-tokens">9.2 How to Obtain Access and Refresh Tokens</a></p>
</li>
<li><p><a href="#heading-93-how-to-create-a-note">9.3 How to Create a Note</a></p>
</li>
<li><p><a href="#heading-94-how-to-list-your-notes">9.4 How to List Your Notes</a></p>
</li>
<li><p><a href="#heading-95-how-to-demostrate-scoping">9.5 How to Demostrate Scoping</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-10-how-to-handle-token-expiration-with-refresh-tokens">Step 10: How to Handle Token Expiration with Refresh Tokens</a></p>
</li>
<li><p><a href="#heading-how-you-can-improve-this-project">How You Can Improve This Project</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<p>Here's what this tutorial covers:</p>
<ol>
<li><p>How to set up a custom user model (and why you should always do this)</p>
</li>
<li><p>How to configure SimpleJWT for access and refresh token authentication</p>
</li>
<li><p>How to build serializers that protect sensitive fields</p>
</li>
<li><p>How to scope your API views so users only see their own data</p>
</li>
<li><p>How to test the entire flow using Postman</p>
</li>
</ol>
<p>Let's get started</p>
<h2 id="heading-prerequisities">Prerequisities</h2>
<p>Before you begin, make sure you're comfortable with the following:</p>
<ol>
<li><p><strong>Django fundamentals</strong>: You should understand how Django projects and apps work, including models, views, URLs, and migrations.</p>
</li>
<li><p><strong>Django REST Framework basics</strong>: You should be familiar with serializers, viewsets or API views, and how DRF handles requests and responses.</p>
</li>
<li><p><strong>Basic command line usage</strong>: You'll run commands in your terminal throughout this tutorial.</p>
</li>
</ol>
<p>Tools you'll need installed:</p>
<ul>
<li><p>Python 3.8 or higher</p>
</li>
<li><p>pip (Python's package manager)</p>
</li>
<li><p>A code editor like Visual Studio Code</p>
</li>
<li><p>Postman (or any API testing tool) for testing your endpoints. You'll use this to send requests to your API.</p>
</li>
</ul>
<h2 id="heading-what-is-jwt-and-why-use-it-over-session-authentication">What is JWT and Why Use It Over Session Authentication?</h2>
<p>Before you write any code, it's important to understand what problem JWTs solve and why Django's built-in session authentication isn't always enough.</p>
<h3 id="heading-how-session-authentication-work">How Session Authentication Work</h3>
<p>Django ships with a session-based authentication system. Here's how it works at a high level:</p>
<ol>
<li><p>A user sends their username and password to the server.</p>
</li>
<li><p>The server verifies the credentials and creates a <strong>session</strong> which is a small record stored in the server's database that says "this user is logged in."</p>
</li>
<li><p>The server sends back a <strong>session ID</strong> as a cookie. The browser stores this cookie automatically.</p>
</li>
<li><p>On every subsequent request, the browser sends the cookie back to the server. The server looks up the session ID in its database and says "ah, this is User A. Let them through."</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/2689a08a-f8a9-4b83-ad7b-cc4c2de90419.png" alt="The infographics shows the steps taken in Django session authentication" style="display:block;margin:0 auto" width="2708" height="1252" loading="lazy">

<p>This works perfectly when your frontend and backend live on the same domain. The browser handles cookies automatically, and Django manages sessions in the database without you thinking about it.</p>
<p>But this approach has some limitations.</p>
<ol>
<li><p><strong>The cross-domain problem:</strong> If your React frontend lives at app.example.com and your Django API lives at <a href="http://api.example.com">api.example.com</a>, cookies become tricky. Browsers enforce strict rules about which domains can send and receive cookies.</p>
<p>You can work around this with CORS (Cross-Origin Resource Sharing) headers and special cookie settings, but it adds complexity and can be fragile.</p>
</li>
<li><p><strong>The scalability problem:</strong> Every active session is stored in the server's database. If you have 10,000 users logged in at the same time, that's 10,000 session records the server has to look up on every single request. As your application grows, this lookup becomes a bottleneck.</p>
</li>
<li><p><strong>The mobile problem:</strong> Mobile apps don't handle cookies the same way browsers do. If you're building an API that will serve both a web app and a mobile app, session cookies create extra headaches.</p>
</li>
</ol>
<h3 id="heading-how-jwt-authentication-works">How JWT Authentication Works</h3>
<p>JWTs take a fundamentally different approach. Instead of storing session data on the server, they put the authentication information directly into the token itself.</p>
<p>Here's how the flow works:</p>
<ol>
<li><p>A user sends their username and password to the server.</p>
</li>
<li><p>The server verifies the credentials and creates a JWT – a long encoded string that contains information like the user's ID and when the token expires.</p>
</li>
<li><p>The server sends this token back to the client. The client stores it (usually in memory or local storage).</p>
</li>
<li><p>On every subsequent request, the client includes the token in the request header. The server reads the token, verifies its signature, and says "this is User A. Let them through."</p>
</li>
</ol>
<p>Notice the key difference: <strong>the server never stores anything</strong>.</p>
<p>It doesn't look up a session in a database. It simply reads the token, checks its cryptographic signature to make sure nobody tampered with it, and extracts the user information. That's why JWTs are called <strong>stateless</strong> – the server doesn't maintain any state about who is logged in.</p>
<p><strong>This solves the cross-domain problem</strong> because tokens are sent in the request header, not as cookies. Headers work the same way regardless of which domain the request comes from.</p>
<p><strong>This solves the scalability problem</strong> because the server doesn't store sessions. Verifying a token is a quick cryptographic check, not a database lookup.</p>
<p><strong>This solves the mobile problem</strong> because any client that can send HTTP headers can use JWT. Mobile apps, desktop apps, other servers – they all work the same way.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/41d60dbd-707c-4483-8374-8910024bda7f.png" alt="The infographics shows the steps taken in JWT authentication" style="display:block;margin:0 auto" width="2682" height="1272" loading="lazy">

<h2 id="heading-step-1-how-to-set-up-the-project-and-install-the-dependecies">Step 1: How to Set Up the Project and Install the Dependecies</h2>
<h3 id="heading-11-how-to-create-the-project">1.1 How to Create the Project</h3>
<p>Open your terminal, navigate to where you want your project to live, and run the following commands:</p>
<pre><code class="language-shell">mkdir notes-project

cd notes-project
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/594f6c90-ea92-4859-9d9b-442d2fd2f23d.png" alt="The image shows the creation of notes project folder" style="display:block;margin:0 auto" width="1642" height="429" loading="lazy">

<h3 id="heading-12-how-to-create-a-virtual-environment-and-install-the-required-dependencies">1.2 How to Create a Virtual Environment and Install the Required Dependencies</h3>
<p>You will create a virtual environment here. Type the following command:</p>
<pre><code class="language-shell">python3 -m venv venv
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/4c1c9c33-9ffb-433a-a8f0-60cf0f598e14.png" alt="The image shows the creation of the virtual environment folder after tying the command" style="display:block;margin:0 auto" width="1778" height="380" loading="lazy">

<p>The above command creates a virtual environment inside a folder called <code>venv</code>. The first <code>venv</code> is the command and the second <code>venv</code> represents the name of the folder. You can name the folder anything though <code>venv</code> is usually preferred.</p>
<p>To activate the virtual environment, we need to use the following command:</p>
<p>On macOS/Linux:</p>
<pre><code class="language-shell">source venv/bin/activate
</code></pre>
<p>On Windows:</p>
<pre><code class="language-shell">venv\Scripts\activate
</code></pre>
<p>You'll know it worked when you see <code>(venv)</code> at the beginning of your terminal prompt. From this point on, any Python packages you install will only exist inside this <strong>virtual environment</strong>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/0a58f106-684c-4286-91fc-3c60f6e45483.png" alt="The image shows virtual environment being activated" style="display:block;margin:0 auto" width="2072" height="558" loading="lazy">

<p>With the virutal environment activated, install Django, Django Rest Framework, and Simple JWT Framework using the command:</p>
<pre><code class="language-shell">pip install django djangorestframework djangorestframework-simplejwt 
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/af83ad3f-3201-4e59-9367-e95630ae3cb3.png" alt="The image shows the installation of the packages after running the pip command" style="display:block;margin:0 auto" width="2314" height="1326" loading="lazy">

<p>You can verify everything installed correctly by running:</p>
<pre><code class="language-shell">pip list
</code></pre>
<p>You should see all three packages listed along with their dependencies.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/b725778c-57ca-42cb-a0a2-486ec3e6286e.png" alt="The image shows a list of all the dependencies along with the dependencies installed just now" style="display:block;margin:0 auto" width="2316" height="1038" loading="lazy">

<h3 id="heading-13-how-to-create-the-project-and-the-app">1.3 How to Create the Project and the App</h3>
<p>Run the following command to create the Django project:</p>
<pre><code class="language-plaintext">django-admin startproject notes_core .
</code></pre>
<p>The dot at the end is important. It tells Django to create the project files in your current directory instead of creating an extra nested folder.</p>
<p>Now let's type this command to create the app:</p>
<pre><code class="language-shell">python manage.py startapp notes
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/f0f95ce1-b8b0-4acb-ae5a-194fa3d74e08.png" alt="The image shows the folder structure of django project and app" style="display:block;margin:0 auto" width="2860" height="1638" loading="lazy">

<h3 id="heading-14-how-to-register-the-app-and-django-rest-framework-drf">1.4 How to Register the App and Django Rest Framework (DRF)</h3>
<p>Open <code>notes_core/settings.py</code> and add <code>rest_framework</code> and <code>notes</code> in the <code>INSTALLED_APPS</code> list:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/35988eff-506c-43af-ac6c-ef698e843ad3.png" alt="The image show the DRF and notes app being added to installed app list" style="display:block;margin:0 auto" width="2740" height="1312" loading="lazy">

<p>Django now knows about your new app and the REST framework. Let's move on to the most important architectural decision you'll make for this project.</p>
<h2 id="heading-step-2-how-to-create-a-custom-user-model">Step 2: How to Create a Custom User Model</h2>
<p>If you've built Django projects before, you might have used Django's default User model. For quick prototypes, that works fine. But for any project you plan to grow or maintain, starting with a custom user model is a best practice you should never skip.</p>
<p>Here's why: Django's default <code>User</code> model uses a <code>username</code> field as the primary identifier. If you later decide you want users to log in with their email address instead, or you need to add a profile picture field, or a phone number, then you're stuck.</p>
<p>Using a custom user model gives you full control over what a "user" means in your app. Instead of being tied to a username, you can design login around something more practical, like email or phone_number for a fitness or mobile-based app. You can also include fields like role (doctor, patient, receptionist in a clinic system) or date of birth directly in the user model, instead of managing a separate profile.</p>
<p>It also helps future-proof your project. If you start with the default model and later decide to switch login from username to email, or add required fields, it becomes difficult and risky to change. Using a custom user model from the beginning avoids this problem and makes it much easier to adapt your authentication system as your app grows.</p>
<p>By creating a custom user model from the start, even if it's identical to the default one, you give yourself the freedom to make changes later without any of that pain.</p>
<h3 id="heading-21-how-to-define-the-custom-user-model">2.1 How to Define the Custom User Model</h3>
<p>Open <code>notes/models/py</code> and add the following code:</p>
<pre><code class="language-python">from django.contrib.auth.models import AbstractUser
from django.db import models

class CustomUser(AbstractUser):
    pass
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/29310f13-cdc9-409a-9c36-23234882fd6e.png" alt="The image shows the code for the custom user model" style="display:block;margin:0 auto" width="2802" height="1194" loading="lazy">

<p>You are importing Django’s built-in <code>AbstractUser</code> class.</p>
<p>Think of <code>AbstractUser</code> as a ready-made blueprint for a user. It already includes fields like username, password, email, first name, last name , and authentication logic.</p>
<p>The <code>pass</code> statement means you're not adding any extra fields yet.</p>
<p>But the key point is that this model is yours. So this model behaves exactly like Django’s default user model, but with one <strong>big advantage</strong>: you now have the flexibility to customize it later.</p>
<p>If three months from now you need to add a <code>phone_number</code> field or switch to email-based login, you just add a field to this class and run a migration.</p>
<pre><code class="language-python">from django.contrib.auth.models import AbstractUser
from django.db import models

class CustomUser(AbstractUser):
    phone_number = models.CharField(max_length=15)
</code></pre>
<p>You can also see all the fields that the <code>CustomUser</code> class has inherited from the <code>AbstractUser</code> class.</p>
<p>To do this we can use the Python shell. Type the following command:</p>
<pre><code class="language-shell">python manage.py shell
</code></pre>
<p>When you type this command, make sure that the virtual environment is active:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/3c958f7f-6403-4eaf-b339-43532a02af6a.png" alt="The image shows the command to enter into the python shell with the virtual environment being activated" style="display:block;margin:0 auto" width="2588" height="604" loading="lazy">

<p>After this, import the <code>CustomUser</code> model in the shell:</p>
<pre><code class="language-shell">from notes.models import CustomUser
</code></pre>
<p>After that, type the following code:</p>
<pre><code class="language-shell">[fields.name for field in CustomUser._meta.get_fields()]
</code></pre>
<p>The above statement lists out all the fields in the <code>CustomUser</code> class.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/5a1b3314-7b48-41f6-98a9-dc3133cfce4c.png" alt="The image shows the output of all the fileds inherited by the CustomUser model" style="display:block;margin:0 auto" width="2638" height="904" loading="lazy">

<h3 id="heading-22-how-to-tell-django-to-use-your-custom-user-model">2.2 How to Tell Django to Use Your Custom User Model</h3>
<p>Now comes the important bit. Open <code>notes_core/settings.py</code> and add this line:</p>
<pre><code class="language-python">AUTH_USER_MODEL = 'notes.CustomUser'
</code></pre>
<p>This setting tells Django to use your <code>CustomUser</code> model instead of the built-in one for everything authentication-related such as login, permissions, foreign keys, and so on.</p>
<p>There's no strict rule to where you need to add it, but the best practice is to add it near the end of the file.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/87c9b03c-908d-4b2d-a6d5-9528ff98d1ab.png" alt="The image shows the above code being added to the settings.py file" style="display:block;margin:0 auto" width="2870" height="1280" loading="lazy">

<p>You can see which user model Django is using by using the method <code>get_user_model()</code>.</p>
<p>Open the Python shell again and import the <code>get_user_model()</code> method:</p>
<pre><code class="language-shell">from django.contrib.auth import get_user_model 
</code></pre>
<p>Then use <code>get_user_model()</code> and print the output:</p>
<pre><code class="language-shell">user = get_user_model()
print(user)
</code></pre>
<p>You should see the name of our model being used:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/895d5bcc-6880-4c4d-9007-96d44e9fa496.png" alt="895d5bcc-6880-4c4d-9007-96d44e9fa496" style="display:block;margin:0 auto" width="1580" height="276" loading="lazy">

<p>If you hadn't added the <code>AUTH_USER_MODEL</code> in the <code>settings.py</code> file, then Django would have used the default user model:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/a89f013a-e5d8-4e59-90b8-7da594188c13.png" alt="The image shows the default user model being used by Django" style="display:block;margin:0 auto" width="1886" height="1126" loading="lazy">

<p><strong>Note:</strong> You'll need to do this before you run your first migration. If you run migrate before setting AUTH_USER_MODEL, Django creates tables for the default User model, and switching afterward becomes a headache.</p>
<h3 id="heading-23-how-to-run-migrations">2.3 How to Run Migrations</h3>
<p>Now create and apply the initial migrations:</p>
<pre><code class="language-shell">python manage.py makemigrations
python manage.py migrate
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/cbca499a-286c-4f07-9a3e-375a69b4c374.png" alt="The image shows the output after running the above commands" style="display:block;margin:0 auto" width="2072" height="1222" loading="lazy">

<p>Django will create the necessary tables for your custom user model along with all the built-in Django tables.</p>
<p>We can again peek under the hood to see the SQL queries that Django used to create the tables especially the <code>CustomUser</code> table.</p>
<p>Type this command:</p>
<pre><code class="language-shell">python manage.py sqlmigrate notes 0001
</code></pre>
<p>Here <code>notes</code> is the name of the app and <code>0001</code> represents the migration number.</p>
<p>And you should get this output:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/850bba6c-8e54-4beb-8b85-f30f193341d5.png" alt="The image shows the output after the sqlmigrate command is executed" style="display:block;margin:0 auto" width="2714" height="1486" loading="lazy">

<p>Let's also create a superuser so you can access the admin panel later for debugging:</p>
<pre><code class="language-shell">python manage.py createsuperuser
</code></pre>
<p>Fill in the username, email (optional), and password when prompted.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/43820470-86b4-4274-834a-19a97adbc208.png" alt="The image shows the super user being created" style="display:block;margin:0 auto" width="2172" height="596" loading="lazy">

<h2 id="heading-step-3-how-to-define-the-note-model">Step 3: How to Define the Note Model</h2>
<p>Now let's create the data model for the core of your application. First add a new import to use the <code>settings</code> object.</p>
<pre><code class="language-python">from django.conf import settings
</code></pre>
<p>Then add the following code below the <code>CustomUser</code> class:</p>
<pre><code class="language-python">class Notes(models.Model):
    owner = models.ForeignKey(
        settings.AUTH_USER_MODEL,
        on_delete=models.CASCADE,
        related_name='notes'
    )
    title = models.CharField(max_length=200)
    body = models.TextField()
    created_at = models.DateTimeField(auto_now_add=True)
    def __str__(self):
        return f"{self.title} (by {self.owner.username})"
</code></pre>
<p>Here's the complete <code>model.py</code> code:</p>
<pre><code class="language-python">from django.contrib.auth.models import AbstractUser
from django.db import models
from django.conf import settings

class CustomUser(AbstractUser):
    pass

class Notes(models.Model):
    owner = models.ForeignKey(
        settings.AUTH_USER_MODEL,
        on_delete=models.CASCADE,
        related_name='notes'
    )
    title = models.CharField(max_length=200)
    body = models.TextField()
    created_at = models.DateTimeField(auto_now_add=True)
    def __str__(self):
        return f"{self.title} (by {self.owner.username})"
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/24f1c5e3-6f9a-4dce-a7dc-d5d17aecc202.png" alt="The image shows the complete models.py file" style="display:block;margin:0 auto" width="2728" height="1314" loading="lazy">

<p>Let's walk through each field:</p>
<ol>
<li><p><code>owner = models.ForeignKey(settings.AUTH_USER_MODEL, ...)</code>: Creates a relationship between each note and a user. The <code>ForeignKey</code> field tells Django that each note belogs to exactly one user but a user can have many notes.</p>
<p>Notice that we use <code>settings.AUTH_USER_MODEL</code> instead of directly importing <code>CustomUser</code>. This is the recommended practice because it keeps your code flexible. If you ever change the user model reference in settings, this foreign key adapts automatically.</p>
<p>The <code>on_delete=models.CASCADE</code> means that if a user is deleted, all their notes are deleted too.</p>
<p>The <code>related_name='notes'</code> lets you access a user's notes with <code>user.notes.all()</code>.</p>
</li>
<li><p><code>title = models.CharField(max_length=200)</code>: Creates a text field for the task name, limited to 200 characters.</p>
</li>
<li><p><code>body = models.TextField()</code>: Holds the actual note content. <code>TextField</code> has no character limit, so users can write as much as they need.</p>
</li>
<li><p><code>created_at = models.DateTimeField(auto_now_add=True)</code>: Automatically records the date and time when a task is created. You never need to set this manually.</p>
<p>The <code>__str__()</code> method gives each note a readable representation. Instead of seeing "Note object (1)" in the admin panel or during debugging, you'll see something like "Meeting Notes (by Solina)."</p>
</li>
</ol>
<h3 id="heading-32-how-to-apply-migration">3.2 How to Apply Migration</h3>
<p>Run the migration commands to create the Note table:</p>
<pre><code class="language-shell">python manage.py makemigrations
python manage.py migrate
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/bf4b3469-89d8-4f8b-8b0d-1ea24eda5e2e.png" alt="The image shows the result of migrating the notes model" style="display:block;margin:0 auto" width="2594" height="1304" loading="lazy">

<p>As before, we can see the exact SQL query Django used to create the <code>notes</code> table:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/606d0d20-7c4f-40fb-a37b-de623ccee574.png" alt="The image shows the SQL query to create the notes table  and reference to the custom user table created earlier" style="display:block;margin:0 auto" width="2700" height="634" loading="lazy">

<h3 id="heading-33-how-to-register-models-in-the-admin">3.3 How to Register Models in the Admin</h3>
<p>Open <code>notes/admin.py</code> and register both models so you can inspect data through the admin panel:</p>
<pre><code class="language-python">from django.contrib import admin
from .models import CustomUser, Notes

admin.site.register(CustomUser)
admin.site.register(Notes)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/482fc6de-33d2-4d09-8c6f-ce3df684eaa3.png" alt="The image shows the code for admin.py" style="display:block;margin:0 auto" width="2376" height="736" loading="lazy">

<p>This is helpful during development when you want to quickly check whether data is being saved correctly.</p>
<h2 id="heading-step-4-how-to-create-the-serializer">Step 4: How to Create the Serializer</h2>
<p>In DRF, a serializer is like a bridge between your database and the internet.</p>
<p>Django models store data as Python objects. But when you want to send that data to a frontend application (like React or a mobile app), you can't send Python objects. You need to send a format that everyone understands which is usually JSON.</p>
<p>Serializers perform three main jobs:</p>
<ol>
<li><p><strong>Serialization:</strong> Converting complex Python objects (Models) into Python dictionaries (which can be easily rendered into JSON).</p>
</li>
<li><p><strong>Deserialization:</strong> Converting JSON data coming from a user back into complex Python objects.</p>
</li>
<li><p><strong>Validation:</strong> Checking if the incoming data is correct before saving it to the database.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/a7339d6d-e338-4e12-a837-a780e85752a6.png" alt="The image shows the serialization deserialization process" style="display:block;margin:0 auto" width="800" height="287" loading="lazy">

<h3 id="heading-41-how-to-create-userserializer">4.1 How to Create UserSerializer</h3>
<p>Create a new file called <code>notes/serializers.py</code> and add the following code:</p>
<pre><code class="language-python">from rest_framework import serializers
from django.contrib.auth import get_user_model

User = get_user_model()
class UserSerializer(serializers.ModelSerializer):
    password = serializers.CharField(write_only=True)

    class Meta:
        model = User
        fields = ['id', 'username', 'email', 'password']

    def create(self, validated_data):
        user = User.objects.create_user(
            username=validated_data['username'],
            email=validated_data.get('email', ''),
            password=validated_data['password']
        )
        return user
</code></pre>
<p>Let's break down this serializer.</p>
<ol>
<li><p>The <code>UserSerializer</code> handles user registration.</p>
</li>
<li><p><code>User = get_user_model()</code> gets the user model that you're using and stores in the variable <code>User</code>. In our case, we're using the <code>CustomUser</code> model</p>
</li>
<li><p><code>class UserSerializer(serializers.ModelSerializer):</code>: Here you've created the UserSerializer class, which inherits <code>ModelSerializer</code>.</p>
<p>A <code>ModelSerializer</code> is a shortcut that automatically creates a serializers class with fields that are in the model class.</p>
<p>When we use a <code>ModelSerializer</code>, DRF inspects the model and automatically does these things:</p>
<p>1. Generates fields from the model so you don't have to<br>2. Automatically adds field validations that are present in the model<br>3. Implements <code>create()</code> and <code>update()</code> methods. A <code>ModelSerializer</code> knows which model to use and how to update and create it. You can override <code>create()</code> and <code>update()</code> methods if you need customized behaviors. <strong>You have overridden the</strong> <code>create()</code> <strong>method in the above code.</strong></p>
</li>
<li><p><code>password = serializers.CharField(write_only=True)</code>: This line is crucial. The <code>write_only=True</code> flag means the password will be accepted during registration but will <strong>never</strong> appear in any API response. Without this, your API would send back the password (even if hashed) every time user data is returned.</p>
<p>So users can create accounts, but their passwords are never exposed back.</p>
</li>
<li><p><code>class Meta</code>: Inside the <code>Meta</code> class, you tell the serializer which model to use. In this case, the model to use is <code>User</code> and the fields to be handled.</p>
</li>
<li><p>The <code>create()</code> method: This is the most important part. This method runs when we create a new user. Instead of using the default <code>.create()</code> method you have overridden it.</p>
<p>It's important to understand why we have overridden this method. The default <code>create()</code> method is not suitable for creating users securely.</p>
<p>By default this method stores the password in plain text format. This is a serious problem because passwords should never be stored in raw form. They need to be <strong>hashed</strong> so that even if the database is compromised, the passwords are never exposed.</p>
<p>Django provides a special method called <code>create_user()</code> that automatically handles this by <strong>hashing the password</strong> and setting up the user properly for authentication.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/c1246e5d-104d-486c-965c-8edb04c850dc.png" alt="The image shows the annoated explanation of the code above" style="display:block;margin:0 auto" width="2340" height="1280" loading="lazy">

<h3 id="heading-42-how-to-create-noteserializer">4.2 How to Create NoteSerializer</h3>
<p>After the <code>UserSerializer</code> class, let's create the <code>NoteSerializer</code> class. The <code>NoteSerializer</code> handles the notes data</p>
<p>First of all, you need to add an import to the <code>Notes</code> class. Add the line <code>from .models import Notes</code> at the end of the last import.</p>
<p>Put this code below the <code>UserSerializer</code> class:</p>
<pre><code class="language-python">class NoteSerializer(serializers.ModelSerializer):
    owner = serializers.ReadOnlyField(source='owner.username')
    class Meta:
        model = Notes
        fields = ['id', 'owner', 'title', 'body', 'created_at']
</code></pre>
<p>Now let's break it down:</p>
<ol>
<li><p><code>owner = serializers.ReadOnlyField(source='owner.username')</code>: This is the most important line in the code. This makes the <code>owner</code> field <strong>read-only</strong>. That means the API will display who owns a note (showing their username), but no one can set or change the owner through the API.</p>
<p>Without this protection, a malicious user could send a POST request with <code>"owner": 5</code> and assign their note to someone else's account, or worse, modify someone else's notes by reassigning ownership.</p>
<p>The <code>source='owner.username'</code> part tells DRF to display the owner's username instead of their numeric ID, which makes the API responses more readable.</p>
</li>
<li><p><code>class Meta:</code> ...: As before the <code>Meta</code> class contains the model which the serializer use and the fields that the API will expose.</p>
<p>Here is the complete code in the <code>serializers.py</code> file</p>
</li>
</ol>
<pre><code class="language-python">from rest_framework import serializers
from django.contrib.auth import get_user_model
from .models import Notes

User = get_user_model()
class UserSerializer(serializers.ModelSerializer):
    password = serializers.CharField(write_only=True)
    class Meta:
        model = User
        fields = ['id', 'username', 'email', 'password']

    def create(self, validated_data):
        user = User.objects.create_user(
            username=validated_data['username'],
            email=validated_data.get('email', ''),
            password=validated_data['password']
        )
        return user

class NoteSerializer(serializers.ModelSerializer):
    owner = serializers.ReadOnlyField(source='owner.username')
    class Meta:
        model = Notes
        fields = ['id', 'owner', 'title', 'body', 'created_at']
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/3073f560-79dc-4950-8acb-96d871e0511c.png" alt="The image shows the complete code for the serializers.py file" style="display:block;margin:0 auto" width="2006" height="1310" loading="lazy">

<h2 id="heading-step-5-how-to-configure-simplejwt">Step 5: How to Configure SimpleJWT</h2>
<p>Now let's set up the authentication system. This is where you tell DRF to use JWT for authentication instead of sessions. This step is crucial because without it, DRF will default to session-based auth.</p>
<p>SimpleJWT provides a complete JWT implementation for DRF, so you don't have to build token generation, signing, or verification from scratch.</p>
<p>The access token is what your client sends with every API request. It's short-lived by design. Think of it like a visitor badge at an office building: it gets you through the door, but it expires at the end of the day. If someone steals it, the damage is limited because it stops working soon.</p>
<p>The refresh token is longer-lived and has a single purpose: getting a new access token when the current one expires. The client stores it securely and only sends it to one specific endpoint. Think of it like your employee ID card. You use it to get a new visitor badge each morning, but you don't flash it at every door.</p>
<p>This separation exists for security. If the short-lived access token is compromised (which is more likely since it's sent with every request), the attacker has a narrow window before it expires. The refresh token, which is sent less frequently, has a lower risk of interception.</p>
<p>Let's look at how the access and refresh token work together</p>
<ol>
<li><p>User logs in, server gives both access token and refresh token</p>
</li>
<li><p>User makes requests using the access token</p>
</li>
<li><p>Access token expires</p>
</li>
<li><p>App sends refresh token to server</p>
</li>
<li><p>Server checks it and gives a new access token</p>
</li>
<li><p>User continues without logging in again</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/9cf61a53-99c6-4665-bf5d-dccafa71ca8d.png" alt="The image shows the use of access and refresh tokens" style="display:block;margin:0 auto" width="1840" height="1006" loading="lazy">

<h3 id="heading-51-how-to-update-rest-framework-settings">5.1 How to Update REST Framework Settings</h3>
<p>Open <code>notes_core/settings.py</code> and add the following code:</p>
<pre><code class="language-python">from datetime import timedelta
REST_FRAMEWORK = {
    'DEFAULT_AUTHENTICATION_CLASSES': (
        'rest_framework_simplejwt.authentication.JWTAuthentication',
    ),

    'DEFAULT_PERMISSION_CLASSES': (
        'rest_framework.permissions.IsAuthenticated',
    ),
}

SIMPLE_JWT = {
    'ACCESS_TOKEN_LIFETIME': timedelta(minutes=30),
    'REFRESH_TOKEN_LIFETIME': timedelta(days=1),
}
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/54e89c41-b456-4bbb-a129-ac3e92a09a6d.png" alt="The image shows the code being added to settings.py file" style="display:block;margin:0 auto" width="2492" height="1528" loading="lazy">

<p>Let's unpack what each section does.</p>
<p>The <code>DEFAULT_AUTHENTICATION_CLASSES</code> setting tells DRF to use JWT as the authentication method for all API endpoints. Every incoming request will be checked for a valid JWT token in the Authorization header.</p>
<p>The <code>DEFAULT_PERMISSION_CLASSES</code> setting sets <code>IsAuthenticated</code> as the global permission policy. This means every endpoint in your API is locked down by default. Only users with a valid token can access any endpoint.</p>
<p>This is a secure-by-default approach: instead of remembering to protect each view individually, everything is protected, and you explicitly open up the endpoints that need to be public <em>(like the registration endpoint, which you'll handle in the next step).</em></p>
<p>The <code>SIMPLE_JWT</code> dictionary controls token behavior. The access token lasts 30 minutes. This is the token clients include in every request. If someone intercepts it, the damage is limited to a 30-minute window. The refresh token lasts one day.</p>
<p>When the access token expires, the client can use the refresh token to get a new access token without forcing the user to log in again. The duration of the refresh token is 1 day. This means after 1 day, the user must log in again with their username and password. You'll see exactly how this works later when you test with Postman.</p>
<h3 id="heading-52-how-to-add-token-url-endpoints">5.2 How to Add Token URL Endpoints</h3>
<p>SimpleJWT provides ready-made views for obtaining and refreshing tokens. You just need to wire them up to URLs.</p>
<p>Open <code>notes_core/urls.py</code> and update it with the following code:</p>
<pre><code class="language-python">from django.contrib import admin
from django.urls import path, include
from rest_framework_simplejwt.views import (
    TokenObtainPairView,
    TokenRefreshView,
)

urlpatterns = [
    path('admin/', admin.site.urls),
    path('api/', include('notes.urls')),
    path('api/token/', TokenObtainPairView.as_view(), name='token_obtain_pair'),
    path('api/token/refresh/', TokenRefreshView.as_view(), name='token_refresh'),
]
</code></pre>
<p>The <code>token/</code> endpoint accepts a username and password, and returns an access token and a refresh token.</p>
<p>The <code>token/refresh/</code> endpoint accepts a refresh token and returns a new access token. You'll see these in action during testing.</p>
<h2 id="heading-step-6-how-to-build-the-authentication-logic">Step 6: How to Build the Authentication Logic</h2>
<p>Open <code>notes/views.py</code> and add the following:</p>
<pre><code class="language-python">from rest_framework import generics, permissions
from django.contrib.auth import get_user_model
from .serializers import UserSerializer

User = get_user_model()

class RegisterView(generics.CreateAPIView):
    queryset = User.objects.all()
    serializer_class = UserSerializer
    permission_classes = [permissions.AllowAny]
</code></pre>
<p>Now let's walk through this code.</p>
<p>The first section are the imports and after that we have used the the <code>get_user_model()</code> method to get the <code>CustomUser</code> model.</p>
<p>Now the main part is <code>RegisterView</code> class. The class inherits from <code>generics.CreateAPIView</code> which is a built in DRF view designed specifically for handling POST requests that create new objects.</p>
<p>Because of this, you don’t have to manually write the logic for handling POST requests, validating data, or saving to the database. DRF does all of that for you behind the scenes.</p>
<p>Inside the class, <code>queryset = Users.objects.all()</code> defines the set of user objects this view can work with.</p>
<p>The <code>serializer_class = UserSerializer</code> tells the view which serializer to use for validating incoming data and creating the user.</p>
<p>Finally <code>permission_classes = [permissions.AllowAny]</code> overrides the global <code>IsAuthenticated</code> permission you set earlier in the value of <code>DEFAULT_PERMISSION_CLASSES</code> .</p>
<p>This means that anyone can access the registration endpoint, even if they aren't logged in. This makes sense for a registration endpoint because new users won’t have accounts yet.</p>
<p>Every other view in your API will inherit the global IsAuthenticated permission, so only this registration endpoint is open.</p>
<h2 id="heading-step-7-how-to-implement-scoped-views">Step 7: How to Implement Scoped Views</h2>
<p>This is the heart of the tutorial. You've set up authentication so the API knows <strong>who</strong> is making a request. Now you need to make sure each user can only interact with <strong>their</strong> <strong>own</strong> notes.</p>
<p>Think of it this way: authentication is the lock on the front door of an apartment building. It keeps strangers out. But scoping is the lock on each individual apartment. Just because you live in the building doesn't mean you can walk into your neighbor's apartment.</p>
<p>Without scoping, an authenticated user could potentially see every note in the database, or worse, modify notes that belong to someone else. Two method overrides on your viewset prevent this entirely.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/03747660-068c-4757-9fc2-d2a122f22f5f.png" alt="The image represents the differences of access resources with and without scoping" style="display:block;margin:0 auto" width="1024" height="559" loading="lazy">

<h3 id="heading-71-how-to-create-a-noteviewset">7.1 How to Create a NoteViewSet</h3>
<p>Now let's create the <code>NoteViewSet</code>. First add these imports to the top of the file. We're importing the viewsets, serializers, and model.</p>
<pre><code class="language-python">from .models import Note
from .serializers import UserSerializer, NoteSerializer
from rest_framework import generics, viewsets, permissions
</code></pre>
<p>Add the following to <code>notes/views.py</code>, below the RegisterView:</p>
<pre><code class="language-python">class NoteViewSet(viewsets.ModelViewSet):
    serializer_class = NoteSerializer

    def get_queryset(self):
        return Notes.objects.filter(owner=self.request.user).order_by('-created_at')

    def perform_create(self, serializer):
        serializer.save(owner=self.request.user)
</code></pre>
<p>Now let's talk about this code in detail.</p>
<p>You've created a new class called <code>NoteViewSet</code> which inherits from the DRF class <code>ModelViewSet</code>. This gives you full CRUD operations, meaning you can list notes and retrieve a single note, as well as create, update, and delete a note.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/25999ad8-3534-424a-ab35-cad9cecec8ef.png" alt="The image shows the Model Viewset being imported" style="display:block;margin:0 auto" width="1338" height="248" loading="lazy">

<p>The next part <code>serializer_class = NoteSerializer</code> tells Django to use the <code>NoteSerializer</code> class to convert between Python objects and JSON.</p>
<p>But the magic is the two methods that you are overriding: <code>get_queryset()</code> and <code>perform_create()</code>.</p>
<p>The <code>get_queryset()</code> method controls which notes the API returns. If you didn't override this method, it would return <code>Note.objects.all()</code> (which would give every user access to every note in the database).</p>
<p>But here, you've overridden this method so that it filters notes by the current user.</p>
<p>Next is the <code>perform_create()</code> method, which is called when the note is saved. You've overridden this method so that it saves the notes of the user who's currently logged in. If you hadn't overridden the this method, it would return all the notes regardless of the logged in user.</p>
<p>Notice that you have passed <code>self.request.user</code> parameters in to the <code>filter()</code> function. This is the code that attaches the logged-in user as the owner of the note.</p>
<p>Remember how you made the owner field read-only in the serializer? This is the other half of that security measure.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/6db94c2f-673f-480a-bf20-730ed4af4bdb.png" alt="6db94c2f-673f-480a-bf20-730ed4af4bdb" style="display:block;margin:0 auto" width="1718" height="710" loading="lazy">

<p>The user can't set the owner through the API request, and the server automatically sets it to whoever is authenticated. These two pieces work together to make ownership tamper-proof.</p>
<h3 id="heading-72-why-this-matters-preventing-id-enumeration-attacks">7.2 Why This Matters: Preventing ID Enumeration Attacks</h3>
<p>Without get_queryset filtering, your API might allow something like this: a user sends a GET request to <code>/api/notes/42/</code> and sees a note that belongs to someone else, simply because they guessed the ID.</p>
<p>This is called an <strong>ID enumeration attack</strong> — an attacker cycles through IDs (1, 2, 3, 4...) to discover and access other people's data.</p>
<p>With your scoped <code>get_queryset</code>, even if User B sends a request to <code>/api/notes/42/</code> and note 42 belongs to User A, the viewset won't find it in User B's filtered queryset. DRF will return a 404 — as far as User B is concerned, that note doesn't exist.</p>
<h2 id="heading-step-8-how-to-connect-a-url">Step 8: How to Connect a URL</h2>
<p>Now you need to wire up the views to URL paths so the API knows which view to call for each endpoint.</p>
<h3 id="heading-81-how-to-create-app-level-urls">8.1 How to Create App-level URLs</h3>
<p>Create a new file called <code>notes/urls.py</code> and add the following:</p>
<pre><code class="language-python">from django.urls import path, include
from rest_framework.routers import DefaultRouter
from .views import RegisterView, NoteViewSet

router = DefaultRouter()
router.register(r'notes', NoteViewSet, basename='note')

urlpatterns = [
    path('register/', RegisterView.as_view(), name='register'),
    path('', include(router.urls)),
]
</code></pre>
<p>The <code>DefaultRouter</code> automatically generates URL patterns for the NoteViewSet. Since you're using a <code>ModelViewSet</code>, the router creates endpoints for listing all notes, creating a note, retrieving a single note, updating a note, and deleting a note — <strong>all from that single router.register call.</strong></p>
<p>The <code>basename='note'</code> parameter is required here because your viewset doesn't have a queryset attribute defined directly on the class <em>(you're using get_queryset instead)</em>. DRF uses the <code>basename</code> to generate the URL pattern names like <code>note-list</code> and <code>note-detail</code>.</p>
<h3 id="heading-82-how-to-verify-the-project-level-urls">8.2 How to Verify the Project-Level URLs</h3>
<p>Make sure your <code>notes_core/urls.py</code> looks like this (you set this up in Step 5, but let's confirm):</p>
<pre><code class="language-python">from django.contrib import admin
from django.urls import path, include
from rest_framework_simplejwt.views import (
    TokenObtainPairView,
    TokenRefreshView,
)

urlpatterns = [
    path('admin/', admin.site.urls),
    path('api/', include('notes.urls')),
    path('api/token/', TokenObtainPairView.as_view(), name='token_obtain_pair'),
    path('api/token/refresh/', TokenRefreshView.as_view(), name='token_refresh'),
]
</code></pre>
<p>Here's the full picture of your API's URL structure:</p>
<table>
<thead>
<tr>
<th><strong>Endpoint</strong></th>
<th><strong>Method</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody><tr>
<td><code>api/register/</code></td>
<td><strong>POST</strong></td>
<td>Create a new user account</td>
</tr>
<tr>
<td><code>api/token/</code></td>
<td><strong>POST</strong></td>
<td>Get access and refresh tokens</td>
</tr>
<tr>
<td><code>api/token/refresh/</code></td>
<td><strong>POST</strong></td>
<td>Get a new access token using a refresh token</td>
</tr>
<tr>
<td><code>api/notes/</code></td>
<td><strong>GET</strong></td>
<td>List all notes for the authenticated user</td>
</tr>
<tr>
<td><code>api/notes/</code></td>
<td><strong>POST</strong></td>
<td>Create a new note</td>
</tr>
<tr>
<td><code>api/notes/&lt;id&gt;/</code></td>
<td><strong>GET</strong></td>
<td>Retrieve a specific note</td>
</tr>
<tr>
<td><code>api/notes/&lt;id&gt;/</code></td>
<td><strong>PUT/PATCH</strong></td>
<td>Update a specific note</td>
</tr>
<tr>
<td><code>api/notes/&lt;id&gt;/</code></td>
<td><strong>DELETE</strong></td>
<td>Delete a specific note</td>
</tr>
</tbody></table>
<p>Start the development server to make sure everything runs without errors:</p>
<pre><code class="language-shell">python manage.py runserver
</code></pre>
<p>If the server starts without complaints, your code is wired up correctly.</p>
<h2 id="heading-step-9-how-to-test-the-apis-with-postman">Step 9: How to Test the APIs with Postman</h2>
<p>Building the API is one thing. Proving it works is another. Let's walk through the entire flow using Postman, from registering a user to demonstrating that scoping actually works.</p>
<p>If you haven't used Postman before, it's a tool that lets you send HTTP requests to your API and inspect the responses. You can download it from <a href="https://www.postman.com/downloads/">postman.com/downloads</a>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/cc154f8b-d3db-4c48-b884-1dbd3d517209.png" alt="Postman software download page" style="display:block;margin:0 auto" width="2518" height="1552" loading="lazy">

<p>Alternatively, you can use curl from the command line or any other API testing tool you're comfortable with.</p>
<p>Make sure your development server is running before proceeding.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/90524445-9d7c-43d0-bcf7-105f54626d85.png" alt="python server running" style="display:block;margin:0 auto" width="2078" height="678" loading="lazy">

<h3 id="heading-91-how-to-register-a-user">9.1 How to Register a User</h3>
<p>Open Postman:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/8951a205-ee01-4646-8e0b-d1d16d21c749.png" alt="opening Postman" style="display:block;margin:0 auto" width="2764" height="1792" loading="lazy">

<p>Create a new request:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>POST</th>
</tr>
</thead>
<tbody><tr>
<td><strong>URL</strong></td>
<td><code>http://127.0.0.1:8000/api/register/</code></td>
</tr>
<tr>
<td><strong>Body tab</strong></td>
<td>Select "raw" and choose "JSON" from the dropdown</td>
</tr>
<tr>
<td><strong>Body Content</strong></td>
<td>{ "username": "priya", "email": "<a href="mailto:priya@example.com">priya@example.com</a>", "password": "securepassword123" }</td>
</tr>
</tbody></table>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/f043164f-184c-4f29-8a9f-55f604d521fb.png" alt="postman UI for registering a new user" style="display:block;margin:0 auto" width="1918" height="770" loading="lazy">

<p>Click <strong>Send</strong>. You should get a <code>201 Created</code> response with the user data <strong>(without the password</strong>, thanks to your <code>write_only=True</code> field) which you wrote in the <code>UserSerializer</code> class.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/021a79db-418f-4057-97f8-f3c5c4b9761c.png" alt="response of registering a user" style="display:block;margin:0 auto" width="2024" height="1240" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/c8fd6b4e-f539-4f16-85c8-72515abadf8f.png" alt="The image describes the codes of the User serializer classs" style="display:block;margin:0 auto" width="2340" height="1280" loading="lazy">

<h3 id="heading-92-how-to-obtain-access-and-refresh-tokens">9.2 How to Obtain Access and Refresh Tokens</h3>
<p>Now log in to get your JWTs:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>POST</th>
</tr>
</thead>
<tbody><tr>
<td><strong>URL</strong></td>
<td><code>http://127.0.0.1:8000/api/token/</code></td>
</tr>
<tr>
<td><strong>Body</strong></td>
<td>{"username" : "priya", "password" : "securepassword123"}</td>
</tr>
</tbody></table>
<p>You'll get a response with access and refresh tokens.</p>
<p><strong>Copy the access token.</strong> You'll need it for every subsequent request. Also save the refresh token, as you'll use it later.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/e5667e26-b11a-463e-85d4-ba99082bee21.png" alt="The image shows the api returning access and refresh token" style="display:block;margin:0 auto" width="2016" height="1224" loading="lazy">

<p>A JWT is only encoded and not encrypted. The encoding is merely a way to transform the data into a safe, standard string format that can be easily transmitted over the internet.</p>
<p>Any one can peel through the encoding to see the data. This is done using base64url encoding.</p>
<p>We can use the Python library <code>pyjwt</code> to decode JWTs or use any of the online sites to decode. It's important to note that you should use online sites with caution since JWTs may contain sensitive information.</p>
<p>For this demo, we'll use site called <a href="https://www.jwt.io">jwt.io</a>.</p>
<p>Open the site and paste in the access token that you have just created:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/f8b9f3e0-db1e-49fb-82be-2b4d27adc8bd.png" alt="The image describes the various sections after decoding the JWT token" style="display:block;margin:0 auto" width="2448" height="1392" loading="lazy">

<p>The JWT has three parts: the header, the payload, and the signature.</p>
<p>The header sections tells you how the header is signed. In this case it is signed using the <strong>HS256</strong> algorithm.</p>
<p>The payload is where the actual data or claim lives. It contains standard claims such as token types, expiration time ( <code>exp</code> ), issued at time ( <code>iat</code> ), and custom claims.</p>
<p>The signature section is used to verify integrity. You <strong>can't decode it to meaningful data.</strong> This section ensures that the token wasn't tampered with.</p>
<h3 id="heading-93-how-to-create-a-note">9.3 How to Create a Note</h3>
<p>Now use the access token to create a note:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>POST</th>
</tr>
</thead>
<tbody><tr>
<td><strong>URL</strong></td>
<td><code>http://127.0.0.1:8000/api/notes/</code></td>
</tr>
<tr>
<td><strong>Header tab:</strong></td>
<td>Add a new header:</td>
</tr>
<tr>
<td>Key: Authorization, Value: Bearer</td>
<td></td>
</tr>
<tr>
<td><strong>Body</strong></td>
<td>{'title': 'My note', 'body': 'This contains secret information'}</td>
</tr>
</tbody></table>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/d04e2bfb-edc6-4c12-ac93-65611ed0d805.png" alt="The image shows adding new header into postman" style="display:block;margin:0 auto" width="1762" height="1084" loading="lazy">

<p>Notice that you don't include an owner field. That's handled automatically by perform_create. You should get a <code>201 Created response</code>:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/19a8b564-4d49-4dc1-823c-d7d96fcb5319.png" alt="The image shows the output (response) after creating a note" style="display:block;margin:0 auto" width="1802" height="1494" loading="lazy">

<p>You can create a few more notes, so that we have some data to work with.</p>
<h3 id="heading-94-how-to-list-your-notes">9.4 How to List Your Notes</h3>
<p>Now to fetch all of Priya's notes:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>GET</th>
</tr>
</thead>
<tbody><tr>
<td><strong>URL</strong></td>
<td><code>http://127.0.0.1:8000/api/notes/</code></td>
</tr>
<tr>
<td><strong>Header tab:</strong></td>
<td>Same Authorization: Bearer header</td>
</tr>
</tbody></table>
<p>You should see all the notes created, sorted by most recent first.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/8b7107e6-44ef-465c-b1aa-076ec9de1979.png" alt="The image shows the response of getting list of notes" style="display:block;margin:0 auto" width="2122" height="1634" loading="lazy">

<h3 id="heading-95-how-to-demonstrate-scoping">9.5 How to Demonstrate Scoping</h3>
<p>Let's prove that a second user can't view the first user's notes.</p>
<p>First, register the second user.</p>
<p>Send a POST request to <code>http://127.0.0.1/api/register</code> with the following data:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>POST</th>
</tr>
</thead>
<tbody><tr>
<td><strong>URL</strong></td>
<td><code>http://127.0.0.1:8000/api/register/</code></td>
</tr>
<tr>
<td><strong>Body tab</strong></td>
<td>Select "raw" and choose "JSON" from the dropdown</td>
</tr>
<tr>
<td><strong>Body Content</strong></td>
<td>{ "username": "sujan", "email": "<a href="mailto:sujan@example.com">sujan@example.com</a>", "password": "anotherpassword123" }</td>
</tr>
</tbody></table>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/094af273-4220-416a-9c8f-f33079badf83.png" alt="The image shows a new user being created" style="display:block;margin:0 auto" width="2028" height="994" loading="lazy">

<p>Then get tokens for Sujan by sending a POST request to <code>http://127.0.0.1:8000/api/token/</code> with Sujan's credentials (username and password) and then copy Sujan's access token.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/8fe22f1b-f36e-4b35-a478-1b48ea0218c3.png" alt="8fe22f1b-f36e-4b35-a478-1b48ea0218c3" style="display:block;margin:0 auto" width="2118" height="1274" loading="lazy">

<p>Now send a GET request to <code>http://127.0.0.1:8000/api/notes/</code> using Sujan's token in the Authorization header.</p>
<p>The response should be an empty list since this user hasn't created any notes:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/ef8e17d0-4f8f-4a51-8bfd-01c078247075.png" alt="The image shows the response of the get query new user's access token" style="display:block;margin:0 auto" width="2094" height="1352" loading="lazy">

<p>More importantly, Priya's notes are completely invisible to him. Even if Sujan tries to access a specific note by ID – say, <code>http://127.0.0.1:8000/api/notes/1/</code> – he'll get a <code>404 Not Found</code> response, not a <code>403 Forbidden</code>.</p>
<p>This is intentional. A <code>404 Not Found</code> doesn't reveal that the note exists, while a <code>403 Forbidden</code> would confirm its existence to a potential attacker.</p>
<p>A <code>403 Forbidden</code> response is like a door with a sign: <em>“Authorized personnel only”.</em> You now know something important is inside. A <code>404 Not Found</code> response is like a blank wall. You don’t even know a room exists.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/a4619fbb-42ce-4f80-9207-31eb81336c7c.png" alt="The image shows the difference between a 403 and 404 response code" style="display:block;margin:0 auto" width="1860" height="910" loading="lazy">

<p>Now that you know why we've used the <code>404</code> response instead of <code>403</code>, let's demonstrate this.</p>
<p>First, I'll access Priya's individual note using her credentials and her access token:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/e68e33cc-93a4-4eb5-99fe-934b283defeb.png" alt="The image shows the result of accessing individual note using the first user (Priya)" style="display:block;margin:0 auto" width="2116" height="1170" loading="lazy">

<p>Now, I'll change the access token and put Sujan's (new user) access token:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/e5a6d07c-a7f0-4a50-ba05-94cc5159489c.png" alt="The image shows the response of accessing the second note using the new user's (sujan) credentials" style="display:block;margin:0 auto" width="2110" height="1166" loading="lazy">

<p>You can see that using the new user's token to access the previous user's note leads to <code>404 Not Found</code> response.</p>
<h2 id="heading-step-10-how-to-handle-token-expiration-with-refresh-tokens">Step 10: How to Handle Token Expiration with Refresh Tokens</h2>
<p>Access tokens are deliberately short-lived (30 minutes in your configuration). This limits the window of damage if a token is stolen.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/b86f4f24-b8b0-45d0-bcee-5e39e2268e21.png" alt="b86f4f24-b8b0-45d0-bcee-5e39e2268e21" style="display:block;margin:0 auto" width="1330" height="870" loading="lazy">

<p>But you don't want users to re-enter their credentials every 30 minutes. That's what refresh tokens are for.</p>
<p>When Priya's access token expires, her API requests will start returning <code>401 Unauthorized</code> responses. Instead of logging in again, the client sends the refresh token to get a fresh access token.</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>POST</th>
</tr>
</thead>
<tbody><tr>
<td><strong>URL</strong></td>
<td><code>http://127.0.0.1:8000/api/token/refresh/</code></td>
</tr>
<tr>
<td><strong>Body tab</strong></td>
<td>Select "raw" and choose "JSON" from the dropdown</td>
</tr>
<tr>
<td><strong>Body Content</strong></td>
<td>{ refresh: &lt; Priya's refresh token &gt;}</td>
</tr>
</tbody></table>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/67d255ac-8696-45e6-be17-c80b2a6e8af0.png" alt="The image shows the response of getting a new access token using a refresh token" style="display:block;margin:0 auto" width="2110" height="1360" loading="lazy">

<p>Replace your old access token with this new one, and you're good for another 30 minutes. The refresh token itself lasts for one day, so the user only needs to fully log in again once every 24 hours.</p>
<p>In a real application, the frontend client handles this automatically. When an API call returns a <code>401</code>, the client catches it, sends the refresh token to get a new access token, and retries the original request — all without the user noticing.</p>
<p>Here's what that flow looks like in pseudocode:</p>
<ol>
<li><p>Client sends request with access token</p>
</li>
<li><p>Server responds with 401 (token expired)</p>
</li>
<li><p>Client sends refresh token to /api/token/refresh/</p>
</li>
<li><p>Server responds with a new access token</p>
</li>
<li><p>Client retries the original request with the new access token</p>
</li>
<li><p>Server responds with the data</p>
<img src="https://cdn.hashnode.com/uploads/covers/69bdd408475ca17974459537/f0fd3fd8-842b-4de3-8f5b-2f1d4ad2181f.png" alt="The image show the steps to get a new token after a previous one has expired" style="display:block;margin:0 auto" width="1868" height="906" loading="lazy"></li>
</ol>
<p>If the refresh token itself has expired (after 24 hours in your configuration), step 4 will also return a <code>401</code>. At that point, the user truly needs to log in again with their username and password. This is the intended behavior: it means even a stolen refresh token has a limited useful life.</p>
<h2 id="heading-how-you-can-improve-this-project">How You Can Improve This Project</h2>
<p>This API is functional and secure, but there's plenty of room to build on it. Here are some directions you could take.</p>
<ol>
<li><p><strong>Add search and filtering.</strong> Let users search their notes by title or body text. You can use DRF's SearchFilter and django-filter to add query parameters like <code>?search=meeting</code> to the notes list endpoint.</p>
</li>
<li><p><strong>Add categories or tags.</strong> Create a <code>Category</code> model and add a <strong>foreign key</strong> to <code>Note</code>, or use a many-to-many relationship for tags. This would let users organize their notes and filter by category.</p>
</li>
<li><p><strong>Add pagination.</strong> Once a user has hundreds of notes, returning them all in a single response becomes slow. DRF has built-in pagination classes that let you return notes in pages of 10, 20, or whatever size you choose.</p>
</li>
<li><p><strong>Deploy to a production server.</strong> The API currently runs on your local machine. You could deploy it to platforms like PythonAnywhere, Railway, or Render to make it accessible from anywhere. You'd need to configure a production database (like PostgreSQL), set a secure SECRET_KEY, and serve the application behind HTTPS.</p>
</li>
<li><p><strong>Build a frontend.</strong> Connect a React, Next.js, or Vue.js frontend to this API. Store the JWTs in the client and implement the token refresh flow so users stay logged in seamlessly.</p>
</li>
<li><p><strong>Add token blacklisting.</strong> SimpleJWT supports token blacklisting, which lets you invalidate refresh tokens when a user logs out. Without this, a refresh token remains valid until it expires, even after the user "logs out."</p>
</li>
</ol>
<p>Each of these improvements builds on the patterns you've already learned and will deepen your understanding of Django, DRF, and API design.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You've built a fully functional, secure note-taking API with Django, Django REST Framework, and SimpleJWT. Along the way, you learned some fundamental concepts that apply to any API you'll build in the future.</p>
<p>You started with a custom user model — a small decision at the beginning that saves you from a painful migration later. You configured JWT authentication so your API can serve mobile clients and decoupled frontends that can't rely on session cookies.</p>
<p>You built serializers that protect sensitive data by keeping passwords write-only and ownership read-only. Most importantly, you implemented scoped views that ensure each user's data is completely isolated from everyone else's.</p>
<p>The patterns you practiced here — overriding <code>get_queryset</code> to filter by the current user, overriding <code>perform_create</code> to assign ownership automatically, and using <code>read-only</code> fields to prevent data tampering — are the same patterns you'll use in production APIs handling real user data.</p>
<p>The best way to solidify what you've learned is to keep building. Try adding search and filtering, build a React frontend that consumes this API, or start a completely new project may be a task manager, a journal app, or a bookmarks API using the same JWT and scoping patterns. The core workflow stays the same. Only the models and business logic change.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Data Science Insights: Why the Mean Lies When Handling Messy Retail Data ]]>
                </title>
                <description>
                    <![CDATA[ In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on. Let's take the case of a retail shop. If we're looking at the average order value to u ]]>
                </description>
                <link>https://www.freecodecamp.org/news/data-science-insights-why-the-mean-lies-when-handling-messy-retail-data/</link>
                <guid isPermaLink="false">69fa21e5a386d7f121b5fe8c</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ statistics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rakshath Naik ]]>
                </dc:creator>
                <pubDate>Tue, 05 May 2026 16:59:17 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4441dcfc-d100-4613-9937-9c62449c6780.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on.</p>
<p>Let's take the case of a retail shop. If we're looking at the average order value to understand customer spending, we'd load the data, run the code, and get a result of $20 per order.</p>
<p>Done.</p>
<p>Except something looks odd.</p>
<p>When we take a closer look, we see that most customers are buying items worth \(8 - \)15. So where's $20 coming from?</p>
<p>In that case, the problem isn’t data – it’s the average. This is a clean textbook trap where everything works perfectly in the textbook, but real-world data doesn’t behave nicely.</p>
<p>Some customers buy in bulk (very large orders), some return orders (negative quantities), and a few anomalies distort the entire picture.</p>
<p>In this article, we'll use the Online Retail Dataset to answer a simple but tricky question: What does “average” really mean in the real world?</p>
<h2 id="heading-table-of-contents">Table Of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-the-dataset">The Dataset</a></p>
</li>
<li><p><a href="#heading-mean-the-sensitive-giant">Mean: The Sensitive Giant</a></p>
</li>
<li><p><a href="#heading-median-the-robust-middle">Median: The Robust Middle</a></p>
</li>
<li><p><a href="#heading-beyond-averages-understanding-spread-with-quartiles">Beyond Averages: Understanding Spread with Quartiles</a></p>
</li>
<li><p><a href="#heading-applying-iqr-to-our-dataset">Applying IQR to Our Dataset</a></p>
</li>
<li><p><a href="#heading-final-comparison-and-insights">Final Comparison and Insights</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-connect-with-me">Connect with me</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along here, you'll need:</p>
<p><strong>Basic Python knowledge:</strong> Understanding of variables and functions.</p>
<p><strong>The Pandas library:</strong> Familiarity with loading data and basic DataFrame operations.</p>
<p><strong>A development environment:</strong> Access to a tool like Jupyter Notebook, VS Code, or Google Colab.</p>
<p><strong>A Dataset:</strong> For this analysis, I used the Online Retail Dataset, which is available for download <a href="https://archive.ics.uci.edu/dataset/352/online+retail">here</a>.</p>
<h2 id="heading-the-dataset"><strong>The Dataset</strong></h2>
<p>We'll work with the Online Retail Dataset, a real-world transactional dataset containing purchase records from a UK-based online retail store.</p>
<ol>
<li><p><strong>Source:</strong> UCI Machine Learning Repository</p>
</li>
<li><p><strong>Collected by:</strong> UK-based online retail company (2010–2011)</p>
</li>
<li><p><strong>Size:</strong> 541,909 transactions</p>
</li>
<li><p><strong>Features:</strong> 8 attributes (InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country)</p>
</li>
<li><p><strong>Ownership:</strong> Public dataset hosted by UCI</p>
</li>
<li><p><strong>License:</strong> Open for research and educational use</p>
</li>
</ol>
<h2 id="heading-mean-the-sensitive-giant">Mean: The Sensitive Giant</h2>
<p>In statistics and data analysis, the terms "<strong>average</strong>" and "<strong>arithmetic mean</strong>" are often used interchangeably. We aim to find the mean total price in our dataset. Mean in the context of the Online Retail Dataset is given as:</p>
<p>$$\text{Average Order Value} = \frac{\text{Sum of all TotalPrice values}}{\text{Number of transactions}}$$</p>
<p>In our dataset, the mean is calculated by summing all transaction values (including bulk purchases and returns) and dividing by the total number of transactions. This means every value, irrespective of unusually high or any negative values, directly influences the final average.</p>
<pre><code class="language-python"># Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
df = pd.read_excel(url, engine='openpyxl')

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate the Mean (Average Order Value)
mean_value = df['TotalPrice'].mean()
print(f"Average Order Value (Mean): {mean_value:.2f}")
</code></pre>
<p>The results are as follows:</p>
<pre><code class="language-python">Average Order Value (Mean): 20.40
</code></pre>
<p>At first glance, the results may look promising: every transaction contributes equally. But that’s where the problem lies. Sometimes a few transactions, which are extremely high or low, affect the mean for all customers who lie in the closer range.</p>
<p>Take a look at the graph for the mean below.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/583bebff-0e5e-44b8-80cb-48e4662b9abf.png" alt="The graph shows the calculated mean for the Online Retail Dataset, where we get a mean of 20.40" style="display:block;margin:0 auto" width="876" height="547" loading="lazy">

<p>The graph shows the mean Total Price for the Online Retail Dataset. We get a mean of 20.42. (Image by Author)</p>
<p>The graph shows <strong>a right-skewed distribution</strong> where the calculated mean of 20.40 is actually a textbook trap. The tallest bar clearly shows that the majority of transactions lie in the range of \(8 - \)15 range, but the <strong>red line</strong> is being dragged to the right by the <strong>long tail</strong> of high-value bulk orders by some customers.</p>
<p>In this scenario, the average price is well above what a typical customer actually spends because it's highly sensitive to outliers – and in reality, the bulk of the data lives in the lower price range.</p>
<p>In simple words, the mean is being pulled by some extreme values to the right, especially by some lying in the range of 200–300, which is noticeable in the graph.</p>
<h2 id="heading-median-the-robust-middle">Median: The Robust Middle</h2>
<p>When the mean is distorted by extreme values, we need a metric that remains unaffected by such outliers. This is where the median comes into play.</p>
<p>Median is defined as the <strong>middle value after sorting the data.</strong></p>
<p>In our dataset, we sort all the transactions and pick the middle one.</p>
<p>The formula for calculating the median is:</p>
<p>$$\text{Median} = \begin{cases} X_{\left[ \frac{n+1}{2} \right]} &amp; \text{if } n \text{ is odd} \ \frac{X_{\left[ \frac{n}{2} \right]} + X_{\left[ \frac{n}{2} + 1 \right]}}{2} &amp; \text{if } n \text{ is even} \end{cases}$$</p>
<p>Unlike the mean, the median doesn't depend on extreme values, and it cares only about the position of the data, not the magnitude.</p>
<pre><code class="language-python"># Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate only the Median
median_value = df['TotalPrice'].median()
print(f"Typical Order Value (Median): {median_value:.2f}")
</code></pre>
<p>The results are as follows:</p>
<pre><code class="language-python">Typical Order Value (Median): 11.10
</code></pre>
<p>Now you'll notice that the result lies in the \(8 — \)15 range, where most of the transactions lie.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/d89a4912-0e44-485e-8ea0-ff559cea6eba.png" alt="The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers." style="display:block;margin:0 auto" width="876" height="547" loading="lazy">

<p>The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers. (Image by Author)</p>
<p>In the previous graph, the mean was pulled to the right by large orders, but the median just asks what the middle customer spends. So even if someone spends $300 or some transactions are negative, the median stays stable.</p>
<p>In the above figure <strong>the median graph</strong> accurately highlights the range where most of the customers lie.</p>
<h2 id="heading-beyond-averages-understanding-spread-with-quartiles"><strong>Beyond Averages: Understanding Spread with Quartiles</strong></h2>
<p>So far, we've studied the median, but knowing the center is not enough.</p>
<p>To truly understand how customer spending is, we need to understand how the data is spread, and this is where quartiles come into play.</p>
<p>Quartiles divide the dataset into the following parts:</p>
<ol>
<li><p><strong>Q1(25th percentile):</strong> 25% of transactions are below this.</p>
</li>
<li><p><strong>Q2 (50th percentile):</strong> Median</p>
</li>
<li><p><strong>Q3 (75th percentile):</strong> 75% of transactions are below this.</p>
</li>
</ol>
<p>This is formally expressed as the Interquartile Range (IQR):</p>
<p>$$IQR = Q_3 - Q_1$$</p>
<h3 id="heading-the-iqr-detecting-outliers"><strong>The IQR: Detecting Outliers</strong></h3>
<p>The IQR measures the spread of the middle 50%.</p>
<p>If the IQR is small, then the data is concentrated. If it's large, the data is spread out. The IQR also helps us identify outliers mathematically.</p>
<p>Outlier Rule:</p>
<ol>
<li><p><strong>Lower Bound = Q1 — 1.5 * IQR</strong></p>
</li>
<li><p><strong>Upper Bound = Q3 + 1.5 * IQR</strong></p>
</li>
</ol>
<h4 id="heading-a-simple-example-to-understand-iqr">A Simple Example to Understand IQR</h4>
<p>Consider the following transaction values:</p>
<p>$$\left[ 5, 8, 10, 12, 15, 18, 20 \right]$$</p>
<h4 id="heading-step-1-find-the-median-q2">Step 1: Find the Median (Q2):</h4>
<p>The middle value is:</p>
<p>$$Q_2 = 12$$</p>
<h4 id="heading-step-2-find-q1-lower-quartile">Step 2: Find Q1 (Lower Quartile):</h4>
<p>The lower half is [5, 8, 10]. The median of the lower half is:</p>
<p>$$Q_1 = 8$$</p>
<h4 id="heading-step-3-find-q3-upper-quartile">Step 3: Find Q3 (Upper Quartile):</h4>
<p>The upper half is [15, 18, 20]. The median of the upper half is:</p>
<p>$$Q_3 = 18$$</p>
<h4 id="heading-step-4-calculate-iqr">Step 4: Calculate IQR:</h4>
<p>$$IQR = Q_3 - Q_1 = 18 - 8 = 10$$</p>
<h4 id="heading-step-5-find-outlier-bounds">Step 5: Find Outlier Bounds:</h4>
<p>$$\begin{aligned} \text{Lower Bound} &amp;= Q_1 - 1.5 \times IQR = 8 - 15 = -7 \ \text{Upper Bound} &amp;= Q_3 + 1.5 \times IQR = 18 + 15 = 33 \end{aligned}$$</p>
<p>Any value <strong>below -7 or above 33</strong> is an outlier (but in this demo problem, no outliers exist).</p>
<h2 id="heading-applying-iqr-to-our-dataset"><strong>Applying IQR to Our Dataset</strong></h2>
<p>In our retail dataset, instead of neat values, we have bulk values and even negative returns.</p>
<pre><code class="language-python"># 1. Calculate IQR and Bounds
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
</code></pre>
<p>When we calculate IQR for our dataset, we get:</p>
<pre><code class="language-python">Lower Bound: -18.75
Upper Bound: 42.45
Number of Outliers: 33180
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/e528db9b-57f9-4ee4-b331-143c2b1947fb.png" alt="The figure demonstrates the outlier range for our dataset" style="display:block;margin:0 auto" width="1036" height="547" loading="lazy">

<p>The graph demonstrates outliers, which are any values falling outside the range of -18.75 to 42.45. (Image by Author)</p>
<p>As the graph shows, the values outside the range -18.75 to 42.45 are considered outliers. These values will be removed.</p>
<h3 id="heading-revisiting-the-mean-after-removing-outliers">Revisiting the Mean After Removing Outliers</h3>
<p>Using the IQR method, we've removed extreme transactions that fell outside the typical spending range.</p>
<pre><code class="language-python"># Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Original Mean
mean_value = df['TotalPrice'].mean()
print(f"Original Mean: {mean_value:.2f}")

# IQR Calculation
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")

# Remove Outliers
df_no_outliers = df[(df['TotalPrice'] &gt;= lower_bound) &amp; (df['TotalPrice'] &lt;= upper_bound)]

# New Mean after removing outliers
new_mean = df_no_outliers['TotalPrice'].mean()
print(f"Mean after removing outliers: {new_mean:.2f}")
</code></pre>
<p>After recomputing, we get:</p>
<pre><code class="language-python">Original Mean: 20.40
Lower Bound: -18.75
Upper Bound: 42.45
Mean after removing outliers: 11.63
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/17e6c2d0-883f-4e48-b45b-d1bf93164c63.png" alt="The graph demonstrates that the mean improves significantly after all outliers are removed. (Image by Author)" style="display:block;margin:0 auto" width="876" height="547" loading="lazy">

<p>Removing outliers significantly shifts the mean toward the region where most transactions occur. We now have a much better mean of 11.63 as opposed to the right-stretched mean of 20.40 we got with outliers.</p>
<h2 id="heading-final-comparison-and-insights"><strong>Final Comparison and Insights</strong></h2>
<p>Looking at the results from all the graphs, we get a complete understanding of the dataset. The original mean was 20.40, which appeared to be significantly higher than the most transactions that actually occurred. In that case, the mean was pulled upward by some of the high-valued transactions and was distorted by the outliers.</p>
<p>The median, on the other hand, was 11.10, which lies within the range where most transactions are concentrated. This shows that the median is a much better representation of what a typical customer spends, as it's not affected by extreme values.</p>
<p>After removing the outliers using the IQR, the mean dropped to 11.63, bringing it very close to the median. This confirms that the earlier mean was not inherently wrong, but was simply influenced by extreme values in the data. Once those values were handled, the mean became a much more reliable measure of central tendency.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>The results show that the mean can be misleading when data contains outliers. In our dataset, the original mean of 20.40 overstated customer spending, while the median (11.10) gave a more realistic picture. After removing outliers, the mean shifted to 11.63, aligning closely with the median.</p>
<p>This highlights a key lesson: <strong>The mean isn't wrong, but it must be used with an understanding of the data.</strong></p>
<p>Choosing the right measure of average depends on the dataset, and in messy real-world scenarios, the median or a cleaned mean often tells the true story.</p>
<h2 id="heading-connect-with-me"><strong>Connect with me</strong></h2>
<ol>
<li><p><a href="https://medium.com/@rakshathnaik62">Medium</a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/rakshath-/">LinkedIN</a></p>
</li>
</ol>
<p>If you want to dive deeper, you can visit: <a href="https://qubrica.com/mean-median-mode-python-guide/"><strong>Mean vs Median vs Mode: Understanding Central Tendency in Data Analysis</strong></a><strong>.</strong></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway ]]>
                </title>
                <description>
                    <![CDATA[ In today's digital world, spam is no longer just an annoyance - it's a growing security threat. To combat this, developers often turn to machine learning to build intelligent filters that can distingu ]]>
                </description>
                <link>https://www.freecodecamp.org/news/deploying-serverless-spam-classifier/</link>
                <guid isPermaLink="false">69f2e347b18c978233780179</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ serverless ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Architecture ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rakshath Naik ]]>
                </dc:creator>
                <pubDate>Thu, 30 Apr 2026 05:06:15 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/08672d22-a4df-4b99-8ef7-fffd18f5dc07.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In today's digital world, spam is no longer just an annoyance - it's a growing security threat. To combat this, developers often turn to machine learning to build intelligent filters that can distinguish legitimate emails from malicious ones.</p>
<p>While building a machine learning model in a notebook is relatively straightforward, the real challenge lies in the last mile: deploying that model into a scalable, production-ready system that users can actually interact with.</p>
<p>In this project, I built an end-to-end serverless spam classifier, combining Scikit-learn for model development with AWS Lambda, Amazon S3, and Amazon API Gateway for deployment. The result is a lightweight, scalable API that can classify messages in real time.</p>
<p>The system is designed to be modular and cost-efficient, allowing the model to be retrained and updated independently without affecting the live API. From detecting "free iPhone" scams to identifying phishing attempts, this project demonstrates how to bridge the gap between machine learning experimentation and real-world deployment.</p>
<h3 id="heading-table-of-contents">Table of&nbsp;Contents</h3>
<ul>
<li><p><a href="#heading-1-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-2-building-the-brain-the-model">Building the Brain: The Model</a></p>
</li>
<li><p><a href="#heading-3-deploying-the-model-to-aws">Deploying the Model to AWS</a></p>
</li>
<li><p><a href="#heading-4-how-to-run-the-project-locally">How to Run The Project Locally</a></p>
</li>
<li><p><a href="#heading-5-our-project-architecture">Our Project Architecture</a></p>
</li>
<li><p><a href="#heading-6-conclusion-the-power-of-serverless-ai">Conclusion: The Power of Serverless AI</a></p>
</li>
<li><p><a href="#heading-7-acknowledgment-references">Acknowledgment / References</a></p>
</li>
</ul>
<h2 id="heading-1-prerequisites">1. Prerequisites</h2>
<ol>
<li><p><strong>Fundamental skills:</strong> Basic proficiency in Python and understanding of Machine Learning concepts like classification.</p>
</li>
<li><p><strong>AWS account:</strong> Access to an AWS account with permissions for Lambda, S3, and API Gateway.</p>
</li>
<li><p><strong>Environment:</strong> Python 3.11 installed, along with libraries like scikit-learn, pandas, and joblib.</p>
</li>
<li><p><strong>AWS CLI:</strong> Configured on your local machine for file uploads.</p>
</li>
<li><p><strong>HuggingFace account:</strong> You can directly download the model from my account.</p>
</li>
</ol>
<h2 id="heading-2-building-the-brain-the-model">2. Building the Brain: The&nbsp;Model</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/b43af198-1472-4914-9469-6cd5ca5384e2.png" alt="Demonstrational image to show the brain of AI." style="display:block;margin:0 auto" width="1000" height="563" loading="lazy">

<p><em>Photo by</em> <a href="https://unsplash.com/@steve_j?utm_source=medium&amp;utm_medium=referral"><em>Steve A Johnson</em></a> <em>on</em> <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral"><em>Unsplash</em></a></p>
<p>At the heart of this project lies a supervised learning approach. Instead of simply specifying which words are considered spam, we'll provide the computer with a dataset and an algorithm, enabling it to learn and identify spam patterns on its own.</p>
<h3 id="heading-1-vectorization-turning-text-into-math">1. Vectorization: Turning Text into&nbsp;Math</h3>
<p>Machine Learning models can't <strong>read</strong> text. They require numerical input. To solve this, we used the <a href="https://www.freecodecamp.org/news/how-to-extract-keywords-from-text-with-tf-idf-and-pythons-scikit-learn-b2a0f3d7e667/">TF-IDF</a> (Term Frequency-Inverse Document Frequency) Vectorizer.</p>
<pre><code class="language-python">feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_train
</code></pre>
<p>Here's the mathematical formula:</p>
<p>$$w_{i,j} = tf_{i,j} \times \log \left( \frac{N}{df_i} \right)$$</p>
<p>TF-IDF term definitions:</p>
<ul>
<li><p><strong>wᵢ,ⱼ (Weight):</strong> The final importance score of a specific word in a document.</p>
</li>
<li><p><strong>tfᵢ,ⱼ (Term Frequency):</strong> How often a word appears in a single email.</p>
</li>
<li><p><strong>N (Total Documents):</strong> The total count of all emails in your dataset.</p>
</li>
<li><p><strong>dfᵢ (Document Frequency):</strong> The number of different emails that contain this specific word.</p>
</li>
<li><p><strong>log(N/dfᵢ) (IDF):</strong> A penalty that lowers the score of common words like <strong>the</strong> or <strong>is</strong> that appear everywhere.</p>
</li>
</ul>
<p>It cleans the data by removing common words, converts all text to lowercase for consistency, and assigns more importance to rare and meaningful words while giving less importance to frequently used words.</p>
<h3 id="heading-2-training-the-logistic-regression-engine">2. Training: The Logistic Regression Engine</h3>
<p>We'll use <strong>Logistic Regression</strong> here, a classification algorithm that predicts the probability of an outcome.</p>
<p>In this stage, we feed our vectorized training data into the Logistic Regression algorithm. The goal is to establish a mathematical relationship between specific word weights and the <strong>Spam</strong> or <strong>Ham</strong> label.</p>
<p>During training, the model iteratively adjusts its internal parameters to minimize error, eventually learning that words like winner or free correlate highly with spam, while conversational language correlates with legitimate messages.</p>
<pre><code class="language-python">model = LogisticRegression()
model.fit(X_train_features, Y_train)
</code></pre>
<p>In our case, it calculates the probability that an email belongs to spam or HAM.</p>
<p>The algorithm uses the Sigmoid function to map any real-valued number into a value between 0 and 1.</p>
<p>$$P(y=1|x) = \frac{1}{1 + e^{-(z)}}$$</p>
<p>where z = β₀ + β₁x₁ +&nbsp;… + βₙxₙ.</p>
<h3 id="heading-3-evaluation-testing-the-intelligence">3. Evaluation: Testing the Intelligence</h3>
<p>After training, we need to verify if the brain actually works on data it hasn't seen before.</p>
<pre><code class="language-python">prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)
</code></pre>
<p>By comparing the model’s predictions against the actual labels in our test set, we calculate an Accuracy Score. This gives us the confidence that the model is ready for the real world (achieving ~94% accuracy in our tests).</p>
<h3 id="heading-4-exporting-the-logic-serialization">4. Exporting the Logic (Serialization)</h3>
<p>To move this brain from our local Python environment to the AWS Cloud, we'll use Joblib to save our work into binary files (.pkl).</p>
<pre><code class="language-python">joblib.dump(model, 'spam_model.pkl')
joblib.dump(feature_extraction, 'vectorizer.pkl')
</code></pre>
<p>We use the Pickle format because it allows us to freeze complex Python objects (mathematical weights and word mappings) into a portable binary format that can be instantly re-animated in the cloud.</p>
<p>We need the Vectorizer to translate new user text into the exact numerical coordinates the Model was trained to understand. Using one without the other is like having a key but no lock.</p>
<p>The trained Logistic Regression model and TF-IDF vectorizer are openly available for the community on Hugging Face here: <a href="https://huggingface.co/rakshath1/mail-spam-detector">Get the model on HuggingFace</a>.</p>
<h2 id="heading-3-deploying-the-model-to-aws">3. Deploying the Model to&nbsp;AWS</h2>
<p>Training a model is science, while deploying it is engineering. To make this classifier accessible to the world, we'll use a serverless stack that scales automatically and incurs nearly no maintenance costs.</p>
<h3 id="heading-1-model-storage-amazon-s3">1. Model Storage: Amazon&nbsp;S3</h3>
<p>First, we'll uploade our&nbsp;.pkl files to an S3 bucket. By decoupling the model from the code, we can update the AI's intelligence (simply by overwriting the file in S3) without redeploying the backend code. It makes the system highly maintainable.</p>
<h3 id="heading-2-the-production-backend-aws-lambda">2. The Production Backend: AWS&nbsp;Lambda</h3>
<p>To make the AI accessible, we'll move from a local script to a Serverless Cloud Architecture. This ensures the model is always available without the cost of a 24/7 server.</p>
<p>The deployment environment is AWS Lambda (Python 3.11). Since Lambda is a lightweight environment, it doesn't include Scikit-Learn or Joblib. To provide these, we'll download and store them in our S3 bucket and import them through the layers.</p>
<p><strong>Commands in AWS CLI:</strong></p>
<pre><code class="language-python">
# 1. Create a workspace
mkdir ml_layer &amp;&amp; cd ml_layer

# 2. Install scikit-learn and its dependencies into a folder
pip install \
    --platform manylinux2014_x86_64 \
    --target=python/lib/python3.11/site-packages \
    --implementation cp \
    --python-version 3.11 \
    --only-binary=:all: \
    scikit-learn joblib

# 3. Zip the folder
zip -r sklearn_lib.zip python

# 4. Upload to S3 (Using AWS CLI)
aws s3 cp sklearn_lib.zip s3://YOUR-BUCKET-NAME/
</code></pre>
<p>We store the Scikit-Learn library as a ZIP in S3 to bypass the AWS Lambda deployment package size limit. This allows the function to dynamically load heavy dependencies only when needed without bloating the core code.</p>
<p><strong>The Lambda Function:</strong></p>
<pre><code class="language-python">
import json
import boto3
import os
import sys
from io import BytesIO

# Ensures the custom Lambda layer(containing sklearn/joblib)
sys.path.append('/opt/python')

try:
    import joblib
except ImportError:
    # Fallback for specific Scikit-Learn distributions
    from sklearn.utils import _joblib as joblib

# Initialize S3 client
s3 = boto3.client('s3')

# Use placeholders for the article so readers can insert their own values
BUCKET_NAME = 'YOUR_S3_BUCKET_NAME' 
MODEL_KEY = 'spam_model.pkl'
VECTORIZER_KEY = 'vectorizer.pkl'

# Global variables for 'Warm Start' caching (improves performance by keeping model in RAM)
model = None
vectorizer = None

def load_model():
    """Downloads model files from S3 only if they aren't already in RAM"""
    global model, vectorizer
    if model is None or vectorizer is None:
        try:
            # 1. Load the Logistic Regression Model from S3
            m_obj = s3.get_object(Bucket=BUCKET_NAME, Key=MODEL_KEY)
            model = joblib.load(BytesIO(m_obj['Body'].read()))
            
            # 2. Load the TF-IDF Vectorizer directly from S3
            v_obj = s3.get_object(Bucket=BUCKET_NAME, Key=VECTORIZER_KEY)
            vectorizer = joblib.load(BytesIO(v_obj['Body'].read()))
        except Exception as e:
            raise Exception(f"Failed to load .pkl files from S3: {str(e)}")

def lambda_handler(event, context):
    try:
        # Ensure model and vectorizer are ready before processing
        load_model()
        
        # Handles both direct Lambda tests and API Gateway POST requests
        body = event.get('body', event)
        if isinstance(body, str):
            body = json.loads(body)
            
        text = body.get('text', '')
            
        if not text:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'No text provided.'})
              }

        # 1. Transform input text to numeric features using the trained Vectorizer
        data_vec = vectorizer.transform([text])
        
        # 2. Predict using the Logistic Regression Model 
        prediction = int(model.predict(data_vec)[0])
        
      # 3. Map numeric result to human-readable label
        result_label = "HAM" if prediction == 1 else "SPAM"
        
        # RESPONSE WITH CORS
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*' # needed for cross-domain web integration
            },
            'body': json.dumps({
                'status': 'success',
                'classification': result_label,
                'input_text': text
            })
        }
        
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error_message': f"Inference Error: {str(e)}"})
        }
</code></pre>
<p>Key features of the Lambda function:</p>
<ol>
<li><p><strong>Warm start caching:</strong> By defining the model and vectorizer variables outside the lambda_handler, we store them in the container's memory. This significantly reduces cold start latency for subsequent requests.</p>
</li>
<li><p><strong>Dynamic dependency loading:</strong> The <strong>sys.path.append('/opt/python')</strong> line allows us to import heavy libraries from S3/Layers without exceeding the upload limit.</p>
</li>
<li><p><strong>Bimodal input handling:</strong> The function is designed to handle both direct JSON testing from the AWS console and stringified payloads sent via API Gateway.</p>
</li>
</ol>
<h3 id="heading-3-the-api-gateway-the-bridge-to-the-web">3. The API Gateway - The Bridge to the&nbsp;Web</h3>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/8aa3e8d7-569a-4dd5-a6ac-184922474952.png" alt="Demonstrational image to show the API Gateway." style="display:block;margin:0 auto" width="1000" height="563" loading="lazy">

<p>Photo by <a href="https://unsplash.com/@growtika?utm_source=medium&amp;utm_medium=referral">Growtika</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></p>
<h4 id="heading-creating-the-rest-api">Creating the REST API</h4>
<p>Next we'll create a REST API with a single POST method. Why POST, you might be wondering? Well, we need to securely send a JSON payload containing the user’s text message to our model.</p>
<ol>
<li><p>First navigate to the Amazon API Gateway console and select Create API -&gt; REST API.</p>
</li>
<li><p>Give your API a name, such as EmailSpamPredictor-API, and set the Endpoint Type to Regional.</p>
</li>
<li><p>Then in the left sidebar, click Resources and enter a resource name (e.g: <strong>/ predict</strong> as entered by me)</p>
</li>
<li><p>Next click the create method and select POST and then select Lambda Function for integration type</p>
</li>
<li><p>Ensure Lambda Proxy integration is enabled (this allows the full request to pass through to your code).</p>
</li>
</ol>
<p><strong>The CORS Configuration (The Troubleshooting Hub)</strong><br>This is where many developers encounter the dreaded <strong>Connection Error</strong>. Since our API is hosted on AWS, and if your front-end is on a separate website, the browser’s Same-Origin Policy will block the request by default.</p>
<p>To fix this, we'll enable <strong>CORS:</strong></p>
<ol>
<li><p><strong>Access-Control-Allow-Origin:</strong> Set to * (or specifically to your domain) to tell the browser that the API is allowed to talk to your front-end.</p>
</li>
<li><p><strong>The OPTIONS method:</strong> API Gateway creates an OPTIONS method automatically. This handles the Preflight request where the browser asks, “Are you allowed to receive data from me?” before sending the actual text.</p>
</li>
<li><p><strong>Access-Control-Allow-Headers:</strong> In the screenshot, you'll notice headers like Content-Type and Authorization are allowed. This ensures that when our JavaScript fetch() call sets the content type to application/json, the API Gateway doesn't reject it.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/cf5c87c6-f374-4dda-8001-77a0aab52672.png" alt="Image illustrates the CORS configuration for our project. " style="display:block;margin:0 auto" width="1487" height="617" loading="lazy">

<p>Image illustrates the CORS configuration for our project. (Image by author)</p>
<h4 id="heading-deployment-stages">Deployment Stages</h4>
<p>Once the API is deployed to a production stage, AWS generates a permanent Invoke URL. This acts as the public gateway to our model and typically follows this structure: <a href="https://%5Bapi-id%5D.execute-api.%5Bregion%5D.amazonaws.com/prod/classify">https://[api-id].execute-api.[region].amazonaws.com/prod/classify</a>.</p>
<h4 id="heading-connecting-the-frontend-the-javascript-layer">Connecting the Frontend (The JavaScript Layer)</h4>
<p>With the API live, we can now write a simple JavaScript function to talk to our model. This script runs whenever a user clicks the <strong>Analyze</strong> button on your site.</p>
<pre><code class="language-python">
async function checkSpam() {
    const message = document.getElementById("userInput").value;
    const apiUrl = "YOUR_API_GATEWAY_INVOKE_URL";

    try {
        const response = await fetch(apiUrl, {
            method: "POST",
            headers: {
                "Content-Type": "application/json"
            },
            body: JSON.stringify({ "text": message })
        });

        const data = await response.json();
        
        // Display result on the webpage
        const resultElement = document.getElementById("result");
        resultElement.innerText = `Prediction: ${data.classification}`;
        resultElement.style.color = data.classification === "SPAM" ? "red" : "green";

    } catch (error) {
        console.error("Error:", error);
        alert("Could not connect to the Spam Detector API.");
    }
}
</code></pre>
<h2 id="heading-4-how-to-run-the-project-locally">4. How to Run The Project&nbsp;Locally</h2>
<p>You can store the front-end as an HTML file. Once it's ready, you shouldn’t just double-click the&nbsp;.html file. Opening it as a <strong>file</strong> in your browser can cause security restrictions. Instead, you should host it using a simple local server.</p>
<p><strong>Step 1:</strong> Open the terminal or Command Prompt.</p>
<p><strong>Step 2:</strong> Navigate to your project folder</p>
<pre><code class="language-shell">cd [PATH_TO_YOUR_FOLDER]
</code></pre>
<p><strong>Step 3:</strong> Start a local Python web server.</p>
<pre><code class="language-shell">python -m http.server 8000
</code></pre>
<p><strong>Step 4:</strong> Access the application.</p>
<p>Open your browser and navigate to:<br><a href="http://localhost:8000/your-file-name.html">http://localhost:8000/your-file-name.html</a></p>
<p><strong>Watch the Demo:</strong></p>
<div class="embed-wrapper"><iframe width="560" height="315" src="https://www.youtube.com/embed/q2X_azntmzY" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>

<h2 id="heading-5-our-project-architecture">5. Our Project Architecture</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/c17673d4-5dd0-43dc-8e8d-3015bcd31864.png" alt="Image showing the Architecture Diagram of our Project." style="display:block;margin:0 auto" width="1000" height="563" loading="lazy">

<p>The image illustrates the architecture of our project (Building a Serverless Spam Classifier). It shows the process that takes place from the client input to the final model output. (Image by Author)</p>
<ol>
<li><p><strong>Client Front-End Interaction:</strong> The process starts on the far left. A user interacts with the web interface (for example, a website or a desktop app). They input text like <strong>WIN free iPhone now</strong> and trigger a request.</p>
</li>
<li><p><strong>The Entry Point: API Gateway:</strong> The request hits the Amazon API Gateway, which acts as the <strong>security guard</strong> and translator.&nbsp;<br><strong>(a)</strong> CORS OPTIONS handles the pre-flight handshake to ensure the browser has permission to talk to the AWS cloud.&nbsp;<br><strong>(b)</strong> Classification Request (POST) routes the actual message data to your backend logic.</p>
</li>
<li><p><strong>The Engine: AWS Lambda (Python 3.11):</strong>&nbsp;The central “<strong>lightbulb</strong>” represents your Lambda function. This is where the code you wrote lives. It doesn’t run 24/7 – it only wakes up when a request arrives.</p>
</li>
<li><p><strong>Storage &amp; Retrieval: S3 Bucket:</strong> Since Lambda is lightweight, it doesn’t store your heavy Machine Learning files internally.<br><strong>Dependency and Model Download:</strong> The function reaches out to the S3 Bucket to pull in the sklearn_<a href="http://lib.zip">lib.zip</a> (the engine) and the&nbsp;.pkl files (the intelligence).&nbsp;<br><strong>Required Dependency and Model:</strong> These assets are loaded into the Lambda’s temporary memory to prepare for the prediction.</p>
</li>
<li><p><strong>The Inference Pipeline:</strong>&nbsp;Inside the Lambda, a three-step mathematical cycle occurs:<br><strong>(a) Text Vectorizer:</strong> Translates the words into numbers.<br><strong>(b) Logistic Regression:</strong> Calculates the probability of spam based on those numbers.<br><strong>(c) Label:</strong> Assigns a final result (Spam or Ham).</p>
</li>
<li><p><strong>The Result Delivery:</strong> The result is sent back through the API Gateway, including the necessary CORS Headers to ensure the browser accepts it. The front-end then updates to show the “<strong>Result: SPAM</strong>” with a visual indicator.</p>
</li>
</ol>
<h2 id="heading-6-conclusion-the-power-of-serverless-ai">6. Conclusion: The Power of Serverless AI</h2>
<p>By merging the mathematical simplicity of Logistic Regression with the industrial strength of AWS Serverless Architecture, we have transformed a static Python script into a globally accessible, scalable API.</p>
<p>This project demonstrates that you don’t need a massive budget or a 24/7 dedicated server to deploy high-quality Machine Learning.</p>
<p>Using the S3-to-Lambda workaround allowed us to bypass common storage hurdles, ensuring that our Brain (the model) and its Muscle (Scikit-Learn) could function seamlessly within the cloud’s ephemeral environment. It bridges the gap between experimentation and real-world applications, making AI systems practical, efficient, and accessible.</p>
<h2 id="heading-7-acknowledgment-references">7. Acknowledgment / References</h2>
<ul>
<li><p>Pre-trained spam classification model: View on Hugging Face (<a href="https://huggingface.co/rakshath1/mail-spam-detector"><strong>rakshath1/mail-spam-detector · Hugging Face</strong></a><strong>)</strong></p>
</li>
<li><p>Scikit-learn <a href="https://scikit-learn.org/stable/api/index.html?utm_source=chatgpt.com">Documentation</a></p>
</li>
<li><p>AWS Lambda <a href="https://docs.aws.amazon.com/lambda/latest/api/welcome.html?utm_source=chatgpt.com">Documentation</a></p>
</li>
<li><p>Amazon S3 <a href="https://aws.amazon.com/documentation-overview/s3/">Documentation</a></p>
</li>
<li><p>Amazon API Gateway <a href="https://docs.aws.amazon.com/apigateway/">Documentation</a></p>
</li>
</ul>
<h3 id="heading-connect-with-me">Connect With Me</h3>
<ul>
<li><p><a href="https://medium.com/@rakshathnaik62">Medium</a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/rakshath-/">LinkedIN</a></p>
</li>
</ul>
<p><strong>You may also like</strong></p>
<ol>
<li><p><a href="https://qubrica.com/python-polars-v-s-pandas-libraries-comparison/">How Polars overtook Pandas</a></p>
</li>
<li><p><a href="https://qubrica.com/devops-is-dead-platform-engineering-2026/"><strong>DevOps is Dead. Long Live Platform Engineering</strong></a></p>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an Open Source Data Lake for Batch Ingestion ]]>
                </title>
                <description>
                    <![CDATA[ Creating a data platform has been made easier by cloud data analytics platforms like Databricks, Snowflake, and BigQuery. They offer excellent ramp-up and scaling options for small to mid-size teams.  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-an-open-source-data-lake-for-batch-ingestion/</link>
                <guid isPermaLink="false">69e0f1a7b67a275a9d3c9122</guid>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ apache-airflow ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ingestion ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Puneet Singh ]]>
                </dc:creator>
                <pubDate>Thu, 16 Apr 2026 14:26:47 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/ef685075-beac-4bf4-b435-6e942e5e1ac1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Creating a data platform has been made easier by cloud data analytics platforms like Databricks, Snowflake, and BigQuery. They offer excellent ramp-up and scaling options for small to mid-size teams.</p>
<p>But the trade-off isn't just merely renting the outside infrastructure. It also includes proprietary abstraction lock-in, and an operational and security surface area built on top of vendor capabilities.</p>
<p>In this article, you'll set up a batch ingestion layer on an open-source data lake stack where you own every component.</p>
<p>The focus is deliberately narrow. We'll get the ingestion layer up and running end-to-end. Then we'll build on foundations that allow future extension: analytics, governance, and stream processing without locking you into any single tool for those layers. We'll also review documented integration failures along the way: misconfigured catalogs, partition values written as NULL, and Python version mismatches.</p>
<p>By the end, you'll have:</p>
<ul>
<li><p>A working single-node data lake running on Docker (compose), built on RustFS (object storage), Apache Iceberg (table format), and Project Nessie (catalog).</p>
</li>
<li><p>A batch pipeline orchestrated with Apache Airflow, executing PySpark jobs that write versioned, partitioned Iceberg tables.</p>
</li>
<li><p>A real-world ingestion pattern, an external web scraper decoupled from Airflow via Redis, writing raw data to object storage with a lightweight signal table.</p>
</li>
<li><p>A view of what this stack is and isn't, and what you'd add to take it toward production.</p>
</li>
</ul>
<p>A word on scope: this covers the E in <a href="https://www.getdbt.com/blog/extract-load-transform">ELT</a>: getting data in. Transformation (dbt, Spark SQL) and analytics (Trino, Superset) are a natural next layer, but are outside the scope of this article. What you build here is the foundation they'd sit on.</p>
<h3 id="heading-what-well-cover">What We'll Cover:</h3>
<ul>
<li><p><a href="#heading-the-ingestion-problem">The Ingestion Problem</a></p>
</li>
<li><p><a href="#heading-stack">Stack</a></p>
</li>
<li><p><a href="#heading-system-overview">System Overview</a></p>
</li>
<li><p><a href="#heading-quick-start">Quick Start</a></p>
</li>
<li><p><a href="#heading-running-the-pipelines">Running the Pipelines</a></p>
</li>
<li><p><a href="#heading-setup">Setup</a></p>
<ul>
<li><p><a href="#heading-rustfs">RustFS</a></p>
</li>
<li><p><a href="#heading-nessie">Nessie</a></p>
</li>
<li><p><a href="#heading-spark">Spark</a></p>
</li>
<li><p><a href="#heading-apache-airflow">Apache Airflow</a></p>
</li>
<li><p><a href="#heading-scrapredis">Scrapredis</a></p>
</li>
<li><p><a href="#heading-scrapworker">Scrapworker</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-path-forward">Path Forward</a></p>
<ul>
<li><p><a href="#heading-extending-capabilities">Extending Capabilities</a></p>
</li>
<li><p><a href="#heading-adding-layers">Adding Layers</a></p>
</li>
</ul>
</li>
</ul>
<h2 id="heading-the-ingestion-problem">The Ingestion Problem</h2>
<p>The structure of a stack/solution is easier to understand with a use case. A high-level goal is to ingest financial data from external market APIs for trend analysis. You'll focus specifically on setting up ingestion of such data into the warehouse for further analytics.</p>
<p>The data is ingested via a web crawler with a specific rate limit per endpoint. In Batch processing, time-based partitioning is effective for processing by downstream pipelines. It also favors cleaner data retention.</p>
<p>The crawler runs as an external process, decoupled from Airflow via a Redis job queue. This keeps rate limiting and crawl lifecycle outside the orchestration layer, with each component failing and recovering independently.</p>
<p>During ingestion, the priority is data landing with high reliability due to the lack of idempotency in crawl jobs.</p>
<h2 id="heading-stack">Stack</h2>
<ul>
<li><p><a href="https://rustfs.com/"><strong>RustFS</strong></a><strong>:</strong> An S3-compatible object store written in Rust</p>
</li>
<li><p><a href="https://projectnessie.org/"><strong>Project Nessie</strong></a><strong>:</strong> Transactional catalog for Apache Iceberg tables</p>
</li>
<li><p><a href="https://spark.apache.org/"><strong>Apache Spark</strong></a><strong>:</strong> Distributed compute engine</p>
</li>
<li><p><a href="https://airflow.apache.org/"><strong>Apache Airflow</strong></a><strong>:</strong> Job scheduling and orchestration</p>
</li>
<li><p><a href="https://jupyter.org/"><strong>Jupyter Notebook</strong></a> <em>(optional)</em>: Ad-hoc Spark queries against Iceberg tables, not covered in this article</p>
</li>
<li><p><strong>Scrapredis:</strong> Job queue for the web crawler</p>
</li>
<li><p><strong>Scrapworker:</strong> Web crawler and ingestion worker</p>
</li>
</ul>
<p>This setup was tested on a 4-core x86/AMD CPU, 16GB RAM, 60GB disk GCP VM running Debian GNU/Linux 11 (Bullseye). Docker with Compose v2 is required. The setup should work on any comparable Linux environment with similar or better specs.</p>
<h2 id="heading-system-overview">System Overview</h2>
<img src="https://cdn.hashnode.com/uploads/covers/69607e708806706b5c49c7af/429a1e8a-bc39-44dc-8e0b-2cd9152370f5.png" alt="Data Platform Architecture" style="display:block;margin:0 auto" width="3202" height="2385" loading="lazy">

<p>The crawler runs as an external process, decoupled from Airflow via a Redis job queue. Airflow pushes a job specification to the queue containing the endpoint, query params, and target path. The crawler picks it up, executes the crawl, and writes raw results directly to object storage.</p>
<p>This separation keeps rate limiting and crawl lifecycle concerns outside the orchestration layer, and isolates failure modes.</p>
<p>A crawl failure is harder to recover since crawl jobs lack idempotency. Pipeline failures after the crawl stage are independently retryable without re-triggering a crawl.</p>
<h2 id="heading-quick-start">Quick Start</h2>
<p>First, initialize the project:</p>
<pre><code class="language-bash"># Clone the repository
git clone https://github.com/ps-mir/data-platform

# Create the shared Docker network
docker network create data-platform

# Create host directories, set permissions, and download Spark JARs
chmod +x init.sh &amp;&amp; ./init.sh
</code></pre>
<p>Start services in this order (shutdown in reverse):</p>
<ol>
<li><strong>RustFS</strong></li>
</ol>
<pre><code class="language-bash">cd rustfs &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Nessie</strong></li>
</ol>
<pre><code class="language-bash">cd nessie &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Spark</strong> — requires a build on first run</li>
</ol>
<pre><code class="language-bash">cd spark &amp;&amp; docker compose build &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Scrapredis</strong></li>
</ol>
<pre><code class="language-bash">cd scrapredis &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Airflow</strong> — requires a build on first run</li>
</ol>
<pre><code class="language-bash">cd airflow-docker &amp;&amp; docker compose build &amp;&amp; docker compose up -d
</code></pre>
<p>Create the Nessie namespaces once after Nessie is up:</p>
<pre><code class="language-bash">curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["default"]}'

curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["scraper"]}'
</code></pre>
<p>Scrapworker runs on the host directly (it's not dockerized). It requires Python &gt;=3.14:</p>
<pre><code class="language-bash">cd scrapworker
pip install -e .
CONFIG_PATH=./config/config.local.yaml RUSTFS_ACCESS_KEY=rustfsadmin RUSTFS_SECRET_KEY=rustfsadmin python -m scrapworker
</code></pre>
<p>Scrapworker must be running before activating <code>scraper_pipeline_v1</code> in Airflow. Without it, the pipeline will push jobs to the queue with no worker to pick them up and hang indefinitely in <code>wait_for_completion</code>.</p>
<p>Trino is also present in setup but not tested for integration with Nessie yet.</p>
<h2 id="heading-running-the-pipelines">Running the Pipelines</h2>
<p>With the stack running, the next step is to activate the pipelines in Airflow. All DAGs are paused at creation by default. The four pipelines build on each other in complexity. Working through them in order is the fastest way to confirm that each layer of the stack is wired correctly before moving to the next.</p>
<p>All four pipelines are loaded but paused by default. Unpause each one in the Airflow UI before triggering.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69607e708806706b5c49c7af/38f95d52-c092-4a00-b660-1233077b781b.png" alt="All Airflow Pipelines" style="display:block;margin:0 auto" width="2678" height="1234" loading="lazy">

<p>Let's go over each pipeline:</p>
<h3 id="heading-sparkstaticdatav1skeleton-hello-dag">spark_static_data_v1_skeleton: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/step1_hello_dag.py">Hello DAG</a></h3>
<p>This is a minimal DAG with no Spark, just a Python task that prints a message. If it goes green, Airflow's scheduler and worker are healthy. <code>[2026-04-09 22:00:01] INFO - Task operator:&lt;Task(_PythonDecoratedOperator): say_hello&gt;</code></p>
<h3 id="heading-sparkstaticdatav2submit-spark-submit">spark_static_data_v2_submit: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/step2_spark_submit.py">Spark Submit</a></h3>
<p>This submits a PySpark job via <code>SparkSubmitOperator</code> that writes a static dataset to an Iceberg table. No partitioning, every run overwrites the previous content.</p>
<p>In Nessie catalog it appears as:</p>
<pre><code class="language-bash">Type: ICEBERG_TABLE
Metadata Location:s3://warehouse/default/static_data_e7e43123-95a7-44d2-b6d5-67c9c7aa4321/metadata/00000-08a5a2db-6f12-4f21-b2a9-de3d9123fbd3.metadata.json
</code></pre>
<h3 id="heading-sparkpartitioneddatav1-spark-partitioned">spark_partitioned_data_v1: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/step3_spark_partitioned.py">Spark Partitioned</a></h3>
<p>This extends step2 with time-based partitioning. Partition values are derived from the scheduled slot time, so every run writes to its own <code>(ds, hr, min)</code> partition without touching previous ones.</p>
<p>Example file path in RustFS: <code>warehouse/default/static_data_partitioned_b172c66f-722b-44f3-bbee-069355753ff6/data/ds=2026-03-28/hr=23/min=15/00000-4-7a196a47-2ac0-4023-af68-ca10487fccb2-0-00001.parquet</code></p>
<h3 id="heading-scraperpipelinev1-scraper-pipeline">scraper_pipeline_v1: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/scraper_pipeline.py">Scraper Pipeline</a></h3>
<p>This is the full ingestion flow. Airflow pushes a job to Scrapredis, Scrapworker calls the Binance API and writes raw results to RustFS, then Airflow publishes a signal row to the Nessie catalog.</p>
<p>Every run fetches: <code>https://api.binance.com/api/v3/trades?symbol=BTCUSDT&amp;limit=10</code></p>
<h2 id="heading-setup">Setup</h2>
<p>This is a single-node development setup using Docker Compose. It's built on a well-structured base config that can be extended to production with targeted changes.</p>
<ul>
<li><p>A production deployment would require HA configuration, persistent volume management, and security hardening for each component.</p>
</li>
<li><p>Images are pinned to specific versions to avoid silent breakage between pulls.</p>
</li>
<li><p>All containers share a common external Docker network named <code>data-platform</code>, which allows services to communicate using container names as hostnames.</p>
</li>
<li><p>An <code>init.sh</code> script creates the required local dirs inside the data folder and also creates the Docker network.</p>
</li>
</ul>
<h3 id="heading-rustfs">RustFS</h3>
<p>RustFS is the object storage layer in this stack. Nessie's REST catalog mode has a hard dependency on an S3-compatible endpoint. Running it against a local filesystem fails the Nessie healthcheck at startup and causes catalog initialization to error out. The REST catalog is the recommended mode for new setups because it enables credential vending and multi-engine coordination.</p>
<p>MinIO was the natural choice for self-hosted S3-compatible storage, but it shifted to a more restrictive license. RustFS is the open-source alternative, written in Rust and backed by local disk.</p>
<p>At write time, Spark pushes Parquet files directly to RustFS via S3FileIO. Nessie commits the table metadata alongside, so data and catalog state land together or not at all. This is <a href="https://iceberg.apache.org/">Apache Iceberg</a>'s core guarantee: atomic commits across both data files and metadata.</p>
<p>For production or cloud deployments, managed object storage services like AWS S3, Google Cloud Storage, or Azure Blob Storage are the natural next step. Self-hosted alternatives at scale include <a href="https://github.com/seaweedfs/seaweedfs">SeaweedFS</a>, <a href="https://docs.ceph.com/en/latest/radosgw/">Ceph/RGW</a>, and <a href="https://garagehq.deuxfleurs.fr/">Garage</a>.</p>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p><strong>Bucket creation:</strong> A <code>rustfs-init</code> sidecar using <code>amazon/aws-cli</code> runs after RustFS passes its healthcheck and creates the <code>s3://warehouse</code> bucket automatically. You don't create the bucket manually.</p>
</li>
<li><p><strong>Permissions:</strong> RustFS runs as uid=10001 inside the container. The host directories (<code>data/rustfs/data</code> and <code>data/rustfs/applogs</code>) must be owned by that uid before the container starts, or it will fail silently. <code>init.sh</code> handles this with <code>sudo chown -R 10001:10001</code>.</p>
</li>
<li><p><strong>Image pinning:</strong> The compose file pins to <code>rustfs/rustfs:1.0.0-alpha.85-glibc</code>. Before upgrading, verify the uid hasn't changed: <code>docker run --rm --entrypoint id rustfs/rustfs:&lt;new-tag&gt;</code>. If it has, re-run <code>init.sh</code> or re-chown manually.</p>
</li>
<li><p><strong>Spark writes:</strong> Spark writes data files directly to RustFS via S3FileIO. Nessie only manages catalog metadata, it doesn't proxy data. The two interact at commit time, not at write time.</p>
</li>
</ul>
<h3 id="heading-nessie">Nessie</h3>
<p>The catalog tracks the list of tables in the warehouse, along with their data files and schema. Without it, it's hard for Spark to agree on what's in the warehouse.</p>
<p><a href="https://hive.apache.org/docs/latest/admin/adminmanual-metastore-administration/">Hive Metastore</a> offers a Thrift-based API and has been the catalog standard for years. It provides transaction semantics on metadata updates through its backing database, but those transactions stop at the catalog layer. Data files underneath aren't part of the same commit, and there's no cross-table history beyond what the database retains.</p>
<p>Apache Iceberg closes the data and metadata gap with atomic table commits. Nessie builds on that and goes further: it treats the catalog like a Git repository. Every table write is a commit. You can branch, tag, and roll back across multiple tables atomically.</p>
<p>Spark reads and writes table metadata through Nessie's Iceberg REST endpoint. Catalog state is persisted to Postgres, so it survives container restarts.</p>
<h4 id="heading-namespace-bootstrap">Namespace bootstrap</h4>
<p>Unlike Hive Metastore, Nessie doesn't auto-create namespaces. Attempting to write a table to a namespace that doesn't exist fails after data has already been written to RustFS, leaving orphaned files with no catalog entry. Namespaces are structural metadata and belong in a one-time bootstrap step, not in a pipeline.</p>
<p>Nessie manages the Iceberg catalog metadata under <code>s3://warehouse/</code>. Iceberg table data lands under paths derived from the namespace, for example, <code>s3://warehouse/default/</code> for the <code>default</code> namespace.</p>
<h4 id="heading-s3-credential-configuration-issue">S3 Credential Configuration Issue</h4>
<p>Nessie's S3 credential fields don't accept plain strings (likely for security reasons). They require a secret URI in the form <code>urn:nessie-secret:quarkus:&lt;name&gt;</code> even for local credentials.</p>
<p>Additionally, the SCREAMING_SNAKE_CASE environment variable convention is ambiguous for Quarkus property names containing hyphens. The property is silently ignored, and the default (which fails) is used instead. The working approach is dot-notation keys passed directly in the compose environment block, which Quarkus reads without conversion:</p>
<pre><code class="language-properties">nessie.catalog.service.s3.default-options.access-key: "urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key"
nessie.catalog.secrets.access-key.name: rustfsadmin
nessie.catalog.secrets.access-key.secret: rustfsadmin
</code></pre>
<h4 id="heading-nessie-health-check">Nessie health check</h4>
<p>Once the RustFS settings are corrected, Nessie's health check URL(<a href="http://localhost:9090/q/health">http://localhost:9090/q/health</a>) should return the following response:</p>
<pre><code class="language-json">{
    "status": "UP",
    "checks": [
        {
            "name": "MongoDB connection health check",
            "status": "UP"
        },
        {
            "name": "Warehouses Object Stores",
            "status": "UP",
            "data": {
                "warehouse.warehouse.status": "UP"
            }
        },
        {
            "name": "Database connections health check",
            "status": "UP",
            "data": {
                "&lt;default&gt;": "UP"
            }
        }
    ]
}
</code></pre>
<p>The MongoDB connection health check appears in the response even though this stack doesn't use MongoDB. It's a Quarkus built-in probe registered automatically regardless of store type. With JDBC configured, MongoDB is never connected and the UP report is just a placeholder response.</p>
<h4 id="heading-catalog-endpoint-vs-management">Catalog endpoint vs Management</h4>
<p>Nessie exposes two separate APIs. The Iceberg REST catalog is at <code>/iceberg</code>. This is what Spark and Trino connect to. The Nessie management API is at <code>/api/v2</code>, which is for branch operations, commit history, and table inspection. They aren't interchangeable.</p>
<pre><code class="language-properties"># Iceberg REST API
http://localhost:19120/iceberg/v1/main/namespaces
http://localhost:19120/iceberg/v1/config

# Nessie management API
http://localhost:19120/api/v2/config
</code></pre>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p><code>path-style-access: true</code> is required for any non-AWS S3 endpoint. <code>region</code> is a dummy value required by the AWS SDK internally.</p>
</li>
<li><p>Nessie's internal port 9000 is remapped to 9090 on the host to avoid conflict with RustFS which occupies 9000 and 9001.</p>
</li>
</ul>
<h4 id="heading-forward-path">Forward path</h4>
<p>Nessie is a stateless REST service, so scaling reads can be done with LB with no coordination between nodes. Durability comes entirely from backend store.</p>
<h3 id="heading-spark">Spark</h3>
<p>As a distributed compute engine, Apache Spark is a reliable and stable choice for long-running jobs. In the current setup, it executes PySpark jobs submitted by Airflow, reads and writes Iceberg tables via the Nessie REST catalog, and writes data files directly to RustFS using S3FileIO. Spark runs in standalone mode with a single master and worker, configured via <code>spark-defaults.conf</code>.</p>
<p>Two JARs are required and must be placed in <code>data/spark/jars/</code> before starting:</p>
<ul>
<li><p><code>iceberg-spark-runtime-3.5_2.12</code>: Iceberg integration for Spark: SparkCatalog, DataFrameWriterV2, SQL extensions, and all table format logic.</p>
</li>
<li><p><code>iceberg-aws-bundle</code>: AWS SDK v2 and Iceberg's S3FileIO, the storage transport layer for writing data files to RustFS. The Spark base image ships only Hadoop AWS (SDK v1). This bundle provides the SDK v2 classes that S3FileIO requires.</p>
</li>
</ul>
<p>Spark uses a custom Dockerfile to install Python 3.12. Build the image before first use:</p>
<pre><code class="language-bash">cd spark
docker compose build
docker compose up -d
</code></pre>
<p>The PySpark jobs are covered in the Airflow section, where we walk through each DAG and its corresponding Spark script as part of the pipeline.</p>
<p>Before submitting any Spark job that writes an Iceberg table, the target namespace must exist in Nessie. Nessie doesn't auto-create namespaces, unlike Hive Metastore. Attempting to write to a missing namespace fails after data has already been written to RustFS, leaving orphaned files with no catalog entry.</p>
<p>Create the <code>default</code> namespace once before running any pipeline:</p>
<pre><code class="language-bash"># Nessie should be up and running at this point
curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["default"]}'
{
  "namespace" : [ "default" ],
  "properties" : { }
}
</code></pre>
<p>Verify:</p>
<pre><code class="language-bash">curl http://localhost:19120/iceberg/v1/main/namespaces
</code></pre>
<h4 id="heading-catalog-mismatch-tables-missing-across-query-engines">Catalog Mismatch: Tables Missing Across Query Engines</h4>
<p>If tables written by Spark aren't visible in Trino, the likely cause is a catalog mismatch. Spark configured with <code>NessieCatalog</code> and Trino using the Iceberg REST catalog maintain separate metadata views — they don't share table state. Both engines must point at the same catalog endpoint: <code>http://nessie:19120/iceberg</code>.</p>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p><strong>Worker memory:</strong> The worker is configured with <code>SPARK_WORKER_MEMORY: 8g</code>. Spark's default is 1g is enough to register but not enough to run a job without queuing. Tune this based on available host memory.</p>
</li>
<li><p><strong>Remote signing:</strong> <code>remote-signing-enabled: false</code> Nessie's REST catalog supports credential vending via IAM/STS, but since that integration isn't present here, remote signing is disabled explicitly to avoid request failures.</p>
</li>
<li><p><strong>Config changes need full restart:</strong> Docker file-level bind mounts cache the inode at container start. Editing <code>spark-defaults.conf</code> won't take effect until Spark and the Airflow worker are restarted. In client mode, the Airflow worker is the Spark driver (the process that reads the config on job submission) and must be restarted too.</p>
</li>
<li><p><strong>Jupyter Notebook:</strong> A Jupyter instance with PySpark is included in the stack for ad-hoc queries against Iceberg tables. It connects to the same Spark cluster and Nessie catalog, so any table written by a pipeline is immediately queryable.</p>
</li>
</ul>
<p>⚠️ <strong>Warning:</strong> The Spark worker and Airflow worker (the driver) must run the same Python minor version. PySpark enforces this at runtime and fails immediately if they diverge. The Spark image in this stack uses a custom Dockerfile to install Python 3.12, matching Airflow's base image. If you upgrade either, verify that the versions stay aligned.</p>
<h3 id="heading-apache-airflow">Apache Airflow</h3>
<p>Airflow makes it easier to author, schedule and monitor workflows. In this case, it handles the ingestion for batch processing, but it can be extended to use cases like stream processing.</p>
<p>The Airflow components resemble more closely the DAG processor Airflow Architecture from the <a href="https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html">official docs</a>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69607e708806706b5c49c7af/a438e02b-0b16-44c7-bcae-92c954a942cc.png" alt="DAG Processor Airflow Architecture" style="display:block;margin:0 auto" width="2308" height="1455" loading="lazy">

<p>Key aspects:</p>
<ul>
<li><p>The DAG Processor continuously parses DAG files and serializes them to the Metadata DB.</p>
</li>
<li><p>The Scheduler reads from there, detects when a DAG run is due, creates task instances, and pushes them to the CeleryExecutor (via Redis queue).</p>
</li>
<li><p>The Celery worker picks up a task and executes it. In the case of a <code>SparkSubmitOperator</code>, the worker process becomes the Spark driver, submitting the job to the Spark cluster.</p>
</li>
<li><p>Executors run on the Spark worker, write Parquet files directly to RustFS, and commit the table metadata to Nessie. Airflow records the task outcome back in the Metadata DB.</p>
</li>
</ul>
<p>Airflow uses a custom Dockerfile to install Java 17 and additional providers. Build the image before first use:</p>
<pre><code class="language-bash">cd airflow-docker
docker compose build
docker compose up -d
</code></pre>
<h4 id="heading-pipelines">Pipelines</h4>
<p>Pipelines need to be created inside <code>airflow-docker/dags</code> folder for dag processor to pick up load the pipeline DAG in metadata DB. Four pipeline examples are provided with varying complexity.</p>
<ol>
<li><p><code>step1_hello_dag.py</code>: single-task DAG with no dependencies, just a Python function that prints a message.</p>
</li>
<li><p><code>step2_spark_submit.py</code>: submits a PySpark job via SparkSubmitOperator. The job writes a static dataset to an Iceberg table via the Nessie catalog.</p>
</li>
<li><p><code>step3_spark_partitioned.py</code>: extends step 2 with time-based partitioning. The scheduled slot time is passed to the PySpark script.</p>
<ul>
<li>Time-based partition values are derived from <code>data_interval_start</code> for idempotency (Backfill, Reruns).</li>
</ul>
</li>
<li><p><code>scraper_pipeline</code>: a real-world ingestion pipeline. Coordinates with the external task executor <code>scrapworker</code> via the Redis queue <code>scrapredis</code>.</p>
<ul>
<li>Both <code>scrapredis</code> and <code>scrapworker</code> must be up and running for this pipeline to work.</li>
</ul>
</li>
</ol>
<h4 id="heading-deploy-mode-and-driver-config">Deploy Mode and Driver Config</h4>
<p>The initial <code>SparkSubmitOperator</code> configuration used <code>deploy_mode="cluster"</code>, which runs the driver on the Spark cluster rather than the submitting machine. This fails immediately on Spark standalone clusters with a hard error:</p>
<pre><code class="language-plaintext">Cluster deploy mode is currently not supported for python applications on standalone clusters.
</code></pre>
<p>Cluster mode for Python is only available on YARN and Kubernetes. The fix is <code>deploy_mode="client"</code>, but this shifts the problem: in client mode, the driver runs on the Airflow worker container, which means the worker needs everything the Spark containers have.</p>
<p>Overall, three changes are required in the Airflow worker:</p>
<ul>
<li><p>The Iceberg and Nessie JARs at <code>/opt/spark/user-jars/</code></p>
</li>
<li><p><code>spark-defaults.conf</code> with catalog, extension, and JAR config</p>
</li>
<li><p><code>SPARK_CONF_DIR=/opt/spark/conf</code>, without this, pip-installed PySpark's <code>spark-submit</code> silently ignores the mounted conf file and runs with no catalog config</p>
</li>
</ul>
<p>The fix was adding all three to <code>x-airflow-common</code> in <code>airflow-docker/docker-compose.yaml</code> so every Airflow service inherits them:</p>
<pre><code class="language-yaml">environment:
  SPARK_CONF_DIR: /opt/spark/conf

volumes:
  - ../data/spark/jars:/opt/spark/user-jars:ro
  - ../spark/spark-defaults.conf:/opt/spark/conf/spark-defaults.conf:ro
</code></pre>
<h4 id="heading-partition-values-written-as-null">Partition Values Written as NULL</h4>
<p>When the third pipeline (Spark Partitioned) ran for the first time, the data landed correctly in RustFS, but querying the Iceberg partitions metadata showed:</p>
<pre><code class="language-plaintext">+------------------+----------+
|         partition|file_count|
+------------------+----------+
|{NULL, NULL, NULL}|         2|
+------------------+----------+
</code></pre>
<p>The original script used Spark's DataSource V1 API:</p>
<pre><code class="language-python">df.write.format("iceberg").mode("overwrite").saveAsTable(table)
</code></pre>
<p>The script used Spark's V1 DataFrame write API with format("iceberg"), which loads an isolated table reference and bypasses Iceberg's catalog write path. As a result, Iceberg committed the data files to storage but wrote NULL partition values into the manifest metadata.</p>
<p>The fix is in Iceberg's native DataFrameWriterV2 API:</p>
<pre><code class="language-python">df.writeTo(table).overwritePartitions()
</code></pre>
<p>This routes through Iceberg's native write path, evaluates partition transforms from the real column values (ds, hr, min), and registers them correctly in the manifest. <code>overwritePartitions()</code> overwrites only the partitions present in the DataFrame. A rerun with the same scheduled time produces the same values and atomically replaces that partition, leaving all others untouched.</p>
<p>⚠️ Existing NULL-partition manifest entries aren't retroactively corrected by subsequent V2 writes. For a brand-new table containing only bad data, DROP TABLE and rewrite is the simplest recovery.</p>
<h3 id="heading-scrapredis">Scrapredis</h3>
<p>Scrapredis is a dedicated Redis instance that sits between Airflow and Scrapworker as a job queue. It's separate from Airflow's internal Redis, which exists solely for CeleryExecutor task dispatch. The separation means the crawler's job queue can be managed, scaled, or replaced without touching Airflow's internals.</p>
<p>The pattern generalises beyond scraping. Any external process that needs its own lifecycle, resource profile, or rate limiting can be wired the same way: Airflow pushes a job, the external worker pops it, and Airflow polls for the result.</p>
<p>The scraper pipeline follows this round-trip:</p>
<ol>
<li>Airflow pushes the job payload to the queue:</li>
</ol>
<pre><code class="language-python">QUEUE_KEY = "scrapworker:jobs"
client.lpush(QUEUE_KEY, json.dumps(payload))
</code></pre>
<ol>
<li>Scrapworker blocks on the queue and pops the next job:</li>
</ol>
<pre><code class="language-python">while True:
    _, payload = client.blpop(redis_cfg["queue_key"])
</code></pre>
<ol>
<li>Once the crawl finishes, Scrapworker writes the outcome and <code>s3_path</code> back to Redis:</li>
</ol>
<pre><code class="language-python">client.set(status_key, json.dumps({"status": "finished", "worker_id": worker_id, "s3_path": job["s3_path"]}), ex=TERMINAL_TTL)
</code></pre>
<ol>
<li>The <code>wait_for_completion</code> task polls for that status key. On success, <code>publish_nessie_signal</code> picks up the <code>s3_path</code> and writes the signal row to Nessie.</li>
</ol>
<h3 id="heading-scrapworker">Scrapworker</h3>
<p>Scrapworker is a Python app that uses the Scrapy crawl framework to crawl all pages of the request. It's decoupled from Airflow due to URL/client specific rate limit semantics. For simplicity, consider it a type of external worker that receives and executes requests from Airflow.</p>
<p>It's responsible for downloading and writing content to object storage (RustFS). The Nessie catalog update is decoupled and kept in a separate Airflow pipeline task.</p>
<h4 id="heading-fixed-signal-table">Fixed Signal Table</h4>
<p>Scrapworker writes raw JSON to RustFS rather than writing scraped data directly as Iceberg columns. The pipeline then publishes a single lightweight signal row to a Nessie-managed Iceberg table.</p>
<p>The signal schema is fixed and minimal (<code>run_id</code>, <code>endpoint</code>, <code>s3_path</code>, <code>ds</code>, <code>hr</code>, <code>min</code>, <code>published_at</code>). It never changes, regardless of what's being scraped.</p>
<p>Mirroring the scraped payload as Iceberg columns would force Scrapworker to own schema evolution across different endpoints. This isn't an ideal place for schema ownership. Instead, schema ownership sits downstream:</p>
<pre><code class="language-plaintext">Scrapworker  →  raw files in RustFS  +  signal row in Iceberg (from Pipeline)
Airflow job  →  reads raw via s3_path, applies schema, writes structured Iceberg table
</code></pre>
<p>The downstream job knows the domain, knows the schema, and is the right place to handle type casting, nulls, and partition layout. Scrapworker stays generic and thin — the same code handles any endpoint without modification.</p>
<h4 id="heading-why-signal-publish-is-a-separate-airflow-task">Why Signal Publish is a Separate Airflow Task</h4>
<p>Scrapworker writes to RustFS and sets <code>status: finished</code> in Redis with the <code>s3_path</code>. A separate Airflow task reads that status and publishes the signal row to Nessie. The two writes are intentionally decoupled.</p>
<p>If scrapworker published to Nessie directly after writing to RustFS, the two writes would share a failure mode. A Nessie failure after a successful RustFS write would leave data stranded with no signal and no clean recovery path. The only option would be a re-crawl which lacks idempotency.</p>
<p>With the decoupled approach, each failure is isolated. A Nessie failure triggers an Airflow retry of the signal publish task only, no re-scrape, no duplicate crawl. RustFS and Nessie failures are independently recoverable.</p>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p>Raw scraped files are written directly to <code>s3://warehouse/raw/</code>, entirely outside Nessie's management. Nothing in the Iceberg layer touches this path.</p>
</li>
<li><p>The scrapworker signal table lives in a dedicated <code>scraper</code> namespace. Create it once before scrapworker runs for the first time.</p>
</li>
</ul>
<pre><code class="language-bash">curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["scraper"]}'
</code></pre>
<h2 id="heading-path-forward">Path Forward</h2>
<p>The stack we've built here is a working ingestion layer. It lands data reliably, tracks it in a versioned catalog, and gives you a foundation to build on. Two directions are worth considering from here.</p>
<h3 id="heading-extending-capabilities">Extending Capabilities</h3>
<p>These are improvements to what's already in the stack, making it more robust without adding new components.</p>
<p><strong>Ingestion reliability:</strong> Scrapworker currently handles failures by setting <code>status: failed</code> in Redis, which requires Airflow to re-trigger the full pipeline. Adding client-side rate limiting and per-endpoint retry logic with backoff would make crawl jobs more self-healing, so that a failed page fetch can retry independently without surfacing to Airflow at all.</p>
<p><strong>Config validation:</strong> A misconfigured endpoint schema in <code>config.yaml</code> fails silently at runtime, often deep into a crawl. A <code>validate_config()</code> call at startup would catch missing required fields like <code>offset_param</code> or <code>response_map</code> before any job runs. This becomes more important as more endpoints are added.</p>
<p><strong>Observability:</strong> Airflow alerting and SLA monitoring give early warning when pipelines miss their schedule or tasks take longer than expected. The signal table is useful here too. A lightweight monitor that checks for expected signal rows within a time window is a simple SLA check that works without external tooling.</p>
<h3 id="heading-adding-layers">Adding Layers</h3>
<p>These are new capabilities that build on the ingestion foundation.</p>
<p><strong>Transform layer:</strong> The raw Iceberg tables written by the ingestion layer are the input for a transform step. dbt or Spark SQL can read from raw, apply schema, clean types, and write structured tables to a separate namespace. This is the L in ELT and the natural next step once ingestion is stable.</p>
<p><strong>Analytics:</strong> Trino is already in the stack and partially integrated. Connecting it fully to Nessie enables SQL queries across all Iceberg tables. Adding Superset on top gives a visualisation layer without requiring any changes to the ingestion pipeline.</p>
<p><strong>Broader source onboarding:</strong> The current stack handles one ingestion pattern: a scheduled Airflow pipeline triggering an external HTTP crawler. The same foundation supports pull-based sources like databases using CDC, and push-based sources like event streams via Kafka. The Iceberg tables and Nessie catalog serve as the landing zone regardless of how data arrives.</p>
<p><strong>Governance:</strong> Iceberg and Nessie provide the foundations, covering snapshots, schema evolution, commit history, and time travel. The governance layer on top requires deliberate additions: access control, data quality checks, lineage tracking, and schema enforcement. None of these require replacing what's here, as they sit on top of it.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Keep Human Experts Visible in Your AI-Assisted Codebase ]]>
                </title>
                <description>
                    <![CDATA[ Six months ago, Stack Overflow processed 108,563 questions in a single month. By December 2025, that number had fallen to 3,862. A 78% collapse in two years. The explanation everyone reaches for is th ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-keep-human-experts-visible-in-your-ai-assisted-codebase/</link>
                <guid isPermaLink="false">69dd18d4217f5dfcbd13e964</guid>
                
                    <category>
                        <![CDATA[ claude.ai ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Productivity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ claude ]]>
                    </category>
                
                    <category>
                        <![CDATA[ claude-code ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Daniel Nwaneri ]]>
                </dc:creator>
                <pubDate>Mon, 13 Apr 2026 16:24:52 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/21d160a8-af66-4048-9fda-1d83b2e26148.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Six months ago, Stack Overflow processed 108,563 questions in a single month. By December 2025, that number had fallen to 3,862. A 78% collapse in two years.</p>
<p>The explanation everyone reaches for is that AI replaced it. That's partly true. But it misses the structural problem underneath: every time a developer asks Claude or ChatGPT to write code, the knowledge that shaped the answer disappears.</p>
<p>The GitHub discussion where someone spent two hours documenting why cursor-based pagination beats offset for live-updating datasets. The Stack Overflow answer from 2019 where one engineer, after a week of debugging, documented exactly why that approach fails under concurrent writes.</p>
<p>The AI consumed all of it. The humans who produced it got nothing — no citation in the codebase, no signal that their work mattered.</p>
<p>Over time, those people stopped contributing. Stack Overflow isn't dying because it's bad. It's dying because AI extracted its value and the feedback loop that kept humans contributing broke down.</p>
<p>This tutorial builds a tool that puts that loop back together. <strong>proof-of-contribution</strong> is a Claude Code skill that links every AI-generated artifact back to the human knowledge that inspired it — and surfaces exactly where the AI made choices with no human source at all.</p>
<p>I'll show you how to install proof-of-contribution, how to record your first provenance entry, how to use the spec-writer integration that makes Knowledge Gaps deterministic, and how to run <code>poc.py verify</code> — a static analyser that detects gaps without a single API call.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-what-you-will-build">What You Will Build</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-quickstart-in-5-minutes">Quickstart in 5 Minutes</a></p>
</li>
<li><p><a href="#heading-how-the-tool-works">How the Tool Works</a></p>
</li>
<li><p><a href="#heading-how-to-install-proof-of-contribution">How to Install proof-of-contribution</a></p>
</li>
<li><p><a href="#heading-how-to-scaffold-your-project">How to Scaffold Your Project</a></p>
</li>
<li><p><a href="#heading-how-to-record-your-first-provenance-entry">How to Record Your First Provenance Entry</a></p>
</li>
<li><p><a href="#heading-how-to-use-import-spec-to-seed-knowledge-gaps">How to Use import-spec to Seed Knowledge Gaps</a></p>
</li>
<li><p><a href="#heading-how-to-trace-human-attribution">How to Trace Human Attribution</a></p>
</li>
<li><p><a href="#heading-how-to-verify-with-static-analysis">How to Verify with Static Analysis</a></p>
</li>
<li><p><a href="#heading-how-to-enable-pr-enforcement">How to Enable PR Enforcement</a></p>
</li>
<li><p><a href="#heading-where-to-go-next">Where to Go Next</a></p>
</li>
</ol>
<h2 id="heading-what-you-will-build">What You Will Build</h2>
<p>proof-of-contribution is a Claude Code skill with a local CLI. Together they give you:</p>
<ul>
<li><p><strong>Provenance Blocks</strong>: Claude appends a structured attribution block to every generated artifact, listing the human sources that inspired it and flagging what it synthesized without any traceable source.</p>
</li>
<li><p><strong>Knowledge Gaps</strong>: the parts of AI-generated code that have no human citation, surfaced before they become production incidents</p>
</li>
<li><p><code>poc.py trace</code>: a CLI command that shows the full human attribution chain for any file in thirty seconds</p>
</li>
<li><p><code>poc.py import-spec</code>: bridges proof-of-contribution with spec-writer, seeding knowledge gaps from your spec's assumptions list before the agent builds anything</p>
</li>
<li><p><code>poc.py verify</code>: a static analyser that cross-checks your file's structure against seeded claims using Python's AST. Zero API calls. Exit code 0 means clean, exit code 1 means gaps found — wires directly into CI</p>
</li>
<li><p><strong>A GitHub Action</strong>: optional PR enforcement that fails PRs missing attribution, for teams that want a standard</p>
</li>
</ul>
<p>The complete source is at <a href="https://github.com/dannwaneri/proof-of-contribution">github.com/dannwaneri/proof-of-contribution</a>.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>This is a beginner-to-intermediate tutorial. You should be comfortable with:</p>
<ul>
<li><p><strong>Command line basics</strong>: navigating directories, running scripts</p>
</li>
<li><p><strong>Git</strong>: basic commits and PRs</p>
</li>
<li><p><strong>Python 3.8 or higher</strong>: the CLI is pure Python with no dependencies</p>
</li>
</ul>
<p>You will need:</p>
<ul>
<li><p><strong>Python installed</strong>: check with <code>python --version</code> or <code>python3 --version</code></p>
</li>
<li><p><strong>Git installed</strong>: check with <code>git --version</code></p>
</li>
<li><p><strong>Claude Code</strong> (or any agent that supports the Agent Skills standard — Cursor and Gemini CLI also work)</p>
</li>
</ul>
<p>There's no database to install. No API keys. No paid services. The default storage is SQLite, which Python includes out of the box.</p>
<h2 id="heading-quickstart-in-5-minutes">Quickstart in 5 Minutes</h2>
<p>If you want to try the tool before reading the full tutorial, here are the five commands that take you from zero to your first gap detection:</p>
<p><strong>Mac and Linux:</strong></p>
<pre><code class="language-bash"># 1. Install
mkdir -p ~/.claude/skills
git clone https://github.com/dannwaneri/proof-of-contribution.git \
  ~/.claude/skills/proof-of-contribution

# 2. Scaffold your project (run in your repo root)
python ~/.claude/skills/proof-of-contribution/assets/scripts/poc_init.py

# 3. Record attribution for an AI-generated file
python poc.py add src/utils/parser.py

# 4. Detect gaps via static analysis
python poc.py verify src/utils/parser.py

# 5. See the full provenance chain
python poc.py trace src/utils/parser.py
</code></pre>
<p><strong>Windows PowerShell:</strong></p>
<pre><code class="language-powershell"># 1. Install
New-Item -ItemType Directory -Force -Path "$HOME\.claude\skills"
git clone https://github.com/dannwaneri/proof-of-contribution.git `
  "$HOME\.claude\skills\proof-of-contribution"

# 2. Scaffold your project
python "$HOME\.claude\skills\proof-of-contribution\assets\scripts\poc_init.py"

# 3. Record attribution
python poc.py add src\utils\parser.py

# 4. Detect gaps
python poc.py verify src\utils\parser.py

# 5. See the full provenance chain
python poc.py trace src\utils\parser.py
</code></pre>
<p>That's the whole tool. The sections below walk through each step in detail with real terminal output at every stage.</p>
<h2 id="heading-how-the-tool-works">How the Tool Works</h2>
<p>Before you install anything, you need a clear mental model of what proof-of-contribution actually does — because the most important part isn't obvious.</p>
<h3 id="heading-the-archaeology-problem">The Archaeology Problem</h3>
<p>Here's a scenario that happens on every team using AI-assisted development.</p>
<p>A developer joins. They go through six months of AI-generated codebase. They hit a bug in the pagination logic — cursor-based, unusual implementation, nobody remembers why it was built that way. The original developer has left.</p>
<p>Old answer: two days of archaeology. <code>git blame</code> points to a commit message that says "fix pagination." The commit before that says "implement pagination." Dead end.</p>
<p>With <code>poc.py trace src/utils/paginator.py</code>, that same developer sees this in thirty seconds:</p>
<pre><code class="language-plaintext">Provenance trace: src/utils/paginator.py
────────────────────────────────────────────────────────────
  [HIGH]  @tannerlinsley on github
          Cursor pagination discussion
          https://github.com/TanStack/query/discussions/123
          Insight: cursor beats offset for live-updating datasets

Knowledge gaps (AI-synthesized, no human source):
  • Error retry strategy — no human source cited
  • Concurrent write handling — AI chose this arbitrarily
</code></pre>
<p>They now know where the pattern came from and — critically — which parts have no traceable human source. The concurrent write handling is where the bug lives. The AI made a choice nobody reviewed.</p>
<p>That's what this tool does. Not enforcement first. Archaeology first.</p>
<h3 id="heading-how-knowledge-gaps-are-detected">How Knowledge Gaps Are Detected</h3>
<p>The obvious assumption is that Claude introspects and reports what it doesn't know. That assumption is wrong. LLMs hallucinate confidently. An AI that could reliably detect its own knowledge gaps wouldn't produce them.</p>
<p>The detection mechanism is a comparison, not introspection.</p>
<p>When you use <a href="https://github.com/dannwaneri/spec-writer">spec-writer</a> before building, it generates a spec with an explicit <code>## Assumptions to review</code> section — every decision the AI is making that you didn't specify, each one impact-rated. That list is the contract.</p>
<p>When you run <code>poc.py import-spec spec.md --artifact src/utils/paginator.py</code>, those assumptions get seeded into the database as unresolved knowledge gaps. After the agent builds, <code>poc.py trace</code> shows which assumptions made it into code with no human source ever cited.</p>
<p>The AI isn't grading its own exam. The spec is the answer key.</p>
<p><code>poc.py verify</code> takes this further. After the agent builds, it parses the file's actual structure using Python's built-in <code>ast</code> module — extracting every function definition, conditional branch, and return path. It cross-checks each one against the seeded claims. Any structural unit with no resolved claim surfaces as a deterministic Knowledge Gap, regardless of how confident the model was when it wrote the code.</p>
<h2 id="heading-how-to-install-proof-of-contribution">How to Install proof-of-contribution</h2>
<h3 id="heading-mac-and-linux">Mac and Linux</h3>
<pre><code class="language-bash">mkdir -p ~/.claude/skills
git clone https://github.com/dannwaneri/proof-of-contribution.git \
  ~/.claude/skills/proof-of-contribution
</code></pre>
<h3 id="heading-windows-powershell">Windows PowerShell</h3>
<pre><code class="language-powershell">New-Item -ItemType Directory -Force -Path "$HOME\.claude\skills"
git clone https://github.com/dannwaneri/proof-of-contribution.git `
  "$HOME\.claude\skills\proof-of-contribution"
</code></pre>
<p>That's the entire installation. No package to install, no configuration file to edit. The skill is a markdown file the agent reads. The CLI is a Python script that runs locally.</p>
<h3 id="heading-verify-the-install">Verify the Install:</h3>
<pre><code class="language-bash">ls ~/.claude/skills/proof-of-contribution/
</code></pre>
<p>You should see <code>SKILL.md</code>, <code>poc.py</code>, <code>assets/</code>, and <code>references/</code>. If the directory is empty, the clone failed — check your internet connection and try again.</p>
<h2 id="heading-how-to-scaffold-your-project">How to Scaffold Your Project</h2>
<p>The scaffold script creates the database, config, CLI, and GitHub integration in your project root. Run it once per project.</p>
<h3 id="heading-mac-and-linux">Mac and Linux</h3>
<pre><code class="language-bash">cd /path/to/your/project
python ~/.claude/skills/proof-of-contribution/assets/scripts/poc_init.py
</code></pre>
<h3 id="heading-windows-powershell">Windows PowerShell</h3>
<pre><code class="language-powershell">cd C:\path\to\your\project
python "$HOME\.claude\skills\proof-of-contribution\assets\scripts\poc_init.py"
</code></pre>
<p>You should see output like this:</p>
<pre><code class="language-plaintext">🔗 Proof of Contribution — init

  →  Project root: /path/to/your/project
  ✔  Created .poc/config.json
  ✔  Created .poc/.gitignore  (db excluded from git, config tracked)
  ✔  Created .poc/provenance.db  (SQLite — no extra infra needed)
  ✔  Created .github/PULL_REQUEST_TEMPLATE.md
  ✔  Created .github/workflows/poc-check.yml
  ✔  Created poc.py  (local CLI — includes import-spec command)
  ✔  Created .gitignore

✔ Proof of Contribution initialised for 'your-project'
</code></pre>
<p>This creates four things in your project:</p>
<pre><code class="language-plaintext">your-project/
├── .poc/
│   ├── config.json      ← project settings (commit this)
│   ├── provenance.db    ← SQLite database (local only, gitignored)
│   └── .gitignore
├── .github/
│   ├── PULL_REQUEST_TEMPLATE.md
│   └── workflows/
│       └── poc-check.yml
└── poc.py               ← your local CLI
</code></pre>
<ul>
<li><p><code>.poc/</code> — the tool's local data directory. <code>config.json</code> stores project settings and is committed to git. <code>provenance.db</code> is the SQLite database where attribution records and knowledge gaps are stored — local only, gitignored.</p>
</li>
<li><p><code>poc.py</code> — your local CLI, copied into the project root. Run <code>python poc.py trace</code>, <code>python poc.py verify</code>, and every other command directly without a global install.</p>
</li>
<li><p><code>.github/PULL_REQUEST_TEMPLATE.md</code> — a PR template with the <code>## 🤖 AI Provenance</code> section pre-filled. Developers fill it in when submitting PRs that contain AI-generated code.</p>
</li>
<li><p><code>.github/workflows/poc-check.yml</code> — the optional GitHub Action for PR enforcement. Installed but dormant until you push the workflow file and enable it in your repo settings.</p>
</li>
</ul>
<p><strong>Windows note:</strong> if the scaffold fails with a <code>UnicodeEncodeError</code>, the emoji in the PR template is hitting a Windows encoding limit. Open <code>assets/scripts/poc_init.py</code> in a text editor and find every line ending with <code>.write_text(...)</code>. Change each one to <code>.write_text(..., encoding="utf-8")</code>. Save and re-run.</p>
<h3 id="heading-verify-the-scaffold-worked">Verify the Scaffold Worked</h3>
<pre><code class="language-bash">python poc.py report
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Proof of Contribution Report
────────────────────────────────────────
  Artifacts tracked    : 0
  With provenance      : 0  (0%)
  Unresolved gaps      : 0
  Resolved claims      : 0
  Human experts        : 0
</code></pre>
<p>Empty database, clean state. You're ready.</p>
<h2 id="heading-how-to-record-your-first-provenance-entry">How to Record Your First Provenance Entry</h2>
<p>Before we dive in here, I just want to clear something up. Earlier, I described <code>poc.py verify</code> as detecting Knowledge Gaps automatically — and it does. But the static analyser can only tell you <em>that</em> a function has no human citation. It can't tell you <em>which</em> human source inspired it. That knowledge lives in your head, not in the code.</p>
<p><code>poc.py add</code> is where you supply that context. After the agent builds a file, you record the human sources you actually drew on: the GitHub discussion you read before prompting, the Stack Overflow answer that shaped the approach. Those records become the attribution chain <code>poc.py trace</code> surfaces — and what closes the gaps <code>poc.py verify</code> flags.</p>
<p><code>verify</code> finds the gaps. <code>add</code> fills them.</p>
<p><code>poc.py add</code> records attribution for a file interactively. You can run it on any AI-generated file in your project.</p>
<pre><code class="language-bash">python poc.py add src/utils/parser.py
</code></pre>
<p>You'll see a prompt:</p>
<pre><code class="language-plaintext">Recording provenance for: src/utils/parser.py
(Press Ctrl+C to cancel)

  Human source URL (or Enter to finish):
</code></pre>
<p>Enter the URL of the human-authored source that inspired the code. This could be a GitHub discussion, a Stack Overflow answer, a documentation page, a blog post, or an RFC.</p>
<pre><code class="language-plaintext">  Human source URL (or Enter to finish): https://github.com/TanStack/query/discussions/123
  Author handle: tannerlinsley
  Platform (github/stackoverflow/docs/other): github
  Source title: Cursor pagination discussion
  What specific insight came from this? cursor beats offset for live-updating datasets
  Confidence HIGH/MEDIUM/LOW [MEDIUM]: HIGH
  ✔ Recorded.
</code></pre>
<p>Add as many sources as apply. Press Enter on a blank URL when you're done.</p>
<pre><code class="language-plaintext">  Human source URL (or Enter to finish): 
✔ Provenance saved. Run: python poc.py trace src/utils/parser.py
</code></pre>
<h3 id="heading-check-what-you-recorded">Check What You Recorded</h3>
<pre><code class="language-bash">python poc.py trace src/utils/parser.py
</code></pre>
<pre><code class="language-plaintext">Provenance trace: src/utils/parser.py
────────────────────────────────────────────────────────────
  [HIGH]  @tannerlinsley on github
          Cursor pagination discussion
          https://github.com/TanStack/query/discussions/123
          Insight: cursor beats offset for live-updating datasets
</code></pre>
<p>No knowledge gaps — because you recorded a source. If the file had parts with no human source, they would appear below as gaps.</p>
<h3 id="heading-see-all-experts-in-your-graph">See All Experts in Your Graph</h3>
<p>Every <code>poc.py add</code> call stores not just the URL but the author — their handle, platform, and the specific insight they contributed. Run it across enough files, and those authors accumulate into a <strong>knowledge graph</strong>: a local record of which human experts your codebase drew from, which files their knowledge shaped, and how many artifacts trace back to their work.</p>
<p><code>poc.py experts</code> surfaces the top contributors. On a new project, it'll be one or two entries. On a mature codebase, it becomes a map of whose knowledge is load-bearing — the people you'd want to consult if that part of the code ever needed to change.</p>
<pre><code class="language-bash">python poc.py experts
</code></pre>
<pre><code class="language-plaintext">Top Human Experts in Knowledge Graph
──────────────────────────────────────────────────────
  @tannerlinsley            github          1 artifact(s)
</code></pre>
<h2 id="heading-how-to-use-import-spec-to-seed-knowledge-gaps">How to Use import-spec to Seed Knowledge Gaps</h2>
<p>This is the most important command in the tool. It connects proof-of-contribution with spec-writer and makes Knowledge Gaps deterministic.</p>
<p>When you use spec-writer before building a feature, it generates an <code>## Assumptions to review</code> section — every implicit decision is impact-rated HIGH, MEDIUM, or LOW. The <code>import-spec</code> command reads that section and seeds those assumptions into the database as unresolved gaps before the agent writes a line of code.</p>
<p>After the agent builds, any assumption that made it into the implementation without a cited human source surfaces automatically in <code>poc.py trace</code>. You don't need to know which parts of the code are uncertain. The spec already told you.</p>
<h3 id="heading-step-1-create-a-test-spec">Step 1 — Create a Test Spec</h3>
<p>If you don't have a spec-writer output yet, create one manually to see how the import works.</p>
<p><strong>Mac and Linux:</strong></p>
<pre><code class="language-bash">cat &gt; test-spec.md &lt;&lt; 'EOF'
## Assumptions to review

1. SQLite is sufficient for single-developer use — Impact: HIGH
   Correct this if: you need team-shared provenance

2. Filepath is the artifact identifier — Impact: MEDIUM
   Correct this if: you use content hashing instead

3. REST pattern for any future API — Impact: LOW
   Correct this if: you prefer GraphQL
EOF
</code></pre>
<p><strong>Windows PowerShell:</strong></p>
<pre><code class="language-powershell">python -c "
content = '''## Assumptions to review

1. SQLite is sufficient for single-developer use - Impact: HIGH
   Correct this if: you need team-shared provenance

2. Filepath is the artifact identifier - Impact: MEDIUM
   Correct this if: you use content hashing instead

3. REST pattern for any future API - Impact: LOW
   Correct this if: you prefer GraphQL'''
open('test-spec.md', 'w', encoding='utf-8').write(content)
print('test-spec.md created')
"
</code></pre>
<p><strong>Windows note:</strong> don't use PowerShell's <code>echo</code> to create spec files. PowerShell saves files as UTF-16, which causes a <code>UnicodeDecodeError</code> when <code>import-spec</code> reads them. The <code>python -c</code> approach above writes UTF-8 correctly.</p>
<h3 id="heading-step-2-import-the-assumptions">Step 2 — Import the Assumptions</h3>
<pre><code class="language-bash">python poc.py import-spec test-spec.md --artifact src/utils/parser.py
</code></pre>
<pre><code class="language-plaintext">Spec assumptions imported — 3 Knowledge Gap(s) seeded
───────────────────────────────────────────────────────
  1. [HIGH] SQLite is sufficient for single-developer use
       Correct if: you need team-shared provenance
  2. [MEDIUM] Filepath is the artifact identifier
       Correct if: you use content hashing instead
  3. [LOW] REST pattern for any future API
       Correct if: you prefer GraphQL

  →  Bound to: src/utils/parser.py
  After the agent builds, run:
  python poc.py trace src/utils/parser.py
  python poc.py add src/utils/parser.py
</code></pre>
<h3 id="heading-step-3-trace-the-gaps">Step 3 — Trace the Gaps</h3>
<pre><code class="language-bash">python poc.py trace src/utils/parser.py
</code></pre>
<pre><code class="language-plaintext">Knowledge gaps (AI-synthesized, no human source):
  • REST pattern for any future API [Correct if: you prefer GraphQL]
  • SQLite is sufficient for single-developer use [Correct if: you need team-shared provenance]
  • Filepath is the artifact identifier [Correct if: you use content hashing instead]

  Resolve gaps: python poc.py add src/utils/parser.py
</code></pre>
<p>Three gaps, colour-coded by urgency. The HIGH-impact assumption — SQLite for single-developer use — appears in red. The LOW-impact one appears in green. When you run <code>poc.py add</code> and record a human source with an insight that overlaps the gap text, the gap auto-closes.</p>
<h3 id="heading-preview-without-writing">Preview Without Writing</h3>
<pre><code class="language-bash">python poc.py import-spec test-spec.md --dry-run
</code></pre>
<p>This parses the spec and prints what would be seeded without touching the database. This is useful before committing to an import.</p>
<h3 id="heading-check-the-overall-health">Check the Overall Health</h3>
<pre><code class="language-bash">python poc.py report
</code></pre>
<pre><code class="language-plaintext">Proof of Contribution Report
────────────────────────────────────────
  Artifacts tracked    : 1
  With provenance      : 0  (0%)
  Unresolved gaps      : 3
  Resolved claims      : 0
  Human experts        : 1
  ⚠ Less than 50% of artifacts have provenance records.
  ⚠ 3 unresolved Knowledge Gap(s).
    Run `poc.py trace &lt;filepath&gt;` to locate them.
</code></pre>
<h2 id="heading-how-to-trace-human-attribution">How to Trace Human Attribution</h2>
<p><code>poc.py trace</code> is the command you'll use most. It shows the full human attribution chain for any file and lists any knowledge gaps — parts of the code with no traceable human source.</p>
<pre><code class="language-bash">python poc.py trace src/utils/parser.py
</code></pre>
<p>A file with both attribution and gaps looks like this:</p>
<pre><code class="language-plaintext">Provenance trace: src/utils/parser.py
────────────────────────────────────────────────────────────
  [HIGH]  @juliandeangelis on github
          Spec Driven Development methodology
          https://github.com/dannwaneri/spec-writer
          Insight: separate functional from technical spec

  [MEDIUM] @tannerlinsley on github
           Cursor pagination discussion
           https://github.com/TanStack/query/discussions/123
           Insight: cursor beats offset for live-updating datasets

Knowledge gaps (AI-synthesized, no human source):
  • Error retry strategy — no human source cited
  • CSV column ordering — AI chose this arbitrarily

  Resolve gaps: python poc.py add src/utils/parser.py
</code></pre>
<p>The human attribution section shows every cited source, colour-coded by confidence. The knowledge gaps section shows every assumption that shipped without a human citation — either seeded from a spec via <code>import-spec</code>, or flagged by Claude in the Provenance Block.</p>
<h3 id="heading-resolving-gaps">Resolving Gaps</h3>
<p>Run <code>poc.py add</code> on any file with open gaps:</p>
<pre><code class="language-bash">python poc.py add src/utils/parser.py
</code></pre>
<p>When you enter an insight that shares words with an open gap claim, the gap auto-closes. Run <code>poc.py trace</code> again to confirm it's resolved.</p>
<h2 id="heading-how-to-verify-with-static-analysis">How to Verify with Static Analysis</h2>
<p><code>poc.py verify</code> is the command that closes the epistemic trust gap completely. It detects Knowledge Gaps by analysing the file's actual code structure — not by asking the AI what it doesn't know.</p>
<p>Run it after the agent builds, once you've seeded gaps with <code>import-spec</code>:</p>
<pre><code class="language-bash">python poc.py verify src/utils/parser.py
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Verify: src/utils/parser.py
────────────────────────────────────────────────────────────
  Structural units detected : 11
  Seeded claims             : 3
  Covered by cited source   : 2
  Deterministic gaps        : 1

Deterministic Knowledge Gaps (no human source):
  • function: handle_concurrent_writes (lines 47–61)
      Seeded assumption: concurrent write handling — AI chose this arbitrarily

  Resolve: python poc.py add src/utils/parser.py
</code></pre>
<p>The gap shown is not something Claude admitted. It's something the analyser found by comparing the file's function list against your seeded claims. The function <code>handle_concurrent_writes</code> exists in the code but has no resolved human citation in the database. That's the gap.</p>
<h3 id="heading-what-the-exit-codes-mean">What the Exit Codes Mean</h3>
<pre><code class="language-bash">python poc.py verify src/utils/parser.py
echo $?   # Mac/Linux

python poc.py verify src/utils/parser.py
echo $LASTEXITCODE   # Windows PowerShell
</code></pre>
<ul>
<li><p><strong>Exit code 0</strong> — no gaps, all detected units have cited sources</p>
</li>
<li><p><strong>Exit code 1</strong> — gaps found, resolve with <code>poc.py add</code></p>
</li>
<li><p><strong>Exit code 2</strong> — file not found or unsupported language</p>
</li>
</ul>
<p>Exit code 1 integrates directly into CI pipelines. Add <code>poc.py verify</code> to your GitHub Action or pre-commit hook and gaps block the build before they reach production.</p>
<h3 id="heading-run-it-without-a-seeded-spec">Run it Without a Seeded Spec</h3>
<p>If you haven't run <code>import-spec</code> first, <code>verify</code> still works — it falls back to structural analysis and surfaces every uncited function and branch as a gap:</p>
<pre><code class="language-bash">python poc.py verify src/utils/parser.py
</code></pre>
<pre><code class="language-plaintext">⚠ No spec imported — showing all uncited structural units.
  Run: python poc.py import-spec spec.md --artifact src/utils/parser.py
  for deterministic gap detection.

Deterministic Knowledge Gaps (no human source):
  • function: parse_query (lines 1–7)
  • branch: if not text (lines 2–3)
  • function: fetch_results (lines 9–12)
  ...
</code></pre>
<p>It's less precise than the spec-writer path — every structural unit shows rather than only the ones tied to named assumptions — but it's useful as a baseline on any file, new or old.</p>
<h3 id="heading-the-strict-flag">The <code>--strict</code> Flag</h3>
<pre><code class="language-bash">python poc.py verify src/utils/parser.py --strict
</code></pre>
<p>Strict mode flags every uncited structural unit as a gap even when claims are seeded. You can use it when you want zero tolerance: any function or branch without a resolved human source fails the check.</p>
<h2 id="heading-how-to-enable-pr-enforcement">How to Enable PR Enforcement</h2>
<p>Once <code>poc.py trace</code> has saved you real hours — not before — enable the GitHub Action. The distinction matters. Turning it on day one frames the tool as overhead. Turning it on after the team already finds value frames it as a standard.</p>
<pre><code class="language-bash">git add .github/ .poc/config.json poc.py
git commit -m "chore: add proof-of-contribution"
git push
</code></pre>
<p>After that, every PR is checked for an <code>## 🤖 AI Provenance</code> section. The scaffold already created the PR template with that section included. Developers fill it in naturally once they're already running <code>poc.py trace</code> locally — the template just asks them to record what they already know.</p>
<p>Developers who write fully human code opt out by adding <code>100% human-written</code> anywhere in the PR body. The action skips the check automatically.</p>
<h3 id="heading-what-the-action-checks">What the Action Checks</h3>
<p>The action reads the PR description and looks for:</p>
<ol>
<li><p>The <code>## 🤖 AI Provenance</code> heading</p>
</li>
<li><p>At least one populated row in the attribution table</p>
</li>
</ol>
<p>If the section is missing or the table is empty, the action fails and posts a comment explaining what to add. The comment includes a link to <code>poc.py trace &lt;filepath&gt;</code> so the developer knows exactly where to look.</p>
<h2 id="heading-where-to-go-next">Where to Go Next</h2>
<h3 id="heading-use-it-with-spec-writer-on-a-real-feature">Use it with spec-writer on a Real Feature</h3>
<p>The real value of <code>import-spec</code> is on actual features, not test specs. If you use <a href="https://github.com/dannwaneri/spec-writer">spec-writer</a>, the workflow is:</p>
<pre><code class="language-plaintext">/spec-writer "your feature description"
</code></pre>
<p>Save the output to <code>spec.md</code>. Then:</p>
<pre><code class="language-bash">python poc.py import-spec spec.md --artifact src/path/to/output.py
</code></pre>
<p>Build the feature with your agent. Then run <code>poc.py trace</code> to see which assumptions made it into code with no human source. Resolve the HIGH-impact gaps first — those are the ones that will cause production incidents.</p>
<h3 id="heading-activate-the-claude-code-skill">Activate the Claude Code Skill</h3>
<p>The SKILL.md file makes Claude automatically append a Provenance Block to every generated artifact when the skill is active. The block lists human sources Claude drew from and flags what it synthesized without any traceable source.</p>
<p>To activate it in Claude Code, the skill is already installed at <code>~/.claude/skills/proof-of-contribution/</code>. Claude Code loads it automatically when you are in a project that has <code>.poc/config.json</code>.</p>
<p>A generated Provenance Block looks like this:</p>
<pre><code class="language-plaintext">## PROOF OF CONTRIBUTION
Generated artifact: fetch_github_discussions()
Confidence: MEDIUM

## HUMAN SOURCES THAT INSPIRED THIS

[1] GitHub GraphQL API Documentation Team
    Source type: Official Docs
    URL: docs.github.com/en/graphql
    Contribution: cursor-based pagination pattern

[2] GitHub Community (multiple contributors)
    Source type: GitHub Discussions
    URL: github.com/community/community
    Contribution: "ghost" fallback for deleted accounts
                  surfaced in bug reports

## KNOWLEDGE GAPS (AI synthesized, no human cited)
- Error handling / retry logic
- Rate limit strategy

## RECOMMENDED HUMAN EXPERTS TO CONSULT
- github.com/octokit community for pagination
</code></pre>
<p>The Knowledge Gaps section is the part no other tool produces. It's where AI admits what it synthesized without a traceable human source — before that gap becomes a production incident.</p>
<h3 id="heading-upgrade-when-you-outgrow-sqlite">Upgrade When You Outgrow SQLite</h3>
<p>The default database is SQLite — local only, no infra required. When you need team sharing or graph queries, the <code>references/</code> directory in the repo has migration guides:</p>
<table>
<thead>
<tr>
<th>Need</th>
<th>File</th>
</tr>
</thead>
<tbody><tr>
<td>Team sharing a provenance DB</td>
<td><code>references/relational-schema.md</code></td>
</tr>
<tr>
<td>Graph traversal queries</td>
<td><code>references/neo4j-implementation.md</code></td>
</tr>
<tr>
<td>Semantic web / interoperability</td>
<td><code>references/jsonld-schema.md</code></td>
</tr>
</tbody></table>
<h2 id="heading-manual-tracking-vs-proof-of-contribution">Manual Tracking vs. proof-of-contribution</h2>
<table>
<thead>
<tr>
<th></th>
<th>Manual tracking</th>
<th>proof-of-contribution</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Finding who wrote the code</strong></td>
<td>Search Slack, ask the team, dig through commits</td>
<td><code>poc.py trace &lt;file&gt;</code> — thirty seconds</td>
</tr>
<tr>
<td><strong>Knowing which parts the AI guessed</strong></td>
<td>You don't, until it breaks in production</td>
<td>Knowledge Gaps section — surfaced before the code ships</td>
</tr>
<tr>
<td><strong>Detecting gaps after the build</strong></td>
<td>Code review, if someone notices</td>
<td><code>poc.py verify</code> — static analysis, zero API calls</td>
</tr>
<tr>
<td><strong>Enforcing attribution on PRs</strong></td>
<td>Honor system</td>
<td>GitHub Action fails the PR if attribution is missing</td>
</tr>
<tr>
<td><strong>Connecting to your spec</strong></td>
<td>Copy-paste assumptions into comments manually</td>
<td><code>poc.py import-spec</code> seeds them as tracked claims automatically</td>
</tr>
<tr>
<td><strong>Infrastructure required</strong></td>
<td>None (usually a spreadsheet or nothing)</td>
<td>None — SQLite, pure Python, no paid services</td>
</tr>
</tbody></table>
<p>The tool doesn't replace code review. It gives code review the context it needs to catch the right things.</p>
<p>The archaeology scenario — two days tracing a bug through dead-end commit messages — takes thirty seconds with <code>poc.py trace</code>. The code still has gaps, and it always will. But now you know where they are.</p>
<p><em>Built by</em> <a href="https://dev.to/dannwaneri"><em>Daniel Nwaneri</em></a><em>. The spec-writer skill that feeds</em> <code>import-spec</code> <em>is at</em> <a href="https://github.com/dannwaneri/spec-writer"><em>github.com/dannwaneri/spec-writer</em></a><em>. The full proof-of-contribution repo is at</em> <a href="https://github.com/dannwaneri/proof-of-contribution"><em>github.com/dannwaneri/proof-of-contribution</em></a><em>.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Efficient Data Processing in Python: Batch vs Streaming Pipelines Explained ]]>
                </title>
                <description>
                    <![CDATA[ Every data pipeline makes a fundamental choice before any code is written: does it process data in chunks on a schedule, or does it process data continuously as it arrives? This choice — batch versus  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/efficient-data-processing-in-python-batch-vs-streaming-pipelines/</link>
                <guid isPermaLink="false">69dcf4dbf57346bc1e06d19b</guid>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Bala Priya C ]]>
                </dc:creator>
                <pubDate>Mon, 13 Apr 2026 13:51:23 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/0cd359d4-9628-4b17-8dc4-a3a2a83172c8.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every data pipeline makes a fundamental choice before any code is written: does it process data in chunks on a schedule, or does it process data continuously as it arrives?</p>
<p>This choice — batch versus streaming — shapes the architecture of everything downstream. The tools you use, the guarantees you can make about data freshness, the complexity of your error handling, and the infrastructure you need to run it all follow directly from this decision.</p>
<p>Getting it wrong is expensive. Teams that build streaming pipelines when batch would have sufficed end up maintaining complex infrastructure for a problem that didn't require it.</p>
<p>Teams that build batch pipelines when their use case demands real-time processing discover the gap at the worst possible moment — when a stakeholder asks why the dashboard is six hours out of date.</p>
<p>In this article, you'll learn what batch and streaming pipelines actually are, how they differ in terms of architecture and tradeoffs, and how to implement both patterns in Python. By the end, you'll have a clear framework for choosing the right approach for any data engineering problem you solve.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along comfortably, make sure you have:</p>
<ul>
<li><p>Practice writing Python functions and working with modules</p>
</li>
<li><p>Familiarity with pandas DataFrames and basic data manipulation</p>
</li>
<li><p>A general understanding of what ETL pipelines do — extract, transform, load</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-is-a-batch-pipeline">What Is a Batch Pipeline?</a></p>
<ul>
<li><p><a href="#heading-implementing-a-batch-pipeline-in-python">Implementing a Batch Pipeline in Python</a></p>
</li>
<li><p><a href="#heading-when-batch-works-well">When Batch Works Well</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-what-is-a-streaming-pipeline">What Is a Streaming Pipeline?</a></p>
<ul>
<li><p><a href="#heading-implementing-a-streaming-pipeline-in-python">Implementing a Streaming Pipeline in Python</a></p>
</li>
<li><p><a href="#heading-when-streaming-works-well">When Streaming Works Well</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-the-key-differences-at-a-glance">The Key Differences at a Glance</a></p>
</li>
<li><p><a href="#heading-choosing-between-batch-and-streaming">Choosing Between Batch and Streaming</a></p>
</li>
<li><p><a href="#heading-the-hybrid-pattern-lambda-and-kappa-architectures">The Hybrid Pattern: Lambda and Kappa Architectures</a></p>
</li>
</ul>
<h2 id="heading-what-is-a-batch-pipeline">What Is a Batch Pipeline?</h2>
<p>A batch pipeline processes a bounded, finite collection of records together — a file, a database snapshot, a day's worth of transactions. It runs on a schedule, say, hourly, nightly, weekly, reads all the data for that period, transforms it, and writes the result somewhere. Then it stops and waits until the next run.</p>
<p>The mental model is simple: <strong>collect, then process</strong>. Nothing happens between runs.</p>
<p>In a retail ETL context, a typical batch pipeline might look like this:</p>
<ol>
<li><p>At midnight, extract all orders placed in the last 24 hours from the transactional database</p>
</li>
<li><p>Join with the product catalogue and customer dimension tables</p>
</li>
<li><p>Compute daily revenue aggregates by region and product category</p>
</li>
<li><p>Load the results into the data warehouse for reporting</p>
</li>
</ol>
<p>The pipeline runs, finishes, and produces a complete, consistent snapshot of yesterday's business. By the time analysts arrive in the morning, the warehouse is up to date.</p>
<h3 id="heading-implementing-a-batch-pipeline-in-python">Implementing a Batch Pipeline in Python</h3>
<p>A batch pipeline in its simplest form is a Python script with three clearly separated stages: extract, transform, load.</p>
<pre><code class="language-python">import pandas as pd
from datetime import datetime, timedelta

def extract(filepath: str) -&gt; pd.DataFrame:
    """Load raw orders from a daily export file."""
    df = pd.read_csv(filepath, parse_dates=["order_timestamp"])
    return df

def transform(df: pd.DataFrame) -&gt; pd.DataFrame:
    """Clean and aggregate orders into daily revenue by region."""
    # Filter to completed orders only
    df = df[df["status"] == "completed"].copy()

    # Extract date from timestamp for grouping
    df["order_date"] = df["order_timestamp"].dt.date

    # Aggregate: total revenue and order count per region per day
    summary = (
        df.groupby(["order_date", "region"])
        .agg(
            total_revenue=("order_value_gbp", "sum"),
            order_count=("order_id", "count"),
            avg_order_value=("order_value_gbp", "mean"),
        )
        .reset_index()
    )
    return summary

def load(df: pd.DataFrame, output_path: str) -&gt; None:
    """Write the aggregated result to the warehouse (here, a CSV)."""
    df.to_csv(output_path, index=False)
    print(f"Loaded {len(df)} rows to {output_path}")

# Run the pipeline
raw = extract("orders_2024_06_01.csv")
aggregated = transform(raw)
load(aggregated, "warehouse/daily_revenue_2024_06_01.csv")
</code></pre>
<p>Let's walk through what this code is doing:</p>
<ul>
<li><p><code>extract</code> reads a CSV file representing a daily order export. The <code>parse_dates</code> argument tells pandas to interpret the <code>order_timestamp</code> column as a datetime object rather than a plain string — this matters for the date extraction step in transform.</p>
</li>
<li><p><code>transform</code> does two things: it filters out any orders that didn't complete (returns, cancellations), and then groups the remaining orders by date and region to produce revenue aggregates. The <code>.agg()</code> call computes three metrics per group in a single pass.</p>
</li>
<li><p><code>load</code> writes the result to a destination — in production this would be a database insert or a cloud storage upload, but the pattern is the same regardless.</p>
</li>
</ul>
<p>The three functions are deliberately kept separate. This separation — extract, transform, load — makes each stage independently testable, replaceable, and debuggable. If the transform logic changes, you don't need to modify the extract or load code.</p>
<h3 id="heading-when-batch-works-well">When Batch Works Well</h3>
<p>Batch pipelines are the right choice when:</p>
<ul>
<li><p><strong>Data freshness requirements are measured in hours, not seconds.</strong> A daily sales report doesn't need to be updated every minute. A weekly marketing attribution model certainly doesn't.</p>
</li>
<li><p><strong>You're processing large historical datasets.</strong> Backfilling two years of transaction history into a new data warehouse is inherently a batch job — the data exists, it's bounded, and you want to process it as efficiently as possible in one run.</p>
</li>
<li><p><strong>Consistency matters more than latency.</strong> Batch pipelines produce complete, point-in-time snapshots. Every row in the output was computed from the same input state. This consistency is valuable for financial reporting, regulatory compliance, and any downstream process that requires a stable, reproducible dataset.</p>
</li>
</ul>
<h2 id="heading-what-is-a-streaming-pipeline">What Is a Streaming Pipeline?</h2>
<p>A streaming pipeline processes data continuously, record by record or in small micro-batches, as it arrives. There is no "end" to the dataset — the pipeline runs indefinitely, consuming events from a source like a message queue, a Kafka topic, or a webhook, and processing each one as it comes in.</p>
<p>The mental model is: <strong>process as you collect</strong>. The pipeline is always running.</p>
<p>In the same retail ETL context, a streaming pipeline might handle order events as they're placed:</p>
<ol>
<li><p>An order is placed on the website and an event is published to a message queue</p>
</li>
<li><p>The streaming pipeline consumes the event within milliseconds</p>
</li>
<li><p>It validates, enriches, and routes the event to downstream systems</p>
</li>
<li><p>The fraud detection service, the inventory system, and the real-time dashboard all receive updated information immediately</p>
</li>
</ol>
<p>The difference from batch is fundamental: the data isn't sitting in a file waiting to be processed. It's flowing, and the pipeline has to keep up.</p>
<h3 id="heading-implementing-a-streaming-pipeline-in-python">Implementing a Streaming Pipeline in Python</h3>
<p>Python's generator functions are the natural building block for streaming pipelines. A generator produces values one at a time and pauses between yields — which maps directly onto the idea of processing records as they arrive without loading everything into memory.</p>
<pre><code class="language-python">import json
import time
from typing import Generator, Dict

def event_source(filepath: str) -&gt; Generator[Dict, None, None]:
    """
    Simulate a stream of order events from a file.
    In production, this would consume from Kafka or a message queue.
    """
    with open(filepath, "r") as f:
        for line in f:
            event = json.loads(line.strip())
            yield event
            time.sleep(0.01)  # simulate arrival delay between events

def validate(event: Dict) -&gt; bool:
    """Check that the event has the required fields and valid values."""
    required_fields = ["order_id", "customer_id", "order_value_gbp", "region"]
    if not all(field in event for field in required_fields):
        return False
    if event["order_value_gbp"] &lt;= 0:
        return False
    return True

def enrich(event: Dict) -&gt; Dict:
    """Add derived fields to the event before routing downstream."""
    event["processed_at"] = time.strftime("%Y-%m-%dT%H:%M:%S")
    event["value_tier"] = (
        "high"   if event["order_value_gbp"] &gt;= 500
        else "mid"    if event["order_value_gbp"] &gt;= 100
        else "low"
    )
    return event

def run_streaming_pipeline(source_file: str) -&gt; None:
    """Process each event as it arrives from the source."""
    processed = 0
    skipped = 0

    for raw_event in event_source(source_file):
        if not validate(raw_event):
            skipped += 1
            continue

        enriched_event = enrich(raw_event)

        # In production: publish to downstream topic or write to sink
        print(f"[{enriched_event['processed_at']}] "
              f"Order {enriched_event['order_id']} | "
              f"£{enriched_event['order_value_gbp']:.2f} | "
              f"tier={enriched_event['value_tier']}")
        processed += 1

    print(f"\nDone. Processed: {processed} | Skipped: {skipped}")

run_streaming_pipeline("order_events.jsonl")
</code></pre>
<p>Here's what's happening:</p>
<ul>
<li><p><code>event_source</code> is a generator function — note the <code>yield</code> keyword instead of <code>return</code>. Each call to <code>yield event</code> pauses the function and hands one event to the caller. The pipeline processes that event before the generator resumes and fetches the next one. This means only one event is in memory at a time, regardless of how large the stream is. The <code>time.sleep(0.01)</code> simulates the real-world delay between events arriving from a message queue.</p>
</li>
<li><p><code>validate</code> checks each event for required fields and valid values before doing anything else with it. In a streaming context, bad events are super common — network issues, upstream bugs, and schema changes all produce malformed records. Validating early and skipping invalid events is far safer than letting them propagate into downstream systems.</p>
</li>
<li><p><code>enrich</code> adds derived fields to the event. This can be a processing timestamp and a value tier classification. In production, this step might also join against a lookup table, call an external API, or apply a model prediction.</p>
</li>
<li><p><code>run_streaming_pipeline</code> ties it together. The <code>for</code> loop over <code>event_source</code> consumes events one at a time, processes each through the <code>validate → enrich → route</code> stages, and keeps a running count of processed and skipped events.</p>
</li>
</ul>
<h3 id="heading-when-streaming-works-well">When Streaming Works Well</h3>
<p>Streaming pipelines are the right choice when:</p>
<ul>
<li><p><strong>Data freshness is measured in seconds or milliseconds.</strong> Fraud detection, real-time inventory updates, live dashboards, and alerting systems all require data to be processed immediately — a batch job running every hour would make them useless.</p>
</li>
<li><p><strong>The data volume is too large to accumulate.</strong> High-frequency IoT sensor data, clickstream events, and financial tick data can generate millions of records per hour. Accumulating all of that before processing is often impractical – you'd need enormous storage and the processing job would take too long to be useful.</p>
</li>
<li><p><strong>You need to react, not just report.</strong> Streaming pipelines can trigger downstream actions — send a notification, block a transaction, update a recommendation — in response to individual events. Batch pipelines can only report on what already happened.</p>
</li>
</ul>
<h2 id="heading-the-key-differences-at-a-glance">The Key Differences at a Glance</h2>
<p>Here is an overview of the differences between batch and stream processing we've discussed thus far:</p>
<table>
<thead>
<tr>
<th><strong>DIMENSION</strong></th>
<th><strong>BATCH</strong></th>
<th><strong>STREAMING</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Data model</strong></td>
<td>Bounded, finite dataset</td>
<td>Unbounded, continuous flow</td>
</tr>
<tr>
<td><strong>Processing trigger</strong></td>
<td>Schedule (time or event)</td>
<td>Arrival of each record</td>
</tr>
<tr>
<td><strong>Latency</strong></td>
<td>Minutes to hours</td>
<td>Milliseconds to seconds</td>
</tr>
<tr>
<td><strong>Throughput</strong></td>
<td>High (optimized for bulk processing)</td>
<td>Lower per-record overhead</td>
</tr>
<tr>
<td><strong>Complexity</strong></td>
<td>Lower</td>
<td>Higher</td>
</tr>
<tr>
<td><strong>State management</strong></td>
<td>Stateless per run</td>
<td>Often stateful across events</td>
</tr>
<tr>
<td><strong>Error handling</strong></td>
<td>Retry the whole job</td>
<td>Per-event dead-letter queues</td>
</tr>
<tr>
<td><strong>Consistency</strong></td>
<td>Strong (point-in-time snapshot)</td>
<td>Eventually consistent</td>
</tr>
<tr>
<td><strong>Best for</strong></td>
<td>Reporting, ML training, backfills</td>
<td>Alerting, real-time features, event routing</td>
</tr>
</tbody></table>
<h2 id="heading-choosing-between-batch-and-streaming">Choosing Between Batch and Streaming</h2>
<p>Okay, all of this info is great. But <em>how</em> do you choose between batch and stream processing? The decision comes down to three questions:</p>
<p><strong>How fresh does the data need to be?</strong> If stakeholders can tolerate results that are hours old, batch is simpler and more cost-effective. If they need results within seconds, streaming is unavoidable.</p>
<p><strong>How complex is your processing logic?</strong> Batch jobs can join across large datasets, run expensive aggregations, and apply complex business logic without worrying about latency. Streaming pipelines must process each event quickly, which constrains how much work you can do per record.</p>
<p><strong>What's your operational capacity?</strong> Streaming infrastructure — Kafka clusters, Flink or Spark Streaming jobs, dead-letter queues, exactly-once delivery guarantees — is significantly more complex to operate than a scheduled Python script. If your team is small or your use case doesn't demand real-time results, that complexity is cost without benefit.</p>
<p>Start with batch. It's simpler to build, simpler to test, simpler to debug, and simpler to maintain. Move to streaming when a specific, concrete requirement — not a hypothetical future one — makes batch insufficient. Most data problems are batch problems, and the ones that genuinely require streaming are usually obvious when you run into them.</p>
<p>And as you might have guessed, you may need to combine them for some data processing systems. Which is why hybrid approaches exist.</p>
<h2 id="heading-the-hybrid-pattern-lambda-and-kappa-architectures">The Hybrid Pattern: Lambda and Kappa Architectures</h2>
<p>In practice, many production data systems use both patterns together. The two most common hybrid architectures are: Lambda and Kappa architecture.</p>
<p><a href="https://www.databricks.com/glossary/lambda-architecture"><strong>Lambda architecture</strong></a> runs a batch layer and a streaming layer in parallel. The batch layer processes complete historical data and produces accurate, consistent results on a delay. The streaming layer processes live data and produces approximate results immediately. Downstream consumers merge both outputs — using the streaming result for freshness and the batch result for correctness.</p>
<p>The tradeoff is operational complexity: you're maintaining two separate processing codebases that must produce semantically equivalent results.</p>
<p><a href="https://hazelcast.com/glossary/kappa-architecture/"><strong>Kappa architecture</strong></a> simplifies this by using only a streaming layer, but with the ability to replay historical data through the same pipeline when you need batch-style reprocessing. This works well when your streaming framework like <a href="https://kafka.apache.org/documentation/">Apache Kafka</a> and <a href="https://flink.apache.org/">Apache Flink</a> supports log retention and replay. You get one codebase, one set of logic, and the ability to reprocess history when your pipeline changes.</p>
<p>Neither architecture is universally better. Lambda is more common in organizations that adopted batch processing first and added streaming incrementally. Kappa is more common in systems designed with streaming as the primary pattern.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Batch and streaming are tools with different tradeoffs, each suited to a different class of problems. Batch pipelines excel at consistency, simplicity, and bulk throughput. Streaming pipelines excel at latency, reactivity, and continuous processing.</p>
<p>Understanding both patterns at the architectural level — before reaching for specific frameworks like Apache Spark, Kafka, or Flink — gives you the judgment to choose the right one and explain that choice clearly. The frameworks implement these patterns, while the judgment about which pattern fits your problem is yours to make first.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Positioning-Based Crude Oil Strategy in Python [Full Handbook] ]]>
                </title>
                <description>
                    <![CDATA[ Commitment of Traders (COT) data gets referenced a lot in commodity trading, especially when people talk about crowded positioning, speculative sentiment, or reversal risk. But most of that discussion ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-positioning-based-crude-oil-strategy-in-python/</link>
                <guid isPermaLink="false">69d91ddfc8e5007ddbc0e7ca</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ stockmarket ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Nikhil Adithyan ]]>
                </dc:creator>
                <pubDate>Fri, 10 Apr 2026 15:57:19 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/c18002cf-6519-4b76-b068-3b443cb0f347.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Commitment of Traders (COT) data gets referenced a lot in commodity trading, especially when people talk about crowded positioning, speculative sentiment, or reversal risk. But most of that discussion stays at the idea level. It rarely becomes a rule that can actually be tested.</p>
<p>That was the starting point for this project.</p>
<p>I wanted to see whether crude oil positioning data could be turned into something more useful than a vague market read. Not a polished macro narrative. An actual strategy framework that could be coded, tested, and challenged.</p>
<p>The goal here was not to begin with a finished strategy. It was to start with a reasonable hypothesis, build the signal step by step, and see what survived once the data was involved.</p>
<p>For this, I used FinancialModelingPrep’s Commitment of Traders data along with historical West Texas Intermediate (WTI) crude oil prices. The first idea was simple: if speculative positioning becomes extreme, maybe that tells us something about what crude oil might do next. But as the build progressed, that idea had to be narrowed, filtered, and reworked before it became usable.</p>
<p>So this article is not a clean showcase of a strategy that worked on the first try. It's the full process of getting there.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-the-initial-idea-use-positioning-extremes-to-define-market-regimes">The Initial Idea: Use Positioning Extremes to Define Market Regimes</a></p>
</li>
<li><p><a href="#heading-importing-packages">Importing Packages</a></p>
</li>
<li><p><a href="#heading-pulling-the-data-cot--wti-crude-prices-using-fmp-apis">Pulling the Data: COT + WTI Crude Prices using FMP APIs</a></p>
</li>
<li><p><a href="#heading-turning-raw-cot-data-into-usable-features">Turning Raw COT Data Into Usable Features</a></p>
</li>
<li><p><a href="#heading-building-the-first-version-of-the-regime-model">Building the First Version of the Regime Model</a></p>
</li>
<li><p><a href="#heading-first-test-what-happens-after-each-regime">First Test: What Happens After Each Regime?</a></p>
</li>
<li><p><a href="#heading-looking-at-the-regimes-more-closely">Looking at the Regimes More Closely</a></p>
</li>
<li><p><a href="#heading-narrowing-the-focus-keeping-two-extra-variants-for-comparison">Narrowing the Focus: Keeping Two Extra Variants for Comparison</a></p>
</li>
<li><p><a href="#heading-building-the-first-trade-rules">Building the First Trade Rules</a></p>
</li>
<li><p><a href="#heading-comparing-bullish-unwind-against-buy-and-hold">Comparing Bullish Unwind Against Buy-and-Hold</a></p>
</li>
<li><p><a href="#heading-adding-a-trend-filter">Adding a Trend Filter</a></p>
</li>
<li><p><a href="#heading-stress-testing-the-setup">Stress-Testing the Setup</a></p>
</li>
<li><p><a href="#heading-the-final-strategy">The Final Strategy</a></p>
</li>
<li><p><a href="#heading-further-improvements">Further Improvements</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>To follow along with this article, you'll need a basic familiarity with Python and the pandas library, as we'll do most of the data manipulation and analysis using DataFrames. The following packages should be installed in your environment: <code>requests</code>, <code>numpy</code>, <code>pandas</code>, and <code>matplotlib</code>.</p>
<p>You'll also need a FinancialModelingPrep API key required to pull both the COT and WTI crude oil price data. If you don't have one, you can register for a free account on the FinancialModelingPrep website.</p>
<p>Finally, a general understanding of what the Commitment of Traders report is and what non-commercial positioning represents will help you follow the reasoning behind the signal construction, though it's not strictly necessary to get value from the code itself.</p>
<p>This article also assumes some baseline familiarity with financial markets and trading concepts. If terms like long and short positioning, open interest, or speculative sentiment are unfamiliar, it may be worth spending a little time with those before diving in.</p>
<h2 id="heading-the-initial-idea-use-positioning-extremes-to-define-market-regimes">The Initial Idea: Use Positioning Extremes to Define Market Regimes</h2>
<p>The first version of the idea was not a trading rule. It was a framework.</p>
<p>If speculative positioning in crude oil becomes extreme, that probably means different things depending on what happens next. A market that is heavily long and still getting more crowded is not the same as a market that is heavily long but starting to unwind. The same logic applies on the bearish side too.</p>
<p>So instead of forcing one blunt signal like “extreme long means short” or “extreme short means buy,” I started by splitting the market into regimes.</p>
<p>The two variables I used were simple. First, how extreme positioning is relative to recent history. Second, whether that positioning is still building or starting to reverse.</p>
<p>That gave me four possible states:</p>
<ul>
<li><p>bullish buildup</p>
</li>
<li><p>bullish unwind</p>
</li>
<li><p>bearish buildup</p>
</li>
<li><p>bearish unwind</p>
</li>
</ul>
<p>This felt like a better starting point than jumping straight into a strategy. It let me treat COT data as a way to describe market state first, then test whether any of those states actually led to useful price behavior.</p>
<p>At this stage, I still didn't know whether any of these regimes would hold up. The point was just to create a structure that could be tested properly.</p>
<h2 id="heading-importing-packages">Importing Packages</h2>
<p>We’ll keep the packages import minimal and simple.</p>
<pre><code class="language-python">import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (14,6)
plt.style.use("ggplot")

api_key = "YOUR FMP API KEY"
base_url = "https://financialmodelingprep.com/stable" 
</code></pre>
<p>Nothing fancy here. Make sure to replace YOUR FMP API KEY with your actual FMP API key. If you don’t have one, you can obtain it by opening a FMP developer account.</p>
<h2 id="heading-pulling-the-data-cot-wti-crude-prices-using-fmp-apis">Pulling the Data: COT + WTI Crude Prices using FMP APIs</h2>
<p>To build this strategy, I needed two datasets. First, I needed COT data for crude oil. Second, I needed historical WTI crude oil prices.</p>
<p>I started with the COT market list to identify the correct crude oil contract.</p>
<pre><code class="language-python">url = f"{base_url}/commitment-of-traders-list?apikey={api_key}"
r = requests.get(url)
cot_list = pd.DataFrame(r.json())

crude_candidates = cot_list[
    cot_list.astype(str)
    .apply(lambda col: col.str.contains("crude", case=False, na=False))
    .any(axis=1)
]

crude_candidates
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/f6de5da0-9876-4928-8b36-59730cab64e2.png" alt="COT market list" style="display:block;margin:0 auto" width="1089" height="432" loading="lazy">

<p>This gives a filtered list of crude-related contracts from the COT universe. In this case, the key contract I used was CL.</p>
<pre><code class="language-python">cot_symbol = "CL"
start_date = "2010-01-01"
end_date = "2026-03-20"

url = f"{base_url}/commitment-of-traders-report?symbol={cot_symbol}&amp;from={start_date}&amp;to={end_date}&amp;apikey={api_key}"
r = requests.get(url)

cot_df = pd.DataFrame(r.json())
cot_df["date"] = pd.to_datetime(cot_df["date"])
cot_df = cot_df.sort_values("date").drop_duplicates(subset="date").reset_index(drop=True)
cot_df = cot_df.rename(columns={"date": "cot_date"})

cot_df.head()
</code></pre>
<p>This returns the weekly COT records for crude oil:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/7ac107b3-dda6-4568-b535-9ab5533448e1.png" alt="Weekly COT crude oil data" style="display:block;margin:0 auto" width="1801" height="754" loading="lazy">

<p>The main fields I needed later were:</p>
<ul>
<li><p><code>date</code></p>
</li>
<li><p><code>openInterestAll</code></p>
</li>
<li><p><code>noncommPositionsLongAll</code></p>
</li>
<li><p><code>noncommPositionsShortAll</code></p>
</li>
</ul>
<p>Next, I pulled the WTI crude oil price data using FMP’s commodity price endpoint.</p>
<pre><code class="language-python">price_symbol = "CLUSD"
start_date = "2010-01-01"
end_date = "2026-03-20"

url = f"{base_url}/historical-price-eod/full?symbol={price_symbol}&amp;from={start_date}&amp;to={end_date}&amp;apikey={api_key}"
r = requests.get(url)

price_df = pd.DataFrame(r.json())
price_df["date"] = pd.to_datetime(price_df["date"])
price_df = price_df.sort_values("date").drop_duplicates(subset="date").reset_index(drop=True)

price_df
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/6bbd3f99-618f-4e80-a2e4-04157f108b9c.png" alt="WTI crude oil price data" style="display:block;margin:0 auto" width="1100" height="571" loading="lazy">

<p>Since the COT dataset is weekly, I converted the price series into weekly bars using the Friday close.</p>
<pre><code class="language-python">price_df["date"] = pd.to_datetime(price_df["date"])
price_df = price_df.sort_values("date").drop_duplicates(subset="date").reset_index(drop=True)

weekly_price = price_df.set_index("date").resample("W-FRI").agg({
    "symbol": "last",
    "open": "first",
    "high": "max",
    "low": "min",
    "close": "last",
    "volume": "sum",
    "vwap": "mean"
}).dropna().reset_index()

weekly_price["weekly_return"] = weekly_price["close"].pct_change()
weekly_price = weekly_price.rename(columns={"date": "price_date"})

weekly_price
</code></pre>
<p>This step matters because the two datasets need to live on the same time scale. If I kept prices daily while COT stayed weekly, the signal alignment would become messy very quickly.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/cba82494-e180-4278-ac41-a5f3490346f5.png" alt="WTI crude oil price data weekly" style="display:block;margin:0 auto" width="1100" height="600" loading="lazy">

<p>Finally, I aligned each COT observation with the next weekly WTI price bar.</p>
<pre><code class="language-python">merged_df = pd.merge_asof(
    cot_df.sort_values("cot_date"),
    weekly_price.sort_values("price_date"),
    left_on="cot_date",
    right_on="price_date",
    direction="forward"
)

merged_df[["cot_date", "price_date", "close", "weekly_return", "openInterestAll", "noncommPositionsLongAll", "noncommPositionsShortAll"]]
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/65b8ed6d-d4ef-43f5-99a2-1b4a5fd80459.png" alt="COT &amp; Price Data merged" style="display:block;margin:0 auto" width="1903" height="794" loading="lazy">

<p>The output is one clean working table with:</p>
<ul>
<li><p>the COT report date</p>
</li>
<li><p>the matched WTI weekly price date</p>
</li>
<li><p>weekly crude price data</p>
</li>
<li><p>the main positioning fields needed for feature engineering</p>
</li>
</ul>
<p>That is the full base dataset for the strategy. With this in place, the next step is to turn the raw positioning data into something more useful.</p>
<h2 id="heading-turning-raw-cot-data-into-usable-features">Turning Raw COT Data Into Usable Features</h2>
<p>At this point, the raw data was ready, but it still wasn't useful as a signal. The COT report gives positioning numbers, but those numbers by themselves don't say much unless they're turned into something comparable over time.</p>
<p>So the next step was to build a few features that could describe positioning in a more meaningful way.</p>
<p>I started with the net non-commercial position. This is just the difference between non-commercial longs and non-commercial shorts.</p>
<pre><code class="language-python">merged_df["net_position"] = merged_df["noncommPositionsLongAll"] - merged_df["noncommPositionsShortAll"]
</code></pre>
<p>This gives the raw speculative bias. A positive value means non-commercial traders are net long. A negative value means they're net short.</p>
<p>But raw net positioning has a problem. The size of the market changes over time, so a value that looked extreme in one period may not mean the same thing in another. To fix that, I normalized it by open interest.</p>
<pre><code class="language-python">merged_df["net_position_ratio"] = merged_df["net_position"] / merged_df["openInterestAll"]
</code></pre>
<p>This made the signal much more useful. Instead of looking at absolute positioning, I was now looking at positioning as a share of the total market.</p>
<p>Next, I needed to know whether that positioning was still building or starting to unwind. For that, I calculated the week-over-week change in the ratio.</p>
<pre><code class="language-python">merged_df["net_position_ratio_change"] = merged_df["net_position_ratio"].diff()
</code></pre>
<p>This was important because the direction of change adds context. An extreme long position that's still increasing isn't the same as an extreme long position that has started to fall.</p>
<p>The last feature was the most important one: a rolling percentile of the positioning ratio. I used a 104-week window.</p>
<pre><code class="language-python">def rolling_percentile(x):
    return pd.Series(x).rank(pct=True).iloc[-1]

merged_df["position_percentile_104"] = merged_df["net_position_ratio"].rolling(104).apply(rolling_percentile)
</code></pre>
<p>This tells us how extreme the current positioning is relative to the last two years. A value above 0.80 means the market is in the top 20% of bullish positioning relative to that recent history. A value below 0.20 means the market is in the bottom 20%.</p>
<p>After adding all four features, I checked the output.</p>
<pre><code class="language-python">merged_df[["cot_date","price_date","net_position","net_position_ratio","net_position_ratio_change","position_percentile_104"]]
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/a94f7dee-fdc6-4495-829a-eee72d95a43d.png" alt="final merged_df" style="display:block;margin:0 auto" width="1100" height="484" loading="lazy">

<p>The first few rows of <code>net_position_ratio_change</code> were <code>NaN</code>, which is expected since the first row has no prior week to compare with. The first 103 rows of <code>position_percentile_104</code> were also <code>NaN</code> because the rolling window needs 104 weeks of history before it can calculate the percentile.</p>
<p>That was fine. What mattered was that the dataset now had four usable pieces:</p>
<ul>
<li><p>raw speculative positioning</p>
</li>
<li><p>normalized positioning</p>
</li>
<li><p>weekly change in positioning</p>
</li>
<li><p>a rolling measure of how extreme that positioning is</p>
</li>
</ul>
<p>This was the point where the COT data stopped being just a table of trader positions and started becoming something that could be turned into a regime model.</p>
<h2 id="heading-building-the-first-version-of-the-regime-model">Building the First Version of the Regime Model</h2>
<p>Once the features were ready, the next step was to turn them into actual market states.</p>
<p>The main idea was simple: positioning extremes on their own aren't enough. A market can stay heavily long or heavily short for a long time. What matters more is what happens while positioning is extreme. Is it still building, or has it started to reverse?</p>
<p>That's why I used two dimensions:</p>
<ul>
<li><p>the 104-week positioning percentile</p>
</li>
<li><p>the weekly change in the positioning ratio</p>
</li>
</ul>
<p>With those two variables, I defined four regimes.</p>
<pre><code class="language-python">merged_df["regime"] = "neutral"

merged_df.loc[(merged_df["position_percentile_104"] &gt; 0.8) &amp; (merged_df["net_position_ratio_change"] &gt; 0), "regime"] = "bullish_buildup"
merged_df.loc[(merged_df["position_percentile_104"] &gt; 0.8) &amp; (merged_df["net_position_ratio_change"] &lt; 0), "regime"] = "bullish_unwind"
merged_df.loc[(merged_df["position_percentile_104"] &lt; 0.2) &amp; (merged_df["net_position_ratio_change"] &lt; 0), "regime"] = "bearish_buildup"
merged_df.loc[(merged_df["position_percentile_104"] &lt; 0.2) &amp; (merged_df["net_position_ratio_change"] &gt; 0), "regime"] = "bearish_unwind"
</code></pre>
<p>Here's what each one means:</p>
<ul>
<li><p><strong>bullish buildup</strong>: positioning is already very bullish, and it's still getting more bullish</p>
</li>
<li><p><strong>bullish unwind</strong>: positioning is very bullish, but that bullishness has started to fade</p>
</li>
<li><p><strong>bearish buildup</strong>: positioning is already very bearish, and it's still getting more bearish</p>
</li>
<li><p><strong>bearish unwind</strong>: positioning is very bearish, but that bearishness has started to ease</p>
</li>
</ul>
<p>Anything that didn't meet one of those extreme conditions stayed in the <code>neutral</code> bucket.</p>
<p>After assigning the regimes, I checked how many observations fell into each one.</p>
<pre><code class="language-python">print(merged_df["regime"].value_counts())
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/5133085c-281c-46fc-8ab6-fa414aa1d682.png" alt="regime count" style="display:block;margin:0 auto" width="275" height="165" loading="lazy">

<p>This output matters because it tells us whether the framework is usable or too sparse. In this case, neutral was still the largest group, which is expected. Most weeks shouldn't be extreme. The four regime buckets were smaller, but still had enough observations to test properly.</p>
<p>I also looked at a sample of the classified rows.</p>
<pre><code class="language-python">merged_df[["cot_date","price_date","net_position_ratio","net_position_ratio_change","position_percentile_104","regime"]].tail(10)
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/9dd1352c-932f-4fd9-bb84-071b61433121.png" alt="merged_df + regime" style="display:block;margin:0 auto" width="1876" height="765" loading="lazy">

<p>At this point, the raw COT data had been turned into a regime model. The next question was whether any of these regimes actually led to useful price behavior.</p>
<h2 id="heading-first-test-what-happens-after-each-regime">First Test: What Happens After Each Regime?</h2>
<p>At this point, I had a regime framework, but not a strategy. Before turning any of these states into trades, I wanted to know what crude oil actually did after each one.</p>
<p>So the next step was to measure forward returns after every regime over four holding windows:</p>
<ul>
<li><p>1 week</p>
</li>
<li><p>2 weeks</p>
</li>
<li><p>4 weeks</p>
</li>
<li><p>8 weeks</p>
</li>
</ul>
<p>I started by creating the forward return columns from the weekly close series.</p>
<pre><code class="language-python">merged_df["fwd_return_1w"] = merged_df["close"].shift(-1) / merged_df["close"] - 1
merged_df["fwd_return_2w"] = merged_df["close"].shift(-2) / merged_df["close"] - 1
merged_df["fwd_return_4w"] = merged_df["close"].shift(-4) / merged_df["close"] - 1
merged_df["fwd_return_8w"] = merged_df["close"].shift(-8) / merged_df["close"] - 1

merged_df[["cot_date","price_date","close","regime","fwd_return_1w","fwd_return_2w","fwd_return_4w","fwd_return_8w"]].tail(12)
</code></pre>
<p>Each of these columns answers a simple question. If crude oil is in a given regime this week, what happens over the next 1, 2, 4, or 8 weeks?</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/cde3faca-cb6d-43b6-81d4-15f6ec660205.png" alt="forward return columns from the weekly close series" style="display:block;margin:0 auto" width="1566" height="749" loading="lazy">

<p>The last few rows had NaN values, which is normal. There is no future price data available beyond the end of the dataset, so the longest horizons drop off first.</p>
<p>Next, I grouped the data by regime and calculated a few summary statistics:</p>
<ul>
<li><p>count</p>
</li>
<li><p>average forward return</p>
</li>
<li><p>median forward return</p>
</li>
<li><p>hit rate</p>
</li>
</ul>
<pre><code class="language-python">regime_summary = merged_df.groupby("regime").agg(
    count=("regime", "size"),
    avg_1w=("fwd_return_1w", "mean"),
    median_1w=("fwd_return_1w", "median"),
    hit_rate_1w=("fwd_return_1w", lambda x: (x &gt; 0).mean()),
    avg_2w=("fwd_return_2w", "mean"),
    median_2w=("fwd_return_2w", "median"),
    hit_rate_2w=("fwd_return_2w", lambda x: (x &gt; 0).mean()),
    avg_4w=("fwd_return_4w", "mean"),
    median_4w=("fwd_return_4w", "median"),
    hit_rate_4w=("fwd_return_4w", lambda x: (x &gt; 0).mean()),
    avg_8w=("fwd_return_8w", "mean"),
    median_8w=("fwd_return_8w", "median"),
    hit_rate_8w=("fwd_return_8w", lambda x: (x &gt; 0).mean())
).reset_index()

regime_summary
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/5e522449-c64a-4a7c-a4b6-43723b3241bd.png" alt="grouped data by regime" style="display:block;margin:0 auto" width="1901" height="334" loading="lazy">

<p>This table was the first real test of the framework, and it immediately ruled out some of the original ideas.</p>
<p>The results weren't great for the raw regime model. In fact, they were weaker than I expected.</p>
<p>A few things stood out:</p>
<ul>
<li><p><code>neutral</code> often outperformed the regime buckets</p>
</li>
<li><p><code>bullish_buildup</code> looked consistently weak</p>
</li>
<li><p><code>bearish_buildup</code> also looked weak</p>
</li>
<li><p><code>bearish_unwind</code> looked stronger at first glance, but some of that came from a few large upside outliers</p>
</li>
<li><p><code>bullish_unwind</code> was the only regime that looked somewhat stable across multiple horizons</p>
</li>
</ul>
<p>That changed the direction of the project.</p>
<p>Up to this point, the plan was to build a full four-regime framework and maybe convert multiple states into trade rules. After looking at the forward returns, that no longer made sense. Most of the regimes were not adding much value.</p>
<p>So instead of carrying all four forward, I started focusing on the one regime that still looked promising: <strong>bullish unwind.</strong></p>
<p>Before making that decision, I wanted to look at the distributions visually and see whether the averages were hiding anything important.</p>
<h2 id="heading-looking-at-the-regimes-more-closely">Looking at the Regimes More Closely</h2>
<p>The summary table already told me that most of the raw regime framework was weak, but I still wanted to look at the behavior visually before dropping anything.</p>
<p>I started with a simple chart that places WTI crude oil next to the speculative net positioning ratio.</p>
<pre><code class="language-python">plt.plot(merged_df["price_date"], merged_df["close"], label="wti close")
plt.plot(merged_df["price_date"], merged_df["net_position_ratio"] * 100, label="net position ratio x 100")
plt.title("WTI crude oil price vs speculative net positioning")
plt.xlabel("date")
plt.ylabel("value")
plt.legend()
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/e1655a05-0c3a-4d4f-8f5d-51dc20e8b305.png" alt="WTI crude oil price vs speculative net positioning" style="display:block;margin:0 auto" width="1741" height="798" loading="lazy">

<p>This chart isn't meant to compare the two series on the same scale. It's just a quick way to see whether large moves in crude oil tend to happen when speculative positioning is becoming stretched.</p>
<p>Next, I plotted the 104-week positioning percentile itself.</p>
<pre><code class="language-python">plt.plot(merged_df["price_date"], merged_df["position_percentile_104"])
plt.axhline(0.8, linestyle="--", color="b")
plt.axhline(0.2, linestyle="--", color="b")
plt.title("104-week positioning percentile")
plt.xlabel("date")
plt.ylabel("percentile")
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/5547d52a-001f-4f30-9479-4414e7b74498.png" alt="104-week positioning percentile" style="display:block;margin:0 auto" width="1829" height="840" loading="lazy">

<p>This made the regime logic easier to understand. Any time the percentile moved above 0.80, the market entered the bullish extreme zone. Any time it dropped below 0.20, the market entered the bearish extreme zone.</p>
<p>Then I looked at how many observations actually fell into each regime.</p>
<pre><code class="language-python">regime_counts = merged_df["regime"].value_counts()

plt.bar(regime_counts.index, regime_counts.values)
plt.title("Regime counts")
plt.xlabel("regime")
plt.ylabel("count")
plt.xticks(rotation=30)
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/6eee2a9a-2876-41c9-9204-8d1e0b0b13f4.png" alt="Regime counts" style="display:block;margin:0 auto" width="1621" height="814" loading="lazy">

<p>The regime counts looked reasonable. Neutral was still the largest bucket, and the four signal regimes had enough observations to test without being too sparse.</p>
<p>After that, I plotted the average 4-week forward return by regime.</p>
<pre><code class="language-python">avg_4w = regime_summary.set_index("regime")["avg_4w"].sort_values()

plt.bar(avg_4w.index, avg_4w.values)
plt.title("Average 4-week forward return by regime")
plt.xlabel("regime")
plt.ylabel("average return")
plt.xticks(rotation=30)
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/00ba5ce0-89df-4a9d-8559-1a96c113447b.png" alt="Average 4-week forward return by regime" style="display:block;margin:0 auto" width="1613" height="794" loading="lazy">

<p>This was the first strong sign that the original framework was too broad. Both buildup regimes looked weak. <code>bullish_unwind</code> was slightly positive, but not by much. <code>bearish_unwind</code> looked strongest on average, which was interesting, but I still didn't trust that result without checking the distribution.</p>
<p>So I looked at the 4-week hit rate next.</p>
<pre><code class="language-python">hit_4w = regime_summary.set_index("regime")["hit_rate_4w"].sort_values()

plt.bar(hit_4w.index, hit_4w.values)
plt.title("4-week hit rate by regime")
plt.xlabel("regime")
plt.ylabel("hit rate")
plt.xticks(rotation=30)
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/93a8bf60-3c69-4c6d-a198-85cda789d3dc.png" alt="4-week hit rate by regime" style="display:block;margin:0 auto" width="1523" height="772" loading="lazy">

<p>The hit rates told a similar story. <code>bullish_unwind</code> was one of the better regimes, but still not strong enough to justify calling it a strategy. <code>neutral</code> was still doing too well, which meant the regime filter wasn't creating a very clean edge yet.</p>
<p>At that point, I wanted to check whether the averages were being distorted by a few large moves. So I plotted the 4-week return distribution for each regime.</p>
<pre><code class="language-python">plot_df = merged_df[["regime", "fwd_return_4w"]].dropna()

plot_df.boxplot(column="fwd_return_4w", by="regime", grid=False)
plt.title("4-week forward return distribution by regime")
plt.suptitle("")
plt.xlabel("regime")
plt.ylabel("4-week forward return")
plt.xticks(rotation=30)
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/849b0d06-0699-4482-84d3-fef2b35f3475.png" alt="4-week forward return distribution by regime" style="display:block;margin:0 auto" width="1644" height="785" loading="lazy">

<p>This chart made the problem much clearer.</p>
<p><code>bearish_unwind</code> looked strong on average, but that strength came from a few very large upside outliers. That made it less convincing as a base strategy.</p>
<p><code>bullish_buildup</code> and <code>bearish_buildup</code> were weak both in the summary table and in the distribution.</p>
<p><code>bullish_unwind</code> was the only regime that looked somewhat stable without depending too much on a handful of extreme observations.</p>
<p>That changed the direction of the build.</p>
<p>Up to this point, the idea was to test a full regime framework and maybe keep multiple paths. After these charts, that no longer made sense. Most of the framework had already done its job by showing what not to use.</p>
<p>So instead of carrying all four regimes forward, I narrowed the focus to just one: bullish unwind.</p>
<h2 id="heading-narrowing-the-focus-keeping-two-extra-variants-for-comparison">Narrowing the Focus: Keeping Two Extra Variants for Comparison</h2>
<p>At this point, <code>bullish_unwind</code> was already the main regime worth paying attention to. The buildup regimes were weak, and <code>bearish_unwind</code> was less convincing because a big part of its strength came from a few outsized moves.</p>
<p>So the focus was already shifting toward <code>bullish_unwind</code>.</p>
<p>Still, before fully committing to it, I kept two additional unwind-based variants in the next step just for comparison:</p>
<ul>
<li><p>a long signal based on <code>bearish_unwind</code></p>
</li>
<li><p>a combined long signal that fires on either unwind regime</p>
</li>
</ul>
<p>That way, the first round of backtests could show whether <code>bullish_unwind</code> was actually better in practice, or whether the broader unwind logic worked better as a whole.</p>
<pre><code class="language-python">merged_df["long_bullish_unwind"] = (merged_df["regime"] == "bullish_unwind").astype(int)
merged_df["long_bearish_unwind"] = (merged_df["regime"] == "bearish_unwind").astype(int)
merged_df["long_any_unwind"] = merged_df["regime"].isin(["bullish_unwind", "bearish_unwind"]).astype(int)

print("number of trades:\n", merged_df[["long_bullish_unwind", "long_bearish_unwind", "long_any_unwind"]].sum())
merged_df[["cot_date","price_date","regime","long_bullish_unwind","long_bearish_unwind","long_any_unwind"]].tail()
</code></pre>
<p>This creates three simple binary signals:</p>
<ul>
<li><p><code>long_bullish_unwind</code> is 1 only when the regime is bullish_unwind</p>
</li>
<li><p><code>long_bearish_unwind</code> is 1 only when the regime is bearish_unwind</p>
</li>
<li><p><code>long_any_unwind</code> is 1 when either unwind regime appears</p>
</li>
</ul>
<p>The output also gives the number of signal occurrences for each one, which matters because the next step is a proper backtest. A signal can look interesting conceptually, but if it barely appears, there isn't much to test.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/0975eaf6-a8a9-408b-a490-f71559fc0f7b.png" alt="number of signal occurrences" style="display:block;margin:0 auto" width="1772" height="779" loading="lazy">

<p>So going into the strategy layer, bullish_unwind was already the main path. The other two were still kept around, but mainly to compare how much weaker or stronger they looked once the trades were actually executed.</p>
<h2 id="heading-building-the-first-trade-rules">Building the First Trade Rules</h2>
<p>Once the three unwind-based signals were ready, the next step was to turn them into actual trades.</p>
<p>I kept the backtest simple on purpose:</p>
<ul>
<li><p>long-only</p>
</li>
<li><p>4-week holding period</p>
</li>
<li><p>non-overlapping trades</p>
</li>
</ul>
<p>The non-overlapping part matters. If a new signal appeared while a current trade was still active, I skipped it. That kept the trade list cleaner and avoided inflating the strategy by stacking overlapping positions on top of each other.</p>
<p>Here is the backtest function I used.</p>
<pre><code class="language-python">def run_fixed_hold_backtest(df, signal_col, hold_weeks=4):
    trades = []
    i = 0

    while i &lt; len(df) - hold_weeks:
        if df.iloc[i][signal_col] == 1:
            entry_date = df.iloc[i]["price_date"]
            exit_date = df.iloc[i + hold_weeks]["price_date"]
            entry_price = df.iloc[i]["close"]
            exit_price = df.iloc[i + hold_weeks]["close"]
            trade_return = exit_price / entry_price - 1

            trades.append({
                "signal": signal_col,
                "entry_index": i,
                "exit_index": i + hold_weeks,
                "entry_date": entry_date,
                "exit_date": exit_date,
                "entry_price": entry_price,
                "exit_price": exit_price,
                "trade_return": trade_return
            })

            i += hold_weeks
        else:
            i += 1

    return pd.DataFrame(trades)
</code></pre>
<p>This function scans through the dataset, checks whether a signal is active, enters at the current weekly bar, exits four weeks later, and records the trade result.</p>
<p>Then I ran it for all three unwind-based signals.</p>
<pre><code class="language-python">bullish_unwind_trades = run_fixed_hold_backtest(merged_df, "long_bullish_unwind", hold_weeks=4)
bearish_unwind_trades = run_fixed_hold_backtest(merged_df, "long_bearish_unwind", hold_weeks=4)
any_unwind_trades = run_fixed_hold_backtest(merged_df, "long_any_unwind", hold_weeks=4)
</code></pre>
<p>After that, I checked how many trades were actually executed.</p>
<pre><code class="language-python">print("executed bullish_unwind trades:", len(bullish_unwind_trades))
print("executed bearish_unwind trades:", len(bearish_unwind_trades))
print("executed any_unwind trades:", len(any_unwind_trades))
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/e6e87883-fe88-4b04-9c55-8dd71aaf92b3.png" alt="executed trades" style="display:block;margin:0 auto" width="496" height="81" loading="lazy">

<p>This output was lower than the raw signal counts from the previous section, which is expected because overlapping signals were skipped.</p>
<p>Next, I built a small helper function to summarize the trade results and applied it to all three strategies.</p>
<pre><code class="language-python">def summarize_trades(trades):
    return pd.Series({
        "trades": len(trades),
        "win_rate": (trades["trade_return"] &gt; 0).mean(),
        "avg_trade_return": trades["trade_return"].mean(),
        "median_trade_return": trades["trade_return"].median(),
        "cumulative_return": (1 + trades["trade_return"]).prod() - 1
    })

trade_summary = pd.DataFrame({
    "bullish_unwind": summarize_trades(bullish_unwind_trades),
    "bearish_unwind": summarize_trades(bearish_unwind_trades),
    "any_unwind": summarize_trades(any_unwind_trades)
}).T

trade_summary
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/da0d8d65-74a4-4ec9-9af5-24a0a0e14b77.png" alt="backtest results" style="display:block;margin:0 auto" width="1100" height="199" loading="lazy">

<p>This was the first full strategy result, and it cleared up the hierarchy very quickly.</p>
<p><code>bullish_unwind</code> was still the best of the three. It wasn't strong yet, but it was clearly better than the other two.</p>
<p>A few things stood out:</p>
<ul>
<li><p><code>bullish_unwind</code> had the best win rate</p>
</li>
<li><p><code>bullish_unwind</code> had the best average and median trade return</p>
</li>
<li><p><code>bearish_unwind</code> and <code>any_unwind</code> both performed badly on a cumulative basis</p>
</li>
<li><p>Combining the two unwind regimes didn't help, just diluted the stronger one</p>
</li>
</ul>
<p>I also wanted to see how these strategies behaved over time, not just in a summary table. So I added simple equity curves for each one.</p>
<pre><code class="language-python">
bullish_unwind_trades["equity_curve"] = (1 + bullish_unwind_trades["trade_return"]).cumprod()
bearish_unwind_trades["equity_curve"] = (1 + bearish_unwind_trades["trade_return"]).cumprod()
any_unwind_trades["equity_curve"] = (1 + any_unwind_trades["trade_return"]).cumprod()

plt.plot(bullish_unwind_trades["exit_date"], bullish_unwind_trades["equity_curve"], label="bullish unwind")
plt.plot(bearish_unwind_trades["exit_date"], bearish_unwind_trades["equity_curve"], label="bearish unwind")
plt.plot(any_unwind_trades["exit_date"], any_unwind_trades["equity_curve"], label="any unwind")
plt.title("Equity curves for 4-week unwind strategies")
plt.xlabel("date")
plt.ylabel("equity multiple")
plt.legend()
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/52a0865f-9054-497c-b3de-7e0ec13c28fc.png" alt="Equity curves for 4-week unwind strategies" style="display:block;margin:0 auto" width="1826" height="847" loading="lazy">

<p>This chart made the same point more clearly. <code>bullish_unwind</code> was still weak in absolute terms, but it held up much better than the other two. <code>bearish_unwind</code> didn't survive the conversion from regime idea to actual strategy, and <code>any_unwind</code> was even worse because it inherited the weakness of both.</p>
<p>So by the end of this step, the picture was much clearer.</p>
<p>The broader unwind idea didn't work well as a whole. <code>bearish_unwind</code> wasn't holding up in a clean backtest. <code>any_unwind</code> was even worse. That left only one regime worth carrying further: <code>bullish unwind</code>.</p>
<p>Still, even that result wasn't strong enough yet. The strategy was better than the alternatives, but not good enough to stop here. In fact, we haven’t even made a profit yet.</p>
<p>The next step was to compare it against buy-and-hold and see whether it actually added anything useful.</p>
<h2 id="heading-comparing-bullish-unwind-against-buy-and-hold">Comparing Bullish Unwind Against Buy-and-Hold</h2>
<p>By this point, <code>bullish_unwind</code> had already beaten the other regime-based variants. But that still did not mean much on its own.</p>
<p>A strategy can look decent relative to weaker alternatives and still fail the most basic test: does it do anything better than just holding crude oil?</p>
<p>So the next step was to compare the raw <code>bullish_unwind</code> strategy against a simple buy-and-hold benchmark.</p>
<p>I started by building the buy-and-hold curve from the weekly WTI price series.</p>
<pre><code class="language-python">buy_hold_df = weekly_price.copy()
buy_hold_df = buy_hold_df.sort_values("price_date").reset_index(drop=True)
buy_hold_df["buy_hold_curve"] = buy_hold_df["close"] / buy_hold_df["close"].iloc[0]

buy_hold_df[["price_date", "close", "buy_hold_curve"]].tail()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/c0a025b3-364e-46a0-b136-d24336010c52.png" alt="buy/hold data" style="display:block;margin:0 auto" width="1050" height="646" loading="lazy">

<p>Then I plotted buy-and-hold against the raw <code>bullish_unwind</code> strategy.</p>
<pre><code class="language-python">plt.plot(buy_hold_df["price_date"], buy_hold_df["buy_hold_curve"], label="buy and hold wti", linewidth=2, alpha=0.5)
plt.plot(bullish_unwind_trades["exit_date"], bullish_unwind_trades["equity_curve"], label="bullish unwind strategy", color="b")
plt.title("Bullish unwind strategy vs buy and hold crude oil")
plt.xlabel("date")
plt.ylabel("equity multiple")
plt.legend()
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/7de51477-a1b3-4ab4-b5c3-b82589f907b9.png" alt="Bullish unwind strategy vs buy and hold crude oil" style="display:block;margin:0 auto" width="1814" height="854" loading="lazy">

<p>The chart was useful because it showed the exact problem with the raw signal. <code>bullish_unwind</code> was more selective than buy-and-hold, but that selectivity was not creating a real edge. The strategy had some decent stretches, but it still lagged the simpler benchmark overall.</p>
<p>To make that comparison more explicit, I calculated the full buy-and-hold return over the sample, then I put both results into one small summary table.</p>
<pre><code class="language-python">buy_hold_return = buy_hold_df["buy_hold_curve"].iloc[-1] - 1

comparison_summary = pd.DataFrame({
    "strategy": ["bullish_unwind", "buy_and_hold"],
    "trades": [len(bullish_unwind_trades), np.nan],
    "win_rate": [(bullish_unwind_trades["trade_return"] &gt; 0).mean(), np.nan],
    "avg_trade_return": [bullish_unwind_trades["trade_return"].mean(), np.nan],
    "cumulative_return": [
        (1 + bullish_unwind_trades["trade_return"]).prod() - 1,
        buy_hold_return
    ]
})

comparison_summary
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/fe0f4949-ac97-4918-a388-43092f3215c5.png" alt="strategy vs b/h returns comparison" style="display:block;margin:0 auto" width="1100" height="174" loading="lazy">

<p>This was the real turning point in the article.</p>
<p>Even though <code>bullish_unwind</code> was the best regime-based candidate so far, it still underperformed buy-and-hold. That made the conclusion very clear: the raw signal wasn't strong enough yet.</p>
<p>So this was no longer a question of choosing between regimes. That part was already settled. The real question now was whether the bullish_unwind setup could be improved without turning the strategy into something over-engineered.</p>
<p>That's what led to the next step: adding a simple trend filter.</p>
<h2 id="heading-adding-a-trend-filter">Adding a Trend Filter</h2>
<p>At this point, the core signal had been narrowed to <code>bullish_unwind</code>, but the raw version still wasn't good enough. It underperformed buy-and-hold, which meant the signal needed more context.</p>
<p>The next idea was simple: not every bullish unwind should be treated the same way. If speculative positioning is starting to unwind while crude oil is already in a weak broader trend, that long signal may not be worth taking. So I added one basic filter: only take the <code>bullish_unwind</code> trade when WTI is above its 26-week moving average.</p>
<p>First, I created the moving average and a binary trend flag. Then I combined that filter with the existing <code>bullish_unwind</code> regime.</p>
<pre><code class="language-python">merged_df["ma_26"] = merged_df["close"].rolling(26).mean()
merged_df["above_ma_26"] = (merged_df["close"] &gt; merged_df["ma_26"]).astype(int)
merged_df["long_bullish_unwind_tf"] = ((merged_df["regime"] == "bullish_unwind") &amp; (merged_df["above_ma_26"] == 1)).astype(int)
</code></pre>
<p>This creates a filtered version of the original signal. The output also shows how many trade opportunities remain after applying the trend filter. As expected, the number drops. That isn't a problem if the remaining trades are better.</p>
<p>Next, I ran the same 4-week non-overlapping backtest on the filtered signal.</p>
<pre><code class="language-python">bullish_unwind_tf_trades = run_fixed_hold_backtest(
    merged_df,
    "long_bullish_unwind_tf",
    hold_weeks=4
)

filtered_summary = pd.DataFrame({
    "bullish_unwind": summarize_trades(bullish_unwind_trades),
    "bullish_unwind_tf": summarize_trades(bullish_unwind_tf_trades)
}).T

filtered_summary
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/7ab5d6b1-6ebc-4d6a-870a-a9b4048b5386.png" alt="original vs optimized strategy performance" style="display:block;margin:0 auto" width="1535" height="210" loading="lazy">

<p>This was the first major improvement in the process.</p>
<p>The filtered version didn't just look slightly better. It changed the profile of the strategy in a meaningful way:</p>
<ul>
<li><p>fewer trades</p>
</li>
<li><p>higher win rate</p>
</li>
<li><p>higher average trade return</p>
</li>
<li><p>much stronger cumulative return</p>
</li>
</ul>
<p>That was exactly what I wanted from a filter. It made the signal more selective, but it also made it much cleaner.</p>
<p>To visualize the difference, I added equity curves for the raw strategy, the filtered version, and buy-and-hold.</p>
<pre><code class="language-python">bullish_unwind_tf_trades["equity_curve"] = (1 + bullish_unwind_tf_trades["trade_return"]).cumprod()

plt.plot(bullish_unwind_trades["exit_date"], bullish_unwind_trades["equity_curve"], label="bullish unwind")
plt.plot(bullish_unwind_tf_trades["exit_date"], bullish_unwind_tf_trades["equity_curve"], label="bullish unwind + trend filter")
plt.plot(buy_hold_df["price_date"], buy_hold_df["buy_hold_curve"], label="buy and hold wti")
plt.title("Bullish unwind strategy with and without trend filter")
plt.xlabel("date")
plt.ylabel("equity multiple")
plt.legend()
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/b1bda6f8-5018-4747-941f-144dc8f8960b.png" alt="Bullish unwind strategy with and without trend filter" style="display:block;margin:0 auto" width="1832" height="854" loading="lazy">

<p>This chart made the change easy to see. The raw strategy was drifting, while the filtered version was much more stable and clearly stronger over the full sample.</p>
<p>So this was the point where the strategy started becoming usable. The signal was no longer just “extreme bullish positioning is starting to unwind.” It was: <strong>extreme bullish positioning is starting to unwind, while crude oil is still in a broader uptrend</strong></p>
<p>That was much more specific, and much more effective.</p>
<p>The next question was whether this improved version was actually stable, or whether it only worked because of one lucky parameter choice.</p>
<h2 id="heading-stress-testing-the-setup">Stress-Testing the Setup</h2>
<p>Once the trend filter improved the strategy, I still didn't want to treat that version as final without checking how fragile it was.</p>
<p>A setup can look strong simply because one exact combination of parameters happened to work. So the next step was to test nearby variations and see whether the result still held up.</p>
<p>I kept the core idea the same:</p>
<ul>
<li><p>bullish unwind</p>
</li>
<li><p>long-only</p>
</li>
<li><p>trend filter stays on</p>
</li>
</ul>
<p>Then I varied three things:</p>
<ul>
<li><p>the percentile window</p>
</li>
<li><p>the threshold that defines an extreme</p>
</li>
<li><p>the holding period</p>
</li>
</ul>
<p>First, I created a helper function to build bullish unwind signals using different percentile columns and threshold levels, and then, a second percentile series using a shorter 52-week window.</p>
<pre><code class="language-python">def add_bullish_unwind_signal(df, percentile_col, high_threshold, signal_name):
    df[signal_name] = (
        (df[percentile_col] &gt; high_threshold) &amp;
        (df["net_position_ratio_change"] &lt; 0) &amp;
        (df["above_ma_26"] == 1)
    ).astype(int)
    
def rolling_percentile(x):
    return pd.Series(x).rank(pct=True).iloc[-1]

merged_df["position_percentile_52"] = merged_df["net_position_ratio"].rolling(52).apply(rolling_percentile)
</code></pre>
<p>With that in place, I built four signal variants:</p>
<ul>
<li><p>104-week percentile with an 80th percentile threshold</p>
</li>
<li><p>104-week percentile with an 85th percentile threshold</p>
</li>
<li><p>52-week percentile with an 80th percentile threshold</p>
</li>
<li><p>52-week percentile with an 85th percentile threshold</p>
</li>
</ul>
<pre><code class="language-python">add_bullish_unwind_signal(merged_df, "position_percentile_104", 0.80, "sig_104_80")
add_bullish_unwind_signal(merged_df, "position_percentile_104", 0.85, "sig_104_85")
add_bullish_unwind_signal(merged_df, "position_percentile_52", 0.80, "sig_52_80")
add_bullish_unwind_signal(merged_df, "position_percentile_52", 0.85, "sig_52_85")
</code></pre>
<p>After that, I ran the same backtest across three holding periods:</p>
<ul>
<li><p>2 weeks</p>
</li>
<li><p>4 weeks</p>
</li>
<li><p>8 weeks</p>
</li>
</ul>
<pre><code class="language-python">results = []

for signal_col in ["sig_104_80", "sig_104_85", "sig_52_80", "sig_52_85"]:
    for hold_weeks in [2, 4, 8]:
        trades = run_fixed_hold_backtest(merged_df, signal_col, hold_weeks=hold_weeks)

        if len(trades) == 0:
            continue

        results.append({
            "signal": signal_col,
            "hold_weeks": hold_weeks,
            "trades": len(trades),
            "win_rate": (trades["trade_return"] &gt; 0).mean(),
            "avg_trade_return": trades["trade_return"].mean(),
            "median_trade_return": trades["trade_return"].median(),
            "cumulative_return": (1 + trades["trade_return"]).prod() - 1
        })

stress_test = pd.DataFrame(results)
stress_test
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/ee70c28c-86a6-4ede-821f-cde23b36cad9.png" alt="backtest across three holding periods" style="display:block;margin:0 auto" width="1675" height="851" loading="lazy">

<p>This output was one of the most important parts of the entire article. It showed whether the improved strategy was actually stable, or whether it only worked in one narrow version.</p>
<p>A few things stood out immediately.</p>
<p>The <strong>104-week / 80th percentile</strong> version was clearly the strongest family. It held up across all three holding periods:</p>
<ul>
<li><p>2-week hold: cumulative return <strong>38.16%</strong></p>
</li>
<li><p>4-week hold: cumulative return <strong>45.95%</strong></p>
</li>
<li><p>8-week hold: cumulative return <strong>19.02%</strong></p>
</li>
</ul>
<p>That consistency mattered. It meant the signal wasn't collapsing the moment the hold period changed.</p>
<p>The <strong>4-week hold</strong> stood out as the best overall choice. It had:</p>
<ul>
<li><p><strong>26 trades</strong></p>
</li>
<li><p><strong>65.38% win rate</strong></p>
</li>
<li><p><strong>1.84% average trade return</strong></p>
</li>
<li><p><strong>3.69% median trade return</strong></p>
</li>
<li><p><strong>45.95% cumulative return</strong></p>
</li>
</ul>
<p>The <strong>8-week hold</strong> had a slightly higher average trade return in some cases, but it came with fewer trades. That made it thinner and harder to treat as the main version.</p>
<p>The <strong>104-week / 85th percentile</strong> setup was too restrictive for the shorter holds. Its 2-week and 4-week versions turned negative, even though the 8-week hold still worked reasonably well.</p>
<p>The <strong>52-week variants</strong> were much less convincing overall. A few of them were positive, but they were not nearly as stable as the 104-week / 80th percentile version.</p>
<p>So by the end of this step, the final structure wasn't just the version that happened to look good once. It was the version that kept holding up even after nearby variations were tested.</p>
<p>That gave me a clear final setup:</p>
<ul>
<li><p><strong>104-week percentile</strong></p>
</li>
<li><p><strong>80th percentile threshold</strong></p>
</li>
<li><p><strong>bullish unwind</strong></p>
</li>
<li><p><strong>26-week moving average filter</strong></p>
</li>
<li><p><strong>4-week hold</strong></p>
</li>
</ul>
<h2 id="heading-the-final-strategy">The Final Strategy</h2>
<p>By this stage, the process had already done most of the filtering.</p>
<p>The raw four-regime framework didn't work well as a strategy. The broader unwind idea didn't work either. The raw <code>bullish_unwind</code> signal was better than the alternatives, but still weaker than buy-and-hold.</p>
<p>The only version that held up after all of that was this one:</p>
<ul>
<li><p>bullish unwind</p>
</li>
<li><p>104-week positioning percentile</p>
</li>
<li><p>80th percentile threshold</p>
</li>
<li><p>26-week moving average filter</p>
</li>
<li><p>4-week hold</p>
</li>
<li><p>non-overlapping trades</p>
</li>
</ul>
<p>So now it made sense to stop iterating and show the final result clearly. I first locked the final signal and reran the backtest using the chosen setup.</p>
<pre><code class="language-python">final_signal = "sig_104_80"
final_hold = 4
final_trades = run_fixed_hold_backtest(merged_df, final_signal, hold_weeks=final_hold)
final_trades["equity_curve"] = (1 + final_trades["trade_return"]).cumprod()

final_summary = pd.DataFrame({
    "metric": [
        "trades",
        "win_rate",
        "avg_trade_return",
        "median_trade_return",
        "cumulative_return"
    ],
    "value": [
        len(final_trades),
        (final_trades["trade_return"] &gt; 0).mean(),
        final_trades["trade_return"].mean(),
        final_trades["trade_return"].median(),
        (1 + final_trades["trade_return"]).prod() - 1
    ]
})

final_summary
</code></pre>
<p>That output gives the final performance profile:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/f7f5219d-233d-4fe7-8ac9-2cee2026feeb.png" alt="final performance profile" style="display:block;margin:0 auto" width="674" height="501" loading="lazy">

<p>Those numbers were already a big improvement over the earlier raw versions, but I still wanted the comparison in one place. So I built a final table against the two reference points:</p>
<ul>
<li><p>buy-and-hold</p>
</li>
<li><p>raw bullish unwind</p>
</li>
</ul>
<pre><code class="language-python">final_comparison = pd.DataFrame({
    "strategy": ["buy_and_hold", "bullish_unwind_raw", "bullish_unwind_filtered"],
    "trades": [
        np.nan,
        len(bullish_unwind_trades),
        len(final_trades)
    ],
    "win_rate": [
        np.nan,
        (bullish_unwind_trades["trade_return"] &gt; 0).mean(),
        (final_trades["trade_return"] &gt; 0).mean()
    ],
    "avg_trade_return": [
        np.nan,
        bullish_unwind_trades["trade_return"].mean(),
        final_trades["trade_return"].mean()
    ],
    "cumulative_return": [
        buy_hold_return,
        (1 + bullish_unwind_trades["trade_return"]).prod() - 1,
        (1 + final_trades["trade_return"]).prod() - 1
    ]
})

final_comparison
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/2b7a3779-1701-4221-9bd2-df0a4ac22de7.png" alt="final performance comparison table" style="display:block;margin:0 auto" width="1537" height="345" loading="lazy">

<p>This was the full payoff of the build:</p>
<ul>
<li><p>buy-and-hold: 13.67%</p>
</li>
<li><p>raw bullish unwind: -2.13%</p>
</li>
<li><p>filtered bullish unwind: 45.95%</p>
</li>
</ul>
<p>The trend filter didn't just smooth the strategy a bit. It changed the result completely.</p>
<p>To make that visible, I plotted the three curves together.</p>
<pre><code class="language-python">plt.plot(buy_hold_df["price_date"], buy_hold_df["buy_hold_curve"], label="buy and hold wti", linewidth=2, alpha=0.5)
plt.plot(bullish_unwind_trades["exit_date"], bullish_unwind_trades["equity_curve"], label="raw bullish unwind", color="indigo")
plt.plot(final_trades["exit_date"], final_trades["equity_curve"], label="filtered bullish unwind", color="b")
plt.title("Crude oil strategy comparison")
plt.xlabel("date")
plt.ylabel("equity multiple")
plt.legend()
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/f4e50969-c1b3-441e-bc7c-5e90327ef9f0.png" alt="Crude oil strategy comparison" style="display:block;margin:0 auto" width="1808" height="847" loading="lazy">

<p>This chart says the same thing as the table, but more directly. The raw signal drifts. Buy-and-hold is positive over the full sample, but much noisier. The filtered version is the only one that compounds in a cleaner way.</p>
<p>I also wanted to show where these filtered trades actually appear on the WTI chart.</p>
<pre><code class="language-python">plt.plot(merged_df["price_date"], merged_df["close"], label="wti close", linewidth=2, alpha=0.5)
plt.scatter(merged_df.loc[merged_df[final_signal] == 1, "price_date"], merged_df.loc[merged_df[final_signal] == 1, "close"],
            s=25, label="filtered bullish unwind signal", color="b")
plt.title("Filtered bullish unwind signals on WTI crude oil")
plt.xlabel("date")
plt.ylabel("price")
plt.legend()
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/c688c947-2819-47af-a825-13c0bac7b530.png" alt="Filtered bullish unwind signals on WTI crude oil" style="display:block;margin:0 auto" width="1804" height="845" loading="lazy">

<p>This is useful because it shows the strategy is selective. It doesn't fire all the time. It only activates when positioning stays in an extreme bullish zone, starts to unwind, and the broader price trend is still intact.</p>
<p>I did the same on the positioning side.</p>
<pre><code class="language-python">plt.plot(merged_df["price_date"], merged_df["position_percentile_104"], label="104-week percentile", linewidth=2, alpha=0.5)
plt.axhline(0.8, linestyle="--", label="80th percentile")
plt.scatter(merged_df.loc[merged_df[final_signal] == 1, "price_date"], merged_df.loc[merged_df[final_signal] == 1, "position_percentile_104"],
            s=25, label="trade signals", color="indigo")
plt.title("Bullish unwind signals from COT positioning extremes")
plt.xlabel("date")
plt.ylabel("percentile")
plt.legend()
plt.show()
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f362fe21017f7317167b14c/85f8ae62-60ca-4de5-8074-213eb5296f92.png" alt="Bullish unwind signals from COT positioning extremes" style="display:block;margin:0 auto" width="1809" height="844" loading="lazy">

<p>This final chart ties everything together. The trades only appear when the percentile is already in the extreme zone, which means the signal is still doing what it was originally designed to do. It's just doing it in a much more disciplined way than the raw regime framework.</p>
<h2 id="heading-further-improvements">Further Improvements</h2>
<p>There are still a few places where this can be pushed further.</p>
<p>The first is execution realism. Right now the strategy uses a clean weekly entry and exit rule, but it doesn't include slippage, spreads, or any contract-level execution constraints. Adding those would make the result stricter.</p>
<p>The second is signal depth. This version only uses non-commercial positioning, a trend filter, and a fixed hold period. It would be worth testing whether commercial positioning, volatility filters, or dynamic exits can improve the setup without overcomplicating it.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>This started as a broad COT idea, not a finished strategy. The first regime framework looked reasonable, but most of it didn't hold up once the data was tested. That part was important, because it made the final signal much narrower and much cleaner.</p>
<p>What survived was a very specific setup: extreme bullish positioning that starts to unwind, while WTI is still above its 26-week moving average. That version ended up outperforming both the raw signal and buy-and-hold over the tested sample.</p>
<p>The nice part is that the whole thing can be built from scratch with FinancialModelingPrep’s COT and commodity price data APIs, without needing to patch together multiple data sources. That made it much easier to go from idea to actual testing.</p>
<p>With that being said, you’ve reached the end of the article. Hope you learned something new and useful. Thank you for your time.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
