How to Build a Production-Safe Agent Loop: From Exit Conditions to Audit Trails

Daniel Nwaneri — Mon, 15 Jun 2026 23:18:49 +0000

In July 2025, a Claude Code recursion loop burned between 16,000 USD and 50,000 USD in five hours. There was no crash or error, just agents doing exactly what they were told, indefinitely, because nobody told them when to stop.

Four months later, a four-agent LangChain loop ran for eleven days and cost 47,000 USD. Nobody noticed until the invoice arrived. The pipeline worked correctly in testing, and the agents were doing exactly what they were told. Same pattern.

This tutorial is about that missing instruction.

You'll build five small Python primitives that catch most agent loop failures before they ship:

A spec writer that forces you to define done before the loop starts
A circuit breaker that kills the loop when it exceeds hard limits
A ledger that records every turn in an append-only SQLite audit trail
An agent loop that ties all three together
A review surface that forces human attestation before downstream systems receive anything

By the end you'll have a working repo you can drop into any agent project. The full code is at github.com/dannwaneri/production-safe-agent-loop.

Why This Keeps Happening
Prerequisites
Phase 1: Define Done Before You Build
Phase 2: Enforce Done at Runtime
Phase 3: Record Everything
Phase 4: The Loop That Respects Its Boundaries
Phase 5: The Review Surface
Phase 6: A Real Example, SEO Audit Agent
Pluggable LLM Client
Running the Tests
What You've Built
Next Steps

Why This Keeps Happening

The math that got companies into trouble was simple. A chatbot costs roughly 0.04 USD per interaction. An orchestrated multi-agent workflow costs 1.20 USD. That's a 30x multiplier — and production benchmarks show it can reach 70x on complex tasks.

The problem isn't that agents are expensive. The problem is that most teams budgeted for chatbot costs and deployed agent architectures. Gartner found the token consumption gap between pilot chatbots and production agent workflows sits at 5-30x. The FinOps Foundation's 2026 State of FinOps report found 73% of enterprises say AI costs exceeded original projections.

The mechanism is straightforward once you see it. When an agent fails a task and retries, it doesn't start fresh. It re-reads the entire context window — every prior failed attempt — before trying again. Iteration one costs 100 tokens. Iteration two costs 200. Iteration ten costs thousands. You're paying for every failure, over and over, in milliseconds.

# This is the entire problem in three lines
while True:
    result = agent.run(task)
    # done when...?

That question mark is where the money goes.

The other thing making it worse: agents don't fail loudly. Traditional code hits an undefined state and crashes. An LLM hits ambiguity and tries to be helpful. It retries. It reformats the tool call. It spins up a verification agent. The verification agent finds something. A correction agent fires. Nobody defined what "correct" means. The loop looks beautiful on every dashboard you have — activity, tool calls, completion rate — while quietly burning through your budget.

Gartner predicts that 40% of agentic projects will be scrapped by 2027 due to economic failure. Most of that failure is preventable. Not with better models, but with exit conditions.

Prerequisites

Python 3.10+
An Anthropic API key (or any provider — more on that later)
Basic familiarity with Python classes and SQLite

git clone https://github.com/dannwaneri/production-safe-agent-loop
cd production-safe-agent-loop
pip install -r requirements.txt
export ANTHROPIC_API_KEY=sk-...

Phase 1: Define Done Before You Build

The most expensive mistake in agent development isn't a bad model choice or a missing retry limit. It's starting the build before you can answer one question in one sentence:

What does done look like?

Most teams can't answer it. Not because they're careless, but because nothing forces them to before they open the terminal. The spec writer is that forcing function.

# spec_writer.py
from spec_writer import SpecWriter

spec = SpecWriter(db_path="spec.db").run()

When you call .run(), it won't return until you've answered three questions:

What does this do?
What does this NOT do?
What does done look like in one sentence?

The third question is the one that matters. It's also the hardest. "The agent audits the site" is not an answer. "The agent crawls the target URL, extracts all </code> and <code><meta description></code> tags, flags any missing or over-length, and stops" is an answer. One of those gives the circuit breaker something to enforce.</p> <p>The spec stores to SQLite and returns a <code>SpecResult</code> dataclass with a <code>session_id</code>. That ID becomes the thread connecting your spec, your ledger rows, and your loop result. One session, traceable end to end.</p> <pre><code class="language-python">@dataclass(frozen=True) class SpecResult: what_it_does: str what_it_does_not: str done_looks_like: str session_id: str </code></pre> <p><code>frozen=True</code> matters. The spec is a commitment, not a draft. Once it's written, the loop runs against it. No mid-run revisions.</p> <p>For testing, <code>SpecWriter</code> accepts injectable <code>input_fn</code> and <code>output_fn</code> callables. No stdin monkey-patching required. See <code>tests/test_spec_writer.py</code> for working examples — the suite uses a small <code>scripted_input</code> helper that returns answers from a generator, and writes to a per-test SQLite file via pytest's <code>tmp_path</code> fixture. SQLite's <code>:memory:</code> isn't safe here, because <code>SpecWriter</code> opens a fresh connection per method and each <code>:memory:</code> connection is its own isolated database.</p> <h2 id="heading-phase-2-enforce-done-at-runtime">Phase 2: Enforce Done at Runtime</h2> <p>Defining the exit condition upstream is discipline. The circuit breaker is enforcement.</p> <pre><code class="language-python"># circuit_breaker.py from circuit_breaker import CircuitBreaker, CircuitBreakerError breaker = CircuitBreaker(turn_limit=5, token_limit=15000) breaker.check(turn_count, accumulated_tokens) # raises on breach </code></pre> <p>Two ceilings. Both hard.</p> <p><code>turn_limit</code> caps how many times the loop can call the LLM. <code>token_limit</code> caps total token consumption across all turns. Either one tripping raises <code>CircuitBreakerError</code> immediately.</p> <p>The boundary is strict: <code>turn_count == turn_limit</code> is allowed. <code>turn_count == turn_limit + 1</code> trips. No grace periods or warnings. A hard stop forces a human checkpoint.</p> <pre><code class="language-python">from dataclasses import dataclass @dataclass class CircuitBreakerError(Exception): reason: str # "turn_ceiling" or "token_ceiling" turn_count: int accumulated_tokens: int def __post_init__(self) -> None: super().__init__( f"circuit breaker tripped: {self.reason} " f"(turn={self.turn_count}, tokens={self.accumulated_tokens})" ) class CircuitBreaker: def __init__(self, turn_limit: int = 5, token_limit: int = 15000) -> None: self.turn_limit = turn_limit self.token_limit = token_limit def check(self, turn_count: int, accumulated_tokens: int) -> None: if turn_count > self.turn_limit: self._trip("turn_ceiling", turn_count, accumulated_tokens) if accumulated_tokens > self.token_limit: self._trip("token_ceiling", turn_count, accumulated_tokens) def _trip(self, reason: str, turn_count: int, accumulated_tokens: int) -> None: print( "\n=== CIRCUIT BREAKER CHECKPOINT ===\n" f"reason : {reason}\n" f"turn_count : {turn_count} / limit {self.turn_limit}\n" f"tokens_used : {accumulated_tokens} / limit {self.token_limit}\n" "action : halt loop, surface to human reviewer\n" "==================================" ) raise CircuitBreakerError( reason=reason, turn_count=turn_count, accumulated_tokens=accumulated_tokens, ) </code></pre> <p><code>CircuitBreakerError</code> is an exception, not a return code. That's intentional. A return code can be ignored. An uncaught exception can't. Silent breach is impossible. The human-readable checkpoint banner is printed to stdout by <code>_trip()</code> <em>before</em> the exception is raised, so even if a caller swallows the exception the operator still sees state.</p> <p>The critical rule: call <code>.check()</code> <strong>before</strong> every LLM call, not after. Post-flight checking means you've already burned the tokens before you knew the limit was exceeded.</p> <pre><code class="language-python"># Wrong — post-flight result = client.messages.create(...) breaker.check(turn_count, accumulated_tokens) # too late # Right — pre-flight breaker.check(turn_count, accumulated_tokens) # raises before any spend result = client.messages.create(...) </code></pre> <p>The defaults (5 turns, 15,000 tokens) match a tight tutorial demo. Your production budget is different. Tune at instantiation:</p> <pre><code class="language-python"># Production example — tighter token budget, more turns breaker = CircuitBreaker(turn_limit=10, token_limit=50000) </code></pre> <h2 id="heading-phase-3-record-everything">Phase 3: Record Everything</h2> <p>The circuit breaker protects your bank account. The ledger protects your understanding of what happened.</p> <p>Most teams log for debugging — they want to know what went wrong after it went wrong. The ledger has a different purpose. It's governance. Every row is proof that the loop stayed within its boundaries, or didn't, and exactly when.</p> <pre><code class="language-python"># ledger.py from ledger import Ledger ledger = Ledger(db_path="ledger.db") ledger.write( session_id=spec.session_id, turn_count=1, state_origin="llm", input_str=task, token_delta=523, execution_time_ms=1240, pass_fail=True, ) </code></pre> <p>One row per turn. Append-only, no updates, and no deletes. The immutability is the point: a ledger you can edit isn't a ledger, it's a notebook.</p> <p>The schema:</p> <pre><code class="language-sql">CREATE TABLE IF NOT EXISTS ledger ( id INTEGER PRIMARY KEY AUTOINCREMENT, session_id TEXT NOT NULL, turn_count INTEGER NOT NULL, state_origin TEXT NOT NULL, input_hash TEXT NOT NULL, token_delta INTEGER NOT NULL, execution_time_ms INTEGER NOT NULL, pass_fail INTEGER NOT NULL, -- 1=pass, 0=fail breach_reason TEXT, -- NULL unless circuit breaker fired created_at TEXT NOT NULL -- ISO 8601, UTC ); CREATE INDEX IF NOT EXISTS idx_ledger_session ON ledger(session_id); </code></pre> <p>The index makes <code>get_session(session_id)</code> — the primary read path — a constant-time lookup as the ledger grows.</p> <p>Three decisions worth explaining:</p> <ol> <li><p><code>input_hash</code> <strong>not</strong> <code>input_text</code><strong>.</strong> The raw input string never persists. Only its SHA-256 hash does. There are two benefits to this: identical inputs across runs are detectable, and PII never enters the audit trail.</p> </li> <li><p><code>pass_fail</code> <strong>as</strong> <code>INTEGER</code> <strong>not</strong> <code>BOOLEAN</code><strong>.</strong> SQLite has no boolean type. <code>1</code> and <code>0</code> are canonical. Clean Python ergonomics at the API edge, correct SQL types on disk.</p> </li> <li><p><code>created_at</code> <strong>as</strong> <code>datetime.now(timezone.utc).isoformat()</code><strong>.</strong> <code>datetime.utcnow()</code> was deprecated in Python 3.12. Timezone-aware timestamps avoid the footgun in any system that crosses timezones.</p> </li> </ol> <p>Retrieve by session:</p> <pre><code class="language-python">rows = ledger.get_session(spec.session_id) for row in rows: print(f"Turn {row.turn_count}: {'PASS' if row.pass_fail else 'FAIL'} " f"| {row.token_delta} tokens | {row.execution_time_ms}ms") </code></pre> <h2 id="heading-phase-4-the-loop-that-respects-its-boundaries">Phase 4: The Loop That Respects Its Boundaries</h2> <p>The agent loop wires the three primitives together. It's the only component that calls the LLM. Everything else is local.</p> <pre><code class="language-python"># agent_loop.py from agent_loop import AgentLoop loop = AgentLoop(spec, breaker, ledger, client) result = loop.run(task) # LoopResult(success, turns, total_tokens, session_id, breach_reason) </code></pre> <p>The anatomy of a turn, in order:</p> <ol> <li><p><code>circuit_breaker.check(turn_count, accumulated_tokens)</code> — raises if either ceiling is exceeded</p> </li> <li><p><code>client.messages.create(...)</code> — the actual LLM call</p> </li> <li><p><code>ledger.write(...)</code> — one row, append-only</p> </li> <li><p>If <code>stop_reason == "end_turn"</code>, return. Otherwise loop.</p> </li> </ol> <p>Pre-flight checking before every LLM call, with no exceptions.</p> <pre><code class="language-python">def run(self, task: str) -> LoopResult: session_id = self.spec.session_id messages: list[dict] = [{"role": "user", "content": task}] turn = 0 total_tokens = 0 try: while True: turn += 1 self.circuit_breaker.check(turn, total_tokens) started = time.perf_counter() response = self.client.messages.create( model=self.model, max_tokens=self.max_tokens, system=self._system_prompt(), messages=messages, ) elapsed_ms = int((time.perf_counter() - started) * 1000) turn_tokens = ( getattr(response.usage, "input_tokens", 0) + getattr(response.usage, "output_tokens", 0) ) total_tokens += turn_tokens text = self._text_from(response) messages.append({"role": "assistant", "content": text}) self.ledger.write( session_id=session_id, turn_count=turn, state_origin="llm", input_str=task, token_delta=turn_tokens, execution_time_ms=elapsed_ms, pass_fail=True, ) if getattr(response, "stop_reason", "end_turn") == "end_turn": return LoopResult( success=True, turns=turn, total_tokens=total_tokens, session_id=session_id, ) messages.append({"role": "user", "content": "continue"}) except CircuitBreakerError as err: self.ledger.write( session_id=session_id, turn_count=turn, state_origin="circuit_breaker", input_str=task, token_delta=0, execution_time_ms=0, pass_fail=False, breach_reason=err.reason, ) return LoopResult( success=False, turns=turn, total_tokens=total_tokens, session_id=session_id, breach_reason=err.reason, ) def _system_prompt(self) -> str: return ( "You are an agent working on a tightly-scoped task.\n\n" f"What this does: {self.spec.what_it_does}\n" f"What this does NOT do: {self.spec.what_it_does_not}\n" f"Done looks like: {self.spec.done_looks_like}\n" ) @staticmethod def _text_from(response) -> str: content = getattr(response, "content", None) if not content: return "" block = content[0] return getattr(block, "text", "") or "" </code></pre> <p>A few choices worth calling out in this body:</p> <ul> <li><p><strong>The whole</strong> <code>while True:</code> <strong>is wrapped in one</strong> <code>try/except CircuitBreakerError</code><strong>.</strong> The check happens at the top of every turn, so a breach is caught the same way whether it fires on turn 1 or turn 6.</p> </li> <li><p><code>input_str=task</code> on every ledger row — the original task, not the last assistant message. The <code>input_hash</code> column then groups rows that share the same starting input across the run.</p> </li> <li><p><code>pass_fail=True</code> <strong>for every LLM turn that returns</strong>, <code>False</code> only on breach. The pass/fail flag tracks whether the loop <em>reached</em> the row legitimately, not whether the model's output was good. Quality scoring is a separate concern.</p> </li> <li><p><code>_system_prompt()</code> <strong>uses all three spec fields</strong>, not just <code>done_looks_like</code>. The model needs the negative scope (<code>what_it_does_not</code>) at least as much as the positive scope.</p> </li> <li><p><code>time.perf_counter()</code> <strong>not</strong> <code>time.time()</code> — monotonic, immune to wall-clock adjustments mid-run.</p> </li> </ul> <p><code>LoopResult.session_id</code> is inherited from <code>spec.session_id</code>. The ledger rows tie back to the spec without a join. One session ID, one traceable run, start to finish.</p> <h2 id="heading-phase-5-the-review-surface">Phase 5: The Review Surface</h2> <p>The circuit breaker protects your bank account. The ledger records what happened. But neither tells you whether what happened matched what you promised.</p> <p>That gap is where bad loops get approved. Polished output, green dashboard, missed commitment. A reviewer sees the artifact, decides it looks acceptable, and signs off. Nobody asked whether the original promise was kept.</p> <p>The review surface closes that gap. It reads the session from SQLite, assembles the five-element frame, and forces a comparison before anything downstream receives the output.</p> <pre><code class="language-python">from review_surface import ReviewSurface rs = ReviewSurface(spec_db_path="spec.db", ledger_db_path="ledger.db") print(rs.render(session_id)) </code></pre> <p>Here's the five-element frame, in order:</p> <ol> <li><p><strong>Original promise</strong> — pulled from the spec table: what it does, what it doesn't do, what done looks like</p> </li> <li><p><strong>Acceptance criteria</strong> — the <code>done_looks_like</code> field rendered as the explicit benchmark</p> </li> <li><p><strong>Diff</strong> — first turn input vs final turn output, turns completed, total tokens, whether the loop breached</p> </li> <li><p><strong>Evidence</strong> — all ledger rows for the session: turn-by-turn pass/fail, token delta, execution time</p> </li> <li><p><strong>Unresolved assumptions</strong> — derived from breach rows and failed turns. Empty when clean.</p> </li> </ol> <p>When the reviewer is satisfied, they attest:</p> <pre><code class="language-python">attestation = rs.attest( session_id=result.session_id, reviewer="daniel", notes="Output matches spec. Approved." ) print(attestation.frame_hash) </code></pre> <p><code>.attest()</code> writes to the <code>attestations</code> table in <code>ledger.db</code>. The <code>frame_hash</code> is a SHA-256 of the canonical frame data — deterministic across reviewers attesting the same session. It's the audit receipt. It proves the reviewer saw the exact frame as rendered, not a summary or a paraphrase.</p> <p>Approval confirms the process ran. Attestation confirms the reviewer compared output to commitment. When the loop touches something regulated, those are different legal documents.</p> <pre><code class="language-python">@dataclass(frozen=True) class ReviewFrame: session_id: str original_promise: SpecResult acceptance_criteria: str diff: DiffResult evidence: tuple # tuple[LedgerRow, ...] unresolved_assumptions: tuple # tuple[str, ...] created_at: str </code></pre> <p><code>ReviewFrame</code> is frozen for the same reason <code>SpecResult</code> is — the frame is evidence, not a draft. <code>evidence</code> and <code>unresolved_assumptions</code> are tuples because lists aren't hashable and frozen dataclasses need hashable fields.</p> <p>The full end-to-end flow with the review surface lives in <code>examples/review_example.py</code> in the repo. Run it after any completed session: it renders the five-element frame, prompts for attestation, and writes the receipt if you approve.</p> <p>The loop runs to you. Downstream systems get nothing until someone signs.</p> <h2 id="heading-phase-6-a-real-example-seo-audit-agent">Phase 6: A Real Example — SEO Audit Agent</h2> <p>The pattern only makes sense against a real problem. This is the same agent architecture behind my <a href="https://github.com/dannwaneri/seo-agent">seo-agent</a> project.</p> <p>SEO audits have a natural cadence: crawl, surface what's broken, fix, wait for reindex. Running the agent continuously doesn't change that cadence. It just burns tokens in the empty space between the moments that matter. A cron job wired to the loop is the honest architecture.</p> <pre><code class="language-python"># examples/seo_audit_example.py import requests from bs4 import BeautifulSoup import anthropic from spec_writer import SpecWriter from circuit_breaker import CircuitBreaker from ledger import Ledger from agent_loop import AgentLoop def crawl_url(url: str) -> str: response = requests.get(url, timeout=10) soup = BeautifulSoup(response.text, "html.parser") title = soup.find("title") meta_desc = soup.find("meta", attrs={"name": "description"}) h1_tags = soup.find_all("h1") return ( f"URL: {url}\n" f"Title: {title.text if title else 'MISSING'}\n" f"Meta description: " f"{meta_desc['content'] if meta_desc else 'MISSING'}\n" f"H1 count: {len(h1_tags)}\n" f"H1 tags: {[h.text[:50] for h in h1_tags]}" ) def run_seo_audit(url: str) -> None: # Step 1: Define done before the loop starts spec = SpecWriter(db_path="spec.db").run() # Step 2: Initialise circuit breaker and ledger breaker = CircuitBreaker(turn_limit=5, token_limit=15000) ledger = Ledger(db_path="ledger.db") client = anthropic.Anthropic() # Step 3: Crawl the URL site_data = crawl_url(url) # Step 4: Run the loop # AgentLoop catches CircuitBreakerError internally and returns # LoopResult(success=False, breach_reason=...). Branch on the # result — do NOT wrap loop.run() in try/except CircuitBreakerError. loop = AgentLoop(spec, breaker, ledger, client) result = loop.run( f"Audit this page for SEO issues:\n\n{site_data}" ) # Step 5: Print the ledger print(f"\nResult: {'SUCCESS' if result.success else 'BREACH'}") if not result.success: print(f"Breach reason: {result.breach_reason}") print(f"Turns: {result.turns} | Tokens: {result.total_tokens}") print("\nAudit trail:") for row in ledger.get_session(result.session_id): status = "PASS" if row.pass_fail else "FAIL" print(f" Turn {row.turn_count}: {status} | " f"{row.token_delta} tokens | {row.execution_time_ms}ms") if __name__ == "__main__": import sys run_seo_audit(sys.argv[1] if len(sys.argv) > 1 else "https://example.com") </code></pre> <p>Run it:</p> <pre><code class="language-bash">python examples/seo_audit_example.py https://yourdomain.com </code></pre> <p>The spec writer prompts you. The loop runs, the circuit breaker fires if the limits are exceeded, and the ledger records every turn. The output lands in front of you and you decide what to fix.</p> <p>The loop runs to you, not into a void.</p> <h2 id="heading-pluggable-llm-client">Pluggable LLM Client</h2> <p>The loop works with any client that satisfies the <code>LLMClient</code> protocol (Anthropic by default). Bring your own via a ~20-line adapter.</p> <pre><code class="language-python"># agent_loop.py from typing import Protocol, runtime_checkable @runtime_checkable class MessagesEndpoint(Protocol): def create(self, *, model: str, max_tokens: int, system: str, messages: list) -> object: ... @runtime_checkable class LLMClient(Protocol): messages: MessagesEndpoint </code></pre> <p><code>messages</code> is an instance attribute (not a nested class) because that's how the real Anthropic SDK exposes it — <code>anthropic.Anthropic().messages.create(...)</code>. Modeling it as a nested class would mean the real client wouldn't satisfy the Protocol. The <code>@runtime_checkable</code> decorator lets you sanity-check conformance with <code>isinstance(client, LLMClient)</code>, and the repo's test suite uses exactly that assertion against the <code>FakeClient</code> test double.</p> <p>Here's an OpenAI adapter example (This is illustrative. A production adapter would also map streaming, tool-use, and error shapes.):</p> <pre><code class="language-python"># openai_adapter.py — illustrative pseudocode, not production-ready. from openai import OpenAI as _OpenAI class _MessagesAdapter: def __init__(self, client): self._client = client def create(self, *, model, max_tokens, system, messages): completion = self._client.chat.completions.create( model=model, max_tokens=max_tokens, messages=[{"role": "system", "content": system}] + messages, ) # Reshape OpenAI's response into the Anthropic-shaped surface # AgentLoop reads: response.usage.{input,output}_tokens, # response.content[0].text, response.stop_reason. return _adapt_response(completion) class OpenAIAdapter: def __init__(self, api_key: str): self._client = _OpenAI(api_key=api_key) self.messages = _MessagesAdapter(self._client) # instance attr, not a nested class </code></pre> <p>The adapter pattern is worth teaching explicitly. Provider APIs don't share a shape. Anthropic puts <code>system</code> at the top level. OpenAI puts it inside the messages array. An adapter shim is ~20 lines and makes the loop provider-agnostic without rewriting anything. Note that <code>self.messages</code> is assigned in <code>__init__</code> so it's a real attribute on each adapter instance, the same shape as the actual SDK.</p> <h2 id="heading-running-the-tests">Running the Tests</h2> <pre><code class="language-bash">python -m pytest tests/ </code></pre> <p>With coverage:</p> <pre><code class="language-bash">python -m coverage run --source=circuit_breaker,ledger,spec_writer,agent_loop,review_surface -m pytest tests/ python -m coverage report -m </code></pre> <p>80 tests, 100% coverage on all five core modules. The loop is exercised against a <code>FakeClient</code> test double defined inline in <code>tests/test_agent_loop.py</code>. It satisfies the <code>LLMClient</code> protocol via duck typing: <code>messages</code> is set to <code>self</code>, so <code>client.messages.create(...)</code> routes back to the same object and ships with scripted responses for each test scenario. Clone the repo and run <code>pytest</code> to see all 80 tests pass without touching the network or needing an API key.</p> <p><code>circuit_breaker.py</code> has 100% coverage — no untested paths. It's the financial safety component. Every path through it is exercised.</p> <h2 id="heading-what-youve-built">What You've Built</h2> <p>In this tutorial, you've build five small primitives, each independently usable.</p> <table> <thead> <tr> <th>Module</th> <th>Role</th> <th>Lines</th> </tr> </thead> <tbody><tr> <td><code>spec_writer.py</code></td> <td>Forces three answers before the loop runs</td> <td>104</td> </tr> <tr> <td><code>circuit_breaker.py</code></td> <td>Hard ceilings on turns and tokens</td> <td>41</td> </tr> <tr> <td><code>ledger.py</code></td> <td>Append-only SQLite audit trail</td> <td>113</td> </tr> <tr> <td><code>agent_loop.py</code></td> <td>The loop that respects both</td> <td>128</td> </tr> <tr> <td><code>review_surface.py</code></td> <td>Assembles the five-element frame, records human attestation</td> <td>114</td> </tr> </tbody></table> <p>The pattern: upstream discipline defines the boundaries. Downstream enforcement breaks the circuit. Neither trusts the model to police itself.</p> <p>A loop that runs without an exit condition isn't autonomous. It's a billing event waiting to happen.</p> <p>Define what done looks like before you start. That's the job, and always has been.</p> <h2 id="heading-next-steps">Next Steps</h2> <p>The repo is at <a href="https://github.com/dannwaneri/production-safe-agent-loop">github.com/dannwaneri/production-safe-agent-loop</a>.</p> <p>There are three natural extensions if you want to go further:</p> <h3 id="heading-1-graduation-to-distributed-systems">1. Graduation to Distributed Systems</h3> <p>The SQLite ledger works for isolated sequential loops. The moment you run multiple agents against shared state, you need serializable isolation — concurrent writes to flat JSON corrupt silently. The README documents the three tipping points where a flat ledger needs to graduate.</p> <h3 id="heading-2-cryptographic-signing">2. Cryptographic Signing</h3> <p>For compliance-scale systems where the auditor wasn't present when the loop ran, SQLite rows aren't enough. A database admin can run an <code>UPDATE</code> query. Ed25519 signing wraps each ledger row in a receipt that proves the log wasn't altered after execution. But that's a different tutorial.</p> <h3 id="heading-wiring-a-cron-job">Wiring a Cron Job</h3> <p>The honest architecture for the SEO audit agent isn't 24/7 autonomous operation. It's a cron job that runs on schedule, surfaces what's broken, and stops. <code>0 3 * * 2 python examples/seo_audit_example.py https://yourdomain.com</code> is the whole thing. The loop runs to you, not into a void.</p> <p>If you need this architecture built for your own stack (circuit breakers, audit trails, production-safe agent loops), I do freelance work. <a href="https://dannwaneri.com/ai-agents/">dannwaneri.com/ai-agents/</a></p> </article> <article> <h1> How to Build a Self-Learning RAG System with Knowledge Reflection </h1> <p>Daniel Nwaneri — Fri, 24 Apr 2026 20:52:49 +0000</p> <p>Every RAG system I've seen — including the one I wrote a handbook about on this site — has the same fundamental problem.</p> <p>It doesn't learn.</p> <p>You ingest 500 documents. You ask a question. The system retrieves the three most similar chunks and hands them to the LLM. Repeat for the next query.</p> <p>The system knows exactly as much as it did on day one. It's a library that never builds a card catalog, never cross-references its own shelves, never notices that three of its books are saying contradictory things.</p> <p>That's what I set out to fix with a knowledge reflection layer. After every ingest, the system finds semantically related documents already in the index and asks an LLM to synthesise what's new, how it connects, and what gap remains. That synthesis gets embedded, stored, and boosted in search results.</p> <p>The knowledge base gets smarter as you add more documents — not just bigger.</p> <p>This tutorial shows you exactly how to build it.</p> <h2 id="heading-table-of-contents">Table of Contents</h2> <ol> <li><p><a href="#heading-what-you-will-build">What You Will Build</a></p> </li> <li><p><a href="#heading-prerequisites">Prerequisites</a></p> </li> <li><p><a href="#heading-how-to-set-up-the-base-system">How to Set Up the Base System</a></p> </li> <li><p><a href="#heading-why-standard-rag-has-a-memory-problem">Why Standard RAG Has a Memory Problem</a></p> </li> <li><p><a href="#heading-step-1-schema-update">Step 1: Schema Update</a></p> </li> <li><p><a href="#heading-step-2-the-reflection-engine">Step 2: The Reflection Engine</a></p> </li> <li><p><a href="#heading-step-3-consolidation">Step 3: Consolidation</a></p> </li> <li><p><a href="#heading-step-4-wire-it-into-your-ingest-handler">Step 4: Wire It Into Your Ingest Handler</a></p> </li> <li><p><a href="#heading-step-5-boost-reflections-in-search">Step 5: Boost Reflections in Search</a></p> </li> <li><p><a href="#heading-step-6-filtering-by-doc_type">Step 6: Filtering by doc_type</a></p> </li> <li><p><a href="#heading-what-changes-after-you-build-this">What Changes After You Build This</a></p> </li> <li><p><a href="#heading-deploying">Deploying</a></p> </li> <li><p><a href="#heading-what-to-build-next">What to Build Next</a></p> </li> </ol> <h2 id="heading-what-you-will-build">What You Will Build</h2> <p>In this tutorial, you'll build a post-ingest reflection pipeline that:</p> <ol> <li><p>Fires automatically after every document ingest</p> </li> <li><p>Finds the most semantically related documents already in the index</p> </li> <li><p>Asks Kimi K2.5 to synthesise a three-sentence insight linking the new document to existing knowledge</p> </li> <li><p>Stores that reflection with <code>doc_type=reflection</code> and a 1.5× ranking boost in search results</p> </li> <li><p>Consolidates reflections into summaries every three ingests</p> </li> </ol> <p>By the end, searching your knowledge base will surface both raw document chunks and reflection artifacts the system wrote on ingest.</p> <h2 id="heading-prerequisites">Prerequisites</h2> <p>You will need:</p> <ul> <li><p>A Cloudflare account — free tier works</p> </li> <li><p>Node.js v18+ and Wrangler CLI installed (<code>npm install -g wrangler</code>)</p> </li> <li><p>Basic TypeScript familiarity</p> </li> </ul> <p>No external API keys. Everything runs on Cloudflare's infrastructure.</p> <h2 id="heading-how-to-set-up-the-base-system">How to Set Up the Base System</h2> <p>If you have already built the RAG system from my <a href="https://www.freecodecamp.org/news/build-a-production-rag-system-with-cloudflare-workers-handbook">freeCodeCamp handbook</a>, skip this section — your system is ready for the reflection layer.</p> <p>If you're starting fresh, this section gets you to a working base in about 15 minutes.</p> <h3 id="heading-scaffold-the-project">Scaffold the Project</h3> <pre><code class="language-bash">npm create cloudflare@latest rag-reflection-system cd rag-reflection-system </code></pre> <p>Choose: Hello World example → TypeScript → No deploy yet.</p> <h3 id="heading-create-the-vectorize-index-and-d1-database">Create the Vectorize Index and D1 Database</h3> <pre><code class="language-bash">npx wrangler vectorize create rag-index --dimensions=384 --metric=cosine npx wrangler d1 create rag-db </code></pre> <h3 id="heading-configure-wranglertoml">Configure wrangler.toml</h3> <pre><code class="language-toml">name = "rag-reflection-system" main = "src/index.ts" compatibility_date = "2026-01-01" [[vectorize]] binding = "VECTORIZE" index_name = "rag-index" [[d1_databases]] binding = "DB" database_name = "rag-db" database_id = "YOUR_DB_ID" [ai] binding = "AI" </code></pre> <h3 id="heading-create-the-documents-table">Create the <code>documents</code> Table</h3> <pre><code class="language-sql">-- migrations/001_init.sql CREATE TABLE IF NOT EXISTS documents ( id TEXT PRIMARY KEY, content TEXT NOT NULL, source TEXT, date_created TEXT DEFAULT (datetime('now')) ); </code></pre> <pre><code class="language-bash">npx wrangler d1 execute rag-db --remote --file=./migrations/001_init.sql </code></pre> <h3 id="heading-add-the-ingest-and-search-endpoints">Add the <code>ingest</code> and <code>search</code> endpoints</h3> <p>Replace <code>src/index.ts</code> with this minimal working system:</p> <pre><code class="language-typescript">export interface Env { VECTORIZE: VectorizeIndex; DB: D1Database; AI: Ai; } export default { async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> { const url = new URL(request.url); if (url.pathname === '/ingest' && request.method === 'POST') { const { id, content, source } = await request.json() as any; const embResult = await env.AI.run('@cf/baai/bge-small-en-v1.5', { text: [content.slice(0, 512)], }) as any; const vector = embResult.data[0]; await env.VECTORIZE.upsert([{ id, values: vector, metadata: { content: content.slice(0, 1000), source, doc_type: 'raw' }, }]); await env.DB.prepare( 'INSERT OR REPLACE INTO documents (id, content, source) VALUES (?, ?, ?)' ).bind(id, content, source ?? '').run(); return Response.json({ success: true, id }); } if (url.pathname === '/search' && request.method === 'POST') { const { query } = await request.json() as any; const embResult = await env.AI.run('@cf/baai/bge-small-en-v1.5', { text: [query], }) as any; const vector = embResult.data[0]; const results = await env.VECTORIZE.query(vector, { topK: 5, returnMetadata: 'all', }); const context = results.matches .map(m => m.metadata?.content as string) .filter(Boolean) .join('\n\n'); const answer = await env.AI.run('@cf/moonshotai/kimi-k2.5', { messages: [ { role: 'system', content: 'Answer using only the context provided.' }, { role: 'user', content: `Context:\n${context}\n\nQuestion: ${query}` }, ], max_tokens: 256, }) as any; return Response.json({ answer: answer.response, sources: results.matches.map(m => m.id) }); } return new Response('RAG system running', { status: 200 }); }, }; </code></pre> <h3 id="heading-deploy-and-verify">Deploy and Verify</h3> <pre><code class="language-bash">npx wrangler deploy </code></pre> <p>Test it:</p> <pre><code class="language-bash"># Ingest a document curl -X POST https://your-worker.workers.dev/ingest \ -H "Content-Type: application/json" \ -d '{"id": "doc-001", "content": "Cursor pagination beats offset pagination for live-updating datasets because offset becomes unreliable when rows are inserted or deleted during pagination."}' # Search curl -X POST https://your-worker.workers.dev/search \ -H "Content-Type: application/json" \ -d '{"query": "what pagination approach should I use?"}' </code></pre> <p>If you get a grounded answer back, the base system is working. The next sections add the reflection layer on top of this foundation.</p> <h2 id="heading-why-standard-rag-has-a-memory-problem">Why Standard RAG Has a Memory Problem</h2> <p>Standard RAG retrieval is stateless. Every query goes in cold. The system has no memory of what it found before, no synthesis of what it learned across documents, and no growing understanding of what questions remain unanswered.</p> <p>Imagine you've ingested 200 documents about your product. Twelve of them touch on a pricing decision made last year. No single one has the full picture — it's distributed across quarterly reports, meeting notes, an internal Slack export, a few Notion pages.</p> <p>A user asks: "Why did we change our pricing structure?"</p> <p>Standard RAG retrieves the three most similar chunks. If those three chunks collectively have the answer, great. If they don't — if the real answer requires synthesising across those twelve documents — the system has no mechanism for that. It returns fragments. The LLM makes its best guess.</p> <p>The reflection layer addresses this directly. When the twelfth pricing document gets ingested, the system finds the eleven related documents, synthesises what connects them, and stores that synthesis as a retrievable artifact. The answer to "why did we change our pricing structure" exists in the index before anyone asks the question.</p> <p>Not smarter retrieval — smarter indexing.</p> <h2 id="heading-step-1-schema-update">Step 1: Schema Update</h2> <p>The reflection layer needs two new fields in your D1 documents table. Run this migration:</p> <pre><code class="language-sql">-- migrations/003_add_reflection_fields.sql ALTER TABLE documents ADD COLUMN doc_type TEXT DEFAULT 'raw'; ALTER TABLE documents ADD COLUMN reflection_score REAL DEFAULT 0; ALTER TABLE documents ADD COLUMN parent_reflection_id TEXT; </code></pre> <p>Apply it:</p> <pre><code class="language-bash">wrangler d1 execute mcp-knowledge-db --remote --file=./migrations/003_add_reflection_fields.sql </code></pre> <p><code>doc_type</code> distinguishes raw documents (<code>raw</code>), single-document reflections (<code>reflection</code>), and consolidated multi-reflection summaries (<code>summary</code>). You'll use this field to filter — exposing only reflections to users who want the distilled view, or excluding them for users who want raw source chunks.</p> <h2 id="heading-step-2-the-reflection-engine">Step 2: The Reflection Engine</h2> <p>Create <code>src/engines/reflection.ts</code>. This is the core of the layer.</p> <pre><code class="language-typescript">import { Env } from '../types/env'; import { resolveEmbeddingModel, resolveReflectionModel } from '../config/models'; const REFLECTION_BOOST = 1.5; const CONSOLIDATION_THRESHOLD = 3; // consolidate every N new reflections export async function reflect( newDocId: string, newDocContent: string, env: Env ): Promise<void> { // 1. Find semantically related documents already in the index const embModel = resolveEmbeddingModel(env.EMBEDDING_MODEL); const embResult = await env.AI.run(embModel.id as any, { text: [newDocContent.slice(0, 512)], }); const queryVector = (embResult as any).data?.[0]; if (!queryVector) return; const related = await env.VECTORIZE.query(queryVector, { topK: 5, filter: { doc_type: { $eq: 'raw' } }, returnMetadata: 'all', }); const relatedDocs = (related.matches ?? []).filter( m => m.id !== newDocId && (m.score ?? 0) > 0.65 ); if (relatedDocs.length === 0) return; // nothing related yet — skip // 2. Build synthesis prompt const relatedSummaries = relatedDocs .slice(0, 3) .map((m, i) => `Document ${i + 1}: ${String(m.metadata?.content ?? '').slice(0, 300)}`) .join('\n\n'); const prompt = `You are synthesising knowledge across documents in a knowledge base. New document: ${newDocContent.slice(0, 600)} Related existing documents: ${relatedSummaries} Write exactly three sentences: 1. What the new document adds that the existing documents don't already cover 2. How the new document connects to or extends the existing documents 3. What gap or question remains unanswered across all these documents Be specific. Reference actual content. Do not summarise — synthesise.`; // 3. Call the reflection model const reflModel = resolveReflectionModel(env.REFLECTION_MODEL); const llmResp = await env.AI.run(reflModel.id as any, { messages: [{ role: 'user', content: prompt }], max_tokens: 180, }); const reflectionText = (llmResp as any)?.response?.trim(); if (!reflectionText || reflectionText.length < 40) return; // 4. Embed and store the reflection const reflEmbResult = await env.AI.run(embModel.id as any, { text: [reflectionText], }); const reflVector = (reflEmbResult as any).data?.[0]; if (!reflVector) return; const reflectionId = `refl_${newDocId}_${Date.now()}`; await env.VECTORIZE.upsert([ { id: reflectionId, values: reflVector, metadata: { content: reflectionText, doc_type: 'reflection', parent_id: newDocId, reflection_score: REFLECTION_BOOST, source_doc_ids: relatedDocs.map(m => m.id).join(','), date_created: new Date().toISOString(), }, }, ]); await env.DB.prepare( `INSERT INTO documents (id, content, doc_type, reflection_score, parent_id, date_created) VALUES (?, ?, 'reflection', ?, ?, ?)` ) .bind(reflectionId, reflectionText, REFLECTION_BOOST, newDocId, new Date().toISOString()) .run(); // 5. Check if consolidation is due const recentCount = await env.DB .prepare(`SELECT COUNT(*) as cnt FROM documents WHERE doc_type = 'reflection' AND date_created > datetime('now', '-1 hour')`) .first<{ cnt: number }>(); if ((recentCount?.cnt ?? 0) >= CONSOLIDATION_THRESHOLD) { await consolidate(env); } } </code></pre> <p>Two things worth noting here.</p> <p>First, the semantic threshold (<code>score > 0.65</code>) matters. Too low and you're synthesising unrelated documents. Too high and you're rarely finding connections. 0.65 works well with <code>bge-small</code>. You can bump it to 0.72 with <code>qwen3-0.6b</code> (1024d) where scores cluster higher.</p> <p>The prompt structure is deliberate. Three sentences, each doing a specific job: what's new, how it connects, what remains. This keeps reflections useful for retrieval. A freeform synthesis prompt produces beautiful prose that doesn't retrieve well. This structure produces retrievable artifacts.</p> <h2 id="heading-step-3-consolidation">Step 3: Consolidation</h2> <p>As reflections accumulate, they need their own synthesis layer — otherwise you're adding noise at a higher abstraction level.</p> <p>Add this to <code>src/engines/reflection.ts</code>:</p> <pre><code class="language-typescript">export async function consolidate(env: Env): Promise<void> { // Fetch recent reflections not yet consolidated const recent = await env.DB .prepare( `SELECT id, content FROM documents WHERE doc_type = 'reflection' AND id NOT IN ( SELECT DISTINCT parent_id FROM documents WHERE doc_type = 'summary' AND parent_id IS NOT NULL ) ORDER BY date_created DESC LIMIT 6` ) .all<{ id: string; content: string }>(); if (!recent.results || recent.results.length < CONSOLIDATION_THRESHOLD) return; const reflectionTexts = recent.results.map((r, i) => `Reflection ${i + 1}: ${r.content}`).join('\n\n'); const prompt = `You are consolidating multiple knowledge reflections into a single compressed insight. ${reflectionTexts} Write two to three sentences that capture the most important cross-cutting pattern or tension across these reflections. What does the knowledge base now understand that it didn't before these documents were added? What's the most important open question? Be precise. No preamble.`; const reflModel = resolveReflectionModel(env.REFLECTION_MODEL); const llmResp = await env.AI.run(reflModel.id as any, { messages: [{ role: 'user', content: prompt }], max_tokens: 320, }); const summaryText = (llmResp as any)?.response?.trim(); if (!summaryText || summaryText.length < 40) return; const embModel = resolveEmbeddingModel(env.EMBEDDING_MODEL); const embResult = await env.AI.run(embModel.id as any, { text: [summaryText] }); const summaryVector = (embResult as any).data?.[0]; if (!summaryVector) return; const summaryId = `summary_${Date.now()}`; await env.VECTORIZE.upsert([ { id: summaryId, values: summaryVector, metadata: { content: summaryText, doc_type: 'summary', reflection_score: REFLECTION_BOOST * 1.2, source_reflection_ids: recent.results.map(r => r.id).join(','), date_created: new Date().toISOString(), }, }, ]); await env.DB.prepare( `INSERT INTO documents (id, content, doc_type, reflection_score, date_created) VALUES (?, ?, 'summary', ?, ?)` ) .bind(summaryId, summaryText, REFLECTION_BOOST * 1.2, new Date().toISOString()) .run(); } </code></pre> <p>Summaries get a 1.2× multiplier on top of the base reflection boost. In search results, a summary synthesising twelve related documents should rank above any single document chunk on broad conceptual queries. On specific factual queries, the raw chunks will score higher. The ranking sorts itself.</p> <h2 id="heading-step-4-wire-it-into-your-ingest-handler">Step 4: Wire It Into Your Ingest Handler</h2> <p>The reflection runs as a background job. It doesn't block the ingest response — that would add 2–3 seconds to every ingest call.</p> <p>In your <code>src/handlers/ingest.ts</code>, after you've stored the document:</p> <pre><code class="language-typescript">import { reflect } from '../engines/reflection'; // ... existing ingest logic ... // After VECTORIZE.upsert() and DB insert succeed: ctx.waitUntil( reflect(documentId, content, env).catch(err => { console.warn('[reflection] failed for', documentId, err.message); }) ); return new Response(JSON.stringify({ success: true, documentId, chunks: chunkCount, // ... rest of response }), { headers: { 'Content-Type': 'application/json' } }); </code></pre> <p><code>ctx.waitUntil()</code> is the Cloudflare Workers primitive for background work. The response returns immediately. The reflection runs after. The ingest API stays fast.</p> <p>The <code>.catch()</code> is important. A failed reflection should never fail an ingest. Raw documents are the source of truth. Reflections are derived value — useful, but not critical path.</p> <h2 id="heading-step-5-boost-reflections-in-search">Step 5: Boost Reflections in Search</h2> <p>Add the reflection boost to your ranking logic in <code>src/engines/hybrid.ts</code>. After RRF fusion and before returning results:</p> <pre><code class="language-typescript">// Apply reflection boost const boosted = results.map(r => ({ ...r, score: r.doc_type === 'reflection' || r.doc_type === 'summary' ? r.score * (r.reflection_score ?? 1.5) : r.score, })); return boosted.sort((a, b) => b.score - a.score); </code></pre> <p>This is a post-fusion boost, not a pre-fusion rerank. The reasoning: apply RRF across all results first, so reflections earn their place on raw relevance before getting boosted. A reflection that would not rank in the top 20 on raw similarity shouldn't appear just because it has a boost multiplier.</p> <h2 id="heading-step-6-filtering-by-doctype">Step 6: Filtering by <code>doc_type</code></h2> <p>Your search endpoint should accept a <code>doc_type</code> filter so callers can control what they see:</p> <pre><code class="language-typescript">// In your search request handler: const docTypeFilter = body.filters?.doc_type; // Pass to Vectorize query: const vectorFilter: Record<string, unknown> = {}; if (docTypeFilter) { vectorFilter.doc_type = docTypeFilter; } </code></pre> <p>This gives callers three modes:</p> <pre><code class="language-bash"># Only reflections and summaries POST /search { "query": "pricing decisions", "filters": { "doc_type": { "$in": ["reflection", "summary"] } } } # Only source documents POST /search { "query": "pricing decisions", "filters": { "doc_type": { "$eq": "raw" } } } # Default: all types, reflections boosted POST /search { "query": "pricing decisions" } </code></pre> <p>The default (no filter) is the most useful. Let the boost do its job. Restrict to raw when you need citations. Restrict to reflections when you want the synthesised view.</p> <h2 id="heading-what-changes-after-you-build-this">What Changes After You Build This</h2> <p>At 200 documents, the difference becomes noticeable. Queries that previously returned five fragmented chunks now surface a reflection that already synthesised those chunks. Broad conceptual queries — "what do we know about X?" — start returning genuinely useful summaries instead of just the most-similar individual paragraph.</p> <p>At 2,000 documents, the reflection layer is the most valuable part of the system. The raw chunks answer specific factual questions. The reflections and summaries answer conceptual questions that could not be answered from any single document. The system has learned something no individual document contains.</p> <p>One failure mode worth knowing: if your embedding model has poor semantic clustering — old <code>bge-small</code> at 384d with mixed-domain documents — the related-documents retrieval step will surface weak connections and produce shallow reflections. The 0.65 threshold filters most of this out, but if you're seeing reflections that seem off-topic, your embeddings are the first thing to check.</p> <h2 id="heading-deploying">Deploying</h2> <pre><code class="language-bash">wrangler d1 execute mcp-knowledge-db --remote --file=./migrations/003_add_reflection_fields.sql wrangler deploy </code></pre> <p>Then ingest a few documents and watch what happens:</p> <pre><code class="language-bash"># Ingest document 1 curl -X POST https://your-worker.workers.dev/ingest \ -H "Authorization: Bearer YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{"id": "doc-001", "content": "Your document text here..."}' # After a few seconds, check if a reflection was created curl "https://your-worker.workers.dev/search" \ -H "Authorization: Bearer YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{"query": "your topic", "filters": {"doc_type": {"$eq": "reflection"}}}' </code></pre> <p>Reflections won't appear until there are related documents to synthesise. Ingest at least three documents on similar topics before expecting to see them.</p> <h2 id="heading-what-to-build-next">What to Build Next</h2> <p>The reflection layer as described here fires after every ingest. That's expensive at high ingest volume: if you're batch-importing 10,000 documents, you don't want 10,000 individual reflection calls.</p> <p>For bulk ingestion, gate it: call <code>reflect()</code> only when a document's similarity search returns a match above 0.8, or batch-run reflection after the bulk import completes. The <code>POST /ingest/batch</code> endpoint in the <a href="https://github.com/dannwaneri/vectorize-mcp-worker">full repo</a> does this.</p> <p>The second thing worth building: surfacing reflections in your UI with a visual distinction. A search result that's a reflection should look different from a raw chunk. In the dashboard included in the repo, reflections render with a <code>💡</code> badge and a "synthesised from N documents" note.</p> <p>Full source at <a href="https://github.com/dannwaneri/vectorize-mcp-worker">github.com/dannwaneri/vectorize-mcp-worker</a> — reflection engine, consolidation, batch ingest, dashboard, OpenAPI spec.</p> <p>The codebase is TypeScript, deploys with a single <code>wrangler deploy</code>, runs for roughly $1–5/month at 10,000 queries/day.</p> <p>Standard RAG retrieves. This learns.</p> </article> <article> <h1> How to Keep Human Experts Visible in Your AI-Assisted Codebase </h1> <p>Daniel Nwaneri — Mon, 13 Apr 2026 16:24:52 +0000</p> <p>Six months ago, Stack Overflow processed 108,563 questions in a single month. By December 2025, that number had fallen to 3,862. A 78% collapse in two years.</p> <p>The explanation everyone reaches for is that AI replaced it. That's partly true. But it misses the structural problem underneath: every time a developer asks Claude or ChatGPT to write code, the knowledge that shaped the answer disappears.</p> <p>The GitHub discussion where someone spent two hours documenting why cursor-based pagination beats offset for live-updating datasets. The Stack Overflow answer from 2019 where one engineer, after a week of debugging, documented exactly why that approach fails under concurrent writes.</p> <p>The AI consumed all of it. The humans who produced it got nothing — no citation in the codebase, no signal that their work mattered.</p> <p>Over time, those people stopped contributing. Stack Overflow isn't dying because it's bad. It's dying because AI extracted its value and the feedback loop that kept humans contributing broke down.</p> <p>This tutorial builds a tool that puts that loop back together. <strong>proof-of-contribution</strong> is a Claude Code skill that links every AI-generated artifact back to the human knowledge that inspired it — and surfaces exactly where the AI made choices with no human source at all.</p> <p>I'll show you how to install proof-of-contribution, how to record your first provenance entry, how to use the spec-writer integration that makes Knowledge Gaps deterministic, and how to run <code>poc.py verify</code> — a static analyser that detects gaps without a single API call.</p> <h2 id="heading-table-of-contents">Table of Contents</h2> <ol> <li><p><a href="#heading-what-you-will-build">What You Will Build</a></p> </li> <li><p><a href="#heading-prerequisites">Prerequisites</a></p> </li> <li><p><a href="#heading-quickstart-in-5-minutes">Quickstart in 5 Minutes</a></p> </li> <li><p><a href="#heading-how-the-tool-works">How the Tool Works</a></p> </li> <li><p><a href="#heading-how-to-install-proof-of-contribution">How to Install proof-of-contribution</a></p> </li> <li><p><a href="#heading-how-to-scaffold-your-project">How to Scaffold Your Project</a></p> </li> <li><p><a href="#heading-how-to-record-your-first-provenance-entry">How to Record Your First Provenance Entry</a></p> </li> <li><p><a href="#heading-how-to-use-import-spec-to-seed-knowledge-gaps">How to Use import-spec to Seed Knowledge Gaps</a></p> </li> <li><p><a href="#heading-how-to-trace-human-attribution">How to Trace Human Attribution</a></p> </li> <li><p><a href="#heading-how-to-verify-with-static-analysis">How to Verify with Static Analysis</a></p> </li> <li><p><a href="#heading-how-to-enable-pr-enforcement">How to Enable PR Enforcement</a></p> </li> <li><p><a href="#heading-where-to-go-next">Where to Go Next</a></p> </li> </ol> <h2 id="heading-what-you-will-build">What You Will Build</h2> <p>proof-of-contribution is a Claude Code skill with a local CLI. Together they give you:</p> <ul> <li><p><strong>Provenance Blocks</strong>: Claude appends a structured attribution block to every generated artifact, listing the human sources that inspired it and flagging what it synthesized without any traceable source.</p> </li> <li><p><strong>Knowledge Gaps</strong>: the parts of AI-generated code that have no human citation, surfaced before they become production incidents</p> </li> <li><p><code>poc.py trace</code>: a CLI command that shows the full human attribution chain for any file in thirty seconds</p> </li> <li><p><code>poc.py import-spec</code>: bridges proof-of-contribution with spec-writer, seeding knowledge gaps from your spec's assumptions list before the agent builds anything</p> </li> <li><p><code>poc.py verify</code>: a static analyser that cross-checks your file's structure against seeded claims using Python's AST. Zero API calls. Exit code 0 means clean, exit code 1 means gaps found — wires directly into CI</p> </li> <li><p><strong>A GitHub Action</strong>: optional PR enforcement that fails PRs missing attribution, for teams that want a standard</p> </li> </ul> <p>The complete source is at <a href="https://github.com/dannwaneri/proof-of-contribution">github.com/dannwaneri/proof-of-contribution</a>.</p> <h2 id="heading-prerequisites">Prerequisites</h2> <p>This is a beginner-to-intermediate tutorial. You should be comfortable with:</p> <ul> <li><p><strong>Command line basics</strong>: navigating directories, running scripts</p> </li> <li><p><strong>Git</strong>: basic commits and PRs</p> </li> <li><p><strong>Python 3.8 or higher</strong>: the CLI is pure Python with no dependencies</p> </li> </ul> <p>You will need:</p> <ul> <li><p><strong>Python installed</strong>: check with <code>python --version</code> or <code>python3 --version</code></p> </li> <li><p><strong>Git installed</strong>: check with <code>git --version</code></p> </li> <li><p><strong>Claude Code</strong> (or any agent that supports the Agent Skills standard — Cursor and Gemini CLI also work)</p> </li> </ul> <p>There's no database to install. No API keys. No paid services. The default storage is SQLite, which Python includes out of the box.</p> <h2 id="heading-quickstart-in-5-minutes">Quickstart in 5 Minutes</h2> <p>If you want to try the tool before reading the full tutorial, here are the five commands that take you from zero to your first gap detection:</p> <p><strong>Mac and Linux:</strong></p> <pre><code class="language-bash"># 1. Install mkdir -p ~/.claude/skills git clone https://github.com/dannwaneri/proof-of-contribution.git \ ~/.claude/skills/proof-of-contribution # 2. Scaffold your project (run in your repo root) python ~/.claude/skills/proof-of-contribution/assets/scripts/poc_init.py # 3. Record attribution for an AI-generated file python poc.py add src/utils/parser.py # 4. Detect gaps via static analysis python poc.py verify src/utils/parser.py # 5. See the full provenance chain python poc.py trace src/utils/parser.py </code></pre> <p><strong>Windows PowerShell:</strong></p> <pre><code class="language-powershell"># 1. Install New-Item -ItemType Directory -Force -Path "$HOME\.claude\skills" git clone https://github.com/dannwaneri/proof-of-contribution.git ` "$HOME\.claude\skills\proof-of-contribution" # 2. Scaffold your project python "$HOME\.claude\skills\proof-of-contribution\assets\scripts\poc_init.py" # 3. Record attribution python poc.py add src\utils\parser.py # 4. Detect gaps python poc.py verify src\utils\parser.py # 5. See the full provenance chain python poc.py trace src\utils\parser.py </code></pre> <p>That's the whole tool. The sections below walk through each step in detail with real terminal output at every stage.</p> <h2 id="heading-how-the-tool-works">How the Tool Works</h2> <p>Before you install anything, you need a clear mental model of what proof-of-contribution actually does — because the most important part isn't obvious.</p> <h3 id="heading-the-archaeology-problem">The Archaeology Problem</h3> <p>Here's a scenario that happens on every team using AI-assisted development.</p> <p>A developer joins. They go through six months of AI-generated codebase. They hit a bug in the pagination logic — cursor-based, unusual implementation, nobody remembers why it was built that way. The original developer has left.</p> <p>Old answer: two days of archaeology. <code>git blame</code> points to a commit message that says "fix pagination." The commit before that says "implement pagination." Dead end.</p> <p>With <code>poc.py trace src/utils/paginator.py</code>, that same developer sees this in thirty seconds:</p> <pre><code class="language-plaintext">Provenance trace: src/utils/paginator.py ──────────────────────────────────────────────────────────── [HIGH] @tannerlinsley on github Cursor pagination discussion https://github.com/TanStack/query/discussions/123 Insight: cursor beats offset for live-updating datasets Knowledge gaps (AI-synthesized, no human source): • Error retry strategy — no human source cited • Concurrent write handling — AI chose this arbitrarily </code></pre> <p>They now know where the pattern came from and — critically — which parts have no traceable human source. The concurrent write handling is where the bug lives. The AI made a choice nobody reviewed.</p> <p>That's what this tool does. Not enforcement first. Archaeology first.</p> <h3 id="heading-how-knowledge-gaps-are-detected">How Knowledge Gaps Are Detected</h3> <p>The obvious assumption is that Claude introspects and reports what it doesn't know. That assumption is wrong. LLMs hallucinate confidently. An AI that could reliably detect its own knowledge gaps wouldn't produce them.</p> <p>The detection mechanism is a comparison, not introspection.</p> <p>When you use <a href="https://github.com/dannwaneri/spec-writer">spec-writer</a> before building, it generates a spec with an explicit <code>## Assumptions to review</code> section — every decision the AI is making that you didn't specify, each one impact-rated. That list is the contract.</p> <p>When you run <code>poc.py import-spec spec.md --artifact src/utils/paginator.py</code>, those assumptions get seeded into the database as unresolved knowledge gaps. After the agent builds, <code>poc.py trace</code> shows which assumptions made it into code with no human source ever cited.</p> <p>The AI isn't grading its own exam. The spec is the answer key.</p> <p><code>poc.py verify</code> takes this further. After the agent builds, it parses the file's actual structure using Python's built-in <code>ast</code> module — extracting every function definition, conditional branch, and return path. It cross-checks each one against the seeded claims. Any structural unit with no resolved claim surfaces as a deterministic Knowledge Gap, regardless of how confident the model was when it wrote the code.</p> <h2 id="heading-how-to-install-proof-of-contribution">How to Install proof-of-contribution</h2> <h3 id="heading-mac-and-linux">Mac and Linux</h3> <pre><code class="language-bash">mkdir -p ~/.claude/skills git clone https://github.com/dannwaneri/proof-of-contribution.git \ ~/.claude/skills/proof-of-contribution </code></pre> <h3 id="heading-windows-powershell">Windows PowerShell</h3> <pre><code class="language-powershell">New-Item -ItemType Directory -Force -Path "$HOME\.claude\skills" git clone https://github.com/dannwaneri/proof-of-contribution.git ` "$HOME\.claude\skills\proof-of-contribution" </code></pre> <p>That's the entire installation. No package to install, no configuration file to edit. The skill is a markdown file the agent reads. The CLI is a Python script that runs locally.</p> <h3 id="heading-verify-the-install">Verify the Install:</h3> <pre><code class="language-bash">ls ~/.claude/skills/proof-of-contribution/ </code></pre> <p>You should see <code>SKILL.md</code>, <code>poc.py</code>, <code>assets/</code>, and <code>references/</code>. If the directory is empty, the clone failed — check your internet connection and try again.</p> <h2 id="heading-how-to-scaffold-your-project">How to Scaffold Your Project</h2> <p>The scaffold script creates the database, config, CLI, and GitHub integration in your project root. Run it once per project.</p> <h3 id="heading-mac-and-linux">Mac and Linux</h3> <pre><code class="language-bash">cd /path/to/your/project python ~/.claude/skills/proof-of-contribution/assets/scripts/poc_init.py </code></pre> <h3 id="heading-windows-powershell">Windows PowerShell</h3> <pre><code class="language-powershell">cd C:\path\to\your\project python "$HOME\.claude\skills\proof-of-contribution\assets\scripts\poc_init.py" </code></pre> <p>You should see output like this:</p> <pre><code class="language-plaintext">🔗 Proof of Contribution — init → Project root: /path/to/your/project ✔ Created .poc/config.json ✔ Created .poc/.gitignore (db excluded from git, config tracked) ✔ Created .poc/provenance.db (SQLite — no extra infra needed) ✔ Created .github/PULL_REQUEST_TEMPLATE.md ✔ Created .github/workflows/poc-check.yml ✔ Created poc.py (local CLI — includes import-spec command) ✔ Created .gitignore ✔ Proof of Contribution initialised for 'your-project' </code></pre> <p>This creates four things in your project:</p> <pre><code class="language-plaintext">your-project/ ├── .poc/ │ ├── config.json ← project settings (commit this) │ ├── provenance.db ← SQLite database (local only, gitignored) │ └── .gitignore ├── .github/ │ ├── PULL_REQUEST_TEMPLATE.md │ └── workflows/ │ └── poc-check.yml └── poc.py ← your local CLI </code></pre> <ul> <li><p><code>.poc/</code> — the tool's local data directory. <code>config.json</code> stores project settings and is committed to git. <code>provenance.db</code> is the SQLite database where attribution records and knowledge gaps are stored — local only, gitignored.</p> </li> <li><p><code>poc.py</code> — your local CLI, copied into the project root. Run <code>python poc.py trace</code>, <code>python poc.py verify</code>, and every other command directly without a global install.</p> </li> <li><p><code>.github/PULL_REQUEST_TEMPLATE.md</code> — a PR template with the <code>## 🤖 AI Provenance</code> section pre-filled. Developers fill it in when submitting PRs that contain AI-generated code.</p> </li> <li><p><code>.github/workflows/poc-check.yml</code> — the optional GitHub Action for PR enforcement. Installed but dormant until you push the workflow file and enable it in your repo settings.</p> </li> </ul> <p><strong>Windows note:</strong> if the scaffold fails with a <code>UnicodeEncodeError</code>, the emoji in the PR template is hitting a Windows encoding limit. Open <code>assets/scripts/poc_init.py</code> in a text editor and find every line ending with <code>.write_text(...)</code>. Change each one to <code>.write_text(..., encoding="utf-8")</code>. Save and re-run.</p> <h3 id="heading-verify-the-scaffold-worked">Verify the Scaffold Worked</h3> <pre><code class="language-bash">python poc.py report </code></pre> <p>Expected output:</p> <pre><code class="language-plaintext">Proof of Contribution Report ──────────────────────────────────────── Artifacts tracked : 0 With provenance : 0 (0%) Unresolved gaps : 0 Resolved claims : 0 Human experts : 0 </code></pre> <p>Empty database, clean state. You're ready.</p> <h2 id="heading-how-to-record-your-first-provenance-entry">How to Record Your First Provenance Entry</h2> <p>Before we dive in here, I just want to clear something up. Earlier, I described <code>poc.py verify</code> as detecting Knowledge Gaps automatically — and it does. But the static analyser can only tell you <em>that</em> a function has no human citation. It can't tell you <em>which</em> human source inspired it. That knowledge lives in your head, not in the code.</p> <p><code>poc.py add</code> is where you supply that context. After the agent builds a file, you record the human sources you actually drew on: the GitHub discussion you read before prompting, the Stack Overflow answer that shaped the approach. Those records become the attribution chain <code>poc.py trace</code> surfaces — and what closes the gaps <code>poc.py verify</code> flags.</p> <p><code>verify</code> finds the gaps. <code>add</code> fills them.</p> <p><code>poc.py add</code> records attribution for a file interactively. You can run it on any AI-generated file in your project.</p> <pre><code class="language-bash">python poc.py add src/utils/parser.py </code></pre> <p>You'll see a prompt:</p> <pre><code class="language-plaintext">Recording provenance for: src/utils/parser.py (Press Ctrl+C to cancel) Human source URL (or Enter to finish): </code></pre> <p>Enter the URL of the human-authored source that inspired the code. This could be a GitHub discussion, a Stack Overflow answer, a documentation page, a blog post, or an RFC.</p> <pre><code class="language-plaintext"> Human source URL (or Enter to finish): https://github.com/TanStack/query/discussions/123 Author handle: tannerlinsley Platform (github/stackoverflow/docs/other): github Source title: Cursor pagination discussion What specific insight came from this? cursor beats offset for live-updating datasets Confidence HIGH/MEDIUM/LOW [MEDIUM]: HIGH ✔ Recorded. </code></pre> <p>Add as many sources as apply. Press Enter on a blank URL when you're done.</p> <pre><code class="language-plaintext"> Human source URL (or Enter to finish): ✔ Provenance saved. Run: python poc.py trace src/utils/parser.py </code></pre> <h3 id="heading-check-what-you-recorded">Check What You Recorded</h3> <pre><code class="language-bash">python poc.py trace src/utils/parser.py </code></pre> <pre><code class="language-plaintext">Provenance trace: src/utils/parser.py ──────────────────────────────────────────────────────────── [HIGH] @tannerlinsley on github Cursor pagination discussion https://github.com/TanStack/query/discussions/123 Insight: cursor beats offset for live-updating datasets </code></pre> <p>No knowledge gaps — because you recorded a source. If the file had parts with no human source, they would appear below as gaps.</p> <h3 id="heading-see-all-experts-in-your-graph">See All Experts in Your Graph</h3> <p>Every <code>poc.py add</code> call stores not just the URL but the author — their handle, platform, and the specific insight they contributed. Run it across enough files, and those authors accumulate into a <strong>knowledge graph</strong>: a local record of which human experts your codebase drew from, which files their knowledge shaped, and how many artifacts trace back to their work.</p> <p><code>poc.py experts</code> surfaces the top contributors. On a new project, it'll be one or two entries. On a mature codebase, it becomes a map of whose knowledge is load-bearing — the people you'd want to consult if that part of the code ever needed to change.</p> <pre><code class="language-bash">python poc.py experts </code></pre> <pre><code class="language-plaintext">Top Human Experts in Knowledge Graph ────────────────────────────────────────────────────── @tannerlinsley github 1 artifact(s) </code></pre> <h2 id="heading-how-to-use-import-spec-to-seed-knowledge-gaps">How to Use import-spec to Seed Knowledge Gaps</h2> <p>This is the most important command in the tool. It connects proof-of-contribution with spec-writer and makes Knowledge Gaps deterministic.</p> <p>When you use spec-writer before building a feature, it generates an <code>## Assumptions to review</code> section — every implicit decision is impact-rated HIGH, MEDIUM, or LOW. The <code>import-spec</code> command reads that section and seeds those assumptions into the database as unresolved gaps before the agent writes a line of code.</p> <p>After the agent builds, any assumption that made it into the implementation without a cited human source surfaces automatically in <code>poc.py trace</code>. You don't need to know which parts of the code are uncertain. The spec already told you.</p> <h3 id="heading-step-1-create-a-test-spec">Step 1 — Create a Test Spec</h3> <p>If you don't have a spec-writer output yet, create one manually to see how the import works.</p> <p><strong>Mac and Linux:</strong></p> <pre><code class="language-bash">cat > test-spec.md << 'EOF' ## Assumptions to review 1. SQLite is sufficient for single-developer use — Impact: HIGH Correct this if: you need team-shared provenance 2. Filepath is the artifact identifier — Impact: MEDIUM Correct this if: you use content hashing instead 3. REST pattern for any future API — Impact: LOW Correct this if: you prefer GraphQL EOF </code></pre> <p><strong>Windows PowerShell:</strong></p> <pre><code class="language-powershell">python -c " content = '''## Assumptions to review 1. SQLite is sufficient for single-developer use - Impact: HIGH Correct this if: you need team-shared provenance 2. Filepath is the artifact identifier - Impact: MEDIUM Correct this if: you use content hashing instead 3. REST pattern for any future API - Impact: LOW Correct this if: you prefer GraphQL''' open('test-spec.md', 'w', encoding='utf-8').write(content) print('test-spec.md created') " </code></pre> <p><strong>Windows note:</strong> don't use PowerShell's <code>echo</code> to create spec files. PowerShell saves files as UTF-16, which causes a <code>UnicodeDecodeError</code> when <code>import-spec</code> reads them. The <code>python -c</code> approach above writes UTF-8 correctly.</p> <h3 id="heading-step-2-import-the-assumptions">Step 2 — Import the Assumptions</h3> <pre><code class="language-bash">python poc.py import-spec test-spec.md --artifact src/utils/parser.py </code></pre> <pre><code class="language-plaintext">Spec assumptions imported — 3 Knowledge Gap(s) seeded ─────────────────────────────────────────────────────── 1. [HIGH] SQLite is sufficient for single-developer use Correct if: you need team-shared provenance 2. [MEDIUM] Filepath is the artifact identifier Correct if: you use content hashing instead 3. [LOW] REST pattern for any future API Correct if: you prefer GraphQL → Bound to: src/utils/parser.py After the agent builds, run: python poc.py trace src/utils/parser.py python poc.py add src/utils/parser.py </code></pre> <h3 id="heading-step-3-trace-the-gaps">Step 3 — Trace the Gaps</h3> <pre><code class="language-bash">python poc.py trace src/utils/parser.py </code></pre> <pre><code class="language-plaintext">Knowledge gaps (AI-synthesized, no human source): • REST pattern for any future API [Correct if: you prefer GraphQL] • SQLite is sufficient for single-developer use [Correct if: you need team-shared provenance] • Filepath is the artifact identifier [Correct if: you use content hashing instead] Resolve gaps: python poc.py add src/utils/parser.py </code></pre> <p>Three gaps, colour-coded by urgency. The HIGH-impact assumption — SQLite for single-developer use — appears in red. The LOW-impact one appears in green. When you run <code>poc.py add</code> and record a human source with an insight that overlaps the gap text, the gap auto-closes.</p> <h3 id="heading-preview-without-writing">Preview Without Writing</h3> <pre><code class="language-bash">python poc.py import-spec test-spec.md --dry-run </code></pre> <p>This parses the spec and prints what would be seeded without touching the database. This is useful before committing to an import.</p> <h3 id="heading-check-the-overall-health">Check the Overall Health</h3> <pre><code class="language-bash">python poc.py report </code></pre> <pre><code class="language-plaintext">Proof of Contribution Report ──────────────────────────────────────── Artifacts tracked : 1 With provenance : 0 (0%) Unresolved gaps : 3 Resolved claims : 0 Human experts : 1 ⚠ Less than 50% of artifacts have provenance records. ⚠ 3 unresolved Knowledge Gap(s). Run `poc.py trace <filepath>` to locate them. </code></pre> <h2 id="heading-how-to-trace-human-attribution">How to Trace Human Attribution</h2> <p><code>poc.py trace</code> is the command you'll use most. It shows the full human attribution chain for any file and lists any knowledge gaps — parts of the code with no traceable human source.</p> <pre><code class="language-bash">python poc.py trace src/utils/parser.py </code></pre> <p>A file with both attribution and gaps looks like this:</p> <pre><code class="language-plaintext">Provenance trace: src/utils/parser.py ──────────────────────────────────────────────────────────── [HIGH] @juliandeangelis on github Spec Driven Development methodology https://github.com/dannwaneri/spec-writer Insight: separate functional from technical spec [MEDIUM] @tannerlinsley on github Cursor pagination discussion https://github.com/TanStack/query/discussions/123 Insight: cursor beats offset for live-updating datasets Knowledge gaps (AI-synthesized, no human source): • Error retry strategy — no human source cited • CSV column ordering — AI chose this arbitrarily Resolve gaps: python poc.py add src/utils/parser.py </code></pre> <p>The human attribution section shows every cited source, colour-coded by confidence. The knowledge gaps section shows every assumption that shipped without a human citation — either seeded from a spec via <code>import-spec</code>, or flagged by Claude in the Provenance Block.</p> <h3 id="heading-resolving-gaps">Resolving Gaps</h3> <p>Run <code>poc.py add</code> on any file with open gaps:</p> <pre><code class="language-bash">python poc.py add src/utils/parser.py </code></pre> <p>When you enter an insight that shares words with an open gap claim, the gap auto-closes. Run <code>poc.py trace</code> again to confirm it's resolved.</p> <h2 id="heading-how-to-verify-with-static-analysis">How to Verify with Static Analysis</h2> <p><code>poc.py verify</code> is the command that closes the epistemic trust gap completely. It detects Knowledge Gaps by analysing the file's actual code structure — not by asking the AI what it doesn't know.</p> <p>Run it after the agent builds, once you've seeded gaps with <code>import-spec</code>:</p> <pre><code class="language-bash">python poc.py verify src/utils/parser.py </code></pre> <p>Expected output:</p> <pre><code class="language-plaintext">Verify: src/utils/parser.py ──────────────────────────────────────────────────────────── Structural units detected : 11 Seeded claims : 3 Covered by cited source : 2 Deterministic gaps : 1 Deterministic Knowledge Gaps (no human source): • function: handle_concurrent_writes (lines 47–61) Seeded assumption: concurrent write handling — AI chose this arbitrarily Resolve: python poc.py add src/utils/parser.py </code></pre> <p>The gap shown is not something Claude admitted. It's something the analyser found by comparing the file's function list against your seeded claims. The function <code>handle_concurrent_writes</code> exists in the code but has no resolved human citation in the database. That's the gap.</p> <h3 id="heading-what-the-exit-codes-mean">What the Exit Codes Mean</h3> <pre><code class="language-bash">python poc.py verify src/utils/parser.py echo $? # Mac/Linux python poc.py verify src/utils/parser.py echo $LASTEXITCODE # Windows PowerShell </code></pre> <ul> <li><p><strong>Exit code 0</strong> — no gaps, all detected units have cited sources</p> </li> <li><p><strong>Exit code 1</strong> — gaps found, resolve with <code>poc.py add</code></p> </li> <li><p><strong>Exit code 2</strong> — file not found or unsupported language</p> </li> </ul> <p>Exit code 1 integrates directly into CI pipelines. Add <code>poc.py verify</code> to your GitHub Action or pre-commit hook and gaps block the build before they reach production.</p> <h3 id="heading-run-it-without-a-seeded-spec">Run it Without a Seeded Spec</h3> <p>If you haven't run <code>import-spec</code> first, <code>verify</code> still works — it falls back to structural analysis and surfaces every uncited function and branch as a gap:</p> <pre><code class="language-bash">python poc.py verify src/utils/parser.py </code></pre> <pre><code class="language-plaintext">⚠ No spec imported — showing all uncited structural units. Run: python poc.py import-spec spec.md --artifact src/utils/parser.py for deterministic gap detection. Deterministic Knowledge Gaps (no human source): • function: parse_query (lines 1–7) • branch: if not text (lines 2–3) • function: fetch_results (lines 9–12) ... </code></pre> <p>It's less precise than the spec-writer path — every structural unit shows rather than only the ones tied to named assumptions — but it's useful as a baseline on any file, new or old.</p> <h3 id="heading-the-strict-flag">The <code>--strict</code> Flag</h3> <pre><code class="language-bash">python poc.py verify src/utils/parser.py --strict </code></pre> <p>Strict mode flags every uncited structural unit as a gap even when claims are seeded. You can use it when you want zero tolerance: any function or branch without a resolved human source fails the check.</p> <h2 id="heading-how-to-enable-pr-enforcement">How to Enable PR Enforcement</h2> <p>Once <code>poc.py trace</code> has saved you real hours — not before — enable the GitHub Action. The distinction matters. Turning it on day one frames the tool as overhead. Turning it on after the team already finds value frames it as a standard.</p> <pre><code class="language-bash">git add .github/ .poc/config.json poc.py git commit -m "chore: add proof-of-contribution" git push </code></pre> <p>After that, every PR is checked for an <code>## 🤖 AI Provenance</code> section. The scaffold already created the PR template with that section included. Developers fill it in naturally once they're already running <code>poc.py trace</code> locally — the template just asks them to record what they already know.</p> <p>Developers who write fully human code opt out by adding <code>100% human-written</code> anywhere in the PR body. The action skips the check automatically.</p> <h3 id="heading-what-the-action-checks">What the Action Checks</h3> <p>The action reads the PR description and looks for:</p> <ol> <li><p>The <code>## 🤖 AI Provenance</code> heading</p> </li> <li><p>At least one populated row in the attribution table</p> </li> </ol> <p>If the section is missing or the table is empty, the action fails and posts a comment explaining what to add. The comment includes a link to <code>poc.py trace <filepath></code> so the developer knows exactly where to look.</p> <h2 id="heading-where-to-go-next">Where to Go Next</h2> <h3 id="heading-use-it-with-spec-writer-on-a-real-feature">Use it with spec-writer on a Real Feature</h3> <p>The real value of <code>import-spec</code> is on actual features, not test specs. If you use <a href="https://github.com/dannwaneri/spec-writer">spec-writer</a>, the workflow is:</p> <pre><code class="language-plaintext">/spec-writer "your feature description" </code></pre> <p>Save the output to <code>spec.md</code>. Then:</p> <pre><code class="language-bash">python poc.py import-spec spec.md --artifact src/path/to/output.py </code></pre> <p>Build the feature with your agent. Then run <code>poc.py trace</code> to see which assumptions made it into code with no human source. Resolve the HIGH-impact gaps first — those are the ones that will cause production incidents.</p> <h3 id="heading-activate-the-claude-code-skill">Activate the Claude Code Skill</h3> <p>The SKILL.md file makes Claude automatically append a Provenance Block to every generated artifact when the skill is active. The block lists human sources Claude drew from and flags what it synthesized without any traceable source.</p> <p>To activate it in Claude Code, the skill is already installed at <code>~/.claude/skills/proof-of-contribution/</code>. Claude Code loads it automatically when you are in a project that has <code>.poc/config.json</code>.</p> <p>A generated Provenance Block looks like this:</p> <pre><code class="language-plaintext">## PROOF OF CONTRIBUTION Generated artifact: fetch_github_discussions() Confidence: MEDIUM ## HUMAN SOURCES THAT INSPIRED THIS [1] GitHub GraphQL API Documentation Team Source type: Official Docs URL: docs.github.com/en/graphql Contribution: cursor-based pagination pattern [2] GitHub Community (multiple contributors) Source type: GitHub Discussions URL: github.com/community/community Contribution: "ghost" fallback for deleted accounts surfaced in bug reports ## KNOWLEDGE GAPS (AI synthesized, no human cited) - Error handling / retry logic - Rate limit strategy ## RECOMMENDED HUMAN EXPERTS TO CONSULT - github.com/octokit community for pagination </code></pre> <p>The Knowledge Gaps section is the part no other tool produces. It's where AI admits what it synthesized without a traceable human source — before that gap becomes a production incident.</p> <h3 id="heading-upgrade-when-you-outgrow-sqlite">Upgrade When You Outgrow SQLite</h3> <p>The default database is SQLite — local only, no infra required. When you need team sharing or graph queries, the <code>references/</code> directory in the repo has migration guides:</p> <table> <thead> <tr> <th>Need</th> <th>File</th> </tr> </thead> <tbody><tr> <td>Team sharing a provenance DB</td> <td><code>references/relational-schema.md</code></td> </tr> <tr> <td>Graph traversal queries</td> <td><code>references/neo4j-implementation.md</code></td> </tr> <tr> <td>Semantic web / interoperability</td> <td><code>references/jsonld-schema.md</code></td> </tr> </tbody></table> <h2 id="heading-manual-tracking-vs-proof-of-contribution">Manual Tracking vs. proof-of-contribution</h2> <table> <thead> <tr> <th></th> <th>Manual tracking</th> <th>proof-of-contribution</th> </tr> </thead> <tbody><tr> <td><strong>Finding who wrote the code</strong></td> <td>Search Slack, ask the team, dig through commits</td> <td><code>poc.py trace <file></code> — thirty seconds</td> </tr> <tr> <td><strong>Knowing which parts the AI guessed</strong></td> <td>You don't, until it breaks in production</td> <td>Knowledge Gaps section — surfaced before the code ships</td> </tr> <tr> <td><strong>Detecting gaps after the build</strong></td> <td>Code review, if someone notices</td> <td><code>poc.py verify</code> — static analysis, zero API calls</td> </tr> <tr> <td><strong>Enforcing attribution on PRs</strong></td> <td>Honor system</td> <td>GitHub Action fails the PR if attribution is missing</td> </tr> <tr> <td><strong>Connecting to your spec</strong></td> <td>Copy-paste assumptions into comments manually</td> <td><code>poc.py import-spec</code> seeds them as tracked claims automatically</td> </tr> <tr> <td><strong>Infrastructure required</strong></td> <td>None (usually a spreadsheet or nothing)</td> <td>None — SQLite, pure Python, no paid services</td> </tr> </tbody></table> <p>The tool doesn't replace code review. It gives code review the context it needs to catch the right things.</p> <p>The archaeology scenario — two days tracing a bug through dead-end commit messages — takes thirty seconds with <code>poc.py trace</code>. The code still has gaps, and it always will. But now you know where they are.</p> <p><em>Built by</em> <a href="https://dev.to/dannwaneri"><em>Daniel Nwaneri</em></a><em>. The spec-writer skill that feeds</em> <code>import-spec</code> <em>is at</em> <a href="https://github.com/dannwaneri/spec-writer"><em>github.com/dannwaneri/spec-writer</em></a><em>. The full proof-of-contribution repo is at</em> <a href="https://github.com/dannwaneri/proof-of-contribution"><em>github.com/dannwaneri/proof-of-contribution</em></a><em>.</em></p> </article> <article> <h1> How to Build a Cost-Efficient AI Agent with Tiered Model Routing </h1> <p>Daniel Nwaneri — Wed, 08 Apr 2026 22:59:09 +0000</p> <p>Most AI agent tutorials make the same mistake: they route every task to the most expensive model available.</p> <p>A character count doesn't need GPT-4. A presence check doesn't need Sonnet. A regex doesn't need anything except Python.</p> <p>The mistake isn't using AI — it's not knowing when to stop using it.</p> <p>This tutorial shows you how to build a tiered routing system that sends tasks to the cheapest model that can solve them. The pattern is called the cost curve. It comes from a comment thread on a DEV.to article, implemented by three developers over a weekend, and it cut the per-URL cost of a real SEO audit agent from $0.006 to effectively $0 for most pages.</p> <p>By the end, you'll have a working <code>cost_curve.py</code> module you can drop into any agent project.</p> <h2 id="heading-what-youll-build">What You'll Build</h2> <p>A three-tier routing function that:</p> <ul> <li><p>Runs deterministic Python checks first — zero API cost</p> </li> <li><p>Escalates to Claude Haiku only for genuinely ambiguous cases — ~$0.0001 per call</p> </li> <li><p>Escalates to Claude Sonnet only when semantic judgment is required — ~$0.006 per call</p> </li> <li><p>Falls back gracefully when any tier fails</p> </li> <li><p>Returns a consistent result schema regardless of which tier handled the request</p> </li> </ul> <p>The full implementation is part of <a href="https://github.com/dannwaneri/seo-agent">dannwaneri/seo-agent</a>, an open-core SEO audit agent. The cost curve module is the premium routing layer, and the principle applies to any agent with mixed-complexity tasks.</p> <h2 id="heading-prerequisites">Prerequisites</h2> <ul> <li><p>Python 3.11 or higher</p> </li> <li><p>An Anthropic API key</p> </li> <li><p>Basic familiarity with Python and the Claude API</p> </li> </ul> <h2 id="heading-table-of-contents">Table of Contents</h2> <ol> <li><p><a href="#heading-the-problem-with-calling-claude-on-everything">The Problem with Calling Claude on Everything</a></p> </li> <li><p><a href="#heading-the-cost-curve-explained">The Cost Curve Explained</a></p> </li> <li><p><a href="#heading-project-setup">Project Setup</a></p> </li> <li><p><a href="#heading-tier-1-deterministic-python">Tier 1: Deterministic Python</a></p> </li> <li><p><a href="#heading-tier-2-claude-haiku-for-ambiguous-cases">Tier 2: Claude Haiku for Ambiguous Cases</a></p> </li> <li><p><a href="#heading-tier-3-claude-sonnet-for-semantic-judgment">Tier 3: Claude Sonnet for Semantic Judgment</a></p> </li> <li><p><a href="#heading-the-router-audit_url">The Router: audit_url()</a></p> </li> <li><p><a href="#heading-graceful-fallback">Graceful Fallback</a></p> </li> <li><p><a href="#heading-testing-the-cost-curve">Testing the Cost Curve</a></p> </li> <li><p><a href="#heading-applying-this-pattern-to-your-agent">Applying This Pattern to Your Agent</a></p> </li> </ol> <h2 id="heading-the-problem-with-calling-claude-on-everything">The Problem with Calling Claude on Everything</h2> <p>Here's what most agent code looks like:</p> <pre><code class="language-python">def audit_url(snapshot: dict) -> dict: response = client.messages.create( model="claude-sonnet-4-20250514", messages=[{"role": "user", "content": build_prompt(snapshot)}] ) return parse_response(response) </code></pre> <p>This works. It also calls Sonnet for every URL in the list — including the ones where the title is 142 characters long and the answer is obviously FAIL without any model involvement.</p> <p>Claude Sonnet 4 is priced at $3 per million input tokens and $15 per million output tokens. A typical page snapshot is around 500 input tokens. That's $0.0015 per URL just for input — before output tokens. Across a 20-URL weekly audit, the total is around $0.12. Not expensive. But most of those pages have mechanical SEO issues: missing descriptions, titles over 60 characters, no canonical tag. A character count catches all of that. You don't need a model.</p> <p>The cost curve fixes this by routing based on what the task actually requires, not on what the model is capable of.</p> <h2 id="heading-the-cost-curve-explained">The Cost Curve Explained</h2> <p>In the cost curve, we have three tiers, three tools, and three price points:</p> <p><strong>Tier 1 — Deterministic Python. Cost: $0.</strong> Check title length, description length, H1 count, canonical presence. These are not judgment calls. They're string operations. If title length > 60, FAIL. No model needed.</p> <p><strong>Tier 2 — Claude Haiku. Cost: ~$0.0001 per call.</strong> Title present but only 4 characters long. Description present but only 30 characters. Status code is a redirect. These pass the mechanical audit but something is off. Haiku is fast and cheap enough that escalating ambiguous cases costs less than the debugging time you'd spend on false positives.</p> <p><strong>Tier 3 — Claude Sonnet. Cost: ~$0.006 per call.</strong> Pages Haiku flags as needing semantic judgment. "This title passes length but reads like a navigation label." "This description duplicates the title verbatim." Sonnet earns its cost on genuinely hard cases — not on every URL in the list.</p> <p>The routing decision happens before any API call. The result schema is identical regardless of which tier handled the request.</p> <h2 id="heading-project-setup">Project Setup</h2> <pre><code class="language-bash">mkdir cost-curve-demo && cd cost-curve-demo pip install anthropic </code></pre> <p>Set your API key:</p> <pre><code class="language-bash"># macOS/Linux export ANTHROPIC_API_KEY="sk-ant-..." # Windows PowerShell $env:ANTHROPIC_API_KEY = "sk-ant-..." </code></pre> <p>Create <code>cost_curve.py</code> — you'll build this module step by step.</p> <h2 id="heading-tier-1-deterministic-python">Tier 1: Deterministic Python</h2> <p>Tier 1 runs first on every URL. It checks four fields using only Python string operations. There's no API call, no latency, and no cost.</p> <pre><code class="language-python">import json import logging import os import re from datetime import datetime, timezone import anthropic logger = logging.getLogger(__name__) REDIRECT_CODES = {301, 302, 307, 308} # Fields that trigger Tier 2 escalation # Title or description present but suspiciously short AMBIGUOUS_TITLE_MAX = 10 # chars — present but too short to be real AMBIGUOUS_DESC_MAX = 50 # chars — present but too short to be useful def _now_iso() -> str: return datetime.now(timezone.utc).isoformat() def _build_result(snapshot: dict, method: str) -> dict: """Base result skeleton — same schema regardless of tier.""" return { "url": snapshot.get("final_url", ""), "final_url": snapshot.get("final_url", ""), "status_code": snapshot.get("status_code"), "title": {"value": None, "length": 0, "status": "PASS"}, "description": {"value": None, "length": 0, "status": "PASS"}, "h1": {"count": 0, "value": None, "status": "PASS"}, "canonical": {"value": None, "status": "PASS"}, "flags": [], "human_review": False, "audited_at": _now_iso(), "method": method, "needs_tier3": False, } def tier1_check(snapshot: dict) -> dict: """ Pure Python SEO checks. Zero API calls. Returns a result dict with method="deterministic". Sets needs_tier3=False always — Tier 1 never escalates to Tier 3 directly. Escalation to Tier 2 is decided by the router, not here. """ result = _build_result(snapshot, "deterministic") title = snapshot.get("title") or "" description = snapshot.get("meta_description") or "" h1s = snapshot.get("h1s") or [] canonical = snapshot.get("canonical") or "" # Title check result["title"]["value"] = title or None result["title"]["length"] = len(title) if not title or len(title) > 60: result["title"]["status"] = "FAIL" msg = "Title is missing" if not title else f"Title is {len(title)} characters (max 60)" result["flags"].append(msg) # Description check result["description"]["value"] = description or None result["description"]["length"] = len(description) if not description or len(description) > 160: result["description"]["status"] = "FAIL" msg = "Meta description is missing" if not description else f"Meta description is {len(description)} characters (max 160)" result["flags"].append(msg) # H1 check result["h1"]["count"] = len(h1s) result["h1"]["value"] = h1s[0] if h1s else None if len(h1s) == 0: result["h1"]["status"] = "FAIL" result["flags"].append("H1 tag is missing") elif len(h1s) > 1: result["h1"]["status"] = "FAIL" result["flags"].append(f"Multiple H1 tags found ({len(h1s)})") # Canonical check result["canonical"]["value"] = canonical or None if not canonical: result["canonical"]["status"] = "FAIL" result["flags"].append("Canonical tag is missing") return result </code></pre> <p>The key design decision: <code>tier1_check()</code> never decides whether to escalate. It just runs the checks and returns. The router decides escalation based on the result.</p> <h2 id="heading-tier-2-claude-haiku-for-ambiguous-cases">Tier 2: Claude Haiku for Ambiguous Cases</h2> <p>Tier 2 runs when Tier 1 detects something mechanical but the result might need a second look. A 4-character title present but clearly wrong. A 30-character description that's technically there but useless. A redirect status that needs a human-readable explanation.</p> <p>Haiku is the right model here. It's fast, cheap ($1 input / $5 output per million tokens), and sufficient for triage-level judgment. The prompt asks a narrow question: is this ambiguous enough to need Sonnet?</p> <pre><code class="language-python">def tier2_check(snapshot: dict) -> dict: """ Claude Haiku call for ambiguous cases. Returns result with method="haiku". Sets needs_tier3=True if Haiku determines the case needs semantic judgment. Falls back to Tier 1 result on API error. """ api_key = os.environ.get("ANTHROPIC_API_KEY") if not api_key: raise OSError("ANTHROPIC_API_KEY is not set.") client = anthropic.Anthropic(api_key=api_key) title = snapshot.get("title") or "" description = snapshot.get("meta_description") or "" status_code = snapshot.get("status_code") prompt = f"""You are an SEO auditor doing a quick triage check. Page data: - Title: {repr(title)} ({len(title)} chars) - Meta description: {repr(description)} ({len(description)} chars) - Status code: {status_code} Answer these two questions with only "yes" or "no": 1. Does this page need semantic judgment beyond simple length/presence checks? (e.g. title is present but clearly wrong, description is present but meaningless) 2. Is the status code a redirect that needs investigation? Respond in this exact JSON format and nothing else: {{"needs_tier3": true_or_false, "reason": "one sentence explanation"}}""" try: response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=150, messages=[{"role": "user", "content": prompt}], ) raw = response.content[0].text.strip() # Strip markdown fences if present if raw.startswith("```"): lines = raw.splitlines() raw = "\n".join(lines[1:-1] if lines[-1].strip() == "```" else lines[1:]) parsed = json.loads(raw) result = _build_result(snapshot, "haiku") # Copy Tier 1 field checks — Haiku doesn't redo those t1 = tier1_check(snapshot) result["title"] = t1["title"] result["description"] = t1["description"] result["h1"] = t1["h1"] result["canonical"] = t1["canonical"] result["flags"] = t1["flags"] result["needs_tier3"] = parsed.get("needs_tier3", False) if result["needs_tier3"]: result["flags"].append(f"Escalated to Tier 3: {parsed.get('reason', '')}") return result except Exception as exc: logger.warning("[tier2] Haiku API error: %s — falling back to Tier 1 result", exc) fallback = tier1_check(snapshot) fallback["method"] = "haiku-fallback" return fallback </code></pre> <p>The fallback is the critical piece. If Haiku fails — rate limit, network error, malformed response — the function returns the Tier 1 result rather than crashing. The audit continues. The URL gets flagged with <code>method="haiku-fallback"</code> so you can identify it later.</p> <h2 id="heading-tier-3-claude-sonnet-for-semantic-judgment">Tier 3: Claude Sonnet for Semantic Judgment</h2> <p>Tier 3 is where the full extraction prompt runs. This is the same call you'd make in a naïve implementation — the difference is that only a small fraction of URLs reach this tier.</p> <pre><code class="language-python">def tier3_check(snapshot: dict) -> dict: """ Claude Sonnet call for semantic judgment. Returns result with method="sonnet". This is the full extraction prompt — same as calling the model directly. """ api_key = os.environ.get("ANTHROPIC_API_KEY") if not api_key: raise OSError("ANTHROPIC_API_KEY is not set.") client = anthropic.Anthropic(api_key=api_key) prompt = f"""You are an SEO auditor. Analyze this page snapshot and return ONLY a JSON object. No prose. No explanation. No markdown fences. Raw JSON only. Page data: - URL: {snapshot.get('final_url')} - Status code: {snapshot.get('status_code')} - Title: {snapshot.get('title')} - Meta description: {snapshot.get('meta_description')} - H1 tags: {snapshot.get('h1s')} - Canonical: {snapshot.get('canonical')} Return this exact schema: {{ "url": "string", "final_url": "string", "status_code": number, "title": {{"value": "string or null", "length": number, "status": "PASS or FAIL"}}, "description": {{"value": "string or null", "length": number, "status": "PASS or FAIL"}}, "h1": {{"count": number, "value": "string or null", "status": "PASS or FAIL"}}, "canonical": {{"value": "string or null", "status": "PASS or FAIL"}}, "flags": ["array of strings describing specific issues"], "human_review": false, "audited_at": "ISO timestamp" }} PASS/FAIL rules: - title: FAIL if null or length > 60 characters, or if present but clearly not a real title - description: FAIL if null or length > 160 characters, or if present but meaningless - h1: FAIL if count is 0 or count > 1 - canonical: FAIL if null - audited_at: use current UTC time""" try: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1000, messages=[{"role": "user", "content": prompt}], ) raw = response.content[0].text.strip() if raw.startswith("```"): lines = raw.splitlines() raw = "\n".join(lines[1:-1] if lines[-1].strip() == "```" else lines[1:]) result = json.loads(raw) result["method"] = "sonnet" result["needs_tier3"] = False return result except Exception as exc: logger.warning("[tier3] Sonnet API error: %s — falling back to Tier 1 result", exc) fallback = tier1_check(snapshot) fallback["method"] = "sonnet-fallback" return fallback </code></pre> <p>Note the prompt addition in Tier 3 that isn't in Tier 1: <code>"or if present but clearly not a real title"</code> and <code>"or if present but meaningless"</code>. That's the semantic judgment Haiku identified as needed. Tier 3 acts on it.</p> <h2 id="heading-the-router-auditurl">The Router: audit_url()</h2> <p>The router is the public interface. Everything else is an implementation detail.</p> <pre><code class="language-python">def audit_url(snapshot: dict, tiered: bool = False) -> dict: """ Route a page snapshot through the appropriate audit tier. Args: snapshot: Page data from browser.py — must contain final_url, status_code, title, meta_description, h1s, canonical. tiered: If False, delegates directly to Tier 3 (Sonnet). If True, routes through the cost curve. Returns: Audit result dict with method field indicating which tier ran. """ if not tiered: # Non-tiered mode: call Sonnet directly, same as v1 behavior return tier3_check(snapshot) # Tier 1: always runs first t1_result = tier1_check(snapshot) # Check if escalation to Tier 2 is warranted title = snapshot.get("title") or "" description = snapshot.get("meta_description") or "" status_code = snapshot.get("status_code") needs_tier2 = ( # Title present but suspiciously short (title and len(title) < AMBIGUOUS_TITLE_MAX) or # Description present but suspiciously short (description and len(description) < AMBIGUOUS_DESC_MAX) or # Redirect status — may need explanation (status_code in REDIRECT_CODES) ) if not needs_tier2: # Tier 1 result is definitive — return without any API call return t1_result # Tier 2: Haiku triage t2_result = tier2_check(snapshot) if not t2_result.get("needs_tier3", False): # Haiku determined no semantic judgment needed return t2_result # Tier 3: Sonnet for semantic judgment return tier3_check(snapshot) </code></pre> <p>The router logic is explicit and readable. Each decision point is a named condition. When <code>tiered=False</code>, behavior is identical to the v1 naive implementation — this is the backward compatibility guarantee that lets you add the cost curve incrementally without breaking existing audits.</p> <h2 id="heading-graceful-fallback">Graceful Fallback</h2> <p>The fallback pattern appears in both Tier 2 and Tier 3. It's worth making explicit:</p> <pre><code class="language-python"># Pattern used in both tier2_check() and tier3_check() except Exception as exc: logger.warning("[tierN] API error: %s — falling back to Tier 1 result", exc) fallback = tier1_check(snapshot) fallback["method"] = "tierN-fallback" return fallback </code></pre> <p>Three things this does:</p> <ol> <li><p>Logs the error with enough context to debug later</p> </li> <li><p>Returns a valid result — the Tier 1 deterministic check always runs regardless</p> </li> <li><p>Tags the result with the fallback method so you can filter these in your report</p> </li> </ol> <p>An agent that crashes on API errors is not production-ready. An agent that degrades gracefully and continues is.</p> <h2 id="heading-testing-the-cost-curve">Testing the Cost Curve</h2> <p>Create <code>test_cost_curve.py</code> to verify routing behavior without live API calls:</p> <pre><code class="language-python">import json from unittest import mock from cost_curve import audit_url, tier1_check def make_snapshot(title="Normal Title Under 60 Chars", description="A normal meta description that is under 160 characters and describes the page content well.", h1s=["Single H1"], canonical="https://example.com/page", status_code=200, final_url="https://example.com/page"): return { "title": title, "meta_description": description, "h1s": h1s, "canonical": canonical, "status_code": status_code, "final_url": final_url, } def test_clean_page_returns_tier1_no_api_calls(): """Clean page: all checks pass deterministically — no API call.""" snapshot = make_snapshot() with mock.patch("anthropic.Anthropic") as mock_client: result = audit_url(snapshot, tiered=True) assert result["method"] == "deterministic" mock_client.assert_not_called() print("PASS: clean page → Tier 1, zero API calls") def test_long_title_returns_tier1_fail_no_api_call(): """Title >60 chars: FAIL from Tier 1, no API call.""" snapshot = make_snapshot(title="A" * 70) with mock.patch("anthropic.Anthropic") as mock_client: result = audit_url(snapshot, tiered=True) assert result["method"] == "deterministic" assert result["title"]["status"] == "FAIL" mock_client.assert_not_called() print("PASS: title >60 → Tier 1 FAIL, zero API calls") def test_suspiciously_short_title_escalates_to_tier2(): """Title present but 4 chars: escalates to Tier 2.""" snapshot = make_snapshot(title="SEO") # 3 chars — under AMBIGUOUS_TITLE_MAX mock_response = mock.MagicMock() mock_response.content = [mock.MagicMock( text='{"needs_tier3": false, "reason": "title is short but not ambiguous"}' )] with mock.patch("anthropic.Anthropic") as mock_client: mock_client.return_value.messages.create.return_value = mock_response result = audit_url(snapshot, tiered=True) assert result["method"] == "haiku" assert mock_client.return_value.messages.create.call_count == 1 print("PASS: short title → Tier 2 (Haiku called once)") def test_tiered_false_calls_sonnet_directly(): """tiered=False: Sonnet called regardless of snapshot content.""" snapshot = make_snapshot() # clean page, would be Tier 1 in tiered mode mock_response = mock.MagicMock() mock_response.content = [mock.MagicMock(text=json.dumps({ "url": "https://example.com/page", "final_url": "https://example.com/page", "status_code": 200, "title": {"value": "Normal Title Under 60 Chars", "length": 27, "status": "PASS"}, "description": {"value": "desc", "length": 4, "status": "PASS"}, "h1": {"count": 1, "value": "Single H1", "status": "PASS"}, "canonical": {"value": "https://example.com/page", "status": "PASS"}, "flags": [], "human_review": False, "audited_at": "2026-04-01T00:00:00+00:00", }))] with mock.patch("anthropic.Anthropic") as mock_client: mock_client.return_value.messages.create.return_value = mock_response result = audit_url(snapshot, tiered=False) assert result["method"] == "sonnet" assert mock_client.return_value.messages.create.call_count == 1 print("PASS: tiered=False → Sonnet called directly") def test_haiku_api_failure_falls_back_to_tier1(): """Haiku failure: falls back to Tier 1 result, no crash.""" snapshot = make_snapshot(title="SEO") # triggers Tier 2 with mock.patch("anthropic.Anthropic") as mock_client: mock_client.return_value.messages.create.side_effect = Exception("rate limit") result = audit_url(snapshot, tiered=True) assert result["method"] == "haiku-fallback" print("PASS: Haiku failure → fallback to Tier 1, no crash") if __name__ == "__main__": test_clean_page_returns_tier1_no_api_calls() test_long_title_returns_tier1_fail_no_api_call() test_suspiciously_short_title_escalates_to_tier2() test_tiered_false_calls_sonnet_directly() test_haiku_api_failure_falls_back_to_tier1() print("\nAll tests passed.") </code></pre> <p>Run it:</p> <pre><code class="language-bash">python test_cost_curve.py </code></pre> <p>Expected output:</p> <pre><code class="language-plaintext">PASS: clean page → Tier 1, zero API calls PASS: title >60 → Tier 1 FAIL, zero API calls PASS: short title → Tier 2 (Haiku called once) PASS: tiered=False → Sonnet called directly PASS: Haiku failure → fallback to Tier 1, no crash </code></pre> <h2 id="heading-applying-this-pattern-to-your-agent">Applying This Pattern to Your Agent</h2> <p>The cost curve is not SEO-specific. Any agent with mixed-complexity tasks can use it.</p> <p>The principle: classify tasks by what they actually require before deciding which model to invoke.</p> <p><strong>Customer support agent:</strong></p> <ul> <li><p>Tier 1: keyword matching for known FAQ topics — no model</p> </li> <li><p>Tier 2: Haiku for intent classification on ambiguous queries</p> </li> <li><p>Tier 3: Sonnet for complex complaints requiring judgment</p> </li> </ul> <p><strong>Code review agent:</strong></p> <ul> <li><p>Tier 1: lint rules, syntax checks — no model</p> </li> <li><p>Tier 2: Haiku for common pattern detection</p> </li> <li><p>Tier 3: Sonnet for architectural review</p> </li> </ul> <p><strong>Content moderation agent:</strong></p> <ul> <li><p>Tier 1: blocklist matching — no model</p> </li> <li><p>Tier 2: Haiku for borderline cases</p> </li> <li><p>Tier 3: Sonnet for context-dependent judgment</p> </li> </ul> <p>The implementation pattern is the same in all three cases. The <code>audit_url()</code> router becomes <code>route_task()</code>. The tier functions change their prompts and escalation conditions. The fallback logic stays identical.</p> <p>The key question to ask before writing any agent code: what fraction of my inputs are mechanically solvable? That fraction goes to Tier 1. The rest escalate. The cost curve routes everything else.</p> <h2 id="heading-wrapping-up">Wrapping Up</h2> <p>The full implementation — including the SEO audit agent that uses this module in production — is at <a href="https://github.com/dannwaneri/seo-agent">dannwaneri/seo-agent</a>. The <code>core/</code> directory is MIT licensed. The tiered routing lives in <code>premium/cost_curve.py</code>.</p> <p><em>This tutorial is the companion piece to</em> <a href="https://dev.to/dannwaneri/i-was-paying-0006-per-url-for-seo-audits-until-i-realized-most-needed-0-132j">I Was Paying $0.006 Per URL for SEO Audits Until I Realized Most Needed $0</a> <em>on DEV.to, which covers the architecture decisions behind the cost curve.</em></p> </article> <article> <h1> How to Build a Local SEO Audit Agent with Browser Use and Claude API </h1> <p>Daniel Nwaneri — Mon, 30 Mar 2026 23:37:08 +0000</p> <p>Every digital marketing agency has someone whose job involves opening a spreadsheet, visiting each client URL, checking the title tag, meta description, and H1, noting broken links, and pasting everything into a report. Then doing it again next week.</p> <p>That work is deterministic. An agent can do it.</p> <p>In this tutorial, you'll build a local SEO audit agent from scratch using Python, Browser Use, and the Claude API. The agent visits real pages in a visible browser window, extracts SEO signals using Claude, checks for broken links asynchronously, handles edge cases with a human-in-the-loop pause, and writes a structured report — all resumable if interrupted.</p> <p>By the end, you'll have a working agent you can run against any list of URLs. It costs less than $0.01 per URL to run.</p> <h2 id="heading-what-youll-build">What You'll Build</h2> <p>A seven-module Python agent that:</p> <ul> <li><p>Reads a URL list from a CSV file</p> </li> <li><p>Visits each URL in a real Chromium browser (not a headless scraper)</p> </li> <li><p>Extracts title, meta description, H1s, and canonical tag via Claude API</p> </li> <li><p>Checks for broken links asynchronously using httpx</p> </li> <li><p>Detects edge cases (404s, login walls, redirects) and pauses for human input</p> </li> <li><p>Writes results to <code>report.json</code> incrementally — safe to interrupt and resume</p> </li> <li><p>Generates a plain-English <code>report-summary.txt</code> on completion</p> </li> </ul> <p>The full code is on GitHub at <a href="https://github.com/dannwaneri/seo-agent">dannwaneri/seo-agent</a>.</p> <h2 id="heading-prerequisites">Prerequisites</h2> <ul> <li><p>Python 3.11 or higher</p> </li> <li><p>An Anthropic API key (get one at console.anthropic.com)</p> </li> <li><p>Windows, macOS, or Linux</p> </li> <li><p>Basic familiarity with Python and the command line</p> </li> </ul> <h2 id="heading-table-of-contents">Table of Contents</h2> <ol> <li><p><a href="#heading-why-browser-use-instead-of-a-scraper">Why Browser Use Instead of a Scraper</a></p> </li> <li><p><a href="#heading-project-structure">Project Structure</a></p> </li> <li><p><a href="#heading-setup">Setup</a></p> </li> <li><p><a href="#heading-module-1-state-management">Module 1: State Management</a></p> </li> <li><p><a href="#heading-module-2-browser-integration">Module 2: Browser Integration</a></p> </li> <li><p><a href="#heading-module-3-claude-extraction-layer">Module 3: Claude Extraction Layer</a></p> </li> <li><p><a href="#heading-module-4-broken-link-checker">Module 4: Broken Link Checker</a></p> </li> <li><p><a href="#heading-module-5-human-in-the-loop">Module 5: Human-in-the-Loop</a></p> </li> <li><p><a href="#heading-module-6-report-writer">Module 6: Report Writer</a></p> </li> <li><p><a href="#heading-module-7-the-main-loop">Module 7: The Main Loop</a></p> </li> <li><p><a href="#heading-running-the-agent">Running the Agent</a></p> </li> <li><p><a href="#heading-scheduling-for-agency-use">Scheduling for Agency Use</a></p> </li> <li><p><a href="#heading-what-the-results-look-like">What the Results Look Like</a></p> </li> </ol> <h2 id="heading-why-browser-use-instead-of-a-scraper">Why Browser Use Instead of a Scraper</h2> <p>The standard approach to SEO auditing is to fetch page HTML with <code>requests</code> and parse it with BeautifulSoup. That works on static pages. It breaks on JavaScript-rendered content, misses dynamically injected meta tags, and fails entirely on authenticated pages.</p> <p>Browser Use (84,000+ GitHub stars, MIT license) takes a different approach. It controls a real Chromium browser, reads the DOM after JavaScript executes, and exposes the page through Playwright's accessibility tree. The agent sees what a human would see.</p> <p>The practical difference: a requests-based scraper might miss a meta description injected by a React component. Browser Use won't.</p> <p>The other difference worth naming: Browser Use reads pages semantically. A Playwright script breaks when a button's CSS class changes from <code>btn-primary</code> to <code>button-main</code>. Browser Use identifies it's still a "Submit" button and acts accordingly. The extraction logic lives in the Claude prompt, not in brittle CSS selectors.</p> <h2 id="heading-project-structure">Project Structure</h2> <pre><code class="language-plaintext">seo-agent/ ├── index.py # Main audit loop ├── browser.py # Browser Use / Playwright page driver ├── extractor.py # Claude API extraction layer ├── linkchecker.py # Async broken link checker ├── hitl.py # Human-in-the-loop pause logic ├── reporter.py # Report writer ├── state.py # State persistence (resume on interrupt) ├── input.csv # Your URL list ├── requirements.txt ├── .env.example └── .gitignore </code></pre> <h2 id="heading-setup">Setup</h2> <p>Create a project folder and install dependencies:</p> <pre><code class="language-bash">mkdir seo-agent && cd seo-agent pip install browser-use anthropic playwright httpx playwright install chromium </code></pre> <p>Create <code>input.csv</code> with your URLs:</p> <pre><code class="language-plaintext">url https://example.com https://example.com/about https://example.com/contact </code></pre> <p>Create <code>.env.example</code>:</p> <pre><code class="language-plaintext">ANTHROPIC_API_KEY=your-key-here </code></pre> <p>Set your API key as an environment variable before running:</p> <pre><code class="language-bash"># macOS/Linux export ANTHROPIC_API_KEY="sk-ant-..." # Windows PowerShell $env:ANTHROPIC_API_KEY = "sk-ant-..." </code></pre> <p>Create <code>.gitignore</code>:</p> <pre><code class="language-plaintext">state.json report.json report-summary.txt .env __pycache__/ *.pyc </code></pre> <h2 id="heading-module-1-state-management">Module 1: State Management</h2> <p>The agent needs to track which URLs it has already audited. If the run is interrupted — power cut, keyboard interrupt, network error — it should resume from where it stopped, not start over.</p> <p><code>state.py</code> handles this with a flat JSON file:</p> <pre><code class="language-python">import json import os STATE_FILE = os.path.join(os.path.dirname(__file__), "state.json") _DEFAULT_STATE = {"audited": [], "pending": [], "needs_human": []} def load_state() -> dict: if not os.path.exists(STATE_FILE): save_state(_DEFAULT_STATE.copy()) with open(STATE_FILE, encoding="utf-8") as f: return json.load(f) def save_state(state: dict) -> None: with open(STATE_FILE, "w", encoding="utf-8") as f: json.dump(state, f, indent=2) def is_audited(url: str) -> bool: return url in load_state()["audited"] def mark_audited(url: str) -> None: state = load_state() if url not in state["audited"]: state["audited"].append(url) save_state(state) def add_to_needs_human(url: str) -> None: state = load_state() if url not in state["needs_human"]: state["needs_human"].append(url) save_state(state) </code></pre> <p>The design is intentional: <code>mark_audited()</code> is called immediately after a URL is processed and written to the report. If the agent crashes mid-run, it loses at most one URL's work.</p> <h2 id="heading-module-2-browser-integration">Module 2: Browser Integration</h2> <p><code>browser.py</code> does the actual page navigation. It uses Playwright directly (which Browser Use installs as a dependency) to open a visible Chromium window, navigate to the URL, capture HTTP status and redirect information, and extract the raw SEO signals from the DOM.</p> <p>The key design decisions:</p> <p><strong>Visible browser, not headless.</strong> Set <code>headless=False</code> so you can watch the agent work. This matters for the demo and for debugging.</p> <p><strong>Status capture via response listener.</strong> Playwright raises an exception on 4xx/5xx responses, but the <code>on("response", ...)</code> handler fires before the exception. We capture status there.</p> <p><strong>2-second delay between visits.</strong> Prevents triggering rate limiting or bot detection on agency client sites.</p> <p>Here is the core navigation function:</p> <pre><code class="language-python">import asyncio import sys import time from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout TIMEOUT = 20_000 # 20 seconds def fetch_page(url: str) -> dict: result = { "final_url": url, "status_code": None, "title": None, "meta_description": None, "h1s": [], "canonical": None, "raw_links": [], } first_status = {"code": None} with sync_playwright() as p: browser = p.chromium.launch(headless=False) page = browser.new_page() def on_response(response): if first_status["code"] is None: first_status["code"] = response.status page.on("response", on_response) try: page.goto(url, wait_until="domcontentloaded", timeout=TIMEOUT) result["status_code"] = first_status["code"] or 200 result["final_url"] = page.url # Extract SEO signals from DOM result["title"] = page.title() or None result["meta_description"] = page.evaluate( "() => { const m = document.querySelector('meta[name=\"description\"]'); " "return m ? m.getAttribute('content') : null; }" ) result["h1s"] = page.evaluate( "() => Array.from(document.querySelectorAll('h1')).map(h => h.innerText.trim())" ) result["canonical"] = page.evaluate( "() => { const c = document.querySelector('link[rel=\"canonical\"]'); " "return c ? c.getAttribute('href') : null; }" ) result["raw_links"] = page.evaluate( "() => Array.from(document.querySelectorAll('a[href]'))" ".map(a => a.href).filter(Boolean).slice(0, 100)" ) except PlaywrightTimeout: result["status_code"] = first_status["code"] or 408 except Exception as exc: print(f"[browser] Error: {exc}", file=sys.stderr) result["status_code"] = first_status["code"] finally: browser.close() time.sleep(2) return result </code></pre> <p>A few things worth noting:</p> <p>The <code>raw_links</code> cap at 100 is deliberate. DEV.to profile pages have hundreds of links — you don't need all of them for broken link detection.</p> <p>The <code>wait_until="domcontentloaded"</code> setting is faster than <code>networkidle</code> and sufficient for meta tag extraction. JavaScript-rendered content needs the DOM to be ready, not all network requests to complete.</p> <h2 id="heading-module-3-claude-extraction-layer">Module 3: Claude Extraction Layer</h2> <p><code>extractor.py</code> takes the raw page snapshot from <code>browser.py</code> and calls Claude to produce a structured SEO audit result.</p> <p>This is where most tutorials go wrong. They either write complex parsing logic in Python (fragile) or ask Claude for a free-form response and try to parse prose (unreliable). The right approach: give Claude a strict JSON schema and tell it to return nothing else.</p> <p><strong>The prompt engineering that makes this reliable:</strong></p> <pre><code class="language-python">import json import os import sys from datetime import datetime, timezone import anthropic MODEL = "claude-sonnet-4-20250514" client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY")) def _strip_fences(text: str) -> str: """Remove accidental markdown code fences from Claude's response.""" text = text.strip() if text.startswith("```"): lines = text.splitlines() # Drop opening fence lines = lines[1:] if lines[0].startswith("```") else lines # Drop closing fence if lines and lines[-1].strip() == "```": lines = lines[:-1] text = "\n".join(lines).strip() return text def extract(snapshot: dict) -> dict: if not os.environ.get("ANTHROPIC_API_KEY"): raise OSError("ANTHROPIC_API_KEY is not set.") prompt = f"""You are an SEO auditor. Analyze this page snapshot and return ONLY a JSON object. No prose. No explanation. No markdown fences. Raw JSON only. Page data: - URL: {snapshot.get('final_url')} - Status code: {snapshot.get('status_code')} - Title: {snapshot.get('title')} - Meta description: {snapshot.get('meta_description')} - H1 tags: {snapshot.get('h1s')} - Canonical: {snapshot.get('canonical')} Return this exact schema: {{ "url": "string", "final_url": "string", "status_code": number, "title": {{"value": "string or null", "length": number, "status": "PASS or FAIL"}}, "description": {{"value": "string or null", "length": number, "status": "PASS or FAIL"}}, "h1": {{"count": number, "value": "string or null", "status": "PASS or FAIL"}}, "canonical": {{"value": "string or null", "status": "PASS or FAIL"}}, "flags": ["array of strings describing specific issues"], "human_review": false, "audited_at": "ISO timestamp" }} PASS/FAIL rules: - title: FAIL if null or length > 60 characters - description: FAIL if null or length > 160 characters - h1: FAIL if count is 0 (missing) or count > 1 (multiple) - canonical: FAIL if null - flags: list every failing field with a clear description - audited_at: use current UTC time in ISO 8601 format""" response = client.messages.create( model=MODEL, max_tokens=1000, messages=[{"role": "user", "content": prompt}], ) raw = response.content[0].text clean = _strip_fences(raw) try: return json.loads(clean) except json.JSONDecodeError as exc: print(f"[extractor] JSON parse error: {exc}", file=sys.stderr) return _error_result(snapshot, str(exc)) def _error_result(snapshot: dict, reason: str) -> dict: return { "url": snapshot.get("final_url", ""), "final_url": snapshot.get("final_url", ""), "status_code": snapshot.get("status_code"), "title": {"value": None, "length": 0, "status": "ERROR"}, "description": {"value": None, "length": 0, "status": "ERROR"}, "h1": {"count": 0, "value": None, "status": "ERROR"}, "canonical": {"value": None, "status": "ERROR"}, "flags": [f"Extraction error: {reason}"], "human_review": True, "audited_at": datetime.now(timezone.utc).isoformat(), } </code></pre> <p>Two things make this reliable in production:</p> <p>First, <code>_strip_fences()</code> handles the case where Claude wraps its response in <code>```json</code> fences despite being told not to. This happens occasionally with Sonnet and consistently breaks <code>json.loads()</code> if you don't handle it.</p> <p>Second, the <code>_error_result()</code> fallback means the agent never crashes on a bad Claude response — it logs the error and marks the URL for human review, then continues to the next URL.</p> <p><strong>Cost:</strong> Claude Sonnet 4 is priced at $3 per million input tokens and $15 per million output tokens. A typical page snapshot is around 500 input tokens; the structured JSON response is around 300 output tokens. That works out to roughly $0.006 per URL — about $0.12 for a 20-URL audit.</p> <h2 id="heading-module-4-broken-link-checker">Module 4: Broken Link Checker</h2> <p><code>linkchecker.py</code> takes the <code>raw_links</code> list from the browser snapshot and checks same-domain links for broken status using async HEAD requests.</p> <p>The design choices:</p> <ul> <li><p><strong>Same-domain only.</strong> Checking every external link on a page would take minutes and isn't what agency clients need. Filter to links on the same domain as the page being audited.</p> </li> <li><p><strong>HEAD requests, not GET.</strong> Faster, lower bandwidth, sufficient for status code detection.</p> </li> <li><p><strong>Cap at 50 links.</strong> Pages like DEV.to article listings have hundreds of internal links. Checking all of them would dominate the runtime.</p> </li> <li><p><strong>Concurrent requests via asyncio.</strong> All links are checked in parallel, not sequentially.</p> </li> </ul> <pre><code class="language-python">import asyncio import logging from urllib.parse import urlparse import httpx CAP = 50 TIMEOUT = 5.0 logger = logging.getLogger(__name__) def _same_domain(link: str, final_url: str) -> bool: if not link: return False lower = link.strip().lower() if lower.startswith(("#", "mailto:", "javascript:", "tel:", "data:")): return False try: page_host = urlparse(final_url).netloc.lower() parsed = urlparse(link) return parsed.scheme in ("http", "https") and parsed.netloc.lower() == page_host except Exception: return False async def _check_link(client: httpx.AsyncClient, url: str) -> tuple[str, bool]: try: resp = await client.head(url, follow_redirects=True, timeout=TIMEOUT) return url, resp.status_code != 200 except Exception: return url, True # Timeout or connection error = broken async def _run_checks(links: list[str]) -> list[str]: async with httpx.AsyncClient() as client: results = await asyncio.gather(*[_check_link(client, url) for url in links]) return [url for url, broken in results if broken] def check_links(raw_links: list[str], final_url: str) -> dict: same_domain = [l for l in raw_links if _same_domain(l, final_url)] capped = len(same_domain) > CAP if capped: logger.warning("Page has %d same-domain links — capping at %d.", len(same_domain), CAP) same_domain = same_domain[:CAP] broken = asyncio.run(_run_checks(same_domain)) return { "broken": broken, "count": len(broken), "status": "FAIL" if broken else "PASS", "capped": capped, } </code></pre> <h2 id="heading-module-5-human-in-the-loop">Module 5: Human-in-the-Loop</h2> <p>This is the part most automation tutorials skip. What happens when the agent hits a login wall? A page that returns 403? A URL that redirects to a "Subscribe to continue reading" page?</p> <p>Most scripts either crash or silently skip. Neither is acceptable in an agency context.</p> <p><code>hitl.py</code> handles this with two functions: one that detects whether a pause is needed, and one that handles the pause itself.</p> <pre><code class="language-python">from state import add_to_needs_human LOGIN_KEYWORDS = {"login", "sign in", "sign-in", "access denied", "log in", "unauthorized"} REDIRECT_CODES = {301, 302, 307, 308} def should_pause(snapshot: dict) -> bool: code = snapshot.get("status_code") # Navigation failed entirely if code is None: return True # Non-200, non-redirect if code != 200 and code not in REDIRECT_CODES: return True # Login wall detection title = (snapshot.get("title") or "").lower() h1s = [h.lower() for h in (snapshot.get("h1s") or [])] if any(kw in title for kw in LOGIN_KEYWORDS): return True if any(kw in h1 for kw in LOGIN_KEYWORDS for h1 in h1s): return True return False def pause_reason(snapshot: dict) -> str: code = snapshot.get("status_code") if code is None: return "Navigation failed (None status)" if code != 200 and code not in REDIRECT_CODES: return f"Unexpected status code: {code}" return "Possible login wall detected" def pause_and_prompt(url: str, reason: str) -> str: print(f"\n⚠️ HUMAN REVIEW NEEDED") print(f" URL: {url}") print(f" Reason: {reason}") print(f" Options: [s] skip [r] retry [q] quit\n") while True: choice = input("Your choice: ").strip().lower() if choice in ("s", "r", "q"): return {"s": "skip", "r": "retry", "q": "quit"}[choice] print(" Enter s, r, or q.") </code></pre> <p>The <code>should_pause()</code> function catches four cases: navigation failure, unexpected HTTP status, login keywords in the title, and login keywords in H1 tags. The login keyword check is what catches "Please sign in to continue" pages that return 200 but are effectively inaccessible.</p> <p>In <code>--auto</code> mode (for scheduled runs), the main loop skips the <code>pause_and_prompt()</code> call and automatically handles these cases by logging the URL to <code>needs_human[]</code> in state and continuing.</p> <h2 id="heading-module-6-report-writer">Module 6: Report Writer</h2> <p><code>reporter.py</code> writes results incrementally. This is important: results are written after each URL is audited, not batched at the end. If the run is interrupted, you don't lose completed work.</p> <pre><code class="language-python">import json import os from datetime import datetime, timezone REPORT_JSON = os.path.join(os.path.dirname(__file__), "report.json") REPORT_TXT = os.path.join(os.path.dirname(__file__), "report-summary.txt") def _load_report() -> list: if not os.path.exists(REPORT_JSON): return [] with open(REPORT_JSON, encoding="utf-8") as f: return json.load(f) def write_result(result: dict) -> None: """Append or update a result in report.json.""" entries = _load_report() url = result.get("url", "") # Update existing entry if URL already present (handles retries) for i, entry in enumerate(entries): if entry.get("url") == url: entries[i] = result break else: entries.append(result) with open(REPORT_JSON, "w", encoding="utf-8") as f: json.dump(entries, f, indent=2, ensure_ascii=False) def _is_overall_pass(result: dict) -> bool: fields = ["title", "description", "h1", "canonical"] for field in fields: if result.get(field, {}).get("status") not in ("PASS",): return False if result.get("broken_links", {}).get("status") == "FAIL": return False return True def write_summary() -> None: entries = _load_report() passed = sum(1 for e in entries if _is_overall_pass(e)) lines = [] for entry in entries: overall = "PASS" if _is_overall_pass(entry) else "FAIL" failed_fields = [ f for f in ["title", "description", "h1", "canonical", "broken_links"] if entry.get(f, {}).get("status") == "FAIL" ] suffix = f" [{', '.join(failed_fields)}]" if failed_fields else "" lines.append(f"{entry.get('url', 'unknown'):<60} | {overall}{suffix}") lines.append("") lines.append(f"{passed}/{len(entries)} URLs passed") with open(REPORT_TXT, "w", encoding="utf-8") as f: f.write("\n".join(lines)) </code></pre> <p>The deduplication in <code>write_result()</code> handles retries cleanly. If a URL is retried after a human reviews a login wall and authenticates, the new result replaces the old one rather than creating a duplicate entry.</p> <h2 id="heading-module-7-the-main-loop">Module 7: The Main Loop</h2> <p><code>index.py</code> wires everything together. It reads the URL list, loads state, skips already-audited URLs, and runs the audit loop.</p> <pre><code class="language-python">import csv import os import sys import time import argparse from state import load_state, is_audited, mark_audited, add_to_needs_human from browser import fetch_page from extractor import extract from linkchecker import check_links from hitl import should_pause, pause_reason, pause_and_prompt from reporter import write_result, write_summary INPUT_CSV = os.path.join(os.path.dirname(__file__), "input.csv") def read_urls(path: str) -> list[str]: with open(path, newline="", encoding="utf-8") as f: return [row["url"].strip() for row in csv.DictReader(f) if row.get("url", "").strip()] def run(auto: bool = False): if not os.environ.get("ANTHROPIC_API_KEY"): print("Error: ANTHROPIC_API_KEY environment variable is not set.") sys.exit(1) urls = read_urls(INPUT_CSV) pending = [u for u in urls if not is_audited(u)] print(f"Starting audit: {len(pending)} pending, {len(urls) - len(pending)} already done.\n") total = len(urls) try: for i, url in enumerate(pending, start=1): position = urls.index(url) + 1 print(f"[{position}/{total}] {url}", end=" -> ", flush=True) # Browser navigation snapshot = fetch_page(url) # Human-in-the-loop check if should_pause(snapshot): reason = pause_reason(snapshot) if auto: print(f"AUTO-SKIPPED ({reason})") add_to_needs_human(url) mark_audited(url) continue action = pause_and_prompt(url, reason) if action == "quit": print("Exiting.") break elif action == "skip": add_to_needs_human(url) mark_audited(url) continue # "retry" falls through to re-fetch below snapshot = fetch_page(url) # Claude extraction result = extract(snapshot) # Broken link check links = check_links(snapshot.get("raw_links", []), snapshot.get("final_url", url)) result["broken_links"] = links # Write result immediately write_result(result) mark_audited(url) overall = "PASS" if all( result.get(f, {}).get("status") == "PASS" for f in ["title", "description", "h1", "canonical"] ) and links["status"] == "PASS" else "FAIL" print(overall) except KeyboardInterrupt: print("\n\nInterrupted. Progress saved. Re-run to continue.") return write_summary() passed = sum( 1 for e in [r for r in []] if all(e.get(f, {}).get("status") == "PASS" for f in ["title", "description", "h1", "canonical"]) ) print(f"\nAudit complete. Report saved to report.json and report-summary.txt") if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--auto", action="store_true", help="Auto-skip URLs requiring human review") args = parser.parse_args() run(auto=args.auto) </code></pre> <p>The <code>KeyboardInterrupt</code> handler is the resume mechanism. When you press Ctrl+C, the handler prints a message and exits cleanly. Because <code>mark_audited()</code> is called after <code>write_result()</code> for each URL, the next run skips everything already processed.</p> <h2 id="heading-running-the-agent">Running the Agent</h2> <p>Interactive mode (pauses on edge cases):</p> <pre><code class="language-bash">python index.py </code></pre> <p>Auto mode (skips edge cases, adds to <code>needs_human[]</code>):</p> <pre><code class="language-bash">python index.py --auto </code></pre> <p>When it runs, you'll see the browser window open for each URL and the terminal print progress:</p> <pre><code class="language-plaintext">Starting audit: 7 pending, 0 already done. [1/7] https://example.com -> PASS [2/7] https://example.com/about -> FAIL [3/7] https://example.com/contact -> AUTO-SKIPPED (Unexpected status code: 404) ... Audit complete. Report saved to report.json and report-summary.txt </code></pre> <p>To resume after an interruption:</p> <pre><code class="language-bash">python index.py --auto # Starting audit: 4 pending, 3 already done. </code></pre> <h2 id="heading-scheduling-for-agency-use">Scheduling for Agency Use</h2> <p>For recurring weekly audits, create a batch file and schedule it with Windows Task Scheduler.</p> <p>Create <code>run-audit.bat</code>:</p> <pre><code class="language-batch">@echo off set ANTHROPIC_API_KEY=your-key-here cd /d C:\Users\yourname\Desktop\seo-agent python index.py --auto </code></pre> <p>In Windows Task Scheduler:</p> <ol> <li><p>Create a new Basic Task</p> </li> <li><p>Set the trigger to Weekly, Monday at 7:00 AM</p> </li> <li><p>Set the action to "Start a program"</p> </li> <li><p>Browse to your <code>run-audit.bat</code> file</p> </li> </ol> <p>Check <code>report-summary.txt</code> on Monday morning. URLs in <code>needs_human[]</code> in <code>state.json</code> need manual review — login walls, paywalls, or pages that returned unexpected status codes.</p> <p>For macOS/Linux, use cron:</p> <pre><code class="language-bash"># Run every Monday at 7am 0 7 * * 1 cd /path/to/seo-agent && ANTHROPIC_API_KEY=your-key python index.py --auto </code></pre> <h2 id="heading-what-the-results-look-like">What the Results Look Like</h2> <p>I ran this agent against seven of my own published pages across Hashnode, freeCodeCamp, and DEV.to. Every single one failed.</p> <pre><code class="language-plaintext">https://hashnode.com/@dannwaneri | FAIL [h1] https://freecodecamp.org/news/claude-code-skill | FAIL [description] https://freecodecamp.org/news/stop-letting-ai-guess | FAIL [description] https://freecodecamp.org/news/rag-system-handbook | FAIL [title, description] https://freecodecamp.org/news/author/dannwaneri | FAIL [description] https://dev.to/dannwaneri/gatekeeping-panic | FAIL [title] https://dev.to/dannwaneri/production-rag-system | FAIL [title] 0/7 URLs passed </code></pre> <p>The freeCodeCamp description issues are partly platform-level — freeCodeCamp's template sometimes truncates or omits meta descriptions for article listing pages. The DEV.to title issues are mine. Article titles that work as headlines often exceed 60 characters in the <code><title></code> tag.</p> <p>A note on the 60-character title rule: this is a display threshold, not a ranking penalty. Google indexes titles of any length. The 60-character guideline reflects approximately how many characters fit in a desktop SERP result before truncation. Titles over 60 characters often still rank — they just get cut off in search results, which can hurt click-through rate. The agent flags display risk, not a ranking violation.</p> <h2 id="heading-next-steps">Next Steps</h2> <p>The agent as built handles the core SEO audit workflow. Obvious extensions:</p> <ul> <li><p><strong>Performance metrics</strong> — add a Lighthouse or PageSpeed Insights API call per URL</p> </li> <li><p><strong>Structured data validation</strong> — check for JSON-LD schema markup and validate it</p> </li> <li><p><strong>Email delivery</strong> — send <code>report-summary.txt</code> via SMTP after the run completes</p> </li> <li><p><strong>Multi-client support</strong> — separate <code>input.csv</code> files per client, separate report directories</p> </li> </ul> <p>The full code including all seven modules is at <a href="https://github.com/dannwaneri/seo-agent">dannwaneri/seo-agent</a>. Clone it, add your URLs, and run it.</p> <p><em>If you found this useful, I write about practical AI agent setups for developers and agencies at</em> <a href="https://dev.to/dannwaneri"><em>DEV.to/@dannwaneri</em></a><em>. The DEV.to companion piece covers the design decisions behind the agent — why HITL matters, why Browser Use over scrapers, and what the audit results mean for your own published content.</em></p> </article> <article> <h1> How to Build Your Own Claude Code Skill </h1> <p>Daniel Nwaneri — Fri, 27 Mar 2026 20:47:26 +0000</p> <p>Every developer eventually has a workflow they repeat. A way they write commit messages. A checklist they run before opening a pull request. A structure they follow when reviewing code. They do it manually, explain it to their agents in every session, and watch the agent interpret it differently each time.</p> <p>Agent skills fix this. A skill is a markdown file that loads into Claude Code's context automatically when you need it. You write the workflow once. The agent follows it every time. And because skills follow an open standard, the same file works in Claude Code, GitHub Copilot, Cursor, and Gemini CLI.</p> <p>This tutorial shows you how to build a skill from scratch. You will build a commit-message-writer — a skill that reads your staged changes and generates a structured commit message following the Conventional Commits standard. By the end, you will have a working skill installed and ready to use, and you will understand the structure well enough to build any skill you need.</p> <h2 id="heading-table-of-contents">Table of Contents</h2> <ol> <li><p><a href="#heading-what-an-agent-skill-is">What an Agent Skill Is</a></p> </li> <li><p><a href="#heading-how-to-choose-what-to-build">How to Choose What to Build</a></p> </li> <li><p><a href="#heading-how-to-structure-your-skill">How to Structure Your Skill</a></p> </li> <li><p><a href="#heading-how-to-write-the-description">How to Write the Description</a></p> </li> <li><p><a href="#heading-how-to-write-the-instructions">How to Write the Instructions</a></p> </li> <li><p><a href="#heading-how-to-build-the-commit-message-writer-skill">How to Build the commit-message-writer Skill</a></p> </li> <li><p><a href="#heading-how-to-install-and-test-your-skill">How to Install and Test Your Skill</a></p> </li> <li><p><a href="#heading-how-to-improve-your-skill-over-time">How to Improve Your Skill Over Time</a></p> </li> <li><p><a href="#heading-where-to-go-next">Where to Go Next</a></p> </li> </ol> <h2 id="heading-what-an-agent-skill-is">What an Agent Skill Is</h2> <p>A skill is a folder containing a <code>SKILL.md</code> file. That file has two parts: a YAML frontmatter block at the top, and a markdown body below it.</p> <pre><code class="language-plaintext">my-skill/ └── SKILL.md </code></pre> <p>The frontmatter tells the agent what the skill is called and when to use it. The body tells the agent what to do when it loads the skill. Here is the minimal structure:</p> <pre><code class="language-yaml">--- name: my-skill description: What this skill does and when to use it. --- # My Skill Instructions for the agent go here. </code></pre> <p>When you invoke a skill — either explicitly with <code>/skill-name</code> or by describing what you want — the agent reads the SKILL.md body and follows the instructions inside it. The frontmatter never reaches the agent's instructions. It's metadata the skill system uses to decide whether to load the skill at all.</p> <h3 id="heading-how-the-agent-decides-to-load-a-skill">How the Agent Decides to Load a Skill</h3> <p>This is the most important thing to understand before you write your first skill: <strong>the agent decides whether to load your skill based entirely on the description field.</strong></p> <p>Skills appear in Claude Code's context as a list of names and descriptions. When you make a request, the agent scans that list and loads any skill whose description matches what you're asking for. If the description is vague, the skill won't load when you need it. If the description is too narrow, it won't load for variations of the same request.</p> <p>The instructions in the body only matter after the skill loads. Getting the description right is what determines whether the skill loads at all.</p> <h3 id="heading-what-skills-are-not">What Skills Are Not</h3> <p>Skills are instruction files. They cannot run code on their own — but they can instruct the agent to run code using its existing tools. They are not plugins, extensions, or packages. They have no runtime. They are markdown files the agent reads, like a recipe a chef follows.</p> <h2 id="heading-how-to-choose-what-to-build">How to Choose What to Build</h2> <p>The best skills share three properties.</p> <ol> <li><p><strong>They encode a repeatable workflow.</strong> If you do something differently every time, a skill won't help. If you follow the same steps every session — even if you explain them differently each time — that's a skill candidate.</p> </li> <li><p><strong>They have a clear trigger.</strong> You should be able to finish the sentence "I need this skill when I want to...". If you can't finish that sentence in one clause, the workflow isn't scoped enough for a skill.</p> </li> <li><p><strong>They produce a consistent output format.</strong> Skills that output in a fixed structure — a commit message, a code review, a spec — are easier to build and test than skills that produce open-ended prose.</p> </li> </ol> <p>Good candidates: commit messages, pull request descriptions, code reviews, changelog entries. Bad candidates: "help me think through this", "make this better" — too open-ended to encode in a skill.</p> <p>For this tutorial, commit message generation is the right scope. The trigger is obvious (you want to commit), the workflow is defined (read staged changes, apply Conventional Commits format), and the output is structured (a commit message with a specific shape).</p> <h2 id="heading-how-to-structure-your-skill">How to Structure Your Skill</h2> <p>Every skill starts as a single folder with a single file:</p> <pre><code class="language-plaintext">commit-message-writer/ └── SKILL.md </code></pre> <p>As skills grow, they can include additional files the agent loads as needed:</p> <pre><code class="language-plaintext">commit-message-writer/ ├── SKILL.md ← always loaded when skill triggers └── references/ └── examples.md ← loaded only when the agent needs examples </code></pre> <p>The SKILL.md body should stay under 500 lines. If your instructions are growing beyond that, move supporting detail into a <code>references/</code> subfolder and tell the agent when to read those files. This keeps the skill lean — the agent only loads what it needs.</p> <p>For this tutorial, a single SKILL.md is enough.</p> <h2 id="heading-how-to-write-the-description">How to Write the Description</h2> <p>The description field is the trigger condition. It determines when your skill loads and when it doesn't. Most skills fail not because the instructions are wrong, but because the description doesn't match how people actually ask for help.</p> <p>Here is a weak description:</p> <pre><code class="language-yaml">description: Generates commit messages. </code></pre> <p>This will undertrigger. "Generate a commit message" will load it. "Write a commit for my changes" probably won't. "Summarize my staged diff" definitely won't — even though all three are asking for the same thing.</p> <p>Here is a stronger description:</p> <pre><code class="language-yaml">description: Generates structured commit messages following the Conventional Commits standard. Use when you want to commit your changes and need a well-formatted message. Triggers on "write a commit message", "commit my changes", "summarize my staged diff", "what should my commit say", or any request to describe or document code changes for version control. </code></pre> <p>The pattern is: <strong>what the skill does + when to use it + specific trigger phrases</strong>. The trigger phrases cover the different ways a developer might ask for the same thing.</p> <p>Two rules for descriptions:</p> <p><strong>Be specific about the output.</strong> "Generates commit messages" is vague. "Generates structured commit messages following the Conventional Commits standard" tells the agent and the user exactly what they'll get.</p> <p><strong>Be slightly pushy.</strong> The agent has a natural tendency to undertrigger skills — to handle requests itself rather than loading a skill. A description that explicitly lists trigger phrases counteracts this. You are not being redundant. You are training the trigger.</p> <h2 id="heading-how-to-write-the-instructions">How to Write the Instructions</h2> <p>The body of SKILL.md is where you define what the agent does when the skill loads. Good instructions follow two principles.</p> <p><strong>Generate first, clarify second.</strong> The agent should produce output immediately rather than asking clarifying questions. If it needs to make assumptions, it should make them and flag them — not ask. Asking questions before producing output adds friction and loses the benefit of having a skill at all.</p> <p><strong>Define the output format explicitly.</strong> Don't say "write a good commit message." Say exactly what the structure is, what fields are required, what the character limits are. The more specific the output format, the more consistent the results.</p> <p>Here is what weak instructions look like:</p> <pre><code class="language-markdown"># Commit Message Writer Look at the staged changes and write a commit message that describes what changed. </code></pre> <p>That will produce different results every time — different formats, different lengths, different conventions. It's not a skill. It's a prompt.</p> <p>Here is what strong instructions look like:</p> <pre><code class="language-markdown"># Commit Message Writer Read the staged diff using `git diff --staged`. Generate a commit message following the Conventional Commits standard. Output format: type(scope): short description under 72 characters Body (if changes are non-trivial): - What changed and why, not how - One bullet per logical change Footer (if applicable): BREAKING CHANGE: description Closes #issue-number </code></pre> <p>The agent knows exactly what to produce. The output will be consistent across sessions, across projects, and across agents that support the standard.</p> <h2 id="heading-how-to-build-the-commit-message-writer-skill">How to Build the <code>commit-message-writer</code> Skill</h2> <p>Now build it. Create the skill directory:</p> <pre><code class="language-bash">mkdir -p ~/.claude/skills/commit-message-writer </code></pre> <p>On Windows PowerShell:</p> <p><strong>Note:</strong> PowerShell uses backtick (<code>`</code>) for line continuation, not backslash.</p> <pre><code class="language-powershell">New-Item -ItemType Directory -Force -Path "$HOME\.claude\skills\commit-message-writer" </code></pre> <p>Create the SKILL.md file inside that directory. Here is the complete content:</p> <pre><code class="language-markdown">--- name: commit-message-writer description: Generates structured commit messages following the Conventional Commits standard. Use when you want to commit your changes and need a well-formatted message. Triggers on "write a commit message", "commit my changes", "summarize my staged diff", "what should my commit say", or any request to describe or document staged changes for version control. --- # commit-message-writer You generate structured commit messages from staged git changes. ## How to invoke Run `git diff --staged` to read the staged changes. If nothing is staged, tell the user and suggest they run `git add` first. Generate first. Do not ask clarifying questions before producing the commit message. If you need to make assumptions about scope or type, make them and note them after the output. ## Output format ~~~ type(scope): short description [body — optional, include if changes are non-trivial] [footer — optional] ~~~ **Type** — choose one: - `feat` — a new feature - `fix` — a bug fix - `docs` — documentation changes only - `refactor` — code change that neither fixes a bug nor adds a feature - `test` — adding or updating tests - `chore` — build process, tooling, or dependency updates **Scope** — the module, file, or area affected. Use the directory name or component name. Omit if the change spans the entire codebase. **Short description** — imperative mood, under 72 characters, no period at the end. "Add user authentication" not "Added user authentication" or "Adds user authentication." **Body** — what changed and why, not how. One bullet per logical change. Skip if the short description is self-explanatory. **Footer** — include `BREAKING CHANGE:` if the commit breaks backward compatibility. Include `Closes #N` if it resolves a GitHub issue. ## Quality rules - Never use "updated", "changed", or "modified" in the short description — be specific - Never write "various improvements" or "misc fixes" — name what improved - If more than three files changed across unrelated concerns, flag it: "These changes may be better split into separate commits: [list concerns]" - The short description must be under 72 characters — count before outputting ## Example output Input: staged changes adding a rate limiter to an API endpoint ~~~ feat(api): add rate limiting to /query endpoint - Limits requests to 100 per minute per IP using Cloudflare's rate limit binding - Returns 429 with Retry-After header when limit is exceeded - Adds rate limit configuration to wrangler.toml Closes #47 ~~~ </code></pre> <p>Save that file. The skill is built.</p> <h2 id="heading-how-to-install-and-test-your-skill">How to Install and Test Your Skill</h2> <h3 id="heading-verify-the-file-exists">Verify the File Exists</h3> <pre><code class="language-bash">cat ~/.claude/skills/commit-message-writer/SKILL.md </code></pre> <p>You should see the full SKILL.md content. If you get an error, check the directory path.</p> <h3 id="heading-test-the-skill">Test the Skill</h3> <p>Open Claude Code in any git repository that has staged changes. Type:</p> <pre><code class="language-plaintext">/commit-message-writer </code></pre> <p>The agent will read your staged diff and produce a commit message following the format you defined.</p> <p>You can also trigger it naturally:</p> <pre><code class="language-plaintext">write a commit message for my staged changes </code></pre> <pre><code class="language-plaintext">what should my commit say </code></pre> <pre><code class="language-plaintext">summarize my diff for git </code></pre> <p>All three should load the skill and produce a structured commit message. If the skill doesn't trigger on natural language requests, the description needs more trigger phrases — see the improvement section below.</p> <h3 id="heading-test-edge-cases">Test Edge Cases</h3> <p>Test these cases before relying on the skill in production:</p> <pre><code class="language-bash"># Stage nothing, then ask for a commit message git add -p # stage nothing # In Claude Code: "write a commit message" # Expected: skill tells you nothing is staged and suggests git add </code></pre> <pre><code class="language-bash"># Stage changes across unrelated files git add src/api.ts src/styles.css README.md # In Claude Code: "write a commit message" # Expected: skill flags that commits may be better split </code></pre> <h2 id="heading-how-to-improve-your-skill-over-time">How to Improve Your Skill Over Time</h2> <p>The first version of any skill is a draft. You improve it by observing where it produces inconsistent or wrong output, then updating the instructions.</p> <h3 id="heading-when-the-skill-undertriggers">When the Skill Undertriggers</h3> <p>If you type "summarize my changes for git" and the skill doesn't load, add that phrase to the description's trigger list:</p> <pre><code class="language-yaml">description: ... Triggers on "write a commit message", "commit my changes", "summarize my staged diff", "summarize my changes for git", ... </code></pre> <p>The description is your primary lever for fixing triggering problems.</p> <h3 id="heading-when-the-output-format-drifts">When the Output Format Drifts</h3> <p>If the agent starts producing commit messages that don't match your format — wrong type, missing scope, body in the wrong style — the instructions need to be more explicit. Add a concrete example that shows the failure and the correct output:</p> <pre><code class="language-markdown">## Common mistakes to avoid Wrong: "Updated the authentication flow" Right: "refactor(auth): simplify token validation logic" Wrong: "Fixed bugs" Right: "fix(api): handle null response from upstream service" </code></pre> <p>Concrete counterexamples are more effective than abstract rules.</p> <h3 id="heading-when-the-scope-grows">When the Scope Grows</h3> <p>If you find yourself wanting the skill to handle related tasks — reviewing commit messages, generating changelogs, writing PR descriptions — resist the urge to add everything to one skill. Build separate skills. Each skill should do one thing well. The Agent Skills standard is designed for composition, not for monolithic instructions.</p> <h2 id="heading-where-to-go-next">Where to Go Next</h2> <p>The commit-message-writer covers the core pattern. The same structure works for any repeatable workflow.</p> <p><strong>Pull request descriptions</strong> follow the same shape — read the diff, apply a structure, produce consistent output. The trigger phrases are different ("write a PR description", "summarize my branch for review") and the output format adds sections for motivation and testing, but the SKILL.md structure is identical.</p> <p><strong>Code review checklists</strong> work well as skills when your team has a standard review process. The trigger is "review this code" or "check this PR", and the instructions encode whatever your team actually checks — security concerns, test coverage, naming conventions.</p> <p>The commit-message-writer is the simplest skill architecture — instructions only. As your skills grow more specialized, two other patterns become useful.</p> <p>The first adds a <code>references/</code> directory: the voice-humanizer skill loads a CORPUS.md file containing the author's published writing, which the agent reads when it needs to check output against a specific style. The second adds quality rules and structured output formats that make results stricter and more consistent — that's the pattern spec-writer uses to surface assumptions inline. Each is the same SKILL.md structure at a different level of complexity.</p> <p>Start with instructions only. Add references when the agent needs external context. Add output format rules when consistency matters more than flexibility.</p> <p>The Agent Skills standard is supported in Claude Code, GitHub Copilot in VS Code, Cursor, and Gemini CLI. A skill you build once installs across all of them. The install path differs by agent:</p> <table> <thead> <tr> <th>Agent</th> <th>Skills directory</th> </tr> </thead> <tbody><tr> <td>Claude Code</td> <td><code>~/.claude/skills/</code></td> </tr> <tr> <td>GitHub Copilot</td> <td><code>~/.copilot/skills/</code> or <code>.github/skills/</code></td> </tr> <tr> <td>Cursor</td> <td><code>~/.cursor/skills/</code></td> </tr> <tr> <td>Gemini CLI</td> <td><code>~/.gemini/skills/</code></td> </tr> </tbody></table> <p>The SKILL.md format is the same across all of them.</p> <p>The commit-message-writer you just built is a working skill. The next one will take less time. By the third, you will start seeing workflows you repeat and immediately think: that should be a skill.</p> <p>That's the point.</p> </article> <article> <h1> How to Stop Letting AI Agents Guess Your Requirements </h1> <p>Daniel Nwaneri — Tue, 24 Mar 2026 00:35:37 +0000</p> <p>I spent 64% of my weekly Claude budget before Wednesday building a tool designed to reduce Claude usage. That's the kind of irony that deserves its own specification.</p> <p>The tool is spec-writer: a Claude Code skill that takes a vague feature request and generates a structured spec, technical plan, and task breakdown before a single line of code gets written.</p> <p>The problem it solves is one most developers hit within their first week of using AI coding agents seriously: the agent writes confidently in the wrong direction and you pay for it twice, once in tokens, once in rewrites.</p> <p>This tutorial shows you how to install spec-writer, how to invoke it on a real feature, and how to read the output so you can catch the assumptions that would have wasted your time.</p> <h2 id="heading-table-of-contents">Table of Contents</h2> <ol> <li><p><a href="#heading-the-problem-with-prompting-agents-directly">The Problem with Prompting Agents Directly</a></p> </li> <li><p><a href="#heading-what-specdriven-development-is">What Spec-Driven Development Is</a></p> </li> <li><p><a href="#heading-how-spec-writer-works">How spec-writer Works</a></p> </li> <li><p><a href="#heading-how-to-install-spec-writer">How to Install spec-writer</a></p> </li> <li><p><a href="#heading-how-to-write-your-first-spec">How to Write Your First Spec</a></p> </li> <li><p><a href="#heading-how-to-read-the-output">How to Read the Output</a></p> </li> <li><p><a href="#heading-how-to-hand-the-spec-to-your-agent">How to Hand the Spec to Your Agent</a></p> </li> <li><p><a href="#heading-where-to-go-next">Where to Go Next</a></p> </li> </ol> <h2 id="heading-the-problem-with-prompting-agents-directly">The Problem with Prompting Agents Directly</h2> <p>Here is what happens when you skip the spec.</p> <p>You have a feature in your head: "Add a way for users to export their data." You open Claude Code and describe it. The agent produces code. It looks right. You run it. It's mostly right – except it exports everything including soft-deleted records, it doesn't paginate, it times out on large accounts, and it has no authentication check on the export endpoint.</p> <p>None of those things were in your prompt. The agent guessed, and it guessed plausibly – which is worse than guessing obviously wrong. You didn't notice until testing.</p> <p>This is the fundamental problem with prompting agents directly on anything non-trivial: your prompt carries your conscious requirements, but every feature has a shadow of requirements you didn't think to state. And the agent fills that shadow with assumptions.</p> <p>Most of the time, those assumptions are reasonable. Some of the time, they're wrong in ways that take hours to unravel.</p> <p>The failure mode isn't hallucination. It's the agent being exactly as helpful as the prompt allowed, which wasn't helpful enough.</p> <p>Spec-Driven Development addresses this directly. The methodology – documented extensively by practitioners like Julián Deangelis – argues that a written spec isn't documentation overhead. It's the mechanism that forces you to make decisions before the agent does.</p> <h2 id="heading-what-spec-driven-development-is">What Spec-Driven Development Is</h2> <p>Spec-Driven Development is the practice of writing a structured specification before you write code or prompt an agent. The spec defines what the feature must do, what assumptions are being made, and what tasks the implementation breaks into.</p> <p>The key insight is what a spec is <em>for</em>. A spec is not trying to replace code. It's trying to surface the decisions that would otherwise be invisible. The agent will make those decisions either way: with a spec, you make them first. Without a spec, you discover them during testing.</p> <p>The strongest counterargument to SDD comes from Gabriella Gonzalez: <em>a sufficiently detailed spec is just code</em>. She's right that some specs devolve into pseudocode so specific they might as well be implementations.</p> <p>But that's a spec written at the wrong level of abstraction. The goal is to name the decisions, not to pre-implement them. "Only authenticated users can trigger this export" is a decision. "Call <code>verifyJWT(token)</code> and return 401 if it fails" is implementation. The spec needs the first. The agent handles the second.</p> <p>SDD has three levels:</p> <ol> <li><p><strong>Spec-First</strong>: write a spec before every feature and hand it to the agent as context. This is the entry point and the workflow this tutorial focuses on.</p> </li> <li><p><strong>Spec-Anchored</strong>: the spec lives in the repository and evolves alongside the code. When requirements change, you update the spec and re-prompt the agent to realign.</p> </li> <li><p><strong>Spec-as-Source</strong>: the spec is the primary artifact. Code is generated from it and considered disposable. This is the most ambitious level and the direction many teams are moving toward.</p> </li> </ol> <p>spec-writer gets you to Spec-First immediately, with no ceremony.</p> <h2 id="heading-how-spec-writer-works">How spec-writer Works</h2> <p>spec-writer is a Claude Code skill – a markdown file that loads into the agent's context and changes how it responds when invoked.</p> <p>The skill follows one rule: generate first, flag assumptions inline. Instead of asking you clarifying questions before producing output, it generates the full spec immediately and marks every decision it made without your explicit input using <code>[ASSUMPTION: ...]</code> tags. Then you correct what's wrong.</p> <p>This is faster than Q&A because it makes the decisions visible in a form you can react to rather than anticipate.</p> <p>The output has three sections in fixed order:</p> <ol> <li><p><strong>SPEC</strong>: the what. One-line purpose, user stories, requirements, edge cases, and acceptance criteria in Given/When/Then format.</p> </li> <li><p><strong>PLAN</strong>: the how. Stack and architecture decisions, data model changes, API contracts, testing strategy, and security constraints.</p> </li> <li><p><strong>TASKS</strong>: the breakdown. Ordered, self-contained tasks each completable in a single agent session, each with its own acceptance criteria.</p> </li> </ol> <p>After the three sections, the skill produces an <strong>Assumptions summary</strong>: every <code>[ASSUMPTION: ...]</code> from the output, ranked by impact. This is the part you review before handing anything to the agent.</p> <p>The skill is compatible with <a href="https://github.com/github/spec-kit">GitHub Spec Kit</a> and <a href="https://github.com/Fission-AI/OpenSpec">OpenSpec</a>. If you use either framework, save the spec output to your <code>.specify/</code> or <code>openspec/changes/</code> directory and continue from there.</p> <h2 id="heading-how-to-install-spec-writer">How to Install spec-writer</h2> <p>spec-writer uses the Agent Skills standard, which means the same SKILL.md file works across Claude Code, Cursor, GitHub Copilot, Gemini CLI, and any other agent that supports the standard. You install it once and it works everywhere.</p> <h3 id="heading-installation">Installation</h3> <p>Create the skills directory if it doesn't exist and clone the repo:</p> <pre><code class="language-bash">mkdir -p ~/.claude/skills git clone https://github.com/dannwaneri/spec-writer.git ~/.claude/skills/spec-writer </code></pre> <p>On Windows PowerShell:</p> <p><strong>(Note:</strong> PowerShell uses backtick (<code>`</code>) for line continuation, not backslash.)</p> <pre><code class="language-powershell">New-Item -ItemType Directory -Force -Path "$HOME\.claude\skills" git clone https://github.com/dannwaneri/spec-writer.git "$HOME\.claude\skills\spec-writer" </code></pre> <p>That's the entire installation. No package to install, no configuration file to edit, no API key. The skill is a markdown file. The agent reads it.</p> <h3 id="heading-verification">Verification</h3> <p>Open Claude Code and type:</p> <pre><code class="language-plaintext">/spec-writer test </code></pre> <p>If the skill is installed correctly, the agent will read the SKILL.md and produce a spec structure – even for "test" as input. You'll see the three sections and the Assumptions summary. If nothing happens, confirm that the <code>~/.claude/skills/spec-writer/SKILL.md</code> file exists.</p> <h2 id="heading-how-to-write-your-first-spec">How to Write Your First Spec</h2> <p>The invocation pattern is:</p> <pre><code class="language-plaintext">/spec-writer [your feature description] </code></pre> <p>The description can be as rough as you'd write in a Slack message. The skill's job is to structure it. You don't need to be precise. Being precise is what the spec is for.</p> <h3 id="heading-a-real-example">A Real Example</h3> <p>Here is a real invocation from the Foundation project, a federated knowledge base built on Cloudflare Workers:</p> <pre><code class="language-plaintext">/spec-writer Add CLI capture for Claude Code sessions stored in ~/.claude/projects/ </code></pre> <p>That's twelve words. No mention of the tech stack, no data model details, no authentication requirements, no error handling spec. All of that comes out in the spec.</p> <p>Here is the abbreviated output:</p> <p><strong>SPEC</strong></p> <p><strong>One-line purpose:</strong> Developers can capture Claude Code session logs from their local filesystem into Foundation before they are deleted after 30 days.</p> <p><strong>Users and use cases:</strong></p> <ul> <li><p>As a developer using Claude Code daily, I want my coding sessions captured automatically so that architectural decisions and debugging patterns don't disappear after 30 days.</p> </li> <li><p>As a Foundation user, I want CLI sessions and browser extension captures in the same knowledge base so that I have one place to query across all my AI interactions.</p> </li> </ul> <p><strong>Acceptance criteria:</strong></p> <pre><code class="language-plaintext">Given a developer with Claude Code sessions in ~/.claude/projects/ When they run the Foundation CLI capture command Then a list of uncaptured sessions is displayed, sorted by most recent first Given a session already in Foundation When the developer runs capture again Then that session does not appear in the review list Given a malformed .jsonl file When the capture command processes it Then an error is logged for that file and processing continues on remaining sessions </code></pre> <p><strong>PLAN</strong></p> <p><strong>Stack and architecture:</strong></p> <pre><code class="language-plaintext">CLI capture runs locally — it needs to read the local filesystem, which Cloudflare Workers cannot do. [ASSUMPTION: CLI capture is a local Node.js or Bun script that calls the Foundation API to insert sessions, rather than a Worker itself] </code></pre> <p><strong>API contracts:</strong></p> <ul> <li><p><code>POST /api/sessions/import</code> – accepts parsed session JSON, returns session ID</p> </li> <li><p>Returns 409 if session already exists</p> </li> <li><p>Triggers evaluator pipeline after successful insert</p> </li> </ul> <p><strong>TASKS</strong></p> <pre><code class="language-plaintext">## Task 1: .jsonl parser **What to build:** A parser that reads a Claude Code .jsonl session file and outputs Foundation's internal message format. **Files likely affected:** cli/src/parsers/claudeCode.js (create) **Acceptance criteria:** 1. Parses a valid .jsonl file into the correct message format 2. Handles malformed lines without crashing — logs and skips 3. Returns empty array for empty file **Dependencies:** none </code></pre> <p><strong>Assumptions to review:</strong></p> <pre><code class="language-plaintext">1. CLI capture is a local script calling the Foundation API, not a Worker Impact: HIGH Correct this if: you want a purely serverless approach 2. Manual curation before capture, not automatic bulk import Impact: HIGH Correct this if: you want automatic background capture 3. Session ID from .jsonl filename is the deduplication key Impact: MEDIUM Correct this if: session IDs are stored differently in your schema 4. No sensitive data scrubbing in v1 Impact: MEDIUM Correct this if: your sessions contain credentials or keys </code></pre> <p>Twelve words in, four decisions surfaced immediately – three of which had real architectural implications.</p> <p>The third assumption ("Session ID from .jsonl filename is the deduplication key") is the one that would have caused the most subtle bug. The agent would have implemented deduplication based on the filename and it would have worked until a session was renamed. The spec caught it before a line of code was written.</p> <h2 id="heading-how-to-read-the-output">How to Read the Output</h2> <p>The output is designed to be scanned for <code>[ASSUMPTION: ...]</code> tags first, read for the tasks second.</p> <h3 id="heading-reading-the-assumptions">Reading the Assumptions</h3> <p>Every <code>[ASSUMPTION: ...]</code> tag marks a place where the agent filled in something you didn't specify. Your job is to go through the Assumptions summary and decide for each one:</p> <ul> <li><p><strong>Correct</strong>: the assumption is right, leave it</p> </li> <li><p><strong>Override</strong>: the assumption is wrong, restate it and re-run the spec</p> </li> <li><p><strong>Defer</strong>: the assumption doesn't matter for this iteration, mark it and move on</p> </li> </ul> <p>The impact rating tells you which assumptions to fix before you start coding. HIGH-impact assumptions affect architecture or data model. If they're wrong, fixing them requires rework. LOW-impact assumptions affect behavior details that are easy to change later.</p> <h3 id="heading-reading-the-acceptance-criteria">Reading the Acceptance Criteria</h3> <p>The acceptance criteria in Given/When/Then format are the most useful part of the spec for catching scope errors. Read each one and ask: is this actually what I want?</p> <p>Criteria are binary by design. "Returns 401 when unauthenticated" is a criterion. "Works correctly" is not. If you find yourself reading a criterion and thinking "well, it depends", then that's a signal that the criterion is hiding an assumption. Restate it.</p> <h3 id="heading-reading-the-tasks">Reading the Tasks</h3> <p>The tasks are ordered and self-contained. Each task produces a verifiable change. Before you hand any task to an agent, check two things:</p> <ol> <li><p>Does the task have all the context it needs? If a task says "follow the existing auth pattern" and you haven't pointed the agent at your auth code, it will guess.</p> </li> <li><p>Does the acceptance criteria match what you'd actually test? If the criteria are vague, tighten them before the agent sees the task.</p> </li> </ol> <h2 id="heading-how-to-hand-the-spec-to-your-agent">How to Hand the Spec to Your Agent</h2> <p>The spec is context, not a prompt. When you start an agent session for a task, include the relevant spec sections alongside the task description.</p> <p>For Task 1 from the example above, your agent session might open like this:</p> <pre><code class="language-plaintext">Context: - This is a federated knowledge base built on Cloudflare Workers, D1, and Vectorize - Sessions are stored in ~/.claude/projects/ as .jsonl files - The API runs at https://<your-worker>.workers.dev Spec: [paste the SPEC and PLAN sections] Task: [paste Task 1] </code></pre> <p>The context block is just an example. Replace it with your own project's tech stack, file locations, and API URL. The point is to give the agent the same context a new team member would need on day one.</p> <p>The agent now has requirements, architecture context, and a single scoped task with binary acceptance criteria. It cannot guess the deduplication key incorrectly because the spec already resolved that assumption. It cannot skip error handling because the acceptance criteria explicitly require it.</p> <p>This is the workflow the spec is designed for. The spec doesn't replace the agent. Rather, it removes the decisions from the agent's hands and puts them in yours, before the work starts.</p> <h3 id="heading-saving-the-spec-for-later">Saving the Spec for Later</h3> <p>If you want to move toward Spec-Anchored development – where the spec lives in the repository – save the output to a <code>specs/</code> directory in your project:</p> <pre><code class="language-bash"># Create specs directory mkdir -p specs # Save your spec # Paste the output into specs/cli-capture.md </code></pre> <p>When requirements change, update the spec and re-prompt the agent to realign the implementation. The spec becomes the source of truth, not the code comments.</p> <h2 id="heading-where-to-go-next">Where to Go Next</h2> <p>Try it on your next feature before you write a line of code. The assumptions it flags will tell you something about your feature you hadn't consciously decided yet – and correcting the HIGH-impact ones before you hand anything to an agent is the whole point. Skipping that step is the same as prompting directly.</p> <p>If your project is growing, move toward Spec-Anchored. Save specs in your repository under <code>specs/</code>. When a new contributor joins or an agent starts a session cold, the specs give them the decisions that got made without requiring them to reverse-engineer the code.</p> <p>The strongest ongoing challenge to this workflow is Gabriella Gonzalez's argument that detailed specs become code. If your specs are getting implementation-specific, you've crossed a line. Pull back to decisions – "only authenticated users can trigger this" – and leave implementation to the agent. The spec's job is to name what the agent would have guessed wrong, not to write the feature in prose.</p> <p>The Agent Skills standard now works across Claude Code, GitHub Copilot, Cursor, and Gemini CLI. The spec-writer repo is at <a href="https://github.com/dannwaneri/spec-writer">github.com/dannwaneri/spec-writer</a>.</p> <p>The irony of spending 64% of a Claude budget building a token-efficiency tool is real. But the spec surfaced four decisions on a twelve-word prompt. The fourth one – the deduplication key assumption – would have produced a bug that worked perfectly until a session got renamed.</p> <p>That's not a hallucination. That's the agent being exactly as helpful as the prompt allowed.</p> <p>The spec is how you raise the ceiling on what "helpful" means.</p> </article> <article> <h1> How to Build a Production RAG System with Cloudflare Workers – a Handbook for Devs </h1> <p>Daniel Nwaneri — Wed, 18 Mar 2026 23:05:13 +0000</p> <p>Most RAG tutorials show you a working demo and call it done. You copy the code, it runs locally, and then you try to put it in production and everything falls apart.</p> <p>This tutorial is different. I run a production RAG system (<a href="https://github.com/dannwaneri/vectorize-mcp-worker">vectorize-mcp-worker</a>) that handles real traffic at a total cost of $5/month. The alternatives I evaluated ranged from $100–$200/month. The difference isn't magic. It's architecture.</p> <p>Here, you'll build <code>rag-tutorial-simple</code>: a clean, minimal RAG chatbot deployed on Cloudflare Workers. No external API keys. No paid vector database subscriptions. No servers to manage. Just Cloudflare's free tier – Workers, Vectorize, and Workers AI – doing the heavy lifting at the edge.</p> <h2 id="heading-table-of-contents">Table of Contents</h2> <ol> <li><p><a href="#heading-what-you-will-build">What You Will Build</a></p> </li> <li><p><a href="#heading-prerequisites">Prerequisites</a></p> </li> <li><p><a href="#heading-how-rag-works">How RAG Works</a></p> </li> <li><p><a href="#heading-how-to-set-up-your-project">How to Set Up Your Project</a></p> </li> <li><p><a href="#heading-how-to-build-the-data-pipeline">How to Build the Data Pipeline</a></p> </li> <li><p><a href="#heading-how-to-build-the-query-pipeline">How to Build the Query Pipeline</a></p> </li> <li><p><a href="#heading-how-to-add-error-handling-and-security">How to Add Error Handling and Security</a></p> </li> <li><p><a href="#heading-performance-and-cost-analysis">Performance and Cost Analysis</a></p> </li> <li><p><a href="#heading-conclusion">Conclusion</a></p> </li> </ol> <h2 id="heading-what-you-will-build">What You Will Build</h2> <p>By the end of this tutorial, you'll have a globally deployed RAG API that:</p> <ul> <li><p>Accepts a natural language question via HTTP</p> </li> <li><p>Converts it to a vector embedding using Workers AI</p> </li> <li><p>Searches a knowledge base stored in Cloudflare Vectorize</p> </li> <li><p>Passes the retrieved context to an LLM (also on Workers AI) to generate an answer</p> </li> <li><p>Returns a grounded, accurate response (not a hallucination)</p> </li> </ul> <p>The complete source code is available at <a href="https://github.com/dannwaneri/rag-tutorial-simple">github.com/dannwaneri/rag-tutorial-simple</a>.</p> <h2 id="heading-prerequisites">Prerequisites</h2> <p>This is an intermediate-level tutorial. You should be comfortable with:</p> <ul> <li><p><strong>JavaScript/TypeScript</strong>: async/await, promises, basic types</p> </li> <li><p><strong>HTTP APIs</strong>: REST, request/response, JSON</p> </li> <li><p><strong>Command line basics</strong>: running npm commands, navigating directories</p> </li> </ul> <p>You will need:</p> <ul> <li><p><strong>Node.js 18 or higher</strong>: check with <code>node --version</code></p> </li> <li><p><strong>A Cloudflare account</strong>: free tier is fine, sign up at <a href="https://dash.cloudflare.com/sign-up">cloudflare.com</a></p> </li> <li><p><strong>A code editor</strong>: VS Code recommended for TypeScript support</p> </li> </ul> <p>That's it. No OpenAI key. No credit card for embeddings. Let's build.</p> <h2 id="heading-how-rag-works">How RAG Works</h2> <p>Before you write any code, you'll need a clear mental model of what you're building. This section explains the three core components of a RAG system, how data flows between them, and why this architecture works at scale.</p> <h3 id="heading-the-mental-model">The Mental Model</h3> <p>Think of a traditional LLM like a doctor who studied medicine for years but has been in a remote cabin with no internet since their graduation day. They are brilliant, but they only know what they knew when they left. Ask them about a drug approved last year and they'll either say they don't know or – worse – confidently give you wrong information.</p> <p>RAG gives that doctor access to an up-to-date medical library. Before answering your question, they can look up the relevant pages, read them, and use that information to give you an accurate answer. Their training still matters (that is, they know how to read and interpret the information), but they're no longer limited to what they memorized years ago.</p> <p>In technical terms, RAG works in three steps on every request:</p> <ol> <li><p><strong>Retrieve</strong>: find the most relevant documents from your knowledge base</p> </li> <li><p><strong>Augment</strong>: add those documents to the LLM prompt as context</p> </li> <li><p><strong>Generate</strong>: let the LLM produce an answer using both its training and the retrieved context</p> </li> </ol> <h3 id="heading-the-three-components">The Three Components</h3> <p>Every RAG system has three moving parts. Understanding each one will help you debug problems and make better architectural decisions as you build.</p> <h4 id="heading-the-embedding-model">The Embedding Model</h4> <p>An embedding model converts text into a vector – an array of numbers that represents the meaning of that text. The model you will use in this tutorial, <code>@cf/baai/bge-base-en-v1.5</code>, outputs 768 numbers for any piece of text you give it.</p> <p>The critical property of embeddings is that semantically similar text produces numerically similar vectors. "How do I install Node.js?" and "What's the process for setting up Node?" will produce vectors that are close together. "How do I install Node.js?" and "What is the capital of France?" will produce vectors that are far apart.</p> <p>This is what makes semantic search possible. You aren't matching keywords, you're matching meaning.</p> <p>One rule you must never break: your documents and your queries must be embedded with the same model. If you embed your documents with <code>bge-base-en-v1.5</code> and your queries with a different model, the vectors won't be comparable and your searches will return garbage.</p> <h4 id="heading-the-vector-database">The Vector Database</h4> <p>The vector database stores your embeddings and lets you search them by similarity. In this tutorial, you'll use Cloudflare Vectorize.</p> <p>When you run a similarity search, you pass in a query vector and Vectorize returns the K most similar vectors it has stored, along with their metadata and similarity scores. This is called approximate nearest neighbor search, and Vectorize is optimized to do it fast even across millions of vectors.</p> <p>The key advantage of using Vectorize over an external vector database like Pinecone is co-location. Vectorize runs in the same Cloudflare network as your Worker. There's no external API call, no authentication roundtrip, and no network latency between your application and your database.</p> <h4 id="heading-the-language-model">The Language Model</h4> <p>The LLM is responsible for one thing: reading the retrieved context and generating a natural language answer. It doesn't search anything. It doesn't decide what's relevant. It just reads what you give it and writes a response.</p> <p>This separation of concerns is intentional. The LLM is good at language: understanding questions, synthesizing information, writing clearly. The vector database is good at retrieval: finding relevant documents fast. RAG combines their strengths without asking either component to do something it is not designed for.</p> <p>In this tutorial you'll use <code>@cf/meta/llama-3.3-70b-instruct-fp8-fast</code> through Workers AI. No API key required.</p> <h3 id="heading-a-note-on-visual-embeddings">A Note on Visual Embeddings</h3> <p>If you plan to extend this system to search images, you may be tempted to use a vision-language model like CLIP to generate visual embeddings (vectors that represent the image itself rather than a text description of it). This sounds clever but works worse for RAG in practice.</p> <p>Visual embeddings match pixel similarity. They are good for "find images that look like this one." They are poor for "find the login screen" or "find dashboards showing error rates" because those queries are about meaning, not pixels.</p> <p>The better approach – used in production – is to pass the image through a multimodal model like Llama 4 Scout, which generates a detailed text description and extracts visible text via OCR. You then embed that description using the same BGE model as your other documents.</p> <p>The result lives in one unified index, works with your existing query pipeline, and produces better search results than visual embeddings for RAG use cases.</p> <p>Cloudflare Workers AI does not support CLIP anyway. But even if it did, descriptions would outperform it for semantic search.</p> <h3 id="heading-how-a-query-flows-through-the-system">How a Query Flows Through the System</h3> <p>Here is exactly what happens when a user sends the question "What is RAG?" to your finished Worker:</p> <ol> <li><p><strong>Step 1 – Embed the question (20-30ms)</strong>: Your Worker calls Workers AI with the question text. The embedding model returns a 768-dimensional vector representing the meaning of the question.</p> </li> <li><p><strong>Step 2 – Search Vectorize (30-50ms)</strong>: Your Worker passes that vector to Vectorize, which searches your knowledge base and returns the 3 most similar documents with their similarity scores.</p> </li> <li><p><strong>Step 3 – Filter and build context (< 1ms)</strong>: Documents with a similarity score below 0.5 are discarded. The remaining document texts are joined into a context string.</p> </li> <li><p><strong>Step 4 – Generate the answer (500-1500ms)</strong>: Your Worker sends the context and the question to the LLM. The LLM reads the context and generates a grounded answer.</p> </li> <li><p><strong>Step 5 – Return to the user</strong>: The answer and source metadata are returned as JSON.</p> </li> </ol> <p>Total time: typically 600-1600ms end to end. The LLM generation step dominates. Everything else is fast.</p> <h3 id="heading-why-this-works-at-scale">Why This Works at Scale</h3> <p>A common objection to Cloudflare RAG is that it cannot meet sub-200ms retrieval requirements. That objection comes from a specific architectural mistake: trying to run the entire RAG pipeline, including heavy embedding generation and reranking, inside a single synchronous request. That's the wrong architecture.</p> <p>The architecture you're building in this tutorial separates the loading step (which is slow and runs once) from the query step (which is fast and runs on every request). By the time a user asks a question, your documents are already embedded and stored. The query pipeline only needs to embed the question, run one vector search, and call the LLM. Those three steps are fast.</p> <p>My production system (<a href="https://github.com/dannwaneri/vectorize-mcp-worker">vectorize-mcp-worker</a>) runs this architecture and handles real traffic at $5/month. The <a href="https://dev.to/dannwaneri/i-built-a-production-rag-system-for-5month-most-alternatives-cost-100-200-21hj">full performance breakdown is here</a>. Cloudflare RAG works. You just have to build it correctly.</p> <h2 id="heading-how-to-set-up-your-project">How to Set Up Your Project</h2> <p>In this section, you'll scaffold a Cloudflare Worker, create a Vectorize index to store your embeddings, and configure the bindings that connect them together.</p> <h3 id="heading-how-to-create-the-project">How to Create the Project</h3> <p>Open your terminal and create a new directory for the project.</p> <p>On Mac/Linux:</p> <pre><code class="language-bash">mkdir rag-tutorial-simple && cd rag-tutorial-simple </code></pre> <p>On Windows PowerShell:</p> <pre><code class="language-powershell">mkdir rag-tutorial-simple cd rag-tutorial-simple </code></pre> <p>Then run the Cloudflare scaffolding tool:</p> <pre><code class="language-bash">npm create cloudflare@latest </code></pre> <p>Answer the prompts like this:</p> <ul> <li><p><strong>Directory/app name</strong>: <code>rag-tutorial-simple</code></p> </li> <li><p><strong>What would you like to start with?</strong> Hello World example</p> </li> <li><p><strong>TypeScript?</strong> Yes</p> </li> <li><p><strong>Deploy?</strong> No</p> </li> </ul> <p>When it finishes, you'll have a working TypeScript Worker with Wrangler already configured.</p> <h3 id="heading-how-to-create-the-vectorize-index">How to Create the Vectorize Index</h3> <p>Vectorize is Cloudflare's vector database. It lives in the same network as your Worker, which means no external API call and no added latency when you search it.</p> <pre><code class="language-bash">npx wrangler vectorize create rag-tutorial-index --dimensions=768 --metric=cosine </code></pre> <p>Two things to note here.</p> <p><code>--dimensions=768</code> tells Vectorize how many numbers make up each embedding. This must match the output of the embedding model you use. The model you will use (<code>@cf/baai/bge-base-en-v1.5</code>) outputs 768 dimensions. If this number doesn't match, your searches will fail.</p> <p><code>--metric=cosine</code> is how Vectorize measures similarity between vectors. Cosine similarity measures the angle between two vectors rather than the distance between them. For text embeddings, this captures semantic meaning more accurately than other metrics.</p> <h3 id="heading-how-to-configure-wranglertoml">How to Configure wrangler.toml</h3> <p>Open <code>wrangler.toml</code> and replace its contents with the following:</p> <pre><code class="language-toml">name = "rag-tutorial-simple" main = "src/index.ts" compatibility_date = "2026-02-25" [[vectorize]] binding = "VECTORIZE" index_name = "rag-tutorial-index" [ai] binding = "AI" </code></pre> <p>The <code>[[vectorize]]</code> block connects your Worker to the index you just created. The <code>[ai]</code> block gives your Worker access to Workers AI – both for generating embeddings and for running the language model that produces answers.</p> <p>Notice that there are no API keys anywhere. Cloudflare handles authentication internally because everything – your Worker, Vectorize, and Workers AI – runs under the same account.</p> <h3 id="heading-how-to-update-srcindexts">How to Update src/index.ts</h3> <p>Open <code>src/index.ts</code> and replace the generated code with this:</p> <pre><code class="language-typescript">export interface Env { VECTORIZE: VectorizeIndex; AI: Ai; LOAD_SECRET: string; } export default { async fetch(request: Request, env: Env): Promise<Response> { return new Response("RAG tutorial worker is running", { status: 200 }); }, }; </code></pre> <p>The <code>Env</code> interface tells TypeScript what bindings are available inside your Worker. <code>VectorizeIndex</code> and <code>Ai</code> are types provided by Cloudflare's type definitions.</p> <h3 id="heading-how-to-verify-your-setup">How to Verify Your Setup</h3> <p>Start the local development server:</p> <pre><code class="language-bash">npx wrangler dev </code></pre> <p>Open your browser and visit <code>http://localhost:8787</code>. You should see:</p> <pre><code class="language-plaintext">RAG tutorial worker is running </code></pre> <p>You will see two warnings in your terminal. Both are expected.</p> <p>The first warning says that Vectorize doesn't support local mode. This means Vectorize queries won't work during local development unless you run with the <code>--remote</code> flag. You'll do this later when testing the full pipeline.</p> <p>The second warning says the AI binding always accesses remote resources. This means that embedding generation and LLM calls always hit Cloudflare's servers, even in local development. This is fine: usage within the free tier limits costs nothing.</p> <p>Your project structure at this point:</p> <pre><code class="language-plaintext">rag-tutorial-simple/ ├── scripts/ │ └── knowledge-base.ts ├── src/ │ └── index.ts ├── wrangler.toml ├── package.json └── tsconfig.json </code></pre> <h2 id="heading-how-to-build-the-data-pipeline">How to Build the Data Pipeline</h2> <p>The data pipeline is responsible for two things: generating embeddings for each document in your knowledge base, and storing those embeddings in Vectorize. You'll handle both steps inside the Worker itself using a <code>/load</code> endpoint.</p> <p>This approach has a key advantage: you don't need an API token, an Account ID, or any external tooling. Everything uses the bindings you already configured in <code>wrangler.toml</code>.</p> <h3 id="heading-how-to-create-the-knowledge-base">How to Create the Knowledge Base</h3> <p>Create a <code>scripts/</code> folder in your project and add a file called <code>knowledge-base.ts</code>:</p> <pre><code class="language-bash">mkdir scripts </code></pre> <p>Add your documents to <code>scripts/knowledge-base.ts</code>:</p> <pre><code class="language-typescript">export const documents = [ { id: "1", text: "Cloudflare Workers run JavaScript at the edge, in over 300 data centers worldwide. Requests are handled close to the user, reducing latency significantly compared to a single-region server.", metadata: { source: "cloudflare-docs", category: "workers" }, }, { id: "2", text: "Vectorize is Cloudflare's vector database. It stores embeddings and lets you search them by semantic similarity. It runs in the same network as your Worker, so there is no external API call needed.", metadata: { source: "cloudflare-docs", category: "vectorize" }, }, { id: "3", text: "Workers AI lets you run machine learning models directly on Cloudflare's infrastructure. You can generate embeddings and run LLM inference without leaving the Cloudflare network.", metadata: { source: "cloudflare-docs", category: "workers-ai" }, }, { id: "4", text: "RAG stands for Retrieval Augmented Generation. Instead of relying only on what the LLM was trained on, RAG retrieves relevant context from a knowledge base and adds it to the prompt before generating an answer.", metadata: { source: "ai-concepts", category: "rag" }, }, { id: "5", text: "An embedding is a numerical representation of text. Similar pieces of text produce similar embeddings. This is what makes semantic search possible — you search by meaning, not exact keywords.", metadata: { source: "ai-concepts", category: "embeddings" }, }, { id: "6", text: "The BGE model (bge-base-en-v1.5) is available through Workers AI. It generates 768-dimensional embeddings and works well for English semantic search tasks.", metadata: { source: "cloudflare-docs", category: "workers-ai" }, }, { id: "7", text: "Cosine similarity measures the angle between two vectors. For text embeddings, it captures semantic similarity regardless of text length, which makes it more reliable than Euclidean distance.", metadata: { source: "ai-concepts", category: "embeddings" }, }, { id: "8", text: "Cloudflare Workers have a free tier that includes 100,000 requests per day. Vectorize is available on both the Workers Free and Paid plans. The free tier lets you prototype and experiment. The Workers Paid plan starts at $5/month and includes higher usage allocations for production workloads.", metadata: { source: "cloudflare-docs", category: "pricing" }, }, ]; </code></pre> <p>Each document has three fields. The <code>id</code> is a unique string that Vectorize uses to identify the vector. The <code>text</code> is what gets converted into an embedding. The <code>metadata</code> is stored alongside the vector and returned in search results. You'll use it later to display the source of each answer.</p> <h3 id="heading-understanding-embeddings">Understanding Embeddings</h3> <p>Before writing the loading code, it helps to understand what you're actually generating.</p> <p>An embedding is an array of 768 numbers that represents the meaning of a piece of text. The model reads a sentence and outputs those 768 numbers in a way where similar sentences produce similar arrays of numbers.</p> <p>When a user asks a question, you convert that question into an embedding using the same model, then ask Vectorize to find the stored embeddings that are closest to it. The documents those embeddings came from are your most relevant context.</p> <p>This is why the model choice matters: your documents and your queries must be embedded with the same model, or the similarity scores will be meaningless.</p> <h3 id="heading-how-to-build-the-load-endpoint">How to Build the Load Endpoint</h3> <p>Open <code>src/index.ts</code> and update it with a <code>/load</code> route. Here is the complete file at this stage:</p> <pre><code class="language-typescript">import { documents } from "../scripts/knowledge-base"; export interface Env { VECTORIZE: VectorizeIndex; AI: Ai; LOAD_SECRET: string; } export default { async fetch(request: Request, env: Env): Promise<Response> { const url = new URL(request.url); if (url.pathname === "/load" && request.method === "POST") { return handleLoad(env, request); } return new Response("RAG tutorial worker is running", { status: 200 }); }, }; async function handleLoad(env: Env, request: Request): Promise<Response> { const authHeader = request.headers.get("X-Load-Secret"); if (authHeader !== env.LOAD_SECRET) { return Response.json({ error: "Unauthorized" }, { status: 401 }); } const results: { id: string; status: string }[] = []; for (const doc of documents) { const response = await env.AI.run("@cf/baai/bge-base-en-v1.5", { text: [doc.text], }) as { data: number[][] }; await env.VECTORIZE.upsert([ { id: doc.id, values: response.data[0], metadata: { ...doc.metadata, text: doc.text, }, }, ]); results.push({ id: doc.id, status: "loaded" }); } return Response.json({ success: true, loaded: results }); } </code></pre> <p>Notice that <code>env.AI.run()</code> and <code>env.VECTORIZE.upsert()</code> require no credentials. The bindings handle authentication because the Worker runs inside your Cloudflare account. There are no secrets to manage for internal service communication.</p> <p>The <code>text: doc.text</code> field inside <code>metadata</code> is important. Vectorize stores the vector values and whatever metadata you provide, but it doesn't store the original text separately. By including the text in metadata, you can retrieve and display it in search results later.</p> <p>The <code>as { data: number[][] }</code> cast is necessary because the TypeScript type definitions for Workers AI do not yet reflect the exact return shape of every model. The actual response always contains a <code>data</code> array, and the cast tells TypeScript to trust that.</p> <h3 id="heading-how-to-deploy-and-load-your-knowledge-base">How to Deploy and Load Your Knowledge Base</h3> <p>First, set the secret that will protect your load endpoint:</p> <pre><code class="language-bash">npx wrangler secret put LOAD_SECRET </code></pre> <p>Type a strong value when prompted. Then deploy:</p> <pre><code class="language-bash">npx wrangler deploy </code></pre> <p>Trigger the load endpoint. You only need to do this once, or any time you update your knowledge base:</p> <pre><code class="language-bash">curl -X POST https://rag-tutorial-simple.<your-subdomain>.workers.dev/load \ -H "X-Load-Secret: your-secret-value" </code></pre> <p>On Windows PowerShell:</p> <p><strong>Note:</strong> PowerShell uses backtick (<code>`</code>) for line continuation, not backslash.</p> <pre><code class="language-powershell">Invoke-WebRequest ` -Uri "https://rag-tutorial-simple.<your-subdomain>.workers.dev/load" ` -Method POST ` -Headers @{"X-Load-Secret"="your-secret-value"} ` -UseBasicParsing </code></pre> <p>You should see:</p> <pre><code class="language-json">{ "success": true, "loaded": [ { "id": "1", "status": "loaded" }, { "id": "2", "status": "loaded" }, { "id": "3", "status": "loaded" }, { "id": "4", "status": "loaded" }, { "id": "5", "status": "loaded" }, { "id": "6", "status": "loaded" }, { "id": "7", "status": "loaded" }, { "id": "8", "status": "loaded" } ] } </code></pre> <p>Your knowledge base is now stored in Vectorize as vectors. In the next section, you'll build the query pipeline that searches those vectors and generates answers.</p> <h2 id="heading-how-to-build-the-query-pipeline">How to Build the Query Pipeline</h2> <p>The query pipeline is the core of your RAG system. When a user sends a question, the pipeline runs four steps in sequence: embed the question, search Vectorize, build context from the results, and generate an answer with the LLM.</p> <p>Add a <code>/query</code> route to your fetch handler and the complete <code>handleQuery</code> function. Here is the full updated <code>src/index.ts</code>:</p> <pre><code class="language-typescript">import { documents } from "../scripts/knowledge-base"; export interface Env { VECTORIZE: VectorizeIndex; AI: Ai; LOAD_SECRET: string; } export default { async fetch(request: Request, env: Env): Promise<Response> { const url = new URL(request.url); if (url.pathname === "/load" && request.method === "POST") { return handleLoad(env, request); } if (url.pathname === "/query" && request.method === "POST") { return handleQuery(request, env); } return new Response("RAG tutorial worker is running", { status: 200 }); }, }; async function handleLoad(env: Env, request: Request): Promise<Response> { const authHeader = request.headers.get("X-Load-Secret"); if (authHeader !== env.LOAD_SECRET) { return Response.json({ error: "Unauthorized" }, { status: 401 }); } const results: { id: string; status: string }[] = []; for (const doc of documents) { const response = await env.AI.run("@cf/baai/bge-base-en-v1.5", { text: [doc.text], }) as { data: number[][] }; await env.VECTORIZE.upsert([ { id: doc.id, values: response.data[0], metadata: { ...doc.metadata, text: doc.text, }, }, ]); results.push({ id: doc.id, status: "loaded" }); } return Response.json({ success: true, loaded: results }); } async function handleQuery(request: Request, env: Env): Promise<Response> { const body = await request.json() as { question: string }; if (!body.question) { return Response.json({ error: "question is required" }, { status: 400 }); } // Step 1: Embed the question using the same model as your documents const embeddingResponse = await env.AI.run("@cf/baai/bge-base-en-v1.5", { text: [body.question], }) as { data: number[][] }; // Step 2: Search Vectorize for the 3 most similar documents const searchResults = await env.VECTORIZE.query( embeddingResponse.data[0], { topK: 3, returnMetadata: "all", } ); // Step 3: Build context from results above the similarity threshold const context = searchResults.matches .filter((match) => match.score > 0.5) .map((match) => match.metadata?.text as string) .filter(Boolean) .join("\n\n"); if (!context) { return Response.json({ answer: "I could not find relevant information to answer that question.", sources: [], }); } // Step 4: Generate an answer using the retrieved context const aiResponse = await env.AI.run("@cf/meta/llama-3.3-70b-instruct-fp8-fast", { messages: [ { role: "system", content: "You are a helpful assistant. Answer the question using only the context provided. If the context does not contain enough information, say so.", }, { role: "user", content: `Context:\n${context}\n\nQuestion: ${body.question}`, }, ], max_tokens: 256, }) as { response: string }; // Step 5: Return the answer with its sources const sources = searchResults.matches .filter((match) => match.score > 0.5) .map((match) => match.metadata?.source as string) .filter(Boolean); return Response.json({ answer: aiResponse.response, sources: [...new Set(sources)], }); } </code></pre> <p>What each step does:</p> <ol> <li><p><strong>Step 1 – Embed the question</strong>: You convert the user's question into a 768-dimensional vector using the same model you used when loading your documents. This is critical: the question and the documents must be embedded with the same model or the similarity scores will be meaningless.</p> </li> <li><p><strong>Step 2 – Search Vectorize</strong>: You pass the question embedding to Vectorize, which returns the three most similar documents. <code>returnMetadata: "all"</code> tells Vectorize to include the metadata you stored alongside each vector — including the original text.</p> </li> <li><p><strong>Step 3 – Build context</strong>: You filter out any results with a similarity score below 0.5 and join the remaining document texts into a single context string. The 0.5 threshold prevents the LLM from receiving irrelevant documents just because nothing better matched.</p> </li> <li><p><strong>Step 4 – Generate the answer</strong>: You pass the context and the question to the LLM using the chat format with <code>messages</code>. The system prompt explicitly instructs the model to answer using only the provided context. This is what keeps the LLM grounded. Without this instruction, it will ignore your context and answer from its training data instead.</p> </li> <li><p><strong>Step 5 – Return sources</strong>: You include the source metadata in the response so callers know which documents the answer came from. The <code>Set</code> deduplicates sources in case multiple chunks came from the same document.</p> </li> </ol> <h3 id="heading-how-to-test-the-query-pipeline">How to Test the Query Pipeline</h3> <p>Deploy your Worker:</p> <pre><code class="language-bash">npx wrangler deploy </code></pre> <p>Send a question:</p> <pre><code class="language-bash">curl -X POST https://rag-tutorial-simple.<your-subdomain>.workers.dev/query \ -H "Content-Type: application/json" \ -d '{"question": "What is RAG?"}' </code></pre> <p>On Windows PowerShell:</p> <pre><code class="language-powershell">Invoke-WebRequest ` -Uri "https://rag-tutorial-simple.<your-subdomain>.workers.dev/query" ` -Method POST ` -ContentType "application/json" ` -Body '{"question": "What is RAG?"}' ` -UseBasicParsing </code></pre> <p>You should receive a response like this:</p> <pre><code class="language-json">{ "answer": "RAG stands for Retrieval Augmented Generation. It's a method that enhances generation by retrieving relevant context from a knowledge base and adding it to the prompt before generating an answer.", "sources": ["ai-concepts"] } </code></pre> <p>The answer came from your knowledge base, not from the LLM's training data. That's the entire point of RAG: grounded, verifiable answers with traceable sources.</p> <h2 id="heading-how-to-add-error-handling-and-security">How to Add Error Handling and Security</h2> <p>A tutorial that only shows the happy path is not production-ready. In this section, you'll add error handling to every step of the query pipeline and protect the <code>/load</code> endpoint from unauthorized access.</p> <h3 id="heading-how-to-secure-the-load-endpoint">How to Secure the Load Endpoint</h3> <p>The <code>/load</code> endpoint generates embeddings and writes to your Vectorize index. Without protection, anyone who discovers your Worker URL can trigger it repeatedly, consuming your Workers AI quota and overwriting your data.</p> <p>The <code>LOAD_SECRET</code> binding you added to <code>Env</code> and the <code>wrangler secret put</code> command you ran earlier handle this. The check at the top of <code>handleLoad</code> rejects any request that doesn't include the correct secret header:</p> <pre><code class="language-typescript">const authHeader = request.headers.get("X-Load-Secret"); if (authHeader !== env.LOAD_SECRET) { return Response.json({ error: "Unauthorized" }, { status: 401 }); } </code></pre> <p>A request without the header returns <code>{"error":"Unauthorized"}</code> with a 401 status. The secret itself is stored as an encrypted environment variable in your Worker. It never appears in your code or <code>wrangler.toml</code>.</p> <p>To trigger the load endpoint, you must include the secret in the request header:</p> <pre><code class="language-bash">curl -X POST https://rag-tutorial-simple.<your-subdomain>.workers.dev/load \ -H "X-Load-Secret: your-secret-value" </code></pre> <h3 id="heading-how-to-handle-query-errors">How to Handle Query Errors</h3> <p>Replace your <code>handleQuery</code> function with this hardened version:</p> <pre><code class="language-typescript">async function handleQuery(request: Request, env: Env): Promise<Response> { // Guard against malformed request body let body: { question: string }; try { body = await request.json() as { question: string }; } catch { return Response.json({ error: "Invalid JSON in request body" }, { status: 400 }); } if (!body.question || typeof body.question !== "string" || body.question.trim() === "") { return Response.json({ error: "question must be a non-empty string" }, { status: 400 }); } // Sanitize: trim whitespace and cap length const question = body.question.trim().slice(0, 500); // Step 1: Embed the question let embeddingResponse: { data: number[][] }; try { embeddingResponse = await env.AI.run("@cf/baai/bge-base-en-v1.5", { text: [question], }) as { data: number[][] }; } catch (err) { console.error("Embedding generation failed:", err); return Response.json({ error: "Failed to process your question" }, { status: 503 }); } // Step 2: Search Vectorize let searchResults: Awaited<ReturnType<typeof env.VECTORIZE.query>>; try { searchResults = await env.VECTORIZE.query( embeddingResponse.data[0], { topK: 3, returnMetadata: "all" } ); } catch (err) { console.error("Vectorize query failed:", err); return Response.json({ error: "Failed to search knowledge base" }, { status: 503 }); } // Step 3: Build context const context = searchResults.matches .filter((match) => match.score > 0.5) .map((match) => match.metadata?.text as string) .filter(Boolean) .join("\n\n"); if (!context) { return Response.json({ answer: "I could not find relevant information to answer that question. Try rephrasing or asking something else.", sources: [], }); } // Step 4: Generate answer let aiResponse: { response: string }; try { aiResponse = await env.AI.run("@cf/meta/llama-3.3-70b-instruct-fp8-fast", { messages: [ { role: "system", content: "You are a helpful assistant. Answer the question using only the context provided. If the context does not contain enough information, say so.", }, { role: "user", content: `Context:\n${context}\n\nQuestion: ${question}`, }, ], max_tokens: 256, }) as { response: string }; } catch (err) { console.error("LLM generation failed:", err); return Response.json({ error: "Failed to generate an answer" }, { status: 503 }); } // Step 5: Return answer with sources const sources = searchResults.matches .filter((match) => match.score > 0.5) .map((match) => match.metadata?.source as string) .filter(Boolean); return Response.json({ answer: aiResponse.response, sources: [...new Set(sources)], }); } </code></pre> <p>What each error handling decision means:</p> <ul> <li><p><code>try/catch</code> <strong>around</strong> <code>request.json()</code>: <code>request.json()</code> throws if the body is not valid JSON. Without this catch, a malformed request crashes your Worker with an unhandled 500 error. With it, the caller gets a clear 400 explaining what went wrong.</p> </li> <li><p><strong>Input validation before processing</strong>: You check that <code>question</code> exists, is a string, and is not empty before calling any external service. This prevents wasted AI calls on invalid input.</p> </li> <li><p><code>.slice(0, 500)</code> <strong>on the question</strong>: This caps the input length before it reaches the embedding model. Without it, a malicious caller could send a very long string designed to inflate your AI usage or hit Workers CPU limits.</p> </li> <li><p><strong>503 for AI and Vectorize failures</strong>: HTTP 503 means "service temporarily unavailable." It signals to callers that the error is on the server side and the request can be retried.</p> </li> <li><p><code>.filter(Boolean)</code> <strong>on context</strong>: After mapping <code>match.metadata?.text</code>, some results may be <code>undefined</code> if metadata was stored without a <code>text</code> field. This filters them out before joining, preventing <code>"undefined"</code> from appearing in the context string you send to the LLM.</p> </li> </ul> <h3 id="heading-how-to-test-error-handling">How to Test Error Handling</h3> <p>Deploy your updated Worker:</p> <pre><code class="language-bash">npx wrangler deploy </code></pre> <p>Test each error case:</p> <pre><code class="language-bash"># Missing secret on load endpoint — should return 401 curl -X POST https://rag-tutorial-simple.<your-subdomain>.workers.dev/load # Invalid JSON — should return 400 curl -X POST https://rag-tutorial-simple.<your-subdomain>.workers.dev/query \ -H "Content-Type: application/json" \ -d 'not json' # Empty question — should return 400 curl -X POST https://rag-tutorial-simple.<your-subdomain>.workers.dev/query \ -H "Content-Type: application/json" \ -d '{"question": ""}' </code></pre> <h2 id="heading-performance-and-cost-analysis">Performance and Cost Analysis</h2> <p>This section uses real production data from my <a href="https://github.com/dannwaneri/vectorize-mcp-worker">vectorize-mcp-worker</a> deployment. It uses the same architecture you just built, measured from Port Harcourt, Nigeria to Cloudflare's edge.</p> <h3 id="heading-real-performance-numbers">Real Performance Numbers</h3> <p>Here is what the pipeline actually costs in time on every request:</p> <table> <thead> <tr> <th>Operation</th> <th>Time</th> </tr> </thead> <tbody><tr> <td>Embedding generation</td> <td>142ms</td> </tr> <tr> <td>Vector search</td> <td>223ms</td> </tr> <tr> <td>Response formatting</td> <td><5ms</td> </tr> <tr> <td><strong>Total</strong></td> <td><strong>~365ms</strong></td> </tr> </tbody></table> <p>This covers embedding generation and vector search only – the retrieval layer. LLM generation adds 500-1500ms on top, which is why end-to-end response time typically runs 600-1600ms.</p> <p>The embedding step and vector search dominate. Everything else is negligible. For context, a comparable setup using OpenAI embeddings and Pinecone would add two external API roundtrips on top of this, easily pushing total latency past 1 second.</p> <p>These numbers come from a single-region measurement. Your actual latency will vary based on your location and Cloudflare's load at the time of the request. The architectural point holds regardless: co-locating everything on the edge eliminates inter-service network hops, which is where most latency in traditional RAG stacks comes from.</p> <h3 id="heading-real-cost-breakdown">Real Cost Breakdown</h3> <p>For 10,000 searches per day (300,000 per month) with 10,000 stored vectors:</p> <p><strong>This stack:</strong></p> <table> <thead> <tr> <th>Service</th> <th>Monthly Cost</th> </tr> </thead> <tbody><tr> <td>Workers</td> <td>~$3</td> </tr> <tr> <td>Workers AI</td> <td>~$3-5</td> </tr> <tr> <td>Vectorize</td> <td>~$2</td> </tr> <tr> <td><strong>Total</strong></td> <td><strong>$8-10</strong></td> </tr> </tbody></table> <p><strong>Traditional alternatives for the same volume:</strong></p> <table> <thead> <tr> <th>Solution</th> <th>Monthly Cost</th> </tr> </thead> <tbody><tr> <td>Pinecone Standard</td> <td>$50-70</td> </tr> <tr> <td>Weaviate Serverless</td> <td>$25-40</td> </tr> <tr> <td>Self-hosted pgvector</td> <td>$40-60</td> </tr> </tbody></table> <p>That is an 85-95% cost reduction depending on which alternative you compare against. For a bootstrapped startup adding semantic search, that difference is $1,500-2,000 per year.</p> <h3 id="heading-why-the-cost-difference-is-so-large">Why the Cost Difference Is So Large</h3> <p>Traditional RAG stacks have three cost problems that compound each other.</p> <p>The first is idle compute. A dedicated server or container running your embedding service costs money even when no searches are happening. Cloudflare Workers charge only for actual execution time.</p> <p>The second is inter-service data transfer. Every time your application calls an external service for an embedding, then calls a separate service for a search, you're paying for two external API calls with metered pricing. In this stack, both operations happen inside Cloudflare's network at no additional transfer cost.</p> <p>The third is minimum plan pricing. Pinecone's Standard plan costs $50/month as a floor, regardless of how little you use it. Cloudflare's pricing scales from the $5/month Workers Paid plan base.</p> <h3 id="heading-when-the-included-allocation-is-enough">When the Included Allocation Is Enough</h3> <p>For smaller usage levels, you may not pay beyond the $5/month Workers Paid base price:</p> <ul> <li><p>Workers: 10 million requests per month included</p> </li> <li><p>Workers AI: generous daily neuron allocation included</p> </li> <li><p>Vectorize: available on both Free and Paid plans, with a free allocation included</p> </li> </ul> <p>A side project, internal tool, or small business with under 3,000 searches per day will likely stay within the included allocations entirely.</p> <h3 id="heading-the-trade-off-to-know-about">The Trade-off to Know About</h3> <p>This cost advantage comes with one operational constraint worth understanding before you build: Vectorize does not work in local development mode.</p> <p>When you run <code>wrangler dev</code>, your Worker runs locally but Vectorize calls fail. You have to deploy to Cloudflare to test your vector search. For most development workflows this means testing your query logic locally with mocked responses, then deploying to a staging environment for full integration tests.</p> <p>This is a real friction point. It's the honest trade-off for having a managed vector database with no infrastructure to operate.</p> <h2 id="heading-conclusion">Conclusion</h2> <p>In this tutorial, you have built and deployed a production-ready RAG system on Cloudflare's edge network. Let's look at what you actually built and what it costs to run.</p> <h3 id="heading-what-you-built">What You Built</h3> <p>Your completed system has three endpoints:</p> <ul> <li><p><code>GET /</code>: health check confirming the Worker is running</p> </li> <li><p><code>POST /load</code>: loads your knowledge base into Vectorize, protected by a secret header</p> </li> <li><p><code>POST /query</code>: accepts a question, retrieves relevant context, and returns a grounded answer with sources</p> </li> </ul> <p>The full query pipeline runs in four steps on every request:</p> <ol> <li><p>The question is converted to a 768-dimensional embedding using <code>@cf/baai/bge-base-en-v1.5</code></p> </li> <li><p>Vectorize finds the three most semantically similar documents</p> </li> <li><p>Documents above the 0.5 similarity threshold are assembled into context</p> </li> <li><p>Llama 3.3 generates an answer using only that context</p> </li> </ol> <p>Everything runs on Cloudflare's infrastructure. No external API keys. No separate vector database subscription. No servers to manage.</p> <h3 id="heading-what-to-build-next">What to Build Next</h3> <p>This tutorial covered the core RAG pattern. Here are four directions to take it further.</p> <h4 id="heading-add-more-documents">Add more documents</h4> <p>The knowledge base in this tutorial has 8 documents. A real system might have thousands. The loading pattern is identical: add documents to <code>knowledge-base.ts</code>, hit <code>/load</code> with your secret, and Vectorize handles the rest.</p> <p>For very large knowledge bases, update <code>handleLoad</code> to batch documents in groups of 20-50 rather than upserting one at a time.</p> <h4 id="heading-improve-chunking">Improve chunking</h4> <p>Each document in this tutorial is a single short paragraph. Real-world documents like PDFs, articles, documentation pages need to be split into chunks before embedding. Chunk at natural boundaries like paragraphs and sentences, aim for 200-400 tokens per chunk, and include 50-token overlaps between chunks to preserve context across boundaries.</p> <h4 id="heading-add-conversation-history">Add conversation history</h4> <p>The current system treats every query as independent. To support follow-up questions, store previous messages in a Cloudflare KV namespace and include the last 2-3 exchanges in the LLM <code>messages</code> array alongside the retrieved context.</p> <h4 id="heading-stream-the-response">Stream the response</h4> <p>For long answers, users stare at a blank screen until generation completes. Cloudflare Workers support streaming responses via <code>TransformStream</code>. Switching to streaming means the first tokens appear in under 100ms while the rest generates.</p> <h4 id="heading-consider-dimensions-vs-reranking-trade-offs">Consider dimensions vs reranking trade-offs</h4> <p>This tutorial uses <code>bge-base-en-v1.5</code> at 768 dimensions. My production system uses <code>bge-small-en-v1.5</code> at 384 dimensions. Testing showed upgrading from 384 to 768 dims only improved accuracy by about 2%, but doubled cost and latency.</p> <p>Adding a reranker (<code>@cf/baai/bge-reranker-base</code>) gave a larger accuracy improvement than the dimension upgrade for a fraction of the cost. The exact improvement will vary by domain and query distribution — test both on your actual data before deciding. If you're optimizing for production, add a reranker before you increase dimensions.</p> <h3 id="heading-the-complete-project">The Complete Project</h3> <p>Clone and deploy in five commands:</p> <pre><code class="language-bash">git clone https://github.com/dannwaneri/rag-tutorial-simple cd rag-tutorial-simple npm install npx wrangler vectorize create rag-tutorial-index --dimensions=768 --metric=cosine npx wrangler secret put LOAD_SECRET npx wrangler deploy </code></pre> <p>Then load your knowledge base:</p> <pre><code class="language-bash">curl -X POST https://<your-worker>.workers.dev/load \ -H "X-Load-Secret: your-secret" </code></pre> <p>If you found this useful, the production system this tutorial is based on is open source at <a href="https://github.com/dannwaneri/vectorize-mcp-worker">github.com/dannwaneri/vectorize-mcp-worker</a>. It extends this foundation with hybrid search combining vector and BM25, multimodal support for searching images with AI vision, a reranker for more accurate results, and a live dashboard. It runs on the same Cloudflare stack you just built – Workers, Vectorize, Workers AI – plus D1 for document storage.</p> <p>One difference you'll notice: the production system uses <code>bge-small-en-v1.5</code> at 384 dimensions rather than the 768 dimensions in this tutorial. That is an intentional trade-off: the reranker adds more accuracy than the extra dimensions at lower cost. The jump from what you built today to that system is smaller than it looks.</p> </article> </main></body></html>

Daniel Nwaneri - freeCodeCamp.org

How to Build a Production-Safe Agent Loop: From Exit Conditions to Audit Trails

Table of Contents

Why This Keeps Happening

Prerequisites

Phase 1: Define Done Before You Build