claude.ai - freeCodeCamp.org

How to Use Claude Code to Build Flutter Apps Faster — Best Practices for 2026

Jesutoni Aderibigbe — Mon, 29 Jun 2026 14:05:14 +0000

In early 2023, I was interning at a US-based company, long before agentic AI became part of everyday development.

We had tools like ChatGPT, Gemini, and Copilot, but they were mostly chat interfaces: you pasted code, got a response, and moved on.

During that time, my manager, who worked in AI/ML, told me that a day would come when developers would collaborate with AI agents and that learning how to write effective prompts would become a valuable skill.

I took that advice seriously. I spent countless nights experimenting with prompts, refining instructions, and learning how to communicate with AI systems effectively.

Today, while I still write code by hand and believe strongly in fundamentals, those early lessons have paid off. In an era where AI is embedded into the development workflow, I've been able to leverage it to significantly amplify my productivity as a software engineer.

You've probably seen all the excitement around AI coding assistants. But if you've tried using one on a real Flutter project, whether it's a fintech app, an e-commerce platform, or any application with a well-structured architecture, you've likely experienced the frustration, too.

The assistant generates a widget. You paste it in. It doesn't fit your architecture. It ignores your naming conventions. It recreates functionality that already exists somewhere else in your codebase. Before long, you've spent twenty minutes fixing code that was supposed to save you time.

The problem isn't the AI. The problem is that most developers still use AI as an advanced autocomplete tool when it can function as something much more powerful: a second engineer that understands your codebase, follows your conventions, and tackles parallel tasks while you focus on solving the hard problems.

In this article, I'll show you what has actually worked for me. We'll cover how to structure your Flutter projects so Claude Code can navigate them effectively and how to use skills, loops, and subagents to automate repetitive development tasks and dramatically increase your productivity.

Prerequisites

Before following along, you should be comfortable with the basics of Flutter development; building widgets, managing state, and running the app from the terminal. You don't need to be an expert.

On the tooling side, you'll need:

Flutter SDK (3.x or later): the framework we're building with. Install it from flutter.dev.
Claude Code: Anthropic's agentic coding tool that runs in your terminal alongside your editor. Install it with npm install -g @anthropic-ai/claude-code, then run claude in your project directory to start a session. You'll need an Anthropic account and API key.
A code editor: VS Code or Android Studio both work well. Claude Code operates in the terminal and reads/writes files directly, so it works alongside whatever editor you use.
Git: version control is assumed throughout. Claude Code integrates with Git for commits, diffs, and branch awareness.

Here's a quick overview of the Claude Code concepts we'll use throughout the article:

CLAUDE.md: a markdown file at your project root that Claude reads at the start of every session. Think of it as a briefing document: your architecture, your conventions, your commands.
Skills: reusable instruction packs stored in .claude/skills/. You define them once, and Claude invokes them automatically when the task matches, or you call them manually with /skillname.
Subagents: isolated Claude instances that handle a focused task in their own context window, then return only a summary. Great for parallel work without polluting your main session.
Hooks: shell commands or scripts that fire on lifecycle events (before a tool runs, after a turn completes, and so on). They bypass Claude's judgment entirely — useful for enforcing rules deterministically.
/loop: a built-in skill that reruns a task repeatedly until a condition you define is met.

None of these require special configuration to unlock. They’re all available once you have Claude Code installed.

1. Why Architecture Comes First
2. Setting Up Your CLAUDE.md
3. Feature-First Folder Structure — The Details
4. Writing Skills for Your Most Repeated Tasks
5. Using /loop for Self-Correcting Workflows
6. Subagents for Parallel Screen Development
7. Hooks — Enforcing Rules Deterministically
8. Putting It All Together: A Real Sprint Workflow
Key Takeaways

1. Why Architecture Comes First

Before you write a single skill or configure a single hook, your folder structure needs to make sense to an AI reading it cold.

Claude Code reads your files to understand your project. If your code is scattered across a layer-first structure (lib/models/, lib/services/, lib/widgets/), Claude has to piece together what each feature does by jumping between folders. It makes mistakes. It creates files in the wrong place. It generates code that doesn't conform to the pattern used in the rest of the app.

The fix is a feature-first structure. Each feature is a self-contained module. Everything Claude needs to understand the transfer flow, for example, lives inside lib/features/transfer/.

lib/
├── core/
│   ├── constants/
│   ├── errors/
│   ├── router/
│   └── theme/
├── features/
│   ├── auth/
│   │   ├── data/
│   │   │   ├── models/         # Freezed models
│   │   │   └── repositories/
│   │   ├── presentation/
│   │   │   ├── screens/
│   │   │   ├── widgets/
│   │   │   └── providers/      # Riverpod providers
│   │   └── auth.dart           # barrel export
│   ├── transfer/
│   │   ├── data/
│   │   ├── presentation/
│   │   └── transfer.dart
│   └── wallet/
│       ├── data/
│       ├── presentation/
│       └── wallet.dart
└── main.dart

This structure tells Claude immediately: "Everything for the transfer feature is in lib/features/transfer/"When you ask it to 'add a beneficiary validation to the transfer flow,' it knows exactly where to look and where to create new files.

It also maps cleanly to Riverpod with code generation. Each feature's providers live close to the screens that use them, which means build_runner output lands in the right place, too.

2. Setting Up Your CLAUDE.md

CLAUDE.md is arguably the most important file in your Claude Code setup. It's loaded at the beginning of every session. It remains in context throughout the conversation, helping Claude stay aligned with your project's architecture, conventions, and development practices no matter how long the session becomes.

Create it at the root of your project:

touch CLAUDE.md

Here's a template shaped for a Flutter/Riverpod project:

# My Flutter App

## Commands
- `flutter pub get` — install dependencies
- `dart run build_runner build --delete-conflicting-outputs` — generate code
- `flutter analyze` — run linter
- `flutter test` — run tests
- `flutter run` — start dev build

## Architecture
Feature-first folder structure. Each feature lives in lib/features//.
State management: Riverpod with @riverpod code generation (AsyncNotifier pattern).
HTTP: Dio with interceptors in lib/core/network/.
Navigation: GoRouter with named routes defined in lib/core/router/.
Models: Freezed + JsonSerializable. Run build_runner after any model change.

## Conventions
- All monetary amounts in the smallest unit (e.g. kobo for NGN), stored as int — never use doubles for money
- Use ref.invalidate() not ref.refresh()
- No business logic in widgets — all logic goes in notifiers or repositories
- Widget files contain only one public widget per file
- Barrel exports via feature.dart in each feature root
- Prefix private widgets with an underscore

## What NOT to do
- Do not add new packages without asking first
- Do not modify *.g.dart or *.freezed.dart files directly — regenerate with build_runner
- Do not put API calls directly in notifiers — always go through the repository layer

A few things to note about this file:

Keep it honest: If your conventions don't match what's actually in the codebase, Claude will get confused. The CLAUDE.md should reflect how the code actually works today, not aspirationally.

The "What NOT to do" section matters: AI assistants are optimistic. They'll solve the problem in front of them without thinking about side effects. Explicitly telling Claude what to avoid saves a lot of cleanup.

Don't make it too long: Every line in CLAUDE.md costs tokens on every single turn of every session. Put team-wide, always-relevant rules here. Everything else should be a skill (covered next).

3. Feature-First Folder Structure — The Details

Let's look inside a feature in more detail, using a wallet feature as an example:

lib/features/wallet/
├── data/
│   ├── models/
│   │   ├── wallet.dart             # Freezed model
│   │   ├── wallet.freezed.dart     # Generated
│   │   ├── wallet.g.dart           # Generated
│   │   └── transaction.dart
│   └── repositories/
│       ├── wallet_repository.dart  # Abstract class
│       └── wallet_repository_impl.dart
├── presentation/
│   ├── screens/
│   │   ├── wallet_screen.dart
│   │   └── transaction_history_screen.dart
│   ├── widgets/
│   │   ├── balance_card.dart
│   │   └── transaction_tile.dart
│   └── providers/
│       ├── wallet_provider.dart
│       └── wallet_provider.g.dart  # Generated
└── wallet.dart                     # Barrel export

And here's what a clean Riverpod provider looks like in this structure:

// lib/features/wallet/presentation/providers/wallet_provider.dart

import 'package:riverpod_annotation/riverpod_annotation.dart';
import '../../data/models/wallet.dart';
import '../../data/repositories/wallet_repository.dart';

part 'wallet_provider.g.dart';

@riverpod
class WalletNotifier extends _$WalletNotifier {
  @override
  Future build() async {
    return ref.watch(walletRepositoryProvider).getWallet();
  }

  Future refreshBalance() async {
    state = const AsyncValue.loading();
    state = await AsyncValue.guard(
      () => ref.read(walletRepositoryProvider).getWallet(),
    );
  }
}

When Claude Code sees this pattern repeated across multiple features, it learns to replicate it. The more consistent your structure, the better Claude's output matches what you'd write yourself.

4. Writing Skills for Your Most Repeated Tasks

Skills are reusable instruction packs that Claude Code loads when they're relevant. They live in .claude/skills//SKILL.md and can be invoked manually with /skillname or triggered automatically when Claude recognises the right context.

A simple way to think about a Skill is as a specialist on your team. Imagine working with a designer, a QA engineer, and a security expert. You don't explain their entire job every time you need their help. Each person already knows their responsibilities and follows a defined process.

Skills work the same way. Instead of repeatedly telling Claude how to generate Riverpod providers, write tests, or review security concerns, you package those instructions into a Skill and let Claude load them whenever they're needed.

Think of a Skill as a saved recipe. Instead of writing out the ingredients and cooking steps every time you want to make a meal, you keep the recipe in one place and reuse it whenever needed.

Skills do the same thing for development workflows. They allow you to save a set of instructions once and have Claude follow them consistently every time a similar task comes up.

The key thing to understand is that the description field is what triggers a skill. Claude evaluates it on every turn and decides whether the current task matches. Because of this, you should describe it using the same verbs that developers actually type in real workflows, like build, commit, release, or fix lint, instead of documentation-style language.

Creating Your First Skill

Before you write a skill, think about the tasks you perform over and over again. A good skill captures a workflow you already know by heart. If you find yourself giving Claude the same instructions every session, such as "run flutter analyze, then run build_runner, then execute the tests," that's a good candidate for a skill.

Start with one task. Keep the steps in the exact order you expect Claude to follow, and clearly define what a successful outcome looks like. Don't try to cover every possible edge case. The goal is to automate your normal workflow so Claude can handle the repetitive work consistently, while you step in only when something unexpected happens.

mkdir -p .claude/skills/flutter-release
touch .claude/skills/flutter-release/SKILL.md

---
name: flutter-release
description: |
  Use this skill when building a release APK or preparing the app for deployment.
  Triggers on: "build release", "generate apk", "prepare release", "release build".
allowed-tools: Bash Read
---

# Flutter release checklist

Run these steps in order. Do not skip any step.

1. Run `flutter pub get`
2. Run `dart run build_runner build --delete-conflicting-outputs`
3. Run `flutter analyze` — fix every error before proceeding. Do not continue with warnings treated as errors.
4. Run `flutter test` — if any test fails, fix it before continuing
5. Run `flutter build apk --release`
6. Confirm build output at `build/app/outputs/flutter-apk/app-release.apk`
7. Create a git commit: `chore: release build vX.X.X`

If any step fails, stop and report the error clearly. Do not skip ahead.

Now, whenever you type "prepare a release" or "build the apk"Claude follows this checklist without you having to remind it of the steps.

A Skill For Conventional Commits

mkdir -p .claude/skills/commit
touch .claude/skills/commit/SKILL.md

---
name: commit
description: |
  Use when committing changes or writing a commit message.
  Triggers on: "commit", "git commit", "commit changes", "write a commit message".
---

Follow Conventional Commits format:

Types: feat | fix | chore | refactor | docs | test | perf

Format: `type(scope): short imperative summary`

Rules:
- Subject line max 72 characters
- Imperative mood — "add" not "added", "fix" not "fixed"
- Scope = the feature name (auth, transfer, wallet, cards)

Examples:
- `feat(transfer): add beneficiary validation on amount input`
- `fix(wallet): correct kobo-to-naira display conversion`
- `chore(deps): upgrade riverpod to 2.6.1`

Always run `flutter analyze` before committing. Never commit with lint errors.

Dynamic Context Injection

Skills support a powerful trick: you can inject live shell output directly into the skill body using !`command` syntax. Claude receives the output as part of the skill, not as a separate step.

For example, you could embed something like !git status inside a skill, so Claude always sees the current state of your repository when applying that skill. In a Flutter workflow, you could also use something like !flutter test so the skill dynamically includes the latest test results before Claude suggests fixes or improvements.

---
name: sprint-status
description: |
  Use when asked about current status, what's left to do, or what changed.
---

## Current git status
!`git status --short`

## Uncommitted changes
!`git diff --stat HEAD`

## Recent commits
!`git log --oneline -10`

## Lint status
!`flutter analyze 2>&1 | tail -20`

Review the above and give a concise summary of: what's done, what's broken, and what needs attention before the next commit.

Type /sprint-status and Claude gets a live snapshot of your project state before responding.

5. Using /loop for Self-Correcting Workflows

/loop is a built-in Claude Code skill that reruns a task repeatedly until a condition is met. It's the difference between "fix this lint error" (one shot) and "fix all lint errors" (autonomous loop).

For example, instead of running a one-time prompt like “fix this lint error,” you would use /loop fix lint errors in this Flutter project until there are no warnings left. Claude will then repeatedly check the output, apply fixes, and recheck until the condition is satisfied.

A more realistic Flutter workflow could look like /loop run flutter analyze and fix all reported issues until analysis passes clean. In this case, Claude keeps running analyses, fixing issues, and revalidating until the project reaches a clean state.

It's worthy of note here that a/loop and a Skill solve two different problems, and it helps to think of them like this:

A Skill is knowledge.
A Loop is behavior over time.

The pattern is always the same: tell Claude what to run, what to check, and when to stop.

Fix Until Clean

/loop
Run flutter analyze.
If there are any errors or warnings, read each one carefully and fix it.
Run flutter analyze again.
Continue until flutter analyze reports zero issues.
Do not move on while there are errors remaining.

TDD Loop

/loop
Run: flutter test --name "WalletNotifier"
If the test fails, read the failure output carefully.
Make the minimal code change required to fix the failure.
Do not change the test itself.
Run the test again.
Stop when the test passes with no errors.

Build a Screen, Check it, Iterate

/loop
Look at the Figma spec notes in CLAUDE.md under "Remaining screens".
Pick the next incomplete screen.
Build the screen following the architecture pattern in lib/features/wallet/presentation/.
After building, run flutter analyze and fix any issues.
Add a comment `// DONE` at the top of the completed screen file.
Move to the next screen.
Stop after completing 3 screens.

A word of caution: /loop is powerful, but give Claude a clear stop condition. "Keep going until it's perfect" is not a stop condition. "Stop when flutter analyze and flutter test both pass with zero issues." is.

6. Subagents for Parallel Screen Development

Subagents are isolated Claude instances that run a task in their own context window and then return only a summary to the main session. This changes how you think about working with Claude Code on a multi-screen project.

A simple way to understand it is to imagine building a full Flutter app with multiple screens. Without subagents, you would design the home screen, then the profile screen, then settings, all in one long conversation. Over time, the context gets heavier, and Claude starts losing focus on earlier decisions.

With subagents, it's like giving each screen to a different engineer. One works on the home screen, another builds the profile screen, and another handles settings. Each one works independently, follows the same project rules, and reports back only when the screen is ready. You then combine their output into the main project without losing clarity or consistency.

Setting Up a Screen-Builder Subagent

Create a file at .claude/agents/screen-builder.md:

---
name: screen-builder
description: Builds a single Flutter screen following the app's feature-first Riverpod architecture
model: claude-sonnet-4-6
tools: [Read, Write, Bash, Glob]
---

You are a Flutter engineer building a screen for a fintech app.

Before building anything:
1. Read lib/features/wallet/presentation/screens/wallet_screen.dart to understand the existing screen pattern
2. Read CLAUDE.md for conventions and architecture rules
3. Read the feature's existing providers in the presentation/providers/ folder

When building the screen:
- Follow the exact same structure as the existing screens
- Use AsyncValue pattern for loading/error/data states
- No business logic in the widget — all state goes through the provider
- Every monetary amount displayed in naira but stored in kobo (divide by 100 for display)
- Use GoRouter for navigation, not Navigator.push

After building:
- Run flutter analyze on the file
- Fix any errors
- Return a summary: file path created, provider used, any decisions made

Using it

In your main session, you can now say:

Use the screen-builder subagent to build the Transaction History screen.
The screen should show a list of transactions from the WalletNotifier provider.
Each item should display: amount (formatted), description, date, and status badge.

Claude dispatches the subagent, which reads your existing code for context, builds the screen following your patterns, fixes any lint errors, and returns a clean summary, without cluttering your main thread with every intermediate step.

You can also run multiple subagents simultaneously for truly parallel work:

Dispatch three screen-builder subagents in parallel:
1. Transaction History screen (list of transactions)
2. Send Money screen (amount input + recipient selection)
3. Wallet Top-Up screen (amount input + payment method)

Each should follow the existing wallet feature patterns.
Report back when all three are complete.

7. Hooks — Enforcing Rules Deterministically

Skills and subagents influence how Claude thinks and plans, but hooks are different. Hooks are deterministic. They run automatically at specific lifecycle events, no matter what Claude decides to do. This makes them useful for enforcing hard rules in your workflow.

A simple way to understand it is to think of hooks as guards in a real engineering pipeline. For example, before any code is committed, a PreToolUse hook can run to check formatting or block unsafe changes. After a tool runs, a PostToolUse hook can validate the output. When a session ends, a Stop hook can trigger cleanup tasks or logging. Other events, like SessionStart, PreCompact help you initialize context or manage memory before Claude continues working.

In practice, hooks are how you enforce consistency. While Skills and subagents guide Claude’s behavior, hooks ensure certain actions always happen at the right moment, without relying on Claude to “remember” or “decide.”

Block Edits to Generated Files

Generated files like *.g.dart and *.freezed.dart should never be edited manually — they get overwritten by build_runner. This hook blocks Claude from writing to them:

Create .claude/hooks.json:

{
  "PreToolUse": [
    {
      "matcher": "Write|Edit",
      "command": "bash -c 'if [[ \"\(CLAUDE_TOOL_INPUT_PATH\" == *.g.dart ]] || [[ \"\)CLAUDE_TOOL_INPUT_PATH\" == *.freezed.dart ]]; then echo \"Blocked: Do not edit generated files. Run build_runner instead.\"; exit 1; fi'"
    }
  ]
}

Run Analyze Before Every Stop

This hook runs flutter analyze before Claude considers its turn complete, catching lint errors before they accumulate:

{
  "Stop": [
    {
      "command": "bash -c 'result=\((flutter analyze 2>&1); if echo \"\)result\" | grep -q \"error •\"; then echo \"Flutter analyze found errors. Fix before stopping:\"; echo \"$result\"; exit 1; fi'"
    }
  ]
}

Now Claude can't finish a turn if there are lint errors. It gets blocked and has to fix them first.

8. Putting It All Together: A Real Sprint Workflow

Here's what a typical feature development session looks like when all of this is configured:

Morning: Check Project State

/sprint-status

Claude reads live Git status, recent commits, and current lint output, then summarises what needs attention.

Start a New Feature

I need to build the beneficiary management feature. 
Users should be able to save, view, and delete beneficiaries for the transfer flow.
Start with the data layer — Freezed model and repository interface.

Claude reads your CLAUDE.md and existing feature patterns, then builds the model and repository in the right place, following your conventions.

Generate All the Screens in Parallel

Use the screen-builder subagent to build:
1. BeneficiaryListScreen — shows saved beneficiaries with search
2. AddBeneficiaryScreen — form with account number and bank selection
3. BeneficiaryDetailScreen — shows details with delete option

Fix Everything Until it's Clean

/loop
Run flutter analyze.
Fix all errors.
Run flutter test.
Fix any test failures.
Stop when both pass with zero issues.

Commit Cleanly

Commit the beneficiary feature

The commit skill triggers, runs analyze one more time, and creates a correctly-formatted conventional commit message.

Key Takeaways

If there's one key takeaway from all of this, it's that Claude Code isn't just about prompting. It's about setup. The quality of its output is shaped far more by what you define about your project upfront than by what you type in the moment.

This is also what separates vibe coding from real AI-assisted engineering. Without structure, you end up guessing and reacting, which feels fast but breaks down quickly.

With the right setup, Claude becomes a pair programming partner that follows your conventions and handles execution while you focus on decisions that actually require engineering judgment. That shift is what lets you spend less time fixing generated code and more time solving the problems that matter.

The payoff compounds. A CLAUDE.md takes 20 minutes to write. A skill for your release flow takes 10 minutes. But both of those pay for themselves the first time Claude correctly follows your process without you having to walk it through every step.

Start small: write your CLAUDE.md this week. Add one skill for the task you repeat most — committing, releasing, or running lint. Then, when you're comfortable, try a /loop on your next test-fixing session. The rest follows naturally.

The goal isn't to let AI write all your code. It's to stop spending your limited engineering time on the parts that don't require your judgment, and to spend more of it on the parts that do.

How to Keep Human Experts Visible in Your AI-Assisted Codebase

Daniel Nwaneri — Mon, 13 Apr 2026 16:24:52 +0000

Six months ago, Stack Overflow processed 108,563 questions in a single month. By December 2025, that number had fallen to 3,862. A 78% collapse in two years.

The explanation everyone reaches for is that AI replaced it. That's partly true. But it misses the structural problem underneath: every time a developer asks Claude or ChatGPT to write code, the knowledge that shaped the answer disappears.

The GitHub discussion where someone spent two hours documenting why cursor-based pagination beats offset for live-updating datasets. The Stack Overflow answer from 2019 where one engineer, after a week of debugging, documented exactly why that approach fails under concurrent writes.

The AI consumed all of it. The humans who produced it got nothing — no citation in the codebase, no signal that their work mattered.

Over time, those people stopped contributing. Stack Overflow isn't dying because it's bad. It's dying because AI extracted its value and the feedback loop that kept humans contributing broke down.

This tutorial builds a tool that puts that loop back together. proof-of-contribution is a Claude Code skill that links every AI-generated artifact back to the human knowledge that inspired it — and surfaces exactly where the AI made choices with no human source at all.

I'll show you how to install proof-of-contribution, how to record your first provenance entry, how to use the spec-writer integration that makes Knowledge Gaps deterministic, and how to run poc.py verify — a static analyser that detects gaps without a single API call.

What You Will Build
Prerequisites
Quickstart in 5 Minutes
How the Tool Works
How to Install proof-of-contribution
How to Scaffold Your Project
How to Record Your First Provenance Entry
How to Use import-spec to Seed Knowledge Gaps
How to Trace Human Attribution
How to Verify with Static Analysis
How to Enable PR Enforcement
Where to Go Next

What You Will Build

proof-of-contribution is a Claude Code skill with a local CLI. Together they give you:

Provenance Blocks: Claude appends a structured attribution block to every generated artifact, listing the human sources that inspired it and flagging what it synthesized without any traceable source.
Knowledge Gaps: the parts of AI-generated code that have no human citation, surfaced before they become production incidents
poc.py trace: a CLI command that shows the full human attribution chain for any file in thirty seconds
poc.py import-spec: bridges proof-of-contribution with spec-writer, seeding knowledge gaps from your spec's assumptions list before the agent builds anything
poc.py verify: a static analyser that cross-checks your file's structure against seeded claims using Python's AST. Zero API calls. Exit code 0 means clean, exit code 1 means gaps found — wires directly into CI
A GitHub Action: optional PR enforcement that fails PRs missing attribution, for teams that want a standard

The complete source is at github.com/dannwaneri/proof-of-contribution.

Prerequisites

This is a beginner-to-intermediate tutorial. You should be comfortable with:

Command line basics: navigating directories, running scripts
Git: basic commits and PRs
Python 3.8 or higher: the CLI is pure Python with no dependencies

You will need:

Python installed: check with python --version or python3 --version
Git installed: check with git --version
Claude Code (or any agent that supports the Agent Skills standard — Cursor and Gemini CLI also work)

There's no database to install. No API keys. No paid services. The default storage is SQLite, which Python includes out of the box.

Quickstart in 5 Minutes

If you want to try the tool before reading the full tutorial, here are the five commands that take you from zero to your first gap detection:

Mac and Linux:

# 1. Install
mkdir -p ~/.claude/skills
git clone https://github.com/dannwaneri/proof-of-contribution.git \
  ~/.claude/skills/proof-of-contribution

# 2. Scaffold your project (run in your repo root)
python ~/.claude/skills/proof-of-contribution/assets/scripts/poc_init.py

# 3. Record attribution for an AI-generated file
python poc.py add src/utils/parser.py

# 4. Detect gaps via static analysis
python poc.py verify src/utils/parser.py

# 5. See the full provenance chain
python poc.py trace src/utils/parser.py

Windows PowerShell:

# 1. Install
New-Item -ItemType Directory -Force -Path "$HOME\.claude\skills"
git clone https://github.com/dannwaneri/proof-of-contribution.git `
  "$HOME\.claude\skills\proof-of-contribution"

# 2. Scaffold your project
python "$HOME\.claude\skills\proof-of-contribution\assets\scripts\poc_init.py"

# 3. Record attribution
python poc.py add src\utils\parser.py

# 4. Detect gaps
python poc.py verify src\utils\parser.py

# 5. See the full provenance chain
python poc.py trace src\utils\parser.py

That's the whole tool. The sections below walk through each step in detail with real terminal output at every stage.

How the Tool Works

Before you install anything, you need a clear mental model of what proof-of-contribution actually does — because the most important part isn't obvious.

The Archaeology Problem

Here's a scenario that happens on every team using AI-assisted development.

A developer joins. They go through six months of AI-generated codebase. They hit a bug in the pagination logic — cursor-based, unusual implementation, nobody remembers why it was built that way. The original developer has left.

Old answer: two days of archaeology. git blame points to a commit message that says "fix pagination." The commit before that says "implement pagination." Dead end.

With poc.py trace src/utils/paginator.py, that same developer sees this in thirty seconds:

Provenance trace: src/utils/paginator.py
────────────────────────────────────────────────────────────
  [HIGH]  @tannerlinsley on github
          Cursor pagination discussion
          https://github.com/TanStack/query/discussions/123
          Insight: cursor beats offset for live-updating datasets

Knowledge gaps (AI-synthesized, no human source):
  • Error retry strategy — no human source cited
  • Concurrent write handling — AI chose this arbitrarily

They now know where the pattern came from and — critically — which parts have no traceable human source. The concurrent write handling is where the bug lives. The AI made a choice nobody reviewed.

That's what this tool does. Not enforcement first. Archaeology first.

How Knowledge Gaps Are Detected

The obvious assumption is that Claude introspects and reports what it doesn't know. That assumption is wrong. LLMs hallucinate confidently. An AI that could reliably detect its own knowledge gaps wouldn't produce them.

The detection mechanism is a comparison, not introspection.

When you use spec-writer before building, it generates a spec with an explicit ## Assumptions to review section — every decision the AI is making that you didn't specify, each one impact-rated. That list is the contract.

When you run poc.py import-spec spec.md --artifact src/utils/paginator.py, those assumptions get seeded into the database as unresolved knowledge gaps. After the agent builds, poc.py trace shows which assumptions made it into code with no human source ever cited.

The AI isn't grading its own exam. The spec is the answer key.

poc.py verify takes this further. After the agent builds, it parses the file's actual structure using Python's built-in ast module — extracting every function definition, conditional branch, and return path. It cross-checks each one against the seeded claims. Any structural unit with no resolved claim surfaces as a deterministic Knowledge Gap, regardless of how confident the model was when it wrote the code.

How to Install proof-of-contribution

Mac and Linux

mkdir -p ~/.claude/skills
git clone https://github.com/dannwaneri/proof-of-contribution.git \
  ~/.claude/skills/proof-of-contribution

Windows PowerShell

New-Item -ItemType Directory -Force -Path "$HOME\.claude\skills"
git clone https://github.com/dannwaneri/proof-of-contribution.git `
  "$HOME\.claude\skills\proof-of-contribution"

That's the entire installation. No package to install, no configuration file to edit. The skill is a markdown file the agent reads. The CLI is a Python script that runs locally.

Verify the Install:

ls ~/.claude/skills/proof-of-contribution/

You should see SKILL.md, poc.py, assets/, and references/. If the directory is empty, the clone failed — check your internet connection and try again.

How to Scaffold Your Project

The scaffold script creates the database, config, CLI, and GitHub integration in your project root. Run it once per project.

Mac and Linux

cd /path/to/your/project
python ~/.claude/skills/proof-of-contribution/assets/scripts/poc_init.py

Windows PowerShell

cd C:\path\to\your\project
python "$HOME\.claude\skills\proof-of-contribution\assets\scripts\poc_init.py"

You should see output like this:

🔗 Proof of Contribution — init

  →  Project root: /path/to/your/project
  ✔  Created .poc/config.json
  ✔  Created .poc/.gitignore  (db excluded from git, config tracked)
  ✔  Created .poc/provenance.db  (SQLite — no extra infra needed)
  ✔  Created .github/PULL_REQUEST_TEMPLATE.md
  ✔  Created .github/workflows/poc-check.yml
  ✔  Created poc.py  (local CLI — includes import-spec command)
  ✔  Created .gitignore

✔ Proof of Contribution initialised for 'your-project'

This creates four things in your project:

your-project/
├── .poc/
│   ├── config.json      ← project settings (commit this)
│   ├── provenance.db    ← SQLite database (local only, gitignored)
│   └── .gitignore
├── .github/
│   ├── PULL_REQUEST_TEMPLATE.md
│   └── workflows/
│       └── poc-check.yml
└── poc.py               ← your local CLI

.poc/ — the tool's local data directory. config.json stores project settings and is committed to git. provenance.db is the SQLite database where attribution records and knowledge gaps are stored — local only, gitignored.
poc.py — your local CLI, copied into the project root. Run python poc.py trace, python poc.py verify, and every other command directly without a global install.
.github/PULL_REQUEST_TEMPLATE.md — a PR template with the ## 🤖 AI Provenance section pre-filled. Developers fill it in when submitting PRs that contain AI-generated code.
.github/workflows/poc-check.yml — the optional GitHub Action for PR enforcement. Installed but dormant until you push the workflow file and enable it in your repo settings.

Windows note: if the scaffold fails with a UnicodeEncodeError, the emoji in the PR template is hitting a Windows encoding limit. Open assets/scripts/poc_init.py in a text editor and find every line ending with .write_text(...). Change each one to .write_text(..., encoding="utf-8"). Save and re-run.

Verify the Scaffold Worked

python poc.py report

Expected output:

Proof of Contribution Report
────────────────────────────────────────
  Artifacts tracked    : 0
  With provenance      : 0  (0%)
  Unresolved gaps      : 0
  Resolved claims      : 0
  Human experts        : 0

Empty database, clean state. You're ready.

How to Record Your First Provenance Entry

Before we dive in here, I just want to clear something up. Earlier, I described poc.py verify as detecting Knowledge Gaps automatically — and it does. But the static analyser can only tell you that a function has no human citation. It can't tell you which human source inspired it. That knowledge lives in your head, not in the code.

poc.py add is where you supply that context. After the agent builds a file, you record the human sources you actually drew on: the GitHub discussion you read before prompting, the Stack Overflow answer that shaped the approach. Those records become the attribution chain poc.py trace surfaces — and what closes the gaps poc.py verify flags.

verify finds the gaps. add fills them.

poc.py add records attribution for a file interactively. You can run it on any AI-generated file in your project.

python poc.py add src/utils/parser.py

You'll see a prompt:

Recording provenance for: src/utils/parser.py
(Press Ctrl+C to cancel)

  Human source URL (or Enter to finish):

Enter the URL of the human-authored source that inspired the code. This could be a GitHub discussion, a Stack Overflow answer, a documentation page, a blog post, or an RFC.

  Human source URL (or Enter to finish): https://github.com/TanStack/query/discussions/123
  Author handle: tannerlinsley
  Platform (github/stackoverflow/docs/other): github
  Source title: Cursor pagination discussion
  What specific insight came from this? cursor beats offset for live-updating datasets
  Confidence HIGH/MEDIUM/LOW [MEDIUM]: HIGH
  ✔ Recorded.

Add as many sources as apply. Press Enter on a blank URL when you're done.

  Human source URL (or Enter to finish): 
✔ Provenance saved. Run: python poc.py trace src/utils/parser.py

Check What You Recorded

python poc.py trace src/utils/parser.py

Provenance trace: src/utils/parser.py
────────────────────────────────────────────────────────────
  [HIGH]  @tannerlinsley on github
          Cursor pagination discussion
          https://github.com/TanStack/query/discussions/123
          Insight: cursor beats offset for live-updating datasets

No knowledge gaps — because you recorded a source. If the file had parts with no human source, they would appear below as gaps.

See All Experts in Your Graph

Every poc.py add call stores not just the URL but the author — their handle, platform, and the specific insight they contributed. Run it across enough files, and those authors accumulate into a knowledge graph: a local record of which human experts your codebase drew from, which files their knowledge shaped, and how many artifacts trace back to their work.

poc.py experts surfaces the top contributors. On a new project, it'll be one or two entries. On a mature codebase, it becomes a map of whose knowledge is load-bearing — the people you'd want to consult if that part of the code ever needed to change.

python poc.py experts

Top Human Experts in Knowledge Graph
──────────────────────────────────────────────────────
  @tannerlinsley            github          1 artifact(s)

How to Use import-spec to Seed Knowledge Gaps

This is the most important command in the tool. It connects proof-of-contribution with spec-writer and makes Knowledge Gaps deterministic.

When you use spec-writer before building a feature, it generates an ## Assumptions to review section — every implicit decision is impact-rated HIGH, MEDIUM, or LOW. The import-spec command reads that section and seeds those assumptions into the database as unresolved gaps before the agent writes a line of code.

After the agent builds, any assumption that made it into the implementation without a cited human source surfaces automatically in poc.py trace. You don't need to know which parts of the code are uncertain. The spec already told you.

Step 1 — Create a Test Spec

If you don't have a spec-writer output yet, create one manually to see how the import works.

Mac and Linux:

cat > test-spec.md << 'EOF'
## Assumptions to review

1. SQLite is sufficient for single-developer use — Impact: HIGH
   Correct this if: you need team-shared provenance

2. Filepath is the artifact identifier — Impact: MEDIUM
   Correct this if: you use content hashing instead

3. REST pattern for any future API — Impact: LOW
   Correct this if: you prefer GraphQL
EOF

Windows PowerShell:

python -c "
content = '''## Assumptions to review

1. SQLite is sufficient for single-developer use - Impact: HIGH
   Correct this if: you need team-shared provenance

2. Filepath is the artifact identifier - Impact: MEDIUM
   Correct this if: you use content hashing instead

3. REST pattern for any future API - Impact: LOW
   Correct this if: you prefer GraphQL'''
open('test-spec.md', 'w', encoding='utf-8').write(content)
print('test-spec.md created')
"

Windows note: don't use PowerShell's echo to create spec files. PowerShell saves files as UTF-16, which causes a UnicodeDecodeError when import-spec reads them. The python -c approach above writes UTF-8 correctly.

Step 2 — Import the Assumptions

python poc.py import-spec test-spec.md --artifact src/utils/parser.py

Spec assumptions imported — 3 Knowledge Gap(s) seeded
───────────────────────────────────────────────────────
  1. [HIGH] SQLite is sufficient for single-developer use
       Correct if: you need team-shared provenance
  2. [MEDIUM] Filepath is the artifact identifier
       Correct if: you use content hashing instead
  3. [LOW] REST pattern for any future API
       Correct if: you prefer GraphQL

  →  Bound to: src/utils/parser.py
  After the agent builds, run:
  python poc.py trace src/utils/parser.py
  python poc.py add src/utils/parser.py

Step 3 — Trace the Gaps

python poc.py trace src/utils/parser.py

Knowledge gaps (AI-synthesized, no human source):
  • REST pattern for any future API [Correct if: you prefer GraphQL]
  • SQLite is sufficient for single-developer use [Correct if: you need team-shared provenance]
  • Filepath is the artifact identifier [Correct if: you use content hashing instead]

  Resolve gaps: python poc.py add src/utils/parser.py

Three gaps, colour-coded by urgency. The HIGH-impact assumption — SQLite for single-developer use — appears in red. The LOW-impact one appears in green. When you run poc.py add and record a human source with an insight that overlaps the gap text, the gap auto-closes.

Preview Without Writing

python poc.py import-spec test-spec.md --dry-run

This parses the spec and prints what would be seeded without touching the database. This is useful before committing to an import.

Check the Overall Health

python poc.py report

Proof of Contribution Report
────────────────────────────────────────
  Artifacts tracked    : 1
  With provenance      : 0  (0%)
  Unresolved gaps      : 3
  Resolved claims      : 0
  Human experts        : 1
  ⚠ Less than 50% of artifacts have provenance records.
  ⚠ 3 unresolved Knowledge Gap(s).
    Run `poc.py trace ` to locate them.

How to Trace Human Attribution

poc.py trace is the command you'll use most. It shows the full human attribution chain for any file and lists any knowledge gaps — parts of the code with no traceable human source.

python poc.py trace src/utils/parser.py

A file with both attribution and gaps looks like this:

Provenance trace: src/utils/parser.py
────────────────────────────────────────────────────────────
  [HIGH]  @juliandeangelis on github
          Spec Driven Development methodology
          https://github.com/dannwaneri/spec-writer
          Insight: separate functional from technical spec

  [MEDIUM] @tannerlinsley on github
           Cursor pagination discussion
           https://github.com/TanStack/query/discussions/123
           Insight: cursor beats offset for live-updating datasets

Knowledge gaps (AI-synthesized, no human source):
  • Error retry strategy — no human source cited
  • CSV column ordering — AI chose this arbitrarily

  Resolve gaps: python poc.py add src/utils/parser.py

The human attribution section shows every cited source, colour-coded by confidence. The knowledge gaps section shows every assumption that shipped without a human citation — either seeded from a spec via import-spec, or flagged by Claude in the Provenance Block.

Resolving Gaps

Run poc.py add on any file with open gaps:

python poc.py add src/utils/parser.py

When you enter an insight that shares words with an open gap claim, the gap auto-closes. Run poc.py trace again to confirm it's resolved.

How to Verify with Static Analysis

poc.py verify is the command that closes the epistemic trust gap completely. It detects Knowledge Gaps by analysing the file's actual code structure — not by asking the AI what it doesn't know.

Run it after the agent builds, once you've seeded gaps with import-spec:

python poc.py verify src/utils/parser.py

Expected output:

Verify: src/utils/parser.py
────────────────────────────────────────────────────────────
  Structural units detected : 11
  Seeded claims             : 3
  Covered by cited source   : 2
  Deterministic gaps        : 1

Deterministic Knowledge Gaps (no human source):
  • function: handle_concurrent_writes (lines 47–61)
      Seeded assumption: concurrent write handling — AI chose this arbitrarily

  Resolve: python poc.py add src/utils/parser.py

The gap shown is not something Claude admitted. It's something the analyser found by comparing the file's function list against your seeded claims. The function handle_concurrent_writes exists in the code but has no resolved human citation in the database. That's the gap.

What the Exit Codes Mean

python poc.py verify src/utils/parser.py
echo $?   # Mac/Linux

python poc.py verify src/utils/parser.py
echo $LASTEXITCODE   # Windows PowerShell

Exit code 0 — no gaps, all detected units have cited sources
Exit code 1 — gaps found, resolve with poc.py add
Exit code 2 — file not found or unsupported language

Exit code 1 integrates directly into CI pipelines. Add poc.py verify to your GitHub Action or pre-commit hook and gaps block the build before they reach production.

Run it Without a Seeded Spec

If you haven't run import-spec first, verify still works — it falls back to structural analysis and surfaces every uncited function and branch as a gap:

python poc.py verify src/utils/parser.py

⚠ No spec imported — showing all uncited structural units.
  Run: python poc.py import-spec spec.md --artifact src/utils/parser.py
  for deterministic gap detection.

Deterministic Knowledge Gaps (no human source):
  • function: parse_query (lines 1–7)
  • branch: if not text (lines 2–3)
  • function: fetch_results (lines 9–12)
  ...

It's less precise than the spec-writer path — every structural unit shows rather than only the ones tied to named assumptions — but it's useful as a baseline on any file, new or old.

The `--strict` Flag

python poc.py verify src/utils/parser.py --strict

Strict mode flags every uncited structural unit as a gap even when claims are seeded. You can use it when you want zero tolerance: any function or branch without a resolved human source fails the check.

How to Enable PR Enforcement

Once poc.py trace has saved you real hours — not before — enable the GitHub Action. The distinction matters. Turning it on day one frames the tool as overhead. Turning it on after the team already finds value frames it as a standard.

git add .github/ .poc/config.json poc.py
git commit -m "chore: add proof-of-contribution"
git push

After that, every PR is checked for an ## 🤖 AI Provenance section. The scaffold already created the PR template with that section included. Developers fill it in naturally once they're already running poc.py trace locally — the template just asks them to record what they already know.

Developers who write fully human code opt out by adding 100% human-written anywhere in the PR body. The action skips the check automatically.

What the Action Checks

The action reads the PR description and looks for:

The ## 🤖 AI Provenance heading
At least one populated row in the attribution table

If the section is missing or the table is empty, the action fails and posts a comment explaining what to add. The comment includes a link to poc.py trace so the developer knows exactly where to look.

Where to Go Next

Use it with spec-writer on a Real Feature

The real value of import-spec is on actual features, not test specs. If you use spec-writer, the workflow is:

/spec-writer "your feature description"

Save the output to spec.md. Then:

python poc.py import-spec spec.md --artifact src/path/to/output.py

Build the feature with your agent. Then run poc.py trace to see which assumptions made it into code with no human source. Resolve the HIGH-impact gaps first — those are the ones that will cause production incidents.

Activate the Claude Code Skill

The SKILL.md file makes Claude automatically append a Provenance Block to every generated artifact when the skill is active. The block lists human sources Claude drew from and flags what it synthesized without any traceable source.

To activate it in Claude Code, the skill is already installed at ~/.claude/skills/proof-of-contribution/. Claude Code loads it automatically when you are in a project that has .poc/config.json.

A generated Provenance Block looks like this:

## PROOF OF CONTRIBUTION
Generated artifact: fetch_github_discussions()
Confidence: MEDIUM

## HUMAN SOURCES THAT INSPIRED THIS

[1] GitHub GraphQL API Documentation Team
    Source type: Official Docs
    URL: docs.github.com/en/graphql
    Contribution: cursor-based pagination pattern

[2] GitHub Community (multiple contributors)
    Source type: GitHub Discussions
    URL: github.com/community/community
    Contribution: "ghost" fallback for deleted accounts
                  surfaced in bug reports

## KNOWLEDGE GAPS (AI synthesized, no human cited)
- Error handling / retry logic
- Rate limit strategy

## RECOMMENDED HUMAN EXPERTS TO CONSULT
- github.com/octokit community for pagination

The Knowledge Gaps section is the part no other tool produces. It's where AI admits what it synthesized without a traceable human source — before that gap becomes a production incident.

Upgrade When You Outgrow SQLite

The default database is SQLite — local only, no infra required. When you need team sharing or graph queries, the references/ directory in the repo has migration guides:

Need	File
Team sharing a provenance DB	`references/relational-schema.md`
Graph traversal queries	`references/neo4j-implementation.md`
Semantic web / interoperability	`references/jsonld-schema.md`

Manual Tracking vs. proof-of-contribution

	Manual tracking	proof-of-contribution
Finding who wrote the code	Search Slack, ask the team, dig through commits	`poc.py trace` — thirty seconds
Knowing which parts the AI guessed	You don't, until it breaks in production	Knowledge Gaps section — surfaced before the code ships
Detecting gaps after the build	Code review, if someone notices	`poc.py verify` — static analysis, zero API calls
Enforcing attribution on PRs	Honor system	GitHub Action fails the PR if attribution is missing
Connecting to your spec	Copy-paste assumptions into comments manually	`poc.py import-spec` seeds them as tracked claims automatically
Infrastructure required	None (usually a spreadsheet or nothing)	None — SQLite, pure Python, no paid services

The tool doesn't replace code review. It gives code review the context it needs to catch the right things.

The archaeology scenario — two days tracing a bug through dead-end commit messages — takes thirty seconds with poc.py trace. The code still has gaps, and it always will. But now you know where they are.

Built by Daniel Nwaneri. The spec-writer skill that feeds import-spec is at github.com/dannwaneri/spec-writer. The full proof-of-contribution repo is at github.com/dannwaneri/proof-of-contribution.

How to Build a Cost-Efficient AI Agent with Tiered Model Routing

Daniel Nwaneri — Wed, 08 Apr 2026 22:59:09 +0000

Most AI agent tutorials make the same mistake: they route every task to the most expensive model available.

A character count doesn't need GPT-4. A presence check doesn't need Sonnet. A regex doesn't need anything except Python.

The mistake isn't using AI — it's not knowing when to stop using it.

This tutorial shows you how to build a tiered routing system that sends tasks to the cheapest model that can solve them. The pattern is called the cost curve. It comes from a comment thread on a DEV.to article, implemented by three developers over a weekend, and it cut the per-URL cost of a real SEO audit agent from $0.006 to effectively $0 for most pages.

By the end, you'll have a working cost_curve.py module you can drop into any agent project.

What You'll Build

A three-tier routing function that:

Runs deterministic Python checks first — zero API cost
Escalates to Claude Haiku only for genuinely ambiguous cases — ~$0.0001 per call
Escalates to Claude Sonnet only when semantic judgment is required — ~$0.006 per call
Falls back gracefully when any tier fails
Returns a consistent result schema regardless of which tier handled the request

The full implementation is part of dannwaneri/seo-agent, an open-core SEO audit agent. The cost curve module is the premium routing layer, and the principle applies to any agent with mixed-complexity tasks.

Prerequisites

Python 3.11 or higher
An Anthropic API key
Basic familiarity with Python and the Claude API

The Problem with Calling Claude on Everything
The Cost Curve Explained
Project Setup
Tier 1: Deterministic Python
Tier 2: Claude Haiku for Ambiguous Cases
Tier 3: Claude Sonnet for Semantic Judgment
The Router: audit_url()
Graceful Fallback
Testing the Cost Curve
Applying This Pattern to Your Agent

The Problem with Calling Claude on Everything

Here's what most agent code looks like:

def audit_url(snapshot: dict) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": build_prompt(snapshot)}]
    )
    return parse_response(response)

This works. It also calls Sonnet for every URL in the list — including the ones where the title is 142 characters long and the answer is obviously FAIL without any model involvement.

Claude Sonnet 4 is priced at $3 per million input tokens and $15 per million output tokens. A typical page snapshot is around 500 input tokens. That's $0.0015 per URL just for input — before output tokens. Across a 20-URL weekly audit, the total is around $0.12. Not expensive. But most of those pages have mechanical SEO issues: missing descriptions, titles over 60 characters, no canonical tag. A character count catches all of that. You don't need a model.

The cost curve fixes this by routing based on what the task actually requires, not on what the model is capable of.

The Cost Curve Explained

In the cost curve, we have three tiers, three tools, and three price points:

Tier 1 — Deterministic Python. Cost: $0. Check title length, description length, H1 count, canonical presence. These are not judgment calls. They're string operations. If title length > 60, FAIL. No model needed.

Tier 2 — Claude Haiku. Cost: ~$0.0001 per call. Title present but only 4 characters long. Description present but only 30 characters. Status code is a redirect. These pass the mechanical audit but something is off. Haiku is fast and cheap enough that escalating ambiguous cases costs less than the debugging time you'd spend on false positives.

Tier 3 — Claude Sonnet. Cost: ~$0.006 per call. Pages Haiku flags as needing semantic judgment. "This title passes length but reads like a navigation label." "This description duplicates the title verbatim." Sonnet earns its cost on genuinely hard cases — not on every URL in the list.

The routing decision happens before any API call. The result schema is identical regardless of which tier handled the request.

Project Setup

mkdir cost-curve-demo && cd cost-curve-demo
pip install anthropic

Set your API key:

# macOS/Linux
export ANTHROPIC_API_KEY="sk-ant-..."

# Windows PowerShell
$env:ANTHROPIC_API_KEY = "sk-ant-..."

Create cost_curve.py — you'll build this module step by step.

Tier 1: Deterministic Python

Tier 1 runs first on every URL. It checks four fields using only Python string operations. There's no API call, no latency, and no cost.

import json
import logging
import os
import re
from datetime import datetime, timezone

import anthropic

logger = logging.getLogger(__name__)

REDIRECT_CODES = {301, 302, 307, 308}

# Fields that trigger Tier 2 escalation
# Title or description present but suspiciously short
AMBIGUOUS_TITLE_MAX = 10   # chars — present but too short to be real
AMBIGUOUS_DESC_MAX = 50    # chars — present but too short to be useful


def _now_iso() -> str:
    return datetime.now(timezone.utc).isoformat()


def _build_result(snapshot: dict, method: str) -> dict:
    """Base result skeleton — same schema regardless of tier."""
    return {
        "url": snapshot.get("final_url", ""),
        "final_url": snapshot.get("final_url", ""),
        "status_code": snapshot.get("status_code"),
        "title": {"value": None, "length": 0, "status": "PASS"},
        "description": {"value": None, "length": 0, "status": "PASS"},
        "h1": {"count": 0, "value": None, "status": "PASS"},
        "canonical": {"value": None, "status": "PASS"},
        "flags": [],
        "human_review": False,
        "audited_at": _now_iso(),
        "method": method,
        "needs_tier3": False,
    }


def tier1_check(snapshot: dict) -> dict:
    """
    Pure Python SEO checks. Zero API calls.

    Returns a result dict with method="deterministic".
    Sets needs_tier3=False always — Tier 1 never escalates to Tier 3 directly.
    Escalation to Tier 2 is decided by the router, not here.
    """
    result = _build_result(snapshot, "deterministic")

    title = snapshot.get("title") or ""
    description = snapshot.get("meta_description") or ""
    h1s = snapshot.get("h1s") or []
    canonical = snapshot.get("canonical") or ""

    # Title check
    result["title"]["value"] = title or None
    result["title"]["length"] = len(title)
    if not title or len(title) > 60:
        result["title"]["status"] = "FAIL"
        msg = "Title is missing" if not title else f"Title is {len(title)} characters (max 60)"
        result["flags"].append(msg)

    # Description check
    result["description"]["value"] = description or None
    result["description"]["length"] = len(description)
    if not description or len(description) > 160:
        result["description"]["status"] = "FAIL"
        msg = "Meta description is missing" if not description else f"Meta description is {len(description)} characters (max 160)"
        result["flags"].append(msg)

    # H1 check
    result["h1"]["count"] = len(h1s)
    result["h1"]["value"] = h1s[0] if h1s else None
    if len(h1s) == 0:
        result["h1"]["status"] = "FAIL"
        result["flags"].append("H1 tag is missing")
    elif len(h1s) > 1:
        result["h1"]["status"] = "FAIL"
        result["flags"].append(f"Multiple H1 tags found ({len(h1s)})")

    # Canonical check
    result["canonical"]["value"] = canonical or None
    if not canonical:
        result["canonical"]["status"] = "FAIL"
        result["flags"].append("Canonical tag is missing")

    return result

The key design decision: tier1_check() never decides whether to escalate. It just runs the checks and returns. The router decides escalation based on the result.

Tier 2: Claude Haiku for Ambiguous Cases

Tier 2 runs when Tier 1 detects something mechanical but the result might need a second look. A 4-character title present but clearly wrong. A 30-character description that's technically there but useless. A redirect status that needs a human-readable explanation.

Haiku is the right model here. It's fast, cheap ($1 input / $5 output per million tokens), and sufficient for triage-level judgment. The prompt asks a narrow question: is this ambiguous enough to need Sonnet?

def tier2_check(snapshot: dict) -> dict:
    """
    Claude Haiku call for ambiguous cases.

    Returns result with method="haiku".
    Sets needs_tier3=True if Haiku determines the case needs semantic judgment.
    Falls back to Tier 1 result on API error.
    """
    api_key = os.environ.get("ANTHROPIC_API_KEY")
    if not api_key:
        raise OSError("ANTHROPIC_API_KEY is not set.")

    client = anthropic.Anthropic(api_key=api_key)

    title = snapshot.get("title") or ""
    description = snapshot.get("meta_description") or ""
    status_code = snapshot.get("status_code")

    prompt = f"""You are an SEO auditor doing a quick triage check.

Page data:
- Title: {repr(title)} ({len(title)} chars)
- Meta description: {repr(description)} ({len(description)} chars)
- Status code: {status_code}

Answer these two questions with only "yes" or "no":
1. Does this page need semantic judgment beyond simple length/presence checks? 
   (e.g. title is present but clearly wrong, description is present but meaningless)
2. Is the status code a redirect that needs investigation?

Respond in this exact JSON format and nothing else:
{{"needs_tier3": true_or_false, "reason": "one sentence explanation"}}"""

    try:
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=150,
            messages=[{"role": "user", "content": prompt}],
        )
        raw = response.content[0].text.strip()
        # Strip markdown fences if present
        if raw.startswith("```"):
            lines = raw.splitlines()
            raw = "\n".join(lines[1:-1] if lines[-1].strip() == "```" else lines[1:])
        parsed = json.loads(raw)

        result = _build_result(snapshot, "haiku")
        # Copy Tier 1 field checks — Haiku doesn't redo those
        t1 = tier1_check(snapshot)
        result["title"] = t1["title"]
        result["description"] = t1["description"]
        result["h1"] = t1["h1"]
        result["canonical"] = t1["canonical"]
        result["flags"] = t1["flags"]
        result["needs_tier3"] = parsed.get("needs_tier3", False)
        if result["needs_tier3"]:
            result["flags"].append(f"Escalated to Tier 3: {parsed.get('reason', '')}")

        return result

    except Exception as exc:
        logger.warning("[tier2] Haiku API error: %s — falling back to Tier 1 result", exc)
        fallback = tier1_check(snapshot)
        fallback["method"] = "haiku-fallback"
        return fallback

The fallback is the critical piece. If Haiku fails — rate limit, network error, malformed response — the function returns the Tier 1 result rather than crashing. The audit continues. The URL gets flagged with method="haiku-fallback" so you can identify it later.

Tier 3: Claude Sonnet for Semantic Judgment

Tier 3 is where the full extraction prompt runs. This is the same call you'd make in a naïve implementation — the difference is that only a small fraction of URLs reach this tier.

def tier3_check(snapshot: dict) -> dict:
    """
    Claude Sonnet call for semantic judgment.

    Returns result with method="sonnet".
    This is the full extraction prompt — same as calling the model directly.
    """
    api_key = os.environ.get("ANTHROPIC_API_KEY")
    if not api_key:
        raise OSError("ANTHROPIC_API_KEY is not set.")

    client = anthropic.Anthropic(api_key=api_key)

    prompt = f"""You are an SEO auditor. Analyze this page snapshot and return ONLY a JSON object.
No prose. No explanation. No markdown fences. Raw JSON only.

Page data:
- URL: {snapshot.get('final_url')}
- Status code: {snapshot.get('status_code')}
- Title: {snapshot.get('title')}
- Meta description: {snapshot.get('meta_description')}
- H1 tags: {snapshot.get('h1s')}
- Canonical: {snapshot.get('canonical')}

Return this exact schema:
{{
  "url": "string",
  "final_url": "string",
  "status_code": number,
  "title": {{"value": "string or null", "length": number, "status": "PASS or FAIL"}},
  "description": {{"value": "string or null", "length": number, "status": "PASS or FAIL"}},
  "h1": {{"count": number, "value": "string or null", "status": "PASS or FAIL"}},
  "canonical": {{"value": "string or null", "status": "PASS or FAIL"}},
  "flags": ["array of strings describing specific issues"],
  "human_review": false,
  "audited_at": "ISO timestamp"
}}

PASS/FAIL rules:
- title: FAIL if null or length > 60 characters, or if present but clearly not a real title
- description: FAIL if null or length > 160 characters, or if present but meaningless
- h1: FAIL if count is 0 or count > 1
- canonical: FAIL if null
- audited_at: use current UTC time"""

    try:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1000,
            messages=[{"role": "user", "content": prompt}],
        )
        raw = response.content[0].text.strip()
        if raw.startswith("```"):
            lines = raw.splitlines()
            raw = "\n".join(lines[1:-1] if lines[-1].strip() == "```" else lines[1:])

        result = json.loads(raw)
        result["method"] = "sonnet"
        result["needs_tier3"] = False
        return result

    except Exception as exc:
        logger.warning("[tier3] Sonnet API error: %s — falling back to Tier 1 result", exc)
        fallback = tier1_check(snapshot)
        fallback["method"] = "sonnet-fallback"
        return fallback

Note the prompt addition in Tier 3 that isn't in Tier 1: "or if present but clearly not a real title" and "or if present but meaningless". That's the semantic judgment Haiku identified as needed. Tier 3 acts on it.

The Router: audit_url()

The router is the public interface. Everything else is an implementation detail.

def audit_url(snapshot: dict, tiered: bool = False) -> dict:
    """
    Route a page snapshot through the appropriate audit tier.

    Args:
        snapshot: Page data from browser.py — must contain final_url,
                  status_code, title, meta_description, h1s, canonical.
        tiered: If False, delegates directly to Tier 3 (Sonnet).
                If True, routes through the cost curve.

    Returns:
        Audit result dict with method field indicating which tier ran.
    """
    if not tiered:
        # Non-tiered mode: call Sonnet directly, same as v1 behavior
        return tier3_check(snapshot)

    # Tier 1: always runs first
    t1_result = tier1_check(snapshot)

    # Check if escalation to Tier 2 is warranted
    title = snapshot.get("title") or ""
    description = snapshot.get("meta_description") or ""
    status_code = snapshot.get("status_code")

    needs_tier2 = (
        # Title present but suspiciously short
        (title and len(title) < AMBIGUOUS_TITLE_MAX) or
        # Description present but suspiciously short
        (description and len(description) < AMBIGUOUS_DESC_MAX) or
        # Redirect status — may need explanation
        (status_code in REDIRECT_CODES)
    )

    if not needs_tier2:
        # Tier 1 result is definitive — return without any API call
        return t1_result

    # Tier 2: Haiku triage
    t2_result = tier2_check(snapshot)

    if not t2_result.get("needs_tier3", False):
        # Haiku determined no semantic judgment needed
        return t2_result

    # Tier 3: Sonnet for semantic judgment
    return tier3_check(snapshot)

The router logic is explicit and readable. Each decision point is a named condition. When tiered=False, behavior is identical to the v1 naive implementation — this is the backward compatibility guarantee that lets you add the cost curve incrementally without breaking existing audits.

Graceful Fallback

The fallback pattern appears in both Tier 2 and Tier 3. It's worth making explicit:

# Pattern used in both tier2_check() and tier3_check()
except Exception as exc:
    logger.warning("[tierN] API error: %s — falling back to Tier 1 result", exc)
    fallback = tier1_check(snapshot)
    fallback["method"] = "tierN-fallback"
    return fallback

Three things this does:

Logs the error with enough context to debug later
Returns a valid result — the Tier 1 deterministic check always runs regardless
Tags the result with the fallback method so you can filter these in your report

An agent that crashes on API errors is not production-ready. An agent that degrades gracefully and continues is.

Testing the Cost Curve

Create test_cost_curve.py to verify routing behavior without live API calls:

import json
from unittest import mock

from cost_curve import audit_url, tier1_check


def make_snapshot(title="Normal Title Under 60 Chars",
                  description="A normal meta description that is under 160 characters and describes the page content well.",
                  h1s=["Single H1"],
                  canonical="https://example.com/page",
                  status_code=200,
                  final_url="https://example.com/page"):
    return {
        "title": title,
        "meta_description": description,
        "h1s": h1s,
        "canonical": canonical,
        "status_code": status_code,
        "final_url": final_url,
    }


def test_clean_page_returns_tier1_no_api_calls():
    """Clean page: all checks pass deterministically — no API call."""
    snapshot = make_snapshot()
    with mock.patch("anthropic.Anthropic") as mock_client:
        result = audit_url(snapshot, tiered=True)
        assert result["method"] == "deterministic"
        mock_client.assert_not_called()
    print("PASS: clean page → Tier 1, zero API calls")


def test_long_title_returns_tier1_fail_no_api_call():
    """Title >60 chars: FAIL from Tier 1, no API call."""
    snapshot = make_snapshot(title="A" * 70)
    with mock.patch("anthropic.Anthropic") as mock_client:
        result = audit_url(snapshot, tiered=True)
        assert result["method"] == "deterministic"
        assert result["title"]["status"] == "FAIL"
        mock_client.assert_not_called()
    print("PASS: title >60 → Tier 1 FAIL, zero API calls")


def test_suspiciously_short_title_escalates_to_tier2():
    """Title present but 4 chars: escalates to Tier 2."""
    snapshot = make_snapshot(title="SEO")  # 3 chars — under AMBIGUOUS_TITLE_MAX
    mock_response = mock.MagicMock()
    mock_response.content = [mock.MagicMock(
        text='{"needs_tier3": false, "reason": "title is short but not ambiguous"}'
    )]
    with mock.patch("anthropic.Anthropic") as mock_client:
        mock_client.return_value.messages.create.return_value = mock_response
        result = audit_url(snapshot, tiered=True)
        assert result["method"] == "haiku"
        assert mock_client.return_value.messages.create.call_count == 1
    print("PASS: short title → Tier 2 (Haiku called once)")


def test_tiered_false_calls_sonnet_directly():
    """tiered=False: Sonnet called regardless of snapshot content."""
    snapshot = make_snapshot()  # clean page, would be Tier 1 in tiered mode
    mock_response = mock.MagicMock()
    mock_response.content = [mock.MagicMock(text=json.dumps({
        "url": "https://example.com/page",
        "final_url": "https://example.com/page",
        "status_code": 200,
        "title": {"value": "Normal Title Under 60 Chars", "length": 27, "status": "PASS"},
        "description": {"value": "desc", "length": 4, "status": "PASS"},
        "h1": {"count": 1, "value": "Single H1", "status": "PASS"},
        "canonical": {"value": "https://example.com/page", "status": "PASS"},
        "flags": [],
        "human_review": False,
        "audited_at": "2026-04-01T00:00:00+00:00",
    }))]
    with mock.patch("anthropic.Anthropic") as mock_client:
        mock_client.return_value.messages.create.return_value = mock_response
        result = audit_url(snapshot, tiered=False)
        assert result["method"] == "sonnet"
        assert mock_client.return_value.messages.create.call_count == 1
    print("PASS: tiered=False → Sonnet called directly")


def test_haiku_api_failure_falls_back_to_tier1():
    """Haiku failure: falls back to Tier 1 result, no crash."""
    snapshot = make_snapshot(title="SEO")  # triggers Tier 2
    with mock.patch("anthropic.Anthropic") as mock_client:
        mock_client.return_value.messages.create.side_effect = Exception("rate limit")
        result = audit_url(snapshot, tiered=True)
        assert result["method"] == "haiku-fallback"
    print("PASS: Haiku failure → fallback to Tier 1, no crash")


if __name__ == "__main__":
    test_clean_page_returns_tier1_no_api_calls()
    test_long_title_returns_tier1_fail_no_api_call()
    test_suspiciously_short_title_escalates_to_tier2()
    test_tiered_false_calls_sonnet_directly()
    test_haiku_api_failure_falls_back_to_tier1()
    print("\nAll tests passed.")

Run it:

python test_cost_curve.py

Expected output:

PASS: clean page → Tier 1, zero API calls
PASS: title >60 → Tier 1 FAIL, zero API calls
PASS: short title → Tier 2 (Haiku called once)
PASS: tiered=False → Sonnet called directly
PASS: Haiku failure → fallback to Tier 1, no crash

Applying This Pattern to Your Agent

The cost curve is not SEO-specific. Any agent with mixed-complexity tasks can use it.

The principle: classify tasks by what they actually require before deciding which model to invoke.

Customer support agent:

Tier 1: keyword matching for known FAQ topics — no model
Tier 2: Haiku for intent classification on ambiguous queries
Tier 3: Sonnet for complex complaints requiring judgment

Code review agent:

Tier 1: lint rules, syntax checks — no model
Tier 2: Haiku for common pattern detection
Tier 3: Sonnet for architectural review

Content moderation agent:

Tier 1: blocklist matching — no model
Tier 2: Haiku for borderline cases
Tier 3: Sonnet for context-dependent judgment

The implementation pattern is the same in all three cases. The audit_url() router becomes route_task(). The tier functions change their prompts and escalation conditions. The fallback logic stays identical.

The key question to ask before writing any agent code: what fraction of my inputs are mechanically solvable? That fraction goes to Tier 1. The rest escalate. The cost curve routes everything else.

Wrapping Up

The full implementation — including the SEO audit agent that uses this module in production — is at dannwaneri/seo-agent. The core/ directory is MIT licensed. The tiered routing lives in premium/cost_curve.py.

This tutorial is the companion piece to I Was Paying $0.006 Per URL for SEO Audits Until I Realized Most Needed $0 on DEV.to, which covers the architecture decisions behind the cost curve.

How to Build a Local SEO Audit Agent with Browser Use and Claude API

Daniel Nwaneri — Mon, 30 Mar 2026 23:37:08 +0000

Every digital marketing agency has someone whose job involves opening a spreadsheet, visiting each client URL, checking the title tag, meta description, and H1, noting broken links, and pasting everything into a report. Then doing it again next week.

That work is deterministic. An agent can do it.

In this tutorial, you'll build a local SEO audit agent from scratch using Python, Browser Use, and the Claude API. The agent visits real pages in a visible browser window, extracts SEO signals using Claude, checks for broken links asynchronously, handles edge cases with a human-in-the-loop pause, and writes a structured report — all resumable if interrupted.

By the end, you'll have a working agent you can run against any list of URLs. It costs less than $0.01 per URL to run.

What You'll Build

A seven-module Python agent that:

Reads a URL list from a CSV file
Visits each URL in a real Chromium browser (not a headless scraper)
Extracts title, meta description, H1s, and canonical tag via Claude API
Checks for broken links asynchronously using httpx
Detects edge cases (404s, login walls, redirects) and pauses for human input
Writes results to report.json incrementally — safe to interrupt and resume
Generates a plain-English report-summary.txt on completion

The full code is on GitHub at dannwaneri/seo-agent.

Prerequisites

Python 3.11 or higher
An Anthropic API key (get one at console.anthropic.com)
Windows, macOS, or Linux
Basic familiarity with Python and the command line

Why Browser Use Instead of a Scraper
Project Structure
Setup
Module 1: State Management
Module 2: Browser Integration
Module 3: Claude Extraction Layer
Module 4: Broken Link Checker
Module 5: Human-in-the-Loop
Module 6: Report Writer
Module 7: The Main Loop
Running the Agent
Scheduling for Agency Use
What the Results Look Like

Why Browser Use Instead of a Scraper

The standard approach to SEO auditing is to fetch page HTML with requests and parse it with BeautifulSoup. That works on static pages. It breaks on JavaScript-rendered content, misses dynamically injected meta tags, and fails entirely on authenticated pages.

Browser Use (84,000+ GitHub stars, MIT license) takes a different approach. It controls a real Chromium browser, reads the DOM after JavaScript executes, and exposes the page through Playwright's accessibility tree. The agent sees what a human would see.

The practical difference: a requests-based scraper might miss a meta description injected by a React component. Browser Use won't.

The other difference worth naming: Browser Use reads pages semantically. A Playwright script breaks when a button's CSS class changes from btn-primary to button-main. Browser Use identifies it's still a "Submit" button and acts accordingly. The extraction logic lives in the Claude prompt, not in brittle CSS selectors.

Project Structure

seo-agent/
├── index.py          # Main audit loop
├── browser.py        # Browser Use / Playwright page driver
├── extractor.py      # Claude API extraction layer
├── linkchecker.py    # Async broken link checker
├── hitl.py           # Human-in-the-loop pause logic
├── reporter.py       # Report writer
├── state.py          # State persistence (resume on interrupt)
├── input.csv         # Your URL list
├── requirements.txt
├── .env.example
└── .gitignore

Setup

Create a project folder and install dependencies:

mkdir seo-agent && cd seo-agent
pip install browser-use anthropic playwright httpx
playwright install chromium

Create input.csv with your URLs:

url
https://example.com
https://example.com/about
https://example.com/contact

Create .env.example:

ANTHROPIC_API_KEY=your-key-here

Set your API key as an environment variable before running:

# macOS/Linux
export ANTHROPIC_API_KEY="sk-ant-..."

# Windows PowerShell
$env:ANTHROPIC_API_KEY = "sk-ant-..."

Create .gitignore:

state.json
report.json
report-summary.txt
.env
__pycache__/
*.pyc

Module 1: State Management

The agent needs to track which URLs it has already audited. If the run is interrupted — power cut, keyboard interrupt, network error — it should resume from where it stopped, not start over.

state.py handles this with a flat JSON file:

import json
import os

STATE_FILE = os.path.join(os.path.dirname(__file__), "state.json")

_DEFAULT_STATE = {"audited": [], "pending": [], "needs_human": []}


def load_state() -> dict:
    if not os.path.exists(STATE_FILE):
        save_state(_DEFAULT_STATE.copy())
    with open(STATE_FILE, encoding="utf-8") as f:
        return json.load(f)


def save_state(state: dict) -> None:
    with open(STATE_FILE, "w", encoding="utf-8") as f:
        json.dump(state, f, indent=2)


def is_audited(url: str) -> bool:
    return url in load_state()["audited"]


def mark_audited(url: str) -> None:
    state = load_state()
    if url not in state["audited"]:
        state["audited"].append(url)
    save_state(state)


def add_to_needs_human(url: str) -> None:
    state = load_state()
    if url not in state["needs_human"]:
        state["needs_human"].append(url)
    save_state(state)

The design is intentional: mark_audited() is called immediately after a URL is processed and written to the report. If the agent crashes mid-run, it loses at most one URL's work.

Module 2: Browser Integration

browser.py does the actual page navigation. It uses Playwright directly (which Browser Use installs as a dependency) to open a visible Chromium window, navigate to the URL, capture HTTP status and redirect information, and extract the raw SEO signals from the DOM.

The key design decisions:

Visible browser, not headless. Set headless=False so you can watch the agent work. This matters for the demo and for debugging.

Status capture via response listener. Playwright raises an exception on 4xx/5xx responses, but the on("response", ...) handler fires before the exception. We capture status there.

2-second delay between visits. Prevents triggering rate limiting or bot detection on agency client sites.

Here is the core navigation function:

import asyncio
import sys
import time
from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout

TIMEOUT = 20_000  # 20 seconds


def fetch_page(url: str) -> dict:
    result = {
        "final_url": url,
        "status_code": None,
        "title": None,
        "meta_description": None,
        "h1s": [],
        "canonical": None,
        "raw_links": [],
    }

    first_status = {"code": None}

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()

        def on_response(response):
            if first_status["code"] is None:
                first_status["code"] = response.status

        page.on("response", on_response)

        try:
            page.goto(url, wait_until="domcontentloaded", timeout=TIMEOUT)
            result["status_code"] = first_status["code"] or 200
            result["final_url"] = page.url

            # Extract SEO signals from DOM
            result["title"] = page.title() or None
            result["meta_description"] = page.evaluate(
                "() => { const m = document.querySelector('meta[name=\"description\"]'); "
                "return m ? m.getAttribute('content') : null; }"
            )
            result["h1s"] = page.evaluate(
                "() => Array.from(document.querySelectorAll('h1')).map(h => h.innerText.trim())"
            )
            result["canonical"] = page.evaluate(
                "() => { const c = document.querySelector('link[rel=\"canonical\"]'); "
                "return c ? c.getAttribute('href') : null; }"
            )
            result["raw_links"] = page.evaluate(
                "() => Array.from(document.querySelectorAll('a[href]'))"
                ".map(a => a.href).filter(Boolean).slice(0, 100)"
            )

        except PlaywrightTimeout:
            result["status_code"] = first_status["code"] or 408
        except Exception as exc:
            print(f"[browser] Error: {exc}", file=sys.stderr)
            result["status_code"] = first_status["code"]
        finally:
            browser.close()

    time.sleep(2)
    return result

A few things worth noting:

The raw_links cap at 100 is deliberate. DEV.to profile pages have hundreds of links — you don't need all of them for broken link detection.

The wait_until="domcontentloaded" setting is faster than networkidle and sufficient for meta tag extraction. JavaScript-rendered content needs the DOM to be ready, not all network requests to complete.

Module 3: Claude Extraction Layer

extractor.py takes the raw page snapshot from browser.py and calls Claude to produce a structured SEO audit result.

This is where most tutorials go wrong. They either write complex parsing logic in Python (fragile) or ask Claude for a free-form response and try to parse prose (unreliable). The right approach: give Claude a strict JSON schema and tell it to return nothing else.

The prompt engineering that makes this reliable:

import json
import os
import sys
from datetime import datetime, timezone
import anthropic

MODEL = "claude-sonnet-4-20250514"
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))


def _strip_fences(text: str) -> str:
    """Remove accidental markdown code fences from Claude's response."""
    text = text.strip()
    if text.startswith("```"):
        lines = text.splitlines()
        # Drop opening fence
        lines = lines[1:] if lines[0].startswith("```") else lines
        # Drop closing fence
        if lines and lines[-1].strip() == "```":
            lines = lines[:-1]
        text = "\n".join(lines).strip()
    return text


def extract(snapshot: dict) -> dict:
    if not os.environ.get("ANTHROPIC_API_KEY"):
        raise OSError("ANTHROPIC_API_KEY is not set.")

    prompt = f"""You are an SEO auditor. Analyze this page snapshot and return ONLY a JSON object.
No prose. No explanation. No markdown fences. Raw JSON only.

Page data:
- URL: {snapshot.get('final_url')}
- Status code: {snapshot.get('status_code')}
- Title: {snapshot.get('title')}
- Meta description: {snapshot.get('meta_description')}
- H1 tags: {snapshot.get('h1s')}
- Canonical: {snapshot.get('canonical')}

Return this exact schema:
{{
  "url": "string",
  "final_url": "string",
  "status_code": number,
  "title": {{"value": "string or null", "length": number, "status": "PASS or FAIL"}},
  "description": {{"value": "string or null", "length": number, "status": "PASS or FAIL"}},
  "h1": {{"count": number, "value": "string or null", "status": "PASS or FAIL"}},
  "canonical": {{"value": "string or null", "status": "PASS or FAIL"}},
  "flags": ["array of strings describing specific issues"],
  "human_review": false,
  "audited_at": "ISO timestamp"
}}

PASS/FAIL rules:
- title: FAIL if null or length > 60 characters
- description: FAIL if null or length > 160 characters  
- h1: FAIL if count is 0 (missing) or count > 1 (multiple)
- canonical: FAIL if null
- flags: list every failing field with a clear description
- audited_at: use current UTC time in ISO 8601 format"""

    response = client.messages.create(
        model=MODEL,
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}],
    )

    raw = response.content[0].text
    clean = _strip_fences(raw)

    try:
        return json.loads(clean)
    except json.JSONDecodeError as exc:
        print(f"[extractor] JSON parse error: {exc}", file=sys.stderr)
        return _error_result(snapshot, str(exc))


def _error_result(snapshot: dict, reason: str) -> dict:
    return {
        "url": snapshot.get("final_url", ""),
        "final_url": snapshot.get("final_url", ""),
        "status_code": snapshot.get("status_code"),
        "title": {"value": None, "length": 0, "status": "ERROR"},
        "description": {"value": None, "length": 0, "status": "ERROR"},
        "h1": {"count": 0, "value": None, "status": "ERROR"},
        "canonical": {"value": None, "status": "ERROR"},
        "flags": [f"Extraction error: {reason}"],
        "human_review": True,
        "audited_at": datetime.now(timezone.utc).isoformat(),
    }

Two things make this reliable in production:

First, _strip_fences() handles the case where Claude wraps its response in ```json fences despite being told not to. This happens occasionally with Sonnet and consistently breaks json.loads() if you don't handle it.

Second, the _error_result() fallback means the agent never crashes on a bad Claude response — it logs the error and marks the URL for human review, then continues to the next URL.

Cost: Claude Sonnet 4 is priced at $3 per million input tokens and $15 per million output tokens. A typical page snapshot is around 500 input tokens; the structured JSON response is around 300 output tokens. That works out to roughly $0.006 per URL — about $0.12 for a 20-URL audit.

Module 4: Broken Link Checker

linkchecker.py takes the raw_links list from the browser snapshot and checks same-domain links for broken status using async HEAD requests.

The design choices:

Same-domain only. Checking every external link on a page would take minutes and isn't what agency clients need. Filter to links on the same domain as the page being audited.
HEAD requests, not GET. Faster, lower bandwidth, sufficient for status code detection.
Cap at 50 links. Pages like DEV.to article listings have hundreds of internal links. Checking all of them would dominate the runtime.
Concurrent requests via asyncio. All links are checked in parallel, not sequentially.

import asyncio
import logging
from urllib.parse import urlparse
import httpx

CAP = 50
TIMEOUT = 5.0
logger = logging.getLogger(__name__)


def _same_domain(link: str, final_url: str) -> bool:
    if not link:
        return False
    lower = link.strip().lower()
    if lower.startswith(("#", "mailto:", "javascript:", "tel:", "data:")):
        return False
    try:
        page_host = urlparse(final_url).netloc.lower()
        parsed = urlparse(link)
        return parsed.scheme in ("http", "https") and parsed.netloc.lower() == page_host
    except Exception:
        return False


async def _check_link(client: httpx.AsyncClient, url: str) -> tuple[str, bool]:
    try:
        resp = await client.head(url, follow_redirects=True, timeout=TIMEOUT)
        return url, resp.status_code != 200
    except Exception:
        return url, True  # Timeout or connection error = broken


async def _run_checks(links: list[str]) -> list[str]:
    async with httpx.AsyncClient() as client:
        results = await asyncio.gather(*[_check_link(client, url) for url in links])
    return [url for url, broken in results if broken]


def check_links(raw_links: list[str], final_url: str) -> dict:
    same_domain = [l for l in raw_links if _same_domain(l, final_url)]

    capped = len(same_domain) > CAP
    if capped:
        logger.warning("Page has %d same-domain links — capping at %d.", len(same_domain), CAP)
        same_domain = same_domain[:CAP]

    broken = asyncio.run(_run_checks(same_domain))

    return {
        "broken": broken,
        "count": len(broken),
        "status": "FAIL" if broken else "PASS",
        "capped": capped,
    }

Module 5: Human-in-the-Loop

This is the part most automation tutorials skip. What happens when the agent hits a login wall? A page that returns 403? A URL that redirects to a "Subscribe to continue reading" page?

Most scripts either crash or silently skip. Neither is acceptable in an agency context.

hitl.py handles this with two functions: one that detects whether a pause is needed, and one that handles the pause itself.

from state import add_to_needs_human

LOGIN_KEYWORDS = {"login", "sign in", "sign-in", "access denied", "log in", "unauthorized"}
REDIRECT_CODES = {301, 302, 307, 308}


def should_pause(snapshot: dict) -> bool:
    code = snapshot.get("status_code")

    # Navigation failed entirely
    if code is None:
        return True

    # Non-200, non-redirect
    if code != 200 and code not in REDIRECT_CODES:
        return True

    # Login wall detection
    title = (snapshot.get("title") or "").lower()
    h1s = [h.lower() for h in (snapshot.get("h1s") or [])]

    if any(kw in title for kw in LOGIN_KEYWORDS):
        return True
    if any(kw in h1 for kw in LOGIN_KEYWORDS for h1 in h1s):
        return True

    return False


def pause_reason(snapshot: dict) -> str:
    code = snapshot.get("status_code")
    if code is None:
        return "Navigation failed (None status)"
    if code != 200 and code not in REDIRECT_CODES:
        return f"Unexpected status code: {code}"
    return "Possible login wall detected"


def pause_and_prompt(url: str, reason: str) -> str:
    print(f"\n⚠️  HUMAN REVIEW NEEDED")
    print(f"   URL:    {url}")
    print(f"   Reason: {reason}")
    print(f"   Options: [s] skip  [r] retry  [q] quit\n")

    while True:
        choice = input("Your choice: ").strip().lower()
        if choice in ("s", "r", "q"):
            return {"s": "skip", "r": "retry", "q": "quit"}[choice]
        print("   Enter s, r, or q.")

The should_pause() function catches four cases: navigation failure, unexpected HTTP status, login keywords in the title, and login keywords in H1 tags. The login keyword check is what catches "Please sign in to continue" pages that return 200 but are effectively inaccessible.

In --auto mode (for scheduled runs), the main loop skips the pause_and_prompt() call and automatically handles these cases by logging the URL to needs_human[] in state and continuing.

Module 6: Report Writer

reporter.py writes results incrementally. This is important: results are written after each URL is audited, not batched at the end. If the run is interrupted, you don't lose completed work.

import json
import os
from datetime import datetime, timezone

REPORT_JSON = os.path.join(os.path.dirname(__file__), "report.json")
REPORT_TXT = os.path.join(os.path.dirname(__file__), "report-summary.txt")


def _load_report() -> list:
    if not os.path.exists(REPORT_JSON):
        return []
    with open(REPORT_JSON, encoding="utf-8") as f:
        return json.load(f)


def write_result(result: dict) -> None:
    """Append or update a result in report.json."""
    entries = _load_report()
    url = result.get("url", "")

    # Update existing entry if URL already present (handles retries)
    for i, entry in enumerate(entries):
        if entry.get("url") == url:
            entries[i] = result
            break
    else:
        entries.append(result)

    with open(REPORT_JSON, "w", encoding="utf-8") as f:
        json.dump(entries, f, indent=2, ensure_ascii=False)


def _is_overall_pass(result: dict) -> bool:
    fields = ["title", "description", "h1", "canonical"]
    for field in fields:
        if result.get(field, {}).get("status") not in ("PASS",):
            return False
    if result.get("broken_links", {}).get("status") == "FAIL":
        return False
    return True


def write_summary() -> None:
    entries = _load_report()
    passed = sum(1 for e in entries if _is_overall_pass(e))

    lines = []
    for entry in entries:
        overall = "PASS" if _is_overall_pass(entry) else "FAIL"
        failed_fields = [
            f for f in ["title", "description", "h1", "canonical", "broken_links"]
            if entry.get(f, {}).get("status") == "FAIL"
        ]
        suffix = f" [{', '.join(failed_fields)}]" if failed_fields else ""
        lines.append(f"{entry.get('url', 'unknown'):<60} | {overall}{suffix}")

    lines.append("")
    lines.append(f"{passed}/{len(entries)} URLs passed")

    with open(REPORT_TXT, "w", encoding="utf-8") as f:
        f.write("\n".join(lines))

The deduplication in write_result() handles retries cleanly. If a URL is retried after a human reviews a login wall and authenticates, the new result replaces the old one rather than creating a duplicate entry.

Module 7: The Main Loop

index.py wires everything together. It reads the URL list, loads state, skips already-audited URLs, and runs the audit loop.

import csv
import os
import sys
import time
import argparse

from state import load_state, is_audited, mark_audited, add_to_needs_human
from browser import fetch_page
from extractor import extract
from linkchecker import check_links
from hitl import should_pause, pause_reason, pause_and_prompt
from reporter import write_result, write_summary

INPUT_CSV = os.path.join(os.path.dirname(__file__), "input.csv")


def read_urls(path: str) -> list[str]:
    with open(path, newline="", encoding="utf-8") as f:
        return [row["url"].strip() for row in csv.DictReader(f) if row.get("url", "").strip()]


def run(auto: bool = False):
    if not os.environ.get("ANTHROPIC_API_KEY"):
        print("Error: ANTHROPIC_API_KEY environment variable is not set.")
        sys.exit(1)

    urls = read_urls(INPUT_CSV)
    pending = [u for u in urls if not is_audited(u)]

    print(f"Starting audit: {len(pending)} pending, {len(urls) - len(pending)} already done.\n")

    total = len(urls)

    try:
        for i, url in enumerate(pending, start=1):
            position = urls.index(url) + 1
            print(f"[{position}/{total}] {url}", end=" -> ", flush=True)

            # Browser navigation
            snapshot = fetch_page(url)

            # Human-in-the-loop check
            if should_pause(snapshot):
                reason = pause_reason(snapshot)

                if auto:
                    print(f"AUTO-SKIPPED ({reason})")
                    add_to_needs_human(url)
                    mark_audited(url)
                    continue

                action = pause_and_prompt(url, reason)
                if action == "quit":
                    print("Exiting.")
                    break
                elif action == "skip":
                    add_to_needs_human(url)
                    mark_audited(url)
                    continue
                # "retry" falls through to re-fetch below
                snapshot = fetch_page(url)

            # Claude extraction
            result = extract(snapshot)

            # Broken link check
            links = check_links(snapshot.get("raw_links", []), snapshot.get("final_url", url))
            result["broken_links"] = links

            # Write result immediately
            write_result(result)
            mark_audited(url)

            overall = "PASS" if all(
                result.get(f, {}).get("status") == "PASS"
                for f in ["title", "description", "h1", "canonical"]
            ) and links["status"] == "PASS" else "FAIL"

            print(overall)

    except KeyboardInterrupt:
        print("\n\nInterrupted. Progress saved. Re-run to continue.")
        return

    write_summary()
    passed = sum(
        1 for e in [r for r in []]
        if all(e.get(f, {}).get("status") == "PASS" for f in ["title", "description", "h1", "canonical"])
    )
    print(f"\nAudit complete. Report saved to report.json and report-summary.txt")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--auto", action="store_true", help="Auto-skip URLs requiring human review")
    args = parser.parse_args()
    run(auto=args.auto)

The KeyboardInterrupt handler is the resume mechanism. When you press Ctrl+C, the handler prints a message and exits cleanly. Because mark_audited() is called after write_result() for each URL, the next run skips everything already processed.

Running the Agent

Interactive mode (pauses on edge cases):

python index.py

Auto mode (skips edge cases, adds to needs_human[]):

python index.py --auto

When it runs, you'll see the browser window open for each URL and the terminal print progress:

Starting audit: 7 pending, 0 already done.

[1/7] https://example.com -> PASS
[2/7] https://example.com/about -> FAIL
[3/7] https://example.com/contact -> AUTO-SKIPPED (Unexpected status code: 404)
...
Audit complete. Report saved to report.json and report-summary.txt

To resume after an interruption:

python index.py --auto
# Starting audit: 4 pending, 3 already done.

Scheduling for Agency Use

For recurring weekly audits, create a batch file and schedule it with Windows Task Scheduler.

Create run-audit.bat:

@echo off
set ANTHROPIC_API_KEY=your-key-here
cd /d C:\Users\yourname\Desktop\seo-agent
python index.py --auto

In Windows Task Scheduler:

Create a new Basic Task
Set the trigger to Weekly, Monday at 7:00 AM
Set the action to "Start a program"
Browse to your run-audit.bat file

Check report-summary.txt on Monday morning. URLs in needs_human[] in state.json need manual review — login walls, paywalls, or pages that returned unexpected status codes.

For macOS/Linux, use cron:

# Run every Monday at 7am
0 7 * * 1 cd /path/to/seo-agent && ANTHROPIC_API_KEY=your-key python index.py --auto

What the Results Look Like

I ran this agent against seven of my own published pages across Hashnode, freeCodeCamp, and DEV.to. Every single one failed.

https://hashnode.com/@dannwaneri                    | FAIL [h1]
https://freecodecamp.org/news/claude-code-skill     | FAIL [description]
https://freecodecamp.org/news/stop-letting-ai-guess | FAIL [description]
https://freecodecamp.org/news/rag-system-handbook   | FAIL [title, description]
https://freecodecamp.org/news/author/dannwaneri     | FAIL [description]
https://dev.to/dannwaneri/gatekeeping-panic         | FAIL [title]
https://dev.to/dannwaneri/production-rag-system     | FAIL [title]

0/7 URLs passed

The freeCodeCamp description issues are partly platform-level — freeCodeCamp's template sometimes truncates or omits meta descriptions for article listing pages. The DEV.to title issues are mine. Article titles that work as headlines often exceed 60 characters in the </code> tag. A note on the 60-character title rule: this is a display threshold, not a ranking penalty. Google indexes titles of any length. The 60-character guideline reflects approximately how many characters fit in a desktop SERP result before truncation. Titles over 60 characters often still rank — they just get cut off in search results, which can hurt click-through rate. The agent flags display risk, not a ranking violation. <h2 id="heading-next-steps">Next Steps</h2> The agent as built handles the core SEO audit workflow. Obvious extensions: <ul> <li>Performance metrics — add a Lighthouse or PageSpeed Insights API call per URL </li> <li>Structured data validation — check for JSON-LD schema markup and validate it </li> <li>Email delivery — send <code>report-summary.txt</code> via SMTP after the run completes </li> <li>Multi-client support — separate <code>input.csv</code> files per client, separate report directories </li> </ul> The full code including all seven modules is at <a href="https://github.com/dannwaneri/seo-agent">dannwaneri/seo-agent</a>. Clone it, add your URLs, and run it. If you found this useful, I write about practical AI agent setups for developers and agencies at <a href="https://dev.to/dannwaneri">DEV.to/@dannwaneri</a>. The DEV.to companion piece covers the design decisions behind the agent — why HITL matters, why Browser Use over scrapers, and what the audit results mean for your own published content. </article> <article> <h1> How to Build Your Own Claude Code Skill </h1> Daniel Nwaneri — Fri, 27 Mar 2026 20:47:26 +0000 Every developer eventually has a workflow they repeat. A way they write commit messages. A checklist they run before opening a pull request. A structure they follow when reviewing code. They do it manually, explain it to their agents in every session, and watch the agent interpret it differently each time. Agent skills fix this. A skill is a markdown file that loads into Claude Code's context automatically when you need it. You write the workflow once. The agent follows it every time. And because skills follow an open standard, the same file works in Claude Code, GitHub Copilot, Cursor, and Gemini CLI. This tutorial shows you how to build a skill from scratch. You will build a commit-message-writer — a skill that reads your staged changes and generates a structured commit message following the Conventional Commits standard. By the end, you will have a working skill installed and ready to use, and you will understand the structure well enough to build any skill you need. <h2 id="heading-table-of-contents">Table of Contents</h2> <ol> <li><a href="#heading-what-an-agent-skill-is">What an Agent Skill Is</a> </li> <li><a href="#heading-how-to-choose-what-to-build">How to Choose What to Build</a> </li> <li><a href="#heading-how-to-structure-your-skill">How to Structure Your Skill</a> </li> <li><a href="#heading-how-to-write-the-description">How to Write the Description</a> </li> <li><a href="#heading-how-to-write-the-instructions">How to Write the Instructions</a> </li> <li><a href="#heading-how-to-build-the-commit-message-writer-skill">How to Build the commit-message-writer Skill</a> </li> <li><a href="#heading-how-to-install-and-test-your-skill">How to Install and Test Your Skill</a> </li> <li><a href="#heading-how-to-improve-your-skill-over-time">How to Improve Your Skill Over Time</a> </li> <li><a href="#heading-where-to-go-next">Where to Go Next</a> </li> </ol> <h2 id="heading-what-an-agent-skill-is">What an Agent Skill Is</h2> A skill is a folder containing a <code>SKILL.md</code> file. That file has two parts: a YAML frontmatter block at the top, and a markdown body below it. <pre><code class="language-plaintext">my-skill/ └── SKILL.md </code></pre> The frontmatter tells the agent what the skill is called and when to use it. The body tells the agent what to do when it loads the skill. Here is the minimal structure: <pre><code class="language-yaml">--- name: my-skill description: What this skill does and when to use it. --- # My Skill Instructions for the agent go here. </code></pre> When you invoke a skill — either explicitly with <code>/skill-name</code> or by describing what you want — the agent reads the SKILL.md body and follows the instructions inside it. The frontmatter never reaches the agent's instructions. It's metadata the skill system uses to decide whether to load the skill at all. <h3 id="heading-how-the-agent-decides-to-load-a-skill">How the Agent Decides to Load a Skill</h3> This is the most important thing to understand before you write your first skill: the agent decides whether to load your skill based entirely on the description field. Skills appear in Claude Code's context as a list of names and descriptions. When you make a request, the agent scans that list and loads any skill whose description matches what you're asking for. If the description is vague, the skill won't load when you need it. If the description is too narrow, it won't load for variations of the same request. The instructions in the body only matter after the skill loads. Getting the description right is what determines whether the skill loads at all. <h3 id="heading-what-skills-are-not">What Skills Are Not</h3> Skills are instruction files. They cannot run code on their own — but they can instruct the agent to run code using its existing tools. They are not plugins, extensions, or packages. They have no runtime. They are markdown files the agent reads, like a recipe a chef follows. <h2 id="heading-how-to-choose-what-to-build">How to Choose What to Build</h2> The best skills share three properties. <ol> <li>They encode a repeatable workflow. If you do something differently every time, a skill won't help. If you follow the same steps every session — even if you explain them differently each time — that's a skill candidate. </li> <li>They have a clear trigger. You should be able to finish the sentence "I need this skill when I want to...". If you can't finish that sentence in one clause, the workflow isn't scoped enough for a skill. </li> <li>They produce a consistent output format. Skills that output in a fixed structure — a commit message, a code review, a spec — are easier to build and test than skills that produce open-ended prose. </li> </ol> Good candidates: commit messages, pull request descriptions, code reviews, changelog entries. Bad candidates: "help me think through this", "make this better" — too open-ended to encode in a skill. For this tutorial, commit message generation is the right scope. The trigger is obvious (you want to commit), the workflow is defined (read staged changes, apply Conventional Commits format), and the output is structured (a commit message with a specific shape). <h2 id="heading-how-to-structure-your-skill">How to Structure Your Skill</h2> Every skill starts as a single folder with a single file: <pre><code class="language-plaintext">commit-message-writer/ └── SKILL.md </code></pre> As skills grow, they can include additional files the agent loads as needed: <pre><code class="language-plaintext">commit-message-writer/ ├── SKILL.md ← always loaded when skill triggers └── references/ └── examples.md ← loaded only when the agent needs examples </code></pre> The SKILL.md body should stay under 500 lines. If your instructions are growing beyond that, move supporting detail into a <code>references/</code> subfolder and tell the agent when to read those files. This keeps the skill lean — the agent only loads what it needs. For this tutorial, a single SKILL.md is enough. <h2 id="heading-how-to-write-the-description">How to Write the Description</h2> The description field is the trigger condition. It determines when your skill loads and when it doesn't. Most skills fail not because the instructions are wrong, but because the description doesn't match how people actually ask for help. Here is a weak description: <pre><code class="language-yaml">description: Generates commit messages. </code></pre> This will undertrigger. "Generate a commit message" will load it. "Write a commit for my changes" probably won't. "Summarize my staged diff" definitely won't — even though all three are asking for the same thing. Here is a stronger description: <pre><code class="language-yaml">description: Generates structured commit messages following the Conventional Commits standard. Use when you want to commit your changes and need a well-formatted message. Triggers on "write a commit message", "commit my changes", "summarize my staged diff", "what should my commit say", or any request to describe or document code changes for version control. </code></pre> The pattern is: what the skill does + when to use it + specific trigger phrases. The trigger phrases cover the different ways a developer might ask for the same thing. Two rules for descriptions: Be specific about the output. "Generates commit messages" is vague. "Generates structured commit messages following the Conventional Commits standard" tells the agent and the user exactly what they'll get. Be slightly pushy. The agent has a natural tendency to undertrigger skills — to handle requests itself rather than loading a skill. A description that explicitly lists trigger phrases counteracts this. You are not being redundant. You are training the trigger. <h2 id="heading-how-to-write-the-instructions">How to Write the Instructions</h2> The body of SKILL.md is where you define what the agent does when the skill loads. Good instructions follow two principles. Generate first, clarify second. The agent should produce output immediately rather than asking clarifying questions. If it needs to make assumptions, it should make them and flag them — not ask. Asking questions before producing output adds friction and loses the benefit of having a skill at all. Define the output format explicitly. Don't say "write a good commit message." Say exactly what the structure is, what fields are required, what the character limits are. The more specific the output format, the more consistent the results. Here is what weak instructions look like: <pre><code class="language-markdown"># Commit Message Writer Look at the staged changes and write a commit message that describes what changed. </code></pre> That will produce different results every time — different formats, different lengths, different conventions. It's not a skill. It's a prompt. Here is what strong instructions look like: <pre><code class="language-markdown"># Commit Message Writer Read the staged diff using `git diff --staged`. Generate a commit message following the Conventional Commits standard. Output format: type(scope): short description under 72 characters Body (if changes are non-trivial): - What changed and why, not how - One bullet per logical change Footer (if applicable): BREAKING CHANGE: description Closes #issue-number </code></pre> The agent knows exactly what to produce. The output will be consistent across sessions, across projects, and across agents that support the standard. <h2 id="heading-how-to-build-the-commit-message-writer-skill">How to Build the <code>commit-message-writer</code> Skill</h2> Now build it. Create the skill directory: <pre><code class="language-bash">mkdir -p ~/.claude/skills/commit-message-writer </code></pre> On Windows PowerShell: Note: PowerShell uses backtick (<code>`</code>) for line continuation, not backslash. <pre><code class="language-powershell">New-Item -ItemType Directory -Force -Path "$HOME\.claude\skills\commit-message-writer" </code></pre> Create the SKILL.md file inside that directory. Here is the complete content: <pre><code class="language-markdown">--- name: commit-message-writer description: Generates structured commit messages following the Conventional Commits standard. Use when you want to commit your changes and need a well-formatted message. Triggers on "write a commit message", "commit my changes", "summarize my staged diff", "what should my commit say", or any request to describe or document staged changes for version control. --- # commit-message-writer You generate structured commit messages from staged git changes. ## How to invoke Run `git diff --staged` to read the staged changes. If nothing is staged, tell the user and suggest they run `git add` first. Generate first. Do not ask clarifying questions before producing the commit message. If you need to make assumptions about scope or type, make them and note them after the output. ## Output format ~~~ type(scope): short description [body — optional, include if changes are non-trivial] [footer — optional] ~~~ **Type** — choose one: - `feat` — a new feature - `fix` — a bug fix - `docs` — documentation changes only - `refactor` — code change that neither fixes a bug nor adds a feature - `test` — adding or updating tests - `chore` — build process, tooling, or dependency updates **Scope** — the module, file, or area affected. Use the directory name or component name. Omit if the change spans the entire codebase. **Short description** — imperative mood, under 72 characters, no period at the end. "Add user authentication" not "Added user authentication" or "Adds user authentication." **Body** — what changed and why, not how. One bullet per logical change. Skip if the short description is self-explanatory. **Footer** — include `BREAKING CHANGE:` if the commit breaks backward compatibility. Include `Closes #N` if it resolves a GitHub issue. ## Quality rules - Never use "updated", "changed", or "modified" in the short description — be specific - Never write "various improvements" or "misc fixes" — name what improved - If more than three files changed across unrelated concerns, flag it: "These changes may be better split into separate commits: [list concerns]" - The short description must be under 72 characters — count before outputting ## Example output Input: staged changes adding a rate limiter to an API endpoint ~~~ feat(api): add rate limiting to /query endpoint - Limits requests to 100 per minute per IP using Cloudflare's rate limit binding - Returns 429 with Retry-After header when limit is exceeded - Adds rate limit configuration to wrangler.toml Closes #47 ~~~ </code></pre> Save that file. The skill is built. <h2 id="heading-how-to-install-and-test-your-skill">How to Install and Test Your Skill</h2> <h3 id="heading-verify-the-file-exists">Verify the File Exists</h3> <pre><code class="language-bash">cat ~/.claude/skills/commit-message-writer/SKILL.md </code></pre> You should see the full SKILL.md content. If you get an error, check the directory path. <h3 id="heading-test-the-skill">Test the Skill</h3> Open Claude Code in any git repository that has staged changes. Type: <pre><code class="language-plaintext">/commit-message-writer </code></pre> The agent will read your staged diff and produce a commit message following the format you defined. You can also trigger it naturally: <pre><code class="language-plaintext">write a commit message for my staged changes </code></pre> <pre><code class="language-plaintext">what should my commit say </code></pre> <pre><code class="language-plaintext">summarize my diff for git </code></pre> All three should load the skill and produce a structured commit message. If the skill doesn't trigger on natural language requests, the description needs more trigger phrases — see the improvement section below. <h3 id="heading-test-edge-cases">Test Edge Cases</h3> Test these cases before relying on the skill in production: <pre><code class="language-bash"># Stage nothing, then ask for a commit message git add -p # stage nothing # In Claude Code: "write a commit message" # Expected: skill tells you nothing is staged and suggests git add </code></pre> <pre><code class="language-bash"># Stage changes across unrelated files git add src/api.ts src/styles.css README.md # In Claude Code: "write a commit message" # Expected: skill flags that commits may be better split </code></pre> <h2 id="heading-how-to-improve-your-skill-over-time">How to Improve Your Skill Over Time</h2> The first version of any skill is a draft. You improve it by observing where it produces inconsistent or wrong output, then updating the instructions. <h3 id="heading-when-the-skill-undertriggers">When the Skill Undertriggers</h3> If you type "summarize my changes for git" and the skill doesn't load, add that phrase to the description's trigger list: <pre><code class="language-yaml">description: ... Triggers on "write a commit message", "commit my changes", "summarize my staged diff", "summarize my changes for git", ... </code></pre> The description is your primary lever for fixing triggering problems. <h3 id="heading-when-the-output-format-drifts">When the Output Format Drifts</h3> If the agent starts producing commit messages that don't match your format — wrong type, missing scope, body in the wrong style — the instructions need to be more explicit. Add a concrete example that shows the failure and the correct output: <pre><code class="language-markdown">## Common mistakes to avoid Wrong: "Updated the authentication flow" Right: "refactor(auth): simplify token validation logic" Wrong: "Fixed bugs" Right: "fix(api): handle null response from upstream service" </code></pre> Concrete counterexamples are more effective than abstract rules. <h3 id="heading-when-the-scope-grows">When the Scope Grows</h3> If you find yourself wanting the skill to handle related tasks — reviewing commit messages, generating changelogs, writing PR descriptions — resist the urge to add everything to one skill. Build separate skills. Each skill should do one thing well. The Agent Skills standard is designed for composition, not for monolithic instructions. <h2 id="heading-where-to-go-next">Where to Go Next</h2> The commit-message-writer covers the core pattern. The same structure works for any repeatable workflow. Pull request descriptions follow the same shape — read the diff, apply a structure, produce consistent output. The trigger phrases are different ("write a PR description", "summarize my branch for review") and the output format adds sections for motivation and testing, but the SKILL.md structure is identical. Code review checklists work well as skills when your team has a standard review process. The trigger is "review this code" or "check this PR", and the instructions encode whatever your team actually checks — security concerns, test coverage, naming conventions. The commit-message-writer is the simplest skill architecture — instructions only. As your skills grow more specialized, two other patterns become useful. The first adds a <code>references/</code> directory: the voice-humanizer skill loads a CORPUS.md file containing the author's published writing, which the agent reads when it needs to check output against a specific style. The second adds quality rules and structured output formats that make results stricter and more consistent — that's the pattern spec-writer uses to surface assumptions inline. Each is the same SKILL.md structure at a different level of complexity. Start with instructions only. Add references when the agent needs external context. Add output format rules when consistency matters more than flexibility. The Agent Skills standard is supported in Claude Code, GitHub Copilot in VS Code, Cursor, and Gemini CLI. A skill you build once installs across all of them. The install path differs by agent: <table> <thead> <tr> <th>Agent</th> <th>Skills directory</th> </tr> </thead> <tbody><tr> <td>Claude Code</td> <td><code>~/.claude/skills/</code></td> </tr> <tr> <td>GitHub Copilot</td> <td><code>~/.copilot/skills/</code> or <code>.github/skills/</code></td> </tr> <tr> <td>Cursor</td> <td><code>~/.cursor/skills/</code></td> </tr> <tr> <td>Gemini CLI</td> <td><code>~/.gemini/skills/</code></td> </tr> </tbody></table> The SKILL.md format is the same across all of them. The commit-message-writer you just built is a working skill. The next one will take less time. By the third, you will start seeing workflows you repeat and immediately think: that should be a skill. That's the point. </article> <article> <h1> What Happened When I Replaced Copilot with Claude Code for 2 Weeks </h1> Balajee Asish Brahmandam — Fri, 27 Mar 2026 18:46:22 +0000 GitHub Copilot costs $10/month, and I'd been using it for two years without thinking twice. But when Claude Code launched, I got curious. What if I just... switched? I didn't want to just add Claude Code to my stack. I actually wanted to replace Copilot entirely for two weeks. I kept everything else the same – same editor, same projects, same workflow. I just swapped the autocomplete suggestion tool. Here's what broke, what improved, and whether I went back. <h2 id="heading-table-of-contents">Table of Contents</h2> <ul> <li><a href="#heading-the-setup">The Setup</a> </li> <li><a href="#heading-what-worked-better">What Worked Better</a> </li> <li><a href="#heading-what-broke-or-slowed-things-down">What Broke (Or Slowed Things Down)</a> </li> <li><a href="#heading-the-first-week-vs-the-second-week">The First Week vs The Second Week</a> </li> <li><a href="#heading-why-i-went-back">Why I Went Back</a> </li> <li><a href="#heading-the-honest-verdict">The Honest Verdict</a> </li> <li><a href="#heading-what-i-actually-use-now">What I Actually Use Now</a> </li> <li><a href="#heading-copilot-vs-claude-code-the-breakdown">Copilot vs Claude Code — The Breakdown</a> </li> <li><a href="#heading-a-word-on-developer-experience">A Word on Developer Experience</a> </li> <li><a href="#heading-what-would-make-me-switch">What Would Make Me Switch</a> </li> <li><a href="#heading-final-thoughts">Final Thoughts</a> </li> </ul> <h2 id="heading-the-setup">The Setup</h2> Environment: <ul> <li>Python 3.12 for backend work (Django REST framework specifically) </li> <li>React/TypeScript for frontend </li> <li>VSCode as my editor </li> <li>A mid-sized project with about 15k lines of code across backend and frontend </li> <li>Two weeks, normal workload (roughly 30-40 hours of coding) </li> <li>Working on features I'd normally tackle: adding endpoints, debugging issues, writing tests </li> </ul> What I did: <ul> <li>Disabled GitHub Copilot completely. Uninstalled the extension. </li> <li>Set up Claude Code (via their CLI and VSCode integration). </li> <li>Kept everything else identical: same repos, same Git flow, same daily work. </li> <li>Tracked time on each task to see if there was a real difference. </li> </ul> Ground rules: <ul> <li>I couldn't use Copilot as a fallback. This was an honest comparison. </li> <li>I logged every time I got frustrated or felt like Claude Code was slowing me down. </li> <li>I kept track of bugs I caught vs. bugs I missed. </li> </ul> The goal: Does Claude Code work as a day-to-day replacement for Copilot, or does it force me back? <h2 id="heading-what-worked-better">What Worked Better</h2> <h3 id="heading-accuracy">Accuracy</h3> Copilot sometimes suggests things that are close but not quite right. It might finish a regex pattern 80% correctly, and I have to tweak it. It happens maybe 20% of the time. Claude Code was more accurate. In the first week, I noticed fewer "close but wrong" suggestions. When I typed a function signature, Claude got the implementation right more often than Copilot did. One example: I was writing a utility to parse JSON and handle errors. Copilot suggested: <pre><code class="language-python">def parse_json(data): try: return json.loads(data) except: return None </code></pre> That's sloppy. It catches all exceptions and silently fails. Claude Code suggested: <pre><code class="language-python">def parse_json(data): try: return json.loads(data) except json.JSONDecodeError as e: logging.error(f"Failed to parse JSON: {e}") return None except Exception as e: logging.error(f"Unexpected error: {e}") raise </code></pre> Better error handling. More production-ready. That's a real difference. I estimate Claude Code's suggestions were "immediately usable" about 85% of the time. Copilot was more like 70%. <h3 id="heading-understanding-context">Understanding Context</h3> Claude Code seems to understand your project better than Copilot. When I opened a file with Claude Code context, it knew: <ul> <li>My project's naming conventions (I use <code>fetch_</code> for async functions, <code>get_</code> for sync). </li> <li>My error handling style. </li> <li>What libraries I was using. </li> </ul> Copilot sometimes forgot these patterns or suggested things using the wrong library. Claude Code was more consistent. One morning I was adding a new endpoint to an existing API. I typed the route signature: <pre><code class="language-python">@app.post("/api/users") async def create_user(data: UserPayload): </code></pre> Copilot might suggest: <pre><code class="language-python"> response = requests.post(...) </code></pre> (Wrong! That's sync. This function is async.) Claude Code suggested: <pre><code class="language-python"> async with httpx.AsyncClient() as client: response = await client.post(...) </code></pre> It remembered that the entire codebase uses async/await and httpx for async calls. That's attention to detail. <h3 id="heading-reasoning-about-requirements">Reasoning About Requirements</h3> Sometimes Copilot just completes code. It doesn't think about whether it makes sense. Claude Code seemed to reason about whether the suggestion was actually what you wanted. A few times, when I was writing ambiguous code, Claude Code offered a clarifying suggestion instead of just finishing it. Example: I started a function for sorting users: <pre><code class="language-python">def sort_users(users): </code></pre> Copilot would auto-complete with some sorting logic, but I'd have to check if it was what I meant. Claude Code would sometimes suggest: <pre><code class="language-python">def sort_users(users, key="created_at", reverse=False): </code></pre> It was thinking: "Sorting is ambiguous. What key? What order?" It was right more often than not. <h2 id="heading-what-broke-or-slowed-things-down">What Broke (Or Slowed Things Down)</h2> <h3 id="heading-response-time">Response Time</h3> This was the biggest issue. Copilot is instant. I type <code>def get_</code> and it finishes before I can blink. It's autocomplete, and autocomplete needs to be fast. The latency is maybe 100-200ms. Claude Code has a noticeable delay. Maybe 1-2 seconds before suggestions appear. On day one, that felt fine – I had time to think. By day two, I was annoyed. By day three, I was genuinely frustrated. Over a day of coding, that adds up. If you're typing 20 functions and each one has a 2-second delay, that's 40 seconds of just waiting. It doesn't sound like much, but it breaks flow. Flow is where the good coding happens. By day three, I was getting frustrated. I'd type faster than Claude Code could suggest, which meant I'd often just finish the code myself. The second a suggestion appeared, I'd already moved on. Defeating the purpose. I tested this by tracking time. Same function, same complexity: <ul> <li>With Copilot: 3 minutes (including auto-complete time) </li> <li>With Claude Code: 5 minutes (waiting for suggestions + finishing manually) </li> </ul> The delay isn't theoretical. It's real and measurable. The truth: Copilot is an autocomplete tool. It needs sub-second latency. Claude Code, being more powerful, is inherently slower. That's a fundamental tradeoff. You can't have both "instant" and "smart." Choose one. <h3 id="heading-no-inline-acceptance">No Inline Acceptance</h3> With Copilot, I press Tab to accept. It's in my muscle memory. Tab = accept. Claude Code doesn't work exactly the same way. I had to click or use a different keyboard shortcut. Small thing, but it broke my rhythm constantly. I'd write code, see a suggestion, and instinctively press Tab. Nothing would happen. Then I'd remember: "Oh right, it's a different tool." After two weeks, I never fully got used to it. <h3 id="heading-disconnected-from-flow">Disconnected From Flow</h3> Copilot is so embedded in the editor that I don't think about it. It's just there, like spellcheck. Claude Code feels like a separate tool I'm using, which means I'm more aware of it. That sounds like a good thing, but it's actually more cognitively expensive. I wanted to type and have suggestions appear. Instead, I felt like I was using a tool. There's a difference. It's the same difference between walking and thinking about walking. When you're thinking about your walking mechanics, you walk worse. This affected my productivity more than I expected. On day three, I found myself just typing manually instead of waiting for suggestions. It wasn't a conscious decision. I'd just start typing and then remember "oh, the suggestion came in." By then I'd already finished half the function myself. <h3 id="heading-limited-to-the-file">Limited to the File</h3> Copilot understands your entire project. It knows what's in other files, what libraries you import, what conventions you follow. If I'm importing a utility function that doesn't exist yet, Copilot knows to suggest the import with the path I'd use. Claude Code seemed more limited to the current file. Sometimes it would suggest imports that weren't already in the file, or use patterns different from the rest of my codebase. Not often, but enough to notice. On one occasion, it suggested a database query pattern that was different from my whole codebase. It would've worked, but it would've been inconsistent. This is less of a limitation and more of a design difference. Claude Code is built for depth on individual files, not breadth across a project. <h2 id="heading-the-first-week-vs-the-second-week">The First Week vs The Second Week</h2> Week 1: I was excited. Claude Code felt smarter. I noticed the accuracy advantage. But the latency was starting to annoy me. Week 2: The novelty wore off. The latency was more annoying. I was missing Copilot's speed. I found myself disabling Claude Code's suggestions and typing manually more often, which defeated the purpose. "If I'm typing it all manually anyway, why switch?" By day 10, I was typing code faster with Claude Code disabled than with it enabled. That's when I knew it wasn't working for me. <h2 id="heading-why-i-went-back">Why I Went Back</h2> On day 14, I re-enabled Copilot. The first thing I noticed: speed. Code was completing again instantly. My rhythm came back. I hit Tab, it accepted, I moved on. That's the entire appeal of Copilot-it's frictionless. I also realized how much I'd been manually typing. On days 10-14, I was writing more code by hand because the suggestions felt too slow to be worth waiting for. Without realizing it, I'd completely stopped using Claude Code's suggestions. I was just typing. That's the worst of both worlds: no AI help and the cognitive burden of being aware you're using a tool that's not helping. Was I sacrificing accuracy? A little. But I'm accurate enough that I catch mistakes in review. For day-to-day, Copilot is fine. The second thing: it just works. No weird setup, no integration issues. It's part of VSCode. It's always there. By day 15, I was back to normal productivity, maybe even higher because the flow was better. <h2 id="heading-the-honest-verdict">The Honest Verdict</h2> Claude Code isn't a Copilot replacement. It's not worse. It's different. It's like comparing a calculator to a calculator app on your phone. One is designed for speed and muscle memory. One is designed to be a full computer in your pocket. They're not competitors. If I'd tried Claude Code expecting it to be better at debugging, I would've been happy. I was trying it expecting it to replace my autocomplete, which is where it falls flat. The experiment was valuable, though. It taught me that: <ol> <li>Latency matters more than I expected. A 2-second delay breaks flow. </li> <li>Familiarity matters. Tab to accept is burned into my muscle memory. </li> <li>Tool stacking works. Claude Code is great for debugging. Copilot is great for autocomplete. Together they're better than either alone. </li> </ol> <h2 id="heading-what-i-actually-use-now">What I Actually Use Now</h2> I didn't abandon Claude Code. I just changed how I use it. <ul> <li>Claude Code: For debugging, analysis, and big changes. "Why is this function slow?" "Refactor this for readability." I invoke it deliberately when I need thinking, not continuous autocomplete. </li> <li>Copilot: For routine coding. Finishing functions, auto-completing imports, normal flow. </li> </ul> That's the working solution. Claude Code is powerful, but it's not a Copilot replacement for daily work. It's a different tool for a different use case. <h2 id="heading-copilot-vs-claude-code-the-breakdown">Copilot vs Claude Code: The Breakdown</h2> Copilot is better for: <ul> <li>Pure autocomplete speed </li> <li>Routine, well-understood coding </li> <li>Low friction, high flow state </li> <li>Simple suggestions </li> </ul> Claude Code is better for: <ul> <li>Complex suggestions that require reasoning </li> <li>Debugging and analysis </li> <li>Understanding intent (not just completing code) </li> <li>Asking questions about code you've written </li> </ul> If you're a Copilot user thinking about switching, don't do it as a straight replacement. Claude Code isn't faster. It's smarter, but slower, and for day-to-day autocomplete, faster wins. Try using both. Use Copilot for normal coding, Claude Code for debugging and complex changes. If you only want to pay for one, stick with Copilot. It's cheaper, it's faster, and it does the job. If you're a heavy debugger and you spend a lot of time analyzing code, Claude Code might be worth it. But as a Copilot replacement? No. <h2 id="heading-a-word-on-developer-experience">A Word on Developer Experience</h2> What surprised me wasn't just the latency. It was how much I missed the seamlessness of Copilot. With Copilot, I don't think about it. It's like breathing-automatic. I type, it suggests, I accept or reject, I move on. With Claude Code, I was constantly aware I was using a tool. I'd finish typing before the suggestion appeared. I'd have to remember the keyboard shortcut. I'd have to context-switch to look at the suggestion. That awareness is exhausting. It's why flow state is so important to programming. The best tools get out of your way. Copilot gets out of the way. Claude Code, for autocomplete purposes, doesn't. Developer experience isn't a nice-to-have. It's core to productivity. A tool that's 10% smarter but 50% more annoying is worse, not better. <h2 id="heading-what-would-make-me-switch">What Would Make Me Switch</h2> <ul> <li>Claude Code needs to get faster. Sub-second latency for suggestions. </li> <li>It needs better editor integration. Tab to accept, like Copilot. </li> <li>It needs to understand the full project, not just the current file. </li> </ul> Once those three things happen, it'd be competitive. Until then, Copilot is still the better choice for daily coding work. <h2 id="heading-final-thoughts">Final Thoughts</h2> This experiment taught me something: better isn't always better. Claude Code is arguably smarter than Copilot. But Copilot is more efficient. For autocomplete, efficiency matters more than intelligence. It's like comparing a sports car to a Jeep. The sports car is faster on a highway. The Jeep is better on a mountain trail. Neither is "better." They're different. Copilot is trying to predict the next line of code fast. Claude Code is trying to understand your code deeply. They're solving different problems. I went back to Copilot not because Claude Code is bad. It's actually impressive. But it's a different category of tool. Using it for autocomplete is like using a hammer when you need a screwdriver. The hammer might be fancier, but the screwdriver does the job. What surprised me most was how much latency matters. I didn't expect a 2-second delay to be that noticeable. But when you're in the zone, typing code, and the autocomplete lags, it completely breaks your flow. It's not about the absolute time. It's about the interruption. Don't take my word for it though. Run your own two-week experiment. Pick a tool, commit to it, and see what happens. Track your productivity. Track your frustration. The best tool is the one you'll actually use. And you can only find that out by using it. <h2 id="heading-whats-next">What's Next?</h2> If you found this useful, I write about Docker, AI tools, and developer workflows every week. I'm Balajee Asish - Docker Captain, freeCodeCamp contributor, and currently building my way through the AI tools space one project at a time. Got questions or built something similar? Drop a comment below or find me on <a href="https://github.com/balajee-asish">GitHub</a> and <a href="https://linkedin.com/in/balajee-asish">LinkedIn</a>. Happy building. </article> <article> <h1> How to Stop Letting AI Agents Guess Your Requirements </h1> Daniel Nwaneri — Tue, 24 Mar 2026 00:35:37 +0000 I spent 64% of my weekly Claude budget before Wednesday building a tool designed to reduce Claude usage. That's the kind of irony that deserves its own specification. The tool is spec-writer: a Claude Code skill that takes a vague feature request and generates a structured spec, technical plan, and task breakdown before a single line of code gets written. The problem it solves is one most developers hit within their first week of using AI coding agents seriously: the agent writes confidently in the wrong direction and you pay for it twice, once in tokens, once in rewrites. This tutorial shows you how to install spec-writer, how to invoke it on a real feature, and how to read the output so you can catch the assumptions that would have wasted your time. <h2 id="heading-table-of-contents">Table of Contents</h2> <ol> <li><a href="#heading-the-problem-with-prompting-agents-directly">The Problem with Prompting Agents Directly</a> </li> <li><a href="#heading-what-specdriven-development-is">What Spec-Driven Development Is</a> </li> <li><a href="#heading-how-spec-writer-works">How spec-writer Works</a> </li> <li><a href="#heading-how-to-install-spec-writer">How to Install spec-writer</a> </li> <li><a href="#heading-how-to-write-your-first-spec">How to Write Your First Spec</a> </li> <li><a href="#heading-how-to-read-the-output">How to Read the Output</a> </li> <li><a href="#heading-how-to-hand-the-spec-to-your-agent">How to Hand the Spec to Your Agent</a> </li> <li><a href="#heading-where-to-go-next">Where to Go Next</a> </li> </ol> <h2 id="heading-the-problem-with-prompting-agents-directly">The Problem with Prompting Agents Directly</h2> Here is what happens when you skip the spec. You have a feature in your head: "Add a way for users to export their data." You open Claude Code and describe it. The agent produces code. It looks right. You run it. It's mostly right – except it exports everything including soft-deleted records, it doesn't paginate, it times out on large accounts, and it has no authentication check on the export endpoint. None of those things were in your prompt. The agent guessed, and it guessed plausibly – which is worse than guessing obviously wrong. You didn't notice until testing. This is the fundamental problem with prompting agents directly on anything non-trivial: your prompt carries your conscious requirements, but every feature has a shadow of requirements you didn't think to state. And the agent fills that shadow with assumptions. Most of the time, those assumptions are reasonable. Some of the time, they're wrong in ways that take hours to unravel. The failure mode isn't hallucination. It's the agent being exactly as helpful as the prompt allowed, which wasn't helpful enough. Spec-Driven Development addresses this directly. The methodology – documented extensively by practitioners like Julián Deangelis – argues that a written spec isn't documentation overhead. It's the mechanism that forces you to make decisions before the agent does. <h2 id="heading-what-spec-driven-development-is">What Spec-Driven Development Is</h2> Spec-Driven Development is the practice of writing a structured specification before you write code or prompt an agent. The spec defines what the feature must do, what assumptions are being made, and what tasks the implementation breaks into. The key insight is what a spec is for. A spec is not trying to replace code. It's trying to surface the decisions that would otherwise be invisible. The agent will make those decisions either way: with a spec, you make them first. Without a spec, you discover them during testing. The strongest counterargument to SDD comes from Gabriella Gonzalez: a sufficiently detailed spec is just code. She's right that some specs devolve into pseudocode so specific they might as well be implementations. But that's a spec written at the wrong level of abstraction. The goal is to name the decisions, not to pre-implement them. "Only authenticated users can trigger this export" is a decision. "Call <code>verifyJWT(token)</code> and return 401 if it fails" is implementation. The spec needs the first. The agent handles the second. SDD has three levels: <ol> <li>Spec-First: write a spec before every feature and hand it to the agent as context. This is the entry point and the workflow this tutorial focuses on. </li> <li>Spec-Anchored: the spec lives in the repository and evolves alongside the code. When requirements change, you update the spec and re-prompt the agent to realign. </li> <li>Spec-as-Source: the spec is the primary artifact. Code is generated from it and considered disposable. This is the most ambitious level and the direction many teams are moving toward. </li> </ol> spec-writer gets you to Spec-First immediately, with no ceremony. <h2 id="heading-how-spec-writer-works">How spec-writer Works</h2> spec-writer is a Claude Code skill – a markdown file that loads into the agent's context and changes how it responds when invoked. The skill follows one rule: generate first, flag assumptions inline. Instead of asking you clarifying questions before producing output, it generates the full spec immediately and marks every decision it made without your explicit input using <code>[ASSUMPTION: ...]</code> tags. Then you correct what's wrong. This is faster than Q&A because it makes the decisions visible in a form you can react to rather than anticipate. The output has three sections in fixed order: <ol> <li>SPEC: the what. One-line purpose, user stories, requirements, edge cases, and acceptance criteria in Given/When/Then format. </li> <li>PLAN: the how. Stack and architecture decisions, data model changes, API contracts, testing strategy, and security constraints. </li> <li>TASKS: the breakdown. Ordered, self-contained tasks each completable in a single agent session, each with its own acceptance criteria. </li> </ol> After the three sections, the skill produces an Assumptions summary: every <code>[ASSUMPTION: ...]</code> from the output, ranked by impact. This is the part you review before handing anything to the agent. The skill is compatible with <a href="https://github.com/github/spec-kit">GitHub Spec Kit</a> and <a href="https://github.com/Fission-AI/OpenSpec">OpenSpec</a>. If you use either framework, save the spec output to your <code>.specify/</code> or <code>openspec/changes/</code> directory and continue from there. <h2 id="heading-how-to-install-spec-writer">How to Install spec-writer</h2> spec-writer uses the Agent Skills standard, which means the same SKILL.md file works across Claude Code, Cursor, GitHub Copilot, Gemini CLI, and any other agent that supports the standard. You install it once and it works everywhere. <h3 id="heading-installation">Installation</h3> Create the skills directory if it doesn't exist and clone the repo: <pre><code class="language-bash">mkdir -p ~/.claude/skills git clone https://github.com/dannwaneri/spec-writer.git ~/.claude/skills/spec-writer </code></pre> On Windows PowerShell: (Note: PowerShell uses backtick (<code>`</code>) for line continuation, not backslash.) <pre><code class="language-powershell">New-Item -ItemType Directory -Force -Path "$HOME\.claude\skills" git clone https://github.com/dannwaneri/spec-writer.git "$HOME\.claude\skills\spec-writer" </code></pre> That's the entire installation. No package to install, no configuration file to edit, no API key. The skill is a markdown file. The agent reads it. <h3 id="heading-verification">Verification</h3> Open Claude Code and type: <pre><code class="language-plaintext">/spec-writer test </code></pre> If the skill is installed correctly, the agent will read the SKILL.md and produce a spec structure – even for "test" as input. You'll see the three sections and the Assumptions summary. If nothing happens, confirm that the <code>~/.claude/skills/spec-writer/SKILL.md</code> file exists. <h2 id="heading-how-to-write-your-first-spec">How to Write Your First Spec</h2> The invocation pattern is: <pre><code class="language-plaintext">/spec-writer [your feature description] </code></pre> The description can be as rough as you'd write in a Slack message. The skill's job is to structure it. You don't need to be precise. Being precise is what the spec is for. <h3 id="heading-a-real-example">A Real Example</h3> Here is a real invocation from the Foundation project, a federated knowledge base built on Cloudflare Workers: <pre><code class="language-plaintext">/spec-writer Add CLI capture for Claude Code sessions stored in ~/.claude/projects/ </code></pre> That's twelve words. No mention of the tech stack, no data model details, no authentication requirements, no error handling spec. All of that comes out in the spec. Here is the abbreviated output: SPEC One-line purpose: Developers can capture Claude Code session logs from their local filesystem into Foundation before they are deleted after 30 days. Users and use cases: <ul> <li>As a developer using Claude Code daily, I want my coding sessions captured automatically so that architectural decisions and debugging patterns don't disappear after 30 days. </li> <li>As a Foundation user, I want CLI sessions and browser extension captures in the same knowledge base so that I have one place to query across all my AI interactions. </li> </ul> Acceptance criteria: <pre><code class="language-plaintext">Given a developer with Claude Code sessions in ~/.claude/projects/ When they run the Foundation CLI capture command Then a list of uncaptured sessions is displayed, sorted by most recent first Given a session already in Foundation When the developer runs capture again Then that session does not appear in the review list Given a malformed .jsonl file When the capture command processes it Then an error is logged for that file and processing continues on remaining sessions </code></pre> PLAN Stack and architecture: <pre><code class="language-plaintext">CLI capture runs locally — it needs to read the local filesystem, which Cloudflare Workers cannot do. [ASSUMPTION: CLI capture is a local Node.js or Bun script that calls the Foundation API to insert sessions, rather than a Worker itself] </code></pre> API contracts: <ul> <li><code>POST /api/sessions/import</code> – accepts parsed session JSON, returns session ID </li> <li>Returns 409 if session already exists </li> <li>Triggers evaluator pipeline after successful insert </li> </ul> TASKS <pre><code class="language-plaintext">## Task 1: .jsonl parser **What to build:** A parser that reads a Claude Code .jsonl session file and outputs Foundation's internal message format. **Files likely affected:** cli/src/parsers/claudeCode.js (create) **Acceptance criteria:** 1. Parses a valid .jsonl file into the correct message format 2. Handles malformed lines without crashing — logs and skips 3. Returns empty array for empty file **Dependencies:** none </code></pre> Assumptions to review: <pre><code class="language-plaintext">1. CLI capture is a local script calling the Foundation API, not a Worker Impact: HIGH Correct this if: you want a purely serverless approach 2. Manual curation before capture, not automatic bulk import Impact: HIGH Correct this if: you want automatic background capture 3. Session ID from .jsonl filename is the deduplication key Impact: MEDIUM Correct this if: session IDs are stored differently in your schema 4. No sensitive data scrubbing in v1 Impact: MEDIUM Correct this if: your sessions contain credentials or keys </code></pre> Twelve words in, four decisions surfaced immediately – three of which had real architectural implications. The third assumption ("Session ID from .jsonl filename is the deduplication key") is the one that would have caused the most subtle bug. The agent would have implemented deduplication based on the filename and it would have worked until a session was renamed. The spec caught it before a line of code was written. <h2 id="heading-how-to-read-the-output">How to Read the Output</h2> The output is designed to be scanned for <code>[ASSUMPTION: ...]</code> tags first, read for the tasks second. <h3 id="heading-reading-the-assumptions">Reading the Assumptions</h3> Every <code>[ASSUMPTION: ...]</code> tag marks a place where the agent filled in something you didn't specify. Your job is to go through the Assumptions summary and decide for each one: <ul> <li>Correct: the assumption is right, leave it </li> <li>Override: the assumption is wrong, restate it and re-run the spec </li> <li>Defer: the assumption doesn't matter for this iteration, mark it and move on </li> </ul> The impact rating tells you which assumptions to fix before you start coding. HIGH-impact assumptions affect architecture or data model. If they're wrong, fixing them requires rework. LOW-impact assumptions affect behavior details that are easy to change later. <h3 id="heading-reading-the-acceptance-criteria">Reading the Acceptance Criteria</h3> The acceptance criteria in Given/When/Then format are the most useful part of the spec for catching scope errors. Read each one and ask: is this actually what I want? Criteria are binary by design. "Returns 401 when unauthenticated" is a criterion. "Works correctly" is not. If you find yourself reading a criterion and thinking "well, it depends", then that's a signal that the criterion is hiding an assumption. Restate it. <h3 id="heading-reading-the-tasks">Reading the Tasks</h3> The tasks are ordered and self-contained. Each task produces a verifiable change. Before you hand any task to an agent, check two things: <ol> <li>Does the task have all the context it needs? If a task says "follow the existing auth pattern" and you haven't pointed the agent at your auth code, it will guess. </li> <li>Does the acceptance criteria match what you'd actually test? If the criteria are vague, tighten them before the agent sees the task. </li> </ol> <h2 id="heading-how-to-hand-the-spec-to-your-agent">How to Hand the Spec to Your Agent</h2> The spec is context, not a prompt. When you start an agent session for a task, include the relevant spec sections alongside the task description. For Task 1 from the example above, your agent session might open like this: <pre><code class="language-plaintext">Context: - This is a federated knowledge base built on Cloudflare Workers, D1, and Vectorize - Sessions are stored in ~/.claude/projects/ as .jsonl files - The API runs at https://<your-worker>.workers.dev Spec: [paste the SPEC and PLAN sections] Task: [paste Task 1] </code></pre> The context block is just an example. Replace it with your own project's tech stack, file locations, and API URL. The point is to give the agent the same context a new team member would need on day one. The agent now has requirements, architecture context, and a single scoped task with binary acceptance criteria. It cannot guess the deduplication key incorrectly because the spec already resolved that assumption. It cannot skip error handling because the acceptance criteria explicitly require it. This is the workflow the spec is designed for. The spec doesn't replace the agent. Rather, it removes the decisions from the agent's hands and puts them in yours, before the work starts. <h3 id="heading-saving-the-spec-for-later">Saving the Spec for Later</h3> If you want to move toward Spec-Anchored development – where the spec lives in the repository – save the output to a <code>specs/</code> directory in your project: <pre><code class="language-bash"># Create specs directory mkdir -p specs # Save your spec # Paste the output into specs/cli-capture.md </code></pre> When requirements change, update the spec and re-prompt the agent to realign the implementation. The spec becomes the source of truth, not the code comments. <h2 id="heading-where-to-go-next">Where to Go Next</h2> Try it on your next feature before you write a line of code. The assumptions it flags will tell you something about your feature you hadn't consciously decided yet – and correcting the HIGH-impact ones before you hand anything to an agent is the whole point. Skipping that step is the same as prompting directly. If your project is growing, move toward Spec-Anchored. Save specs in your repository under <code>specs/</code>. When a new contributor joins or an agent starts a session cold, the specs give them the decisions that got made without requiring them to reverse-engineer the code. The strongest ongoing challenge to this workflow is Gabriella Gonzalez's argument that detailed specs become code. If your specs are getting implementation-specific, you've crossed a line. Pull back to decisions – "only authenticated users can trigger this" – and leave implementation to the agent. The spec's job is to name what the agent would have guessed wrong, not to write the feature in prose. The Agent Skills standard now works across Claude Code, GitHub Copilot, Cursor, and Gemini CLI. The spec-writer repo is at <a href="https://github.com/dannwaneri/spec-writer">github.com/dannwaneri/spec-writer</a>. The irony of spending 64% of a Claude budget building a token-efficiency tool is real. But the spec surfaced four decisions on a twelve-word prompt. The fourth one – the deduplication key assumption – would have produced a bug that worked perfectly until a session got renamed. That's not a hallucination. That's the agent being exactly as helpful as the prompt allowed. The spec is how you raise the ceiling on what "helpful" means. </article> <article> <h1> How to Build an MCP Server with Python, Docker, and Claude Code </h1> Balajee Asish Brahmandam — Tue, 10 Mar 2026 21:41:44 +0000 Every MCP tutorial I've found so far has followed the same basic script: build a server, point Claude Desktop at it, screenshot the chat window, done. This is fine if you want a demo. But it's not fine if you want something you can ship, defend in an interview, or hand to another developer without a README that starts with "first, install this Electron app." So I built an MCP server in Python, containerized it with Docker, and wired it into Claude Code – all from the terminal, no GUI required. This article walks through the full loop in one afternoon: what MCP actually is, why it matters now that OpenAI and Google have adopted it, the real security problems nobody puts in their tutorial (complete with CVEs), and every command you need to go from an empty directory to a working tool. If you're between jobs and need a portfolio project that shows you understand how AI tooling actually works under the hood, this is the one. <h2 id="heading-table-of-contents">Table of Contents</h2> <ul> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#what-you-will-build">What You Will Build</a> </li> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#prerequisites">Prerequisites</a> </li> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#what-is-mcp-and-why-should-you-care">What is MCP (and Why Should You Care)?</a> </li> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#why-claude-code-instead-of-claude-desktop">Why Claude Code Instead of Claude Desktop?</a> </li> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#step-1-build-the-mcp-server">Step 1: Build the MCP Server</a> </li> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#step-2-test-it-locally">Step 2: Test It Locally</a> </li> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#step-3-dockerize-it">Step 3: Dockerize It</a> </li> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#step-4-wire-it-into-claude-code">Step 4: Wire It Into Claude Code</a> </li> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#step-5-use-it">Step 5: Use It</a> </li> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#security-what-the-other-tutorials-leave-out">Security: What the Other Tutorials Leave Out</a> </li> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#what-to-do-next">What to Do Next</a> </li> <li><a href="https://claude.ai/chat/1a92e709-4c86-4c9a-8fa3-b1533b9d21a5#wrapping-up">Wrapping Up</a> </li> </ul> <h2 id="heading-what-you-will-build">What You Will Build</h2> By the end of this tutorial, you will have: <ul> <li>A Python MCP server that exposes custom tools to any MCP-compatible AI client </li> <li>A Docker container that packages the server for reproducible deployment </li> <li>A working connection between that container and Claude Code in your terminal </li> <li>An understanding of the security risks involved and how to mitigate the worst of them </li> </ul> The server we are building is a project scaffolder. You give it a project name and a language, and it generates a starter directory structure with the right files. It's simple enough to build in an afternoon, but useful enough to actually put on your résumé. <h2 id="heading-prerequisites">Prerequisites</h2> You will need the following installed on your machine: <ul> <li>Python 3.10+ (check with <code>python3 --version</code>) </li> <li>Docker (check with <code>docker --version</code>) </li> <li>Claude Code with an active Claude Pro, Max, or API plan (check with <code>claude --version</code>) </li> <li>Node.js 20+ (required by Claude Code – check with <code>node --version</code>) </li> <li>A terminal you are comfortable in </li> </ul> If you don't have Claude Code installed yet, follow the <a href="https://code.claude.com/docs/en/getting-started">official installation instructions</a>. The npm installation method is deprecated, so make sure you use the native binary installer instead. <h2 id="heading-what-is-mcp-and-why-should-you-care">What is MCP (and Why Should You Care)?</h2> The Model Context Protocol (MCP) is an open standard that lets AI models connect to external tools and data sources. Anthropic released it in November 2024, and within a year it became the default way to extend what an LLM can do. OpenAI adopted it in March 2025. Google DeepMind followed in April. The protocol now has over 97 million monthly SDK downloads and more than 10,000 active servers. The easiest way to think about MCP is as a USB-C port for AI. Before MCP, every AI provider had its own way of calling tools. OpenAI had function calling. Google had their own format. If you wanted your tool to work with multiple models, you had to implement it multiple times. MCP gives you one interface that works everywhere. Here is how the pieces fit together: <ul> <li>An MCP server exposes tools, resources, and prompts. It is your code. </li> <li>An MCP client (like Claude Code, Claude Desktop, or Cursor) discovers those tools and calls them on behalf of the LLM. </li> <li>The transport is how they communicate. For local servers, that's usually stdio (standard input/output). For remote servers, it's HTTP. </li> </ul> When you type a message in Claude Code and it decides to use one of your tools, here is what happens: Claude Code sends a JSON-RPC 2.0 message to your server over stdin, your server executes the tool and writes the result to stdout, and Claude Code reads it back. The LLM never talks to your server directly. The client is always in the middle. If you want the deeper architecture breakdown, freeCodeCamp already has a <a href="https://www.freecodecamp.org/news/how-does-an-mcp-work-under-the-hood/">solid explainer on how MCP works under the hood</a>. Here, I will focus on building. <h2 id="heading-why-claude-code-instead-of-claude-desktop">Why Claude Code Instead of Claude Desktop?</h2> Most MCP tutorials use Claude Desktop as the client. That works, but Claude Code has a few advantages for developers: <ol> <li>It lives in your terminal. No GUI to configure. No JSON files to hand-edit in hidden config directories. You add an MCP server with one command and you are done. </li> <li>It's already where you code. If you're writing the server, testing it, and connecting it, doing all of that in the same terminal session cuts the context switching. </li> <li>It works on headless machines. If you're SSHing into a dev box or running in CI, Claude Desktop isn't an option. Claude Code is. </li> <li>It's also an MCP server itself. Claude Code can expose its own tools (file reading, writing, shell commands) to other MCP clients via <code>claude mcp serve</code>. That's a neat trick we won't use today, but it's worth knowing about. </li> </ol> The relevant commands: <pre><code class="language-bash"># Add an MCP server claude mcp add <name> -- <command> # List configured servers claude mcp list # Remove a server claude mcp remove <name> # Check MCP status inside Claude Code /mcp </code></pre> <h2 id="heading-step-1-build-the-mcp-server">Step 1: Build the MCP Server</h2> We're using <a href="https://github.com/jlowin/fastmcp">FastMCP</a>, a Python framework that handles all the protocol plumbing so you can focus on your tools. Create a new project directory and set it up: <pre><code class="language-bash">mkdir mcp-scaffolder && cd mcp-scaffolder python3 -m venv .venv source .venv/bin/activate pip install "mcp[cli]>=1.25,<2" </code></pre> Why pin the version? The MCP Python SDK v2.0 is in development and will change the transport layer significantly. Pinning to >=1.25,<2 keeps your server working until you're ready to migrate. Now create <code>server.py</code>: <pre><code class="language-python"># server.py from mcp.server.fastmcp import FastMCP import os import json mcp = FastMCP("project-scaffolder") # Templates for different languages TEMPLATES = { "python": { "files": { "main.py": '"""Entry point."""\n\n\ndef main():\n print("Hello, world!")\n\n\nif __name__ == "__main__":\n main()\n', "requirements.txt": "", "README.md": "# {name}\n\nA Python project.\n\n## Setup\n\n```bash\npip install -r requirements.txt\npython main.py\n```\n", ".gitignore": "__pycache__/\n*.pyc\n.venv/\n", }, "dirs": ["tests"], }, "node": { "files": { "index.js": 'console.log("Hello, world!");\n', "package.json": '{{\n "name": "{name}",\n "version": "1.0.0",\n "main": "index.js"\n}}\n', "README.md": "# {name}\n\nA Node.js project.\n\n## Setup\n\n```bash\nnpm install\nnode index.js\n```\n", ".gitignore": "node_modules/\n", }, "dirs": [], }, "go": { "files": { "main.go": 'package main\n\nimport "fmt"\n\nfunc main() {{\n\tfmt.Println("Hello, world!")\n}}\n', "go.mod": "module {name}\n\ngo 1.21\n", "README.md": "# {name}\n\nA Go project.\n\n## Setup\n\n```bash\ngo run main.go\n```\n", ".gitignore": "bin/\n", }, "dirs": ["cmd", "internal"], }, } @mcp.tool() def scaffold_project(name: str, language: str) -> str: """Create a new project directory structure. Args: name: The project name (used as the directory name) language: The programming language - one of: python, node, go """ language = language.lower().strip() if language not in TEMPLATES: return json.dumps({ "error": f"Unsupported language: {language}", "supported": list(TEMPLATES.keys()), }) template = TEMPLATES[language] base_path = os.path.join(os.getcwd(), name) if os.path.exists(base_path): return json.dumps({ "error": f"Directory already exists: {name}", }) # Create the project directory os.makedirs(base_path, exist_ok=True) # Create subdirectories for dir_name in template["dirs"]: os.makedirs(os.path.join(base_path, dir_name), exist_ok=True) # Create files created_files = [] for filename, content in template["files"].items(): filepath = os.path.join(base_path, filename) formatted_content = content.replace("{name}", name) with open(filepath, "w") as f: f.write(formatted_content) created_files.append(filename) return json.dumps({ "status": "created", "path": base_path, "language": language, "files": created_files, "directories": template["dirs"], }) @mcp.tool() def list_templates() -> str: """List all available project templates and their contents.""" result = {} for lang, template in TEMPLATES.items(): result[lang] = { "files": list(template["files"].keys()), "directories": template["dirs"], } return json.dumps(result, indent=2) if __name__ == "__main__": mcp.run(transport="stdio") </code></pre> A few things to notice about this code: Tools return strings. MCP tools communicate through text. I'm returning JSON strings so the LLM can parse the results reliably. You could return plain text, but structured data gives the model more to work with. The <code>@mcp.tool()</code> decorator does the heavy lifting. FastMCP reads your function signature and docstring to generate the JSON schema that tells the LLM what this tool does, what arguments it takes, and what types they are. Good docstrings aren't optional here – they're how the LLM decides whether to call your tool. <code>transport="stdio"</code> is the key line. This tells FastMCP to communicate over standard input/output, which is what Claude Code expects for local servers. <h2 id="heading-step-2-test-it-locally">Step 2: Test It Locally</h2> Before we Dockerize anything, make sure the server actually works: <pre><code class="language-bash"># Quick smoke test - the server should start without errors python server.py </code></pre> You should see... nothing. That is correct. An MCP server over stdio just sits there waiting for JSON-RPC messages on stdin. Press <code>Ctrl+C</code> to stop it. For a proper test, use the MCP Inspector (Anthropic's debugging tool): <pre><code class="language-bash"># Install and run the inspector npx @modelcontextprotocol/inspector python server.py </code></pre> This opens a web interface where you can see your tools, call them manually, and inspect the JSON-RPC messages going back and forth. Verify that both <code>scaffold_project</code> and <code>list_templates</code> show up and return sensible results. Here's a debugging tip that will save you time: If your MCP server logs anything to stdout, it will corrupt the JSON-RPC stream and the client will disconnect. Use stderr for all logging: <code>print("debug info", file=sys.stderr)</code>. This is the single most common source of "my server connects but then immediately fails" bugs. The New Stack called stdio transport "incredibly fragile" for exactly this reason. <h2 id="heading-step-3-dockerize-it">Step 3: Dockerize It</h2> Create a <code>Dockerfile</code> in your project root: <pre><code class="language-dockerfile">FROM python:3.12-slim WORKDIR /app # Install dependencies COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy server code COPY server.py . # MCP servers over stdio need unbuffered output ENV PYTHONUNBUFFERED=1 # The server reads from stdin and writes to stdout CMD ["python", "server.py"] </code></pre> Create <code>requirements.txt</code>: <pre><code class="language-plaintext">mcp[cli]>=1.25,<2 </code></pre> Build and verify: <pre><code class="language-bash">docker build -t mcp-scaffolder . # Quick test - should start without errors docker run -i mcp-scaffolder </code></pre> Again, you'll see nothing because the server is waiting for input. <code>Ctrl+C</code> to stop. Two things matter in this Dockerfile: <ol> <li><code>PYTHONUNBUFFERED=1</code> is critical. Without it, Python buffers stdout, and the MCP client may hang waiting for responses that are sitting in a buffer. This is one of those bugs that works fine in local testing and breaks in Docker. </li> <li><code>docker run -i</code> (interactive mode) is required. The <code>-i</code> flag keeps stdin open so the MCP client can send messages to the container. Without it, the server gets an immediate EOF and exits. </li> </ol> <h2 id="heading-step-4-wire-it-into-claude-code">Step 4: Wire It Into Claude Code</h2> Now connect your Docker container to Claude Code: <pre><code class="language-bash">claude mcp add scaffolder -- docker run -i --rm mcp-scaffolder </code></pre> That's the whole command. Let me break it down: <ul> <li><code>claude mcp add</code> registers a new MCP server </li> <li><code>scaffolder</code> is the name you will reference it by </li> <li>Everything after <code>--</code> is the command Claude Code runs to start the server </li> <li><code>docker run -i --rm mcp-scaffolder</code> starts the container with interactive stdin and removes it when done </li> </ul> Verify that it registered: <pre><code class="language-bash">claude mcp list </code></pre> You should see <code>scaffolder</code> in the output with a <code>stdio</code> transport type. Now launch Claude Code and check the connection: <pre><code class="language-bash">claude </code></pre> Once inside Claude Code, type <code>/mcp</code> to see the status of your MCP servers. You should see <code>scaffolder</code> listed as connected with two tools available. <h2 id="heading-step-5-use-it">Step 5: Use It</h2> Still inside Claude Code, try it out: <pre><code class="language-plaintext">Create a new Python project called "weather-api" </code></pre> Claude Code should discover your <code>scaffold_project</code> tool, call it with <code>name="weather-api"</code> and <code>language="python"</code>, and report back what it created. Check your filesystem and you should see the full project structure. Try a few more: <pre><code class="language-plaintext">What project templates are available? </code></pre> <pre><code class="language-plaintext">Scaffold a Go project called "url-shortener" </code></pre> If Claude Code doesn't pick up your tools, run <code>/mcp</code> to check the connection status. If it shows as disconnected, the most common causes are that the Docker image failed to build, stdout is being polluted (check for stray print statements), or the Docker daemon is not running. <h2 id="heading-security-what-the-other-tutorials-leave-out">Security: What the Other Tutorials Leave Out</h2> This is the section most MCP tutorials skip. They should not. MCP has had real security incidents, not theoretical ones, and understanding them makes you a better developer. <h3 id="heading-the-prompt-injection-problem">The Prompt Injection Problem</h3> MCP servers execute code on your machine based on what an LLM decides to do. If an attacker can influence what the LLM sees, they can influence what your server does. This is called prompt injection, and it is the number one unsolved security problem in the MCP ecosystem. In May 2025, researchers at Invariant Labs demonstrated this against the official GitHub MCP server. They created a malicious GitHub issue that, when read by an AI agent, hijacked the agent into leaking private repository data (including salary information) into a public pull request. The root cause was an overly broad Personal Access Token combined with untrusted content landing in the LLM's context window. This was not a contrived lab demo. It used the official GitHub MCP server, the kind of thing people install from the MCP server directory without a second thought. <h3 id="heading-real-cves-not-theory">Real CVEs, Not Theory</h3> The ecosystem has accumulated real vulnerability reports: <ul> <li>CVE-2025-6514: A critical command-injection bug in <code>mcp-remote</code>, a popular OAuth proxy that 437,000+ environments used. An attacker could execute arbitrary OS commands through crafted OAuth redirect URIs. </li> <li>CVE-2025-6515: Session hijacking in <code>oatpp-mcp</code> through predictable session IDs, letting attackers inject prompts into other users' sessions. </li> <li>MCP Inspector RCE: Anthropic's own debugging tool allowed unauthenticated remote code execution. Inspecting a malicious server meant giving the attacker a shell on your machine. </li> </ul> An Equixly security assessment found command injection in 43% of tested MCP server implementations. Nearly a third were vulnerable to server-side request forgery. <h3 id="heading-what-you-should-actually-do">What You Should Actually Do</h3> For the server we built today, here is what matters: <h4 id="heading-limit-file-system-access">Limit file system access</h4> Our Docker container doesn't mount your home directory. That's intentional. If you need the server to write files to your host, mount only the specific directory you need: <code>docker run -i --rm -v $(pwd)/projects:/app/projects mcp-scaffolder</code>. Never mount <code>/</code> or <code>~</code>. <h4 id="heading-validate-all-inputs">Validate all inputs</h4> Our <code>scaffold_project</code> tool checks that the language is in a known list and that the directory does not already exist. But think about what happens if someone passes <code>name="../../etc/passwd"</code> as the project name. Path traversal is the kind of thing you need to catch. Add this to the tool: <pre><code class="language-python"># Add this validation at the top of scaffold_project if ".." in name or "/" in name or "\\" in name: return json.dumps({"error": "Invalid project name"}) </code></pre> <h4 id="heading-use-least-privilege-tokens">Use least-privilege tokens</h4> If your MCP server connects to an API, give it the minimum permissions it needs. The GitHub MCP incident happened because the PAT had access to every private repo. A read-only token scoped to one repo would have contained the blast radius. <h4 id="heading-do-not-install-mcp-servers-from-untrusted-sources">Do not install MCP servers from untrusted sources</h4> A malicious npm package posing as a "Postmark MCP Server" was caught silently BCC'ing all emails to an attacker's address. Treat MCP server packages with the same caution you would give any code that runs on your machine with your permissions. <h2 id="heading-what-to-do-next">What to Do Next</h2> You have a working MCP server in a Docker container, connected to Claude Code. Here is how to make it portfolio-ready: <ol> <li>Add more tools: The scaffolder is a starting point. Add a tool that reads a project's dependency file and lists outdated packages. Add one that generates a Dockerfile for an existing project. Each tool is a function with a decorator – the pattern is the same every time. </li> <li>Add tests: Write pytest tests that call your tool functions directly and verify the output. MCP tools are just Python functions. Test them like Python functions. </li> <li>Push the Docker image: Tag it and push to Docker Hub or GitHub Container Registry. Then your <code>claude mcp add</code> command becomes <code>claude mcp add scaffolder -- docker run -i --rm yourusername/mcp-scaffolder:latest</code> and anyone can use it. </li> <li>Write a README that explains the security model: What permissions does your server need? What file system access? What happens if inputs are malicious? Answering these questions in your README signals that you think about security, which is exactly what hiring managers are looking for right now. </li> </ol> <h2 id="heading-wrapping-up">Wrapping Up</h2> We built a Python MCP server with FastMCP, containerized it with Docker, and connected it to Claude Code. The whole thing fits in about 100 lines of Python, a six-line Dockerfile, and one <code>claude mcp add</code> command. The MCP ecosystem is real and growing fast. The protocol has the backing of Anthropic, OpenAI, and Google. It's now governed by the Linux Foundation. But it's also young, and the security story is still being written. Build with it, but build with your eyes open. If you want to go deeper, here are the resources I found most useful: <ul> <li><a href="https://modelcontextprotocol.io/specification/2025-11-25">MCP specification</a>: the actual protocol docs </li> <li><a href="https://code.claude.com/docs/en/mcp">Claude Code MCP documentation</a>: how Claude Code implements MCP </li> <li><a href="https://github.com/jlowin/fastmcp">FastMCP GitHub</a>: the Python framework we used </li> <li><a href="https://authzed.com/blog/timeline-mcp-breaches">AuthZed's timeline of MCP security incidents</a>: required reading if you are building MCP servers for production </li> <li><a href="https://simonwillison.net/2025/Apr/9/mcp-prompt-injection/">Simon Willison on MCP prompt injection</a>: the clearest explanation of why this is hard to solve </li> </ul> The complete source code for this tutorial is on <a href="https://github.com/balajeeasish/ai-workshop/tree/main/mcp-server">GitHub</a>. </article> <article> <h1> Learn to Use Claude AI to Build Text Summarizers, Image Describers, and More </h1> Beau Carnes — Tue, 22 Oct 2024 15:13:45 +0000 From summarizing lengthy articles to providing detailed descriptions of images, AI models are becoming essential tools for developers. One such powerful tool is Claude, a state-of-the-art AI language model developed by Anthropic. Whether you're an aspiring AI enthusiast or an experienced developer, learning how to leverage Claude’s capabilities can open up a world of creative and practical possibilities. We just published a course on the <a target="_blank" href="http://freeCodeCamp.org">freeCodeCamp.org</a> YouTube channel that will teach you all about Claude AI and how to build exciting projects using Anthropic's API. In this course, you'll dive into Claude’s capabilities and discover how to use this AI model to create applications like text summarizers and image describers. The course is packed with hands-on coding challenges that will help you build practical skills while working with real-world AI tasks. Shant Dashjian from Scrimba developed this course. You'll begin by learning the basics: getting familiar with Claude, understanding its potential, and setting up your Anthropic API key. From there, you’ll quickly progress to interacting with Claude through conversations, where you’ll learn how to craft effective prompts to control its responses. The course will guide you through building two main projects that showcase how AI can process both text and visual data. The projects are: <ul> <li>🗞️ a text summarizer </li> <li>🖼️ an image describer </li> </ul> In addition to learning how to work with Claude’s API, you’ll also develop important skills like error handling, prompt engineering, and cloud deployment, which are essential for creating robust, real-world applications. By the end of the course, you’ll not only have built two impressive projects but also gained a deeper understanding of how Claude fits into the broader AI landscape and how to effectively use AI models in your own projects. Ready to meet Claude and start building? Watch the <a target="_blank" href="https://www.youtube.com/watch?v=QfJB9d0J3Iw">full course on the freeCodeCamp.org YouTube channel</a> (1-hour watch). <div class="embed-wrapper"> </div> </article> </main></body></html>

claude.ai - freeCodeCamp.org

How to Use Claude Code to Build Flutter Apps Faster — Best Practices for 2026

Prerequisites

Table of Contents

1. Why Architecture Comes First

2. Setting Up Your CLAUDE.md

3. Feature-First Folder Structure — The Details

4. Writing Skills for Your Most Repeated Tasks

Creating Your First Skill

A Skill For Conventional Commits

Dynamic Context Injection

5. Using /loop for Self-Correcting Workflows

Fix Until Clean

TDD Loop

Build a Screen, Check it, Iterate

6. Subagents for Parallel Screen Development

Setting Up a Screen-Builder Subagent

Using it

7. Hooks — Enforcing Rules Deterministically

Block Edits to Generated Files

Run Analyze Before Every Stop

8. Putting It All Together: A Real Sprint Workflow

Morning: Check Project State

Start a New Feature

Generate All the Screens in Parallel

Fix Everything Until it's Clean

Commit Cleanly

Key Takeaways

How to Keep Human Experts Visible in Your AI-Assisted Codebase

Table of Contents

What You Will Build

Prerequisites

Quickstart in 5 Minutes

How the Tool Works

The Archaeology Problem

How Knowledge Gaps Are Detected

How to Install proof-of-contribution

Mac and Linux

Windows PowerShell

Verify the Install:

How to Scaffold Your Project

Mac and Linux

Windows PowerShell

Verify the Scaffold Worked

How to Record Your First Provenance Entry

Check What You Recorded

See All Experts in Your Graph

How to Use import-spec to Seed Knowledge Gaps

Step 1 — Create a Test Spec

Step 2 — Import the Assumptions

Step 3 — Trace the Gaps

Preview Without Writing

Check the Overall Health

How to Trace Human Attribution

Resolving Gaps

How to Verify with Static Analysis

What the Exit Codes Mean

Run it Without a Seeded Spec

The --strict Flag

How to Enable PR Enforcement

What the Action Checks

Where to Go Next

Use it with spec-writer on a Real Feature

Activate the Claude Code Skill

Upgrade When You Outgrow SQLite

Manual Tracking vs. proof-of-contribution

How to Build a Cost-Efficient AI Agent with Tiered Model Routing

What You'll Build

Prerequisites

Table of Contents

The Problem with Calling Claude on Everything

The Cost Curve Explained

Project Setup

Tier 1: Deterministic Python

Tier 2: Claude Haiku for Ambiguous Cases

Tier 3: Claude Sonnet for Semantic Judgment

The Router: audit_url()

Graceful Fallback

Testing the Cost Curve

Applying This Pattern to Your Agent

The `--strict` Flag