Christopher Galliart - freeCodeCamp.org

How to Trace Multi-Agent AI Swarms with Jaeger v2

Christopher Galliart — Thu, 23 Apr 2026 23:41:57 +0000

When you run a single AI agent, debugging is straightforward. You read the log, you see what happened.

When you run five agents in a swarm, each spawning its own tool calls and producing its own output, "read the log" stops being a strategy.

I built Claude Forge as an adversarial multi-agent coding framework on top of Claude Code. A typical run spawns a planner, an implementer, a reviewer, and a fixer. They evaluate each other's work and loop back when quality checks fail.

But when something went wrong, I had timestamps and text dumps but no way to see which agent was responsible, how long it actually took, or where the tokens went.

Jaeger fixed that. This article covers setting up Jaeger v2 with Docker, wiring it into a multi-agent system through OpenTelemetry, and what I learned along the way.

What Is Distributed Tracing?
Why Jaeger v2?
Prerequisites
Installing Docker on Debian
Setting Up Jaeger v2
Setting Up Claude Forge Tracing
Understanding the Span Model
Instrumenting a Multi-Agent Swarm
Viewing Traces in the Jaeger UI
Lessons from the Trenches
Environment Variable Reference
Wrapping Up

What Is Distributed Tracing?

Distributed tracing tracks a single operation as it moves through multiple services. A span is one unit of work with a start time, end time, and key-value attributes. Spans nest into parent-child trees. One tree per operation is one trace.

Microservices people already know this pattern: follow an HTTP request from the gateway through auth, the database, and the cache. Same idea works for multi-agent AI. Follow one swarm invocation from the orchestrator through each subagent and its tool calls.

OpenTelemetry (OTel) is the standard. It gives you SDKs for creating spans and shipping them over OTLP. Jaeger receives that data and renders it as a searchable timeline.

Why Jaeger v2?

Jaeger started at Uber and graduated as a CNCF project in 2019. v1 hit end of life in December 2025. v2 is the current release, built on the OpenTelemetry Collector framework. Single binary: collector, query service, and UI. It speaks OTLP natively on port 4317 (gRPC) and 4318 (HTTP). There's no separate collector needed for local work.

One important difference from v1: configuration moved from CLI flags and environment variables to a YAML file. The old -e SPAN_STORAGE_TYPE=badger env vars are silently ignored in v2. The container starts fine but falls back to in-memory storage. I lost two days of traces before noticing. More on the correct setup below.

Prerequisites

Docker installed and running.
Claude Code installed.
Python 3.8+ for the tracing hook.
Claude Forge or another multi-agent system to instrument.

Installing Docker on Debian

Skip this if you already have Docker. macOS and Windows users can use Docker Desktop. On Debian:

sudo apt-get update
sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \
  https://download.docker.com/linux/debian \
  \((. /etc/os-release && echo "\)VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker

Ubuntu users: replace both linux/debian URLs with linux/ubuntu.

Setting Up Jaeger v2

Basic Run

For quick testing with no persistence:

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/jaeger:2.17.0

Port 16686 is the UI. Port 4317 is OTLP/gRPC ingestion. Port 4318 is OTLP/HTTP. Remove the container and your traces are gone.

Persistent Storage with Badger

v2 reads configuration from a YAML file, not environment variables. Save this as ~/.local/share/jaeger/config.yaml:

service:
  extensions: [jaeger_storage, jaeger_query, healthcheckv2]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger_storage_exporter]
extensions:
  healthcheckv2:
    use_v2: true
    http: { endpoint: 0.0.0.0:13133 }
  jaeger_query:
    storage: { traces: main_store }
  jaeger_storage:
    backends:
      main_store:
        badger:
          directories: { keys: /badger/key, values: /badger/data }
          ephemeral: false
          ttl: { spans: 720h }
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
processors:
  batch:
exporters:
  jaeger_storage_exporter:
    trace_storage: main_store

The Jaeger container runs as UID 10001. Docker named volumes default to root ownership. Without fixing permissions first, the container crash-loops with mkdir /badger/key: permission denied.

Pre-create the volume and fix ownership:

docker volume create jaeger-data

docker run --rm \
  -v jaeger-data:/badger \
  alpine sh -c "mkdir -p /badger/data /badger/key && chown -R 10001:10001 /badger"

Then run Jaeger with the config mounted in:

docker run -d --name jaeger \
  --restart unless-stopped \
  -v ~/.local/share/jaeger/config.yaml:/etc/jaeger/config.yaml:ro \
  -v jaeger-data:/badger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/jaeger:2.17.0 \
  --config /etc/jaeger/config.yaml

Verify persistence by running docker restart jaeger and confirming a previously recorded trace is still there. Hit http://localhost:16686 and you should see the UI.

Setting Up Claude Forge Tracing

Installing Claude Forge

Install it through the Claude Code plugin marketplace:

/plugin marketplace add hatmanstack/claude-forge
/plugin install forge@claude-forge
/reload-plugins

The install opens a TUI to confirm scope and settings. After reload, commands use the forge: prefix (for example, /forge:pipeline).

You can also clone the repo from GitHub.

Installing the Tracing Hook

From your target project directory, run the install script. For plugin installs:

cd your-project
forge-trace                # if you set up the alias from the README
# or, without the alias:
bash "$(find ~/.claude -path '*/forge*' -name install-tracing.sh 2>/dev/null | head -1)"

For clone installs:

cd your-project
bash /path/to/claude-forge/bin/install-tracing.sh

The script builds a dedicated venv at ~/.local/share/claude-forge/venv (prefers uv, falls back to python3 -m venv), installs the OpenTelemetry packages, copies the hook into place, merges hook entries into .claude/settings.local.json, and self-tests against the OTLP endpoint.

Pass --no-settings to skip the settings merge, or --uninstall to tear everything down.

Opting In

Add to your shell init and restart your terminal:

export CLAUDE_FORGE_TRACING=1

Restart Claude Code, run /pipeline, then check http://localhost:16686 for the claude-forge service.

Understanding the Span Model

Here's what the hierarchy looks like for a typical swarm run:

session: "implement login form with OAuth"        <- root span
├── subagent:planner
│   ├── tool:Write  (Phase-0.md)                  <- mutation spans (on by default)
│   ├── tool:Write  (Phase-1.md)
│   └── subagent_result:planner                   <- duration, token counts, output
├── subagent:implementer
│   ├── tool:Edit   (src/auth.ts)
│   ├── tool:Bash   (npm test)
│   ├── tool:Write  (src/oauth.ts)
│   └── subagent_result:implementer
├── subagent:reviewer
│   └── subagent_result:reviewer
└── session_complete                              <- session totals

The root span's name comes from the first line of your prompt. Find traces by what you asked for, not by a UUID.

Subagents get an anchor span on start and a result span on completion. The result carries duration, token counts, prompt, and output.

Three Tiers of Detail

Not all inner tool calls are equally interesting. Write, Edit, MultiEdit, and Bash are mutational: small in number, high signal. They tell you what actually changed. Read, Glob, Grep, and WebFetch are navigation: lots of them, mostly noise.

Tracing captures mutations by default. That middle ground turned out to be the right one. Before this change, you either saw nothing inside subagents or you saw 200+ spans per run.

Mode	Subagents	Mutations (Write/Edit/Bash)	Other inner tools
Default	yes	yes	no
`CLAUDE_FORGE_TRACE_INNER=1`	yes	yes	yes (minus blocklist)
`CLAUDE_FORGE_TRACE_MUTATIONS=0`	yes	no	no (or per INNER)

Span Attributes

On session_complete: session.tokens.input, session.tokens.output, session.tokens.total, session.tokens.turns, session.duration_ms, user.prompt (first 2KB).

On subagent_result: agent.description, agent.prompt, agent.output, agent.duration_ms, agent.is_error, agent.tokens.input, agent.tokens.output.

On tool:*: tool.name, tool.input, tool.output, tool.duration_ms, tool.is_error.

Instrumenting a Multi-Agent Swarm

Hook Architecture

Claude Code has lifecycle hooks that fire scripts on specific events. Four matter here:

UserPromptSubmit (create the root span),
PreToolUse (start a span),
PostToolUse (end it with results), and
Stop (finalize the trace). Each hook gets a JSON payload on stdin and runs as a subprocess.

Sending Spans with OpenTelemetry

Here's some minimal Python to get a span into Jaeger:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.name": "my-agent-system"})
exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("agent-tracer")

with tracer.start_as_current_span("my-agent-task") as span:
    span.set_attribute("agent.name", "planner")
    span.set_attribute("agent.tokens.input", 1500)
    span.set_attribute("agent.tokens.output", 800)

Refresh localhost:16686, pick your service, click "Find Traces."

Correlating Pre and Post Events

You need to match each PreToolUse to its PostToolUse. Agent-type tool calls didn't include a tool_use_id in the payload, so I hashed the tool name and input instead. Pre and Post carry identical tool_input, so the hashes line up.

import hashlib, json

def correlation_key(tool_name: str, tool_input: dict) -> str:
    content = json.dumps({"tool": tool_name, "input": tool_input}, sort_keys=True)
    return hashlib.sha1(content.encode()).hexdigest()[:16]

State Across Invocations

Every hook call is a separate process. No shared memory. So I wrote span context to JSON files on Pre and read them back on Post:

/tmp/claude-forge-tracing//
├── _root.json              # trace ID, root span context
├── _session_start_ns.json  # timestamp for duration calculation
├── subagent_.json    # per-subagent span context
└── tool_.json        # per-tool span context

File names get sanitized against path traversal. _safe_name() strips everything outside [A-Za-z0-9._-] and falls back to a SHA1 slug.

Flushing Without Blocking

try:
    provider.force_flush(timeout_millis=1000)
except Exception:
    pass  # Never block the swarm

I tried 2000ms first and the swarm felt slow. 100ms lost spans on cold TLS connections. 1000ms worked. If Jaeger is down, the swarm keeps running regardless.

Viewing Traces in the Jaeger UI

Open http://localhost:16686. Pick claude-forge from the service dropdown. Click "Find Traces."

The trace search filters by operation name, tags, and time range. Since session spans take their name from your prompt, searching "login form" pulls up the runs where you asked for one.

The timeline view is where I spend most of my time. Every span is a horizontal bar, nested by parent-child relationships. I can see the planner took 12 seconds, the implementer 45, the reviewer 8. Click any bar to see token counts, prompts, outputs, error status.

Trace comparison puts two runs side by side. This is good for figuring out why one run succeeded and another did not.

Lessons from the Trenches

One trace per swarm, not per subagent: My first version wiped the root span's state file on every Stop event, so each subagent started a new trace. I changed Stop to mark a timestamp while preserving the root.

Use descriptions, not type names: Subagents all report their type as general-purpose. The description field is where the actual role lives.

Token attribution needs per-agent transcripts: Claude Code writes subagent transcripts to ~/.claude/projects///subagents/agent-*.jsonl. Match them via agent-*.meta.json.

Parse boolean env vars explicitly: bool("0") in Python is True. Use an allowlist: {"1", "true", "yes", "on"}.

Environment Variable Reference

Variable	Purpose
`CLAUDE_FORGE_TRACING=1`	Master opt-in. Hook is a no-op without this.
`CLAUDE_FORGE_TRACE_MUTATIONS=0`	Disable default mutation spans (Write/Edit/Bash). On by default.
`CLAUDE_FORGE_TRACE_INNER=1`	Capture all inner tool calls as child spans (off by default).
`CLAUDE_FORGE_TRACE_TOOL_BLOCKLIST`	Comma-separated tools to skip when inner tracing is on. Defaults to `Read,Glob,Grep,TodoWrite,NotebookRead`.
`CLAUDE_FORGE_HOOK_DEBUG=1`	Enable debug logging of raw hook payloads. Off by default.
`CLAUDE_FORGE_HOOK_DEBUG_LOG`	Override debug log path. Defaults to `~/.cache/claude-forge/hook.log`.
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP/gRPC endpoint. Defaults to `http://localhost:4317`.

Wrapping Up

Without visibility into the process, you're being inefficient with tokens and your time. Multi-agent swarms cost real money on every run. When an agent fails and retries, or when a reviewer rejects work that was close, you're paying for that blind.

Tracing gives you the map. You find out where the failure modes are. You find out which agents burn tokens going nowhere. A 45-second implementer run might have been 10 seconds with a better planner prompt. But you would never know that without seeing the breakdown.

Get observability in early. Jaeger and OpenTelemetry make it cheap to set up. Once you can see where things go wrong you can actually fix them.

Claude Forge tracing is on the main branch.

Why Chrome OS Is the Operating System the AI Era Was Built For

Christopher Galliart — Fri, 17 Apr 2026 18:05:16 +0000

Chrome OS runs on a read-only filesystem. You can't install executables on the host. There's no traditional desktop environment. Everything that interacts with the underlying system does so through a sandboxed browser, a containerized Linux terminal, or a cloud connection.

For years, that list of constraints was the reason people dismissed it. But in 2026, it's the reason Chrome OS might be the most correctly designed operating system for what's coming.

The security architecture treats the endpoint as untrusted by default. The containerized Linux environment gives developers a full headless stack without compromising the host. And an upcoming OS-level rewrite, Aluminium, puts Google's on-device AI models directly into the kernel.

This article covers security architecture, the container-based developer environment, cloud-streamed creative tools via AWS NICE DCV, cloud gaming, and what Aluminium OS means for on-device AI.

Here's what we'll cover:

Security-First Architecture in an Era of AI-Powered Threats
A Headless Linux Stack That's More Flexible Than It Looks
AWS NICE DCV Changes the Creative Tools Conversation
Cloud Gaming Works
Aluminium OS: On-Device Models on Google's Own Architecture
Where This Lands

Security-First Architecture in an Era of AI-Powered Threats

Threat actors are getting better tools. Models like Mythos are lowering the barrier for generating convincing phishing campaigns, crafting polymorphic malware, and automating social engineering at scale.

Traditional operating systems present exactly the attack surface these tools target: writable system files, user-installable executables, patches that sit uninstalled for weeks because someone clicked "remind me later."

Chrome OS sidesteps most of this by design. The root filesystem is read-only and cryptographically verified on every boot through a process called Verified Boot.

If anything has modified the OS files since the last verified state, whether that's malware, a compromised package, or a rogue AI agent that decided to start deleting system files, the device detects it at startup and either self-corrects or refuses to boot.

Persistence across reboots isn't difficult. It's architecturally impossible through software alone.

Updates happen silently. While you're working, the system downloads the next OS version to an inactive partition. On your next reboot, it pivots to the updated version. No prompts, no deferred patches, no exposure window.

Major updates ship every four to six weeks. Security patches land every two to three weeks. The gap between vulnerability discovery and remediation is measured in days.

Chrome OS consistently doesn't appear in the top 50 products by CVE count in the NIST vulnerability database. Windows and the Linux kernel sit near the top every year. When AI is actively being weaponized to find and exploit vulnerabilities faster than humans can patch them, a read-only, verified, automatically updated endpoint is a different category of security posture.

The tradeoff is trust. Chrome OS's security model means trusting Google as the root authority for your entire computing stack: updates, certificate trust, telemetry. Organizations with strict data sovereignty requirements should weigh that dependency carefully.

A Headless Linux Stack That's More Flexible Than It Looks

Chrome OS is a text-based operating system. There's no native GUI layer. Stop and sit with that for a second, because it's the thing that makes people dismiss Chrome OS and also the thing that makes it work.

The entire graphical interface you interact with IS the Chrome browser. The Ash shell, Chrome's window manager, is the desktop. You don't install applications onto it the way you install .exe files on Windows or drag .app bundles into a macOS Applications folder. If it isn't running in a browser tab, an Android VM, or a Linux container, it doesn't run. That restriction is what keeps the host locked down, and it's what makes everything else possible.

Under the hood, Chrome OS runs a minimal virtual machine called Termina through crosvm, Google's Rust-based VM monitor.

Inside Termina, LXD manages Linux containers. The default container, penguin, is a Debian environment with a special trick: it bridges GUI-based Linux applications directly into the Chrome OS desktop through a Wayland proxy called Sommelier. Install VS Code, GIMP, or LibreOffice in penguin and they show up in your Chrome OS app launcher, running in windows alongside your browser tabs. For a lot of developers, penguin alone covers the daily workflow.

But Termina gives you more than penguin. Through the LXD layer you can spin up independent containers that are fully isolated operating systems: Arch, Alpine, Ubuntu, whatever you need.

These aren't attached to the GUI bridge. They run headless, natively, with their own systemd, their own package managers, their own persistent state. Need a clean Ubuntu environment to test a deployment script without touching your main setup? lxc launch and you're there. Need to blow it away? lxc delete and it's gone. No orphaned files on the host, no cross-contamination between environments.

The key distinction from Docker is that LXD runs system containers (full OS emulation) rather than application containers. You get background services, persistent daemons, the works. You can also run Docker inside any of these LXD containers if you need application-level containerization on top of that.

Snapshot your entire environment with lxc snapshot before a risky dependency install and roll back instantly if something breaks. That kind of safety net is broader than version control alone: it captures your full OS configuration, not just code.

Pair this with browser-native tools like GitHub Codespaces, Google Colab, AWS CloudShell, or vscode.dev, and the terminal handles your local tooling while the browser handles everything else.

AI coding assistants like Claude and Gemini already operate natively in the browser. The distance between "cloud IDE" and "local IDE" keeps shrinking.

There are friction points: no custom kernel modules inside Crostini. Nested KVM requires Intel Gen 10+ processors. VPN routing into the Linux container from the Chrome OS host can be a headache, with WireGuard requiring userspace workarounds inside the container.

But none of these break the core architecture for cloud-native work. They're just worth knowing about before you commit.

AWS NICE DCV Changes the Creative Tools Conversation

One of the longest-standing arguments against Chrome OS has been the absence of professional creative software. There's no Premiere, no DaVinci Resolve, no Blender, no Ableton. For years, this was a dead-end conversation.

AWS NICE DCV (Desktop Cloud Visualization) reopens it. DCV is a high-performance remote display protocol that streams GPU-accelerated desktop sessions from EC2 instances to any device, including a Chromebook running the browser-based DCV client. It supports OpenGL, Vulkan, and DirectX rendering, with adaptive encoding that adjusts to network conditions. On AWS, the DCV license is free. You pay only for the EC2 compute time.

Netflix engineers use DCV to stream content creation applications to remote artists. Volkswagen runs 3D CAD simulations across their engineering division through it. A VFX studio called RVX used it to deliver visual effects for HBO's The Last of Us, streaming Nuke, Maya, Houdini, and Blender to artists distributed across Europe from servers in Iceland. Their team said it was the best remote experience they'd ever worked with.

So: a Chromebook connected to a g5.xlarge EC2 instance (one A10G GPU) can run Blender, DaVinci Resolve, or any other GPU-accelerated creative application with full hardware acceleration. The rendering happens in the data center. DCV streams the pixels. The creative professional gets a responsive, high-fidelity workspace on a $400 machine that couldn't locally render a single frame.

The constraints are connectivity and cost. You need sustained bandwidth (25+ Mbps for 1080p work, more for 4K multi-monitor setups) and leaving a GPU instance running around the clock adds up. But for studios and professionals who already budget for high-end workstations, the math often pencils out, especially when you factor in zero local hardware maintenance and the ability to scale GPU power on demand.

Cloud Gaming Works

GeForce NOW survived where Stadia failed because it made a better business decision: bring your own games. Connect your existing Steam, Epic, or Ubisoft library and stream from NVIDIA's server-side hardware. The Ultimate tier now runs on RTX 5080-class infrastructure. 4K at 120fps with ray tracing, on a fanless Chromebook.

Chrome OS has a structural advantage as a cloud gaming client. GeForce NOW runs natively in the Chromium browser via WebRTC, and users consistently report less micro-stuttering and tighter input handling than the standalone Windows desktop app. Under good network conditions, measured total latency runs 13 to 14ms, with sub-3ms ping documented near datacenter proximity. That's below human perceptual threshold for most game types.

Anti-cheat systems like Easy Anti-Cheat and Riot Vanguard are a non-issue in this model. They run on the server where the game executes, not on your local endpoint. On-device gaming isn't viable on Chrome OS and likely never will be. The architecture isn't designed for it, and even projects attempting to bridge local GPUs hit bottlenecks in the container layers. Cloud gaming is the path, and it works.

The limiting factors are network-dependent. Latency spikes above 500ms on bad connections make fast-twitch games unplayable, and NVIDIA's 100-hour monthly cap on the Ultimate tier has drawn criticism. But cloud gaming on Chrome OS has crossed the line from novelty to daily-driver viable for most use cases.

Aluminium OS: On-Device Models on Google's Own Architecture

The most consequential near-term development for Chrome OS is Project Aluminium, a ground-up rewrite that replaces the current Chrome OS foundation with a native Android kernel. Not another bolted-on compatibility layer: a new operating system built on Android 16, designed to run Android applications natively with direct hardware acceleration instead of routing them through the resource-heavy ARCVM virtual machine that currently eats CPU cycles on even basic app launches.

The AI story is the real story. Aluminium is being built with Gemini models integrated directly into the OS: the file system, the application launcher, the window manager.

Google serving their own proprietary models on their own devices, using an architecture optimized specifically to run them, is a level of vertical integration that no other OS vendor has in the pipeline. Apple has the silicon advantage for local inference. Google has the model-to-OS integration advantage. Those are competing theses about where AI compute should live, and both are worth taking seriously.

The rollout timeline from court documents and leaked roadmaps puts a trusted tester program on select hardware in late 2026, premium tablets by early 2027, and general consumer availability in 2028. Chrome OS Classic gets maintained through existing support obligations until 2033 or 2034.

The launch won't be perfect. Google's track record on platform transitions gives the community earned skepticism. But the ability to iterate a natively AI-integrated OS on hardware they control is the kind of capability that compounds over time.

Where This Lands

Two years ago, calling Chrome OS a serious platform for development or creative work would have been a stretch. Today you can run a full Debian environment with systemd daemons, snapshot your workspace, stream Blender from a GPU-backed data center, play AAA games at 4K on hardware you don't own, and do all of it from a verified, read-only endpoint that patches itself while you sleep.

The remaining gaps are real. But they're concentrated in workflows that are themselves moving to the cloud. Chrome OS was designed around assumptions about computing that used to be premature. They're not premature anymore.

How to Apply GAN Architecture to Multi-Agent Code Generation

Christopher Galliart — Wed, 25 Mar 2026 16:49:56 +0000

Ask an AI coding agent to build a feature and it will probably do a decent job. Ask it to review its own work and it will tell you everything looks great.

This is the fundamental problem with single-pass AI code generation: the same context that created the code is the one evaluating it. There's no adversarial pressure. No second opinion. No fresh eyes.

What if you could structure the work so that separate agents generate and critique each other in iterative loops, the way a generator and discriminator improve each other in a GAN? The code that reaches you has already survived an argument between agents who disagreed about whether it was good enough.

This article walks through why that pattern works, how to build it, and when it is (and is not) worth the extra tokens. The concrete example is an open source project called Claude Forge, but the ideas are framework-agnostic. Anything that supports subagent spawning with fresh context windows can implement this pattern.

The Single-Pass Problem
What the Ecosystem Is Solving
The GAN Pattern Applied to Code
Why Rhetorical Questions Outperform Direct Instructions
Feedback as Filesystem
The Zero-Context Engineer
Phase-0: Immutable Conventions
Convergence Design: Knowing When to Stop
Ground Truth Documents and the Pipeline
What the Adversarial Loop Actually Catches
Honest Trade-offs
When to Use This (And When Not To)
Getting Started

Prerequisites

Familiarity with Claude Code or a similar AI coding agent
A working installation of Claude Code (for the hands-on sections)
Basic understanding of how LLM context windows work
Git installed and configured

No machine learning background is required. The GAN concepts are explained from first principles where they appear.

The Single-Pass Problem

The AI generates code in one pass. If it hallucinates a file path, misunderstands the architecture, or writes tests that don't actually test anything, you catch it during review. Or worse, you don't.

This isn't a hypothetical. Anyone who has used AI coding agents at scale has seen placeholder tests like expect(true).toBe(true), phantom dependencies where Phase 2 assumes a model that Phase 1 never creates, and instructions so ambiguous that two valid interpretations exist. These aren't rare edge cases. They're the predictable failure mode of single-pass generation.

The problem compounds with task complexity. A simple utility function generates fine in one pass. An auth middleware with token refresh, error handling, rate limiting, and logging across multiple files? The agent starts cutting corners, because the entire generation happened inside one context window that is simultaneously tracking the plan, the code, the tests, and the growing weight of its own prior reasoning.

What the Ecosystem Is Solving

There is a growing ecosystem of frameworks tackling different aspects of this problem. They each bring real contributions worth understanding.

Superpowers focuses on development methodology. It uses subagent-driven development, TDD enforcement, and multi-stage review. The framework generates a design spec, then an implementation plan, then dispatches subagents to execute. Review subagents check the output, and if they find issues, the implementer revises and gets re-reviewed until approved.

Get Shit Done (GSD) focuses on context engineering. Its key insight is fighting context window degradation through fresh 200k subagent contexts, parallel wave execution, and XML-structured plans. A JavaScript CLI handles the deterministic work (tracking progress, dependency ordering, context budgets) so the LLM never wastes tokens on bookkeeping it would do unreliably anyway.

Both frameworks share a crucial design decision: fresh context windows. When an agent has been reasoning for 100k tokens, its attention degrades. By spawning subagents with clean 200k contexts, these frameworks sidestep the "context rot" problem that plagues long-running agent sessions.

Where these frameworks diverge is in how they handle quality assurance. GSD relies on mechanical verification: lint, test, type-check, and auto-fix retries if the checks fail. There is no agent reading another agent's code to assess whether it matches the spec's intent. The "review" is whether npm run test passes.

Superpowers does have agent-to-agent review with iterative loops. But the review is enforced by in-context instructions, which means the agent can (and frequently does) rationalize skipping the review step to save tokens.

This is a known issue in the project. When review enforcement lives inside the same prompt that the model is also using to make efficiency decisions, the model sometimes decides that review is not worth the cost.

The adversarial GAN pattern addresses this differently. Instead of asking an agent to review its own work or trusting in-context instructions to enforce review, it structures the pipeline so that review is architecturally mandatory. The reviewer is a separate agent that cannot be skipped, because the orchestrator will not advance the pipeline without the reviewer's signal. The reviewer cannot modify source code, only feedback.md. The generator cannot approve its own output. Role separation is enforced by the system, not suggested by the prompt.

The GAN Pattern Applied to Code

In machine learning, GANs pit two networks against each other: a generator creates content, a discriminator evaluates it, and the feedback loop between them drives both to improve. The generator gets better at producing realistic output. The discriminator gets better at finding flaws. The adversarial tension is what produces quality.

Applied to software development, this creates two stacked feedback loops:

Each role runs as a separate agent with its own fresh context window. The Plan Reviewer has never seen the Planner's reasoning process. It only sees the output. The Code Reviewer has never seen the Implementer's struggles. It only sees the code.

This separation fundamentally changes what the reviewer can catch. When a reviewer shares context with the generator, it inherits the generator's blind spots. When a reviewer starts fresh, it reads the plan the way an actual engineer would: with no assumptions about what the author "meant" versus what they wrote.

The adversarial Plan Reviewer doesn't just verify structure. It actively tries to break the plan:

Deadlock search: Is there a task ordering that would deadlock the implementer? (Task 3 needs the output of Task 5.)
False positive verification: Could any verification checklist pass even with a wrong implementation?
Ambiguity search: Are there instructions that could be interpreted two valid ways?
Missing context: Could the implementer get stuck because a task assumes knowledge not provided?

This is where the GAN analogy is most literal. The discriminator isn't checking if the plan looks good. It's trying to find failure modes.

Why Rhetorical Questions Outperform Direct Instructions

When a reviewer finds an issue, there are two ways to communicate it.

Direct instruction:

Fix line 45: the error handler returns 500 instead of 401 for invalid tokens.

Rhetorical question:

Consider: The test test_invalid_token_rejection expects a 401 status code.
Are you returning the correct HTTP status in your error handling?

Think about: In src/auth/middleware.js:45, what happens when the token is
invalid? Is the error properly caught?

Reflect: Look at how other middleware handles auth errors. Are you following
the same pattern?

The direct instruction produces a mechanical edit. The agent changes line 45 and moves on. The rhetorical question produces a deeper investigation. The agent re-examines the surrounding code, considers the pattern used elsewhere, and is more likely to find the root cause rather than just patching the symptom.

This maps to how the underlying models work. When given an explicit instruction, the model follows it literally. When guided to reason about a problem, it activates a broader search through its understanding of the codebase. The fix addresses related issues that a mechanical edit would miss.

Reviewer prompts structured around "Consider," "Think about," and "Reflect" prefixes consistently produce better fixes than "Fix" or "Change" directives. The implementer agent receives these as feedback in feedback.md and addresses them in the next iteration of the GAN loop.

Feedback as Filesystem

Most agent orchestration systems rely on some form of message passing: API calls, databases, queue systems, in-memory state. These all work, but they introduce infrastructure dependencies and make the agent conversation opaque after the fact.

An alternative: use the filesystem as the message bus and git as the orchestration layer.

All agent communication flows through feedback.md, a structured markdown file with two sections:

## Active Feedback (OPEN)

### FB-001: Auth middleware missing rate limiting
- **Status:** OPEN
- **Source:** Plan Reviewer
- **Phase:** 1
- **Detail:** The plan specifies JWT validation but does not address rate
  limiting for failed auth attempts. Consider: what happens if an attacker
  brute-forces tokens?

## Resolved Feedback

### FB-000: Missing error codes in API spec
- **Status:** RESOLVED
- **Resolution:** Added error code table to Phase-0 conventions

This design has several properties that matter in practice:

Full audit trail: Every piece of feedback, every resolution, every signal is committed to git alongside the code it produced. When you want to understand why the auth middleware was designed a certain way, the conversation that shaped it is right there in the commit history.

State recovery: If a pipeline gets interrupted (token limits, network issues, you need to step away), resuming is trivial. The orchestrator re-reads feedback.md and git log, determines what stage the pipeline reached, and picks up where it left off. No cloud infrastructure, no database, no queue. Just files.

Transparency: You can read the agent conversation in your editor. You can see exactly what the reviewer flagged, exactly how the implementer responded, and whether the resolution actually addressed the concern.

Agents communicate through structured signals routed by the orchestrator:

PLAN_COMPLETE / REVISION_REQUIRED / PLAN_APPROVED (plan GAN loop)
IMPLEMENTATION_COMPLETE / CHANGES_REQUESTED / PHASE_APPROVED (code GAN loop)
GO / NO-GO (final gate)
VERIFIED / UNVERIFIED (post-remediation verification)

Each signal marks a state transition. The orchestrator reads the signal, determines the next agent to invoke, and passes it the relevant context. The orchestrator itself is a Claude Code session, but the agents it spawns are fresh subagents with clean context windows.

The Zero-Context Engineer

One of the most effective constraints in the system is the "zero-context engineer" framing. The Planner writes every plan as if it will be executed by an engineer who:

Is skilled but has zero context on the codebase
Is unfamiliar with the toolset and problem domain
Will follow instructions precisely
Will not infer missing details. If it's not in the plan, it won't happen.

This constraint forces explicit instructions. No "add the usual auth middleware." Instead: which library, which pattern, which error codes, which files to create, which existing files to modify, and how to verify the result.

The Plan Reviewer then simulates this zero-context experience: "If I knew nothing about this codebase, could I follow these instructions and produce a working result?"

This framing catches a class of failures that are invisible to someone with context. The author of the plan knows what they meant. The zero-context reviewer only knows what is written. The gap between intention and specification is where bugs live.

Phase-0: Immutable Conventions

Every pipeline run starts with a Phase-0 document that defines immutable rules: tech stack, testing strategy, deployment approach, shared patterns, commit format. Every subsequent phase inherits from Phase-0. Every reviewer checks against it.

This solves a common multi-agent problem: drift. Without a shared source of truth, Agent A might decide to use Jest while Agent B sets up Vitest. Agent C might use a different error handling pattern than Agent D. Phase-0 prevents this by establishing conventions before any code is written.

The conventions aren't suggestions. They're constraints that every agent in the pipeline must respect, and every reviewer must verify against.

Convergence Design: Knowing When to Stop

An adversarial loop without exit conditions is just two agents arguing forever. The convergence design has three mechanisms:

Iteration caps: Each GAN loop (plan review, code review) runs a maximum of 3 iterations. If the planner and reviewer cannot converge in 3 rounds, the issue requires human judgment, not more machine cycles.

Signal protocol: The structured signals (PLAN_APPROVED, GO, NO-GO) are explicit state transitions, not suggestions. When the final reviewer issues NO-GO, the pipeline rolls back the phase. There is no "let's try one more time." The rollback is automatic.

Token budget: Each phase targets roughly 50k tokens with a 75k hard ceiling. This prevents any single phase from consuming the entire context budget and ensures the orchestrator retains enough headroom to manage the pipeline.

These caps exist because adversarial loops have a cost curve. The first iteration catches major issues. The second iteration catches subtle issues. The third iteration catches edge cases. A fourth iteration almost never catches anything the previous three missed, but it costs just as many tokens. Three iterations hit the sweet spot between thoroughness and efficiency.

Ground Truth Documents and the Pipeline

The adversarial pipeline doesn't start from a vague prompt. Every workflow begins with an intake skill that produces a structured ground truth document. The pipeline then runs from that document, not from the original user request.

Brainstorm: Turning Ideas into Specs

The /brainstorm skill is the feature creation workflow. Given a feature idea, it first explores the codebase to understand the existing architecture, tech stack, and patterns. Then it asks 5-15 clarifying questions designed to front-load high-impact decisions:

The codebase uses DynamoDB for storage. For this feature's data, should we:

A) Add tables to the existing DynamoDB setup
B) Use a different storage approach (e.g., S3 for documents)
C) Both - DynamoDB for metadata, S3 for content

These aren't generic questions. They're grounded in what the skill found during codebase exploration. The skill identifies the real decision points for this specific project and surfaces them before any planning or code generation begins.

The output is brainstorm.md, a structured design spec. Not a conversation transcript, but a distilled set of decisions that the Planner agent can consume cold. This document becomes the single source of truth for the entire pipeline run.

Repository Evaluation, Health, and Documentation Audits

The same ground-truth-document pattern applies to the audit workflows:

/repo-eval spawns three evaluator agents in parallel (the Pragmatist, the Oncall Engineer, the Team Lead), each scoring the codebase from a different lens across 12 pillars. The output is eval.md.
/repo-health runs a technical debt auditor across four vectors (architectural, structural, operational, hygiene). The output is health-audit.md.
/doc-health runs six detection phases comparing documentation against actual code. The output is doc-audit.md.
/audit runs any combination of the above. It asks scoping questions once, then spawns up to 5 agents in parallel (3 evaluators + health auditor + doc auditor). All intake documents land in one directory.

Each of these intake skills produces a read-only assessment. The agents doing the evaluation never modify the codebase. They only write their findings into the intake document.

The Pipeline Runs from Ground Truth

The /pipeline skill reads whatever intake documents exist and runs the adversarial GAN loop from them. For a feature, it reads brainstorm.md. For an audit, it reads whichever combination of eval.md, health-audit.md, and doc-audit.md are present.

When multiple intake documents exist (from a combined audit), the Planner reads all findings together and consolidates overlapping concerns into a single unified plan. Phases are tagged by implementer type and ordered:

[HYGIENIST] phases first, subtractive cleanup (deleting dead code, simplifying over-abstractions)
[IMPLEMENTER] phases next, structural fixes on clean code
[FORTIFIER] phases next, locking in the clean state (linting, CI checks, git hooks)
[DOC-ENGINEER] phases last, documentation reflecting final code

The ordering matters. You don't want the implementer building on top of dead code that the hygienist would have removed. You don't want the doc-engineer documenting an API that the fortifier is about to add validation to.

This separation between intake and pipeline is deliberate. The intake skills are exploratory and interactive. They ask questions, explore the codebase, and produce a document. The pipeline is autonomous. It reads the document and runs through the adversarial loops with minimal human intervention, stopping only at explicit decision points.

What the Adversarial Loop Actually Catches

In practice, the adversarial loops catch issues that single-pass generation consistently misses.

Plan Review catches:

Hallucinated file paths (the Planner says "modify" a file that doesn't exist)
Phantom dependencies (Phase 2 assumes a model that Phase 1 never creates)
Test strategies that require live cloud resources instead of mocks
Ambiguous instructions that a zero-context engineer could misinterpret
Deadlocks in task ordering (Task 3 needs the output of Task 5)

Code Review catches:

Placeholder tests (expect(true).toBe(true))
Deviations from Phase-0 architecture conventions
Missing error path coverage (only happy paths tested)
Hardcoded secrets and input validation gaps

Verification catches:

Remediation targets that weren't actually addressed
Regressions introduced during fixes
Partial fixes where the symptom changed but the root cause remains

An earlier design re-ran the full evaluator or auditor agents after remediation, 3-5 agents re-scanning the entire codebase. This was token-expensive and redundant since the per-phase reviewers had already verified each fix. The current design uses a single verification agent with a targeted scope: read the original intake document findings and check each specific file:line location. One agent, targeted scope, a fraction of the tokens. Evaluator and auditor agents run exactly once (during intake) and never again.

Honest Trade-offs

This pipeline is not free. There are some trade-offs you'll want to consider and be aware of:

Token Cost

Multiple agents reviewing each other's work uses significantly more tokens than a single-pass approach. The adversarial loops can triple the total token usage for a feature. On a subscription plan, this means hitting session limits faster. On API billing, this means real money.

Time

A feature that takes one agent 10 minutes might take the pipeline 30-45 minutes with review loops. Multi-agent frameworks in general are slower than single-pass. The adversarial loops add time on top of the orchestration overhead that any multi-agent system carries.

Orchestrator Context Pressure

The orchestrator accumulates agent result summaries across phases. Long pipelines with many phases may hit context compression, which degrades the orchestrator's ability to route effectively.

Not Fire-and-Forget

Despite the automation, complex features benefit from human checkpoints. The pipeline stops and asks for judgment at key moments. If you skip those checkpoints, you may end up with technically correct code that misses the actual requirement.

Diminishing Returns on Simple Tasks

For a quick script, a utility function, or a prototype, the adversarial overhead is pure waste. Single-pass generation is faster, cheaper, and sufficient.

The trade-off is worth it for features where correctness matters more than speed: anything touching auth, payments, data integrity, or infrastructure. When the cost of a bug in production exceeds the cost of the extra tokens to prevent it, the math works. For everything else, single-pass is fine.

When to Use This (And When Not To)

Use adversarial multi-agent patterns when:

The feature touches authentication, authorization, or session management
The code handles payments or financial transactions
Data integrity is critical (migrations, schema changes, ETL pipelines)
Infrastructure changes could affect production (IaC, CI/CD modifications)
The codebase is unfamiliar to the agents (large legacy systems)

Use single-pass generation when:

Prototyping or exploring an idea
Writing utility scripts or one-off tools
Making small, well-scoped changes to familiar code
Speed matters more than thoroughness
You will review the output carefully yourself anyway

Getting Started

Claude Forge is built entirely from Claude Code custom skills. No external tooling, no CI integration required. Install by copying the skills directory into your project:

git clone https://github.com/hatmanstack/claude-forge.git
cp -r claude-forge/.claude/skills/ /path/to/your-project/.claude/skills/

Then in your project:

# Feature development
/brainstorm I want to add webhook support for payment events
/pipeline 2026-03-12-payment-webhooks

# Full audit (health + eval + docs), one command
/audit all
/pipeline 2026-03-16-audit-remediation

# Individual audits
/repo-eval
/repo-health
/doc-health

The pipeline handles the orchestration. You'll see progress reports between stages, and it will stop and ask when something needs human judgment.

Wrapping Up

The adversarial pattern (separate generator and discriminator with isolated context windows, structured feedback as the communication channel, iteration caps for convergence) can be implemented in any agent system that supports subagent spawning with fresh contexts. The specific implementation uses Claude Code skills, but the pattern is the contribution, not the tooling.

Sometimes the best code comes from the argument, not the agreement.

How to Build a Serverless RAG Pipeline on AWS That Scales to Zero

Christopher Galliart — Wed, 11 Mar 2026 18:19:40 +0000

Most RAG tutorials end the same way: you've got a working prototype and a bill for a vector database that runs whether anyone's querying it or not. Add an always-on embedding service, a hosted LLM endpoint, and the usual AWS infrastructure, and you're looking at real money before a single user shows up.

But it doesn't have to work that way. In this tutorial, you'll deploy a fully serverless RAG pipeline that processes documents, images, video, and audio, then scales to zero when nobody's using it.

Everything runs in your AWS account, your data never leaves your infrastructure, and your ongoing monthly cost for a modest knowledge base will be closer to 2-3 USD than 300 USD.

We'll use RAGStack-Lambda, an open-source project I built on AWS. By the end, you'll have a deployed pipeline with a dashboard, an AI chat interface with source citations, a drop-in web component you can embed in any app, and an MCP server you can use to feed your assistant context.

What This Actually Costs

Before we build anything, let's talk money, because the cost story is the whole point.

RAG pipelines have two cost phases: ingestion (processing your documents once) and operation (querying them over time).

Most platforms charge you a flat monthly rate regardless of which phase you're in. A serverless architecture flips that: ingestion costs something, and then everything scales to zero.

Ingestion: The One-Time Hit

When you upload documents, several things happen: text extraction (OCR for PDFs and images), embedding generation, metadata extraction, and storage. Here's what that actually costs per service:

Textract (OCR): This is the most expensive part of ingestion, and it only applies to scanned PDFs and images that need text extraction. Plain text, HTML, CSV, and other text-based formats skip this entirely.

Textract charges about 1.50 USD per 1,000 pages for standard text detection. If you're uploading 500 pages of scanned PDFs, that's about 0.75 USD. A heavy initial load of several thousand scanned pages might run 5-10 USD. But once your documents are processed, you never pay this again unless you add new ones.

Bedrock Embeddings (Nova Multimodal): This is where your content gets converted into vectors for semantic search. The pricing is almost comically cheap:

Text: 0.00002 USD per 1,000 input tokens
Images: 0.00115 USD per image
Video/Audio: 0.00200 USD per minute

To put that in perspective: if you have 1,500 text documents averaging 2,500 tokens each after chunking, your total embedding cost is about 0.08 USD. A knowledge base with 500 images runs 0.58 USD. Even a mixed corpus of text, images, and a few hours of video stays well under 2 USD for the entire embedding pass. This is a one-time cost – you only re-embed if you add or update documents.

Bedrock LLM (Metadata Extraction): RAGStack uses an LLM to analyze each document and extract structured metadata automatically. This is a few inference calls per document using Nova Lite or a similar model. At 0.06 USD/0.24 USD per million input/output tokens, processing 1,500 documents costs well under 1 USD.

S3 Vectors (Storage): Storing your embeddings. At 0.06 USD per GB/month, a knowledge base of 1,500 documents with 1,024-dimension vectors takes up a trivially small amount of space. We're talking pennies per month.

S3 (Document Storage): Your source documents in standard S3. Even cheaper, 0.023 USD per GB/month.

DynamoDB: Stores document metadata and processing state. The on-demand pricing model means you pay per request during ingestion, then essentially nothing at rest. A few cents for the initial load.

To put real numbers on it: if you upload 200 text documents (PDFs, HTML, markdown), your total ingestion cost is likely under 1 USD. If you upload 1,000 scanned PDFs that need OCR, you might see 5-8 USD as a one-time hit. That 7-10 USD figure you might see referenced? That's the upper end for a heavy initial load with lots of OCR work.

Operation: Where Scale-to-Zero Shines

Once your documents are ingested, the pipeline is waiting. Not running. Waiting. Here's what each query costs:

Lambda: Invocations are billed per request and duration. The free tier covers 1 million requests/month. For a personal or small-team knowledge base, you may never leave the free tier.

S3 Vectors (Queries): 2.50 USD per million query API calls, plus a per-TB data processing charge. For a small index queried a few hundred times a month, this rounds to effectively zero.

Bedrock (Chat Inference): This is your main operating cost. Each chat response requires an LLM call. Using Nova Lite at 0.06 USD per million input tokens and 0.24 USD per million output tokens, a typical RAG query (retrieval context + user question + response) might cost 0.001-0.003 USD per query. A hundred queries a month is 0.10-0.30 USD.

Step Functions: Orchestrates the document processing pipeline. Standard workflows charge 0.025 USD per 1,000 state transitions. Minimal during operation since it's only active during ingestion.

Cognito: User authentication. Free for the first 10,000 monthly active users.

CloudFront: Serves the dashboard UI. Free tier covers 1 TB of data transfer per month.

API Gateway: Handles GraphQL API requests. Free tier covers 1 million API calls per month.

Add it all up for a knowledge base with 500 documents getting a few hundred queries per month, and your monthly operating cost is somewhere between 0.50 USD and 3.00 USD. Most of that is the LLM inference for chat responses.

The Comparison That Matters

Here's the same pipeline on a traditional always-on stack:

Service	RAGStack-Lambda	Traditional Stack
Vector Database	S3 Vectors: pennies/mo	Pinecone Starter: `70 USD`/mo
Vector Database (alt)	S3 Vectors: pennies/mo	OpenSearch Serverless: about `350 USD`/mo min
Compute	Lambda: free tier	EC2 or ECS: `50-150 USD`/mo
LLM Inference	Same per-query cost	Same per-query cost
Total (idle)	about `0.50-3.00 USD`/mo	`120-500 USD`/mo

The LLM inference cost per query is roughly the same everywhere – that's Bedrock's on-demand pricing regardless of your architecture. The difference is everything else. Traditional stacks pay a floor cost whether anyone's using them or not. A serverless stack pays for what it uses, and idle costs essentially nothing.

What About Transcribe?

If you're uploading video or audio, AWS Transcribe adds cost for speech-to-text conversion. Standard transcription runs about 0.024 USD per minute of audio. A 10-minute video costs 0.24 USD to transcribe. This is a one-time ingestion cost, once transcribed and embedded, the resulting text chunks are queried like any other document.

What You're Building

By the end of this tutorial, you'll have a deployed pipeline that does the following:

You upload a document (PDF, image, video, audio, HTML, CSV, the full list is extensive) through a web dashboard.
The pipeline detects the file type and routes it to the right processor. Scanned PDFs go through OCR via Textract. Video and audio go through Transcribe for speech-to-text, split into 30-second searchable chunks with speaker identification. Images get visual embeddings and any caption text you provide.
An LLM analyzes each document and extracts structured metadata, topic, document type, date range, people mentioned, whatever's relevant. This happens automatically.
Everything gets embedded using Amazon Nova Multimodal Embeddings and stored in a Bedrock Knowledge Base backed by S3 Vectors.
You (or your users) ask questions through an AI chat interface. The pipeline retrieves relevant documents, passes them as context to a Bedrock LLM, and returns an answer with collapsible source citations, including timestamp links for video and audio that jump to the exact position.

All of this runs in your AWS account. No external control plane, no third-party services beyond AWS itself.

The Architecture

A few things to note about this architecture:

Step Functions orchestrate everything. When a document is uploaded, a state machine manages the entire processing flow, detecting the file type, routing to the right processor, waiting for async operations like Transcribe jobs, then triggering embedding and metadata extraction.

This is what makes the pipeline reliable without a running server. If a step fails, it retries. You can see exactly where every document is in the processing pipeline.

Lambda does the compute. Every processing step is a Lambda function. They spin up when needed, run for a few seconds to a few minutes, and shut down. There's no EC2 instance idling at 3 AM.

S3 Vectors is the vector store. Your embeddings live in S3's purpose-built vector storage rather than in a dedicated vector database like Pinecone or OpenSearch.

This is what makes the "scale to zero" cost possible: you're paying object storage rates for vector data instead of keeping a database cluster warm. It also means your vectors are sitting in your own S3 bucket, not in a third-party managed service that holds your data on their terms.

Cognito handles auth. The dashboard and API are protected with Cognito user pools. When you deploy, you get a temporary password via email. The web component uses IAM-based authentication, and server-side integrations use API key auth.

CloudFront serves the UI. The dashboard is a static React app served through CloudFront, so there's no web server to maintain.

Two Ways to Deploy

You have two deployment paths depending on what you want:

AWS Marketplace (the fast path), click deploy, fill in two fields (stack name and email), and wait about 10 minutes. No local tooling required. This is the path we'll walk through first.

From Source (the developer path), Clone the repo, run publish.py, and deploy via SAM CLI. This is the path for when you want to customize the processing pipeline, modify the UI, or contribute to the project. We'll cover this after the Marketplace walkthrough.

Both paths produce the same stack. The Marketplace version just wraps the CloudFormation template in a one-click deployment.

Prerequisites

Before you deploy, you'll need:

An AWS account with permissions to create CloudFormation stacks, Lambda functions, S3 buckets, DynamoDB tables, and Cognito user pools. If you're using an admin account, you're covered.
Bedrock model access: RAGStack defaults to us-east-1 because that's where Nova Multimodal Embeddings is available. Amazon's own models (including Nova) are available by default in Bedrock, no manual enablement required. Just make sure your IAM role has the necessary bedrock:InvokeModel permissions.
For the Marketplace path: just a web browser.
For the source path: Python 3.13+, Node.js 24+, AWS CLI and SAM CLI configured, and Docker (for building Lambda layers).

Deploying from AWS Marketplace

This is the fastest path – no local tools, no CLI, no Docker. You'll launch a CloudFormation stack and have a working pipeline in about 10 minutes.

Step 1: Launch the Stack

Click the direct deploy link to open CloudFormation's "Quick create stack" page with the template pre-loaded.

Step 2: Fill In Two Fields

The page has a lot of options, but you only need two:

Stack name: Must be lowercase. This becomes the prefix for all your AWS resources (for example, my-docs, team-kb, project-notes). Keep it short.
Admin Email: Under Required Settings. Cognito will send your temporary login credentials here. Use an email you can access right now.

Everything else – Build Options, Advanced Settings, OCR Backend, model selections – can stay at the defaults. They're there for customization later, but the defaults work out of the box.

Step 3: Deploy

Scroll to the bottom, check the three acknowledgment boxes under "Capabilities and transforms," and click Create stack.

Deployment takes roughly 10 minutes. You can watch the progress in the CloudFormation Events tab if you're curious, but there's nothing to do until the stack status flips to CREATE_COMPLETE.

Step 4: Log In

Once the stack finishes, check your email. Cognito sends you the dashboard URL and a temporary password. Log in, set a new password, and you're looking at an empty dashboard ready for documents.

Deploying from Source

If you want to customize the pipeline, modify the UI, or contribute to the project, deploy from source instead.

Step 1: Clone and Set Up

git clone https://github.com/HatmanStack/RAGStack-Lambda.git
cd RAGStack-Lambda

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Step 2: Deploy

The publish.py script handles everything: building the frontend, packaging Lambda functions, and deploying via SAM CLI.

python publish.py \
  --project-name my-docs \
  --admin-email admin@example.com

This defaults to us-east-1 for Nova Multimodal Embeddings. The script will build the React dashboard, build the web component, package all Lambda layers with Docker, and deploy the CloudFormation stack through SAM.

First deploy takes longer (15-20 minutes) because it's building everything from scratch. Subsequent deploys are faster since SAM caches unchanged resources.

If you only want to iterate on the backend and skip UI builds:

# Skip dashboard build (still builds web component)
python publish.py --project-name my-docs --admin-email admin@example.com --skip-ui

# Skip ALL UI builds
python publish.py --project-name my-docs --admin-email admin@example.com --skip-ui-all

Once it finishes, you'll get the same Cognito email and dashboard URL as the Marketplace path.

Uploading Your First Documents

The dashboard has tabs for different content types. We'll start with the Documents tab since that's the most common use case.

Documents

Click the Documents tab and upload a file. RAGStack accepts a wide range of formats: PDF, DOCX, XLSX, HTML, CSV, JSON, XML, EML, EPUB, TXT, and Markdown. Drag and drop or use the file picker.

Once uploaded, the document enters the processing pipeline. You'll see the status update in real time:

UPLOADED: File received and stored in S3.
PROCESSING: Step Functions has picked it up and routed it to the right processor. Text-based files (HTML, CSV, Markdown) go through direct extraction. Scanned PDFs and images go through Textract OCR. The LLM analyzes the content and extracts structured metadata, topic, document type, people mentioned, date ranges, whatever's relevant to the content.
INDEXED: Embeddings generated, vectors stored, document is searchable.

Text documents typically process in 1-5 minutes. OCR-heavy documents (scanned PDFs, images with text) can take 2-15 minutes depending on page count.

Images

The Images tab works differently. Upload a JPG, PNG, GIF, or WebP and you can add a caption. Both the visual content and caption text get embedded using Nova Multimodal Embeddings, so you can search by what's in the image or by your description of it.

This is where multimodal embeddings earn their keep. A traditional text-only RAG pipeline would need you to describe every image manually. Here, the image itself becomes searchable, and since everything stays in your AWS account, you're not sending personal photos or sensitive visual content to an external service to get there.

What About Video and Audio?

Upload video or audio files and RAGStack routes them through AWS Transcribe for speech-to-text conversion. The transcript gets split into 30-second chunks with speaker identification, then embedded like any other document. When chat results reference a video source, you get timestamp links that jump to the exact position in the recording.

Web Scraping

The Scrape tab lets you pull websites directly into your knowledge base. Enter a URL and RAGStack crawls the page, extracts the content, and processes it through the same pipeline as uploaded documents, metadata extraction, embedding, indexing.

This is useful for building a knowledge base from existing web content without manually saving and uploading pages. Documentation sites, blog archives, reference material, anything publicly accessible.

Chatting With Your Knowledge Base

This is the payoff. Go to the Chat tab, type a question, and RAGStack retrieves relevant documents from your knowledge base, passes them as context to a Bedrock LLM, and returns an answer with source citations.

The citations are collapsible, so click to expand and see which documents informed the answer, with the option to download the source file. For video and audio sources, you get clickable timestamps that jump to the relevant moment.

Metadata Filtering

If you've uploaded enough documents to have meaningful metadata categories, the chat interface lets you filter search results by metadata before querying. RAGStack auto-discovers the metadata structure from your documents, so you don't configure this manually, it just appears as your knowledge base grows.

This is useful when you have a large mixed corpus. Instead of hoping the vector search picks the right context from thousands of documents, you can narrow it down: "only search documents about project X" or "only search content from Q4 2024."

Embedding the Web Component in Your App

The dashboard is useful for managing your knowledge base, but the real power is embedding RAGStack's chat in your own application. The web component works with any framework, React, Vue, Angular, Svelte, plain HTML.

Load the script once from your CloudFront distribution:

Then drop the component wherever you want a chat interface:

That's it. The component handles authentication (via IAM), manages conversation state, and renders source citations, all self-contained. Your CloudFront URL is in the stack outputs.

For server-side integrations that don't need a UI, the GraphQL API is available with API key authentication. You can find your endpoint and API key in the dashboard under Settings.

Using the MCP Server

RAGStack includes an MCP server that connects your knowledge base to AI assistants like Claude Desktop, Cursor, VS Code, and Amazon Q CLI. Instead of switching to the dashboard to search your documents, you ask your assistant directly.

Install it:

pip install ragstack-mcp

Then add it to your AI assistant's MCP configuration:

{
  "ragstack": {
    "command": "uvx",
    "args": ["ragstack-mcp"],
    "env": {
      "RAGSTACK_GRAPHQL_ENDPOINT": "YOUR_ENDPOINT",
      "RAGSTACK_API_KEY": "YOUR_API_KEY"
    }
  }
}

Your endpoint and API key are in the dashboard under Settings. Once configured, type @ragstack in your assistant's chat to invoke the MCP server, then ask things like "search my knowledge base for authentication docs" and it queries RAGStack directly.

See the MCP Server docs for the full list of available tools and setup details.

What You Can Build From Here

You've got a deployed RAG pipeline that costs almost nothing to run and handles text, images, video, and audio. A few directions you might take it:

A searchable personal archive. Every conference talk you've saved, every PDF textbook, every tutorial video that's sitting in a folder somewhere. Upload it all, and now you have one search interface across years of accumulated material. The multimodal embeddings mean your screenshots and diagrams are searchable too, not just the text.

I built a family archive app this way, scanned letters, old photos, home videos, with RAGStack deployed as a nested CloudFormation stack so the whole family can search across decades of memories using the chat widget.

A second brain for a client project. Scrape the client's existing docs, upload the SOW and meeting notes, drop in the codebase documentation. Now you've got a searchable knowledge base scoped to that engagement. Spin it up at the start, tear it down when the contract ends. At these costs, it's disposable infrastructure.

AI chat over a niche dataset. Recipe collections, legal filings, research papers, local government meeting minutes, any corpus that's too specialized for general-purpose LLMs to know well. The web component means you can ship it as a standalone tool without building a frontend from scratch.

RAG for your MCP workflow. If you're already using Claude Desktop or Cursor, the MCP server turns your knowledge base into another tool your assistant can reach for. Upload your team's runbooks and architecture docs, and now @ragstack in your editor gives you instant context without tab-switching.

Wrapping Up

The serverless RAG pipeline you just deployed handles document processing, multimodal embeddings, metadata extraction, and AI chat with source citations, all scaling to zero when idle, all running in your AWS account. Your documents, your vectors, your infrastructure. The traditional approach to this stack costs 120-500 USD/month in baseline infrastructure. This one costs pocket change.

The full source is at github.com/HatmanStack/RAGStack-Lambda. File issues, open PRs, or just poke around the architecture. If you want to go deeper on the technical tradeoffs, particularly how filtered vector search behaves on cost-optimized backends like S3 Vectors, that's a story for the next post.

Christopher Galliart - freeCodeCamp.org

How to Trace Multi-Agent AI Swarms with Jaeger v2

Table of Contents

What Is Distributed Tracing?

Why Jaeger v2?

Prerequisites

Installing Docker on Debian

Setting Up Jaeger v2

Basic Run

Persistent Storage with Badger

Setting Up Claude Forge Tracing

Installing Claude Forge

Installing the Tracing Hook

Opting In

Understanding the Span Model

Three Tiers of Detail

Span Attributes

Instrumenting a Multi-Agent Swarm

Hook Architecture

Sending Spans with OpenTelemetry

Correlating Pre and Post Events

State Across Invocations

Flushing Without Blocking

Viewing Traces in the Jaeger UI

Lessons from the Trenches

Environment Variable Reference

Wrapping Up

Why Chrome OS Is the Operating System the AI Era Was Built For

Here's what we'll cover:

Security-First Architecture in an Era of AI-Powered Threats

A Headless Linux Stack That's More Flexible Than It Looks

AWS NICE DCV Changes the Creative Tools Conversation

Cloud Gaming Works

Aluminium OS: On-Device Models on Google's Own Architecture

Where This Lands

How to Apply GAN Architecture to Multi-Agent Code Generation

Table of Contents

Prerequisites

The Single-Pass Problem

What the Ecosystem Is Solving

The GAN Pattern Applied to Code

Why Rhetorical Questions Outperform Direct Instructions

Feedback as Filesystem

The Zero-Context Engineer

Phase-0: Immutable Conventions

Convergence Design: Knowing When to Stop

Ground Truth Documents and the Pipeline

Brainstorm: Turning Ideas into Specs

Repository Evaluation, Health, and Documentation Audits

The Pipeline Runs from Ground Truth

What the Adversarial Loop Actually Catches

Honest Trade-offs

Token Cost

Time

Orchestrator Context Pressure

Not Fire-and-Forget

Diminishing Returns on Simple Tasks

When to Use This (And When Not To)

Getting Started

Wrapping Up

How to Build a Serverless RAG Pipeline on AWS That Scales to Zero

Here's what we'll cover:

What This Actually Costs

Ingestion: The One-Time Hit

Operation: Where Scale-to-Zero Shines

The Comparison That Matters

What About Transcribe?

What You're Building

The Architecture

Two Ways to Deploy

Prerequisites

Deploying from AWS Marketplace

Step 1: Launch the Stack

Step 2: Fill In Two Fields

Step 3: Deploy

Step 4: Log In

Deploying from Source

Step 1: Clone and Set Up

Step 2: Deploy

Uploading Your First Documents