Nataraj Sundar - freeCodeCamp.org

How to Use Context Hub (chub) to Build a Companion Relevance Engine

Nataraj Sundar — Fri, 17 Apr 2026 20:36:32 +0000

Large language models can write code quickly, but they still misremember APIs, miss version-specific details, and forget what they learned at the end of a session.

That is the problem Context Hub is trying to solve.

Context Hub (chub) gives coding agents curated, versioned documentation and skills that they can search and fetch through a CLI. It also gives them two learning loops: local annotations for agent memory and feedback for maintainers.

In this tutorial, you'll learn how the official chub workflow works, how Context Hub organizes docs and skills, how annotations and feedback create a memory loop, and how to build a companion relevance engine that improves retrieval without breaking the upstream content model.

This tutorial uses two public repositories side by side:

the official upstream project: andrewyng/context-hub
the companion implementation for this article: natarajsundar/context-hub-relevance-engine

I've also opened a corresponding upstream pull request from my fork to the main project. If you want to track that work from the article, use the upstream pull request list filtered by author: andrewyng/context-hub pull requests by natarajsundar.

What We'll Build

By the end of this tutorial, you'll have:

a clear mental model for how Context Hub works
a working local install of the official chub CLI
a repeatable workflow for search, fetch, annotations, and feedback
a companion repo that adds an additive reranking layer on top of a Context-Hub-style content tree
a small benchmark and local comparison UI you can run end to end
a clear bridge between the companion repo and the smaller upstream PR

Prerequisites

Before you start, make sure you have:

Node.js 18 or newer
npm
comfort with the terminal
basic familiarity with Markdown

How to Understand Context Hub
How to Understand the Official Repo, the Companion Repo, and the Upstream PR
How to Install and Use the Official CLI
How to Understand Docs, Skills, and the Content Layout
How to Use Incremental Fetch and Layered Sources
How to Use Annotations and Feedback to Create a Memory Loop
How to See Where Relevance Still Misses
How the Companion Relevance Engine Improves Retrieval
How to Run the Companion Repo End to End
How to Read the Benchmark Honestly
How to Connect the Companion Repo to the Upstream PR
Conclusion
Sources

How to Understand Context Hub

Context Hub is easiest to understand as a workflow for turning fast-moving documentation into a reliable input for coding agents.

Instead of asking an agent to rely on whatever it remembers from training data, you give it a predictable contract:

search for the right entry
fetch the right doc or skill
write code against that curated content
save local lessons as annotations
send doc-quality feedback back to maintainers

That system boundary matters.

It makes the agent easier to audit, easier to improve, and easier to extend. It also keeps the interface small enough that you can reason about where the failures happen. If the agent still misses the answer, you can ask whether the problem happened during search, fetch, context selection, or generation.

How to Understand the Official Repo, the Companion repo, and the Upstream PR

This tutorial is intentionally split across two codebases and one contribution path.

The official upstream project, andrewyng/context-hub, is the source of truth for the real CLI, the content model, and the documented workflows. That's the codebase you should use to learn how chub works today.

The companion repository, natarajsundar/context-hub-relevance-engine, is where the relevant ideas in this article are made concrete. It's a companion implementation, not a replacement product. Its job is to make retrieval tradeoffs visible, measurable, and easy to run locally.

The upstream PR is the bridge between those two worlds. The companion repo is where you can iterate faster on benchmarks, reranking, and the comparison UI. The upstream PR is where the smallest reviewable slices can be proposed back to the main project. You can track that thread here: upstream PR search filtered by author.

That three-part framing keeps the article honest:

use the upstream repo to understand the current system
use the companion repo to explore relevant improvements end to end
use the upstream PR to show how a larger idea can be broken into reviewable pieces

How to Install and Use the Official CLI

The official quick start is intentionally small.

npm install -g @aisuite/chub

Once the CLI is installed, you can search for what is available and fetch a specific entry:

chub search openai
chub get openai/chat --lang py

That's the happy path, but it helps to think through the request flow.

In practice, the most useful detail is that the CLI is designed for the agent to use, not just for the human to use by hand.

That's why the upstream CLI also ships a get-api-docs skill. For example, if you use Claude Code, you can copy the skill into your local project like this:

mkdir -p .claude/skills
cp $(npm root -g)/@aisuite/chub/skills/get-api-docs/SKILL.md \
  .claude/skills/get-api-docs.md

That step teaches the agent a retrieval habit:

Before you write code against a third-party SDK or API, use chub instead of guessing.

That behavioral rule is often as important as the docs themselves.

How to Understand Docs, Skills, and the Content Layout

Context Hub separates content into two categories:

docs, which answer “what should the agent know?”
skills, which answer “how should the agent behave?”

That distinction makes the content model easier to scale. Docs can be versioned and language-specific. Skills can stay short and operational.

The directory structure is also predictable. The content guide organizes entries by author, then by docs or skills, then by entry name.

A small example looks like this:

author/docs/payments/python/DOC.md
author/docs/payments/python/references/errors.md
author/skills/login-flows/SKILL.md

This is one of the reasons Context Hub is easy to work with.

The shape of the content is plain Markdown, the main entry file is predictable, and the build output is inspectable. You don't have to reverse engineer a hidden prompt layer to figure out what the agent is reading.

How to Use Incremental Fetch and Layered Sources

One of the best design choices in Context Hub is that it doesn't force you to inject every file into the model on every request.

Instead, the entry file gives you the overview, and the reference files hold the deeper material.

That lets you fetch content in progressively larger slices.

chub get stripe/webhooks --lang py
chub get stripe/webhooks --lang py --file references/raw-body.md
chub get stripe/webhooks --lang py --full

This is a token-budget feature as much as it is a documentation feature. A good agent should first load the overview, decide what part of the task matters, and only then fetch the specific supporting file.

Context Hub also supports layered sources. You can merge public content with your own local build output through ~/.chub/config.yaml.

A minimal configuration looks like this:

sources:
  - name: community
    url: https://cdn.aichub.org/v1
  - name: my-team
    path: /opt/team-docs/dist

That means you can keep public docs in one lane and team-specific runbooks in another lane while still giving the agent one search surface.

How to Use Annotations and Feedback to Create a Memory Loop

Context Hub has two different improvement channels.

Annotations are local. They help your agent remember what worked last time. Feedback is shared. It helps maintainers improve the docs for everyone.

That distinction matters because not every lesson belongs in the shared registry. Some lessons are environment-specific. Others point to content quality issues that should be fixed centrally.

Here is what local memory looks like in practice:

chub annotate stripe/webhooks \
  "Remember: Flask request.data must stay raw for Stripe signature verification."

And here's the feedback path:

chub feedback stripe/webhooks up

That loop is simple, but it's one of the most important ideas in the project. It turns a one-off debugging lesson into either persistent local memory or a signal that the shared docs need to improve.

How to See Where Relevance Still Misses

The upstream project already has a real ranking story. It uses BM25 and lexical rescue so that package-like identifiers, exact tokens, and fuzzy matches still have a chance to surface.

That is a strong baseline.

But developer queries are often much messier than package names.

People search for:

rrf
signin
pg vector
hnsw
raw body stripe

Those aren't “bad” queries. They're realistic shorthand.

And they expose an opportunity in the content model itself: many of the exact answers live in reference files such as references/rrf.md, references/raw-body.md, and references/hnsw.md.

So the question is not whether the current search works at all. It clearly does. The better question is this:

How can you improve retrieval without breaking the content contract that already makes Context Hub useful?

The answer in the companion repo is to keep the current model and add a reranking layer on top of it.

How the Companion Relevance Engine Improves Retrieval

The companion repository in this article is context-hub-relevance-engine.

It keeps the same broad ideas that make Context Hub attractive:

plain Markdown content
DOC.md and SKILL.md entry points
build artifacts you can inspect
local annotations and feedback
progressive fetch behavior

Then it adds one new build artifact: signals.json.

At build time, the engine extracts extra signals such as:

headings from the main file
titles and tokens from reference files
language and version metadata
source metadata and freshness
annotation overlap
feedback priors

The first pass stays cheap and transparent. The reranker only runs after the baseline has done its work.

That approach matters for two reasons.

First, it's additive. You don't have to redesign the content tree.

Second, it's measurable. You can define concrete failure modes, fix them one by one, and run the same benchmark every time you change the scorer.

How to Run the Companion Repo End to End

Open the repository on GitHub, clone it using GitHub’s normal clone flow, and then run the commands below from the project root.

cd context-hub-relevance-engine
npm install
npm run build
npm test

The repository has no third-party runtime dependencies, so npm install is mostly there to keep the workflow familiar. The main commands are all plain Node scripts.

How to Reproduce a Baseline Miss

Start with the query rrf.

node bin/chub-lab.mjs search rrf --mode baseline --lang python

Expected output:

No results.

Now run the improved mode.

node bin/chub-lab.mjs search rrf --mode improved --lang python

Expected top result:

langchain/retrievers [doc] score=320.24
  Composable retrieval patterns for hybrid search, parent documents, query expansion, and reranking.

That win happens because the improved mode looks beyond the top-level entry description. It also sees the reference file title rrf, the related terms from query expansion, and the broader token overlap in the extracted signals.

How to Reproduce a Workflow-intent Win

Try a sign-in query.

node bin/chub-lab.mjs search signin --mode baseline
node bin/chub-lab.mjs search signin --mode improved

The baseline misses. The improved mode returns playwright-community/login-flows because the reranker treats signin, sign in, login, and authentication as related intent.

How to Test the Memory Loop

Write a local note:

node bin/chub-lab.mjs annotate stripe/webhooks \
  "Remember: Flask request.data must stay raw for Stripe signature verification."

Then fetch the doc:

node bin/chub-lab.mjs get stripe/webhooks --lang python

You will see the main doc content, the list of available reference files, and the appended annotation.

That's the behavior you want from an agent memory loop: learn once, reuse many times.

How to Run the Benchmark

Start from an empty store:

npm run reset-store
node bin/chub-lab.mjs evaluate

The included synthetic stress set reports the following summary with an empty store:

Mode	Top-1 Accuracy	MRR
baseline	0.333	0.333
improved	1.000	1.000

You can also seed the store and rerun the evaluation:

npm run seed-demo
node bin/chub-lab.mjs evaluate

That demonstrates how annotations and feedback can push relevant entries even higher when the query overlaps with the agent’s own history.

How to Launch the Local Comparison UI

npm run serve

Then open http://localhost:8787 in your browser.

The UI lets you compare baseline and improved retrieval, inspect stored annotations and feedback, rebuild the local artifacts, and rerun the benchmark from one place.

How to Read the Benchmark Honestly

The benchmark in this repo is intentionally small.

That is a feature, not a flaw.

The point is not to claim universal search quality. The point is to make a handful of realistic failure modes easy to reproduce:

acronym queries
shorthand workflow queries
reference-file topic queries
memory-aware reranking

That keeps the evaluation honest.

If a future scoring change breaks rrf, signin, or raw body stripe, you'll know immediately. And if you add a stronger dataset later, you can keep these tests as regression guards.

The benchmark files included in the repo are:

demo/benchmark.json
docs/benchmark-empty-store.json
docs/benchmark-seeded-store.json
docs/relevance-improvement-plan.md

How to Connect the Companion Repo to the Upstream PR

A good companion repo is broad enough to explore ideas quickly. A good upstream PR is narrow enough to review.

That's why the two shouldn't be identical.

The companion repository is where you can keep the full relevance story together:

the local comparison UI
the synthetic benchmark
the richer reranking signals
the debug and explain surfaces
the documentation that walks through tradeoffs end to end

The upstream PR should be smaller and more surgical. In practice, that usually means proposing the most reviewable slices first, such as:

reference-file signal extraction
explainable score output for debugging
a lightweight benchmark fixture format
one additive reranking hook behind a flag

That keeps the main repository maintainable while still letting the article and companion repo tell the full engineering story. The upstream thread for this work lives here: andrewyng/context-hub pull requests by natarajsundar.

Conclusion

What makes Context Hub interesting is not just that it stores documentation. It gives you a clear system boundary for improving coding agents.

You can inspect what the agent reads. You can decide when it should retrieve. You can layer public and private sources. You can persist local lessons. And you can improve ranking without tearing the whole model apart.

The companion relevance engine shows how to keep what already works, make one part of the system measurably better, and package the result in a way other developers can run, inspect, and extend. The upstream PR, in turn, shows how to turn a broad idea into smaller pieces that are realistic to review in the main project.

Diagram Attribution

All diagrams used in this article were created by the author specifically for this tutorial and its companion repository.

Sources

How to Set Up OpenClaw and Design an A2A Plugin Bridge

Nataraj Sundar — Tue, 07 Apr 2026 17:45:46 +0000

OpenClaw is getting attention because it turns a popular AI idea into something you can actually run yourself. Instead of opening one more browser tab, you run a Gateway on your own machine or server and connect it to communication tools you already use.

That matters because OpenClaw is self-hosted, multi-channel, open source, and built around agent workflows such as sessions, tools, plugins, and multi-agent routing. It feels less like a toy chatbot and more like an operator-controlled agent runtime.

In this guide, you'll do three things. First, you'll learn what OpenClaw is and why developers are paying attention to it. Second, you'll get it running the beginner-friendly way through the dashboard. Third, you'll walk through an original design contribution: a proposed OpenClaw-to-A2A plugin architecture and a proof-of-concept relay that shows how OpenClaw’s session model could map to the A2A protocol.

That last part is important, so I want to frame it carefully. The A2A integration in this article is not presented as a built-in OpenClaw feature. It's a documented architecture proposal built on top of the extension points OpenClaw already exposes.

Prerequisites

This guide is beginner-friendly for OpenClaw itself, but it assumes a few basics so you can follow the architecture and proof-of-concept sections comfortably.

Before you continue, you should be familiar with:

Basic JavaScript or Node.js (reading and running scripts)
How HTTP APIs work (requests, responses, JSON payloads)
Using a terminal to run commands
High-level concepts like services, APIs, or microservices

You don't need prior experience with OpenClaw or A2A. The setup steps walk through everything you need to get started.

What OpenClaw Is
Why OpenClaw Is Getting So Much Attention
What the A2A Protocol Is
How OpenClaw and A2A Relate
What You Need Before You Start
Install OpenClaw
Run the Onboarding Wizard
Check the Gateway and Open the Dashboard
Use OpenClaw as a Private Coding Assistant
Understand Multi Agent Routing
Where A2A Could Fit Later
A Proposed OpenClaw to A2A Plugin Architecture
Build the Proof of Concept Relay
How the Proof of Concept Maps to a Real OpenClaw Plugin
Security Notes Before You Go Further
Final Thoughts

What OpenClaw Is

According to the official docs, OpenClaw is a self-hosted gateway that connects chat apps like WhatsApp, Telegram, Discord, iMessage, and a browser dashboard to AI agents.

That wording is useful because it tells you where OpenClaw sits in the stack. It's not just a model wrapper. It's a Gateway that handles sessions, routing, and app connections, while agents, tools, plugins, and providers do the actual work.

Here is the simplest mental model:

If you're new to the project, this is the practical way to think about it:

your chat apps are the front door
the Gateway is the traffic and control layer
the agent is the reasoning layer
the model provider and tools are what let the agent actually do work

That's one reason OpenClaw feels different from a normal browser-only assistant.

Why Developers Are Paying Attention to OpenClaw

OpenClaw is getting a lot of attention for a few reasons.

The first reason is control. The docs position OpenClaw as self-hosted and multi-channel, which means you can run it on your own machine or server instead of depending on a fully hosted assistant.

The second reason is that OpenClaw already looks like an agent platform. The docs talk about sessions, plugins, tools, skills, multi-agent routing, and ACP-backed external coding harnesses. That's a much richer story than “ask a model a question in a web page.”

The third reason is workflow fit. A lot of people don't want another inbox. They want an assistant that can live in the tools they already check every day.

There's also a broader industry trend behind the hype. Developers are actively looking for ways to connect multiple agents and multiple tools without giving up visibility into what's happening. OpenClaw sits directly in that conversation.

What the A2A Protocol Is

A2A, short for Agent2Agent, is an open protocol for communication between agent systems. The A2A specification says its purpose is to help independent agent systems discover each other, negotiate interaction modes, manage collaborative tasks, and exchange information without exposing internal memory, tools, or proprietary logic.

That last point matters. A2A is about interoperability between agent systems, not about exposing all of one agent's internals to another.

A2A introduces a few core concepts that are worth learning early:

Agent Card: a JSON description of the remote agent, its URL, skills, capabilities, and auth requirements
Task: the main unit of remote work
Artifact: the output of a task
Context ID: a stable interaction boundary across multiple related turns

A2A tasks follow a fairly clean lifecycle:

The A2A docs also explain that A2A and MCP are complementary, not competing. A2A is for agent-to-agent collaboration. MCP is for agent-to-tool communication.

That distinction is useful when you compare A2A with OpenClaw, because OpenClaw already has strong local tool and session concepts.

How OpenClaw and A2A Relate

OpenClaw and A2A are not the same thing, but they line up in interesting ways.

OpenClaw already documents several features that point in a multi-agent direction:

multi-agent routing for multiple isolated agents in one running Gateway
session tools such as sessions_send and sessions_spawn
a plugin system that can register tools, HTTP routes, Gateway RPC methods, and background services
ACP support and the openclaw acp bridge for external coding clients

But it's still important to stay precise here.

OpenClaw documents ACP, plugins, and local multi-agent coordination today. The docs I checked do not describe native A2A support as a first-class built-in capability.

That means the honest claim is this:

OpenClaw can be meaningfully connected to A2A in theory because the architectural pieces line up, but the A2A bridge still has to be built.

ACP versus A2A

ACP and A2A solve different problems.

ACP in OpenClaw today is about bridging an IDE or coding client to a Gateway-backed session.

A2A is about one agent system talking to another agent system across a protocol boundary.

That difference is one reason I prefer the phrase plugin bridge here instead of native A2A support.

What You Need Before You Start

The easiest first run does not require WhatsApp, Telegram, or Discord.

The OpenClaw onboarding docs say the fastest first chat is the dashboard. That makes this a much more approachable beginner setup.

Before you start, you'll need:

Node 24 if possible, or Node 22.16+ for compatibility
an API key for the model provider you want to use
If you're on Windows, WSL2 is the recommended path for the full experience. Native Windows works for core CLI and Gateway flows, but the docs call out caveats and position WSL2 as the more stable setup.
about five minutes for the first dashboard-based run

Step 1: Install OpenClaw

The official getting-started page recommends the installer script.

On macOS, Linux, or WSL2, run:

curl -fsSL https://openclaw.ai/install.sh | bash

On Windows PowerShell, the docs show this:

iwr -useb https://openclaw.ai/install.ps1 | iex

If you're on Windows, the platform docs recommend installing WSL2 first:

wsl --install

Then open Ubuntu and continue with the Linux commands there.

Step 2: Run the Onboarding Wizard

Once the CLI is installed, run the onboarding wizard.

openclaw onboard --install-daemon

The onboarding wizard is the recommended path in the docs. It configures auth, gateway settings, optional channels, skills, and workspace defaults in one guided flow.

The most beginner-friendly choice is to keep the first run simple. Don't worry about chat apps yet. Get the local Gateway working first.

Step 3: Check the Gateway and Open the Dashboard

After onboarding, verify that the Gateway is running.

openclaw gateway status

Then open the dashboard:

openclaw dashboard

The docs call this the fastest first chat because it avoids channel setup. It's also the safest way to start, because the dashboard is local and the OpenClaw docs clearly say the Control UI is an admin surface and should not be exposed publicly.

The beginner setup flow looks like this:

If you can chat in the dashboard, your day-zero setup is working.

Step 4: Use OpenClaw as a Private Coding Assistant

The best first use case is not to drop OpenClaw into a public group chat.

Use it as a private coding assistant in the dashboard.

For example, try a prompt like this:

I am building a small Node.js utility that reads Markdown files and generates a table of contents. Turn this idea into a project plan, a README outline, and the first five implementation tasks.

That kind of prompt is ideal for a first run because it gives you something concrete back right away.

You can also use it to:

turn rough notes into a plan,
summarize a bug report into action items,
draft a README,
propose a folder structure, or
write a safe first implementation checklist.

That is already enough to make OpenClaw useful before you touch any advanced protocol work.

Step 5: Understand Multi Agent Routing

Once the basic setup is working, it helps to understand OpenClaw’s local multi-agent model.

The docs describe multi-agent routing as a way to run multiple isolated agents in one Gateway, with separate workspaces, state directories, and sessions.

That means you can imagine setups like this:

a personal assistant
a coding assistant
a research assistant
an alerts assistant

OpenClaw already has a model for that:

You don't need to set this up on day one.

But it matters for the A2A discussion, because once you understand how OpenClaw routes work between local agents, it becomes much easier to think about routing work to remote agents through a protocol like A2A.

Where A2A Could Fit Later

A2A could fit into OpenClaw in two broad ways.

Option 1: OpenClaw as an A2A Client

In this model, OpenClaw stays your personal edge assistant.

It receives a request from the dashboard or a chat app, decides the task needs a specialist, discovers a remote A2A agent through an Agent Card, sends the task, waits for updates or artifacts, and translates the result back into a normal OpenClaw reply.

This is the cleaner story for a personal assistant. OpenClaw stays the front door, and A2A becomes a delegation path behind the scenes.

Option 2: OpenClaw as an A2A Server

In this model, OpenClaw exposes some of its own capabilities to other agents.

A plugin could theoretically publish an A2A Agent Card, advertise a narrow skill set, accept A2A tasks, and map those tasks into OpenClaw sessions or sub-agent runs.

That's technically plausible because the plugin system can register HTTP routes, tools, Gateway methods, and background services.

It's also the riskier direction for a personal assistant, which is why I think client-first is the right starting point.

A Proposed OpenClaw to A2A Plugin Architecture

This section is my original contribution in the article.

I think the cleanest first architecture is not a full bidirectional bridge. It's a narrow outbound delegation plugin that lets OpenClaw call a small allowlist of remote A2A agents.

The design goal is simple:

Reuse OpenClaw for user-facing conversations and local tool access, but use A2A only when a remote specialist agent is the best place to do the work.

Here is the architecture I would start with:

Why This Design is a Good Fit for OpenClaw

This proposal is grounded in extension points OpenClaw already documents.

A plugin can register:

an agent tool for delegation,
a Gateway method for health and diagnostics,
an HTTP route for future callbacks or webhook verification, and
a background service for cache warming, task subscriptions, or cleanup.

That means the bridge doesn't have to modify OpenClaw core to be credible.

The Mapping Table

The most important design decision is how to map OpenClaw’s session model to A2A’s task model.

Here is the mapping I recommend:

OpenClaw concept	A2A concept	Why this mapping works
`sessionKey`	`contextId`	A single OpenClaw conversation should keep a stable remote context across related delegated turns
one delegated remote call	one `Task`	each remote specialization request becomes a discrete unit of work
plugin tool call	`SendMessage`	the delegation tool is the natural point where the local agent crosses the protocol boundary
remote output	`Artifact`	A2A wants task outputs returned as artifacts rather than chat-only replies
plugin HTTP route	callback or future push handler	gives you a place to verify webhooks if you later adopt async push
Gateway method	status endpoint	gives operators a direct way to inspect relay health without going through the model
background service	polling or cache work	keeps asynchronous and maintenance work out of the tool call path

This is the key architectural claim in the article:

Treat the OpenClaw session as the long-lived conversational boundary, and treat each remote A2A task as one delegated execution inside that boundary.

That preserves both sides cleanly.

The Design in One Sentence

The a2a_delegate tool should:

resolve an allowlisted remote Agent Card,
reuse an existing A2A contextId for the current sessionKey when possible,
create a fresh remote Task for the new delegated turn,
normalize remote artifacts back into a simple local answer, and
never expose the whole OpenClaw Gateway directly to the public internet.

I like this design because it is incremental, testable, and consistent with OpenClaw’s personal-assistant trust model.

Build the Proof of Concept Relay

To make the architecture concrete, I built a small proof-of-concept relay.

https://github.com/natarajsundar/openclaw-a2a-secure-agent-runtime

It's intentionally small. It doesn't try to become a full production plugin. Instead, it proves the hardest conceptual part of the bridge: how to map one OpenClaw session to a reusable A2A context while creating a fresh A2A task per delegated turn.

Here's the repository layout:

openclaw-a2a-secure-agent-runtime/
├── README.md
├── package.json
├── examples/
│   └── openclaw-plugin-entry.example.ts
├── src/
│   ├── a2a-client.mjs
│   ├── agent-card-cache.mjs
│   ├── demo.mjs
│   ├── mock-remote-agent.mjs
│   ├── openclaw-a2a-relay.mjs
│   ├── session-task-map.mjs
│   └── utils.mjs
└── test/
    └── relay.test.mjs

The PoC does six things:

fetches a remote Agent Card from /.well-known/agent-card.json,
caches it with simple ETag revalidation,
records local sessionKey to remote contextId mappings,
sends an A2A SendMessage request,
polls GetTask until the task finishes, and
converts the remote artifact into a local text answer.

Run the Demo

The repo uses only built-in Node.js modules.

cd openclaw-a2a-secure-agent-runtime
npm run demo

The demo spins up a mock remote A2A server, delegates one task, delegates a second task from the same local session, and shows that the same remote contextId is reused.

The Core Relay Idea

This is the important logic in plain English:

look up the most recent remote mapping for the current OpenClaw sessionKey
reuse the old contextId if one exists
create a fresh A2A Task for the new request
poll until that task becomes TASK_STATE_COMPLETED
turn the returned artifact into a normal text result that OpenClaw can send back to the user

That makes the bridge predictable.

Here's a shortened version of the relay logic:

const previous = await sessionTaskMap.latestForSession(sessionKey, remoteBaseUrl);
const contextId = previous?.contextId ?? crypto.randomUUID();

const sendResult = await client.sendMessage({
  text,
  contextId,
  metadata: {
    openclawSessionKey: sessionKey,
    requestedSkillId: skillId,
  },
});

let task = sendResult.task;
while (!isTerminalTaskState(task.status?.state)) {
  await sleep(pollIntervalMs);
  task = await client.getTask(task.id);
}

return {
  contextId,
  taskId: task.id,
  answer: taskArtifactsToText(task),
};

That's the heart of the design.

Why This Repo is a Useful Proof of Concept

A lot of “integration” articles stay too abstract. This repo avoids that problem in three ways.

First, it makes the session-to-context mapping explicit.

Second, it includes a mock remote A2A agent so you can test the flow without needing a large external setup.

Third, it includes a test that checks the most important invariant: repeated delegations from one local OpenClaw session reuse the same A2A context.

That is the piece I most wanted to make concrete, because it is where architecture turns into implementation.

How the Proof of Concept Maps to a Real OpenClaw Plugin

The proof of concept is the relay core.

A real OpenClaw plugin would wrap that relay with four extension surfaces that the OpenClaw docs already describe.

1: A Delegation Tool

This is the main entry point.

A plugin would register an optional tool like a2a_delegate so the local agent can explicitly choose to delegate work.

That tool should be optional, not always-on, because remote delegation is a side effect and should be easy to disable.

2: A Gateway Method for Diagnostics

A method like a2a.status would let you inspect whether the relay is healthy, which remote cards are cached, and whether any tasks are still being tracked.

That is much better than asking the model to “tell me if the bridge is healthy.”

3: A Plugin HTTP Route

You may not need this on day one.

But once you move beyond polling and want push-style callbacks or webhook verification, a plugin route gives you the right boundary for that work.

4: A Background Service

A small service is a clean place to do cache warming, cleanup, or later subscription handling.

That keeps the tool path focused on delegation instead of maintenance work.

If I were turning this into a real plugin package, I would sequence the work in this order:

wrap the relay in registerTool,
add a small config schema with an allowlist of remote agents,
add a2a.status,
keep polling as the first async model,
add a callback route only if a real use case needs it.

That is the most practical path from theory to a real extension.

I tested the relay flow locally with the mock remote agent and confirmed that repeated delegations from the same local session reused the same remote contextId.

Security Notes Before You Go Further

This is the section you should not skip.

The OpenClaw security docs explicitly say the project assumes a personal assistant trust model: one trusted operator boundary per Gateway. They also say a shared Gateway for mutually untrusted or adversarial users is not the supported boundary model.

That has a direct consequence for A2A.

A2A is designed for communication across agent systems and organizational boundaries. That is powerful, but it is also a different threat model from a single private OpenClaw deployment.

So the safer design is not this:

expose your personal OpenClaw Gateway publicly,
let arbitrary remote agents reach it,
and hope the tool boundaries are enough.

The safer design is closer to this:

This diagram shows two separate trust boundaries.

On the left is your private OpenClaw deployment. This includes your Gateway, your sessions, your workspace, and any credentials or tools your agent can access. This boundary is designed for a single trusted operator.

On the right is the external A2A ecosystem, where remote agents live. These agents may belong to other teams or organizations and operate under different security assumptions.

The key idea is that communication between these two sides should happen through a controlled relay layer, not by directly exposing your OpenClaw Gateway. The relay enforces allowlists, limits what data is sent out, and ensures that remote agents cannot directly access your local tools or state.

This separation lets you experiment with agent interoperability while keeping your personal assistant environment safe.

In plain English, keep your personal assistant boundary private.

If you experiment with A2A, treat that as a separate exposure boundary with its own allowlists, auth, and operational controls.

That is why the proof-of-concept relay in this article starts with an explicit remote allowlist.

Why This Design and Not the Other One?

A natural question is why this article proposes an outbound-only A2A bridge first, instead of immediately building a full bidirectional or server-style integration.

The short answer is that OpenClaw’s current design is centered around a personal assistant trust boundary, where one operator controls the Gateway, sessions, and tools. Introducing external agents into that environment requires careful control over what is exposed.

Starting with outbound delegation gives you a safer and more incremental path.

Outbound-only first means:

preserving the personal-assistant trust boundary, so your local OpenClaw deployment remains private and operator-controlled
avoiding exposing the OpenClaw Gateway as a public A2A server before you have strong auth, policy, and monitoring in place
allowing you to test remote delegation patterns (Agent Cards, tasks, artifacts) without committing to full interoperability complexity
keeping OpenClaw as the user-facing control plane, while remote agents act as optional specialists

This approach follows a common systems design pattern: start with controlled outbound integration, validate behavior and constraints, and only then consider expanding to inbound or bidirectional communication.

In practice, this means you can experiment with A2A safely, learn how the models fit together, and evolve the system without introducing unnecessary risk early on.

Final Thoughts

OpenClaw is worth learning because it gives you a self-hosted assistant that can live in the communication tools you already use.

The simplest beginner path is still the right one:

install it,
run onboarding,
check the Gateway,
open the dashboard,
try one private workflow.

That is already a real end-to-end setup.

A2A belongs in the conversation because it gives you a credible way to connect OpenClaw to remote specialist agents later.

But the most important thing in this article isn't the buzzword. It's the boundary design.

If you keep OpenClaw as the private user-facing edge and use a narrow plugin bridge for outbound delegation, the OpenClaw session model and the A2A task model can fit together cleanly.

That is the architectural idea I wanted to make concrete here.

Diagram Attribution

All diagrams in this article were created by the author specifically for this guide.

How to Build a Production-Ready Voice Agent Architecture with WebRTC

Nataraj Sundar — Fri, 06 Mar 2026 19:46:46 +0000

In this tutorial, you'll build a production-ready voice agent architecture: a browser client that streams audio over WebRTC (Web Real-Time Communication), a backend that mints short-lived session tokens, an agent runtime that orchestrates speech and tools safely, and generates post-call artifacts for downstream workflows.

This article is intentionally vendor-neutral. You can implement these patterns using any AI voice platform that supports WebRTC (directly or via an SFU, selective forwarding unit) and server-side token minting. The goal is to help you ship a voice agent architecture that is secure, observable, and operable in production.

Disclosure: This article reflects my personal views and experience. It does not represent the views of my employer or any vendor mentioned.

What You'll Build
How to Avoid Common Production Failures in Voice Agents
How to Design a Latency Budget for a Real-Time Voice Agent
Production Voice Agent Architecture (Vendor-Neutral)
Production readiness checklist
Closing

What You'll Build

By the end, you'll have:

A web client that streams microphone audio and plays agent audio.
A backend token endpoint that keeps credentials server-side.
A safe coordination channel between the agent and the application.
Structured messages between the application and the agent.
A production checklist for security, reliability, observability, and cost control.

Prerequisites

You should be comfortable with:

JavaScript or TypeScript
Node.js 18+ (so fetch works server-side) and an HTTP framework (Express in examples)
Browser microphone permissions
Basic WebRTC concepts (high level is fine)

TL;DR

A production-ready voice agent needs:

A server-side token service (no secrets in the browser)
A real-time media plane (WebRTC) for low-latency audio
A data channel for structured messages between your app and the agent
Tool guardrails (allowlists, confirmations, timeouts, audit logs)
Post-call processing (summary, actions, CRM (Customer Relationship Management), tickets)
Observability-first implementation (state transitions + metrics)

How to Avoid Common Production Failures in Voice Agents

If you've operated distributed systems, you've seen most failures happen at boundaries:

timeouts and partial connectivity
retries that amplify load
unclear ownership between components
missing observability
“helpful automation” that becomes unsafe

Voice agents amplify those risks because:

Latency is User Experience: A slow agent feels broken. Conversational UX is less forgiving than web UX.

Audio + UI + Tools is a Distributed System: You coordinate browser audio capture, WebRTC transport, STT (speech-to-text), model reasoning, tool calls, TTS (text-to-speech), and playback buffering. Each stage has different clocks and failure modes.

Security Boundaries are Non-negotiable: A leaked API key is catastrophic. A tool misfire can trigger real-world side effects.

Debuggability determines whether you can ship: If you don't log state transitions and capture post-call artifacts, you can't operate or improve the system safely.

How to Design a Latency Budget for a Real-Time Voice Agent

Conversations have a “feel.” That feel is mostly latency.

A practical guideline:

Under ~200ms feels instant
300–500ms feels responsive
Over ~700ms feels broken

Your end-to-end latency is the sum of mic capture, network RTT (round-trip time), STT, reasoning, tool execution, TTS, and playback buffering. Budget for it explicitly or you’ll ship a technically correct system that users perceive as unintelligent.

How to Design a Production Voice Agent Architecture (Vendor-Neutral)

A scalable voice agent architecture typically has these layers:

Web client: mic capture, audio playback, UI state
Token service: short-lived session tokens (secrets stay server-side)
Real-time plane: WebRTC media + a data channel
Agent runtime: STT → reasoning → TTS, plus tool orchestration
Tool layer: external actions behind safety controls
Post-call processor: summary + structured outputs after the session ends

This separation makes failure domains and trust boundaries explicit.

Step 0: Set Up the Project

Create a new project directory:

mkdir voice-agent-app
cd voice-agent-app
npm init -y
npm pkg set type=module
npm pkg set scripts.start="node server.js"

Install dependencies:

npm install express dotenv

Create this folder structure:

voice-agent-app/
├── server.js
├── .env
└── public/
    ├── index.html
    └── client.js

Add a .env file:

VOICE_PLATFORM_URL=https://your-provider.example
VOICE_PLATFORM_API_KEY=your_api_key_here

Now you’re ready to implement each part of the system.

Step 1: Keep Credentials Server-side

Treat every API key like production credentials:

store it in environment variables or a secrets manager
rotate it if exposed
never embed it in browser or mobile apps
avoid logging secrets (log only a short suffix if necessary)

Even if a vendor supports CORS, the browser is not a safe place for long-lived credentials.

Step 2: Build a Backend Token Endpoint

Your backend should:

authenticate the user
mint a short-lived session token using your platform API
return only what the client needs (URL + token + expiry)

Create server.js (Node.js + Express)

import express from "express";
import dotenv from "dotenv";
import path from "path";
import { fileURLToPath } from "url";

dotenv.config();

const app = express();
app.use(express.json());

// Serve the web client from /public
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
app.use(express.static(path.join(__dirname, "public")));

const VOICE_PLATFORM_URL = process.env.VOICE_PLATFORM_URL;
const VOICE_PLATFORM_API_KEY = process.env.VOICE_PLATFORM_API_KEY;

app.post("/api/voice-token", async (req, res) => {
  res.setHeader("Cache-Control", "no-store");

  try {
    if (!VOICE_PLATFORM_URL || !VOICE_PLATFORM_API_KEY) {
      return res.status(500).json({
        error: "Missing VOICE_PLATFORM_URL or VOICE_PLATFORM_API_KEY in .env",
      });
    }

    // TODO: Authenticate the caller before minting tokens.

    const r = await fetch(`${VOICE_PLATFORM_URL}/api/v1/token`, {
      method: "POST",
      headers: {
        "X-API-Key": VOICE_PLATFORM_API_KEY,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ participant_name: "Web User" }),
    });

    if (!r.ok) {
      const detail = await r.text().catch(() => "");
      return res.status(r.status).json({ error: "Token request failed", detail });
    }

    const data = await r.json();

    res.json({
      rtc_url: data.rtc_url || data.livekit_url,
      token: data.token,
      expires_in: data.expires_in,
    });
  } catch (err) {
    res.status(500).json({ error: "Failed to mint token" });
  }
});

app.listen(3000, () => console.log("Open http://localhost:3000"));

Run the server

npm start

Then open: http://localhost:3000

How this code works

You load credentials from environment variables so secrets never enter the browser.
The /api/voice-token endpoint calls the voice platform’s token API.
You return only the rtc_url, token, and expiration time.
The browser never sees the API key.
If the provider returns an error, you forward a structured error response.

Production Notes

rate-limit /api/voice-token (cost + abuse control)
instrument token mint latency and error rate
keep TTL short and handle refresh/reconnect
return minimal fields

Step 3: Connect from the Web Client (WebRTC + SFU)

In this step, you'll build a minimal web UI that:

Requests a short-lived token from your backend
Connects to a real-time WebRTC room (often via an SFU)
Plays the agent's audio track
Captures and publishes microphone audio

Create `public/index.html`



  
    
    
    Voice Agent Demo
  
  
    Voice Agent Demo

    
    

    Idle

Create `public/client.js`

Note: This uses a LiveKit-style client SDK to demonstrate the pattern. If you're using a different provider, swap this import and the connect/publish calls for your provider's WebRTC client.

import { Room, RoomEvent, Track } from "https://unpkg.com/livekit-client@2.10.1/dist/livekit-client.esm.mjs";

const startBtn = document.getElementById("startBtn");
const endBtn = document.getElementById("endBtn");
const statusEl = document.getElementById("status");

let room = null;
let intentionallyDisconnected = false;
let audioEls = [];

function setStatus(text) {
  statusEl.textContent = text;
}

function detachAllAudio() {
  for (const el of audioEls) {
    try { el.pause?.(); } catch {}
    el.remove();
  }
  audioEls = [];
}

async function mintToken() {
  const res = await fetch("/api/voice-token", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ participant_name: "Web User" }),
    cache: "no-store",
  });

  if (!res.ok) {
    const detail = await res.text().catch(() => "");
    throw new Error(`Token request failed: ${detail || res.status}`);
  }

  const { rtc_url, token } = await res.json();
  if (!rtc_url || !token) throw new Error("Token response missing rtc_url or token");
  return { rtc_url, token };
}

function wireRoomEvents(r) {
  // 1) Play the agent audio track when subscribed
  r.on(RoomEvent.TrackSubscribed, (track) => {
    if (track.kind !== Track.Kind.Audio) return;

    const el = track.attach();
    audioEls.push(el);
    document.body.appendChild(el);

    // Autoplay restrictions vary by browser/device.
    el.play?.().catch(() => {
      setStatus("Connected (audio may be blocked — click the page to enable)");
    });
  });

  // 2) Reconnect on disconnect (token expiry often shows up this way)
  r.on(RoomEvent.Disconnected, async () => {
    if (intentionallyDisconnected) return;
    setStatus("Disconnected (reconnecting...)");
    await attemptReconnect();
  });
}

async function connectOnce() {
  const { rtc_url, token } = await mintToken();

  const r = new Room();
  wireRoomEvents(r);

  await r.connect(rtc_url, token);

  // Mic permission + publish mic
  try {
    await r.localParticipant.setMicrophoneEnabled(true);
  } catch {
    try { r.disconnect(); } catch {}
    throw new Error("Microphone access denied. Allow mic permission and try again.");
  }

  return r;
}

async function startCall() {
  if (room) return;

  intentionallyDisconnected = false;
  setStatus("Connecting...");

  room = await connectOnce();

  setStatus("Connected");
  startBtn.disabled = true;
  endBtn.disabled = false;
}

async function stopCall() {
  intentionallyDisconnected = true;

  try {
    await room?.localParticipant?.setMicrophoneEnabled(false);
  } catch {}

  try {
    room?.disconnect();
  } catch {}

  room = null;
  detachAllAudio();

  setStatus("Disconnected");
  startBtn.disabled = false;
  endBtn.disabled = true;
}

async function attemptReconnect() {
  // Simplified exponential backoff reconnect.
  // In production, add jitter, max attempts, and better error classification.
  const delaysMs = [250, 500, 1000, 2000];

  for (const delay of delaysMs) {
    if (intentionallyDisconnected) return;

    try {
      // Tear down current state before reconnecting
      try { room?.disconnect(); } catch {}
      room = null;
      detachAllAudio();

      await new Promise((r) => setTimeout(r, delay));

      room = await connectOnce();
      setStatus("Reconnected");
      startBtn.disabled = true;
      endBtn.disabled = false;
      return;
    } catch {
      // keep retrying
    }
  }

  setStatus("Disconnected (reconnect failed)");
  startBtn.disabled = false;
  endBtn.disabled = true;
}

startBtn.addEventListener("click", async () => {
  try {
    await startCall();
  } catch (err) {
    setStatus(err?.message || "Connection failed");
    startBtn.disabled = false;
    endBtn.disabled = true;
    room = null;
    detachAllAudio();
  }
});

endBtn.addEventListener("click", async () => {
  await stopCall();
});

How this Step works (and why these details matter)

The Start button gives you a user gesture so browsers are more likely to allow audio playback.
Mic permission is handled explicitly: if the user denies access, you show a clear error and avoid a half-connected session.
Disconnect cleanup removes audio elements so you don't leak resources across retries.
The reconnect loop demonstrates the production pattern: if a disconnect happens (often due to token expiry or network churn), the client re-mints a token and reconnects.

In the next step, you'll add a structured data-channel handler to safely process agent-suggested “client actions”.

Handle These Explicitly

Autoplay Restriction Example

Add this to index.html:

In client.js:

const startBtn = document.getElementById("startBtn");
const endBtn = document.getElementById("endBtn");
const statusEl = document.getElementById("status");

let room;

startBtn.addEventListener("click", async () => {
  try {
    room = await connectVoice();
    statusEl.textContent = "Connected";
    startBtn.disabled = true;
    endBtn.disabled = false;
  } catch (err) {
    statusEl.textContent = "Connection failed";
  }
});

Microphone denial

try {
  await navigator.mediaDevices.getUserMedia({ audio: true });
} catch (err) {
  statusEl.textContent = "Microphone access denied";
  throw err;
}

Disconnect cleanup

endBtn.addEventListener("click", () => {
  if (room) {
    room.disconnect();
    statusEl.textContent = "Disconnected";
    startBtn.disabled = false;
    endBtn.disabled = true;
  }
});

Token refresh (simplified)

room.on(RoomEvent.Disconnected, async () => {
  const res = await fetch("/api/voice-token");
  const { rtc_url, token } = await res.json();
  await room.connect(rtc_url, token);
});

Step 4: Add Client Actions (Agent Suggests, App Executes)

A production voice agent often needs to:

open a runbook/dashboard URL
show a checklist in the UI
request confirmation for an irreversible action
receive structured context (account, region, incident ID)

The key safety rule:

The agent suggests actions. The application validates and executes them.

Use structured messages over the data channel:

{
  "type": "client_action",
  "action": "open_url",
  "payload": { "url": "https://internal.example.com/runbook" },
  "id": "action_123"
}

Add guardrails:

allowlist permitted actions
validate payload shape
confirmation gates for irreversible actions
idempotency via id
audit logs for every request and outcome

This boundary limits damage from hallucinations or prompt injection.

// Guardrails: allowlist + validation + idempotency + confirmation

const ALLOWED_ACTIONS = new Set(["open_url", "request_confirm"]);
const EXECUTED_ACTION_IDS = new Set();
const ALLOWED_HOSTS = new Set(["internal.example.com"]);

function parseClientAction(text) {
  let msg;
  try {
    msg = JSON.parse(text);
  } catch {
    return null;
  }

  if (msg?.type !== "client_action") return null;
  if (typeof msg.id !== "string") return null;
  if (!ALLOWED_ACTIONS.has(msg.action)) return null;

  return msg;
}

async function handleClientAction(msg, room) {
  if (EXECUTED_ACTION_IDS.has(msg.id)) return; // idempotency
  EXECUTED_ACTION_IDS.add(msg.id);

  console.log("[client_action]", msg); // audit log (demo)

  if (msg.action === "open_url") {
    const url = msg.payload?.url;
    if (typeof url !== "string") return;

    const u = new URL(url);
    if (!ALLOWED_HOSTS.has(u.host)) {
      console.warn("Blocked navigation to:", u.host);
      return;
    }

    window.open(url, "_blank", "noopener,noreferrer");
    return;
  }

  if (msg.action === "request_confirm") {
    const prompt = msg.payload?.prompt || "Confirm this action?";
    const ok = window.confirm(prompt);

    // Send confirmation back to agent/app
    room.localParticipant.publishData(
  new TextEncoder().encode(
    JSON.stringify({ type: "user_confirmed", id: msg.id, ok })
  ),
  { topic: "client_events", reliable: true }
);
  }
}

room.on(RoomEvent.DataReceived, (payload, participant, kind, topic) => {
  if (topic !== "client_actions") return;

  const text = new TextDecoder().decode(payload);
  const msg = parseClientAction(text);
  if (!msg) return;

  handleClientAction(msg, room);
});

Step 5: Add Tool Integrations Safely

Tools turn a voice agent into automation. Regardless of vendor, enforce these rules:

timeouts on every tool call
circuit breakers for flaky dependencies
audit logs (inputs, outputs, duration, trace IDs)
explicit confirmation for destructive actions
credentials stored server-side (never in prompts or clients)

If tools fail, degrade gracefully (“I can’t access that system right now, here’s the manual fallback.”). Silence reads as failure.

Create a server-side tool runner (example)

Paste this into server.js:

const TOOL_ALLOWLIST = {
  get_status: { destructive: false },
  create_ticket: { destructive: true },
};

let failures = 0;
let circuitOpenUntil = 0;

function circuitOpen() {
  return Date.now() < circuitOpenUntil;
}

async function withTimeout(promise, ms) {
  return Promise.race([
    promise,
    new Promise((_, reject) => setTimeout(() => reject(new Error("timeout")), ms)),
  ]);
}

async function runToolSafely(tool, args) {
  if (circuitOpen()) throw new Error("circuit_open");

  try {
    const result = await withTimeout(Promise.resolve({ ok: true, tool, args }), 2000);
    failures = 0;
    return result;
  } catch (err) {
    failures++;
    if (failures >= 3) circuitOpenUntil = Date.now() + 10_000;
    throw err;
  }
}

app.post("/api/tools/run", async (req, res) => {
  const { tool, args, user_confirmed } = req.body || {};

  if (!TOOL_ALLOWLIST[tool]) return res.status(400).json({ error: "Tool not allowed" });

  if (TOOL_ALLOWLIST[tool].destructive && user_confirmed !== true) {
    return res.status(400).json({ error: "Confirmation required" });
  }

  try {
    const started = Date.now();
    const result = await runToolSafely(tool, args);
    console.log("[tool_call]", { tool, ms: Date.now() - started }); // audit log
    res.json({ ok: true, result });
  } catch (err) {
    console.log("[tool_error]", { tool, err: String(err) });
    res.status(500).json({ ok: false, error: "Tool call failed" });
  }
});

Step 6: Add post-call processing (where durable value appears)

After a call ends, generate structured artifacts:

summary
action items
follow-up email draft
CRM entry or ticket creation

A production pattern:

store transcript + metadata
enqueue a background job (queue/worker)
produce outputs as JSON + a human-readable report
apply integrations with retries + idempotency
store a “call report” for audits and incident reviews

Create a post-call webhook endpoint (example)

Paste into server.js:

app.post("/webhooks/call-ended", async (req, res) => {
  const payload = req.body;

  console.log("[call_ended]", {
    call_id: payload.call_id,
    ended_at: payload.ended_at,
  });

  setImmediate(() => processPostCall(payload));
  res.json({ ok: true });
});

function processPostCall(payload) {
  const transcript = payload.transcript || [];
  const summary = transcript.slice(0, 3).map(t => `- \({t.speaker}: \){t.text}`).join("\n");

  const report = {
    call_id: payload.call_id,
    summary,
    action_items: payload.action_items || [],
    created_at: new Date().toISOString(),
  };

  console.log("[call_report]", report);
}

Test it locally

curl -X POST http://localhost:3000/webhooks/call-ended \
  -H "Content-Type: application/json" \
  -d '{
    "call_id": "call_123",
    "ended_at": "2026-02-26T00:10:00Z",
    "transcript": [
      {"speaker": "user", "text": "I need help resetting my password."},
      {"speaker": "agent", "text": "Sure — I can help with that."}
    ],
    "action_items": ["Send password reset link", "Verify account email"]
  }'

Production readiness checklist

Security

no API keys in the browser
strict allowlist for client actions
confirmation gates for destructive actions
schema validation on all inbound messages
audit logging for actions and tool calls

Reliability

reconnect strategy for expired tokens
timeouts + circuit breakers for tools
graceful degradation when dependencies fail
idempotent side effects

Observability

Log state transitions (for example):
listening → thinking → speaking → ended

Track:

connect failure rate
end-to-end latency (STT + reasoning + TTS)
tool error rate
reconnect frequency

Cost control

rate-limit token minting and sessions
cap max call duration
bound context growth (summarize or truncate)
track per-call usage drivers (STT/TTS minutes, tool calls)

Optional resources

How to Try a Managed Voice Platform Quickly

If you want a managed provider to test quickly, you can sign up for a Vocal Bridge account and implement these steps using their token minting + real-time session APIs.

But the core production voice agent architecture in this article is vendor-agnostic. You can replace any component (SFU, STT/TTS, agent runtime, tool layer) as long as you preserve the boundaries: secure token service, real-time media, safe tool execution, and strong observability.

Watch a full demo and explore a complete reference repo

If you'd like to see these patterns working together in a realistic scenario (incident triage), here are two optional resources:

- Demo video: Voice-First Incident Triage (end-to-end run)
This is a hackathon run-through showing client actions, decision boundaries for irreversible actions, and a structured post-call summary.

- GitHub repo (architecture + design + working code): https://github.com/natarajsundar/voice-first-incident-triage

These links are optional, you can follow the tutorial end-to-end without them.

Closing

Production-ready voice agents work when you treat them like real-time distributed systems.

Start with the baseline:

token service + web client + real-time audio

Then layer in:

controlled client actions
safe tools
post-call automation
observability and cost controls

That’s how you ship a voice agent architecture you can operate. You now have a vendor-neutral reference architecture you can adapt to your stack, with clear trust boundaries, safe tool execution, and operational visibility.

If you’re shipping real-time AI systems, what’s been your biggest production bottleneck so far: latency, reliability, or tool safety? I’d love to hear what you’re seeing in the wild. Connect with me on LinkedIn.

How to Build AI Agents That Remember User Preferences (Without Breaking Context)

Nataraj Sundar — Wed, 11 Feb 2026 17:58:05 +0000

Why Personalization Breaks Most AI Agents

Personalization is one of the most requested features in AI-powered applications. Users expect an agent to remember their preferences, adapt to their style, and improve over time.

In practice, personalization is unfortunately also one of the fastest ways to break an otherwise working AI agent.

Many agents start with a simple idea: keep adding more conversation history to the prompt. This approach works for demos, but it quickly fails in real applications. Context windows grow too large. Irrelevant information leaks into decisions. Costs increase. Debugging becomes nearly impossible.

If you want a personalized agent that survives production, you need more than a large language model. You need a way to connect the agent to tools, manage multi-step workflows, and store user preferences safely over time – without turning your system into a tangled mess of prompts and callbacks.

In this tutorial, you’ll learn how to design a personalized AI agent using three core building blocks:

Agent Development Kit (ADK) to orchestrate agent reasoning and execution
Model Context Protocol (MCP) to connect tools with clear boundaries
Long-term memory to store preferences without polluting context

Rather than focusing on setup commands or vendor-specific walkthroughs, we'll focus on the architectural patterns that make personalized agents reliable, debuggable, and maintainable.

Figure 1 — Personalization influences agent responses

Prerequisites
What “Personalized” Means in a Real AI Agent
How the Agent Architecture Fits Together
How to Design the Agent Core with ADK
How to Connect Tools Safely with MCP
How to Add Long-Term Memory Without Polluting Context
- Privacy, Consent, and Lifecycle Controls (Production Checklist)
How the End-to-End Agent Flow Works
Common Pitfalls You’ll Hit (and How to Avoid Them)
What You Learned and Where to Go Next

Prerequisites

To follow along with this tutorial, you should have:

Basic familiarity with Python
A general understanding of how large language models work
Optional: a Google Cloud account if you want to run an end-to-end demo. Otherwise, you can follow the architecture and code patterns locally with stubs. We’ll avoid deep infrastructure setup and focus on design patterns rather than deployment mechanics.

You don’t need prior experience with ADK or MCP. I’ll introduce each concept as it appears.

What “Personalized” Means in a Real AI Agent

Figure 2 — Keep preferences out of the prompt: agent ↔ tools across a protocol boundary

Before writing any code, it’s important to define what personalization means in an AI agent.

Personalization is not the same as “remembering everything.” In practice, agent state usually falls into three categories:

Short-term context: Information needed to complete the current task. This belongs in the prompt.
Session state: Temporary decisions or selections made during a workflow. This should be structured and scoped to a session.
Long-term memory: Durable user preferences or facts that should persist across sessions.

Figure 3 — Three kinds of agent state: context (now), session (today), memory (always)

Most problems happen when these categories are mixed together.

If you store long-term preferences directly in the prompt, the agent’s behavior becomes unpredictable. If you store everything permanently, memory grows without bounds. If you don’t scope memory at all, unrelated sessions start influencing each other.

A well-designed, personalized agent treats memory as a first-class system component, not as extra text added to a prompt.

In the next section, we'll look at how to structure the agent so these concerns stay separated.

By the end of this tutorial, you’ll understand how to design a personalized AI agent that uses long-term memory safely, connects to tools through clear boundaries, and remains debuggable as it grows.

How the Agent Architecture Fits Together

Figure 4 — Reference architecture: agent core + tools + memory service + orchestration runtime

The above diagram shows a high-level, personalized AI agent architecture. In it, an agent core handles reasoning and planning while interacting with a tool interface layer, a long-term memory service, and an orchestration runtime.

Let’s now understand the moving parts of a personalized agent and how they interact.

At a high level, the system has four responsibilities:

Reasoning – deciding what to do next
Execution – calling tools and services
Memory – storing and retrieving long-term preferences
Boundaries – controlling what the agent is allowed to do

A common mistake you’ll see is to blur these responsibilities together. For example, letting the model decide when to write memory, or allowing tools to execute actions without clear constraints.

Instead, you'll design the system so each responsibility has a clear owner. The core components look like this:

Agent core: Handles reasoning and planning
Tools: Perform external actions (read or write)
MCP layer: Defines how tools are exposed and invoked
Memory services: Store long-term user data safely

ADK sits at the center, orchestrating how requests flow between these components. The model never directly talks to databases or services. It reasons about actions, and ADK coordinates execution.

This separation makes the system easier to reason about, debug, and extend.

How to Design the Agent Core with ADK

Before we dive in, a quick note on what ADK is.

Agent Development Kit (ADK) is an agent orchestration framework – the glue code between a large language model and your application. Instead of treating the model as a black box that directly “does things”, ADK helps you structure the agent as a system:

The model focuses on reasoning (turning user intent, context, and memory into a structured plan)
Your runtime stays in control of execution (deciding which tools can run, how they run, and what gets logged or persisted)

In other words, ADK is what lets you take tool calling and multi-step workflows out of a giant prompt and turn them into a maintainable and testable architecture. In this tutorial, we’ll use ADK to refer to that orchestration layer. The same patterns apply if you use a different agent framework.

Note: The following code snippets are simplified reference examples intended to illustrate architectural patterns. They’re not production-ready drop-ins.

Once you understand the architecture, you can start designing the agent core. The agent core is responsible for reasoning, not execution.

A helpful mental model is to think of the agent as a planner, not a doer. Its role is to interpret the user’s goal, consider available context and memory, and produce a structured plan that can later be executed in a controlled way.

To make this concrete, the following example shows how an agent can translate user input and memory into an explicit plan. In practice, ADK orchestrates this using a large language model, but the important idea is that the output is structured intent, not side effects.

# Reference example for illustration.

from dataclasses import dataclass
from typing import List, Dict, Any

@dataclass
class Step:
    tool: str
    args: Dict[str, Any]

@dataclass
class Plan:
    goal: str
    steps: List[Step]

def build_plan(user_text: str, memory: Dict[str, Any]) -> Plan:
    # In practice, the LLM produces this structure via ADK orchestration.
    goal = f"Help user: {user_text}"
    steps = []
    if memory.get("prefers_short_answers"):
        steps.append(Step(tool="set_style", args={"verbosity": "low"}))
    steps.append(Step(tool="search_docs", args={"query": user_text}))
    steps.append(Step(tool="summarize", args={"max_bullets": 5}))
    return Plan(goal=goal, steps=steps)

This example illustrates an important constraint: the agent produces a plan, but it doesn’t execute anything directly.

The agent decides what should happen and in what order, while ADK controls when and how each step runs. This separation lets you inspect, test, and reason about decisions before they result in real-world actions.

When personalization is involved, this distinction becomes critical. Preferences may influence planning, but execution should remain tightly controlled by the runtime.

Again, we can consider the agent to be a planner, not a doer.

It should not:

Perform side effects directly
Write to databases
Call external APIs without supervision

In ADK, this separation is natural. The agent produces intents and tool calls, while the runtime controls how and when those calls are executed.

This design has two major benefits:

Safety – you can restrict which tools the agent can access
Debuggability – you can inspect decisions before execution

When personalization is involved, this becomes even more important. Preferences influence reasoning, but execution should remain tightly controlled.

How to Connect Tools Safely with MCP

Figure 5 — Tool calls with guardrails: request → validate → execute → respond

Tools are how agents interact with the real world. They fetch data, generate artifacts, and sometimes perform actions with side effects.

Without clear boundaries, tool usage quickly becomes a source of fragility. Hardcoded API calls leak into prompts, tools evolve independently, and agents gain more authority than intended.

To avoid these problems, tools should be explicitly registered and invoked through a narrow interface. The following example shows a simple tool registry pattern that mirrors how MCP exposes tools to an agent without tightly coupling it to implementations.

# Reference example (pseudocode for illustration)

from typing import Callable, Dict, Any

ToolFn = Callable[[Dict[str, Any]], Dict[str, Any]]

TOOLS: Dict[str, ToolFn] = {}

def register_tool(name: str):
    def decorator(fn: ToolFn):
        TOOLS[name] = fn
        return fn
    return decorator

@register_tool("search_docs")
def search_docs(args: Dict[str, Any]) -> Dict[str, Any]:
    query = args["query"]
    # Replace with your MCP client call (or local tool implementation).
    return {"results": [f"doc://example?q={query}"]}

def invoke_tool(name: str, args: Dict[str, Any]) -> Dict[str, Any]:
    if name not in TOOLS:
        raise ValueError(f"Tool not allowed: {name}")
    return TOOLS[name](args)

The Model Context Protocol (MCP) provides a clean way to formalize this pattern. You can think of MCP the same way operating systems treat system calls.

An application does not directly manipulate hardware. Instead, it requests operations through well-defined system calls. The kernel decides whether the operation is allowed and how it executes.

In the same way, the agent knows what capabilities exist, MCP defines how those capabilities are invoked, and the runtime controls when and whether they execute.

This separation prevents several common problems, including hardcoded API details in prompts, unexpected breakage when tools change, and agents performing unrestricted side effects.

When designing tools, it helps to classify them by risk: read tools for safe queries, generate tools for planning or synthesis, and commit tools for irreversible actions. In a personalized agent, commit tools should be rare and tightly guarded.

Figure 6 — Observability around tool calls: logs, traces, timing, decision points

How to Add Long-Term Memory Without Polluting Context

Figure 7 — Memory admission pipeline: extract → filter/validate → persist asynchronously

Memory is where personalization either succeeds or fails.

You can start by storing everything the user says and feed it back into the prompt. This works briefly, then collapses under its own weight as context grows, costs rise, and behavior becomes unpredictable.

A better approach is to treat memory as structured, curated data so you can control what the agent remembers and why with clear admission rules. Before persisting anything, the system should explicitly decide whether the information is worth remembering. The following function demonstrates a simple memory admission policy.

# Simplified Reference Only
from typing import Optional, Dict, Any

def memory_candidate(user_text: str) -> Optional[Dict[str, Any]]:
    text = user_text.lower()

    # Durable
    if "for this session" in text or "ignore after" in text:
        return None

    # Reusable
    if "my preferred language is" in text:
        return {"type": "preference", "key": "language", "value": user_text.split()[-1]}

    # Safe (basic example; add PII checks for your use case)
    if "password" in text or "ssn" in text:
        return None

    return None  # default: don’t store

This policy encodes three questions every memory candidate must answer:

Is it durable? Will it still matter in the future?
Is it reusable? Will it influence future decisions meaningfully?
Is it safe to persist? Does it avoid sensitive or session-specific data?

Only information that passes all three checks should become long-term memory. In practice, this usually includes stable preferences and long-lived constraints, not temporary instructions or intermediate reasoning.

Even if your admission rules are solid, long-term memory introduces governance requirements:

User control: allow users to view, export, and delete stored preferences at any time.
Sensitive data handling: never store secrets/PII. Run PII detection on every memory candidate (and consider redaction).
Retention + consent: use explicit consent for persistent memory and apply retention windows (TTL) so memory expires unless it’s still useful.
Security + auditability: encrypt at rest, restrict access by service identity, and keep an audit log of memory writes/updates.

Memory writes should also be asynchronous. The agent should never block while persisting memory, which keeps interactions responsive and avoids coupling reasoning to storage latency.

How the End-to-End Agent Flow Works

Figure 8 — End-to-end request lifecycle: user input → plan → tools → memory updates

At this point, you can trace exactly how memory and tools interact during a single request. With the individual components in place, it’s helpful to see how they work together during a single request. The following example walks through the full lifecycle of a personalized interaction, from user input to response.

# Reference example (pseudocode for illustration)

def handle_request(user_id: str, user_text: str) -> str:
    memory = memory_store.get(user_id)  # e.g., {"prefers_short_answers": True}
    plan = build_plan(user_text, memory)

    tool_outputs = []
    for step in plan.steps:
        out = invoke_tool(step.tool, step.args)
        tool_outputs.append({step.tool: out})

    response = render_response(goal=plan.goal, tool_outputs=tool_outputs, memory=memory)

    cand = memory_candidate(user_text)
    if cand:
        # Never block the user on storage.
        memory_store.write_async(user_id, cand)
    return response

At a high level, the flow looks like this:

The user sends a message.
Relevant long-term memory is retrieved.
The agent reasons about the request and produces a plan.
ADK invokes tools through MCP as needed.
Results flow back to the agent.
The agent decides whether new information should be persisted.
Memory is written asynchronously.
The final response is returned to the user.

Notice what does not happen: the model does not directly write memory, tools do not execute without coordination, and context does not grow without bounds. This structure keeps personalization controlled and predictable.

Common Pitfalls You’ll Hit (and How to Avoid Them)

Even with a solid architecture, there are a few failure modes that show up repeatedly in real systems. Many of them stem from allowing agents to perform irreversible actions without explicit checks.

The following example shows a simple guardrail for commit-style tools that require approval before execution.

# Reference example (pseudocode for illustration)

def invoke_commit_tool(name: str, args: Dict[str, Any], approved: bool) -> Dict[str, Any]:
    if not approved:
        # Require explicit confirmation or policy approval before side effects.
        return {"status": "blocked", "reason": "commit tools require approval"}

    # For example: create_ticket, send_email, submit_order, update_record
    return invoke_tool(name, args)

This pattern forces a clear decision point before side effects occur. It also creates an audit trail that explains why an action was allowed or blocked.

Other common pitfalls include over-personalization, leaky memory that persists session-specific data, uncontrolled tool growth, and debugging blind spots caused by unclear boundaries. If you see these symptoms, it usually means responsibilities are not clearly separated.

What You Learned and Where to Go Next

Personalized AI agents are powerful, but they require discipline. The key insight is that personalization is a systems problem, not a prompt problem.

By separating reasoning from execution, structuring memory carefully, and using protocols like MCP to enforce boundaries, you can build agents that scale beyond demos and remain maintainable in production.

As you extend this system, resist the urge to add “just one more prompt tweak.” Instead, ask whether the change belongs in memory, tools, or orchestration.

That mindset will save you time as your agent grows in complexity.

If you’d like to continue the conversation, you can find me on LinkedIn.

*All diagrams in this article were created by the author for educational purposes.

Nataraj Sundar - freeCodeCamp.org

How to Use Context Hub (chub) to Build a Companion Relevance Engine

What We'll Build

Prerequisites

Table of Contents

How to Understand Context Hub

How to Understand the Official Repo, the Companion repo, and the Upstream PR

How to Install and Use the Official CLI

How to Understand Docs, Skills, and the Content Layout

How to Use Incremental Fetch and Layered Sources

How to Use Annotations and Feedback to Create a Memory Loop

How to See Where Relevance Still Misses

How the Companion Relevance Engine Improves Retrieval

How to Run the Companion Repo End to End

How to Reproduce a Baseline Miss

How to Reproduce a Workflow-intent Win

How to Test the Memory Loop

How to Run the Benchmark

How to Launch the Local Comparison UI

How to Read the Benchmark Honestly

How to Connect the Companion Repo to the Upstream PR

Conclusion

Diagram Attribution

Sources

How to Set Up OpenClaw and Design an A2A Plugin Bridge

Prerequisites

Table of Contents

What OpenClaw Is

Why Developers Are Paying Attention to OpenClaw

What the A2A Protocol Is

How OpenClaw and A2A Relate

ACP versus A2A

What You Need Before You Start

Step 1: Install OpenClaw

Step 2: Run the Onboarding Wizard

Step 3: Check the Gateway and Open the Dashboard

Step 4: Use OpenClaw as a Private Coding Assistant

Step 5: Understand Multi Agent Routing

Where A2A Could Fit Later

Option 1: OpenClaw as an A2A Client

Option 2: OpenClaw as an A2A Server

A Proposed OpenClaw to A2A Plugin Architecture

Why This Design is a Good Fit for OpenClaw

The Mapping Table

The Design in One Sentence

Build the Proof of Concept Relay

Run the Demo

The Core Relay Idea

Why This Repo is a Useful Proof of Concept

How the Proof of Concept Maps to a Real OpenClaw Plugin

1: A Delegation Tool

2: A Gateway Method for Diagnostics

3: A Plugin HTTP Route

4: A Background Service

Security Notes Before You Go Further

Why This Design and Not the Other One?

Final Thoughts

Diagram Attribution

Further Reading

How to Build a Production-Ready Voice Agent Architecture with WebRTC

Table of Contents

What You'll Build

Prerequisites

TL;DR

How to Avoid Common Production Failures in Voice Agents

How to Design a Latency Budget for a Real-Time Voice Agent

How to Design a Production Voice Agent Architecture (Vendor-Neutral)

Step 0: Set Up the Project

Step 1: Keep Credentials Server-side

Step 2: Build a Backend Token Endpoint

Create server.js (Node.js + Express)

Run the server

How this code works

Production Notes

Step 3: Connect from the Web Client (WebRTC + SFU)

Create public/index.html

Voice Agent Demo

Create public/client.js

How this Step works (and why these details matter)

Handle These Explicitly

Create `public/index.html`

Create `public/client.js`