Building reliable workflows across multiple microservices is challenging. In a monolith, a database transaction can ensure that multiple operations either succeed or fail together. But once data is spread across different services and databases, that guarantee disappears.
This is where the Saga Pattern comes in. Instead of using distributed transactions, a saga coordinates a sequence of local transactions and runs compensation actions when something goes wrong.
In this article, we'll build an orchestrated Saga Pattern using NestJS, gRPC, PostgreSQL, and Sequelize. You'll learn how to coordinate work across services, implement compensation-based rollbacks, handle idempotency, and track workflow progress in a production-style microservice architecture.
Table of Contents
Prerequisites
This article assumes you're already familiar with some backend development concepts. You don't need prior experience with the Saga Pattern, but you should be comfortable with:
JavaScript, TypeScript, Node.js
NestJS fundamentals (controllers, services, dependency injection)
Basic PostgreSQL concepts
Database transactions
Docker (recommended for local development)
Microservice architecture basics
gRPC fundamentals (helpful but not required)
If you've already built a few backend services with NestJS and PostgreSQL, you'll have everything you need to follow this guide.
1. Introduction
A saga is a sequence of local transactions across multiple services. Each step commits its own database transaction. If a later step fails, the saga runs compensating transactions to semantically undo the work already committed.
The pattern was first described by Hector Garcia-Molina and Kenneth Salem in 1987 for long-lived database transactions. It was rediscovered a decade ago when companies started splitting monoliths into microservices and realised that the database transaction — the single most powerful tool in a backend developer's belt — stops working at the service boundary.
This article walks through an orchestrated saga in Node.js (NestJS + gRPC) for onboarding an agency, where two services must agree on a single business outcome:
agency-service— owns the agency record.auth-service— owns the organization, user and role.
If either side fails, the system must end up as if nothing ever happened. No half-created users, orphan organizations, or 3am Slack threads.
2. The Problem in One Picture
Here's the bug a saga is built to prevent:
Step 1: auth-service ✅ creates Organization #42
Step 2: auth-service ✅ creates User #99
Step 3: agency-service ❌ fails (DB down, validation, network blip…)
Result without a saga:
Organization #42 and User #99 still exist.
There is no Agency row.
The user can log in but has nothing to manage.
Support gets a ticket. Engineer writes a one-off SQL cleanup.
Repeat every week.
The saga's job is to detect that step 3 failed and explicitly delete Organization #42 and User #99, so the system is consistent again — even though those rows live in a different service's database.
3. Why You Need a Saga
In a monolith, you wrap everything in one DB transaction and let the database handle atomicity:
await sequelize.transaction(async (tx) => {
await Organization.create({...}, { transaction: tx });
await User.create({...}, { transaction: tx });
await Agency.create({...}, { transaction: tx });
});
In microservices, each service has its own database. You can't wrap two services in one ACID transaction. The classic alternatives all have problems:
| Option | Problem |
|---|---|
| Two-Phase Commit (2PC) | Locks rows across services, coordinator is a single point of failure, and doesn't scale. Most modern databases don't support it well across HTTP/gRPC. |
| "Just hope it works" | Leaves orphan users / billing rows when half the flow fails. Real data corruption — and the longer the system runs, the more orphans accumulate. |
| Manual cleanup scripts | Works for a week. Bugs hide for months. New engineers don't know they exist. |
| Eventual consistency without compensation | Fine for some domains (analytics) but completely wrong for billing, identity, or anything with money. |
| Saga pattern | Each service commits locally. The orchestrator owns the workflow and runs explicit compensation on failure. It's auditable, restartable, and reasonable. |
The saga gives you eventual consistency with a clear, auditable rollback path — without distributed locks.
4. Choreography vs Orchestration
There are two ways to implement a saga:
Choreography
With Choreography, services emit events and other services subscribe and react.
auth-service → emits "UserCreated"
agency-service → listens, creates agency, emits "AgencyCreated"
billing-service → listens, creates subscription…
It's simple at first, but brittle later. The workflow is scattered across N codebases. Nobody owns it. Debugging means tracing events across logs. Adding a step means changing several services.
Orchestration
With Orchestration, one service is the conductor. It calls the others in order.
orchestrator:
1. authClient.provisionAccount(...)
2. agencyRepo.create(...)
3. authClient.sendWelcomeEmail(...)
There's slightly more coupling here (the orchestrator imports clients), but the entire workflow lives in one file. Onboarding new engineers becomes a one-hour task. Adding a step is a single PR.
Pick orchestration unless you have a strong reason not to. This article — and the reference implementation — uses orchestration.
5. The Example Project
Our goal here is to create an Agency in the system. This is the moment a new B2B customer signs up.
It requires two services to agree on a single outcome:
auth-service must create:
an
Organizationrow (the tenant)a
Userrow (the agency admin who will log in)a
UserRolerow linking the user to theAGENCY_ADMINrole
agency-service must create:
- an
Agencyrow containing business details (size, registration number, website, branches…), linked to the user/organization above
These rows have foreign-key relationships within a service, but not across services — Postgres can't enforce that the user in auth's DB matches the authUserId in agency's DB. The application has to do it.
auth-service DB agency-service DB
───────────────── ─────────────────
organizations ◄────────┐
│ │
│ (1:1) │ foreign reference (no FK)
▼ │ agencies
users ──────► user_roles ─ authUserId
└ authOrganizationId
If step 2 fails after step 1 succeeded, we end up with a user who can authenticate but has no agency — the exact bug from 2. That's what the saga prevents.
6. Architecture
┌───────────────────────────────┐
│ API Gateway │
└──────────────┬────────────────┘
│ HTTP
▼
┌──────────────────────────────────────────────────┐
│ agency-service │
│ ┌─────────────────────────────────────────┐ │
│ │ AgencyOnboardingOrchestrator (SAGA) │ │
│ └───────────────┬─────────────────────────┘ │
│ │ writes state │
│ ▼ │
│ agency_onboarding_sagas (Postgres) │
└───────────────┬─────────────────┬────────────────┘
│ gRPC │ gRPC
provisionAgencyAccount compensateAgencyAccount
│ │
▼ ▼
┌──────────────────────────────────────────────────┐
│ auth-service │
│ AgencyProvisioningService (Participant) │
│ │
│ organizations · users · user_roles │
│ agency_provision_records ← idempotency log │
└──────────────────────────────────────────────────┘
Three components do all the work:
AgencyOnboardingOrchestratorinagency-service— drives the workflow.agency_onboarding_sagastable inagency-service— the durable log of the saga's progress.AgencyProvisioningServiceinauth-service— exposes adooperation (provisionAgencyAccount) and anundooperation (compensateAgencyAccount). It's backed by its ownagency_provision_recordsidempotency table.
The orchestrator never reaches into the auth database directly. The boundary is enforced by gRPC.
7. The Saga Flow, Step by Step
This sequence diagram shows the complete lifecycle of the onboarding saga. The workflow begins when a client sends a request to create a new agency. The orchestrator first creates a saga record in its database and marks it as STARTED, giving it a durable record of the workflow before any business action takes place.
At a high level, the orchestrator begins by creating a saga record and then asks auth-service to provision the organization, user, and role. Once that succeeds, the orchestrator creates the agency record in its own database.
If every step succeeds, the saga reaches the COMPLETED state. If the agency creation fails after the auth resources have already been created, the orchestrator triggers a compensation step that instructs auth-service to remove everything it previously provisioned.
The key idea is that each service commits its own local transaction, while the saga coordinates the overall business workflow and ensures the system can return to a consistent state when failures occur.
sequenceDiagram
autonumber
participant C as Client
participant AS as agency-service<br/>Orchestrator
participant DB1 as saga store
participant AU as auth-service
participant DB2 as auth DB
C->>AS: POST /agencies
AS->>DB1: INSERT saga (STARTED, payload)
AS->>AU: provisionAgencyAccount(sagaId, …)
AU->>DB2: BEGIN TX
AU->>DB2: create org + user + role + provision_record
AU->>DB2: COMMIT
AU-->>AS: { userId, organizationId, roleId }
AS->>DB1: UPDATE saga (AUTH_PROVISIONED)
AS->>AS: create Agency row
alt Agency row OK
AS->>DB1: UPDATE saga (AGENCY_CREATED → COMPLETED)
AS->>AU: sendAgencyWelcomeEmail (non-critical)
AS-->>C: 200 OK + sagaId
else Agency row fails
AS->>DB1: UPDATE saga (COMPENSATING)
AS->>AU: compensateAgencyAccount(sagaId)
AU->>DB2: BEGIN TX
AU->>DB2: delete role + token + user + org + record
AU->>DB2: COMMIT
AS->>DB1: UPDATE saga (COMPENSATED → FAILED)
AS-->>C: 5xx + error code
end
Read this once top to bottom and you'll understand the entire onboarding workflow. That's the value of orchestration — the sequence diagram is the architecture.
8. The State Machine
Every transition is written to agency_onboarding_sagas before the next step runs. That is what makes the saga observable and recoverable.
export enum AgencyOnboardingSagaStatus {
STARTED = 'STARTED', // Row exists, no side effects yet
AUTH_PROVISIONED = 'AUTH_PROVISIONED', // Auth side committed
AGENCY_CREATED = 'AGENCY_CREATED', // Agency row committed
COMPLETED = 'COMPLETED', // Happy-path terminal state
COMPENSATING = 'COMPENSATING', // Rollback in progress
COMPENSATED = 'COMPENSATED', // Rollback finished
FAILED = 'FAILED', // Terminal failure (with or without compensation)
}
Why so many states? Because "what went wrong here?" is a question someone will ask at 2am. A saga that only stores success | failure is useless for forensics.
┌── auth fails ──────────► FAILED (nothing to compensate)
│
STARTED ──► AUTH_PROVISIONED ──► AGENCY_CREATED ──► COMPLETED (happy path)
│
agency fails ───┘
▼
COMPENSATING
│
▼
COMPENSATED ──► FAILED (consistent again)
The “point of no return” is AUTH_PROVISIONED. Before it, we can fail fast — there's nothing to undo. After it, every failure path must go through compensation.
9. Implementing the Orchestrator
The orchestrator is the only place that knows the workflow. Each step is a private method, and each step persists its result before returning.
Creating the Saga Record
// agency-onboarding.saga.repository.ts
async createSaga(payload: CreateAgencyOrchestrationInput) {
return this.sagaModel.create({
sagaId: randomUUID(), // correlation id for everything
status: AgencyOnboardingSagaStatus.STARTED,
currentStep: 'STARTED',
payload, // full input snapshot for replay
});
}
The sagaId is a UUID generated once and propagated to every downstream call. It's the single identifier that ties the saga log on the orchestrator side to the provision record on the participant side.
The Main Loop
// agency-onboarding.orchestrator.ts (trimmed for the article)
async execute(input: CreateAgencyOrchestrationInput) {
const saga = await this.sagaRepository.createSaga(input); // STARTED
try {
// Step 1 — auth-service work
const authStep = await this.provisionAuth(saga, input);
if (!authStep.ok) {
await this.markFailed(saga, authStep.failure); // nothing to compensate
return authStep.failure;
}
// Step 2 — agency-service work
let activeSaga = authStep.saga; // status: AUTH_PROVISIONED
try {
activeSaga = await this.createAgencyRow(activeSaga, input, authStep.authIds);
} catch (err) {
// The expensive case: undo what auth-service did
await this.compensateAuth(activeSaga, 'SAGA_FAILED');
const failure = mapSagaFailure(err.message, 'SAGA_FAILED', 'CREATE_AGENCY');
await this.markFailed(activeSaga, failure);
return failure;
}
// Step 3 — mark done and run non-critical side effects
activeSaga = await this.sagaRepository.updateSaga(activeSaga, {
status: AgencyOnboardingSagaStatus.COMPLETED,
});
await this.sendWelcomeEmail(input, activeSaga); // best-effort
return mapSagaSuccess(activeSaga, await this.agencyModel.findByPk(activeSaga.agencyId!));
} catch (error) {
// Defensive catch-all (lost DB connection, unexpected throw)
await this.compensateAuth(saga, 'SAGA_FAILED');
const failure = mapSagaFailure(error.message, 'SAGA_FAILED', 'SAGA');
await this.markFailed(saga, failure);
return failure;
}
}
A Single Step in Detail
private async provisionAuth(saga: AgencyOnboardingSaga, input: ...) {
this.logger.log(`[${saga.sagaId}] PROVISION_AUTH`);
const auth = await firstValueFrom(
this.authClient.provisionAgencyAccount({
sagaId: saga.sagaId, // <-- correlation
organizationName: input.agencyName.trim(),
email: input.email.trim().toLowerCase(),
// …
}),
);
if (!auth.status || !auth.data) {
return { ok: false, failure: mapAuthProvisionFailure(auth) };
}
// Persist the IDs we will need if we have to compensate later
const updated = await this.sagaRepository.updateSaga(saga, {
authOrganizationId: Number(auth.data.organizationId),
authUserId: Number(auth.data.userId),
authUserRoleId: Number(auth.data.userRoleId),
status: AgencyOnboardingSagaStatus.AUTH_PROVISIONED,
});
return { ok: true, saga: updated, authIds: auth.data };
}
The line that does most of the work is the updateSaga call. It stores the foreign IDs returned by auth-service on the saga row, so even if the orchestrator process crashes and restarts, a recovery job can read that row and still know what to compensate.
Habits Worth Copying
Persist after every successful step, including the IDs you'll need to undo it.
Distinguish critical vs non-critical steps. Welcome emails, audit logs and analytics events are not worth rolling a saga back for. They're best-effort.
One log line per transition, prefixed with
[${sagaId}]. Grep is your debugger.
10. Implementing the Participant
The participant (auth-service) wraps all of its own work in a local DB transaction. Inside that boundary it's still ACID — the saga only handles the cross-service problem.
// agency-provisioning.service.ts (trimmed)
async provisionAgencyAccount(req: ProvisionAgencyAccountInput) {
// 1. Idempotency — return the previous result if this sagaId already provisioned.
const existing = await this.provisionRecordModel.findOne({
where: { sagaId: req.sagaId },
});
if (existing) {
return serviceSuccess('Agency admin already onboarded', {
userId: Number(existing.userId),
organizationId: Number(existing.organizationId),
userRoleId: Number(existing.roleId),
});
}
// 2. Domain validation BEFORE the transaction (fail fast).
if (await this.emailExists(req.email)) {
return serviceFailure('Email already exists', { code: 'EMAIL_EXISTS' });
}
if (await this.organizationExists(req.organizationName)) {
return serviceFailure('Organization already exists', { code: 'ORGANIZATION_EXISTS' });
}
// 3. The actual work — atomic at the auth-service boundary.
return withSequelizeTransaction(this.sequelize, async (tx) => {
const org = await this.organizationModel.create({ ... }, { transaction: tx });
const user = await this.userModel.create({ ..., organizationId: org.id }, { transaction: tx });
await this.userRoleModel.create({ userId: user.id, roleId: agencyAdminRole.id }, { transaction: tx });
// The audit record that makes compensation possible later.
await this.provisionRecordModel.create(
{ sagaId: req.sagaId, organizationId: org.id, userId: user.id, roleId: agencyAdminRole.id },
{ transaction: tx },
);
return serviceSuccess('Provisioned', {
userId: user.id, organizationId: org.id, userRoleId: agencyAdminRole.id,
});
});
}
Three things make this method "saga-safe":
Idempotency check first: If the orchestrator retries (network blip, gRPC timeout), the second call is a no-op that returns the same IDs. No duplicate users.
Validation outside the transaction: Cheap reads first, expensive writes second.
One transaction wraps every write: If any insert fails, the whole thing rolls back automatically. The orchestrator sees a clean failure response and knows nothing was persisted.
The agency_provision_records table is the single most important piece of the participant. It's both the idempotency key and the compensation lookup — keyed by the same sagaId the orchestrator uses.
11. Rollback (Compensation)
Compensation is just another gRPC call. The orchestrator sends the sagaId and the IDs it remembers. The participant deletes everything it created, in reverse dependency order, inside its own DB transaction.
On the Orchestrator Side
private async compensateAuth(saga: AgencyOnboardingSaga, errorCode?: string) {
if (!saga.authUserId && !saga.authOrganizationId) {
// Nothing was provisioned — nothing to compensate.
return;
}
// Mark the saga as compensating BEFORE the call, so the row is consistent
// even if the compensating RPC times out.
await this.sagaRepository.updateSaga(saga, {
status: AgencyOnboardingSagaStatus.COMPENSATING,
currentStep: 'COMPENSATING',
errorCode,
});
try {
const rollback = await firstValueFrom(this.authClient.compensateAgencyAccount({
sagaId: saga.sagaId,
organizationId: saga.authOrganizationId,
userId: saga.authUserId,
}));
if (!rollback.status) {
this.logger.error(`[\({saga.sagaId}] Auth compensation returned failure: \){rollback.message}`);
}
} catch (err) {
this.logger.error(`[\({saga.sagaId}] Auth compensation RPC failed: \){err.message}`);
}
await this.sagaRepository.updateSaga(saga, {
status: AgencyOnboardingSagaStatus.COMPENSATED,
currentStep: 'COMPENSATED',
});
}
On the Participant Side
private async rollbackProvisionedAuth(req, sagaId: string, tx: Transaction) {
// Use the saga log as the source of truth — even if the caller forgot IDs.
const record = await this.provisionRecordModel.findOne({
where: { sagaId }, transaction: tx,
});
const userId = req.userId ?? record?.userId;
const organizationId = req.organizationId ?? record?.organizationId;
if (userId) {
const user = await this.userModel.findByPk(userId, { transaction: tx, attributes: ['email'] });
await this.userRoleModel.destroy({ where: { userId }, transaction: tx });
if (user?.email) {
await this.passwordResetTokenModel.destroy({ where: { email: user.email }, transaction: tx });
}
await this.userModel.destroy({ where: { id: userId }, transaction: tx });
}
if (organizationId) {
await this.organizationModel.destroy({ where: { id: organizationId }, transaction: tx });
}
if (record) {
await record.destroy({ transaction: tx });
}
}
Rules of a Good Compensation
Reverse the order of creation: Children first (user_roles, tokens), then parents (users, organizations). The same rule you follow for
DROP TABLEstatements.Be idempotent: Receiving the same
sagaIdtwice must be safe — everydestroyis a no-op if the row is already gone.Use the saga log, not just the request: If the caller forgets an ID or sends a partial payload, look it up by
sagaId. Defence in depth.Wrap it in a local transaction: The rollback must itself be atomic — half-undone is worse than not-undone.
Always close the loop on the orchestrator side: Mark
COMPENSATEDeven if the RPC failed. The failure should also be surfaced (log, metric, alert). A stuckCOMPENSATINGrow is an operational landmine.
What Happens if the Compensation Itself Fails?
This is the worst case in any saga design. There are three reasonable strategies:
First, you can retry with exponential backoff. This works for transient failures (network, deadlocks).
Second, you can dead-letter the saga — write it to a "needs human attention" queue and alert.
Third, you can expose a manual rollback endpoint. This reference implementation does that via RollbackAgencyOnboarding gRPC, so an operator can replay compensation with the same sagaId.
A production system should combine all three. The pattern doesn't decide for you. You decide based on your business risk.
12. Tracking, Idempotency and Observability
Two tables, both keyed by the same UUID sagaId, give you full traceability across services.
Orchestrator Side — agency_onboarding_sagas
| column | purpose |
|---|---|
sagaId (UUID, unique) |
Propagated to every RPC. The join key across services. |
status |
Current state in the state machine. |
currentStep |
Human-readable label for dashboards (PROVISION_AUTH, CREATE_AGENCY…). |
payload (JSONB) |
Snapshot of the input — used for replay, debug, support. |
authOrganizationId, authUserId, authUserRoleId |
Foreign IDs needed for compensation. |
agencyId |
Set once the agency row exists. |
errorCode, errorMessage |
Filled on failure. |
createdAt, updatedAt |
Timeline for the saga. |
A real row in COMPLETED state looks roughly like this:
{
"sagaId": "0a4f3e2c-7b11-4f8d-9a2c-90b6f5f5b8a1",
"status": "COMPLETED",
"currentStep": "COMPLETED",
"agencyId": 17,
"authOrganizationId": 42,
"authUserId": 99,
"authUserRoleId": 3,
"errorCode": null,
"errorMessage": null,
"payload": { "agencyName": "Acme Education", "email": "admin@acme.com", "...": "..." },
"createdAt": "2026-05-22T10:14:32.118Z",
"updatedAt": "2026-05-22T10:14:33.412Z"
}
Participant Side — agency_provision_records
| column | purpose |
|---|---|
sagaId (unique) |
Idempotency key. The same sagaId from the orchestrator. |
userId, organizationId, roleId |
What to delete on compensation. |
createdAt, updatedAt |
Audit timestamps. |
Observability for Free
Because every log line is prefixed with [${sagaId}], a single grep across both services gives the full timeline:
[0a4f3e2c…] PROVISION_AUTH agency-service
[0a4f3e2c…] provisionAgencyAccount: ok auth-service
[0a4f3e2c…] CREATE_AGENCY agency-service
[0a4f3e2c…] Agency step failed: ... agency-service
[0a4f3e2c…] Auth compensation completed auth-service
In a structured-logging setup (Loki, Elasticsearch, Datadog) this becomes a one-click filter. The sagaId is your distributed trace.
13. Testing a Saga
A saga is just a state machine, so the test matrix is finite and small. Cover at least these cases:
| # | Scenario | Expected end state |
|---|---|---|
| 1 | Happy path | COMPLETED, agency exists, user exists |
| 2 | Auth step fails (e.g. email exists) | FAILED, no rows on either side |
| 3 | Agency step fails | COMPENSATED, auth rows gone, no agency |
| 4 | Compensation RPC times out | COMPENSATING → operator-driven recovery |
| 5 | Caller retries with the same sagaId |
Second call returns the first call's result; no duplicate rows |
| 6 | Welcome email fails | COMPLETED still — non-critical step did not cascade |
Two practical tips for testing:
First, mock the gRPC client at the orchestrator level, not the network. You want to assert that compensateAgencyAccount was called with the right sagaId, not that bytes hit a socket.
Second, spin up a real Postgres in integration tests (Testcontainers, or a Docker Compose postgres service). The saga state machine is too easy to "test" against a mock and too easy to break against a real DB.
14. When NOT to Use a Saga
Sagas are not free. Skip them when:
One service does all the writes. Use a regular DB transaction. Don't reinvent the wheel.
The workflow is read-only or analytical. No rollback semantics exist for a SELECT.
The "rollback" is impossible. You sent a real email. You charged a credit card and the gateway doesn't support refunds. In those cases, design forward: send an apology email, queue a manual refund. Sagas can't unsend physical actions.
You don't actually have multiple services yet. A saga in a monolith is over-engineering. Wait until the service boundary is real.
A saga adds a state table, a compensation method per step, and an operational habit of grepping by sagaId. That cost is worth paying when the alternative is orphaned data — and not before.
15. Trade-offs and Lessons Learned
Things that worked well in this design:
Synchronous orchestration is easier to debug than choreography. A new engineer reads one file and understands the whole flow.
Idempotency at the participant is non-negotiable. Retries from the orchestrator must be safe. Build it in from day one — retro-fitting is painful.
The saga table replaces tribal knowledge. Ops can answer "what happened to this signup?" with a single SQL query. The payload JSONB is gold during incidents.
sagaIdas the trace key plays nicely with OpenTelemetry / Datadog / Loki — no extra infra to set up.
Things to know before copying this pattern:
A failing compensation is the worst case. If
compensateAgencyAccountitself errors, you have inconsistent state. Plan for retries + dead-letter + a manual rollback endpoint from the start.Non-critical steps must be marked explicitly. Here, the welcome email is allowed to fail without rolling back the agency. Don't accidentally compensate over a flaky SMTP provider.
Sagas aren't a replacement for local transactions. Inside each service, still use a real DB transaction. The saga only handles the cross-service seam.
Synchronous gRPC is simple but couples availability. If
auth-serviceis down, agency creation fails. Swap the gRPC calls for a durable message bus (RabbitMQ / Kafka) and treat each step as a command + reply when you need higher resilience.The orchestrator becomes a critical service. Treat its uptime accordingly — monitor saga durations, alert on stuck
COMPENSATINGrows, and run more than one replica.
16. Conclusion
The saga pattern isn't magic. It's a disciplined version of what experienced engineers already do by hand: commit locally, record what you did, and know how to undo it.
In Node.js with NestJS, you only need three ingredients:
A state table to track the saga.
An orchestrator that drives the workflow and writes that state.
A participant that exposes a
doand anundooperation, both idempotent and keyed bysagaId.
Get those three right and your microservices can offer the same "all-or-nothing" feel as a monolithic transaction — without the operational pain of distributed locks.
Start simple, use orchestration, make every step idempotent, persist before you call, and always know how to undo. That's the whole pattern.