Md Tarikul Islam - freeCodeCamp.org

How to Implement PayPal in a Microservice Architecture Using NestJS, gRPC, and Docker

Md Tarikul Islam — Thu, 16 Jul 2026 22:56:30 +0000

In this tutorial, you'll build a production-ready PayPal payment service using NestJS microservices. Along the way, you'll learn how to isolate payment logic into its own service, communicate between services using gRPC, publish payment events with RabbitMQ, and deploy everything with Docker.

By the end, you'll have a scalable payment architecture that can be reused across multiple business domains.

Introduction
Why Use a Dedicated Payment Service?
Architecture Overview
- Payment State Machine
Prerequisites
PayPal Concepts You Need to Know
- Sandbox vs Live
- Orders API Flow (What We Use)
- Environment Variables
Project Structure
Step 1 — Create the Payment Service
Step 2 — Define the gRPC Contract
Step 3 — Implement the PayPal Service
Step 4 — Build the Payment Flow (Create → Approve → Capture)
- 4a. Create Payment
- 4b. User Approves on PayPal
- 4c. Capture Payment
Step 5 — Connect Domain Services via gRPC
- Domain Service Business Logic Example
Step 6 — Add the API Gateway Layer
Step 7 — Publish Payment Events with RabbitMQ
- Two Paths to Mark an Order as Paid
Step 8 — Database Schema and Migrations
- Production Migration Gotcha
Step 9 — Local Development Setup (Docker)
- Environment Variables (.env)
- Docker Compose (Local)
- Start Services
- Verify Health
- Test Payment Flow
Step 10 — Production Deployment
- PayPal Live Credentials
- Production .env
- Docker Compose (Production)
- Deploy Commands
- Verify Production
- Frontend Domain in Production
Step 11 — Health Checks and Monitoring
Complete Request Flow (Real Example)
Coupon Support (Optional)
PayPal Webhooks (Optional but Recommended)
Testing Checklist
Wrapping Up
Further Reading

Introduction

Payment logic doesn't belong inside every microservice. When you scatter PayPal API calls across user-service, order-service, and billing-service, you end up with:

Duplicated PayPal credentials and SDK code
Inconsistent error handling and idempotency
Hard-to-audit payment records
Painful environment switching (sandbox to live)

The solution is a dedicated payment microservice that owns all PayPal interactions. Other services call it over gRPC, and payment outcomes are broadcast over RabbitMQ so domain services can update their own data.

This guide walks you through that pattern using a real-world stack:

Layer	Technology
Payment service	NestJS
Inter-service communication	gRPC
Event bus	RabbitMQ
Database	PostgreSQL
API exposure	API Gateway (HTTP)
Containerization	Docker Compose
PayPal API	Orders v2 (Create, Approve, Capture)

Why Use a Dedicated Payment Service?

A dedicated payment service centralizes all payment-related responsibilities in one place. Instead of every microservice communicating directly with PayPal, they simply request payment operations from the payment service.

This service manages PayPal authentication, order creation, payment captures, wallet updates, ledger records, and webhook processing. Meanwhile, domain services remain focused on business logic such as student applications or subscriptions.

Domain services only need to know:

How much to charge
Who is paying
What business entity the payment is for (referenceId)
Where to redirect the user after payment (returnUrl / cancelUrl)

They do not need PayPal credentials.

Architecture Overview

Users initiate payments from the Frontend, and requests are routed through the API Gateway to the Students Service. The service uses gRPC to communicate with the Payment Service, which handles all interactions with PayPal.

Once the payment is completed, the Payment Service publishes an event to RabbitMQ, enabling the Students Service to update the payment status asynchronously.

┌────────────────────────────────────────────────────────────┐
│                     PRESENTATION LAYER                     │
├────────────────────────────────────────────────────────────┤
│ Frontend (React)                                           │
└───────────────────────┬────────────────────────────────────┘
                        │ HTTP
                        ▼

┌────────────────────────────────────────────────────────────┐
│                       GATEWAY LAYER                        │
├────────────────────────────────────────────────────────────┤
│ student-apigw                                               │
└───────────────────────┬────────────────────────────────────┘
                        │ gRPC
                        ▼

┌────────────────────────────────────────────────────────────┐
│                       DOMAIN LAYER                         │
├────────────────────────────────────────────────────────────┤
│ students-service                                            │
└───────────────────────┬────────────────────────────────────┘
                        │ gRPC
                        ▼

┌────────────────────────────────────────────────────────────┐
│                      PAYMENT LAYER                         │
├────────────────────────────────────────────────────────────┤
│ payment-service                                             │
│                                                            │
│ • Create Payment                                           │
│ • Capture Payment                                          │
│ • Wallet Management                                        │
│ • Ledger                                                   │
│ • Webhooks                                                 │
│ • Event Publishing                                         │
└──────────────┬───────────────────────┬─────────────────────┘
               │                       │
               │ REST                  │ RabbitMQ
               ▼                       ▼

      ┌───────────────┐      ┌────────────────────┐
      │    PayPal     │      │   payment_events   │
      │   Checkout    │      │       Queue        │
      └───────────────┘      └─────────┬──────────┘
                                       │
                                       ▼

                           ┌────────────────────┐
                           │ students-service   │
                           │ Event Consumer     │
                           └────────────────────┘

Payment State Machine

A payment state machine represents the lifecycle of a payment, tracking its progress from creation to completion (or failure). Each state reflects the current status of the payment, making it easier to monitor, retry, and prevent invalid operations.

NOT_STARTED → EXECUTING → SUCCESS
                      └→ FAILED

NOT_STARTED — order record created in DB
EXECUTING — PayPal order created, waiting for user approval
SUCCESS — funds captured, ledger updated, event published
FAILED — capture failed or user cancelled

Prerequisites

Before you start, make sure you have:

Node.js 18+
Docker and Docker Compose
NestJS basics
A PayPal Developer account
Basic understanding of gRPC and message queues

PayPal Concepts You Need to Know

Before integrating PayPal, it's helpful to understand a few core concepts. PayPal provides separate environments for development and production, along with an order-based payment workflow that your application follows.

Sandbox vs Live

Environment	API Base URL	Checkout URL
Sandbox (dev)	`https://api-m.sandbox.paypal.com`	`https://www.sandbox.paypal.com/checkoutnow?token=...`
Live (prod)	`https://api-m.paypal.com`	`https://www.paypal.com/checkoutnow?token=...`

Always develop in sandbox. Switch to live only in production.

Orders API Flow (What We Use)

PayPal's Orders v2 API follows three steps:

Create Order: your backend creates an order with amount and return URLs
Approve: user is redirected to PayPal and approves payment
Capture: your backend captures the approved funds

This is different from the older Payments REST API. Orders v2 is the recommended approach for new integrations.

Environment Variables

The PayPal service reads its configuration from environment variables. This keeps sensitive credentials out of your source code and makes it easy to switch between sandbox and production environments.

PAYPAL_CLIENT_ID=your_client_id
PAYPAL_CLIENT_SECRET=your_client_secret
PAYPAL_API_BASE=https://api-m.sandbox.paypal.com   # or https://api-m.paypal.com for live

Never commit real credentials to Git. Use .env files and Docker environment injection.

Project Structure

apps/
├── core/
│   └── payment-service/          # Owns all PayPal logic
│       ├── src/
│       │   ├── app/payment/
│       │   │   ├── paypal/paypal.service.ts
│       │   │   ├── payment.service.ts
│       │   │   ├── payment.grpc.controller.ts
│       │   │   ├── payment.http.controller.ts
│       │   │   └── events/payment-events.publisher.ts
│       │   ├── migrations/       # DB schema
│       │   └── routes/health.routes.ts
│       └── Dockerfile
├── services/
│   └── students-service/         # Domain service example
│       └── src/app/payment/
│           ├── payment-client.service.ts      # gRPC client
│           ├── application-payment.service.ts # business logic
│           └── payment-events.consumer.ts     # RabbitMQ listener
└── gateways/
    └── student-apigw/            # HTTP API for frontend
libs/
└── shared/dto/src/lib/payment/
    └── payment.proto             # Shared gRPC contract

Step 1 — Create the Payment Service

The payment service runs two servers in one process

Protocol	Port	Purpose
HTTP	3003	Health checks, webhooks, admin APIs
gRPC	50061	Internal service-to-service calls

The payment service exposes both an HTTP server and a gRPC server in the same NestJS application. The HTTP server handles health checks, webhooks, and external requests, while the gRPC server accepts internal requests from other microservices.


// apps/core/payment-service/src/main.ts

async function bootstrap() {
  const app = await NestFactory.create(AppModule);

  // Health route (outside /api prefix)
  app.use('/health', healthRouter);

  // gRPC microservice
  app.connectMicroservice({
    transport: Transport.GRPC,
    options: {
      package: 'payment',
      protoPath: join(process.cwd(), 'libs/shared/dto/src/lib/payment/payment.proto'),
      url: `0.0.0.0:${process.env.GRPC_PORT || '50061'}`,
    },
  });

  app.setGlobalPrefix('api');
  await app.startAllMicroservices();
  await app.listen(process.env.PORT || 3003);
}

During startup, NestJS initializes both servers, allowing external clients and internal services to communicate through the appropriate protocol.

Key design choice: HTTP is for external/webhook traffic. gRPC is for fast, typed internal calls between services.

Step 2 — Define the gRPC Contract

Next, you'll create a shared .proto file so all services speak the same language:

A gRPC contract defines the API shared between microservices. Using a .proto file ensures that every service communicates with the payment service using the same request and response structure, regardless of the programming language.

// libs/shared/dto/src/lib/payment/payment.proto

syntax = "proto3";
package payment;

service PaymentService {
  rpc CreatePayment(CreatePaymentRequest) returns (CreatePaymentResponse) {}
  rpc CapturePayment(CapturePaymentRequest) returns (CapturePaymentResponse) {}
  rpc GetPaymentStatus(GetPaymentStatusRequest) returns (GetPaymentStatusResponse) {}
  rpc ListPayments(ListPaymentsRequest) returns (ListPaymentsResponse) {}
}

message CreatePaymentRequest {
  string checkout_id = 1;
  string payment_order_id = 2;
  string domain = 3;           // e.g. "application", "subscription"
  string reference_id = 4;     // business entity ID
  string payer_id = 5;
  string amount = 6;
  string currency = 7;
  string buyer_email = 8;
  string seller_account = 9;
  string payment_category = 10;
  string return_url = 11;      // PayPal redirect on success
  string cancel_url = 12;      // PayPal redirect on cancel
  string idempotency_key = 13;
  string metadata = 14;
  string description = 15;
}

message CreatePaymentResponse {
  int32 status = 1;
  string message = 2;
  string payment_order_id = 3;
  string paypal_order_id = 4;
  string approve_url = 5;      // Redirect user here
  string payment_order_status = 6;
}

The domain + reference_id pair lets one payment service handle payments for applications, subscriptions, university fees, and more without coupling to any single business model.

Step 3 — Implement the PayPal Service

Now, you'll create a dedicated PayPalService that wraps the PayPal REST API.

Instead of calling the PayPal API throughout the application, we encapsulate all PayPal communication inside a dedicated service. This keeps authentication, order creation, and payment capture logic centralized and easier to maintain.

// apps/core/payment-service/src/app/payment/paypal/paypal.service.ts

@Injectable()
export class PayPalService {
  private accessToken: string | null = null;
  private tokenExpiresAt = 0;

  private get apiBase(): string {
    return this.configService.get('PAYPAL_API_BASE')
      || 'https://api-m.sandbox.paypal.com';
  }

  // Step 1: Get OAuth access token (cached until expiry)
  private async getAccessToken(): Promise {
    const now = Date.now();
    if (this.accessToken && now < this.tokenExpiresAt) {
      return this.accessToken;
    }

    const response = await axios.post(
      `${this.apiBase}/v1/oauth2/token`,
      'grant_type=client_credentials',
      {
        auth: {
          username: this.configService.get('PAYPAL_CLIENT_ID'),
          password: this.configService.get('PAYPAL_CLIENT_SECRET'),
        },
        headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
      }
    );

    this.accessToken = response.data.access_token;
    this.tokenExpiresAt = now + (response.data.expires_in - 60) * 1000;
    return this.accessToken;
  }

  // Step 2: Create PayPal checkout order
  async createOrder(input: PayPalCreateOrderInput) {
    const token = await this.getAccessToken();

    const response = await axios.post(
      `${this.apiBase}/v2/checkout/orders`,
      {
        intent: 'CAPTURE',
        purchase_units: [{
          custom_id: input.paymentOrderId,
          description: input.description,
          amount: {
            currency_code: input.currency,
            value: input.amount,
          },
        }],
        application_context: {
          return_url: input.returnUrl,
          cancel_url: input.cancelUrl,
          brand_name: 'YourApp',
          user_action: 'PAY_NOW',
        },
      },
      {
        headers: {
          Authorization: `Bearer ${token}`,
          'PayPal-Request-Id': input.idempotencyKey,
        },
      }
    );

    const paypalOrderId = response.data.id;
    const approveUrl = response.data.links
      ?.find((l) => l.rel === 'approve')?.href;

    return { paypalOrderId, approveUrl };
  }

  // Step 3: Capture approved order
  async captureOrder(paypalOrderId: string) {
    const token = await this.getAccessToken();

    const response = await axios.post(
      `${this.apiBase}/v2/checkout/orders/${paypalOrderId}/capture`,
      {},
      { headers: { Authorization: `Bearer ${token}` } }
    );

    const capture = response.data.purchase_units?.[0]?.payments?.captures?.[0];
    return { status: response.data.status, captureId: capture?.id || '' };
  }
}

On startup, log configuration (with masked secrets) so you can verify sandbox vs live at a glance:

PayPal configuration check:
  PAYPAL_API_BASE: https://api-m.paypal.com
  PAYPAL_CLIENT_ID: AQb2...aq1M (80 chars)
  credentialsPresent: true
  environment: live

Notice that the access token is cached until it expires. This avoids requesting a new OAuth token for every payment, improving performance and reducing unnecessary API calls.

Step 4 — Build the Payment Flow (Create, Approve, Capture)

Create Payment

PaymentService.createPayment() does the following:

Checks idempotency key and returns an existing order if one is already created
Creates a payment_events checkout record
Creates a payment_orders row with status NOT_STARTED
Calls PayPalService.createOrder()
Updates order status to EXECUTING
Returns approveUrl to the caller

async createPayment(input: CreatePaymentPayload) {
  // Idempotency: prevent duplicate charges
  const existing = await this.paymentOrderModel.findOne({
    where: { idempotencyKey: input.idempotencyKey },
  });
  if (existing) return this.buildCreateResponse(existing);

  const order = await this.paymentOrderModel.create({
    paymentOrderId: input.paymentOrderId,
    amount: input.amount,
    currency: input.currency,
    paymentOrderStatus: PaymentOrderStatus.NOT_STARTED,
    domain: input.domain,
    referenceId: input.referenceId,
    // ...
  });

  const paypalOrder = await this.paypalService.createOrder({
    paymentOrderId: order.paymentOrderId,
    amount: input.amount,
    currency: input.currency,
    returnUrl: input.returnUrl,
    cancelUrl: input.cancelUrl,
    idempotencyKey: input.idempotencyKey,
  });

  await order.update({
    paymentOrderStatus: PaymentOrderStatus.EXECUTING,
    paypalOrderId: paypalOrder.paypalOrderId,
  });

  return {
    approveUrl: paypalOrder.approveUrl,
    paypalOrderId: paypalOrder.paypalOrderId,
    paymentOrderStatus: PaymentOrderStatus.EXECUTING,
  };
}

User Approves on PayPal

The frontend redirects the user to approveUrl. PayPal handles authentication and approval, then redirects back to your returnUrl.

Capture Payment

After approval, call capturePayment() with either paymentOrderId or paypalOrderId:

async capturePayment(paymentOrderId?: string, paypalOrderId?: string) {
  const order = await this.findOrder(paymentOrderId, paypalOrderId);

  if (order.paymentOrderStatus === PaymentOrderStatus.SUCCESS) {
    return this.buildCaptureResponse(order); // already captured
  }

  const capture = await this.paypalService.captureOrder(order.paypalOrderId);

  if (capture.status !== 'COMPLETED') {
    throw new Error(`PayPal capture status: ${capture.status}`);
  }

  await this.finalizeSuccessfulPayment(order, capture.captureId);
  return this.buildCaptureResponse(order);
}

finalizeSuccessfulPayment() runs in a database transaction:

Updates order status to SUCCESS
Updates seller wallet balance
Creates ledger entries (audit trail)
Mark scheckout event as done
Publishes a payment.{domain}.completed event to RabbitMQ

Step 5 — Connect Domain Services via gRPC

Domain services (like students-service) never talk to PayPal directly. They use a gRPC client:

The Students Service communicates with the Payment Service through a gRPC client. Rather than calling the PayPal API directly, it invokes strongly typed remote procedures exposed by the payment service.

// apps/services/students-service/src/app/payment/payment.module.ts

ClientsModule.registerAsync([{
  name: 'PAYMENT_SERVICE',
  useFactory: () => ({
    transport: Transport.GRPC,
    options: {
      package: 'payment',
      protoPath: 'libs/shared/dto/src/lib/payment/payment.proto',
      url: process.env.PAYMENT_SERVICE_URL || 'payment-service:50061',
    },
  }),
}])

// payment-client.service.ts

@Injectable()
export class PaymentClientService implements OnModuleInit {
  private paymentService: PaymentGrpcService;

  constructor(@Inject('PAYMENT_SERVICE') private client: ClientGrpc) {}

  onModuleInit() {
    this.paymentService = this.client.getService('PaymentService');
  }

  async createPayment(data: CreatePaymentRequest) {
    return firstValueFrom(this.paymentService.CreatePayment(data));
  }

  async capturePayment(data: { payment_order_id?: string; paypal_order_id?: string }) {
    return firstValueFrom(this.paymentService.CapturePayment(data));
  }
}

Domain Service Business Logic Example:

This example shows how a domain service prepares business-specific data before delegating payment processing to the Payment Service.

// application-payment.service.ts

async initiateTuitionPayment(applicationId: number, options: { domain: string }) {
  const application = await this.applicationModel.findByPk(applicationId);

  // Build PayPal return URLs from frontend domain
  const frontendBase = options.domain; // e.g. https://crm.yourapp.com
  const returnUrl = `${frontendBase}/payment/successful?applicationId=${application.applicationId}`;
  const cancelUrl = `${frontendBase}/payment/failure?applicationId=${application.applicationId}`;

  const result = await this.paymentClient.createPayment({
    checkout_id: `checkout-app-${application.id}`,
    payment_order_id: uuidv4(),
    domain: 'application',
    reference_id: String(application.id),
    payer_id: application.studentId,
    amount: finalAmount.toFixed(2),
    currency: 'USD',
    buyer_email: buyerEmail,
    seller_account: `university-${application.universityId}`,
    payment_category: 'tuition_deposit',
    return_url: returnUrl,
    cancel_url: cancelUrl,
    idempotency_key: `app-${application.id}-tuition-${uuidv4()}`,
  });

  return {
    approveUrl: result.approve_url,
    paypalOrderId: result.paypal_order_id,
    paymentOrderId: result.payment_order_id,
  };
}

The domain service remains responsible for business rules, while the payment service handles the payment workflow itself.

Important: The frontend must send its own origin as domain so return URLs point to the correct environment (localhost in dev, production URL in prod).

Step 6 — Add the API Gateway Layer

The API gateway exposes HTTP endpoints to the frontend and forwards to domain services:

POST /applications/:id/pay/applicationfee
Body: { "domain": "https://crm.yourapp.com", "couponCode": "SAVE10" }

// student-apigw → students-service (gRPC) → payment-service (gRPC) → PayPal

Gateway responsibilities:

Authentication (JWT)
Request validation
No PayPal credentials

Capture the endpoint after the PayPal redirect:

POST /applications/:id/pay/applicationfee/capture
Body: { "paypalOrderId": "PAYPAL_ORDER_ID_FROM_URL" }

Step 7 — Publish Payment Events with RabbitMQ

RabbitMQ enables asynchronous communication between services. Instead of waiting for every service to finish processing after a payment succeeds, the payment service simply publishes an event and lets interested services handle it independently.

After a successful capture, the payment service publishes an event:

// payment-events.publisher.ts

async publishCompleted(event: PaymentCompletedEvent) {
  const pattern = `payment.${event.domain}.completed`; // e.g. payment.application.completed
  this.eventsClient.emit(pattern, { ...event, eventId: uuidv4() });
}

Each domain service subscribes to payment events that are relevant to its business domain. For example, the Students Service listens for payment.application.completed events so it can mark student applications as paid.

// payment-events.consumer.ts (students-service)

@EventPattern('payment.application.completed')
async handlePaymentCompleted(@Payload() data: PaymentCompletedPayload) {
  await this.applicationPaymentService.handlePaymentCompletedEvent(data);
  // Marks application as PAID, records payment history
}

This decouples payment completion from domain updates. Even if students-service is temporarily down, you can replay events from the queue.

Two Paths to Mark an Order as Paid

Path	When used
Synchronous capture	Frontend calls capture API after PayPal redirect
Async event	RabbitMQ consumer updates domain state after payment service publishes event

Using both (with idempotency) gives you reliability: the sync path gives immediate UX feedback. The async path is a safety net.

Step 8 — Database Schema and Migrations

The payment service maintains its own database schema. Each table has a specific responsibility, allowing payment records, financial transactions, and webhook processing to remain isolated from other business services.

Table	Purpose
`payment_events`	Checkout session (buyer/seller info)
`payment_orders`	Individual payment attempts with PayPal IDs
`ledger_entries`	Financial audit trail
`wallets`	Seller balance tracking
`processed_webhooks`	Webhook deduplication
`coupons` / `coupon_redemptions`	Discount codes (optional)
`sequelize_meta`	Migration tracking

Production Migration Gotcha

In production Docker images, migration .ts files are not available unless you compile them to JavaScript and copy them into the image:

# Dockerfile — compile migrations for production
RUN pnpm exec tsc --project apps/core/payment-service/tsconfig.migrations.json
COPY --from=builder /app/dist/apps/core/payment-service/migrations ./migrations

Without this, you'll see Executed 0 migrations in logs and no tables will be created.

Create the database user before first deploy:

CREATE USER payment_user WITH PASSWORD 'payment_pass';
CREATE DATABASE payment_db;
GRANT ALL PRIVILEGES ON DATABASE payment_db TO payment_user;

Step 9 — Local Development Setup (Docker)

Environment Variables (`.env`)

PAYPAL_CLIENT_ID=your_sandbox_client_id
PAYPAL_CLIENT_SECRET=your_sandbox_client_secret
PAYPAL_API_BASE=https://api-m.sandbox.paypal.com

In this section, we'll configure the payment service for local development using Docker Compose. This setup provides a complete environment for testing payments without deploying to production.

Docker Compose (local)

The following configuration starts the payment service together with its required dependencies, including PostgreSQL and RabbitMQ.

payment-service:
  build:
    dockerfile: apps/core/payment-service/Dockerfile.dev
  ports:
    - '3003:3003'    # HTTP
    - '50061:50061'  # gRPC
  environment:
    - PAYPAL_API_BASE=https://api-m.sandbox.paypal.com
    - PAYPAL_CLIENT_ID=${PAYPAL_CLIENT_ID}
    - PAYPAL_CLIENT_SECRET=${PAYPAL_CLIENT_SECRET}
    - DB_HOST=postgres
    - DB_NAME=payment_db
    - DB_USER=payment_user
    - DB_PASSWORD=payment_pass
    - RABBITMQ_URL=amqp://rabbitmq:5672

students-service:
  environment:
    - PAYMENT_SERVICE_URL=payment-service:50061
  depends_on:
    payment-service:
      condition: service_healthy

Start Services

Once the configuration is complete, start the containers and verify that every service is running correctly before testing the payment flow.

docker compose up -d payment-service students-service student-apigw

Verify Health

curl http://localhost:3003/health
# {"status":"healthy","service":"payment-service",...}

Test Payment Flow

Call POST /applications/:id/pay/applicationfee with { "domain": "http://localhost:3000" }
Open the returned approveUrl in a browser
Log in with a PayPal sandbox buyer account
After approval, call POST /applications/:id/pay/applicationfee/capture with the paypalOrderId
Confirm application status is PAID

Step 10 — Production Deployment

After verifying everything locally, the next step is deploying the payment service to production. The main differences are using PayPal Live credentials, production environment variables, and production-ready Docker images.

PayPal Live Credentials

Go to PayPal Developer Dashboard → Live apps
Create a Live REST API app
Copy Client ID and Secret

Production `.env` (on Server — Never Commit)

PAYPAL_CLIENT_ID=your_live_client_id
PAYPAL_CLIENT_SECRET=your_live_secret
PAYPAL_API_BASE=https://api-m.paypal.com

Docker Compose (Production)

payment-service:
  build:
    dockerfile: apps/core/payment-service/Dockerfile
  environment:
    - NODE_ENV=production
    - PAYPAL_API_BASE=${PAYPAL_API_BASE:-https://api-m.paypal.com}
    - PAYPAL_CLIENT_ID=${PAYPAL_CLIENT_ID}
    - PAYPAL_CLIENT_SECRET=${PAYPAL_CLIENT_SECRET}
    - DB_HOST=${DB_HOST}
    - DB_NAME=payment_db
    - DB_USER=payment_user
    - DB_PASSWORD=payment_pass
    - RABBITMQ_URL=amqp://${RABBITMQ_USER}:${RABBITMQ_PASS}@rabbitmq:5672
  labels:
    - 'traefik.http.routers.payment.rule=Host(`payment-service.yourapp.com`)'

students-service:
  environment:
    - PAYMENT_SERVICE_URL=payment-service:50061
  depends_on:
    payment-service:
      condition: service_healthy

Deploy Commands

docker compose -f docker-compose.prod.yml build --no-cache payment-service
docker compose -f docker-compose.prod.yml up -d payment-service students-service

Verify Production

curl https://payment-service.yourapp.com/health

docker logs -f apply-goal-payment-service
# Look for:
#   environment: live
#   Found 8 pending migrations
#   Executed 8 migrations

Frontend Domain in Production

The frontend must send the production CRM URL when initiating payment:

{ "domain": "https://crm.yourapp.com" }

Not localhost. This controls where PayPal redirects after payment.

Step 11 — Health Checks and Monitoring

Health checks allow orchestration tools such as Docker and Traefik to verify that the payment service is running correctly. Monitoring these endpoints helps detect failures early and improves application reliability.

// GET /health
{ "status": "healthy", "service": "payment-service", "timestamp": "...", "version": "1.0.0" }

Used by:

Docker HEALTHCHECK
Traefik load balancer
Uptime monitoring

PayPal credential check runs on startup via PayPalService.logConfiguration().

Complete Request Flow (Real Example)

Scenario: Student pays tuition fee for university application.

1. Frontend
   POST /applications/42/pay/applicationfee
   Body: { "domain": "https://crm.yourapp.com" }
        │
        ▼
2. student-apigw (HTTP → gRPC)
   InitiateApplicationTuitionPayment(applicationId: 42)
        │
        ▼
3. students-service
   - Validates application not already paid
   - Resolves tuition amount
   - Optionally validates coupon via payment-service gRPC
   - Builds returnUrl / cancelUrl from domain
   - Calls payment-service CreatePayment (gRPC)
        │
        ▼
4. payment-service
   - Creates payment_orders record (EXECUTING)
   - Calls PayPal POST /v2/checkout/orders
   - Returns approveUrl
        │
        ▼
5. Frontend redirects user to approveUrl (PayPal checkout)
        │
        ▼
6. User approves → PayPal redirects to returnUrl
        │
        ▼
7. Frontend
   POST /applications/42/pay/applicationfee/capture
   Body: { "paypalOrderId": "PAYPAL_ORDER_ID" }
        │
        ▼
8. payment-service
   - POST /v2/checkout/orders/{id}/capture
   - Updates order → SUCCESS
   - Updates wallet + ledger
   - Publishes payment.application.completed → RabbitMQ
        │
        ▼
9. students-service (event consumer)
   - Marks application paymentStatus = PAID
   - Records payment in application_payments table

Coupon Support (Optional)

Before creating a PayPal order, validate a coupon via gRPC:

const validation = await this.paymentClient.validateCoupon({
  code: 'SAVE20',
  universityId: application.universityId,
  originalAmount: 500,
  paymentType: 'application_fee',
});

const finalAmount = validation.data.finalAmount;

// If coupon covers 100% — skip PayPal entirely
if (finalAmount <= 0) {
  await this.markApplicationPaid(applicationId, { amount: 0, source: 'coupon' });
  return { paymentOrderStatus: 'COMPLETED' };
}

Coupon logic lives in payment-service so discount rules are centralized.

PayPal Webhooks (Optional but Recommended)

https://payment-service.yourapp.com/api/v1/payments/webhooks/paypal

The payment service handles:

Event	Action
`CHECKOUT.ORDER.APPROVED`	Auto-capture the order
`PAYMENT.CAPTURE.COMPLETED`	Finalize payment if not already done

Webhook events are deduplicated via processed_webhooks table to prevent double-processing.

Testing Checklist

[ ] GET /health returns 200
[ ] PayPal logs show credentialsPresent: true
[ ] Database tables exist after startup (payment_orders, payment_events, etc.)
[ ] Create payment returns valid approveUrl
[ ] Sandbox buyer can complete checkout
[ ] Capture returns payment_order_status: SUCCESS
[ ] Application marked as PAID in domain service
[ ] RabbitMQ event payment.application.completed is consumed
[ ] Duplicate capture is handled gracefully (idempotent)
[ ] Coupon 100% discount skips PayPal
[ ] Production uses https://api-m.paypal.com (live)

Wrapping Up

Integrating PayPal in a microservice architecture comes down to a few principles:

One payment service owns all PayPal API calls
gRPC connects domain services to the payment service internally
RabbitMQ broadcasts payment outcomes so domain services stay decoupled
Idempotency keys prevent duplicate charges
Environment variables switch between sandbox and live — no code changes
Migrations must be compiled for production Docker images
Frontend sends domain so return URLs work in every environment

This pattern scales: add a new payment type (subscription, agency fee, university service fee) by sending a different domain and payment_category — no changes to PayPal integration code.

The Saga Pattern in Node.js: How to Roll Back Distributed Transactions Across Microservices

Md Tarikul Islam — Sat, 13 Jun 2026 06:45:43 +0000

Building reliable workflows across multiple microservices is challenging. In a monolith, a database transaction can ensure that multiple operations either succeed or fail together. But once data is spread across different services and databases, that guarantee disappears.

This is where the Saga Pattern comes in. Instead of using distributed transactions, a saga coordinates a sequence of local transactions and runs compensation actions when something goes wrong.

In this article, we'll build an orchestrated Saga Pattern using NestJS, gRPC, PostgreSQL, and Sequelize. You'll learn how to coordinate work across services, implement compensation-based rollbacks, handle idempotency, and track workflow progress in a production-style microservice architecture.

Prerequisites
1. Introduction
2. The Problem in One Picture
3. Why You Need a Saga
4. Choreography vs Orchestration
- Choreography
- Orchestration
5. The Example Project
6. Architecture
7. The Saga Flow, Step by Step
8. The State Machine
9. Implementing the Orchestrator
10. Implementing the Participant
11. Rollback (Compensation)
12. Tracking, Idempotency and Observability
13. Testing a Saga
14. When NOT to Use a Saga
15. Trade-offs and Lessons Learned
16. Conclusion

Prerequisites

This article assumes you're already familiar with some backend development concepts. You don't need prior experience with the Saga Pattern, but you should be comfortable with:

JavaScript, TypeScript, Node.js
NestJS fundamentals (controllers, services, dependency injection)
Basic PostgreSQL concepts
Database transactions
Docker (recommended for local development)
Microservice architecture basics
gRPC fundamentals (helpful but not required)

If you've already built a few backend services with NestJS and PostgreSQL, you'll have everything you need to follow this guide.

1. Introduction

A saga is a sequence of local transactions across multiple services. Each step commits its own database transaction. If a later step fails, the saga runs compensating transactions to semantically undo the work already committed.

The pattern was first described by Hector Garcia-Molina and Kenneth Salem in 1987 for long-lived database transactions. It was rediscovered a decade ago when companies started splitting monoliths into microservices and realised that the database transaction — the single most powerful tool in a backend developer's belt — stops working at the service boundary.

This article walks through an orchestrated saga in Node.js (NestJS + gRPC) for onboarding an agency, where two services must agree on a single business outcome:

agency-service — owns the agency record.
auth-service — owns the organization, user and role.

If either side fails, the system must end up as if nothing ever happened. No half-created users, orphan organizations, or 3am Slack threads.

2. The Problem in One Picture

Here's the bug a saga is built to prevent:

Step 1: auth-service     ✅ creates Organization #42
Step 2: auth-service     ✅ creates User #99
Step 3: agency-service   ❌ fails (DB down, validation, network blip…)

Result without a saga:
   Organization #42 and User #99 still exist.
   There is no Agency row.
   The user can log in but has nothing to manage.
   Support gets a ticket. Engineer writes a one-off SQL cleanup.
   Repeat every week.

The saga's job is to detect that step 3 failed and explicitly delete Organization #42 and User #99, so the system is consistent again — even though those rows live in a different service's database.

3. Why You Need a Saga

In a monolith, you wrap everything in one DB transaction and let the database handle atomicity:

await sequelize.transaction(async (tx) => {
  await Organization.create({...}, { transaction: tx });
  await User.create({...}, { transaction: tx });
  await Agency.create({...}, { transaction: tx });
});

In microservices, each service has its own database. You can't wrap two services in one ACID transaction. The classic alternatives all have problems:

Option	Problem
Two-Phase Commit (2PC)	Locks rows across services, coordinator is a single point of failure, and doesn't scale. Most modern databases don't support it well across HTTP/gRPC.
"Just hope it works"	Leaves orphan users / billing rows when half the flow fails. Real data corruption — and the longer the system runs, the more orphans accumulate.
Manual cleanup scripts	Works for a week. Bugs hide for months. New engineers don't know they exist.
Eventual consistency without compensation	Fine for some domains (analytics) but completely wrong for billing, identity, or anything with money.
Saga pattern	Each service commits locally. The orchestrator owns the workflow and runs explicit compensation on failure. It's auditable, restartable, and reasonable.

The saga gives you eventual consistency with a clear, auditable rollback path — without distributed locks.

4. Choreography vs Orchestration

There are two ways to implement a saga:

Choreography

With Choreography, services emit events and other services subscribe and react.

auth-service → emits "UserCreated"
agency-service → listens, creates agency, emits "AgencyCreated"
billing-service → listens, creates subscription…

It's simple at first, but brittle later. The workflow is scattered across N codebases. Nobody owns it. Debugging means tracing events across logs. Adding a step means changing several services.

Orchestration

With Orchestration, one service is the conductor. It calls the others in order.

orchestrator:
   1. authClient.provisionAccount(...)
   2. agencyRepo.create(...)
   3. authClient.sendWelcomeEmail(...)

There's slightly more coupling here (the orchestrator imports clients), but the entire workflow lives in one file. Onboarding new engineers becomes a one-hour task. Adding a step is a single PR.

Pick orchestration unless you have a strong reason not to. This article — and the reference implementation — uses orchestration.

5. The Example Project

Our goal here is to create an Agency in the system. This is the moment a new B2B customer signs up.

It requires two services to agree on a single outcome:

auth-service must create:

an Organization row (the tenant)
a User row (the agency admin who will log in)
a UserRole row linking the user to the AGENCY_ADMIN role

agency-service must create:

an Agency row containing business details (size, registration number, website, branches…), linked to the user/organization above

These rows have foreign-key relationships within a service, but not across services — Postgres can't enforce that the user in auth's DB matches the authUserId in agency's DB. The application has to do it.

auth-service DB                    agency-service DB
─────────────────                  ─────────────────
organizations  ◄────────┐
   │                    │
   │ (1:1)              │   foreign reference (no FK)
   ▼                    │           agencies
users  ──────► user_roles                     ─ authUserId
                                              └ authOrganizationId

If step 2 fails after step 1 succeeded, we end up with a user who can authenticate but has no agency — the exact bug from 2. That's what the saga prevents.

6. Architecture

                     ┌───────────────────────────────┐
                     │        API Gateway            │
                     └──────────────┬────────────────┘
                                    │ HTTP
                                    ▼
   ┌──────────────────────────────────────────────────┐
   │              agency-service                      │
   │   ┌─────────────────────────────────────────┐    │
   │   │   AgencyOnboardingOrchestrator (SAGA)   │    │
   │   └───────────────┬─────────────────────────┘    │
   │                   │ writes state                 │
   │                   ▼                              │
   │      agency_onboarding_sagas  (Postgres)         │
   └───────────────┬─────────────────┬────────────────┘
                   │ gRPC            │ gRPC
       provisionAgencyAccount   compensateAgencyAccount
                   │                 │
                   ▼                 ▼
   ┌──────────────────────────────────────────────────┐
   │              auth-service                        │
   │   AgencyProvisioningService  (Participant)       │
   │                                                  │
   │   organizations · users · user_roles             │
   │   agency_provision_records  ← idempotency log    │
   └──────────────────────────────────────────────────┘

Three components do all the work:

AgencyOnboardingOrchestrator in agency-service — drives the workflow.
agency_onboarding_sagas table in agency-service — the durable log of the saga's progress.
AgencyProvisioningService in auth-service — exposes a do operation (provisionAgencyAccount) and an undo operation (compensateAgencyAccount). It's backed by its own agency_provision_records idempotency table.

The orchestrator never reaches into the auth database directly. The boundary is enforced by gRPC.

7. The Saga Flow, Step by Step

This sequence diagram shows the complete lifecycle of the onboarding saga. The workflow begins when a client sends a request to create a new agency. The orchestrator first creates a saga record in its database and marks it as STARTED, giving it a durable record of the workflow before any business action takes place.

At a high level, the orchestrator begins by creating a saga record and then asks auth-service to provision the organization, user, and role. Once that succeeds, the orchestrator creates the agency record in its own database.

If every step succeeds, the saga reaches the COMPLETED state. If the agency creation fails after the auth resources have already been created, the orchestrator triggers a compensation step that instructs auth-service to remove everything it previously provisioned.

The key idea is that each service commits its own local transaction, while the saga coordinates the overall business workflow and ensures the system can return to a consistent state when failures occur.

sequenceDiagram
    autonumber
    participant C as Client
    participant AS as agency-service
Orchestrator
    participant DB1 as saga store
    participant AU as auth-service
    participant DB2 as auth DB

    C->>AS: POST /agencies
    AS->>DB1: INSERT saga (STARTED, payload)
    AS->>AU: provisionAgencyAccount(sagaId, …)
    AU->>DB2: BEGIN TX
    AU->>DB2: create org + user + role + provision_record
    AU->>DB2: COMMIT
    AU-->>AS: { userId, organizationId, roleId }
    AS->>DB1: UPDATE saga (AUTH_PROVISIONED)
    AS->>AS: create Agency row
    alt Agency row OK
        AS->>DB1: UPDATE saga (AGENCY_CREATED → COMPLETED)
        AS->>AU: sendAgencyWelcomeEmail (non-critical)
        AS-->>C: 200 OK + sagaId
    else Agency row fails
        AS->>DB1: UPDATE saga (COMPENSATING)
        AS->>AU: compensateAgencyAccount(sagaId)
        AU->>DB2: BEGIN TX
        AU->>DB2: delete role + token + user + org + record
        AU->>DB2: COMMIT
        AS->>DB1: UPDATE saga (COMPENSATED → FAILED)
        AS-->>C: 5xx + error code
    end

Read this once top to bottom and you'll understand the entire onboarding workflow. That's the value of orchestration — the sequence diagram is the architecture.

8. The State Machine

Every transition is written to agency_onboarding_sagas before the next step runs. That is what makes the saga observable and recoverable.

export enum AgencyOnboardingSagaStatus {
  STARTED            = 'STARTED',            // Row exists, no side effects yet
  AUTH_PROVISIONED   = 'AUTH_PROVISIONED',   // Auth side committed
  AGENCY_CREATED     = 'AGENCY_CREATED',     // Agency row committed
  COMPLETED          = 'COMPLETED',          // Happy-path terminal state
  COMPENSATING       = 'COMPENSATING',       // Rollback in progress
  COMPENSATED        = 'COMPENSATED',        // Rollback finished
  FAILED             = 'FAILED',             // Terminal failure (with or without compensation)
}

Why so many states? Because "what went wrong here?" is a question someone will ask at 2am. A saga that only stores success | failure is useless for forensics.

                ┌── auth fails ──────────► FAILED  (nothing to compensate)
                │
STARTED ──► AUTH_PROVISIONED ──► AGENCY_CREATED ──► COMPLETED  (happy path)
                                       │
                       agency fails ───┘
                                       ▼
                                COMPENSATING
                                       │
                                       ▼
                                COMPENSATED ──► FAILED  (consistent again)

The “point of no return” is AUTH_PROVISIONED. Before it, we can fail fast — there's nothing to undo. After it, every failure path must go through compensation.

9. Implementing the Orchestrator

The orchestrator is the only place that knows the workflow. Each step is a private method, and each step persists its result before returning.

Creating the Saga Record

// agency-onboarding.saga.repository.ts
async createSaga(payload: CreateAgencyOrchestrationInput) {
  return this.sagaModel.create({
    sagaId: randomUUID(),                          // correlation id for everything
    status: AgencyOnboardingSagaStatus.STARTED,
    currentStep: 'STARTED',
    payload,                                       // full input snapshot for replay
  });
}

The sagaId is a UUID generated once and propagated to every downstream call. It's the single identifier that ties the saga log on the orchestrator side to the provision record on the participant side.

The Main Loop

// agency-onboarding.orchestrator.ts (trimmed for the article)
async execute(input: CreateAgencyOrchestrationInput) {
  const saga = await this.sagaRepository.createSaga(input); // STARTED

  try {
    // Step 1 — auth-service work
    const authStep = await this.provisionAuth(saga, input);
    if (!authStep.ok) {
      await this.markFailed(saga, authStep.failure); // nothing to compensate
      return authStep.failure;
    }

    // Step 2 — agency-service work
    let activeSaga = authStep.saga; // status: AUTH_PROVISIONED
    try {
      activeSaga = await this.createAgencyRow(activeSaga, input, authStep.authIds);
    } catch (err) {
      // The expensive case: undo what auth-service did
      await this.compensateAuth(activeSaga, 'SAGA_FAILED');
      const failure = mapSagaFailure(err.message, 'SAGA_FAILED', 'CREATE_AGENCY');
      await this.markFailed(activeSaga, failure);
      return failure;
    }

    // Step 3 — mark done and run non-critical side effects
    activeSaga = await this.sagaRepository.updateSaga(activeSaga, {
      status: AgencyOnboardingSagaStatus.COMPLETED,
    });
    await this.sendWelcomeEmail(input, activeSaga); // best-effort

    return mapSagaSuccess(activeSaga, await this.agencyModel.findByPk(activeSaga.agencyId!));
  } catch (error) {
    // Defensive catch-all (lost DB connection, unexpected throw)
    await this.compensateAuth(saga, 'SAGA_FAILED');
    const failure = mapSagaFailure(error.message, 'SAGA_FAILED', 'SAGA');
    await this.markFailed(saga, failure);
    return failure;
  }
}

A Single Step in Detail

private async provisionAuth(saga: AgencyOnboardingSaga, input: ...) {
  this.logger.log(`[${saga.sagaId}] PROVISION_AUTH`);

  const auth = await firstValueFrom(
    this.authClient.provisionAgencyAccount({
      sagaId: saga.sagaId,                  // <-- correlation
      organizationName: input.agencyName.trim(),
      email: input.email.trim().toLowerCase(),
      // …
    }),
  );

  if (!auth.status || !auth.data) {
    return { ok: false, failure: mapAuthProvisionFailure(auth) };
  }

  // Persist the IDs we will need if we have to compensate later
  const updated = await this.sagaRepository.updateSaga(saga, {
    authOrganizationId: Number(auth.data.organizationId),
    authUserId: Number(auth.data.userId),
    authUserRoleId: Number(auth.data.userRoleId),
    status: AgencyOnboardingSagaStatus.AUTH_PROVISIONED,
  });

  return { ok: true, saga: updated, authIds: auth.data };
}

The line that does most of the work is the updateSaga call. It stores the foreign IDs returned by auth-service on the saga row, so even if the orchestrator process crashes and restarts, a recovery job can read that row and still know what to compensate.

Habits Worth Copying

Persist after every successful step, including the IDs you'll need to undo it.
Distinguish critical vs non-critical steps. Welcome emails, audit logs and analytics events are not worth rolling a saga back for. They're best-effort.
One log line per transition, prefixed with [${sagaId}]. Grep is your debugger.

10. Implementing the Participant

The participant (auth-service) wraps all of its own work in a local DB transaction. Inside that boundary it's still ACID — the saga only handles the cross-service problem.

// agency-provisioning.service.ts (trimmed)
async provisionAgencyAccount(req: ProvisionAgencyAccountInput) {

  // 1. Idempotency — return the previous result if this sagaId already provisioned.
  const existing = await this.provisionRecordModel.findOne({
    where: { sagaId: req.sagaId },
  });
  if (existing) {
    return serviceSuccess('Agency admin already onboarded', {
      userId: Number(existing.userId),
      organizationId: Number(existing.organizationId),
      userRoleId: Number(existing.roleId),
    });
  }

  // 2. Domain validation BEFORE the transaction (fail fast).
  if (await this.emailExists(req.email)) {
    return serviceFailure('Email already exists', { code: 'EMAIL_EXISTS' });
  }
  if (await this.organizationExists(req.organizationName)) {
    return serviceFailure('Organization already exists', { code: 'ORGANIZATION_EXISTS' });
  }

  // 3. The actual work — atomic at the auth-service boundary.
  return withSequelizeTransaction(this.sequelize, async (tx) => {
    const org = await this.organizationModel.create({ ... }, { transaction: tx });
    const user = await this.userModel.create({ ..., organizationId: org.id }, { transaction: tx });
    await this.userRoleModel.create({ userId: user.id, roleId: agencyAdminRole.id }, { transaction: tx });

    // The audit record that makes compensation possible later.
    await this.provisionRecordModel.create(
      { sagaId: req.sagaId, organizationId: org.id, userId: user.id, roleId: agencyAdminRole.id },
      { transaction: tx },
    );

    return serviceSuccess('Provisioned', {
      userId: user.id, organizationId: org.id, userRoleId: agencyAdminRole.id,
    });
  });
}

Three things make this method "saga-safe":

Idempotency check first: If the orchestrator retries (network blip, gRPC timeout), the second call is a no-op that returns the same IDs. No duplicate users.
Validation outside the transaction: Cheap reads first, expensive writes second.
One transaction wraps every write: If any insert fails, the whole thing rolls back automatically. The orchestrator sees a clean failure response and knows nothing was persisted.

The agency_provision_records table is the single most important piece of the participant. It's both the idempotency key and the compensation lookup — keyed by the same sagaId the orchestrator uses.

11. Rollback (Compensation)

Compensation is just another gRPC call. The orchestrator sends the sagaId and the IDs it remembers. The participant deletes everything it created, in reverse dependency order, inside its own DB transaction.

On the Orchestrator Side

private async compensateAuth(saga: AgencyOnboardingSaga, errorCode?: string) {
  if (!saga.authUserId && !saga.authOrganizationId) {
    // Nothing was provisioned — nothing to compensate.
    return;
  }

  // Mark the saga as compensating BEFORE the call, so the row is consistent
  // even if the compensating RPC times out.
  await this.sagaRepository.updateSaga(saga, {
    status: AgencyOnboardingSagaStatus.COMPENSATING,
    currentStep: 'COMPENSATING',
    errorCode,
  });

  try {
    const rollback = await firstValueFrom(this.authClient.compensateAgencyAccount({
      sagaId: saga.sagaId,
      organizationId: saga.authOrganizationId,
      userId: saga.authUserId,
    }));
    if (!rollback.status) {
      this.logger.error(`[\({saga.sagaId}] Auth compensation returned failure: \){rollback.message}`);
    }
  } catch (err) {
    this.logger.error(`[\({saga.sagaId}] Auth compensation RPC failed: \){err.message}`);
  }

  await this.sagaRepository.updateSaga(saga, {
    status: AgencyOnboardingSagaStatus.COMPENSATED,
    currentStep: 'COMPENSATED',
  });
}

On the Participant Side

private async rollbackProvisionedAuth(req, sagaId: string, tx: Transaction) {
  // Use the saga log as the source of truth — even if the caller forgot IDs.
  const record = await this.provisionRecordModel.findOne({
    where: { sagaId }, transaction: tx,
  });
  const userId         = req.userId         ?? record?.userId;
  const organizationId = req.organizationId ?? record?.organizationId;

  if (userId) {
    const user = await this.userModel.findByPk(userId, { transaction: tx, attributes: ['email'] });
    await this.userRoleModel.destroy({ where: { userId }, transaction: tx });
    if (user?.email) {
      await this.passwordResetTokenModel.destroy({ where: { email: user.email }, transaction: tx });
    }
    await this.userModel.destroy({ where: { id: userId }, transaction: tx });
  }
  if (organizationId) {
    await this.organizationModel.destroy({ where: { id: organizationId }, transaction: tx });
  }
  if (record) {
    await record.destroy({ transaction: tx });
  }
}

Rules of a Good Compensation

Reverse the order of creation: Children first (user_roles, tokens), then parents (users, organizations). The same rule you follow for DROP TABLE statements.
Be idempotent: Receiving the same sagaId twice must be safe — every destroy is a no-op if the row is already gone.
Use the saga log, not just the request: If the caller forgets an ID or sends a partial payload, look it up by sagaId. Defence in depth.
Wrap it in a local transaction: The rollback must itself be atomic — half-undone is worse than not-undone.
Always close the loop on the orchestrator side: Mark COMPENSATED even if the RPC failed. The failure should also be surfaced (log, metric, alert). A stuck COMPENSATING row is an operational landmine.

What Happens if the Compensation Itself Fails?

This is the worst case in any saga design. There are three reasonable strategies:

First, you can retry with exponential backoff. This works for transient failures (network, deadlocks).

Second, you can dead-letter the saga — write it to a "needs human attention" queue and alert.

Third, you can expose a manual rollback endpoint. This reference implementation does that via RollbackAgencyOnboarding gRPC, so an operator can replay compensation with the same sagaId.

A production system should combine all three. The pattern doesn't decide for you. You decide based on your business risk.

12. Tracking, Idempotency and Observability

Two tables, both keyed by the same UUID sagaId, give you full traceability across services.

Orchestrator Side — `agency_onboarding_sagas`

column	purpose
`sagaId` (UUID, unique)	Propagated to every RPC. The join key across services.
`status`	Current state in the state machine.
`currentStep`	Human-readable label for dashboards (`PROVISION_AUTH`, `CREATE_AGENCY`…).
`payload` (JSONB)	Snapshot of the input — used for replay, debug, support.
`authOrganizationId`, `authUserId`, `authUserRoleId`	Foreign IDs needed for compensation.
`agencyId`	Set once the agency row exists.
`errorCode`, `errorMessage`	Filled on failure.
`createdAt`, `updatedAt`	Timeline for the saga.

A real row in COMPLETED state looks roughly like this:

{
  "sagaId": "0a4f3e2c-7b11-4f8d-9a2c-90b6f5f5b8a1",
  "status": "COMPLETED",
  "currentStep": "COMPLETED",
  "agencyId": 17,
  "authOrganizationId": 42,
  "authUserId": 99,
  "authUserRoleId": 3,
  "errorCode": null,
  "errorMessage": null,
  "payload": { "agencyName": "Acme Education", "email": "admin@acme.com", "...": "..." },
  "createdAt": "2026-05-22T10:14:32.118Z",
  "updatedAt": "2026-05-22T10:14:33.412Z"
}

Participant Side — `agency_provision_records`

column	purpose
`sagaId` (unique)	Idempotency key. The same `sagaId` from the orchestrator.
`userId`, `organizationId`, `roleId`	What to delete on compensation.
`createdAt`, `updatedAt`	Audit timestamps.

Observability for Free

Because every log line is prefixed with [${sagaId}], a single grep across both services gives the full timeline:

[0a4f3e2c…] PROVISION_AUTH                  agency-service
[0a4f3e2c…] provisionAgencyAccount: ok      auth-service
[0a4f3e2c…] CREATE_AGENCY                   agency-service
[0a4f3e2c…] Agency step failed: ...         agency-service
[0a4f3e2c…] Auth compensation completed     auth-service

In a structured-logging setup (Loki, Elasticsearch, Datadog) this becomes a one-click filter. The sagaId is your distributed trace.

13. Testing a Saga

A saga is just a state machine, so the test matrix is finite and small. Cover at least these cases:

#	Scenario	Expected end state
1	Happy path	`COMPLETED`, agency exists, user exists
2	Auth step fails (e.g. email exists)	`FAILED`, no rows on either side
3	Agency step fails	`COMPENSATED`, auth rows gone, no agency
4	Compensation RPC times out	`COMPENSATING` → operator-driven recovery
5	Caller retries with the same `sagaId`	Second call returns the first call's result; no duplicate rows
6	Welcome email fails	`COMPLETED` still — non-critical step did not cascade

Two practical tips for testing:

First, mock the gRPC client at the orchestrator level, not the network. You want to assert that compensateAgencyAccount was called with the right sagaId, not that bytes hit a socket.

Second, spin up a real Postgres in integration tests (Testcontainers, or a Docker Compose postgres service). The saga state machine is too easy to "test" against a mock and too easy to break against a real DB.

14. When NOT to Use a Saga

Sagas are not free. Skip them when:

One service does all the writes. Use a regular DB transaction. Don't reinvent the wheel.
The workflow is read-only or analytical. No rollback semantics exist for a SELECT.
The "rollback" is impossible. You sent a real email. You charged a credit card and the gateway doesn't support refunds. In those cases, design forward: send an apology email, queue a manual refund. Sagas can't unsend physical actions.
You don't actually have multiple services yet. A saga in a monolith is over-engineering. Wait until the service boundary is real.

A saga adds a state table, a compensation method per step, and an operational habit of grepping by sagaId. That cost is worth paying when the alternative is orphaned data — and not before.

15. Trade-offs and Lessons Learned

Things that worked well in this design:

Synchronous orchestration is easier to debug than choreography. A new engineer reads one file and understands the whole flow.
Idempotency at the participant is non-negotiable. Retries from the orchestrator must be safe. Build it in from day one — retro-fitting is painful.
The saga table replaces tribal knowledge. Ops can answer "what happened to this signup?" with a single SQL query. The payload JSONB is gold during incidents.
sagaId as the trace key plays nicely with OpenTelemetry / Datadog / Loki — no extra infra to set up.

Things to know before copying this pattern:

A failing compensation is the worst case. If compensateAgencyAccount itself errors, you have inconsistent state. Plan for retries + dead-letter + a manual rollback endpoint from the start.
Non-critical steps must be marked explicitly. Here, the welcome email is allowed to fail without rolling back the agency. Don't accidentally compensate over a flaky SMTP provider.
Sagas aren't a replacement for local transactions. Inside each service, still use a real DB transaction. The saga only handles the cross-service seam.
Synchronous gRPC is simple but couples availability. If auth-service is down, agency creation fails. Swap the gRPC calls for a durable message bus (RabbitMQ / Kafka) and treat each step as a command + reply when you need higher resilience.
The orchestrator becomes a critical service. Treat its uptime accordingly — monitor saga durations, alert on stuck COMPENSATING rows, and run more than one replica.

16. Conclusion

The saga pattern isn't magic. It's a disciplined version of what experienced engineers already do by hand: commit locally, record what you did, and know how to undo it.

In Node.js with NestJS, you only need three ingredients:

A state table to track the saga.
An orchestrator that drives the workflow and writes that state.
A participant that exposes a do and an undo operation, both idempotent and keyed by sagaId.

Get those three right and your microservices can offer the same "all-or-nothing" feel as a monolithic transaction — without the operational pain of distributed locks.

Start simple, use orchestration, make every step idempotent, persist before you call, and always know how to undo. That's the whole pattern.

How to Self‑Host an S3‑Compatible Object Store with MinIO on Your Staging Server (and Save Hundreds of Dollars a Month)

Md Tarikul Islam — Mon, 01 Jun 2026 14:40:43 +0000

This article is a complete copy‑paste guide to running MinIO behind Traefik with HTTPS, custom domains, and pre-signed upload/download URLs — using only Docker Compose.

Your production will keep using a managed S3 / Cloudflare R2 / Hetzner Object Storage, while every staging upload, download, and pre-signed URL goes to your own server for free.

1. Why Self‑Host Object Storage on Staging?
2. The Architecture: Production vs. Staging
3. Prerequisites
4. Step 1 — DNS: Point Your Domains to the Staging Server
5. Step 2 — Run MinIO with Docker Compose
6. Step 3 — Expose MinIO over HTTPS with Traefik
7. Step 4 — Create the Bucket and Access Keys
8. Step 5 — Configure Your App to Use MinIO on Staging Only
9. Step 6 — Upload Files (3 Ways)
10. Step 7 — Generate Presigned URLs (PUT and GET)
11. Step 8 — Get Public URLs for Documents
12. Step 9 — Lock Down CORS, Lifecycle, and Security
13. Step 10 — Backups and Monitoring
14. Troubleshooting Cheat Sheet
15. Wrapping Up

1. Why Self‑Host Object Storage on Staging?

If your app handles documents — PDFs, profile pictures, application transcripts, recordings — every test upload your QA team makes costs real money on AWS S3, Cloudflare R2, or Hetzner Object Storage. The price isn't huge per file, but staging is where you:

run automated end‑to‑end tests that upload thousands of dummy files,
reset databases nightly (which leaves orphan objects behind),
let developers experiment with broken code that re‑uploads the same files,
and hold months of test data nobody ever deletes.

In production those costs are justified. Managed storage gives you replication, availability, and someone else's pager. In staging, those costs are pure waste.

MinIO is a free, open‑source, S3‑compatible object server. Same API, same SDKs, same presigned URLs, same mc/aws s3 CLIs — but running on your own VPS, billed at $0 per gigabyte. Point your staging app at MinIO, point your production app at S3/R2, and the only thing that changes is an environment variable.

The result: identical code paths in both environments, zero storage bill on staging, and a nice fallback if your cloud provider ever has an outage.

2. The Architecture: Production vs. Staging

In real-world applications, you usually don’t want your development or staging environment writing directly to production storage.

A common and cost-effective setup is:

Production: managed cloud object storage
Staging / Development: self-hosted S3-compatible storage

The good part is that your application code doesn't need to change.

As long as both services are S3-compatible, the same SDK and upload logic work everywhere. Only the environment variables differ.

High-Level Architecture

The above diagram illustrates how the same application can communicate with different storage providers depending on the deployment environment.

In the production environment, uploads are stored in a managed object storage service such as AWS S3, Cloudflare R2, or Hetzner Object Storage. These services handle durability, scalability, backups, and infrastructure management.

In the staging environment, uploads are directed to a self-hosted MinIO instance running inside Docker on a VPS. MinIO implements the S3 API, making it behave similarly to production storage while keeping costs low.

Because both storage systems are S3-compatible, the application uses the same upload logic in every environment. The only difference is the configuration provided through environment variables.

Why This Architecture Is Useful

This setup gives you:

A cheap staging environment
Production-like testing
Zero storage vendor lock-in
The ability to switch providers without rewriting application code

Because both environments speak the S3 protocol, your upload logic remains identical.

Example Environment Variables

Your application only reads environment variables like these:

S3_ENDPOINT=
S3_REGION=
S3_ACCESS_KEY=
S3_SECRET_KEY=
S3_BUCKET=

Switch the values, and the exact same application now uploads files to a different backend.

Production Storage Example

In production, you typically use managed object storage providers such as:

AWS S3
Cloudflare R2
Hetzner Object Storage

Example:

S3_ENDPOINT=https://.r2.cloudflarestorage.com

The benefits are that it's highly scalable, globally available, durable, has managed backups, and doesn't have infrastructure maintenance.

Staging Environment Example

For staging, a lightweight self-hosted MinIO container is often enough.

Next.js App
     ↓
MinIO Container (inside Docker on VPS)

Example domains:

Service	Domain	Internal Port
MinIO S3 API	`minio-staging.domain.com`	`9000`
MinIO Web Console	`minio-console-staging.domain.com`	`9001`

This allows you to:

Test uploads safely
Avoid production storage costs
Reproduce production-like behavior locally

3. Prerequisites

You'll need:

A Linux VPS (Hetzner, DigitalOcean, Contabo, OVH — anything with a public IP).
Two A records pointing at that IP (we'll register them next).
Docker + Docker Compose v2.
Traefik v2 in front, with Let's Encrypt configured (any reverse proxy works – the labels below are Traefik's flavor).
Open ports 80 and 443 on the firewall for Let's Encrypt + HTTPS.
~10 GB free disk for the MinIO data volume to start.

If Docker isn't installed:

curl -fsSL https://get.docker.com | sh
sudo apt-get install -y docker-compose-plugin
docker --version && docker compose version

4. Step 1 — DNS: Point Your Domains to the Staging Server

In your DNS provider (Cloudflare, Route 53, Namecheap, and so on), create two A records pointing at your staging server's public IP:

minio-staging.domain.com           A    203.0.113.45
minio-console-staging.domain.com   A    203.0.113.45

If you use Cloudflare, set the proxy status to DNS only (gray cloud) for minio-staging.*. Cloudflare's free plan caps uploads at 100 MB, and you don't want it stripping S3 signing headers. The console subdomain can stay proxied if you want a WAF in front of it.

Wait a minute and verify:

dig +short minio-staging.domain.com
# 203.0.113.45

5. Step 2 — Run MinIO with Docker Compose

Add this service to your staging compose file (docker-compose.staging.yml). MinIO is just one container — the disk is mounted as a Docker volume so data survives upgrades.

# docker-compose.staging.yml
networks:
  proxy:
    external: true
    name: proxy
  internal:
    name: internal

volumes:
  minio-data:

services:
  minio:
    image: minio/minio:latest
    container_name: minio-staging
    restart: unless-stopped
    environment:
      - MINIO_ROOT_USER=${MINIO_ROOT_USER:-admin}
      - MINIO_ROOT_PASSWORD=${MINIO_ROOT_PASSWORD:-change-me-please}
      # Tell MinIO which public domain to sign URLs with
      - MINIO_SERVER_URL=https://minio-staging.domain.com
      - MINIO_BROWSER_REDIRECT_URL=https://minio-console-staging.domain.com
    command: server /data --console-address ":9001"
    volumes:
      - minio-data:/data
    networks:
      - proxy
      - internal
    ports:
      - "9000:9000"  # S3 API
      - "9001:9001"  # Web console
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s

Two things deserve attention:

MINIO_SERVER_URL is the secret sauce. Without it, MinIO signs presigned URLs using its internal hostname (http://minio:9000), which then fails verification when the browser hits the public domain. Set it to the exact HTTPS URL clients will use.
MINIO_BROWSER_REDIRECT_URL does the same for the web console (login redirects, OIDC callbacks, and so on).

Bring it up:

docker compose -f docker-compose.staging.yml up -d minio
docker compose -f docker-compose.staging.yml logs -f minio

You should see API: http://... and Console: http://... lines.

6. Step 3 — Expose MinIO over HTTPS with Traefik

We don't expose ports 9000/9001 to the world directly — Traefik does that for us, terminating TLS with a free Let's Encrypt certificate.

Add these labels to the minio service:

    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"

      # ---- S3 API (port 9000) ----
      - "traefik.http.routers.minio-staging.rule=Host(`minio-staging.domain.com`)"
      - "traefik.http.routers.minio-staging.entrypoints=websecure"
      - "traefik.http.routers.minio-staging.tls.certresolver=letsencrypt"
      - "traefik.http.routers.minio-staging.service=minio-staging"
      - "traefik.http.services.minio-staging.loadbalancer.server.port=9000"

      # ---- Web Console (port 9001) ----
      - "traefik.http.routers.minio-console-staging.rule=Host(`minio-console-staging.domain.com`)"
      - "traefik.http.routers.minio-console-staging.entrypoints=websecure"
      - "traefik.http.routers.minio-console-staging.tls.certresolver=letsencrypt"
      - "traefik.http.routers.minio-console-staging.service=minio-console-staging"
      - "traefik.http.services.minio-console-staging.loadbalancer.server.port=9001"

You also need an entrypoint for :443 and a certificatesresolver named letsencrypt. Here's the minimum Traefik config (traefik.staging.yml):

api:
  dashboard: true

entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"

certificatesResolvers:
  letsencrypt:
    acme:
      httpChallenge:
        entryPoint: web
      email: admin@domain.com
      storage: /etc/traefik/acme.json

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"
    exposedByDefault: false
    network: proxy

Restart and watch the cert get issued:

docker compose -f docker-compose.staging.yml up -d
docker compose -f docker-compose.staging.yml logs -f traefik | grep -i acme

Sanity check from your laptop:

curl -I https://minio-staging.domain.com/minio/health/live
# HTTP/2 200

You can now log in to the web console at https://minio-console-staging.domain.com with admin / change-me-please.

Important upload size tweak: if you're behind Cloudflare or NGINX in front of Traefik, raise the request body limit. Traefik itself has no default limit, but Cloudflare's free plan refuses anything over 100 MB. For self‑hosted edge proxies, set client_max_body_size 0; (NGINX) or the equivalent.

7. Step 4 — Create the Bucket and Access Keys

Anything that speaks S3 can talk to MinIO. The easiest tool is mc (the official MinIO client), shipped inside the same image.

7.1 Connect mc to your server

docker exec -it minio-staging \
  mc alias set local http://localhost:9000 admin change-me-please

7.2 Create a bucket

docker exec -it minio-staging mc mb local/domain-files-staging

7.3 Choose a bucket policy

You have three choices, so just pick based on what you store:

Policy	When to use
`private` (default)	Anything sensitive — student transcripts, contracts, internal docs. Reads only via presigned URL.
`download`	Public read, no listing. Good for CDN‑style assets like avatars.
`public`	Anyone can read AND list. Use only for truly public content.

Set one:

# Private (recommended for documents)
docker exec -it minio-staging \
  mc anonymous set none local/domain-files-staging

# OR public read for static assets only:
docker exec -it minio-staging \
  mc anonymous set download local/domain-files-staging

7.4 Create a dedicated app user (don't use root keys!)

The admin account can wipe everything. Make a least‑privilege user for your app:

docker exec -it minio-staging mc admin user add local \
  domain-app a-long-random-secret-key

# Attach the built-in read/write policy, scoped to one bucket via JSON:
cat > /tmp/policy.json <<'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:*"],
      "Resource": [
        "arn:aws:s3:::domain-files-staging",
        "arn:aws:s3:::domain-files-staging/*"
      ]
    }
  ]
}
EOF

docker cp /tmp/policy.json minio-staging:/tmp/policy.json
docker exec -it minio-staging \
  mc admin policy create local domain-rw /tmp/policy.json
docker exec -it minio-staging \
  mc admin policy attach local domain-rw --user domain-app

Save those two values — they are your S3_ACCESS_KEY and S3_SECRET_KEY.

8. Step 5 — Configure Your App to Use MinIO on Staging Only

The trick to "MinIO in staging, real S3 in prod" is to use the same S3 client in your code and only swap the env vars.

Your staging.env (loaded by your staging compose stack):

# ---- Staging: self-hosted MinIO ----
STORAGE_ENABLED=true
S3_ENDPOINT=https://minio-staging.domain.com
S3_PUBLIC_ENDPOINT=https://minio-staging.domain.com
S3_BUCKET=domain-files-staging
S3_ACCESS_KEY=domain-app
S3_SECRET_KEY=a-long-random-secret-key
S3_REGION=us-east-1
S3_FORCE_PATH_STYLE=true

Your production.env:

# ---- Production: Cloudflare R2 ----
STORAGE_ENABLED=true
S3_ENDPOINT=https://.r2.cloudflarestorage.com
S3_PUBLIC_ENDPOINT=https://files.domain.com
S3_BUCKET=domain-files
S3_ACCESS_KEY=
S3_SECRET_KEY=
S3_REGION=auto
S3_FORCE_PATH_STYLE=true

S3_FORCE_PATH_STYLE=true is critical for both MinIO and R2/Hetzner. Without it, the SDK tries https://bucket.minio-staging.domain.com (virtual‑host style), which won't resolve.

Now in your application code (Node.js example using AWS SDK v3):

// src/lib/s3.js
import { S3Client } from "@aws-sdk/client-s3";

export const s3 = new S3Client({
  endpoint: process.env.S3_ENDPOINT,
  region: process.env.S3_REGION,
  credentials: {
    accessKeyId: process.env.S3_ACCESS_KEY,
    secretAccessKey: process.env.S3_SECRET_KEY,
  },
  forcePathStyle: process.env.S3_FORCE_PATH_STYLE === "true",
});

export const BUCKET = process.env.S3_BUCKET;
export const PUBLIC_ENDPOINT = process.env.S3_PUBLIC_ENDPOINT;

The same s3 instance now talks to MinIO on staging and to R2 in production with no code change.

9. Step 6 — Upload Files (3 Ways)

9.1 From a server (best for trusted backends)

import { PutObjectCommand } from "@aws-sdk/client-s3";
import { s3, BUCKET } from "./lib/s3.js";
import { readFile } from "node:fs/promises";

export async function uploadDocument(localPath, key, contentType) {
  const Body = await readFile(localPath);
  await s3.send(new PutObjectCommand({
    Bucket: BUCKET,
    Key: key,
    Body,
    ContentType: contentType,
    // Optional: per-object metadata, useful for audits
    Metadata: { uploadedBy: "system", env: process.env.NODE_ENV },
  }));
  return key;
}

9.2 With the mc CLI (good for one‑off uploads / migrations)

mc alias set staging https://minio-staging.domain.com domain-app a-long-random-secret-key
mc cp ./report.pdf staging/domain-files-staging/reports/2026/report.pdf
mc ls staging/domain-files-staging --recursive

9.3 Directly from the browser via a presigned PUT URL

The recommended pattern for user uploads is: the file goes from the browser to MinIO with zero bytes touching your API server.

We'll cover this in detail next.

10. Step 7 — Generate Presigned URLs (PUT and GET)

A presigned URL is a regular HTTPS URL with a time‑limited signature in the query string. Anyone with the URL can do exactly the action it was signed for (PUT this object, or GET that object) for the next N minutes — and nothing else.

This is what makes "users upload directly to storage" safe.

10.1 Presigned PUT (for uploads)

// src/lib/presign.js
import { PutObjectCommand, GetObjectCommand } from "@aws-sdk/client-s3";
import { getSignedUrl } from "@aws-sdk/s3-request-presigner";
import { s3, BUCKET } from "./s3.js";
import { randomUUID } from "node:crypto";

export async function presignUpload({ filename, contentType, userId }) {
  const key = `users/\({userId}/\){randomUUID()}-${filename}`;
  const cmd = new PutObjectCommand({
    Bucket: BUCKET,
    Key: key,
    ContentType: contentType,
  });
  const uploadUrl = await getSignedUrl(s3, cmd, { expiresIn: 60 * 5 }); // 5 min
  return { uploadUrl, key };
}

Wire it to your API:

// POST /api/uploads/presign
app.post("/api/uploads/presign", requireAuth, async (req, res) => {
  const { filename, contentType } = req.body;
  const result = await presignUpload({
    filename,
    contentType,
    userId: req.user.id,
  });
  res.json(result); // { uploadUrl, key }
});

The browser uploads straight to MinIO:

// In your frontend
async function uploadFile(file) {
  const { uploadUrl, key } = await fetch("/api/uploads/presign", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ filename: file.name, contentType: file.type }),
  }).then(r => r.json());

  await fetch(uploadUrl, {
    method: "PUT",
    headers: { "Content-Type": file.type },
    body: file,
  });

  // Persist `key` in your DB so you can retrieve it later
  await fetch("/api/documents", {
    method: "POST",
    body: JSON.stringify({ key, originalName: file.name }),
  });
}

The Content-Type you send during PUT must match the one you signed with, or MinIO will reject the request with SignatureDoesNotMatch. This catches everyone the first time.

10.2 Presigned GET (for downloads)

Same idea, but with GetObjectCommand:

export async function presignDownload(key, expiresIn = 60 * 10) {
  const cmd = new GetObjectCommand({ Bucket: BUCKET, Key: key });
  return getSignedUrl(s3, cmd, { expiresIn });
}

A typical "view document" endpoint:

app.get("/api/documents/:id/url", requireAuth, async (req, res) => {
  const doc = await db.documents.findById(req.params.id);
  if (!doc || !canUserSee(req.user, doc)) return res.sendStatus(403);
  const url = await presignDownload(doc.key, 600);
  res.json({ url });
});

The frontend just opens that URL — the file streams from MinIO directly to the user.

10.3 Why presigned URLs beat "proxy through the API"

	Proxy through API	Presigned URL
Bytes through your app	All of them	Zero
API CPU/RAM cost	High	None
Throughput limit	Your API	MinIO's NIC
Auth check	Your code	Your code (still — check before signing)

11. Step 8 — Get Public URLs for Documents

Sometimes you want a permanent, unauthenticated URL — for example public profile pictures.

If the bucket policy allows anonymous reads (mc anonymous set download …), the public URL pattern is:

https://minio-staging.domain.com//

So users/42/avatar.png becomes:

https://minio-staging.domain.com/domain-files-staging/users/42/avatar.png

In code:

export function publicUrl(key) {
  return `\({process.env.S3_PUBLIC_ENDPOINT}/\){BUCKET}/${key}`;
}

For private buckets (most documents), don't use public URLs at all — always go through presignDownload(key) so you can re‑check authorization on every request and expire links.

12. Step 9 — Lock Down CORS, Lifecycle, and Security

12.1 Allow your frontend origins (CORS)

Browser uploads need CORS rules on the bucket. Drop this JSON via mc:

cat > /tmp/cors.json <<'EOF'
{
  "CORSRules": [
    {
      "AllowedOrigins": [
        "https://crm-staging.domain.com",
        "http://localhost:3000"
      ],
      "AllowedMethods": ["GET", "PUT", "POST", "HEAD"],
      "AllowedHeaders": ["*"],
      "ExposeHeaders": ["ETag"],
      "MaxAgeSeconds": 3000
    }
  ]
}
EOF

docker cp /tmp/cors.json minio-staging:/tmp/cors.json
docker exec -it minio-staging \
  mc cors set local/domain-files-staging /tmp/cors.json

12.2 Auto‑delete old test files (lifecycle)

Staging accumulates junk. Tell MinIO to expire anything older than 30 days:

docker exec -it minio-staging \
  mc ilm rule add --expire-days 30 local/domain-files-staging

12.3 Encrypt at rest

docker exec -it minio-staging \
  mc encrypt set sse-s3 local/domain-files-staging

12.4 Hard rules

Never ship MINIO_ROOT_USER=admin / MINIO_ROOT_PASSWORD=admin123 to a server reachable from the internet. Generate strong values and store them in your secret manager.
The root account should be used only by mc admin, never by your app. The app uses a scoped IAM user (Step 7.4).
Keep the console subdomain behind an IP allow‑list or basic auth via Traefik middleware if it's truly public.
Rotate the app access keys at least every 90 days.

13. Step 10 — Backups and Monitoring

13.1 Backups: mirror to a cheap cold bucket weekly

Set up a tiny cron job that uses mc mirror to push to Backblaze B2, R2, or another cheap S3 endpoint:

mc alias set b2 https://s3.us-east-005.backblazeb2.com \(B2_KEY \)B2_SECRET
mc mirror --overwrite --remove \
  staging/domain-files-staging \
  b2/domain-staging-backup

Even at $6/TB/month this is essentially free for staging volumes.

13.2 Monitoring with Prometheus

MinIO exposes Prometheus metrics out of the box at /minio/v2/metrics/cluster. Scrape with:

scrape_configs:
  - job_name: minio
    metrics_path: /minio/v2/metrics/cluster
    scheme: https
    static_configs:
      - targets: ["minio-staging.domain.com"]

If you have Grafana, import dashboard ID 13502 for an instant overview (capacity, request rates, latency, error counts).

14. Troubleshooting Cheat Sheet

Symptom	Likely cause	Fix
`SignatureDoesNotMatch` on presigned PUT	Browser sent a different `Content-Type` than what was signed	Send the exact same `Content-Type` header during PUT
Presigned URL works locally but not in browser	`MINIO_SERVER_URL` not set, so URLs are signed for `minio:9000`	Set `MINIO_SERVER_URL=https://minio-staging.domain.com` and restart
`403 SignatureDoesNotMatch` after going through Cloudflare	Cloudflare strips/modifies headers	Set the DNS record to DNS‑only (gray cloud)
`NoSuchBucket`	App pointing at the wrong endpoint or bucket	Re‑check `S3_ENDPOINT` and `S3_BUCKET` in env
Browser CORS preflight fails	No CORS rule on the bucket	Apply the CORS JSON from §12.1
Upload works for small files, fails at 100 MB	Cloudflare free plan body limit	Use Cloudflare paid plan, or skip CF proxy
`x509: certificate signed by unknown authority` from your app	App container doesn't trust Let's Encrypt	Update CA bundle (`apt install ca-certificates`) or use HTTP inside the Docker network
Web console redirects to `http://minio:9001/login`	`MINIO_BROWSER_REDIRECT_URL` missing	Set it to `https://minio-console-staging.domain.com`

Useful diagnostics:

# Check MinIO health
curl -I https://minio-staging.domain.com/minio/health/live

# List all objects in a bucket
docker exec -it minio-staging mc ls --recursive local/domain-files-staging

# Tail MinIO logs
docker compose -f docker-compose.staging.yml logs -f minio

# Decode a presigned URL to see what it was signed for
echo "" | tr '&' '\n'

15. Wrapping Up

Here's what you have now:

A free, S3‑compatible object store running on your own staging server.
Real HTTPS on a real domain (https://minio-staging.domain.com), thanks to Traefik + Let's Encrypt.
A scoped, least‑privilege application user — root keys stay locked away.
The same exact code paths in staging and production. Switching between MinIO / R2 / Hetzner / AWS S3 is a four‑variable change in the env file.
Presigned PUT URLs so users upload straight to storage, bypassing your API.
Presigned GET URLs so private documents are short‑lived and authorization‑gated.
Lifecycle rules that nuke old test files automatically.
Optional weekly mirror to a cold backup bucket.

Production keeps running on managed storage where the SLA matters. Staging now costs you exactly $0 per month per gigabyte uploaded — and you can finally stop telling QA to "delete the test files when you're done."

How to Deploy a Full-Stack Next.js App on Cloudflare Workers with GitHub Actions CI/CD

Md Tarikul Islam — Wed, 29 Apr 2026 14:23:26 +0000

I typically build my projects using Next.js 14 (App Router) and Supabase for authentication along with Postgres. The default deployment choice for a Next.js app is usually Vercel, and for good reason: it provides an excellent developer experience.

But after running the same project on both platforms for about a week, I started exploring Cloudflare Workers as an alternative. I noticed improvements in latency (lower TTFB) and found the free tier to be more flexible for my use case.

Deploying Next.js apps on Cloudflare used to be challenging. Earlier solutions like Cloudflare Pages had limitations with full Next.js features, and tools like next-on-pages often lagged behind the latest releases.

That changed with the introduction of @opennextjs/cloudflare. It allows you to compile a standard Next.js application into a Cloudflare Worker, supporting features like SSR, ISR, middleware, and the Image component – all without requiring major code changes.

In this guide, I’ll walk you through the exact steps I used to deploy my full-stack Next.js + Supabase application to Cloudflare Workers.

This article is the runbook I wish I had when I started.

Why Choose Cloudflare Workers Over Vercel?
Prerequisites
The Stack
Step 1 — Install the Cloudflare Adapter
Step 2 — Wire OpenNext into next dev
Step 3— Local Environment Setup with .dev.vars
Step 4 — Deploy Your App from Your Local Machine
Step 5 — Push your secrets to the Worker
Step 6 — Set Up Continuous Deployment with GitHub Actions
Step 7 — Updating the project (the daily workflow)
Final thoughts

Why Choose Cloudflare Workers Over Vercel?

When deploying a Next.js application, Vercel is often the default choice. It offers a smooth developer experience and tight integration with Next.js.

But Cloudflare Workers provides a compelling alternative, especially when you care about global performance and cost efficiency.

Here’s a high-level comparison (at the time of writing):

Concern	Vercel (Hobby)	Cloudflare Workers (Free Tier)
Requests	Fair usage limits	Millions of requests per day
Cold starts	~100–300 ms (region-based)	Near-zero (V8 isolates)
Edge locations	Limited regions for SSR	300+ global edge locations
Bandwidth	~100 GB/month (soft cap)	Generous / no strict cap on free tier
Custom domains	Supported	Supported
Image optimization	Counts toward usage	Available via `IMAGES` binding
Pricing beyond free	Starts at ~$20/month	Low-cost, usage-based pricing

Key Takeaways

Lower latency globally: Cloudflare runs your app across hundreds of edge locations, reducing response time for users worldwide.
Minimal cold starts: Thanks to V8 isolates, functions start almost instantly.
Cost efficiency: The free tier is generous enough for portfolios, blogs, and many small-to-medium apps.

Trade-offs to Consider

Cloudflare Workers use a V8 isolate runtime, not a full Node.js environment. That means:

Some Node.js APIs like fs or child_process aren't available
Native binaries or certain libraries may not work

That said, for most modern stacks – like Next.js + Supabase + Stripe + Resend – this limitation is rarely an issue.

In short, choose Vercel if you want the simplest, plug-and-play Next.js deployment. Choose Cloudflare Workers if you want better edge performance and more flexible scaling.

Prerequisites

Before getting started, make sure you have the following set up. Most of these take only a few minutes:

Node.js 18+ and pnpm 9+ (you can also use npm or yarn, but this guide uses pnpm.)
A Cloudflare account 👉 https://dash.cloudflare.com/sign-up
A Supabase account (if your app uses a database) 👉 https://supabase.com
A GitHub repository for your project (required later for CI/CD setup)
A domain name (optional) – You’ll get a free *.workers.dev URL by default.

Install Wrangler (Cloudflare CLI)

We’ll use Wrangler to build and deploy the application:

pnpm add -D wrangler

The Stack

Here’s the tech stack used in this project:

Next.js (v14.2.x): Using the App Router with Edge runtime for both public and dashboard routes
Supabase: Handles authentication, Postgres database, and Row-Level Security (RLS)
Tailwind CSS + UI utilities: For styling, along with lightweight animation using Framer Motion
Cloudflare Workers: Deployment powered by @opennextjs/cloudflare and wrangler
GitHub Actions: Used to automate CI/CD and deployments

Note: If you're using Next.js 15 or later, you can remove the
--dangerouslyUseUnsupportedNextVersion flag from the build script, as it's only required for certain Next.js 14 setups.

Step 1 — Install the Cloudflare Adapter

From inside your existing Next.js project, install the OpenNext adapter along with Wrangler (Cloudflare’s CLI tool):

pnpm add @opennextjs/cloudflare
pnpm add -D wrangler

Then add the deploy scripts to package.json:

{
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "next start",
    "lint": "next lint",

    "cloudflare-build": "opennextjs-cloudflare build --dangerouslyUseUnsupportedNextVersion",
    "preview":          "pnpm cloudflare-build && opennextjs-cloudflare preview",
    "deploy":           "pnpm cloudflare-build && wrangler deploy",
    "upload":           "pnpm cloudflare-build && opennextjs-cloudflare upload",
    "cf-typegen":       "wrangler types --env-interface CloudflareEnv cloudflare-env.d.ts"
  }
}

What each script does:

Script	What it does
`pnpm cloudflare-build`	Compiles your Next app into `.open-next/` (the Worker bundle). No upload.
`pnpm preview`	Builds and runs the Worker locally with `wrangler dev`. Closest thing to prod.
`pnpm deploy`	Builds and uploads to Cloudflare. This ships to production.
`pnpm upload`	Builds and uploads a new version without promoting it (for staged rollouts).
`pnpm cf-typegen`	Regenerates `cloudflare-env.d.ts` types after editing `wrangler.jsonc`.

Heads up: the Pages-based @cloudflare/next-on-pages is a different tool. We are not using Pages — we're deploying as a real Worker. Don't mix the two.

Step 2 — Wire OpenNext into `next dev`

So that pnpm dev can read your Cloudflare bindings (env vars, R2, KV, D1, …) the same way production will, edit next.config.mjs:

/** @type {import('next').NextConfig} */
const nextConfig = {};

if (process.env.NODE_ENV !== "production") {
  const { initOpenNextCloudflareForDev } = await import(
    "@opennextjs/cloudflare"
  );
  initOpenNextCloudflareForDev();
}

export default nextConfig;

We only call it in development so next build stays fast and CI doesn't spin up a Miniflare instance for nothing.

Step 3 — Local Environment Setup with `.dev.vars`

When working with Cloudflare Workers locally, Wrangler uses a file called .dev.vars to store environment variables (instead of .env.local used by Next.js).

A simple and reliable approach is to keep an example file in your repo and ignore the real one.

Example: `.dev.vars.example` (committed)

NEXT_PUBLIC_SUPABASE_URL="https://YOUR-PROJECT-ref.supabase.co"
NEXT_PUBLIC_SUPABASE_ANON_KEY="YOUR-ANON-KEY"
NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL="admin@example.com"

Set Up Your Local Environment

Run the following commands:

cp .dev.vars.example .dev.vars
cp .dev.vars .env.local

.dev.vars is used by Wrangler (wrangler dev)
.env.local is used by Next.js (next dev)

Why Use Both Files?

next dev reads from .env.local
wrangler dev (used in pnpm preview) reads from .dev.vars

Keeping both files in sync ensures your app behaves consistently in development and when running in the Cloudflare runtime.

Update `.gitignore`

Make sure these files are ignored:

.dev.vars
.env*.local
.open-next
.wrangler

Step 4 — Deploy Your App from Your Local Machine

Once pnpm preview is working correctly, you're ready to deploy your application:

pnpm deploy

Under the hood that runs:

pnpm cloudflare-build && wrangler deploy

The first time, Wrangler will:

Compile your app to .open-next/worker.js.
Upload the script + assets to Cloudflare.
Print your live URL, e.g. https://porfolio..workers.dev.

Open it in a browser. Congratulations — you're on Cloudflare's edge in 330+ cities. The page should be served in <100 ms TTFB from anywhere.

Here's the live version of my own portfolio deployed this way

Step 5 — Push Your Secrets to the Worker

Local .dev.vars is not uploaded by wrangler deploy. You have to push secrets explicitly:

wrangler secret put NEXT_PUBLIC_SUPABASE_URL
wrangler secret put NEXT_PUBLIC_SUPABASE_ANON_KEY
wrangler secret put NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL

Each command prompts you for the value and stores it encrypted on Cloudflare. Or do it visually:

Cloudflare Dashboard → Workers & Pages → your worker → Settings → Variables and Secrets → Add.

Important: NEXT_PUBLIC_* vars are inlined into the client bundle at build time, so they also need to be available when pnpm cloudflare-build runs (locally, that's your .env.local; in CI, see Step 10).

Step 6 — Set Up Continuous Deployment with GitHub Actions

Once your local deployment is working, the next step is automating deployments so every push to the main branch updates production automatically.

With this workflow:

Pull requests will run validation checks
Production deploys only happen after successful builds
Broken code never reaches your live site

Create the following file inside your project:

.github/workflows/deploy.yml

name: CI / Deploy to Cloudflare Workers

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  workflow_dispatch:

concurrency:
  group: cloudflare-deploy-${{ github.ref }}
  cancel-in-progress: true

jobs:
  verify:
    name: Lint and Build
    runs-on: ubuntu-latest
    timeout-minutes: 10

    steps:
      - uses: actions/checkout@v4

      - uses: pnpm/action-setup@v4
        with:
          version: 10

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: pnpm

      - run: pnpm install --frozen-lockfile
      - run: pnpm lint
      - run: pnpm build
        env:
          NEXT_PUBLIC_SUPABASE_URL: ${{ secrets.NEXT_PUBLIC_SUPABASE_URL }}
          NEXT_PUBLIC_SUPABASE_ANON_KEY: ${{ secrets.NEXT_PUBLIC_SUPABASE_ANON_KEY }}
          NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL: ${{ secrets.NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL }}

  deploy:
    name: Deploy to Cloudflare Workers
    needs: verify
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    timeout-minutes: 15

    steps:
      - uses: actions/checkout@v4

      - uses: pnpm/action-setup@v4
        with:
          version: 10

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: pnpm

      - run: pnpm install --frozen-lockfile

      - name: Build and Deploy
        run: pnpm run deploy
        env:
          CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
          CLOUDFLARE_ACCOUNT_ID: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
          NEXT_PUBLIC_SUPABASE_URL: ${{ secrets.NEXT_PUBLIC_SUPABASE_URL }}
          NEXT_PUBLIC_SUPABASE_ANON_KEY: ${{ secrets.NEXT_PUBLIC_SUPABASE_ANON_KEY }}
          NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL: ${{ secrets.NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL }}

Required GitHub repo secrets

Go to GitHub repo → Settings → Secrets and variables → Actions → New repository secret and add:

Secret	Where to get it
`CLOUDFLARE_API_TOKEN`	https://dash.cloudflare.com/profile/api-tokens → "Edit Cloudflare Workers" template
`CLOUDFLARE_ACCOUNT_ID`	Cloudflare dashboard → right sidebar, "Account ID"
`CLOUDFLARE_ACCOUNT_SUBDOMAIN`	Your `*.workers.dev` subdomain (used only for the deployment URL link)
`NEXT_PUBLIC_SUPABASE_URL`	Supabase project settings
`NEXT_PUBLIC_SUPABASE_ANON_KEY`	Supabase project settings
`NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL`	Email pre-filled on `/dashboard/login`

That's it. Push it to main and it'll go live in about 90 seconds. PRs run lint and build only, so broken code never reaches production.

Step 7 — Updating the Project (the Daily Workflow)

After the initial setup, the loop is boringly simple — which is the whole point. Here's what I actually do day-to-day:

Code Change

git checkout -b feat/new-section
# ...edit files...
pnpm dev                # iterate locally
pnpm preview            # final smoke test on the Worker runtime
git commit -am "feat: add new section"
git push origin feat/new-section

Open a PR and the verify that the job runs. Then review, merge, and the deploy it. The job ships to Cloudflare automatically.

Updating env Vars / Secrets

# Local
nano .dev.vars

# Production
wrangler secret put NEXT_PUBLIC_SUPABASE_URL
# ...etc.

Final Thoughts

When I started this migration, I was nervous about leaving Vercel — the Next.js DX there is genuinely excellent. But the moment you push beyond a hobby site, Cloudflare's economics and edge performance are not close.

With @opennextjs/cloudflare, the developer experience has also caught up: my pnpm dev loop is identical, my pnpm preview mimics production, and git push deploys globally in ~90 seconds.

If you've been holding off because the old Cloudflare Pages + Next.js story was rough, that era is over. Try this runbook on a side project this weekend and see for yourself.

If you found this useful, the full repo is here — feel free to clone it as a starter.

Happy shipping.

— Tarikul

How I Built a Production-Ready CI/CD Pipeline for a Monorepo-Based Microservices System with Jenkins, Docker Compose, and Traefik

Md Tarikul Islam — Thu, 23 Apr 2026 18:11:20 +0000

This tutorial is a complete, real-world guide to building a production-ready CI/CD pipeline using Jenkins, Docker Compose, and Traefik on a single Linux server.

You’ll learn how to expose services on a custom domain with auto-renewing HTTPS, and implement a smart deployment strategy that detects changes and redeploys only the affected microservices. This helps avoid unnecessary full-stack redeploys. We'll also cover real production issues and the exact fixes for each one.

1. What you'll build
2. Architecture
3. Server prerequisites
4. Traefik — the reverse proxy
5. Run Jenkins in Docker
6. Expose Jenkins on a domain via Traefik
7. First-time Jenkins setup
8. Add the GitHub credential
9. Create the pipeline job
10. The Jenkinsfile (deploy only what changed)
11. End-to-end test
12. Troubleshooting — every error we hit
13. Mental model: host vs. container
14. Daily operations cheat sheet
15. What I'd do differently next time
Closing thoughts

1. What You'll Build

In this tutorial, you'll build a Jenkins instance running inside Docker on the same Linux server as your application stack.

Traefik will act as a reverse proxy in front of Jenkins, exposing it via a clean URL (https://jenkins.example.com) with auto-renewing Let's Encrypt certificates.

You'll also create a Jenkinsfile in your application repository that:

Automatically triggers on every push to the staging branch,
Detects which microservices changed in each commit,
Pulls the latest code on the host machine,
Rebuilds and restarts only the affected services.

On every push, only the relevant services are redeployed.

Prerequisites

Before jumping in, this guide assumes you’re already comfortable with a few core concepts and tools.

This isn't a beginner-level tutorial — we’ll be working directly with infrastructure, containers, and CI/CD pipelines.

You should be familiar with:

Basic Linux commands (SSH, file system navigation, permissions)
Docker fundamentals (images, containers, volumes, networks)
Git workflows (clone, pull, branches)
General idea of CI/CD pipelines

Tools and environment required:

A Linux server (Ubuntu recommended)
Docker Engine + Docker Compose (v2)
A domain name (for Traefik + HTTPS)
GitHub repository (for your backend project)
Basic understanding of microservices architecture

If you’re comfortable with the above, you’re ready to follow along.

2. Architecture

Here's an overview of the architecture:

┌──────────────────────────── Linux server (Ubuntu) ────────────────────────────┐
│                                                                               │
│   /home/developer/projects/                                                  │
│       └── project-prod-configs/             ← infra repo (compose, Traefik) │
│              ├── docker-compose.staging.yml                                   │
│              ├── traefik.staging.yml                                          │
│              └── project-backend/          ← app repo (services, gateways) │
│                     ├── Jenkinsfile                                           │
│                     ├── docker-compose.staging.yml                            │
│                     └── apps/                                                 │
│                            ├── services//                               │
│                            ├── gateways//                               │
│                            └── core//                                   │
│                                                                               │
│   ┌─────────────────────── Docker network: proxy ──────────────────────┐      │
│   │  traefik (80, 443)                                                 │      │
│   │     │                                                              │      │
│   │     ├──► jenkins  (projects-jenkins-staging)                     │      │
│   │     │      ↳ /projects  ← bind-mount of the host project tree     │      │
│   │     │      ↳ /var/run/docker.sock ← controls host Docker           │      │
│   │     │                                                              │      │
│   │     └──► your services & gateways (built by the pipeline)          │      │
│   └────────────────────────────────────────────────────────────────────┘      │
│                                                                               │
└───────────────────────────────────────────────────────────────────────────────┘
            ▲
            │  webhook on push
            │
   GitHub: /project-backend (branch: staging)

There are two key ideas here:

Jenkins runs in a container, but it controls the host's Docker by mounting /var/run/docker.sock. It also bind-mounts the project folder as /projects/..., so it can cd into the real code on the host and run docker compose there.
The Jenkinsfile lives inside the app repo, so the pipeline definition is versioned with the code. Jenkins simply points at it.

3. Server Prerequisites

Before we start configuring Jenkins or Traefik, we need to prepare the server properly.

In this step, we’ll:

Create a dedicated Linux user for managing the project
Install Docker and Docker Compose
Set up the folder structure for our repositories

This ensures our CI/CD pipeline runs in a clean and predictable environment.

# Linux user that owns the project tree
sudo adduser developer

# Docker engine + Compose plugin
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker developer

# Sanity check Compose v2
docker compose version
# -> Docker Compose version v2.x.y

# Find where the Compose plugin binary lives — write it down, you'll need it
ls /usr/libexec/docker/cli-plugins/docker-compose
# (some distros use /usr/lib/docker/cli-plugins/docker-compose)

# Project layout
sudo mkdir -p /home/developer/project
sudo chown -R developer:developer /home/developer/project

# Clone both repos in the right place
cd /home/developer/projects
git clone https://github.com//projects-prod-configs.git
cd projects-prod-configs
git clone -b staging https://github.com//projects-backend.git

You should now have:

/home/developer/projects/projects-prod-configs/projects-backend

Memorize this path — your Jenkinsfile references it.

DNS

Point an A-record for your Jenkins subdomain to the server's public IP before the next steps so Let's Encrypt can validate via HTTP challenge:

jenkins.example.com   A

4. Traefik — the Reverse Proxy

Traefik acts as the entry point to your entire system. Instead of exposing each service manually with ports, Traefik automatically:

Routes traffic based on domain names
Generates and renews HTTPS certificates using Let’s Encrypt
Connects to Docker and detects services dynamically

In simple terms, Traefik lets you access services like:

https://jenkins.example.com
https://api.example.com

…without manually configuring NGINX or managing SSL certificates.

In this setup, Traefik watches Docker containers and routes traffic using labels we'll define later.

Traefik gives every container a real domain and a real cert with zero per-service config — you just add a few labels.

`traefik.staging.yml` (static config)

Put this at the root of your infra repo:

api:
  dashboard: true

entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"

certificatesResolvers:
  letsencrypt:
    acme:
      httpChallenge:
        entryPoint: web
      email: admin@example.com           # ← change me
      storage: /etc/traefik/acme.json

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"
    exposedByDefault: false              # only containers with traefik.enable=true
    network: proxy
  file:
    directory: /etc/traefik/dynamic
    watch: true

log:
  level: INFO

accessLog: {}

The Traefik service in `docker-compose.staging.yml`

networks:
  proxy:
    name: proxy
    driver: bridge
  internal:
    name: internal
    driver: bridge

volumes:
  acme-data:
  traefik-logs:
  jenkins-data:

services:
  traefik:
    image: traefik:v2.11
    container_name: projects-traefik-staging
    restart: unless-stopped
    ports:
      - "80:80"        # HTTP (auto-redirects to HTTPS)
      - "443:443"      # HTTPS
      - "8080:8080"    # Traefik dashboard (internal only — protect via firewall)
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik.staging.yml:/etc/traefik/traefik.yml:ro
      - ./dynamic:/etc/traefik/dynamic:ro
      - acme-data:/etc/traefik           # persists Let's Encrypt certs
      - traefik-logs:/var/log/traefik
    networks:
      - proxy
    command:
      - '--api.insecure=false'
      - '--api.dashboard=true'
      - '--providers.docker=true'
      - '--providers.docker.exposedbydefault=false'
      - '--providers.docker.network=proxy'
      - '--entrypoints.web.address=:80'
      - '--entrypoints.websecure.address=:443'
      - '--entrypoints.web.http.redirections.entryPoint.to=websecure'
      - '--entrypoints.web.http.redirections.entryPoint.scheme=https'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge=true'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web'
      - '--certificatesresolvers.letsencrypt.acme.email=${ACME_EMAIL:-admin@example.com}'
      - '--certificatesresolvers.letsencrypt.acme.storage=/etc/traefik/acme.json'
      - '--log.level=INFO'
      - '--accesslog=true'
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"
      # Traefik's own dashboard
      - "traefik.http.routers.traefik-dash.rule=Host(`traefik.example.com`)"
      - "traefik.http.routers.traefik-dash.entrypoints=websecure"
      - "traefik.http.routers.traefik-dash.tls.certresolver=letsencrypt"
      - "traefik.http.routers.traefik-dash.service=api@internal"

Bring it up:

cd /home/developer/projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d traefik

Watch the logs the first time — Traefik will request a cert for the dashboard host as soon as DNS resolves.

docker logs -f projects-traefik-staging

Tip. While testing, switch ACME to staging endpoint (acme.caServer=https://acme-staging-v02.api.letsencrypt.org/directory) so you don't burn through Let's Encrypt's rate limits if you misconfigure DNS. Remove that flag before going live.

5. Run Jenkins in Docker

Add this Jenkins service to the same docker-compose.staging.yml. Every line matters (and the comments explain why).

  jenkins:
    image: jenkins/jenkins:lts
    container_name: projects-jenkins-staging
    restart: unless-stopped
    user: root                           # to use host docker.sock without UID juggling
    environment:
      - JAVA_OPTS=-Xmx1g -Xms512m -Duser.timezone=Asia/Dhaka
      - TZ=Asia/Dhaka                    # OS-level timezone inside container
      - JENKINS_OPTS=--prefix=/
    ports:
      - "3095:8080"                      # web UI (also reachable directly if needed)
      - "50000:50000"                    # inbound agent port
    volumes:
      - jenkins-data:/var/jenkins_home   # Jenkins config/jobs/secrets persistence
      - /var/run/docker.sock:/var/run/docker.sock                          # control host Docker
      - /usr/bin/docker:/usr/bin/docker                                     # docker CLI from host
      - /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro  # docker compose plugin
      - /home/developer/projects:/projects                                # project tree
      - /etc/localtime:/etc/localtime:ro                                    # match host clock
      - /etc/timezone:/etc/timezone:ro
    networks:
      - proxy
      - internal
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8080/login']
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
    deploy:
      resources:
        limits:
          memory: 1024M

Why user: root? It's the simplest way to share docker.sock and the project bind-mount without UID/GID gymnastics. If you prefer an unprivileged user, you'll need to set group: docker and align UIDs/perms on host folders — possible but out of scope here.

6. Expose Jenkins on a Domain via Traefik

This is the section many guides skip. We'll add labels to the Jenkins service so Traefik picks it up automatically. No editing of Traefik config required.

  jenkins:
    # ... everything above ...
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"

      # 1) Router — match incoming Host
      - "traefik.http.routers.jenkins.rule=Host(`jenkins.example.com`)"
      - "traefik.http.routers.jenkins.entrypoints=websecure"
      - "traefik.http.routers.jenkins.tls.certresolver=letsencrypt"
      - "traefik.http.routers.jenkins.service=jenkins"

      # 2) Service — tell Traefik which container port is the app
      - "traefik.http.services.jenkins.loadbalancer.server.port=8080"

      # 3) Middleware — Jenkins needs X-Forwarded-Proto so it knows it's behind HTTPS
      - "traefik.http.middlewares.jenkins-headers.headers.customrequestheaders.X-Forwarded-Proto=https"
      - "traefik.http.routers.jenkins.middlewares=jenkins-headers"

What each line does:

Label	Purpose
`traefik.enable=true`	Opts this container in (we set `exposedByDefault=false`).
`traefik.docker.network=proxy`	Tells Traefik which network to talk to Jenkins on (Jenkins is on both `proxy` and `internal`).
`routers.jenkins.rule=Host(...)`	Forwards only this hostname to Jenkins.
`routers.jenkins.entrypoints=websecure`	Listens only on 443. (HTTP redirect was set up in section 4.)
`routers.jenkins.tls.certresolver=letsencrypt`	Auto-issues + renews the cert.
`services.jenkins.loadbalancer.server.port=8080`	Jenkins listens on 8080 inside the container.
`customrequestheaders.X-Forwarded-Proto=https`	Without this, Jenkins generates `http://` URLs in webhooks/links and breaks.

Bring Jenkins up:

cd /home/developer/projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d jenkins

# Watch Traefik issue the certificate
docker logs -f projects-traefik-staging | grep -i acme

After 10–60 seconds you should be able to open https://jenkins.example.com and see Jenkins's setup wizard with a valid lock icon.

Inside Jenkins (after first login):

Manage Jenkins → System → Jenkins URL → set this to: https://jenkins.example.com/

This is important because Jenkins uses this base URL to generate:

Webhook endpoints (for GitHub triggers)
Links inside emails and build logs

If this isn't set correctly, GitHub webhooks may fail, and any links Jenkins generates will point to the wrong address (often localhost or internal IPs).

7. First-Time Jenkins Setup

If you're running Jenkins for the first time on this server, follow this section to complete the initial setup.

If you already have Jenkins configured, you can skip this section — but make sure the required plugins and settings match what we use later in this guide.

Open https://jenkins.example.com. Get the initial admin password:

docker exec projects-jenkins-staging cat /var/jenkins_home/secrets/initialAdminPassword

Paste it, choose Install suggested plugins.
Create your admin user.
Manage Jenkins → Plugins → Available and install:
- GitHub (and GitHub Branch Source)
- Pipeline: GitHub
- Credentials Binding (usually preinstalled)

That's all the plugins you need for the rest of this guide.

8. Add the GitHub Credential

Jenkins needs permission to access your GitHub repository.

This is done using a GitHub Personal Access Token (PAT), which acts like a password for secure API and Git operations.

We’ll store this token inside Jenkins as a credential so it can pull code during pipeline execution and authenticate securely without exposing secrets in code.

This single credential is used both for the SCM checkout and for the deploy-time git pull.

Create a Personal Access Token (classic) on GitHub with repo scope.
In Jenkins: Manage Jenkins → Credentials → System → Global → Add Credentials.
Fill in:
- Kind: Username with password
- Username: your GitHub username
- Password: the token
- ID: github_classic_token (the Jenkinsfile references this exact ID)

9. Create the Pipeline Job

Now that Jenkins has access to your repository, the next step is to define how deployments should run.

A pipeline job tells Jenkins:

where your code lives,
which branch to monitor,
and how to execute your deployment process.

In Jenkins, create a new Pipeline job and connect it to your GitHub repository. Once this is set up, Jenkins will automatically trigger deployments whenever you push to the staging branch.

Start by creating a new job:

New Item → Pipeline → name it projects-staging → OK

Then configure the job:

Under Build Triggers, enable:
GitHub hook trigger for GITScm polling
Under Pipeline:
- Definition: Pipeline script from SCM
- SCM: Git
- Repository URL: https://github.com//projects-backend.git
- Credentials: github_classic_token
- Branch: */staging
- Script Path: Jenkinsfile

Save the configuration.

At this point, Jenkins is fully connected to your repository and ready to run your deployment pipeline automatically.

10. The Jenkinsfile (Deploy Only What Changed)

Place this at the root of the app repo (projects-backend/Jenkinsfile), branch staging.

pipeline {
  agent any

  environment {
    PROJECT_PATH = "/projects/projects-prod-configs/projects-backend"
    COMPOSE_FILE = "docker-compose.staging.yml"
  }

  stages {

    stage('Checkout') {
      steps {
        checkout scm
        echo "Checkout completed for branch: ${env.BRANCH_NAME ?: 'staging'}"
      }
    }

    stage('Detect Changes') {
      steps {
        script {
          def changedFiles = sh(
            script: "git diff --name-only HEAD~1 HEAD",
            returnStdout: true
          ).trim()

          echo "Changed files:\n${changedFiles}"

          def services = [] as Set
          changedFiles.split('\n').each { file ->
            def svc  = file =~ /^apps\/services\/([a-z0-9-]+)\//
            def gw   = file =~ /^apps\/gateways\/([a-z0-9-]+)\//
            def core = file =~ /^apps\/core\/([a-z0-9-]+)\//
            if (svc)  { services << svc[0][1]  }
            if (gw)   { services << gw[0][1]   }
            if (core) { services << core[0][1] }
          }
          services = services.findAll { !it.endsWith('-e2e') }
          env.CHANGED_SERVICES = services.join(' ')

          echo "Services to deploy: ${env.CHANGED_SERVICES ?: '(none)'}"
        }
      }
    }

    stage('Deploy') {
      when { expression { return env.CHANGED_SERVICES?.trim() } }
      steps {
        withCredentials([usernamePassword(
          credentialsId: 'github_classic_token',
          usernameVariable: 'GIT_USER',
          passwordVariable: 'GIT_TOKEN'
        )]) {
          sh '''
            set -eu
            git config --global --add safe.directory "${PROJECT_PATH}"
            cd "${PROJECT_PATH}"
            git remote set-url origin "https://github.com//projects-backend.git"
            git -c credential.helper= \
                -c "credential.helper=!f() { echo username=\({GIT_USER}; echo password=\){GIT_TOKEN}; }; f" \
                pull origin staging
            docker compose -f "\({COMPOSE_FILE}" up -d --build \){CHANGED_SERVICES}
          '''
        }
        echo "Deployed: ${env.CHANGED_SERVICES}"
      }
    }

    stage('Skip Deployment') {
      when { expression { return !env.CHANGED_SERVICES?.trim() } }
      steps { echo "No service changes detected — nothing to deploy." }
    }
  }
}

Why each tricky line is there:

git config --global --add safe.directory ... — git refuses to operate on a repo whose owner UID differs from the current user's. The repo on disk is owned by developer, but Git inside the container runs as root. This whitelists the path.
git remote set-url origin "https://..." — flips the on-disk remote to HTTPS so the token can be used. (A PAT can't authenticate git@github.com: URLs — those use SSH.) Idempotent — safe to re-run.
git -c credential.helper="!f() { echo username=...; echo password=...; }; f" — feeds the username/token to git for that one command without writing the token to disk and without exposing it on the process command line.
${CHANGED_SERVICES} is unquoted on purpose so multiple service names expand as separate args.

11. End-to-End Test

Before considering the setup complete, we need to verify that the entire pipeline works as expected.

This end-to-end test ensures that:

GitHub webhooks are triggering Jenkins correctly,
Jenkins can detect which services changed,
and only the affected services are rebuilt and deployed.

In other words, this simulates a real production deployment.

Start by making a small change in your repository. For example, modify a file inside:

apps/gateways/student-apigw/

Then push the change to the staging branch.

Once pushed, Jenkins should automatically trigger via the webhook. If not, you can manually click Build Now.

Now open the build’s Console Output and verify the flow. You should see something like:

Checkout completed for branch: staging
Services to deploy: student-apigw
git pull origin staging (successful)
docker compose ... up -d --build student-apigw
Deployed: student-apigw

If you see this sequence, your pipeline is working correctly.

If anything fails, don’t worry — jump to Section 12 where every common issue and its fix is documented.

12. Troubleshooting — Every Error We Hit

This section covers real issues we faced while setting up this pipeline — and more importantly, why each fix works. Understanding the “why” will help you debug similar problems in your own setup.

cd: can't cd to /projects/projects-prod-configs/projects-backend

Cause:
The Jenkinsfile runs cd $PROJECT_PATH, but inside the container that path doesn’t exist. This usually happens when:

the project wasn’t cloned on the host, or
the bind mount isn’t configured correctly.

Fix:

ls /home/developer/projects/projects-prod-configs/projects-backend
# If missing: git clone -b staging  there.

Confirm the bind mount:

docker inspect projects-jenkins-staging --format '{{range .Mounts}}{{.Source}} -> {{.Destination}}{{println}}{{end}}'

If missing, recreate the container:

docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

Why this works:

Jenkins runs inside a container, but your code lives on the host. The bind mount connects them. Without it, Jenkins cannot access your project directory.

fatal: detected dubious ownership in repository

Cause:
Git blocks access when the repository owner differs from the current user.

Repo owner: developer (host)
Git runs as: root (inside container)

Fix:

git config --global --add safe.directory "${PROJECT_PATH}"

Why this works:

This explicitly tells Git that the directory is trusted, bypassing ownership mismatch security restrictions.

`Host key verification failed` / `Could not read from remote repository`

Cause:

The repository uses SSH (git@github.com:...), but:

the container has no SSH keys
no known_hosts file exists

Also, GitHub tokens cannot authenticate over SSH.

Fix (recommended):

git remote set-url origin "https://github.com//projects-backend.git"

Why this works:

HTTPS uses token-based authentication (PAT), which works inside containers without SSH configuration.

`unknown shorthand flag: 'f' in -f` ( `docker compose`)

Cause:
The Docker CLI exists, but the Docker Compose plugin is missing inside the container.

Fix:

volumes:
  - /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro

Find your path if needed:

find /usr -name docker-compose -type f 2>/dev/null

Verify:

docker exec projects-jenkins-staging docker compose version

Why this works:

Docker Compose v2 is a CLI plugin. Mounting this directory makes the docker compose command available inside the container.

Wrong timezone in build timestamps and Jenkins UI

Fix: Set both env var and JVM flag, and bind-mount the host's clock files:

environment:
  - TZ=Asia/Dhaka
  - JAVA_OPTS=... -Duser.timezone=Asia/Dhaka
volumes:
  - /etc/localtime:/etc/localtime:ro
  - /etc/timezone:/etc/timezone:ro

You must recreate the container for env-var changes to take effect:

docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

Why this works:
Jenkins runs on Java, which uses its own timezone separate from the OS.
By aligning OS timezone, JVM timezone, and host clock, you ensure consistent timestamps everywhere.

ERR_SOCKET_TIMEOUT (pnpm install fails)

Cause:

If you have multiple services building in parallel and each runs pnpm install with ~1500 packages, the network gets saturated and a timeout occurs.

Fixes:

a) Increase timeout + control concurrency

RUN pnpm install --frozen-lockfile --ignore-scripts 
--network-timeout 600000 
--network-concurrency 8

Why: Gives pnpm more time and reduces network overload.

b) Enable pnpm cache (BuildKit)

RUN --mount=type=cache,id=pnpm-store,target=/root/.local/share/pnpm/store 
pnpm install --frozen-lockfile --ignore-scripts

Why: Dependencies are cached and reused instead of downloading every time.

c) Avoid unnecessary rebuilds

docker compose -f \(COMPOSE_FILE build \)CHANGED_SERVICES docker compose -f \(COMPOSE_FILE up -d --no-build \)CHANGED_SERVICES

Why: Only changed services are rebuilt → less network load → fewer failures.

Container changes don’t apply after editing docker-compose.yml

Cause:

Docker compose up -d does not update running containers.

Fix:

docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

Why this works:

This forces Docker to recreate the container with updated configuration (env, volumes, labels).

Traefik shows default certificate (no HTTPS)

Common causes:

DNS not pointing to server Port 80 blocked Wrong Docker network

Check:

dig +short jenkins.example.com docker logs projects-traefik-staging 2>&1 | grep -i acme

Why this works:

Let’s Encrypt uses HTTP-01 challenge, so it must reach your server via port 80. If DNS or networking is wrong, certificate issuance fails.

Jenkins: "Reverse proxy setup is broken"

Fix:

Set the Jenkins URL to https://jenkins.example.com/
Ensure header:

X-Forwarded-Proto: https

Why this works:

Jenkins needs to know it's behind HTTPS. Without this, it generates incorrect URLs (http instead of https), breaking redirects and webhooks.

13. Mental Model: Host vs. Container

Many setup mistakes come from confusing the host filesystem with the container filesystem. This table makes it explicit:

Inside the Jenkins container	Comes from on the host
`/var/jenkins_home`	docker volume `jenkins-data` (Jenkins config, jobs, secrets)
`/projects/...`	`/home/developer/projects/...` (your project tree)
`/usr/bin/docker`	host's `/usr/bin/docker`
`/usr/libexec/docker/cli-plugins/docker-compose`	host plugin (lets `docker compose` work)
`/var/run/docker.sock`	host Docker daemon (so builds happen on the host's engine)
`/etc/localtime`, `/etc/timezone`	host clock
`~/.ssh`	nothing — that's why SSH-to-GitHub doesn't work without extra setup

When debugging, always ask: "Inside which filesystem is this command running, and does the file/folder it's looking for exist there?"

14. Daily Operations Cheat Sheet

# Recreate Jenkins after changing compose
cd /home/developer/Projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

# Tail Jenkins logs
docker logs -f projects-jenkins-staging

# Open a shell inside the Jenkins container
docker exec -it projects-jenkins-staging bash

# From inside the container — sanity checks
docker compose version
ls /projects/projects-prod-configs/projects-backend
git -C /projects/projects-prod-configs/projects-backend remote -v

# Manually trigger the same deploy the pipeline does
cd /projects/projects-configs/projects-backend
git pull origin staging
docker compose -f docker-compose.staging.yml up -d --build student-apigw

# Inspect Traefik routing decisions
docker logs projects-traefik-staging 2>&1 | grep -i jenkins

# Check renewed certs
docker exec projects-traefik-staging cat /etc/traefik/acme.json | head -50

15. What I'd Do Differently Next Time

Pre-build a base image with all node_modules baked in. With ~1500 packages × 15 services, every clean build re-downloads ~22k tarballs. A shared base cuts that 90%.
Run a private npm proxy (Verdaccio / Nexus / GitHub Packages) on the same Docker network — eliminates flaky npmjs.org timeouts entirely.
Per-service Jenkinsfile if your services drift apart in tooling. With one Jenkinsfile, every team contends for the same pipeline definition.
Replace git diff HEAD~1 HEAD with git diff $(git merge-base HEAD origin/staging~1) HEAD so squash-merges and force-pushes don't accidentally skip services.
Move secrets to a vault (HashiCorp Vault / AWS Secrets Manager / Doppler). PATs in Jenkins work, but rotation across many jobs is painful.
Use Jenkins' Configuration-as-Code (JCasC) so the entire Jenkins setup (jobs, credentials definitions, plugins) is in git. Then a server rebuild is a one-command operation.

Closing Thoughts

The pipeline itself is just three stages: Checkout → Detect Changes → Deploy — but a real production setup is mostly about plumbing: reverse proxy, certificates, bind-mounts, credentials, timezones, build caches. None of these are exotic. Together they decide whether your Friday-afternoon deploy goes silently green or eats your weekend.

Follow sections 1–11 to get a working pipeline. Bookmark section 12 to keep it working.

Happy shipping.

Md Tarikul Islam - freeCodeCamp.org

How to Implement PayPal in a Microservice Architecture Using NestJS, gRPC, and Docker

Table of Contents

Introduction

Why Use a Dedicated Payment Service?

Architecture Overview

Payment State Machine

Prerequisites

PayPal Concepts You Need to Know

Sandbox vs Live

Orders API Flow (What We Use)

Environment Variables

Project Structure

Step 1 — Create the Payment Service

Step 2 — Define the gRPC Contract

Step 3 — Implement the PayPal Service

Step 4 — Build the Payment Flow (Create, Approve, Capture)

Create Payment

User Approves on PayPal

Capture Payment

Step 5 — Connect Domain Services via gRPC

Domain Service Business Logic Example:

Step 6 — Add the API Gateway Layer

Step 7 — Publish Payment Events with RabbitMQ

Two Paths to Mark an Order as Paid

Step 8 — Database Schema and Migrations

Production Migration Gotcha

Step 9 — Local Development Setup (Docker)

Environment Variables (.env)

Docker Compose (local)

Start Services

Verify Health

Test Payment Flow

Step 10 — Production Deployment

PayPal Live Credentials

Production .env (on Server — Never Commit)

Docker Compose (Production)

Deploy Commands

Verify Production

Frontend Domain in Production

Step 11 — Health Checks and Monitoring

Complete Request Flow (Real Example)

Coupon Support (Optional)

PayPal Webhooks (Optional but Recommended)

Testing Checklist

Wrapping Up

Further Reading

The Saga Pattern in Node.js: How to Roll Back Distributed Transactions Across Microservices

Table of Contents

Prerequisites

1. Introduction

2. The Problem in One Picture

3. Why You Need a Saga

4. Choreography vs Orchestration

Choreography

Orchestration

5. The Example Project

6. Architecture

7. The Saga Flow, Step by Step

8. The State Machine

9. Implementing the Orchestrator

Creating the Saga Record

The Main Loop

A Single Step in Detail

Habits Worth Copying

10. Implementing the Participant

11. Rollback (Compensation)

On the Orchestrator Side

On the Participant Side

Rules of a Good Compensation

What Happens if the Compensation Itself Fails?

12. Tracking, Idempotency and Observability

Orchestrator Side — agency_onboarding_sagas

Participant Side — agency_provision_records

Observability for Free

13. Testing a Saga

14. When NOT to Use a Saga

15. Trade-offs and Lessons Learned

16. Conclusion

How to Self‑Host an S3‑Compatible Object Store with MinIO on Your Staging Server (and Save Hundreds of Dollars a Month)

Environment Variables (`.env`)

Production `.env` (on Server — Never Commit)

Orchestrator Side — `agency_onboarding_sagas`

Participant Side — `agency_provision_records`

Step 2 — Wire OpenNext into `next dev`

Step 3 — Local Environment Setup with `.dev.vars`

Example: `.dev.vars.example` (committed)

Update `.gitignore`

`traefik.staging.yml` (static config)