Saga Pattern¶

Distributed systems face a fundamental challenge: operations that span multiple services or data stores cannot use traditional ACID transactions. The saga pattern addresses this by decomposing a distributed operation into a sequence of local transactions, each with a compensating action that can undo its effects if a subsequent step fails.

The Distributed Transaction Problem¶

Consider an e-commerce order that must:

Reserve inventory in the warehouse service
Charge the customer's payment method
Create a shipping request
Update the order status

In a monolithic system with a single database, this would be a single transaction: all steps succeed or all are rolled back. In a distributed system, each service owns its data, and no single transaction can span them all.

The naive approach (hoping all steps succeed) leads to inconsistent state when failures occur:

Inventory reserved but payment fails → phantom reservation
Payment charged but shipping fails → customer charged, no delivery
Partial failures leave the system in states that violate business invariants

Saga Fundamentals¶

A saga is a sequence of local transactions where:

Each transaction updates data within a single service
Each transaction publishes an event or triggers the next step
Each transaction has a compensating transaction that semantically undoes its effects
If a transaction fails, compensating transactions are executed in reverse order

Forward Flow (Success):
T1 → T2 → T3 → T4 → Complete

Compensation Flow (T3 fails):
T1 → T2 → T3(fail) → C2 → C1 → Aborted

Compensation vs Rollback¶

Compensation is not rollback. A compensating transaction cannot undo time; it can only create new transactions that semantically reverse the effect.

Original Transaction	Compensation	Note
Reserve 5 items	Release 5 items	Inventory returns to available
Charge $100	Refund $100	Creates refund record, not deletion
Create shipment	Cancel shipment	May fail if already shipped
Send email	Send cancellation email	Cannot unsend original

Some compensations may be impossible (email sent, physical item shipped). Saga design must account for non-compensatable actions by ordering them last.

Saga Coordination Patterns¶

Choreography¶

In choreography, each service listens for events and decides locally what to do next. No central coordinator exists.

Implementation:

// Order Service - initiates saga
public class OrderService {

    public Order createOrder(OrderRequest request) {
        Order order = new Order(request);
        order.setStatus(OrderStatus.PENDING);
        orderRepository.save(order);

        // Publish event to start saga
        eventPublisher.publish(new OrderCreatedEvent(order));

        return order;
    }

    @EventListener
    public void onPaymentCharged(PaymentChargedEvent event) {
        Order order = orderRepository.findById(event.getOrderId());
        order.setStatus(OrderStatus.PAYMENT_CONFIRMED);
        orderRepository.save(order);
    }

    @EventListener
    public void onPaymentFailed(PaymentFailedEvent event) {
        Order order = orderRepository.findById(event.getOrderId());
        order.setStatus(OrderStatus.CANCELLED);
        orderRepository.save(order);
        // Inventory service will also receive this and release reservation
    }
}

// Inventory Service - reacts to OrderCreated
public class InventoryService {

    @EventListener
    public void onOrderCreated(OrderCreatedEvent event) {
        try {
            reserveInventory(event.getOrderId(), event.getItems());
            eventPublisher.publish(new InventoryReservedEvent(event.getOrderId()));
        } catch (InsufficientInventoryException e) {
            eventPublisher.publish(new InventoryReservationFailedEvent(
                event.getOrderId(), e.getMessage()));
        }
    }

    @EventListener
    public void onPaymentFailed(PaymentFailedEvent event) {
        // Compensate: release the reservation
        releaseReservation(event.getOrderId());
        eventPublisher.publish(new InventoryReleasedEvent(event.getOrderId()));
    }
}

Advantages: - Simple to implement initially - Loose coupling between services - No single point of failure

Disadvantages: - Difficult to understand the complete flow - Cyclic dependencies can emerge - Testing requires understanding event sequences - Hard to add new steps or change order

Orchestration¶

In orchestration, a central coordinator (orchestrator) directs the saga, telling each participant what to do and when.

Implementation:

public class OrderSagaOrchestrator {

    public void execute(OrderRequest request) {
        UUID sagaId = UUID.randomUUID();
        SagaState state = createSagaState(sagaId, request);

        try {
            // Step 1: Reserve inventory
            state.setCurrentStep("RESERVE_INVENTORY");
            saveSagaState(state);

            InventoryReservation reservation = inventoryService.reserve(
                request.getItems(), sagaId);
            state.setInventoryReservationId(reservation.getId());

            // Step 2: Process payment
            state.setCurrentStep("PROCESS_PAYMENT");
            saveSagaState(state);

            PaymentResult payment = paymentService.charge(
                request.getCustomerId(), request.getTotal(), sagaId);
            state.setPaymentId(payment.getId());

            // Step 3: Create shipment
            state.setCurrentStep("CREATE_SHIPMENT");
            saveSagaState(state);

            Shipment shipment = shippingService.createShipment(
                request.getShippingAddress(), request.getItems(), sagaId);
            state.setShipmentId(shipment.getId());

            // Step 4: Confirm order
            state.setCurrentStep("CONFIRM_ORDER");
            state.setStatus(SagaStatus.COMPLETED);
            saveSagaState(state);

            orderService.confirmOrder(request.getOrderId());

        } catch (Exception e) {
            compensate(state, e);
        }
    }

    private void compensate(SagaState state, Exception cause) {
        state.setStatus(SagaStatus.COMPENSATING);
        state.setFailureReason(cause.getMessage());
        saveSagaState(state);

        // Compensate in reverse order
        String failedStep = state.getCurrentStep();

        if (shouldCompensate(failedStep, "CREATE_SHIPMENT") &&
            state.getShipmentId() != null) {
            shippingService.cancelShipment(state.getShipmentId());
        }

        if (shouldCompensate(failedStep, "PROCESS_PAYMENT") &&
            state.getPaymentId() != null) {
            paymentService.refund(state.getPaymentId());
        }

        if (shouldCompensate(failedStep, "RESERVE_INVENTORY") &&
            state.getInventoryReservationId() != null) {
            inventoryService.releaseReservation(state.getInventoryReservationId());
        }

        state.setStatus(SagaStatus.COMPENSATED);
        saveSagaState(state);

        orderService.cancelOrder(state.getOrderId(), cause.getMessage());
    }
}

Advantages: - Easy to understand the complete flow - Centralized control and monitoring - Simpler to add/modify steps - Better for complex sagas with many steps

Disadvantages: - Orchestrator is a single point of failure - Tighter coupling to orchestrator - Can become a bottleneck

Saga State Management with Cassandra¶

The saga's state must be persisted to survive crashes. Cassandra provides durable storage for saga coordination.

Saga State Schema¶

CREATE TABLE saga_instances (
    saga_id UUID,
    saga_type TEXT,
    status TEXT,                    -- STARTED, RUNNING, COMPENSATING, COMPLETED, FAILED
    current_step TEXT,
    started_at TIMESTAMP,
    updated_at TIMESTAMP,
    completed_at TIMESTAMP,
    payload BLOB,                   -- Serialized saga data
    step_results MAP<TEXT, BLOB>,   -- Results from each step
    failure_reason TEXT,
    PRIMARY KEY (saga_id)
);

CREATE TABLE saga_steps (
    saga_id UUID,
    step_name TEXT,
    step_order INT,
    status TEXT,                    -- PENDING, RUNNING, COMPLETED, FAILED, COMPENSATED
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    request_data BLOB,
    response_data BLOB,
    error_message TEXT,
    retry_count INT,
    PRIMARY KEY ((saga_id), step_order)
) WITH CLUSTERING ORDER BY (step_order ASC);

State Machine Implementation¶

Model the saga as a state machine:

public class SagaStateMachine {

    private final SagaDefinition definition;
    private SagaInstance instance;

    public void advance() {
        while (true) {
            SagaStep currentStep = definition.getStep(instance.getCurrentStep());

            if (currentStep == null) {
                // No more steps - saga complete
                instance.setStatus(SagaStatus.COMPLETED);
                instance.setCompletedAt(Instant.now());
                saveInstance();
                return;
            }

            try {
                // Execute current step
                StepResult result = executeStep(currentStep);
                recordStepResult(currentStep, result);

                // Move to next step
                instance.setCurrentStep(currentStep.getNextStep());
                instance.setUpdatedAt(Instant.now());
                saveInstance();

            } catch (Exception e) {
                handleStepFailure(currentStep, e);
                return;
            }
        }
    }

    private void handleStepFailure(SagaStep failedStep, Exception e) {
        if (failedStep.isRetryable() &&
            getRetryCount(failedStep) < failedStep.getMaxRetries()) {

            incrementRetryCount(failedStep);
            scheduleRetry(failedStep);
            return;
        }

        // Start compensation
        instance.setStatus(SagaStatus.COMPENSATING);
        instance.setFailureReason(e.getMessage());
        saveInstance();

        compensate();
    }

    private void compensate() {
        List<SagaStep> completedSteps = getCompletedStepsInReverseOrder();

        for (SagaStep step : completedSteps) {
            if (step.hasCompensation()) {
                try {
                    executeCompensation(step);
                    markStepCompensated(step);
                } catch (Exception e) {
                    // Compensation failed - requires manual intervention
                    instance.setStatus(SagaStatus.COMPENSATION_FAILED);
                    alertOperations(instance, step, e);
                    saveInstance();
                    return;
                }
            }
        }

        instance.setStatus(SagaStatus.COMPENSATED);
        instance.setCompletedAt(Instant.now());
        saveInstance();
    }
}

Handling Failure Scenarios¶

Step Failure with Successful Compensation¶

The normal compensation path:

T1(success) → T2(success) → T3(fail)
                              ↓
              C2(success) ← compensation starts
                   ↓
              C1(success)
                   ↓
              Saga Compensated

Compensation Failure¶

When compensation fails, human intervention may be required:

public class CompensationFailureHandler {

    public void handleCompensationFailure(SagaInstance saga,
                                          SagaStep step,
                                          Exception e) {
        // Log detailed information for operations team
        log.error("Compensation failed for saga {} step {}: {}",
            saga.getId(), step.getName(), e.getMessage(), e);

        // Create incident ticket
        incidentService.createIncident(
            IncidentSeverity.HIGH,
            "Saga Compensation Failure",
            String.format("Saga %s failed to compensate step %s. " +
                "Manual intervention required.", saga.getId(), step.getName()),
            Map.of(
                "saga_id", saga.getId().toString(),
                "saga_type", saga.getType(),
                "failed_step", step.getName(),
                "error", e.getMessage()
            )
        );

        // Persist failure state for manual resolution
        saga.setStatus(SagaStatus.COMPENSATION_FAILED);
        saga.setRequiresManualIntervention(true);
        sagaRepository.save(saga);
    }
}

Timeout Handling¶

Steps may hang indefinitely. Implement timeouts and treat them as failures:

public StepResult executeStepWithTimeout(SagaStep step, Duration timeout) {
    CompletableFuture<StepResult> future = CompletableFuture.supplyAsync(
        () -> step.execute(instance.getPayload())
    );

    try {
        return future.get(timeout.toMillis(), TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        future.cancel(true);
        throw new StepTimeoutException(step, timeout);
    }
}

Idempotent Steps¶

Steps may be retried after partial execution. Ensure idempotency:

public class IdempotentInventoryReservation {

    public InventoryReservation reserve(UUID sagaId, List<Item> items) {
        // Check if already reserved
        InventoryReservation existing = reservationRepository
            .findBySagaId(sagaId);

        if (existing != null) {
            log.info("Reservation already exists for saga {}", sagaId);
            return existing;
        }

        // Create new reservation
        InventoryReservation reservation = new InventoryReservation(sagaId, items);
        // ... reserve logic ...

        return reservationRepository.save(reservation);
    }
}

Saga Recovery¶

After a system crash, incomplete sagas must be recovered.

Recovery Process¶

@Scheduled(fixedDelay = 60000)  // Every minute
public void recoverIncompleteSagas() {
    Instant staleThreshold = Instant.now().minus(Duration.ofMinutes(5));

    List<SagaInstance> staleSagas = sagaRepository
        .findByStatusInAndUpdatedAtBefore(
            List.of(SagaStatus.RUNNING, SagaStatus.COMPENSATING),
            staleThreshold
        );

    for (SagaInstance saga : staleSagas) {
        log.info("Recovering stale saga {}", saga.getId());

        try {
            if (saga.getStatus() == SagaStatus.RUNNING) {
                // Determine if last step completed
                if (lastStepCompleted(saga)) {
                    advanceSaga(saga);
                } else {
                    // Retry or compensate based on step configuration
                    handleIncompleteStep(saga);
                }
            } else if (saga.getStatus() == SagaStatus.COMPENSATING) {
                continueCompensation(saga);
            }
        } catch (Exception e) {
            log.error("Recovery failed for saga {}", saga.getId(), e);
        }
    }
}

Saga Timeout¶

Sagas that run too long may indicate problems:

@Scheduled(fixedDelay = 300000)  // Every 5 minutes
public void timeoutStaleSagas() {
    Instant timeout = Instant.now().minus(Duration.ofHours(1));

    List<SagaInstance> timedOutSagas = sagaRepository
        .findByStatusAndStartedAtBefore(SagaStatus.RUNNING, timeout);

    for (SagaInstance saga : timedOutSagas) {
        log.warn("Saga {} timed out after 1 hour, starting compensation", saga.getId());

        saga.setStatus(SagaStatus.COMPENSATING);
        saga.setFailureReason("Saga timeout exceeded");
        sagaRepository.save(saga);

        compensate(saga);
    }
}

Monitoring Sagas¶

Key Metrics¶

Metric	Description	Alert Threshold
`saga.started`	Sagas initiated per minute	Baseline deviation
`saga.completed`	Successfully completed	Drop indicates failures
`saga.compensated`	Required compensation	> 1% of started
`saga.compensation_failed`	Compensation failed	Any occurrence
`saga.duration_p99`	99th percentile duration	> SLA threshold
`saga.step.{name}.failures`	Per-step failure rate	Step-specific

Dashboard Queries¶

-- Active sagas by status
SELECT status, COUNT(*) as count
FROM saga_instances
WHERE started_at > ?
GROUP BY status;

-- Failure reasons
SELECT failure_reason, COUNT(*) as count
FROM saga_instances
WHERE status IN ('COMPENSATED', 'COMPENSATION_FAILED')
  AND completed_at > ?
GROUP BY failure_reason
ORDER BY count DESC;

-- Stuck sagas (running > 1 hour)
SELECT saga_id, saga_type, current_step, started_at
FROM saga_instances
WHERE status = 'RUNNING'
  AND updated_at < ?;

When to Use the Saga Pattern¶

Appropriate Use Cases¶

Multi-service transactions: Operations spanning multiple bounded contexts
Long-running processes: Operations that take minutes or hours
Eventual consistency acceptable: Business can tolerate temporary inconsistency
Compensatable operations: All steps can be semantically reversed

Consider Alternatives When¶

Single service: If all data is in one service, use local transactions
Strong consistency required: If temporary inconsistency is unacceptable, consider different architecture
Non-compensatable steps: If steps cannot be reversed (external API calls with no undo), saga may not fit
High-frequency operations: Saga overhead may be prohibitive for very high-volume operations

Summary¶

The saga pattern enables distributed transactions through:

Local transactions that each service can commit independently
Compensating transactions that semantically reverse completed steps
Coordination via choreography (events) or orchestration (central controller)
Persistent state that survives failures and enables recovery
Idempotent operations that handle retries safely

Cassandra provides durable storage for saga state, enabling recovery after failures. The pattern accepts temporary inconsistency (between step completion and final commit or compensation) in exchange for availability and partition tolerance.

Ledger Pattern - Financial transactions with sagas
Transactional Outbox - Reliable event publishing for choreography
Idempotency Patterns - Safe step retries
Event Sourcing - Events as coordination mechanism