Patterns
advancederror handling

Dead Letter Queue

Capture workflow items that fail processing after all retry attempts are exhausted. Store them safely for investigation, manual correction, and replay — never silently lose data.

Views26
BPMN 2.0
On this page

Visual Flow

Rendering diagram…

When to Use This Pattern

Use a dead letter queue (DLQ) when:

  • Your workflow processes items that can fail (API calls, data transforms, external systems)
  • The Retry with Exponential Backoff pattern has been exhausted — all retries failed
  • You cannot afford to lose data — every item must be eventually processed
  • Operations teams need a way to investigate, fix, and replay failed items

How It Works

Rendering diagram…

Implementation Guide

Step 1: Design the Dead Letter Record

Each DLQ entry should capture everything needed to understand and replay the failure:

FieldDescriptionExample
dlq_idUnique identifierDLQ-2025-0042
source_workflowWhich workflow sent thisInvoice Processing
original_payloadThe complete input data{ "invoiceId": "INV-1234", "amount": 5200, ... }
error_messageWhat went wrong"API returned 500: Internal Server Error"
error_codeStructured error codeHTTP_500
retry_countHow many retries were attempted4
last_retry_atWhen the last retry failed2025-03-15T14:22:33Z
statusCurrent DLQ statusnew / investigating / fixed / replayed / abandoned
assigned_toWho is investigatingops-team@company.com
resolution_notesWhat was done to fix it"Vendor API was down. Replayed successfully after recovery."
created_onWhen item entered DLQ2025-03-15T14:22:33Z
replayed_onWhen item was successfully reprocessed2025-03-16T09:15:00Z
Step 2: Route Failures to the DLQ

After the last retry fails:

  1. Serialize the complete state — the original input, any partial outputs, and the full error chain
  2. Write to the DLQ — a SharePoint list, database table, or dedicated queue
  3. Alert the operations team — send a notification with the DLQ ID and a summary
  4. Mark the original workflow item as "failed-to-dlq" so it's not picked up by the next batch run
Step 3: Build an Investigation Interface

Give the ops team a dashboard (or simple list view) showing:

  • All DLQ items sorted by age (oldest first)
  • Filter by source workflow, error type, status
  • The full payload and error details for each item
  • Action buttons: Investigate, Reassign, Replay, Abandon
Step 4: Implement the Replay Mechanism

When the root cause is fixed, the ops team clicks "Replay":

  1. Read the original payload from the DLQ record
  2. Re-submit it to the processing workflow
  3. If it succeeds → update DLQ status to "replayed" with timestamp
  4. If it fails again → increment retry count, leave in DLQ
Step 5: Set Retention Policies
DLQ StatusRetention
New / InvestigatingIndefinite — must be resolved
Replayed (success)90 days, then archive
Abandoned1 year, then delete

Tips & Best Practices

Warning

The DLQ is not a trash can. Every item in it represents a failure that needs resolution. Set up alerts if the DLQ grows beyond a threshold (e.g., >10 items) — it might indicate a systemic problem.

  • Make replay idempotent. The item might have been partially processed before failing. Replaying it should not create duplicate records. Use idempotency keys.
  • Classify error types. Some errors are always transient (API timeout) and can be auto-replayed after a delay. Others (validation error) need human correction before replay.
  • Monitor DLQ age. Alert if any item has been in the DLQ for more than 48 hours without being assigned. Old items are forgotten items.
  • Include the workflow version. If the workflow was updated between failure and replay, note which version created the DLQ entry. The fix might have already been deployed.

Related patterns