advancederror handling

Dead Letter Queue

Capture workflow items that fail processing after all retry attempts are exhausted. Store them safely for investigation, manual correction, and replay — never silently lose data.

BPMN 2.0

On this page

Visual Flow

Rendering diagram…

When to Use This Pattern

Use a dead letter queue (DLQ) when:

Your workflow processes items that can fail (API calls, data transforms, external systems)
The Retry with Exponential Backoff pattern has been exhausted — all retries failed
You cannot afford to lose data — every item must be eventually processed
Operations teams need a way to investigate, fix, and replay failed items

How It Works

Rendering diagram…

Implementation Guide

Step 1: Design the Dead Letter Record

Each DLQ entry should capture everything needed to understand and replay the failure:

Field	Description	Example
dlq_id	Unique identifier	DLQ-2025-0042
source_workflow	Which workflow sent this	Invoice Processing
original_payload	The complete input data	`{ "invoiceId": "INV-1234", "amount": 5200, ... }`
error_message	What went wrong	"API returned 500: Internal Server Error"
error_code	Structured error code	HTTP_500
retry_count	How many retries were attempted	4
last_retry_at	When the last retry failed	2025-03-15T14:22:33Z
status	Current DLQ status	new / investigating / fixed / replayed / abandoned
assigned_to	Who is investigating	ops-team@company.com
resolution_notes	What was done to fix it	"Vendor API was down. Replayed successfully after recovery."
created_on	When item entered DLQ	2025-03-15T14:22:33Z
replayed_on	When item was successfully reprocessed	2025-03-16T09:15:00Z

Step 2: Route Failures to the DLQ

After the last retry fails:

Serialize the complete state — the original input, any partial outputs, and the full error chain
Write to the DLQ — a SharePoint list, database table, or dedicated queue
Alert the operations team — send a notification with the DLQ ID and a summary
Mark the original workflow item as "failed-to-dlq" so it's not picked up by the next batch run

Step 3: Build an Investigation Interface

Give the ops team a dashboard (or simple list view) showing:

All DLQ items sorted by age (oldest first)
Filter by source workflow, error type, status
The full payload and error details for each item
Action buttons: Investigate, Reassign, Replay, Abandon

Step 4: Implement the Replay Mechanism

When the root cause is fixed, the ops team clicks "Replay":

Read the original payload from the DLQ record
Re-submit it to the processing workflow
If it succeeds → update DLQ status to "replayed" with timestamp
If it fails again → increment retry count, leave in DLQ

Step 5: Set Retention Policies

DLQ Status	Retention
New / Investigating	Indefinite — must be resolved
Replayed (success)	90 days, then archive
Abandoned	1 year, then delete

Tips & Best Practices

Warning

The DLQ is not a trash can. Every item in it represents a failure that needs resolution. Set up alerts if the DLQ grows beyond a threshold (e.g., >10 items) — it might indicate a systemic problem.

Make replay idempotent. The item might have been partially processed before failing. Replaying it should not create duplicate records. Use idempotency keys.
Classify error types. Some errors are always transient (API timeout) and can be auto-replayed after a delay. Others (validation error) need human correction before replay.
Monitor DLQ age. Alert if any item has been in the DLQ for more than 48 hours without being assigned. Old items are forgotten items.
Include the workflow version. If the workflow was updated between failure and replay, note which version created the DLQ entry. The fix might have already been deployed.

Retry with Exponential Backoff — what to do before items reach the DLQ
Scheduled Batch Processor — a common source of DLQ items
Human-in-the-Loop Review — for investigating DLQ items

Related patterns

error handlingFeatured

Circuit Breaker

Stop calling a failing dependency before it drags you down with it. After a threshold of errors, open the circuit, fail fast for a cooldown period, then cautiously let traffic back in.

Intermediate

error handlingFeatured

Retry with Exponential Backoff

Gracefully handle transient API failures by retrying with increasing delays. Essential for any workflow that calls external services.

Advanced