Dead Letter Queue
Capture workflow items that fail processing after all retry attempts are exhausted. Store them safely for investigation, manual correction, and replay — never silently lose data.
On this page
Visual Flow
Rendering diagram…
When to Use This Pattern
Use a dead letter queue (DLQ) when:
- Your workflow processes items that can fail (API calls, data transforms, external systems)
- The Retry with Exponential Backoff pattern has been exhausted — all retries failed
- You cannot afford to lose data — every item must be eventually processed
- Operations teams need a way to investigate, fix, and replay failed items
How It Works
Rendering diagram…
Implementation Guide
Step 1: Design the Dead Letter Record
Each DLQ entry should capture everything needed to understand and replay the failure:
| Field | Description | Example |
|---|---|---|
| dlq_id | Unique identifier | DLQ-2025-0042 |
| source_workflow | Which workflow sent this | Invoice Processing |
| original_payload | The complete input data | { "invoiceId": "INV-1234", "amount": 5200, ... } |
| error_message | What went wrong | "API returned 500: Internal Server Error" |
| error_code | Structured error code | HTTP_500 |
| retry_count | How many retries were attempted | 4 |
| last_retry_at | When the last retry failed | 2025-03-15T14:22:33Z |
| status | Current DLQ status | new / investigating / fixed / replayed / abandoned |
| assigned_to | Who is investigating | ops-team@company.com |
| resolution_notes | What was done to fix it | "Vendor API was down. Replayed successfully after recovery." |
| created_on | When item entered DLQ | 2025-03-15T14:22:33Z |
| replayed_on | When item was successfully reprocessed | 2025-03-16T09:15:00Z |
Step 2: Route Failures to the DLQ
After the last retry fails:
- Serialize the complete state — the original input, any partial outputs, and the full error chain
- Write to the DLQ — a SharePoint list, database table, or dedicated queue
- Alert the operations team — send a notification with the DLQ ID and a summary
- Mark the original workflow item as "failed-to-dlq" so it's not picked up by the next batch run
Step 3: Build an Investigation Interface
Give the ops team a dashboard (or simple list view) showing:
- All DLQ items sorted by age (oldest first)
- Filter by source workflow, error type, status
- The full payload and error details for each item
- Action buttons: Investigate, Reassign, Replay, Abandon
Step 4: Implement the Replay Mechanism
When the root cause is fixed, the ops team clicks "Replay":
- Read the original payload from the DLQ record
- Re-submit it to the processing workflow
- If it succeeds → update DLQ status to "replayed" with timestamp
- If it fails again → increment retry count, leave in DLQ
Step 5: Set Retention Policies
| DLQ Status | Retention |
|---|---|
| New / Investigating | Indefinite — must be resolved |
| Replayed (success) | 90 days, then archive |
| Abandoned | 1 year, then delete |
Tips & Best Practices
The DLQ is not a trash can. Every item in it represents a failure that needs resolution. Set up alerts if the DLQ grows beyond a threshold (e.g., >10 items) — it might indicate a systemic problem.
- Make replay idempotent. The item might have been partially processed before failing. Replaying it should not create duplicate records. Use idempotency keys.
- Classify error types. Some errors are always transient (API timeout) and can be auto-replayed after a delay. Others (validation error) need human correction before replay.
- Monitor DLQ age. Alert if any item has been in the DLQ for more than 48 hours without being assigned. Old items are forgotten items.
- Include the workflow version. If the workflow was updated between failure and replay, note which version created the DLQ entry. The fix might have already been deployed.
Related Patterns
- Retry with Exponential Backoff — what to do before items reach the DLQ
- Scheduled Batch Processor — a common source of DLQ items
- Human-in-the-Loop Review — for investigating DLQ items
Related patterns
Circuit Breaker
Stop calling a failing dependency before it drags you down with it. After a threshold of errors, open the circuit, fail fast for a cooldown period, then cautiously let traffic back in.
Retry with Exponential Backoff
Gracefully handle transient API failures by retrying with increasing delays. Essential for any workflow that calls external services.