intermediateerror handling Featured

Circuit Breaker

Stop calling a failing dependency before it drags you down with it. After a threshold of errors, open the circuit, fail fast for a cooldown period, then cautiously let traffic back in.

BPMN 2.0

On this page

Visual Flow

Rendering diagram…

When to Use This Pattern

Use a circuit breaker when you call something that can be down or slow and you want to fail fast instead of fail slow:

HTTP calls to third-party APIs
Database queries that sometimes hang
Internal services behind a mesh
Any network boundary where slow equals bad

Retries and timeouts handle transient failures. A circuit breaker handles sustained failures — the thing is actually broken, and continuing to call it just wastes resources and amplifies the outage.

How It Works

The breaker is a state machine with three states:

Closed — calls pass through. Count failures over a rolling window.
Open — fail immediately without calling the dependency. Stays open for a cooldown (often 30–60 seconds).
Half-Open — after the cooldown, let one request through as a probe. If it succeeds, close the circuit. If it fails, go back to open.

You need a rule for what counts as "failure" (HTTP 5xx, timeouts, specific exceptions) and a threshold (e.g. "5 failures in 30 seconds" or "failure rate > 50%"). Tune these to the dependency's actual behaviour, not a generic default.

Tip

Emit a metric on every state transition. "Circuit opened for payment-service" should be a visible alert — it means customers are about to feel pain.

Implementation Guide

Step 1: Pick your threshold carefully

Count-based thresholds (5 failures in a row) are noisy at low traffic. Rate-based thresholds (50% failure rate over 30 seconds with at least 20 calls) behave better under realistic load. Use rate-based unless traffic is tiny.

Step 2: Decide what "failure" means

HTTP 5xx: yes. HTTP 4xx: usually no (that's a you-problem, not a them-problem).
Timeouts: yes. Connection errors: yes.
Specific exceptions: map them deliberately.

Wrong classification here produces circuits that trip when the dependency is fine.

Step 3: Pair it with a fallback

What does your code do when the circuit is open? Options:

Return a cached/stale value
Return a sensible default (empty list, zero score)
Propagate the failure to the caller with a clean error
Queue the request for later when the circuit closes

Pick one deliberately per call site.

Step 4: Keep state per dependency

One breaker per (dependency, operation) pair. A single global breaker for "all external calls" will trip on the first broken service and take your whole system down.

Step 5: Surface state in your dashboards

Your on-call should see which breakers are open at a glance. Half-open probes should be visible too — they tell you whether a dependency is recovering.

Tips & Best Practices

Don't share breaker state across instances unless you have a reason. Per-instance is fine and simpler.
Combine with a bulkhead. Isolate thread/connection pools per dependency.
Add jitter to cooldowns. If all your instances come back at the same instant, the dependency gets slammed again.
Test it. Breakers that never trip in tests will misbehave in prod. Inject failures deliberately in staging.
Don't use it as a replacement for timeouts. Breakers + timeouts + retries work together. Each handles a different failure mode.

Related patterns

error handlingFeatured

Retry with Exponential Backoff

Gracefully handle transient API failures by retrying with increasing delays. Essential for any workflow that calls external services.

Advanced

error handling

Dead Letter Queue

Capture workflow items that fail processing after all retry attempts are exhausted. Store them safely for investigation, manual correction, and replay — never silently lose data.

Advanced