Circuit Breaker
Stop calling a failing dependency before it drags you down with it. After a threshold of errors, open the circuit, fail fast for a cooldown period, then cautiously let traffic back in.
On this page
Visual Flow
Rendering diagram…
When to Use This Pattern
Use a circuit breaker when you call something that can be down or slow and you want to fail fast instead of fail slow:
- HTTP calls to third-party APIs
- Database queries that sometimes hang
- Internal services behind a mesh
- Any network boundary where slow equals bad
Retries and timeouts handle transient failures. A circuit breaker handles sustained failures — the thing is actually broken, and continuing to call it just wastes resources and amplifies the outage.
How It Works
The breaker is a state machine with three states:
- Closed — calls pass through. Count failures over a rolling window.
- Open — fail immediately without calling the dependency. Stays open for a cooldown (often 30–60 seconds).
- Half-Open — after the cooldown, let one request through as a probe. If it succeeds, close the circuit. If it fails, go back to open.
You need a rule for what counts as "failure" (HTTP 5xx, timeouts, specific exceptions) and a threshold (e.g. "5 failures in 30 seconds" or "failure rate > 50%"). Tune these to the dependency's actual behaviour, not a generic default.
Emit a metric on every state transition. "Circuit opened for payment-service" should be a visible alert — it means customers are about to feel pain.
Implementation Guide
Step 1: Pick your threshold carefully
Count-based thresholds (5 failures in a row) are noisy at low traffic. Rate-based thresholds (50% failure rate over 30 seconds with at least 20 calls) behave better under realistic load. Use rate-based unless traffic is tiny.
Step 2: Decide what "failure" means
- HTTP 5xx: yes. HTTP 4xx: usually no (that's a you-problem, not a them-problem).
- Timeouts: yes. Connection errors: yes.
- Specific exceptions: map them deliberately.
Wrong classification here produces circuits that trip when the dependency is fine.
Step 3: Pair it with a fallback
What does your code do when the circuit is open? Options:
- Return a cached/stale value
- Return a sensible default (empty list, zero score)
- Propagate the failure to the caller with a clean error
- Queue the request for later when the circuit closes
Pick one deliberately per call site.
Step 4: Keep state per dependency
One breaker per (dependency, operation) pair. A single global breaker for "all external calls" will trip on the first broken service and take your whole system down.
Step 5: Surface state in your dashboards
Your on-call should see which breakers are open at a glance. Half-open probes should be visible too — they tell you whether a dependency is recovering.
Tips & Best Practices
- Don't share breaker state across instances unless you have a reason. Per-instance is fine and simpler.
- Combine with a bulkhead. Isolate thread/connection pools per dependency.
- Add jitter to cooldowns. If all your instances come back at the same instant, the dependency gets slammed again.
- Test it. Breakers that never trip in tests will misbehave in prod. Inject failures deliberately in staging.
- Don't use it as a replacement for timeouts. Breakers + timeouts + retries work together. Each handles a different failure mode.
Related patterns
Retry with Exponential Backoff
Gracefully handle transient API failures by retrying with increasing delays. Essential for any workflow that calls external services.
Dead Letter Queue
Capture workflow items that fail processing after all retry attempts are exhausted. Store them safely for investigation, manual correction, and replay — never silently lose data.