Patterns
intermediateerror handling Featured

Circuit Breaker

Stop calling a failing dependency before it drags you down with it. After a threshold of errors, open the circuit, fail fast for a cooldown period, then cautiously let traffic back in.

Views17
BPMN 2.0
On this page

Visual Flow

Rendering diagram…

When to Use This Pattern

Use a circuit breaker when you call something that can be down or slow and you want to fail fast instead of fail slow:

  • HTTP calls to third-party APIs
  • Database queries that sometimes hang
  • Internal services behind a mesh
  • Any network boundary where slow equals bad

Retries and timeouts handle transient failures. A circuit breaker handles sustained failures — the thing is actually broken, and continuing to call it just wastes resources and amplifies the outage.

How It Works

The breaker is a state machine with three states:

  • Closed — calls pass through. Count failures over a rolling window.
  • Open — fail immediately without calling the dependency. Stays open for a cooldown (often 30–60 seconds).
  • Half-Open — after the cooldown, let one request through as a probe. If it succeeds, close the circuit. If it fails, go back to open.

You need a rule for what counts as "failure" (HTTP 5xx, timeouts, specific exceptions) and a threshold (e.g. "5 failures in 30 seconds" or "failure rate > 50%"). Tune these to the dependency's actual behaviour, not a generic default.

Tip

Emit a metric on every state transition. "Circuit opened for payment-service" should be a visible alert — it means customers are about to feel pain.

Implementation Guide

Step 1: Pick your threshold carefully

Count-based thresholds (5 failures in a row) are noisy at low traffic. Rate-based thresholds (50% failure rate over 30 seconds with at least 20 calls) behave better under realistic load. Use rate-based unless traffic is tiny.

Step 2: Decide what "failure" means
  • HTTP 5xx: yes. HTTP 4xx: usually no (that's a you-problem, not a them-problem).
  • Timeouts: yes. Connection errors: yes.
  • Specific exceptions: map them deliberately.

Wrong classification here produces circuits that trip when the dependency is fine.

Step 3: Pair it with a fallback

What does your code do when the circuit is open? Options:

  • Return a cached/stale value
  • Return a sensible default (empty list, zero score)
  • Propagate the failure to the caller with a clean error
  • Queue the request for later when the circuit closes

Pick one deliberately per call site.

Step 4: Keep state per dependency

One breaker per (dependency, operation) pair. A single global breaker for "all external calls" will trip on the first broken service and take your whole system down.

Step 5: Surface state in your dashboards

Your on-call should see which breakers are open at a glance. Half-open probes should be visible too — they tell you whether a dependency is recovering.

Tips & Best Practices

  • Don't share breaker state across instances unless you have a reason. Per-instance is fine and simpler.
  • Combine with a bulkhead. Isolate thread/connection pools per dependency.
  • Add jitter to cooldowns. If all your instances come back at the same instant, the dependency gets slammed again.
  • Test it. Breakers that never trip in tests will misbehave in prod. Inject failures deliberately in staging.
  • Don't use it as a replacement for timeouts. Breakers + timeouts + retries work together. Each handles a different failure mode.

Related patterns