Patterns
advancedintegration

Change Data Capture Stream

Stream row-level changes out of a database in near real-time using the transaction log. No polling, no app changes — downstream systems get inserts, updates, and deletes as they happen.

Views11
BPMN 2.0
On this page

Visual Flow

Rendering diagram…

When to Use This Pattern

Use CDC when you need to stream changes from a database to other systems without polling or modifying the application:

  • Replicating OLTP data to a warehouse for analytics
  • Keeping a search index (Elasticsearch, OpenSearch) in sync
  • Invalidating caches when source data changes
  • Building event streams from legacy systems you can't modify

CDC shines when: the source DB can't be changed, latency matters (sub-second to few-second), and you need every change, not a snapshot.

How It Works

Databases write every change to a transaction log before applying it. CDC tools (Debezium, AWS DMS, Fivetran HVR, native CDC in SQL Server / Postgres logical replication) tail that log and emit a stream of change events, each describing:

  • The table and primary key
  • The change type: insert, update, or delete
  • The row's before and after state

The stream is typically published to Kafka or a similar event bus. Downstream consumers subscribe and apply changes to their own stores.

Warning

CDC captures every change including sensitive ones. If your CDC stream flows to a less secure environment, you've just leaked PII. Plan encryption, masking, and access control before you connect anything.

Implementation Guide

Step 1: Enable the log feature on the source

Postgres: set wal_level = logical and create a publication. SQL Server: enable CDC on the database and the tables. MySQL: enable binlog_format = ROW. Each has knock-on effects on disk usage and backup — read the docs.

Step 2: Handle the initial snapshot

The log only has changes. Downstream consumers need the current state too. Most CDC tools do a one-time snapshot first, then switch to tailing the log. Plan for this window — the snapshot of a 500GB table takes time.

Step 3: Protect against schema changes

A column added upstream needs to propagate. A column dropped could break consumers. Decide: hard-fail loudly on schema drift, or route old-schema events through a compatibility layer?

Step 4: Guarantee ordering where it matters

Most CDC tools preserve per-row ordering (all events for user-42 arrive in order). Cross-row ordering is usually not preserved. If you need causal consistency across tables, design for it explicitly.

Step 5: Monitor replication lag

Lag is the most important metric. A 30-second lag is healthy; a 30-minute lag means your search index is 30 minutes stale. Alert on sustained lag over a threshold.

Tips & Best Practices

  • Test with real schema changes in staging. CDC breaking in prod at 2am is the worst.
  • Never rely on CDC for exact balance reconciliation. Use it for speed; reconcile with a nightly batch.
  • Keep retention high enough to replay. If a consumer is down for 4 hours, can you replay 4 hours of changes? If retention is 1 hour, you're reseeding from snapshot.
  • Use CDC for data movement, not business logic. Consumers that react to CDC events and then write business rules start to feel like distributed transactions gone wrong.
  • Budget for storage carefully. The transaction log grows when a consumer falls behind.

Related patterns