Kafka + ClickHouse Streaming ETL Blueprint

Blueprint for building resilient Kafka-to-ClickHouse streaming ETL with schema evolution, observability, and backups for 2026-grade real-time analytics.

Hook: Why your analytics pipeline is failing when you need it most

If your analytics cluster lags by minutes, schema changes break downstream consumers, or production restores feel like a scramble, you're not alone. Teams building real-time analytics pipelines in 2026 face three compounding pressures: higher event volumes, frequent product-driven schema changes, and tighter SLOs for fresh data. This blueprint shows how to architect a resilient streaming ETL that pairs Kafka with ClickHouse, implements robust schema evolution, and adds the observability, backups, and CI/CD patterns needed for production at scale.

Executive summary (most important first)

Use Kafka as the durable event log and ClickHouse as the OLAP engine for sub-second to minute-level analytics.
Prefer schema-managed wire formats (Avro/Protobuf) with a schema registry and automated compatibility tests to manage evolution.
Ingest into ClickHouse using either the Kafka table engine + materialized views for simplicity, or a Kafka Connect / sink connector for richer transformations and resilience.
Ensure idempotency and deduplication: design ClickHouse tables with Replacing/Collapsing MergeTree or use deterministic dedup keys to make writes safe to replay.
Instrument both Kafka and ClickHouse with Prometheus/OpenTelemetry metrics, set SLO-driven alerts, and maintain runbooks for lag, reprocessing, and partial restores.

Why Kafka + ClickHouse in 2026?

By late 2025 / early 2026 the OLAP landscape shifted: ClickHouse gained serious market momentum (including significant funding) and feature velocity, improving cloud adapters, replication, and operator tooling. Kafka remains the de facto event backbone for streaming ETL. The combination gives teams a cost-effective, high-throughput stack for real-time analytics where Kafka guarantees durability and ClickHouse excels at fast ad-hoc and aggregated queries.

High-level architecture

At a glance, a production-grade pipeline has these layers:

Producers: Services emitting events into Kafka (Avro/Protobuf/JSON Schema).
Kafka cluster: Topic partitioning, retention, compacted topics for keys, and Schema Registry for schema governance.
Transformation layer: Kafka Streams/ksqlDB or Kafka Connect with SMTs for lightweight transformations.
Sinks: ClickHouse ingestion via native Kafka engine + Materialized View or Kafka Connect sink connector.
Storage & compute: ClickHouse Distributed tables, replicated MergeTree, and S3 long-term storage for backups and cold data.
Observability & Ops: Prometheus, Grafana, tracing (OpenTelemetry), runbooks and CI/CD for schema and infra changes.

Design decisions: Kafka table engine vs Connect sink

Option A — ClickHouse Kafka engine + materialized views (fast, simple)

The ClickHouse Kafka engine lets ClickHouse itself act as a Kafka consumer. A typical pattern:

CREATE TABLE events_kafka ENGINE = Kafka SETTINGS kafka_broker_list='broker1:9092', kafka_topic_list='events', kafka_format='Avro', kafka_schema='events-value,1';

CREATE TABLE events_raw (
  event_id String,
  user_id String,
  ts DateTime64(3),
  payload String
) ENGINE = ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/events', '{replica}') ORDER BY (event_id, ts);

CREATE MATERIALIZED VIEW events_mv TO events_raw AS SELECT
  event_id,
  user_id,
  parseDateTimeBestEffort(ts) AS ts,
  payload
FROM events_kafka;

Pros: fewer moving parts, low-latency ingestion, tight coupling to ClickHouse replication. Cons: limited pre-processing, schema registry integration needs careful config, and reprocessing requires manual steps.

Option B — Kafka Connect / ClickHouse sink (flexible, production-friendly)

Use a connector for richer transformations, buffering, dead-letter queues, and better error handling. Typical topology: producers -> Kafka -> Kafka Connect with SMTs -> ClickHouse sink. Advantages: connector-managed delivery semantics, offload complex parsing, easier retry/DLQ handling, CI for connector configs.

Making writes safe: idempotency and deduplication

ClickHouse is not transactional like OLTP databases. To safely replay events and achieve at-least-once guarantees without duplicates at query-time:

Design tables with unique keys and use ReplacingMergeTree or CollapsingMergeTree to deduplicate. Include a version/timestamp column to choose the latest event.
Emit a deterministic event key from producers (e.g., event_id). Use compacted topics to retain latest value for each key for reprocessing and recovery.
For high-integrity pipelines, maintain a separate idempotency table that records processed offsets or message ids.

CREATE TABLE events_dedup (
  event_id String,
  user_id String,
  ts DateTime64(3),
  payload String,
  version UInt64
) ENGINE = ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/events_dedup', '{replica}')
ORDER BY event_id
SETTINGS index_granularity = 8192;

Schema evolution strategy (practical rules)

Schema drift is the top cause of pipeline incidents. Implement a schema-first workflow:

Wire format and registry: Use Avro or Protobuf on the wire and a Schema Registry (Confluent or open-source) to enforce compatibility.
Compatibility mode: Prefer BACKWARD or FULL compatibility by default. Use explicit team approvals for breaking changes.
Subject naming: Use topic+value or topic+key naming strategies to isolate changes and reduce accidental coupling.
Automated contract tests: Add CI checks that validate producer and consumer schemas; run consumer contracts against mock topics in integration tests.
Feature flags + dual write: For risky changes, implement dual-write producers (write old+new schema) and consumers that read both formats while migrating.

Example compatibility config (Schema Registry):

# Set compatibility for subjects
curl -X PUT http://schema-registry:8081/config/events-value -H 'Content-Type: application/json' -d '{"compatibility":"BACKWARD"}'

Handling breaking schema changes

If you must rename a required field: introduce a new optional field, dual-write for one release, migrate consumers, then deprecate the old field after a safe window.
To change a type (e.g., int -> long): add the new field, backfill if necessary, and alter consumers to prefer the new field.
When removing fields: mark them deprecated and retain them for a retention window in compacted topics to support reprocessing.

Reprocessing strategy

Reprocessing is inevitable. Build for it:

Use compacted topics for idempotent keys (state topics) and time-partitioned topics for append-only events.
Store checkpoints — either consumer offset checkpointing in Kafka or external metadata (e.g., a processing log table in ClickHouse) — to resume safely.
Version your transforms. Tag a pipeline release and make transforms replayable by referencing the release tag in reprocessing jobs.
Support two reprocessing modes: fast replay (best-effort, small window) and controlled backfill (strict, slow, runs in maintenance windows).

Backups, disaster recovery, and data retention

Kafka

Enable replication across racks or regions. Use MirrorMaker 2 or Kafka’s native cluster replication for cross-region DR.
Retain topic data long enough for usual reprocessing windows. For long-term state, materialize views into ClickHouse or S3.
Keep a compacted topic for idempotency keys to support full replays.

ClickHouse

Use replicated MergeTree families with proper shard/replica layout.
Regularly snapshot and offload to S3 using tools such as clickhouse-backup or built-in remote disk support. Test restores quarterly.
For schema changes, use new tables with backfills and controlled cutover rather than destructive in-place ALTERs when possible.

# Example clickhouse-backup usage
clickhouse-backup create nightly_backup
clickhouse-backup upload nightly_backup --config /etc/clickhouse-backup.yml

Observability: metrics, tracing, and alerts

Visibility is non-negotiable. Monitor Kafka, ClickHouse, and the transformation layer with the same rigor you use for services.

Core metrics to collect

Kafka: consumer lag (per topic/partition), under-replicated partitions, produce/consume rates, broker disk usage
ClickHouse: insert rates, parts count, merge queue length, query latency P50/P95/P99, disk usage, replication lag (system.replication_queue)
Connectors: task failures, REST API errors, connector lag, DLQ rate
Business metrics: freshness (time since last event per view), missing partitions, rows processed per minute

Tracing and logs

Instrument producers and connectors with OpenTelemetry. Capture trace ids in events to correlate from producer to final ClickHouse write. Centralize logs and use structured logging for quick troubleshooting.

Alerting: SLO-driven thresholds

High priority: consumer lag > X minutes for 5+ partitions, under-replicated partitions > 0, ClickHouse replication delay > Y seconds
Medium: increase in DLQ rate, merge queue length exceeding threshold, low disk headroom
Low: sustained changes in insert rate or query P95 drift

CI/CD and testing for data pipelines

Treat schemas, connectors, and transforms as code. Your CI should run these checks on every PR:

Schema linting and compatibility tests against the registry
Unit tests for transformation logic (ksqlDB or Kafka Streams)
Integration tests using ephemeral environments (Docker Compose or testcontainers) that spin up Kafka + Schema Registry + ClickHouse and validate end-to-end ingestion
Contract tests: validate that consumers can handle older/newer schema versions

# Example GitHub Actions job (conceptual)
jobs:
  contract-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Start test stack
        run: docker-compose -f ci/docker-compose.yml up -d --build
      - name: Run schema compatibility tests
        run: ./ci/check-schema-compatibility.sh
      - name: Run consumer integration tests
        run: mvn -Dtest=ConsumerIntegrationTest test

Operational runbooks (concise, actionable)

Runbook: consumer lag spike

Check consumer group lag via Kafka UI (or CLI: kafka-consumer-groups.sh).
If lag maps to a stuck connector task, restart the task. If task fails, inspect DLQ and connector logs.
If ClickHouse ingestion is slow, check merge queue and disk I/O; scale the ClickHouse cluster if necessary.

Runbook: reprocessing after a schema bug

Lock downstream writes to prevent partial reads.
Create a new topic or compacted mirror for corrected events.
Run the reprocessing job with a consumer group prefix (so reads begin from earliest) and direct writes into a staging ClickHouse table.
Validate row counts and checksums, then swap staging into production (rename tables or update views).

Scaling ClickHouse for analytics

Use Distributed tables over sharded MergeTree replicas. Partition by time or a high-cardinality field where queries can benefit. Key considerations:

Shard count should reflect expected concurrency and query patterns, not just data volume.
Set TTLs to move cold data to S3 or drop it automatically.
Monitor parts and merges to avoid merge storms; tune max_parts_in_total and merge settings.

CREATE TABLE events_local (
  event_id String,
  user_id String,
  ts DateTime64(3),
  payload String
) ENGINE = ReplacingMergeTree(ts)
PARTITION BY toYYYYMM(ts)
ORDER BY (user_id, event_id)
SETTINGS index_granularity = 8192;

CREATE TABLE events_dist AS events_local
ENGINE = Distributed(cluster, database, events_local);

Security and compliance

Encrypt data in transit (TLS for Kafka and ClickHouse) and at rest (S3/KMS for backups).
Use RBAC for Kafka (ACLs) and ClickHouse users with least privilege.
Audit sensitive columns and classify PII; consider tokenization or hashing before inserting into ClickHouse.

2026 trends and future-proofing

Looking into 2026, expect:

Stronger managed ClickHouse offerings and operators — lowering operational burden.
Better connector ecosystems and more first-party sink connectors that understand ClickHouse MergeTree semantics.
Shift toward event schema governance tooling that automates compatibility analysis across multiple consumers.

Design your pipeline assuming rapid feature addition and schema churn — invest early in schema governance, contract testing, and idempotent data models.

Concrete checklist to get started (actionable takeaways)

Pick a wire format (Avro/Protobuf) and deploy a Schema Registry with BACKWARD compatibility.
Choose an ingestion pattern: Kafka engine for simplicity, Kafka Connect for production-grade control.
Design ClickHouse tables with dedup keys and Replacing/Collapsing MergeTree for idempotency.
Implement automated schema compatibility tests in CI and run integration tests against ephemeral ClickHouse + Kafka stacks.
Instrument metrics from Kafka, ClickHouse, and connectors. Create SLOs for freshness and latency with alerting and runbooks.
Build backup policies: Kafka replication + MirrorMaker, ClickHouse snapshots to S3, and quarterly restore drills.

"Build for reprocessability: if you can't replay, you can't recover."

Sample minimal deployment roadmap (4 sprints)

Sprint 1 — Foundations

Deploy Kafka + Schema Registry (dev), ClickHouse single-node
Implement producer libraries that write Avro/Protobuf

Sprint 2 — Ingestion

Add ClickHouse Kafka engine or Kafka Connect sink, simple materialized view ingestion
Set up Prometheus exporters for basic metrics

Sprint 3 — Reliability

Enable replication/sharding in ClickHouse, configure DLQs
Create CI contract tests and schema compatibility checks

Sprint 4 — Hardening

Set backup/restore automation, create runbooks and alerts, run a restore drill
Stress test with realistic event volumes and tune merge settings

Closing: operational confidence with Kafka + ClickHouse

Real-time analytics in 2026 is less about raw speed and more about operational resilience: quick recoveries, safe schema changes, and predictable freshness. Kafka provides durable, ordered event storage; ClickHouse provides fast analytics; the rest is engineering discipline — schema governance, idempotent writes, observability, and automated CI. Follow the patterns in this blueprint and you’ll reduce incidents, shorten time-to-insight, and scale your analytics with confidence.

Call to action

Ready to implement this blueprint? Start with a small proof-of-concept using your most important event topic. If you want a jump-start, download our CI test harness (Kafka + Schema Registry + ClickHouse docker-compose) and a pre-built ClickHouse DDL + connector configs to validate a 1-week migration plan. Visit programa.space/blueprints to get the repo and a step-by-step guide.