Stateful Kafka Streams Processors: Reducing Rebalancing - Lessons Learned

11 min read

Stateful Kafka Streams on Kubernetes need configuration to stay stable during rolling updates without causing lag. If your processor must be fast, runs on few partitions, and short lag matters, the defaults are not enough. Enable static membership with a timeout based on the worst-case duration of your rolling updates.

Fundamentals

Kafka Streams was designed for building mission-critical real-time applications and microservices (kafka.apache.org). It is used when both input and output data live in Kafka clusters.

Kafka Cluster

In a distributed system, multiple computers communicate over the network but appear as one system. In Kafka, a cluster is a group of servers working together. A Kafka cluster needs a few components to work: brokers, partitions, and replication (confluent.io).

Terminology

Distributed Concept Kafka Streams Kafka Core
Worker Stream Application instance Consumer Instance
Unit of work Task You define it (poll loop)
State State Store (managed) You manage it (DB, memory)
Threading StreamThread You manage it
Partition assignment Automatic per task Partition assignment (same mechanism)
Source: Claude, Sonnet 4.6

Replication and Brokers

In distributed systems, we design for the loss of any computer at any time. That event must not cause data loss. Replication solves this problem. The pattern uses a leader and followers. Followers stay in sync with the leader and keep a copy of the data. In distributed systems, a computer that is part of the cluster is called a node.

A broker does compute. It handles replication, reads and writes, leader election, and offset management. When you write a consumer or producer app and deploy it to a pod, that pod contains business logic and internally calls consumer.poll() or producer.send().

If you have multiple Kafka producers, consumers, and stream processors, they each run in their own pods. But all of them talk to a common set of brokers, which are shared infrastructure. When you produce a message, you logically send it to a topic. But the partition for that topic lives in a broker.

With 3 brokers, partition distribution looks like this:

Broker-0: Partition-0 (leader), Partition-1 (follower), Partition-2 (follower)
Broker-1: Partition-1 (leader), Partition-0 (follower), Partition-2 (follower)
Broker-2: Partition-2 (leader), Partition-0 (follower), Partition-1 (follower)

Multiple leader partitions per broker are common when the number of partitions exceeds the number of brokers.

Each partition is copied for durability. The number of copies is controlled by replicationFactor. The Apache Kafka default is 1, but production clusters almost always use 3. With replicationFactor = 2, partition 0 of Topic A lives on two brokers: one leader, one follower.

What happens if a broker fails? The Kafka controller elects a new leader from the remaining in-sync replicas. KRaft (KIP-500) is the system that keeps the controller itself highly available, replacing ZooKeeper in Kafka 3.3+. If all replicas for a partition are lost, that data is gone.

Let’s pause and sum up. In a Kafka cluster, the shared infrastructure is the brokers. Producers, consumers, and processors all communicate through topics and partitions on that shared infrastructure. To keep data durable, we use replication: copies are stored across brokers. The controller promotes a follower to leader if a broker fails.

Topics, Partitions, Consumer Groups, Offsets, and Lag

A topic is like a table in a database: it stores data under a name. That data is split into physical buckets called partitions. Multiple partitions can live in a single broker.

Producers write data to topics. Each topic is consumed by consumers within consumer groups. A consumer group tracks where each consumer finished consuming via an offset. The distance to the latest message is called lag.

Kafka Streams vs Kafka Core

In Kafka Core, you write low-level code to manage everything yourself. You call consumer.poll() to get messages and store them in Redis or Postgres. Kafka Core is just a client library.

As computation grows more complex, you start building wrappers on top of the low-level Kafka client. Imagine you want to count words in a stream. With Kafka Core, you need:

consumer.poll()
└── groupBy key + count
└── upsert to database
    ├── race conditions        (2 instances update same key)
    ├── duplicate processing   (what if app crashes mid-upsert?)
    ├── rebalance problem      (partition moves to another instance, state stays in DB but local cache is lost)
    └── consistency            (local count vs global count out of sync)

These are distributed problems that would be tedious and risky to solve in every computation unit. That is where Kafka Streams comes in.

Kafka Streams provides a high-level wrapper around the low-level Kafka client that solves those distributed problems for you. The same problem in Kafka Streams is a one-liner:

stream
  .groupByKey()
  .count()

Kafka Core

Consumer Group

Each consumer works within a consumer group: a set of consumers that consume from the same topic. This enables parallelism. With 4 partitions and 4 consumers, each consumer handles exactly one partition, giving a parallelism factor of 4. Different consumer groups can consume from the same partitions independently, because each group maintains its own offset.

The consumer group is a fundamental Kafka concept. It lets you guarantee ordering, which is both important and tricky in distributed systems. Order is guaranteed only within a single partition.

Rebalance

Rebalance is the process that assigns partitions to consumers. The Group Coordinator, which is part of the broker, handles this.

Example with 4 partitions and 3 consumers in the consumer group:

  • Consumer 0 -> Partition-0, Partition-1
  • Consumer 1 -> Partition-2
  • Consumer 2 -> Partition-3

When does it happen? When a new consumer joins or leaves the group.

To reassign partitions, the Group Coordinator first stops all consumers from polling. They resume only after reassignment is complete. This is the stop-the-world rebalance. Kafka 2.4 introduced cooperative rebalancing (KIP-429), enabled by default from Kafka 3.1. It stops only the partitions that need to move, not all of them.

Source: Claude, Sonnet 4.6

Kafka Streams

Topology

Your application code defines a Topology, which is a DAG (Directed Acyclic Graph). It contains your entire processing pipeline:

Source (topic) -> Processor1 -> Processor2 -> … -> ProcessorN -> Sink (topic)

We consume from a topic, process the data through multiple processors, and produce to a different output topic. The topology describes what happens to the data at each step.

Word Count example:

Source("input-topic")
└── FlatMapValues (split sentence into words)
└── GroupByKey
└── Count  (stateful processor, uses state store)
└── Sink("output-topic")

Processor

A single component in the topology is called a processor. A processor does one thing. There are two types:

Stateless processors:

  • filter
  • map
  • flatMap
  • branch

Stateful processors:

  • count
  • aggregate
  • join
  • reduce

The word count example needs stateful processors for aggregation.

Task

A task is one instance of the topology running against a single partition. It is the unit of parallelism.

State and Durability in Kafka Streams

Kafka Core requires low-level coding. To store state, you would need an external store like Postgres or Redis. Kafka Streams includes all of this, with the hard distributed problems solved for you.

Where is the magic? There is none. Kafka Streams uses RocksDB, a disk-based embedded database. The Kafka cluster ensures data durability through replication across brokers. With disk-based storage, durability is already guaranteed by the cluster.

Kafka Streams adds one more layer: an internal topic called the changelog. It applies the Event Sourcing pattern: an ordered log of modifications that can be replayed to rebuild state.

Pure event sourcing is just a list of changes. Over time, this list grows and becomes slow to replay. Event sourcing solves this with snapshotting: periodically converting the event list into a snapshot, from which further events are replayed. In Kafka, this is handled by log compaction: it keeps only the latest value for each key, which is exactly what the internal state store needs.

Rebalance in Kafka Streams

Rebalancing in Kafka Streams works the same way as in Kafka Core, but stateful applications have state attached to each partition.

In Kafka Core:

Kafka Core partition moves: 
Consumer-A loses Partition-0 
└── Consumer-B gets Partition-0 
└── just starts polling from offset ← trivial

In Kafka Streams, the new task owner also needs to restore state:

Kafka Streams partition moves: 
Consumer-A loses Partition-0 
└── RocksDB state stays on Consumer-A's disk 
└── Consumer-B gets Partition-0 
└── has no local RocksDB state 
└── must rebuild from changelog topic ← expensive

Partition assignment is just metadata in the broker. The state store is actual data on disk. When a rebalance happens, the partition assignment is fast: it is just a metadata update. But the new owner may not have the state locally, so it must replay the changelog to rebuild it.

In practice, eliminating or limiting changelog replay is the goal. How? Keep reading.

How to limit unnecessary rebalancing?

Cooperative Rebalancing and Static Membership

Before cooperative rebalancing, stop-the-world froze ALL partitions during a rebalance. With cooperative rebalancing (KIP-429), only the partitions that need to move are paused.

Cooperative rebalancing changes HOW partitions are reassigned DURING a rebalance.

Static membership (KIP-345) operates one level earlier: it changes WHETHER a rebalance happens at all. If a pod restarts briefly, does anything need to change?

These two mechanisms are complementary. Together they minimize partition movement in the cluster, reducing disruption for any application using it.

session.timeout.ms controls how long the cluster waits before declaring a member dead.

Static Membership in Kafka Streams

With very large state, rebalancing often means downtime. Enable static membership with an appropriate timeout to limit it. Static membership and cooperative rebalancing matter more in Kafka Streams than in Kafka Core, because state physically lives on each node.

Sticky Assignor

The StickyAssignor has existed since Kafka 0.11: instead of reshuffling all assignments blindly, it preserves existing ones and only moves what is necessary. Kafka 2.4 went further and introduced the CooperativeStickyAssignor, combining stickiness with cooperative rebalancing. That combination became the default from Kafka 3.1.

This is important for Kafka Streams: it keeps state stores on the nodes that already have the data.

Persistent Volumes (Kubernetes)

By default, data stored in a Kubernetes pod lives on the pod’s disk. When the pod is deleted, the data is gone. Persistent Volumes solve this. A PV is storage that lives independently of any pod. A pod requests disk by creating a Persistent Volume Claim (PVC): “I need 20 GB, fast SSD, read/write”.

StatefulSet Identity

In Kubernetes, there are two types of workload controllers: Deployments and StatefulSets. The difference is that StatefulSets give each pod a stable identifier:

- `stream-processor-0`
- `stream-processor-1`
- `stream-processor-2`
- `stream-processor-3`

A StatefulSet with PVC works like Kafka’s sticky assignor: each pod with the same ID always gets the same PVC:

Pod `stream-processor-0` will always mount PVC `data-stream-processor-0`

When a pod restarts, Kubernetes deletes and recreates it with the same name, and reattaches the same PVC.

This is another element that limits rebalancing in Kafka Streams.

Standby Replicas

With the Sticky Assignor, Cooperative Rebalancing, and Static Membership, unnecessary rebalancing is limited. StatefulSet and PVC ensure predictable restarts: the same PVC always goes to the same pod. The static membership timeout gives Kubernetes time to finish before Kafka declares a member dead.

But what if the timeout is not enough? What if Kubernetes does not finish before the timeout expires?

Kafka will treat this as a consumer leave event and trigger rebalancing. It removes the dead member from the group and redistributes its partitions to the remaining consumers. When the pod eventually becomes healthy and rejoins, a second rebalance is triggered. With a large state store, these two rebalances can lead to downtime while the changelog is replayed.

To guard against this without adding complexity, Kafka Streams has Standby Replicas.

Standby Replicas are a Kafka Streams concept. They are controlled by num.standby.replicas and work like this:

  1. Assign an active task and a standby task based on num.standby.replicas.
  2. Only the active task processes data. The standby replays the changelog to keep its state current.
  3. With num.standby.replicas=1, you always have one active copy and one standby copy on a different node. This requires at least 2 running application instances; with only one, Kafka Streams silently skips creating the standby.
  4. When a node leaves the cluster, another node already has a hot copy of the data.
  5. The leave event triggers failover. Instead of replaying the full changelog, the standby only needs to catch up on changes since its last sync.

Standby replicas provide fast failover when something goes wrong in the Kubernetes cluster.

TL;DR

Five configuration decisions that keep stateful Kafka Streams stable on Kubernetes:

  1. Static membership (group.instance.id + session.timeout.ms) — prevents rebalancing on brief pod restarts. Set the timeout to cover your worst-case rolling update duration.
  2. Cooperative rebalancing — pauses only the partitions that need to move, not all of them. Enabled by default from Kafka 3.1; enable explicitly on older versions.
  3. Sticky assignor — preserves existing partition assignments during rebalancing, keeping state stores on the nodes that already hold the data. Default from Kafka 3.x.
  4. StatefulSet + Persistent Volumes — gives each pod a stable identity and reattaches the same PVC on restart, so RocksDB state survives pod deletion without changelog replay.
  5. Standby replicas (num.standby.replicas) — keeps a hot copy of state on a second node. If the primary fails after the session timeout expires, failover skips a full changelog replay and catches up from the last sync point only.