Why Rebalancing Looks Like a Deployment Incident
Rebalancing is normal behavior, not a failure. It appears as a failure when timing and readiness are poor.
Rebalancing is the normal process of recalculating and redistributing partition assignments in a consumer group.
In production, teams often notice lag spikes during deployment before they notice throughput tuning opportunities.
The reason is simple: assignment release and reassignment create a short processing gap.
That gap appears as a lag spike and can look like an outage when unprepared.
When Unintended Rebalancing Happens
Rebalancing occurs when group membership changes or when brokers treat a member as unhealthy.
- A new consumer instance joins the group.
- An existing consumer instance terminates or fails.
- Polling is delayed beyond
max.poll.interval.ms. - Heartbeats are interrupted and exceed
session.timeout.ms.
You cannot eliminate rebalancing. You can design your system so delay and duplicate processing impact stay bounded when it happens.
Eager vs Cooperative: Reducing Partition Movement Impact
The default eager strategy releases all assignments first, then redistributes. It is simple but can move many partitions at once.
The cooperative strategy (CooperativeStickyAssignor) moves only required partitions gradually.
In rolling deployments or environments with frequent instance churn, this approach often reduces latency spikes.
Example consumer.properties
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
Default assignor configuration can differ by Kafka client version. Verify version-specific defaults first. Some newer versions include cooperative assignors in default strategy lists, but many teams pin explicit values for operational consistency.
Reducing Reassignment with Static Membership
In rolling deployments, instances often go down briefly and return.
If group.instance.id is set, brokers can treat the returning instance as the same member and reduce unnecessary reassignment.
spring:
kafka:
consumer:
properties:
group.instance.id: order-worker-${HOSTNAME}
The main risk is ID collision.
If two instances start with the same group.instance.id, errors occur.
Inject unique values such as hostname or pod name from deployment automation.
In environments where pod names change on every restart (for example, Kubernetes Deployment), ${HOSTNAME} may change too often.
In such cases, stable identity models (for example, StatefulSet) or a fixed-ID strategy aligned with instance lifecycle are more effective.
In environments where
${HOSTNAME}changes every restart, static membership benefits can be smaller than expected.
Timeout Design to Reduce Poll Delay
Many rebalancing issues start when processing gets long and poll cadence breaks. External API calls, slow DB I/O, and large batch operations are common causes.
@KafkaListener(topics = "order-events", groupId = "order-worker-group")
public void consume(ConsumerRecord<String, OrderEvent> record, Acknowledgment ack) {
long startedAt = System.currentTimeMillis();
OrderEvent event = record.value();
try {
paymentService.processWithTimeout(event); // It is safer to use timeouts for external calls.
ack.acknowledge();
log.info("consume success. key={}, partition={}, offset={}, orderId={}, elapsedMs={}",
record.key(),
record.partition(),
record.offset(),
event.getOrderId(),
System.currentTimeMillis() - startedAt);
} catch (SocketTimeoutException transientError) {
log.warn("transient error. key={}, partition={}, offset={}, orderId={}",
record.key(), record.partition(), record.offset(), event.getOrderId(), transientError);
throw transientError;
} catch (Exception permanentError) {
log.error("permanent error. key={}, partition={}, offset={}, orderId={}",
record.key(), record.partition(), record.offset(), event.getOrderId(), permanentError);
throw permanentError;
}
}
AckMode.MANUAL or MANUAL_IMMEDIATE gives explicit commit control at success points.
Because manual commits are sensitive to coding mistakes, tests should verify both duplicate and missing-processing scenarios.
Settings That Reduce Delay in Rolling Deployments
A practical sequence used in production:
- Measure consumer processing time and set realistic
max.poll.interval.ms. - Tune heartbeat-related values without large deviations from defaults.
- Avoid replacing too many instances at once during rolling deployment.
- Add a short drain period before consumer shutdown.
In Spring, you can configure listener behavior to support controlled shutdown.
spring:
kafka:
listener:
ack-mode: manual
missing-topics-fatal: false
@PreDestroy
public void onShutdown() {
log.info("consumer shutting down. drain in progress");
}
From a performance standpoint, large partition movement in one step causes the biggest delays. That is why the combination of cooperative assignor and static membership is frequently used for deployment stabilization.
Idempotency with Duplicate Delivery as a Baseline
Idempotency means repeated processing of the same message yields the same final result as a single processing.
During rebalancing, the same message can be delivered again depending on commit boundaries. Treat this as a normal scenario, not an exceptional one.
@Transactional
public void processWithIdempotency(OrderEvent event) {
String idempotencyKey = event.getOrderId() + ":" + event.getEventType();
try {
processedEventRepository.save(new ProcessedEvent(idempotencyKey));
} catch (DataIntegrityViolationException alreadyProcessed) {
log.info("already processed. idempotencyKey={}", idempotencyKey);
return;
}
paymentGatewayClient.charge(event);
}
Allow duplicate delivery but enforce single-effect application. This is the most predictable structure under rebalancing.
Operational Steps for Partition Expansion
Partition expansion itself is simple, but operational planning must include rebalancing impact. A common staged process:
- Collect baseline before expansion: lag, throughput, average processing time, rebalance count
- Apply a small increase first during off-peak (for example, +20 to 30%)
- Monitor reassignment behavior and lag spikes in real time
- Expand further in phases if stable
Operational observation points
- rebalance duration
- consumer lag(max/avg)
- processing success rate and retry rate
- external I/O timeout ratio
Large one-shot expansion increases rebalancing shock, so smaller phases are safer.
Partitions Cannot Be Reduced: Rollback and Migration
Because direct partition reduction is not supported, practical rollback must be designed as topic transition.
Before expansion, define rollback triggers. For example, start rollback if any condition persists beyond a threshold window:
- Rebalance duration increases significantly over baseline
- Lag stays above threshold for a prolonged period
- Timeout/failure rate exceeds critical threshold
Direct partition reduction is not supported. Rollback must be designed as either traffic return to an existing topic or migration to a new topic with a smaller partition count.
Direct partition reduction: not supported
Recommended rollback: create a new topic and migrate gradually
Idempotency is mandatory during rollback too, because duplicate delivery likelihood increases during transition.
Operational Summary
Rebalancing is a recurring event during deployment and failure recovery. The goal is not to eliminate rebalancing. The goal is to limit its impact when it occurs.
The items below are recommendations that work best when applied together.
- Minimize partition movement: Use cooperative assignor plus static membership to reduce unnecessary reassignment.
- Optimize delay windows: Keep processing within
max.poll.interval.ms; tunemax.poll.recordsif needed. - Absorb duplicate delivery: Handle duplicates with idempotency boundaries such as DB unique keys.
- Integrate monitoring: Track
lag, rebalance frequency, rebalance duration, and failure rate on one dashboard.