[KAFKA-693] Consumer rebalance fails if no leader available for a partition and stops all fetchers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.0
Fix Version/s: 0.8.0
Component/s: core
Labels:
- p2

Description

I am currently experiencing this with the MirrorMaker but I assume it happens for any rebalance. The symptoms are:

I have replication factor of 1

1. If i start the MirrorMaker (bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config mirror-consumer.properties --producer.config mirror-producer.properties --blacklist 'xdummyx' --num.streams=1 --num.producers=1) with a broker down
1.1 I set the refresh.leader.backoff.ms to 600000 (10min) so that the ConsumerFetcherManager doesn't retry to often to get the unavailable partitions
1.2 The rebalance starts at the init step and fails: Exception in thread "main" kafka.common.ConsumerRebalanceFailedException: KafkaMirror_mirror-01-1357893495345-fac86b15 can't rebalance after 4 retries
1.3 After the exception, everything stops (fetchers and queues)
1.4 I attached the full logs (info & debug) for this case

2. If i start the MirrorMaker with all the brokers up and then kill a broker
2.1 The first rebalance is successful
2.2 The consumer will handle correctly the broker down and stop the associated ConsumerFetcherThread
2.3 The refresh.leader.backoff.ms to 600000 works correctly
2.4 If something triggers a rebalance (new topic, partition reassignment...), then we go back to 1., the rebalance fails and stops everything.

I think the desired behavior is to consumer whatever is available, and try later at some intervals. I would be glad to help on that issue although the Consumer code seems a little tough to get on.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

mirror.log
11/Jan/13 09:22
70 kB
Maxime Brugidou
mirror_debug.log
11/Jan/13 09:22
151 kB
Maxime Brugidou
KAFKA-693.patch
15/Jan/13 12:41
10 kB
Maxime Brugidou
KAFKA-693-v2.patch
16/Jan/13 08:48
13 kB
Maxime Brugidou
KAFKA-693-v3.patch
17/Jan/13 13:52
15 kB
Maxime Brugidou

Issue Links

relates to

KAFKA-691 Fault tolerance broken with replication factor 1

Resolved

Activity

People

Assignee:: Maxime Brugidou

Reporter:: Maxime Brugidou

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Jan/13 09:22

Updated:: 24/Jan/13 04:53

Resolved:: 17/Jan/13 18:16