[ZOOKEEPER-2172] Cluster crashes when reconfig a new node as a participant - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.5.0
Fix Version/s: 3.5.3, 3.6.0
Component/s: leaderElection, quorum, server
Labels:
None
Environment:

Ubuntu 12.04 + java 7

Hadoop Flags:

Reviewed

Description

The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover.

I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE /10.0.0.2:55890 ********”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done.

When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.10000005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes.

When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ZOOKEPER-2172-05.patch
13/Aug/16 22:59
16 kB
Alexander Shraer
zookeeper-3.out
21/Jul/15 10:07
59 kB
Hitoshi Mitake
zookeeper-3.log
25/May/15 04:45
62 kB
Ziyou Wang
ZOOKEEPER-2172-07.patch
14/Aug/16 06:16
13 kB
Mohammad Arshad
ZOOKEEPER-2172-06.patch
14/Aug/16 00:38
14 kB
Alexander Shraer
ZOOKEEPER-2172-04.patch
13/Aug/16 20:17
13 kB
Mohammad Arshad
ZOOKEEPER-2172-03.patch
04/Aug/16 15:23
12 kB
Mohammad Arshad
ZOOKEEPER-2172-02.patch
24/Jul/16 06:04
2 kB
Mohammad Arshad
ZOOKEEPER-2172.patch
28/Jul/15 06:56
0.7 kB
Hitoshi Mitake
zookeeper-2.out
21/Jul/15 10:07
33 kB
Hitoshi Mitake
zookeeper-2.log
25/May/15 04:45
253 kB
Ziyou Wang
zookeeper-1.out
21/Jul/15 10:07
51 kB
Hitoshi Mitake
zookeeper-1.log
25/May/15 04:45
1.31 MB
Ziyou Wang
zoo-4-3.log
23/Jun/15 05:38
60 kB
Ziyou Wang
zoo-4-2.log
23/Jun/15 05:38
440 kB
Ziyou Wang
zoo-4-1.log
23/Jun/15 05:38
1.39 MB
Ziyou Wang
zoo-3-3.log
03/Jun/15 09:25
68 kB
Ziyou Wang
zoo-3-2.log
03/Jun/15 09:25
860 kB
Ziyou Wang
zoo-3-1.log
03/Jun/15 09:25
1.76 MB
Ziyou Wang
zoo-3.log
01/Jun/15 14:34
57 kB
Ziyou Wang
zoo-2-3.log
02/Jun/15 04:48
60 kB
Ziyou Wang
zoo-2212-3.log
17/Jun/15 10:34
120 kB
Ziyou Wang
zoo-2212-2.log
17/Jun/15 10:34
3.10 MB
Ziyou Wang
zoo-2212-1.log
17/Jun/15 10:34
3.92 MB
Ziyou Wang
zoo-2-2.log
02/Jun/15 04:48
435 kB
Ziyou Wang
zoo-2-1.log
02/Jun/15 04:48
1.60 MB
Ziyou Wang
zoo-2.log
01/Jun/15 14:34
495 kB
Ziyou Wang
zoo-1.log
01/Jun/15 14:34
1.55 MB
Ziyou Wang
zoo.cfg.dynamic.next
21/Apr/15 04:16
0.2 kB
Ziyou Wang
zoo.cfg.dynamic.10000005d
21/Apr/15 04:16
0.2 kB
Ziyou Wang
node-3.log
21/Apr/15 04:16
20 kB
Ziyou Wang
node-2.log
21/Apr/15 04:16
108 kB
Ziyou Wang
node-1.log
21/Apr/15 04:16
171 kB
Ziyou Wang
history.txt
21/Jul/15 10:07
63 kB
Hitoshi Mitake

Issue Links

is depended upon by

ZOOKEEPER-2513 majorChange exceptions during leader sync

Open

is related to

ZOOKEEPER-2855 Rebooting a Joined Node Failed Due The Joined Node Previously Failed to Update Its Configuration Correctly

Open

Activity

People

Assignee:: Mohammad Arshad

Reporter:: Ziyou Wang

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 21/Apr/15 04:14

Updated:: 07/Mar/18 14:37

Resolved:: 08/Sep/16 21:01