[ZOOKEEPER-713] zookeeper fails to start - broken snapshot? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: 3.2.2
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

debian lenny; ia64; xen virtualization

Description

Hi guys,

The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.

Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.

Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
In one of logs I noticed 'Invalid snapshot' message which disturbed me a bit.

We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.

I am attaching contents of version-2 directory from all nodes and server logs.
Source problem occurred some time before 15. First cluster restart happened at 15:03.
At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.

I am also attaching zoo.cfg. Maybe something is wrong at this place.
As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.

Best regards, Łukasz Osipiuk

PS. due to attachment size limit I used split. to untar use
cat nodeX-version-2.tgz-* |tar -xz

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

zoo.cfg
18/Mar/10 15:54
0.5 kB
Lukasz Osipiuk
node3-zookeeper.log.gz
18/Mar/10 15:54
192 kB
Lukasz Osipiuk
node3-version-2.tgz-ac
18/Mar/10 16:11
1.75 MB
Lukasz Osipiuk
node3-version-2.tgz-ab
18/Mar/10 16:08
9.00 MB
Lukasz Osipiuk
node3-version-2.tgz-aa
18/Mar/10 16:08
9.00 MB
Lukasz Osipiuk
node2-zookeeper.log.gz
18/Mar/10 15:54
237 kB
Lukasz Osipiuk
node2-version-2.tgz-ac
18/Mar/10 16:07
5.41 MB
Lukasz Osipiuk
node2-version-2.tgz-ab
18/Mar/10 16:06
9.00 MB
Lukasz Osipiuk
node2-version-2.tgz-aa
18/Mar/10 16:05
9.00 MB
Lukasz Osipiuk
node1-zookeeper.log.gz
18/Mar/10 15:54
213 kB
Lukasz Osipiuk
node1-version-2.tgz-ab
18/Mar/10 16:03
6.21 MB
Lukasz Osipiuk
node1-version-2.tgz-aa
18/Mar/10 16:02
9.00 MB
Lukasz Osipiuk

Issue Links

is related to

ZOOKEEPER-714 snapshotting doesn't handle runtime exceptions (like out of memory) well

Open

ZOOKEEPER-715 add better reporting for initLimit being reached

Open

Activity

People

Assignee:: Unassigned

Reporter:: Lukasz Osipiuk

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 18/Mar/10 15:53

Updated:: 26/Mar/10 17:20

Resolved:: 18/Mar/10 17:41