[ZOOKEEPER-1277] servers stop serving when lower 32bits of zxid roll over - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.3.3
Fix Version/s: 3.3.5, 3.4.4, 3.5.0
Component/s: server
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
Workaround: there is a simple workaround for this issue. Force a leader re-election before the lower 32bits reach 0xffffffff

Most users won't even see this given the number of writes on a typical installation - say you are doing 500 writes/second, you'd see this after ~3 months if the quorum is stable, any changes (such as upgrading the server software) would cause the xid to be reset, thereby staving off this issue for another period.

Show
Workaround: there is a simple workaround for this issue. Force a leader re-election before the lower 32bits reach 0xffffffff Most users won't even see this given the number of writes on a typical installation - say you are doing 500 writes/second, you'd see this after ~3 months if the quorum is stable, any changes (such as upgrading the server software) would cause the xid to be reset, thereby staving off this issue for another period.

Description

When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again.

This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this.

I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ~~ZOOKEEPER-335~~, however there is certainly an issue with updating the "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira for that)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ZOOKEEPER-1277_br33.patch
15/Mar/12 05:45
22 kB
Patrick D. Hunt
ZOOKEEPER-1277_br33.patch
14/Mar/12 23:39
22 kB
Patrick D. Hunt
ZOOKEEPER-1277_br33.patch
14/Mar/12 17:29
20 kB
Patrick D. Hunt
ZOOKEEPER-1277_br33.patch
12/Nov/11 01:08
12 kB
Patrick D. Hunt
ZOOKEEPER-1277_br34.patch
15/Mar/12 05:45
26 kB
Patrick D. Hunt
ZOOKEEPER-1277_br34.patch
15/Mar/12 04:52
26 kB
Patrick D. Hunt
ZOOKEEPER-1277_trunk.patch
15/Mar/12 05:46
25 kB
Patrick D. Hunt
ZOOKEEPER-1277_trunk.patch
15/Mar/12 04:53
25 kB
Patrick D. Hunt

Issue Links

Blocked

ZOOKEEPER-2789 Reassign `ZXID` for solving 32bit overflow problem

Open

is related to

ZOOKEEPER-4870 Proactive leadership transfer

Open

ZOOKEEPER-4883 Rollover leader epoch when counter part of zxid reach limit

Open

relates to

ZOOKEEPER-1278 acceptedEpoch not handling zxid rollover in lower 32bits

Resolved

ZOOKEEPER-3253 client should not send requests with cxid=-4, -2, or -1

Closed

Activity

People

Assignee:: Patrick D. Hunt

Reporter:: Patrick D. Hunt

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 02/Nov/11 16:46

Updated:: 28/Oct/24 04:40

Resolved:: 15/Mar/12 16:55