ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1277

servers stop serving when lower 32bits of zxid roll over

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.3.3
    • Fix Version/s: 3.3.5, 3.4.4, 3.5.0
    • Component/s: server
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Workaround: there is a simple workaround for this issue. Force a leader re-election before the lower 32bits reach 0xffffffff

      Most users won't even see this given the number of writes on a typical installation - say you are doing 500 writes/second, you'd see this after ~3 months if the quorum is stable, any changes (such as upgrading the server software) would cause the xid to be reset, thereby staving off this issue for another period.
      Show
      Workaround: there is a simple workaround for this issue. Force a leader re-election before the lower 32bits reach 0xffffffff Most users won't even see this given the number of writes on a typical installation - say you are doing 500 writes/second, you'd see this after ~3 months if the quorum is stable, any changes (such as upgrading the server software) would cause the xid to be reset, thereby staving off this issue for another period.

      Description

      When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again.

      This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this.

      I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira for that)

      1. ZOOKEEPER-1277_br33.patch
        22 kB
        Patrick Hunt
      2. ZOOKEEPER-1277_br33.patch
        22 kB
        Patrick Hunt
      3. ZOOKEEPER-1277_br33.patch
        20 kB
        Patrick Hunt
      4. ZOOKEEPER-1277_br33.patch
        12 kB
        Patrick Hunt
      5. ZOOKEEPER-1277_br34.patch
        26 kB
        Patrick Hunt
      6. ZOOKEEPER-1277_br34.patch
        26 kB
        Patrick Hunt
      7. ZOOKEEPER-1277_trunk.patch
        25 kB
        Patrick Hunt
      8. ZOOKEEPER-1277_trunk.patch
        25 kB
        Patrick Hunt

        Issue Links

          Activity

          Patrick Hunt created issue -
          Patrick Hunt made changes -
          Field Original Value New Value
          Release Note Workaround: there is a simple workaround for this issue. Force a leader re-election before the lower 32bits reach 0xffffffff

          Most users won't even see this given the number of writes on a typical installation - say you are doing 500 writes/second, you'd see this after ~3 months if the quorum is stable, any changes (such as upgrading the server software) would cause the xid to be reset, thereby staving off this issue for another period.
          Patrick Hunt made changes -
          Link This issue relates to ZOOKEEPER-1278 [ ZOOKEEPER-1278 ]
          Patrick Hunt made changes -
          Attachment ZOOKEEPER-1277_br33.patch [ 12503459 ]
          Patrick Hunt made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Flavio Junqueira made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Patrick Hunt made changes -
          Fix Version/s 3.3.5 [ 12319081 ]
          Fix Version/s 3.3.4 [ 12316276 ]
          Patrick Hunt made changes -
          Fix Version/s 3.3.6 [ 12320172 ]
          Fix Version/s 3.3.5 [ 12319081 ]
          Priority Blocker [ 1 ] Critical [ 2 ]
          Patrick Hunt made changes -
          Attachment ZOOKEEPER-1277_br33.patch [ 12518345 ]
          Patrick Hunt made changes -
          Attachment ZOOKEEPER-1277_br33.patch [ 12518400 ]
          Patrick Hunt made changes -
          Attachment ZOOKEEPER-1277_br34.patch [ 12518420 ]
          Patrick Hunt made changes -
          Attachment ZOOKEEPER-1277_trunk.patch [ 12518421 ]
          Patrick Hunt made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Fix Version/s 3.3.5 [ 12319081 ]
          Fix Version/s 3.4.4 [ 12319841 ]
          Fix Version/s 3.5.0 [ 12316644 ]
          Fix Version/s 3.3.6 [ 12320172 ]
          Patrick Hunt made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Patrick Hunt made changes -
          Attachment ZOOKEEPER-1277_br33.patch [ 12518424 ]
          Attachment ZOOKEEPER-1277_br34.patch [ 12518425 ]
          Patrick Hunt made changes -
          Attachment ZOOKEEPER-1277_trunk.patch [ 12518426 ]
          Patrick Hunt made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Patrick Hunt made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Patrick Hunt
              Reporter:
              Patrick Hunt
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development