Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3890

Ephemeral node not deleted after session is gone, then elected as leader

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Bug
    • 3.4.14, 3.5.7
    • None
    • None
    • None

    Description

      When a ZooKeeper client session disappears, the associated ephemeral node that is used for leader election is occasionally not deleted and persists (indefinitely, it seems).
      A leader election process may select such a stale node to be the leader. In a scenario where there is a redundant service that takes action when acquiring leadership by means of a ZooKeeper election process, this leads to none of the services being active when the stale ephemeral node is elected.

      One of the scenarios where such a stale ephemeral node is created can be triggered by force-killing theĀ  ZooKeeper server (kill -9 <pid>) as well as the client, which leads to the session being recreated after restarting the server on its side, even though the actual client session is gone. This node even persists after regular restarts from now on. No pings from its owner-session are received, compared to an active one, yet the session never expires. This scenario involves a single ZooKeeper server, but the problem has also been observed in a cluster of three.

      When the ephemeral node is first persisted after restarting (and every restart thereafter), the following is observable in the ZooKeeper server logs. The scenario involves a local ZooKeeper server (version 3.5.7) and a single leader election participant.

      Opening datadir:/my/path snapDir:/my/path
      zookeeper.snapshot.trust.empty : true
      tickTime set to 2000
      minSessionTimeout set to 4000
      maxSessionTimeout set to 40000
      zookeeper.snapshotSizeFactor = 0.33
      Reading snapshot /my/path/version-2/snapshot.71
      Created new input stream /my/path/version-2/log.4b
      Created new input archive /my/path/version-2/log.4b
      EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.4b
      Created new input stream /my/path/version-2/log.72
      Created new input archive /my/path/version-2/log.72
      Ignoring processTxn failure hdr: -1 : error: -110
      Ignoring processTxn failure hdr: -1, error: -110, path: null
      Ignoring processTxn failure hdr: -1 : error: -110
      Ignoring processTxn failure hdr: -1, error: -110, path: null
      Ignoring processTxn failure hdr: -1 : error: -110
      Ignoring processTxn failure hdr: -1, error: -110, path: null
      Ignoring processTxn failure hdr: -1 : error: -110
      Ignoring processTxn failure hdr: -1, error: -110, path: null
      Ignoring processTxn failure hdr: -1 : error: -110
      Ignoring processTxn failure hdr: -1, error: -110, path: null
      Ignoring processTxn failure hdr: -1 : error: -110
      Ignoring processTxn failure hdr: -1, error: -110, path: null
      EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.72
      Snapshotting: 0x8b to /my/path/version-2/snapshot.8b
      ZKShutdownHandler is not registered, so ZooKeeper server won't take any action on ERROR or SHUTDOWN server state changes
      autopurge.snapRetainCount set to 3
      autopurge.purgeInterval set to 3

      Could this problem be solved by ZooKeeper checking the sessions for each participating node before starting a leader election?
      So far only manual intervention (removing the stale ephemeral node) seems to "fix" the issue temporarily.

      Attachments

        1. cmdline-feedback.txt
          4 kB
          Lea Morschel
        2. zkLogsAndSnapshots.tar.xz
          46 kB
          Lea Morschel

        Activity

          People

            Unassigned Unassigned
            lemora Lea Morschel
            Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: