Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-2654

Zookeeper dependent services should not depend on Connectionstate to be valid before cleaning up

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.2.0
    • 5.0.0b1, 4.3.1
    • HA
    • None

    Description

      Currently in ZKUtils, ZKLocks and ZKJobsConcurrency services, we don't properly teardown the zookeeper connections when the callback was not received from zookeeper to change the connection state.

      We can get into this situation if the ZK session for example was closed by ZK before any callback was received to update the connection state. This can cause the oozie server in a HA mode to not terminate with one or more sockets in close_wait state.

      Here is an instance of this issue

      From the network connections, we have one connection still on close_wait with indefinite wait.

      tcp6 143 0 x.x.x.1:46710 x.x.x.2:2181 CLOSE_WAIT 4688/java off (0.00/0/0)

      From the zookeeper logs,

      016-08-18 20:45:29,921 - INFO NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868 - Client attempting to establish new session at /x.x.x.1:46710 2016-08-18 20:45:29,926 - INFO CommitProcessor:1:ZooKeeperServer@617 - Established session 0x1569f576843000e with negotiated timeout 40000 for client /x.x.x.1:46710

      and later

      2016-08-18 20:46:34,008 - INFO CommitProcessor:1:NIOServerCnxn@1007 - Closed socket connection for client /x.x.x.1:46710 which had sessionid 0x1569f576843000e

      The fix is to not check for the connectionstate during service destroy and teardown the zk connections.

      Attachments

        1. OOZIE-2654.diff
          2 kB
          Venkat Ranganathan

        Activity

          People

            venkatnrangan Venkat Ranganathan
            venkatnrangan Venkat Ranganathan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: