Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-12186

ActiveStandbyElector shouldn't call monitorLockNodeAsync multiple times

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.7.1
    • 2.8.0, 2.7.2, 3.0.0-alpha1
    • ha
    • None
    • Reviewed

    Description

      ActiveStandbyElector shouldn't call monitorLockNodeAsync before StatCallback for previous zkClient.exists is received.
      We saw RM shutdown because ActiveStandbyElector retrying monitorLockNodeAsync exceeded limit. The following is the logs.
      Based on the log, it looks like multiple monitorLockNodeAsync are called at the same time due to back-to-back SyncConnected event received.
      The current code doesn't prevent zkClient.exists from being called before AsyncCallback.StatCallback for previous zkClient.exists is received.
      So the retry for monitorLockNodeAsync doesn't work correctly sometimes.

      2015-07-01 19:24:12,806 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6674ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect
      2015-07-01 19:24:12,919 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
      2015-07-01 19:24:14,704 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error)
      2015-07-01 19:24:14,704 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43487, server: node-1.internal/192.168.123.3:2181
      2015-07-01 19:24:14,707 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000
      2015-07-01 19:24:14,712 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
      2015-07-01 19:24:21,374 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6667ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect
      2015-07-01 19:24:21,477 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
      2015-07-01 19:24:22,640 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error)
      2015-07-01 19:24:22,640 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43526, server: node-1.internal/192.168.123.3:2181
      2015-07-01 19:24:22,641 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000
      2015-07-01 19:24:22,642 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
      2015-07-01 19:24:29,310 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6669ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect
      2015-07-01 19:24:29,413 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
      2015-07-01 19:24:30,738 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error)
      2015-07-01 19:24:30,739 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43574, server: node-1.internal/192.168.123.3:2181
      2015-07-01 19:24:30,739 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000
      2015-07-01 19:24:30,740 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
      2015-07-01 19:24:37,409 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6670ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect
      2015-07-01 19:24:37,512 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
      2015-07-01 19:24:38,979 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error)
      2015-07-01 19:24:38,979 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43598, server: node-1.internal/192.168.123.3:2181
      2015-07-01 19:24:38,980 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000
      2015-07-01 19:24:38,981 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
      2015-07-01 19:24:45,649 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6669ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect
      2015-07-01 19:24:45,752 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
      2015-07-01 19:24:45,855 INFO org.apache.zookeeper.ZooKeeper: Session: 0x14e47693cc20007 closed
      2015-07-01 19:25:07,932 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007
      2015-07-01 19:25:07,932 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type EMBEDDED_ELECTOR_FAILED. Cause:
      Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
      2015-07-01 19:25:07,932 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007
      2015-07-01 19:25:07,932 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007
      2015-07-01 19:25:07,933 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007
      2015-07-01 19:25:07,933 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
      2015-07-01 19:25:08,036 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
      

      Attachments

        1. HADOOP-12186.000.patch
          6 kB
          Zhihai Xu

        Activity

          People

            zxu Zhihai Xu
            zxu Zhihai Xu
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: