Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12834

DFSZKFailoverController on error exits with 0 error code

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.7.3, 3.0.0-alpha4
    • None
    • ha
    • None

    Description

      On error DFSZKFailoverController exits with 0 return code which leads to problems when integrating it with scripts and monitoring tools, e.g. systemd, which when configured to restart the service only on failure does not restart ZKFC because it exited with 0.

      For example, in my case, systemd reported zkfc exited with success but in logs I have found this:

      2017-11-14 05:33:55,075 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x15fb794bd240001, closing socket connection and attempting reconnect
      2017-11-14 05:33:55,178 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
      2017-11-14 05:33:55,564 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.9.4.73/10.9.4.73:2182. Will not attempt to authenticate using SASL (unknown error)
      2017-11-14 05:33:55,566 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.9.4.73/10.9.4.73:2182, initiating session
      2017-11-14 05:33:55,569 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.9.4.73/10.9.4.73:2182, sessionid = 0x15fb794bd240001, negotiated timeout = 5000
      2017-11-14 05:33:55,570 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
      2017-11-14 05:33:58,230 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x15fb794bd240001, likely server has closed socket, closing socket connection and attempting reconnect
      2017-11-14 05:33:58,335 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
      2017-11-14 05:33:58,402 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.9.4.138/10.9.4.138:2181. Will not attempt to authenticate using SASL (unknown error)
      2017-11-14 05:33:58,403 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.9.4.138/10.9.4.138:2181, initiating session
      2017-11-14 05:33:58,406 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x15fb794bd240001, likely server has closed socket, closing socket connection and attempting reconnect
      2017-11-14 05:33:59,218 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.9.4.228/10.9.4.228:2183. Will not attempt to authenticate using SASL (unknown error)
      2017-11-14 05:33:59,219 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.9.4.228/10.9.4.228:2183, initiating session
      2017-11-14 05:33:59,221 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x15fb794bd240001, likely server has closed socket, closing socket connection and attempting reconnect
      2017-11-14 05:34:01,094 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.9.4.73/10.9.4.73:2182. Will not attempt to authenticate using SASL (unknown error)
      2017-11-14 05:34:01,094 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 1773ms for sessionid 0x15fb794bd240001, closing socket connection and attempting reconnect
      2017-11-14 05:34:01,196 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
      2017-11-14 05:34:02,153 INFO org.apache.zookeeper.ZooKeeper: Session: 0x15fb794bd240001 closed
      2017-11-14 05:34:02,154 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
      2017-11-14 05:34:02,154 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
      2017-11-14 05:34:05,208 INFO org.apache.hadoop.ipc.Server: Stopping server on 8019
      2017-11-14 05:34:05,487 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8019
      2017-11-14 05:34:05,488 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
      2017-11-14 05:34:05,487 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
      2017-11-14 05:34:05,488 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
      2017-11-14 05:34:05,490 FATAL org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Got a fatal error, exiting now
      java.lang.RuntimeException: ZK Failover Controller failed: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
              at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:369)
              at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:238)
              at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
              at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
              at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
              at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
              at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
              at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:181)
      

      The code that seems responsible is in DFSZKFailoverController.java:

        public static void main(String args[])
            throws Exception {
      ...
          int retCode = 0;
          try {
            retCode = zkfc.run(parser.getRemainingArgs());
          } catch (Throwable t) {
            LOG.fatal("Got a fatal error, exiting now", t); 
          }   
          System.exit(retCode);
        }
      

      Attachments

        1. HDFS-12834.00.patch
          0.7 kB
          Bharat Viswanadham
        2. HDFS-12834.01.patch
          1 kB
          Bharat Viswanadham

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            bharat Bharat Viswanadham
            kostrzewa Zbigniew Kostrzewa
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment