Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11328

ZKFailoverController does not log Exception when doRun raises errors

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.5.1
    • 2.8.0, 3.0.0-alpha1
    • ha
    • None
    • Reviewed

    Description

      In ZKFailoverController.java, the Exception caught by the run() method does not have a single error log. This causes latent problems that are only manifested during failover.

      The problem we encountered

      An Exception is thrown from the doRun() method during initHM() (caused by a configuration error). If you want to repeat, you can set
      "ha.health-monitor.connect-retry-interval.ms" to be any nonsensical value.

      ZKFailoverController.java
        private int doRun(String[] args)
          ...
          initRPC();
          initHM();
          startRPC();
          ....
        }
      

      The Exception is caught in the run() method, as follows,

      ZKFailoverController.java
        public int run(final String[] args) throws Exception {
          ...
          try {
            ...
              @Override
              public Integer run() {
                try {
                  return doRun(args);
                } catch (Exception t) {
                  throw new RuntimeException(t);
                } finally {
                  if (elector != null) {
                    elector.terminateConnection();
                  }
                }
              }
            });
          } catch (RuntimeException rte) {
            throw (Exception)rte.getCause();
          }
        }
      

      Unfortunately, the Exception (causing the shutdown of the process) is not logged at all. This causes latent errors which is only manifested during failover (because ZKFC is dead). The tricky thing here is that everything looks perfectly fine: the jps command shows a running DFSZKFailoverController process and the two NameNode (active and standby) work fine.

      Patch

      We strongly suggest to add a error log to notify the error caught, such as,

      — hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java (revision 1641307)
      +++ hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java (working copy)

      @@ -178,6 +178,7 @@
               }
             });
           } catch (RuntimeException rte) {
      +      LOG.fatal("The failover controller encounters runtime error: " + rte);
             throw (Exception)rte.getCause();
           }
         }
      

      Thanks!

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tianyin Tianyin Xu
            tianyin Tianyin Xu
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment