Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11328

ZKFailoverController does not log Exception when doRun raises errors

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.5.1
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: ha
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      In ZKFailoverController.java, the Exception caught by the run() method does not have a single error log. This causes latent problems that are only manifested during failover.

      The problem we encountered

      An Exception is thrown from the doRun() method during initHM() (caused by a configuration error). If you want to repeat, you can set
      "ha.health-monitor.connect-retry-interval.ms" to be any nonsensical value.

      ZKFailoverController.java
        private int doRun(String[] args)
          ...
          initRPC();
          initHM();
          startRPC();
          ....
        }
      

      The Exception is caught in the run() method, as follows,

      ZKFailoverController.java
        public int run(final String[] args) throws Exception {
          ...
          try {
            ...
              @Override
              public Integer run() {
                try {
                  return doRun(args);
                } catch (Exception t) {
                  throw new RuntimeException(t);
                } finally {
                  if (elector != null) {
                    elector.terminateConnection();
                  }
                }
              }
            });
          } catch (RuntimeException rte) {
            throw (Exception)rte.getCause();
          }
        }
      

      Unfortunately, the Exception (causing the shutdown of the process) is not logged at all. This causes latent errors which is only manifested during failover (because ZKFC is dead). The tricky thing here is that everything looks perfectly fine: the jps command shows a running DFSZKFailoverController process and the two NameNode (active and standby) work fine.

      Patch

      We strongly suggest to add a error log to notify the error caught, such as,

      — hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java (revision 1641307)
      +++ hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java (working copy)

      @@ -178,6 +178,7 @@
               }
             });
           } catch (RuntimeException rte) {
      +      LOG.fatal("The failover controller encounters runtime error: " + rte);
             throw (Exception)rte.getCause();
           }
         }
      

      Thanks!

        Attachments

          Activity

            People

            • Assignee:
              tianyin Tianyin Xu
              Reporter:
              tianyin Tianyin Xu
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: