Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2832

Wrong Check Logic of NodeHealthCheckerService Causes Latent Errors

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.1, 2.5.1
    • None
    • nodemanager
    • None
    • Any environment

    Description

      NodeManager allows users to specify the health checker script that will be invoked by the health-checker service via the configuration parameter, "yarn.nodemanager.health-checker.script.path"

      During the serviceInit() of the health-check service, NM checks whether the parameter is set correctly using shouldRun(), as follows,

      /* NodeHealthCheckerService.java */
        protected void serviceInit(Configuration conf) throws Exception {
          if (NodeHealthScriptRunner.shouldRun(conf)) {
            nodeHealthScriptRunner = new NodeHealthScriptRunner();
            addService(nodeHealthScriptRunner);
          }
          addService(dirsHandler);
          super.serviceInit(conf);
        }
      

      The problem is that if the parameter is misconfigured (e.g., permission problem, wrong path), NM does not have any log message to inform users which could cause latent errors or mysterious problems (e.g., "why my scripts does not work?")

      I see the checking and printing logic is put in serviceStart() function in NodeHealthScriptRunner.java (see the following code snippets). However, the logic is very wrong. For an incorrect parameter that does not pass the "shouldRun" check, serviceStart() would never be called because the NodeHealthScriptRunner instance does not have the chance to be created (see the code snippets above).

      /* NodeHealthScriptRunner.java */
        protected void serviceStart() throws Exception {
          // if health script path is not configured don't start the thread.
          if (!shouldRun(conf)) {
            LOG.info("Not starting node health monitor");
            return;
          }
          ... 
        }  
      

      Basically, I think the checking and printing logic should be put in the serviceInit() in NodeHealthCheckerService instead of serviceStart() in NodeHealthScriptRunner.

      See the attachment for the simple patch.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tianyin Tianyin Xu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: