Hadoop Common
  1. Hadoop Common
  2. HADOOP-3184

HOD gracefully exclude "bad" nodes during ring formation

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: contrib/hod
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Modified HOD to handle master (NameNode or JobTracker) failures on bad nodes by trying to bring them up on another node in the ring. Introduced new property ringmaster.max-master-failures to specify the maximum number of times a master is allowed to fail.

      Description

      HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.

      This is a frequent HOD user issue (although not directly caused by HOD).

      Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.

      Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

      1. 3184.1.patch
        6 kB
        Hemanth Yamijala
      2. 3184.2.patch
        9 kB
        Hemanth Yamijala

        Activity

        Hide
        Hemanth Yamijala added a comment -

        There are 2 possible types of issues users are facing:

        • Hod allocations fail (that is, the allocate command returns back with a non-zero exit code) due to some of the conditions mentioned above. And retrying doesn't help unless the condition is rectified or the node which has the condition is removed from the resource manager's list. This is particularly true in Torque, as it returns the same set of nodes, in the same order and hence the failure condition is mostly repeated.
          OR
        • Hod allocation hangs (without returning back), again due to some of the conditions mentioned.

        Firstly, can you please confirm which one is more of the issue ?

        AFAIK, the second case is a Torque issue where we do not even get control to do anything. We could attempt to fix the first one - maybe even outside of HOD. Maybe we could offline a node if HOD allocations fail a couple of times on it. So, in an automated manner, the offending node is removed, and further attempts would work.

        Show
        Hemanth Yamijala added a comment - There are 2 possible types of issues users are facing: Hod allocations fail (that is, the allocate command returns back with a non-zero exit code) due to some of the conditions mentioned above. And retrying doesn't help unless the condition is rectified or the node which has the condition is removed from the resource manager's list. This is particularly true in Torque, as it returns the same set of nodes, in the same order and hence the failure condition is mostly repeated. OR Hod allocation hangs (without returning back), again due to some of the conditions mentioned. Firstly, can you please confirm which one is more of the issue ? AFAIK, the second case is a Torque issue where we do not even get control to do anything. We could attempt to fix the first one - maybe even outside of HOD. Maybe we could offline a node if HOD allocations fail a couple of times on it. So, in an automated manner, the offending node is removed, and further attempts would work.
        Hide
        Hemanth Yamijala added a comment -

        Another approach is the following:

        Mostly, Hod allocations fail if the RingMaster does not come up or the JobTracker does not come up. If the JobTracker does not come up, then the hodring on the node can report a failure, and another node which asks for the hadoop command can be asked to run the JT. If the RingMaster does not come up, its a bit more difficult - because that's what controls the whole process. So, maybe in that case, the RingMaster should somehow make another instance of it to come up on a different machine and then it should die gracefully.

        I think the latter change would be quite involved. The former should be simpler.

        Show
        Hemanth Yamijala added a comment - Another approach is the following: Mostly, Hod allocations fail if the RingMaster does not come up or the JobTracker does not come up. If the JobTracker does not come up, then the hodring on the node can report a failure, and another node which asks for the hadoop command can be asked to run the JT. If the RingMaster does not come up, its a bit more difficult - because that's what controls the whole process. So, maybe in that case, the RingMaster should somehow make another instance of it to come up on a different machine and then it should die gracefully. I think the latter change would be quite involved. The former should be simpler.
        Hide
        Hemanth Yamijala added a comment -

        Patch that addresses issue of JobTracker or NameNode failure

        Show
        Hemanth Yamijala added a comment - Patch that addresses issue of JobTracker or NameNode failure
        Hide
        Hemanth Yamijala added a comment -

        The attached patch solves the problem of cluster allocation failing due to a single bad JobTracker node in the entire cluster. It does not handle ringmaster failures, which is much tougher to solve at this point.

        Description of the solution:

        This patch builds on the solution of HADOOP-3464, where we introduced an RPC message (setHodRingErrors) which the HodRing will call when they fail to launch the Hadoop daemons on a node (for e.g. because of a missing Hadoop). In HADOOP-3464, upon receiving this error, we checked if the error came while launching a Master command (i.e. a NameNode or JobTracker command) and if so, we simply propagated that back to the client which deallocated the cluster after displaying the error message from the hodring.

        In this patch, we keep track of how many times such master commands failed in a variable in the service object. We also introduce a config variable, ringmaster.max-master-failures. The RingMaster returns an error to the client only when the number of times the master command fails exceeds the configured value. If the number is not exceeded, the next HodRing which asks for a command to launch is given out the master command again.

        The config variable ringmaster.max-master-failures is bounded by a function of the maximum number of requested nodes, in case they are fewer than the configured value. This is so that the cluster allocation can fail if sufficient nodes are not available to bring up masters anymore.

        Show
        Hemanth Yamijala added a comment - The attached patch solves the problem of cluster allocation failing due to a single bad JobTracker node in the entire cluster. It does not handle ringmaster failures, which is much tougher to solve at this point. Description of the solution: This patch builds on the solution of HADOOP-3464 , where we introduced an RPC message (setHodRingErrors) which the HodRing will call when they fail to launch the Hadoop daemons on a node (for e.g. because of a missing Hadoop). In HADOOP-3464 , upon receiving this error, we checked if the error came while launching a Master command (i.e. a NameNode or JobTracker command) and if so, we simply propagated that back to the client which deallocated the cluster after displaying the error message from the hodring. In this patch, we keep track of how many times such master commands failed in a variable in the service object. We also introduce a config variable, ringmaster.max-master-failures. The RingMaster returns an error to the client only when the number of times the master command fails exceeds the configured value. If the number is not exceeded, the next HodRing which asks for a command to launch is given out the master command again. The config variable ringmaster.max-master-failures is bounded by a function of the maximum number of requested nodes, in case they are fewer than the configured value. This is so that the cluster allocation can fail if sufficient nodes are not available to bring up masters anymore.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12383452/3184.1.patch
        against trunk revision 663487.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383452/3184.1.patch against trunk revision 663487. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/console This message is automatically generated.
        Hide
        Mahadev konar added a comment -

        1) shouldRetryMasterLaunch is defined but is not used anywhere
        2) you might want to wrap around most the statements since they exceed 80 character columns.
        3) is there something we can report back to the user on the command line that some machines are faultty – CRITICAL contact admin? it would be really helpful if we can do that .

        Show
        Mahadev konar added a comment - 1) shouldRetryMasterLaunch is defined but is not used anywhere 2) you might want to wrap around most the statements since they exceed 80 character columns. 3) is there something we can report back to the user on the command line that some machines are faultty – CRITICAL contact admin? it would be really helpful if we can do that .
        Hide
        Hemanth Yamijala added a comment -

        Patch addressing some of Mahadev's comments.

        Show
        Hemanth Yamijala added a comment - Patch addressing some of Mahadev's comments.
        Hide
        Hemanth Yamijala added a comment -

        Mahadev, thank you for the review.

        1) shouldRetryMasterLaunch is defined but is not used anywhere

        This is removed now.

        2) you might want to wrap around most the statements since they exceed 80 character columns.

        Also done.

        3) is there something we can report back to the user on the command line that some machines are faultty - CRITICAL contact admin? it would be really helpful if we can do that .

        I presume you mean in the case where some machines failed, but the cluster eventually came up, right ? Because otherwise, we do print a report on the command line for the users that the hodring on this machine failed due to this reason. The services folks could then check the ringmaster log to see what other machines failed.

        If you meant the former (i.e. the case of eventual success), I agree that it would be useful feature to have. However, it would take more work to build this functionality into the client. I propose we leave this as such for now, and make the enhancement in a later release.

        Show
        Hemanth Yamijala added a comment - Mahadev, thank you for the review. 1) shouldRetryMasterLaunch is defined but is not used anywhere This is removed now. 2) you might want to wrap around most the statements since they exceed 80 character columns. Also done. 3) is there something we can report back to the user on the command line that some machines are faultty - CRITICAL contact admin? it would be really helpful if we can do that . I presume you mean in the case where some machines failed, but the cluster eventually came up, right ? Because otherwise, we do print a report on the command line for the users that the hodring on this machine failed due to this reason. The services folks could then check the ringmaster log to see what other machines failed. If you meant the former (i.e. the case of eventual success), I agree that it would be useful feature to have. However, it would take more work to build this functionality into the client. I propose we leave this as such for now, and make the enhancement in a later release.
        Hide
        Hemanth Yamijala added a comment -

        I presume you mean in the case where some machines failed, but the cluster eventually came up, right ? Because otherwise, we do print a report on the command line for the users that the hodring on this machine failed due to this reason. The services folks could then check the ringmaster log to see what other machines failed.

        In an offline conversation I had with Mahadev, I actually found that he had meant the latter, which is supported. So, all is good. The utility of the other feature remains, though it can be done as an enhancement at a later state.

        Show
        Hemanth Yamijala added a comment - I presume you mean in the case where some machines failed, but the cluster eventually came up, right ? Because otherwise, we do print a report on the command line for the users that the hodring on this machine failed due to this reason. The services folks could then check the ringmaster log to see what other machines failed. In an offline conversation I had with Mahadev, I actually found that he had meant the latter, which is supported. So, all is good. The utility of the other feature remains, though it can be done as an enhancement at a later state.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12383534/3184.2.patch
        against trunk revision 663841.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383534/3184.2.patch against trunk revision 663841. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/console This message is automatically generated.
        Hide
        Hemanth Yamijala added a comment -

        Core test failure is unrelated to the patch

        Show
        Hemanth Yamijala added a comment - Core test failure is unrelated to the patch
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Hemanth!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Hemanth!

          People

          • Assignee:
            Hemanth Yamijala
            Reporter:
            Marco Nicosia
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development