Hadoop Common
  1. Hadoop Common
  2. HADOOP-3464

[HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: contrib/hod
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Implemented a mechanism to transfer HOD errors that occur on compute nodes to the submit node running the HOD client, so users have good feedback on why an allocation failed.

      Description

      This issue addresses error messages w.r.t failures on compute nodes, while HADOOP-3151 addresses error messages in hod client.

      1. HADOOP-3464
        7 kB
        Vinod Kumar Vavilapalli
      2. HADOOP-3464.1
        15 kB
        Vinod Kumar Vavilapalli
      3. HADOOP-3464.4
        20 kB
        Hemanth Yamijala

        Activity

        Vinod Kumar Vavilapalli created issue -
        Vinod Kumar Vavilapalli made changes -
        Field Original Value New Value
        Assignee Hemanth Yamijala [ yhemanth ] Vinod Kumar Vavilapalli [ vinodkv ]
        Summary [HOD] HOD can improve error messages by reporting failure on compute nodes back to hod client [HOD] HOD can improve error messages by reporting failures on compute nodes back to hod client
        Component/s contrib/hod [ 12312090 ]
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Attaching first patch.

        • This solves the problem of reporting errors on ringmaster side back to the hod client, HodRing problems are still NOT addressed.
        • Changes to hodlib/Common/setup.py are borrowed from the patch to HADOOP-2961. Need merging of these two while committing.
        • Also fixed another issue - earlier, any validation errors in ringmaster were not getting logged due to late log initialization, changed this now so that these errors can also be reported back to the hod client.
        • Tested with 1) an invalid tar file e.g. a junk file 2) a non-existent path value for hodring.java-home and 3) a non-existent path value for gridservice-hdfs.pkgs and verified that errors are properly propagated back to the hod client.
        Show
        Vinod Kumar Vavilapalli added a comment - Attaching first patch. This solves the problem of reporting errors on ringmaster side back to the hod client, HodRing problems are still NOT addressed. Changes to hodlib/Common/setup.py are borrowed from the patch to HADOOP-2961 . Need merging of these two while committing. Also fixed another issue - earlier, any validation errors in ringmaster were not getting logged due to late log initialization, changed this now so that these errors can also be reported back to the hod client. Tested with 1) an invalid tar file e.g. a junk file 2) a non-existent path value for hodring.java-home and 3) a non-existent path value for gridservice-hdfs.pkgs and verified that errors are properly propagated back to the hod client.
        Vinod Kumar Vavilapalli made changes -
        Attachment HADOOP-3464 [ 12383076 ]
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Attaching a new patch

        • Fixes the problem of transferring hodring error msgs to hod client.
        • Fixes a minor problem in the earlier patch - now print ringmaster error msgs both when return status is 5 or 6.

        This patch needs some cleanup - removing extraneous debug statements, making log statements better, make better how data(error msgs) are transferred from hodrings to ringmaster and then to hod client, and even perhaps clean up of api.

        Extra effort would be changing the error messages themselves - this patch only addresses the issue of bringing them to hod client, that is all. That anyway should be part of new jira issues.

        Show
        Vinod Kumar Vavilapalli added a comment - Attaching a new patch Fixes the problem of transferring hodring error msgs to hod client. Fixes a minor problem in the earlier patch - now print ringmaster error msgs both when return status is 5 or 6. This patch needs some cleanup - removing extraneous debug statements, making log statements better, make better how data(error msgs) are transferred from hodrings to ringmaster and then to hod client, and even perhaps clean up of api. Extra effort would be changing the error messages themselves - this patch only addresses the issue of bringing them to hod client, that is all. That anyway should be part of new jira issues.
        Vinod Kumar Vavilapalli made changes -
        Attachment HADOOP-3464.1 [ 12383103 ]
        Hide
        Hemanth Yamijala added a comment -

        Few comments:

        • When ringmaster fails, we are printing out the errors as a array of strings in a single line. For better readability, they should be printed one per line.
        • When ringmaster fails due to problems with hadoop pkgs, the error message is not helpful. It says something like int cannot be NoneType or some such. This should be improved.
        • We use ringmaster.addMasterParams to report errors from the hodrings. This is confusing. We should define a new API, something like setHodRingError and report errors back using that RPC.
        • The PID of the hodring process is part of the 'host' reporting the error. It appears this is important, as removing the PID caused the functionality to break. However, when we print these messages to the client, the name is printed as hostname_pid, which does not make too much sense. So, we can try and see if the pid part can be avoided.
        • At few places we are constructing an XML-RPC client object. If already constructed, can be reuse this ?
        • When hodrings fail due to a config error, we don't report this back. This is because error reporting happens only if the getCommand method has been called by a hodring. In case of config errors, getCommand is not called and so these errors are not caught. The requirement is that we should be able to report Master command failures - that is if an internal HDFS daemon fails, or MapRed daemon fails. If there are n nodes in the ring, atleast 2 (in case of internal) or 1 hodring should come up successfully for the masters. If the number of reported failures exceeds this, we can report a failure to the service registry client.
        • When a hadoop daemon fails, the message simply says failed to launch hadoop command. Typically the daemon.err file has more useful information. If possible, this should be fetched and displayed to the client.

        Will try and submit a patch addressing these points.

        Show
        Hemanth Yamijala added a comment - Few comments: When ringmaster fails, we are printing out the errors as a array of strings in a single line. For better readability, they should be printed one per line. When ringmaster fails due to problems with hadoop pkgs, the error message is not helpful. It says something like int cannot be NoneType or some such. This should be improved. We use ringmaster.addMasterParams to report errors from the hodrings. This is confusing. We should define a new API, something like setHodRingError and report errors back using that RPC. The PID of the hodring process is part of the 'host' reporting the error. It appears this is important, as removing the PID caused the functionality to break. However, when we print these messages to the client, the name is printed as hostname_pid, which does not make too much sense. So, we can try and see if the pid part can be avoided. At few places we are constructing an XML-RPC client object. If already constructed, can be reuse this ? When hodrings fail due to a config error, we don't report this back. This is because error reporting happens only if the getCommand method has been called by a hodring. In case of config errors, getCommand is not called and so these errors are not caught. The requirement is that we should be able to report Master command failures - that is if an internal HDFS daemon fails, or MapRed daemon fails. If there are n nodes in the ring, atleast 2 (in case of internal) or 1 hodring should come up successfully for the masters. If the number of reported failures exceeds this, we can report a failure to the service registry client. When a hadoop daemon fails, the message simply says failed to launch hadoop command. Typically the daemon.err file has more useful information. If possible, this should be fetched and displayed to the client. Will try and submit a patch addressing these points.
        Hide
        Hemanth Yamijala added a comment -

        The attached patch fixes most of the comments I mentioned in the previous comment. The two that are not handled are:

        • Not returning error if all hodrings fail. This will be addressed in the fix for HADOOP-3184
        • Still creating a new XMLRPC client - as this is not too much overhead.
        Show
        Hemanth Yamijala added a comment - The attached patch fixes most of the comments I mentioned in the previous comment. The two that are not handled are: Not returning error if all hodrings fail. This will be addressed in the fix for HADOOP-3184 Still creating a new XMLRPC client - as this is not too much overhead.
        Hemanth Yamijala made changes -
        Attachment HADOOP-3464.4 [ 12383376 ]
        Hemanth Yamijala made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Release Note Implemented a mechanism to transfer HOD errors that occur on compute nodes to the submit node running the HOD client, so users have good feedback on why an allocation failed.
        Hadoop Flags [Reviewed]
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12383376/HADOOP-3464.4
        against trunk revision 663079.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383376/HADOOP-3464.4 against trunk revision 663079. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2573/console This message is automatically generated.
        Hide
        Mukund Madhugiri added a comment -

        I just committed this for Hemant. Thanks Vinod for the patch!

        Show
        Mukund Madhugiri added a comment - I just committed this for Hemant. Thanks Vinod for the patch!
        Mukund Madhugiri made changes -
        Resolution Fixed [ 1 ]
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Vinod Kumar Vavilapalli
            Reporter:
            Vinod Kumar Vavilapalli
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development