Hadoop Common
  1. Hadoop Common
  2. HADOOP-3217

[HOD] Be less agressive when querying job status from resource manager.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.16.2
    • Fix Version/s: 0.17.3, 0.18.2
    • Component/s: contrib/hod
    • Labels:
      None

      Description

      After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate.

      1. HADOOP-3217.patch.0.17
        18 kB
        Hemanth Yamijala
      2. HADOOP-3217.patch.0.17
        6 kB
        Hemanth Yamijala
      3. HADOOP-3217.patch.0.17
        6 kB
        Hemanth Yamijala
      4. HADOOP-3217
        7 kB
        Hemanth Yamijala

        Activity

        Hide
        Arun C Murthy added a comment -
        Show
        Arun C Murthy added a comment - I messed up the commit msg, so the subversion commits are available here: trunk: http://svn.apache.org/viewcvs?view=rev&rev=705420 branch-19: http://svn.apache.org/viewcvs?view=rev&rev=705422 branch-18: http://svn.apache.org/viewcvs?view=rev&rev=705423 branch-17: http://svn.apache.org/viewcvs?view=rev&rev=705426
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk #636 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/636/)
        . Decrease the rate at which the hod queries the resource manager for job status. Contributed by Hemanth Yamijala.

        Show
        Hudson added a comment - Integrated in Hadoop-trunk #636 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/636/ ) . Decrease the rate at which the hod queries the resource manager for job status. Contributed by Hemanth Yamijala.
        Hide
        Peeyush Bishnoi added a comment -

        I and Suman verified the patch with Hadoop-0.18 and it is working fine.

        Show
        Peeyush Bishnoi added a comment - I and Suman verified the patch with Hadoop-0.18 and it is working fine.
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks, Hemanth!

        Show
        Arun C Murthy added a comment - I just committed this. Thanks, Hemanth!
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12392183/HADOOP-3217
        against trunk revision 705215.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12392183/HADOOP-3217 against trunk revision 705215. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/console This message is automatically generated.
        Hide
        Hemanth Yamijala added a comment -

        There was a minor issue : when invalid tarball is specified with tarball option then ringmaster failure occurs and return code is 5 while for 0.17.3 without patch, return code is 6 for the same scenario.

        I discussed this with Vinod and Suman offline. There has always been a timing issue in this part of the code. The patch makes it more likely that 5 is returned in this case. Note that both indicate a ringmaster failure. I think at this stage, it is OK to ignore this minor anomaly for now.

        Show
        Hemanth Yamijala added a comment - There was a minor issue : when invalid tarball is specified with tarball option then ringmaster failure occurs and return code is 5 while for 0.17.3 without patch, return code is 6 for the same scenario. I discussed this with Vinod and Suman offline. There has always been a timing issue in this part of the code. The patch makes it more likely that 5 is returned in this case. Note that both indicate a ringmaster failure. I think at this stage, it is OK to ignore this minor anomaly for now.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        +1 for both the patches.

        Show
        Vinod Kumar Vavilapalli added a comment - +1 for both the patches.
        Hide
        Suman Sehgal added a comment -

        Verified patch for 0.17.0
        This is working fine for dynamic hdfs & mapred. There was a minor issue : when invalid tarball is specified with tarball option then ringmaster failure occurs and return code is 5 while for 0.17.3 without patch, return code is 6 for the same scenario.

        Show
        Suman Sehgal added a comment - Verified patch for 0.17.0 This is working fine for dynamic hdfs & mapred. There was a minor issue : when invalid tarball is specified with tarball option then ringmaster failure occurs and return code is 5 while for 0.17.3 without patch, return code is 6 for the same scenario.
        Hide
        Hemanth Yamijala added a comment -

        Patch for Hadoop 0.18. I've verified this also applies to 0.19 and trunk.

        Show
        Hemanth Yamijala added a comment - Patch for Hadoop 0.18. I've verified this also applies to 0.19 and trunk.
        Hide
        Hemanth Yamijala added a comment -

        I found a bug while testing the previous patch. In a corner case when we are retrying qsub operations, and the user hits a control-C, we don't break out of the loop immediately. This is fixed now in the last patch I uploaded. Rest of the patch is the same.

        Show
        Hemanth Yamijala added a comment - I found a bug while testing the previous patch. In a corner case when we are retrying qsub operations, and the user hits a control-C, we don't break out of the loop immediately. This is fixed now in the last patch I uploaded. Rest of the patch is the same.
        Hide
        Hemanth Yamijala added a comment -

        Good idea on the loop. I've updated a new patch incorporating the comments.

        Show
        Hemanth Yamijala added a comment - Good idea on the loop. I've updated a new patch incorporating the comments.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Some comments:

        • You've accidentally removed documentation comment line in hodlib/Hod/hadoop.py(+16)
        • Instead of the big while loop for retrying submitNodeSet, you could just try submitNodeSet repeatedly till max-failures and break appropriately. Greatly reduces the size of the patch.
        • Should we change the issue heading/description? We are really dealing with two issues here - 1) increasing retrying interval for qstat and 2) retrying for qsub/qstat failures.
        Show
        Vinod Kumar Vavilapalli added a comment - Some comments: You've accidentally removed documentation comment line in hodlib/Hod/hadoop.py(+16) Instead of the big while loop for retrying submitNodeSet, you could just try submitNodeSet repeatedly till max-failures and break appropriately. Greatly reduces the size of the patch. Should we change the issue heading/description? We are really dealing with two issues here - 1) increasing retrying interval for qstat and 2) retrying for qsub/qstat failures.
        Hide
        Hemanth Yamijala added a comment -

        Attached a patch for Hadoop 0.17. The following are the changes:

        • For relevant qsub failures, that is other than qsub options error, or insufficient resources, we retry a configurable number of times (default 3), with a configurable wait interval between the retries (default 10 seconds)
        • For all qstat errors, we retry a configurable number of times (default 3), with a configurable wait time interval between the retries (default 10 seconds)
        • For qstat queries which are successful, and where we poll for the job state to become running or completed, the interval is made configurable (default 30 seconds).

        Patch for other branches in progress.

        Show
        Hemanth Yamijala added a comment - Attached a patch for Hadoop 0.17. The following are the changes: For relevant qsub failures, that is other than qsub options error, or insufficient resources, we retry a configurable number of times (default 3), with a configurable wait interval between the retries (default 10 seconds) For all qstat errors, we retry a configurable number of times (default 3), with a configurable wait time interval between the retries (default 10 seconds) For qstat queries which are successful, and where we poll for the job state to become running or completed, the interval is made configurable (default 30 seconds). Patch for other branches in progress.
        Hide
        Hemanth Yamijala added a comment -

        This is causing some issues in a production deployment. Will work on a fix asap.

        Show
        Hemanth Yamijala added a comment - This is causing some issues in a production deployment. Will work on a fix asap.

          People

          • Assignee:
            Hemanth Yamijala
            Reporter:
            Hemanth Yamijala
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development