Hadoop Common
  1. Hadoop Common
  2. HADOOP-3376

[HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: contrib/hod
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Modified HOD client to look for specific messages related to resource limit overruns and take appropriate actions - such as either failing to allocate the cluster, or issuing a warning to the user. A tool is provided, specific to Maui and Torque, that will set these specific messages.
      Show
      Modified HOD client to look for specific messages related to resource limit overruns and take appropriate actions - such as either failing to allocate the cluster, or issuing a warning to the user. A tool is provided, specific to Maui and Torque, that will set these specific messages.

      Description

      Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.

      (Internal) Use Case:
      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

      1. HADOOP-3376
        7 kB
        Vinod Kumar Vavilapalli
      2. checklimits.sh
        0.7 kB
        Vinod Kumar Vavilapalli
      3. HADOOP-3376.1
        15 kB
        Vinod Kumar Vavilapalli
      4. HADOOP-3376.2
        16 kB
        Vinod Kumar Vavilapalli

        Activity

        Hide
        Vinod Kumar Vavilapalli added a comment -

        Attaching a patch.

        • This implements changes required in HOD to deal better with clusters exceeding resource manager or scheduler limits.
        • After this, every time HOD detects that the cluster is still queued, HOD calls isJobFeasible method of resource manager interface (src/contrib/hod/hodlib/Hod/nodePool.py) to check if job can run if at all.
        • Torque implementation of isJobFeasible (src/contrib/hod/hodlib/NodePools/torque.py) uses the comment field in qstat output. When this comment field becomes equal to hodlib.Common.util.TORQUE_USER_LIMITS_COMMENT_FIELD, HOD deallocates the cluster with the error message "Request execeeded maximum user limits. Cluster will not be allocated." . As it is, this is still only part of the solution - torque comment field has to be set to the above string either by a scheduler or by an external tool.
        • Also introducing a hod config parameter which will enable the above checking : check-job-feasibility. This defaults to false and specifies whether or not to check job feasibility - resource manager and/or scheduler limits.
        • This patches also replaces a few 'job' strings by the string 'cluster'.
        Show
        Vinod Kumar Vavilapalli added a comment - Attaching a patch. This implements changes required in HOD to deal better with clusters exceeding resource manager or scheduler limits. After this, every time HOD detects that the cluster is still queued, HOD calls isJobFeasible method of resource manager interface (src/contrib/hod/hodlib/Hod/nodePool.py) to check if job can run if at all. Torque implementation of isJobFeasible (src/contrib/hod/hodlib/NodePools/torque.py) uses the comment field in qstat output. When this comment field becomes equal to hodlib.Common.util.TORQUE_USER_LIMITS_COMMENT_FIELD, HOD deallocates the cluster with the error message "Request execeeded maximum user limits. Cluster will not be allocated." . As it is, this is still only part of the solution - torque comment field has to be set to the above string either by a scheduler or by an external tool. Also introducing a hod config parameter which will enable the above checking : check-job-feasibility. This defaults to false and specifies whether or not to check job feasibility - resource manager and/or scheduler limits. This patches also replaces a few 'job' strings by the string 'cluster'.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Attaching checklimits.sh. This is the utility that would update torque comment field. Uploading for review.

        Show
        Vinod Kumar Vavilapalli added a comment - Attaching checklimits.sh. This is the utility that would update torque comment field. Uploading for review.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12381947/checklimits.sh
        against trunk revision 655674.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2457/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381947/checklimits.sh against trunk revision 655674. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2457/console This message is automatically generated.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Cancelling patch to incorporate Hemanth's comments. The following things need to be done:

        • Each time cluster is checked for feasibility, two qstats are run - reduce it to get required information within only one trip to resource manager.
        • There can be two ways in which user limits can be crossed - requesting for limits beyond the max limit, and cumulative usage crossing the max limits. These two scenarios should be dealt with separately - in the first case, cluster should be deallocated while in the second, cluster should not be deallocated but users should be appropriately informed.
        • Do away with the configuration variable check-job-feasibility. Instead have the variable job-feasibility-comment, which will 1)indicate whether user limits functionality has to used and 2) the comment field that will be set by checklimits.sh - currently checkjob(used by checklimits.sh) prints "job [0-9]* violates active HARD MAXPROC limit of [0-9]* for user [a-z]* (R: [0-9], U: [0-9]])"
        • This patch changes behavior of getJobState. It should only return True or False in all code paths.
        • Modify the error message TORQUE_USER_LIMITS_EXCEEDED_MSG so that it also prints the max limits so that user can modify his request.
        • checklimits.sh: 1) Submit this also with in the patch as part of src/contrib/hod/support 2) checklimits.sh should only do only one iteration over all incomplete jobs and modify comment field according as the job crosses the user limits. It should be left to some outside mechanism (like cron) to run checklimits.sh repeatedly after every (some) interval of time.
        Show
        Vinod Kumar Vavilapalli added a comment - Cancelling patch to incorporate Hemanth's comments. The following things need to be done: Each time cluster is checked for feasibility, two qstats are run - reduce it to get required information within only one trip to resource manager. There can be two ways in which user limits can be crossed - requesting for limits beyond the max limit, and cumulative usage crossing the max limits. These two scenarios should be dealt with separately - in the first case, cluster should be deallocated while in the second, cluster should not be deallocated but users should be appropriately informed. Do away with the configuration variable check-job-feasibility. Instead have the variable job-feasibility-comment, which will 1)indicate whether user limits functionality has to used and 2) the comment field that will be set by checklimits.sh - currently checkjob(used by checklimits.sh) prints "job [0-9] * violates active HARD MAXPROC limit of [0-9] * for user [a-z] * (R: [0-9] , U: [0-9] ])" This patch changes behavior of getJobState. It should only return True or False in all code paths. Modify the error message TORQUE_USER_LIMITS_EXCEEDED_MSG so that it also prints the max limits so that user can modify his request. checklimits.sh: 1) Submit this also with in the patch as part of src/contrib/hod/support 2) checklimits.sh should only do only one iteration over all incomplete jobs and modify comment field according as the job crosses the user limits. It should be left to some outside mechanism (like cron) to run checklimits.sh repeatedly after every (some) interval of time.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Made the suggested changes. Also updated documentation.

        • When asking for resources>max limits, it prints "Request exceeded maximum user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s" at critical log level and deletes the cluster.
        • When request is within limits but cumulative usage crosses limits, it prints "Request exceeded maximum user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s. This cluster will remain queued till old clusters free resources" at info level and stays in the queued state.
        • Replaced job-feasibity config parameter with job-feasibility-attr : specifies whether to check job feasibility - resource manager and/or scheduler limits, also gives the attribute value. It defaults to TORQUE_USER_LIMITS_COMMENT_FIELD which is "User-limits exceeded. Requested[0-9]) Used[0-9]) MaxLimit[0-9]*).
        • Made necessary changes in checklimits and putting it now in hod/support dir.
        Show
        Vinod Kumar Vavilapalli added a comment - Made the suggested changes. Also updated documentation. When asking for resources>max limits, it prints "Request exceeded maximum user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s" at critical log level and deletes the cluster. When request is within limits but cumulative usage crosses limits, it prints "Request exceeded maximum user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s. This cluster will remain queued till old clusters free resources" at info level and stays in the queued state. Replaced job-feasibity config parameter with job-feasibility-attr : specifies whether to check job feasibility - resource manager and/or scheduler limits, also gives the attribute value. It defaults to TORQUE_USER_LIMITS_COMMENT_FIELD which is "User-limits exceeded. Requested [0-9] ) Used [0-9] ) MaxLimit [0-9] *). Made necessary changes in checklimits and putting it now in hod/support dir.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12382301/HADOOP-3376.1
        against trunk revision 656939.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12382301/HADOOP-3376.1 against trunk revision 656939. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/console This message is automatically generated.
        Hide
        Hemanth Yamijala added a comment -

        Some comments:

        • I think job-feasibility-attr should be optional. Some code which depends on this attribute may need to check for it or change to handle it if it's not defined:
          In torque.py.isJobFeasible, if the job-feasibility-attr is not defined, we would get an exception, where the info message being printed is not going to be very descriptive. I think it would just print 'job-feasibility-attr' and not information about what the error is.
          __check_job_state: doesn't handle case where job-feasibility-attr is not defined.
        • The messages now read as follows:
          (In case of req. resources > max resources):
          Request exceeded maximum user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s
          (In other case):
          Request exceeded maximum user limits. CurentUsage:3, Requested:3, MaxLimit:3 This cluster will remain queued till old clusters free resources.
          The message still does not clarify the resources being exceeded.

        I suggest the following:
        Request number of nodes exceeded maximum user limits. Current Usage:%s, Requested:%s, Maximum Limit:%s. This cluster cannot be allocated now.

        and

        Request number of nodes exceeded maximum user limits. Current Usage:%s, Requested:%s, Maximum Limit:%s. This cluster allocation will succeed only after other clusters are deallocated.
        (Note: I also corrected some typos in the message)

        • The executable bit is not being turned on for support/checklimits.sh. This is mostly due to a bug in the ant script. For code under the contrib projects, only files under the bin/ folder are made executable when packaged. As this is not a bug in HOD, I think we should leave this as it is, but update the usage documentation to make it executable.
        • In checklimits.sh - the sleep at the end is not required.
        • In case when current usage + requested usage exceeds limits, the critical message is printed every 10 seconds. It should be printed only once.

        Other than these, I tested checklimits and hod for both scenarios and it works fine.

        Show
        Hemanth Yamijala added a comment - Some comments: I think job-feasibility-attr should be optional. Some code which depends on this attribute may need to check for it or change to handle it if it's not defined: In torque.py.isJobFeasible, if the job-feasibility-attr is not defined, we would get an exception, where the info message being printed is not going to be very descriptive. I think it would just print 'job-feasibility-attr' and not information about what the error is. __check_job_state: doesn't handle case where job-feasibility-attr is not defined. The messages now read as follows: (In case of req. resources > max resources): Request exceeded maximum user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s (In other case): Request exceeded maximum user limits. CurentUsage:3, Requested:3, MaxLimit:3 This cluster will remain queued till old clusters free resources. The message still does not clarify the resources being exceeded. I suggest the following: Request number of nodes exceeded maximum user limits. Current Usage:%s, Requested:%s, Maximum Limit:%s. This cluster cannot be allocated now. and Request number of nodes exceeded maximum user limits. Current Usage:%s, Requested:%s, Maximum Limit:%s. This cluster allocation will succeed only after other clusters are deallocated. (Note: I also corrected some typos in the message) The executable bit is not being turned on for support/checklimits.sh. This is mostly due to a bug in the ant script. For code under the contrib projects, only files under the bin/ folder are made executable when packaged. As this is not a bug in HOD, I think we should leave this as it is, but update the usage documentation to make it executable. In checklimits.sh - the sleep at the end is not required. In case when current usage + requested usage exceeds limits, the critical message is printed every 10 seconds. It should be printed only once. Other than these, I tested checklimits and hod for both scenarios and it works fine.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Incorporated the above changes. Reattaching.

        This cannot have really useful test cases - it is related to system testing and too integrated with Torque/maui.

        Show
        Vinod Kumar Vavilapalli added a comment - Incorporated the above changes. Reattaching. This cannot have really useful test cases - it is related to system testing and too integrated with Torque/maui.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12382640/HADOOP-3376.2
        against trunk revision 661918.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12382640/HADOOP-3376.2 against trunk revision 661918. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/console This message is automatically generated.
        Hide
        Karam Singh added a comment -

        To check this issue after setting MAXPROC limit (say 10) in maui.cfg did the following -:

        Added line -: job-feasibility-attr = User-limits exceeded. Requested[0-9]) Used[0-9]) MaxLimit[0-9]*) under section hod in hodrc. Hod requires exact string in hodrc

        1. When tried to use hod allocate with number of nodes greater then MAXPROC limit (say 11). Verified that hod exits with exit code 4 and proper error message saying -: CRITICAL/50 hadoop:216 - Requested number of nodes exceeded maximum user limits. Current Usage:0, Requested:11, Maximum Limit:10 This cluster cannot be allocated now.

        2. Tried a combination like first used hod allocate 5 nodes then again using hod allocate with 6 nodes. Verified that job got queued with message -:
        CRITICAL/50 hadoop:216 - Requested number of nodes exceeded maximum user limits. Current Usage:5, Requested:6, Maximum Limit:10 This cluster allocation will succeed only after other clusters are deallocated.
        Also checked after first cluster got deallocated then second cluster got allocated

        Repeated with more hod allocate combinations

        Show
        Karam Singh added a comment - To check this issue after setting MAXPROC limit (say 10) in maui.cfg did the following -: Added line -: job-feasibility-attr = User-limits exceeded. Requested [0-9] ) Used [0-9] ) MaxLimit [0-9] *) under section hod in hodrc. Hod requires exact string in hodrc 1. When tried to use hod allocate with number of nodes greater then MAXPROC limit (say 11). Verified that hod exits with exit code 4 and proper error message saying -: CRITICAL/50 hadoop:216 - Requested number of nodes exceeded maximum user limits. Current Usage:0, Requested:11, Maximum Limit:10 This cluster cannot be allocated now. 2. Tried a combination like first used hod allocate 5 nodes then again using hod allocate with 6 nodes. Verified that job got queued with message -: CRITICAL/50 hadoop:216 - Requested number of nodes exceeded maximum user limits. Current Usage:5, Requested:6, Maximum Limit:10 This cluster allocation will succeed only after other clusters are deallocated. Also checked after first cluster got deallocated then second cluster got allocated Repeated with more hod allocate combinations
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Vinod!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Vinod!

          People

          • Assignee:
            Vinod Kumar Vavilapalli
            Reporter:
            Vinod Kumar Vavilapalli
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development