Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1221

Kill tasks on a node if the free physical memory on that machine falls below a configured threshold

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.21.0
    • Component/s: tasktracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The TaskTracker currently supports killing tasks if the virtual memory of a task exceeds a set of configured thresholds. I would like to extend this feature to enable killing tasks if the physical memory used by that task exceeds a certain threshold.

      On a certain operating system (guess?), if user space processes start using lots of memory, the machine hangs and dies quickly. This means that we would like to prevent map-reduce jobs from triggering this condition. From my understanding, the killing-based-on-virtual-memory-limits (HADOOP-5883) were designed to address this problem. This works well when most map-reduce jobs are Java jobs and have well-defined -Xmx parameters that specify the max virtual memory for each task. On the other hand, if each task forks off mappers/reducers written in other languages (python/php, etc), the total virtual memory usage of the process-subtree varies greatly. In these cases, it is better to use kill-tasks-using-physical-memory-limits.

      1. MAPREDUCE-1221-v6.txt
        28 kB
        Scott Chen
      2. MAPREDUCE-1221-v5.1.txt
        27 kB
        Scott Chen
      3. MAPREDUCE-1221-v4.patch
        26 kB
        Scott Chen
      4. MAPREDUCE-1221-v3.patch
        16 kB
        Scott Chen
      5. MAPREDUCE-1221-v2.patch
        16 kB
        Scott Chen
      6. MAPREDUCE-1221-v1.patch
        15 kB
        Scott Chen
      7. ASF.LICENSE.NOT.GRANTED--MAPREDUCE-1221-v5.txt
        27 kB
        Scott Chen

        Activity

        Hide
        Scott Chen added a comment -

        Thanks for the help, Arun and Amareshwari

        Show
        Scott Chen added a comment - Thanks for the help, Arun and Amareshwari
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks Scott!

        I've committed this to trunk, the patch doesn't apply to branch-0.21.

        Show
        Arun C Murthy added a comment - I just committed this. Thanks Scott! I've committed this to trunk, the patch doesn't apply to branch-0.21.
        Hide
        Amareshwari Sriramadasu added a comment -

        Changes look fine to me.

        Show
        Amareshwari Sriramadasu added a comment - Changes look fine to me.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12442340/MAPREDUCE-1221-v6.txt
        against trunk revision 936121.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/121/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/121/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/121/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/121/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12442340/MAPREDUCE-1221-v6.txt against trunk revision 936121. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/121/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/121/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/121/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/121/console This message is automatically generated.
        Hide
        Scott Chen added a comment -

        I made the following change based on Amareshwari's suggestion. Thanks

           void addToMemoryManager(TaskAttemptID attemptId, boolean isMap, 
                                   JobConf conf) {
        +    if (!isTaskMemoryManagerEnabled()) {
        +      return; // Skip this if TaskMemoryManager is not enabled.
             }
        
        Show
        Scott Chen added a comment - I made the following change based on Amareshwari's suggestion. Thanks void addToMemoryManager(TaskAttemptID attemptId, boolean isMap, JobConf conf) { + if (!isTaskMemoryManagerEnabled()) { + return ; // Skip this if TaskMemoryManager is not enabled. }
        Hide
        Scott Chen added a comment -

        Thanks Amareshwari,
        That totally make sense! I will upload the patch soon.

        Show
        Scott Chen added a comment - Thanks Amareshwari, That totally make sense! I will upload the patch soon.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12442255/MAPREDUCE-1221-v5.1.txt
        against trunk revision 935427.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/119/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/119/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/119/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/119/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12442255/MAPREDUCE-1221-v5.1.txt against trunk revision 935427. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/119/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/119/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/119/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/119/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment -

        Scott,

        In TaskTracker.addToMemoryManager() method, shall we move the calculation of physicalMemoryLimit and virtualMemoryLimit into the check isTaskMemoryManagerEnabled()?

        My comment there was to move the calculation into the 'if check' in the same method.
        Your patch has :

        +    // Obtain physical memory limits from the job configuration
        +    long physicalMemoryLimit =
        +      conf.getLong(isMap ? JobContext.MAP_MEMORY_PHYSICAL_MB :
        +                   JobContext.REDUCE_MEMORY_PHYSICAL_MB,
        +                   JobConf.DISABLED_MEMORY_LIMIT);
        +    if (physicalMemoryLimit > 0) {
        +      physicalMemoryLimit *= 1024L * 1024L;
        +    }
        +
        +    // Obtain virtual memory limits from the job configuration
        +    long virtualMemoryLimit = isMap ?
        +      conf.getMemoryForMapTask() * 1024 * 1024 :
        +      conf.getMemoryForReduceTask() * 1024 * 1024;
        +
             if (isTaskMemoryManagerEnabled()) {
        -      taskMemoryManager.addTask(attemptId, isMap ? conf
        -          .getMemoryForMapTask() * 1024 * 1024L : conf
        -          .getMemoryForReduceTask() * 1024 * 1024L);
        +      taskMemoryManager.addTask(attemptId, virtualMemoryLimit,
        +                                physicalMemoryLimit);
        

        I was expecting it to be :

             if (isTaskMemoryManagerEnabled()) {
        +    // Obtain physical memory limits from the job configuration
        +    long physicalMemoryLimit =
        +      conf.getLong(isMap ? JobContext.MAP_MEMORY_PHYSICAL_MB :
        +                   JobContext.REDUCE_MEMORY_PHYSICAL_MB,
        +                   JobConf.DISABLED_MEMORY_LIMIT);
        +    if (physicalMemoryLimit > 0) {
        +      physicalMemoryLimit *= 1024L * 1024L;
        +    }
        +
        +    // Obtain virtual memory limits from the job configuration
        +    long virtualMemoryLimit = isMap ?
        +      conf.getMemoryForMapTask() * 1024 * 1024 :
        +      conf.getMemoryForReduceTask() * 1024 * 1024;
        +
        -      taskMemoryManager.addTask(attemptId, isMap ? conf
        -          .getMemoryForMapTask() * 1024 * 1024L : conf
        -          .getMemoryForReduceTask() * 1024 * 1024L);
        +      taskMemoryManager.addTask(attemptId, virtualMemoryLimit,
        +                                physicalMemoryLimit);
        
        

        Does that make sense now?

        Show
        Amareshwari Sriramadasu added a comment - Scott, In TaskTracker.addToMemoryManager() method, shall we move the calculation of physicalMemoryLimit and virtualMemoryLimit into the check isTaskMemoryManagerEnabled()? My comment there was to move the calculation into the 'if check' in the same method. Your patch has : + // Obtain physical memory limits from the job configuration + long physicalMemoryLimit = + conf.getLong(isMap ? JobContext.MAP_MEMORY_PHYSICAL_MB : + JobContext.REDUCE_MEMORY_PHYSICAL_MB, + JobConf.DISABLED_MEMORY_LIMIT); + if (physicalMemoryLimit > 0) { + physicalMemoryLimit *= 1024L * 1024L; + } + + // Obtain virtual memory limits from the job configuration + long virtualMemoryLimit = isMap ? + conf.getMemoryForMapTask() * 1024 * 1024 : + conf.getMemoryForReduceTask() * 1024 * 1024; + if (isTaskMemoryManagerEnabled()) { - taskMemoryManager.addTask(attemptId, isMap ? conf - .getMemoryForMapTask() * 1024 * 1024L : conf - .getMemoryForReduceTask() * 1024 * 1024L); + taskMemoryManager.addTask(attemptId, virtualMemoryLimit, + physicalMemoryLimit); I was expecting it to be : if (isTaskMemoryManagerEnabled()) { + // Obtain physical memory limits from the job configuration + long physicalMemoryLimit = + conf.getLong(isMap ? JobContext.MAP_MEMORY_PHYSICAL_MB : + JobContext.REDUCE_MEMORY_PHYSICAL_MB, + JobConf.DISABLED_MEMORY_LIMIT); + if (physicalMemoryLimit > 0) { + physicalMemoryLimit *= 1024L * 1024L; + } + + // Obtain virtual memory limits from the job configuration + long virtualMemoryLimit = isMap ? + conf.getMemoryForMapTask() * 1024 * 1024 : + conf.getMemoryForReduceTask() * 1024 * 1024; + - taskMemoryManager.addTask(attemptId, isMap ? conf - .getMemoryForMapTask() * 1024 * 1024L : conf - .getMemoryForReduceTask() * 1024 * 1024L); + taskMemoryManager.addTask(attemptId, virtualMemoryLimit, + physicalMemoryLimit); Does that make sense now?
        Hide
        Scott Chen added a comment -

        I only did the following change on comments to address the comment made by Amareshwari.

         --- src/java/org/apache/hadoop/mapred/TaskMemoryManagerThread.java    (revision 934199)
        558,570c558
         @@ -332,8 +411,9 @@
              List<TaskAttemptID> tasksToExclude = new ArrayList<TaskAttemptID>();
              // Find tasks to kill so as to get memory usage under limits.
              while (memoryStillInUsage > maxMemoryAllowedForAllTasks) {
         -      // Exclude tasks that are already marked for
         -      // killing.
         +      // Exclude tasks that are already marked for killing.
         +      // Note that we do not need to call isKillable() here because the logic
         +      // is contained in taskTracker.findTaskToKill()
                TaskInProgress task = taskTracker.findTaskToKill(tasksToExclude);
                if (task == null) {
                 break; // couldn't find any more tasks to kill.
        
        Show
        Scott Chen added a comment - I only did the following change on comments to address the comment made by Amareshwari. --- src/java/org/apache/hadoop/mapred/TaskMemoryManagerThread.java (revision 934199) 558,570c558 @@ -332,8 +411,9 @@ List<TaskAttemptID> tasksToExclude = new ArrayList<TaskAttemptID>(); // Find tasks to kill so as to get memory usage under limits. while (memoryStillInUsage > maxMemoryAllowedForAllTasks) { - // Exclude tasks that are already marked for - // killing. + // Exclude tasks that are already marked for killing. + // Note that we do not need to call isKillable() here because the logic + // is contained in taskTracker.findTaskToKill() TaskInProgress task = taskTracker.findTaskToKill(tasksToExclude); if (task == null ) { break ; // couldn't find any more tasks to kill.
        Hide
        Scott Chen added a comment -

        Hey Amareshwari, Thanks for the comments.

        In TaskTracker.addToMemoryManager() method, shall we move the calculation of physicalMemoryLimit and virtualMemoryLimit into the check isTaskMemoryManagerEnabled()?

        This will be more efficient. But unfortunately this limits needs to be set independently on each job. So we have to keep them in addToMemoryManager. The method isTaskMemoryManagerEnabled() will be called only once to determine whether to start the manager thread at the beginning.

        Shouldn't we call isKillable() in TaskMemoryManagerThread.killTasksWithLeastProgress() also ?

        killTasksWIthLeastProgress() calls taskTracker.findTaskToKill(). In this method, similar logic as isKillable() is implemented. I think this part of the code is not very clear. I will add an one line comment on it to explain why it does not need to call isKillable().

        Show
        Scott Chen added a comment - Hey Amareshwari, Thanks for the comments. In TaskTracker.addToMemoryManager() method, shall we move the calculation of physicalMemoryLimit and virtualMemoryLimit into the check isTaskMemoryManagerEnabled()? This will be more efficient. But unfortunately this limits needs to be set independently on each job. So we have to keep them in addToMemoryManager. The method isTaskMemoryManagerEnabled() will be called only once to determine whether to start the manager thread at the beginning. Shouldn't we call isKillable() in TaskMemoryManagerThread.killTasksWithLeastProgress() also ? killTasksWIthLeastProgress() calls taskTracker.findTaskToKill(). In this method, similar logic as isKillable() is implemented. I think this part of the code is not very clear. I will add an one line comment on it to explain why it does not need to call isKillable().
        Hide
        Amareshwari Sriramadasu added a comment -

        Looked at the attached patch. Some minor comments:

        • In TaskTracker.addToMemoryManager() method, shall we move the calculation of physicalMemoryLimit and virtualMemoryLimit into the check isTaskMemoryManagerEnabled()?
        • Shouldn't we call isKillable() in TaskMemoryManagerThread.killTasksWithLeastProgress() also ?
        Show
        Amareshwari Sriramadasu added a comment - Looked at the attached patch. Some minor comments: In TaskTracker.addToMemoryManager() method, shall we move the calculation of physicalMemoryLimit and virtualMemoryLimit into the check isTaskMemoryManagerEnabled()? Shouldn't we call isKillable() in TaskMemoryManagerThread.killTasksWithLeastProgress() also ?
        Hide
        dhruba borthakur added a comment -

        hi arun/amareshwari, would you like to please review this one?

        Show
        dhruba borthakur added a comment - hi arun/amareshwari, would you like to please review this one?
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12441860/MAPREDUCE-1221-v5.txt
        against trunk revision 933441.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/113/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/113/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/113/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/113/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12441860/MAPREDUCE-1221-v5.txt against trunk revision 933441. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/113/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/113/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/113/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/113/console This message is automatically generated.
        Hide
        Scott Chen added a comment -

        Arun, Thanks for the review

        We shouldn't be using the TT.fConf to read config values each time, please save it and re-use:

        Good catch. I made some change. A member called reservedPhysicalMemoryOnTT is created to save this value.

        I don't understand why 'limitPhysical > 0' is necessary. If 'doCheckPhysicalMemory' returns true, shouldn't limitPhysical always be positive? Please help me understand, thanks.

        You are right about this one. That check is redundant. If doCheckPhysicalMemory returns true, it means the user has configured physical memory limit (should include per task limit) and this value should not be negative.
        The logic should be the same as the one in virtual memory limit. I have removed this redundant check.

        Show
        Scott Chen added a comment - Arun, Thanks for the review We shouldn't be using the TT.fConf to read config values each time, please save it and re-use: Good catch. I made some change. A member called reservedPhysicalMemoryOnTT is created to save this value. I don't understand why 'limitPhysical > 0' is necessary. If 'doCheckPhysicalMemory' returns true, shouldn't limitPhysical always be positive? Please help me understand, thanks. You are right about this one. That check is redundant. If doCheckPhysicalMemory returns true, it means the user has configured physical memory limit (should include per task limit) and this value should not be negative. The logic should be the same as the one in virtual memory limit. I have removed this redundant check.
        Hide
        Arun C Murthy added a comment -

        Scott, I had a question:

        Comparing the following snippets in src/java/org/apache/hadoop/mapred/TaskMemoryManagerThread.java

        +          if (doCheckVirtualMemory() &&
        +              isProcessTreeOverLimit(tid.toString(), currentMemUsage,
                                               curMemUsageOfAgedProcesses, limit)) {
        
        +          } else if (doCheckPhysicalMemory() && limitPhysical > 0 &&
        +              isProcessTreeOverLimit(tid.toString(), currentRssMemUsage,
        +                                curRssMemUsageOfAgedProcesses, limitPhysical)) {
        

        I don't understand why 'limitPhysical > 0' is necessary. If 'doCheckPhysicalMemory' returns true, shouldn't limitPhysical always be positive? Please help me understand, thanks.

        Show
        Arun C Murthy added a comment - Scott, I had a question: Comparing the following snippets in src/java/org/apache/hadoop/mapred/TaskMemoryManagerThread.java + if (doCheckVirtualMemory() && + isProcessTreeOverLimit(tid.toString(), currentMemUsage, curMemUsageOfAgedProcesses, limit)) { + } else if (doCheckPhysicalMemory() && limitPhysical > 0 && + isProcessTreeOverLimit(tid.toString(), currentRssMemUsage, + curRssMemUsageOfAgedProcesses, limitPhysical)) { I don't understand why 'limitPhysical > 0' is necessary. If 'doCheckPhysicalMemory' returns true, shouldn't limitPhysical always be positive? Please help me understand, thanks.
        Hide
        Arun C Murthy added a comment -

        I just started looking at this, some minor comments:

        We shouldn't be using the TT.fConf to read config values each time, please save it and re-use:

        Index: src/java/org/apache/hadoop/mapred/TaskTracker.java
        ===================================================================
        --- src/java/org/apache/hadoop/mapred/TaskTracker.java	(revision 921667)
        +++ src/java/org/apache/hadoop/mapred/TaskTracker.java	(working copy)
           
        +    if (fConf.get(TTConfig.TT_RESERVED_PHYSCIALMEMORY_MB) == null
        +        && totalMemoryAllottedForTasks == JobConf.DISABLED_MEMORY_LIMIT) {
        
        Index: src/java/org/apache/hadoop/mapred/TaskMemoryManagerThread.java
        ===================================================================
        --- src/java/org/apache/hadoop/mapred/TaskMemoryManagerThread.java	(revision 921667)
        +++ src/java/org/apache/hadoop/mapred/TaskMemoryManagerThread.java	(working copy)
        
        +    long reservedRssMemory = taskTracker.getJobConf().
        +            getLong(TTConfig.TT_RESERVED_PHYSCIALMEMORY_MB,
        +                    JobConf.DISABLED_MEMORY_LIMIT);
        
        Show
        Arun C Murthy added a comment - I just started looking at this, some minor comments: We shouldn't be using the TT.fConf to read config values each time, please save it and re-use: Index: src/java/org/apache/hadoop/mapred/TaskTracker.java =================================================================== --- src/java/org/apache/hadoop/mapred/TaskTracker.java (revision 921667) +++ src/java/org/apache/hadoop/mapred/TaskTracker.java (working copy) + if (fConf.get(TTConfig.TT_RESERVED_PHYSCIALMEMORY_MB) == null + && totalMemoryAllottedForTasks == JobConf.DISABLED_MEMORY_LIMIT) { Index: src/java/org/apache/hadoop/mapred/TaskMemoryManagerThread.java =================================================================== --- src/java/org/apache/hadoop/mapred/TaskMemoryManagerThread.java (revision 921667) +++ src/java/org/apache/hadoop/mapred/TaskMemoryManagerThread.java (working copy) + long reservedRssMemory = taskTracker.getJobConf(). + getLong(TTConfig.TT_RESERVED_PHYSCIALMEMORY_MB, + JobConf.DISABLED_MEMORY_LIMIT);
        Hide
        dhruba borthakur added a comment -

        Thanks Scott for updating the patch with the three configuration variables. Code looks good to me. I would like to get this into the 0.21 release, so if somebody wants to review it again, please do so now. Thanks.

        Show
        dhruba borthakur added a comment - Thanks Scott for updating the patch with the three configuration variables. Code looks good to me. I would like to get this into the 0.21 release, so if somebody wants to review it again, please do so now. Thanks.
        Hide
        dhruba borthakur added a comment -

        Hi amareshwari, you bring up a good point that a failed/killed task might get re-executed on the same machine. But this actually depends on the scheduler that one uses. I agree that the schedulers should be intelligent enough to not schedule the same task on the same machine on which it had scheduled the same task earlier (if there are other equivalent resources available). This isue does not seem to be directly related to this JIRA, isn't it?

        Show
        dhruba borthakur added a comment - Hi amareshwari, you bring up a good point that a failed/killed task might get re-executed on the same machine. But this actually depends on the scheduler that one uses. I agree that the schedulers should be intelligent enough to not schedule the same task on the same machine on which it had scheduled the same task earlier (if there are other equivalent resources available). This isue does not seem to be directly related to this JIRA, isn't it?
        Hide
        Amareshwari Sriramadasu added a comment -

        First up: Do you agree that we need to fail the task and not just kill it, so that the job fails fast?

        One more thing to add here is "If a task is failed, JobTracker will try to schedule the task on different machine. If the task is killed, there are more chances that the task will be re-executed on the same tracker".

        If the total limit is violated, TaskTracker will kill the task with highest amount of memory to relief the memory pressure.

        So, if we kill a task with highest amount of memory, these are more chances that the task will be executed on the same machine. Doesn't this trouble the tracker again?

        Show
        Amareshwari Sriramadasu added a comment - First up: Do you agree that we need to fail the task and not just kill it, so that the job fails fast? One more thing to add here is "If a task is failed, JobTracker will try to schedule the task on different machine. If the task is killed, there are more chances that the task will be re-executed on the same tracker". If the total limit is violated, TaskTracker will kill the task with highest amount of memory to relief the memory pressure. So, if we kill a task with highest amount of memory, these are more chances that the task will be executed on the same machine. Doesn't this trouble the tracker again?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12438470/MAPREDUCE-1221-v4.patch
        against trunk revision 923907.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/528/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/528/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/528/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/528/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438470/MAPREDUCE-1221-v4.patch against trunk revision 923907. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/528/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/528/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/528/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/528/console This message is automatically generated.
        Hide
        Scott Chen added a comment -

        The definitions of the parameters in the previous comment are not very clear.
        Here are the definitions.

        mapreduce.tasktracker.reserved.physicalmemory.mb
        Specify how much memory on the TT that will not be used by tasks.
        For example if we configure this to be 2G and we have 16G of memory.
        The tasks only allow to use 14G of memory. If the limit is exceeded, TT will kill the task consumes the most amount of memory.
        We use reserved memory instead of total memory because there may be cases that we have nonuniform cluster.

        mapreduce.map.memory.physical.mb
        Memory limit for one mapper. If the mapper uses more memory that this, it will fail.

        mapreduce.reduce.memory.physical.mb
        Memory limit for one reducer. If the reducer uses more memory that this, it will fail.

        Show
        Scott Chen added a comment - The definitions of the parameters in the previous comment are not very clear. Here are the definitions. mapreduce.tasktracker.reserved.physicalmemory.mb Specify how much memory on the TT that will not be used by tasks. For example if we configure this to be 2G and we have 16G of memory. The tasks only allow to use 14G of memory. If the limit is exceeded, TT will kill the task consumes the most amount of memory. We use reserved memory instead of total memory because there may be cases that we have nonuniform cluster. mapreduce.map.memory.physical.mb Memory limit for one mapper. If the mapper uses more memory that this, it will fail. mapreduce.reduce.memory.physical.mb Memory limit for one reducer. If the reducer uses more memory that this, it will fail.
        Hide
        Scott Chen added a comment -

        The patch has been changed over time. Here is a quick overall summary of what this patch does.

        The purpose of this patch is to allow TaskTracker to kill/fail tasks based on the RSS memory status.

        We can set the following three different parameters
        mapreduce.tasktracker.reserved.physicalmemory.mb
        mapreduce.map.memory.physical.mb
        mapreduce.reduce.memory.physical.mb

        They will determine the total allowed RSS memory for tasks and the limit for individual tasks.
        If the total limit is violated, TaskTracker will kill the task with highest amount of memory to relief the memory pressure.
        If the per task limit is violated, TaskTracker will fail the task that violate the limit.
        If the parameters are not set, there will not be any limit.

        The implementation is mostly follow the virtual memory limiting logic that we already have.
        And ProcfsBasedProcessTree also allow us to obtain physical memory of tasks.

        The tests added are the following
        TestTaskTrackerMemoryManager.testTasksCumulativelyExceedingTTPhysicalLimits()
        TestTaskTrackerMemoryManager.testTasksBeyondPhysicalLimits()
        They verifies the behavior of the cases when total memory limit and per task limit are triggered.
        There is also a slight modification in TestTaskTrackerMemoryManager.testTasksWithinLimits() to make sure the tasks within physical memory limit will run correctly.

        Show
        Scott Chen added a comment - The patch has been changed over time. Here is a quick overall summary of what this patch does. The purpose of this patch is to allow TaskTracker to kill/fail tasks based on the RSS memory status. We can set the following three different parameters mapreduce.tasktracker.reserved.physicalmemory.mb mapreduce.map.memory.physical.mb mapreduce.reduce.memory.physical.mb They will determine the total allowed RSS memory for tasks and the limit for individual tasks. If the total limit is violated, TaskTracker will kill the task with highest amount of memory to relief the memory pressure. If the per task limit is violated, TaskTracker will fail the task that violate the limit. If the parameters are not set, there will not be any limit. The implementation is mostly follow the virtual memory limiting logic that we already have. And ProcfsBasedProcessTree also allow us to obtain physical memory of tasks. The tests added are the following TestTaskTrackerMemoryManager.testTasksCumulativelyExceedingTTPhysicalLimits() TestTaskTrackerMemoryManager.testTasksBeyondPhysicalLimits() They verifies the behavior of the cases when total memory limit and per task limit are triggered. There is also a slight modification in TestTaskTrackerMemoryManager.testTasksWithinLimits() to make sure the tasks within physical memory limit will run correctly.
        Hide
        Scott Chen added a comment -

        I have run TestJobClient on two different dev boxes. Both worked. I will submit this to Hudson again.

        Show
        Scott Chen added a comment - I have run TestJobClient on two different dev boxes. Both worked. I will submit this to Hudson again.
        Hide
        Scott Chen added a comment -

        I am checking the failed tasks.

        Show
        Scott Chen added a comment - I am checking the failed tasks.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12438470/MAPREDUCE-1221-v4.patch
        against trunk revision 923907.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/527/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/527/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/527/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/527/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438470/MAPREDUCE-1221-v4.patch against trunk revision 923907. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/527/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/527/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/527/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/527/console This message is automatically generated.
        Hide
        Scott Chen added a comment -

        The following two new parameters are for per task physical memory limit.
        mapreduce.map.memory.physical.mb
        mapreduce.reduce.memory.physical.mb

        Show
        Scott Chen added a comment - The following two new parameters are for per task physical memory limit. mapreduce.map.memory.physical.mb mapreduce.reduce.memory.physical.mb
        Hide
        Scott Chen added a comment -

        Added the per task physical memory limit. If the rss memory usage of the task tree is higher than the configured value, the task will fail.

        Show
        Scott Chen added a comment - Added the per task physical memory limit. If the rss memory usage of the task tree is higher than the configured value, the task will fail.
        Hide
        Scott Chen added a comment -

        @Arun: Sorry for the very late reply. Dhruba and I have been trying to call you but it seems you are busy as well.

        I think I got your point. The problem is that the bad job will never fail and its task gets killed and rescheduled again and again which keeps hurting the cluster. So we should add per task RSS limit in this patch so that we can fail the bad job. This is just like what we currently do in the trunk for virtual memory. But we here offer the RSS memory limiting as an option (a trade-off between memory utilization and stability).

        I will make the change and resubmit the patch soon. Thanks again for the help.

        Show
        Scott Chen added a comment - @Arun: Sorry for the very late reply. Dhruba and I have been trying to call you but it seems you are busy as well. I think I got your point. The problem is that the bad job will never fail and its task gets killed and rescheduled again and again which keeps hurting the cluster. So we should add per task RSS limit in this patch so that we can fail the bad job. This is just like what we currently do in the trunk for virtual memory. But we here offer the RSS memory limiting as an option (a trade-off between memory utilization and stability). I will make the change and resubmit the patch soon. Thanks again for the help.
        Hide
        Tomer Shiran added a comment -

        @Dhruba,

        How do you ensure that the physical memory limit is only hit when there is a misbehaving task (which uses unbounded memory), and not during normal operation (without having a per-task limit)? Are you making assumptions about the general expected memory usage of a task, and setting the # of slots accordingly?

        Show
        Tomer Shiran added a comment - @Dhruba, How do you ensure that the physical memory limit is only hit when there is a misbehaving task (which uses unbounded memory), and not during normal operation (without having a per-task limit)? Are you making assumptions about the general expected memory usage of a task, and setting the # of slots accordingly?
        Hide
        Arun C Murthy added a comment -

        Dhruba, thanks for explaining your use case.

        First up: Do you agree that we need to fail the task and not just kill it, so that the job fails fast?

        Show
        Arun C Murthy added a comment - Dhruba, thanks for explaining your use case. First up: Do you agree that we need to fail the task and not just kill it, so that the job fails fast?
        Hide
        dhruba borthakur added a comment -

        @Luke: Thanks for the suggestion. I am pretty sure that we would like to move to a Container model (where each task runs inside its own Container e.g. lxc-execute) in a slightly longer term.

        Show
        dhruba borthakur added a comment - @Luke: Thanks for the suggestion. I am pretty sure that we would like to move to a Container model (where each task runs inside its own Container e.g. lxc-execute) in a slightly longer term.
        Hide
        Luke Lu added a comment -

        @dhruba: Have you guys tried lxc-execute? (needs 2.6.27+ kernel), which give you fairly complete control of the hardware resources for a group of processes (an arbitrary container of processes and their children) without having to write any code.

        Show
        Luke Lu added a comment - @dhruba: Have you guys tried lxc-execute? (needs 2.6.27+ kernel), which give you fairly complete control of the hardware resources for a group of processes (an arbitrary container of processes and their children) without having to write any code.
        Hide
        dhruba borthakur added a comment -

        Please allow me to present my use case.

        I have users submitting their own jobs to the cluster. These jobs and neither audited nor vetted by any authority before being deployed on the cluster. The mappers for most of these jobs are written in python or php. In these languages, it is easy for code writers to mistakenly use excessive amounts of memory (via a python dictionary or some such thing). We have seen about 1 such case per month in our cluster. The thing to note that in all 100% of these jobs, the user had a coding error that erroneously kept on inserting elements to his/her dictionary. These are not "valid" jobs, and are usually killed by the user when he/she realises his/her coding mistake.

        The problem we are encountering is that when such a job is let loose in our cluster, many tasks start eating lots of memory, thus causing excessive swapping and finally makes the OS on those hang. This JIRA attempts to prevent this scenario. Once properly configured, this JIRA will make it really really hard for a user job to be able to bring down nodes in the Hadoop cluster. This JIRA increases the stability and uptime of our cluster to a great extent. I would request all concerned authorities to review this JIRA from this perspective.

        Show
        dhruba borthakur added a comment - Please allow me to present my use case. I have users submitting their own jobs to the cluster. These jobs and neither audited nor vetted by any authority before being deployed on the cluster. The mappers for most of these jobs are written in python or php. In these languages, it is easy for code writers to mistakenly use excessive amounts of memory (via a python dictionary or some such thing). We have seen about 1 such case per month in our cluster. The thing to note that in all 100% of these jobs, the user had a coding error that erroneously kept on inserting elements to his/her dictionary. These are not "valid" jobs, and are usually killed by the user when he/she realises his/her coding mistake. The problem we are encountering is that when such a job is let loose in our cluster, many tasks start eating lots of memory, thus causing excessive swapping and finally makes the OS on those hang. This JIRA attempts to prevent this scenario. Once properly configured, this JIRA will make it really really hard for a user job to be able to bring down nodes in the Hadoop cluster. This JIRA increases the stability and uptime of our cluster to a great extent. I would request all concerned authorities to review this JIRA from this perspective.
        Hide
        Arun C Murthy added a comment -

        Allen, thanks for your inputs.

        To be clear, an admin can set a per-task limit, on the TaskTracker for the vmem. This works independent of the scheduler.

        OTOH, since this is per-task, users who need more than the limit have to use multiple-slots for their tasks (jobs) to survive. This is where the scheduler comes in, afaik only the CapacityScheduler supports this.

        Also, the 'per-task' is a misnomer, this limit is actually on the task and all its children - we do this for vmem by walking /proc.

        Show
        Arun C Murthy added a comment - Allen, thanks for your inputs. To be clear, an admin can set a per-task limit, on the TaskTracker for the vmem. This works independent of the scheduler. OTOH, since this is per-task, users who need more than the limit have to use multiple-slots for their tasks (jobs) to survive. This is where the scheduler comes in, afaik only the CapacityScheduler supports this. Also, the 'per-task' is a misnomer, this limit is actually on the task and all its children - we do this for vmem by walking /proc.
        Hide
        Allen Wittenauer added a comment -

        Bah. I wish I had edit to fix the strikeouts.

        One thing I didn't say that I meant to:

        I don't think a phys mem limit is necessarily harmful, but I don't think it is particularly useful. I'm still left with the thought that the vm limit wasn't implemented properly, all in a futile attempt to protect innocent users.

        Show
        Allen Wittenauer added a comment - Bah. I wish I had edit to fix the strikeouts. One thing I didn't say that I meant to: I don't think a phys mem limit is necessarily harmful, but I don't think it is particularly useful. I'm still left with the thought that the vm limit wasn't implemented properly, all in a futile attempt to protect innocent users.
        Hide
        Allen Wittenauer added a comment -

        I've read through this jira a few times, and looked at some of the previously mentioned jira's around memory limits. I think I see where the issue is actually at.

        Arun started to fill in the historical background, but I think he may have missed a significant point. Let's retell the story, so that we can get to cruxt of the ops requirement here....

        Under HOD w/torque, we configured torque such that it would limit the virtual memory size to be total vm - 4gb. [This left plenty of ram for LInux, our monitoring software, etc, etc. So on a machine with 4x4GB swap partitions and 16GB RAM, the vm limit would be set to 28GB]. Now the thing about hod is that it allocates the entire node to an entire job.... which means there is a subtle point here, easily missed: the vm limit under torque was the aggregate for all of the tasks on the node, not just a single task. So if you had a bad behaving task/job, it will kill all the tasks running on that node.

        To simulate this ops requirement, hadoop should be taking the memory used by all the tasks and then performing some action. While I realize there is a desire to only punish 'bad tasks', I'm not sure if there is an easy way to do that. Putting my jack boots on, my answer is Kill Them All and Let The Users Sort Themselves Out. If I have to pick between killing the system (we're talking hard hang here, not happy little panic in my experiences) and punishing potentially innocent users, the answer is easy.

        Now here is where things get more complex, and there is a very good chance I've gotten this wrong. [Hopefully I have, because it sounds to me like a feature was added in the wrong spot.]

        It sounds like capacity has the ability to kill tasks based upon vm per node. It has this idea of a max vm size and how much memory each task is asking for. It then schedules based upon a weird slot+mem ratio.

        While this is a fine and dandy feature that would likely fix the requestors problem, I think it is a bit short sighted not have the kill feature at the task tracker level. The task tracker, regardless of scheduler, should still be able to keep track of all the mem used on the box and kill as necessary. If a scheduler wants to provide alternative logic, more power to it. But tying this to a scheduler just seems a bit ridiculous.

        Show
        Allen Wittenauer added a comment - I've read through this jira a few times, and looked at some of the previously mentioned jira's around memory limits. I think I see where the issue is actually at. Arun started to fill in the historical background, but I think he may have missed a significant point. Let's retell the story, so that we can get to cruxt of the ops requirement here.... Under HOD w/torque, we configured torque such that it would limit the virtual memory size to be total vm - 4gb. [This left plenty of ram for LInux, our monitoring software, etc, etc. So on a machine with 4x4GB swap partitions and 16GB RAM, the vm limit would be set to 28GB] . Now the thing about hod is that it allocates the entire node to an entire job.... which means there is a subtle point here, easily missed: the vm limit under torque was the aggregate for all of the tasks on the node, not just a single task. So if you had a bad behaving task/job, it will kill all the tasks running on that node. To simulate this ops requirement, hadoop should be taking the memory used by all the tasks and then performing some action. While I realize there is a desire to only punish 'bad tasks', I'm not sure if there is an easy way to do that. Putting my jack boots on, my answer is Kill Them All and Let The Users Sort Themselves Out. If I have to pick between killing the system (we're talking hard hang here, not happy little panic in my experiences) and punishing potentially innocent users, the answer is easy. Now here is where things get more complex, and there is a very good chance I've gotten this wrong. [Hopefully I have, because it sounds to me like a feature was added in the wrong spot.] It sounds like capacity has the ability to kill tasks based upon vm per node. It has this idea of a max vm size and how much memory each task is asking for. It then schedules based upon a weird slot+mem ratio. While this is a fine and dandy feature that would likely fix the requestors problem, I think it is a bit short sighted not have the kill feature at the task tracker level. The task tracker, regardless of scheduler, should still be able to keep track of all the mem used on the box and kill as necessary. If a scheduler wants to provide alternative logic, more power to it. But tying this to a scheduler just seems a bit ridiculous.
        Hide
        Scott Chen added a comment -

        +1 on having a JIRA for per-task-physical-memory-limiting

        +1 on Hong's idea of fine tuning the policy. We can make this part configurable.

        Show
        Scott Chen added a comment - +1 on having a JIRA for per-task-physical-memory-limiting +1 on Hong's idea of fine tuning the policy. We can make this part configurable.
        Hide
        Hong Tang added a comment -

        @arun, I think we need both - with per-task resource containment at the high level, and os-level protection at the low level. In this jira, let's focus on the low-level protection) and we can have a separate jira to track the per-task resource containment.

        Show
        Hong Tang added a comment - @arun, I think we need both - with per-task resource containment at the high level, and os-level protection at the low level. In this jira, let's focus on the low-level protection) and we can have a separate jira to track the per-task resource containment.
        Hide
        Arun C Murthy added a comment -

        Dhruba, this is not about making users write good code. This is about penalizing poorly behaved applications.

        If a MR job consumes too much memory you want to fail its component tasks and eventually fail the job.

        My concern is that the current implementation of this feature is not doing that.

        To be clear, I'm not against tracking physical memory used by the process. I'm only proposing we need to penalize the applications who consume too much memory. To that affect I'm proposing we fail the task if it exceeds the limit.

        I prefer the per-task limit since it has has served us well with virtual memory. Maybe it is a bad idea to use the same model for physical memory, maybe some can help me understand why it is so. I'm happy to reconsider it then, I've asked Allen Wittnauer for his thoughts on this too.

        Show
        Arun C Murthy added a comment - Dhruba, this is not about making users write good code. This is about penalizing poorly behaved applications. If a MR job consumes too much memory you want to fail its component tasks and eventually fail the job. My concern is that the current implementation of this feature is not doing that. To be clear, I'm not against tracking physical memory used by the process. I'm only proposing we need to penalize the applications who consume too much memory. To that affect I'm proposing we fail the task if it exceeds the limit. I prefer the per-task limit since it has has served us well with virtual memory. Maybe it is a bad idea to use the same model for physical memory, maybe some can help me understand why it is so. I'm happy to reconsider it then, I've asked Allen Wittnauer for his thoughts on this too.
        Hide
        dhruba borthakur added a comment -

        > If tasks, and hence jobs, are not penalized i.e. killed rather than failed, what is the incentive for the authors of the bad

        From my understanding, this JIRA is not about making users write good code. It is about making nodes in the cluster not die and rot (via excessive swapping, aka MAPREDUCE-257) when a huge-memory-consuming job comes along. It is also about letting other good jobs continue to run peacefully as much as possible. In a true utility model, the service provider provides service and it is upto the customer to write/deploy whenever code he wants to run on it (as long as he pays for it, of course)

        > So killing some tasks and letting others proceed may still be better than not killing. That said, I agree that we can fine tune the policy

        +1 for Hong's proposal. If somebody has a better policy of which tasks to kill when this situation arises that will be great. The default policy is to kill tasks with the maximum memory used and I think that is a very practical one, isn't it?

        Show
        dhruba borthakur added a comment - > If tasks, and hence jobs, are not penalized i.e. killed rather than failed, what is the incentive for the authors of the bad From my understanding, this JIRA is not about making users write good code. It is about making nodes in the cluster not die and rot (via excessive swapping, aka MAPREDUCE-257 ) when a huge-memory-consuming job comes along. It is also about letting other good jobs continue to run peacefully as much as possible. In a true utility model, the service provider provides service and it is upto the customer to write/deploy whenever code he wants to run on it (as long as he pays for it, of course ) > So killing some tasks and letting others proceed may still be better than not killing. That said, I agree that we can fine tune the policy +1 for Hong's proposal. If somebody has a better policy of which tasks to kill when this situation arises that will be great. The default policy is to kill tasks with the maximum memory used and I think that is a very practical one, isn't it?
        Hide
        Arun C Murthy added a comment -

        We have tried once with the per-task limit. We tried to set the threshold higher but it cannot be too high because it hurts the memory utilization.

        Yes, we will need to make trade-offs. And making the right ones is very important for Hadoop Map-Reduce.

        Show
        Arun C Murthy added a comment - We have tried once with the per-task limit. We tried to set the threshold higher but it cannot be too high because it hurts the memory utilization. Yes, we will need to make trade-offs. And making the right ones is very important for Hadoop Map-Reduce.
        Hide
        Arun C Murthy added a comment -

        I think Hong is right, this is a duplicate of MAPREDUCE-257.

        Anyway,

        Correct me if I am wrong. Kill a task will not make the job fail. It is different from failing a task. So the 4th attempt gets killed will not fail the job. Also, if it gets killed, it is likely that itself is the rough task because it uses the highest amount of memory.

        Now I'm really worried. If tasks, and hence jobs, are not penalized i.e. killed rather than failed, what is the incentive for the authors of the bad applications to be fix them? If they are just killed the same tasks will be run over and over again without any penalty - there are no disincentives!

        Show
        Arun C Murthy added a comment - I think Hong is right, this is a duplicate of MAPREDUCE-257 . Anyway, Correct me if I am wrong. Kill a task will not make the job fail. It is different from failing a task. So the 4th attempt gets killed will not fail the job. Also, if it gets killed, it is likely that itself is the rough task because it uses the highest amount of memory. Now I'm really worried. If tasks, and hence jobs, are not penalized i.e. killed rather than failed, what is the incentive for the authors of the bad applications to be fix them? If they are just killed the same tasks will be run over and over again without any penalty - there are no disincentives!
        Hide
        Hong Tang added a comment -

        Chip in my 2 cents. I think the goal of this jira is to prevent a node from swapping (same as in MAPREDUCE-257). In this regard, containing memory usage for individual tasks may not be sufficient to protect the system from swapping - for instance, there could be foreign processes or bugs in TT and DN that "eat up" ram on the same node and lead to swapping. Or there could be faulty ram modules such that OS does not recognize them during the boot time and leads to the actual amount of RAM less than configured. In my view, this idea is similar to the OOM killer mechanism in Linux kernel and serves as a low-level protection against faults not easily preventable from upper layers.

        In terms of what tasks to shot down, when a node goes into swap, there is a high chance that all tasks would fail (time out). Even worse, when TT is swapping, it may not even respond to kill commands from JT. So killing some tasks and letting others proceed may still be better than not killing. That said, I agree that we can fine tune the policy - e.g. avoid killing the 4th task attempt, or bias tasks that have already done lots of work (input bytes consumed, slot secs used).

        Show
        Hong Tang added a comment - Chip in my 2 cents. I think the goal of this jira is to prevent a node from swapping (same as in MAPREDUCE-257 ). In this regard, containing memory usage for individual tasks may not be sufficient to protect the system from swapping - for instance, there could be foreign processes or bugs in TT and DN that "eat up" ram on the same node and lead to swapping. Or there could be faulty ram modules such that OS does not recognize them during the boot time and leads to the actual amount of RAM less than configured. In my view, this idea is similar to the OOM killer mechanism in Linux kernel and serves as a low-level protection against faults not easily preventable from upper layers. In terms of what tasks to shot down, when a node goes into swap, there is a high chance that all tasks would fail (time out). Even worse, when TT is swapping, it may not even respond to kill commands from JT. So killing some tasks and letting others proceed may still be better than not killing. That said, I agree that we can fine tune the policy - e.g. avoid killing the 4th task attempt, or bias tasks that have already done lots of work (input bytes consumed, slot secs used).
        Hide
        Scott Chen added a comment -

        @Arun: I agree. It is reasonable to track these two.

        We have tried once with the per-task limit. We tried to set the threshold higher but it cannot be too high because it hurts the memory utilization. But that threshold still cause some of our pipeline and user jobs fails. So we switched it off quickly. But we really need some memory protecting feature. That was why we started working on this patch.

        Correct me if I am wrong. Kill a task will not make the job fail. It is different from failing a task. So the 4th attempt gets killed will not fail the job. Also, if it gets killed, it is likely that itself is the rough task because it uses the highest amount of memory.

        But in general, I agree with you. The total-RSS-memory-based-and-killing-highest-memory-task approach is not as predictable as the virtual-memory-limit approach. But it still have some benefits that we discussed above. And it is optional to use. This patch will not do anything if we don't configure it.

        Again, thank you for the comment. I think the discussions are very helpful.

        Show
        Scott Chen added a comment - @Arun: I agree. It is reasonable to track these two. We have tried once with the per-task limit. We tried to set the threshold higher but it cannot be too high because it hurts the memory utilization. But that threshold still cause some of our pipeline and user jobs fails. So we switched it off quickly. But we really need some memory protecting feature. That was why we started working on this patch. Correct me if I am wrong. Kill a task will not make the job fail. It is different from failing a task. So the 4th attempt gets killed will not fail the job. Also, if it gets killed, it is likely that itself is the rough task because it uses the highest amount of memory. But in general, I agree with you. The total-RSS-memory-based-and-killing-highest-memory-task approach is not as predictable as the virtual-memory-limit approach. But it still have some benefits that we discussed above. And it is optional to use. This patch will not do anything if we don't configure it. Again, thank you for the comment. I think the discussions are very helpful.
        Hide
        Arun C Murthy added a comment -

        Scott, as I said, it is reasonably to track either virtual memory or physical memory or both.

        The other main reason that we want to do this is that per task virtual-memory-limit is an API change for our users.

        A new feature might imply change, no?

        OTOH you could get away with simply setting the default values to be reasonable for a wide-variety of uses so users do not have to do anything.

        I think it is may not be that bad that we kill the task.

        Like I said, the problem is that there is no predictability. What if a job gets unlucky and it's 4th attempt gets killed because it happened to run on a node where a rouge task of some other job ... again, predictability is very important. Penalizing the right task is equally important.

        Show
        Arun C Murthy added a comment - Scott, as I said, it is reasonably to track either virtual memory or physical memory or both. The other main reason that we want to do this is that per task virtual-memory-limit is an API change for our users. A new feature might imply change, no? OTOH you could get away with simply setting the default values to be reasonable for a wide-variety of uses so users do not have to do anything. I think it is may not be that bad that we kill the task. Like I said, the problem is that there is no predictability. What if a job gets unlucky and it's 4th attempt gets killed because it happened to run on a node where a rouge task of some other job ... again, predictability is very important. Penalizing the right task is equally important.
        Hide
        Scott Chen added a comment -

        @Arun: Thanks again for the comment. I appreciate the help.

        I agree that virtual-memory-limit has lots of benefit.
        It kills the tasks quickly before the task starts doing bad things and it is more predictable.

        But virtual-memory-limit also wastes some memory because the processes does not actually use that much of physical memory.
        For example, there are tasks acquire 1.6GB of virtual memory but use only 600mb all the time.
        If you cumulate the total, the memory that is not being used is a lot.

        The other main reason that we want to do this is that per task virtual-memory-limit is an API change for our users.
        The jobs they used to submit everyday may failed and they have to learn to set the per-task limit.
        The good thing about this patch is that it is backward compatible.

        I think it is may not be that bad that we kill the task.
        It will be scheduled again on other machine and the job can still be finished. It will only slow down the job.
        But if we don't kill it, it is very likely that it will caused the machine to fail and we fail the task anyway.
        Also all the tasks on the machine fail that is even worse.

        Show
        Scott Chen added a comment - @Arun: Thanks again for the comment. I appreciate the help. I agree that virtual-memory-limit has lots of benefit. It kills the tasks quickly before the task starts doing bad things and it is more predictable. But virtual-memory-limit also wastes some memory because the processes does not actually use that much of physical memory. For example, there are tasks acquire 1.6GB of virtual memory but use only 600mb all the time. If you cumulate the total, the memory that is not being used is a lot. The other main reason that we want to do this is that per task virtual-memory-limit is an API change for our users. The jobs they used to submit everyday may failed and they have to learn to set the per-task limit. The good thing about this patch is that it is backward compatible. I think it is may not be that bad that we kill the task. It will be scheduled again on other machine and the job can still be finished. It will only slow down the job. But if we don't kill it, it is very likely that it will caused the machine to fail and we fail the task anyway. Also all the tasks on the machine fail that is even worse.
        Hide
        Arun C Murthy added a comment -

        Ah, fair point - I missed that detail about rlimit, my bad.


        However, I think the goal of this patch is different - it's to let the jobs use however much memory they want without declaring it in advance, but fix things when we do overcommit.

        I'm trying to parse things chronologically. Please help me understand this.

        The original description says: "virtual-memory is inconvenient in some cases, so we'll do for physical-memory". Fair enough.

        However, the current patch seems to reserve some physical memory for TaskTracker... and is the plan to just whichever task is the highest at a given instant?

        If so, speaking from experience with HoD which had pretty much the same feature (albeit for different reasons i.e. to protect the Linux Kernel), this is a bad idea. The problem is that there is completely no predictability. Tasks get randomly shot down because someone tipped things over the edge.

        I'd rather see a simpler per-task limit which the user sets per-task - much like the virtual-memory-limit.

        Show
        Arun C Murthy added a comment - Ah, fair point - I missed that detail about rlimit, my bad. However, I think the goal of this patch is different - it's to let the jobs use however much memory they want without declaring it in advance, but fix things when we do overcommit. I'm trying to parse things chronologically. Please help me understand this. The original description says: "virtual-memory is inconvenient in some cases, so we'll do for physical-memory". Fair enough. However, the current patch seems to reserve some physical memory for TaskTracker... and is the plan to just whichever task is the highest at a given instant? If so, speaking from experience with HoD which had pretty much the same feature (albeit for different reasons i.e. to protect the Linux Kernel), this is a bad idea. The problem is that there is completely no predictability . Tasks get randomly shot down because someone tipped things over the edge. I'd rather see a simpler per-task limit which the user sets per-task - much like the virtual-memory-limit.
        Hide
        Zheng Shao added a comment -

        Arun, does the explanations from Scott and Matei make sense to you?
        If it looks good to you, I would like to commit it.

        Show
        Zheng Shao added a comment - Arun, does the explanations from Scott and Matei make sense to you? If it looks good to you, I would like to commit it.
        Hide
        Matei Zaharia added a comment -

        Setrlimit does not set a cumulative limit for the children. If you setrlimit to 1 GB for a process, and then it spawns the child, the child gets a separate 1 GB limit, and together they have 2 GB. Some newer features in Linux (cgroups and containers) can be used to set a limit on a whole process tree. However, I think the goal of this patch is different - it's to let the jobs use however much memory they want without declaring it in advance, but fix things when we do overcommit.

        Show
        Matei Zaharia added a comment - Setrlimit does not set a cumulative limit for the children. If you setrlimit to 1 GB for a process, and then it spawns the child, the child gets a separate 1 GB limit, and together they have 2 GB. Some newer features in Linux (cgroups and containers) can be used to set a limit on a whole process tree. However, I think the goal of this patch is different - it's to let the jobs use however much memory they want without declaring it in advance, but fix things when we do overcommit.
        Hide
        Arun C Murthy added a comment -

        Scott: setrlimit works for a process and all it's children. So, if the TT set it for each of the map/reduce tasks we are good, no?

        NAME
             getrlimit, setrlimit -- control maximum system resource consumption
        
        SYNOPSIS
             #include <sys/resource.h>
        
             int
             getrlimit(int resource, struct rlimit *rlp);
        
             int
             setrlimit(int resource, const struct rlimit *rlp);
        
        DESCRIPTION
             Limits on the consumption of system resources by the current process and each
             process it creates may be obtained with the getrlimit() call, and set with
             the setrlimit() call.
        

        The reason I'm pushing for this is that we already have (or had) a feature to set ulimit on the children. This seems like a natural extension of that feature.

        Show
        Arun C Murthy added a comment - Scott: setrlimit works for a process and all it's children. So, if the TT set it for each of the map/reduce tasks we are good, no? NAME getrlimit, setrlimit -- control maximum system resource consumption SYNOPSIS #include <sys/resource.h> int getrlimit(int resource, struct rlimit *rlp); int setrlimit(int resource, const struct rlimit *rlp); DESCRIPTION Limits on the consumption of system resources by the current process and each process it creates may be obtained with the getrlimit() call, and set with the setrlimit() call. The reason I'm pushing for this is that we already have (or had) a feature to set ulimit on the children. This seems like a natural extension of that feature.
        Hide
        Scott Chen added a comment -

        @Zheng: The failed test is because of MAPREDUCE-1520. It is not related to this patch.

        @Arun: Thank you very much for the comment. It's not late at all Using setrlimit() is good for limiting the RSS memory for one task. But here we want to limit the total RSS memory on the TT. And we want to kill the task with the highest memory usage in this case. That is why we do it this way.

        Show
        Scott Chen added a comment - @Zheng: The failed test is because of MAPREDUCE-1520 . It is not related to this patch. @Arun: Thank you very much for the comment. It's not late at all Using setrlimit() is good for limiting the RSS memory for one task. But here we want to limit the total RSS memory on the TT. And we want to kill the task with the highest memory usage in this case. That is why we do it this way.
        Hide
        Arun C Murthy added a comment -

        Sorry for coming in late, but isn't using (get|set)rlimit significantly simpler?

        Show
        Arun C Murthy added a comment - Sorry for coming in late, but isn't using (get|set)rlimit significantly simpler?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12436385/MAPREDUCE-1221-v3.patch
        against trunk revision 915223.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/479/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/479/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/479/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/479/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12436385/MAPREDUCE-1221-v3.patch against trunk revision 915223. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/479/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/479/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/479/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/479/console This message is automatically generated.
        Hide
        Scott Chen added a comment -

        Sorry about the tabs, my bad. It is removed in this file.

        Show
        Scott Chen added a comment - Sorry about the tabs, my bad. It is removed in this file.
        Hide
        Zheng Shao added a comment -

        Scott, can you replace TAB with 2 spaces in your code?

        Show
        Zheng Shao added a comment - Scott, can you replace TAB with 2 spaces in your code?
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12431623/MAPREDUCE-1221-v2.patch
        against trunk revision 905008.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/422/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/422/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/422/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/422/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12431623/MAPREDUCE-1221-v2.patch against trunk revision 905008. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/422/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/422/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/422/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/422/console This message is automatically generated.
        Hide
        Scott Chen added a comment -

        Have done some change based on Zheng's comment.
        1. In TaskMemoryManagerThread 222: Skip the killed task when checking memory of each task
        2. In TestTaskTrackerMemoryManager 545: Add assertFalse to fail the test immediately if the job finish successfully.

        Show
        Scott Chen added a comment - Have done some change based on Zheng's comment. 1. In TaskMemoryManagerThread 222: Skip the killed task when checking memory of each task 2. In TestTaskTrackerMemoryManager 545: Add assertFalse to fail the test immediately if the job finish successfully.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12428563/MAPREDUCE-1221-v1.patch
        against trunk revision 898486.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/383/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/383/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/383/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/383/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428563/MAPREDUCE-1221-v1.patch against trunk revision 898486. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/383/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/383/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/383/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/383/console This message is automatically generated.
        Hide
        dhruba borthakur added a comment -

        Hi Vinod, will it be possible for you to review this one?

        Show
        dhruba borthakur added a comment - Hi Vinod, will it be possible for you to review this one?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12428563/MAPREDUCE-1221-v1.patch
        against trunk revision 894165.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/243/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/243/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/243/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/243/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428563/MAPREDUCE-1221-v1.patch against trunk revision 894165. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/243/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/243/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/243/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/243/console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12428563/MAPREDUCE-1221-v1.patch
        against trunk revision 892893.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/232/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/232/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/232/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/232/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428563/MAPREDUCE-1221-v1.patch against trunk revision 892893. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/232/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/232/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/232/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/232/console This message is automatically generated.
        Hide
        Scott Chen added a comment -

        About the virtual memory limiting, we have tried to use it in our cluster. Our experience is that even if we set the total memory threshold to a high enough value, TaskTracker would still kills a considerable amount of tasks when there is nothing wrong with the RSS memory. So we decided to extend the virtual memory limiting to this physical one. Anyway, it doesn't hurt to have more options. It will not be turned on if the configuration is not set.

        Show
        Scott Chen added a comment - About the virtual memory limiting, we have tried to use it in our cluster. Our experience is that even if we set the total memory threshold to a high enough value, TaskTracker would still kills a considerable amount of tasks when there is nothing wrong with the RSS memory. So we decided to extend the virtual memory limiting to this physical one. Anyway, it doesn't hurt to have more options. It will not be turned on if the configuration is not set.
        Hide
        Scott Chen added a comment -

        The patch allows us to set an amount of memory that will not be used to run tasks. If this limit is violated, the task uses the highest amount of memory will be killed.

        Ex: Configure mapreduce.tasktracker.reserved.physicalmemory.mb=3072
        If there's a TaskTracker with 16GB of memory and currently the tasks are using 14GB of memory, then we have 16GB-14GB < 3GB. In this case, TaskMemoryManagerThread will kill the task uses the highest amount of memory. Note that if the value is not configured, this policy will not triggered.

        Killing tasks will slow down the job because the task has to be scheduled again. But if the task is not killed in this case, it is very likely that it will failed on this node because the node has no memory to run it. Also the node might crashed because of this. We choose the highest memory-consuming task to kill because it is likely that it is the bad job that's causing the problem.

        A part of this patch is done by Dhruba. He has sent me his half-done patch and I continued from there.

        Show
        Scott Chen added a comment - The patch allows us to set an amount of memory that will not be used to run tasks. If this limit is violated, the task uses the highest amount of memory will be killed. Ex: Configure mapreduce.tasktracker.reserved.physicalmemory.mb=3072 If there's a TaskTracker with 16GB of memory and currently the tasks are using 14GB of memory, then we have 16GB-14GB < 3GB. In this case, TaskMemoryManagerThread will kill the task uses the highest amount of memory. Note that if the value is not configured, this policy will not triggered. Killing tasks will slow down the job because the task has to be scheduled again. But if the task is not killed in this case, it is very likely that it will failed on this node because the node has no memory to run it. Also the node might crashed because of this. We choose the highest memory-consuming task to kill because it is likely that it is the bad job that's causing the problem. A part of this patch is done by Dhruba. He has sent me his half-done patch and I continued from there.
        Hide
        Hong Tang added a comment -

        Is it a duplicate of MAPREDUCE-257?

        Show
        Hong Tang added a comment - Is it a duplicate of MAPREDUCE-257 ?
        Hide
        dhruba borthakur added a comment -

        > In any case, it currently works only on Linux

        It is, of course, for Linux

        > The TaskMemoryManager indeed looks at the memory usage of all the processes in the task's process-tree irr

        That's is right. But my point was that each python/php process that is forked by the mapper actually uses about 500GB of virtual memory (at the minimum). And for each python library that the process includes, it adds yet another chunk of virtual memory. Most of this virtual memory is unused the python/php interpreter; i.e. no physical pages allocated for the process. But since the existing MR framework looks at the total virtual memory of the process subtree, it starts killing tasks even though a large chunk of the physical RAM on on the machine is unused.

        The proposal to have the ability to kill tasks based on physical memory usage should make system utilization much better, do you agree?

        Show
        dhruba borthakur added a comment - > In any case, it currently works only on Linux It is, of course, for Linux > The TaskMemoryManager indeed looks at the memory usage of all the processes in the task's process-tree irr That's is right. But my point was that each python/php process that is forked by the mapper actually uses about 500GB of virtual memory (at the minimum). And for each python library that the process includes, it adds yet another chunk of virtual memory. Most of this virtual memory is unused the python/php interpreter; i.e. no physical pages allocated for the process. But since the existing MR framework looks at the total virtual memory of the process subtree, it starts killing tasks even though a large chunk of the physical RAM on on the machine is unused. The proposal to have the ability to kill tasks based on physical memory usage should make system utilization much better, do you agree?
        Hide
        Vinod Kumar Vavilapalli added a comment -

        This works well when most map-reduce jobs are Java jobs and have well-defined -Xmx parameters that specify the max virtual memory for each task. On the other hand, if each task forks off mappers/reducers written in other languages (python/php, etc), the total virtual memory usage of the process-subtree varies greatly.

        This is not quite right. The TaskMemoryManager indeed looks at the memory usage of all the processes in the task's process-tree irrespective of whether each of them is a java program or otherwise. If one wishes to limit the physical memory usage, the tasks can be configured to be.

        In any case, it currently works only on Linux, so if is another OS, it needs to be implemented there.

        Show
        Vinod Kumar Vavilapalli added a comment - This works well when most map-reduce jobs are Java jobs and have well-defined -Xmx parameters that specify the max virtual memory for each task. On the other hand, if each task forks off mappers/reducers written in other languages (python/php, etc), the total virtual memory usage of the process-subtree varies greatly. This is not quite right. The TaskMemoryManager indeed looks at the memory usage of all the processes in the task's process-tree irrespective of whether each of them is a java program or otherwise. If one wishes to limit the physical memory usage, the tasks can be configured to be. In any case, it currently works only on Linux, so if is another OS, it needs to be implemented there.

          People

          • Assignee:
            Scott Chen
            Reporter:
            dhruba borthakur
          • Votes:
            0 Vote for this issue
            Watchers:
            22 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development