Hadoop Common
  1. Hadoop Common
  2. HADOOP-3581

Prevent memory intensive user tasks from taking down nodes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Added the ability to kill process trees transgressing memory limits. TaskTracker uses the configuration parameters introduced in HADOOP-3759. In addition, mapred.tasktracker.taskmemorymanager.monitoring-interval specifies the interval for which TT waits between cycles of monitoring tasks' memory usage, and mapred.tasktracker.procfsbasedprocesstree.sleeptime-before-sigkill specifies the time TT waits for sending a SIGKILL to a process-tree that has overrun memory limits, after it has been sent a SIGTERM.
      Show
      Added the ability to kill process trees transgressing memory limits. TaskTracker uses the configuration parameters introduced in HADOOP-3759 . In addition, mapred.tasktracker.taskmemorymanager.monitoring-interval specifies the interval for which TT waits between cycles of monitoring tasks' memory usage, and mapred.tasktracker.procfsbasedprocesstree.sleeptime-before-sigkill specifies the time TT waits for sending a SIGKILL to a process-tree that has overrun memory limits, after it has been sent a SIGTERM.

      Description

      Sometimes user Map/Reduce applications can get extremely memory intensive, maybe due to some inadvertent bugs in the user code, or the amount of data processed. When this happens, the user tasks start to interfere with the proper execution of other processes on the node, including other Hadoop daemons like the DataNode and TaskTracker. Thus, the node would become unusable for any Hadoop tasks. There should be a way to prevent such tasks from bringing down the node.

      1. patch_3581_0.1.txt
        14 kB
        Vinod Kumar Vavilapalli
      2. patch_3581_3.3.txt
        26 kB
        Vinod Kumar Vavilapalli
      3. patch_3581_4.3.txt
        32 kB
        Vinod Kumar Vavilapalli
      4. patch_3581_4.4.txt
        32 kB
        Vinod Kumar Vavilapalli
      5. patch_3581_5.0.txt
        40 kB
        Vinod Kumar Vavilapalli
      6. patch_3581_5.2.txt
        44 kB
        Vinod Kumar Vavilapalli
      7. HADOOP-3581.6.0.txt
        51 kB
        Vinod Kumar Vavilapalli
      8. HADOOP-3581-final.txt
        51 kB
        Vinod Kumar Vavilapalli
      9. HADOOP-3581.20080901.2.txt
        46 kB
        Vinod Kumar Vavilapalli
      10. HADOOP-3581.20080902.txt
        46 kB
        Vinod Kumar Vavilapalli
      11. HADOOP-3581.20080904.txt
        48 kB
        Vinod Kumar Vavilapalli
      12. HADOOP-3581.20080905.txt
        48 kB
        Vinod Kumar Vavilapalli
      13. HADOOP-3581.20080908.txt
        46 kB
        Vinod Kumar Vavilapalli

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #600 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/600/ )
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Vinod!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Vinod!
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389676/HADOOP-3581.20080908.txt
          against trunk revision 693048.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3209/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3209/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3209/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3209/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389676/HADOOP-3581.20080908.txt against trunk revision 693048. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3209/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3209/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3209/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3209/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Addressing the above minor changes.

          Show
          Vinod Kumar Vavilapalli added a comment - Addressing the above minor changes.
          Hide
          Devaraj Das added a comment -

          Two minor comments:
          1) The javadoc in TaskLog.java has some changes which are really not required.
          2) The TaskTracker.getLocalDirAllocator is not required. Just get an instance using

           new LocalDirAllocator("mapred.local.dir") 

          at the place where you require it.

          Show
          Devaraj Das added a comment - Two minor comments: 1) The javadoc in TaskLog.java has some changes which are really not required. 2) The TaskTracker.getLocalDirAllocator is not required. Just get an instance using new LocalDirAllocator( "mapred.local.dir" ) at the place where you require it.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          org.apache.hadoop.fs.kfs.TestKosmosFileSystem.testDirs is unrelated (HADOOP-4078).

          Patch still applies cleanly to trunk(after HADOOP-3150), test-cases run successfully. And I verified manually that this patch isn't disturbed anywhere by HADOOP-3150. Committable.

          Show
          Vinod Kumar Vavilapalli added a comment - org.apache.hadoop.fs.kfs.TestKosmosFileSystem.testDirs is unrelated ( HADOOP-4078 ). Patch still applies cleanly to trunk(after HADOOP-3150 ), test-cases run successfully. And I verified manually that this patch isn't disturbed anywhere by HADOOP-3150 . Committable.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389552/HADOOP-3581.20080905.txt
          against trunk revision 692409.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3186/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3186/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3186/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3186/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389552/HADOOP-3581.20080905.txt against trunk revision 692409. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3186/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3186/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3186/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3186/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          In the addTask and removeTask API, the synchonization is not done while checking for whether tasksToBeAdded and tasksToBeRemoved contain the object or not. I think this synchronization is required. Because the run method will access these datastructures and modify them.

          We actually don't need these checks, because no where is a task duplicately added to the data structures in our code. Removing these checks, and attaching a new patch.

          In task cleanup, do we need synchronization when the PID directory is added to the cleaner thread's queue ?

          Not needed. Cleaner thread uses a LinkedBlockingQueue which internally uses locks.

          Show
          Vinod Kumar Vavilapalli added a comment - In the addTask and removeTask API, the synchonization is not done while checking for whether tasksToBeAdded and tasksToBeRemoved contain the object or not. I think this synchronization is required. Because the run method will access these datastructures and modify them. We actually don't need these checks, because no where is a task duplicately added to the data structures in our code. Removing these checks, and attaching a new patch. In task cleanup, do we need synchronization when the PID directory is added to the cleaner thread's queue ? Not needed. Cleaner thread uses a LinkedBlockingQueue which internally uses locks.
          Hide
          Hemanth Yamijala added a comment -

          Changes look fine. Two questions:

          • In the addTask and removeTask API, the synchonization is not done while checking for whether tasksToBeAdded and tasksToBeRemoved contain the object or not. I think this synchronization is required. Because the run method will access these datastructures and modify them.
          • In task cleanup, do we need synchronization when the PID directory is added to the cleaner thread's queue ? It seems like it is not required, but can you confirm once more.
          Show
          Hemanth Yamijala added a comment - Changes look fine. Two questions: In the addTask and removeTask API, the synchonization is not done while checking for whether tasksToBeAdded and tasksToBeRemoved contain the object or not. I think this synchronization is required. Because the run method will access these datastructures and modify them. In task cleanup, do we need synchronization when the PID directory is added to the cleaner thread's queue ? It seems like it is not required, but can you confirm once more.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Incorporated 1 through 4.

          • For (5), Moved Procfs based code in TT to a new package private class TaskMemoryManagerThread.
          • Addressed (6) by introducing two datastructures tasksToBeadded and tasksToBeRemoved int TaskMemoryManagerThread on which addTask and removeTask methods (called from within startNewTask and reportTaskFinished) lock. TaskMemoryManagerThread in turns locks individually on these structures, adds the corresponding tasks to its internal processTreeInfoMap, clears them for future addition/removal, and then proceeds to business of memory monitoring using processTreeInfoMap.
          Show
          Vinod Kumar Vavilapalli added a comment - Incorporated 1 through 4. For (5), Moved Procfs based code in TT to a new package private class TaskMemoryManagerThread. Addressed (6) by introducing two datastructures tasksToBeadded and tasksToBeRemoved int TaskMemoryManagerThread on which addTask and removeTask methods (called from within startNewTask and reportTaskFinished) lock. TaskMemoryManagerThread in turns locks individually on these structures, adds the corresponding tasks to its internal processTreeInfoMap, clears them for future addition/removal, and then proceeds to business of memory monitoring using processTreeInfoMap.
          Hide
          Devaraj Das added a comment -

          Some comments:
          1) fix the bug in setTaskMemoryManagerEnabled that unconditionally sets the flag taskMemoryManagerEnabled to true
          2) Increase the default monitoring interval to 5 seconds. 300 msecs seems really low
          3) Change processTreeInfo to processTreeInfoMap. The name is confusing.
          4) Move the PID file cleanup to cleanup of Task.
          5) Move the Procfs based code to a different class outside the TaskTracker.
          5) The thread doing the memory monitoring is locking quite a big section of code where interaction with the OS is also involved. The same lock is also used for adding tasks to be monitored before their launch. This might negatively impact fast start of tasks.

          Show
          Devaraj Das added a comment - Some comments: 1) fix the bug in setTaskMemoryManagerEnabled that unconditionally sets the flag taskMemoryManagerEnabled to true 2) Increase the default monitoring interval to 5 seconds. 300 msecs seems really low 3) Change processTreeInfo to processTreeInfoMap. The name is confusing. 4) Move the PID file cleanup to cleanup of Task. 5) Move the Procfs based code to a different class outside the TaskTracker. 5) The thread doing the memory monitoring is locking quite a big section of code where interaction with the OS is also involved. The same lock is also used for adding tasks to be monitored before their launch. This might negatively impact fast start of tasks.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The first findBugs warning is already explained, cannot be avoided.

          The warning "Hard coded reference to an absolute pathname in org.apache.hadoop.util.ProcfsBasedProcessTree.getProcessList()" refers to the absolute path "/proc", which is inevitable.

          The second one relates to making the member class ProcessTreeInfo static private . The current class hierarchy is TaskTracker(non-static)>TaskMemoryManagerThread(non-static)>ProcessTreeInfo. ProcessTreeInfo is only related to TaskMemoryManagerThread, so we wish to leave it like that. And because of this three-level hierarchy, it cannot be made static.

          The findBugs warnings are unavoidable. Patch is committable.

          Show
          Vinod Kumar Vavilapalli added a comment - The first findBugs warning is already explained, cannot be avoided. The warning "Hard coded reference to an absolute pathname in org.apache.hadoop.util.ProcfsBasedProcessTree.getProcessList()" refers to the absolute path "/proc", which is inevitable. The second one relates to making the member class ProcessTreeInfo static private . The current class hierarchy is TaskTracker(non-static) >TaskMemoryManagerThread(non-static) >ProcessTreeInfo. ProcessTreeInfo is only related to TaskMemoryManagerThread, so we wish to leave it like that. And because of this three-level hierarchy, it cannot be made static. The findBugs warnings are unavoidable. Patch is committable.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389342/HADOOP-3581.20080902.txt
          against trunk revision 691306.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3162/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3162/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3162/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3162/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389342/HADOOP-3581.20080902.txt against trunk revision 691306. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3162/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3162/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3162/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3162/console This message is automatically generated.
          Hide
          Hemanth Yamijala added a comment -

          Looks good to me. +1

          Show
          Hemanth Yamijala added a comment - Looks good to me. +1
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Attaching patch addressing the above comments. Renamed creatProcessTreeInfo to a simpler and more relevant name 'addTask', and removeProcessTreeInfo to removeTask.

          Show
          Vinod Kumar Vavilapalli added a comment - Attaching patch addressing the above comments. Renamed creatProcessTreeInfo to a simpler and more relevant name 'addTask', and removeProcessTreeInfo to removeTask.
          Hide
          Hemanth Yamijala added a comment -

          Looks good.

          Few minor comments:

          ProcFsBasedProcessTree

          • allProcessInfo is being used only within getProcessTree, hence it can be a local variable. Otherwise, it will hold memory unnecessarily
          • processTree should be cleared before every re-construction of the process tree.
          • isAlive(pid): If there's an IOException, the code needs to return false.

          TaskMemoryMonitorThread

          • IMO, addTaskToMonitor is a better name than createProcessTreeInfo, as it better conveys the meaning of the method. Likewise removeTaskToMonitor should replace removeProcessTreeInfo.

          hadoop-default.xml

          • I think the documentation need not include details about the classes etc, but should make sense to the administrator. Something like: The interval, in milliseconds, the tasktracker waits between two cycles of monitoring a task's memory usage.
          • Also, generally the framework doesn't seem to define configuration variables with upper case, so maybe mapred.tasktracker.taskmemorymanager.monitoring-interval ?
          • Millisecs should be expanded to milliseconds.
          Show
          Hemanth Yamijala added a comment - Looks good. Few minor comments: ProcFsBasedProcessTree allProcessInfo is being used only within getProcessTree, hence it can be a local variable. Otherwise, it will hold memory unnecessarily processTree should be cleared before every re-construction of the process tree. isAlive(pid): If there's an IOException, the code needs to return false. TaskMemoryMonitorThread IMO, addTaskToMonitor is a better name than createProcessTreeInfo, as it better conveys the meaning of the method. Likewise removeTaskToMonitor should replace removeProcessTreeInfo. hadoop-default.xml I think the documentation need not include details about the classes etc, but should make sense to the administrator. Something like: The interval, in milliseconds, the tasktracker waits between two cycles of monitoring a task's memory usage. Also, generally the framework doesn't seem to define configuration variables with upper case, so maybe mapred.tasktracker.taskmemorymanager.monitoring-interval ? Millisecs should be expanded to milliseconds.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389297/HADOOP-3581.20080901.2.txt
          against trunk revision 691099.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3155/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3155/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3155/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3155/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389297/HADOOP-3581.20080901.2.txt against trunk revision 691099. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3155/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3155/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3155/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3155/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Included all the above comments. Also removed the ProcessTree abstraction as it proved to be a premature abstraction, and is not panning out nicely while passing things like pid, sigKillInterval.

          • monitoringInterval is configurable via mapred.tasktracker.taskMemoryManager.monitoringInterval and sigkill interval via mapred.tasktracker.procfsBasedProcessTree.sleepTimeBeforeSigKill.
          • Removed the processtreeimpl configuration parameter.
          • Added createProcessTreeInfo(TaskAttemptID tid, long memLimit) and removeProcessTreeInfo(TaskAttemptID tid) to help handle synchronization better. These are respectively called by startNewTask(LaunchTaskAction action) and reportTaskFinished(TaskAttemptID taskid) which update the processTreeInfo map accordingly.
          • A bit of refactoring - made ProcessInfo an inner static class of ProcfsBasedProcessTree.
          Show
          Vinod Kumar Vavilapalli added a comment - Included all the above comments. Also removed the ProcessTree abstraction as it proved to be a premature abstraction, and is not panning out nicely while passing things like pid, sigKillInterval. monitoringInterval is configurable via mapred.tasktracker.taskMemoryManager.monitoringInterval and sigkill interval via mapred.tasktracker.procfsBasedProcessTree.sleepTimeBeforeSigKill. Removed the processtreeimpl configuration parameter. Added createProcessTreeInfo(TaskAttemptID tid, long memLimit) and removeProcessTreeInfo(TaskAttemptID tid) to help handle synchronization better. These are respectively called by startNewTask(LaunchTaskAction action) and reportTaskFinished(TaskAttemptID taskid) which update the processTreeInfo map accordingly. A bit of refactoring - made ProcessInfo an inner static class of ProcfsBasedProcessTree.
          Hide
          Hemanth Yamijala added a comment -

          Few minor nits:

          • In TaskLog the construction of the command line for the PID file can be done using fewer calls to append.
          • isTaskMemoryManagerEnabled should be package private
          • TaskTracker.getPid() should close the streams in a finally block.
          Show
          Hemanth Yamijala added a comment - Few minor nits: In TaskLog the construction of the command line for the PID file can be done using fewer calls to append. isTaskMemoryManagerEnabled should be package private TaskTracker.getPid() should close the streams in a finally block.
          Hide
          Hemanth Yamijala added a comment -

          Some comments:

          ProcessTree:

          • isImplemented need not throw an Exception. It can return false if something fails, as the object can no longer be used.
          • rename getVmem to something like getCumulativeVmem to better reflect what it is doing.

          ProcfsBasedProcessTree:

          • The algorithm in initialize can be improved. Particularly, to construct the process hierarchy, we are using a recursive mechanism which is looking at paths in the process tree hierarchy multiple times. Instead, we could have one pass to get the list of processes, and another to create the parent-child relationship. Building the required tree will then be walking from the process corresponding to the task, and listing its children recursively.
          • In reconstruct we are removing the completed processes by adding that to a delete list and then walking over it to delete one at a time. Can't we use Iterator.remove to achieve what we want ?
          • In reconstruct rather than creating a new HashMap to clear the elements, we can directly call clear on the existing HashMap.
          • SLEEP_TIME_BEFORE_SIGKILL should be made a configuration variable.
          • Give a name to the SigKillThread.

          TaskMemoryManagerThread

          • MONITORING_INTERVAL should be configurable.
          • Instead of using the Object[3] to store the Process related information, we can use a simple private class to hold this information together.
          • The processTreeInfo map should be between TIP and the object described above.
          • Use Configuration(false), which will not load the default configuration, when setting the jobid/
          • Use TaskTracker.getMemoryPerTask(TaskInProgress) instead of getting the value from the JobConf.
          • When walking over the tasks that are running, you must check if the task state is running, or commit pending, and so on.
          Show
          Hemanth Yamijala added a comment - Some comments: ProcessTree : isImplemented need not throw an Exception. It can return false if something fails, as the object can no longer be used. rename getVmem to something like getCumulativeVmem to better reflect what it is doing. ProcfsBasedProcessTree : The algorithm in initialize can be improved. Particularly, to construct the process hierarchy, we are using a recursive mechanism which is looking at paths in the process tree hierarchy multiple times. Instead, we could have one pass to get the list of processes, and another to create the parent-child relationship. Building the required tree will then be walking from the process corresponding to the task, and listing its children recursively. In reconstruct we are removing the completed processes by adding that to a delete list and then walking over it to delete one at a time. Can't we use Iterator.remove to achieve what we want ? In reconstruct rather than creating a new HashMap to clear the elements, we can directly call clear on the existing HashMap. SLEEP_TIME_BEFORE_SIGKILL should be made a configuration variable. Give a name to the SigKillThread. TaskMemoryManagerThread MONITORING_INTERVAL should be configurable. Instead of using the Object [3] to store the Process related information, we can use a simple private class to hold this information together. The processTreeInfo map should be between TIP and the object described above. Use Configuration(false), which will not load the default configuration, when setting the jobid/ Use TaskTracker.getMemoryPerTask(TaskInProgress) instead of getting the value from the JobConf. When walking over the tasks that are running, you must check if the task state is running, or commit pending, and so on.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          And I can't see any contrib tests failing. Not even in the console output. Wrong report by Hudson?

          HADOOP-4037

          Show
          Vinod Kumar Vavilapalli added a comment - And I can't see any contrib tests failing. Not even in the console output. Wrong report by Hudson? HADOOP-4037
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The findbugs warning cannot be avoided. The warning "Hard coded reference to an absolute pathname in org.apache.hadoop.util.ProcfsBasedProcessTree.getProcessList()" refers to the absolute path "/proc", which is inevitable.

          TestMiniMRDFSSort.testMapReduceSort is not related to this patch, HADOOP-3950 addresses this.

          And I can't see any contrib tests failing. Not even in the console output. Wrong report by Hudson?

          Show
          Vinod Kumar Vavilapalli added a comment - The findbugs warning cannot be avoided. The warning "Hard coded reference to an absolute pathname in org.apache.hadoop.util.ProcfsBasedProcessTree.getProcessList()" refers to the absolute path "/proc", which is inevitable. TestMiniMRDFSSort.testMapReduceSort is not related to this patch, HADOOP-3950 addresses this. And I can't see any contrib tests failing. Not even in the console output. Wrong report by Hudson?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12388977/HADOOP-3581-final.txt
          against trunk revision 689380.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3121/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3121/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3121/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3121/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388977/HADOOP-3581-final.txt against trunk revision 689380. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3121/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3121/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3121/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3121/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          HADOOP-3759 is committed. Attaching a final patch with merged changes.

          Show
          Vinod Kumar Vavilapalli added a comment - HADOOP-3759 is committed. Attaching a final patch with merged changes.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Attaching a new patch. This one assumes HADOOP-3759 is in. Even after that, it needs minor merges at two places. Made a few changes also.

          • We used to start the TaskMemoryManagerThread all the time, which will run and get disabled if memory management is not abled. Now, changed this behaviour so that this thread is created only if required.
          • In earlier patch, we used to send SIGTERM first, sleep for some interval, then send a SIGKILL. This had the problem of memory-overstepping tasks sneak in and finish off during that wait time. Now that is prevented by moving {sleeping, sending SIGKILL}

            to a new thread(SigKillThread). This way, we 'll have one thread per process-tree to be killed and the number of threads is bounded by the total number of tasks that can be run on a TT. This is nearly fool proof, only tasks that can run with over-quota memory are the ones that have very very short life span of 300ms(TaskMemoryManager sleep time). Very unlikely.

          • Added another testcase TestTaskTrackerMemoryManager which uses miniMR and miniDFS clusters, It adds two tests to test that 1) tasks with memory requirements within what TT can offer will run successfully without any errors and 2) tasks with memory requirements more than what TTs can offer are really killed. Also asserts that the error messages in diagnostic information are as expected and in expected format.
          • This patch also makes the tips killed due to memory transgression to be marked as FAILED. Earlier they were marked as KILLED, TIPS kept getting rescheduled and so the job could go on for ever without finishing.
          Show
          Vinod Kumar Vavilapalli added a comment - Attaching a new patch. This one assumes HADOOP-3759 is in. Even after that, it needs minor merges at two places. Made a few changes also. We used to start the TaskMemoryManagerThread all the time, which will run and get disabled if memory management is not abled. Now, changed this behaviour so that this thread is created only if required. In earlier patch, we used to send SIGTERM first, sleep for some interval, then send a SIGKILL. This had the problem of memory-overstepping tasks sneak in and finish off during that wait time. Now that is prevented by moving {sleeping, sending SIGKILL} to a new thread(SigKillThread). This way, we 'll have one thread per process-tree to be killed and the number of threads is bounded by the total number of tasks that can be run on a TT. This is nearly fool proof, only tasks that can run with over-quota memory are the ones that have very very short life span of 300ms(TaskMemoryManager sleep time). Very unlikely. Added another testcase TestTaskTrackerMemoryManager which uses miniMR and miniDFS clusters, It adds two tests to test that 1) tasks with memory requirements within what TT can offer will run successfully without any errors and 2) tasks with memory requirements more than what TTs can offer are really killed. Also asserts that the error messages in diagnostic information are as expected and in expected format. This patch also makes the tips killed due to memory transgression to be marked as FAILED. Earlier they were marked as KILLED, TIPS kept getting rescheduled and so the job could go on for ever without finishing.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          HADOOP-3759 is ready for committing. Cancelling this patch to attach a new one, merging changes from 3759's patch.

          Show
          Vinod Kumar Vavilapalli added a comment - HADOOP-3759 is ready for committing. Cancelling this patch to attach a new one, merging changes from 3759's patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12387547/patch_3581_5.2.txt
          against trunk revision 682978.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 526 javac compiler warnings (more than the trunk's current 525 warnings).

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3009/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3009/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3009/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3009/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387547/patch_3581_5.2.txt against trunk revision 682978. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 526 javac compiler warnings (more than the trunk's current 525 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3009/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3009/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3009/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3009/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Attaching modified patch.

          • Incorporated above comments - - Added mapred.tasktracer.processtreeimpl which defaults to null. Also pid files are written only when TaskMemoryManager is enabled which is when both mapred.tasktracer.processtreeimpl and mapred.tasktracker.maxmemory are set. TT only refers to ProcessTree.
          • Added isZombie, isEmpty and getProcessTree to abstract class ProcessTree. getProcessTree replaces initializes(and reconstruct) and returns the ProcessTree with latest state.
          • Did a bit of refactoring of ProcfsBasedProcessTree so that all inherited methods are together at one place.
          Show
          Vinod Kumar Vavilapalli added a comment - Attaching modified patch. Incorporated above comments - - Added mapred.tasktracer.processtreeimpl which defaults to null. Also pid files are written only when TaskMemoryManager is enabled which is when both mapred.tasktracer.processtreeimpl and mapred.tasktracker.maxmemory are set. TT only refers to ProcessTree. Added isZombie, isEmpty and getProcessTree to abstract class ProcessTree. getProcessTree replaces initializes(and reconstruct) and returns the ProcessTree with latest state. Did a bit of refactoring of ProcfsBasedProcessTree so that all inherited methods are together at one place.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Cancelling patch to incorporate Devaraj's review comments offline.

          • Remove all references to ProcfsBasedProcessTree in TT. TT should strictly use the abstract interface ProcessTree. Only. So that implementing and using some other processTree doesn't imply any changes to TT.
          • Let the pidFile be written during task creation only if memory management is enabled. Despite this, hitting the disk to write pid to a file for each task might affect short jobs.
          • TT should get an instance of ProcessTree from a configuration variable - mapred.tasktracker.processtreeimpl
          • Make TT use a string for a pid instead of an integer.
          Show
          Vinod Kumar Vavilapalli added a comment - Cancelling patch to incorporate Devaraj's review comments offline. Remove all references to ProcfsBasedProcessTree in TT. TT should strictly use the abstract interface ProcessTree. Only. So that implementing and using some other processTree doesn't imply any changes to TT. Let the pidFile be written during task creation only if memory management is enabled. Despite this, hitting the disk to write pid to a file for each task might affect short jobs. TT should get an instance of ProcessTree from a configuration variable - mapred.tasktracker.processtreeimpl Make TT use a string for a pid instead of an integer.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Fixed above errors. Ataching a new patch.

          Show
          Vinod Kumar Vavilapalli added a comment - Fixed above errors. Ataching a new patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12387330/patch_3581_4.4.txt
          against trunk revision 681493.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          -1 release audit. The applied patch generated 218 release audit warnings (more than the trunk's current 214 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3000/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3000/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3000/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3000/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3000/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387330/patch_3581_4.4.txt against trunk revision 681493. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. -1 release audit. The applied patch generated 218 release audit warnings (more than the trunk's current 214 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3000/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3000/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3000/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3000/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3000/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Incorporating findBug warnings' related fixes.

          Show
          Vinod Kumar Vavilapalli added a comment - Incorporating findBug warnings' related fixes.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12387286/patch_3581_4.3.txt
          against trunk revision 681429.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 18 new Findbugs warnings.

          -1 release audit. The applied patch generated 218 release audit warnings (more than the trunk's current 214 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2999/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2999/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2999/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2999/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2999/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387286/patch_3581_4.3.txt against trunk revision 681429. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 18 new Findbugs warnings. -1 release audit. The applied patch generated 218 release audit warnings (more than the trunk's current 214 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2999/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2999/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2999/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2999/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2999/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Attaching a new patch.

          • Added a test case that creates a "rogue task", verifies it goes beyond a certain limit, kill it, and asserts that it is killed properly.
          • Moved ProcessTree, ProcfsBasedProcessTree and ProcInfo to org.apache.hadoop.util
          • Changed all methods in ProcessTree to be abstract methods.
          Show
          Vinod Kumar Vavilapalli added a comment - Attaching a new patch. Added a test case that creates a "rogue task", verifies it goes beyond a certain limit, kill it, and asserts that it is killed properly. Moved ProcessTree, ProcfsBasedProcessTree and ProcInfo to org.apache.hadoop.util Changed all methods in ProcessTree to be abstract methods.
          Hide
          Vinod Kumar Vavilapalli added a comment - - edited

          Attaching patch for review. It still doesn't have test-cases and documentation.

          Notes:

          • TaskMemoryManagerThread: This is a thread in TaskTracker that manages memory usage of tasks running under this TT. It is responsible for killing any task-trees that over-step memory limits. It uses MONITORING_INTERVAL, the interval for which TaskMemoryManager sleeps before initiating another memory management cycle. Default value 300ms - need an appropriate, but small, value for this.
          • TaskMemoryManagerThread tracks tasks using ProcessTree objects of abstract class ProcessTree. Currently, we only have implementation for Linux and Cygwin - ProcfsBasedProcessTree - a proc file-system based ProcessTree.
          • For managing memory, ProcfsBasedProcessTree needs pid of root task to begin with. For this, the way tasks are started is changed so as to store the pid of the started task process in a temporary PidFile (by echoing $$). By this, we are doing away with the earlier proposal of writing native code to get pid which involves having another external library. Using shell features to get pid is straightforward, simple to incorporate and doesn't need multiple implementations.
          • PidFiles reside in PIDDIR of TaskTracker's work-space. They are removed once a task process-tree gets killed/finished.
          • Processes that survive the initial SIGTERM are killed by sending a subsequent SIGKILL after SLEEP_TIME_BEFORE_SIGKILL. This is currently set to 5 secs, but this should be changed to an appropriate value; the main downside of having this (large a ) value is that it leaves enough time for rogue tasks to behave badly by expanding their memory usage beyond set limits.
          • All the three configuration parameters default to Long.MAX_VALUE (memory management disabled by default).
          • Zombie process-trees : We manage non-empty process-trees even after root processes (Tasks) exit so as to take care of rogue tasks that may fork off offsprings silently before they exit.

          TODO:

          • Deprecate all of the ulimit business - i.e deprecating mapred.child.ulimit feature provided by HADOOP-2765. We may still want to retain this for limiting other things like open files etc., but HADOOP-3675 should be automatically providing such task setup feature. Comments?
          • Incorporate some of the methods in ProcfsBasedProcessTree(isEmpty, isZombie, reconstruct etc) into ProcessTree ?
          • Move ProcessInfo, ProcessTree and ProcfsBasedProcessTree to util package ?

          Also, please comment on a bunch of other minor TODO's marked in the patch.

          Testing:
          Tested the patch on a Linux cluster

          • with no limits (all the three parameters left unspecified),
          • with only TT limit set (tasks get default limits) and
          • with user-configured per-job limits (which override TT's limits). TaskMemoryManager works as desired in all the above scenarios.
          Show
          Vinod Kumar Vavilapalli added a comment - - edited Attaching patch for review. It still doesn't have test-cases and documentation. Notes: TaskMemoryManagerThread: This is a thread in TaskTracker that manages memory usage of tasks running under this TT. It is responsible for killing any task-trees that over-step memory limits. It uses MONITORING_INTERVAL, the interval for which TaskMemoryManager sleeps before initiating another memory management cycle. Default value 300ms - need an appropriate, but small, value for this. TaskMemoryManagerThread tracks tasks using ProcessTree objects of abstract class ProcessTree. Currently, we only have implementation for Linux and Cygwin - ProcfsBasedProcessTree - a proc file-system based ProcessTree. For managing memory, ProcfsBasedProcessTree needs pid of root task to begin with. For this, the way tasks are started is changed so as to store the pid of the started task process in a temporary PidFile (by echoing $$). By this, we are doing away with the earlier proposal of writing native code to get pid which involves having another external library. Using shell features to get pid is straightforward, simple to incorporate and doesn't need multiple implementations. PidFiles reside in PIDDIR of TaskTracker's work-space. They are removed once a task process-tree gets killed/finished. Processes that survive the initial SIGTERM are killed by sending a subsequent SIGKILL after SLEEP_TIME_BEFORE_SIGKILL. This is currently set to 5 secs, but this should be changed to an appropriate value; the main downside of having this (large a ) value is that it leaves enough time for rogue tasks to behave badly by expanding their memory usage beyond set limits. All the three configuration parameters default to Long.MAX_VALUE (memory management disabled by default). Zombie process-trees : We manage non-empty process-trees even after root processes (Tasks) exit so as to take care of rogue tasks that may fork off offsprings silently before they exit. TODO: Deprecate all of the ulimit business - i.e deprecating mapred.child.ulimit feature provided by HADOOP-2765 . We may still want to retain this for limiting other things like open files etc., but HADOOP-3675 should be automatically providing such task setup feature. Comments? Incorporate some of the methods in ProcfsBasedProcessTree(isEmpty, isZombie, reconstruct etc) into ProcessTree ? Move ProcessInfo, ProcessTree and ProcfsBasedProcessTree to util package ? Also, please comment on a bunch of other minor TODO's marked in the patch. Testing: Tested the patch on a Linux cluster with no limits (all the three parameters left unspecified), with only TT limit set (tasks get default limits) and with user-configured per-job limits (which override TT's limits). TaskMemoryManager works as desired in all the above scenarios.
          Hide
          Vivek Ratan added a comment -

          Just FYI, I have summarized the entire proposal (which is split between this Jira and HADOOP-3759) here.

          Show
          Vivek Ratan added a comment - Just FYI, I have summarized the entire proposal (which is split between this Jira and HADOOP-3759 ) here .
          Hide
          Hemanth Yamijala added a comment -

          Some more details on the configuration items:

          mapred.tasktracker.tasks.maxmemory: We can default this to a value like Long.MAX_VALUE, or maybe even -1L, to disable this feature. Only if it is different from default, will memory monitoring be done.

          Regarding mapred.map.task.maxmemory, one thought is whether we need separate items for map and reduce tasks, or can we just do with one item, such as mapred.task.maxmemory, which will define the maximum value that will be taken by any task (Map or Reduce) in the job. If it is typical that one type of task (say Reduce), has significantly different memory requirements than the other, then two items may be required. Comments ?

          Show
          Hemanth Yamijala added a comment - Some more details on the configuration items: mapred.tasktracker.tasks.maxmemory: We can default this to a value like Long.MAX_VALUE, or maybe even -1L, to disable this feature. Only if it is different from default, will memory monitoring be done. Regarding mapred.map.task.maxmemory, one thought is whether we need separate items for map and reduce tasks, or can we just do with one item, such as mapred.task.maxmemory, which will define the maximum value that will be taken by any task (Map or Reduce) in the job. If it is typical that one type of task (say Reduce), has significantly different memory requirements than the other, then two items may be required. Comments ?
          Hide
          eric baldeschwieler added a comment -

          Looks good to me. I think the only constraint we should place on MAX
          task sizes is that there is space for the MAX map plus the MAX reduce
          on the node.

          Show
          eric baldeschwieler added a comment - Looks good to me. I think the only constraint we should place on MAX task sizes is that there is space for the MAX map plus the MAX reduce on the node.
          Hide
          Hemanth Yamijala added a comment -

          A user should specify the MAX RAM in GB or MB that the tasks will use.

          +1. I think that is much easier for a user to specify.

          Here's what I propose with respect to the configuration variables:

          • mapred.tasktracker.tasks.maxmemory: Cumulative memory that can be used by all map/reduce tasks.
          • mapred.map.task.maxmemory: (Overridable per job) Maximum memory any map task of a job can take. By default, mapred.tasktracker.tasks.maxmemory / number of slots on a node
          • mapred.reduce.task.maxmemory: (Overridable per job) Maximum memory any reduce of a job can take. By default, mapred.tasktracker.tasks.maxmemory / number of slots on a node

          Thoughts ? Specifically, on the default values, is it OK to give the same amount of max memory to map tasks and reduce tasks ? Or should we look to divide the max memory so that there's more (say twice) given to the reduce tasks, than to the map tasks ?

          Show
          Hemanth Yamijala added a comment - A user should specify the MAX RAM in GB or MB that the tasks will use. +1. I think that is much easier for a user to specify. Here's what I propose with respect to the configuration variables: mapred.tasktracker.tasks.maxmemory: Cumulative memory that can be used by all map/reduce tasks. mapred.map.task.maxmemory: (Overridable per job) Maximum memory any map task of a job can take. By default, mapred.tasktracker.tasks.maxmemory / number of slots on a node mapred.reduce.task.maxmemory: (Overridable per job) Maximum memory any reduce of a job can take. By default, mapred.tasktracker.tasks.maxmemory / number of slots on a node Thoughts ? Specifically, on the default values, is it OK to give the same amount of max memory to map tasks and reduce tasks ? Or should we look to divide the max memory so that there's more (say twice) given to the reduce tasks, than to the map tasks ?
          Hide
          eric baldeschwieler added a comment -

          A couple of comments:

          As I understand this the job specifies is RAM requirements as a % of
          RAM on a TT? That doesn't fly. A user should specify the MAX RAM in
          GB or MB that the tasks will use. %s are not the right model. IE
          may tasks will use no more than 1.5GB

          I don't think a per job limit on % RAM to map / reduce works. Better
          to just specify the biggest MAP or REDUCE the cluster can support.

          E14

          Show
          eric baldeschwieler added a comment - A couple of comments: As I understand this the job specifies is RAM requirements as a % of RAM on a TT? That doesn't fly. A user should specify the MAX RAM in GB or MB that the tasks will use. %s are not the right model. IE may tasks will use no more than 1.5GB I don't think a per job limit on % RAM to map / reduce works. Better to just specify the biggest MAP or REDUCE the cluster can support. E14
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Summarizing all the discussion that went so far ..

          • TaskTracker tracks the memory usage of all the tasks and their sub-processes (irrespective of which user runs which tasks).
          • It uses per-task-objects of classes implementing ProcessTree as described earlier. Currently, we implement ProcfsBasedProcessTree, which works on both Linux and Cygwin. For other OS' we need classes extending ProcessTree.
          • We will have two configuration properties - per-tracker property mapred.tasktracker.tasks.maxmemory specifying maximum memory usable across all tasks on a tasktracker; and per-job property mapred.map.memlimit.percent for specifying memory usable across all maps in terms of percentage of mapred.tasktracker.tasks.maxmemory. Maximum memory usable by reduce tasks = (100-mapred.map.memlimit.percent )% of mapred.tasktracker.tasks.maxmemory. All these are virtual memory limits, not working set. By default, we can set mapred.tasktracker.tasks.maxmemory to 12GB(4GB RAM + 8GB swap) and mapred.map.memlimit.percent to 33%. Should be ok?
          • After every heartbeat, TT scans through the list of running tasks and finds if any task trees are transgressing limits and kills them by issuing 'kill <pid>' (SIGTERM) to each process in each of the concerned process-trees. TT will monitor if a process-tree is killed successfully. If not, it issues a subsequent SIGKILL.
          • ProcfsBasedProcessTree: Constructs process-tree information from /proc file system. TT obtains pid from the task via the rpc method getPid and then constructs the process tree using procfs.

          Please put forward your objections to the above proposal, if any.

          Show
          Vinod Kumar Vavilapalli added a comment - Summarizing all the discussion that went so far .. TaskTracker tracks the memory usage of all the tasks and their sub-processes (irrespective of which user runs which tasks). It uses per-task-objects of classes implementing ProcessTree as described earlier. Currently, we implement ProcfsBasedProcessTree, which works on both Linux and Cygwin. For other OS' we need classes extending ProcessTree. We will have two configuration properties - per-tracker property mapred.tasktracker.tasks.maxmemory specifying maximum memory usable across all tasks on a tasktracker; and per-job property mapred.map.memlimit.percent for specifying memory usable across all maps in terms of percentage of mapred.tasktracker.tasks.maxmemory. Maximum memory usable by reduce tasks = (100-mapred.map.memlimit.percent )% of mapred.tasktracker.tasks.maxmemory. All these are virtual memory limits, not working set. By default, we can set mapred.tasktracker.tasks.maxmemory to 12GB(4GB RAM + 8GB swap) and mapred.map.memlimit.percent to 33%. Should be ok? After every heartbeat, TT scans through the list of running tasks and finds if any task trees are transgressing limits and kills them by issuing 'kill <pid>' (SIGTERM) to each process in each of the concerned process-trees. TT will monitor if a process-tree is killed successfully. If not, it issues a subsequent SIGKILL. ProcfsBasedProcessTree: Constructs process-tree information from /proc file system. TT obtains pid from the task via the rpc method getPid and then constructs the process tree using procfs. Please put forward your objections to the above proposal, if any.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Could we solve this by adding an extra argument specifying the JobId and the UserId to enable the script to do by job/user accounting ?

          I am not sure if I understand this well enough. If you meant "pass JobId/UserId to the script and do per-job/per-user accounting only", then that won't help - we need overall accounting across all tasks.

          The wrapper I proposed before could solve this problem as a side effect (with /etc/security/limits.conf). But it might not be portable and your solution is maybe for this case.

          limits.conf approach is already being evaluated, it doesn't solve the current problem. See this comment on this very JIRA - https://issues.apache.org/jira/browse/HADOOP-3581?focusedCommentId=12607650#action_12607650

          I'm afraid that many functionality will not to be available for threaded tasks anyway. My next proposition will include a fallback mecanism so you should'nt have to take this in account.

          This looks like an interesting problem - how do we manage resource usage by each thread? Any thread resource management support in Java? What is the use-case for threaded tasks in the first place? If cost of per-taskJvm is the only reason why we want to run each task in a thread instead of a jvm, we can still achieve resource management of all tasks by forking one single jvm and running all tasks as threads of this jvm. This way we can meet our objective here too - shield hadoop from user code.

          Show
          Vinod Kumar Vavilapalli added a comment - Could we solve this by adding an extra argument specifying the JobId and the UserId to enable the script to do by job/user accounting ? I am not sure if I understand this well enough. If you meant "pass JobId/UserId to the script and do per-job/per-user accounting only", then that won't help - we need overall accounting across all tasks. The wrapper I proposed before could solve this problem as a side effect (with /etc/security/limits.conf). But it might not be portable and your solution is maybe for this case. limits.conf approach is already being evaluated, it doesn't solve the current problem. See this comment on this very JIRA - https://issues.apache.org/jira/browse/HADOOP-3581?focusedCommentId=12607650#action_12607650 I'm afraid that many functionality will not to be available for threaded tasks anyway. My next proposition will include a fallback mecanism so you should'nt have to take this in account. This looks like an interesting problem - how do we manage resource usage by each thread? Any thread resource management support in Java? What is the use-case for threaded tasks in the first place? If cost of per-taskJvm is the only reason why we want to run each task in a thread instead of a jvm, we can still achieve resource management of all tasks by forking one single jvm and running all tasks as threads of this jvm. This way we can meet our objective here too - shield hadoop from user code.
          Hide
          Brice Arnould added a comment -

          Implementation in wrapper implies tracking per task and thus we will not have a global picture of resource usage at TaskTracker level. Further, it is set-once-and-run kind of mechanism - before launching tasks itself, we will have to declare the limits within which tasks can run. If we wish to make these limits dynamic, we will need an extra communication pipe between the wrapper and TaskTracker.

          Could we solve this by adding an extra argument specifying the JobId and the UserId to enable the script to do by job/user accounting ?

          We could not find an out-of-the-box OS solution to curtail a process and it's descendents' memory limits. Specifically, ulimit did not seem to handle processes spawned from a parent whose memory limit was set.

          The wrapper I proposed before could solve this problem as a side effect (with /etc/security/limits.conf). But it might not be portable and your solution is maybe for this case.

          One impact of HADOOP-3675 on this work is that, when the mechanism to launch a task becomes pluggable, the way we monitor memory per task might need to change as well. So, for example, if we have a task-per-thread implementation of a task runner, it would be difficult to monitor memory per task because it is in the same process space, right ? In fact this proposal in the patch works only if the task is launched in a separate process.

          I'm afraid that many functionality will not to be available for threaded tasks anyway. My next proposition will include a fallback mecanism so you should'nt have to take this in account.

          PS: I'm quite in hurry, please excuse me for my english :-/

          Show
          Brice Arnould added a comment - Implementation in wrapper implies tracking per task and thus we will not have a global picture of resource usage at TaskTracker level. Further, it is set-once-and-run kind of mechanism - before launching tasks itself, we will have to declare the limits within which tasks can run. If we wish to make these limits dynamic, we will need an extra communication pipe between the wrapper and TaskTracker. Could we solve this by adding an extra argument specifying the JobId and the UserId to enable the script to do by job/user accounting ? We could not find an out-of-the-box OS solution to curtail a process and it's descendents' memory limits. Specifically, ulimit did not seem to handle processes spawned from a parent whose memory limit was set. The wrapper I proposed before could solve this problem as a side effect (with /etc/security/limits.conf ). But it might not be portable and your solution is maybe for this case. One impact of HADOOP-3675 on this work is that, when the mechanism to launch a task becomes pluggable, the way we monitor memory per task might need to change as well. So, for example, if we have a task-per-thread implementation of a task runner, it would be difficult to monitor memory per task because it is in the same process space, right ? In fact this proposal in the patch works only if the task is launched in a separate process. I'm afraid that many functionality will not to be available for threaded tasks anyway. My next proposition will include a fallback mecanism so you should'nt have to take this in account. PS: I'm quite in hurry, please excuse me for my english :-/
          Hide
          Hemanth Yamijala added a comment -

          One impact of HADOOP-3675 on this work is that, when the mechanism to launch a task becomes pluggable, the way we monitor memory per task might need to change as well. So, for example, if we have a task-per-thread implementation of a task runner, it would be difficult to monitor memory per task because it is in the same process space, right ? In fact this proposal in the patch works only if the task is launched in a separate process.

          Show
          Hemanth Yamijala added a comment - One impact of HADOOP-3675 on this work is that, when the mechanism to launch a task becomes pluggable, the way we monitor memory per task might need to change as well. So, for example, if we have a task-per-thread implementation of a task runner, it would be difficult to monitor memory per task because it is in the same process space, right ? In fact this proposal in the patch works only if the task is launched in a separate process.
          Hide
          Hemanth Yamijala added a comment -

          Brice,

          I think that there is a more general problem, that is task insulation, because a bugged process could many other things than just overloading the memory.

          True. In our environments, we have seen processes overloading memory as an oft repeated problem. Hence, we were focussing on that.

          The userBasedInsulator.sh that I proposed in HADOOP-3675 could solve this issue (and a few others) in an easier way.

          So, the goal of this JIRA was to prevent user tasks from adversely affecting one another, or other system daemons on the node by gobbling up memory. We could not find an out-of-the-box OS solution to curtail a process and it's descendents' memory limits. Specifically, ulimit did not seem to handle processes spawned from a parent whose memory limit was set. Maybe virtual machines will help, but IMO, we are still some way off from deciding which tool is suitable for this. That is why this JIRA proposes implementing the tracking of memory on its own.

          If you are aware of a way this can be achieved with an OS specific mechanism, we can gladly look at that. And it would be significantly easy to use the mechanism you propose (via the wrapper script). Then we can focus on HADOOP-3675. Please do let us know of any solution that you have in mind.

          Show
          Hemanth Yamijala added a comment - Brice, I think that there is a more general problem, that is task insulation, because a bugged process could many other things than just overloading the memory. True. In our environments, we have seen processes overloading memory as an oft repeated problem. Hence, we were focussing on that. The userBasedInsulator.sh that I proposed in HADOOP-3675 could solve this issue (and a few others) in an easier way. So, the goal of this JIRA was to prevent user tasks from adversely affecting one another, or other system daemons on the node by gobbling up memory. We could not find an out-of-the-box OS solution to curtail a process and it's descendents' memory limits. Specifically, ulimit did not seem to handle processes spawned from a parent whose memory limit was set. Maybe virtual machines will help, but IMO, we are still some way off from deciding which tool is suitable for this. That is why this JIRA proposes implementing the tracking of memory on its own. If you are aware of a way this can be achieved with an OS specific mechanism, we can gladly look at that. And it would be significantly easy to use the mechanism you propose (via the wrapper script). Then we can focus on HADOOP-3675 . Please do let us know of any solution that you have in mind.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          [Update]

          • Just verified that cygwin supports proc file system exactly as on linux, so the same code should work there too.
          • Also tested process-groups and signal handling to process-groups on cygwin, works exactly as expected. Given that process-groups are supported there too, and thus is fairly portable(assuming it works right away on Solaris and Mac), we could think of moving task creation to be done via job-control.
          Show
          Vinod Kumar Vavilapalli added a comment - [Update] Just verified that cygwin supports proc file system exactly as on linux, so the same code should work there too. Also tested process-groups and signal handling to process-groups on cygwin, works exactly as expected. Given that process-groups are supported there too, and thus is fairly portable(assuming it works right away on Solaris and Mac), we could think of moving task creation to be done via job-control.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Sorry about not stating this before my last comment, but in this JIRA I am restricting myself to writing utilities that track process trees, watch their memory usage, destroy them when overflowing etc. We definitely need the functionality of these classes, and we can use them where ever we wish to, once HADOOP-3675 moves. Will create a new JIRA for this, if that is the correct step forward.

          Regarding where we want to perform the resource management - tracking disk space, memory usage etc., we can either implement it in the wrapper or in the TaskTracker itself.

          • Implementation in wrapper implies tracking per task and thus we will not have a global picture of resource usage at TaskTracker level. Further, it is set-once-and-run kind of mechanism - before launching tasks itself, we will have to declare the limits within which tasks can run. If we wish to make these limits dynamic, we will need an extra communication pipe between the wrapper and TaskTracker.
          • Implementing resource tracking in TaskTracker helps us avoid all this. Moreover, it gives the global idea of usage over all tasks and can thus be flexible.

          Brice, your idea of doing it in the wrapper script relates to the first point above and thus has some disadvantages. It would be good, if we could just restrict the wrapper to do the most basic need of isoated work-space/chroot/jail-like mechanism and do task's resource management in TT. Comments?

          Show
          Vinod Kumar Vavilapalli added a comment - Sorry about not stating this before my last comment, but in this JIRA I am restricting myself to writing utilities that track process trees, watch their memory usage, destroy them when overflowing etc. We definitely need the functionality of these classes, and we can use them where ever we wish to, once HADOOP-3675 moves. Will create a new JIRA for this, if that is the correct step forward. Regarding where we want to perform the resource management - tracking disk space, memory usage etc., we can either implement it in the wrapper or in the TaskTracker itself. Implementation in wrapper implies tracking per task and thus we will not have a global picture of resource usage at TaskTracker level. Further, it is set-once-and-run kind of mechanism - before launching tasks itself, we will have to declare the limits within which tasks can run. If we wish to make these limits dynamic, we will need an extra communication pipe between the wrapper and TaskTracker. Implementing resource tracking in TaskTracker helps us avoid all this. Moreover, it gives the global idea of usage over all tasks and can thus be flexible. Brice, your idea of doing it in the wrapper script relates to the first point above and thus has some disadvantages. It would be good, if we could just restrict the wrapper to do the most basic need of isoated work-space/chroot/jail-like mechanism and do task's resource management in TT. Comments?
          Hide
          Brice Arnould added a comment -

          Hi !
          I think that there is a more general problem, that is task insulation, because a bugged process could many other things than just overloading the memory.

          The userBasedInsulator.sh that I proposed in HADOOP-3675 could solve this issue (and a few others) in an easier way. We don't need HADOOP-3675 to be complete to introduce that approch however.
          The idea is to provide a "wrapper" charged to enforce local policies. This wrapper can be written as a shell script to work on must Unix (and maybe cygwin), and requires much less change to the core of Hadoop, that is :

          +      ArrayList<String> vargs = new ArrayList<String>(8);
          +      //  Check for the eventual wrapper script
          +      final String wrapper = conf.get("mapred.child.wrapper");
          +      if (wrapper != null)
          +        vargs.add(wrapper);
          -      Vector<String> vargs = new Vector<String>(8);
          
          Show
          Brice Arnould added a comment - Hi ! I think that there is a more general problem, that is task insulation, because a bugged process could many other things than just overloading the memory. The userBasedInsulator.sh that I proposed in HADOOP-3675 could solve this issue (and a few others) in an easier way. We don't need HADOOP-3675 to be complete to introduce that approch however. The idea is to provide a "wrapper" charged to enforce local policies. This wrapper can be written as a shell script to work on must Unix (and maybe cygwin), and requires much less change to the core of Hadoop, that is : + ArrayList< String > vargs = new ArrayList< String >(8); + // Check for the eventual wrapper script + final String wrapper = conf.get( "mapred.child.wrapper" ); + if (wrapper != null ) + vargs.add(wrapper); - Vector< String > vargs = new Vector< String >(8);
          Hide
          Vinod Kumar Vavilapalli added a comment -

          After discussing about this with Hemanth, it came out that we need to reorganize the current code. I propose the following;

          TaskTracker maintains a ProcessTree object for each task.

          public abstract class ProcessTree {
          
           /* Initialize the process tree */
           public void initialize();
          
           /* Destroy the process tree */
           public void destroy();
          
           /* Return total virtual memory usage by this process tree */
           public long getVmem();
          
          }
          

          My previous code(ProcessTracker: initialize, kill and getCurrentVmemUsage) would be moved to a class LinuxProcessTree that extends ProcessTree. Cygwin seems to support proc filesystem - so the same can be used for windows , need to confirm this for sure, though. For solaris/other OS we need classes extending ProcessTree.

          Getting pid of the task process : In the first patch, the implementation of getting pid of a process is hacky - put forward here (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4244896), it tries to get the value of a private variable by suppressing the standard Java language access checks. This won't work if it is prevented by security policy.
          Replacing this implementation with a better one Instead, we should let TaskTracker ask Task via TaskUmbilicalProtocol to give its pid through a getPid call:

          public Integer getPid();
          

          The task itself should return its pid(pid on *NIX and cygwin PIDs on Windows) to TT perhaps by calling native code.

          Side notes:

          • Should we reconstruct the process tree on demand when a getVmem() is called, or should we start a thread and update it periodically?
          • If cygwin also supports process groups and sessions (I can see setpgid, setsid etc. in cygwin's POSIX compatible API here http://cygwin.com/cygwin-api/compatibility.html#std-susv3 ), we might want to change the implementation of destroy. This is to be a separate issue where we also modify how we start tasks.
          Show
          Vinod Kumar Vavilapalli added a comment - After discussing about this with Hemanth, it came out that we need to reorganize the current code. I propose the following; TaskTracker maintains a ProcessTree object for each task. public abstract class ProcessTree { /* Initialize the process tree */ public void initialize(); /* Destroy the process tree */ public void destroy(); /* Return total virtual memory usage by this process tree */ public long getVmem(); } My previous code(ProcessTracker: initialize, kill and getCurrentVmemUsage) would be moved to a class LinuxProcessTree that extends ProcessTree. Cygwin seems to support proc filesystem - so the same can be used for windows , need to confirm this for sure, though. For solaris/other OS we need classes extending ProcessTree. Getting pid of the task process : In the first patch, the implementation of getting pid of a process is hacky - put forward here ( http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4244896 ), it tries to get the value of a private variable by suppressing the standard Java language access checks. This won't work if it is prevented by security policy. Replacing this implementation with a better one Instead, we should let TaskTracker ask Task via TaskUmbilicalProtocol to give its pid through a getPid call: public Integer getPid(); The task itself should return its pid(pid on *NIX and cygwin PIDs on Windows) to TT perhaps by calling native code. Side notes: Should we reconstruct the process tree on demand when a getVmem() is called, or should we start a thread and update it periodically? If cygwin also supports process groups and sessions (I can see setpgid, setsid etc. in cygwin's POSIX compatible API here http://cygwin.com/cygwin-api/compatibility.html#std-susv3 ), we might want to change the implementation of destroy. This is to be a separate issue where we also modify how we start tasks.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Attaching a first patch, so that it helps in moving the discussion forward. The patch is still raw and needs a good deal of work up. Much of it is just a proof of concept; enough abstraction is set in place so that actual implementation can be changed easily.

          At present,

          • the process tracker works only on linux, uses proc file system and the process directories inside.
          • uses mapred.child.ulimit to limit the total vmem usage of all the tasks' process trees.
          • once it detects that the total vmem usage of all tasks has crossed over the specified limit, it calls findOOMTaskstoKill to find tasks to be killed.
          • findOOMTaskstoKill returns the list of tasks to be killed. Currently it returns only one task, the one with the highest memory usage.
          • after getting the list of tasks to be killed, it kills each of the corresponding process trees by issuing individual 'kill <pid>' commands (SIGTERM).

          Need thought/TODO:

          • Introduce separate configuration properties for usage of map tasks and reduce tasks? Knock out previous usage of mapred.child.ulimit and its corresponding usage to set ulimits?
          • May want to monitor if the kill went through or not, and then issue a subsequent SIGKILL as needed. Kill mechanism might totally change if we wish to start the tasks using job control.
          • May want to refactor the code a bit and merge killOOMTasks with killOverflowingTasks. Later move all of this together to a single place when HADOOP-3675 goes in.
          • Lot of code paths are not synchronized yet, so might result in threading errors/race conditions.

          We still need decision as to whether we want to 1) limit aggregate usage over all tasks' process trees or 2) limit usage per task's process tree. Believe that both of these can be implemented with the framework setup in current patch.

          Show
          Vinod Kumar Vavilapalli added a comment - Attaching a first patch, so that it helps in moving the discussion forward. The patch is still raw and needs a good deal of work up. Much of it is just a proof of concept; enough abstraction is set in place so that actual implementation can be changed easily. At present, the process tracker works only on linux, uses proc file system and the process directories inside. uses mapred.child.ulimit to limit the total vmem usage of all the tasks' process trees. once it detects that the total vmem usage of all tasks has crossed over the specified limit, it calls findOOMTaskstoKill to find tasks to be killed. findOOMTaskstoKill returns the list of tasks to be killed. Currently it returns only one task, the one with the highest memory usage. after getting the list of tasks to be killed, it kills each of the corresponding process trees by issuing individual 'kill <pid>' commands (SIGTERM). Need thought/TODO: Introduce separate configuration properties for usage of map tasks and reduce tasks? Knock out previous usage of mapred.child.ulimit and its corresponding usage to set ulimits? May want to monitor if the kill went through or not, and then issue a subsequent SIGKILL as needed. Kill mechanism might totally change if we wish to start the tasks using job control. May want to refactor the code a bit and merge killOOMTasks with killOverflowingTasks. Later move all of this together to a single place when HADOOP-3675 goes in. Lot of code paths are not synchronized yet, so might result in threading errors/race conditions. We still need decision as to whether we want to 1) limit aggregate usage over all tasks' process trees or 2) limit usage per task's process tree. Believe that both of these can be implemented with the framework setup in current patch.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          What prevents the task tracker from overflowing memory itself? Considering the memory leaks we've already seen in the name node, I don't trust the task trackers to be leak free either.

          One of the advantages that we have with HOD presently is that because the limit is set prior to the task tracker getting launched is that the task tracker itself is bounded. This makes the entire hadoop task chain limited and not just individual portions.

          Whatever system is designed needs to mimic this functionality.

          Don't we limit task tracker's usage by setting java system property -Xmx? I believe that if there are TT memory leaks, they should be addressed directly in code. The above proposed method, irrespective of the implementation details, prevents tasks from disrupting hadoop daemons, and this is our main objective here.

          Show
          Vinod Kumar Vavilapalli added a comment - What prevents the task tracker from overflowing memory itself? Considering the memory leaks we've already seen in the name node, I don't trust the task trackers to be leak free either. One of the advantages that we have with HOD presently is that because the limit is set prior to the task tracker getting launched is that the task tracker itself is bounded. This makes the entire hadoop task chain limited and not just individual portions. Whatever system is designed needs to mimic this functionality. Don't we limit task tracker's usage by setting java system property -Xmx? I believe that if there are TT memory leaks, they should be addressed directly in code. The above proposed method, irrespective of the implementation details, prevents tasks from disrupting hadoop daemons, and this is our main objective here.
          Hide
          Allen Wittenauer added a comment -

          What prevents the task tracker from overflowing memory itself? Considering the memory leaks we've already seen in the name node, I don't trust the task trackers to be leak free either.

          One of the advantages that we have with HOD presently is that because the limit is set prior to the task tracker getting launched is that the task tracker itself is bounded. This makes the entire hadoop task chain limited and not just individual portions.

          Whatever system is designed needs to mimic this functionality.

          Show
          Allen Wittenauer added a comment - What prevents the task tracker from overflowing memory itself? Considering the memory leaks we've already seen in the name node, I don't trust the task trackers to be leak free either. One of the advantages that we have with HOD presently is that because the limit is set prior to the task tracker getting launched is that the task tracker itself is bounded. This makes the entire hadoop task chain limited and not just individual portions. Whatever system is designed needs to mimic this functionality.
          Hide
          Vinod Kumar Vavilapalli added a comment -
          • Ulimits
            Approach taken by HADOOP-3280. Set ulimit -v for the tasks launched. Doing this would limit the virtual memory usable by the launched task process. All the sub-processes would also inherit the same limit and so are capped by this limit. But this limit can easily be circumvented by a rogue/badly written task by repetitively forking sub-processes. Another mechanism that is discussed on HADOOP-3280 - system wide per user limits via /etc/security/limits.conf, is just another form of ulimit -v : it is ulimit -v simply configured system-wide and kicks in when a user logs in (login shell) and thus applies to all the subsequent sub-processes. This mechanism also fails for the same reasons of repetitive forking.

          As discussed on HADOOP-3280, long term we will likely run TaskTrackers as root and setuid to the submitting user and at that point ulimits definitely will not help in addressing this issue. Each task might be well within limits, but tasks of different users might use up significant memory and affect running of hadoop daemons.

          • TaskTrackers tracking memory usage of tasks' process tree.
            In this approach, TaskTracker tracks the usage of all the tasks and their sub-processes (irrepective of which user runs which tasks). We have two options here.
            Option1:
            Put a limit on the total memory used by all the children of TaskTrackers (aggregate limit over all tasks). This way we can be sure that the running of hadoop daemon is not affected. But it has the obvious disadvantage of tasks intruding into each other, resulting in issues like one task using up all of memory within resource limits and making other tasks fail.
            Option 2:
            Put individual cap on the memory usable by each task. Perhaps separate limits for map tasks and reduce tasks.
            TaskTracker tracks the resource usage of each task and its sub-process tree. Once a task crosses over the configured limits, TaskTracker kills the task's process tree and reports accordingly. We need implementation for different platforms to make it portable. Currently, we can restrict ourselves to using procfs on posix systems. But enough abstraction should be in place for implementation on other platforms. e.g. using windows api.
            In the presence of free slots (if at all), this choice has the disadvantage of running tasks under utililzing memory (not a significant issue?)

          Comments?

          Show
          Vinod Kumar Vavilapalli added a comment - Ulimits Approach taken by HADOOP-3280 . Set ulimit -v for the tasks launched. Doing this would limit the virtual memory usable by the launched task process. All the sub-processes would also inherit the same limit and so are capped by this limit. But this limit can easily be circumvented by a rogue/badly written task by repetitively forking sub-processes. Another mechanism that is discussed on HADOOP-3280 - system wide per user limits via /etc/security/limits.conf, is just another form of ulimit -v : it is ulimit -v simply configured system-wide and kicks in when a user logs in (login shell) and thus applies to all the subsequent sub-processes. This mechanism also fails for the same reasons of repetitive forking. As discussed on HADOOP-3280 , long term we will likely run TaskTrackers as root and setuid to the submitting user and at that point ulimits definitely will not help in addressing this issue. Each task might be well within limits, but tasks of different users might use up significant memory and affect running of hadoop daemons. TaskTrackers tracking memory usage of tasks' process tree. In this approach, TaskTracker tracks the usage of all the tasks and their sub-processes (irrepective of which user runs which tasks). We have two options here. Option1: Put a limit on the total memory used by all the children of TaskTrackers (aggregate limit over all tasks). This way we can be sure that the running of hadoop daemon is not affected. But it has the obvious disadvantage of tasks intruding into each other, resulting in issues like one task using up all of memory within resource limits and making other tasks fail. Option 2: Put individual cap on the memory usable by each task. Perhaps separate limits for map tasks and reduce tasks. TaskTracker tracks the resource usage of each task and its sub-process tree. Once a task crosses over the configured limits, TaskTracker kills the task's process tree and reports accordingly. We need implementation for different platforms to make it portable. Currently, we can restrict ourselves to using procfs on posix systems. But enough abstraction should be in place for implementation on other platforms. e.g. using windows api. In the presence of free slots (if at all), this choice has the disadvantage of running tasks under utililzing memory (not a significant issue?) Comments?
          Hide
          Hemanth Yamijala added a comment -

          mapred.child.ulimit?

          Arun, from the discussion on HADOOP-3280, it appears there are some limitations with the mapred.child.ulimit approach. For e.g. as it is a per process limit, a process could break this by forking sub processes. We are not very sure of this, but there are strong indications it is that way.

          Show
          Hemanth Yamijala added a comment - mapred.child.ulimit? Arun, from the discussion on HADOOP-3280 , it appears there are some limitations with the mapred.child.ulimit approach. For e.g. as it is a per process limit, a process could break this by forking sub processes. We are not very sure of this, but there are strong indications it is that way.
          Hide
          Arun C Murthy added a comment -

          This is seen for Streaming applications.

          mapred.child.ulimit?

          Show
          Arun C Murthy added a comment - This is seen for Streaming applications. mapred.child.ulimit?
          Hide
          Hemanth Yamijala added a comment -

          This is seen for Streaming applications. For Java based applications, we already have a way of limiting the memory using Hadoop configuration.

          Show
          Hemanth Yamijala added a comment - This is seen for Streaming applications. For Java based applications, we already have a way of limiting the memory using Hadoop configuration.

            People

            • Assignee:
              Vinod Kumar Vavilapalli
              Reporter:
              Hemanth Yamijala
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development