Hadoop Common
  1. Hadoop Common
  2. HADOOP-3759

Provide ability to run memory intensive jobs without affecting other running tasks on the nodes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In HADOOP-3581, we are discussing how to prevent memory intensive tasks from affecting Hadoop daemons and other tasks running on a node. A related requirement is that users be provided an ability to run jobs which are memory intensive. The system must provide enough knobs to allow such jobs to be run while still maintaining the requirements of HADOOP-3581.

      1. HADOOP-3759.patch
        9 kB
        Hemanth Yamijala
      2. HADOOP-3759.patch
        12 kB
        Hemanth Yamijala
      3. HADOOP-3759.patch
        27 kB
        Hemanth Yamijala
      4. HADOOP-3759.patch
        26 kB
        Hemanth Yamijala
      5. HADOOP-3759.patch
        26 kB
        Hemanth Yamijala
      6. HADOOP-3759.patch
        31 kB
        Hemanth Yamijala
      7. HADOOP-3759.patch
        33 kB
        Hemanth Yamijala
      8. HADOOP-3759.patch
        32 kB
        Hemanth Yamijala

        Issue Links

          Activity

          Hide
          Hemanth Yamijala added a comment -

          Initial proposal:

          • HADOOP-3581 proposes a maximum amount of virtual memory, say MAX_MEM, that all tasks (and their descendants) on that tasktracker would use.
          • By default, we can translate this to a per task memory limit which is = MAX_MEM / number of slots
          • To allow high memory jobs to run, we define a new configuration variable that users can set to specify the maximum memory they expect their tasks to take.
          • In each heartbeat, the tasktracker computes the amount of free memory (using HADOOP-3581's fix) and reports that to the jobtracker. This is similar to the approach followed in HADOOP-657 for disk space.
          • The jobtracker schedules a new task on this tasktracker, only if it's job's memory limit per task is less than the free memory.
          • If a job's task takes more memory than what is specified (either default or explicitly configured), per HADOOP-3581, it will be killed.

          Comments ?

          Show
          Hemanth Yamijala added a comment - Initial proposal: HADOOP-3581 proposes a maximum amount of virtual memory, say MAX_MEM, that all tasks (and their descendants) on that tasktracker would use. By default, we can translate this to a per task memory limit which is = MAX_MEM / number of slots To allow high memory jobs to run, we define a new configuration variable that users can set to specify the maximum memory they expect their tasks to take. In each heartbeat, the tasktracker computes the amount of free memory (using HADOOP-3581 's fix) and reports that to the jobtracker. This is similar to the approach followed in HADOOP-657 for disk space. The jobtracker schedules a new task on this tasktracker, only if it's job's memory limit per task is less than the free memory. If a job's task takes more memory than what is specified (either default or explicitly configured), per HADOOP-3581 , it will be killed. Comments ?
          Hide
          Hemanth Yamijala added a comment -

          One obvious impact is on the scheduling of tasks. If a memory intensive job happens to be the current job being considered, what would happen if a TT comes in with lesser than required amount of memory. Scheduling tasks of other jobs to this TT could lead to starvation of the memory intensive job. Not scheduling other jobs could lead to under utilization of the cluster. However with different fairness scheduling primitives (like user limits, etc) which are being discussed in HADOOP-3445 and HADOOP-3746, it might be possible that these situations can be handled. Some more details will need to be worked out for this aspect.

          Show
          Hemanth Yamijala added a comment - One obvious impact is on the scheduling of tasks. If a memory intensive job happens to be the current job being considered, what would happen if a TT comes in with lesser than required amount of memory. Scheduling tasks of other jobs to this TT could lead to starvation of the memory intensive job. Not scheduling other jobs could lead to under utilization of the cluster. However with different fairness scheduling primitives (like user limits, etc) which are being discussed in HADOOP-3445 and HADOOP-3746 , it might be possible that these situations can be handled. Some more details will need to be worked out for this aspect.
          Hide
          Hemanth Yamijala added a comment -

          HADOOP-3581 handles tracking memory used by a task tree and killing it, if the memory used exceeds a specified configured value.

          Show
          Hemanth Yamijala added a comment - HADOOP-3581 handles tracking memory used by a task tree and killing it, if the memory used exceeds a specified configured value.
          Hide
          Hemanth Yamijala added a comment -

          Some more details on implementation:

          • HADOOP-3581 proposes the configuration parameters introduced to specify the maximum amount of RAM allowed for all tasks on a TaskTracker and memory per task of a job. The per job limit is defined in the JobConf of the job, and the maximum amount of RAM is defined in the JobConf of the tasktracker. (Refer comments here and here)
          • In TaskTracker.transmitHeartbeat(), we compute the amount of free virtual memory as max allowed RAM - sum (max memory per task, for all tasks running on the node)
          • In TaskTrackerStatus.java, we define a map of key-value pairs to transmit such resource information to the JobTracker. This is to allow additional resources to be added as time goes, without needing to change the wire protocol. However, for simplicity, we can provide accessor methods in TaskTrackerStatus to get/set these key-value pairs. Something like:
            Map<String, Long> resourceInfo;
            public long getFreeVirtualMemory() {
              return resourceInfo.get("memory").longValue();
            }
            public long setFreeVirtualMemory(long freeVirtualMemory) {
              resourceInfo.set("memory", new Long(freeVirtualMemory));
            }
            
          • In JobInProgress, we define APIs to get the configured virtual memory requirements for a job. This will read from the jobconf of the Job.
          • Using these, any scheduler (such as HADOOP-3445) can match memory requirements of a Job with the reported resource info in the task tracker and take scheduling decisions.

          Comments ?

          Show
          Hemanth Yamijala added a comment - Some more details on implementation: HADOOP-3581 proposes the configuration parameters introduced to specify the maximum amount of RAM allowed for all tasks on a TaskTracker and memory per task of a job. The per job limit is defined in the JobConf of the job, and the maximum amount of RAM is defined in the JobConf of the tasktracker. (Refer comments here and here ) In TaskTracker.transmitHeartbeat() , we compute the amount of free virtual memory as max allowed RAM - sum (max memory per task, for all tasks running on the node) In TaskTrackerStatus.java , we define a map of key-value pairs to transmit such resource information to the JobTracker. This is to allow additional resources to be added as time goes, without needing to change the wire protocol. However, for simplicity, we can provide accessor methods in TaskTrackerStatus to get/set these key-value pairs. Something like: Map< String , Long > resourceInfo; public long getFreeVirtualMemory() { return resourceInfo.get( "memory" ).longValue(); } public long setFreeVirtualMemory( long freeVirtualMemory) { resourceInfo.set( "memory" , new Long (freeVirtualMemory)); } In JobInProgress , we define APIs to get the configured virtual memory requirements for a job. This will read from the jobconf of the Job. Using these, any scheduler (such as HADOOP-3445 ) can match memory requirements of a Job with the reported resource info in the task tracker and take scheduling decisions. Comments ?
          Hide
          Hemanth Yamijala added a comment -

          Initial patch for review and to encourage discussion

          The patch incorporates the approach mentioned in the earlier comments. Specifically:

          • Includes the configuration variables described and defines accessors in JobConf
          • Includes a Map for storing resource information in TaskTrackerStatus. Though there is only one entry currently (free memory), this map is an attempt to keep the wire protocol the same even in future when more resources are to be added. I've currently used the value of these resources as Long, but possibly this should be changed to a Writable so we can pass anything
          • Defines a method to compute minimum free space that would be available for a new task in TaskTracker (using the configuration variables for currently running tasks), and setting that value in TaskTrackerStatus.
          • Accessors in JobInProgress to define memory requirements for the job.

          With these changes in place, schedulers such as HADOOP-3445 can decide to schedule tasks according to the job's memory requirements, and the free memory available on a tasktracker.

          Please provide feedback on this implementation.

          Show
          Hemanth Yamijala added a comment - Initial patch for review and to encourage discussion The patch incorporates the approach mentioned in the earlier comments. Specifically: Includes the configuration variables described and defines accessors in JobConf Includes a Map for storing resource information in TaskTrackerStatus. Though there is only one entry currently (free memory), this map is an attempt to keep the wire protocol the same even in future when more resources are to be added. I've currently used the value of these resources as Long, but possibly this should be changed to a Writable so we can pass anything Defines a method to compute minimum free space that would be available for a new task in TaskTracker (using the configuration variables for currently running tasks), and setting that value in TaskTrackerStatus. Accessors in JobInProgress to define memory requirements for the job. With these changes in place, schedulers such as HADOOP-3445 can decide to schedule tasks according to the job's memory requirements, and the free memory available on a tasktracker. Please provide feedback on this implementation.
          Hide
          Vivek Ratan added a comment -

          Given the proposals in this Jira, and in HADOOP-3581, I wanted to summarize in one place how this entire feature works. Most, if not all, of this summary is spread out across the two Jiras. I thought it would help to consolidate it in one place.

          The goal is to allow memory intensive jobs to run without affecting other jobs and also detecting/killing which jobs are violating their memory contract with Hadoop. Here is how we propose to do this:

          1. Each machine can be configured to set a maximum VM limit per task (and its descendants). This limit, let's call it MAX_MEM, is specified by the config variable mapred.tasktracker.tasks.maxmemory and specifies the total VM available on a machine to all TT tasks. By default, each task's maximum limit, call it MAX_MEM_PER_TASK is MAX_MEM divided by the number of slots that the TT is configured for. For example, if mapred.tasktracker.tasks.maxmemory is set to to 12GB, and the TT is configured for 2 Maps and 2 Reduce slots, MAX_MEM_PER_TASK is 3GB, i.e., no single task (and its descendants) should go over 3GB.
            • for simplicity, we assume that Maps and Reduce tasks are treated equivalently. If we need to distinguish them, then we will have separate sets of variables for Maps and Reduce tasks.
            • MAX_MEM may have different values on different machines.
            • MAX_MEM is optional (see here), so it's possible to set up a cluster with no memory limits per task.
          2. The TT will detect if a task is using memory above MAX_MEM_PER_TASK and kill it. This approach is described in HADOOP-3581.
          3. We'd like users to be able to run memory-intensive jobs, and thus to control MAX_MEM_PER_TASK for tasks in their job. User can, optionally, specify a per-task memory limit for their job (this limit applies to each task of the job). As described here, we may have separate limits for map and reduce tasks, or just one limit.
          4. Given a task to run, the TT knows the MAX_MEM_PER_TASK for that task (which is either a user-specified limit for that job, or a fraction of MAX_MEM, or no limit at all).
          5. There is a scheduling component to all this, as described here. A scheduler may choose to support memory-intensive jobs in different ways.
            • If a scheduler ignores a user-specified limit, it may end up assigning a task to a TT that has less VM than what the task asked for. This is no worse than what we have today, but we may still see problems with memory intensive tasks bringing down a system.
            • The scheduler in HADOOP-3445 will support memory limits and will assign tasks to TTs only if there's enough VM available. However, tasks with higher memory limits may take a little longer to be scheduled (this can be discussed in more detail in HADOOP-3445).
          Show
          Vivek Ratan added a comment - Given the proposals in this Jira, and in HADOOP-3581 , I wanted to summarize in one place how this entire feature works. Most, if not all, of this summary is spread out across the two Jiras. I thought it would help to consolidate it in one place. The goal is to allow memory intensive jobs to run without affecting other jobs and also detecting/killing which jobs are violating their memory contract with Hadoop. Here is how we propose to do this: Each machine can be configured to set a maximum VM limit per task (and its descendants). This limit, let's call it MAX_MEM, is specified by the config variable mapred.tasktracker.tasks.maxmemory and specifies the total VM available on a machine to all TT tasks. By default, each task's maximum limit, call it MAX_MEM_PER_TASK is MAX_MEM divided by the number of slots that the TT is configured for. For example, if mapred.tasktracker.tasks.maxmemory is set to to 12GB, and the TT is configured for 2 Maps and 2 Reduce slots, MAX_MEM_PER_TASK is 3GB, i.e., no single task (and its descendants) should go over 3GB. for simplicity, we assume that Maps and Reduce tasks are treated equivalently. If we need to distinguish them, then we will have separate sets of variables for Maps and Reduce tasks. MAX_MEM may have different values on different machines. MAX_MEM is optional (see here ), so it's possible to set up a cluster with no memory limits per task. The TT will detect if a task is using memory above MAX_MEM_PER_TASK and kill it. This approach is described in HADOOP-3581 . We'd like users to be able to run memory-intensive jobs, and thus to control MAX_MEM_PER_TASK for tasks in their job. User can, optionally, specify a per-task memory limit for their job (this limit applies to each task of the job). As described here , we may have separate limits for map and reduce tasks, or just one limit. Given a task to run, the TT knows the MAX_MEM_PER_TASK for that task (which is either a user-specified limit for that job, or a fraction of MAX_MEM, or no limit at all). There is a scheduling component to all this, as described here . A scheduler may choose to support memory-intensive jobs in different ways. If a scheduler ignores a user-specified limit, it may end up assigning a task to a TT that has less VM than what the task asked for. This is no worse than what we have today, but we may still see problems with memory intensive tasks bringing down a system. The scheduler in HADOOP-3445 will support memory limits and will assign tasks to TTs only if there's enough VM available. However, tasks with higher memory limits may take a little longer to be scheduled (this can be discussed in more detail in HADOOP-3445 ).
          Hide
          Hemanth Yamijala added a comment -

          After some internal discussions, we are proposing the following behavior with respect to default values of the configuration options MAX_MEM and MAX_MEM_PER_TASK. In the description below, MAX_MEM is specified for a tasktracker, and MAX_MEM_PER_TASK is specified in a job's configuration.

          • We should provide an option to turn off the behavior described in the above comment. We can do this by defaulting the configuration option values to something invalid, like -1.
          • This gives rise to the following 4 conditions, and the proposed behavior of the system for each:
            • Both MAX_MEM and MAX_MEM_PER_TASK are set to invalid values: This will result in the current Map/Reduce system behavior. Essentially, it will turn off the fixes in this JIRA and in HADOOP-3581, thereby providing backwards compatibility.
            • Both MAX_MEM and MAX_MEM_PER_TASK are specified as valid values: This will result in the behavior described on this JIRA and also in HADOOP-3581. In short, it will result in the tasks of the job being scheduled on TaskTrackers with sufficient free memory, and tasks which exceed the memory limits will be killed.
            • MAX_MEM is set to invalid value but MAX_MEM_PER_TASK is specified for a job: MAX_MEM_PER_TASK is ignored. Since MAX_MEM is an administrative option, if an administrator has disabled it , no memory specific scheduling can be turned on by users.
            • MAX_MEM is specified, but MAX_MEM_PER_TASK is set to an invalid value: A task of this job will run on this TT only if the free memory is at least equal to MAX_MEM / number of slots on the TT. For e.g. consider a TT with 8 GB as MAX_MEM and 2 slots. Consider two jobs: one with a 6 GB MAX_MEM_PER_TASK and the other which has no valid value specified. If a task of the first job is running on this TT, there is only 2 GB that can be guaranteed for any other job. Since MAX_MEM / number of slots is 4 GB, no task of the second job would run on this TT until the high memory task completes. In other words, we are defining that if a job has not specified any MAX_MEM_PER_TASK, it is guaranteed at least a MAX_MEM / number of slots amount of memory by the system. This seems to be fairly deterministic compared to scheduling tasks of the second job and binding their MAX_MEM_PER_TASK to some arbitrary amount of free memory.

          As the above choice means that the RAM intensive jobs can prevent other jobs from running, there must be some way to counter-balance this effect. One way could be for schedulers like HADOOP-3445 to artifically consider the RAM intensive job as using more slots than actual, and thereby subjecting them to limits faster.

          Comments ?

          Show
          Hemanth Yamijala added a comment - After some internal discussions, we are proposing the following behavior with respect to default values of the configuration options MAX_MEM and MAX_MEM_PER_TASK. In the description below, MAX_MEM is specified for a tasktracker, and MAX_MEM_PER_TASK is specified in a job's configuration. We should provide an option to turn off the behavior described in the above comment. We can do this by defaulting the configuration option values to something invalid, like -1. This gives rise to the following 4 conditions, and the proposed behavior of the system for each: Both MAX_MEM and MAX_MEM_PER_TASK are set to invalid values: This will result in the current Map/Reduce system behavior. Essentially, it will turn off the fixes in this JIRA and in HADOOP-3581 , thereby providing backwards compatibility. Both MAX_MEM and MAX_MEM_PER_TASK are specified as valid values: This will result in the behavior described on this JIRA and also in HADOOP-3581 . In short, it will result in the tasks of the job being scheduled on TaskTrackers with sufficient free memory, and tasks which exceed the memory limits will be killed. MAX_MEM is set to invalid value but MAX_MEM_PER_TASK is specified for a job: MAX_MEM_PER_TASK is ignored. Since MAX_MEM is an administrative option, if an administrator has disabled it , no memory specific scheduling can be turned on by users. MAX_MEM is specified, but MAX_MEM_PER_TASK is set to an invalid value: A task of this job will run on this TT only if the free memory is at least equal to MAX_MEM / number of slots on the TT. For e.g. consider a TT with 8 GB as MAX_MEM and 2 slots. Consider two jobs: one with a 6 GB MAX_MEM_PER_TASK and the other which has no valid value specified. If a task of the first job is running on this TT, there is only 2 GB that can be guaranteed for any other job. Since MAX_MEM / number of slots is 4 GB, no task of the second job would run on this TT until the high memory task completes. In other words, we are defining that if a job has not specified any MAX_MEM_PER_TASK, it is guaranteed at least a MAX_MEM / number of slots amount of memory by the system. This seems to be fairly deterministic compared to scheduling tasks of the second job and binding their MAX_MEM_PER_TASK to some arbitrary amount of free memory. As the above choice means that the RAM intensive jobs can prevent other jobs from running, there must be some way to counter-balance this effect. One way could be for schedulers like HADOOP-3445 to artifically consider the RAM intensive job as using more slots than actual, and thereby subjecting them to limits faster. Comments ?
          Hide
          Vivek Ratan added a comment -

          I'm in favor of the approach Hemanth has described above because it lets us reason about memory constraints in terms of slots/containers. In his example, each slot on the TT has a memory limit of 4GB. This also means that a task is guaranteed up to 4GB, if it runs in that slot. So if a task needs 6GB, it ends up using 2 slots. This helps schedulers, because they're really dealing with slots. For example, if they're keeping track of how many slots are used by a user or job, then assigning two slots to the task that wants 6GB works really well.

          Another good thing about this approach is that tasks are not affected by other tasks, in terms of memory guarantees. Suppose task1 wanted 7.5 GB. This would leave 0.5GB for task2, which is not right, as task2 shouldn't suffer from task1's needs. It's better task2 not run in this case.

          Show
          Vivek Ratan added a comment - I'm in favor of the approach Hemanth has described above because it lets us reason about memory constraints in terms of slots/containers. In his example, each slot on the TT has a memory limit of 4GB. This also means that a task is guaranteed up to 4GB, if it runs in that slot. So if a task needs 6GB, it ends up using 2 slots. This helps schedulers, because they're really dealing with slots. For example, if they're keeping track of how many slots are used by a user or job, then assigning two slots to the task that wants 6GB works really well. Another good thing about this approach is that tasks are not affected by other tasks, in terms of memory guarantees. Suppose task1 wanted 7.5 GB. This would leave 0.5GB for task2, which is not right, as task2 shouldn't suffer from task1's needs. It's better task2 not run in this case.
          Hide
          Hemanth Yamijala added a comment -

          The attached file implements the proposal mentioned above.

          Following is a summary of the changes:

          • Defined two configuration variables: mapred.tasktracker.tasks.maxmemory and mapred.task.maxmemory. The first is described as MAX_MEM and the other as MAX_MEM_PER_TASK in the comments above. The default values are set to -1, turning them off.
          • JobConf provides accessors for the above
          • In TaskTrackerStatus defined a way to pass free memory and default max memory per task to the JobTracker from the TaskTracker. These are passed using a Map, so that other resource details we might want to pass in the future can be done without changing the protocol.
          • In TaskTracker, implemented changes to compute the free memory. Also, when a job does not define mapred.task.maxmemory, the tasktracker sets this to MAX_MEM / number of slots while localizing the task.

          The patch contains some additional log statements that I will remove after the review is completed. Also, it is missing unit tests. Request a review of the code, except for these points.

          Regarding tests, I've tested the changes manually. I am looking for some ideas on how to automate these. What would be ideal is to test the following: Configure the memory related variables, schedule tasks in a predetermined order, verify that each time, the free memory is computed correctly. The last part seems to require hooks into the heartbeat processing code on JT or TT. Alternatively, we can make the free memory computation package private. The latter seems to be very hacky. Any other ideas ?

          Show
          Hemanth Yamijala added a comment - The attached file implements the proposal mentioned above. Following is a summary of the changes: Defined two configuration variables: mapred.tasktracker.tasks.maxmemory and mapred.task.maxmemory. The first is described as MAX_MEM and the other as MAX_MEM_PER_TASK in the comments above. The default values are set to -1, turning them off. JobConf provides accessors for the above In TaskTrackerStatus defined a way to pass free memory and default max memory per task to the JobTracker from the TaskTracker. These are passed using a Map, so that other resource details we might want to pass in the future can be done without changing the protocol. In TaskTracker, implemented changes to compute the free memory. Also, when a job does not define mapred.task.maxmemory, the tasktracker sets this to MAX_MEM / number of slots while localizing the task. The patch contains some additional log statements that I will remove after the review is completed. Also, it is missing unit tests. Request a review of the code, except for these points. Regarding tests, I've tested the changes manually. I am looking for some ideas on how to automate these. What would be ideal is to test the following: Configure the memory related variables, schedule tasks in a predetermined order, verify that each time, the free memory is computed correctly. The last part seems to require hooks into the heartbeat processing code on JT or TT. Alternatively, we can make the free memory computation package private. The latter seems to be very hacky. Any other ideas ?
          Hide
          Hemanth Yamijala added a comment -

          Attaching a more complete patch with test cases and after checking test-patch results.

          Show
          Hemanth Yamijala added a comment - Attaching a more complete patch with test cases and after checking test-patch results.
          Hide
          Hemanth Yamijala added a comment -

          The latest patch adds a JUnit test class that works as follows:

          • The test class defines a custom TaskScheduler that extends JobQueueTaskScheduler. This is only a dummy scheduler, that overrides the assignTasks method in which it verifies the values of the memory related variables that are reported by the tasktrackers. This is the core of the test.
          • It uses the SleepJob example as it just needs a dummy job to run.
          • Each test case sets up a MiniMRCluster to use the custom TaskScheduler, adds a SleepJob to it, and configures various values for the memory related configuration variables defined above.
          • Then it lets the schedule's assignTasks API to verify that the free memory etc are computed and reported correctly to the JobTracker.

          In order for this test to work, I had to make a few changes to some core classes. Please comment on whether these are reasonable. The changes are:

          • In JobTracker, defined a package-private API
            TaskScheduler getTaskScheduler()
          • In MiniMRCluster, defined an API to get the JobTracker instance that is created
          • Again in MiniMRCluster, starting the TaskTracker with a configured JobConf variable, rather than null. I needed this because the TaskTracker should get the memory related configuration variables.
          Show
          Hemanth Yamijala added a comment - The latest patch adds a JUnit test class that works as follows: The test class defines a custom TaskScheduler that extends JobQueueTaskScheduler. This is only a dummy scheduler, that overrides the assignTasks method in which it verifies the values of the memory related variables that are reported by the tasktrackers. This is the core of the test. It uses the SleepJob example as it just needs a dummy job to run. Each test case sets up a MiniMRCluster to use the custom TaskScheduler, adds a SleepJob to it, and configures various values for the memory related configuration variables defined above. Then it lets the schedule's assignTasks API to verify that the free memory etc are computed and reported correctly to the JobTracker. In order for this test to work, I had to make a few changes to some core classes. Please comment on whether these are reasonable. The changes are: In JobTracker, defined a package-private API TaskScheduler getTaskScheduler() In MiniMRCluster, defined an API to get the JobTracker instance that is created Again in MiniMRCluster, starting the TaskTracker with a configured JobConf variable, rather than null. I needed this because the TaskTracker should get the memory related configuration variables.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Review comments:

          • TaskTracker's private method findFreeVirtualMemory should be synchronized on runningTasks.
          • Would be good if the resources in TaskTrackerStatus are of the form <String, Writable> instead of <String, Long>.

          Am assuming that the accessor method in JobInProgress is only for future use inside a scheduler. +1 for addition of the api in MiniMRCluster.

          Code looks fine otherwise. Test case too.

          Show
          Vinod Kumar Vavilapalli added a comment - Review comments: TaskTracker's private method findFreeVirtualMemory should be synchronized on runningTasks. Would be good if the resources in TaskTrackerStatus are of the form <String, Writable> instead of <String, Long>. Am assuming that the accessor method in JobInProgress is only for future use inside a scheduler. +1 for addition of the api in MiniMRCluster. Code looks fine otherwise. Test case too.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Just seen that HADOOP-657 is committed. Do we want to put available disk space also as part of the same resource map?

          Show
          Vinod Kumar Vavilapalli added a comment - Just seen that HADOOP-657 is committed. Do we want to put available disk space also as part of the same resource map?
          Hide
          Ari Rabkin added a comment -

          I had intended to consolidate all the resource estimation and tracking stuff in the ResourceEstimator class. Does that seem sensible?

          It's possible that some of that stuff belongs, at least in part, in the scheduler. On the other hand, tracking current and estimated consumption is in some way orthogonal to all the other scheduling decisions, so I think it still makes sense to keep resource management in its own class[es].

          Show
          Ari Rabkin added a comment - I had intended to consolidate all the resource estimation and tracking stuff in the ResourceEstimator class. Does that seem sensible? It's possible that some of that stuff belongs, at least in part, in the scheduler. On the other hand, tracking current and estimated consumption is in some way orthogonal to all the other scheduling decisions, so I think it still makes sense to keep resource management in its own class [es] .
          Hide
          Hemanth Yamijala added a comment -

          Ari, I had looked at the ResourceEstimator class that you defined in HADOOP-657. In this patch we are not estimating the used memory - rather using a configured value. If we go to a model where we estimate memory per task, then we should definitely use the ResourceEstimator. But for now, we are going with a simpler approach of letting the user specify this to us in configuration.

          On the other hand, to communicate the used disk space, you are following roughly the same mechanism as what is defined here IIRC - which is to compute the free disk space in the heartbeat and communicate it via the TaskTrackerStatus. Vinod's comment was to unify your change as part of the resource map defined here so that it would be done in a like manner.

          Makes sense ?

          Show
          Hemanth Yamijala added a comment - Ari, I had looked at the ResourceEstimator class that you defined in HADOOP-657 . In this patch we are not estimating the used memory - rather using a configured value. If we go to a model where we estimate memory per task, then we should definitely use the ResourceEstimator. But for now, we are going with a simpler approach of letting the user specify this to us in configuration. On the other hand, to communicate the used disk space, you are following roughly the same mechanism as what is defined here IIRC - which is to compute the free disk space in the heartbeat and communicate it via the TaskTrackerStatus. Vinod's comment was to unify your change as part of the resource map defined here so that it would be done in a like manner. Makes sense ?
          Hide
          Ari Rabkin added a comment -

          That'll teach me to comment before coffee.

          Hemanth –

          Yes, I think it makes sense to pass the disk space via the new resourceMap. I won't have time to revise the disk space tracking for a week or two. Shall I open a JIRA, so we don't lose track?

          My only very slight qualm is that I'd rather avoid passing the names, as well as values, of the various resources we're tracking. As I understand, heartbeats are a bottleneck, so it's worth being frugal there. But probably it doesn't make sense to worry about that at this stage. We can do serialization magic later if it turns out to be worthwhile.

          Show
          Ari Rabkin added a comment - That'll teach me to comment before coffee. Hemanth – Yes, I think it makes sense to pass the disk space via the new resourceMap. I won't have time to revise the disk space tracking for a week or two. Shall I open a JIRA, so we don't lose track? My only very slight qualm is that I'd rather avoid passing the names, as well as values, of the various resources we're tracking. As I understand, heartbeats are a bottleneck, so it's worth being frugal there. But probably it doesn't make sense to worry about that at this stage. We can do serialization magic later if it turns out to be worthwhile.
          Hide
          Hemanth Yamijala added a comment -

          Ari - no issues, I will modify it as part of this patch. Because your patch is already committed, I can change it now. Also, its been raised as a comment on this JIRA.

          Regarding passing names, we did think of the load on heartbeats. However, I think elsewhere the number of heartbeats is being cut down. Hence, we decided to go with an approach that was more clear.

          Show
          Hemanth Yamijala added a comment - Ari - no issues, I will modify it as part of this patch. Because your patch is already committed, I can change it now. Also, its been raised as a comment on this JIRA. Regarding passing names, we did think of the load on heartbeats. However, I think elsewhere the number of heartbeats is being cut down. Hence, we decided to go with an approach that was more clear.
          Hide
          Hemanth Yamijala added a comment -

          In an offline discussion with Devaraj, he suggested that I not modify the job conf when localizing the task. Also, the patch no longer applies to trunk. I will modify this behavior and upload a new patch.

          Show
          Hemanth Yamijala added a comment - In an offline discussion with Devaraj, he suggested that I not modify the job conf when localizing the task. Also, the patch no longer applies to trunk. I will modify this behavior and upload a new patch.
          Hide
          Hemanth Yamijala added a comment -

          New patch addressing some of the review comments.

          Show
          Hemanth Yamijala added a comment - New patch addressing some of the review comments.
          Hide
          Hemanth Yamijala added a comment -

          Sigh. Please ignore the last patch. The way I tried to make the resource map handle generic writables seems incorrect. Will upload a new one.

          Show
          Hemanth Yamijala added a comment - Sigh. Please ignore the last patch. The way I tried to make the resource map handle generic writables seems incorrect. Will upload a new one.
          Hide
          Hemanth Yamijala added a comment -

          Incorporated review comments and synchronized with trunk.

          TaskTracker's private method findFreeVirtualMemory should be synchronized on runningTasks.

          Done. Synchronized it on the tasktracker object.

          Would be good if the resources in TaskTrackerStatus are of the form <String, Writable> instead of <String, Long>.

          I was not able to see how to make it work with Writables. The problem is with reading back the objects. It seemed like we need to know what Writable the object actually is, in order to read it back fully. That would have prevented us from making it generic. There is an ObjectWritable, but that seems an overkill.

          What I've done instead is to create a small wrapper object that implements Writable called ResourceStatus. This is used to encapsulate all fields that need to be reported to the JT in the TaskTrackerStatus. Please comment if this makes sense.

          Just seen that HADOOP-657 is committed. Do we want to put available disk space also as part of the same resource map?

          Modified the availableSpace as a field in ResourceStatus. Ari, can you please check if this change looks OK ? I didn't see a test case in HADOOP-657 which I could run.

          In an offline discussion with Devaraj, he suggested that I not modify the job conf when localizing the task.

          Done this. Rather than localizing, I calculate the default free memory when required - for e.g in findFreeVirtualMemory.

          Show
          Hemanth Yamijala added a comment - Incorporated review comments and synchronized with trunk. TaskTracker's private method findFreeVirtualMemory should be synchronized on runningTasks. Done. Synchronized it on the tasktracker object. Would be good if the resources in TaskTrackerStatus are of the form <String, Writable> instead of <String, Long>. I was not able to see how to make it work with Writables. The problem is with reading back the objects. It seemed like we need to know what Writable the object actually is, in order to read it back fully. That would have prevented us from making it generic. There is an ObjectWritable, but that seems an overkill. What I've done instead is to create a small wrapper object that implements Writable called ResourceStatus. This is used to encapsulate all fields that need to be reported to the JT in the TaskTrackerStatus. Please comment if this makes sense. Just seen that HADOOP-657 is committed. Do we want to put available disk space also as part of the same resource map? Modified the availableSpace as a field in ResourceStatus. Ari, can you please check if this change looks OK ? I didn't see a test case in HADOOP-657 which I could run. In an offline discussion with Devaraj, he suggested that I not modify the job conf when localizing the task. Done this. Rather than localizing, I calculate the default free memory when required - for e.g in findFreeVirtualMemory.
          Hide
          Devaraj Das added a comment -

          I am okay with the patch. The one thing i do want to point out is that the ResourceStatus could be made extensible and any component in the TT that wants to advertise a resource Key/Value info can do so (as opposed to hardcoding the memory/disk-space resources only). But this could be for later.
          The other thing is the way we handle -Xmx in this setup. Assume a case where the user hasn't specified any memory requirement for his job. The memory that a task would get is proportional to the amount of memory in the TT/#slots. Let's say for this cluster instance, it is 1G. Now if his -Xmx, which is an absolute number, is above this, say 1.5G, would it work? Note that the task JVM might work even with 1G. It is just the user happened to specify it as 1.5G.

          Show
          Devaraj Das added a comment - I am okay with the patch. The one thing i do want to point out is that the ResourceStatus could be made extensible and any component in the TT that wants to advertise a resource Key/Value info can do so (as opposed to hardcoding the memory/disk-space resources only). But this could be for later. The other thing is the way we handle -Xmx in this setup. Assume a case where the user hasn't specified any memory requirement for his job. The memory that a task would get is proportional to the amount of memory in the TT/#slots. Let's say for this cluster instance, it is 1G. Now if his -Xmx, which is an absolute number, is above this, say 1.5G, would it work? Note that the task JVM might work even with 1G. It is just the user happened to specify it as 1.5G.
          Hide
          Arun C Murthy added a comment -

          Minor nit: ResourceStatus should definitely use WritableUtils.

          {read|write}VLong rather than DataOutput.{read|write}

          Long.

          Show
          Arun C Murthy added a comment - Minor nit: ResourceStatus should definitely use WritableUtils. {read|write}VLong rather than DataOutput.{read|write} Long.
          Hide
          Owen O'Malley added a comment -

          I'd suggest that:

          • ResourceStatus class be static and package private instead of non-static and private
          • Add a package private API to get the ResourceStatus, but remove the other public api methods.
          • TaskTrackerStatus should not create a new ResourceStatus in readFields.
          • The JobInProgress should cache the value rather than looking it up in the config and parsing it each time. You've already added the getMaxVirtualMemoryForTask that you'll need to access it.
          • The task tracker should also compute the value once and reuse it rather than recalculate it each time getDefaultMemoryPerTask is called.
          Show
          Owen O'Malley added a comment - I'd suggest that: ResourceStatus class be static and package private instead of non-static and private Add a package private API to get the ResourceStatus, but remove the other public api methods. TaskTrackerStatus should not create a new ResourceStatus in readFields. The JobInProgress should cache the value rather than looking it up in the config and parsing it each time. You've already added the getMaxVirtualMemoryForTask that you'll need to access it. The task tracker should also compute the value once and reuse it rather than recalculate it each time getDefaultMemoryPerTask is called.
          Hide
          Hemanth Yamijala added a comment -

          This patch addresses all comments from Arun and Owen. Regarding the following comment from Devaraj:

          The other thing is the way we handle -Xmx in this setup. Assume a case where the user hasn't specified any memory requirement for his job. The memory that a task would get is proportional to the amount of memory in the TT/#slots. Let's say for this cluster instance, it is 1G. Now if his -Xmx, which is an absolute number, is above this, say 1.5G, would it work? Note that the task JVM might work even with 1G. It is just the user happened to specify it as 1.5G.

          As far as I could see, the JVM seems to be allocating the Xmx value in chunks. So, it might not fail immediately, and might not at all if the task doesn't require more than 1G. However, if it does come over 1G, it might get killed. There seem to be two ways to handle this:

          • Document this for the user to know and take care of in the job configuration. We've done a similar thing for mapred.child.ulimit.
          • When scheduling, we specifically look for Xmx value in mapred.child.java.opts and if its set to a value higher than ResourceStatus.getDefaultMemoryPerTask(), don't schedule the task on this TT.

          The latter option looks a little hacky (having to parse mapred.child.java.opts etc) and also could behave differently in different conditions making it difficult to debug. It seems simpler to just document this. On this line, I've updated both the hadoop-default.xml documentation and the Forrest documentation in Map/Red tutorial that mentions these options. Let me know if this seems fine.

          Also, I've retained mapred.child.ulimit, since it could be used for parameters other than memory.

          Show
          Hemanth Yamijala added a comment - This patch addresses all comments from Arun and Owen. Regarding the following comment from Devaraj: The other thing is the way we handle -Xmx in this setup. Assume a case where the user hasn't specified any memory requirement for his job. The memory that a task would get is proportional to the amount of memory in the TT/#slots. Let's say for this cluster instance, it is 1G. Now if his -Xmx, which is an absolute number, is above this, say 1.5G, would it work? Note that the task JVM might work even with 1G. It is just the user happened to specify it as 1.5G. As far as I could see, the JVM seems to be allocating the Xmx value in chunks. So, it might not fail immediately, and might not at all if the task doesn't require more than 1G. However, if it does come over 1G, it might get killed. There seem to be two ways to handle this: Document this for the user to know and take care of in the job configuration. We've done a similar thing for mapred.child.ulimit . When scheduling, we specifically look for Xmx value in mapred.child.java.opts and if its set to a value higher than ResourceStatus.getDefaultMemoryPerTask() , don't schedule the task on this TT. The latter option looks a little hacky (having to parse mapred.child.java.opts etc) and also could behave differently in different conditions making it difficult to debug. It seems simpler to just document this. On this line, I've updated both the hadoop-default.xml documentation and the Forrest documentation in Map/Red tutorial that mentions these options. Let me know if this seems fine. Also, I've retained mapred.child.ulimit , since it could be used for parameters other than memory.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Also, I've retained mapred.child.ulimit, since it could be used for parameters other than memory.

          Currently mapred.child.ulimit cannot be used for anything else, it is hardcoded to use it as virtual mem limit (ulimit -v) only. Should we change this?

          Also, I think we should deprecate setting vlimits via mapred.child.ulimit, they both (intend) to do the same thing.

          Show
          Vinod Kumar Vavilapalli added a comment - Also, I've retained mapred.child.ulimit, since it could be used for parameters other than memory. Currently mapred.child.ulimit cannot be used for anything else, it is hardcoded to use it as virtual mem limit (ulimit -v) only. Should we change this? Also, I think we should deprecate setting vlimits via mapred.child.ulimit, they both (intend) to do the same thing.
          Hide
          Devaraj Das added a comment -

          Documenting seems fine to me for now. Regarding the 'ulimit -v', I think we should deprecate the API getUlimitMemoryCommand.

          Show
          Devaraj Das added a comment - Documenting seems fine to me for now. Regarding the 'ulimit -v', I think we should deprecate the API getUlimitMemoryCommand.
          Hide
          Devaraj Das added a comment -

          Regarding deprecation, here is a way to handle this case. If the new option is defined in this jira is enabled (by default it is disabled), then the ulimit for memory is not applied when the task is launched. If the new option is disabled (the tasktracker has a negative number for max mem), then we apply the ulimit settings (as it is done today)...

          Show
          Devaraj Das added a comment - Regarding deprecation, here is a way to handle this case. If the new option is defined in this jira is enabled (by default it is disabled), then the ulimit for memory is not applied when the task is launched. If the new option is disabled (the tasktracker has a negative number for max mem), then we apply the ulimit settings (as it is done today)...
          Hide
          Hemanth Yamijala added a comment -

          New patch that deprecates getUlimitMemoryCommand, and updates documentation for the same. Also, the behavior that Devaraj mentions is implemented with respect to when the ulimit is set. Put another way, if an administrator has specified a value for the tasktracker's memory limit, we give it higher priority than ulimit setting. This seems to be the right thing to do. In the next release, we can completely remove the ulimit configuration.

          Also, I ran test-patch and get a -1 for the number of javac warnings as the deprecation warning will come up now. This is expected, right ?

          Comments from others on these changes ?

          Show
          Hemanth Yamijala added a comment - New patch that deprecates getUlimitMemoryCommand, and updates documentation for the same. Also, the behavior that Devaraj mentions is implemented with respect to when the ulimit is set. Put another way, if an administrator has specified a value for the tasktracker's memory limit, we give it higher priority than ulimit setting. This seems to be the right thing to do. In the next release, we can completely remove the ulimit configuration. Also, I ran test-patch and get a -1 for the number of javac warnings as the deprecation warning will come up now. This is expected, right ? Comments from others on these changes ?
          Hide
          eric baldeschwieler added a comment -

          Why does this need to be deprecated? Generalizing it as a way of setting up other limits or ENV config seems more correct. Might I not want to limit the size of any process as well as the size of the process tree?

          This is more secure when it works, since it keeps me from even creating the large object. The other approach might miss the problem until too late...

          Show
          eric baldeschwieler added a comment - Why does this need to be deprecated? Generalizing it as a way of setting up other limits or ENV config seems more correct. Might I not want to limit the size of any process as well as the size of the process tree? This is more secure when it works, since it keeps me from even creating the large object. The other approach might miss the problem until too late...
          Hide
          Vinod Kumar Vavilapalli added a comment -

          [...] Generalizing it as a way of setting up other limits or ENV config seems more correct. Might I not want to limit the size of any process as well as the size of the process tree? [..]

          +1. Created HADOOP-3974 for this.

          Show
          Vinod Kumar Vavilapalli added a comment - [...] Generalizing it as a way of setting up other limits or ENV config seems more correct. Might I not want to limit the size of any process as well as the size of the process tree? [..] +1. Created HADOOP-3974 for this.
          Hide
          Owen O'Malley added a comment -

          I agree with Eric that keeping the ulimit capability is a good thing. It is better to prevent the problem than react to the problem, given a choice...

          Show
          Owen O'Malley added a comment - I agree with Eric that keeping the ulimit capability is a good thing. It is better to prevent the problem than react to the problem, given a choice...
          Hide
          Hemanth Yamijala added a comment -

          +1 from me too. Since there is consensus on keeping the ulimit method, I've reverted the changes and uploaded a new patch. Will submit this to Hudson, as all comments have been taken care of now.

          Show
          Hemanth Yamijala added a comment - +1 from me too. Since there is consensus on keeping the ulimit method, I've reverted the changes and uploaded a new patch. Will submit this to Hudson, as all comments have been taken care of now.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12388580/HADOOP-3759.patch
          against trunk revision 688101.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3085/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3085/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3085/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3085/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388580/HADOOP-3759.patch against trunk revision 688101. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3085/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3085/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3085/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3085/console This message is automatically generated.
          Hide
          Hemanth Yamijala added a comment -

          -1 on javadoc is due to HADOOP-3964
          -1 on core tests is due to HADOOP-3950
          I can't figure out why there's a -1 on the contrib tests, no tests appear to have failed. And I can't see anything from the console log as well. I notice that many tests which ran on Hudson have also got a -1 on contrib. Therefore, I suspect it is unlikely to be caused by this patch.

          Show
          Hemanth Yamijala added a comment - -1 on javadoc is due to HADOOP-3964 -1 on core tests is due to HADOOP-3950 I can't figure out why there's a -1 on the contrib tests, no tests appear to have failed. And I can't see anything from the console log as well. I notice that many tests which ran on Hudson have also got a -1 on contrib. Therefore, I suspect it is unlikely to be caused by this patch.
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks Hemanth!

          Show
          Devaraj Das added a comment - I just committed this. Thanks Hemanth!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #586 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/586/ )

            People

            • Assignee:
              Hemanth Yamijala
              Reporter:
              Hemanth Yamijala
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development