Hadoop Common
  1. Hadoop Common
  2. HADOOP-5881

Simplify configuration related to task-memory-monitoring and memory-based scheduling

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed

      Description

      The configuration we have now is pretty complicated. Besides everything else, the mechanism of not specifying parameters separately for maps and reduces leads to problems like HADOOP-5811. This JIRA should address simplifying things and at the same time solving these problems.

      1. HADOOP-5881-20090526.1.txt
        123 kB
        Vinod Kumar Vavilapalli
      2. HADOOP-5881-20090526-branch-20.1.txt
        122 kB
        Vinod Kumar Vavilapalli

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          Some of the problems that stood out and the corresponding solutions, after some discussions with Eric, Arun and Hemanth:

          • The current configuration system doesn't distinguish memory usage of maps and reduces. In general, reduces are more memory intensive than maps. Also, because of this lack of distinguishing, we use the total memory available on the TT as a shared resource across slot types. This lead to problems mentioned in HADOOP-5811. The solution is to divide memory resource between map slots and reduce slots. In presence of high memory jobs, map slots of these jobs will use memory of other map slots and don't take over memory from reduce slots. And vice versa. To reflect the differences in general usage between map slots and reduce slots, we may want to specify more map slots on a node than the number of reduce slots.
          • We now have separate configuration for specifying default values for the job's memory configuration. This is unnecessary and can better be handled by using layers of configuration. As Arun suggests, default values can be provided by cluster admins to the users via the configuration files distributed to clients.
          • With the current configuration, we have 1) total memory available calculated on a TT, 2) a `reserved` memory on TT for system usage. Because of this mechanism, if a TT has lower memory overall, then we assign lower memory to a single slot. As per the discussions, this seems like a wrong idea. To paraphrase Eric - "A slot is a slot, is a slot. TT will just be configured with the number of slots (map & reduce). " In essential, if a TT has lower memory, the correct scheme is to decrease the number of slots and not the memory per slot. Memory allotted for slot should be more or less same across all the TTs in the cluster.
          • We are distinguishing virtual memory used by processes with the physical memory. This seems necessary when considering streaming/pipes tasks. However, "in Java, once the VM hits swap, performance degrades fast, we want to configure the limits based on the physical memory on the machine (not including swap), to avoid thrashing". With this in view, there doesn't seem to any need for distinguishing vmem from physical memory w.r.t configuration. Depending on a site's requirements, the configuration items can reflect whether we want tasks to go beyond physical memory or not.
          Show
          Vinod Kumar Vavilapalli added a comment - Some of the problems that stood out and the corresponding solutions, after some discussions with Eric, Arun and Hemanth: The current configuration system doesn't distinguish memory usage of maps and reduces. In general, reduces are more memory intensive than maps. Also, because of this lack of distinguishing, we use the total memory available on the TT as a shared resource across slot types. This lead to problems mentioned in HADOOP-5811 . The solution is to divide memory resource between map slots and reduce slots. In presence of high memory jobs, map slots of these jobs will use memory of other map slots and don't take over memory from reduce slots. And vice versa. To reflect the differences in general usage between map slots and reduce slots, we may want to specify more map slots on a node than the number of reduce slots. We now have separate configuration for specifying default values for the job's memory configuration. This is unnecessary and can better be handled by using layers of configuration. As Arun suggests, default values can be provided by cluster admins to the users via the configuration files distributed to clients. With the current configuration, we have 1) total memory available calculated on a TT, 2) a `reserved` memory on TT for system usage. Because of this mechanism, if a TT has lower memory overall, then we assign lower memory to a single slot. As per the discussions, this seems like a wrong idea. To paraphrase Eric - "A slot is a slot, is a slot. TT will just be configured with the number of slots (map & reduce). " In essential, if a TT has lower memory, the correct scheme is to decrease the number of slots and not the memory per slot. Memory allotted for slot should be more or less same across all the TTs in the cluster. We are distinguishing virtual memory used by processes with the physical memory. This seems necessary when considering streaming/pipes tasks. However, "in Java, once the VM hits swap, performance degrades fast, we want to configure the limits based on the physical memory on the machine (not including swap), to avoid thrashing". With this in view, there doesn't seem to any need for distinguishing vmem from physical memory w.r.t configuration. Depending on a site's requirements, the configuration items can reflect whether we want tasks to go beyond physical memory or not.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Here's the proposal:

          Parameter Type Meaning
          mapred.cluster.map.memory.mb set by admin, cluster-wide Cluster definition of memory per map slot. The maximum amount of memory, in MB, each map task on a tasktracker can consume.
          mapred.cluster.reduce.memory.mb set by admin, cluster-wide Cluster definition of memory per reduce slot. The maximum amount of memory, in MB, each reduce task on a tasktracker can consume.
          mapred.job.map.memory.mb set by user, per-job Job requirement for map tasks. The maximum amount of memory each map task of a job can consume, in MB.
          mapred.job.reduce.memory.mb set by user, per-job job requirement for reduce tasks. The maximum amount of memory each reduce task of a job can consume, in MB.
          mapred.cluster.max.map.memory.mb set by admin, cluster-wide Max limit on jobs. The maximum value that can be specified by a user via mapred.job.map.memory.mb, in MB. A job that asks for more than this number will be failed at submission itself.
          mapred.cluster.max.reduce.memory.mb set by admin, cluster-wide Max limit on jobs. The maximum value that can be specified by a user via mapred.job.reduce.memory.mb, in MB. A job that asks for more than this number will be failed at submission itself.
          Show
          Vinod Kumar Vavilapalli added a comment - Here's the proposal: Parameter Type Meaning mapred.cluster.map.memory.mb set by admin, cluster-wide Cluster definition of memory per map slot. The maximum amount of memory, in MB, each map task on a tasktracker can consume. mapred.cluster.reduce.memory.mb set by admin, cluster-wide Cluster definition of memory per reduce slot. The maximum amount of memory, in MB, each reduce task on a tasktracker can consume. mapred.job.map.memory.mb set by user, per-job Job requirement for map tasks. The maximum amount of memory each map task of a job can consume, in MB. mapred.job.reduce.memory.mb set by user, per-job job requirement for reduce tasks. The maximum amount of memory each reduce task of a job can consume, in MB. mapred.cluster.max.map.memory.mb set by admin, cluster-wide Max limit on jobs. The maximum value that can be specified by a user via mapred.job.map.memory.mb, in MB. A job that asks for more than this number will be failed at submission itself. mapred.cluster.max.reduce.memory.mb set by admin, cluster-wide Max limit on jobs. The maximum value that can be specified by a user via mapred.job.reduce.memory.mb, in MB. A job that asks for more than this number will be failed at submission itself.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Patch implementing the proposal.

          Show
          Vinod Kumar Vavilapalli added a comment - Patch implementing the proposal.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The uploaded patch (for trunk) passed ant test-patch, run-mapred-tests and test-contrib, leaving aside the usual suspects for failure - TestReduceFetch, TestStreamingBadRecords, TestStreamingExitStatus, TestQueueCapacities and TestJobHistory(which succeeds on a clean workspace but not when run along with all the other tests).

          Show
          Vinod Kumar Vavilapalli added a comment - The uploaded patch (for trunk) passed ant test-patch, run-mapred-tests and test-contrib, leaving aside the usual suspects for failure - TestReduceFetch, TestStreamingBadRecords, TestStreamingExitStatus, TestQueueCapacities and TestJobHistory(which succeeds on a clean workspace but not when run along with all the other tests).
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Patch for branch-0.20.

          Show
          Vinod Kumar Vavilapalli added a comment - Patch for branch-0.20.
          Hide
          Hemanth Yamijala added a comment -

          I've been reviewing this patch offline and already had given comments that are incorporated into the new patch. Based on that check, I am a +1 for the code changes. Will commit this.

          Show
          Hemanth Yamijala added a comment - I've been reviewing this patch offline and already had given comments that are incorporated into the new patch. Based on that check, I am a +1 for the code changes. Will commit this.
          Hide
          Hemanth Yamijala added a comment -

          I just committed this to trunk and branch 0.20. Thanks, Vinod !

          Show
          Hemanth Yamijala added a comment - I just committed this to trunk and branch 0.20. Thanks, Vinod !
          Hide
          Owen O'Malley added a comment -

          I'm concerned that this went in without an opportunity for review. The patch was uploaded at 5:49am and committed at 6:36am.

          I'm also concerned that this incompatibly changes the config variables in the 0.20 branch. What is the rationale for pushing these configuration changes back to 0.20? If they are required, they need a compatability story.

          Show
          Owen O'Malley added a comment - I'm concerned that this went in without an opportunity for review. The patch was uploaded at 5:49am and committed at 6:36am. I'm also concerned that this incompatibly changes the config variables in the 0.20 branch. What is the rationale for pushing these configuration changes back to 0.20? If they are required, they need a compatability story.
          Hide
          Hemanth Yamijala added a comment -

          I had a chat with Owen regarding the objections raised.

          I'm concerned that this went in without an opportunity for review. The patch was uploaded at 5:49am and committed at 6:36am.

          I apologize for the rushed commit and take responsibility. However, as I explained above I had reviewed the patch for a significant time. The first version which I reviewed is not on JIRA. In retrospect, I think that was a mistake, and will avoid repeating it in future. Sorry !

          I'm also concerned that this incompatibly changes the config variables in the 0.20 branch. What is the rationale for pushing these configuration changes back to 0.20? If they are required, they need a compatability story.

          There was a misunderstanding on this point. It seemed like in an internal discussion we agreed to this approach. However, we are reverting stand after rethinking. I opened HADOOP-5919 to discuss and provide a backwards compatibility story for this feature. Let's keep the discussion there.

          Based on these explanations, can I resolve this bug ?

          Show
          Hemanth Yamijala added a comment - I had a chat with Owen regarding the objections raised. I'm concerned that this went in without an opportunity for review. The patch was uploaded at 5:49am and committed at 6:36am. I apologize for the rushed commit and take responsibility. However, as I explained above I had reviewed the patch for a significant time. The first version which I reviewed is not on JIRA. In retrospect, I think that was a mistake, and will avoid repeating it in future. Sorry ! I'm also concerned that this incompatibly changes the config variables in the 0.20 branch. What is the rationale for pushing these configuration changes back to 0.20? If they are required, they need a compatability story. There was a misunderstanding on this point. It seemed like in an internal discussion we agreed to this approach. However, we are reverting stand after rethinking. I opened HADOOP-5919 to discuss and provide a backwards compatibility story for this feature. Let's keep the discussion there. Based on these explanations, can I resolve this bug ?
          Hide
          Owen O'Malley added a comment -

          Yes, thanks for addressing my concerns Hemanth.

          Show
          Owen O'Malley added a comment - Yes, thanks for addressing my concerns Hemanth.
          Hide
          Ramya Sunil added a comment -

          It would be useful if the description of the above memory management parameters is added in mapred-default.xml. Also mapred.tasktracker.vmem.reserved, mapred.task.default.maxvmem and mapred.task.limit.maxvmem has to be removed from the xml.

          Show
          Ramya Sunil added a comment - It would be useful if the description of the above memory management parameters is added in mapred-default.xml. Also mapred.tasktracker.vmem.reserved, mapred.task.default.maxvmem and mapred.task.limit.maxvmem has to be removed from the xml.
          Hide
          Hemanth Yamijala added a comment -

          It would be useful if the description of the above memory management parameters is added in mapred-default.xml. Also mapred.tasktracker.vmem.reserved, mapred.task.default.maxvmem and mapred.task.limit.maxvmem has to be removed from the xml.

          +1 regarding documentation. I filed HADOOP-5957 to track this. But we are planning to introduce backwards compatible options, so removing the configuration parameters from xml will not be necessary. Please see HADOOP-5919.

          Show
          Hemanth Yamijala added a comment - It would be useful if the description of the above memory management parameters is added in mapred-default.xml. Also mapred.tasktracker.vmem.reserved, mapred.task.default.maxvmem and mapred.task.limit.maxvmem has to be removed from the xml. +1 regarding documentation. I filed HADOOP-5957 to track this. But we are planning to introduce backwards compatible options, so removing the configuration parameters from xml will not be necessary. Please see HADOOP-5919 .
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #863 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/863/ )
          Hide
          lu cheng added a comment -

          Hi guys,
          If set mapred.cluster.max.map.memory.mb and mapred.cluster.max.reduce.memory.mb in the mapred-site.xml, for example, to be 3072, when I submit a grep job I get:
          10/04/05 15:23:09 INFO mapred.FileInputFormat: Total input paths to process : 14
          org.apache.hadoop.ipc.RemoteException: java.io.IOException: job_201004051501_0001(-1 memForMapTasks -1 memForReduceTasks): Invalid job requirements.
          at org.apache.hadoop.mapred.JobTracker.checkMemoryRequirements(JobTracker.java:3864)
          at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3014)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:396)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

          at org.apache.hadoop.ipc.Client.call(Client.java:740)
          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
          at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source)
          at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
          at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
          at org.apache.hadoop.examples.Grep.run(Grep.java:69)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
          at org.apache.hadoop.examples.Grep.main(Grep.java:93)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
          at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
          at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

          Also, If you set mapred.cluster.map.memory.mb and mapred.cluster.reduce.memory.mb to be 512, the system always prompts you that :
          INFO mapred.JobClient: Task Id : attempt_201004051259_0001_r_000000_0, Status : FAILED
          TaskTree [pid=32315,tipID=attempt_201004051259_0001_r_000000_0] is running beyond memory-limits. Current usage : 556195840bytes. Limit : -1048576bytes. Killing task.

          The version I deploy is 0.20.2.

          It seems in some place the code use bytes, and in some place it uses MB.

          Can someone fix this problem?

          Show
          lu cheng added a comment - Hi guys, If set mapred.cluster.max.map.memory.mb and mapred.cluster.max.reduce.memory.mb in the mapred-site.xml, for example, to be 3072, when I submit a grep job I get: 10/04/05 15:23:09 INFO mapred.FileInputFormat: Total input paths to process : 14 org.apache.hadoop.ipc.RemoteException: java.io.IOException: job_201004051501_0001(-1 memForMapTasks -1 memForReduceTasks): Invalid job requirements. at org.apache.hadoop.mapred.JobTracker.checkMemoryRequirements(JobTracker.java:3864) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3014) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at org.apache.hadoop.ipc.Client.call(Client.java:740) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.hadoop.examples.Grep.run(Grep.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.Grep.main(Grep.java:93) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Also, If you set mapred.cluster.map.memory.mb and mapred.cluster.reduce.memory.mb to be 512, the system always prompts you that : INFO mapred.JobClient: Task Id : attempt_201004051259_0001_r_000000_0, Status : FAILED TaskTree [pid=32315,tipID=attempt_201004051259_0001_r_000000_0] is running beyond memory-limits. Current usage : 556195840bytes. Limit : -1048576bytes. Killing task. The version I deploy is 0.20.2. It seems in some place the code use bytes, and in some place it uses MB. Can someone fix this problem?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Are you using all the configuration items properly as documented here ? You should be setting all of them with appropriate values.

          The first error you mentioned seems to happening because you only configured the max limits but not the cluster wide slot size and the per-job configuration.

          No idea about the second problem. If you give more details, we can see.

          We have been using this feature with the mentioned configuration items so far without any issues, so I don't think there is any problem of using bytes vs MB at all.

          Show
          Vinod Kumar Vavilapalli added a comment - Are you using all the configuration items properly as documented here ? You should be setting all of them with appropriate values. The first error you mentioned seems to happening because you only configured the max limits but not the cluster wide slot size and the per-job configuration. No idea about the second problem. If you give more details, we can see. We have been using this feature with the mentioned configuration items so far without any issues, so I don't think there is any problem of using bytes vs MB at all.

            People

            • Assignee:
              Vinod Kumar Vavilapalli
              Reporter:
              Vinod Kumar Vavilapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development