Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-220

Collecting cpu and memory usage for MapReduce tasks

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: task, tasktracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Collect cpu and memory statistics per task.

      Description

      It would be nice for TaskTracker to collect cpu and memory usage for individual Map or Reduce tasks over time.

      1. MAPREDUCE-220-v1.txt
        15 kB
        Scott Chen
      2. MAPREDUCE-220-20100818.txt
        13 kB
        Scott Chen
      3. MAPREDUCE-220-20100817.txt
        13 kB
        Scott Chen
      4. MAPREDUCE-220-20100812.txt
        14 kB
        Scott Chen
      5. MAPREDUCE-220-20100811.txt
        14 kB
        Scott Chen
      6. MAPREDUCE-220-20100809.txt
        12 kB
        Scott Chen
      7. MAPREDUCE-220-20100806.txt
        12 kB
        Scott Chen
      8. MAPREDUCE-220-20100804.txt
        12 kB
        Scott Chen
      9. MAPREDUCE-220-20100616.txt
        12 kB
        Scott Chen
      10. MAPREDUCE-220.txt
        15 kB
        Scott Chen
      11. ant-test-patch.log
        2 kB
        Scott Chen
      12. ant-test.log
        48 kB
        Scott Chen

        Issue Links

          Activity

          Hide
          Hong Tang added a comment -

          Such information may be useful in several ways:

          • Visualizing the resource usage of a user job over time.
          • Identifying user application hotspots.
          • Identifying scheduling anomaly in MapReduce framework.
          Show
          Hong Tang added a comment - Such information may be useful in several ways: Visualizing the resource usage of a user job over time. Identifying user application hotspots. Identifying scheduling anomaly in MapReduce framework.
          Hide
          Scott Chen added a comment -

          We can make TaskTracker reports its usage on CPU, memory, bandwidth to JobTracker. JobTracker can use the information for scheduling tasks and profiling jobs.

          One way to do this is to first make ProcfsProcessTree to collect the utilization information (CPU, mem...) and write the information in TaskTrackerStatus.taskReports and send them with the heartbeats. Then we can aggregate these information in JobInProgress to do job profiling and scheduling.

          Show
          Scott Chen added a comment - We can make TaskTracker reports its usage on CPU, memory, bandwidth to JobTracker. JobTracker can use the information for scheduling tasks and profiling jobs. One way to do this is to first make ProcfsProcessTree to collect the utilization information (CPU, mem...) and write the information in TaskTrackerStatus.taskReports and send them with the heartbeats. Then we can aggregate these information in JobInProgress to do job profiling and scheduling.
          Hide
          dhruba borthakur added a comment -

          Are we proposing that we add the following metrics to the heartbeat message?

          A1. virtual memory used by each task (in bytes)
          A2. physical memory used by each task (in bytes)
          A3. cpu used by each task (as a percentage of total CPU on that machine)

          B1. available physical memory on this machine (in bytes)
          B2. available cpu on this machine (as a percentage of total CPU on that machine)

          Show
          dhruba borthakur added a comment - Are we proposing that we add the following metrics to the heartbeat message? A1. virtual memory used by each task (in bytes) A2. physical memory used by each task (in bytes) A3. cpu used by each task (as a percentage of total CPU on that machine) B1. available physical memory on this machine (in bytes) B2. available cpu on this machine (as a percentage of total CPU on that machine)
          Hide
          Scott Chen added a comment -

          Dhruba and I discussed again. We think these information should be transmitted

          A1. virtual memory used by each task (in bytes)
          A2. physical memory used by each task (in bytes)
          A3. cumulative cpu time used by each task (in millisecond)

          B1. available physical memory on this machine (in bytes)
          B2. cumulative used cpu time (for all cores) since the machine is up (in millisecond)
          B3. cpu speed on this machine (in Hz)
          B4. # of cpu cores on the machine

          The cpu % can be obtained by looking at the difference of the reported cumulative cpu time.

          Show
          Scott Chen added a comment - Dhruba and I discussed again. We think these information should be transmitted A1. virtual memory used by each task (in bytes) A2. physical memory used by each task (in bytes) A3. cumulative cpu time used by each task (in millisecond) B1. available physical memory on this machine (in bytes) B2. cumulative used cpu time (for all cores) since the machine is up (in millisecond) B3. cpu speed on this machine (in Hz) B4. # of cpu cores on the machine The cpu % can be obtained by looking at the difference of the reported cumulative cpu time.
          Hide
          Scott Chen added a comment -

          I created a sub-task for this one for collecting the TaskTracker resource in MAPREDUCE-1218.
          I will factor out the codes for TaskTracker resource collecting there and leave the scheduling related codes here.

          Show
          Scott Chen added a comment - I created a sub-task for this one for collecting the TaskTracker resource in MAPREDUCE-1218 . I will factor out the codes for TaskTracker resource collecting there and leave the scheduling related codes here.
          Hide
          Scott Chen added a comment -

          I posted the above comment in the wrong place. It's supposed to go to MAPREDUCE-961. Sorry for the confusion and spam.

          Show
          Scott Chen added a comment - I posted the above comment in the wrong place. It's supposed to go to MAPREDUCE-961 . Sorry for the confusion and spam.
          Hide
          dhruba borthakur added a comment -

          I am proposing that we hold off doing anything to this JIRA until MAPREDUCE-901 is committed.

          In the meantime, the items marked B1 - B4 can be done as part of MAPREDUCE-1218. The B1-B4 are not task related metrics (rather, they are TaskTracker related) and are not dependent on TaskMetrics changes proposed in MAPREDUCE-901.

          Show
          dhruba borthakur added a comment - I am proposing that we hold off doing anything to this JIRA until MAPREDUCE-901 is committed. In the meantime, the items marked B1 - B4 can be done as part of MAPREDUCE-1218 . The B1-B4 are not task related metrics (rather, they are TaskTracker related) and are not dependent on TaskMetrics changes proposed in MAPREDUCE-901 .
          Hide
          Hong Tang added a comment -

          +1 on holding off this Jira until MAPREDUCE-901 is committed.

          +1 on moving B* to MAPREDUCE-1218.

          I'd also propose A4. Total number of child processes in the process tree rooted from the main task process. (Assuming A1, A2, A3 are cumulative metrics on all sub-processes launched by the main task process).

          Show
          Hong Tang added a comment - +1 on holding off this Jira until MAPREDUCE-901 is committed. +1 on moving B* to MAPREDUCE-1218 . I'd also propose A4. Total number of child processes in the process tree rooted from the main task process. (Assuming A1, A2, A3 are cumulative metrics on all sub-processes launched by the main task process).
          Hide
          Vinod Kumar Vavilapalli added a comment -

          B2. cumulative used cpu time (for all cores) since the machine is up (in millisecond)

          I'd also propose A4. Total number of child processes in the process tree rooted from the main task process.

          What can these two stats possibly be used for?

          Show
          Vinod Kumar Vavilapalli added a comment - B2. cumulative used cpu time (for all cores) since the machine is up (in millisecond) I'd also propose A4. Total number of child processes in the process tree rooted from the main task process. What can these two stats possibly be used for?
          Hide
          Olga Natkovich added a comment -

          It would be very useful for profiling purposes if applications could get resource utilization information via counters. Either detailed information for each map/reduce or min/max/average could be useful.

          • Average CPU utilization
          • Max memory usage
          • Average inbound and outbound I/0. (Not sure if it is possible to obtain this information on per process basis.)
          Show
          Olga Natkovich added a comment - It would be very useful for profiling purposes if applications could get resource utilization information via counters. Either detailed information for each map/reduce or min/max/average could be useful. Average CPU utilization Max memory usage Average inbound and outbound I/0. (Not sure if it is possible to obtain this information on per process basis.)
          Hide
          Hong Tang added a comment -

          What can these two stats possibly be used for?

          This would allow us to tell whether a sudden decrease/increase of cpu or memory usage is caused by the spawning of new processes.

          Show
          Hong Tang added a comment - What can these two stats possibly be used for? This would allow us to tell whether a sudden decrease/increase of cpu or memory usage is caused by the spawning of new processes.
          Hide
          Scott Chen added a comment -

          What can these two stats possibly be used for?

          B2. can also allow us to compute the current CPU usage (by taking difference)?
          +1 on Hong's idea of collecting A4. total number of child processes.

          Show
          Scott Chen added a comment - What can these two stats possibly be used for? B2. can also allow us to compute the current CPU usage (by taking difference)? +1 on Hong's idea of collecting A4. total number of child processes.
          Hide
          dhruba borthakur added a comment -

          hi folks, we would like to start work on this one. In the earlier discussion, we said that a pre-requisite is to refactor all the metric reporting via MAPREDUCE-901. But 901 is not moving forward. Given that fact, is it ok with people if we start working on this JIRA using the existing reporting framework even before M901 is done?

          Show
          dhruba borthakur added a comment - hi folks, we would like to start work on this one. In the earlier discussion, we said that a pre-requisite is to refactor all the metric reporting via MAPREDUCE-901 . But 901 is not moving forward. Given that fact, is it ok with people if we start working on this JIRA using the existing reporting framework even before M901 is done?
          Hide
          Arun C Murthy added a comment -

          Dhruba, I think that makes sense.

          Having said that, if we do manage to get MAPREDUCE-901 committed before this gets in, would it be reasonable to ask for a bit of re-work on this one?

          Show
          Arun C Murthy added a comment - Dhruba, I think that makes sense. Having said that, if we do manage to get MAPREDUCE-901 committed before this gets in, would it be reasonable to ask for a bit of re-work on this one?
          Hide
          dhruba borthakur added a comment -

          Agreed. we can start work on this one, ge it reviewed by the community, and by the time it is ready, if M-901 is already comitted, then we rafactor this patch to be compatible with M-901. On the other hand, if M-901 is not committed by the time this one is ready, then we do not hold this one up. Sounds fair?

          Show
          dhruba borthakur added a comment - Agreed. we can start work on this one, ge it reviewed by the community, and by the time it is ready, if M-901 is already comitted, then we rafactor this patch to be compatible with M-901. On the other hand, if M-901 is not committed by the time this one is ready, then we do not hold this one up. Sounds fair?
          Hide
          Arun C Murthy added a comment -

          Precisely. Thanks!

          Show
          Arun C Murthy added a comment - Precisely. Thanks!
          Hide
          Scott Chen added a comment -

          Thanks, Dhruba and Arun.
          I will start working on this one now.

          Show
          Scott Chen added a comment - Thanks, Dhruba and Arun. I will start working on this one now.
          Hide
          Scott Chen added a comment -

          This patch implements the feature that transmit the per task CPU and memory information through TaskTracker heartbeat.
          Here's a summary of the changes:

          1. Add three methods in LinuxResourceCalculatorPlugin to get the cumulative CPU time, physical memory and virtual memory of a process (aggregated all the subprocesses)..
          2. Add a method called resourceStatusUpdate in TaskTracker to update the resource status in TaskStatus.taskReports before sending heartbeat.
          3. Add a nested class called ResourceStatus in TaskStatus to store the CPU and memory information.
          3. Modify TestTTResource to check if the per task resource information actually transmitted.

          Show
          Scott Chen added a comment - This patch implements the feature that transmit the per task CPU and memory information through TaskTracker heartbeat. Here's a summary of the changes: 1. Add three methods in LinuxResourceCalculatorPlugin to get the cumulative CPU time, physical memory and virtual memory of a process (aggregated all the subprocesses).. 2. Add a method called resourceStatusUpdate in TaskTracker to update the resource status in TaskStatus.taskReports before sending heartbeat. 3. Add a nested class called ResourceStatus in TaskStatus to store the CPU and memory information. 3. Modify TestTTResource to check if the per task resource information actually transmitted.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12443120/MAPREDUCE-220.txt
          against trunk revision 938805.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/155/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/155/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/155/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/155/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12443120/MAPREDUCE-220.txt against trunk revision 938805. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/155/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/155/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/155/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/155/console This message is automatically generated.
          Hide
          Scott Chen added a comment -

          The problem found in findbug is because I made the method resourceUpdate() synchronized.
          This is unnecessary. I have remove it.

          Show
          Scott Chen added a comment - The problem found in findbug is because I made the method resourceUpdate() synchronized. This is unnecessary. I have remove it.
          Hide
          Scott Chen added a comment -

          Rerun the failed contrib test TestSimulatorDeterministicreplay. It succeed on my dev box.

           b/c/m/t/TEST-org.apache.hadoop.mapred.TestSimulatorDeterministicReplay.txt                                                                                     
          Testsuite: org.apache.hadoop.mapred.TestSimulatorDeterministicReplay
          Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 31.101 sec
          ------------- Standard Output ---------------
          Job job_200904211745_0002 is submitted at 103010
          Job job_200904211745_0002 completed at 141990 with status: SUCCEEDED runtime: 38980
          Job job_200904211745_0003 is submitted at 984078
          Job job_200904211745_0004 is submitted at 993516
          Job job_200904211745_0003 completed at 1011051 with status: SUCCEEDED runtime: 26973
          Job job_200904211745_0005 is submitted at 1033963
          Done, total events processed: 595469
          Job job_200904211745_0002 is submitted at 103010
          Job job_200904211745_0002 completed at 141990 with status: SUCCEEDED runtime: 38980
          Job job_200904211745_0003 is submitted at 984078
          Job job_200904211745_0004 is submitted at 993516
          Job job_200904211745_0003 completed at 1011051 with status: SUCCEEDED runtime: 26973
          Job job_200904211745_0005 is submitted at 1033963
          Done, total events processed: 595469
          ------------- ---------------- ---------------
          
          Show
          Scott Chen added a comment - Rerun the failed contrib test TestSimulatorDeterministicreplay. It succeed on my dev box. b/c/m/t/TEST-org.apache.hadoop.mapred.TestSimulatorDeterministicReplay.txt Testsuite: org.apache.hadoop.mapred.TestSimulatorDeterministicReplay Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 31.101 sec ------------- Standard Output --------------- Job job_200904211745_0002 is submitted at 103010 Job job_200904211745_0002 completed at 141990 with status: SUCCEEDED runtime: 38980 Job job_200904211745_0003 is submitted at 984078 Job job_200904211745_0004 is submitted at 993516 Job job_200904211745_0003 completed at 1011051 with status: SUCCEEDED runtime: 26973 Job job_200904211745_0005 is submitted at 1033963 Done, total events processed: 595469 Job job_200904211745_0002 is submitted at 103010 Job job_200904211745_0002 completed at 141990 with status: SUCCEEDED runtime: 38980 Job job_200904211745_0003 is submitted at 984078 Job job_200904211745_0004 is submitted at 993516 Job job_200904211745_0003 completed at 1011051 with status: SUCCEEDED runtime: 26973 Job job_200904211745_0005 is submitted at 1033963 Done, total events processed: 595469 ------------- ---------------- ---------------
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12443235/MAPREDUCE-220-v1.txt
          against trunk revision 939505.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/363/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/363/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/363/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/363/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12443235/MAPREDUCE-220-v1.txt against trunk revision 939505. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/363/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/363/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/363/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/363/console This message is automatically generated.
          Hide
          Scott Chen added a comment -

          Vinod: I think you are very familiar with this part of the codes. Is it possible that you can help me review the patch? Thanks.

          Arun: Let me know if there is anything that might cause trouble to MAPREDUCE-901.

          Show
          Scott Chen added a comment - Vinod: I think you are very familiar with this part of the codes. Is it possible that you can help me review the patch? Thanks. Arun: Let me know if there is anything that might cause trouble to MAPREDUCE-901 .
          Hide
          Arun C Murthy added a comment -

          TaskStatus isn't really a 'public' api (MAPREDUCE-1623 & HADOOP-5073).

          Thus TaskStatus.ResourceStatus shouldn't be exposed directly to the end-user. We probably should convert the fields in ResourceStatus into Counters and use that as the primary interface for the end-user, also we should store them into JobHistory etc.

          Show
          Arun C Murthy added a comment - TaskStatus isn't really a 'public' api ( MAPREDUCE-1623 & HADOOP-5073 ). Thus TaskStatus.ResourceStatus shouldn't be exposed directly to the end-user. We probably should convert the fields in ResourceStatus into Counters and use that as the primary interface for the end-user, also we should store them into JobHistory etc.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          We probably should convert the fields in ResourceStatus into Counters and use that as the primary interface for the end-user, also we should store them into JobHistory etc.

          I second that. It will also solve two other issues with the patch:

          • the cpu and memory usage details of each task are sent in every heartbeat, making it bulky. Translating them into Counters will make them to be sent only once every minute
          • with Counters, we get for free the logging into JobHistory as well displaying on the web UI.

          Leaving that aside, I have one more comment on the TT side: For getting the cpu/memory usage of a task, we construct the process-tree of the task repeatedly every time a heartbeat is sent.

          • For one, if we go the Counters way, we only need to do the calculations every once a minute.
          • Otherwise, the process-trees for all tasks are now constructed by both by TaskMemoryManager and the TT main thread. It can become costly depending on the size of the process-tree. There is an opportunity for refactoring this, I guess - may be a single class which maintains all the process-trees (TaskMemoryManager.ProcessTreeInfo?) and the corresponding statistics, within a given precision, time-wise.

          Scott?

          Show
          Vinod Kumar Vavilapalli added a comment - We probably should convert the fields in ResourceStatus into Counters and use that as the primary interface for the end-user, also we should store them into JobHistory etc. I second that. It will also solve two other issues with the patch: the cpu and memory usage details of each task are sent in every heartbeat, making it bulky. Translating them into Counters will make them to be sent only once every minute with Counters, we get for free the logging into JobHistory as well displaying on the web UI. Leaving that aside, I have one more comment on the TT side: For getting the cpu/memory usage of a task, we construct the process-tree of the task repeatedly every time a heartbeat is sent. For one, if we go the Counters way, we only need to do the calculations every once a minute. Otherwise, the process-trees for all tasks are now constructed by both by TaskMemoryManager and the TT main thread. It can become costly depending on the size of the process-tree. There is an opportunity for refactoring this, I guess - may be a single class which maintains all the process-trees (TaskMemoryManager.ProcessTreeInfo?) and the corresponding statistics, within a given precision, time-wise. Scott?
          Hide
          dhruba borthakur added a comment -

          I like the idea of sending this information via Counters.

          This data could be used by schedulers to make decisions on what/when to schedule new tasks or preempt existing tasks. For this use-case, it would be nice if we can send them to the JT more frequently that 1 minute. any ideas here?

          Show
          dhruba borthakur added a comment - I like the idea of sending this information via Counters. This data could be used by schedulers to make decisions on what/when to schedule new tasks or preempt existing tasks. For this use-case, it would be nice if we can send them to the JT more frequently that 1 minute. any ideas here?
          Hide
          Scott Chen added a comment -

          Hey guys, Thanks for the help.

          I am not familiar with the counters. But from Arun and Vinod's comments I can the see the benefits:
          1. Reuse of the counter logging and transmitting
          2. Easier to expose to end users
          This is really good!

          But as Dhruba mentioned, we want to use this information for scheduling.
          So measuring it and then sending it with the heart beat ensures the scheduler gets the latest information.
          One minute may be too slow for the scheduling.

          The other question I have is that
          Using counters, can we aggregate using other method (e.g. max) rather than just increment values?

          My original plan is to report these information in this issue and aggregate them into job level status in MAPREDUCE-1739.
          And I am planning to generate these fields after aggregation:
          1. Total CPU cycles (# of giga-cycles)
          2. Total Memory occupied time (GB-sec)
          3. Maximum peak memory on one task (GB)
          4. Maximum peak CPU on one task (GHz)
          Is it possible to get these fields by using the counters?

          I will read the relavent codes and think more about it.
          Maybe there's a way to obtain both benefit.

          Vinod: I also feel that there are lots of redundant creation/computation of processTree.
          Maybe we should refactor the codes and use one thread to compute it and expose the information to others.

          Show
          Scott Chen added a comment - Hey guys, Thanks for the help. I am not familiar with the counters. But from Arun and Vinod's comments I can the see the benefits: 1. Reuse of the counter logging and transmitting 2. Easier to expose to end users This is really good! But as Dhruba mentioned, we want to use this information for scheduling. So measuring it and then sending it with the heart beat ensures the scheduler gets the latest information. One minute may be too slow for the scheduling. The other question I have is that Using counters, can we aggregate using other method (e.g. max) rather than just increment values? My original plan is to report these information in this issue and aggregate them into job level status in MAPREDUCE-1739 . And I am planning to generate these fields after aggregation: 1. Total CPU cycles (# of giga-cycles) 2. Total Memory occupied time (GB-sec) 3. Maximum peak memory on one task (GB) 4. Maximum peak CPU on one task (GHz) Is it possible to get these fields by using the counters? I will read the relavent codes and think more about it. Maybe there's a way to obtain both benefit. Vinod: I also feel that there are lots of redundant creation/computation of processTree. Maybe we should refactor the codes and use one thread to compute it and expose the information to others.
          Hide
          Scott Chen added a comment -

          I had some discussion with Arun. The problem with the Counter is that it can only be incremented.
          So it is difficult to use to transmit CPU and memory information (this goes up and down).
          We filed another JIRA MAPREDUCE-1762 to allow setValue() in Counter.
          Then we may use Counters to send these information.

          What do you think?

          Show
          Scott Chen added a comment - I had some discussion with Arun. The problem with the Counter is that it can only be incremented. So it is difficult to use to transmit CPU and memory information (this goes up and down). We filed another JIRA MAPREDUCE-1762 to allow setValue() in Counter. Then we may use Counters to send these information. What do you think?
          Hide
          Scott Chen added a comment -

          Had some offline discussion with Dhruba.
          We can make COUNTER_UPDATE_INTERVAL configurable here (it is currently hard coded to 1 min).
          Then we can reconfigure it to submit newer information to the scheduler in our case.
          This should solve our problem.

          Show
          Scott Chen added a comment - Had some offline discussion with Dhruba. We can make COUNTER_UPDATE_INTERVAL configurable here (it is currently hard coded to 1 min). Then we can reconfigure it to submit newer information to the scheduler in our case. This should solve our problem.
          Hide
          Scott Chen added a comment -

          Update patch. Move the resource information in TaskCounter.
          We created the following three values in TaskCounter.

          CPU_MILLISECONDS
          PHYSICAL_MEMORY_BYTES
          VIRTUAL_MEMORY_BYTES
          

          and added a method called Task.updateResourceCounters() which is used by Task.updateCounters().

          Show
          Scott Chen added a comment - Update patch. Move the resource information in TaskCounter. We created the following three values in TaskCounter. CPU_MILLISECONDS PHYSICAL_MEMORY_BYTES VIRTUAL_MEMORY_BYTES and added a method called Task.updateResourceCounters() which is used by Task.updateCounters().
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12447289/MAPREDUCE-220-20100616.txt
          against trunk revision 955198.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/576/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/576/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/576/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/576/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12447289/MAPREDUCE-220-20100616.txt against trunk revision 955198. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/576/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/576/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/576/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/576/console This message is automatically generated.
          Hide
          Scott Chen added a comment -

          The failed contrib test is TestSimulatorDeterministicReplay.testMain.
          It is a know issue in MAPREDUCE-1834.

          In the patch we put task cumulative CPU time, current physical memory and current virtual memory in task counters.
          So it will be aggregated in JobInProgress.getJobCounter(). We will get the total CPU time and current total memory usage.
          They will go to both web ui and history as part of the counters.
          We can access the task counters to obtain these information in place like task scheduler too.

          @Vinod: I didn't do the refactoring of the ProcTree. Because LinuxResourceCalculatorPlugin is now called by the task (updateCounters is in Task.java). It is in a different process than TaskTracker so we can not reuse the ProcTree. But directory /proc/ is in memory, this may not be so bad in terms of performance. What do you think?

          Show
          Scott Chen added a comment - The failed contrib test is TestSimulatorDeterministicReplay.testMain. It is a know issue in MAPREDUCE-1834 . In the patch we put task cumulative CPU time, current physical memory and current virtual memory in task counters. So it will be aggregated in JobInProgress.getJobCounter(). We will get the total CPU time and current total memory usage. They will go to both web ui and history as part of the counters. We can access the task counters to obtain these information in place like task scheduler too. @Vinod: I didn't do the refactoring of the ProcTree. Because LinuxResourceCalculatorPlugin is now called by the task (updateCounters is in Task.java). It is in a different process than TaskTracker so we can not reuse the ProcTree. But directory /proc/ is in memory, this may not be so bad in terms of performance. What do you think?
          Hide
          Allen Wittenauer added a comment -

          (You know, it is a shame this was called ProcfsBasedProcessTree with the disclaimer that it only works on Linux. It probably should be renamed LinuxProcfsBasedProcessTree so that other operating systems with /proc could work. I suppose the alternative is to hack this code to be multi-OS aware)

          Show
          Allen Wittenauer added a comment - (You know, it is a shame this was called ProcfsBasedProcessTree with the disclaimer that it only works on Linux. It probably should be renamed LinuxProcfsBasedProcessTree so that other operating systems with /proc could work. I suppose the alternative is to hack this code to be multi-OS aware)
          Hide
          Scott Chen added a comment -

          Allen: I agree. LinuxProcfsBasedProcessTree is a better name.

          Show
          Scott Chen added a comment - Allen: I agree. LinuxProcfsBasedProcessTree is a better name.
          Hide
          Evan Wang added a comment -

          why not collect disk i/o and bandwidth for scheduling as well?

          Show
          Evan Wang added a comment - why not collect disk i/o and bandwidth for scheduling as well?
          Hide
          Scott Chen added a comment -

          @Evan: That's a very good idea. We can file another JIRA on this one. What do you think?

          Show
          Scott Chen added a comment - @Evan: That's a very good idea. We can file another JIRA on this one. What do you think?
          Hide
          Evan Wang added a comment -

          @Scott: I've already make a java tool for MR profiling through Linux OS tools, which is independent from Hadoop. However, the overhead of network monitor, tcpdump, is really high. When running gridmix2, tcpdump will cost 20% cpu in one core. Disk monitor also encountered some problems. So, I am not so sure that the MR performance is influenced by all that factors---cpu, memory, disk, network. I'd like to complete my base experiment first. Could you give me some advice about network and disk monitor?

          Show
          Evan Wang added a comment - @Scott: I've already make a java tool for MR profiling through Linux OS tools, which is independent from Hadoop. However, the overhead of network monitor, tcpdump, is really high. When running gridmix2, tcpdump will cost 20% cpu in one core. Disk monitor also encountered some problems. So, I am not so sure that the MR performance is influenced by all that factors---cpu, memory, disk, network. I'd like to complete my base experiment first. Could you give me some advice about network and disk monitor?
          Hide
          Scott Chen added a comment -

          @Evan: This sounds like a good experiment.

          The CPU and memory collected in this JIRA is obtained by parsing /proc/ directory.
          It is very good because /proc/ is in memory so the overhead is small.
          However, there is no per process IO and network information in /proc/.
          And like you mentioned, using tools like tcpdump can be very expensive.

          Another approach to do this is by counting the

          {non,rack,data}

          -local bytes fetched from HDFS and fetched/served for map output.
          This way we can estimate the IO and network traffic from these numbers.
          The drawback of this approach is that this doesn't capture IO and network that is not introduced by the framework.
          People can write user script which does lots of IO. That will not be captured by this.
          Thoughts?

          Show
          Scott Chen added a comment - @Evan: This sounds like a good experiment. The CPU and memory collected in this JIRA is obtained by parsing /proc/ directory. It is very good because /proc/ is in memory so the overhead is small. However, there is no per process IO and network information in /proc/. And like you mentioned, using tools like tcpdump can be very expensive. Another approach to do this is by counting the {non,rack,data} -local bytes fetched from HDFS and fetched/served for map output. This way we can estimate the IO and network traffic from these numbers. The drawback of this approach is that this doesn't capture IO and network that is not introduced by the framework. People can write user script which does lots of IO. That will not be captured by this. Thoughts?
          Hide
          M. C. Srivas added a comment -

          We've found that disk bandwidth is virtually unlimited compared to other factors, esp network, thus measuring/collecting it is not worthwhile for scheduling. More interesting is disk-ops-per-second-per-drive. It identifies bad data layout immediately (ie, one disk will be very hot even though it might be transferring very little data).

          Unfortunately, using ops / second / disk to schedule work is still not very useful, since bad data layout will not change because we schedule less.

          Network is a big bottleneck. But bytes-in/bytes-out per unit of time is not representative of a problem. IF we had some measure of the congestion, we could use it to increase/decrease scheduling locality (eg, if network gets congested, reduce %-age of non-local tasks). We need to know round-trip times under "normal" vs "congested" situations., dropped packet counts, retransmit counts, etc. to figure out metrics for congestion. (Perhaps add some sockopts to tell us this? TCP knows this, after all)

          CPU/memory/swapping still seem to be most useful therefore.

          Show
          M. C. Srivas added a comment - We've found that disk bandwidth is virtually unlimited compared to other factors, esp network, thus measuring/collecting it is not worthwhile for scheduling. More interesting is disk-ops-per-second-per-drive. It identifies bad data layout immediately (ie, one disk will be very hot even though it might be transferring very little data). Unfortunately, using ops / second / disk to schedule work is still not very useful, since bad data layout will not change because we schedule less. Network is a big bottleneck. But bytes-in/bytes-out per unit of time is not representative of a problem. IF we had some measure of the congestion, we could use it to increase/decrease scheduling locality (eg, if network gets congested, reduce %-age of non-local tasks). We need to know round-trip times under "normal" vs "congested" situations., dropped packet counts, retransmit counts, etc. to figure out metrics for congestion. (Perhaps add some sockopts to tell us this? TCP knows this, after all) CPU/memory/swapping still seem to be most useful therefore.
          Hide
          dhruba borthakur added a comment -

          +1 to srivas's proposal. let this jira focus on cpu/memory metrics. And then maybe continue the discussion about disk bandwidth in another jira.

          Evan: If this is acceptable to you, can you pl create a new jira for it? Thanks.

          Show
          dhruba borthakur added a comment - +1 to srivas's proposal. let this jira focus on cpu/memory metrics. And then maybe continue the discussion about disk bandwidth in another jira. Evan: If this is acceptable to you, can you pl create a new jira for it? Thanks.
          Hide
          Philip Zeyliger added a comment -

          Scott,

          Quick question: have you tried this patch with JVM re-use enabled? On my quick-reading, this patch doesn't handle that case; I don't know if it's a real problem or not.

          Cheers,

          – Philip

          Show
          Philip Zeyliger added a comment - Scott, Quick question: have you tried this patch with JVM re-use enabled? On my quick-reading, this patch doesn't handle that case; I don't know if it's a real problem or not. Cheers, – Philip
          Hide
          Scott Chen added a comment -

          Hey Philip,

          We haven't try test this under the case of JVM re-use. But I think you are right about this.
          We need to do some more work for this case.

          We can still get the correct PID in JVM reuse case. Because we use

          String pid = System.getenv().get("JVM_PID");
          

          which is invoked from Task.updateCounters().
          So we should be able to get the correct PID for the task no matter JVM is reused or not.

          The problem is the cumulated CPU time. Because the process may be used by another task for a while.
          One way to solve this is to send only the current value instead of cumulated value.
          Does this sound correct to you?

          Scott

          Show
          Scott Chen added a comment - Hey Philip, We haven't try test this under the case of JVM re-use. But I think you are right about this. We need to do some more work for this case. We can still get the correct PID in JVM reuse case. Because we use String pid = System .getenv().get( "JVM_PID" ); which is invoked from Task.updateCounters(). So we should be able to get the correct PID for the task no matter JVM is reused or not. The problem is the cumulated CPU time. Because the process may be used by another task for a while. One way to solve this is to send only the current value instead of cumulated value. Does this sound correct to you? Scott
          Hide
          Philip Zeyliger added a comment -

          Hi Scott,

          You could also "reset" the counters to 0 when the new task is started (sort of like a "tare" button on a scale). If resourceCalculator.getProcCumulativeCpuTime() was rather resourceCalculator.getCumulativeCpuTimeDelta() [cumulative CPU time since last call], you could use counter.incr() for the CPU usage.

          It's also worth mentioning that the memory usage here is the last-known memory usage value. It's not byte-seconds (which wouldn't be that useful), nor is it maximum memory. That seems useful, but it's a bit unintuitive.

          +    long cpuTime = resourceCalculator.getProcCumulativeCpuTime();
          +    long pMem = resourceCalculator.getProcPhysicalMemorySize();
          +    long vMem = resourceCalculator.getProcVirtualMemorySize();
          +    counters.findCounter(TaskCounter.CPU_MILLISECONDS).setValue(cpuTime);
          +    counters.findCounter(TaskCounter.PHYSICAL_MEMORY_BYTES).setValue(pMem);
          +    counters.findCounter(TaskCounter.VIRTUAL_MEMORY_BYTES).setValue(vMem);
          
          Show
          Philip Zeyliger added a comment - Hi Scott, You could also "reset" the counters to 0 when the new task is started (sort of like a "tare" button on a scale). If resourceCalculator.getProcCumulativeCpuTime() was rather resourceCalculator.getCumulativeCpuTimeDelta() [cumulative CPU time since last call] , you could use counter.incr() for the CPU usage. It's also worth mentioning that the memory usage here is the last-known memory usage value. It's not byte-seconds (which wouldn't be that useful), nor is it maximum memory. That seems useful, but it's a bit unintuitive. + long cpuTime = resourceCalculator.getProcCumulativeCpuTime(); + long pMem = resourceCalculator.getProcPhysicalMemorySize(); + long vMem = resourceCalculator.getProcVirtualMemorySize(); + counters.findCounter(TaskCounter.CPU_MILLISECONDS).setValue(cpuTime); + counters.findCounter(TaskCounter.PHYSICAL_MEMORY_BYTES).setValue(pMem); + counters.findCounter(TaskCounter.VIRTUAL_MEMORY_BYTES).setValue(vMem);
          Hide
          Scott Chen added a comment -

          Update based on Philip's suggestion.
          We obtain a initial CPU cumulative time when task is initialized and we subtract it when reporting the CPU time.

          Show
          Scott Chen added a comment - Update based on Philip's suggestion. We obtain a initial CPU cumulative time when task is initialized and we subtract it when reporting the CPU time.
          Hide
          Eli Collins added a comment -

          Hey Scott,

          Latest patch looks good to me. I assume the redundant calls to getProcessTree be handled in MR-901, worth returning the values as a tuple in the mean time? Out of curiosity for the test why did the map and reduce sleeps time need to be bumped to 5s? Wouldn't anything >1s pass?

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Hey Scott, Latest patch looks good to me. I assume the redundant calls to getProcessTree be handled in MR-901, worth returning the values as a tuple in the mean time? Out of curiosity for the test why did the map and reduce sleeps time need to be bumped to 5s? Wouldn't anything >1s pass? Thanks, Eli
          Hide
          Scott Chen added a comment -

          Hey Eli,

          I think returning a tuple is a very good idea. I updated the patch based on this suggestion.

          For the sleep time, when I tested it I increase it because I thought I have to wait some time for the update counters.
          But the counters will be updated at the very beginning of the task. So I changed it back in the patch.
          Thanks for the review,

          Scott

          Show
          Scott Chen added a comment - Hey Eli, I think returning a tuple is a very good idea. I updated the patch based on this suggestion. For the sleep time, when I tested it I increase it because I thought I have to wait some time for the update counters. But the counters will be updated at the very beginning of the task. So I changed it back in the patch. Thanks for the review, Scott
          Hide
          Eli Collins added a comment -

          Looks good. Minor nit: I might rename ProcResourceStatus to something like ProcResourceValues. Also, this inner class technically needs interface annotations (private and unstable). Sanjay and Tom can correct me if I'm wrong but I don't think we decided that classes inherit the annotations of the outer class.

          Show
          Eli Collins added a comment - Looks good. Minor nit: I might rename ProcResourceStatus to something like ProcResourceValues. Also, this inner class technically needs interface annotations (private and unstable). Sanjay and Tom can correct me if I'm wrong but I don't think we decided that classes inherit the annotations of the outer class.
          Hide
          Scott Chen added a comment -

          Thanks for the comment again, Eli.
          I have changed ProcResourceStatus to ProcResourceValues. It's better.
          Also I added the interface annotations.

          Show
          Scott Chen added a comment - Thanks for the comment again, Eli. I have changed ProcResourceStatus to ProcResourceValues. It's better. Also I added the interface annotations.
          Hide
          Eli Collins added a comment -

          +1

          Latest patch looks good to me. Thanks Scott.

          Show
          Eli Collins added a comment - +1 Latest patch looks good to me. Thanks Scott.
          Hide
          Arun C Murthy added a comment -

          Scott, sorry for coming in late.

          I have a nit: we seem to create a new ProcfsBasedProcessTree each time - wouldn't it be easier to re-use the object? Create it once and re-use it each time?

          Show
          Arun C Murthy added a comment - Scott, sorry for coming in late. I have a nit: we seem to create a new ProcfsBasedProcessTree each time - wouldn't it be easier to re-use the object? Create it once and re-use it each time?
          Hide
          Scott Chen added a comment -

          Thanks, Arun. I will update the patch soon.

          Show
          Scott Chen added a comment - Thanks, Arun. I will update the patch soon.
          Hide
          Scott Chen added a comment -

          Update to address Arun's comment.

          Show
          Scott Chen added a comment - Update to address Arun's comment.
          Hide
          Eli Collins added a comment -

          Caching the process tree this way works with JVM re-use?

          Show
          Eli Collins added a comment - Caching the process tree this way works with JVM re-use?
          Hide
          Scott Chen added a comment -

          Hey Eli,

          I think it will still work. The process tree will be initialized in Task.initialized().
          So it will get the correct process id.

          Scott

          Show
          Scott Chen added a comment - Hey Eli, I think it will still work. The process tree will be initialized in Task.initialized(). So it will get the correct process id. Scott
          Hide
          Arun C Murthy added a comment -

          Scott the patch looks good, but you need to generate it with --no-prefix.

          Once it passes through hudson I'll commit. Thanks!

          Show
          Arun C Murthy added a comment - Scott the patch looks good, but you need to generate it with --no-prefix. Once it passes through hudson I'll commit. Thanks!
          Hide
          Scott Chen added a comment -

          ah, I see. Thanks, Arun.
          That's why Hudson never respond to my patch. I will submit the patch again with --no-prefix.

          Show
          Scott Chen added a comment - ah, I see. Thanks, Arun. That's why Hudson never respond to my patch. I will submit the patch again with --no-prefix.
          Hide
          Arun C Murthy added a comment -

          Hudson might be stuck. Can you please attach the output of 'ant test' and 'ant test-patch' here? Thanks.

          Show
          Arun C Murthy added a comment - Hudson might be stuck. Can you please attach the output of 'ant test' and 'ant test-patch' here? Thanks.
          Hide
          Scott Chen added a comment -

          Hey Arun,
          Thanks, I will run the tests and attach them.
          Scott

          Show
          Scott Chen added a comment - Hey Arun, Thanks, I will run the tests and attach them. Scott
          Hide
          Scott Chen added a comment -

          Upload the result for "ant test" and "ant test-patch".

          Update the patch: In Task.java, move the new added two variables to the location commented "Fields".

          Show
          Scott Chen added a comment - Upload the result for "ant test" and "ant test-patch". Update the patch: In Task.java, move the new added two variables to the location commented "Fields".
          Hide
          Arun C Murthy added a comment -

          I just committed this. Thanks Scott!

          Show
          Arun C Murthy added a comment - I just committed this. Thanks Scott!
          Hide
          Scott Chen added a comment -

          Thanks for the help

          Show
          Scott Chen added a comment - Thanks for the help
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/ )

            People

            • Assignee:
              Scott Chen
              Reporter:
              Hong Tang
            • Votes:
              0 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development