Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-467

Collect information about number of tasks succeeded / total per time unit for a tasktracker.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Provide ability to collect statistics about tasks completed and succeeded for each tracker in time windows. The statistics is available on the jobtrackers' nodes UI page.

      Description

      Collecting information of number of tasks succeeded / total per tasktracker and being able to see these counts per hour, day and since start time will help reason about things like the blacklisting strategy.

      1. 467_branch_0.20.patch
        19 kB
        Sharad Agarwal
      2. 467_v4.patch
        19 kB
        Sharad Agarwal
      3. 467_v5.patch
        19 kB
        Sharad Agarwal
      4. 467_v6.patch
        19 kB
        Sharad Agarwal
      5. 467_v7.patch
        19 kB
        Sharad Agarwal
      6. 5931_v1.patch
        28 kB
        Sharad Agarwal
      7. 5931_v2.patch
        31 kB
        Sharad Agarwal
      8. 5931_v3.patch
        15 kB
        Sharad Agarwal

        Activity

        Hemanth Yamijala created issue -
        Hide
        Sharad Agarwal added a comment -

        To collect stats for last hour/day, we can have a moving window for that time period. A moving window can contain multiple time slots. The granularity of window movement/update is decided by the slot size. The slot size could be different for different time windows. For example, hour window could have 5 minutes, day window could have 1 hour update granularity. So in that case hour window would hold stats in 12 slots of 5 mins each. Likewise day window would hold stats in 24 slots of 1 hour each.

        As the last slot time is crossed, a new slot would be added and the very first one would be knocked off. Hence moving the window by one slot.

        A simple strategy could be to collect this information in TaskTracker and report that to JobTracker via TaskTrackerStatus. A subclass could be added to TaskTrackerStatus with fields, say:
        tasksSinceStarted, tasksSuccededSinceStarted,
        tasksSinceInLastHour, tasksSuccededInLastHour,
        tasksSinceInLastDay, tasksSuccededInLastDay

        To optimize on heartbeat size, we need not send the above fields with every heartbeat. This could be reported only at certain interval (typically the minimum slot size, 5 mins in above example).

        An alternate way could be to compute all this in JobTracker. My vote goes for doing it in Tasktracker as this is mostly to do with individual Task tracker and doesn't need any global information.

        Thoughts?

        Show
        Sharad Agarwal added a comment - To collect stats for last hour/day, we can have a moving window for that time period. A moving window can contain multiple time slots. The granularity of window movement/update is decided by the slot size. The slot size could be different for different time windows. For example, hour window could have 5 minutes, day window could have 1 hour update granularity. So in that case hour window would hold stats in 12 slots of 5 mins each. Likewise day window would hold stats in 24 slots of 1 hour each. As the last slot time is crossed, a new slot would be added and the very first one would be knocked off. Hence moving the window by one slot. A simple strategy could be to collect this information in TaskTracker and report that to JobTracker via TaskTrackerStatus. A subclass could be added to TaskTrackerStatus with fields, say: tasksSinceStarted, tasksSuccededSinceStarted, tasksSinceInLastHour, tasksSuccededInLastHour, tasksSinceInLastDay, tasksSuccededInLastDay To optimize on heartbeat size, we need not send the above fields with every heartbeat. This could be reported only at certain interval (typically the minimum slot size, 5 mins in above example). An alternate way could be to compute all this in JobTracker. My vote goes for doing it in Tasktracker as this is mostly to do with individual Task tracker and doesn't need any global information. Thoughts?
        Hide
        Sharad Agarwal added a comment -

        Correction: The fields names in last comment should read as:
        tasksSinceStarted, tasksSuccededSinceStarted,
        tasksInLastHour, tasksSuccededInLastHour,
        tasksInLastDay, tasksSuccededInLastDay

        Show
        Sharad Agarwal added a comment - Correction: The fields names in last comment should read as: tasksSinceStarted, tasksSuccededSinceStarted, tasksInLastHour, tasksSuccededInLastHour, tasksInLastDay, tasksSuccededInLastDay
        Hide
        Hemanth Yamijala added a comment -

        I am assuming the moving window mechanism would be flexible enough to add new bucket sizes as required.

        Regarding having the computation on the tasktracker, and reporting the status via status, one problem is that if we want to change the bucket size, it would involve a change in the status object.

        Also, one requirement for this is to store this information on the JobTracker. Can you describe how this will be stored, mechanics with respect to lost tasktrackers etc ?

        Will this information be available if the JobTracker restarts ?

        Show
        Hemanth Yamijala added a comment - I am assuming the moving window mechanism would be flexible enough to add new bucket sizes as required. Regarding having the computation on the tasktracker, and reporting the status via status, one problem is that if we want to change the bucket size, it would involve a change in the status object. Also, one requirement for this is to store this information on the JobTracker. Can you describe how this will be stored, mechanics with respect to lost tasktrackers etc ? Will this information be available if the JobTracker restarts ?
        Hide
        Sharad Agarwal added a comment -

        I am assuming the moving window mechanism would be flexible enough to add new bucket sizes as required.

        Yes. I am planning to use and extend metric framework available in core, thru which custom window/bucket sizes can be defined.

        Regarding having the computation on the tasktracker, and reporting the status via status, one problem is that if we want to change the bucket size, it would involve a change in the status object.

        To avoid that, instead of above fields, we can have say List<MetricInfo> metrics field in TaskTrackerStatus where MetricInfo could be:
        class MetricInfo

        { String name; int tasks; int tasksSucceeded; }

        Here name would be the name of the metrics. e.q. "lasthour", "lastday" etc. which could be configured in the metrics property file.

        Also, one requirement for this is to store this information on the JobTracker. Can you describe how this will be stored, mechanics with respect to lost tasktrackers etc ?

        Currently jobtracker doesn't store any information about lost tasktrackers. Storing info about lost trackers is not trivial and demands a separate jira issue. Consider the case of tracker getting lost and never coming back or coming back at different port. The jobtracker data structures need to be cleaned up for such trackers otherwise those data structures would be lying forever.

        Will this information be available if the JobTracker restarts ?

        Yes. Since this info is propagated from Tasktracker, it would be available after jobtracker restarts.

        Show
        Sharad Agarwal added a comment - I am assuming the moving window mechanism would be flexible enough to add new bucket sizes as required. Yes. I am planning to use and extend metric framework available in core, thru which custom window/bucket sizes can be defined. Regarding having the computation on the tasktracker, and reporting the status via status, one problem is that if we want to change the bucket size, it would involve a change in the status object. To avoid that, instead of above fields, we can have say List<MetricInfo> metrics field in TaskTrackerStatus where MetricInfo could be: class MetricInfo { String name; int tasks; int tasksSucceeded; } Here name would be the name of the metrics. e.q. "lasthour", "lastday" etc. which could be configured in the metrics property file. Also, one requirement for this is to store this information on the JobTracker. Can you describe how this will be stored, mechanics with respect to lost tasktrackers etc ? Currently jobtracker doesn't store any information about lost tasktrackers. Storing info about lost trackers is not trivial and demands a separate jira issue. Consider the case of tracker getting lost and never coming back or coming back at different port. The jobtracker data structures need to be cleaned up for such trackers otherwise those data structures would be lying forever. Will this information be available if the JobTracker restarts ? Yes. Since this info is propagated from Tasktracker, it would be available after jobtracker restarts.
        Sharad Agarwal made changes -
        Field Original Value New Value
        Assignee Sharad Agarwal [ sharadag ]
        Hide
        Sharad Agarwal added a comment -

        This patch adds a MovingWindowContext class which captures the metrics in a moving time window. The window and bucket sizes can be configured using hadoop-metrics.properties
        It is a very early patch. Testing is in progress. Not all fields captured.

        Show
        Sharad Agarwal added a comment - This patch adds a MovingWindowContext class which captures the metrics in a moving time window. The window and bucket sizes can be configured using hadoop-metrics.properties It is a very early patch. Testing is in progress. Not all fields captured.
        Sharad Agarwal made changes -
        Attachment 5931_v1.patch [ 12410381 ]
        Hide
        Sharad Agarwal added a comment -

        patch for review.

        Show
        Sharad Agarwal added a comment - patch for review.
        Sharad Agarwal made changes -
        Attachment 5931_v2.patch [ 12410757 ]
        Sharad Agarwal made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Fix Version/s 0.21.0 [ 12313563 ]
        Hide
        Sharad Agarwal added a comment -

        Had an off line discussion with Devaraj/Eric, the concern raised is that metric context is an export interface and instead of using it, we should collect the metrics natively in hadoop. Administrators should not be able to remove this metric as it may in future used by Jobtracker to make decisions. Right?
        Let me clarify a bit. Please note that only time windows are configured in the metric properties, and not the actual metric name which gets collected. Also a new context name is defined "tasktracker" (Refer hadoop-metrics.properties in patch) . So it does not come in between the existing metric contexts. Those can continue to be chukwa/ganglia etc.
        If this doesn't sound like a good idea, I see few options:
        1. Give a better name to the added context say "core-mapred", so that administrators don't override it. It would serve only to add/remove time windows.

        2. Do not use Metrics api. Expose the time window configuration via mapred-site.xml.

        3. Don't expose the configuration at all and have fixed windows, say "last hour" and "last day".

        I went with extending the metrics API because I thought that it would help to collect any other existing metrics in time windows without making much change to the code. For example if we want to collect "mapred" metrics in time windows, then "mapred" context can point to the Composite context, which can be configured to use multiple contexts, one being time window context.

        Thoughts?

        Show
        Sharad Agarwal added a comment - Had an off line discussion with Devaraj/Eric, the concern raised is that metric context is an export interface and instead of using it, we should collect the metrics natively in hadoop. Administrators should not be able to remove this metric as it may in future used by Jobtracker to make decisions. Right? Let me clarify a bit. Please note that only time windows are configured in the metric properties, and not the actual metric name which gets collected. Also a new context name is defined "tasktracker" (Refer hadoop-metrics.properties in patch) . So it does not come in between the existing metric contexts. Those can continue to be chukwa/ganglia etc. If this doesn't sound like a good idea, I see few options: 1. Give a better name to the added context say "core-mapred", so that administrators don't override it. It would serve only to add/remove time windows. 2. Do not use Metrics api. Expose the time window configuration via mapred-site.xml. 3. Don't expose the configuration at all and have fixed windows, say "last hour" and "last day". I went with extending the metrics API because I thought that it would help to collect any other existing metrics in time windows without making much change to the code. For example if we want to collect "mapred" metrics in time windows, then "mapred" context can point to the Composite context, which can be configured to use multiple contexts, one being time window context. Thoughts?
        Hide
        Sharad Agarwal added a comment -

        Had a discussion with Owen, following came up:

        • Metric Api is an export interface, so we should not use it. We want to build the metrics natively in Hadoop so it should not be exposed via metrics config file.
        • It is better to do the collection in jobtracker. The restart concern will go away as at some point we will have heartbeat transaction log. So recovery would be generic. Having it in jobtracker will give us more control to make scheduling decisions.
        Show
        Sharad Agarwal added a comment - Had a discussion with Owen, following came up: Metric Api is an export interface, so we should not use it. We want to build the metrics natively in Hadoop so it should not be exposed via metrics config file. It is better to do the collection in jobtracker. The restart concern will go away as at some point we will have heartbeat transaction log. So recovery would be generic. Having it in jobtracker will give us more control to make scheduling decisions.
        Sharad Agarwal made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Sharad Agarwal added a comment -

        Attached patch collects the metrics in jobtracker. It doesn't use metric api. It defines a new class StatisticsCollector which keep statistics in time windows.
        Stats are collected LAST_HOUR, LAST_DAY and SINCE_START. The stats are shown in jobtracker web ui on trackers list page.

        Show
        Sharad Agarwal added a comment - Attached patch collects the metrics in jobtracker. It doesn't use metric api. It defines a new class StatisticsCollector which keep statistics in time windows. Stats are collected LAST_HOUR, LAST_DAY and SINCE_START. The stats are shown in jobtracker web ui on trackers list page.
        Sharad Agarwal made changes -
        Attachment 5931_v3.patch [ 12411062 ]
        Sharad Agarwal made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12411062/5931_v3.patch
        against trunk revision 785928.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/531/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/531/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/531/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/531/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12411062/5931_v3.patch against trunk revision 785928. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/531/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/531/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/531/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/531/console This message is automatically generated.
        Owen O'Malley made changes -
        Project Hadoop Common [ 12310240 ] Hadoop Map/Reduce [ 12310941 ]
        Key HADOOP-5931 MAPREDUCE-467
        Component/s mapred [ 12310690 ]
        Fix Version/s 0.21.0 [ 12313563 ]
        Hide
        Sharad Agarwal added a comment -

        Updated to trunk after the project split.
        Also moved the time window list handling to StatisticsCollector from JobTrackerStatistics.

        Show
        Sharad Agarwal added a comment - Updated to trunk after the project split. Also moved the time window list handling to StatisticsCollector from JobTrackerStatistics.
        Sharad Agarwal made changes -
        Attachment 467_v4.patch [ 12411499 ]
        Sharad Agarwal made changes -
        Attachment 467_v5.patch [ 12412043 ]
        Sharad Agarwal made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Sharad Agarwal made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Amareshwari Sriramadasu added a comment -

        changes in JobInProgress and machines.jsp look good.
        Comments in other code:
        1. In JobTracker.ExpireTrackers, statistics.taskTrackerRemoved(trackerName); should be called after the call to lostTaskTracker(current);
        The same can be removed from lostTaskTracker(current).
        2. Minor comment in JobTrackerStatistics and StatisticsCollector: All binary operators except . should be separated from their operands by spaces. and
        A keyword followed by a parenthesis should be separated by a space. wrt http://java.sun.com/docs/codeconv/html/CodeConventions.doc7.html#475

        Show
        Amareshwari Sriramadasu added a comment - changes in JobInProgress and machines.jsp look good. Comments in other code: 1. In JobTracker.ExpireTrackers, statistics.taskTrackerRemoved(trackerName); should be called after the call to lostTaskTracker(current); The same can be removed from lostTaskTracker(current). 2. Minor comment in JobTrackerStatistics and StatisticsCollector: All binary operators except . should be separated from their operands by spaces. and A keyword followed by a parenthesis should be separated by a space. wrt http://java.sun.com/docs/codeconv/html/CodeConventions.doc7.html#475
        Hide
        Sharad Agarwal added a comment -

        Updated the patch to trunk. Incorporated review comments.

        Show
        Sharad Agarwal added a comment - Updated the patch to trunk. Incorporated review comments.
        Sharad Agarwal made changes -
        Attachment 467_v6.patch [ 12412242 ]
        Sharad Agarwal made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Sharad Agarwal made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Amareshwari Sriramadasu added a comment -

        Patch looks fine to me

        Show
        Amareshwari Sriramadasu added a comment - Patch looks fine to me
        Hide
        Sharad Agarwal added a comment -

        Retrying hudson

        Show
        Sharad Agarwal added a comment - Retrying hudson
        Sharad Agarwal made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Sharad Agarwal made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Iyappan Srinivasan added a comment -

        Tested these scenarios and found them to pass:

        a) Start a randomwriter job and check if all windows appear properly in the nodes section.

        Total Tasks last hour and succeeded task last hour
        Total Tasks last day and succeeded task last day
        Total Tasks since start and succeeded task since start

        b) Check after a job is run whether all tasks are captured properly in these windows. The number of tasks should be same. All windows needs to be populated.

        c) All windows needs to be refreshed after the time window given.

        d) Run simultaneous jobs and check if all windows are populated with proper values of tasks.

        e) Kill some tasks attempts and see if those numbers match.

        f) Run different kinds of jobs and see if tasks tracker is still able to get the number of tasks right.

        g) Kill a job in the middle and see how tasks tracker numbers are populated in these windows.

        h) Check even after subsequent execution fo jobs and subsequent passing of time, still the tasks are able to be captured without any error.

        i) Restart job tracker and see if the tasks are captured properly.

        Show
        Iyappan Srinivasan added a comment - Tested these scenarios and found them to pass: a) Start a randomwriter job and check if all windows appear properly in the nodes section. Total Tasks last hour and succeeded task last hour Total Tasks last day and succeeded task last day Total Tasks since start and succeeded task since start b) Check after a job is run whether all tasks are captured properly in these windows. The number of tasks should be same. All windows needs to be populated. c) All windows needs to be refreshed after the time window given. d) Run simultaneous jobs and check if all windows are populated with proper values of tasks. e) Kill some tasks attempts and see if those numbers match. f) Run different kinds of jobs and see if tasks tracker is still able to get the number of tasks right. g) Kill a job in the middle and see how tasks tracker numbers are populated in these windows. h) Check even after subsequent execution fo jobs and subsequent passing of time, still the tasks are able to be captured without any error. i) Restart job tracker and see if the tasks are captured properly.
        Hide
        Sharad Agarwal added a comment -

        ant test passed except TestJobInProgressListener, TestJobTrackerRestart and TestJobTrackerRestartWithLostTracker which are failing on trunk as well.
        test patch passed as well:
        +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        Show
        Sharad Agarwal added a comment - ant test passed except TestJobInProgressListener, TestJobTrackerRestart and TestJobTrackerRestartWithLostTracker which are failing on trunk as well. test patch passed as well: +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Hide
        Devaraj Das added a comment -

        Looks fine to me. The UI can be improved. For example, the metrics could be printed out in a sorted fashion by number of succeeded tasks. The UI could also have percentage information instead of absolute numbers for the succeeded tasks' metrics.

        Show
        Devaraj Das added a comment - Looks fine to me. The UI can be improved. For example, the metrics could be printed out in a sorted fashion by number of succeeded tasks. The UI could also have percentage information instead of absolute numbers for the succeeded tasks' metrics.
        Hide
        Sharad Agarwal added a comment -

        the metrics could be printed out in a sorted fashion by number of succeeded tasks.

        I agree that UI can be improved. Instead of doing piecemeal effort for providing sorting on succeeded tasks, I can imagine sorting capability on all the columns would be useful. Perhaps we can use some javascript API to do so which can provide sorting, pagination etc. This can be done in follow up jira.

        Show
        Sharad Agarwal added a comment - the metrics could be printed out in a sorted fashion by number of succeeded tasks. I agree that UI can be improved. Instead of doing piecemeal effort for providing sorting on succeeded tasks, I can imagine sorting capability on all the columns would be useful. Perhaps we can use some javascript API to do so which can provide sorting, pagination etc. This can be done in follow up jira.
        Hide
        Iyappan Srinivasan added a comment -

        Additional testing done mainly on Jobtracker restart, TT restart and blacklisting. All testacses are found to pass:

        1) If a tasktracker is globally blacklisted, that tasktracker should not capture any more tasks. After coming out fo blacklisting after 24 hours, it should again start accepting tasks and increasing the task nubmers in its windows. : Pass

        2) After restarting a blacklisted tasktracker, it shd be made healthy and continue to receive task numbers. - In 5 node cluster. :Pass

        3) Killing a task tracker and also suspending a task tracker. - Killing a task tracker and restarting a tasktracker takes off all the information. Suspending and continuing with task tracker retains the information.In both scenarios they continue to receive task numbers after coming bacl to healthy mode. : Pass

        3) JT is supended and brought back. - The task number windows are still retained.: Pass

        4) Jt restart at differnt scenarios :
        a) When at least 3 tasks are waiting. b) When file size of a task is zero. c) After some jobs are completed. d) When it is two times restarted. e) When a job is 20% complete, when a job is 50% complete. - After job tracker restarts, all the information of the connected tasktracker, goes off. At this point the job.persist is true.

        5) TT is suspended and brought back. Still numbers shd be captured : pass

        6) When different tasks are run at the same time. -Different task tracker windows is able to capture correct numbers. : pass

        7) Kill a task tracker and rejoin it. It shd work. - works, but previous task succeded info gone.

        8) Run jobs with different priorities and then check if is captured properly. Also restart JT in this scenario and check if the job is restarted properly. : Pass

        9) Changing job priority dynamically. How will it affect the task tracker's capture of tasks. -It captures it normally

        10) In job level blacklisting scenario, tasks are continuign to be received.

        Show
        Iyappan Srinivasan added a comment - Additional testing done mainly on Jobtracker restart, TT restart and blacklisting. All testacses are found to pass: 1) If a tasktracker is globally blacklisted, that tasktracker should not capture any more tasks. After coming out fo blacklisting after 24 hours, it should again start accepting tasks and increasing the task nubmers in its windows. : Pass 2) After restarting a blacklisted tasktracker, it shd be made healthy and continue to receive task numbers. - In 5 node cluster. :Pass 3) Killing a task tracker and also suspending a task tracker. - Killing a task tracker and restarting a tasktracker takes off all the information. Suspending and continuing with task tracker retains the information.In both scenarios they continue to receive task numbers after coming bacl to healthy mode. : Pass 3) JT is supended and brought back. - The task number windows are still retained.: Pass 4) Jt restart at differnt scenarios : a) When at least 3 tasks are waiting. b) When file size of a task is zero. c) After some jobs are completed. d) When it is two times restarted. e) When a job is 20% complete, when a job is 50% complete. - After job tracker restarts, all the information of the connected tasktracker, goes off. At this point the job.persist is true. 5) TT is suspended and brought back. Still numbers shd be captured : pass 6) When different tasks are run at the same time. -Different task tracker windows is able to capture correct numbers. : pass 7) Kill a task tracker and rejoin it. It shd work. - works, but previous task succeded info gone. 8) Run jobs with different priorities and then check if is captured properly. Also restart JT in this scenario and check if the job is restarted properly. : Pass 9) Changing job priority dynamically. How will it affect the task tracker's capture of tasks. -It captures it normally 10) In job level blacklisting scenario, tasks are continuign to be received.
        Hide
        Amareshwari Sriramadasu added a comment -

        Sorry for the late comment,
        update statistics should use tip.machineWhereTaskRan(taskid) insteadof status.getTaskTracker(). Then you may have to introduce a non-null check for the lost tracker case.

        Show
        Amareshwari Sriramadasu added a comment - Sorry for the late comment, update statistics should use tip.machineWhereTaskRan(taskid) insteadof status.getTaskTracker(). Then you may have to introduce a non-null check for the lost tracker case.
        Hide
        Sharad Agarwal added a comment -

        Incorporated Amareshwari's comments.

        Show
        Sharad Agarwal added a comment - Incorporated Amareshwari's comments.
        Sharad Agarwal made changes -
        Attachment 467_v7.patch [ 12413002 ]
        Hide
        Iyappan Srinivasan added a comment -

        Tested some important scenarios and found them to pass:

        1) After restarting a blacklisted tasktracker, it shd be made healthy and continue to receive task numbers. - In 5 node cluster. :Pass

        2) After task tracker is killed and goes out of node list, otehr nodes recive these tasks and execute them. Number of tasks match.

        3) Some task attempts are killed. The numbers captured reflects teh failures properly.

        4) Do a job restart. task trackers should start receiveing tasks again and reflect it in their windows..

        5) For blacklisting scenarios, first MAPREDUCE-746 needs to be fixed.

        Show
        Iyappan Srinivasan added a comment - Tested some important scenarios and found them to pass: 1) After restarting a blacklisted tasktracker, it shd be made healthy and continue to receive task numbers. - In 5 node cluster. :Pass 2) After task tracker is killed and goes out of node list, otehr nodes recive these tasks and execute them. Number of tasks match. 3) Some task attempts are killed. The numbers captured reflects teh failures properly. 4) Do a job restart. task trackers should start receiveing tasks again and reflect it in their windows.. 5) For blacklisting scenarios, first MAPREDUCE-746 needs to be fixed.
        Hide
        Amareshwari Sriramadasu added a comment -

        +1 for the patch

        Show
        Amareshwari Sriramadasu added a comment - +1 for the patch
        Hide
        Sharad Agarwal added a comment -

        I just committed this!

        Show
        Sharad Agarwal added a comment - I just committed this!
        Sharad Agarwal made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Fix Version/s 0.21.0 [ 12314045 ]
        Resolution Fixed [ 1 ]
        Sharad Agarwal made changes -
        Release Note Provide ability to collect statistics about tasks completed and succeeded for each tracker in time windows. The statistics is available on the jobtrackers' nodes UI page.
        Issue Type Improvement [ 4 ] New Feature [ 2 ]
        Hide
        Sharad Agarwal added a comment -

        Patch for Yahoo's distribution for branch 20.

        Show
        Sharad Agarwal added a comment - Patch for Yahoo's distribution for branch 20.
        Sharad Agarwal made changes -
        Attachment 467_branch_0.20.patch [ 12413284 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #21 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/21/)
        . Provide ability to collect statistics about total tasks and succeeded tasks in different time windows.

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #21 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/21/ ) . Provide ability to collect statistics about total tasks and succeeded tasks in different time windows.
        Tom White made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Sharad Agarwal
            Reporter:
            Hemanth Yamijala
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development