Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Updates to task timing information were fixed in some code paths that led to inconsistent metering of task run times. Also, checks were added to guard against inconsistencies and record information about state if they should happen to occur.

      Description

      For some jobs the mapSlotSeconds is incorrect.

      negative value

      09/09/01 18:31:44 INFOmapred.JobInProgress$JobSummary: jobId=job_200908270718_4568,submitTime=1251823543976,launchTime=1251823554310,finishTime=1251829904565, numMaps=7965,numSlotsPerMap=1,numReduces=40,numSlotsPerReduce=1,user=wile,queue=runner,status=SUCCEEDED, mapSlotSeconds=-2503133523,reduceSlotsSeconds=186536,clusterMapCapacity=11262,clusterReduceCapacity=3754

      or too high

      09/09/02 23:59:57 INFO mapred.JobInProgress$JobSummary: jobId=job_200908270718_5861,submitTime=1251935672924,launchTime=1251935687698,finishTime=1251935997949, numMaps=1026,numSlotsPerMap=1,numReduces=10,numSlotsPerReduce=1,user=dfsload,queue=gridops,status=SUCCEEDED,
      mapSlotSeconds=1251949742,reduceSlotsSeconds=537,clusterMapCapacity=11262,clusterReduceCapacity=3754

      1. mapreduce-964-ydist-1.patch
        0.5 kB
        Sreekanth Ramakrishnan
      2. mapreduce-964-ydist.patch
        5 kB
        Sreekanth Ramakrishnan
      3. mapreduce-964-ydist.patch
        6 kB
        Sreekanth Ramakrishnan
      4. MAPREDUCE-964-4.patch
        7 kB
        Hemanth Yamijala
      5. mapreduce-964-3.patch
        6 kB
        Sreekanth Ramakrishnan
      6. mapreduce-964-2.patch
        6 kB
        Sreekanth Ramakrishnan
      7. mapreduce-964-1.patch
        2 kB
        Sreekanth Ramakrishnan

        Activity

        Hide
        Hemanth Yamijala added a comment -

        I didn't catch this miss when reviewing. My bad. Apologies.

        I applied the patch (mapreduce-964-ydist-1.patch) on top of the Yahoo! distribution and reviewed the changes once more. Now, the patches are in sync. +1 for the patch.

        Show
        Hemanth Yamijala added a comment - I didn't catch this miss when reviewing. My bad. Apologies. I applied the patch (mapreduce-964-ydist-1.patch) on top of the Yahoo! distribution and reviewed the changes once more. Now, the patches are in sync. +1 for the patch.
        Hide
        Sreekanth Ramakrishnan added a comment -

        TaskTracker changes got missed in the Y! dist patch. Attaching the TT side changes required for stopping negative values.

        Show
        Sreekanth Ramakrishnan added a comment - TaskTracker changes got missed in the Y! dist patch. Attaching the TT side changes required for stopping negative values.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #67 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/67/)
        . Fixed start and finish times of TaskStatus to be consistent, thereby fixing inconsistencies in metering tasks. Contributed by Sreekanth Ramakrishnan.

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #67 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/67/ ) . Fixed start and finish times of TaskStatus to be consistent, thereby fixing inconsistencies in metering tasks. Contributed by Sreekanth Ramakrishnan.
        Hide
        Hemanth Yamijala added a comment -

        I committed this to trunk and branch 0.21. Thanks, Sreekanth !

        Show
        Hemanth Yamijala added a comment - I committed this to trunk and branch 0.21. Thanks, Sreekanth !
        Hide
        Hemanth Yamijala added a comment -

        The test failure is tracked on MAPREDUCE-1029 and is not related to this patch. I will commit this.

        Show
        Hemanth Yamijala added a comment - The test failure is tracked on MAPREDUCE-1029 and is not related to this patch. I will commit this.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12420557/MAPREDUCE-964-4.patch
        against trunk revision 818830.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/60/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/60/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/60/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/60/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420557/MAPREDUCE-964-4.patch against trunk revision 818830. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/60/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/60/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/60/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/60/console This message is automatically generated.
        Hide
        Hemanth Yamijala added a comment -

        Trying Hudson again.

        Show
        Hemanth Yamijala added a comment - Trying Hudson again.
        Hide
        Hemanth Yamijala added a comment -

        The attached file fixes the Mumak test failures. All failures occurred because the change introduced in the patch enforces the condition that finish time is not set without start time. In the Mumak tests in TestSimulatorTaskTracker, CheckedEventQueue.expectReduceTask and CheckedEventQueue.expectMapTask are not setting startTime but setting the finished time. The fix was to set the startTime in both cases as follows:

        In expectMapTask:
        status.setStartTime(mapStart);

        In expectReduceTask:
        status.setStartTime(mapDone);

        With this change, I verified all Mumak tests pass.

        Show
        Hemanth Yamijala added a comment - The attached file fixes the Mumak test failures. All failures occurred because the change introduced in the patch enforces the condition that finish time is not set without start time. In the Mumak tests in TestSimulatorTaskTracker, CheckedEventQueue.expectReduceTask and CheckedEventQueue.expectMapTask are not setting startTime but setting the finished time. The fix was to set the startTime in both cases as follows: In expectMapTask: status.setStartTime(mapStart); In expectReduceTask: status.setStartTime(mapDone); With this change, I verified all Mumak tests pass.
        Hide
        Hemanth Yamijala added a comment -

        Canceling the patch to fix test failures in Mumak. The core test failure is already being tracked in MAPREDUCE-1029 and is not related to this patch.

        Show
        Hemanth Yamijala added a comment - Canceling the patch to fix test failures in Mumak. The core test failure is already being tracked in MAPREDUCE-1029 and is not related to this patch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12420540/mapreduce-964-3.patch
        against trunk revision 818674.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/5/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/5/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/5/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/5/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420540/mapreduce-964-3.patch against trunk revision 818674. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/5/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/5/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/5/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/5/console This message is automatically generated.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Attaching patch modifying the log statement.

        Show
        Sreekanth Ramakrishnan added a comment - Attaching patch modifying the log statement.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Attaching latest Y! dist patch with log statement modified and adding stackTrace in log statement, using StringUtils.stringfyException()

        Show
        Sreekanth Ramakrishnan added a comment - Attaching latest Y! dist patch with log statement modified and adding stackTrace in log statement, using StringUtils.stringfyException()
        Hide
        Hemanth Yamijala added a comment -

        Sreekanth, can you please move the LOG statement to ERROR level as discussed above. Other than that, +1 for the patch. Please make it Patch Available after making the change or run tests and test-patch locally.

        Show
        Hemanth Yamijala added a comment - Sreekanth, can you please move the LOG statement to ERROR level as discussed above. Other than that, +1 for the patch. Please make it Patch Available after making the change or run tests and test-patch locally.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Attaching patch incorporating Hemanths comment.

        Show
        Sreekanth Ramakrishnan added a comment - Attaching patch incorporating Hemanths comment.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Attaching Yahoo! distribution patch for the issue with Hemanth comments.

        Show
        Sreekanth Ramakrishnan added a comment - Attaching Yahoo! distribution patch for the issue with Hemanth comments.
        Hide
        Hemanth Yamijala added a comment -

        Also of course, we must always set a start time before starting a task...

        This happens.

        Sounds like things are very messy. Can we refactor the code to make this much simpler / obvious and make it more likely to be correctly maintained?

        Yes, things are messy. I would not attempt refactor in this patch. Times not being set is not the only thing that's messy. The whole status update logic needs a relook. There are duplicate data structures and states and phases meaning different things in different contexts floating all around.

        I'd call no end time a serious ERROR. Not an INFO level issue. I'd be tempted to assert... Although this might not be best to do while trying to stabilize the code. Maybe in the mainline though...

        +1.

        Show
        Hemanth Yamijala added a comment - Also of course, we must always set a start time before starting a task... This happens. Sounds like things are very messy. Can we refactor the code to make this much simpler / obvious and make it more likely to be correctly maintained? Yes, things are messy. I would not attempt refactor in this patch. Times not being set is not the only thing that's messy. The whole status update logic needs a relook. There are duplicate data structures and states and phases meaning different things in different contexts floating all around. I'd call no end time a serious ERROR. Not an INFO level issue. I'd be tempted to assert... Although this might not be best to do while trying to stabilize the code. Maybe in the mainline though... +1.
        Hide
        eric baldeschwieler added a comment -

        Also of course, we must always set a start time before starting a task...

        Sounds like things are very messy. Can we refactor the code to make this much simpler / obvious and make it more likely to be correctly maintained?

        I'd call no end time a serious ERROR. Not an INFO level issue. I'd be tempted to assert... Although this might not be best to do while trying to stabilize the code. Maybe in the mainline though...

        Show
        eric baldeschwieler added a comment - Also of course, we must always set a start time before starting a task... Sounds like things are very messy. Can we refactor the code to make this much simpler / obvious and make it more likely to be correctly maintained? I'd call no end time a serious ERROR. Not an INFO level issue. I'd be tempted to assert... Although this might not be best to do while trying to stabilize the code. Maybe in the mainline though...
        Hide
        Hemanth Yamijala added a comment -

        I looked at this patch mainly from the point of view of verifying two invariants:

        • Whenever we set a startTime in a TaskStatus, we need to set the finishTime as well.
        • A finishTime must be set only if the startTime is > 0.

        AFAIK, I think the first invariant is ensured from this patch. For the next, I tried tracing different code paths, and could see other places where the invariant was broken, though the cases identified in the bug seem to have been addressed. Rather than adding a fix in every place (which does not guarantee that the patch will continue to hold in the face of future changes), I think it is sensible to add a check in the setFinishTime and statusUpdate of TaskStatus itself, and ensure the invariant holds at the root. Let's log an INFO message when we see the invariant is not met to help us debug. Is there a good way to get the stack trace as well, which will be more useful ? Maybe Thread.currentThread().getStackTrace() will do the trick ?

        Also, if we make this change, it would be simple to write a fast unit test that ensures the invariant is being satisfied.

        Let's run MR-reliability tests on this couple of more times for sanity testing it's as good as the original fix.

        Show
        Hemanth Yamijala added a comment - I looked at this patch mainly from the point of view of verifying two invariants: Whenever we set a startTime in a TaskStatus, we need to set the finishTime as well. A finishTime must be set only if the startTime is > 0. AFAIK, I think the first invariant is ensured from this patch. For the next, I tried tracing different code paths, and could see other places where the invariant was broken, though the cases identified in the bug seem to have been addressed. Rather than adding a fix in every place (which does not guarantee that the patch will continue to hold in the face of future changes), I think it is sensible to add a check in the setFinishTime and statusUpdate of TaskStatus itself, and ensure the invariant holds at the root. Let's log an INFO message when we see the invariant is not met to help us debug. Is there a good way to get the stack trace as well, which will be more useful ? Maybe Thread.currentThread().getStackTrace() will do the trick ? Also, if we make this change, it would be simple to write a fast unit test that ensures the invariant is being satisfied. Let's run MR-reliability tests on this couple of more times for sanity testing it's as good as the original fix.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Attaching a patch to fix this issue of negative value which is caused by the finish time of task status not being set during kill. The large values of the seconds was seen because finish time was set but not the start time. The reason for this was due to the kill signal was sent to an attempt which was about to be launched but at same time recv a kill signal due to the job completeion, which results in the path in runner where status of the task is checked and is found to be killed and we dont launch it but set only the finish time.

        The patch fixes the issue by setting finish time only when the start time is set and setting finish time in kill which was missing in TaskTracker.

        Running back to back reliability tests for validating the fix.

        Show
        Sreekanth Ramakrishnan added a comment - Attaching a patch to fix this issue of negative value which is caused by the finish time of task status not being set during kill. The large values of the seconds was seen because finish time was set but not the start time. The reason for this was due to the kill signal was sent to an attempt which was about to be launched but at same time recv a kill signal due to the job completeion, which results in the path in runner where status of the task is checked and is found to be killed and we dont launch it but set only the finish time. The patch fixes the issue by setting finish time only when the start time is set and setting finish time in kill which was missing in TaskTracker. Running back to back reliability tests for validating the fix.
        Hide
        Sreekanth Ramakrishnan added a comment -

        The reason for the very high map slot seconds is because of following senario,

        We hand out of a task to tracker, tracker gets lost, then we kill the task, as TT has not reported back during lost period our task start time is not updated according to JT the task has not been launched and the time and then we fail the task and the task has a finish time but not start time.

        Following is log from JT:

        2009-09-18 13:28:20,990 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200909180756_0009_m_000003_2' to tip task_200909180756_0009_m_000003, for tracker 'tracker_xxx:localhost/127.0.0.1:45507'
        2009-09-18 13:39:51,482 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_200909180756_0009_m_000003_2: Lost task tracker: tracker_xxx:localhost/127.0.0.1:45507
        2009-09-18 13:39:56,997 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_200909180756_0009_m_000003_2' to tip task_200909180756_0009_m_000003, for tracker 'tracker_xxx:localhost/127.0.0.1:60187'
        2009-09-18 13:40:00,000 INFO org.apache.hadoop.mapred.JobInProgress: TaskDebug attemptId : attempt_200909180756_0009_m_000003_2 slots : SLOTS_MILLIS_MAPS tip.numslots is: 1 difference to add : 1253281197003 status start : 0 status end time : 1253281197003
        2009-09-18 13:40:00,000 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200909180756_0009_m_000003_2' from 'tracker_xxx:localhost/127.0.0.1:60187'
        
        Show
        Sreekanth Ramakrishnan added a comment - The reason for the very high map slot seconds is because of following senario, We hand out of a task to tracker, tracker gets lost, then we kill the task, as TT has not reported back during lost period our task start time is not updated according to JT the task has not been launched and the time and then we fail the task and the task has a finish time but not start time. Following is log from JT: 2009-09-18 13:28:20,990 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200909180756_0009_m_000003_2' to tip task_200909180756_0009_m_000003, for tracker 'tracker_xxx:localhost/127.0.0.1:45507' 2009-09-18 13:39:51,482 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_200909180756_0009_m_000003_2: Lost task tracker: tracker_xxx:localhost/127.0.0.1:45507 2009-09-18 13:39:56,997 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_200909180756_0009_m_000003_2' to tip task_200909180756_0009_m_000003, for tracker 'tracker_xxx:localhost/127.0.0.1:60187' 2009-09-18 13:40:00,000 INFO org.apache.hadoop.mapred.JobInProgress: TaskDebug attemptId : attempt_200909180756_0009_m_000003_2 slots : SLOTS_MILLIS_MAPS tip.numslots is: 1 difference to add : 1253281197003 status start : 0 status end time : 1253281197003 2009-09-18 13:40:00,000 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200909180756_0009_m_000003_2' from 'tracker_xxx:localhost/127.0.0.1:60187'
        Hide
        Sreekanth Ramakrishnan added a comment -

        The reason for the negative values being present in the JobSummary results from the taskStatus not having the finishTime field not being set in the taskStatus. This happens especially when the tasks which are being killed when they just have finished, resulting in a race condition. If look at the attempt which is mentioned has actually finished and same time a kill task action has been recv for same attempt at the same time.

        Had modified the code to put in the debug statement following is example of task which upset the metering:

        2009-09-18 06:18:32,001 INFO org.apache.hadoop.mapred.JobInProgress: TaskDebug attemptId : attempt_200909180346_0004_m_000575_0 slots : SLOTS_MILLIS_MAPS tip.numslots is: 1 difference to add : -1253254705577 status start : 1253254705577 status end time : 0 
        

        TT logs :

        2009-09-18 06:18:01,200 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200909180346_0004_m_000575_0 task's state:UNASSIGNED
        2009-09-18 06:18:01,200 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200909180346_0004_m_000575_0 which needs 1 slots
        2009-09-18 06:18:01,200 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_200909180346_0004_m_000575_0 which needs 1 slots
        2009-09-18 06:18:02,750 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_200909180346_0004_m_1883468461 given task: attempt_200909180346_0004_m_000575_0
        2009-09-18 06:18:09,034 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200909180346_0004_m_000575_0 0.18553433% xxxx/my_reliability_test_input/part-00032:335544320+67108864
        2009-09-18 06:18:12,040 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200909180346_0004_m_000575_0 0.25399202% xxx/my_reliability_test_input/part-00032:335544320+67108864
        2009-09-18 06:18:15,317 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200909180346_0004_m_000575_0 0.3378629% xxx/my_reliability_test_input/part-00032:335544320+67108864
        2009-09-18 06:18:15,319 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_200909180346_0004_m_000575_0
        2009-09-18 06:18:15,319 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_200909180346_0004_m_000575_0
        2009-09-18 06:18:20,413 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909180346_0004_m_000575_0 done; removing files.
        2009-09-18 06:18:20,415 INFO org.apache.hadoop.mapred.IndexCache: Map ID attempt_200909180346_0004_m_000575_0 not found in cache
        2009-09-18 06:18:25,511 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909180346_0004_m_000575_0 done; removing files.
        2009-09-18 06:18:25,515 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200909180346_0004_m_000575_0 task's state:FAILED_UNCLEAN
        2009-09-18 06:18:25,516 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200909180346_0004_m_000575_0 which needs 1 slots
        2009-09-18 06:18:25,516 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 2 and trying to launch attempt_200909180346_0004_m_000575_0 which needs 1 slots
        2009-09-18 06:18:26,651 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_200909180346_0004_m_-1834354161 given task: attempt_200909180346_0004_m_000575_0
        2009-09-18 06:18:26,888 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_200909180346_0004_m_000575_0
        2009-09-18 06:18:26,888 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_200909180346_0004_m_000575_0
        2009-09-18 06:18:31,986 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200909180346_0004_m_000575_0 0.0% 
        2009-09-18 06:18:31,986 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200909180346_0004_m_000575_0 Ignoring status-update since runState: FAILED
        2009-09-18 06:18:31,989 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909180346_0004_m_000575_0 done; removing files.
        2009-09-18 06:18:31,990 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call statusUpdate(attempt_200909180346_0004_m_000575_0, org.apache.hadoop.mapred.MapTaskStatus@4e69b84c) from 127.0.0.1:51146: output error
        2009-09-18 06:18:31,995 INFO org.apache.hadoop.mapred.IndexCache: Map ID attempt_200909180346_0004_m_000575_0 not found in cache
        2009-09-18 06:18:34,992 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909180346_0004_m_000575_0 done; removing files.
        
        Show
        Sreekanth Ramakrishnan added a comment - The reason for the negative values being present in the JobSummary results from the taskStatus not having the finishTime field not being set in the taskStatus. This happens especially when the tasks which are being killed when they just have finished, resulting in a race condition. If look at the attempt which is mentioned has actually finished and same time a kill task action has been recv for same attempt at the same time. Had modified the code to put in the debug statement following is example of task which upset the metering: 2009-09-18 06:18:32,001 INFO org.apache.hadoop.mapred.JobInProgress: TaskDebug attemptId : attempt_200909180346_0004_m_000575_0 slots : SLOTS_MILLIS_MAPS tip.numslots is: 1 difference to add : -1253254705577 status start : 1253254705577 status end time : 0 TT logs : 2009-09-18 06:18:01,200 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200909180346_0004_m_000575_0 task's state:UNASSIGNED 2009-09-18 06:18:01,200 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200909180346_0004_m_000575_0 which needs 1 slots 2009-09-18 06:18:01,200 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_200909180346_0004_m_000575_0 which needs 1 slots 2009-09-18 06:18:02,750 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_200909180346_0004_m_1883468461 given task: attempt_200909180346_0004_m_000575_0 2009-09-18 06:18:09,034 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200909180346_0004_m_000575_0 0.18553433% xxxx/my_reliability_test_input/part-00032:335544320+67108864 2009-09-18 06:18:12,040 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200909180346_0004_m_000575_0 0.25399202% xxx/my_reliability_test_input/part-00032:335544320+67108864 2009-09-18 06:18:15,317 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200909180346_0004_m_000575_0 0.3378629% xxx/my_reliability_test_input/part-00032:335544320+67108864 2009-09-18 06:18:15,319 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_200909180346_0004_m_000575_0 2009-09-18 06:18:15,319 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_200909180346_0004_m_000575_0 2009-09-18 06:18:20,413 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909180346_0004_m_000575_0 done; removing files. 2009-09-18 06:18:20,415 INFO org.apache.hadoop.mapred.IndexCache: Map ID attempt_200909180346_0004_m_000575_0 not found in cache 2009-09-18 06:18:25,511 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909180346_0004_m_000575_0 done; removing files. 2009-09-18 06:18:25,515 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200909180346_0004_m_000575_0 task's state:FAILED_UNCLEAN 2009-09-18 06:18:25,516 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200909180346_0004_m_000575_0 which needs 1 slots 2009-09-18 06:18:25,516 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 2 and trying to launch attempt_200909180346_0004_m_000575_0 which needs 1 slots 2009-09-18 06:18:26,651 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_200909180346_0004_m_-1834354161 given task: attempt_200909180346_0004_m_000575_0 2009-09-18 06:18:26,888 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_200909180346_0004_m_000575_0 2009-09-18 06:18:26,888 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_200909180346_0004_m_000575_0 2009-09-18 06:18:31,986 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200909180346_0004_m_000575_0 0.0% 2009-09-18 06:18:31,986 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200909180346_0004_m_000575_0 Ignoring status-update since runState: FAILED 2009-09-18 06:18:31,989 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909180346_0004_m_000575_0 done; removing files. 2009-09-18 06:18:31,990 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call statusUpdate(attempt_200909180346_0004_m_000575_0, org.apache.hadoop.mapred.MapTaskStatus@4e69b84c) from 127.0.0.1:51146: output error 2009-09-18 06:18:31,995 INFO org.apache.hadoop.mapred.IndexCache: Map ID attempt_200909180346_0004_m_000575_0 not found in cache 2009-09-18 06:18:34,992 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909180346_0004_m_000575_0 done; removing files.
        Hide
        Rob Weltman added a comment -

        launchTime is frequently 0 (not really a valid timestamp) as well. When that happens, mapSlotSeconds is 0.

        Show
        Rob Weltman added a comment - launchTime is frequently 0 (not really a valid timestamp) as well. When that happens, mapSlotSeconds is 0.
        Hide
        Arun C Murthy added a comment -

        Hmm... I'm concerned we might have a race-condition causing these spurious values - I'll start digging.

        Show
        Arun C Murthy added a comment - Hmm... I'm concerned we might have a race-condition causing these spurious values - I'll start digging.

          People

          • Assignee:
            Sreekanth Ramakrishnan
            Reporter:
            Rajiv Chittajallu
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development