Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.22.0
    • Component/s: security, tasktracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Moved Task log cleanup into a separate thread in TaskTracker.
      Added configuration "mapreduce.job.userlog.retain.hours" to specify the time(in hours) for which the user-logs are to be retained after the job completion.
      Show
      Moved Task log cleanup into a separate thread in TaskTracker. Added configuration "mapreduce.job.userlog.retain.hours" to specify the time(in hours) for which the user-logs are to be retained after the job completion.

      Description

      Task logs' cleanup is being done in Child now. This is undesirable atleast for two reasons: 1) failures while cleaning up will affect the user's tasks, and 2) the task's wall time will get affected due to operations that TT actually should own.

      1. patch-927-5-dist.txt
        152 kB
        Vinod Kumar Vavilapalli
      2. patch-927-4.txt
        50 kB
        Amareshwari Sriramadasu
      3. patch-927-3.txt
        51 kB
        Amareshwari Sriramadasu
      4. patch-927-2.txt
        43 kB
        Amareshwari Sriramadasu
      5. patch-927-1.txt
        31 kB
        Amareshwari Sriramadasu
      6. patch-927.txt
        31 kB
        Amareshwari Sriramadasu

        Issue Links

          Activity

          Vinod Kumar Vavilapalli created issue -
          Hide
          Vinod Kumar Vavilapalli added a comment -

          There are deeper problems:

          • it doesn't guarantee the cleanup to be finished within the time specified by mapred.task.userlog.retain.hours.
          • Undefined behavior when multiple tasks perform the cleanup simultaneously on a single node
          Show
          Vinod Kumar Vavilapalli added a comment - There are deeper problems: it doesn't guarantee the cleanup to be finished within the time specified by mapred.task.userlog.retain.hours. Undefined behavior when multiple tasks perform the cleanup simultaneously on a single node
          Hide
          Vinod Kumar Vavilapalli added a comment -

          This is a blocker. After MAPREDUCE-842, when LinuxTaskController is in use, user's tasks run as the user and userlogs are owned by users themselves. So cleanup of userlogs of one user cannot be completed in the child jvm of another user.

          Essentially, userlogs' cleanup is completely broken with LinuxTaskController and will hasten MAPREDUCE-1100 issue. This should be fixed.

          Show
          Vinod Kumar Vavilapalli added a comment - This is a blocker. After MAPREDUCE-842 , when LinuxTaskController is in use, user's tasks run as the user and userlogs are owned by users themselves. So cleanup of userlogs of one user cannot be completed in the child jvm of another user. Essentially, userlogs' cleanup is completely broken with LinuxTaskController and will hasten MAPREDUCE-1100 issue. This should be fixed.
          Vinod Kumar Vavilapalli made changes -
          Field Original Value New Value
          Fix Version/s 0.21.0 [ 12314045 ]
          Affects Version/s 0.21.0 [ 12314045 ]
          Priority Major [ 3 ] Blocker [ 1 ]
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Argh... the task-logs cleanup is completely broken. One more issue:

          • The configuration mapred.task.userlog.retain.hours needed for cleaning up is picked up from the job whose task JVM is doing the cleanup and not the original task.

          Side issue: mapred.task.userlog.retain.hours is not renamed according to new conventions. MAPREDUCE-849 missed this.

          Show
          Vinod Kumar Vavilapalli added a comment - Argh... the task-logs cleanup is completely broken. One more issue: The configuration mapred.task.userlog.retain.hours needed for cleaning up is picked up from the job whose task JVM is doing the cleanup and not the original task. Side issue: mapred.task.userlog.retain.hours is not renamed according to new conventions. MAPREDUCE-849 missed this.
          Hide
          Amareshwari Sriramadasu added a comment -

          I'm just thinking that mapred.task.userlog.retain.hours is not a job-level parameter. It should be TaskTracker's parameter and TaskTracker should cleanup the logs after mapred.task.userlog.retain.hours after the job completion.

          Show
          Amareshwari Sriramadasu added a comment - I'm just thinking that mapred.task.userlog.retain.hours is not a job-level parameter. It should be TaskTracker's parameter and TaskTracker should cleanup the logs after mapred.task.userlog.retain.hours after the job completion.
          Vinod Kumar Vavilapalli made changes -
          Link This issue is part of MAPREDUCE-1100 [ MAPREDUCE-1100 ]
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Will be incorporated by MAPREDUCE-1100.

          Show
          Vinod Kumar Vavilapalli added a comment - Will be incorporated by MAPREDUCE-1100 .
          Vinod Kumar Vavilapalli made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Amareshwari Sriramadasu made changes -
          Resolution Duplicate [ 3 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Amareshwari Sriramadasu made changes -
          Parent MAPREDUCE-1100 [ 12437951 ]
          Issue Type Bug [ 1 ] Sub-task [ 7 ]
          Hide
          Amareshwari Sriramadasu added a comment -

          I would propose the following to solve this:

          • Task logs directory hierarchy is currently userlogs/attemptid. Instead, it should be userlogs/jobid/attemptid.
          • TaskTracker has a thread for deleting task logs. Whenever TaskTracker gets a KillJobAction, it adds an entry (jobid, timestamp) to delete tasklogs i.e. (jobId, jobCompletionTime + userLogRetainsHours). userlogs/jobid directory will be deleted at/after the timestamp (jobCompletionTime + userLogretainshours). This says user logs will be maintained for userLogRetainsHours duration after the job completion.

          We can have userLogRetainsHours as a TaskTracker's parameter, since it controls the disk space of the tracker. Or we can have it as a job level parameter as it is now, since they are maintained for user to access. If it is a job level parameter, we would need a TaskTracker parameter as an upper bound for the job configuration to control the disk space. Thoughts?

          Show
          Amareshwari Sriramadasu added a comment - I would propose the following to solve this: Task logs directory hierarchy is currently userlogs/attemptid. Instead, it should be userlogs/jobid/attemptid. TaskTracker has a thread for deleting task logs. Whenever TaskTracker gets a KillJobAction, it adds an entry (jobid, timestamp) to delete tasklogs i.e. (jobId, jobCompletionTime + userLogRetainsHours). userlogs/jobid directory will be deleted at/after the timestamp (jobCompletionTime + userLogretainshours). This says user logs will be maintained for userLogRetainsHours duration after the job completion. We can have userLogRetainsHours as a TaskTracker's parameter, since it controls the disk space of the tracker. Or we can have it as a job level parameter as it is now, since they are maintained for user to access. If it is a job level parameter, we would need a TaskTracker parameter as an upper bound for the job configuration to control the disk space. Thoughts?
          Amareshwari Sriramadasu made changes -
          Component/s security [ 12313041 ]
          Hide
          Hemanth Yamijala added a comment -

          This says user logs will be maintained for userLogRetainsHours duration after the job completion.

          It seems reasonable to change the semantics of userLogRetainsHours to start counting after job completion, instead of after task completion. The basic use case for retaining logs is for purposes such as debugging, log collection and analysis, etc. For all of these use cases, it seems like having all the logs of the job is a more reasonable system behavior than having just a few tasks' logs. The only catch we need to be aware of is that this will cause an increase in the demand on disk space because overall logs will be retained for a longer period of time. I doubt if this is a big concern though with the following assumptions:

          • Mostly tasks of a job finish within reasonable amounts of time; long tails that will really make a difference for log space used (hours instead of minutes for a few tasks) can be considered rare.
          • With the proposals of MAPREDUCE-1100, we have ways of controlling the amount of logs retained and can use that to offset the increased demand for space.

          If it is a job level parameter, we would need a TaskTracker parameter as an upper bound for the job configuration to control the disk space.

          I would not take this approach. It is high time Configuration provides a mechanism to define allowed ranges for configuration items with bounded values. For the time being, I would rather take an approach where administrators can lock down the value of userLogRetainHours to a final value if this is required.

          Show
          Hemanth Yamijala added a comment - This says user logs will be maintained for userLogRetainsHours duration after the job completion. It seems reasonable to change the semantics of userLogRetainsHours to start counting after job completion, instead of after task completion. The basic use case for retaining logs is for purposes such as debugging, log collection and analysis, etc. For all of these use cases, it seems like having all the logs of the job is a more reasonable system behavior than having just a few tasks' logs. The only catch we need to be aware of is that this will cause an increase in the demand on disk space because overall logs will be retained for a longer period of time. I doubt if this is a big concern though with the following assumptions: Mostly tasks of a job finish within reasonable amounts of time; long tails that will really make a difference for log space used (hours instead of minutes for a few tasks) can be considered rare. With the proposals of MAPREDUCE-1100 , we have ways of controlling the amount of logs retained and can use that to offset the increased demand for space. If it is a job level parameter, we would need a TaskTracker parameter as an upper bound for the job configuration to control the disk space. I would not take this approach. It is high time Configuration provides a mechanism to define allowed ranges for configuration items with bounded values. For the time being, I would rather take an approach where administrators can lock down the value of userLogRetainHours to a final value if this is required.
          Amareshwari Sriramadasu made changes -
          Assignee Amareshwari Sriramadasu [ amareshwari ]
          Hide
          Amareshwari Sriramadasu added a comment -

          With the current proposal, we found two things that need an answer.

          1. Memory footprint of the TaskTracker: Each map entry (JobID, Long) would take about 40 bytes. If the userLogRetainsHours is configured to 7days and there are 1lakh job's tasks run by a TaskTracker in a day, the map would take up 28MB of memory. I guess this memory footprint is fine compared to persisting the same information to disk and reading it back and forth from disk until the directory is removed.
          2. If TaskTracker is reinited/ restarted and a job completed when the TaskTracker was down, then TaskTracker would not get a KillJobAction for the job. Then we can keep the userlogs for default userLogRetainsHours, after the reinit/restart.

          Thoughts?

          Show
          Amareshwari Sriramadasu added a comment - With the current proposal, we found two things that need an answer. Memory footprint of the TaskTracker: Each map entry (JobID, Long) would take about 40 bytes. If the userLogRetainsHours is configured to 7days and there are 1lakh job's tasks run by a TaskTracker in a day, the map would take up 28MB of memory. I guess this memory footprint is fine compared to persisting the same information to disk and reading it back and forth from disk until the directory is removed. If TaskTracker is reinited/ restarted and a job completed when the TaskTracker was down, then TaskTracker would not get a KillJobAction for the job. Then we can keep the userlogs for default userLogRetainsHours, after the reinit/restart. Thoughts?
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch implementing the above proposal.

          Show
          Amareshwari Sriramadasu added a comment - Patch implementing the above proposal.
          Amareshwari Sriramadasu made changes -
          Attachment patch-927.txt [ 12436285 ]
          Amareshwari Sriramadasu made changes -
          Status Reopened [ 4 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12436285/patch-927.txt
          against trunk revision 911519.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/329/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/329/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/329/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/329/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12436285/patch-927.txt against trunk revision 911519. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/329/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/329/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/329/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/329/console This message is automatically generated.
          Amareshwari Sriramadasu made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Amareshwari Sriramadasu added a comment -

          Fixed javadoc and findbugs warning

          Show
          Amareshwari Sriramadasu added a comment - Fixed javadoc and findbugs warning
          Amareshwari Sriramadasu made changes -
          Attachment patch-927-1.txt [ 12436547 ]
          Amareshwari Sriramadasu made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12436547/patch-927-1.txt
          against trunk revision 912471.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/471/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/471/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/471/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/471/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12436547/patch-927-1.txt against trunk revision 912471. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/471/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/471/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/471/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/471/console This message is automatically generated.
          Vinod Kumar Vavilapalli made changes -
          Link This issue is part of MAPREDUCE-1100 [ MAPREDUCE-1100 ]
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Looked at the patch. Have some comments:

          TaskLog.java

          • Creation of the localFS should not be in a static block. In the past also, we did this and realized it creates a kind of circular initialization of loggers and results in NPE while creating log objects which can be seen in task-logs. The current way of creation of localFS should be retained.
          • We can move DEFAULT_USER_LOG_RETAIN_HOURS to TaskLogsCleanupThread.
          • Shall we rename getJobUserLogDir() to a simpler {{getJobDir()? And getBaseDir(String) to getAttemptDir(String) to be clear? I think it's ok like this as TaskLog.getJobDir() clearly means it is job-dir for logs.
          • Also, I think it's high time TaskLog.java is made @InterfaceAudience.Private.

          TaskLogCleanupThread

          • Rename the class to TaskLogsMonitor, so that we are consistent going forward with MAPREDUCE-1100.
          • Set the audience visibility of the class to private?
          • threadSleepTime is not configurable. May not be a public documented configuration, but still we need one.
          • Constructor: Shouldn't the volume on which the disk-service works be getUserLogsDir() instead of getBaseLogDir()? The correctness is not lost with the current patch as we are always passing absolute paths to the disk-service, but I think we should change it anyways. Also can't we simply construct a local-filesystem here itself, instead of calling TaskLog.getLocalFileSystem()? This will mostly avoid your changes regarding my first comment in TaskLog.java above.
          • For the sake of correctness, in removeOldUserLogs(), the job should be removed only after the deletion of the log file.
          • Throughout the class, we should use a clock instead of directly using System.currentMillis(). This will better the testing.
          • Shouldn't clearOldUserLogs() be done as part of constructor itself? This is the pattern that MRAsyncDiskService uses, for example.

          TaskTracker.java

          • The logscleanup thread is not joined/killed in the TaskTracker.close(). So, there will be zombie threads in the system on a re-init and may well interfere with the new thread.
          • Shouldn't we be setting 770-user-mapred permissions on userlogs/$jobid as part of job-localization? Granted 711 is enough for now, but this slightly deviates from the current security on the directories - we always have the most secure permissions possible on the files/dirs. If we do this, TestLocalizationWithLinuxTaskController should also test the permissions of the joblogdir.
          • The following code fragment at +1035 can be moved out of localizeJobFiles() into a new method initializeJobLogDir() and can be called directly from localizeJob().
              taskLogCleanupThread.removeJobFromLogDeletion(jobId);
              localizer.initializeJobUserLogDir(jobId);
            
          • One case is still not handled: TT reinits while the Job is still running. After re-init, no tasks of the running job arrive at this TT. Retain-hours after re-init TT removes the job's tasks' logs even though the job is still running elsewhere. Doing this will need TT to specially communicate with JT and from what I understand, we are not doing it here. If that is the case, can we simply add a test for this too in TestUserLogCleanup?

          TestUserLogCleanup

          • Use TaskTracker.initializeJobLogDir() mentioned above in setupJobLogDirectory().
          • Once we use a clock in TaskLogCleanupThread,
            • the tests can be modified to use this.
            • we can test what exactly the modified retainTimeStamp is after re-init.
          • We should create some attempt dirs also in the joblog dir with appropriate permissions and verify the proper cleanup.
          • Document(javadoc) the two tests?

          test-taskcontroller.c

          • This needs to be fixed to reflect the new directory hierarchy of the logs.

          mapred_tutorial.xml

          • Should TaskLogs section be changed to explicitly specify the new directory hierarchy?

          Cancelling the patch for the sake of these changes.

          Show
          Vinod Kumar Vavilapalli added a comment - Looked at the patch. Have some comments: TaskLog.java Creation of the localFS should not be in a static block. In the past also, we did this and realized it creates a kind of circular initialization of loggers and results in NPE while creating log objects which can be seen in task-logs. The current way of creation of localFS should be retained. We can move DEFAULT_USER_LOG_RETAIN_HOURS to TaskLogsCleanupThread . Shall we rename getJobUserLogDir() to a simpler {{getJobDir() ? And getBaseDir(String) to getAttemptDir(String) to be clear? I think it's ok like this as TaskLog.getJobDir() clearly means it is job-dir for logs. Also, I think it's high time TaskLog.java is made @InterfaceAudience.Private. TaskLogCleanupThread Rename the class to TaskLogsMonitor , so that we are consistent going forward with MAPREDUCE-1100 . Set the audience visibility of the class to private? threadSleepTime is not configurable. May not be a public documented configuration, but still we need one. Constructor: Shouldn't the volume on which the disk-service works be getUserLogsDir() instead of getBaseLogDir() ? The correctness is not lost with the current patch as we are always passing absolute paths to the disk-service, but I think we should change it anyways. Also can't we simply construct a local-filesystem here itself, instead of calling TaskLog.getLocalFileSystem()? This will mostly avoid your changes regarding my first comment in TaskLog.java above. For the sake of correctness, in removeOldUserLogs() , the job should be removed only after the deletion of the log file. Throughout the class, we should use a clock instead of directly using System.currentMillis() . This will better the testing. Shouldn't clearOldUserLogs() be done as part of constructor itself? This is the pattern that MRAsyncDiskService uses, for example. TaskTracker.java The logscleanup thread is not joined/killed in the TaskTracker.close() . So, there will be zombie threads in the system on a re-init and may well interfere with the new thread. Shouldn't we be setting 770-user-mapred permissions on userlogs/$jobid as part of job-localization? Granted 711 is enough for now, but this slightly deviates from the current security on the directories - we always have the most secure permissions possible on the files/dirs. If we do this, TestLocalizationWithLinuxTaskController should also test the permissions of the joblogdir. The following code fragment at +1035 can be moved out of localizeJobFiles() into a new method initializeJobLogDir() and can be called directly from localizeJob() . taskLogCleanupThread.removeJobFromLogDeletion(jobId); localizer.initializeJobUserLogDir(jobId); One case is still not handled: TT reinits while the Job is still running. After re-init, no tasks of the running job arrive at this TT. Retain-hours after re-init TT removes the job's tasks' logs even though the job is still running elsewhere. Doing this will need TT to specially communicate with JT and from what I understand, we are not doing it here. If that is the case, can we simply add a test for this too in TestUserLogCleanup ? TestUserLogCleanup Use TaskTracker.initializeJobLogDir() mentioned above in setupJobLogDirectory() . Once we use a clock in TaskLogCleanupThread, the tests can be modified to use this. we can test what exactly the modified retainTimeStamp is after re-init. We should create some attempt dirs also in the joblog dir with appropriate permissions and verify the proper cleanup. Document(javadoc) the two tests? test-taskcontroller.c This needs to be fixed to reflect the new directory hierarchy of the logs. mapred_tutorial.xml Should TaskLogs section be changed to explicitly specify the new directory hierarchy? Cancelling the patch for the sake of these changes.
          Vinod Kumar Vavilapalli made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Amareshwari Sriramadasu added a comment -

          Creation of the localFS should not be in a static block. In the past also, we did this and realized it creates a kind of circular initialization of loggers and results in NPE while creating log objects which can be seen in task-logs.

          Can you elaborate this? How will static initialization code result into circular initialization of loggers? How will this result into NPE?

          Also can't we simply construct a local-filesystem here itself, instead of calling TaskLog.getLocalFileSystem()?

          I'm getting the log file system through TaskLog.getLogFileSystem() to make sure that it is the same FileSystem used in TaskLog creation and deletion.

          Rename the class to TaskLogsMonitor, so that we are consistent going forward with MAPREDUCE-1100.

          The logscleanup thread is not joined/killed in the TaskTracker.close(). So, there will be zombie threads in the system on a re-init and may well interfere with the new thread.

          TaskLogMonitor introduced in MAPREDUCE-1100 is not a daemon thread, but TaskLogCleanupThread needs to be. I don't think we can merge the functionality of TaskLogCleanupThread and TaskLogMonitor into one thread.
          TaskLogCleanupThread is not started after every re-init. It is started only once similar to taskCleanupThread, directoryCleanupThread. So, this should not be joined in TaskTracker.close().

          Show
          Amareshwari Sriramadasu added a comment - Creation of the localFS should not be in a static block. In the past also, we did this and realized it creates a kind of circular initialization of loggers and results in NPE while creating log objects which can be seen in task-logs. Can you elaborate this? How will static initialization code result into circular initialization of loggers? How will this result into NPE? Also can't we simply construct a local-filesystem here itself, instead of calling TaskLog.getLocalFileSystem()? I'm getting the log file system through TaskLog.getLogFileSystem() to make sure that it is the same FileSystem used in TaskLog creation and deletion. Rename the class to TaskLogsMonitor, so that we are consistent going forward with MAPREDUCE-1100 . The logscleanup thread is not joined/killed in the TaskTracker.close(). So, there will be zombie threads in the system on a re-init and may well interfere with the new thread. TaskLogMonitor introduced in MAPREDUCE-1100 is not a daemon thread, but TaskLogCleanupThread needs to be. I don't think we can merge the functionality of TaskLogCleanupThread and TaskLogMonitor into one thread. TaskLogCleanupThread is not started after every re-init. It is started only once similar to taskCleanupThread, directoryCleanupThread. So, this should not be joined in TaskTracker.close().
          Hide
          Amareshwari Sriramadasu added a comment -

          The logscleanup thread is not joined/killed in the TaskTracker.close(). So, there will be zombie threads in the system on a re-init and may well interfere with the new thread.

          I just realized that the thread is initialized in intialize() method. It should be moved to constructor.

          Show
          Amareshwari Sriramadasu added a comment - The logscleanup thread is not joined/killed in the TaskTracker.close(). So, there will be zombie threads in the system on a re-init and may well interfere with the new thread. I just realized that the thread is initialized in intialize() method. It should be moved to constructor.
          Hide
          Ravi Gummadi added a comment -

          >Creation of the localFS should not be in a static block. In the past also, we did this and realized it creates a kind of circular initialization of loggers and results in NPE while creating log objects which can be seen in task-logs.
          >>Can you elaborate this? How will static initialization code result into circular initialization of loggers? How will this result into NPE?

          LogFactory.getLog() in Configuration.java goes through log4j.Logger.getLogger() and calls activateOptions() that again calls TaskLog.getTaskLogFile(). Since TaskLog.getTaskLogFile() is a static method, the static block in TaskLog.java gets executed and that calls new Configuration(). Thus, from static LOG = LogFactory.getLog(Configuration.class) to we came back to constructor of Configuration() and now the constructor is getting executed(even though the static blocks of Configuration.java are not finished execution). The constructor of Configuration() was doing LOG.isDebugEnabled()

          {.....}

          and LOG is null now and thus resulting in NPE.
          Hope that helps.

          Show
          Ravi Gummadi added a comment - >Creation of the localFS should not be in a static block. In the past also, we did this and realized it creates a kind of circular initialization of loggers and results in NPE while creating log objects which can be seen in task-logs. >>Can you elaborate this? How will static initialization code result into circular initialization of loggers? How will this result into NPE? LogFactory.getLog() in Configuration.java goes through log4j.Logger.getLogger() and calls activateOptions() that again calls TaskLog.getTaskLogFile(). Since TaskLog.getTaskLogFile() is a static method, the static block in TaskLog.java gets executed and that calls new Configuration(). Thus, from static LOG = LogFactory.getLog(Configuration.class) to we came back to constructor of Configuration() and now the constructor is getting executed(even though the static blocks of Configuration.java are not finished execution). The constructor of Configuration() was doing LOG.isDebugEnabled() {.....} and LOG is null now and thus resulting in NPE. Hope that helps.
          Hide
          Amareshwari Sriramadasu added a comment -

          Thanks Ravi for the explanation. I have undone the change.

          Shouldn't clearOldUserLogs() be done as part of constructor itself?

          No. Again for the same reason that the thread is a daemon thread and is constructed only once.

          Rename the class to TaskLogsMonitor, so that we are consistent going forward with MAPREDUCE-1100.

          Renamed it to UserLogCleaner.

          We should create some attempt dirs also in the joblog dir with appropriate permissions and verify the proper cleanup.

          All this code path is tested with proper permissions in TestTaskTrackerLocalization and TestLocalizationWIthLinuxTaskController.

          Should TaskLogs section be changed to explicitly specify the new directory hierarchy?

          Did not do this. I don't think we want to publish the directory hierarchy for users in this jira.

          Patch incorporates almost all the review comments.

          Show
          Amareshwari Sriramadasu added a comment - Thanks Ravi for the explanation. I have undone the change. Shouldn't clearOldUserLogs() be done as part of constructor itself? No. Again for the same reason that the thread is a daemon thread and is constructed only once. Rename the class to TaskLogsMonitor, so that we are consistent going forward with MAPREDUCE-1100 . Renamed it to UserLogCleaner. We should create some attempt dirs also in the joblog dir with appropriate permissions and verify the proper cleanup. All this code path is tested with proper permissions in TestTaskTrackerLocalization and TestLocalizationWIthLinuxTaskController. Should TaskLogs section be changed to explicitly specify the new directory hierarchy? Did not do this. I don't think we want to publish the directory hierarchy for users in this jira. Patch incorporates almost all the review comments.
          Amareshwari Sriramadasu made changes -
          Attachment patch-927-2.txt [ 12437694 ]
          Amareshwari Sriramadasu made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Fix Version/s 0.22.0 [ 12314184 ]
          Fix Version/s 0.21.0 [ 12314045 ]
          Hide
          Amareshwari Sriramadasu added a comment -

          All the LinuxTaskController tests passed with the patch.

          Show
          Amareshwari Sriramadasu added a comment - All the LinuxTaskController tests passed with the patch.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12437694/patch-927-2.txt
          against trunk revision 918864.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 17 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/17/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/17/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/17/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/17/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437694/patch-927-2.txt against trunk revision 918864. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 17 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/17/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/17/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/17/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/17/console This message is automatically generated.
          Amareshwari Sriramadasu made changes -
          Link This issue is blocked by MAPREDUCE-1553 [ MAPREDUCE-1553 ]
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Looks close. Two major comments:

          • Job's log directory need not be 770 in all cases. It can be 570 when the job-submitter is not the same as the TT user. So, this needs a merge with MAPREDUCE-890(which is close to the finish line).
          • After this patch, with a job specific directory for logs, jobacls.xml needs to be moved into job-dir. Will file a separate JIRA issue for this.

          Minor code level comments follow:

          UserLogCleaner

          • The comment for DEFAULT_THREAD_SLEEP_TIME should read 1 hour instead of 1day.
          • I think we can name the APIs in this class a bit better. removeJobFromLogDeletion can be unMarkJobFromLogDeletion? similarly addJobLogsForDeletion as markJobForLogDeletion? removeOldUserLogs to processCompletedJobs?
          • Why maintain two different APIs - addJobLogsForDeletion and addJobForLogDeletion? These two can be merged?
          • In run() method, sleep should be moved into a finally block?
          • addJobLogsForDeletion() should not itself take jobcompletion time, it should instead be taken from clock..

          Miscellaneous

          • Localizer.initializeJobUserLogDir() can be renamed to initializeJobLogDir() and can be helped with some javadoc
          • I actually meant moving TaskTracker.initializeJobLogDir out of localizeJobFiles into localizeJob() itself. Is it possible? Also comment inside this method needs grammatical fixes.
          • No need for documenting mapreduce.tasktracker.userlogcleanup.sleeptime ?
          • TestTaskTrackerLocalization.verifyUserLogsCleanup(): shouldn't you be setting the clock and use an inline cleaner? This is mainly to avoid parallel execution by the thread and the test-case. Otherwise it is possible that we run in errors and test failures .

          TestUserLogsCleanup

          • Should use clock and inline clear similar to the last comment above
          • Can directly use UtilsForTest.FakeClock instead of another FakeClock.
          • Can we augment the current tests to check that TOBEDELETED isn't actually deleted?
          • Rename the two tests appropriately?
          • Should use clock.advance() to improve testUserLogCleanup().
          Show
          Vinod Kumar Vavilapalli added a comment - Looks close. Two major comments: Job's log directory need not be 770 in all cases. It can be 570 when the job-submitter is not the same as the TT user. So, this needs a merge with MAPREDUCE-890 (which is close to the finish line). After this patch, with a job specific directory for logs, jobacls.xml needs to be moved into job-dir. Will file a separate JIRA issue for this. Minor code level comments follow: UserLogCleaner The comment for DEFAULT_THREAD_SLEEP_TIME should read 1 hour instead of 1day. I think we can name the APIs in this class a bit better. removeJobFromLogDeletion can be unMarkJobFromLogDeletion? similarly addJobLogsForDeletion as markJobForLogDeletion? removeOldUserLogs to processCompletedJobs? Why maintain two different APIs - addJobLogsForDeletion and addJobForLogDeletion? These two can be merged? In run() method, sleep should be moved into a finally block? addJobLogsForDeletion() should not itself take jobcompletion time, it should instead be taken from clock.. Miscellaneous Localizer.initializeJobUserLogDir() can be renamed to initializeJobLogDir() and can be helped with some javadoc I actually meant moving TaskTracker.initializeJobLogDir out of localizeJobFiles into localizeJob() itself. Is it possible? Also comment inside this method needs grammatical fixes. No need for documenting mapreduce.tasktracker.userlogcleanup.sleeptime ? TestTaskTrackerLocalization.verifyUserLogsCleanup() : shouldn't you be setting the clock and use an inline cleaner? This is mainly to avoid parallel execution by the thread and the test-case. Otherwise it is possible that we run in errors and test failures . TestUserLogsCleanup Should use clock and inline clear similar to the last comment above Can directly use UtilsForTest.FakeClock instead of another FakeClock. Can we augment the current tests to check that TOBEDELETED isn't actually deleted? Rename the two tests appropriately? Should use clock.advance() to improve testUserLogCleanup().
          Amareshwari Sriramadasu made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch merged with trunk and incorporates most of the comments.

          addJobLogsForDeletion() should not itself take jobcompletion time, it should instead be taken from clock.

          I think it makes sense to pass this as a parameter. In future, we should get job completion time from JobTracker.

          TestTaskTrackerLocalization.verifyUserLogsCleanup(): shouldn't you be setting the clock and use an inline cleaner?

          Should use clock and inline clear similar to the last comment above

          Tests already use inline cleanup.

          Show
          Amareshwari Sriramadasu added a comment - Patch merged with trunk and incorporates most of the comments. addJobLogsForDeletion() should not itself take jobcompletion time, it should instead be taken from clock. I think it makes sense to pass this as a parameter. In future, we should get job completion time from JobTracker. TestTaskTrackerLocalization.verifyUserLogsCleanup(): shouldn't you be setting the clock and use an inline cleaner? Should use clock and inline clear similar to the last comment above Tests already use inline cleanup.
          Amareshwari Sriramadasu made changes -
          Attachment patch-927-3.txt [ 12438174 ]
          Amareshwari Sriramadasu made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12438174/patch-927-3.txt
          against trunk revision 920250.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 17 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/506/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/506/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/506/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/506/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438174/patch-927-3.txt against trunk revision 920250. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 17 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/506/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/506/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/506/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/506/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          Test failure TestBadRecords is not related to the patch. Test failed because of many ZipException and NoClassDefFoundError.
          log for the exception:

          2010-03-08 11:30:53,708 WARN  mapred.TaskTracker (TaskTracker.java:startNewTask(2223)) - Error initializing attempt_20100308113025258_0001_m_000000_1:
          java.lang.RuntimeException: java.util.zip.ZipException: error reading zip file
          	at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1715)
          	at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1529)
          	at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1475)
          	at org.apache.hadoop.conf.Configuration.get(Configuration.java:564)
          	at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1892)
          	at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:397)
          	at org.apache.hadoop.mapred.TaskTracker.localizeJobFiles(TaskTracker.java:1035)
          	at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:961)
          	at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2218)
          	at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2183)
          Caused by: java.util.zip.ZipException: error reading zip file
          	at java.util.zip.ZipFile.read(Native Method)
          	at java.util.zip.ZipFile.access$1200(ZipFile.java:29)
          	at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:447)
          	at java.util.zip.ZipFile$1.fill(ZipFile.java:230)
          	at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141)
          	at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:105)
          	at java.io.FilterInputStream.read(FilterInputStream.java:66)
          	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager$RewindableInputStream.read(XMLEntityManager.java:2910)
          	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:704)
          	at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186)
          	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
          	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
          	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107)
          	at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:225)
          	at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)
          	at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
          	at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1628)
          	... 9 more
          

          The test passed on my machine.

          Show
          Amareshwari Sriramadasu added a comment - Test failure TestBadRecords is not related to the patch. Test failed because of many ZipException and NoClassDefFoundError. log for the exception: 2010-03-08 11:30:53,708 WARN mapred.TaskTracker (TaskTracker.java:startNewTask(2223)) - Error initializing attempt_20100308113025258_0001_m_000000_1: java.lang.RuntimeException: java.util.zip.ZipException: error reading zip file at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1715) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1529) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1475) at org.apache.hadoop.conf.Configuration.get(Configuration.java:564) at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1892) at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:397) at org.apache.hadoop.mapred.TaskTracker.localizeJobFiles(TaskTracker.java:1035) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:961) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2218) at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2183) Caused by: java.util.zip.ZipException: error reading zip file at java.util.zip.ZipFile.read(Native Method) at java.util.zip.ZipFile.access$1200(ZipFile.java:29) at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:447) at java.util.zip.ZipFile$1.fill(ZipFile.java:230) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:105) at java.io.FilterInputStream.read(FilterInputStream.java:66) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager$RewindableInputStream.read(XMLEntityManager.java:2910) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:704) at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107) at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:225) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1628) ... 9 more The test passed on my machine.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch improves the testcase TestUserLogCleanup. Modifies comment in task-controller.c.
          No other changes.

          Modified test passed successfully.
          All linux task controller tests passed except TestStreamingAsDifferentUser (due to MAPREDUCE-1573)

          Show
          Amareshwari Sriramadasu added a comment - Patch improves the testcase TestUserLogCleanup. Modifies comment in task-controller.c. No other changes. Modified test passed successfully. All linux task controller tests passed except TestStreamingAsDifferentUser (due to MAPREDUCE-1573 )
          Amareshwari Sriramadasu made changes -
          Attachment patch-927-4.txt [ 12438257 ]
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Patch looks real good now. +1. Will check this in after running a couple of sanity tests..

          Show
          Vinod Kumar Vavilapalli added a comment - Patch looks real good now. +1. Will check this in after running a couple of sanity tests..
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I've just committed this. Thanks Amareshwari!

          Show
          Vinod Kumar Vavilapalli added a comment - I've just committed this. Thanks Amareshwari!
          Vinod Kumar Vavilapalli made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Hide
          Amareshwari Sriramadasu added a comment -

          Not fixing it for 0.21 now since it needs more work. It can be fixed if anybody needs it for 0.21.
          Changing the priority to Major.

          Show
          Amareshwari Sriramadasu added a comment - Not fixing it for 0.21 now since it needs more work. It can be fixed if anybody needs it for 0.21. Changing the priority to Major.
          Amareshwari Sriramadasu made changes -
          Release Note Moved Task log cleanup into a separate thread in TaskTracker.
          Added configuration "mapreduce.job.userlog.retain.hours" to specify the time(in hours) for which the user-logs are to be retained after the job completion.
          Priority Blocker [ 1 ] Major [ 3 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #253 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/253/)
          . Cleanup of task-logs should happen in TaskTracker instead of the Child. Contributed by Amareshwari Sriramadasu.

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #253 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/253/ ) . Cleanup of task-logs should happen in TaskTracker instead of the Child. Contributed by Amareshwari Sriramadasu.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #270 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/270/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #270 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/270/ )
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Attaching patch for ydist on behalf of Amareshwari. This one refactors some parts of the ydist patch for truncating logs feature in MAPREDUCE-1100 and so should be applied over that patch.

          This patch is not for commit to any of the apache branches.

          Show
          Vinod Kumar Vavilapalli added a comment - Attaching patch for ydist on behalf of Amareshwari. This one refactors some parts of the ydist patch for truncating logs feature in MAPREDUCE-1100 and so should be applied over that patch. This patch is not for commit to any of the apache branches.
          Vinod Kumar Vavilapalli made changes -
          Attachment patch-927-5-dist.txt [ 12439009 ]
          Tom White made changes -
          Fix Version/s 0.21.0 [ 12314045 ]
          Fix Version/s 0.22.0 [ 12314184 ]
          Tom White made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Vinod Kumar Vavilapalli added a comment -

          As Amareshwari mentioned before, this wasn't committed to 0.21 as it was more work. Changing fix version.

          Show
          Vinod Kumar Vavilapalli added a comment - As Amareshwari mentioned before, this wasn't committed to 0.21 as it was more work. Changing fix version.
          Vinod Kumar Vavilapalli made changes -
          Fix Version/s 0.22.0 [ 12314184 ]
          Fix Version/s 0.21.0 [ 12314045 ]
          Hide
          Todd Lipcon added a comment -

          With this patch I've seen occasional NPEs in jvmFinished - it seems initalContext.task is ending up null. Have you guys seen this as well? My current theory is that it's because jvmToRunningTask is accessed without synchronization in JvmRunner.runChild, so if that HashMap.get() overlaps with a rehashing due to a concurrent put, it will return a false null. My reasoning is that when I've seen this happen it's within a couple ms of another task being spawned or removed.

          Also, what's the status of this for 22? Should it be considered a blocker?

          Show
          Todd Lipcon added a comment - With this patch I've seen occasional NPEs in jvmFinished - it seems initalContext.task is ending up null. Have you guys seen this as well? My current theory is that it's because jvmToRunningTask is accessed without synchronization in JvmRunner.runChild, so if that HashMap.get() overlaps with a rehashing due to a concurrent put, it will return a false null. My reasoning is that when I've seen this happen it's within a couple ms of another task being spawned or removed. Also, what's the status of this for 22? Should it be considered a blocker?
          Hide
          Todd Lipcon added a comment -

          Er, sorry, I see now that it was committed to trunk, I was confused by the above comment by Vinod saying it wasn't put in 21. I'll see if the synchronization bug is also present in trunk (NPEs I mentioned were seen in the YDH backport)

          Show
          Todd Lipcon added a comment - Er, sorry, I see now that it was committed to trunk, I was confused by the above comment by Vinod saying it wasn't put in 21. I'll see if the synchronization bug is also present in trunk (NPEs I mentioned were seen in the YDH backport)
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          83d 2h 28m 1 Vinod Kumar Vavilapalli 19/Nov/09 13:05
          Resolved Resolved Reopened Reopened
          80d 21h 3m 1 Amareshwari Sriramadasu 08/Feb/10 10:09
          Reopened Reopened Patch Available Patch Available
          10d 18h 21m 1 Amareshwari Sriramadasu 19/Feb/10 04:30
          Patch Available Patch Available Open Open
          15d 8h 23m 3 Amareshwari Sriramadasu 08/Mar/10 10:54
          Open Open Patch Available Patch Available
          1d 22h 5m 3 Amareshwari Sriramadasu 08/Mar/10 11:00
          Patch Available Patch Available Resolved Resolved
          23h 58m 1 Vinod Kumar Vavilapalli 09/Mar/10 10:58
          Resolved Resolved Closed Closed
          168d 10h 17m 1 Tom White 24/Aug/10 22:15

            People

            • Assignee:
              Amareshwari Sriramadasu
              Reporter:
              Vinod Kumar Vavilapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development