Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.3, 0.21.0, 0.21.1, 0.22.0
    • Fix Version/s: None
    • Component/s: task, tasktracker
    • Labels:
    1. mapreduce-1716-testcase-race.txt
      5 kB
      Todd Lipcon
    2. patch-log-truncation-bugs-20100514.txt
      23 kB
      Vinod Kumar Vavilapalli
    3. patch-1100-fix-ydist.2.txt
      13 kB
      Vinod Kumar Vavilapalli
    4. MAPREDUCE-1100-20091216.2.txt
      53 kB
      Vinod Kumar Vavilapalli

      Issue Links

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        1561d 7h 25m 1 Allen Wittenauer 30/Jul/14 19:08
        Allen Wittenauer made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 0.22.1 [ 12319242 ]
        Resolution Fixed [ 1 ]
        Hide
        Allen Wittenauer added a comment -

        Fixed. Sort of.

        Show
        Allen Wittenauer added a comment - Fixed. Sort of.
        Konstantin Shvachko made changes -
        Fix Version/s 0.22.1 [ 12319242 ]
        Fix Version/s 0.22.0 [ 12314184 ]
        Arun C Murthy made changes -
        Labels critical-0.22.0
        Arun C Murthy made changes -
        Target Version/s 0.22.0 [ 12314184 ]
        Hide
        Arun C Murthy added a comment -

        Konstantin - I'd strongly urge you to consider this for 0.22.0 itself. Without this patch clusters are very susceptible to tasks in which users have code to log excessively & cause lots of issues for DN/TT by eating up disk space on nodes.

        Show
        Arun C Murthy added a comment - Konstantin - I'd strongly urge you to consider this for 0.22.0 itself. Without this patch clusters are very susceptible to tasks in which users have code to log excessively & cause lots of issues for DN/TT by eating up disk space on nodes.
        Konstantin Shvachko made changes -
        Priority Blocker [ 1 ] Major [ 3 ]
        Hide
        Konstantin Shvachko added a comment -

        Unblocking this. Target for 0.22.1

        Show
        Konstantin Shvachko added a comment - Unblocking this. Target for 0.22.1
        Hide
        Konstantin Shvachko added a comment -

        Any volunteers to fix this for 0.22?

        Show
        Konstantin Shvachko added a comment - Any volunteers to fix this for 0.22?
        Hide
        Dmitriy V. Ryaboy added a comment -

        Sounds like this went in as part of the 203 release. Can one of the 203 authors comment?

        Show
        Dmitriy V. Ryaboy added a comment - Sounds like this went in as part of the 203 release. Can one of the 203 authors comment?
        Todd Lipcon made changes -
        Attachment mapreduce-1716-testcase-race.txt [ 12469884 ]
        Hide
        Todd Lipcon added a comment -

        This test case has a bit of a race, since it assumes that the logs will be truncated immediately upon job completion.

        This isn't the case, since the logs aren't added to the truncation manager until the JVM finished, which can be several seconds after the last task finishes (eg when sleepTimeBeforeSigKill is 5 seconds). So, I found that this test was flaky as is.

        You can show this by adding a 10 second sleep before truncating the logs, for example.

        This delta patch has the tests loop for 20 seconds while checking the logs for truncation. It only fails if the logs aren't truncated after 20 seconds.

        Show
        Todd Lipcon added a comment - This test case has a bit of a race, since it assumes that the logs will be truncated immediately upon job completion. This isn't the case, since the logs aren't added to the truncation manager until the JVM finished, which can be several seconds after the last task finishes (eg when sleepTimeBeforeSigKill is 5 seconds). So, I found that this test was flaky as is. You can show this by adding a 10 second sleep before truncating the logs, for example. This delta patch has the tests loop for 20 seconds while checking the logs for truncation. It only fails if the logs aren't truncated after 20 seconds.
        Eli Collins made changes -
        Priority Major [ 3 ] Blocker [ 1 ]
        Hide
        Eli Collins added a comment -

        Making a blocker since it's parent task is.

        Show
        Eli Collins added a comment - Making a blocker since it's parent task is.
        Eli Collins made changes -
        Fix Version/s 0.22.0 [ 12314184 ]
        Affects Version/s 0.20.3 [ 12314813 ]
        Affects Version/s 0.21.1 [ 12315272 ]
        Affects Version/s 0.22.0 [ 12314184 ]
        Vinod Kumar Vavilapalli made changes -
        Attachment patch-log-truncation-bugs-20100514.txt [ 12444476 ]
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Attaching one more patch for ydist that (1) fixes truncation problem when binary data is written to the logs and (2) adds a header at the beginning of a truncated log file: "[ ... this log file was truncated because of excess length]\n"

        Show
        Vinod Kumar Vavilapalli added a comment - Attaching one more patch for ydist that (1) fixes truncation problem when binary data is written to the logs and (2) adds a header at the beginning of a truncated log file: "[ ... this log file was truncated because of excess length]\n"
        Vinod Kumar Vavilapalli made changes -
        Attachment MAPREDUCE-1100-20091216.2.txt [ 12444474 ]
        Attachment patch-1100-fix-ydist.2.txt [ 12444475 ]
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Attaching the ydist patches for this issue. These were already uploaded on MAPREDUCE-1100 before this sub-task was created.

        Show
        Vinod Kumar Vavilapalli added a comment - Attaching the ydist patches for this issue. These were already uploaded on MAPREDUCE-1100 before this sub-task was created.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        This is related to MAPREDUCE-1100 and solves one part of it.

        mapreduce.task.userlog.limit.kb is not usable in the current format because of its limitations:

        • If this is used, showing the userlogs is not possible until tasks finish or fail. This is simply not acceptable.
        • The stdout/stderr files are controlled by using 'tail -c' on the stdout/stderr of the task-jvm. This tail command uses some of the precious memory allocated to the users, which is not accounted or controlled anywhere.
        • syslog files are written to by tasks but the files themselves can be arbitrarily written to by the jvm and its child processes without respecting any of these limits.

        The task-logs truncation functionality through "tail -c" is broken since 0.19 itself when jvm-reuse feature went in via HADOOP-249. Instead of fixing "tail -c" which has the above said limitations, I propose we throw it away as part of this issue itself in favour of the truncation code being put in here.

        Note that MAPREDUCE-1648 is a related JIRA trying to solve similar problems but tries to 'rewrite' the overall (loosely defined) logging framework for tasks using log4j and thus is still uncertain about certain points like performance, pipes etc. This issue on the other hand is an incremental improvement over what we already have.

        Show
        Vinod Kumar Vavilapalli added a comment - This is related to MAPREDUCE-1100 and solves one part of it. mapreduce.task.userlog.limit.kb is not usable in the current format because of its limitations: If this is used, showing the userlogs is not possible until tasks finish or fail. This is simply not acceptable. The stdout/stderr files are controlled by using 'tail -c' on the stdout/stderr of the task-jvm. This tail command uses some of the precious memory allocated to the users, which is not accounted or controlled anywhere. syslog files are written to by tasks but the files themselves can be arbitrarily written to by the jvm and its child processes without respecting any of these limits. The task-logs truncation functionality through "tail -c" is broken since 0.19 itself when jvm-reuse feature went in via HADOOP-249 . Instead of fixing "tail -c" which has the above said limitations, I propose we throw it away as part of this issue itself in favour of the truncation code being put in here. Note that MAPREDUCE-1648 is a related JIRA trying to solve similar problems but tries to 'rewrite' the overall (loosely defined) logging framework for tasks using log4j and thus is still uncertain about certain points like performance, pipes etc. This issue on the other hand is an incremental improvement over what we already have.
        Vinod Kumar Vavilapalli made changes -
        Link This issue relates to MAPREDUCE-1648 [ MAPREDUCE-1648 ]
        Vinod Kumar Vavilapalli made changes -
        Field Original Value New Value
        Affects Version/s 0.21.0 [ 12314045 ]
        Component/s task [ 12312920 ]
        Component/s tasktracker [ 12312906 ]
        Vinod Kumar Vavilapalli created issue -

          People

          • Assignee:
            Vinod Kumar Vavilapalli
            Reporter:
            Vinod Kumar Vavilapalli
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development