Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.0
    • Fix Version/s: 3.0.0-alpha1
    • Component/s: mrv2
    • Labels:
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      mapreduce.fileoutputcommitter.algorithm.version now defaults to 2.
        
      In algorithm version 1:

        1. commitTask renames directory
        $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/
        to
        $joboutput/_temporary/$appAttemptID/$taskID/

        2. recoverTask renames
        $joboutput/_temporary/$appAttemptID/$taskID/
        to
        $joboutput/_temporary/($appAttemptID + 1)/$taskID/

        3. commitJob merges every task output file in
        $joboutput/_temporary/$appAttemptID/$taskID/
        to
        $joboutput/, then it will delete $joboutput/_temporary/
        and write $joboutput/_SUCCESS

      commitJob's run time, number of RPC, is O(n) in terms of output files, which is discussed in MAPREDUCE-4815, and can take minutes.

      Algorithm version 2 changes the behavior of commitTask, recoverTask, and commitJob.

        1. commitTask renames all files in
        $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/
        to $joboutput/

        2. recoverTask is a nop strictly speaking, but for
        upgrade from version 1 to version 2 case, it checks if there
        are any files in
        $joboutput/_temporary/($appAttemptID - 1)/$taskID/
        and renames them to $joboutput/

        3. commitJob deletes $joboutput/_temporary and writes
        $joboutput/_SUCCESS

      Algorithm 2 takes advantage of task parallelism and makes commitJob itself O(1). However, the window of vulnerability for having incomplete output in $jobOutput directory is much larger. Therefore, pipeline logic for consuming job outputs should be built on checking for existence of _SUCCESS marker.
      Show
      mapreduce.fileoutputcommitter.algorithm.version now defaults to 2.    In algorithm version 1:   1. commitTask renames directory   $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/   to   $joboutput/_temporary/$appAttemptID/$taskID/   2. recoverTask renames   $joboutput/_temporary/$appAttemptID/$taskID/   to   $joboutput/_temporary/($appAttemptID + 1)/$taskID/   3. commitJob merges every task output file in   $joboutput/_temporary/$appAttemptID/$taskID/   to   $joboutput/, then it will delete $joboutput/_temporary/   and write $joboutput/_SUCCESS commitJob's run time, number of RPC, is O(n) in terms of output files, which is discussed in MAPREDUCE-4815 , and can take minutes. Algorithm version 2 changes the behavior of commitTask, recoverTask, and commitJob.   1. commitTask renames all files in   $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/   to $joboutput/   2. recoverTask is a nop strictly speaking, but for   upgrade from version 1 to version 2 case, it checks if there   are any files in   $joboutput/_temporary/($appAttemptID - 1)/$taskID/   and renames them to $joboutput/   3. commitJob deletes $joboutput/_temporary and writes   $joboutput/_SUCCESS Algorithm 2 takes advantage of task parallelism and makes commitJob itself O(1). However, the window of vulnerability for having incomplete output in $jobOutput directory is much larger. Therefore, pipeline logic for consuming job outputs should be built on checking for existence of _SUCCESS marker.

      Description

      This JIRA is to propose making new FileOutputCommitter behavior from MAPREDUCE-4815 enabled by default in trunk, and potentially in branch-2.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                l201514 Siqi Li
                Reporter:
                jira.shegalov Gera Shegalov
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: