[MAPREDUCE-6336] Enable v2 FileOutputCommitter by default - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.7.0
Fix Version/s: 3.0.0-alpha1
Component/s: mrv2
Labels:
- BB2015-05-TBR

Hadoop Flags:

Incompatible change, Reviewed
Release Note:

Hide
mapreduce.fileoutputcommitter.algorithm.version now defaults to 2.

In algorithm version 1:

  1. commitTask renames directory
  $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/
  to
  $joboutput/_temporary/$appAttemptID/$taskID/

  2. recoverTask renames
  $joboutput/_temporary/$appAttemptID/$taskID/
  to
  $joboutput/_temporary/($appAttemptID + 1)/$taskID/

  3. commitJob merges every task output file in
  $joboutput/_temporary/$appAttemptID/$taskID/
  to
  $joboutput/, then it will delete $joboutput/_temporary/
  and write $joboutput/_SUCCESS

commitJob's run time, number of RPC, is O(n) in terms of output files, which is discussed in ~~MAPREDUCE-4815~~, and can take minutes.

Algorithm version 2 changes the behavior of commitTask, recoverTask, and commitJob.

  1. commitTask renames all files in
  $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/
  to $joboutput/

  2. recoverTask is a nop strictly speaking, but for
  upgrade from version 1 to version 2 case, it checks if there
  are any files in
  $joboutput/_temporary/($appAttemptID - 1)/$taskID/
  and renames them to $joboutput/

  3. commitJob deletes $joboutput/_temporary and writes
  $joboutput/_SUCCESS

Algorithm 2 takes advantage of task parallelism and makes commitJob itself O(1). However, the window of vulnerability for having incomplete output in $jobOutput directory is much larger. Therefore, pipeline logic for consuming job outputs should be built on checking for existence of _SUCCESS marker.

Show
mapreduce.fileoutputcommitter.algorithm.version now defaults to 2.    In algorithm version 1:   1. commitTask renames directory   $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/   to   $joboutput/_temporary/$appAttemptID/$taskID/   2. recoverTask renames   $joboutput/_temporary/$appAttemptID/$taskID/   to   $joboutput/_temporary/($appAttemptID + 1)/$taskID/   3. commitJob merges every task output file in   $joboutput/_temporary/$appAttemptID/$taskID/   to   $joboutput/, then it will delete $joboutput/_temporary/   and write $joboutput/_SUCCESS commitJob's run time, number of RPC, is O(n) in terms of output files, which is discussed in MAPREDUCE-4815 , and can take minutes. Algorithm version 2 changes the behavior of commitTask, recoverTask, and commitJob.   1. commitTask renames all files in   $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/   to $joboutput/   2. recoverTask is a nop strictly speaking, but for   upgrade from version 1 to version 2 case, it checks if there   are any files in   $joboutput/_temporary/($appAttemptID - 1)/$taskID/   and renames them to $joboutput/   3. commitJob deletes $joboutput/_temporary and writes   $joboutput/_SUCCESS Algorithm 2 takes advantage of task parallelism and makes commitJob itself O(1). However, the window of vulnerability for having incomplete output in $jobOutput directory is much larger. Therefore, pipeline logic for consuming job outputs should be built on checking for existence of _SUCCESS marker.

Description

This JIRA is to propose making new FileOutputCommitter behavior from ~~MAPREDUCE-4815~~ enabled by default in trunk, and potentially in branch-2.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-6336.v1.patch
24/Apr/15 17:22
0.8 kB
Siqi Li

Issue Links

depends upon

MAPREDUCE-4815 Speed up FileOutputCommitter#commitJob for many output files

Closed

Activity

People

Assignee:: Siqi Li

Reporter:: Gera Shegalov

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 24/Apr/15 00:23

Updated:: 12/May/16 18:22

Resolved:: 27/May/15 22:49