[HADOOP-3149] supporting multiple outputs for M/R jobs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.19.0
Component/s: None
Labels:
None
Environment:

all

Hadoop Flags:

Reviewed
Release Note:

Hide
Introduced MultipleOutputs class so Map/Reduce jobs can write data to different output files. Each output can use a different OutputFormat. Outpufiles are created within the job output directory. FileOutputFormat.getPathForCustomFile() creates a filename under the outputdir that is named with the task ID and task type (i.e. myfile-r-00001).

Show
Introduced MultipleOutputs class so Map/Reduce jobs can write data to different output files. Each output can use a different OutputFormat. Outpufiles are created within the job output directory. FileOutputFormat.getPathForCustomFile() creates a filename under the outputdir that is named with the task ID and task type (i.e. myfile-r-00001).

Description

The outputcollector supports writing data to a single output, the 'part' files in the output path.

We found quite common that our M/R jobs have to write data to different output. For example when classifying data as NEW, UPDATE, DELETE, NO-CHANGE to later do different processing on it.

Handling the initialization of additional outputs from within the M/R code complicates the code and is counter intuitive with the notion of job configuration.

It would be desirable to:

Configure the additional outputs in the jobconf, potentially specifying different outputformats, key and value classes for each one.
Write to the additional outputs in a similar way as data is written to the outputcollector.
Support the speculative execution semantics for the output files, only visible in the final output for promoted tasks.

To support multiple outputs the following classes would be added to mapred/lib:

MOJobConf : extends JobConf adding methods to define named outputs (name, outputformat, key class, value class)
MOOutputCollector : extends OutputCollector adding a collect(String outputName, WritableComparable key, Writable value) method.
MOMapper and MOReducer: implement Mapper and Reducer adding a new configure, map and reduce signature that take the corresponding MO classes and performs the proper initialization.

The data flow behavior would be: key/values written to the default (unnamed) output (using the original OutputCollector collect signature) take part of the shuffle/sort/reduce processing phases. key/values written to a named output from within a map don't.

The named output files would be named using the task type and task ID to avoid collision among tasks (i.e. 'new-m-00002' and 'new-r-00001').

Together with the setInputPathFilter feature introduced by ~~HADOOP-2055~~ it would become very easy to chain jobs working on particular named outputs within a single directory.

We are using heavily this pattern and it greatly simplified our M/R code as well as chaining different M/R.

We wanted to contribute this back to Hadoop as we think is a generic feature many could benefit from.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

patch3149.txt
08/Jul/08 04:14
32 kB
Alejandro Abdelnur
patch3149.txt
07/Jul/08 07:21
32 kB
Alejandro Abdelnur
patch3149.txt
07/Jul/08 05:29
32 kB
Alejandro Abdelnur
patch3149.txt
26/Jun/08 06:18
32 kB
Alejandro Abdelnur
patch3149.txt
25/Jun/08 06:14
33 kB
Alejandro Abdelnur
patch3149.txt
25/Jun/08 01:07
33 kB
Alejandro Abdelnur
patch3149.txt
12/May/08 07:21
19 kB
Alejandro Abdelnur
patch3149.txt
05/May/08 09:25
19 kB
Alejandro Abdelnur
patch3149.txt
25/Apr/08 02:59
19 kB
Alejandro Abdelnur
patch3149.txt
17/Apr/08 03:52
20 kB
Alejandro Abdelnur
patch3149.txt
15/Apr/08 09:14
19 kB
Alejandro Abdelnur
patch3149.txt
15/Apr/08 04:59
19 kB
Alejandro Abdelnur
patch3149.txt
03/Apr/08 04:15
22 kB
Alejandro Abdelnur
patch3149.txt
02/Apr/08 12:33
33 kB
Alejandro Abdelnur
patch3149.txt
02/Apr/08 08:19
33 kB
Alejandro Abdelnur
patch3149.txt
01/Apr/08 09:21
29 kB
Alejandro Abdelnur

Issue Links

incorporates

HADOOP-3258 FileOutputFormat should have a method to create custom files under the outputdir with a unique name per task to avoid name collision

Closed

is related to

HADOOP-3836 TestMultipleOutputs will fail if it is ran more than one times

Closed

relates to

HADOOP-3150 Move task file promotion into the task

Closed

Activity

People

Assignee:: Alejandro Abdelnur

Reporter:: Alejandro Abdelnur

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Apr/08 09:07

Updated:: 08/Jul/09 16:52

Resolved:: 10/Jul/08 12:30