[HADOOP-372] should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.4.0
Fix Version/s: 0.19.0
Component/s: None
Labels:
None
Environment:

all

Hadoop Flags:

Reviewed

Description

Right now, the user can specify multiple input directories for a map reduce job.
However, the files under all the directories are assumed to be in the same format,
with the same key/value classes. This proves to be a serious limit in many situations.
Here is an example. Suppose I have three simple tables:
one has URLs and their rank values (page ranks),
another has URLs and their classification values,
and the third one has the URL meta data such as crawl status, last crawl time, etc.
Suppose now I need a job to generate a list of URLs to be crawled next.
The decision depends on the info in all the three tables.
Right now, there is no easy way to accomplish this.

However, this job can be done if the framework allows to specify different inputformats for different input dirs.
Suppose my three tables are in the following directory respectively: rankTable, classificationTable. and metaDataTable.
If we extend JobConf class with the following method (as Owen suggested to me):
addInputPath(aPath, anInputFormatClass, anInputKeyClass, anInputValueClass)
Then I can specify my job as follows:
addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class, DoubleWritable.class)
addInputPath(classificationTable, TextInputFormat.class, UTF8,class, UTF8.class)
addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class, MyRecord.class)
If an input directory is added through the current API, it will have the same meaning as it is now.
Thus this extension will not affect any applications that do not need this new feature.

It is relatively easy for the M/R framework to create an appropriate record reader for a map task based on the above information.
And that is the only change needed for supporting this extension.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hadoop-372.patch
10/Mar/08 17:41
10 kB
Thomas White
hadoop-372.patch
10/Jul/08 17:10
23 kB
Chris Smith
hadoop-372.patch
11/Jul/08 14:15
23 kB
Chris Smith
hadoop-372.patch
14/Jul/08 09:49
23 kB
Chris Smith
hadoop-372.patch
14/Jul/08 14:34
23 kB
Chris Smith

Issue Links

duplicates

HADOOP-450 Remove the need for users to specify the types of the inputs

Closed

is depended upon by

MAPREDUCE-605 In Streaming, allow different mappers for different subsets of the input

Open

Activity

People

Assignee:: Chris Smith

Reporter:: Runping Qi

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 19/Jul/06 18:02

Updated:: 02/May/13 02:29

Resolved:: 18/Jul/08 10:28