[PIG-1518] multi file input format for loaders - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.8.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
Feature: combine splits of sizes smaller than the value of property "pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is not set, the file system default block size of the load's location. This feature can be turned off through setting the property "pig.splitCombination" to "false". When such a combination is performed, a log message like "Total input paths (combined) to process : 7" will be logged.

This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more "under-fed" mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object.

This change also requires the loader to be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument.
Otherwise, this feature should be disabled.

In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations.

Show
Feature: combine splits of sizes smaller than the value of property "pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is not set, the file system default block size of the load's location. This feature can be turned off through setting the property "pig.splitCombination" to "false". When such a combination is performed, a log message like "Total input paths (combined) to process : 7" will be logged. This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more "under-fed" mappers to be launched and potentially slowdown of the execution. This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object. This change also requires the loader to be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument. Otherwise, this feature should be disabled. In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations.

Description

We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient.

It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible.

There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API.

We at least want to do a feasibility study for Pig 0.8.0.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-1518.patch
26/Aug/10 17:35
58 kB
Yan Zhou
PIG-1518.patch
26/Aug/10 15:17
58 kB
Yan Zhou
PIG-1518.patch
26/Aug/10 00:24
58 kB
Yan Zhou
PIG-1518.patch
25/Aug/10 04:53
58 kB
Yan Zhou
PIG-1518.patch
24/Aug/10 00:17
58 kB
Yan Zhou
PIG-1518.patch
23/Aug/10 22:34
58 kB
Yan Zhou
PIG-1518.patch
20/Aug/10 23:07
58 kB
Yan Zhou
PIG-1518.patch
18/Aug/10 16:02
52 kB
Yan Zhou
PIG-1518-0.7.0.patch
06/Sep/10 16:36
57 kB
Justin Sanders

Issue Links

relates to

PIG-2462 getWrappedSplit is incorrectly returning the first split instead of the current split.

Closed

Activity

People

Assignee:: Yan Zhou

Reporter:: Olga Natkovich

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Jul/10 20:53

Updated:: 10/Jan/12 07:30

Resolved:: 26/Aug/10 17:59