Pig
  1. Pig
  2. PIG-1518

multi file input format for loaders

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Feature: combine splits of sizes smaller than the value of property "pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is not set, the file system default block size of the load's location. This feature can be turned off through setting the property "pig.splitCombination" to "false". When such a combination is performed, a log message like "Total input paths (combined) to process : 7" will be logged.

      This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more "under-fed" mappers to be launched and potentially slowdown of the execution.

      This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object.

      This change also requires the loader to be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument.
      Otherwise, this feature should be disabled.

      In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations.
      Show
      Feature: combine splits of sizes smaller than the value of property "pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is not set, the file system default block size of the load's location. This feature can be turned off through setting the property "pig.splitCombination" to "false". When such a combination is performed, a log message like "Total input paths (combined) to process : 7" will be logged. This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more "under-fed" mappers to be launched and potentially slowdown of the execution. This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object. This change also requires the loader to be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument. Otherwise, this feature should be disabled. In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations.

      Description

      We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient.

      It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible.

      There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API.

      We at least want to do a feasibility study for Pig 0.8.0.

      1. PIG-1518.patch
        58 kB
        Yan Zhou
      2. PIG-1518.patch
        58 kB
        Yan Zhou
      3. PIG-1518.patch
        58 kB
        Yan Zhou
      4. PIG-1518.patch
        58 kB
        Yan Zhou
      5. PIG-1518.patch
        58 kB
        Yan Zhou
      6. PIG-1518.patch
        58 kB
        Yan Zhou
      7. PIG-1518.patch
        58 kB
        Yan Zhou
      8. PIG-1518.patch
        52 kB
        Yan Zhou
      9. PIG-1518-0.7.0.patch
        57 kB
        Justin Sanders

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Yan Zhou
              Reporter:
              Olga Natkovich
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development