Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1518

multi file input format for loaders

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.8.0
    • None
    • None
    • Reviewed
    • Hide
      Feature: combine splits of sizes smaller than the value of property "pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is not set, the file system default block size of the load's location. This feature can be turned off through setting the property "pig.splitCombination" to "false". When such a combination is performed, a log message like "Total input paths (combined) to process : 7" will be logged.

      This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more "under-fed" mappers to be launched and potentially slowdown of the execution.

      This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object.

      This change also requires the loader to be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument.
      Otherwise, this feature should be disabled.

      In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations.
      Show
      Feature: combine splits of sizes smaller than the value of property "pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is not set, the file system default block size of the load's location. This feature can be turned off through setting the property "pig.splitCombination" to "false". When such a combination is performed, a log message like "Total input paths (combined) to process : 7" will be logged. This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more "under-fed" mappers to be launched and potentially slowdown of the execution. This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object. This change also requires the loader to be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument. Otherwise, this feature should be disabled. In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations.

    Description

      We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient.

      It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible.

      There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API.

      We at least want to do a feasibility study for Pig 0.8.0.

      Attachments

        1. PIG-1518.patch
          58 kB
          Yan Zhou
        2. PIG-1518.patch
          58 kB
          Yan Zhou
        3. PIG-1518.patch
          58 kB
          Yan Zhou
        4. PIG-1518.patch
          58 kB
          Yan Zhou
        5. PIG-1518.patch
          58 kB
          Yan Zhou
        6. PIG-1518.patch
          58 kB
          Yan Zhou
        7. PIG-1518.patch
          58 kB
          Yan Zhou
        8. PIG-1518.patch
          52 kB
          Yan Zhou
        9. PIG-1518-0.7.0.patch
          57 kB
          Justin Sanders

        Issue Links

          Activity

            People

              yanz Yan Zhou
              olgan Olga Natkovich
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: