Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2647

Split Combining drops splits with empty getLocations()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.13.0, 0.14.0, 0.13.1
    • 0.15.0
    • impl
    • None
    • Reviewed
    • Don't ignore unavailable blocks when combining input splits into PigSplits.

    Description

      in:
      org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil#getCombinePigSplits
      which is used by PigInputFormat

      There is an assumption that every split's getLocations() will return a non-empty array.
      If the following criteria are met:
      1) Split combining is turned on
      2) There is more than one split
      3) There is at least one split that is smaller than the maxCombineSplitSize

      splits with empty getLocations() will simply be dropped (ignored) without warning.

      The hadoop API does not specify that all splits must return a location and there are cases where a split may want to return no locations (if the data is not in HDFS for example, or if the data is a directory full of HDFS files in which case there's not much gained by having locality)

      This is due to the implementation of org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil#getCombinePigSplits
      scans all splits eligible for combining and creates a map of Nodes -> splits, then laster iterates through the MAP (not the splits) to do the combining.

      One solution would be to inject a dummy "empty node" into the map.

      Overall the logic in getCombinePigSplits is very complicated and has a lot of edge cases, it might be worth cleaning up.

      Attachments

        1. PIG-2647.patch
          3 kB
          Travis Woodruff

        Activity

          People

            tmwoodruff Travis Woodruff
            alexlevenson Alex Levenson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: