Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1306

[zebra] Support of locally sorted input splits

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.7.0
    • None
    • None

    Description

      Current Zebra supports sorted or unsorted input splits on sorted table or sorted table unions. The sorted input splits are based upon key ranges which do not overlap. And the splits are basically globally sorted in that they are locally sorted, and their key ranges do not overlap.

      The biggest problem of the key-range splits are performance hits suffered if data skew is present, particularly if a key range contains a duplicate key solely which makes the data trunk of the duplicate keys virtually unsplittable regardless how many mappers are available: it just has to be processed by a single mapper.

      On the other hand, there are scenarios when the globally sorted splits are a over-kill and only locally sorted splits are good enough. Examples are the use of Zebra sorted tables as the probe table in a map-side merge inner join.

      Attachments

        1. PIG-1306.patch
          134 kB
          Yan Zhou
        2. PIG-1306.patch
          133 kB
          Yan Zhou
        3. PIG-1306.patch
          134 kB
          Yan Zhou
        4. PIG-1306.patch
          148 kB
          Yan Zhou
        5. PIG-1306.patch
          147 kB
          Yan Zhou

        Activity

          People

            yanz Yan Zhou
            yanz Yan Zhou
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: