Pig
  1. Pig
  2. PIG-1592

ORDER BY distribution is uneven when record size is correlated with order key

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The partitioner contributed in PIG-545 distributes the order key space between partitions so that each partition gets approximately the same number of keys, even when the keys have a non-uniform distribution over the key space.

      Unfortunately this still allows for severe partition imbalance when record size is correlated with the order key. By way of motivating example, consider this script which attempts to produce a list of genuses based on how many species each genus contains:

      set default_parallel 60;
      critters = load 'biodata'' as (genus, species);
      genus_counts = foreach (group critters by genus) generate group as genus, COUNT(critters) as num_species, critters;
      ordered_genuses = order genus_counts by num_species desc;
      store ordered_genuses....
      

      The higher the value of genus_counts, the more species tuples will be contained in the critters bag, the wider the row. This can cause a severe processing imbalance, as the partitioner processing the records with the highest values of genus_counts will have the same number of records as the partitioner processing the lowest number, but it will have far more actual bytes to work on.

        Activity

        Hide
        Dmitriy V. Ryaboy added a comment -

        One proposal is to simply change the default weighted range partitioner to take into account the record size. If record size is uniform, or uniformly distributed, or non-uniformly distributed but independent of the order key, this change shouldn't materially affect the distributions created for data sets not covered by this issue.

        Show
        Dmitriy V. Ryaboy added a comment - One proposal is to simply change the default weighted range partitioner to take into account the record size. If record size is uniform, or uniformly distributed, or non-uniformly distributed but independent of the order key, this change shouldn't materially affect the distributions created for data sets not covered by this issue.

          People

          • Assignee:
            Thejas M Nair
            Reporter:
            Dmitriy V. Ryaboy
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development