Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1038

Optimize nested distinct/sort to use secondary key

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.4.0
    • 0.6.0
    • impl
    • None
    • Reviewed

    Description

      If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query.

      Eg1:
      A = load 'mydata';
      B = group A by $0;
      C = foreach B {
      D = order A by $1;
      generate group, D;
      }
      store C into 'myresult';

      We can specify a secondary sort on A.$1, and drop "order A by $1".

      Eg2:
      A = load 'mydata';
      B = group A by $0;
      C = foreach B {
      D = A.$1;
      E = distinct D;
      generate group, E;
      }
      store C into 'myresult';

      We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct D" to a special version of distinct, which does not do the sorting.

      Attachments

        1. PIG-1038-1.patch
          110 kB
          Daniel Dai
        2. PIG-1038-2.patch
          110 kB
          Daniel Dai
        3. PIG-1038-3.patch
          128 kB
          Daniel Dai
        4. PIG-1038-4.patch
          109 kB
          Daniel Dai
        5. PIG-1038-5.patch
          109 kB
          Daniel Dai

        Issue Links

          Activity

            People

              daijy Daniel Dai
              olgan Olga Natkovich
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: