Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1038

Optimize nested distinct/sort to use secondary key

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.6.0
    • Component/s: impl
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query.

      Eg1:
      A = load 'mydata';
      B = group A by $0;
      C = foreach B {
      D = order A by $1;
      generate group, D;
      }
      store C into 'myresult';

      We can specify a secondary sort on A.$1, and drop "order A by $1".

      Eg2:
      A = load 'mydata';
      B = group A by $0;
      C = foreach B {
      D = A.$1;
      E = distinct D;
      generate group, E;
      }
      store C into 'myresult';

      We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct D" to a special version of distinct, which does not do the sorting.

        Attachments

        1. PIG-1038-5.patch
          109 kB
          Daniel Dai
        2. PIG-1038-4.patch
          109 kB
          Daniel Dai
        3. PIG-1038-3.patch
          128 kB
          Daniel Dai
        4. PIG-1038-2.patch
          110 kB
          Daniel Dai
        5. PIG-1038-1.patch
          110 kB
          Daniel Dai

          Issue Links

            Activity

              People

              • Assignee:
                daijy Daniel Dai
                Reporter:
                olgan Olga Natkovich
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: