Pig
  1. Pig
  2. PIG-2536

Extend pig to support DISTINCT x.(project)

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Currently, pig does not allow this syntax:

      A = load 'thing' (x:int, y:int, z:int);
      B = distinct A.x;
      C = distinct A.(y,z)
      D = distinct C.$0;
      

      and so on. With this patch, it does. I should probably add more tests, though it's a simple patch... it just turns distinct rel.proj into syntactic sugar for distinct (foreach rel generate proj)

      1. PIG-2436-0.patch
        2 kB
        Jonathan Coveney

        Activity

        Hide
        Jonathan Coveney added a comment -

        Patch is attached. Probably needs more tests.

        Show
        Jonathan Coveney added a comment - Patch is attached. Probably needs more tests.
        Hide
        Dmitriy V. Ryaboy added a comment -

        I think this is fine.. Can I get a second committer opinion?

        Show
        Dmitriy V. Ryaboy added a comment - I think this is fine.. Can I get a second committer opinion?
        Hide
        Julien Le Dem added a comment -

        +1
        Jonathan: you updated the existing tests for DISTINCT so it is covered. LGTM

        Show
        Julien Le Dem added a comment - +1 Jonathan: you updated the existing tests for DISTINCT so it is covered. LGTM
        Hide
        Gianmarco De Francisci Morales added a comment -

        I guess we can commit this, can't we?

        Should we also think about extending this syntax to wherever else it makes sense in Pig Latin?

        Show
        Gianmarco De Francisci Morales added a comment - I guess we can commit this, can't we? Should we also think about extending this syntax to wherever else it makes sense in Pig Latin?
        Hide
        Dmitriy V. Ryaboy added a comment -

        Jon how do you feel about this one?

        Show
        Dmitriy V. Ryaboy added a comment - Jon how do you feel about this one?
        Hide
        Jonathan Coveney added a comment -

        Oh, this patch is fine. I guess we had debated some about whether or not we wanted to include this sort of syntax (since there are other cases where we could use the same), but if people are down, I'm down. just comment here and I'll merge it in.

        Show
        Jonathan Coveney added a comment - Oh, this patch is fine. I guess we had debated some about whether or not we wanted to include this sort of syntax (since there are other cases where we could use the same), but if people are down, I'm down. just comment here and I'll merge it in.
        Hide
        Julien Le Dem added a comment -

        Basically this is making the bag projection syntax work on relations.
        If we want to do this we should look at all the places where a relation can be used and make sure we can have a consistent syntax.
        This type of shortcut is useful only if it consistent, otherwise it is confusing.

        For some it is straightforward:
        B = distinct A.x; => B = distinct (foreach A generate x);
        B = limit A.x 10; => B = limit (foreach A generate x) 10;
        B = sample A.x 0.1; => B = sample (foreach A generate x) 0.1;

        For other operators it could be tricky:
        B = ORDER A.x BY x; => B = ORDER (FOREACH A GENERATE x) BY x;
        B = ORDER A.(y,z) BY x; => B = FOREACH (ORDER A BY x) GENERATE y,z;
        Same for group by:
        B = GROUP A.(y,z) BY x; => B = FOREACH (GROUP A BY x) GENERATE group, A.(y,z);
        And Filter
        B = FILTER A.(y,z) BY x=0 => B = FOREACH (FILTER A BY x=0) GENERATE y,z;

        For Split, Join, cogroup it becomes trickier.

        Show
        Julien Le Dem added a comment - Basically this is making the bag projection syntax work on relations. If we want to do this we should look at all the places where a relation can be used and make sure we can have a consistent syntax. This type of shortcut is useful only if it consistent, otherwise it is confusing. For some it is straightforward: B = distinct A.x; => B = distinct (foreach A generate x); B = limit A.x 10; => B = limit (foreach A generate x) 10; B = sample A.x 0.1; => B = sample (foreach A generate x) 0.1; For other operators it could be tricky: B = ORDER A.x BY x; => B = ORDER (FOREACH A GENERATE x) BY x; B = ORDER A.(y,z) BY x; => B = FOREACH (ORDER A BY x) GENERATE y,z; Same for group by: B = GROUP A.(y,z) BY x; => B = FOREACH (GROUP A BY x) GENERATE group, A.(y,z); And Filter B = FILTER A.(y,z) BY x=0 => B = FOREACH (FILTER A BY x=0) GENERATE y,z; For Split, Join, cogroup it becomes trickier.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Ok I am convinced by Julien's argument for consistency; don't think the cost/benefit is there to work on this right now, though. Let's leave this as-is; the ticket and patch are here if someone wants to come back later and finish the job.

        Show
        Dmitriy V. Ryaboy added a comment - Ok I am convinced by Julien's argument for consistency; don't think the cost/benefit is there to work on this right now, though. Let's leave this as-is; the ticket and patch are here if someone wants to come back later and finish the job.
        Hide
        Haitao Yao added a comment -

        But I do think this feature is very useful. Without this feature, I have to write like this:
        A = load 'test_data' as (a,b,c,d);
        B1 = foreach A generate a,b,c;
        D1 = distinct B1;

        – or like this:
        D = distinct A.(a,b,c);

        less code , less error. I really hope this can be merged into trunk.
        thanks.

        Show
        Haitao Yao added a comment - But I do think this feature is very useful. Without this feature, I have to write like this: A = load 'test_data' as (a,b,c,d); B1 = foreach A generate a,b,c; D1 = distinct B1; – or like this: D = distinct A.(a,b,c); less code , less error. I really hope this can be merged into trunk. thanks.

          People

          • Assignee:
            Jonathan Coveney
            Reporter:
            Jonathan Coveney
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development