Pig
  1. Pig
  2. PIG-3010

Allow UDF's to flatten themselves

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      This is something I thought would be cool for a while, so I sat down and did it because I think there are some useful debugging tools it'd help with.

      The idea is that if you attach an annotation to a UDF, the Tuple or DataBag you output will be flattened. This is quite powerful. A very common pattern is:

      a = foreach data generate Flatten(MyUdf(thing)) as (a,b,c);

      This would let you just do:

      a = foreach data generate MyUdf(thing);

      With the exact same result!

      1. PIG-3010-5.patch
        261 kB
        Jonathan Coveney
      2. PIG-3010-5_nows.patch
        69 kB
        Jonathan Coveney
      3. PIG-3010-4.patch
        261 kB
        Jonathan Coveney
      4. PIG-3010-4_nows.patch
        69 kB
        Jonathan Coveney
      5. PIG-3010-3.patch
        261 kB
        Jonathan Coveney
      6. PIG-3010-3_nows.patch
        69 kB
        Jonathan Coveney
      7. PIG-3010-2_nowhitespace.patch
        68 kB
        Jonathan Coveney
      8. PIG-3010-2.patch
        285 kB
        Jonathan Coveney
      9. PIG-3010-1.patch
        71 kB
        Jonathan Coveney
      10. PIG-3010-0.patch
        79 kB
        Jonathan Coveney

        Issue Links

          Activity

          Hide
          Jonathan Coveney added a comment -

          Here is a patch that does this. The changes are further reaching than they otherwise might need to be, but this is because this is a good time to futureproof flatten by using an enum approach instead.

          A nice side effect is that you can implement FLATTEN as a UDF (though this isn't necessarily desirable as it is going to add some overhead...still, the fact that it can be done is quite powerful). That UDF is src/org/apache/pig/builtin/UdfFlatten.java

          This let's you do a lot of really neat stuff, such as:

          a = load 'data2' as (x:int,y:int);
          b = foreach a generate UdfFlatten(x,y);
          describe b;
          

          which results in:

          b: {x: int,y: int}
          

          Woah! Previously, this was impossible. What happens if you dump? The result is

          (1,10)
          (4,11)
          (5,10)
          

          Woah!

          You can even do the following:

          a = load 'data2' as (x:int,y:int);
          b = foreach a generate UdfFlatten(TOTUPLE(x,y));
          dump b;
          

          And it works for bags as well. The uses are obvious IMHO.

          Show
          Jonathan Coveney added a comment - Here is a patch that does this. The changes are further reaching than they otherwise might need to be, but this is because this is a good time to futureproof flatten by using an enum approach instead. A nice side effect is that you can implement FLATTEN as a UDF (though this isn't necessarily desirable as it is going to add some overhead...still, the fact that it can be done is quite powerful). That UDF is src/org/apache/pig/builtin/UdfFlatten.java This let's you do a lot of really neat stuff, such as: a = load 'data2' as (x: int ,y: int ); b = foreach a generate UdfFlatten(x,y); describe b; which results in: b: {x: int ,y: int } Woah! Previously, this was impossible. What happens if you dump? The result is (1,10) (4,11) (5,10) Woah! You can even do the following: a = load 'data2' as (x: int ,y: int ); b = foreach a generate UdfFlatten(TOTUPLE(x,y)); dump b; And it works for bags as well. The uses are obvious IMHO.
          Hide
          Jonathan Coveney added a comment -

          I've changed nothing, just made sure it was updated with the newest code and diffed off of trunk.

          Show
          Jonathan Coveney added a comment - I've changed nothing, just made sure it was updated with the newest code and diffed off of trunk.
          Hide
          Dmitriy V. Ryaboy added a comment -

          can you regenerate without the ws changes? 285Kb patch..

          Show
          Dmitriy V. Ryaboy added a comment - can you regenerate without the ws changes? 285Kb patch..
          Hide
          Jonathan Coveney added a comment -

          Attached

          Show
          Jonathan Coveney added a comment - Attached
          Hide
          Jonathan Coveney added a comment -

          I went ahead and made a reviewboard here: https://reviews.apache.org/r/9060/

          This is not a small patch, but I'd love comments. I think this would be a huge bump in expressivity for Pig. The current system is very annoying and leads to a lot of annoying realiasing.

          Show
          Jonathan Coveney added a comment - I went ahead and made a reviewboard here: https://reviews.apache.org/r/9060/ This is not a small patch, but I'd love comments. I think this would be a huge bump in expressivity for Pig. The current system is very annoying and leads to a lot of annoying realiasing.
          Hide
          Jonathan Coveney added a comment -

          The patch applied fine, but I updated it to be cutting edge, here and in the RB. Would love eyes.

          Show
          Jonathan Coveney added a comment - The patch applied fine, but I updated it to be cutting edge, here and in the RB. Would love eyes.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Still has all the whitespace changes...

          Show
          Dmitriy V. Ryaboy added a comment - Still has all the whitespace changes...
          Hide
          Jonathan Coveney added a comment -

          I uploaded a _nows patch?

          Show
          Jonathan Coveney added a comment - I uploaded a _nows patch?
          Hide
          Dmitriy V. Ryaboy added a comment -

          not to rb..

          Show
          Dmitriy V. Ryaboy added a comment - not to rb..
          Hide
          Jonathan Coveney added a comment -

          Ahhh, I was confused about that all along!

          https://reviews.apache.org/r/9529/

          Show
          Jonathan Coveney added a comment - Ahhh, I was confused about that all along! https://reviews.apache.org/r/9529/
          Hide
          Dmitriy V. Ryaboy added a comment -

          1) patch didn't apply:
          src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java
          Revision 14598cc New Change
          Diff currently unavailable.
          Error: The patch to 'src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java' didn't apply cleanly. The temporary files have been left in '/tmp/reviewboard.3x6PyD' for debugging purposes. `patch` returned: patching file /tmp/reviewboard.3x6PyD/tmp7aQP12 Hunk #5 FAILED at 108. 1 out of 20 hunks FAILED – saving rejects to file /tmp/reviewboard.3x6PyD/tmp7aQP12-new.rej

          2) can you describe the general approach here? Looks like the changes are pretty deep.

          Show
          Dmitriy V. Ryaboy added a comment - 1) patch didn't apply: src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java Revision 14598cc New Change Diff currently unavailable. Error: The patch to 'src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java' didn't apply cleanly. The temporary files have been left in '/tmp/reviewboard.3x6PyD' for debugging purposes. `patch` returned: patching file /tmp/reviewboard.3x6PyD/tmp7aQP12 Hunk #5 FAILED at 108. 1 out of 20 hunks FAILED – saving rejects to file /tmp/reviewboard.3x6PyD/tmp7aQP12-new.rej 2) can you describe the general approach here? Looks like the changes are pretty deep.
          Hide
          Jonathan Coveney added a comment -

          Hmm, odd. Must be something in between...will fix tomorrow (Sverige time). As far as the general approach, the idea is simply to replace the boolean flag of "flatten" or "don't flatten" with an Enum that can carry more specific information (in this case: do nothing, old flatten, or flatten without alias). The reason the changes are so broad is because that boolean flag was read in a lot of places (would that I could flattening would be handled differently but alas..). The change to allow UDFs to flatten themselves itself wasn't too hard, but IMHO the ability to return rows without alias is what makes it useful. Now FLATTEN can be done as a UDF, as can other flatten variants. The range of what we can right usefully is huge now, and we can more effectively manage the namespace cruft that Pig scripts often generate.

          But yeah, the change is pretty simple. I literally just changed the flag to the enum, and followed compiler errors.

          Show
          Jonathan Coveney added a comment - Hmm, odd. Must be something in between...will fix tomorrow (Sverige time). As far as the general approach, the idea is simply to replace the boolean flag of "flatten" or "don't flatten" with an Enum that can carry more specific information (in this case: do nothing, old flatten, or flatten without alias). The reason the changes are so broad is because that boolean flag was read in a lot of places (would that I could flattening would be handled differently but alas..). The change to allow UDFs to flatten themselves itself wasn't too hard, but IMHO the ability to return rows without alias is what makes it useful. Now FLATTEN can be done as a UDF, as can other flatten variants. The range of what we can right usefully is huge now, and we can more effectively manage the namespace cruft that Pig scripts often generate. But yeah, the change is pretty simple. I literally just changed the flag to the enum, and followed compiler errors.
          Hide
          Jonathan Coveney added a comment -
          Show
          Jonathan Coveney added a comment - ws rb https://reviews.apache.org/r/9060/ nows rb https://reviews.apache.org/r/9529/ should all be good now
          Hide
          Alan Gates added a comment -

          Patch no longer applies. This causes review board to not show the diffs either. Sorry for waiting so long on this.

          Show
          Alan Gates added a comment - Patch no longer applies. This causes review board to not show the diffs either. Sorry for waiting so long on this.
          Hide
          Daniel Dai added a comment -

          Jonathan Coveney, are you still working on it?

          Show
          Daniel Dai added a comment - Jonathan Coveney , are you still working on it?

            People

            • Assignee:
              Jonathan Coveney
              Reporter:
              Jonathan Coveney
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development