Pig
  1. Pig
  2. PIG-3010

Allow UDF's to flatten themselves

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 0.16.0
    • Component/s: None
    • Labels:
      None

      Description

      This is something I thought would be cool for a while, so I sat down and did it because I think there are some useful debugging tools it'd help with.

      The idea is that if you attach an annotation to a UDF, the Tuple or DataBag you output will be flattened. This is quite powerful. A very common pattern is:

      a = foreach data generate Flatten(MyUdf(thing)) as (a,b,c);

      This would let you just do:

      a = foreach data generate MyUdf(thing);

      With the exact same result!

      1. PIG-3010-5.patch
        261 kB
        Jonathan Coveney
      2. PIG-3010-5_nows.patch
        69 kB
        Jonathan Coveney
      3. PIG-3010-4.patch
        261 kB
        Jonathan Coveney
      4. PIG-3010-4_nows.patch
        69 kB
        Jonathan Coveney
      5. PIG-3010-3.patch
        261 kB
        Jonathan Coveney
      6. PIG-3010-3_nows.patch
        69 kB
        Jonathan Coveney
      7. PIG-3010-2.patch
        285 kB
        Jonathan Coveney
      8. PIG-3010-2_nowhitespace.patch
        68 kB
        Jonathan Coveney
      9. PIG-3010-1.patch
        71 kB
        Jonathan Coveney
      10. PIG-3010-0.patch
        79 kB
        Jonathan Coveney

        Issue Links

          Activity

          Jonathan Coveney created issue -
          Hide
          Jonathan Coveney added a comment -

          Here is a patch that does this. The changes are further reaching than they otherwise might need to be, but this is because this is a good time to futureproof flatten by using an enum approach instead.

          A nice side effect is that you can implement FLATTEN as a UDF (though this isn't necessarily desirable as it is going to add some overhead...still, the fact that it can be done is quite powerful). That UDF is src/org/apache/pig/builtin/UdfFlatten.java

          This let's you do a lot of really neat stuff, such as:

          a = load 'data2' as (x:int,y:int);
          b = foreach a generate UdfFlatten(x,y);
          describe b;
          

          which results in:

          b: {x: int,y: int}
          

          Woah! Previously, this was impossible. What happens if you dump? The result is

          (1,10)
          (4,11)
          (5,10)
          

          Woah!

          You can even do the following:

          a = load 'data2' as (x:int,y:int);
          b = foreach a generate UdfFlatten(TOTUPLE(x,y));
          dump b;
          

          And it works for bags as well. The uses are obvious IMHO.

          Show
          Jonathan Coveney added a comment - Here is a patch that does this. The changes are further reaching than they otherwise might need to be, but this is because this is a good time to futureproof flatten by using an enum approach instead. A nice side effect is that you can implement FLATTEN as a UDF (though this isn't necessarily desirable as it is going to add some overhead...still, the fact that it can be done is quite powerful). That UDF is src/org/apache/pig/builtin/UdfFlatten.java This let's you do a lot of really neat stuff, such as: a = load 'data2' as (x: int ,y: int ); b = foreach a generate UdfFlatten(x,y); describe b; which results in: b: {x: int ,y: int } Woah! Previously, this was impossible. What happens if you dump? The result is (1,10) (4,11) (5,10) Woah! You can even do the following: a = load 'data2' as (x: int ,y: int ); b = foreach a generate UdfFlatten(TOTUPLE(x,y)); dump b; And it works for bags as well. The uses are obvious IMHO.
          Jonathan Coveney made changes -
          Field Original Value New Value
          Attachment PIG-3010-0.patch [ 12550995 ]
          Jonathan Coveney made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Jonathan Coveney made changes -
          Attachment PIG-3010-1.patch [ 12550997 ]
          Jonathan Coveney made changes -
          Link This issue blocks PIG-3088 [ PIG-3088 ]
          Hide
          Jonathan Coveney added a comment -

          I've changed nothing, just made sure it was updated with the newest code and diffed off of trunk.

          Show
          Jonathan Coveney added a comment - I've changed nothing, just made sure it was updated with the newest code and diffed off of trunk.
          Jonathan Coveney made changes -
          Attachment PIG-3010-2.patch [ 12560306 ]
          Hide
          Dmitriy V. Ryaboy added a comment -

          can you regenerate without the ws changes? 285Kb patch..

          Show
          Dmitriy V. Ryaboy added a comment - can you regenerate without the ws changes? 285Kb patch..
          Jonathan Coveney made changes -
          Attachment PIG-3010-2_nowhitespace.patch [ 12560703 ]
          Hide
          Jonathan Coveney added a comment -

          Attached

          Show
          Jonathan Coveney added a comment - Attached
          Hide
          Jonathan Coveney added a comment -

          I went ahead and made a reviewboard here: https://reviews.apache.org/r/9060/

          This is not a small patch, but I'd love comments. I think this would be a huge bump in expressivity for Pig. The current system is very annoying and leads to a lot of annoying realiasing.

          Show
          Jonathan Coveney added a comment - I went ahead and made a reviewboard here: https://reviews.apache.org/r/9060/ This is not a small patch, but I'd love comments. I think this would be a huge bump in expressivity for Pig. The current system is very annoying and leads to a lot of annoying realiasing.
          Jonathan Coveney made changes -
          Attachment PIG-3010-3_nows.patch [ 12566053 ]
          Attachment PIG-3010-3.patch [ 12566054 ]
          Hide
          Jonathan Coveney added a comment -

          The patch applied fine, but I updated it to be cutting edge, here and in the RB. Would love eyes.

          Show
          Jonathan Coveney added a comment - The patch applied fine, but I updated it to be cutting edge, here and in the RB. Would love eyes.
          Jonathan Coveney made changes -
          Attachment PIG-3010-4_nows.patch [ 12569836 ]
          Attachment PIG-3010-4.patch [ 12569837 ]
          Hide
          Dmitriy V. Ryaboy added a comment -

          Still has all the whitespace changes...

          Show
          Dmitriy V. Ryaboy added a comment - Still has all the whitespace changes...
          Hide
          Jonathan Coveney added a comment -

          I uploaded a _nows patch?

          Show
          Jonathan Coveney added a comment - I uploaded a _nows patch?
          Hide
          Dmitriy V. Ryaboy added a comment -

          not to rb..

          Show
          Dmitriy V. Ryaboy added a comment - not to rb..
          Hide
          Jonathan Coveney added a comment -

          Ahhh, I was confused about that all along!

          https://reviews.apache.org/r/9529/

          Show
          Jonathan Coveney added a comment - Ahhh, I was confused about that all along! https://reviews.apache.org/r/9529/
          Hide
          Dmitriy V. Ryaboy added a comment -

          1) patch didn't apply:
          src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java
          Revision 14598cc New Change
          Diff currently unavailable.
          Error: The patch to 'src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java' didn't apply cleanly. The temporary files have been left in '/tmp/reviewboard.3x6PyD' for debugging purposes. `patch` returned: patching file /tmp/reviewboard.3x6PyD/tmp7aQP12 Hunk #5 FAILED at 108. 1 out of 20 hunks FAILED – saving rejects to file /tmp/reviewboard.3x6PyD/tmp7aQP12-new.rej

          2) can you describe the general approach here? Looks like the changes are pretty deep.

          Show
          Dmitriy V. Ryaboy added a comment - 1) patch didn't apply: src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java Revision 14598cc New Change Diff currently unavailable. Error: The patch to 'src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java' didn't apply cleanly. The temporary files have been left in '/tmp/reviewboard.3x6PyD' for debugging purposes. `patch` returned: patching file /tmp/reviewboard.3x6PyD/tmp7aQP12 Hunk #5 FAILED at 108. 1 out of 20 hunks FAILED – saving rejects to file /tmp/reviewboard.3x6PyD/tmp7aQP12-new.rej 2) can you describe the general approach here? Looks like the changes are pretty deep.
          Hide
          Jonathan Coveney added a comment -

          Hmm, odd. Must be something in between...will fix tomorrow (Sverige time). As far as the general approach, the idea is simply to replace the boolean flag of "flatten" or "don't flatten" with an Enum that can carry more specific information (in this case: do nothing, old flatten, or flatten without alias). The reason the changes are so broad is because that boolean flag was read in a lot of places (would that I could flattening would be handled differently but alas..). The change to allow UDFs to flatten themselves itself wasn't too hard, but IMHO the ability to return rows without alias is what makes it useful. Now FLATTEN can be done as a UDF, as can other flatten variants. The range of what we can right usefully is huge now, and we can more effectively manage the namespace cruft that Pig scripts often generate.

          But yeah, the change is pretty simple. I literally just changed the flag to the enum, and followed compiler errors.

          Show
          Jonathan Coveney added a comment - Hmm, odd. Must be something in between...will fix tomorrow (Sverige time). As far as the general approach, the idea is simply to replace the boolean flag of "flatten" or "don't flatten" with an Enum that can carry more specific information (in this case: do nothing, old flatten, or flatten without alias). The reason the changes are so broad is because that boolean flag was read in a lot of places (would that I could flattening would be handled differently but alas..). The change to allow UDFs to flatten themselves itself wasn't too hard, but IMHO the ability to return rows without alias is what makes it useful. Now FLATTEN can be done as a UDF, as can other flatten variants. The range of what we can right usefully is huge now, and we can more effectively manage the namespace cruft that Pig scripts often generate. But yeah, the change is pretty simple. I literally just changed the flag to the enum, and followed compiler errors.
          Hide
          Jonathan Coveney added a comment -
          Show
          Jonathan Coveney added a comment - ws rb https://reviews.apache.org/r/9060/ nows rb https://reviews.apache.org/r/9529/ should all be good now
          Jonathan Coveney made changes -
          Attachment PIG-3010-5_nows.patch [ 12570284 ]
          Attachment PIG-3010-5.patch [ 12570285 ]
          Hide
          Alan Gates added a comment -

          Patch no longer applies. This causes review board to not show the diffs either. Sorry for waiting so long on this.

          Show
          Alan Gates added a comment - Patch no longer applies. This causes review board to not show the diffs either. Sorry for waiting so long on this.
          Alan Gates made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Gavin made changes -
          Link This issue blocks PIG-3088 [ PIG-3088 ]
          Gavin made changes -
          Link This issue is depended upon by PIG-3088 [ PIG-3088 ]
          Hide
          Daniel Dai added a comment -

          Jonathan Coveney, are you still working on it?

          Show
          Daniel Dai added a comment - Jonathan Coveney , are you still working on it?
          Daniel Dai made changes -
          Fix Version/s 0.13.0 [ 12324971 ]
          Fix Version/s 0.12.0 [ 12323380 ]
          Aniket Mokashi made changes -
          Fix Version/s 0.14.0 [ 12326954 ]
          Fix Version/s 0.13.0 [ 12324971 ]
          Daniel Dai made changes -
          Fix Version/s 0.15.0 [ 12328760 ]
          Fix Version/s 0.14.0 [ 12326954 ]
          Daniel Dai made changes -
          Fix Version/s 0.16.0 [ 12332168 ]
          Fix Version/s 0.15.0 [ 12328760 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          1m 53s 1 Jonathan Coveney 26/Oct/12 19:34
          Patch Available Patch Available Open Open
          180d 21h 58m 1 Alan Gates 25/Apr/13 17:33

            People

            • Assignee:
              Jonathan Coveney
              Reporter:
              Jonathan Coveney
            • Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development