Pig
  1. Pig
  2. PIG-1387

Syntactical Sugar for PIG-1385

    Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.10.0
    • Component/s: grunt
    • Labels:

      Description

      From this conversation, extend PIG-1385 to instead of calling UDF use built-in behavior when the (),{},[] groupings are encountered.

      > > What about making them part of the language using symbols?
      > >
      > > instead of
      > >
      > > foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
      > >
      > > have language support
      > >
      > > foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
      > >
      > > or even:
      > >
      > > foreach T generate ($0, $1, $2), {$3, $4, $5}, $6#$7, $8#$9, $10, $11;
      > >
      > >
      > > Is there reason not to do the second or third other than being more
      > > complicated?
      > >
      > > Certainly I'd volunteer to put the top implementation in to the util
      > > package and submit them for builtin's, but the latter syntactic candies
      > > seems more natural..
      > >
      > >
      > >
      > > On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates <gates@yahoo-inc.com> wrote:
      > >
      > >> The grouping package in piggybank is left over from back when Pig
      > allowed
      > >> users to define grouping functions (0.1). Functions like these should
      > go in
      > >> evaluation.util.
      > >>
      > >> However, I'd consider putting these in builtin (in main Pig) instead.
      > >> These are things everyone asks for and they seem like a reasonable
      > addition
      > >> to the core engine. This will be more of a burden to write (as we'll
      > hold
      > >> them to a higher standard) but of more use to people as well.
      > >>
      > >> Alan.
      > >>
      > >>
      > >> On Apr 19, 2010, at 12:53 PM, hc busy wrote:
      > >>
      > >> Some times I wonder... I mean, somebody went to the trouble of making a
      > >>> path
      > >>> called
      > >>>
      > >>> org.apache.pig.piggybank.grouping
      > >>>
      > >>> (where it seems like this code belong), but didn't check in any java
      > code
      > >>> into that package.
      > >>>
      > >>>
      > >>> Any comment about where to put this kind of utility classes?
      > >>>
      > >>>
      > >>>
      > >>> On Mon, Apr 19, 2010 at 12:07 PM, Andrey S <octo47@gmail.com> wrote:
      > >>>
      > >>> 2010/4/19 hc busy <hc.busy@gmail.com>
      > >>>>
      > >>>> That's just the way it is right now, you can't make bags or tuples
      > >>>>> directly... Maybe we should have some UDF's in piggybank for these:
      > >>>>>
      > >>>>> toBag()
      > >>>>> toTuple(); --which is kinda like exec(Tuple in)

      {return in;}

      > >>>>> TupleToBag(); --some times you need it this way for some reason.
      > >>>>>
      > >>>>>
      > >>>>> Ok. I place my current code here, may be later I make a patch (if
      > such
      > >>>> implementation is acceptable of course).
      > >>>>
      > >>>> import org.apache.pig.EvalFunc;
      > >>>> import org.apache.pig.data.BagFactory;
      > >>>> import org.apache.pig.data.DataBag;
      > >>>> import org.apache.pig.data.Tuple;
      > >>>> import org.apache.pig.data.TupleFactory;
      > >>>>
      > >>>> import java.io.IOException;
      > >>>>
      > >>>> /**
      > >>>> * Convert any sequence of fields to bag with specified count of
      > >>>> fields<br>
      > >>>> * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
      > >>>> * Output: count=2, then

      { (fld1, fld2) , (fld3, fld4) ... }

      > >>>> *
      > >>>> * @author astepachev
      > >>>> */
      > >>>> public class ToBag extends EvalFunc<DataBag> {
      > >>>> public BagFactory bagFactory;
      > >>>> public TupleFactory tupleFactory;
      > >>>>
      > >>>> public ToBag()

      { > >>>> bagFactory = BagFactory.getInstance(); > >>>> tupleFactory = TupleFactory.getInstance(); > >>>> }

      > >>>>
      > >>>> @Override
      > >>>> public DataBag exec(Tuple input) throws IOException {
      > >>>> if (input.isNull())
      > >>>> return null;
      > >>>> final DataBag bag = bagFactory.newDefaultBag();
      > >>>> final Integer couter = (Integer) input.get(0);
      > >>>> if (couter == null)
      > >>>> return null;
      > >>>> Tuple tuple = tupleFactory.newTuple();
      > >>>> for (int i = 0; i < input.size() - 1; i++) {
      > >>>> if (i % couter == 0)

      { > >>>> tuple = tupleFactory.newTuple(); > >>>> bag.add(tuple); > >>>> }

      > >>>> tuple.append(input.get(i + 1));
      > >>>> }
      > >>>> return bag;
      > >>>> }
      > >>>> }
      > >>>>
      > >>>> import org.apache.pig.ExecType;
      > >>>> import org.apache.pig.PigServer;
      > >>>> import org.junit.Before;
      > >>>> import org.junit.Test;
      > >>>>
      > >>>> import java.io.IOException;
      > >>>> import java.net.URISyntaxException;
      > >>>> import java.net.URL;
      > >>>>
      > >>>> import static org.junit.Assert.assertTrue;
      > >>>>
      > >>>> /**
      > >>>> * @author astepachev
      > >>>> */
      > >>>> public class ToBagTest {
      > >>>> PigServer pigServer;
      > >>>> URL inputTxt;
      > >>>>
      > >>>> @Before
      > >>>> public void init() throws IOException, URISyntaxException

      { > >>>> pigServer = new PigServer(ExecType.LOCAL); > >>>> inputTxt = > >>>> this.getClass().getResource("bagTest.txt").toURI().toURL(); > >>>> }

      > >>>>
      > >>>> @Test
      > >>>> public void testSimple() throws IOException

      { > >>>> pigServer.registerQuery("a = load '" + inputTxt.toExternalForm() > + > >>>> "' using PigStorage(',') " + > >>>> "as (id:int, a:chararray, b:chararray, c:chararray, > >>>> d:chararray);"); > >>>> pigServer.registerQuery("last = foreach a generate flatten(" + > >>>> ToBag.class.getName() + "(2, id, a, id, b, id, c));"); > >>>> > >>>> pigServer.deleteFile("target/pigtest/func1.txt"); > >>>> pigServer.store("last", "target/pigtest/func1.txt"); > >>>> assertTrue(pigServer.fileSize("target/pigtest/func1.txt") > 0); > >>>> }

      > >>>> }
      > >>>>
      > >>>>
      > >>
      > >
      >

      This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

      1. PIG-1387.4.patch
        7 kB
        Gianmarco De Francisci Morales
      2. PIG-1387.3.patch
        4 kB
        Gianmarco De Francisci Morales
      3. PIG-1387.2.patch
        3 kB
        Gianmarco De Francisci Morales
      4. PIG-1387.1.patch
        3 kB
        Gianmarco De Francisci Morales

        Issue Links

          Activity

          Hide
          Alan Gates added a comment -

          Adding () for tuples, {} for bags, and [] for maps seems reasonable, especially given that we support that for constants. Given that we want to do it in line with constants, the {} for bags should include the () for the tuples inside the bag. So

          foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
          

          above would change to

          foreach T generate ($0, $1, $2), {($3), ($4), ($5)}, $6, $7;
          
          Show
          Alan Gates added a comment - Adding () for tuples, {} for bags, and [] for maps seems reasonable, especially given that we support that for constants. Given that we want to do it in line with constants, the {} for bags should include the () for the tuples inside the bag. So foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; above would change to foreach T generate ($0, $1, $2), {($3), ($4), ($5)}, $6, $7;
          Hide
          Gianmarco De Francisci Morales added a comment -

          Trying to sort out ideas on how to do this... correct me if I am wrong.

          The result of ($0, $1, $2),

          {($3), ($4), ($5)}

          , $6, $7;
          should be the same of TOTUPLE($0, $1, $2), TOBAG($3, $4, $5), $6, $7;
          and the schema would be {t$0, $1, $2), b:

          {t:($345)}

          , $6, $7}

          So the first way I would try is doing syntactical translation in the parser:
          whenever I encounter {} in the generate, I translate it to a user func call to TOBAG.

          I assume I would get the same result with a cast, but it is more complex as I would have to translate the schema as well.

          An alternative would be to do it in the LogicalPlanBuilder by translating a proper AST token to a UserFuncExpression call, and insert it into the plan in the right place.
          I see this option as more complex, and I don't see any clear advantage.

          Thoughts?

          Show
          Gianmarco De Francisci Morales added a comment - Trying to sort out ideas on how to do this... correct me if I am wrong. The result of ($0, $1, $2), {($3), ($4), ($5)} , $6, $7; should be the same of TOTUPLE($0, $1, $2), TOBAG($3, $4, $5), $6, $7; and the schema would be {t $0, $1, $2), b: {t:($345)} , $6, $7} So the first way I would try is doing syntactical translation in the parser: whenever I encounter {} in the generate, I translate it to a user func call to TOBAG. I assume I would get the same result with a cast, but it is more complex as I would have to translate the schema as well. An alternative would be to do it in the LogicalPlanBuilder by translating a proper AST token to a UserFuncExpression call, and insert it into the plan in the right place. I see this option as more complex, and I don't see any clear advantage. Thoughts?
          Hide
          Thejas M Nair added a comment -

          Yes, I think the translation can be done in either LogicalPlanGenerator.g or LogicalPlanBuilder. In general, I think it is better to have anything that involves more than handful of lines of Java code in LogicalPlanBuilder.java (instead of antlr file). But in this case, it should only be a few lines of java code, so it can be done in LogicalPlanGenerator.g itself.

          Show
          Thejas M Nair added a comment - Yes, I think the translation can be done in either LogicalPlanGenerator.g or LogicalPlanBuilder. In general, I think it is better to have anything that involves more than handful of lines of Java code in LogicalPlanBuilder.java (instead of antlr file). But in this case, it should only be a few lines of java code, so it can be done in LogicalPlanGenerator.g itself.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Very first approach to this feature in PIG-1387.1.patch.
          For now, I am focusing on TOBAG, I assume other TO* functions would be similar.
          We might have a problem for TOTUPLE because it starts with LEFT_PAREN that is also used by other rules.
          I am not sure this is the direction we want to go.

          Basically what I am doing is to do the translation in the parser.
          When I encounter a LEFT_CURLY in a GENERATE statement, I process it as if it were a function call.
          I generate a FUNC_EVAL virtual token and process the rest of the expression as arguments to the function.

          I see some issues with this approach, for example one needs to be careful to change the grammar of the GENERATE statement if one ever changes the grammar of the function call, because there is an implicit dependence between them.
          Also, I feel it might be complicated to properly parse the arguments in some extreme cases (but I have no example at hand).

          The pro is that it is extremely easy to perform the change, and that it is purely syntactical, which means that it reduces the chances of bugs.

          I would like to have the opinion of the community before going on.

          Some small examples:

          
          grunt> cat a.txt
          1       11
          2       3
          3       10
          4       11
          5       10
          6       15
          
          grunt> a = load 'a.txt' as (id,num);  b = foreach a generate TOBAG($0);
          grunt> dump b
          ({(1)})
          ({(2)})
          ({(3)})
          ({(4)})
          ({(5)})
          ({(6)})
          
          grunt> a = load 'a.txt' as (id,num);  b = foreach a generate {$0};        
          grunt> dump b                                                     
          ({(1)})
          ({(2)})
          ({(3)})
          ({(4)})
          ({(5)})
          ({(6)})
          
          grunt> a = load 'a.txt' as (id,num);  b = foreach a generate TOBAG($0,$1);
          grunt> dump b                                                             
          ({(1),(11)})
          ({(2),(3)})
          ({(3),(10)})
          ({(4),(11)})
          ({(5),(10)})
          ({(6),(15)})
          
          grunt> a = load 'a.txt' as (id,num);  b = foreach a generate {$0,$1};
          grunt> dump b                                                        
          ({(1),(11)})
          ({(2),(3)})
          ({(3),(10)})
          ({(4),(11)})
          ({(5),(10)})
          ({(6),(15)})
          
          

          And this is the logical plan generated:

          
          #-----------------------------------------------
          # New Logical Plan:
          #-----------------------------------------------
          b: (Name: LOStore Schema: #191:bag{#192:tuple(#193:bytearray)})
          |
          |---b: (Name: LOForEach Schema: #191:bag{#192:tuple(#193:bytearray)})
              |   |
              |   (Name: LOGenerate[false] Schema: #191:bag{#192:tuple(#193:bytearray)})
              |   |   |
              |   |   (Name: UserFunc(org.apache.pig.builtin.TOBAG) Type: bag Uid: 191)
              |   |   |
              |   |   |---id:(Name: Project Type: bytearray Uid: 183 Input: 0 Column: (*))
              |   |   |
              |   |   |---num:(Name: Project Type: bytearray Uid: 184 Input: 1 Column: (*))
              |   |
              |   |---(Name: LOInnerLoad[0] Schema: id#183:bytearray)
              |   |
              |   |---(Name: LOInnerLoad[1] Schema: num#184:bytearray)
              |
              |---a: (Name: LOLoad Schema: id#183:bytearray,num#184:bytearray)RequiredFields:null
          
          
          Show
          Gianmarco De Francisci Morales added a comment - Very first approach to this feature in PIG-1387 .1.patch. For now, I am focusing on TOBAG, I assume other TO* functions would be similar. We might have a problem for TOTUPLE because it starts with LEFT_PAREN that is also used by other rules. I am not sure this is the direction we want to go. Basically what I am doing is to do the translation in the parser. When I encounter a LEFT_CURLY in a GENERATE statement, I process it as if it were a function call. I generate a FUNC_EVAL virtual token and process the rest of the expression as arguments to the function. I see some issues with this approach, for example one needs to be careful to change the grammar of the GENERATE statement if one ever changes the grammar of the function call, because there is an implicit dependence between them. Also, I feel it might be complicated to properly parse the arguments in some extreme cases (but I have no example at hand). The pro is that it is extremely easy to perform the change, and that it is purely syntactical, which means that it reduces the chances of bugs. I would like to have the opinion of the community before going on. Some small examples: grunt> cat a.txt 1 11 2 3 3 10 4 11 5 10 6 15 grunt> a = load 'a.txt' as (id,num); b = foreach a generate TOBAG($0); grunt> dump b ({(1)}) ({(2)}) ({(3)}) ({(4)}) ({(5)}) ({(6)}) grunt> a = load 'a.txt' as (id,num); b = foreach a generate {$0}; grunt> dump b ({(1)}) ({(2)}) ({(3)}) ({(4)}) ({(5)}) ({(6)}) grunt> a = load 'a.txt' as (id,num); b = foreach a generate TOBAG($0,$1); grunt> dump b ({(1),(11)}) ({(2),(3)}) ({(3),(10)}) ({(4),(11)}) ({(5),(10)}) ({(6),(15)}) grunt> a = load 'a.txt' as (id,num); b = foreach a generate {$0,$1}; grunt> dump b ({(1),(11)}) ({(2),(3)}) ({(3),(10)}) ({(4),(11)}) ({(5),(10)}) ({(6),(15)}) And this is the logical plan generated: #----------------------------------------------- # New Logical Plan: #----------------------------------------------- b: (Name: LOStore Schema: #191:bag{#192:tuple(#193:bytearray)}) | |---b: (Name: LOForEach Schema: #191:bag{#192:tuple(#193:bytearray)}) | | | (Name: LOGenerate[ false ] Schema: #191:bag{#192:tuple(#193:bytearray)}) | | | | | (Name: UserFunc(org.apache.pig.builtin.TOBAG) Type: bag Uid: 191) | | | | | |---id:(Name: Project Type: bytearray Uid: 183 Input: 0 Column: (*)) | | | | | |---num:(Name: Project Type: bytearray Uid: 184 Input: 1 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: id#183:bytearray) | | | |---(Name: LOInnerLoad[1] Schema: num#184:bytearray) | |---a: (Name: LOLoad Schema: id#183:bytearray,num#184:bytearray)RequiredFields: null
          Hide
          Thejas M Nair added a comment -

          The approach looks fine. But I think this should be a type of expr and and not flatten_generated_item, so that it can be used anywhere, including as argument to other udfs.
          In case of TOTUPLE, the single element case will cause ambiguity, I think that should not be considered a TOTUPLE call. ie, (20) should be considered the same as 20.

          Show
          Thejas M Nair added a comment - The approach looks fine. But I think this should be a type of expr and and not flatten_generated_item, so that it can be used anywhere, including as argument to other udfs. In case of TOTUPLE, the single element case will cause ambiguity, I think that should not be considered a TOTUPLE call. ie, (20) should be considered the same as 20.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I followed Thejas' comment and implemented the translation at the expr level.
          More precisely, as a projectable_expr, which is the closest gateway to func_eval.

          From my quick tests it looks like it is working.
          I tested all three TO* functions. I also tested casts and operations inside the functions.

          I am not too convinced about handling the single parentheses ($0) as a special case, but it shouldn't be hard to do if we want.

          TODO:
          Write tests

          Show
          Gianmarco De Francisci Morales added a comment - I followed Thejas' comment and implemented the translation at the expr level. More precisely, as a projectable_expr, which is the closest gateway to func_eval. From my quick tests it looks like it is working. I tested all three TO* functions. I also tested casts and operations inside the functions. I am not too convinced about handling the single parentheses ($0) as a special case, but it shouldn't be hard to do if we want. TODO: Write tests
          Hide
          Thejas M Nair added a comment -

          If we treat ($0) as a TOTUPLE case, that would break backward compatibility (eg. 'a + (b)' ). It is also not going to be very intuitive to new users who might not be familiar with this 'advanced' syntax.

          Show
          Thejas M Nair added a comment - If we treat ($0) as a TOTUPLE case, that would break backward compatibility (eg. 'a + (b)' ). It is also not going to be very intuitive to new users who might not be familiar with this 'advanced' syntax.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Back on the task.

          Attaching PIG-1387.3.patch

          Now the feature should work as intended.
          Single element tuples do not trigger a TOTUPLE call.

           a = load 'a.txt' as (id:int,num:int); c = foreach a generate ($1) + 1;
          grunt> dump c
          (12)
          (4)
          (11)
          (12)
          (11)
          (16)
          
          

          For the tests, I was planning to write 2 equivalent scripts (sugarized/non-sugarized) and compare the logical plans.

          This should make them faster and test exactly what the feature should do.

          Any comments on this?
          Also, where should this kind of tests go with the new test layout?

          Show
          Gianmarco De Francisci Morales added a comment - Back on the task. Attaching PIG-1387 .3.patch Now the feature should work as intended. Single element tuples do not trigger a TOTUPLE call. a = load 'a.txt' as (id: int ,num: int ); c = foreach a generate ($1) + 1; grunt> dump c (12) (4) (11) (12) (11) (16) For the tests, I was planning to write 2 equivalent scripts (sugarized/non-sugarized) and compare the logical plans. This should make them faster and test exactly what the feature should do. Any comments on this? Also, where should this kind of tests go with the new test layout?
          Hide
          Alan Gates added a comment -

          If you plan to compare logical plans, this belongs in unit tests, using junit.

          I would encourage you to also write end-to-end tests with the new "sugared" syntax, using old standard syntax as the verification script. I'm happy to help you write the tests if you want me to.

          Show
          Alan Gates added a comment - If you plan to compare logical plans, this belongs in unit tests, using junit. I would encourage you to also write end-to-end tests with the new "sugared" syntax, using old standard syntax as the verification script. I'm happy to help you write the tests if you want me to.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Thanks Alan, any help with the new test infrastructure is very appreciated.

          Is there a specific reason to write end-to-end test for this feature?
          From my point of view, this is a front-end issue, so if the logical plans are the same the rest doesn't matter.

          Show
          Gianmarco De Francisci Morales added a comment - Thanks Alan, any help with the new test infrastructure is very appreciated. Is there a specific reason to write end-to-end test for this feature? From my point of view, this is a front-end issue, so if the logical plans are the same the rest doesn't matter.
          Hide
          Daniel Dai added a comment -

          You can follow existing tests in test/e2e/pig/tests/nightly.conf to get an idea how to write an end-to-end test.

          Yes, this patch is logical plan only. But usually an end-to-end test is encouraged for a new feature to make sure the whole process works. It happens we break a feature without even realizing it.

          Show
          Daniel Dai added a comment - You can follow existing tests in test/e2e/pig/tests/nightly.conf to get an idea how to write an end-to-end test. Yes, this patch is logical plan only. But usually an end-to-end test is encouraged for a new feature to make sure the whole process works. It happens we break a feature without even realizing it.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I am trying to write some end-to-end tests, but I must admit it is not that easy.
          I am having some problems with my perl installation and I cannot run the e2e suite.
          Unfortunately I cannot change the configuration of this machine.
          However I will get a new laptop in 2 weeks.
          So I will put the patch on hold until I fix my setup.

          Show
          Gianmarco De Francisci Morales added a comment - I am trying to write some end-to-end tests, but I must admit it is not that easy. I am having some problems with my perl installation and I cannot run the e2e suite. Unfortunately I cannot change the configuration of this machine. However I will get a new laptop in 2 weeks. So I will put the patch on hold until I fix my setup.
          Hide
          Olga Natkovich added a comment -

          Hi Gianmarco,

          Are you still trying to finish this for 0.10?

          We are planning to branch early next week and we need all features to be in before we branch. Otherwise, we could delay it till the next release.

          Show
          Olga Natkovich added a comment - Hi Gianmarco, Are you still trying to finish this for 0.10? We are planning to branch early next week and we need all features to be in before we branch. Otherwise, we could delay it till the next release.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Hi Olga,
          I will have my new setup on Monday.
          I will try to finish it by Tuesday.
          Writing the tests shouldn't be too complex.

          Show
          Gianmarco De Francisci Morales added a comment - Hi Olga, I will have my new setup on Monday. I will try to finish it by Tuesday. Writing the tests shouldn't be too complex.
          Hide
          Olga Natkovich added a comment -

          Great! Thanks for the update!

          Show
          Olga Natkovich added a comment - Great! Thanks for the update!
          Hide
          Gianmarco De Francisci Morales added a comment -

          Attaching PIG-1387.4.patch

          Added e2e tests in nightly.
          Tested the 3 different TOTUPLE, TOBAG, TOMAP plus the special case TOTUPLE with single element (e.g. ($0) ).

          Tests run fine in local mode.
          I would ask who reviews the patch to run them in cluster mode as I didn't have the time to set up one to test the patch on.

          Show
          Gianmarco De Francisci Morales added a comment - Attaching PIG-1387 .4.patch Added e2e tests in nightly. Tested the 3 different TOTUPLE, TOBAG, TOMAP plus the special case TOTUPLE with single element (e.g. ($0) ). Tests run fine in local mode. I would ask who reviews the patch to run them in cluster mode as I didn't have the time to set up one to test the patch on.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Set up a cluster and tested the patch. It works fine.
          It is ready for review.

          Show
          Gianmarco De Francisci Morales added a comment - Set up a cluster and tested the patch. It works fine. It is ready for review.
          Hide
          Thejas M Nair added a comment -

          +1 (PIG-1387.4.patch)

          Show
          Thejas M Nair added a comment - +1 ( PIG-1387 .4.patch)
          Hide
          Gianmarco De Francisci Morales added a comment -

          Patch committed to trunk!

          Show
          Gianmarco De Francisci Morales added a comment - Patch committed to trunk!

            People

            • Assignee:
              Gianmarco De Francisci Morales
              Reporter:
              hc busy
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development