Pig
  1. Pig
  2. PIG-2597

Move grunt from javacc to ANTLR

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The parser is very difficult to work with, and next to impossible to understand or modify. ANTLR provides a much cleaner, more standard way to generate parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we continue to add features to Pig.

      This is a candidate project for Google summer of code 2014. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2014

      1. pig02.diff
        25 kB
        Boski Shah

        Issue Links

          Activity

          Hide
          Daniel Dai added a comment -

          I glance through Boski Shah's patch, one thing we need to avoid is to introduce another Driver (GruntDriver), we shall merge all parsing logic into QueryParser.

          Show
          Daniel Dai added a comment - I glance through Boski Shah's patch, one thing we need to avoid is to introduce another Driver (GruntDriver), we shall merge all parsing logic into QueryParser.
          Kyungho Jeon made changes -
          Summary Move grunt from javacc to ANTRL Move grunt from javacc to ANTLR
          Hide
          Daniel Dai added a comment -

          Kyungho Jeon, besides "register" and "parameter substitution" PIG-3359 addresses, grunt shell commands does not support. Also the implementation of PIG-3359 is more a walk around. It is better to refactor it to fully integrate the "register" command into QueryParser.

          Show
          Daniel Dai added a comment - Kyungho Jeon , besides "register" and "parameter substitution" PIG-3359 addresses, grunt shell commands does not support. Also the implementation of PIG-3359 is more a walk around. It is better to refactor it to fully integrate the "register" command into QueryParser.
          Hide
          Kyungho Jeon added a comment -

          Sorry that I hasn't been quite active. I think I am confused with the above comment by Daniel Dai regarding Macro. I find a test case in the workaround (PIG-3359). Would it be good enough for me to design based on the test case, or is there any other cases that the test case does not cover?

          Show
          Kyungho Jeon added a comment - Sorry that I hasn't been quite active. I think I am confused with the above comment by Daniel Dai regarding Macro. I find a test case in the workaround ( PIG-3359 ). Would it be good enough for me to design based on the test case, or is there any other cases that the test case does not cover?
          Daniel Dai made changes -
          Link This issue is depended upon by PIG-3772 [ PIG-3772 ]
          Daniel Dai made changes -
          Link This issue is depended upon by PIG-3910 [ PIG-3910 ]
          Daniel Dai made changes -
          Link This issue is depended upon by PIG-19 [ PIG-19 ]
          Hide
          Daniel Dai added a comment -

          Kyungho Jeon, Glad you reviewed the patch already. Yes, it's not just moving GruntParser from javacc to antlr, it also merging it into QueryParser.

          Show
          Daniel Dai added a comment - Kyungho Jeon , Glad you reviewed the patch already. Yes, it's not just moving GruntParser from javacc to antlr, it also merging it into QueryParser.
          Hide
          Kyungho Jeon added a comment -

          Daniel Dai
          I review the patch (PIG-3359). It seems that what needs to be done is to merge all language processing parts into a single one and to implement it with antlr. I will revise my proposal according to this, but it seems it's going to be a bigger project than I imagined from the title "Move grunt from javacc to antlr".

          Show
          Kyungho Jeon added a comment - Daniel Dai I review the patch ( PIG-3359 ). It seems that what needs to be done is to merge all language processing parts into a single one and to implement it with antlr. I will revise my proposal according to this, but it seems it's going to be a bigger project than I imagined from the title "Move grunt from javacc to antlr".
          Hide
          Kyungho Jeon added a comment -

          Thank you Daniel Dai. I was not aware of the issue with Macro. I will look into the issue and update the proposal.

          Show
          Kyungho Jeon added a comment - Thank you Daniel Dai . I was not aware of the issue with Macro. I will look into the issue and update the proposal.
          Hide
          Daniel Dai added a comment -

          Thanks Kyungho Jeon, the proposal looks good. Another motivation to move javacc to ANTLR is we want to use all commands in Macro. There exist some workaround (PIG-3359). After this work, we need to clean up the workaround and make sure all commands works in Macro.

          Show
          Daniel Dai added a comment - Thanks Kyungho Jeon , the proposal looks good. Another motivation to move javacc to ANTLR is we want to use all commands in Macro. There exist some workaround ( PIG-3359 ). After this work, we need to clean up the workaround and make sure all commands works in Macro.
          Hide
          Kyungho Jeon added a comment -

          This is my initial draft:
          https://github.com/kyunghoj/GSoC2014

          I am not sure how much details should be in design section. Also, I realized that other components based on JavaCC (e.g., Parameter substitution) might be bigger beasts. Should those component be included in this project?

          Show
          Kyungho Jeon added a comment - This is my initial draft: https://github.com/kyunghoj/GSoC2014 I am not sure how much details should be in design section. Also, I realized that other components based on JavaCC (e.g., Parameter substitution) might be bigger beasts. Should those component be included in this project?
          Hide
          Daniel Dai added a comment -

          Kyungho Jeon, thanks for your interest. If I get multiple proposals, I will have to pick the best one unless another mentor other than me is also interested in mentoring this. You can post either your entire proposal or a link to your proposal on Jira, so I can preview it before you submit to melange.

          Show
          Daniel Dai added a comment - Kyungho Jeon , thanks for your interest. If I get multiple proposals, I will have to pick the best one unless another mentor other than me is also interested in mentoring this. You can post either your entire proposal or a link to your proposal on Jira, so I can preview it before you submit to melange.
          Hide
          Kyungho Jeon added a comment -

          Daniel Dai I am wondering what would happen if there are multiple applicants on the same project I am working on a proposal draft, so I don't want to waste my effort. Could you review my proposal if I send it to you or post here? Thanks!

          Show
          Kyungho Jeon added a comment - Daniel Dai I am wondering what would happen if there are multiple applicants on the same project I am working on a proposal draft, so I don't want to waste my effort. Could you review my proposal if I send it to you or post here? Thanks!
          Hide
          Daniel Dai added a comment -

          Vimuth Fernando, I should be available to mentor this.

          Show
          Daniel Dai added a comment - Vimuth Fernando , I should be available to mentor this.
          Daniel Dai made changes -
          Assignee Daniel Dai [ daijy ]
          Hide
          Vimuth Fernando added a comment -

          Hi, Im a third year computer science student from the university of moratuwa. Im interested in taking this on as my gsoc project. I've already checked out the code and built it and taken a quick look at the grunt related code. I'm currently following the tutorials provided in the gsoc getting started page
          Is there any possible mentors for this project so that i can discuss how to proceed further?

          Show
          Vimuth Fernando added a comment - Hi, Im a third year computer science student from the university of moratuwa. Im interested in taking this on as my gsoc project. I've already checked out the code and built it and taken a quick look at the grunt related code. I'm currently following the tutorials provided in the gsoc getting started page Is there any possible mentors for this project so that i can discuss how to proceed further?
          Daniel Dai made changes -
          Description Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The parser is very difficult to work with, and next to impossible to understand or modify. ANTLR provides a much cleaner, more standard way to generate parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we continue to add features to Pig.

          This is a candidate project for Google summer of code 2013. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2013
          Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The parser is very difficult to work with, and next to impossible to understand or modify. ANTLR provides a much cleaner, more standard way to generate parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we continue to add features to Pig.

          This is a candidate project for Google summer of code 2014. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2014
          Daniel Dai made changes -
          Labels gsoc2013 gsoc2014
          Daniel Dai made changes -
          Description Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The parser is very difficult to work with, and next to impossible to understand or modify. ANTLR provides a much cleaner, more standard way to generate parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we continue to add features to Pig.

          This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
          Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The parser is very difficult to work with, and next to impossible to understand or modify. ANTLR provides a much cleaner, more standard way to generate parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we continue to add features to Pig.

          This is a candidate project for Google summer of code 2013. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2013
          Daniel Dai made changes -
          Labels GSoC2012 gsoc2013
          Hide
          Koji Noguchi added a comment -

          Jonathan, any update on this?

          I'm interested in this status as well.
          Does Boski have a plan to continue working on this?

          Show
          Koji Noguchi added a comment - Jonathan, any update on this? I'm interested in this status as well. Does Boski have a plan to continue working on this?
          Hide
          Russell Jurney added a comment -

          Jonathan, any update on this?

          Show
          Russell Jurney added a comment - Jonathan, any update on this?
          Boski Shah made changes -
          Attachment pig02.diff [ 12535778 ]
          Hide
          Boski Shah added a comment -

          I have modified pig grunt code to use ANTLR for a some grunt commands (CAT, HELP and QUIT). I have attached the diff file for your review. Please find more details about the changes below.

          I have the basic code working, but I still think it is just the first draft. I would be refining and cleaning code as I proceed further. but before I do that, I want to make sure that I am heading in the right direction. Can you please take a look at the code and let me know if you see any issues with my approach?

          Approach:

          Enhanced existing grammar: Instead of creating new grammar as I suggested earlier, I ended up modifying existing grammars to add grunt commands. i.e. I have modified Query

          {Lexer, Parser}

          .g, ASTValidator.g and LogicalPlanGenerator.g to support these commands. After trying various approaches including new grammer, enhanced existing grammar with changes in PigServer to support grunt commands etc. I think this is the cleanest approach. You had also suggested this as the preferred option as well.

          Deprecated GruntParser: I have depcrecated GruntParser. To replace that, I have created a new class 'GruntDriver'. Grunt.java now uses this new class instead.
          GruntDriver works in interactive as well as batch mode.
          GruntDriver.process method is similar to what GruntParser.parseStopOnError() does.
          process method first uses the grammar to parse the input stream (parsing code is identical to QueryParserDriver) and creates the tree.
          process method then traverses the tree: every time it comes across a grunt command's node, it executes it immediately. For all pig query nodes, GruntDriver delegates the work to PigServer by calling its registerQuery method.

          Retain the original input text:
          One caveat I encountered was that PigServer.registerQuery expects raw pig query string as input. Whereas, after AST generation, GruntDriver does not have the raw input anymore. I did consider modifying PigServer code to see if it can take the tree as input. But that change seemed way to intrusive. and also since PigServer is public interface, I do not feel comfortable it having an API that takes AST node.
          so, instead I modified grammar such that it retains the original input string as one of the children for all statement. for example general_statement in QueryParser.g now has an additional child TEXT[$general_statement.text]. this child value is then used by GruntDriver to pass the original input to PigServer.registerQuery.

          Open Items:

          Add all commands: I have added only some commands in GruntDriver. I am working on adding many more at this time. I expect many of them to be trivial to add such as cd, cp etc. And some would require more work such as explain, run and exec.

          Secondary Prompt: With this new implementation, the secondary prompt in interactive mode does not work. i.e. existing pig gives a different kind of prompt (">>") if the statement provided through the grunt shell is incomplete. with my changes, it gives the error saying that input was invalid. I am not sure how critical it is to support such secondary prompts. I have a few ideas about how to support it, but I believe it requires lot of efforts and code changes in the grammar. So, before I start on that, I just want to understand how critical it is to retain that feature.

          Show
          Boski Shah added a comment - I have modified pig grunt code to use ANTLR for a some grunt commands (CAT, HELP and QUIT). I have attached the diff file for your review. Please find more details about the changes below. I have the basic code working, but I still think it is just the first draft. I would be refining and cleaning code as I proceed further. but before I do that, I want to make sure that I am heading in the right direction. Can you please take a look at the code and let me know if you see any issues with my approach? Approach: Enhanced existing grammar: Instead of creating new grammar as I suggested earlier, I ended up modifying existing grammars to add grunt commands. i.e. I have modified Query {Lexer, Parser} .g, ASTValidator.g and LogicalPlanGenerator.g to support these commands. After trying various approaches including new grammer, enhanced existing grammar with changes in PigServer to support grunt commands etc. I think this is the cleanest approach. You had also suggested this as the preferred option as well. Deprecated GruntParser: I have depcrecated GruntParser. To replace that, I have created a new class 'GruntDriver'. Grunt.java now uses this new class instead. GruntDriver works in interactive as well as batch mode. GruntDriver.process method is similar to what GruntParser.parseStopOnError() does. process method first uses the grammar to parse the input stream (parsing code is identical to QueryParserDriver) and creates the tree. process method then traverses the tree: every time it comes across a grunt command's node, it executes it immediately. For all pig query nodes, GruntDriver delegates the work to PigServer by calling its registerQuery method. Retain the original input text: One caveat I encountered was that PigServer.registerQuery expects raw pig query string as input. Whereas, after AST generation, GruntDriver does not have the raw input anymore. I did consider modifying PigServer code to see if it can take the tree as input. But that change seemed way to intrusive. and also since PigServer is public interface, I do not feel comfortable it having an API that takes AST node. so, instead I modified grammar such that it retains the original input string as one of the children for all statement. for example general_statement in QueryParser.g now has an additional child TEXT [$general_statement.text] . this child value is then used by GruntDriver to pass the original input to PigServer.registerQuery. Open Items: Add all commands: I have added only some commands in GruntDriver. I am working on adding many more at this time. I expect many of them to be trivial to add such as cd, cp etc. And some would require more work such as explain, run and exec. Secondary Prompt: With this new implementation, the secondary prompt in interactive mode does not work. i.e. existing pig gives a different kind of prompt (">>") if the statement provided through the grunt shell is incomplete. with my changes, it gives the error saying that input was invalid. I am not sure how critical it is to support such secondary prompts. I have a few ideas about how to support it, but I believe it requires lot of efforts and code changes in the grammar. So, before I start on that, I just want to understand how critical it is to retain that feature.
          Thejas M Nair made changes -
          Link This issue is duplicated by PIG-2079 [ PIG-2079 ]
          Gianmarco De Francisci Morales made changes -
          Link This issue is duplicated by PIG-2439 [ PIG-2439 ]
          Jonathan Coveney made changes -
          Field Original Value New Value
          Description Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The parser is very difficult to work with, and next to impossible to understand or modify. ANTLR provides a much cleaner, more standard way to generate parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we continue to add features to Pig. Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The parser is very difficult to work with, and next to impossible to understand or modify. ANTLR provides a much cleaner, more standard way to generate parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we continue to add features to Pig.

          This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
          Jonathan Coveney created issue -

            People

            • Assignee:
              Daniel Dai
              Reporter:
              Jonathan Coveney
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development