Pig
  1. Pig
  2. PIG-3359

Register Statements and Param Substitution in Macros

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: parser
    • Labels:
      None
    • Release Note:
      Register statements and parameter substitutions can be used inside macro now.

      Description

      There are some gaps in the functionality of macros that I've made a patch to address. The goal is to provide everything you'd need to make reusable algorithms libraries.

      1. You can't register udfs inside a macro
      2. Paramater substitutions aren't done inside macros
      3. Resources (including macros) should not be redundantly acquired if they are already present.

      Rohini's patch https://issues.apache.org/jira/browse/PIG-3204 should address problem 3 where Pig reparses everything every time it reads a line, but there still would be a problem if two separate files import the same macro / udf file.

      To get this working, I moved methods for registering jars/udfs and param substitution from PigServer to PigContext so they can be accessed in QueryParserDriver which processes macros (QPD was already passed a PigContext reference). Is that ok?

      1. PIG-3359_test.tar.gz
        253 kB
        Jonathan Packer
      2. PIG-3359-v1.diff
        35 kB
        Jonathan Packer
      3. PIG-3359-v2.diff
        34 kB
        Jonathan Packer
      4. PIG-3359-v3.diff
        61 kB
        Jonathan Packer
      5. PIG-3359-v3-test-failures.txt
        3 kB
        Cheolsoo Park
      6. PIG-3359-v4.diff
        57 kB
        Jonathan Packer
      7. PIG-3359-v5.diff
        58 kB
        Jonathan Packer
      8. PIG-3359-v6.diff
        63 kB
        Jonathan Packer
      9. PIG-3359-v7.diff
        68 kB
        Jonathan Packer

        Activity

        Hide
        Jonathan Packer added a comment -

        This is my first take at a patch that fixes the problems described above. I'll make a code review request on Review Board as well.

        Show
        Jonathan Packer added a comment - This is my first take at a patch that fixes the problems described above. I'll make a code review request on Review Board as well.
        Hide
        Jonathan Packer added a comment -

        A few notes:

        PIG-3359_test.tar.gz has a pigscript and other files which test registering udfs and parameter substitutions inside macros to demonstrate how this works.

        I haven't been able to upload my diff to ReviewBoard because I generated it with git and I think RB wants something svn-specific. Is there some way I can get git to generate something acceptable?

        I haven't written full unit-tests yet; I'd be happy to write them if people are ok with the code changes in general.

        Finally, there was a problem with unrelated classes FieldSchemaResetter and ExpressionUidResetter which were not declared in their own files, which seemed to break trunk. I moved the former into its own file since it was used by two other classes, and the latter into an inner class in UidResetter and that seemed to fix it. Let me know if anyone has a better idea of what these things are or where they're supposed to go.

        Show
        Jonathan Packer added a comment - A few notes: PIG-3359 _test.tar.gz has a pigscript and other files which test registering udfs and parameter substitutions inside macros to demonstrate how this works. I haven't been able to upload my diff to ReviewBoard because I generated it with git and I think RB wants something svn-specific. Is there some way I can get git to generate something acceptable? I haven't written full unit-tests yet; I'd be happy to write them if people are ok with the code changes in general. Finally, there was a problem with unrelated classes FieldSchemaResetter and ExpressionUidResetter which were not declared in their own files, which seemed to break trunk. I moved the former into its own file since it was used by two other classes, and the latter into an inner class in UidResetter and that seemed to fix it. Let me know if anyone has a better idea of what these things are or where they're supposed to go.
        Hide
        Cheolsoo Park added a comment -

        I haven't been able to upload my diff to ReviewBoard because I generated it with git and I think RB wants something svn-specific. Is there some way I can get git to generate something acceptable?

        You can upload git-generated diff to RB. You should specify pig-git as repository instead of pig. Please upload your patch to RB.

        Show
        Cheolsoo Park added a comment - I haven't been able to upload my diff to ReviewBoard because I generated it with git and I think RB wants something svn-specific. Is there some way I can get git to generate something acceptable? You can upload git-generated diff to RB. You should specify pig-git as repository instead of pig. Please upload your patch to RB.
        Hide
        Jonathan Packer added a comment -

        Hi, I uploaded the diff to RB: https://reviews.apache.org/r/12062/

        Show
        Jonathan Packer added a comment - Hi, I uploaded the diff to RB: https://reviews.apache.org/r/12062/
        Hide
        Cheolsoo Park added a comment - - edited

        Jonathan Packer, I have one question for you.

        I moved some methods from PigServer to PigContext, which QPD has an instance of.

        Why can't you construct a PigServer instance w/ PigContext inside QPD instead of moving those methods? That seems like a pattern being used at several places. For example,

        GroovyScriptEngine.java
            PigServer pigServer = new PigServer(context, false);
            String groovyJar = getJarPath(groovy.util.GroovyScriptEngine.class);
            if (null != groovyJar) {
                pigServer.registerJar(groovyJar);
            }
        

        To me, "pigServer.registerCode()" reads better than "pigContext.registerCode()", so I prefer not moving them to PigContext. What do you think? Please feel free to disagree with me.

        Show
        Cheolsoo Park added a comment - - edited Jonathan Packer , I have one question for you. I moved some methods from PigServer to PigContext, which QPD has an instance of. Why can't you construct a PigServer instance w/ PigContext inside QPD instead of moving those methods? That seems like a pattern being used at several places. For example, GroovyScriptEngine.java PigServer pigServer = new PigServer(context, false ); String groovyJar = getJarPath(groovy.util.GroovyScriptEngine.class); if ( null != groovyJar) { pigServer.registerJar(groovyJar); } To me, "pigServer.registerCode()" reads better than "pigContext.registerCode()", so I prefer not moving them to PigContext. What do you think? Please feel free to disagree with me.
        Hide
        Jonathan Packer added a comment -

        Hi, thanks for taking a look at this. I simply wasn't aware of that pattern--I've uploaded an updated patch with registerCode and registerJar moved back to PigServer. I think it's a bit weird that PigServer instantiates QPD which then instantiates a new PigServer, but I agree its better to be consistent with what's already there.

        doParamSubstitution(...) I kept moved into PigContext because the param lists are stored there. I noticed in some code paths PigContext wasn't being informed of which parameters were being used (ex. GruntParser has its own param substitution method for some reason?) so I wanted to make sure at least that each time doParamSubstitution was called it would update the params List object.

        Finally, the updated patch adds global param substitutions in macros. The use case is as follows: a macro file defines a java evalfunc udf with a constructor argument, which needs to be configurable by the main pigscript. In my case, this was for a K-Nearest-Neighbors udf; I wanted to be able to specify K.

        Thanks for bearing with me--this is my first time touching the main Pig codebase outside of Piggybank and I'm still trying to grasp the overall architecture.

        Show
        Jonathan Packer added a comment - Hi, thanks for taking a look at this. I simply wasn't aware of that pattern--I've uploaded an updated patch with registerCode and registerJar moved back to PigServer. I think it's a bit weird that PigServer instantiates QPD which then instantiates a new PigServer, but I agree its better to be consistent with what's already there. doParamSubstitution(...) I kept moved into PigContext because the param lists are stored there. I noticed in some code paths PigContext wasn't being informed of which parameters were being used (ex. GruntParser has its own param substitution method for some reason?) so I wanted to make sure at least that each time doParamSubstitution was called it would update the params List object. Finally, the updated patch adds global param substitutions in macros. The use case is as follows: a macro file defines a java evalfunc udf with a constructor argument, which needs to be configurable by the main pigscript. In my case, this was for a K-Nearest-Neighbors udf; I wanted to be able to specify K. Thanks for bearing with me--this is my first time touching the main Pig codebase outside of Piggybank and I'm still trying to grasp the overall architecture.
        Hide
        Johnny Zhang added a comment -

        Hi, Jonathan, few questions:

        1. how to use existing script to test issue 3 (Resources (including macros) should not be redundantly acquired if they are already present.) ? nested_macro.pig and macro_with_register.pig doesn't have macros with same name. Do you mean by import the same file twice instead?

        2. what do you mean by issue 2 (Paramater substitutions aren't done inside macros) will this change effect anything if macro and pig script are in the same file?

        3. the test you added duplicatedImportTest() should follow the negative test case format (expecting a exception and verify the error message is the same as expected), please see the example like

        TestMacroExpansion.java
        // macro doesn't contain return alias
            @Test
            public void negativeTest6() throws Throwable {
                String macro = "define group_and_count (A,C) returns B {\n" +
                    "    B = JOIN $A BY $B, $C BY user;\n" +
                    "};\n";
                
                String script = 
                    "alpha = load 'users' as (user, age, zip);\n" +
                    "beta = load 'links' as (user, link, view);\n" +
                    "gamma = group_and_count (alpha,beta);\n" +
                    "store gamma into 'byuser';\n";
                
                String expectedErr = "Reason: Macro 'group_and_count' missing return alias: B";
                
                validateFailure(macro + script, expectedErr);
            }
        
        Show
        Johnny Zhang added a comment - Hi, Jonathan, few questions: 1. how to use existing script to test issue 3 (Resources (including macros) should not be redundantly acquired if they are already present.) ? nested_macro.pig and macro_with_register.pig doesn't have macros with same name. Do you mean by import the same file twice instead? 2. what do you mean by issue 2 (Paramater substitutions aren't done inside macros) will this change effect anything if macro and pig script are in the same file? 3. the test you added duplicatedImportTest() should follow the negative test case format (expecting a exception and verify the error message is the same as expected), please see the example like TestMacroExpansion.java // macro doesn't contain return alias @Test public void negativeTest6() throws Throwable { String macro = "define group_and_count (A,C) returns B {\n" + " B = JOIN $A BY $B, $C BY user;\n" + "};\n" ; String script = "alpha = load 'users' as (user, age, zip);\n" + "beta = load 'links' as (user, link, view);\n" + "gamma = group_and_count (alpha,beta);\n" + "store gamma into 'byuser';\n" ; String expectedErr = "Reason: Macro 'group_and_count' missing return alias: B" ; validateFailure(macro + script, expectedErr); }
        Hide
        Cheolsoo Park added a comment -

        Jonathan Packer, thank you for updating the patch.

        I think it's a bit weird that PigServer instantiates QPD which then instantiates a new PigServer, but I agree its better to be consistent with what's already there.

        I have thought about this more. What if you don't call registerJars and registerCode in QPD at all? Instead can't you just pass filenames to PigServer via PigContext? Then, after QPD finishes parsing, PigServer can call these methods to register them by itself. Currently, jars are only registered in the PigServer constructor (i.e. addJarsFromProperties()), but we can register additional jars after parsing is done. Similarly, we can register scripts after parsing is done.

        The reasons why I am obsessed with de-coupling these methods from QPD is because I think it's good to keep a clear distinction between parser and server. Sorry if I am asking too much. Thanks!

        Show
        Cheolsoo Park added a comment - Jonathan Packer , thank you for updating the patch. I think it's a bit weird that PigServer instantiates QPD which then instantiates a new PigServer, but I agree its better to be consistent with what's already there. I have thought about this more. What if you don't call registerJars and registerCode in QPD at all? Instead can't you just pass filenames to PigServer via PigContext? Then, after QPD finishes parsing, PigServer can call these methods to register them by itself. Currently, jars are only registered in the PigServer constructor (i.e. addJarsFromProperties()), but we can register additional jars after parsing is done. Similarly, we can register scripts after parsing is done. The reasons why I am obsessed with de-coupling these methods from QPD is because I think it's good to keep a clear distinction between parser and server. Sorry if I am asking too much. Thanks!
        Hide
        Cheolsoo Park added a comment -

        I also made some comments in RB:
        https://reviews.apache.org/r/12062/

        I am really looking forward to these new features. That said, I am still struggling with thinking through the best way to implement them. As you know, parameter substitution code is not well organized, and I hope that we don't make it worse. We should seriously consider refactoring it if needed. Please let me know what you think.

        Show
        Cheolsoo Park added a comment - I also made some comments in RB: https://reviews.apache.org/r/12062/ I am really looking forward to these new features. That said, I am still struggling with thinking through the best way to implement them. As you know, parameter substitution code is not well organized, and I hope that we don't make it worse. We should seriously consider refactoring it if needed. Please let me know what you think.
        Hide
        Jonathan Packer added a comment -

        Updated RB: https://reviews.apache.org/r/12062/

        Comments below

        Show
        Jonathan Packer added a comment - Updated RB: https://reviews.apache.org/r/12062/ Comments below
        Hide
        Jonathan Packer added a comment -

        Cheolsoo, I updated patch to address your comments in RB. Johnny, my replies to your questions are at the bottom of this comment.

        Cheolsoo,

        There was a comment where you noted that ParamLoader was being called repeatedly. To address this, and a lot of other messiness with param substitution, I did a refactoring so all methods that substitute parameters call PigContext's doParamSubstitution(), which instantiates and manages ParamSubstitutionPreprocessor. This ensures consistent behavior between the param substitution called in Main, GruntParser, and several other places, and solves the ParamLoader problem. Managed by PigContext, the loading of parameters is separated from substitution itself, is not repeated when substitution is done many times (i.e. for macros).

        As for registerJars and registerCode, I got stuck. I understand your concern about separating parser and server. But QPD returns a LogicalPlan, and I'm pretty sure that the LP generation needs the UDF output schemas to be defined, which requires the UDFs to be registered. I think this could be fixed by splitting the AST generation from the LP generation and allowing PigServer to register resources requested by macros in between the two. But that seems like a big change and I don't want the scope of this patch to expand too much.

        What do you think?

        Johnny,

        To answer your questions:
        1) The query parser seems to be reparsing everything every time it reads a new line of Pig code. So the same macros are parsed over and over. This patch caches the AST's from the first parsing, so when a new line of the Pig script is parsed, the macro is not re-parsed. Also in terms of actually fetching the files, the import sequence goes "test.pig" imports "macros_1.pig" imports "macros_2.pig"; when "test.pig" imports "macros_2.pig" in the next line, it has already been acquired, so it is not loaded. Same idea for udfs, jars. I'm not sure how best to test this within Pig without a mocking framework to intercept when QPD parses a macro and ensure it is not doing it repeatedly. Any ideas?

        2) It should not, since param substitution in the main pigscript happens before macro expansion. This is just to propagate the parameters to imported macro files.

        3) duplicatedImportTest is not supposed to fail; the duplicated import should just be ignored. This is silly within one pigscript, but the it makes sense in the test.pig/macros_1.pig/macros_2.pig example I described above, since both test and macros_1 should be able to import macros_2.

        Show
        Jonathan Packer added a comment - Cheolsoo, I updated patch to address your comments in RB. Johnny, my replies to your questions are at the bottom of this comment. Cheolsoo, There was a comment where you noted that ParamLoader was being called repeatedly. To address this, and a lot of other messiness with param substitution, I did a refactoring so all methods that substitute parameters call PigContext's doParamSubstitution(), which instantiates and manages ParamSubstitutionPreprocessor. This ensures consistent behavior between the param substitution called in Main, GruntParser, and several other places, and solves the ParamLoader problem. Managed by PigContext, the loading of parameters is separated from substitution itself, is not repeated when substitution is done many times (i.e. for macros). As for registerJars and registerCode, I got stuck. I understand your concern about separating parser and server. But QPD returns a LogicalPlan, and I'm pretty sure that the LP generation needs the UDF output schemas to be defined, which requires the UDFs to be registered. I think this could be fixed by splitting the AST generation from the LP generation and allowing PigServer to register resources requested by macros in between the two. But that seems like a big change and I don't want the scope of this patch to expand too much. What do you think? Johnny, To answer your questions: 1) The query parser seems to be reparsing everything every time it reads a new line of Pig code. So the same macros are parsed over and over. This patch caches the AST's from the first parsing, so when a new line of the Pig script is parsed, the macro is not re-parsed. Also in terms of actually fetching the files, the import sequence goes "test.pig" imports "macros_1.pig" imports "macros_2.pig"; when "test.pig" imports "macros_2.pig" in the next line, it has already been acquired, so it is not loaded. Same idea for udfs, jars. I'm not sure how best to test this within Pig without a mocking framework to intercept when QPD parses a macro and ensure it is not doing it repeatedly. Any ideas? 2) It should not, since param substitution in the main pigscript happens before macro expansion. This is just to propagate the parameters to imported macro files. 3) duplicatedImportTest is not supposed to fail; the duplicated import should just be ignored. This is silly within one pigscript, but the it makes sense in the test.pig/macros_1.pig/macros_2.pig example I described above, since both test and macros_1 should be able to import macros_2.
        Hide
        Cheolsoo Park added a comment -

        Hi Jonathan, thank you very much for the update and detailed explanation. I am reviewing your patch now.

        Btw, your patch doesn't compile. I fixed the errors by myself and will continue reviewing it. In the meantime, do you mind uploading a new patch?

        Show
        Cheolsoo Park added a comment - Hi Jonathan, thank you very much for the update and detailed explanation. I am reviewing your patch now. Btw, your patch doesn't compile. I fixed the errors by myself and will continue reviewing it. In the meantime, do you mind uploading a new patch?
        Hide
        Cheolsoo Park added a comment -

        I love what you did for parameter substitution code. Thank you! As for registerJars and registerCode, you're totally right. Let's leave it as is. I made some additional comments in RB.

        I am running full unit tests and seeing several failures (not surprising since you're making quite a few changes). I will post the full list of test failures once completed.

        Show
        Cheolsoo Park added a comment - I love what you did for parameter substitution code. Thank you! As for registerJars and registerCode, you're totally right. Let's leave it as is. I made some additional comments in RB. I am running full unit tests and seeing several failures (not surprising since you're making quite a few changes). I will post the full list of test failures once completed.
        Hide
        Cheolsoo Park added a comment -

        I see 51 test failures and am attaching the full list of test case.

        I ran the unit tests with this repo:
        https://github.com/piaozhexiu/apache-pig/tree/PIG-3359

        Thanks!

        Show
        Cheolsoo Park added a comment - I see 51 test failures and am attaching the full list of test case. I ran the unit tests with this repo: https://github.com/piaozhexiu/apache-pig/tree/PIG-3359 Thanks!
        Hide
        Jonathan Packer added a comment -

        Latest patch should fix the unit-tests failures and address latest comments on RB.

        Show
        Jonathan Packer added a comment - Latest patch should fix the unit-tests failures and address latest comments on RB.
        Hide
        Cheolsoo Park added a comment -

        Great! I just confirmed all unit tests pass now. Thank you!

        I have one last suggestion. Can you make this changes?
        https://github.com/piaozhexiu/apache-pig/commit/9e7c9ec9b70fa902ad4adc1dcf337762a8a14d3a

        While doing more testing, I discovered few more corner cases (see my comments on RB) that break your implementation. In addition, I realized that there is no need to re-implement the doParamSubstitutionForMacroFile() method after all. So I made the above changes to your patch, and your test scripts still work.

        Thanks!

        Show
        Cheolsoo Park added a comment - Great! I just confirmed all unit tests pass now. Thank you! I have one last suggestion. Can you make this changes? https://github.com/piaozhexiu/apache-pig/commit/9e7c9ec9b70fa902ad4adc1dcf337762a8a14d3a While doing more testing, I discovered few more corner cases (see my comments on RB) that break your implementation. In addition, I realized that there is no need to re-implement the doParamSubstitutionForMacroFile() method after all. So I made the above changes to your patch, and your test scripts still work. Thanks!
        Hide
        Jonathan Packer added a comment -

        Latest patch should fix corner cases from RB. I kept doParamSubstitutionForMacros because the regular parser from PigScriptParser.jj did not work for the following macro file:

        /*
        %default BINARY_FUNC 'divide'
        */
        /* a comment */ %default BINARY_FUNC 'multiply'

        REGISTER 'udfs.py' USING jython AS udfs;

        DEFINE ApplyBinaryFunc(data)
        RETURNS funced

        { $funced = FOREACH $data GENERATE udfs.$BINARY_FUNC($0, $1); }

        ;

        DEFINE MacroThatAppliesCross(data1, data2)
        RETURNS crossed

        { $crossed = CROSS $data1, $data2; }

        ;

        It does not substitute $BINARY_FUNC inside the macro. I don't know how the .jj files work but it looks complicated, and I think my latest regex solution, while admittedly hacky, should handle 95%+ of situations (including these test macros).

        Show
        Jonathan Packer added a comment - Latest patch should fix corner cases from RB. I kept doParamSubstitutionForMacros because the regular parser from PigScriptParser.jj did not work for the following macro file: /* %default BINARY_FUNC 'divide' */ /* a comment */ %default BINARY_FUNC 'multiply' REGISTER 'udfs.py' USING jython AS udfs; DEFINE ApplyBinaryFunc(data) RETURNS funced { $funced = FOREACH $data GENERATE udfs.$BINARY_FUNC($0, $1); } ; DEFINE MacroThatAppliesCross(data1, data2) RETURNS crossed { $crossed = CROSS $data1, $data2; } ; It does not substitute $BINARY_FUNC inside the macro. I don't know how the .jj files work but it looks complicated, and I think my latest regex solution, while admittedly hacky, should handle 95%+ of situations (including these test macros).
        Hide
        Cheolsoo Park added a comment -

        Hey Jonathan, thanks for the update!

        I think we need a test suite that covers all the new functionalities that you're adding. Without it, you and I can't be on the same page. Do you mind adding either e2e test cases or unit test cases?

        It does not substitute $BINARY_FUNC inside the macro. I don't know how the .jj files work but it looks complicated, and I think my latest regex solution, while admittedly hacky, should handle 95%+ of situations (including these test macros).

        I can reproduce the issue, and I totally agree that we should get it working. Can we think about this more?

        Show
        Cheolsoo Park added a comment - Hey Jonathan, thanks for the update! I think we need a test suite that covers all the new functionalities that you're adding. Without it, you and I can't be on the same page. Do you mind adding either e2e test cases or unit test cases? It does not substitute $BINARY_FUNC inside the macro. I don't know how the .jj files work but it looks complicated, and I think my latest regex solution, while admittedly hacky, should handle 95%+ of situations (including these test macros). I can reproduce the issue, and I totally agree that we should get it working. Can we think about this more?
        Hide
        Jonathan Packer added a comment -

        Sure thing. I don't think I'll be able to get to this until the weekend, but I'll write some unit tests and see if I can get the .jj parser working with macros.

        Show
        Jonathan Packer added a comment - Sure thing. I don't think I'll be able to get to this until the weekend, but I'll write some unit tests and see if I can get the .jj parser working with macros.
        Hide
        Jonathan Packer added a comment -

        Sorry for the delay! I updated the patch and replaced the regex hack with a small tweak to PigMacro.java that got it working with the existing parser (so I didn't have to touch that thankfully). I also added 6 unit-tests for the issues this JIRA is supposed to address.

        Show
        Jonathan Packer added a comment - Sorry for the delay! I updated the patch and replaced the regex hack with a small tweak to PigMacro.java that got it working with the existing parser (so I didn't have to touch that thankfully). I also added 6 unit-tests for the issues this JIRA is supposed to address.
        Hide
        Cheolsoo Park added a comment -

        Looks great! Let me run the unit tests today. I am also going to test with some of production scripts. Thank you so much!

        Show
        Cheolsoo Park added a comment - Looks great! Let me run the unit tests today. I am also going to test with some of production scripts. Thank you so much!
        Hide
        Cheolsoo Park added a comment -

        All the unit tests pass. I also tested with some of production scripts and verified that "pig -dryrun" generates the same output. Awesome!

        1. The only difference that I see is that a lot more warnings are printed when there are many macro files. For example,

        2013-08-02 19:05:46,158 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_SOURCE_NETFLIX_SEASONS. Using value 1
        2013-08-02 19:05:46,158 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_ATTRIBUTE_ORIGINAL_COUNTRY. Using value 1
        2013-08-02 19:05:46,158 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_ATTRIBUTE_SEASON_SEQUENCE_NBR. Using value 5
        2013-08-02 19:05:46,158 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_ATTRIBUTE_RELEASE_YEAR. Using value 6
        2013-08-02 19:05:46,158 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_ATTRIBUTE_EPISODE_COUNT. Using value 8
        2013-08-02 19:05:46,158 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_ATTRIBUTE_TITLE_TYPE. Using value 20
        

        This makes sense because you load params every time when importing a macro file. Can you please lower the log level to debug for these messages? This may unnecessarily scare the user.

        2. Can you update the doc? I think you can simply remove the following lines from the macro section:

        - Macros can only contain Pig Latin statements. The REGISTER statement is not supported. The shell commands (used with Grunt) are not supported.
        - Parameter substitution cannot be used inside of macros. Parameters should be explicitly passed to macros and parameter substitution used only at the top level.
        

        You can add your examples too, but I will let you decide on that.

        I will commit this as soon as you update your patch. Thanks a lot!

        Show
        Cheolsoo Park added a comment - All the unit tests pass. I also tested with some of production scripts and verified that "pig -dryrun" generates the same output. Awesome! 1. The only difference that I see is that a lot more warnings are printed when there are many macro files. For example, 2013-08-02 19:05:46,158 [main] WARN org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_SOURCE_NETFLIX_SEASONS. Using value 1 2013-08-02 19:05:46,158 [main] WARN org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_ATTRIBUTE_ORIGINAL_COUNTRY. Using value 1 2013-08-02 19:05:46,158 [main] WARN org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_ATTRIBUTE_SEASON_SEQUENCE_NBR. Using value 5 2013-08-02 19:05:46,158 [main] WARN org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_ATTRIBUTE_RELEASE_YEAR. Using value 6 2013-08-02 19:05:46,158 [main] WARN org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_ATTRIBUTE_EPISODE_COUNT. Using value 8 2013-08-02 19:05:46,158 [main] WARN org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for GCI_ATTRIBUTE_TITLE_TYPE. Using value 20 This makes sense because you load params every time when importing a macro file. Can you please lower the log level to debug for these messages? This may unnecessarily scare the user. 2. Can you update the doc? I think you can simply remove the following lines from the macro section : - Macros can only contain Pig Latin statements. The REGISTER statement is not supported. The shell commands (used with Grunt) are not supported. - Parameter substitution cannot be used inside of macros. Parameters should be explicitly passed to macros and parameter substitution used only at the top level. You can add your examples too, but I will let you decide on that. I will commit this as soon as you update your patch. Thanks a lot!
        Hide
        Cheolsoo Park added a comment -

        On a second thought, I think we should change the condition for the warnings as follows:

        -        if (param_val.containsKey(key)) {
        +        if (param_val.containsKey(key) && !param_val.get(key).equals(val)) {
                     if (overwrite) {
                         log.warn("Warning : Multiple values found for " + key + ". Using value " + val);
                     } else {
        

        We shouldn't lower the log level. Instead, we should print it only if there are multiple values for the same key. Do you agree?

        Show
        Cheolsoo Park added a comment - On a second thought, I think we should change the condition for the warnings as follows: - if (param_val.containsKey(key)) { + if (param_val.containsKey(key) && !param_val.get(key).equals(val)) { if (overwrite) { log.warn( "Warning : Multiple values found for " + key + ". Using value " + val); } else { We shouldn't lower the log level. Instead, we should print it only if there are multiple values for the same key. Do you agree?
        Hide
        Jonathan Packer added a comment -

        Updated the patch.

        There's a slight complication with the handling of the warnings in that "val" passed to those functions in PreprocessorContext is the raw string, which might not be the same as the val put into param_val. I added a separate hashtable "param_source" to keep track of this original argument, and if it's seen again to skip it instead of logging the warning.

        Also updated the docs.

        Show
        Jonathan Packer added a comment - Updated the patch. There's a slight complication with the handling of the warnings in that "val" passed to those functions in PreprocessorContext is the raw string, which might not be the same as the val put into param_val. I added a separate hashtable "param_source" to keep track of this original argument, and if it's seen again to skip it instead of logging the warning. Also updated the docs.
        Hide
        Cheolsoo Park added a comment -

        +1. I will commit it.

        Show
        Cheolsoo Park added a comment - +1. I will commit it.
        Hide
        Cheolsoo Park added a comment -

        Thank you Jonathan for the excellent work! Look forward to many more!

        Show
        Cheolsoo Park added a comment - Thank you Jonathan for the excellent work! Look forward to many more!
        Hide
        Cheolsoo Park added a comment -
        Show
        Cheolsoo Park added a comment - Committed to trunk: http://svn.apache.org/viewvc?view=revision&revision=1510858

          People

          • Assignee:
            Jonathan Packer
            Reporter:
            Jonathan Packer
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development