Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None

      Description

      This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.

      Example:
      =======
      Pig script myscript.pig:

      A = load '/data/mydata/%date%';
      B = filter A by $0>'5';
      .....

      Pig command line:

      pig -param date='20080110' myscript.pig

      Proposed interface and implementation:

      Interface:
      =======
      (0) Substitution will be only supported with pig script files.
      (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
      (2) Default values for the parameters can be specified within the script via decare statement:
      decare <param>=<value>
      (3) Withint the script the parameter will be enclosed in %%. % can be used te escape.

      Implementation:
      ============

      Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:

      • create a new file in temp location
      • build a hash of parameters from command line and declare statement
      • for each line in the original script
        if this is a declare line, skip it
        else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found. Write the line to the temp file.
      • pass the temp file to grunt.
      1. PIG-58_v1.patch
        96 kB
        Olga Natkovich
      2. PIG-58_v2
        97 kB
        Olga Natkovich
      3. PIG-58_v3.patch
        103 kB
        Olga Natkovich

        Activity

        Hide
        Alan Gates added a comment -

        The code as checked in included an import that was java 1.6 specific. The import was spurious, so I've checked in a fix that resolved it (revision 646546). But if you see the following error:

        [javac] Compiling 229 source files to /home/gates/src/pig/PIG-111/trunk/build/classes
        [javac] /home/gates/src/pig/PIG-111/trunk/src/org/apache/pig/tools/parameters/ParameterSubstitutionPreprocessor.java:31: cannot find symbol
        [javac] symbol : class Console
        [javac] location: package java.io
        [javac] import java.io.Console;
        [javac] ^
        [javac] Note: Some input files use or override a deprecated API.
        [javac] Note: Recompile with -Xlint:deprecation for details.
        [javac] Note: Some input files use unchecked or unsafe operations.
        [javac] Note: Recompile with -Xlint:unchecked for details.
        [javac] 1 error

        The solution is to either update to the listed revision or remove the import for java.io.Console from src/org/apache/pig/tools/parameters/ParameterSubstitutionPreprocessor.jav

        Show
        Alan Gates added a comment - The code as checked in included an import that was java 1.6 specific. The import was spurious, so I've checked in a fix that resolved it (revision 646546). But if you see the following error: [javac] Compiling 229 source files to /home/gates/src/pig/ PIG-111 /trunk/build/classes [javac] /home/gates/src/pig/ PIG-111 /trunk/src/org/apache/pig/tools/parameters/ParameterSubstitutionPreprocessor.java:31: cannot find symbol [javac] symbol : class Console [javac] location: package java.io [javac] import java.io.Console; [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 error The solution is to either update to the listed revision or remove the import for java.io.Console from src/org/apache/pig/tools/parameters/ParameterSubstitutionPreprocessor.jav
        Hide
        Olga Natkovich added a comment -

        patch committed. Thanks Pi and Alan for reviews

        Show
        Olga Natkovich added a comment - patch committed. Thanks Pi and Alan for reviews
        Hide
        Olga Natkovich added a comment -

        I made all the changes requested in the feedback by Alan and Pi. Please, take another look. Unless I hear any objections, I will commit the patch at noon PST tomorrow.

        Show
        Olga Natkovich added a comment - I made all the changes requested in the feedback by Alan and Pi. Please, take another look. Unless I hear any objections, I will commit the patch at noon PST tomorrow.
        Hide
        Olga Natkovich added a comment -

        In reply to last comments from Pi:

        >>I am not sure but seems like all parser we have do that.I think others throw ParseException.

        I checked - they throw both.

        >>Here is comment on static structure:- Having getHash() and setHash() in both PigFileParser and ParamLoader pointing to UtilFunctions.param_val is a kind of weird. If you try to substitute vars multiple times in the same JVM, this can lead to side effects (I don't know if such scenarios exist). I think for the sake of clarity and reusability, UtilFunctions.param_val should be separated as non-static member in a new ParamSubstituteContext class. PigFileParser and ParamLoader then accept the context as constructor parameter.

        Ok, I will look into making UtilFunctions a class without static and just pass it around.

        Show
        Olga Natkovich added a comment - In reply to last comments from Pi: >>I am not sure but seems like all parser we have do that.I think others throw ParseException. I checked - they throw both. >>Here is comment on static structure:- Having getHash() and setHash() in both PigFileParser and ParamLoader pointing to UtilFunctions.param_val is a kind of weird. If you try to substitute vars multiple times in the same JVM, this can lead to side effects (I don't know if such scenarios exist). I think for the sake of clarity and reusability, UtilFunctions.param_val should be separated as non-static member in a new ParamSubstituteContext class. PigFileParser and ParamLoader then accept the context as constructor parameter. Ok, I will look into making UtilFunctions a class without static and just pass it around.
        Hide
        Alan Gates added a comment -

        [Alan before] 1) Dryrun isn't a good name for the command line option. It's far to generic. Knowing nothing about parameter substitution and looking at the usage statement of pig I'd assume that dryrun meant that pig was going to parse my query but not run it. I would suggest a name like preproc or preproconly

        [Olga] I was hoping that we would extend dryrun over time to include things that you were talking about. I did not mean it to stay parameter substitution specific. What do you think?

        Seems fine. We should put in a comment articulating that in the command line option.

        [Alan before] 2) Why did we choose to put all of the data for the unit tests in separate files in a test/data directory? To date the approach has been to have the
        tests themselves generate the data they need on the fly. There are pros and cons to switching, but I think we should discuss and have a policy of how unit tests handle their data before we start adding a directory with a lot of files in it.

        [Olga] This is how the test cases were created when they were given to me. SInce I did not have strong oppinion and did not have time to redo them I left them as is. I am not really sure that we need a policy on this. What are your concerns with different tests doing what makes most sense for the kind of things they are testing? Let me know if you think it is ok to leave it as is or if you think that it needs to be reworked before committing the changes.

        I guess I like having one way to do things instead of 3 (does that mean I'm more of python programmer than perl programmer?). But it isn't worth delaying getting this in for.

        [Alan before] 5) PigFileParser.jj doesn't skip over commented lines in the pig code. It should ignore anything on a line after -

        [Olga] The logic is to not change any lines including comments that don't have parameters to be substituted. Grun parser will skip the comments.

        I was referring to that fact that it will do substitutions in comments. If we had comments that could be terminated, this could lead to strange behavior (consider C
        style /* comment */ comments and then defining a parameter that contained */). But our comments are currently to line end only (--) so it's should be fine.

        Show
        Alan Gates added a comment - [Alan before] 1) Dryrun isn't a good name for the command line option. It's far to generic. Knowing nothing about parameter substitution and looking at the usage statement of pig I'd assume that dryrun meant that pig was going to parse my query but not run it. I would suggest a name like preproc or preproconly [Olga] I was hoping that we would extend dryrun over time to include things that you were talking about. I did not mean it to stay parameter substitution specific. What do you think? Seems fine. We should put in a comment articulating that in the command line option. [Alan before] 2) Why did we choose to put all of the data for the unit tests in separate files in a test/data directory? To date the approach has been to have the tests themselves generate the data they need on the fly. There are pros and cons to switching, but I think we should discuss and have a policy of how unit tests handle their data before we start adding a directory with a lot of files in it. [Olga] This is how the test cases were created when they were given to me. SInce I did not have strong oppinion and did not have time to redo them I left them as is. I am not really sure that we need a policy on this. What are your concerns with different tests doing what makes most sense for the kind of things they are testing? Let me know if you think it is ok to leave it as is or if you think that it needs to be reworked before committing the changes. I guess I like having one way to do things instead of 3 (does that mean I'm more of python programmer than perl programmer?). But it isn't worth delaying getting this in for. [Alan before] 5) PigFileParser.jj doesn't skip over commented lines in the pig code. It should ignore anything on a line after - [Olga] The logic is to not change any lines including comments that don't have parameters to be substituted. Grun parser will skip the comments. I was referring to that fact that it will do substitutions in comments. If we had comments that could be terminated, this could lead to strange behavior (consider C style /* comment */ comments and then defining a parameter that contained */). But our comments are currently to line end only (--) so it's should be fine.
        Hide
        Olga Natkovich added a comment -

        Alan, thanks for your feedback. My comments below:

        >>> 1) Dryrun isn't a good name for the command line option. It's far to generic. Knowing nothing about parameter substitution and looking at the usage statement of pig I'd assume that dryrun meant that pig was going to parse my query but not run it. I would suggest a name like preproc or preproconly

        I was hoping that we would extend dryrun over time to include things that you were talking about. I did not mean it to stay parameter substitution specific. What do you think?

        >>> 2) Why did we choose to put all of the data for the unit tests in separate files in a test/data directory? To date the approach has been to have the
        tests themselves generate the data they need on the fly. There are pros and cons to switching, but I think we should discuss and have a policy of how unit tests handle their data before we start adding a directory with a lot of files in it.

        This is how the test cases were created when they were given to me. SInce I did not have strong oppinion and did not have time to redo them I left them as is. I am not really sure that we need a policy on this. What are your concerns with different tests doing what makes most sense for the kind of things they are testing? Let me know if you think it is ok to leave it as is or if you think that it needs to be reworked before committing the changes.

        >>> 3) A number of the public functions do not have javadoc comments.

        Will fix that

        >>> 4) In general more comments throughout the code on what it is doing would be helpful. For example, in UtilFunctions.substitute, it is totally non-obvious what the line "replaced_line = replaced_line.replaceAll("\\\\\\$","$");

        Will update that

        >>> 5) PigFileParser.jj doesn't skip over commented lines in the pig code. It should ignore anything on a line after -

        The logic is to not change any lines including comments that don't have parameters to be substituted. Grun parser will skip the comments.

        Show
        Olga Natkovich added a comment - Alan, thanks for your feedback. My comments below: >>> 1) Dryrun isn't a good name for the command line option. It's far to generic. Knowing nothing about parameter substitution and looking at the usage statement of pig I'd assume that dryrun meant that pig was going to parse my query but not run it. I would suggest a name like preproc or preproconly I was hoping that we would extend dryrun over time to include things that you were talking about. I did not mean it to stay parameter substitution specific. What do you think? >>> 2) Why did we choose to put all of the data for the unit tests in separate files in a test/data directory? To date the approach has been to have the tests themselves generate the data they need on the fly. There are pros and cons to switching, but I think we should discuss and have a policy of how unit tests handle their data before we start adding a directory with a lot of files in it. This is how the test cases were created when they were given to me. SInce I did not have strong oppinion and did not have time to redo them I left them as is. I am not really sure that we need a policy on this. What are your concerns with different tests doing what makes most sense for the kind of things they are testing? Let me know if you think it is ok to leave it as is or if you think that it needs to be reworked before committing the changes. >>> 3) A number of the public functions do not have javadoc comments. Will fix that >>> 4) In general more comments throughout the code on what it is doing would be helpful. For example, in UtilFunctions.substitute, it is totally non-obvious what the line "replaced_line = replaced_line.replaceAll("\\\\\\$","$"); Will update that >>> 5) PigFileParser.jj doesn't skip over commented lines in the pig code. It should ignore anything on a line after - The logic is to not change any lines including comments that don't have parameters to be substituted. Grun parser will skip the comments.
        Hide
        Alan Gates added a comment -

        Comments:

        1) Dryrun isn't a good name for the command line option. It's far to generic.
        Knowing nothing about parameter substitution and looking at the usage
        statement of pig I'd assume that dryrun meant that pig was going to parse my
        query but not run it. I would suggest a name like preproc or preproconly.

        2) Why did we choose to put all of the data for the unit tests in separate
        files in a test/data directory? To date the approach has been to have the
        tests themselves generate the data they need on the fly. There are pros and
        cons to switching, but I think we should discuss and have a policy of how unit
        tests handle their data before we start adding a directory with a lot of files
        in it.

        3) A number of the public functions do not have javadoc comments.

        4) In general more comments throughout the code on what it is doing would be
        helpful. For example, in UtilFunctions.substitute, it is totally non-obvious
        what the line "replaced_line = replaced_line.replaceAll("\\\\\\$","
        $");

        5) PigFileParser.jj doesn't skip over commented lines in the pig code. It should ignore anything on a line after –

        Show
        Alan Gates added a comment - Comments: 1) Dryrun isn't a good name for the command line option. It's far to generic. Knowing nothing about parameter substitution and looking at the usage statement of pig I'd assume that dryrun meant that pig was going to parse my query but not run it. I would suggest a name like preproc or preproconly. 2) Why did we choose to put all of the data for the unit tests in separate files in a test/data directory? To date the approach has been to have the tests themselves generate the data they need on the fly. There are pros and cons to switching, but I think we should discuss and have a policy of how unit tests handle their data before we start adding a directory with a lot of files in it. 3) A number of the public functions do not have javadoc comments. 4) In general more comments throughout the code on what it is doing would be helpful. For example, in UtilFunctions.substitute, it is totally non-obvious what the line "replaced_line = replaced_line.replaceAll("\\\\\\$"," $"); 5) PigFileParser.jj doesn't skip over commented lines in the pig code. It should ignore anything on a line after –
        Hide
        Pi Song added a comment -

        >>>> 5. Why ParamLoader.Parse() throw IOException?
        >>I am not sure but seems like all parser we have do that.
        I think others throw ParseException.
        >>>> 1. I prefer HashMap to HashTable
        Hashtable just gives me a feeling of old-fashioned programmers

        Here is comment on static structure:-

        Having getHash() and setHash() in both PigFileParser and ParamLoader pointing to UtilFunctions.param_val is a kind of weird. If you try to substitute vars multiple times in the same JVM, this can lead to side effects (I don't know if such scenarios exist).
        I think for the sake of clarity and reusability, UtilFunctions.param_val should be separated as non-static member in a new ParamSubstituteContext class. PigFileParser and ParamLoader then accept the context as constructor parameter.

        Show
        Pi Song added a comment - >>>> 5. Why ParamLoader.Parse() throw IOException? >>I am not sure but seems like all parser we have do that. I think others throw ParseException. >>>> 1. I prefer HashMap to HashTable Hashtable just gives me a feeling of old-fashioned programmers Here is comment on static structure:- Having getHash() and setHash() in both PigFileParser and ParamLoader pointing to UtilFunctions.param_val is a kind of weird. If you try to substitute vars multiple times in the same JVM, this can lead to side effects (I don't know if such scenarios exist). I think for the sake of clarity and reusability, UtilFunctions.param_val should be separated as non-static member in a new ParamSubstituteContext class. PigFileParser and ParamLoader then accept the context as constructor parameter.
        Hide
        Olga Natkovich added a comment -

        Updated patch:

        (1) Changed name for default token. It conflicted with reserved word in javacc. Now it is called PIGDEFAULT
        (2) Made log declaration consistent with other modules
        (3) Made identifier pattern static
        (4) Changed grammar to no longer require quotes around single word values

        Show
        Olga Natkovich added a comment - Updated patch: (1) Changed name for default token. It conflicted with reserved word in javacc. Now it is called PIGDEFAULT (2) Made log declaration consistent with other modules (3) Made identifier pattern static (4) Changed grammar to no longer require quotes around single word values
        Hide
        Olga Natkovich added a comment -

        Pi, thanks for your prompt feedback. Couple of follow up questions/comments.

        >> 1. You assume no escaping in shell command, right?

        Yes, I assume that the command is formed such that it can run as is. At this point it does not allow backticks inside of the command. If this turned out to be needed we can change that. The only escaping that is done is for $identifier to prevent the substitution. This is done by the substitution code itself.

        >> 2. The name "UtilFunctions" implies it does not hold state (even global state). From the way it is used, we should have a better name or refactor is needed.

        This is how the code was submitted to me and renaming things is not one of things I like to do . Since it is not an interface point, I don't really want to change names.

        >> 3. PigFileParser.unquote still doesn't do escaping.

        There are two types of escaping supported for the lines: for quotes and for $identifier. Since the preprocessor does not interpret any other data, I don't think anything else is needed.

        >> 4. <DEFALT> token in PigFileParser misspelled

        Will fix that .

        >> 5. Why ParamLoader.Parse() throw IOException?

        I am not sure but seems like all parser we have do that.

        >> 6. In UtilFunctions.substitute, what does "replaced_line = replaced_line.replaceAll("\\\\\\$","
        $");" do?

        The indent is to allow escaping $identifier. I am not exactly sure why it requires so many escapes. I can try and see if there is another more intuitive way to do that.

        >> 7. Shouldn't logger be declared "private final Logger logger = Logger.getLogger("org.apache.pig.preprocessor.log");" (everything in one line) to make it consistent?

        Will fix that

        >> 1. I prefer HashMap to HashTable

        Any particular reason?

        2. Pattern identifier in UtilFunctions.substitute can be made static, this makes it 1 microsecond faster

        Will fix that .

        I also realized that I we don't have to require that single word parameters have to be quoted. I will make that change in the parser, make the updates that you suggested and submit new patch

        Show
        Olga Natkovich added a comment - Pi, thanks for your prompt feedback. Couple of follow up questions/comments. >> 1. You assume no escaping in shell command, right? Yes, I assume that the command is formed such that it can run as is. At this point it does not allow backticks inside of the command. If this turned out to be needed we can change that. The only escaping that is done is for $identifier to prevent the substitution. This is done by the substitution code itself. >> 2. The name "UtilFunctions" implies it does not hold state (even global state). From the way it is used, we should have a better name or refactor is needed. This is how the code was submitted to me and renaming things is not one of things I like to do . Since it is not an interface point, I don't really want to change names. >> 3. PigFileParser.unquote still doesn't do escaping. There are two types of escaping supported for the lines: for quotes and for $identifier. Since the preprocessor does not interpret any other data, I don't think anything else is needed. >> 4. <DEFALT> token in PigFileParser misspelled Will fix that . >> 5. Why ParamLoader.Parse() throw IOException? I am not sure but seems like all parser we have do that. >> 6. In UtilFunctions.substitute, what does "replaced_line = replaced_line.replaceAll("\\\\\\$"," $");" do? The indent is to allow escaping $identifier. I am not exactly sure why it requires so many escapes. I can try and see if there is another more intuitive way to do that. >> 7. Shouldn't logger be declared "private final Logger logger = Logger.getLogger("org.apache.pig.preprocessor.log");" (everything in one line) to make it consistent? Will fix that >> 1. I prefer HashMap to HashTable Any particular reason? 2. Pattern identifier in UtilFunctions.substitute can be made static, this makes it 1 microsecond faster Will fix that . I also realized that I we don't have to require that single word parameters have to be quoted. I will make that change in the parser, make the updates that you suggested and submit new patch
        Hide
        Pi Song added a comment -

        A quick review:-

        1. You assume no escaping in shell command, right?

        2. The name "UtilFunctions" implies it does not hold state (even global state). From the way it is used, we should have a better name or refactor is needed.

        3. PigFileParser.unquote still doesn't do escaping.

        4. <DEFALT> token in PigFileParser misspelled

        5. Why ParamLoader.Parse() throw IOException?

        6. In UtilFunctions.substitute, what does "replaced_line = replaced_line.replaceAll("\\\\\\$","
        $");" do?

        7. Shouldn't logger be declared "private final Logger logger = Logger.getLogger("org.apache.pig.preprocessor.log");" (everything in one line) to make it consistent?

        Trivial things:-
        1. I prefer HashMap to HashTable
        2. Pattern identifier in UtilFunctions.substitute can be made static, this makes it 1 microsecond faster

        Show
        Pi Song added a comment - A quick review:- 1. You assume no escaping in shell command, right? 2. The name "UtilFunctions" implies it does not hold state (even global state). From the way it is used, we should have a better name or refactor is needed. 3. PigFileParser.unquote still doesn't do escaping. 4. <DEFALT> token in PigFileParser misspelled 5. Why ParamLoader.Parse() throw IOException? 6. In UtilFunctions.substitute, what does "replaced_line = replaced_line.replaceAll("\\\\\\$"," $");" do? 7. Shouldn't logger be declared "private final Logger logger = Logger.getLogger("org.apache.pig.preprocessor.log");" (everything in one line) to make it consistent? Trivial things:- 1. I prefer HashMap to HashTable 2. Pattern identifier in UtilFunctions.substitute can be made static, this makes it 1 microsecond faster
        Hide
        Olga Natkovich added a comment -

        The patch contains the initial implementation provided by a tean at Yahoo. I have made some modification and also integrated it into pig.

        The code includes an extenssive set of unit tests which all pass. Also, all existing unit and end-to-end tests are passing. I ran several end-to-end tests and they also pass. More end-to-end testing is on the way.

        Also, update the specification: http://wiki.apache.org/pig/ParameterSubstitution

        Show
        Olga Natkovich added a comment - The patch contains the initial implementation provided by a tean at Yahoo. I have made some modification and also integrated it into pig. The code includes an extenssive set of unit tests which all pass. Also, all existing unit and end-to-end tests are passing. I ran several end-to-end tests and they also pass. More end-to-end testing is on the way. Also, update the specification: http://wiki.apache.org/pig/ParameterSubstitution
        Hide
        Olga Natkovich added a comment -

        Alan,

        Your comments just make me think again that C-style preprocessor does not really make sense for Pig. So I think in my last proposal, I will change from #define to declare but with extensions that I proposed for #define.

        I like to prepend $ to the var names to make it more readable and easier parsable. Also, if pig ever has full support for variables, this would be useful as well.

        Show
        Olga Natkovich added a comment - Alan, Your comments just make me think again that C-style preprocessor does not really make sense for Pig. So I think in my last proposal, I will change from #define to declare but with extensions that I proposed for #define. I like to prepend $ to the var names to make it more readable and easier parsable. Also, if pig ever has full support for variables, this would be useful as well.
        Hide
        Alan Gates added a comment -

        A couple of comments on using #define

        1) Why do we need $ in our variables after #define? In you example you say:

        #define CMD `generate_date`

        a = load '/data/$CMD';

        It seems the second line should be:

        a = load '/data/CMD';

        People used to using #define in Cpp won't be thinking they need a variable designator such as $ in there.

        2) How will default values be done? I think the proposal was to have values from the command line override any #defines in the script. If we're going to follow the syntax of Cpp we better follow it's semantics as well, else we'll be violating the law of least astonishment. In Cpp #defines are not overridden by values supplied at compile time with -D. Instead, it's done as:

        #ifndef X

        #define X

        #endif

        That way if X is provided on the command line via -D it remains, otherwise a default value is given. We could implement #ifndef without bogging down in a full #ifdef implementation.

        Show
        Alan Gates added a comment - A couple of comments on using #define 1) Why do we need $ in our variables after #define? In you example you say: #define CMD `generate_date` a = load '/data/$CMD'; It seems the second line should be: a = load '/data/CMD'; People used to using #define in Cpp won't be thinking they need a variable designator such as $ in there. 2) How will default values be done? I think the proposal was to have values from the command line override any #defines in the script. If we're going to follow the syntax of Cpp we better follow it's semantics as well, else we'll be violating the law of least astonishment. In Cpp #defines are not overridden by values supplied at compile time with -D. Instead, it's done as: #ifndef X #define X #endif That way if X is provided on the command line via -D it remains, otherwise a default value is given. We could implement #ifndef without bogging down in a full #ifdef implementation.
        Hide
        Olga Natkovich added a comment -

        This is in response to comment https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565959#action_12565959

        >> 1. The parameter usage notation seems to arcane
        Hopefully, the new proposal to use #define to declare all variables addresses this issue

        >> 2. "-param" key seems redundant and will cause confusion
        I usually prefer explicit rather than implicit specifications. Also, the -param approach separates pigs command line namespace from user parameter one. But I could go either way on this.

        >> 3. We need an ability to specify "DFS working directory" for the whole Pig job.
        There is already way in pig to do that by issuing CD command

        >> 4. Need to specify the precedence rules
        I thought I already specified this in the proposal. Please, let me know which parts are unclear.

        >> 5. Can parameters be used in RHS of declare?

        My plan was to avoid it for the initial release to keep things simple since most of the use cases I saw did not need that. However, I don't think it would be difficult to implement so we can put that in.

        Show
        Olga Natkovich added a comment - This is in response to comment https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565959#action_12565959 >> 1. The parameter usage notation seems to arcane Hopefully, the new proposal to use #define to declare all variables addresses this issue >> 2. "-param" key seems redundant and will cause confusion I usually prefer explicit rather than implicit specifications. Also, the -param approach separates pigs command line namespace from user parameter one. But I could go either way on this. >> 3. We need an ability to specify "DFS working directory" for the whole Pig job. There is already way in pig to do that by issuing CD command >> 4. Need to specify the precedence rules I thought I already specified this in the proposal. Please, let me know which parts are unclear. >> 5. Can parameters be used in RHS of declare? My plan was to avoid it for the initial release to keep things simple since most of the use cases I saw did not need that. However, I don't think it would be difficult to implement so we can put that in.
        Hide
        Olga Natkovich added a comment -

        This is in reply to https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565956#action_12565956

        >> Given the use of a distinct preprocessor, are there any existing ones that make sense?

        I found this one: http://antenna.sourceforge.net/wtkpreprocess.php which basically implements C-style preprocessor for JAVA but I am not sure that we want/need to go this far. See my comments on CPP preprocessor.

        >> Does the pig language support literals with whitespace, carriage- returns and so on? If so, how does one define a parameter in a file with such characters in them?

        I believe that pig allows string constants to contain arbitrary text including white spaces and carriage-returns. To put it in the parameter file you would enclose the value in quotes and we can just allow literals to span multiple lines. I can make clarification in the document.

        >> How does this approach compare to solutions used in SQL?

        As far as I could tell this is not part of standard SQL. In MySQL, user can define variables using set statement and then use them in their SQL statement: http://dev.mysql.com/doc/refman/5.0/en/user-variables.html.

        Show
        Olga Natkovich added a comment - This is in reply to https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565956#action_12565956 >> Given the use of a distinct preprocessor, are there any existing ones that make sense? I found this one: http://antenna.sourceforge.net/wtkpreprocess.php which basically implements C-style preprocessor for JAVA but I am not sure that we want/need to go this far. See my comments on CPP preprocessor. >> Does the pig language support literals with whitespace, carriage- returns and so on? If so, how does one define a parameter in a file with such characters in them? I believe that pig allows string constants to contain arbitrary text including white spaces and carriage-returns. To put it in the parameter file you would enclose the value in quotes and we can just allow literals to span multiple lines. I can make clarification in the document. >> How does this approach compare to solutions used in SQL? As far as I could tell this is not part of standard SQL. In MySQL, user can define variables using set statement and then use them in their SQL statement: http://dev.mysql.com/doc/refman/5.0/en/user-variables.html .
        Hide
        Olga Natkovich added a comment -

        This is in response to https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565958#action_12565958
        =============================================================================================

        This is response to Alan's comment: https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565958#action_12565958
        =========================================================================================================

        I think it is an interesting idea and a reasonable approach but I have a few concerns:

        (1) I don't think that C-style preprocessor (CPP) is necessarily that well known and understood among our users and developers
        (2) CPP is complex to use and implement and might be too heavy weight for what we are trying to do. Briefly looked at the CPP code and it is very involved. Translating or writing on our own chunks of it would be a fairly large project.
        (3) For C, CPP is used to influence how code is compiled. For pig we are trying to influence the run time behavior of the pig program and there are other ways to do it. One way to do it is to embed Pig in languages such as Perl, Python or C/C++ which would take care of code inclusion, conditional execution and more. Similarly, we might decide to have this things in pig language but I don't think they belong in our preprocessor. (What's the difference between "if" and "#if" would be in pig.) So my approach is (a) simple things like parameter substitution can be done in preprocessor. (b) more complex things happen in the language itself or in the language in which Pig is embedded.
        (4) CPP does not provide support for command execution which users asked for and just forcing them to run it from command line has limitation in terms of parameterizing command line and also harvesting return codes and error messages.

        There are a couple of things I like from this proposal and would like to use:

        (1) use #define rather than declare
        (2) extend #define to also declare commands
        (3) We can later further expand #define to include more things as we need them.

        This way only variable names would be used outside of define which is nice since if pig later support variables such as for scalars they would have consistent representation.

        So my examples from the document would now look as follows:

        (1)
        A = load '/data/mydata/$date';

        (2)
        #define CMD `generate_date`
        A = load '/data/mydata/$CMD';

        (3)
        #define CMD `generate_name $date`
        A = load '/data/mydata/$CMD';

        (4)
        #define CMD `$cmd $date`;
        A = load '/data/mydata/$CMD';

        I think this also addresses some of the concerns from

        https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565959#action_12565959

        Show
        Olga Natkovich added a comment - This is in response to https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565958#action_12565958 ============================================================================================= This is response to Alan's comment: https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565958#action_12565958 ========================================================================================================= I think it is an interesting idea and a reasonable approach but I have a few concerns: (1) I don't think that C-style preprocessor (CPP) is necessarily that well known and understood among our users and developers (2) CPP is complex to use and implement and might be too heavy weight for what we are trying to do. Briefly looked at the CPP code and it is very involved. Translating or writing on our own chunks of it would be a fairly large project. (3) For C, CPP is used to influence how code is compiled. For pig we are trying to influence the run time behavior of the pig program and there are other ways to do it. One way to do it is to embed Pig in languages such as Perl, Python or C/C++ which would take care of code inclusion, conditional execution and more. Similarly, we might decide to have this things in pig language but I don't think they belong in our preprocessor. (What's the difference between "if" and "#if" would be in pig.) So my approach is (a) simple things like parameter substitution can be done in preprocessor. (b) more complex things happen in the language itself or in the language in which Pig is embedded. (4) CPP does not provide support for command execution which users asked for and just forcing them to run it from command line has limitation in terms of parameterizing command line and also harvesting return codes and error messages. There are a couple of things I like from this proposal and would like to use: (1) use #define rather than declare (2) extend #define to also declare commands (3) We can later further expand #define to include more things as we need them. This way only variable names would be used outside of define which is nice since if pig later support variables such as for scalars they would have consistent representation. So my examples from the document would now look as follows: (1) A = load '/data/mydata/$date'; (2) #define CMD `generate_date` A = load '/data/mydata/$CMD'; (3) #define CMD `generate_name $date` A = load '/data/mydata/$CMD'; (4) #define CMD `$cmd $date`; A = load '/data/mydata/$CMD'; I think this also addresses some of the concerns from https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565959#action_12565959
        Hide
        Olga Natkovich added a comment -

        Comments from a user:

        1. The parameter usage notation seems to arcane.
        Why both %% and $? Why both %% and ``?
        It looks like you need %..% only to be able to put a command inside them – they (%..%) do not seem necessary for variable substitutions.
        Similarly, `..` with this meaning can only be right inside %...%
        So why don't you simplify it to the following:
        Example 1: parameter name specification
        A = load '/data/mydata/$date';
        Example 2: command specification
        A = load '/data/mydata/%generate_date%';
        Example 3: command taking a parameter
        A = load '/data/mydata/%generate_name $date%';
        Example 4: command is passed as a parameter:
        A = load '/data/mydata/%$cmd $date%';

        I would not want to write
        B = filter A by $0>'%$N%';
        (assuming I have N=5 in the command line...)
        I'd prefer
        B = filter A by $0>$N;

        2. (repeated from another message)
        "-param" key seems redundant and will cause confusion.
        Why not assume that any command line argument that has = is a parameter specification? Indeed, we can get rid of -option syntax at all.
        Another way to think about this: "X=Y" and "-X Y" are two ways to specify parameters. Use one.

        3. We need an ability to specify "DFS working directory" for the whole Pig job.
        This is the directory where all the relative Record Set path names (in "load" and "store") are rooted.
        By default, it is the users "home DFS directory".
        It is very convenient to be able to specify it when a script is run.

        4. Need to specify the precedence rules: if for the same parameter name 3 values are specified – in on command line, in "declare", and in parameters file, which one wins? (probable, the order is command line, parameters file, declare ?)

        5. Can parameters be used in RHS of declare? The documents implicitly says "NO", but is a very convenient feature. (I could give examples, but you can come up with plenty on your own.)

        6. Typo:
        The fault parameter values can be specified

        Show
        Olga Natkovich added a comment - Comments from a user: 1. The parameter usage notation seems to arcane. Why both %% and $? Why both %% and ``? It looks like you need %..% only to be able to put a command inside them – they (%..%) do not seem necessary for variable substitutions. Similarly, `..` with this meaning can only be right inside %...% So why don't you simplify it to the following: Example 1: parameter name specification A = load '/data/mydata/$date'; Example 2: command specification A = load '/data/mydata/%generate_date%'; Example 3: command taking a parameter A = load '/data/mydata/%generate_name $date%'; Example 4: command is passed as a parameter: A = load '/data/mydata/%$cmd $date%'; I would not want to write B = filter A by $0>'%$N%'; (assuming I have N=5 in the command line...) I'd prefer B = filter A by $0>$N; 2. (repeated from another message) "-param" key seems redundant and will cause confusion. Why not assume that any command line argument that has = is a parameter specification? Indeed, we can get rid of -option syntax at all. Another way to think about this: "X=Y" and "-X Y" are two ways to specify parameters. Use one. 3. We need an ability to specify "DFS working directory" for the whole Pig job. This is the directory where all the relative Record Set path names (in "load" and "store") are rooted. By default, it is the users "home DFS directory". It is very convenient to be able to specify it when a script is run. 4. Need to specify the precedence rules: if for the same parameter name 3 values are specified – in on command line, in "declare", and in parameters file, which one wins? (probable, the order is command line, parameters file, declare ?) 5. Can parameters be used in RHS of declare? The documents implicitly says "NO", but is a very convenient feature. (I could give examples, but you can come up with plenty on your own.) 6. Typo: The fault parameter values can be specified
        Hide
        Olga Natkovich added a comment -

        comments from Alan Gates:
        ======================

        What if instead we just built a C style preprocessor exactly, with #define, #include, #ifdef, #ifndef, #if, #else, and #endif?

        This would meet requirements 1, 2, 3, 5, and 6. For 4, people would have to generate the values with a script or binary themselves before calling the command line, but this should be trivial: pig -Dbla=`xxx`

        This has the advantage that it is well understood by all, implementers and users. No one has to develop a new set or semantics and we don't have to deal with user questions of how it works.

        Alan.

        Show
        Olga Natkovich added a comment - comments from Alan Gates: ====================== What if instead we just built a C style preprocessor exactly, with #define, #include, #ifdef, #ifndef, #if, #else, and #endif? This would meet requirements 1, 2, 3, 5, and 6. For 4, people would have to generate the values with a script or binary themselves before calling the command line, but this should be trivial: pig -Dbla=`xxx` This has the advantage that it is well understood by all, implementers and users. No one has to develop a new set or semantics and we don't have to deal with user questions of how it works. Alan.
        Hide
        Olga Natkovich added a comment -

        comments from erik14:
        ===================

        This approach seems fairly brittle. For example whitespace in a parameter can have unexpected results.

        Given the use of a distinct preprocessor, are there any existing ones that make sense?

        Does the pig language support literals with whitespace, carriage- returns and so on? If so, how does one define a parameter in a file with such characters in them?

        How does this approach compare to solutions used in SQL?

        In Hadoop, the discussion would be in a jira. Does moving the discussion to a jira make sense?

        Show
        Olga Natkovich added a comment - comments from erik14: =================== This approach seems fairly brittle. For example whitespace in a parameter can have unexpected results. Given the use of a distinct preprocessor, are there any existing ones that make sense? Does the pig language support literals with whitespace, carriage- returns and so on? If so, how does one define a parameter in a file with such characters in them? How does this approach compare to solutions used in SQL? In Hadoop, the discussion would be in a jira. Does moving the discussion to a jira make sense?
        Hide
        Olga Natkovich added a comment -

        Moved the proposal into a separate document:

        http://wiki.apache.org/pig/ParameterSubstitution

        Show
        Olga Natkovich added a comment - Moved the proposal into a separate document: http://wiki.apache.org/pig/ParameterSubstitution
        Hide
        Olga Natkovich added a comment -

        We have 2 additional requirements:

        (1) Specify the substitutions via input file. The input file will have key/value pairs.
        (2) Substitutions, can be strings or an executable that generates the string. Eg 20080122 or a script that generates the date.

        Show
        Olga Natkovich added a comment - We have 2 additional requirements: (1) Specify the substitutions via input file. The input file will have key/value pairs. (2) Substitutions, can be strings or an executable that generates the string. Eg 20080122 or a script that generates the date.
        Hide
        Benjamin Reed added a comment -

        We should also allow positional parameters since nothing following the script:

        pig -param date='aoeuaoeu' myscript.pig firstPos secondPos

        We also don't need to generate a temp file. We can do the substitution on the fly.

        Show
        Benjamin Reed added a comment - We should also allow positional parameters since nothing following the script: pig -param date='aoeuaoeu' myscript.pig firstPos secondPos We also don't need to generate a temp file. We can do the substitution on the fly.

          People

          • Assignee:
            Olga Natkovich
            Reporter:
            Olga Natkovich
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development