Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2.0
    • Fix Version/s: 0.4.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently, it is hardwired in PigContext.

      1. PIG-832-2.patch
        8 kB
        Daniel Dai
      2. PIG-832-1.patch
        3 kB
        Daniel Dai

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Pig-trunk #484 (See http://hudson.zones.apache.org/hudson/job/Pig-trunk/484/)
          : Make import list configurable

          Show
          Hudson added a comment - Integrated in Pig-trunk #484 (See http://hudson.zones.apache.org/hudson/job/Pig-trunk/484/ ) : Make import list configurable
          Hide
          Daniel Dai added a comment -

          Patch committed

          Show
          Daniel Dai added a comment - Patch committed
          Hide
          Daniel Dai added a comment -

          Remove the Yahoo line

          Show
          Daniel Dai added a comment - Remove the Yahoo line
          Hide
          Olga Natkovich added a comment -

          +1 once we answer/resolve issues above

          Show
          Olga Natkovich added a comment - +1 once we answer/resolve issues above
          Hide
          Olga Natkovich added a comment -

          Hi Daniel,

          The patch looks good.

          One comment - I think the Yahoo line that you commented out should be removed

          One question - the way this is implemented, the builtins will take precedence over user defined functions in case of the conflict. I think this is the right approach - I think overwriting builting should be explicit via fully qualified names but I wanted to see what others thought.

          Show
          Olga Natkovich added a comment - Hi Daniel, The patch looks good. One comment - I think the Yahoo line that you commented out should be removed One question - the way this is implemented, the builtins will take precedence over user defined functions in case of the conflict. I think this is the right approach - I think overwriting builting should be explicit via fully qualified names but I wanted to see what others thought.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12411291/PIG-832-2.patch
          against trunk revision 786694.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/95/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/95/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/95/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12411291/PIG-832-2.patch against trunk revision 786694. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/95/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/95/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/95/console This message is automatically generated.
          Hide
          Daniel Dai added a comment -

          Change the patch with a better unit test.

          Show
          Daniel Dai added a comment - Change the patch with a better unit test.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12411247/PIG-832-1.patch
          against trunk revision 786607.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/94/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/94/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/94/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12411247/PIG-832-1.patch against trunk revision 786607. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/94/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/94/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/94/console This message is automatically generated.
          Hide
          Daniel Dai added a comment -

          I submit the patch for the original proposal for now. We can discuss alias things which Milind suggest in a separate case.

          Show
          Daniel Dai added a comment - I submit the patch for the original proposal for now. We can discuss alias things which Milind suggest in a separate case.
          Hide
          Olga Natkovich added a comment -

          I don't believe this prevents future improvements

          Show
          Olga Natkovich added a comment - I don't believe this prevents future improvements
          Hide
          Milind Bhandarkar added a comment -

          Olga,

          As long the suggested improvements do not result in redundancy / make the original solutions obsolete, its fine. But I believe that the core issue, which is, "how does pig resolve UDFs?", is not addressed properly in the "small change to current implementation".

          Show
          Milind Bhandarkar added a comment - Olga, As long the suggested improvements do not result in redundancy / make the original solutions obsolete, its fine. But I believe that the core issue, which is, "how does pig resolve UDFs?", is not addressed properly in the "small change to current implementation".
          Hide
          Olga Natkovich added a comment -

          You use a fully qualified name for the other one.

          I would like for us to continue on our original plan. It might not solve all the issues but it certainly helps and it is a very small change to the current implementation.

          We can discuss improvements in a separate JIRA.

          Show
          Olga Natkovich added a comment - You use a fully qualified name for the other one. I would like for us to continue on our original plan. It might not solve all the issues but it certainly helps and it is a very small change to the current implementation. We can discuss improvements in a separate JIRA.
          Hide
          Milind Bhandarkar added a comment -

          Daniel,

          Pig streaming already uses backquotes for executing external programs. So, users are familiar with this syntax. I believe an ordinary pig user already knows about doing such things in unix shells. But anyway, as Olga said, she is looking for requirements, and not solutions, so, here is a requirement:

          I have two jars: xyz.jar, and abc.jar. I am using two UDFs in my scripts. I want to use function1 from xyz.jar, and function2 from abc.jar. How do I use function2 from abc.jar with full confidence that xyz.jar does not contain a UDF named function2? How do you propose I do that without modifying a whole bunch of pig scripts that I am testing for my functions ?

          In the solution that I proposed, I can just change function2 mapping by including "-Dimport.list=function2:com.yahoo.milind.function2" on the command-line.

          Show
          Milind Bhandarkar added a comment - Daniel, Pig streaming already uses backquotes for executing external programs. So, users are familiar with this syntax. I believe an ordinary pig user already knows about doing such things in unix shells. But anyway, as Olga said, she is looking for requirements, and not solutions, so, here is a requirement: I have two jars: xyz.jar, and abc.jar. I am using two UDFs in my scripts. I want to use function1 from xyz.jar, and function2 from abc.jar. How do I use function2 from abc.jar with full confidence that xyz.jar does not contain a UDF named function2? How do you propose I do that without modifying a whole bunch of pig scripts that I am testing for my functions ? In the solution that I proposed, I can just change function2 mapping by including "-Dimport.list=function2:com.yahoo.milind.function2" on the command-line.
          Hide
          Daniel Dai added a comment -

          yes, `cat myudflist` is a way to get around. However, in my humble opinion, this syntax is not very intuitive to the ordinary user. Many users may have the impression that they have to put their UDFs one by one.

          Show
          Daniel Dai added a comment - yes, `cat myudflist` is a way to get around. However, in my humble opinion, this syntax is not very intuitive to the ordinary user. Many users may have the impression that they have to put their UDFs one by one.
          Hide
          Milind Bhandarkar added a comment -

          Daniel,

          >>>>Hi, Milind, If a user wrote 10 UDFs, I guess he/she does not suppose to put 10 entries in the command line, right?

          No, thats why I have a `cat myudflist` allowed on the command-line.

          Show
          Milind Bhandarkar added a comment - Daniel, >>>>Hi, Milind, If a user wrote 10 UDFs, I guess he/she does not suppose to put 10 entries in the command line, right? No, thats why I have a `cat myudflist` allowed on the command-line.
          Hide
          Daniel Dai added a comment -

          Hi, Milind,
          If a user wrote 10 UDFs, I guess he/she does not suppose to put 10 entries in the command line, right?

          Show
          Daniel Dai added a comment - Hi, Milind, If a user wrote 10 UDFs, I guess he/she does not suppose to put 10 entries in the command line, right?
          Hide
          Milind Bhandarkar added a comment -

          Olga, specifying a list of packages as a path list will have the same issues as

          import com.xyz.package.*;
          

          in java, where it is considered to be a bad practice. So, in the solution that I have proposed, I am assuming the class name is specified on the commandline and not the package name.

          Show
          Milind Bhandarkar added a comment - Olga, specifying a list of packages as a path list will have the same issues as import com.xyz. package .*; in java, where it is considered to be a bad practice. So, in the solution that I have proposed, I am assuming the class name is specified on the commandline and not the package name.
          Hide
          Olga Natkovich added a comment -

          Milind,

          Issue is not the complexity of implementation but that I am not sure we want to support command line aliasing and I want to discuss and understand the use cases for it separately. And we can parameterize PigMix if we needed to - that was just an example of an alternative solution for the issue you specified.

          I looking for a list of requirements - not a solution.

          Another comment is I don't think the solution you are proposing would work. The way the list is used to by prepending the package name to the function name to see if the function exist. It deos not do anything with function name itself.

          Show
          Olga Natkovich added a comment - Milind, Issue is not the complexity of implementation but that I am not sure we want to support command line aliasing and I want to discuss and understand the use cases for it separately. And we can parameterize PigMix if we needed to - that was just an example of an alternative solution for the issue you specified. I looking for a list of requirements - not a solution. Another comment is I don't think the solution you are proposing would work. The way the list is used to by prepending the package name to the function name to see if the function exist. It deos not do anything with function name itself.
          Hide
          Daniel Dai added a comment -

          Hi, Milind,
          For your first comment, yes, user's class have to be PigStorage. For your second comment, we do not put user's jar before pig.jar. We put their udf search path first. Let's say user put "-Dudf.import.list=com.xxx.udf1:com.xxx.udf2", when we see an unknown UDF, we first search in the package "com.xxx.udf1", then "com.xxx.udf2", then "org.apache.pig.builtin". We build this policy in our code. It's not put user.jar in front of pig.jar.

          Show
          Daniel Dai added a comment - Hi, Milind, For your first comment, yes, user's class have to be PigStorage. For your second comment, we do not put user's jar before pig.jar. We put their udf search path first. Let's say user put "-Dudf.import.list=com.xxx.udf1:com.xxx.udf2", when we see an unknown UDF, we first search in the package "com.xxx.udf1", then "com.xxx.udf2", then "org.apache.pig.builtin". We build this policy in our code. It's not put user.jar in front of pig.jar.
          Hide
          Milind Bhandarkar added a comment -

          Daniel: For that to work, user's class will have to be called PigStorage. And also, inserting user's jars before pig jar for looking up methods can have major unintended consequences. pig.jar should always be the first in the classpath.

          Olga: My use case cannot use parameter substitution, because PigMix scrips does not specify PigStorage as, say, $storage. The solution I proposed is as simple to implement as Daniel's original proposal (+= is a syntactic sugar. even = can be used with the same effect.), and it fixes a specific ask, and also allows for extensibility. Am I missing something here ?

          Show
          Milind Bhandarkar added a comment - Daniel: For that to work, user's class will have to be called PigStorage. And also, inserting user's jars before pig jar for looking up methods can have major unintended consequences. pig.jar should always be the first in the classpath. Olga: My use case cannot use parameter substitution, because PigMix scrips does not specify PigStorage as, say, $storage. The solution I proposed is as simple to implement as Daniel's original proposal (+= is a syntactic sugar. even = can be used with the same effect.), and it fixes a specific ask, and also allows for extensibility. Am I missing something here ?
          Hide
          Olga Natkovich added a comment -

          Milind, we have parameter substitution for what you are mentioning as example.

          My proposal would be to keep this issue strictly for the packaging thing. This will already make a lot of people happy and users asked for just that.

          We can discuss and understand more user requirements regarding aliases in a separate thread.

          Show
          Olga Natkovich added a comment - Milind, we have parameter substitution for what you are mentioning as example. My proposal would be to keep this issue strictly for the packaging thing. This will already make a lot of people happy and users asked for just that. We can discuss and understand more user requirements regarding aliases in a separate thread.
          Hide
          Daniel Dai added a comment -

          Hi, Milind, in the use case you mentioned, he/she can write his own PigStorage, put the jar in the import list. Pig will take user supplied UDF first, thus override the buildin PigStorage. How is this?

          Show
          Daniel Dai added a comment - Hi, Milind, in the use case you mentioned, he/she can write his own PigStorage, put the jar in the import list. Pig will take user supplied UDF first, thus override the buildin PigStorage. How is this?
          Hide
          Milind Bhandarkar added a comment -

          Hardwiring names that are not reserved words is likely to cause more pain in the long run. What Daniel is proposing seems to be right for your assumptions 1, 2, and 3, though. But one can easily think of several use cases where overriding on the command-line eases a lot of pain. Assume that someone writes a new superfast PigStorage UDF, and wants to compare its performance with the default PigStorage provided with Pig. Instead of modifying the entire benchmark suite PigMix to use the new storage UDF, he/she can just make PigStorage point to his own UDF on the commandline and run PigMix. It saves a lot of Pain.

          Show
          Milind Bhandarkar added a comment - Hardwiring names that are not reserved words is likely to cause more pain in the long run. What Daniel is proposing seems to be right for your assumptions 1, 2, and 3, though. But one can easily think of several use cases where overriding on the command-line eases a lot of pain. Assume that someone writes a new superfast PigStorage UDF, and wants to compare its performance with the default PigStorage provided with Pig. Instead of modifying the entire benchmark suite PigMix to use the new storage UDF, he/she can just make PigStorage point to his own UDF on the commandline and run PigMix. It saves a lot of Pain.
          Hide
          Olga Natkovich added a comment -

          Also think you are suggesting UDF aliasing on command line which I am not sure is the right place for it.

          The scope of this work is just to make it easier for users to refer to their UDFs.

          Show
          Olga Natkovich added a comment - Also think you are suggesting UDF aliasing on command line which I am not sure is the right place for it. The scope of this work is just to make it easier for users to refer to their UDFs.
          Hide
          Olga Natkovich added a comment -

          Milind,

          Couple of comments and clarifications:

          (1) Builtin UDFs are not reserved words. (Flatten is reserved but it is not a UDF) The issue we have seen is users creating UDFs that had reserved words in the package name and if the package name is registered as proposed in this JIRa, their problem will go away.
          (2) I don't think we need to allow to overwrite the defaults. We are not planning to expand the list beyond default distribution (builtins + piggybank.) The plan is to hardwire this values in the code since they are not likely to change
          (3) Our plan is to keep it simple and to just allow users to add packages based on what they use in their UDFs.

          Show
          Olga Natkovich added a comment - Milind, Couple of comments and clarifications: (1) Builtin UDFs are not reserved words. (Flatten is reserved but it is not a UDF) The issue we have seen is users creating UDFs that had reserved words in the package name and if the package name is registered as proposed in this JIRa, their problem will go away. (2) I don't think we need to allow to overwrite the defaults. We are not planning to expand the list beyond default distribution (builtins + piggybank.) The plan is to hardwire this values in the code since they are not likely to change (3) Our plan is to keep it simple and to just allow users to add packages based on what they use in their UDFs.
          Hide
          Milind Bhandarkar added a comment -

          Instead of a list, if you make it a map (i.e. short name -> fully qualified class name), it will be much easier, as it will guarantee that each name has exactly one udf class associated with it. It will also allow users to use udfs that have class names which are pig reserved words. For example, If I have an existing UDF with a class name such as load or store, I can still use them with a different name like myload, without having to rename the class.

          So, I suggest:

          java -jar pig.jar -Dimport.list+=MyLoad:com.xxxx.Load,Flatten:com.xxxx.Flatten,... 
          

          If I do not specify -Dimport.list on the pig command line, then the default import.list is used.

          Thoughts ?

          Show
          Milind Bhandarkar added a comment - Instead of a list, if you make it a map (i.e. short name -> fully qualified class name), it will be much easier, as it will guarantee that each name has exactly one udf class associated with it. It will also allow users to use udfs that have class names which are pig reserved words. For example, If I have an existing UDF with a class name such as load or store, I can still use them with a different name like myload, without having to rename the class. So, I suggest: java -jar pig.jar -Dimport.list+=MyLoad:com.xxxx.Load,Flatten:com.xxxx.Flatten,... If I do not specify -Dimport.list on the pig command line, then the default import.list is used. Thoughts ?
          Hide
          Daniel Dai added a comment -

          My thinking is that most users will use build-in UDFs. So it is better to be straight-forward to the majority. One thing is, import list is ordered. We can put the buildin UDFs in the end. So if user provide udf with the same name, Pig takes the user defined udf first. How is that?

          Show
          Daniel Dai added a comment - My thinking is that most users will use build-in UDFs. So it is better to be straight-forward to the majority. One thing is, import list is ordered. We can put the buildin UDFs in the end. So if user provide udf with the same name, Pig takes the user defined udf first. How is that?
          Hide
          Milind Bhandarkar added a comment -

          Olga, what I am saying is to have a default import list: which contains default UDFs (tokenize, Max, Min, flatten), followed by piggybank contribs. And the same list can be added to / overridden on the command-line. This has several advantages. Pig built-ins do not have to be reserved words, and can be overridden. For example, recent mails on pig-users have mentioned that tokenize+flatten should be a single udf. This can be done by providing a flatten (which is null), and tokenize, which does tokenize+flatten, and existing scripts will still work. This simplifies pig grammar as well. Users can create udf libraries, and use them with:

          java -Dimport.list += `cat my-udf-lib.import`
          

          Thoughts ?

          Show
          Milind Bhandarkar added a comment - Olga, what I am saying is to have a default import list: which contains default UDFs (tokenize, Max, Min, flatten), followed by piggybank contribs. And the same list can be added to / overridden on the command-line. This has several advantages. Pig built-ins do not have to be reserved words, and can be overridden. For example, recent mails on pig-users have mentioned that tokenize+flatten should be a single udf. This can be done by providing a flatten (which is null), and tokenize, which does tokenize+flatten, and existing scripts will still work. This simplifies pig grammar as well. Users can create udf libraries, and use them with: java -Dimport.list += `cat my-udf-lib. import ` Thoughts ?
          Hide
          Olga Natkovich added a comment -

          Milind, Not quite sure what you are saying. We currently don't have any way to pass the list in. import.list does not exist in pig as far as I know.

          Show
          Olga Natkovich added a comment - Milind, Not quite sure what you are saying. We currently don't have any way to pass the list in. import.list does not exist in pig as far as I know.
          Hide
          Milind Bhandarkar added a comment -

          Can we have:

          java -Dimport.list += com.xxx.udf1 ...
          

          That way, I only add/override the default udf import list. Other wise, we will have two variables - import.list, and udf.import.list, and resolving a udf name will have to check both in specific order.

          Show
          Milind Bhandarkar added a comment - Can we have: java -Dimport.list += com.xxx.udf1 ... That way, I only add/override the default udf import list. Other wise, we will have two variables - import.list, and udf.import.list, and resolving a udf name will have to check both in specific order.
          Hide
          Olga Natkovich added a comment -

          Daniel, your proposal looks good

          Show
          Olga Natkovich added a comment - Daniel, your proposal looks good
          Hide
          Olga Natkovich added a comment -

          In response to Milind. I don't think we are committing to more support for piggybank. All this does is, if you do use UDFs from piggybank, you don't need to use full package name.

          Show
          Olga Natkovich added a comment - In response to Milind. I don't think we are committing to more support for piggybank. All this does is, if you do use UDFs from piggybank, you don't need to use full package name.
          Hide
          Daniel Dai added a comment -

          How about this syntax:
          Add a new java property: udf.import.list. We can put package name to import to this property. We can put multiple packages, separated by colon.

          User runs pig like this:
          java -Dudf.import.list=com.xxx.udf1:com.xxx.udf2 ......

          For that, user can refer to UDFs in these two packages without specify the package name.

          Show
          Daniel Dai added a comment - How about this syntax: Add a new java property: udf.import.list. We can put package name to import to this property. We can put multiple packages, separated by colon. User runs pig like this: java -Dudf.import.list=com.xxx.udf1:com.xxx.udf2 ...... For that, user can refer to UDFs in these two packages without specify the package name.
          Hide
          Milind Bhandarkar added a comment -

          If we include the piggybank functions in the default import list, we need to make sure that they are compiled and tested in the default build, and that the releases will be blocked due to them not compiling etc. Is that the intention ?

          Show
          Milind Bhandarkar added a comment - If we include the piggybank functions in the default import list, we need to make sure that they are compiled and tested in the default build, and that the releases will be blocked due to them not compiling etc. Is that the intention ?
          Hide
          Olga Natkovich added a comment -

          As part of this fix we should also expand the default list to include piggybank functions

          Show
          Olga Natkovich added a comment - As part of this fix we should also expand the default list to include piggybank functions

            People

            • Assignee:
              Daniel Dai
              Reporter:
              Olga Natkovich
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development