Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java.

      1. calltrace.png
        704 kB
        Arnab Nandi
      2. package.zip
        5.17 MB
        Alan Gates
      3. pig.scripting.patch.arnab
        17 kB
        Arnab Nandi
      4. PIG-928.patch
        9 kB
        Dmitriy V. Ryaboy
      5. pig-greek.tgz
        5.62 MB
        Julien Le Dem
      6. pyg.tgz
        5.62 MB
        Julien Le Dem
      7. RegisterPythonUDF_Final.patch
        40 kB
        Aniket Mokashi
      8. RegisterPythonUDF3.patch
        10 kB
        Aniket Mokashi
      9. RegisterPythonUDF4.patch
        26 kB
        Aniket Mokashi
      10. RegisterPythonUDFFinale.patch
        32 kB
        Aniket Mokashi
      11. RegisterPythonUDFFinale3.patch
        35 kB
        Aniket Mokashi
      12. RegisterPythonUDFFinale4.patch
        40 kB
        Aniket Mokashi
      13. RegisterPythonUDFFinale5.patch
        40 kB
        Aniket Mokashi
      14. RegisterPythonUDFLatest.patch
        45 kB
        Aniket Mokashi
      15. RegisterPythonUDFLatest2.patch
        51 kB
        Aniket Mokashi
      16. RegisterScriptUDFDefineParse.patch
        2 kB
        Aniket Mokashi
      17. scripting.tgz
        44 kB
        Woody Anderson
      18. scripting.tgz
        43 kB
        Woody Anderson
      19. test.zip
        2 kB
        Arnab Nandi

        Issue Links

          Activity

          Hide
          Alan Gates added a comment -

          Attaching some preliminary work by Kishore Gopalakrishna on this. This code is a good start, but not ready for inclusion. It needs to be cleaned up, put in our class structure, etc.

          Comments from Kishore:

          It contains all the libraries required and also the GenericEval UDF and
          GenericFilter UDF

          I dint get a chance to get the Algebraic function working.

          To test it, just unzip the package and run

          rm -rf wordcount/output;
          pig -x local wordcount.pig ---> to test eval
          pig -x local wordcount_filter.pig ---> to test filter [sorry it should
          be named filter.pig]
          cat wordcount/output

          Show
          Alan Gates added a comment - Attaching some preliminary work by Kishore Gopalakrishna on this. This code is a good start, but not ready for inclusion. It needs to be cleaned up, put in our class structure, etc. Comments from Kishore: It contains all the libraries required and also the GenericEval UDF and GenericFilter UDF I dint get a chance to get the Algebraic function working. To test it, just unzip the package and run rm -rf wordcount/output; pig -x local wordcount.pig ---> to test eval pig -x local wordcount_filter.pig ---> to test filter [sorry it should be named filter.pig] cat wordcount/output
          Hide
          Alan Gates added a comment -

          Questions that we need to answer to get this patch ready for commit:

          1) How do we do type conversion? The current patch assumes a single string input and output. We'll want to be able to do conversions from scripting languages to pig types that make sense. How this can be done is tied up with #2 below.

          2) Do we do this using the Bean Scripting Framework or with specific bindings for each language? This patch shows how to do the specific bindings for Groovy. It can be done for Jython, and I'm reasonably sure it can be done for JRuby. The obvious advantage of using the BSF is we get all the languages they support for free. We need to understand the performance costs of each choice. We should be able to use the existing patch to test the difference between using the BSF and direct Groovy bindings. Also, it seems like type conversions will be much easier to do if we use specific bindings, as we can do explicit type mappings for each language. Perhaps this is possible with BSF, but I'm not sure how.

          3) Grammer for how to declare these. I propose that we allow two options: inlined in define and file referenced in define. So these would roughly look like:

          define myudf ScriptUDF('groovy', 'return input.get(0).split();');
          define myudf ScriptUDF('python', myudf.py);

          We could also support inlining in the Pig Latin itself, something like:

          B = foreach A generate

          {'groovy', 'return input.get(0).split();');}

          ;

          I'm not a fan of this type of inlining, as I think it makes the code hard to read.

          Show
          Alan Gates added a comment - Questions that we need to answer to get this patch ready for commit: 1) How do we do type conversion? The current patch assumes a single string input and output. We'll want to be able to do conversions from scripting languages to pig types that make sense. How this can be done is tied up with #2 below. 2) Do we do this using the Bean Scripting Framework or with specific bindings for each language? This patch shows how to do the specific bindings for Groovy. It can be done for Jython, and I'm reasonably sure it can be done for JRuby. The obvious advantage of using the BSF is we get all the languages they support for free. We need to understand the performance costs of each choice. We should be able to use the existing patch to test the difference between using the BSF and direct Groovy bindings. Also, it seems like type conversions will be much easier to do if we use specific bindings, as we can do explicit type mappings for each language. Perhaps this is possible with BSF, but I'm not sure how. 3) Grammer for how to declare these. I propose that we allow two options: inlined in define and file referenced in define. So these would roughly look like: define myudf ScriptUDF('groovy', 'return input.get(0).split();'); define myudf ScriptUDF('python', myudf.py); We could also support inlining in the Pig Latin itself, something like: B = foreach A generate {'groovy', 'return input.get(0).split();');} ; I'm not a fan of this type of inlining, as I think it makes the code hard to read.
          Hide
          Alan Gates added a comment -

          I ran some quick and sloppy performance tests on this. I ran it using both BSF and direct bindings to groovy. I also ran it using the builtin TOKENIZE function in Pig. I had it read 5000 lines of text. The groovy (or TOKENIZE) functions handle splitting the line, then we do a standard group/count to count the words. I got the following results:

          Groovy using BSF: 55.070 seconds
          Groovy direct bindings: 58.560 seconds
          TOKENIZE: 2.554 seconds

          So a 30x slow down using this. That's pretty painful. I know string translation between languages can be bad. I don't know how much of this is inter-language bindings and how much is groovy. When i get chance I'll try this in Python and see if I get similar numbers.

          Show
          Alan Gates added a comment - I ran some quick and sloppy performance tests on this. I ran it using both BSF and direct bindings to groovy. I also ran it using the builtin TOKENIZE function in Pig. I had it read 5000 lines of text. The groovy (or TOKENIZE) functions handle splitting the line, then we do a standard group/count to count the words. I got the following results: Groovy using BSF: 55.070 seconds Groovy direct bindings: 58.560 seconds TOKENIZE: 2.554 seconds So a 30x slow down using this. That's pretty painful. I know string translation between languages can be bad. I don't know how much of this is inter-language bindings and how much is groovy. When i get chance I'll try this in Python and see if I get similar numbers.
          Hide
          Ashutosh Chauhan added a comment -

          30x is indeed too slow. But, between BSF and direct bindings, I imagine direct bindings should have been more performant, since BSF adds an extra layer of translation. Isn't it ?

          Show
          Ashutosh Chauhan added a comment - 30x is indeed too slow. But, between BSF and direct bindings, I imagine direct bindings should have been more performant, since BSF adds an extra layer of translation. Isn't it ?
          Hide
          Alan Gates added a comment -

          I expected to see the direct bindings to be faster as well, but the tests didn't show that. In the code contributed by Kishore the type translation was done the same regardless of the bindings used. Perhaps there would be a more efficient way to do the type translation for direct bindings.

          Show
          Alan Gates added a comment - I expected to see the direct bindings to be faster as well, but the tests didn't show that. In the code contributed by Kishore the type translation was done the same regardless of the bindings used. Perhaps there would be a more efficient way to do the type translation for direct bindings.
          Hide
          Ashutosh Chauhan added a comment -

          Though good learning from this test is BSF is not slower then direct bindings (need additional verifications though..) So, this feature could be implemented in lot less code and complexity using BSF as oppose to using different direct bindings for different languages. On the other hand, only useful language BSF supports currently is Ruby. Not sure how many people using Pig will also be interested in groovy, javascript etc.( other languages supported by BSF ).

          Show
          Ashutosh Chauhan added a comment - Though good learning from this test is BSF is not slower then direct bindings (need additional verifications though..) So, this feature could be implemented in lot less code and complexity using BSF as oppose to using different direct bindings for different languages. On the other hand, only useful language BSF supports currently is Ruby. Not sure how many people using Pig will also be interested in groovy, javascript etc.( other languages supported by BSF ).
          Hide
          Alan Gates added a comment -

          jython was the one I was assuming people would want.

          Show
          Alan Gates added a comment - jython was the one I was assuming people would want.
          Hide
          Ashutosh Chauhan added a comment -

          Right, I overlooked it. I think Ruby and Python are two most widely used scripting languages and both are supported by BSF. So, comparing BSF with direct bindings:
          1) Performance : Initial test shows almost equal.
          2) Support of multiple languages.
          3) Ease of implementation
          To me, BSF seems to be the way to go for this, atleast the first-cut. Implementing this feature using BSF will allow us to expose this to users quickly and if many people are using it and finding one particular language to be slow then we can explore language bindings for that particular language. Thoughts?

          Show
          Ashutosh Chauhan added a comment - Right, I overlooked it. I think Ruby and Python are two most widely used scripting languages and both are supported by BSF. So, comparing BSF with direct bindings: 1) Performance : Initial test shows almost equal. 2) Support of multiple languages. 3) Ease of implementation To me, BSF seems to be the way to go for this, atleast the first-cut. Implementing this feature using BSF will allow us to expose this to users quickly and if many people are using it and finding one particular language to be slow then we can explore language bindings for that particular language. Thoughts?
          Hide
          Alan Gates added a comment -

          A couple thoughts:

          1) I still have to figure out how to do type translation in BSF. The current patch just assumes one string argument and then does reflection on the fly on return to figure out what it is returning. We may or may not be able to expose schemas to scripted UDFs (ala outputSchema and argToFuncMapping) but we at least need to handle multiple and non-string arguments. I need to do more digging in order to understand how to do this type translation in BSF.

          2) For at least some either jython or jruby we've got to show better than a 30x differential. There are some products you're just too embarrassed to sell. We may be able to speed this up some by having the framework figure out the return type for this UDF and always convert the returning object based on that return type rather than trying to do reflection.

          I don't know ruby or python, and I don't have time at the moment to go learn either. If someone is willing to give me snippets of python and/or ruby that mimic the split functionality given in the patch, I'm happy to test against those two in BSF and see what happens.

          Show
          Alan Gates added a comment - A couple thoughts: 1) I still have to figure out how to do type translation in BSF. The current patch just assumes one string argument and then does reflection on the fly on return to figure out what it is returning. We may or may not be able to expose schemas to scripted UDFs (ala outputSchema and argToFuncMapping) but we at least need to handle multiple and non-string arguments. I need to do more digging in order to understand how to do this type translation in BSF. 2) For at least some either jython or jruby we've got to show better than a 30x differential. There are some products you're just too embarrassed to sell. We may be able to speed this up some by having the framework figure out the return type for this UDF and always convert the returning object based on that return type rather than trying to do reflection. I don't know ruby or python, and I don't have time at the moment to go learn either. If someone is willing to give me snippets of python and/or ruby that mimic the split functionality given in the patch, I'm happy to test against those two in BSF and see what happens.
          Hide
          Ashutosh Chauhan added a comment -

          I did little research on the topic and it turned there is a third option for doing it. JSR-223[1] for "Scripting for Java" has been approved through JCP and now is a part of java platform in form of javax.script[2] as of java 6. It seems that it aims to provide a consistent api through java language itself. No bindings needed, no BSF all one needs is a "scripting engine". And they claim to have a very long list of languages supported including awk, python, ruby, groovy, javascript, scheme, php, smalltalk etc.
          It will be interesting to explore this since:
          1) Support from java platform implies no dependencies on BSF and language bindings jars.
          2) Possibly more performant.
          3) One consistent api for all scripting languages
          4) Longer list of supported languages

          I am currently reading the apis and if I get something to work, will post back here.

          [1] http://www.jcp.org/en/jsr/detail?id=223
          [2] http://java.sun.com/javase/6/docs/api/javax/script/package-summary.html
          [3] https://scripting.dev.java.net/

          Show
          Ashutosh Chauhan added a comment - I did little research on the topic and it turned there is a third option for doing it. JSR-223 [1] for "Scripting for Java" has been approved through JCP and now is a part of java platform in form of javax.script [2] as of java 6. It seems that it aims to provide a consistent api through java language itself. No bindings needed, no BSF all one needs is a "scripting engine". And they claim to have a very long list of languages supported including awk, python, ruby, groovy, javascript, scheme, php, smalltalk etc. It will be interesting to explore this since: 1) Support from java platform implies no dependencies on BSF and language bindings jars. 2) Possibly more performant. 3) One consistent api for all scripting languages 4) Longer list of supported languages I am currently reading the apis and if I get something to work, will post back here. [1] http://www.jcp.org/en/jsr/detail?id=223 [2] http://java.sun.com/javase/6/docs/api/javax/script/package-summary.html [3] https://scripting.dev.java.net/
          Hide
          Ashutosh Chauhan added a comment -

          I did some quick benchmarking using BSF approach for UDFs written in Ruby, Python, Groovy and native builtin in Pig. It's a standard wordcount example where udf tokenizes an input string into number of words. I used pig sources(src/org/apache/pig) as input which has more then 210K lines. Since, I haven't yet figured out type translation so to be consistent in experiment, I passed data as String argument and return type as Object[] in all languages. Following are the numbers I got averaged over 3 runs:

          Language Time(seconds) Factor
          Pig 17 1
          Ruby 155 9.1
          Python 178 10.4
          Groovy 1460 85

          This shows Groovy-BSF combo is super-slow and Ruby and Python is much better. These numbers must be seen as an absolute worst case. I believe type translations, compiling script in constructor and using the compiled version instead of evaluating script in every exec() call will give much better performance. Also, there might exist other optimizations.

          Sometime next week, I will try to repeat the same experiment with javax.script

          Show
          Ashutosh Chauhan added a comment - I did some quick benchmarking using BSF approach for UDFs written in Ruby, Python, Groovy and native builtin in Pig. It's a standard wordcount example where udf tokenizes an input string into number of words. I used pig sources(src/org/apache/pig) as input which has more then 210K lines. Since, I haven't yet figured out type translation so to be consistent in experiment, I passed data as String argument and return type as Object[] in all languages. Following are the numbers I got averaged over 3 runs: Language Time(seconds) Factor Pig 17 1 Ruby 155 9.1 Python 178 10.4 Groovy 1460 85 This shows Groovy-BSF combo is super-slow and Ruby and Python is much better. These numbers must be seen as an absolute worst case. I believe type translations, compiling script in constructor and using the compiled version instead of evaluating script in every exec() call will give much better performance. Also, there might exist other optimizations. Sometime next week, I will try to repeat the same experiment with javax.script
          Hide
          Woody Anderson added a comment -

          unpack the file into a directory:
          cd foo;
          tar xvfz scripting.tgz
          mkdata.sh

          time pig -x local tokenize.pig
          time pig -x local js_wc.pig
          time pig -x local pjy_wc.pig

          to do the last one, you'll have to build the Code.jar, do this (after installing jython.jar in /tmp)
          mkdir tmp
          scripter --jars '/tmp/jython.jar:spig.jar:pjy.jar:pjs.jar' -c ./Code.jar -w ./tmp/ --javac javac -o pjy_wc.pig pjy_wc.pjy

          Show
          Woody Anderson added a comment - unpack the file into a directory: cd foo; tar xvfz scripting.tgz mkdata.sh time pig -x local tokenize.pig time pig -x local js_wc.pig time pig -x local pjy_wc.pig to do the last one, you'll have to build the Code.jar, do this (after installing jython.jar in /tmp) mkdir tmp scripter --jars '/tmp/jython.jar:spig.jar:pjy.jar:pjs.jar' -c ./Code.jar -w ./tmp/ --javac javac -o pjy_wc.pig pjy_wc.pjy
          Hide
          Woody Anderson added a comment -

          slight error in the js_wc.js script:
          change line 9 to:
          X = foreach a GENERATE spig_split($0);
          and, if you want schema info in the JS impl, change 'bag' to 'b:

          {tt:(t:chararray)}

          ' on line 4.

          setenv PIG_HEAPSIZE 2048
          time pig -x local tokenize.pig
          41.724u 2.046s 0:30.52 143.3% 0+0k 0+16io 8pf+0w
          time pig -x local js_wc.pig
          72.079u 2.905s 0:54.50 137.5% 0+0k 0+46io 14pf+0w
          time pig -x local pjy_wc.pig
          41.588u 2.155s 0:33.58 130.2% 0+0k 0+6io 8pf+0w

          so the testing indicates that with this implementation the jython is fairly on par with the java TOKENIZE impl, and js is just shy of twice as slow.

          there are a lot of reasons that the performance of this implementation is startlingly better than the previous numbers, mostly to do with caching the functions, and jython.2.5.1 perhaps being better than whatever python variant was tried above.
          this impl also aheres to the schema system for output data, which does cost some cpu, but is generally not too bad.

          the scripter converter does not have a js handler, but it does convert inlined jython code (anything between @@ jython @@ and subsequent @@)
          for example (taken from pjy_wc.pjy):
          @@ jython @@
          def split(a):
          """ @return b:

          {tt:(t:chararray)}

          """
          return a.split()

          anyway, i'd like to discuss these approaches moving into pig with more out-of-the-box support.
          package: org/apache/pig/scripting is meant to be the harness that i'd like to see as part of pig (or something very like that package)
          packages: org/apache/pig/scripting/js, org/apache/pig/scripting/jython are implementations that i think are pretty useful, but could be improved. distributing these with pig is certainly debatable. eps jython requires jython.jar to function, and the js implementation is really just a proof of concept for a second language impl (i didn't even make a FilterFunc yet)

          the scripter functionality is something i'd like to see supported by the pig parser as much as possible, but i don't have a great idea of how to do that yet. perhaps a new statement to allow a user to register a language pack jar would include hooking it into the parser to handle file references etc. as manually handling the dependency graph is a major pita. The creation of a Code jar and the invocation of javac (in particular, this may not be needed) are pretty arduous, so it'd be nice for a general system to make this work.
          I tried to write the script so that you could add new language handlers to it and it would process functions of the form

          {lang}

          .

          {function}

          (args) and convert appropriately. but i only implemented jython, so the language separation may not be entirely complete, e.g. a language with very different structure may require some other modifications to the script.

          i want to close by saying that the initial inspiration for this work and the idea of the pre-process script came from a blog post about a project called baconsnake http://arnab.org/blog/baconsnake, by Arnab Nandi. That post put me on the track of using jython from java code for the first time, and the idea of making the actual script injecting language tolerable. many thanks.

          Show
          Woody Anderson added a comment - slight error in the js_wc.js script: change line 9 to: X = foreach a GENERATE spig_split($0); and, if you want schema info in the JS impl, change 'bag' to 'b: {tt:(t:chararray)} ' on line 4. setenv PIG_HEAPSIZE 2048 time pig -x local tokenize.pig 41.724u 2.046s 0:30.52 143.3% 0+0k 0+16io 8pf+0w time pig -x local js_wc.pig 72.079u 2.905s 0:54.50 137.5% 0+0k 0+46io 14pf+0w time pig -x local pjy_wc.pig 41.588u 2.155s 0:33.58 130.2% 0+0k 0+6io 8pf+0w so the testing indicates that with this implementation the jython is fairly on par with the java TOKENIZE impl, and js is just shy of twice as slow. there are a lot of reasons that the performance of this implementation is startlingly better than the previous numbers, mostly to do with caching the functions, and jython.2.5.1 perhaps being better than whatever python variant was tried above. this impl also aheres to the schema system for output data, which does cost some cpu, but is generally not too bad. the scripter converter does not have a js handler, but it does convert inlined jython code (anything between @@ jython @@ and subsequent @@) for example (taken from pjy_wc.pjy): @@ jython @@ def split(a): """ @return b: {tt:(t:chararray)} """ return a.split() anyway, i'd like to discuss these approaches moving into pig with more out-of-the-box support. package: org/apache/pig/scripting is meant to be the harness that i'd like to see as part of pig (or something very like that package) packages: org/apache/pig/scripting/js, org/apache/pig/scripting/jython are implementations that i think are pretty useful, but could be improved. distributing these with pig is certainly debatable. eps jython requires jython.jar to function, and the js implementation is really just a proof of concept for a second language impl (i didn't even make a FilterFunc yet) the scripter functionality is something i'd like to see supported by the pig parser as much as possible, but i don't have a great idea of how to do that yet. perhaps a new statement to allow a user to register a language pack jar would include hooking it into the parser to handle file references etc. as manually handling the dependency graph is a major pita. The creation of a Code jar and the invocation of javac (in particular, this may not be needed) are pretty arduous, so it'd be nice for a general system to make this work. I tried to write the script so that you could add new language handlers to it and it would process functions of the form {lang} . {function} (args) and convert appropriately. but i only implemented jython, so the language separation may not be entirely complete, e.g. a language with very different structure may require some other modifications to the script. i want to close by saying that the initial inspiration for this work and the idea of the pre-process script came from a blog post about a project called baconsnake http://arnab.org/blog/baconsnake , by Arnab Nandi. That post put me on the track of using jython from java code for the first time, and the idea of making the actual script injecting language tolerable. many thanks.
          Hide
          Woody Anderson added a comment -

          did a bit more classloader work and i removed the need for the rather ugly javac hack.
          so, now the command line is:
          scripter --jars '/tmp/jython.jar:spig.jar:pjy.jar:pjs.jar' -c ./Code.jar -w ./tmp/ -o pjy_wc.pig pjy_wc.pjy

          if https://issues.apache.org/jira/browse/PIG-1242 were accomplished, the code.jar could be omitted in favor of register jython_code.py;, which would be even nicer.

          Show
          Woody Anderson added a comment - did a bit more classloader work and i removed the need for the rather ugly javac hack. so, now the command line is: scripter --jars '/tmp/jython.jar:spig.jar:pjy.jar:pjs.jar' -c ./Code.jar -w ./tmp/ -o pjy_wc.pig pjy_wc.pjy if https://issues.apache.org/jira/browse/PIG-1242 were accomplished, the code.jar could be omitted in favor of register jython_code.py;, which would be even nicer.
          Hide
          Ashutosh Chauhan added a comment -

          Hey Woody,

          Great work !! This will definitely be useful for lot of Pig users. I just hastily looked at your work. One question which stuck to me is you are doing lot of heavy lifting to provide for multi-language support by figuring out which language user is asking for and then doing reflection to load appropriate interpreter and stuff. I think it might be easier to use one of the frameworks here (BSF or javax.script) which hides this and allows handling of multiple language transparently. (atleast, thats what they claim to do) Have you taken a look at them? These frameworks will arguably help us to provide support for more languages without maintaining lot of code on our part. Though, I am sure they will come at the performance cost (certainly CPU and possibly memory too).

          Show
          Ashutosh Chauhan added a comment - Hey Woody, Great work !! This will definitely be useful for lot of Pig users. I just hastily looked at your work. One question which stuck to me is you are doing lot of heavy lifting to provide for multi-language support by figuring out which language user is asking for and then doing reflection to load appropriate interpreter and stuff. I think it might be easier to use one of the frameworks here (BSF or javax.script) which hides this and allows handling of multiple language transparently. (atleast, thats what they claim to do) Have you taken a look at them? These frameworks will arguably help us to provide support for more languages without maintaining lot of code on our part. Though, I am sure they will come at the performance cost (certainly CPU and possibly memory too).
          Hide
          Woody Anderson added a comment -

          yes, i've looked at both javax.script and BSF, both of which are not well designed for this scenario (in my opinion).
          This comes mostly from their extreme generality and that they do not seem to provide a way to access and subsequently stash a consistent reference to a particular function. aka a pointer.

          This is partly what allows direct use of the jython interpreter to be so fast. Each invocation utilizes a function object directly, it does not have to give a name to an 'engine' which looks up the function and decided appropriate call context, object context etc.
          Those things are great, but not if you don't need them.
          Perhaps someone can show me how those systems work much better than i have been able to utilize them, but this approach allows the impl to be agnostic to these frameworks in a way that can boost performance.
          as you may have noticed, the js example uses javax.script, which BSF3 now conforms to, this impl must populate an engine, and then use the function name over and over. this involves more function name lookups and is less condusive to lamda functions etc.

          bsf is also extremely easy to integrate under the hood in the same way, it has the same perf costs as javax.script due to the hoop jumping. I tried this out while trying to make perl work, but the perlengine is 6 years old and i was unable to get it to work, the bsf binding part worked well enough though.

          the reflection overhead is pretty minimal, and not really needed if the user writes the code directly (they can simply use the appropriate package directly).
          eg.
          define spig_println_Tchararray_P1 org.apache.pig.scripting.Eval('js','println_Tchararray_P1','chararray','var println_Tchararray_P1 = function(a0)

          { println(a0); };');
          v.s
          define spig_println_Tchararray_P1 org.apache.pig.scripting.js.Eval('println_Tchararray_P1','chararray','var println_Tchararray_P1 = function(a0) { println(a0); }

          ;');

          the top level Eval is there simply to allow factory based performance improvements that can be created by knowledgeable implementers.

          if the scriptengine frameworks provided nicer access to functions, and nicer call patterns it would have been nicer to use them.

          Show
          Woody Anderson added a comment - yes, i've looked at both javax.script and BSF, both of which are not well designed for this scenario (in my opinion). This comes mostly from their extreme generality and that they do not seem to provide a way to access and subsequently stash a consistent reference to a particular function. aka a pointer. This is partly what allows direct use of the jython interpreter to be so fast. Each invocation utilizes a function object directly, it does not have to give a name to an 'engine' which looks up the function and decided appropriate call context, object context etc. Those things are great, but not if you don't need them. Perhaps someone can show me how those systems work much better than i have been able to utilize them, but this approach allows the impl to be agnostic to these frameworks in a way that can boost performance. as you may have noticed, the js example uses javax.script, which BSF3 now conforms to, this impl must populate an engine, and then use the function name over and over. this involves more function name lookups and is less condusive to lamda functions etc. bsf is also extremely easy to integrate under the hood in the same way, it has the same perf costs as javax.script due to the hoop jumping. I tried this out while trying to make perl work, but the perlengine is 6 years old and i was unable to get it to work, the bsf binding part worked well enough though. the reflection overhead is pretty minimal, and not really needed if the user writes the code directly (they can simply use the appropriate package directly). eg. define spig_println_Tchararray_P1 org.apache.pig.scripting.Eval('js','println_Tchararray_P1','chararray','var println_Tchararray_P1 = function(a0) { println(a0); };'); v.s define spig_println_Tchararray_P1 org.apache.pig.scripting.js.Eval('println_Tchararray_P1','chararray','var println_Tchararray_P1 = function(a0) { println(a0); } ;'); the top level Eval is there simply to allow factory based performance improvements that can be created by knowledgeable implementers. if the scriptengine frameworks provided nicer access to functions, and nicer call patterns it would have been nicer to use them.
          Hide
          Prasen Mukherjee added a comment -

          Just curious to know, can we not implement it along the lines of DEFINE commands. In that case we will let the shell take care of scripting issues, and no need to include scripting-specific jars ( jython etc. ). That might require code changes in core-pig and cant be implemented as a separate UDF-package though.

          Show
          Prasen Mukherjee added a comment - Just curious to know, can we not implement it along the lines of DEFINE commands. In that case we will let the shell take care of scripting issues, and no need to include scripting-specific jars ( jython etc. ). That might require code changes in core-pig and cant be implemented as a separate UDF-package though.
          Hide
          Ashutosh Chauhan added a comment -

          @Prasen

          can we not implement it along the lines of DEFINE commands.

          Ya, this functionality could be partially simulated using DEFINE / Streaming combination. But that may not be most efficient way to achieve it. First of all, streaming script would be run in a separate process (as oppose to same JVM in approaches discussed above) so there will be CPU cost involved in getting data in and out of from java process to stream script process. Then, there is a cost of serialization and deserialization of parameters. You loose all the type information of the parameters. Once you are in same runtime you can start doing interesting things. Also, having scripts in define statements will get kludgy soon as one you start to do complicated things there.

          no need to include scripting-specific jars (jython etc.)

          Do you mean Include in pig distribution or in pig's classpath at runtime ? In either case that may not necessarily a problem. For first part, we can use ivy to pull the jars for us instead of including in distribution and for second part we can ship all the jars required by Pig to compute nodes.

          Show
          Ashutosh Chauhan added a comment - @Prasen can we not implement it along the lines of DEFINE commands. Ya, this functionality could be partially simulated using DEFINE / Streaming combination. But that may not be most efficient way to achieve it. First of all, streaming script would be run in a separate process (as oppose to same JVM in approaches discussed above) so there will be CPU cost involved in getting data in and out of from java process to stream script process. Then, there is a cost of serialization and deserialization of parameters. You loose all the type information of the parameters. Once you are in same runtime you can start doing interesting things. Also, having scripts in define statements will get kludgy soon as one you start to do complicated things there. no need to include scripting-specific jars (jython etc.) Do you mean Include in pig distribution or in pig's classpath at runtime ? In either case that may not necessarily a problem. For first part, we can use ivy to pull the jars for us instead of including in distribution and for second part we can ship all the jars required by Pig to compute nodes.
          Hide
          Ashutosh Chauhan added a comment -

          @Woody

          I agree frameworks will not be performant. I think there usefulness depends on what we want to achieve? If we want to support many different languages, then they might prove useful, if we are only interested in supporting a language or two (seems Python and Ruby are most popular ones) then it won't make sense to pay the overhead associated with them.

          Show
          Ashutosh Chauhan added a comment - @Woody I agree frameworks will not be performant. I think there usefulness depends on what we want to achieve? If we want to support many different languages, then they might prove useful, if we are only interested in supporting a language or two (seems Python and Ruby are most popular ones) then it won't make sense to pay the overhead associated with them.
          Hide
          Dmitriy V. Ryaboy added a comment -

          FWIW – I would rather few languages were supported, and were fast, than support a lot of languages that are all unusably slow. Ten times slower than Pig is in the unusable range, imo.

          Show
          Dmitriy V. Ryaboy added a comment - FWIW – I would rather few languages were supported, and were fast, than support a lot of languages that are all unusably slow. Ten times slower than Pig is in the unusable range, imo.
          Hide
          Alan Gates added a comment -

          FWIW - I would rather few languages were supported, and were fast, than support a lot of languages that are all unusably slow. Ten times slower than Pig is in the unusable range, imo.

          +1
          I think if we can get Python going and make it easy to add Ruby, we'll have satisfied 90% of the potential users. I've had a number of people ask me directly if they could program in either of those languages. I've never had anyone say they wish they could write UDFs in groovy or java script. I think people will pay a 2x cost for Python or Ruby. I don't think they'll pay 10x.

          Show
          Alan Gates added a comment - FWIW - I would rather few languages were supported, and were fast, than support a lot of languages that are all unusably slow. Ten times slower than Pig is in the unusable range, imo. +1 I think if we can get Python going and make it easy to add Ruby, we'll have satisfied 90% of the potential users. I've had a number of people ask me directly if they could program in either of those languages. I've never had anyone say they wish they could write UDFs in groovy or java script. I think people will pay a 2x cost for Python or Ruby. I don't think they'll pay 10x.
          Hide
          Woody Anderson added a comment -

          @Ashutosh
          I don't think there is any measurable overhead to the reflection mechanism in the example I provided. The objects are allocated "a few" times due to the schema interrogation logic of pig (something that might deserve an entire other bug thread of discussion, as i have no idea why X copies of a UDF have to be allocated for this).
          When it comes time to run (i.e. where it really counts), there is a single invocation of the factory pattern followed by "huge" (data set derived) number of calls to that function. The UDF that is called is fully built an fully initialized with final variables etc, facilitating maximal streamlined execution.
          There are certainly things about the approach i took, but language selection overhead is not one of them. If you have profiling numbers that suggest otherwise I'd be suitably surprised.

          A secondary point to the whole idea of needing some script language code other than, say BSF or javax.script is the idea of type coercion. BSF/javax is not usable in a drop in manner. Each engine unfortunately consumes and produces objects in its own object model. If either of these frameworks had bothered to mandate converting input/output to java.util things would at least be easier, b/c we could convert from that to DataBag/Tuple in a unified manner, but this isn't the case. Thus conversion must be implemented per Engine, at which point, a conversion from PyArray to Tuple is more appropriate than PyArray -> List -> Tuple for performance concerns.
          But, even for rudimentary correctness, type conversion must be implemented for each, at which point, a wrapping pattern that selects an appropriate function factory is a necessary pattern anyway.

          @Alan/@Dmitriy
          Orthogonal to the above point: The idea of trying to support multiple script languages vs. a few. I am personally not of the same mind as you guys i guess.
          I think there is near zero 'overhead' perf cost for supporting some unspecified language. Languages continually evolve and new languages emerge that utilize the JVM better and better. I certainly agree that, at this time, jython and jruby seem the best. However, to say that clojure or javascript, or whatever are not going to move forward and potentially become more effectively integrated with the JVM is a bit premature.

          I would make the sacrifice if the ability to support multiple languages was actually that hard, or had an actual serious performance cost.
          I just don't think those two issues are real.

          The performance costs come from the individual scripting engine features with respect to byte-code compliation, function referencing, string manipulation, execution caching etc., and their type coercion complexities.
          That is completely different than the cost of PIG supporting multiple languages.
          Also, supporting multiple languages is also not that hard. Arnab has thought about this, as have I. I think his ideas, while not perfect, offer a good avenue of exploration and moving forward that offers integration of PIG with any script language. It (importantly) offers to put those languages in PIG instead of the other way around, and it allows for multiple interpreter contexts and even multiple languages.

          I'll quote Arnab's quick description here:


          DEFINE CMD `SCRIPTBLOCK` script('javascript')
          This is identical to the commandline streaming syntax, and follows gracefully in the style of the "ship" and "cache" keywords. 

          Thus your javascript example becomes
          DEFINE JSBlock `
          function split(a)

          {   return a.split(" "); }

          ` script('JAVASCRIPT');
          Note the use of backticks is consistent with the current syntax, and is unlikely to occur in common scripts, so it saves us the escaping. Also it allows newlines in the code. 
          The goal is to create namespaces – you can now call your function as "JSBlock.split(a)". This allows us to have multiple functions in one block.


          This idea, coupled with the ability to register files and directories directly (e.g. register foo.py provides the ability to load code into an arbitrary namespace/interpreter-scope, load it for an arbitrary language etc.
          and the invocation syntax is nice and clean Block.foo() calls a method named foo in the interpreter.
          To allow for the easy invocation syntax to perform well, we would need to cause it to execute in the same was as:
          define spig_split org.apache.pig.scripting.Eval('jython','split','b:

          {tt:(t:chararray)}

          ');

          i don't see that as particularly difficult modification of the function rationalization logic of pig. Actually, i think it's a general improvement as it cuts down on object allocations.

          In the event that this methodology is adopted, you are then still free to write projects that stuff PIG inside python or ruby etc. But PIG itself remains an environment that plays well with multiple script engines.

          conclusion:
          I see it as quite achievable to support any given language with near zero overhead above the lang's scriptengine,
          I thing it's quite doable to do this in a flexible model that allows them to be mixed together, even within the same script
          I think that, overall this is highly preferable to a single or otherwise finite language situation (though i advocate possibly auto-supporting jython/jruby)

          Show
          Woody Anderson added a comment - @Ashutosh I don't think there is any measurable overhead to the reflection mechanism in the example I provided. The objects are allocated "a few" times due to the schema interrogation logic of pig (something that might deserve an entire other bug thread of discussion, as i have no idea why X copies of a UDF have to be allocated for this). When it comes time to run (i.e. where it really counts), there is a single invocation of the factory pattern followed by "huge" (data set derived) number of calls to that function. The UDF that is called is fully built an fully initialized with final variables etc, facilitating maximal streamlined execution. There are certainly things about the approach i took, but language selection overhead is not one of them. If you have profiling numbers that suggest otherwise I'd be suitably surprised. A secondary point to the whole idea of needing some script language code other than, say BSF or javax.script is the idea of type coercion. BSF/javax is not usable in a drop in manner. Each engine unfortunately consumes and produces objects in its own object model. If either of these frameworks had bothered to mandate converting input/output to java.util things would at least be easier, b/c we could convert from that to DataBag/Tuple in a unified manner, but this isn't the case. Thus conversion must be implemented per Engine, at which point, a conversion from PyArray to Tuple is more appropriate than PyArray -> List -> Tuple for performance concerns. But, even for rudimentary correctness, type conversion must be implemented for each, at which point, a wrapping pattern that selects an appropriate function factory is a necessary pattern anyway. @Alan/@Dmitriy Orthogonal to the above point: The idea of trying to support multiple script languages vs. a few. I am personally not of the same mind as you guys i guess. I think there is near zero 'overhead' perf cost for supporting some unspecified language. Languages continually evolve and new languages emerge that utilize the JVM better and better. I certainly agree that, at this time, jython and jruby seem the best. However, to say that clojure or javascript, or whatever are not going to move forward and potentially become more effectively integrated with the JVM is a bit premature. I would make the sacrifice if the ability to support multiple languages was actually that hard, or had an actual serious performance cost. I just don't think those two issues are real. The performance costs come from the individual scripting engine features with respect to byte-code compliation, function referencing, string manipulation, execution caching etc., and their type coercion complexities. That is completely different than the cost of PIG supporting multiple languages. Also, supporting multiple languages is also not that hard. Arnab has thought about this, as have I. I think his ideas, while not perfect, offer a good avenue of exploration and moving forward that offers integration of PIG with any script language. It (importantly) offers to put those languages in PIG instead of the other way around, and it allows for multiple interpreter contexts and even multiple languages. I'll quote Arnab's quick description here: DEFINE CMD `SCRIPTBLOCK` script('javascript') This is identical to the commandline streaming syntax, and follows gracefully in the style of the "ship" and "cache" keywords.  Thus your javascript example becomes DEFINE JSBlock ` function split(a) {   return a.split(" "); } ` script('JAVASCRIPT'); Note the use of backticks is consistent with the current syntax, and is unlikely to occur in common scripts, so it saves us the escaping. Also it allows newlines in the code.  The goal is to create namespaces – you can now call your function as "JSBlock.split(a)". This allows us to have multiple functions in one block. This idea, coupled with the ability to register files and directories directly (e.g. register foo.py provides the ability to load code into an arbitrary namespace/interpreter-scope, load it for an arbitrary language etc. and the invocation syntax is nice and clean Block.foo() calls a method named foo in the interpreter. To allow for the easy invocation syntax to perform well, we would need to cause it to execute in the same was as: define spig_split org.apache.pig.scripting.Eval('jython','split','b: {tt:(t:chararray)} '); i don't see that as particularly difficult modification of the function rationalization logic of pig. Actually, i think it's a general improvement as it cuts down on object allocations. In the event that this methodology is adopted, you are then still free to write projects that stuff PIG inside python or ruby etc. But PIG itself remains an environment that plays well with multiple script engines. conclusion: I see it as quite achievable to support any given language with near zero overhead above the lang's scriptengine, I thing it's quite doable to do this in a flexible model that allows them to be mixed together, even within the same script I think that, overall this is highly preferable to a single or otherwise finite language situation (though i advocate possibly auto-supporting jython/jruby)
          Hide
          Dmitriy V. Ryaboy added a comment -

          Woody, what I meant by my remark was that I disagree with Ashutosh and agree with you, not that I only want to support Python. If using a framework meant we could support 100 jvm-based languages and your approach meant we could support 2, I'd still go with what actually works.

          By the way, we should adapt this to create a reflection UDF to call out to Java libraries, so we don't have to wrap things like String.split anymore.

          Show
          Dmitriy V. Ryaboy added a comment - Woody, what I meant by my remark was that I disagree with Ashutosh and agree with you, not that I only want to support Python. If using a framework meant we could support 100 jvm-based languages and your approach meant we could support 2, I'd still go with what actually works. By the way, we should adapt this to create a reflection UDF to call out to Java libraries, so we don't have to wrap things like String.split anymore.
          Hide
          Woody Anderson added a comment -

          Java reflection is very doable, it's kind of a pain i guess, but you could definitely do it. I think using BeanShell might be a way to use java syntax if you want to, but jython and jruby also are quite good at allowing you to call java code very easily and naturally.
          What kind of reflection system are you thinking? passing a string as input to some function? or finding someway to assume you can make certain method calls on the objects that represent various data object in pig. e.g. $0.split("."), assuming $0 is a chararray/string.
          or are you thinking something that equates to:
          def splitter java.util.regex.Pattern("\.");
          A = foreach B generate splitter.split($0);

          to have it perform at 'peak', you'd need to wrap the reflection into the constructor and cache the java.lang.reflect.Method object.
          it wouldn't be too hard to write (the assumed impl uses constructor args to determine the correct Method via reflection):
          def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 'split', "\.", 'String', 'b:

          {tt:(t:chararray)}

          ');
          A = foreach B generate split($0);

          to be more 'generic' but less performant, you could do it more like this (the assumed impl uses less info to simply reflect a particular object):
          def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 'split', "\.");
          A = foreach B generate split('split', $0);

          the issue here is that each invocation has to determine the correct Method object (after the first it's probably highly cacheable), also since the method might change as a result of a different name or different args, the lookup might also produce a different output schema. At any rate, i think you could write reasonably peformant caching code for this solution, but it'd be more complicated and a tag slower than the former approach.
          Mainly i've tried in all of my impls to do as little as possible in the exec() method, and try to make most objects in use final and immutable (e.g. build them all in the constructor).

          you could of course go so far as to delay the creation of the actual Pattern object (i.e. where you first present the split pattern "\."). Again, it lends itself to performance degrading coding patterns, but if you're careful with your actions, i think you could get most of it back with appropriately cached objects. Doing this in a completely generic fashion.. i'll think about it i guess, i think there's more overhead here than in the other approaches, but if your lib function is more than 'split', the overhead might not be noticeable. Of course, you could implement each of these abstractions levels and use them judiciously.

          anyway, there are a lot of options here, are these in line with what you were thinking?

          Show
          Woody Anderson added a comment - Java reflection is very doable, it's kind of a pain i guess, but you could definitely do it. I think using BeanShell might be a way to use java syntax if you want to, but jython and jruby also are quite good at allowing you to call java code very easily and naturally. What kind of reflection system are you thinking? passing a string as input to some function? or finding someway to assume you can make certain method calls on the objects that represent various data object in pig. e.g. $0.split("."), assuming $0 is a chararray/string. or are you thinking something that equates to: def splitter java.util.regex.Pattern("\."); A = foreach B generate splitter.split($0); to have it perform at 'peak', you'd need to wrap the reflection into the constructor and cache the java.lang.reflect.Method object. it wouldn't be too hard to write (the assumed impl uses constructor args to determine the correct Method via reflection): def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 'split', "\.", 'String', 'b: {tt:(t:chararray)} '); A = foreach B generate split($0); to be more 'generic' but less performant, you could do it more like this (the assumed impl uses less info to simply reflect a particular object): def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 'split', "\."); A = foreach B generate split('split', $0); the issue here is that each invocation has to determine the correct Method object (after the first it's probably highly cacheable), also since the method might change as a result of a different name or different args, the lookup might also produce a different output schema. At any rate, i think you could write reasonably peformant caching code for this solution, but it'd be more complicated and a tag slower than the former approach. Mainly i've tried in all of my impls to do as little as possible in the exec() method, and try to make most objects in use final and immutable (e.g. build them all in the constructor). you could of course go so far as to delay the creation of the actual Pattern object (i.e. where you first present the split pattern "\."). Again, it lends itself to performance degrading coding patterns, but if you're careful with your actions, i think you could get most of it back with appropriately cached objects. Doing this in a completely generic fashion.. i'll think about it i guess, i think there's more overhead here than in the other approaches, but if your lib function is more than 'split', the overhead might not be noticeable. Of course, you could implement each of these abstractions levels and use them judiciously. anyway, there are a lot of options here, are these in line with what you were thinking?
          Hide
          Julien Le Dem added a comment -

          Hi,
          I'm attaching something I implemented last year. I cleaned it up and updated the dependency to Pig 0.6.0 for the occasion.
          There's probably some overlap with previous posts, sorry about the late submission.
          Here is my approach.
          I wanted to make easier a couple of things:

          • writing programs that require multiple calls to pig
          • UDFs
          • parameter passing to Pig
            So I integrated Pig with Jython so that the whole program (main program, UDFs, Pig scripts) could be in one python script.
            example: python/tc.py in the attachment

          The script defines Python functions that are available as UDFs to pig automatically. The decorator @outputSchema is an easy way to specify what the output schema of the UDF is.
          example (see script): @outputSchema("relationships:

          {t:(target:chararray, candidate:chararray)}

          "
          Also notice that the UDFs use the standard python constructs: tuple, dictionary and list. they are converted to Pig constructs on the fly. This makes the definition of UDFs in Python very easy. Notice that the udf takes a list of arguments, not a tuple. The input tuple gets automatically mapped to the arguments.

          Then the script defines a main() function that will be the main program executed on the client.
          In the main the Python program has access to a global pig variable that provides two methods (for now) and is designed to be an equivalent to PigServer.
          List<ExecJob> executeScript(String script)
          to execute a pig script in-lined in Python
          deleteFile(String filename)
          to delete a file
          This looks a little bit like the JDBC approach where you "query" Pig and then can process the data.

          also you can embed python expressions in the pig statements using $

          { ... }

          example: $

          {n - 1}

          They get executed in the current scope and replaced in the script.

          To run the example (assuming javac, jar and java are in your PATH):

          • tar xzvf pyg.tgz
          • add pig-0.6.0-core.jar to the lib folder
          • ./makejar.sh
          • ./runme.sh

          It runs the following:
          org.apache.pig.pyg.Pyg local tc.py

          tc.py is a python script that performs a transitive closure on a list of relation using an iterative algorithm. It defines python functions

          Limitations:

          • you can not include other python scripts but this should be doable.
          • I haven't spent much time testing performance. I suspect the Pig<->Python type conversion to be a little slow as it creates many new objects. It could possibly be improved by making the Pig objects implement the Python interfaces.

          (the attachment contains jython.jar 2.5.0 for simplicity)

          Best regards, Julien

          Show
          Julien Le Dem added a comment - Hi, I'm attaching something I implemented last year. I cleaned it up and updated the dependency to Pig 0.6.0 for the occasion. There's probably some overlap with previous posts, sorry about the late submission. Here is my approach. I wanted to make easier a couple of things: writing programs that require multiple calls to pig UDFs parameter passing to Pig So I integrated Pig with Jython so that the whole program (main program, UDFs, Pig scripts) could be in one python script. example: python/tc.py in the attachment The script defines Python functions that are available as UDFs to pig automatically. The decorator @outputSchema is an easy way to specify what the output schema of the UDF is. example (see script): @outputSchema("relationships: {t:(target:chararray, candidate:chararray)} " Also notice that the UDFs use the standard python constructs: tuple, dictionary and list. they are converted to Pig constructs on the fly. This makes the definition of UDFs in Python very easy. Notice that the udf takes a list of arguments, not a tuple. The input tuple gets automatically mapped to the arguments. Then the script defines a main() function that will be the main program executed on the client. In the main the Python program has access to a global pig variable that provides two methods (for now) and is designed to be an equivalent to PigServer. List<ExecJob> executeScript(String script) to execute a pig script in-lined in Python deleteFile(String filename) to delete a file This looks a little bit like the JDBC approach where you "query" Pig and then can process the data. also you can embed python expressions in the pig statements using $ { ... } example: $ {n - 1} They get executed in the current scope and replaced in the script. To run the example (assuming javac, jar and java are in your PATH): tar xzvf pyg.tgz add pig-0.6.0-core.jar to the lib folder ./makejar.sh ./runme.sh It runs the following: org.apache.pig.pyg.Pyg local tc.py tc.py is a python script that performs a transitive closure on a list of relation using an iterative algorithm. It defines python functions Limitations: you can not include other python scripts but this should be doable. I haven't spent much time testing performance. I suspect the Pig<->Python type conversion to be a little slow as it creates many new objects. It could possibly be improved by making the Pig objects implement the Python interfaces. (the attachment contains jython.jar 2.5.0 for simplicity) Best regards, Julien
          Hide
          Julien Le Dem added a comment -

          Hi Woody,
          Some comments:

          • Schema parsing:
            I notice that you wrote a Schema parser in EvalBase.
            It took me a while to figure out but you can do that with the following Pig class
            org.apache.pig.impl.logicalLayer.parser.QueryParser
            using the following code:
            QueryParser parser = new QueryParser(new StringReader(schema));
            result = parser.TupleSchema();
            for example:
            String schema = "relationships: {t:(target:chararray, candidate:chararray)}

            "
            and you get a Schema instance back.

          • Different options for passing the Python code to the hadoop nodes:
            I notice you pass the Python functions by creating a .py file included in the jar which is then loaded through the class loader.
            I pass the python code to the nodes by adding it as a parameter of my UDF constructor (encoded in a string). The drawback is that it is verbose as it gets included for every function.
          Show
          Julien Le Dem added a comment - Hi Woody, Some comments: Schema parsing: I notice that you wrote a Schema parser in EvalBase. It took me a while to figure out but you can do that with the following Pig class org.apache.pig.impl.logicalLayer.parser.QueryParser using the following code: QueryParser parser = new QueryParser(new StringReader(schema)); result = parser.TupleSchema(); for example: String schema = "relationships: {t:(target:chararray, candidate:chararray)} " and you get a Schema instance back. Different options for passing the Python code to the hadoop nodes: I notice you pass the Python functions by creating a .py file included in the jar which is then loaded through the class loader. I pass the python code to the nodes by adding it as a parameter of my UDF constructor (encoded in a string). The drawback is that it is verbose as it gets included for every function.
          Hide
          Woody Anderson added a comment -

          @julien
          have read over your code.

          1. schema parsing: yup, i much prefer re-using the parser, i wasn't able to find that impl, but should have been more diligent in looking for it.
          2. i love the outputSchema decorator pattern that you use.
          3. code via a .py file vs. string literal in the constructor. The .py file is a definite win when dealing with encoding issues (quotes, newlines etc). It's also a cleaner way to import larger blocks of code, and works for jython files etc. that are used indirectly etc. The constructor pattern is still supported in my approach, i just use it exclusively for lambda functions.
          4. the pyObjToObj code is simpler in your approach, but limits the integration flexibility. i.e. you explicitly tie tuple:tuple, list:bag. Also, it's not clear how well this would handle sequences and iterators etc. I personally prefer using the schema to disambiguate the conversion, so that existing python code can be used to generate bags/tuples etc. via the schema rather than having to convert python objects using wrapper code.
          5. the outputSchema logic is nice (as i said in #2, i love the decorator thing), but the schema should be cached if it is not a function. If it's a function, then the ref should be cached. This is particularly important if you're using the schema to inform the python -> pig data coercion.
          6. as i said in prev comments, the scope of the interpreter is important. If you have two different UDFs that you want to share any state (such as counters), then a shared interpreter is a good idea. There are also memory gains from sharing etc. In general, i think you rarely want a distinct interpreter, and as such it should be possible, but not the default.

          Anyway, thanks for attaching the submission, i think there are lots of great ideas in your project. It makes me wish i'd known about it sooner, parsing the pig schema system was not a fun day, though i guess i did learn a bit from it. The decorator thing is lovely. I'll probably borrow those and produce a tighter jython and scripting harness at some point.

          Overall, i'm still firmly in the multi-language camp, but i think this provides nice improvements for a jython impl, and can clearly still swallow whatever language support pig introduces for anyone who wants to drive pig from python. So i think it should still be useful as a standalone project/harness.

          Show
          Woody Anderson added a comment - @julien have read over your code. 1. schema parsing: yup, i much prefer re-using the parser, i wasn't able to find that impl, but should have been more diligent in looking for it. 2. i love the outputSchema decorator pattern that you use. 3. code via a .py file vs. string literal in the constructor. The .py file is a definite win when dealing with encoding issues (quotes, newlines etc). It's also a cleaner way to import larger blocks of code, and works for jython files etc. that are used indirectly etc. The constructor pattern is still supported in my approach, i just use it exclusively for lambda functions. 4. the pyObjToObj code is simpler in your approach, but limits the integration flexibility. i.e. you explicitly tie tuple:tuple, list:bag. Also, it's not clear how well this would handle sequences and iterators etc. I personally prefer using the schema to disambiguate the conversion, so that existing python code can be used to generate bags/tuples etc. via the schema rather than having to convert python objects using wrapper code. 5. the outputSchema logic is nice (as i said in #2, i love the decorator thing), but the schema should be cached if it is not a function. If it's a function, then the ref should be cached. This is particularly important if you're using the schema to inform the python -> pig data coercion. 6. as i said in prev comments, the scope of the interpreter is important. If you have two different UDFs that you want to share any state (such as counters), then a shared interpreter is a good idea. There are also memory gains from sharing etc. In general, i think you rarely want a distinct interpreter, and as such it should be possible, but not the default. Anyway, thanks for attaching the submission, i think there are lots of great ideas in your project. It makes me wish i'd known about it sooner, parsing the pig schema system was not a fun day, though i guess i did learn a bit from it. The decorator thing is lovely. I'll probably borrow those and produce a tighter jython and scripting harness at some point. Overall, i'm still firmly in the multi-language camp, but i think this provides nice improvements for a jython impl, and can clearly still swallow whatever language support pig introduces for anyone who wants to drive pig from python. So i think it should still be useful as a standalone project/harness.
          Hide
          Julien Le Dem added a comment -

          @Woody

          The main advantage of embedding pig calls in the scripting language is that it enables iterative algorithms, which Pig is no very good at currently. Why would we limit users to UDFs when they can have their whole program in their scripting language of choice?

          4. Python is a very interesting language to integrate with Pig because it has all the same native data structures (tuple:tuple, list:bag, dictionary:map) which makes the UDFs compact and easy to code. That said, in scripting languages that don't match as well as Python to the Pig types, using the schema to disambiguate will be a must have.
          When do we need to convert sequences and iterators ? Pig has only tuple, bag and map as complex types AFAIK.
          5. agreed, It should be cached or initialised at the begining.
          3. and 6. I'll investigate passing the main script through the classpath when I have time. One interpreter would be nice to save memory and initialization time. I'm not sure the shared state is such an advantage as UDFs should not rely on being run in the same process. Maybe I'm just missing something.

          About the multi language: I'm not against it, but there's not that much code to share.
          The scripting<->pig type conversion is specific to each language as you mentioned. also calling functions, getting a list of functions, defining output schemas will be specific.

          How I see the multilanguage:

          pig local|mapred -script

          {language}

          {scriptfile}

          main program:

          • generic: loads the sript file
          • generic: makes the script available in the classpath of the tasks (through a jar generated on the fly?)
          • specific: initializes the interpreter for the scripting language
          • specific: adds the global variables defined by pig for the main (in my case: decorators, pig server instance)
          • generic: loads the script in the interpreter
          • specific: figures out the list of functions and registers them automatically as UDFs in PIG using a dedicated UDF wrapper class
          • specific: run the main

          Pig execute call from the script:

          • generic: parse the Pig string to replace $ {expression}

            by the value of the expression as evaluated by the interpreter in the local scope.

          UDF init:

          • generic: loads the script from the classpath
          • specific: initializes the interpreter for the scripting language
          • specific: add the global variables defined by pig for the UDFs (in my case: decorators)
          • generic: loads the script in the interpreter
          • specific: figures out the runtime for the outputSchema: function call or static schema (parsing of schema generic)

          UDF call:

          • specific: convert a pig tuple to a parameter list in the scripting language types
          • specific: call the function with the parameters
          • specific: convert the result to Pig types
          • generic: return the result
          Show
          Julien Le Dem added a comment - @Woody The main advantage of embedding pig calls in the scripting language is that it enables iterative algorithms, which Pig is no very good at currently. Why would we limit users to UDFs when they can have their whole program in their scripting language of choice? 4. Python is a very interesting language to integrate with Pig because it has all the same native data structures (tuple:tuple, list:bag, dictionary:map) which makes the UDFs compact and easy to code. That said, in scripting languages that don't match as well as Python to the Pig types, using the schema to disambiguate will be a must have. When do we need to convert sequences and iterators ? Pig has only tuple, bag and map as complex types AFAIK. 5. agreed, It should be cached or initialised at the begining. 3. and 6. I'll investigate passing the main script through the classpath when I have time. One interpreter would be nice to save memory and initialization time. I'm not sure the shared state is such an advantage as UDFs should not rely on being run in the same process. Maybe I'm just missing something. About the multi language: I'm not against it, but there's not that much code to share. The scripting<->pig type conversion is specific to each language as you mentioned. also calling functions, getting a list of functions, defining output schemas will be specific. How I see the multilanguage: pig local|mapred -script {language} {scriptfile} main program: generic: loads the sript file generic: makes the script available in the classpath of the tasks (through a jar generated on the fly?) specific: initializes the interpreter for the scripting language specific: adds the global variables defined by pig for the main (in my case: decorators, pig server instance) generic: loads the script in the interpreter specific: figures out the list of functions and registers them automatically as UDFs in PIG using a dedicated UDF wrapper class specific: run the main Pig execute call from the script: generic: parse the Pig string to replace $ {expression} by the value of the expression as evaluated by the interpreter in the local scope. UDF init: generic: loads the script from the classpath specific: initializes the interpreter for the scripting language specific: add the global variables defined by pig for the UDFs (in my case: decorators) generic: loads the script in the interpreter specific: figures out the runtime for the outputSchema: function call or static schema (parsing of schema generic) UDF call: specific: convert a pig tuple to a parameter list in the scripting language types specific: call the function with the parameters specific: convert the result to Pig types generic: return the result
          Hide
          Dmitriy V. Ryaboy added a comment -

          Woody,
          I submitted my attempt at generic Java invocation in PIG-1354. Would appreciate feedback. It's fairly limited (only works for methods that return one of classes that has a Pig equivalent, and takes parameters of the same), but I've already found it quite useful, even in the limited state. Had to break out a separate class for each return type, Pig was giving me trouble otherwise.

          Show
          Dmitriy V. Ryaboy added a comment - Woody, I submitted my attempt at generic Java invocation in PIG-1354 . Would appreciate feedback. It's fairly limited (only works for methods that return one of classes that has a Pig equivalent, and takes parameters of the same), but I've already found it quite useful, even in the limited state. Had to break out a separate class for each return type, Pig was giving me trouble otherwise.
          Hide
          Julien Le Dem added a comment -

          I implemented the modifications mentioned in my previous comment:
          https://issues.apache.org/jira/browse/PIG-928?focusedCommentId=12847986&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12847986

          To run the example (assuming javac, jar and java are in your PATH):

          • tar xzvf pyg.tgz
          • add pig-0.6.0-core.jar to the lib folder
          • ./makejar.sh
          • ./runme.sh

          The python implementation is now decoupled form the generic code.
          the script code is passed through the classpath.
          To implement other scripting languages, extend org.apache.pig.greek.ScriptEngine

          I renamed this thing Pig-Greek

          Show
          Julien Le Dem added a comment - I implemented the modifications mentioned in my previous comment: https://issues.apache.org/jira/browse/PIG-928?focusedCommentId=12847986&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12847986 To run the example (assuming javac, jar and java are in your PATH): tar xzvf pyg.tgz add pig-0.6.0-core.jar to the lib folder ./makejar.sh ./runme.sh The python implementation is now decoupled form the generic code. the script code is passed through the classpath. To implement other scripting languages, extend org.apache.pig.greek.ScriptEngine I renamed this thing Pig-Greek
          Hide
          Julien Le Dem added a comment -

          The attentive reader will have noticed that it should be "tar xzvf pig-greek.tgz" in my previous comment.

          Show
          Julien Le Dem added a comment - The attentive reader will have noticed that it should be "tar xzvf pig-greek.tgz" in my previous comment.
          Hide
          Nicolas Torzec added a comment -

          On the benchmarking side,
          I had a look at the benchmark comparing native Pig built-in functions with UDFs written in Ruby, Python and Groovy using the BSF approach.

          For the sake of comprehensiveness, couldn't we also compare it with Pig streaming through Ruby, Python and Groovy?

          Show
          Nicolas Torzec added a comment - On the benchmarking side, I had a look at the benchmark comparing native Pig built-in functions with UDFs written in Ruby, Python and Groovy using the BSF approach. For the sake of comprehensiveness, couldn't we also compare it with Pig streaming through Ruby, Python and Groovy?
          Hide
          Arnab Nandi added a comment -

          Building on Julien's and Woody's code, this patch provides pluggable scripting support in native Pig.

          ##Syntax:##

          register 'test.py' USING org.apache.pig.scripting.jython.JythonScriptEngine;

          This makes all functions inside test.py available as Pig functions.

          ##Things in this patch: ##

          1. Modifications to parser .jjt file

          2. ScriptEngine abstract class and Jython instantiation.

          3. Ability to ship .py files similar to .jars, loaded on demand.

          4. Input checking and Schema support.

          ##Things NOT in this patch: ##

          1. Inline code support: (Replace 'test.py' with `multiline inline code`, prefer to submit as separate bug)

          2. Scripting engines and examples other than Jython(e.g. beanshell and rhino)

          3. Junit-based test harness (provided as test.zip)

          4. Python<->Pig Object transforms are not very efficient (see calltrace.zip). Preferred the cleaner implementation first. (non-obvious optimizations such as object reuse can be introduced as separate bug)

          ##Notes: ##

          1. I went with "register" instead of "define" since files can contain multiple functions, similar to .jars. imho this makes more sense, using define would introduce the concept of "codeblock aliases" and function names would look like "alias.functionName()", which is possible but inconsistent since we cannot have "alias2.functionName()" (which would require separate interpreter instances, etc etc).

          2. This has been tested both locally and in mapred mode.

          3. We assume .py files are simply a list of functions. Since the entire file is loaded, you can have dependent functions. No effort is made to resolve imports, though.

          4. You'll need to add jython.jar into classpath, or compile it into pig.jar.

          Would love comments and code-followups!

          Show
          Arnab Nandi added a comment - Building on Julien's and Woody's code, this patch provides pluggable scripting support in native Pig. ##Syntax:## register 'test.py' USING org.apache.pig.scripting.jython.JythonScriptEngine; This makes all functions inside test.py available as Pig functions. ##Things in this patch: ## 1. Modifications to parser .jjt file 2. ScriptEngine abstract class and Jython instantiation. 3. Ability to ship .py files similar to .jars, loaded on demand. 4. Input checking and Schema support. ##Things NOT in this patch: ## 1. Inline code support: (Replace 'test.py' with `multiline inline code`, prefer to submit as separate bug) 2. Scripting engines and examples other than Jython(e.g. beanshell and rhino) 3. Junit-based test harness (provided as test.zip) 4. Python<->Pig Object transforms are not very efficient (see calltrace.zip). Preferred the cleaner implementation first. (non-obvious optimizations such as object reuse can be introduced as separate bug) ##Notes: ## 1. I went with "register" instead of "define" since files can contain multiple functions, similar to .jars. imho this makes more sense, using define would introduce the concept of "codeblock aliases" and function names would look like "alias.functionName()", which is possible but inconsistent since we cannot have "alias2.functionName()" (which would require separate interpreter instances, etc etc). 2. This has been tested both locally and in mapred mode. 3. We assume .py files are simply a list of functions. Since the entire file is loaded, you can have dependent functions. No effort is made to resolve imports, though. 4. You'll need to add jython.jar into classpath, or compile it into pig.jar. Would love comments and code-followups!
          Hide
          Dmitriy V. Ryaboy added a comment -

          I've found that using lazy conversion from objects to tuples can save significant amounts of time when records get later filtered out, only parts of the output used, etc. Perhaps this is something to try if you say pythonToPig is slow?

          Here's what I did with Protocol Buffers: http://github.com/dvryaboy/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/util/ProtobufTuple.java

          Show
          Dmitriy V. Ryaboy added a comment - I've found that using lazy conversion from objects to tuples can save significant amounts of time when records get later filtered out, only parts of the output used, etc. Perhaps this is something to try if you say pythonToPig is slow? Here's what I did with Protocol Buffers: http://github.com/dvryaboy/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/util/ProtobufTuple.java
          Hide
          Arnab Nandi added a comment -

          Thanks Dmitriy! Lazy objects are a great idea. Note that I'm not saying that pythontoPig is slow per se – it's just the biggest part of the profiler trace, and would be a great place for optimization. I ran some numbers on the patch, and it looks like outside of the runtime instantiation, there is a fairly small performance penalty with the current code (1.2x slower).

          WordCount example from Alan's package.zip:

          Data size Native Jython Factor
          10K 9s 18s 2
          50K 14s 19s 1.35
          500K 54s 64s 1.19

          (Full Data: 8x"War & Peace" from Proj. Gutenberg, 500K lines, 24MB)
          (TOKENIZE was modified to spaces-only, both implementations have identical output)

          Python code:

          @outputSchema("s:{d:(word:chararray)}")
          def tokenize(word):
            if word is not None:
              return word.split(' ')
          
          Show
          Arnab Nandi added a comment - Thanks Dmitriy! Lazy objects are a great idea. Note that I'm not saying that pythontoPig is slow per se – it's just the biggest part of the profiler trace, and would be a great place for optimization. I ran some numbers on the patch, and it looks like outside of the runtime instantiation, there is a fairly small performance penalty with the current code (1.2x slower). WordCount example from Alan's package.zip: Data size Native Jython Factor 10K 9s 18s 2 50K 14s 19s 1.35 500K 54s 64s 1.19 (Full Data: 8x"War & Peace" from Proj. Gutenberg, 500K lines, 24MB) (TOKENIZE was modified to spaces-only, both implementations have identical output) Python code: @outputSchema("s:{d:(word:chararray)}") def tokenize(word): if word is not None: return word.split(' ')
          Hide
          Ashutosh Chauhan added a comment -

          Arnab,

          Thanks for putting together a patch for this. One question I have is about register Vs define. Currently you are auto-registering all the functions in the script file and then they are available for later use in script. But I am not sure how we will handle the case for inlined functions. For inline functions define seems to be a natural choice as noted in previous comments of the jira. And if so, then we need to modify define to support that use case. Wondering to remain consistent, we always use define to define <non-native> functions instead of auto registering them. I also didn't get why there will be need for separate interpreter instances in that case.

          Show
          Ashutosh Chauhan added a comment - Arnab, Thanks for putting together a patch for this. One question I have is about register Vs define. Currently you are auto-registering all the functions in the script file and then they are available for later use in script. But I am not sure how we will handle the case for inlined functions. For inline functions define seems to be a natural choice as noted in previous comments of the jira. And if so, then we need to modify define to support that use case. Wondering to remain consistent, we always use define to define <non-native> functions instead of auto registering them. I also didn't get why there will be need for separate interpreter instances in that case.
          Hide
          Arnab Nandi added a comment -

          Thanks for looking into the patch Ashutosh! Very good question, short answer: I couldn't come up with an elegant solution using define

          I spent a bunch of time thinking about the "right thing to do" before going this way. As Woody mentioned, my initial instinct was to do this in in define, but kept hitting roadblocks when working with define:

          1. I came up with the analogy that "register" is like "import" in java, and "define" is like "alias" in bash. In this interpretation, whenever you want to introduce new code, you register it with Pig. Whenever you want to alias anything for convenience or to add meta-information, you define it.
          2. Define is not amenable to multiple functions in the same script.
            • For example, to follow the stream convention,

              {define X 'x.py' [inputoutputspec][schemaspec];}.

              Which function is the input/output spec for? A solution like

              {[func1():schemaspec1,func2:schemaspec2]}

              is... ugly.
            • Further, how do we access these functions? One solution is to have the namespace as a codeblock, e.g. X.func1(), which is doable by registering functions as "X.func1", but we're (mis)leading the user to believe there is some sort of real namespacing going on. I foresee multi-function files as a very common use case; people could have a "util.py" with their commonly used suite of functions instead of forcing 1 file per 2-3 line function.
            • Note that Julien's @decorator idea cleanly solves this problem and I think it'll work for all languages.
          3. With inline define, most languages have the convention of mentioning function definitions with the function name, input references & return schema spec, it seems redundant to force the user to break this convention and have something like

            {define x as script('def X(a,b): return a + b;');},

            and have x.X(). Lambdas can solve this problem halfway, you'll need to then worry about the schema spec and we're back at a kludgy solution!
          4. My plan for inline functions is to write all to a temp file (1 per script engine) and then deal with them as registering a file.
          5. Jython code runs in its own interpreter because I couldn't figure out how to load Jython bytecode into Java, this has something to do with the lack of a jythonc afaik(I may be wrong). There will be one interpreter per non-compilable scriptengine, for others(Janino, Groovy), we load the class directly into the runtime.
          6. From a code-writing perspective, overloading define to tack on a third use-case despite would involve an overhaul to the POStream physical operator and felt very inelegant; register on the other hand is well contained to a single purpose – including files for UDFs.
          7. Consider the use of Janino as a ScriptEngine. Unlike the Jython scriptengine, this loads java UDFs into the native runtime and doesn't translate objects; so we're looking at potentially zero loss of performance for inline UDFs (or register 'UDF.java'; ). The difference between native and script code gets blurry here...

          [tl;dr] ...and then I thought fair enough, let's just go with register!

          Show
          Arnab Nandi added a comment - Thanks for looking into the patch Ashutosh! Very good question, short answer: I couldn't come up with an elegant solution using define I spent a bunch of time thinking about the "right thing to do" before going this way. As Woody mentioned, my initial instinct was to do this in in define , but kept hitting roadblocks when working with define : I came up with the analogy that "register" is like "import" in java, and "define" is like "alias" in bash. In this interpretation, whenever you want to introduce new code, you register it with Pig. Whenever you want to alias anything for convenience or to add meta-information, you define it. Define is not amenable to multiple functions in the same script. For example, to follow the stream convention, {define X 'x.py' [inputoutputspec] [schemaspec] ;}. Which function is the input/output spec for? A solution like { [func1():schemaspec1,func2:schemaspec2] } is... ugly. Further, how do we access these functions? One solution is to have the namespace as a codeblock, e.g. X.func1(), which is doable by registering functions as "X.func1", but we're (mis)leading the user to believe there is some sort of real namespacing going on. I foresee multi-function files as a very common use case; people could have a "util.py" with their commonly used suite of functions instead of forcing 1 file per 2-3 line function. Note that Julien's @decorator idea cleanly solves this problem and I think it'll work for all languages. With inline define , most languages have the convention of mentioning function definitions with the function name, input references & return schema spec, it seems redundant to force the user to break this convention and have something like {define x as script('def X(a,b): return a + b;');}, and have x.X(). Lambdas can solve this problem halfway, you'll need to then worry about the schema spec and we're back at a kludgy solution! My plan for inline functions is to write all to a temp file (1 per script engine) and then deal with them as registering a file. Jython code runs in its own interpreter because I couldn't figure out how to load Jython bytecode into Java, this has something to do with the lack of a jythonc afaik(I may be wrong). There will be one interpreter per non-compilable scriptengine, for others(Janino, Groovy), we load the class directly into the runtime. From a code-writing perspective, overloading define to tack on a third use-case despite would involve an overhaul to the POStream physical operator and felt very inelegant; register on the other hand is well contained to a single purpose – including files for UDFs. Consider the use of Janino as a ScriptEngine. Unlike the Jython scriptengine, this loads java UDFs into the native runtime and doesn't translate objects; so we're looking at potentially zero loss of performance for inline UDFs (or register 'UDF.java'; ). The difference between native and script code gets blurry here... [tl;dr] ...and then I thought fair enough, let's just go with register !
          Hide
          Julien Le Dem added a comment -

          I like Register better as well.

          With java UDFs, you REGISTER a jar.
          Then you can use the classes in the jar using their fully qualified class name.
          Optionally you can use DEFINE to alias the functions or pass extra initialization parameters.

          with scripting as implemented by Arnab, you REGISTER a script file (adding the script language information as it is not only java anymore) and you can use all the functions in it (just like you do in java).
          Then I would say you should be able to alias them using DEFINE and define a closure by passing extra parameters, DEFINE log2 logn(2, $0); (maybe I am asking to much here )

          Show
          Julien Le Dem added a comment - I like Register better as well. With java UDFs, you REGISTER a jar. Then you can use the classes in the jar using their fully qualified class name. Optionally you can use DEFINE to alias the functions or pass extra initialization parameters. with scripting as implemented by Arnab, you REGISTER a script file (adding the script language information as it is not only java anymore) and you can use all the functions in it (just like you do in java). Then I would say you should be able to alias them using DEFINE and define a closure by passing extra parameters, DEFINE log2 logn(2, $0); (maybe I am asking to much here )
          Hide
          Aniket Mokashi added a comment -

          Proposed syntax for the Script UDF registration-

          1. Registration of entire script-
          test.py has helloworld, complex etc.

          register 'test.py' lang python;
          b = foreach a generate helloworld(a.$0), complex(a.$1);
          

          This registers all functions in test.py as pig UDFs.

          Issues- (as per current implementation)
          1. flat namespace- this consumes the UDF namespace. Do we need to have test.py.helloworld?
          2. no way to find signature- We do not verify signature of helloworld in front end, user has no feedback about UDF signatures.
          3. Dependencies- no ship clause.

          Optional command-

          describe 'test.py';
          helloworld{x:chararray};
          complex{i:int};
          

          Changes needed- ScriptEngine needs to have a function that for a given script and funcspec dumps the function signature if funcspec if the function is present in the script (given path).
          abstract void dumpFunction(String path, FuncSpec funcSpec, PigContext pigContext);

          2. Registration of single UDF from a script-
          test.py has helloworld which has dependencies in '1.py' and '2.py'.

          define helloworld lang python source 'test.py' ship ('1.py', '2.py');
          OR
          define hello lang python source 'test.py'#helloworld ship ('1.py', '2.py');
          b = foreach a generate helloworld(a.$0);
          

          This registers helloworld (/hello) as pig UDF.

          Also,
          ScriptEngine -> getStandardScriptJarPath() returns path for standard location of jython.jar (user can override this with register jython5.jar etc). We ship this jar if user does not explicitly specify one.
          ScriptEngine.getInstance maps keyword "python" to appropriate ScriptEngine class.

          Attached is initial implementation for register script clause and parse patch has parsing related initial changes for define clause.
          [RegisterPythonUDF2.patch, RegisterScriptUDFDefineParse.patch ]

          Show
          Aniket Mokashi added a comment - Proposed syntax for the Script UDF registration- 1. Registration of entire script- test.py has helloworld, complex etc. register 'test.py' lang python; b = foreach a generate helloworld(a.$0), complex(a.$1); This registers all functions in test.py as pig UDFs. Issues- (as per current implementation) 1. flat namespace- this consumes the UDF namespace. Do we need to have test.py.helloworld? 2. no way to find signature- We do not verify signature of helloworld in front end, user has no feedback about UDF signatures. 3. Dependencies- no ship clause. Optional command- describe 'test.py'; helloworld{x:chararray}; complex{i: int }; Changes needed- ScriptEngine needs to have a function that for a given script and funcspec dumps the function signature if funcspec if the function is present in the script (given path). abstract void dumpFunction(String path, FuncSpec funcSpec, PigContext pigContext); 2. Registration of single UDF from a script- test.py has helloworld which has dependencies in '1.py' and '2.py'. define helloworld lang python source 'test.py' ship ('1.py', '2.py'); OR define hello lang python source 'test.py'#helloworld ship ('1.py', '2.py'); b = foreach a generate helloworld(a.$0); This registers helloworld (/hello) as pig UDF. Also, ScriptEngine -> getStandardScriptJarPath() returns path for standard location of jython.jar (user can override this with register jython5.jar etc). We ship this jar if user does not explicitly specify one. ScriptEngine.getInstance maps keyword "python" to appropriate ScriptEngine class. Attached is initial implementation for register script clause and parse patch has parsing related initial changes for define clause. [RegisterPythonUDF2.patch, RegisterScriptUDFDefineParse.patch ]
          Hide
          Arnab Nandi added a comment -

          > register 'test.py' lang python;

          How does one define an arbitrary "lang"? e.g. I would like to introduce Scala as a UDF engine, preferably as a jar itself. i.e. something like:

          register scalascript.jar;
          register 'test.py' USING scala.Engine();

          Show
          Arnab Nandi added a comment - > register 'test.py' lang python; How does one define an arbitrary "lang"? e.g. I would like to introduce Scala as a UDF engine, preferably as a jar itself. i.e. something like: register scalascript.jar; register 'test.py' USING scala.Engine();
          Hide
          Aniket Mokashi added a comment -

          I support above comment.
          Also, in favor of not breaking old code. I think, we should avoid introducing new keywords.

          In the above proposal, by adding python as a lang-keyword I meant to hide extensibility of ScriptEngine interface by natively supporting python. If we have to allow users add support for other languages. we need to allow "using org.apache.pig.scripting.jython.JythonScriptEngine". But this will need us to document the scriptengine interface.

          Following seems to be more suitable choice. Comments?

          -- register all UDFs inside test.py using custom (or builtin) ScriptEngine
          register 'test.py' using org.apache.pig.scripting.jython.JythonScriptEngine ship ('1.py', '2.py');
          -- namespace? test.helloworld?
          b = foreach a generate helloworld(a.$0), complex(a.$1);
          
          -- register helloworld UDF as hello using JythonScriptEngine
          define hello using org.apache.pig.scripting.jython.JythonScriptEngine from 'test.py'#helloworld ship ('1.py', '2.py');
          b = foreach a generate helloworld(a.$0); 
          

          Also, register scalascript.jar would not be necessary if getStandardScriptJarPath() returns the path of the jar.

          Show
          Aniket Mokashi added a comment - I support above comment. Also, in favor of not breaking old code. I think, we should avoid introducing new keywords. In the above proposal, by adding python as a lang-keyword I meant to hide extensibility of ScriptEngine interface by natively supporting python. If we have to allow users add support for other languages. we need to allow "using org.apache.pig.scripting.jython.JythonScriptEngine". But this will need us to document the scriptengine interface. Following seems to be more suitable choice. Comments? -- register all UDFs inside test.py using custom (or builtin) ScriptEngine register 'test.py' using org.apache.pig.scripting.jython.JythonScriptEngine ship ('1.py', '2.py'); -- namespace? test.helloworld? b = foreach a generate helloworld(a.$0), complex(a.$1); -- register helloworld UDF as hello using JythonScriptEngine define hello using org.apache.pig.scripting.jython.JythonScriptEngine from 'test.py'#helloworld ship ('1.py', '2.py'); b = foreach a generate helloworld(a.$0); Also, register scalascript.jar would not be necessary if getStandardScriptJarPath() returns the path of the jar.
          Hide
          Alan Gates added a comment -

          I propose the following syntax for register:

          REGISTER _filename_ [USING _class_ [AS _namespace_]]
          

          This is backwards compatible with the current version of register.

          class in the USING clause would need to implement a new interface ScriptEngine (or something) which would be used to interpret the file. If no USING clause is
          given, then it is assumed that filename is a jar. I like this better than the 'lang python' option we had earlier because it allows users to add new engines
          without modifying the parser. We should however provide a pre-defined set of scripting engines and names, so that for example python translates to
          org.apache.pig.script.jython.JythonScriptingEngine

          If the AS clause is not given, then the basename of filename defines the namespace name for all functions defined in that file. This allows us to avoid
          function name clashes. If the AS clause is given, this defines an alternate namespace. This allows us to avoid name clashes for filenames. Functions would
          have to be referenced by full namespace names, though aliases can be given via DEFINE.

          Note that the AS clause is a sub-clause of the USING clause, and cannot be used alone, so there is no ability to give namespaces to jars.

          As far as I can tell there is no need for a SHIP clause in the register. Additional python modules that are needed can be registered. As long as Pig lazily
          searches for functions and does not automatically find every function in every file we register, this will work fine.

          So taken altogether, this would look like the following. Assume we have two python files /home/alan/myfuncs.py

          import mymod
          
          def a():
              ...
          
          def b():
              ...
          

          and /home/bob/myfuncs.py:

          def a():
              ...
          
          def c():
              ...
          

          and the following Pig Latin

          REGISTER /home/alan/myfuncs.py USING python;
          REGISTER /home/alan/mymod.py; -- no need for USING since I won't be looking in here for files, it just has to be moved over
          REGISTER /home/bob/myfuncs.py  USING python AS hisfuncs;
          
          DEFINE b myfuncs.b();
          
          A = LOAD 'mydata' as (x, y, z);
          B = FOREACH A GENERATE myfuncs.a(x), b(y), hisfuncs.a(z);
          ...
          
          Show
          Alan Gates added a comment - I propose the following syntax for register: REGISTER _filename_ [USING _class_ [AS _namespace_]] This is backwards compatible with the current version of register. class in the USING clause would need to implement a new interface ScriptEngine (or something) which would be used to interpret the file. If no USING clause is given, then it is assumed that filename is a jar. I like this better than the 'lang python' option we had earlier because it allows users to add new engines without modifying the parser. We should however provide a pre-defined set of scripting engines and names, so that for example python translates to org.apache.pig.script.jython.JythonScriptingEngine If the AS clause is not given, then the basename of filename defines the namespace name for all functions defined in that file. This allows us to avoid function name clashes. If the AS clause is given, this defines an alternate namespace. This allows us to avoid name clashes for filenames. Functions would have to be referenced by full namespace names, though aliases can be given via DEFINE. Note that the AS clause is a sub-clause of the USING clause, and cannot be used alone, so there is no ability to give namespaces to jars. As far as I can tell there is no need for a SHIP clause in the register. Additional python modules that are needed can be registered. As long as Pig lazily searches for functions and does not automatically find every function in every file we register, this will work fine. So taken altogether, this would look like the following. Assume we have two python files /home/alan/myfuncs.py import mymod def a(): ... def b(): ... and /home/bob/myfuncs.py : def a(): ... def c(): ... and the following Pig Latin REGISTER /home/alan/myfuncs.py USING python; REGISTER /home/alan/mymod.py; -- no need for USING since I won't be looking in here for files, it just has to be moved over REGISTER /home/bob/myfuncs.py USING python AS hisfuncs; DEFINE b myfuncs.b(); A = LOAD 'mydata' as (x, y, z); B = FOREACH A GENERATE myfuncs.a(x), b(y), hisfuncs.a(z); ...
          Hide
          Julien Le Dem added a comment -

          I like the suggestion. However I would prefer not to use namespaces by default.
          Most likely users will register a few functions and use namespaces only when conflicts happen.
          The shortest syntax should be used for the most common use case.

          most of the time:
          REGISTER /home/alan/myfuncs.py USING python;
          B = FOREACH A GENERATE a;

          when it is needed:
          REGISTER /home/alan/myfuncs.py USING python AS myfuncs;
          B = FOREACH A GENERATE myfuncs.a;

          Also register jar does not prefix classes by the jar name so that would be inconsistent.
          REGISTER /home/alan/myfuncs.jar;

          Show
          Julien Le Dem added a comment - I like the suggestion. However I would prefer not to use namespaces by default. Most likely users will register a few functions and use namespaces only when conflicts happen. The shortest syntax should be used for the most common use case. most of the time: REGISTER /home/alan/myfuncs.py USING python; B = FOREACH A GENERATE a ; when it is needed: REGISTER /home/alan/myfuncs.py USING python AS myfuncs; B = FOREACH A GENERATE myfuncs.a ; Also register jar does not prefix classes by the jar name so that would be inconsistent. REGISTER /home/alan/myfuncs.jar;
          Hide
          Aniket Mokashi added a comment -

          I have attached the patch for proposed changes.

          Few points to note-
          1. As jar is treated in a different way (searched in system resources, classloader used etc) than other files, we differentiate a jar with its extension.
          2. namespace is kept as default = "" as per above comment, this is implemented as part of registerFunctions interface of ScriptEngine, so that different engines can have different behavior as necessary.
          3. keyword python is supported along with custom scriptengine name.

          Show
          Aniket Mokashi added a comment - I have attached the patch for proposed changes. Few points to note- 1. As jar is treated in a different way (searched in system resources, classloader used etc) than other files, we differentiate a jar with its extension. 2. namespace is kept as default = "" as per above comment, this is implemented as part of registerFunctions interface of ScriptEngine, so that different engines can have different behavior as necessary. 3. keyword python is supported along with custom scriptengine name.
          Hide
          Aniket Mokashi added a comment -

          Adding missing scripting files

          Show
          Aniket Mokashi added a comment - Adding missing scripting files
          Hide
          Aniket Mokashi added a comment -

          Extension of this jira to track progress for inline script udfs with define clause has been added at https://issues.apache.org/jira/browse/PIG-1471

          Show
          Aniket Mokashi added a comment - Extension of this jira to track progress for inline script udfs with define clause has been added at https://issues.apache.org/jira/browse/PIG-1471
          Hide
          Julien Le Dem added a comment -

          I created another extension to discuss the embedding part: https://issues.apache.org/jira/browse/PIG-1479

          Show
          Julien Le Dem added a comment - I created another extension to discuss the embedding part: https://issues.apache.org/jira/browse/PIG-1479
          Hide
          Dmitriy V. Ryaboy added a comment -

          Aniket, the patch does not apply cleanly to trunk, can you rebase it?

          Show
          Dmitriy V. Ryaboy added a comment - Aniket, the patch does not apply cleanly to trunk, can you rebase it?
          Hide
          Dmitriy V. Ryaboy added a comment -

          I rebased the patch and made it pull jython down via maven. 2.5.1 doesn't appear to be available right now, so this pulls down 2.5.0. Hope that's ok.

          Looks like the tabulation is wrong in most of this patch.. someone please hit ctrl-a, ctrl-i next time .

          Needless to say, this thing needs tests, desperately.

          Also imho in order for it to make it into trunk, it should be a compile-time option to support (and pull down) jython or jruby or whatnot, not a default option. Otherwise we are well on our way to making people pull down the internet in order to compile pig.

          Show
          Dmitriy V. Ryaboy added a comment - I rebased the patch and made it pull jython down via maven. 2.5.1 doesn't appear to be available right now, so this pulls down 2.5.0. Hope that's ok. Looks like the tabulation is wrong in most of this patch.. someone please hit ctrl-a, ctrl-i next time . Needless to say, this thing needs tests, desperately. Also imho in order for it to make it into trunk, it should be a compile-time option to support (and pull down) jython or jruby or whatnot, not a default option. Otherwise we are well on our way to making people pull down the internet in order to compile pig.
          Hide
          Aniket Mokashi added a comment -

          The fix needed some changes in queryparser to support namespace, I found this in test cases I added.
          Current EvalFuncSpec logic is convoluted, I replaced it with a cleaner one.
          I have attached the updated patch with changes mentioned above.

          I am not sure what needs to be done for jython.jar, my guess was to check-in that in /lib. Thoughts?

          Show
          Aniket Mokashi added a comment - The fix needed some changes in queryparser to support namespace, I found this in test cases I added. Current EvalFuncSpec logic is convoluted, I replaced it with a cleaner one. I have attached the updated patch with changes mentioned above. I am not sure what needs to be done for jython.jar, my guess was to check-in that in /lib. Thoughts?
          Hide
          Aniket Mokashi added a comment -

          Changes needed for script UDF.
          TODO- jython.jar related changes

          Show
          Aniket Mokashi added a comment - Changes needed for script UDF. TODO- jython.jar related changes
          Hide
          Dmitriy V. Ryaboy added a comment -

          Aniket, I already made the changes you need to pull down jython – take a look at the patch I attached.

          One more general note – let's say jython instead of python (in the grammar, the keywords, everywhere), as there may be slight incompatibilities between the two and we want to be clear on what we are using.

          Show
          Dmitriy V. Ryaboy added a comment - Aniket, I already made the changes you need to pull down jython – take a look at the patch I attached. One more general note – let's say jython instead of python (in the grammar, the keywords, everywhere), as there may be slight incompatibilities between the two and we want to be clear on what we are using.
          Hide
          Aniket Mokashi added a comment -

          I had added an interface: getStandardScriptJarPath to find the path of jython jar to be shipped as part of job.jar only when user uses this feature. How do I incorporate this into new changes?
          Do we want to go for compile time support option?

          Show
          Aniket Mokashi added a comment - I had added an interface: getStandardScriptJarPath to find the path of jython jar to be shipped as part of job.jar only when user uses this feature. How do I incorporate this into new changes? Do we want to go for compile time support option?
          Hide
          Julien Le Dem added a comment -

          Aniket, this is assuming the ScriptEngine requires only one jar.
          I would suggest instead having a method ScriptEngine.init(PigContext) that would be called after the ScriptEngine instance has been retrieved from the factory.
          That would let the script engine add whatever is needed to the job.

                  if(scriptingLang != null) {
                      ScriptEngine se = ScriptEngine.getInstance(scriptingLang);
          
                      //pigContext.scriptJars.add(se.getStandardScriptJarPath());
                      se.init(pigContext);
                      se.registerFunctions(path, namespace, pigContext);
                  }
          

          Have a good week end, Julien

          Show
          Julien Le Dem added a comment - Aniket, this is assuming the ScriptEngine requires only one jar. I would suggest instead having a method ScriptEngine.init(PigContext) that would be called after the ScriptEngine instance has been retrieved from the factory. That would let the script engine add whatever is needed to the job. if (scriptingLang != null ) { ScriptEngine se = ScriptEngine.getInstance(scriptingLang); //pigContext.scriptJars.add(se.getStandardScriptJarPath()); se.init(pigContext); se.registerFunctions(path, namespace, pigContext); } Have a good week end, Julien
          Hide
          Julien Le Dem added a comment -

          actually, I retract the init() method as it seems this could all happen in registerFunctions()

          public void registerFunctions(String path, String namespace, PigContext pigContext)
          throws IOException {

          pigContext.addJar(JAR_PATH);
          ...

          also I was suggesting this way of automatically figuring out the jar path for a class:

          /**

          • figure out the jar location from the class
          • @param clazz
          • @return the jar file location, null if the class was not loaded from a jar
            */
            protected static String getJar(Class<?> clazz)
            Unknown macro: { URL resource = clazz.getClassLoader().getResource(clazz.getCanonicalName().replace(".","/")+".class"); if (resource.getProtocol().equals("jar"))
            Unknown macro: { return resource.getPath().substring(resource.getPath().indexOf('}
            return null; }

          otherwise the code depends on the path it is run from.

          Show
          Julien Le Dem added a comment - actually, I retract the init() method as it seems this could all happen in registerFunctions() public void registerFunctions(String path, String namespace, PigContext pigContext) throws IOException { pigContext.addJar(JAR_PATH); ... also I was suggesting this way of automatically figuring out the jar path for a class: /** figure out the jar location from the class @param clazz @return the jar file location, null if the class was not loaded from a jar */ protected static String getJar(Class<?> clazz) Unknown macro: { URL resource = clazz.getClassLoader().getResource(clazz.getCanonicalName().replace(".","/")+".class"); if (resource.getProtocol().equals("jar")) Unknown macro: { return resource.getPath().substring(resource.getPath().indexOf('} return null; } otherwise the code depends on the path it is run from.
          Hide
          Julien Le Dem added a comment -

          Argh... Sorry about that

          	/** 
          	 * figure out the jar location from the class 
          	 * @param clazz
          	 * @return the jar file location, null if the class was not loaded from a jar
          	 */
          	protected static String getJar(Class<?> clazz) {
          		URL resource = clazz.getClassLoader().getResource(clazz.getCanonicalName().replace(".","/")+".class");
          		if (resource.getProtocol().equals("jar")) {
          			return resource.getPath().substring(resource.getPath().indexOf(':')+1,resource.getPath().indexOf('!'));
          		}
          		return null;
          	}
          
          Show
          Julien Le Dem added a comment - Argh... Sorry about that /** * figure out the jar location from the class * @param clazz * @ return the jar file location, null if the class was not loaded from a jar */ protected static String getJar( Class <?> clazz) { URL resource = clazz.getClassLoader().getResource(clazz.getCanonicalName().replace( "." , "/" )+ ".class" ); if (resource.getProtocol().equals( "jar" )) { return resource.getPath().substring(resource.getPath().indexOf(':')+1,resource.getPath().indexOf('!')); } return null ; }
          Hide
          Aniket Mokashi added a comment -

          Thanks Dmitriy and Julien for your help.
          Attached is the patch with test cases. Test manually passed.

          Show
          Aniket Mokashi added a comment - Thanks Dmitriy and Julien for your help. Attached is the patch with test cases. Test manually passed.
          Hide
          Julien Le Dem added a comment -

          ScriptEvalFunc does not do much anymore, I would suggest to remove it.
          If we want to keep it to add shared code in the future then remove its constructor as it forces the schema to be fixed.
          The output schema may depend on the input schema in some cases.

          public abstract class ScriptEvalFunc extends EvalFunc<Object> {
              /**
               * Stub constructor to guide derived classes 
               * Avoids extra reference on exec()
               * @param fileName
               * @param functionName
               * @param numArgs
               * @param schema
               */
              public ScriptEvalFunc(String fileName, String functionName, String numArgs, String schema) {
              }
          
              @Override
              public abstract Object exec(Tuple tuple) throws IOException;
          
              @Override
              public abstract Schema outputSchema(Schema input);
          
          }
          

          As a side note, my original posting (see pig-greek.tgz) had a second decorator to handle that. You would provide the name of the function to compute the output schema from the input schema:

          @outputSchemaFunction("fooOutputSchema")
          def foo(someParameter):
             ...
          
          def fooOutputSchema(inputSchema):
             ...
          
          Show
          Julien Le Dem added a comment - ScriptEvalFunc does not do much anymore, I would suggest to remove it. If we want to keep it to add shared code in the future then remove its constructor as it forces the schema to be fixed. The output schema may depend on the input schema in some cases. public abstract class ScriptEvalFunc extends EvalFunc< Object > { /** * Stub constructor to guide derived classes * Avoids extra reference on exec() * @param fileName * @param functionName * @param numArgs * @param schema */ public ScriptEvalFunc( String fileName, String functionName, String numArgs, String schema) { } @Override public abstract Object exec(Tuple tuple) throws IOException; @Override public abstract Schema outputSchema(Schema input); } As a side note, my original posting (see pig-greek.tgz) had a second decorator to handle that. You would provide the name of the function to compute the output schema from the input schema: @outputSchemaFunction( "fooOutputSchema" ) def foo(someParameter): ... def fooOutputSchema(inputSchema): ...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12448831/RegisterPythonUDFFinale3.patch
          against trunk revision 960062.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          -1 javac. The applied patch generated 146 javac compiler warnings (more than the trunk's current 145 warnings).

          -1 findbugs. The patch appears to introduce 4 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/340/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/340/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/340/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448831/RegisterPythonUDFFinale3.patch against trunk revision 960062. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. -1 javac. The applied patch generated 146 javac compiler warnings (more than the trunk's current 145 warnings). -1 findbugs. The patch appears to introduce 4 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/340/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/340/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/340/console This message is automatically generated.
          Hide
          Aniket Mokashi added a comment -

          I got what you mean, if user needs a generic square function he can write:

          #!/usr/bin/python
          @outputSchemaFunction(\"squareSchema\")
          def square(number):
              return (number * number)
          def squareSchema(input):
              return input
          

          I will make changes so that I can use similar approach as pig-greek. Since outputschema needs to know both input and name of outputSchemaFunction current code would need further changes.

          Show
          Aniket Mokashi added a comment - I got what you mean, if user needs a generic square function he can write: #!/usr/bin/python @outputSchemaFunction(\ "squareSchema\" ) def square(number): return (number * number) def squareSchema(input): return input I will make changes so that I can use similar approach as pig-greek. Since outputschema needs to know both input and name of outputSchemaFunction current code would need further changes.
          Hide
          Aniket Mokashi added a comment -

          Added support for decorator outputSchemaFunction that points to a function which defines the schema for the function.
          Also, in case of function with no decorator schema is assumed to be databytearray.

          Show
          Aniket Mokashi added a comment - Added support for decorator outputSchemaFunction that points to a function which defines the schema for the function. Also, in case of function with no decorator schema is assumed to be databytearray.
          Hide
          Aniket Mokashi added a comment -

          I have uploaded a wiki page to mention the usage and syntax-- http://wiki.apache.org/pig/UDFsUsingScriptingLanguages.

          Show
          Aniket Mokashi added a comment - I have uploaded a wiki page to mention the usage and syntax-- http://wiki.apache.org/pig/UDFsUsingScriptingLanguages .
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12449018/RegisterPythonUDF_Final.patch
          against trunk revision 960062.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 146 javac compiler warnings (more than the trunk's current 145 warnings).

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12449018/RegisterPythonUDF_Final.patch against trunk revision 960062. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 146 javac compiler warnings (more than the trunk's current 145 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/console This message is automatically generated.
          Hide
          Aniket Mokashi added a comment -

          Fixed @@@ related stuff...
          Parsing of schema from decorators is postponed until the constructor.
          Fixed some test related changes.

          Show
          Aniket Mokashi added a comment - Fixed @@@ related stuff... Parsing of schema from decorators is postponed until the constructor. Fixed some test related changes.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12449105/RegisterPythonUDFFinale4.patch
          against trunk revision 962628.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/365/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12449105/RegisterPythonUDFFinale4.patch against trunk revision 962628. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/365/console This message is automatically generated.
          Hide
          Aniket Mokashi added a comment -

          Rebased version of Finale4

          Show
          Aniket Mokashi added a comment - Rebased version of Finale4
          Hide
          Alan Gates added a comment -

          ScriptEngine is a new public interface for Pig once we commit this patch. We need to declare this as public and it's stability level (evolving I'm guessing since
          its new, but I'm open to arguments for other levels). See PIG-1311 for info on how to do this.

          Show
          Alan Gates added a comment - ScriptEngine is a new public interface for Pig once we commit this patch. We need to declare this as public and it's stability level (evolving I'm guessing since its new, but I'm open to arguments for other levels). See PIG-1311 for info on how to do this.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12449134/RegisterPythonUDFFinale5.patch
          against trunk revision 963504.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 145 javac compiler warnings (more than the trunk's current 144 warnings).

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/344/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/344/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/344/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12449134/RegisterPythonUDFFinale5.patch against trunk revision 963504. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 145 javac compiler warnings (more than the trunk's current 144 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/344/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/344/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/344/console This message is automatically generated.
          Hide
          Ashutosh Chauhan added a comment -
          • Do you want to allow: register myJavaUDFs.jar using 'java' as 'javaNameSpace' ? Use-case could be that if we are allowing namespaces for non-java, why not allow for Java udfs as well. But then define is exactly for this purpose. So, it may make sense to throw exception for such a case.
          • In ScriptEngine.getJarPath() shouldn't you throw a FileNotFoundException instead of returning null.
          • Don't gobble up Checked Exceptions and then rethrow RuntimeExceptions. Throw checked exceptions, if you need to.
          • ScriptEngine.getInstance() should be a singleton, no?
          • In JythonScriptEngine.getFunction() I think you should check if interpreter.get(functionName) != null and then return it and call Interpreter.init(path) only if its null.
          • In JythonUtils, for doing type conversion you should make use of both input and output schemas (whenever they are available) and avoid doing reflection for every element. You can get hold of input schema through outputSchema() of EvalFunc and then do UDFCOntext magic to use it. If schema == null || schema == bytearray, you need to resort to reflections. Similarily if outputSchema is available via decorators, use it to do type conversions.
          • In jythonUtils.pythonToPig() in case of Tuple, you first create Object[] then do Arrays.asList(), you can directly create List<Object> and avoid unnecessary casting. In the same method, you are only checking for long, dont you need to check for int, String etc. and then do casting appropriately. Also, in default case I think we cant let object pass as it is using Object.class, it could be object of any type and may cause cryptic errors in Pipeline, if let through. We should throw an exception if we dont know what type of object it is. Similar argument for default case of pigToPython()
          • I didn't get why the changes are required in POUserFunc. Can you explain and also add it as comments in the code.

          Testing:

          • This is a big enough feature to warrant its own test file. So, consider adding a new test file (may be TestNonJavaUDF). Additionally, we see frequent timeouts on TestEvalPipeline, we dont want it to run any longer.
          • Instead of adding query through pigServer.registerCode() api, add it through pigServer.registerQuery(register myscript.py using "jython"). This will make sure we are testing changes in QueryParser.jjt as well.
          • Add more tests. Specifically, for complex types passed to the udfs (like bag) and returning a bag. You can get bags after doing a group-by. You can also take a look at original Julien's patch which contained a python script. Those I guess were at right level of complexity to be added as test-cases in our junit tests.

          Nit-picks:

          • Unnecessary import in JythonFunction.java
          • In PigContext.java, you are using Vector and LinkedList, instead of usual ArrayList. Any particular reason for it, just curious?
          • More documentation (in QuerParser.jjt, ScriptEngine, JythonScriptEngine (specifically for outputSchema, outputSchemaFunction, schemafunction))
          • Also keep an eye of recent "mavenization" efforts of Pig, depending on when it gets checked-in you may (or may not) need to make changes to ivy
          Show
          Ashutosh Chauhan added a comment - Do you want to allow: register myJavaUDFs.jar using 'java' as 'javaNameSpace' ? Use-case could be that if we are allowing namespaces for non-java, why not allow for Java udfs as well. But then define is exactly for this purpose. So, it may make sense to throw exception for such a case. In ScriptEngine.getJarPath() shouldn't you throw a FileNotFoundException instead of returning null. Don't gobble up Checked Exceptions and then rethrow RuntimeExceptions. Throw checked exceptions, if you need to. ScriptEngine.getInstance() should be a singleton, no? In JythonScriptEngine.getFunction() I think you should check if interpreter.get(functionName) != null and then return it and call Interpreter.init(path) only if its null. In JythonUtils, for doing type conversion you should make use of both input and output schemas (whenever they are available) and avoid doing reflection for every element. You can get hold of input schema through outputSchema() of EvalFunc and then do UDFCOntext magic to use it. If schema == null || schema == bytearray, you need to resort to reflections. Similarily if outputSchema is available via decorators, use it to do type conversions. In jythonUtils.pythonToPig() in case of Tuple, you first create Object[] then do Arrays.asList(), you can directly create List<Object> and avoid unnecessary casting. In the same method, you are only checking for long, dont you need to check for int, String etc. and then do casting appropriately. Also, in default case I think we cant let object pass as it is using Object.class, it could be object of any type and may cause cryptic errors in Pipeline, if let through. We should throw an exception if we dont know what type of object it is. Similar argument for default case of pigToPython() I didn't get why the changes are required in POUserFunc. Can you explain and also add it as comments in the code. Testing: This is a big enough feature to warrant its own test file. So, consider adding a new test file (may be TestNonJavaUDF). Additionally, we see frequent timeouts on TestEvalPipeline, we dont want it to run any longer. Instead of adding query through pigServer.registerCode() api, add it through pigServer.registerQuery(register myscript.py using "jython"). This will make sure we are testing changes in QueryParser.jjt as well. Add more tests. Specifically, for complex types passed to the udfs (like bag) and returning a bag. You can get bags after doing a group-by. You can also take a look at original Julien's patch which contained a python script. Those I guess were at right level of complexity to be added as test-cases in our junit tests. Nit-picks: Unnecessary import in JythonFunction.java In PigContext.java, you are using Vector and LinkedList, instead of usual ArrayList. Any particular reason for it, just curious? More documentation (in QuerParser.jjt, ScriptEngine, JythonScriptEngine (specifically for outputSchema, outputSchemaFunction, schemafunction)) Also keep an eye of recent "mavenization" efforts of Pig, depending on when it gets checked-in you may (or may not) need to make changes to ivy
          Hide
          Aniket Mokashi added a comment -

          Thanks for your comments. I will make the required changes.

          Do you want to allow: register myJavaUDFs.jar using 'java' as 'javaNameSpace' ? Use-case could be that if we are allowing namespaces for non-java, why not allow for Java udfs as well. But then define is exactly for this purpose. So, it may make sense to throw exception for such a case.

          myJavaUDFs.jar can itself have package structure that can define its own namespace, for example- maths.jar has function math.sin etc, I will throw parseexception for such a case

          ScriptEngine.getInstance() should be a singleton, no?

          getInstance is a factory method that returns an instance of scriptEngine based on its type. We create a newInstance of the scriptEngine so that if registerCode is called simultaneously, we can create a different interpreter for both the invocations to register these scripts to pig.

          In JythonScriptEngine.getFunction() I think you should check if interpreter.get(functionName) != null and then return it and call Interpreter.init(path) only if its null.

          This behavior is consistent with interpreter.get method that returns null if some resource is not found inside the script. Callers of this function handle runtimeexceptions. Also, we will fail much earlier if we try to access functions that are not already present/registered so it should be safe.
          Also, interpreter is never null because its a static member of the JythonScriptEngine, instantiated statically.

          I didn't get why the changes are required in POUserFunc. Can you explain and also add it as comments in the code.

          POUserFunc has possible bug to check res.result != null when it is always null at this point. If the returntype expected is bytearray, we cast return object to byte[] with toString().getBytes() (which was never hit due to the bug mentioned above), but when return type is byte[] we need special handling (this is not case for other evalfuncs as they generally return pigtypes).

          Instead of adding query through pigServer.registerCode() api, add it through pigServer.registerQuery(register myscript.py using "jython"). This will make sure we are testing changes in QueryParser.jjt as well.

          register is Grunt command parsed by gruntparser hence doesnt go through queryparser. We directly call registerCode from GruntParser. Also, parsing logic is trivial.

          Show
          Aniket Mokashi added a comment - Thanks for your comments. I will make the required changes. Do you want to allow: register myJavaUDFs.jar using 'java' as 'javaNameSpace' ? Use-case could be that if we are allowing namespaces for non-java, why not allow for Java udfs as well. But then define is exactly for this purpose. So, it may make sense to throw exception for such a case. myJavaUDFs.jar can itself have package structure that can define its own namespace, for example- maths.jar has function math.sin etc, I will throw parseexception for such a case ScriptEngine.getInstance() should be a singleton, no? getInstance is a factory method that returns an instance of scriptEngine based on its type. We create a newInstance of the scriptEngine so that if registerCode is called simultaneously, we can create a different interpreter for both the invocations to register these scripts to pig. In JythonScriptEngine.getFunction() I think you should check if interpreter.get(functionName) != null and then return it and call Interpreter.init(path) only if its null. This behavior is consistent with interpreter.get method that returns null if some resource is not found inside the script. Callers of this function handle runtimeexceptions. Also, we will fail much earlier if we try to access functions that are not already present/registered so it should be safe. Also, interpreter is never null because its a static member of the JythonScriptEngine, instantiated statically. I didn't get why the changes are required in POUserFunc. Can you explain and also add it as comments in the code. POUserFunc has possible bug to check res.result != null when it is always null at this point. If the returntype expected is bytearray, we cast return object to byte[] with toString().getBytes() (which was never hit due to the bug mentioned above), but when return type is byte[] we need special handling (this is not case for other evalfuncs as they generally return pigtypes). Instead of adding query through pigServer.registerCode() api, add it through pigServer.registerQuery(register myscript.py using "jython"). This will make sure we are testing changes in QueryParser.jjt as well. register is Grunt command parsed by gruntparser hence doesnt go through queryparser. We directly call registerCode from GruntParser. Also, parsing logic is trivial.
          Hide
          Aniket Mokashi added a comment -

          Commenting on behavior of EvalFunc<Object>, we consider following UDF-

          public class UDF1 extends EvalFunc<Object> {
              class Student{
                  int age;
                  String name;
                  Student(int a, String nm) {
                      age = a;
                      name = nm;
                  }
              }
              @Override
              public Object exec(Tuple input) throws IOException {
                  return new Student(12, (String)input.get(0));
              }
              @Override
              public Schema outputSchema(Schema input) {
                  return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
              }
          }
          

          Although, this one define its output schema as ByteArray we fail this one as we do not know how to deserialize Student. Clearly, this is due to the bug in POUserFunc which fails to convert to ByteArray. Hence, res.result != null should be changed to result.result !=null.

          Show
          Aniket Mokashi added a comment - Commenting on behavior of EvalFunc<Object>, we consider following UDF- public class UDF1 extends EvalFunc< Object > { class Student{ int age; String name; Student( int a, String nm) { age = a; name = nm; } } @Override public Object exec(Tuple input) throws IOException { return new Student(12, ( String )input.get(0)); } @Override public Schema outputSchema(Schema input) { return new Schema( new Schema.FieldSchema( null , DataType.BYTEARRAY)); } } Although, this one define its output schema as ByteArray we fail this one as we do not know how to deserialize Student. Clearly, this is due to the bug in POUserFunc which fails to convert to ByteArray. Hence, res.result != null should be changed to result.result !=null.
          Hide
          Aniket Mokashi added a comment -

          Added new test cases to test tuple and bag scenarios- moved to a new test file.
          Fixed the exception handling.
          Added detailed comments.

          Show
          Aniket Mokashi added a comment - Added new test cases to test tuple and bag scenarios- moved to a new test file. Fixed the exception handling. Added detailed comments.
          Hide
          Ashutosh Chauhan added a comment -

          Thanks, Aniket for making those changes. Its getting closer.

          • I am still not convinced about the changes required in POUserFunc. That logic should really be a part of pythonToPig(pyObject). If python UDF is returning byte[], it should be turned into DataByteArray before it gets back into Pig's pipeline. And if we do that conversion in pythonToPig() (which is a right place to do it) we will need no changes in POUserFunc.
          • As I suggested in previous comment in the same method you should avoid first creating Array and then turning that Array in list, you can rather create a list upfront and use it.
          • Instead of instanceof, doing class equality test will be a wee-bit faster. Like instead of (pyObject instanceof PyDictionary) do pyobject.getClass() == PyDictionary.class. Obviously, it will work when you know exact target class and not for the derived ones.
          • parseSchema(String schema) already exist in org.apache.pig.impl.util.Utils class. So, no need for that in ScriptEngine
          • For register command, we need to test not only for functionality but for regressions as well. Look at TestGrunt.java in test package to get an idea how to write test for it.
          Show
          Ashutosh Chauhan added a comment - Thanks, Aniket for making those changes. Its getting closer. I am still not convinced about the changes required in POUserFunc. That logic should really be a part of pythonToPig(pyObject). If python UDF is returning byte[], it should be turned into DataByteArray before it gets back into Pig's pipeline. And if we do that conversion in pythonToPig() (which is a right place to do it) we will need no changes in POUserFunc. As I suggested in previous comment in the same method you should avoid first creating Array and then turning that Array in list, you can rather create a list upfront and use it. Instead of instanceof, doing class equality test will be a wee-bit faster. Like instead of (pyObject instanceof PyDictionary) do pyobject.getClass() == PyDictionary.class. Obviously, it will work when you know exact target class and not for the derived ones. parseSchema(String schema) already exist in org.apache.pig.impl.util.Utils class. So, no need for that in ScriptEngine For register command, we need to test not only for functionality but for regressions as well. Look at TestGrunt.java in test package to get an idea how to write test for it.
          Hide
          Ashutosh Chauhan added a comment -

          Addendum:

          • Also what will happen if user returned a nil python object (null equivalent of Java) from UDF. It looks to me that will result in NPE. Can you add a test for that and similar test case from pigToPython()
          Show
          Ashutosh Chauhan added a comment - Addendum: Also what will happen if user returned a nil python object (null equivalent of Java) from UDF. It looks to me that will result in NPE. Can you add a test for that and similar test case from pigToPython()
          Hide
          Aniket Mokashi added a comment -

          I am still not convinced about the changes required in POUserFunc. That logic should really be a part of pythonToPig(pyObject). If python UDF is returning byte[], it should be turned into DataByteArray before it gets back into Pig's pipeline. And if we do that conversion in pythonToPig() (which is a right place to do it) we will need no changes in POUserFunc.

          I agree that it is better to move computation on JythonFunction side (JythonUtils) for type checking and should provide more type safety to avoid user defined types complexity. But I would still go for changes in POUserFunc for result.result for the case defined in above example (removing byte[] scenario).

          Instead of instanceof, doing class equality test will be a wee-bit faster. Like instead of (pyObject instanceof PyDictionary) do pyobject.getClass() == PyDictionary.class. Obviously, it will work when you know exact target class and not for the derived ones.

          Jython code has derived classes for each of the basic Jython types, though they aren't used for most of the types as of now, they may start returning these derived objects (PyTupleDerived) in their future implementation, in which case we might break our code. Also, PyLongDerived are already used inside the code. _tojava_ function just returns the proxy java object until we ask for a specific type of object. I think its better to use instanceof instead of class equality here.

          For register command, we need to test not only for functionality but for regressions as well. Look at TestGrunt.java in test package to get an idea how to write test for it.

          Code path for .jar registration is identical to old code, except that it doesnt "use" any engine or namespace.

          Also what will happen if user returned a nil python object (null equivalent of Java) from UDF. It looks to me that will result in NPE. Can you add a test for that and similar test case from pigToPython()

          A java null object will be turned into PyNone object but _tojava_ function will always returns the special object Py.NoConversion if this PyObject can not be converted to the desired Java class.

          Show
          Aniket Mokashi added a comment - I am still not convinced about the changes required in POUserFunc. That logic should really be a part of pythonToPig(pyObject). If python UDF is returning byte[], it should be turned into DataByteArray before it gets back into Pig's pipeline. And if we do that conversion in pythonToPig() (which is a right place to do it) we will need no changes in POUserFunc. I agree that it is better to move computation on JythonFunction side (JythonUtils) for type checking and should provide more type safety to avoid user defined types complexity. But I would still go for changes in POUserFunc for result.result for the case defined in above example (removing byte[] scenario). Instead of instanceof, doing class equality test will be a wee-bit faster. Like instead of (pyObject instanceof PyDictionary) do pyobject.getClass() == PyDictionary.class. Obviously, it will work when you know exact target class and not for the derived ones. Jython code has derived classes for each of the basic Jython types, though they aren't used for most of the types as of now, they may start returning these derived objects (PyTupleDerived) in their future implementation, in which case we might break our code. Also, PyLongDerived are already used inside the code. _ tojava _ function just returns the proxy java object until we ask for a specific type of object. I think its better to use instanceof instead of class equality here. For register command, we need to test not only for functionality but for regressions as well. Look at TestGrunt.java in test package to get an idea how to write test for it. Code path for .jar registration is identical to old code, except that it doesnt "use" any engine or namespace. Also what will happen if user returned a nil python object (null equivalent of Java) from UDF. It looks to me that will result in NPE. Can you add a test for that and similar test case from pigToPython() A java null object will be turned into PyNone object but _ tojava _ function will always returns the special object Py.NoConversion if this PyObject can not be converted to the desired Java class.
          Hide
          Aniket Mokashi added a comment -

          Added test for map-udf, null-inputoutput and grunt
          Made required changes as per suggestions.

          Show
          Aniket Mokashi added a comment - Added test for map-udf, null-inputoutput and grunt Made required changes as per suggestions.
          Hide
          Daniel Dai added a comment -

          Patch committed. Thanks Aniket!

          Show
          Daniel Dai added a comment - Patch committed. Thanks Aniket!

            People

            • Assignee:
              Aniket Mokashi
              Reporter:
              Alan Gates
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development