Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0, 0.11
    • Component/s: None
    • Labels:
    • Patch Info:
      Patch Available
    • Hadoop Flags:
      Reviewed

      Description

      It should be possible to write UDFs in Ruby. These UDFs will be registered in the same way as python and javascript UDFs.

      1. jruby_scripting_2_real.patch
        84 kB
        Jonathan Coveney
      2. jruby_scripting_3.patch
        107 kB
        Jonathan Coveney
      3. jruby_scripting_4.patch
        105 kB
        Jonathan Coveney
      4. jruby_scripting_5.patch
        101 kB
        Jonathan Coveney
      5. jruby_scripting_6.patch
        119 kB
        Jonathan Coveney
      6. jruby_scripting_7.patch
        130 kB
        Alan Gates
      7. jruby_scripting.patch
        29 kB
        Jacob Perkins
      8. PIG-2317_13_0.10.patch
        20 kB
        Jonathan Coveney
      9. PIG-2317_14_0.10.patch
        21 kB
        Jonathan Coveney
      10. PIG-2317_14.patch
        193 kB
        Jonathan Coveney
      11. PIG-2317_15.patch
        190 kB
        Daniel Dai
      12. PIG-2317_16.10.patch
        192 kB
        Daniel Dai
      13. PIG-2317_16.patch
        192 kB
        Daniel Dai
      14. PIG-2317-10_plus.patch
        1 kB
        Daniel Dai
      15. PIG-2317-10.patch
        188 kB
        Jonathan Coveney
      16. PIG-2317-11.patch
        189 kB
        Jonathan Coveney
      17. PIG-2317-12.patch
        191 kB
        Jonathan Coveney
      18. PIG-2317-13.patch
        191 kB
        Jonathan Coveney
      19. PIG-2317-13.patch
        191 kB
        Jonathan Coveney
      20. PIG-2317-13.patch
        191 kB
        Jonathan Coveney
      21. PIG-2317-8_plus.patch
        6 kB
        Daniel Dai
      22. PIG-2317-8.patch
        93 kB
        Jonathan Coveney
      23. PIG-2317-9.patch
        140 kB
        Jonathan Coveney
      24. pigjruby.rb
        1 kB
        Jonathan Coveney
      25. pigjruby.rb
        1 kB
        Jonathan Coveney
      26. pigjruby.rb
        2 kB
        Jonathan Coveney
      27. pigudf.rb
        3 kB
        Jonathan Coveney
      28. PigUdf.rb
        3 kB
        Jonathan Coveney
      29. PigUdf.rb
        3 kB
        Jonathan Coveney
      30. recursiveinclusion.tar.gz
        0.9 kB
        Daniel Dai

        Issue Links

          Activity

          Hide
          Jacob Perkins added a comment -

          Here's the current way the patch works.

          myudfs.rb
          hello_world_outputschema = "str:chararray"
          def hello_world
            "hello world"
          end
          
          complex_outputschema = "word:chararray,num:long"
          def complex(word)
            [word,word.length]
          end
          

          There really aren't many nice options for attaching metadata (output schema information) to a function in ruby so I went with the above convention.

          And from Pig:

          foo.pig
          register 'myudfs.rb' using jruby as myudfs;
          A = LOAD 'somedata.tsv' AS (word:chararray);
          B = FOREACH A GENERATE myudfs.hello_world(), myudfs.complex(word);
          
          Show
          Jacob Perkins added a comment - Here's the current way the patch works. myudfs.rb hello_world_outputschema = "str:chararray" def hello_world "hello world" end complex_outputschema = "word:chararray,num: long " def complex(word) [word,word.length] end There really aren't many nice options for attaching metadata (output schema information) to a function in ruby so I went with the above convention. And from Pig: foo.pig register 'myudfs.rb' using jruby as myudfs; A = LOAD 'somedata.tsv' AS (word:chararray); B = FOREACH A GENERATE myudfs.hello_world(), myudfs.complex(word);
          Hide
          Alan Gates added a comment -

          I'm reviewing this patch.

          Show
          Alan Gates added a comment - I'm reviewing this patch.
          Hide
          Alan Gates added a comment -

          I haven't finished testing this, but it looks good after my initial review.

          One issue I found is that JrubyScriptEngine.run should throw UnsupportedMethodException. This method is for when embedding Pig in a scripting language is supported, which your patch doesn't do.

          Show
          Alan Gates added a comment - I haven't finished testing this, but it looks good after my initial review. One issue I found is that JrubyScriptEngine.run should throw UnsupportedMethodException. This method is for when embedding Pig in a scripting language is supported, which your patch doesn't do.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Awesome!

          I'm confused by JRuby's 3-way licensing. Is it ok to bundle this with Pig?

          Show
          Dmitriy V. Ryaboy added a comment - Awesome! I'm confused by JRuby's 3-way licensing. Is it ok to bundle this with Pig?
          Hide
          Jonathan Coveney added a comment -

          As far as how to specify schema, another option would be to have a class function "schema" which sets the schema for the following function definition.

          schema "str:chararray"
          def hello_world
          "hello world"
          end

          schema "word:chararray,num:long"
          def complex(word)
          [word,word.length]
          end

          or even (the probably easier to code, and both could be support)

          def hello_world
          "hello world"
          end

          def complex(word)
          [word,word.length]
          end

          schema :hello_world, "str:chararray"
          schema :complex, "word:chararray,num:long"

          I don't think this would be super hard, I'd be willing to help make it work if you guys think it'd be easier or more elegant

          Show
          Jonathan Coveney added a comment - As far as how to specify schema, another option would be to have a class function "schema" which sets the schema for the following function definition. schema "str:chararray" def hello_world "hello world" end schema "word:chararray,num:long" def complex(word) [word,word.length] end or even (the probably easier to code, and both could be support) def hello_world "hello world" end def complex(word) [word,word.length] end schema :hello_world, "str:chararray" schema :complex, "word:chararray,num:long" I don't think this would be super hard, I'd be willing to help make it work if you guys think it'd be easier or more elegant
          Hide
          Jonathan Coveney added a comment -

          Woops, formatting got whacked.

          schema "str:chararray"
          def hello_world
            "hello world"
          end
          
          schema "word:chararray,num:long"
          def complex(word)
            [word,word.length]
          end
          
          def hello_world
            "hello world"
          end
          
          def complex(word)
            [word,word.length]
          end
          
          schema :hello_world, "str:chararray"
          schema :complex, "word:chararray,num:long"
          
          Show
          Jonathan Coveney added a comment - Woops, formatting got whacked. schema "str:chararray" def hello_world "hello world" end schema "word:chararray,num: long " def complex(word) [word,word.length] end def hello_world "hello world" end def complex(word) [word,word.length] end schema :hello_world, "str:chararray" schema :complex, "word:chararray,num: long "
          Hide
          Jonathan Coveney added a comment -

          Here is a modified version of the Jruby patch. It builds off of the excellent work Jacob did, but simplified greatly the function registration and the method invocation.

          There are two supported forms, which can be mix and matched. They generally mirror the python form.

          outputSchema "word:chararray"
          def helloworld
            return 'Hello, World'
          end
          
          outputSchema "word:chararray,num:long"
          def complex word
            return [word.to_s,word.length]
          end
          
          outputSchemaFunction :squareSchema
          def square num
            return num**2
          end
          
          def squareSchema input
            return input
          end
          

          and

          def helloworld
            return 'Hello, World'
          end
          
          def complex word
            return [word.to_s,word.length]
          end
          
          def square num
            return num**2
          end
          
          def squareSchema input
            return input
          end
          
          outputSchema :helloworld, "word:chararray"
          outputSchema :complex, "word:chararray,num:long"
          outputSchemaFunction :square, :squareSchema
          

          There is one key difference, though, which is that this version only registers functions with an explicit schema. While it does mean that you need a schema for every function, I think it more cleanly allows for people to write scripts where there are helper functions.

          Please let me know what you think. Oh, I also split out the JavaScript, Python, and Ruby scripting tests into different files...the ruby and python ones take long enough and are substantial enough that it is annoying to have to run them all at once to test a change just to the ruby implementation.

          Show
          Jonathan Coveney added a comment - Here is a modified version of the Jruby patch. It builds off of the excellent work Jacob did, but simplified greatly the function registration and the method invocation. There are two supported forms, which can be mix and matched. They generally mirror the python form. outputSchema "word:chararray" def helloworld return 'Hello, World' end outputSchema "word:chararray,num: long " def complex word return [word.to_s,word.length] end outputSchemaFunction :squareSchema def square num return num**2 end def squareSchema input return input end and def helloworld return 'Hello, World' end def complex word return [word.to_s,word.length] end def square num return num**2 end def squareSchema input return input end outputSchema :helloworld, "word:chararray" outputSchema :complex, "word:chararray,num: long " outputSchemaFunction :square, :squareSchema There is one key difference, though, which is that this version only registers functions with an explicit schema. While it does mean that you need a schema for every function, I think it more cleanly allows for people to write scripts where there are helper functions. Please let me know what you think. Oh, I also split out the JavaScript, Python, and Ruby scripting tests into different files...the ruby and python ones take long enough and are substantial enough that it is annoying to have to run them all at once to test a change just to the ruby implementation.
          Hide
          Jonathan Coveney added a comment -

          Woops, git has spoiled me. This should be the real patch.

          Show
          Jonathan Coveney added a comment - Woops, git has spoiled me. This should be the real patch.
          Hide
          Jacob Perkins added a comment -

          Looks like the patch doesn't apply to branch 0.9 anymore. PigContext no longer has a method 'addScriptFile' that takes two arguments. This breaks the way required libraries are shipped with the job jar.

          Also, @Jonathan, Awesome work! Good to have someone with a bit more java experience to pick up where I left off

          The one minor concern I have is that with your patch a ruby script containing udfs can't be evaluated outside the context of the script engine. That is, I cant interpret it on the command line without adding the module and class definition that you define in JrubyScriptEngine.java. It's pretty minor but it sure would be nice to be able to test and debug a udf completely independently of Pig itself.

          Thoughts?

          Show
          Jacob Perkins added a comment - Looks like the patch doesn't apply to branch 0.9 anymore. PigContext no longer has a method 'addScriptFile' that takes two arguments. This breaks the way required libraries are shipped with the job jar. Also, @Jonathan, Awesome work! Good to have someone with a bit more java experience to pick up where I left off The one minor concern I have is that with your patch a ruby script containing udfs can't be evaluated outside the context of the script engine. That is, I cant interpret it on the command line without adding the module and class definition that you define in JrubyScriptEngine.java. It's pretty minor but it sure would be nice to be able to test and debug a udf completely independently of Pig itself. Thoughts?
          Hide
          Jonathan Coveney added a comment -

          My pleasure to help

          Hmm, I think it's worth seeing if we can do this without breaking 0.9 compatibility...a lot of shops aren't even on 0.9 yet, so requiring 0.10 is gonna severely limiting potential users.

          I see your issue on the ruby script end and agree...and I think there are a couple of solutions.

          1) we explicitly make people extend a class in order to write a UDF. Then, we could just give them a gem that defines all of this stuff, and they can require 'pigudf' or whatever.
          2) we provide a ruby function which does the wrapping for the user, and then they can run it... so you type "gimmeruby myudfs.rb"

          The second approach is much simpler and would still let people test things independent of pig. The first approach seems to me the "right" approach. Also, it'd have the nice side effect that you could easily extend it to manipulate pig objects on the ruby end more easily. For example: right now, while technically you could create your own schema objects in jruby...good luck (all of the scripting UDFs suffer from this). It'd be pretty nasty, because you'd be creating a bunch of java objects and whatnot. But we could easily just extend it in the gem and bam, provide utilities like that...

          Anything else pinging around your brain?

          Show
          Jonathan Coveney added a comment - My pleasure to help Hmm, I think it's worth seeing if we can do this without breaking 0.9 compatibility...a lot of shops aren't even on 0.9 yet, so requiring 0.10 is gonna severely limiting potential users. I see your issue on the ruby script end and agree...and I think there are a couple of solutions. 1) we explicitly make people extend a class in order to write a UDF. Then, we could just give them a gem that defines all of this stuff, and they can require 'pigudf' or whatever. 2) we provide a ruby function which does the wrapping for the user, and then they can run it... so you type "gimmeruby myudfs.rb" The second approach is much simpler and would still let people test things independent of pig. The first approach seems to me the "right" approach. Also, it'd have the nice side effect that you could easily extend it to manipulate pig objects on the ruby end more easily. For example: right now, while technically you could create your own schema objects in jruby...good luck (all of the scripting UDFs suffer from this). It'd be pretty nasty, because you'd be creating a bunch of java objects and whatnot. But we could easily just extend it in the gem and bam, provide utilities like that... Anything else pinging around your brain?
          Hide
          Jacob Perkins added a comment -

          I like the idea of a gem, especially from the added utilities standpoint. The one thing I'd like to avoid with a gem though is creating a dsl and instead keeping the syntax as obvious as possible. Extending a class feels right:

          concat.rb
          require 'pigudf'
          
          class Concat < PigUdf
            def output_schema
              "str:chararray"
            end
          
            def eval str
              str + str
            end
          end
          

          but leads to lots of extra lines of code when all I really wanted to do was concat two strings...

          I would rather see something like this:

          concat.rb
          require 'pigudf'
          
          Concat = PigUdf.new("str:chararray") do |str|
            str + str
          end
          

          Which is a dead simple class in ruby:

          pigudf.rb
          class PigUdf  
            def self.new output_schema, &blk
              Class.new do
                @@output_schema = output_schema
                @@blk = blk
          
                def output_schema
                  @@output_schema
                end
                
                def eval *args
                  @@blk.call(*args)
                end      
              end    
            end
          end
          

          I also suspect that, with everything wrapped in a class, the java side of things becomes even simpler. Thoughts?

          Show
          Jacob Perkins added a comment - I like the idea of a gem, especially from the added utilities standpoint. The one thing I'd like to avoid with a gem though is creating a dsl and instead keeping the syntax as obvious as possible. Extending a class feels right: concat.rb require 'pigudf' class Concat < PigUdf def output_schema "str:chararray" end def eval str str + str end end but leads to lots of extra lines of code when all I really wanted to do was concat two strings... I would rather see something like this: concat.rb require 'pigudf' Concat = PigUdf. new ( "str:chararray" ) do |str| str + str end Which is a dead simple class in ruby: pigudf.rb class PigUdf def self. new output_schema, &blk Class . new do @@output_schema = output_schema @@blk = blk def output_schema @@output_schema end def eval *args @@blk.call(*args) end end end end I also suspect that, with everything wrapped in a class, the java side of things becomes even simpler. Thoughts?
          Hide
          Jonathan Coveney added a comment -

          I think that this looks really excellent. Very ruby. I think we should allow the simpler syntax, but then also allow a more robust syntax for people who want it.

          Wrapping things in a class should simplify things a lot, especially because we can make helper methods and whatnot to facilitate everything.

          I'd also like to have the more robust syntax for people who want it, something like

          class MyUdfs < PigUdf
            outputSchema "str:chararray"
            def helloworld
              "Hello World"
            end
          
            filterFunc
            def amihappy
              true
            end
          end
          

          also it could support

          NoImNot = PigUdf.filterFunc do |x|
            return !x
          end
          

          The benefit of the latter being brevity, the benefit of the former being that you can have easier helper functions etc.

          also, and this is something the other scripting languages haven't tackled but I think we could do it pretty easily, and it would make it so that serious people could make efficient udfs

          class Count < AlgPigUdf
            def initial t
            end
          
            def intermed t
            end
          
            def final t
            end
          end
          

          another syntax

          Count = PigUdf.algebraic do |udf|
            udf.initial :func1
            udf.intermed :func2
            udf.final :func3
          end
          

          Now, one issue that comes up here that didn't come up otherwise was the fact that in the previous example, we knew exactly what the class was called (since we wrapped their code in it). Now we don't. However, we are having them require the gem, so we'd ideally want a way to know the classes that they defined, and other fun. We could of course do the same trick of wrapping them in a class, but then you have the same issue, so ideally it'd be something all ruby, where after running all of these things, there are functions defined in the gem that give you information on the classes, how to invoke them, etc (ie all the stuff pig needs, but conveniently exposed in ruby so that the user can manipulate it as well). But my ideal would being able to do something like

          PigUdf.getClasses //returns a list of the defined classes
          PigUdf.getFunctions(className) //returns a dictionary of the defined functions, and their schema
          

          And more, who knows.

          If you can think of a more ruby-esque syntax that allows this (I'm sure you can , do tell. It'd be easy for PigUdf.new, but different for the subclasses.

          Love your thoughts.

          Show
          Jonathan Coveney added a comment - I think that this looks really excellent. Very ruby. I think we should allow the simpler syntax, but then also allow a more robust syntax for people who want it. Wrapping things in a class should simplify things a lot, especially because we can make helper methods and whatnot to facilitate everything. I'd also like to have the more robust syntax for people who want it, something like class MyUdfs < PigUdf outputSchema "str:chararray" def helloworld "Hello World" end filterFunc def amihappy true end end also it could support NoImNot = PigUdf.filterFunc do |x| return !x end The benefit of the latter being brevity, the benefit of the former being that you can have easier helper functions etc. also, and this is something the other scripting languages haven't tackled but I think we could do it pretty easily, and it would make it so that serious people could make efficient udfs class Count < AlgPigUdf def initial t end def intermed t end def final t end end another syntax Count = PigUdf.algebraic do |udf| udf.initial :func1 udf.intermed :func2 udf. final :func3 end Now, one issue that comes up here that didn't come up otherwise was the fact that in the previous example, we knew exactly what the class was called (since we wrapped their code in it). Now we don't. However, we are having them require the gem, so we'd ideally want a way to know the classes that they defined, and other fun. We could of course do the same trick of wrapping them in a class, but then you have the same issue, so ideally it'd be something all ruby, where after running all of these things, there are functions defined in the gem that give you information on the classes, how to invoke them, etc (ie all the stuff pig needs, but conveniently exposed in ruby so that the user can manipulate it as well). But my ideal would being able to do something like PigUdf.getClasses //returns a list of the defined classes PigUdf.getFunctions(className) //returns a dictionary of the defined functions, and their schema And more, who knows. If you can think of a more ruby-esque syntax that allows this (I'm sure you can , do tell. It'd be easy for PigUdf.new, but different for the subclasses. Love your thoughts.
          Hide
          Jonathan Coveney added a comment -

          Ok, so I had some time (and by had some time I mean didn't stop working for the past however many hours), and I sat down and tried to do this.

          This code is REALLY rough, but I'd rather start a dialogue around a) the design and b) the implementation. I'm sure I did some dumb stuff and would love to hear it.

          GIGANTIC NOTE: the Algebraic piece is EXTREMELY buggy. Right now it's mainly just a proof of concept: it gets heap space errors right and left. I included it mainly in the hope that people could eyeball it and tell me if there is anything dumb I'm doing...short of that I need to just rest a bit and give it some fresh eyeballs, and I'll use yourkit to profile it. But there's something weird going on somewhere, perhaps with the Ruby runtime environment running in parallel..

          Also, I know that you are more of a Ruby guy than I am, Jacob, so you can take a look at that end of things.

          Would there be any other features/syntax that would simplify things for poeple? As is it's pretty concise

          Non design to do:

          • make it compatible with 0.9.0
          • clean it up (I really need to organize the code and see what needs to be ripped into other classes)
          • write tests
          • figure out
          • put in logic that will opportunistically run the intermediate function on the intermediate output (how does Pig normally make this choice?)
          • add an accumulator interface (it would be minimal work with what I already have)
          • need to have a better way to provide the tuple/databag abstraction in Ruby. The current method is pretty nasty to work with, and you get functions that receive [[[element]]], and whatnot (you'll see that in my code)
          • tear into my code/design
          Show
          Jonathan Coveney added a comment - Ok, so I had some time (and by had some time I mean didn't stop working for the past however many hours), and I sat down and tried to do this. This code is REALLY rough, but I'd rather start a dialogue around a) the design and b) the implementation. I'm sure I did some dumb stuff and would love to hear it. GIGANTIC NOTE: the Algebraic piece is EXTREMELY buggy. Right now it's mainly just a proof of concept: it gets heap space errors right and left. I included it mainly in the hope that people could eyeball it and tell me if there is anything dumb I'm doing...short of that I need to just rest a bit and give it some fresh eyeballs, and I'll use yourkit to profile it. But there's something weird going on somewhere, perhaps with the Ruby runtime environment running in parallel.. Also, I know that you are more of a Ruby guy than I am, Jacob, so you can take a look at that end of things. Would there be any other features/syntax that would simplify things for poeple? As is it's pretty concise Non design to do: make it compatible with 0.9.0 clean it up (I really need to organize the code and see what needs to be ripped into other classes) write tests figure out put in logic that will opportunistically run the intermediate function on the intermediate output (how does Pig normally make this choice?) add an accumulator interface (it would be minimal work with what I already have) need to have a better way to provide the tuple/databag abstraction in Ruby. The current method is pretty nasty to work with, and you get functions that receive [[ [element] ]], and whatnot (you'll see that in my code) tear into my code/design
          Hide
          Jonathan Coveney added a comment -

          I made some fixes. The Algebraic stuff is still whacked, some sort of weird memory thing is going on, not sure what or why... but it works with small small data sets, so at least as a proof of concept is is ok. Hoping to use yourkit to figure it out. Will continue soon.

          Show
          Jonathan Coveney added a comment - I made some fixes. The Algebraic stuff is still whacked, some sort of weird memory thing is going on, not sure what or why... but it works with small small data sets, so at least as a proof of concept is is ok. Hoping to use yourkit to figure it out. Will continue soon.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Jonathan, can you inline the latest on how jruby udfs are supposed to be created? It's a bit confusing to read this ticket as you guys are working out the right interface. My understanding is that with the latest patch the memory woes on algebraics are solved, right?

          -D

          Show
          Dmitriy V. Ryaboy added a comment - Jonathan, can you inline the latest on how jruby udfs are supposed to be created? It's a bit confusing to read this ticket as you guys are working out the right interface. My understanding is that with the latest patch the memory woes on algebraics are solved, right? -D
          Hide
          Jonathan Coveney added a comment -

          Ok! I have made a lot of progress in this. Algebraic Ruby UDF's are WORKING, you just have to make sure to set -Xmx 1024m (or more). Pigs are not lean creatures

          My to do's:

          • Allow for in-line UDFs
          • Expose an accumulator interface (should be easy peasy)
          • Think of a better way to handle DataBags in Ruby, given that currently tuples and databags are the same structure... I think I know what to do, but need to ruminate.
          • Make it compatible with 0.9.1 at least
          • Write tests (is there a way to force a UDF which imeplements Algebraic and Accumulator to be run as an Accumulator UDF, or a normal EvalFunc? this is important for testing)
          • Anything you guys thing would be useful

          One thing that I did to get this to work was to generate an abstract "AccEvalFunc" and "AlgEvalFunc," where if you define the pieces of that interface, you get the lower levels for free. So in the case of the ruby algebraic UDF, by simply defining "initial" "intermed" and "final," you get accumulator and exec for free and don't have to putz around defining them. I think this should at least go in the piggybank, but it needs some eyeballs and suggestions.

          To allay confusion, I'm going to inline "pigruby.rb" and explain how it works.

          require 'PigUdf'
          
          Helloworld=PigUdf.evalfunc("word:chararray") do
            "Hello, world"
          end
          

          First off, it's important to require 'PigUdf' so ruby can work it's magic. In order to facilitate easy Udf definitions, you can declare functions in this way. You cannot have a schema which refers to a function, however, and the Udf name MUST begin with a capital letter.

          Complex=PigUdf.evalfunc("word:chararray,num:long") do |word|
            [word.to_s,word.length]
          end
          

          Conveniently, you can ask for as many parameters as you like. Varargs aren't supported yet, and it's not super high on the list but if you want it, let me know.

          Divbythree=PigUdf.filterfunc do |num|
            num%3==0
          end
          

          Much like an evalfunc, you can easy make a filterfunc.

          class Myudfs < PigUdf
            outputSchema "val:long"
            def cumsum num
              @x||=0
              @x+=num
            end
            
            outputSchemaFunction :squareSchema
            def square num
              return num**2
            end
          
            def squareSchema input
              return input
            end
            
            filterFunc
            def divbytwo input
              input%2==0
            end
          end
          

          For more complicated udfs, declaring a class that extends PigUdf is the way to go. In order to declare a UDF, you have to declare it's schema. There are two options: outputSchema "schema" will set that schema for the next function you define, and register it as a udf. or you can do outputSchema :funcname, "schema" to register funcname as a udf and schema as it's schema. If your function has a schema dependent on the input, then you can use outputSchemaFunction :funcname, and the next defined function will be registered as a udf with funcname as it's schema function. Or as above, you can do outputSchemaFunction :functoregister, :funcname. If you do filterFunc, the next function is a filterFunc.

          class COUNT < AlgPigUdf
            outputSchema "val:long"
          
            class Initial
              def exec item
                1
              end
            end
            
            class Intermed
              def exec items
                items.flatten.inject(:+)
              end
            end
            
            class Final < Intermed
            end
          end
          

          This is the algebraic interface. The class name will be the UDF name in Pig.

          Initial's exec function will be passed one item at a time. You do not have to deal with the tuple that contains a bag that contains a tuple that contains an item madness.

          Intermed's exec function will be passed a "databag" that contains "tuples," which means it will be a list of lists of items (which is why I flatten items before summing). you return the item, and it will be put in a tuple for you.

          Final's exec function will do the exact same thing as Intermed, and the output just won't be wrapped in a TUple. this is why in the examples, it just extends Intermed.

          class SUM < AlgPigUdf
            outputSchema "val:long"
            
            class Initial
              def exec item
                item
              end
            end
            
            class Intermed
              def exec items
                items.flatten.inject(:+)
              end
            end
            
            class Final < Intermed
            end
          end
          

          This example works exactly like the above, except it sums the values, and not the rows.

          class WORDCOUNT < AlgPigUdf
            outputSchema "val:long"
          
            class Initial
              def exec item
                item ? item.split.length : 0
              end
            end
          
            class Intermed
              def exec items
                items.flatten.inject(:+)
              end
            end
          
            class Final < Intermed
            end
          end
          

          Of course, what would any example be without a word count example.

          Please let me know if you run into bugs, or have any suggestions on the code itself, the interface, etc.

          Show
          Jonathan Coveney added a comment - Ok! I have made a lot of progress in this. Algebraic Ruby UDF's are WORKING, you just have to make sure to set -Xmx 1024m (or more). Pigs are not lean creatures My to do's: Allow for in-line UDFs Expose an accumulator interface (should be easy peasy) Think of a better way to handle DataBags in Ruby, given that currently tuples and databags are the same structure... I think I know what to do, but need to ruminate. Make it compatible with 0.9.1 at least Write tests (is there a way to force a UDF which imeplements Algebraic and Accumulator to be run as an Accumulator UDF, or a normal EvalFunc? this is important for testing) Anything you guys thing would be useful One thing that I did to get this to work was to generate an abstract "AccEvalFunc" and "AlgEvalFunc," where if you define the pieces of that interface, you get the lower levels for free. So in the case of the ruby algebraic UDF, by simply defining "initial" "intermed" and "final," you get accumulator and exec for free and don't have to putz around defining them. I think this should at least go in the piggybank, but it needs some eyeballs and suggestions. To allay confusion, I'm going to inline "pigruby.rb" and explain how it works. require 'PigUdf' Helloworld=PigUdf.evalfunc( "word:chararray" ) do "Hello, world" end First off, it's important to require 'PigUdf' so ruby can work it's magic. In order to facilitate easy Udf definitions, you can declare functions in this way. You cannot have a schema which refers to a function, however, and the Udf name MUST begin with a capital letter. Complex=PigUdf.evalfunc( "word:chararray,num: long " ) do |word| [word.to_s,word.length] end Conveniently, you can ask for as many parameters as you like. Varargs aren't supported yet, and it's not super high on the list but if you want it, let me know. Divbythree=PigUdf.filterfunc do |num| num%3==0 end Much like an evalfunc, you can easy make a filterfunc. class Myudfs < PigUdf outputSchema "val: long " def cumsum num @x||=0 @x+=num end outputSchemaFunction :squareSchema def square num return num**2 end def squareSchema input return input end filterFunc def divbytwo input input%2==0 end end For more complicated udfs, declaring a class that extends PigUdf is the way to go. In order to declare a UDF, you have to declare it's schema. There are two options: outputSchema "schema" will set that schema for the next function you define, and register it as a udf. or you can do outputSchema :funcname, "schema" to register funcname as a udf and schema as it's schema. If your function has a schema dependent on the input, then you can use outputSchemaFunction :funcname, and the next defined function will be registered as a udf with funcname as it's schema function. Or as above, you can do outputSchemaFunction :functoregister, :funcname. If you do filterFunc, the next function is a filterFunc. class COUNT < AlgPigUdf outputSchema "val: long " class Initial def exec item 1 end end class Intermed def exec items items.flatten.inject(:+) end end class Final < Intermed end end This is the algebraic interface. The class name will be the UDF name in Pig. Initial's exec function will be passed one item at a time. You do not have to deal with the tuple that contains a bag that contains a tuple that contains an item madness. Intermed's exec function will be passed a "databag" that contains "tuples," which means it will be a list of lists of items (which is why I flatten items before summing). you return the item, and it will be put in a tuple for you. Final's exec function will do the exact same thing as Intermed, and the output just won't be wrapped in a TUple. this is why in the examples, it just extends Intermed. class SUM < AlgPigUdf outputSchema "val: long " class Initial def exec item item end end class Intermed def exec items items.flatten.inject(:+) end end class Final < Intermed end end This example works exactly like the above, except it sums the values, and not the rows. class WORDCOUNT < AlgPigUdf outputSchema "val: long " class Initial def exec item item ? item.split.length : 0 end end class Intermed def exec items items.flatten.inject(:+) end end class Final < Intermed end end Of course, what would any example be without a word count example. Please let me know if you run into bugs, or have any suggestions on the code itself, the interface, etc.
          Hide
          Jacob Perkins added a comment -

          Combiners from ruby? I like it.

          Here's some stylistic nitpicking more than anything. Ruby people are very particular.

          1) Ruby people in general hate camel case for anything except class names. So,

          require 'PigUdf'
          
          # ...
          outputSchema "str:chararray"
          

          should become:

          require 'pigudf'
          
          # ...
          output_schema "str:chararray"
          

          I'm really not kidding.

          2) I would argue for 'PigUDF' as opposed to 'PigUdf' if only because UDF is an acronym.

          3) There is tab completion in the universe now. As such we should spell things like 'AlgPigUdf' out as 'AlgebraicPigUDF

          I'll look into making it compatible with 0.9.1, specifically with regards to the 'addScriptFile' issue.

          Show
          Jacob Perkins added a comment - Combiners from ruby? I like it. Here's some stylistic nitpicking more than anything. Ruby people are very particular. 1) Ruby people in general hate camel case for anything except class names. So, require 'PigUdf' # ... outputSchema "str:chararray" should become: require 'pigudf' # ... output_schema "str:chararray" I'm really not kidding. 2) I would argue for 'PigUDF' as opposed to 'PigUdf' if only because UDF is an acronym. 3) There is tab completion in the universe now. As such we should spell things like 'AlgPigUdf' out as 'AlgebraicPigUDF I'll look into making it compatible with 0.9.1, specifically with regards to the 'addScriptFile' issue.
          Hide
          Jonathan Coveney added a comment -

          I agree on the style points, and will implement accordingly.

          As far as the Algebraic UDF's, I can simplify them greatly. The nested structure was largely a vestige of an initial, poor implementation.

          class SUM < AlgebraicPigUDF
            outputSchema "val:long"
            
            def initial item
              item
            end
          
            def intermed items
              items.flatten.inject(:+)
            end
          
            def final items
              intermed items
            end
          end
          

          Extremely compact.

          Another fun thing that could be a boon to doing serious analysis in a scripting language (as something similar could no doubt be done in Jython): I am working on making a native ruby object that wraps the DataBag. There are a couple of hacky ways of doing it, but I'm hoping to do one in a real, fast way. It should ideally allow us to avoid materializing bags into memory in scripting languages.

          Show
          Jonathan Coveney added a comment - I agree on the style points, and will implement accordingly. As far as the Algebraic UDF's, I can simplify them greatly. The nested structure was largely a vestige of an initial, poor implementation. class SUM < AlgebraicPigUDF outputSchema "val: long " def initial item item end def intermed items items.flatten.inject(:+) end def final items intermed items end end Extremely compact. Another fun thing that could be a boon to doing serious analysis in a scripting language (as something similar could no doubt be done in Jython): I am working on making a native ruby object that wraps the DataBag. There are a couple of hacky ways of doing it, but I'm hoping to do one in a real, fast way. It should ideally allow us to avoid materializing bags into memory in scripting languages.
          Hide
          Jonathan Coveney added a comment -

          So! Made some new changes. There is now an accumulator interface.

          class SUM2 < AccumulatorPigUDF
            output_schema "val:long"
          
            def exec items
              @sum||=0
              @sum+=items.flatten.inject(:+)
            end
          
            def get
              @sum
            end
          end
          

          One interesting thing about the accumulator interface is that all of the state is handled inside of the ruby class...so if you want intermediate objects, it's all there. The cleanup step is just throwing away the class, and then it will be reinstantiated if the interface is invoked again.

          Algebraic UDFs are easier than ever.

          class SUM < AlgebraicPigUDF
            output_schema "val:long"
            
            def initial item
              item
            end
            
            def intermed items
              items.flatten.inject(:+)
            end
            
            def final items
              intermed items
            end
          end
          
          class WORDCOUNT < AlgebraicPigUDF
            output_schema "val:long"
          
            def initial item
              item ? item.split.length : 0
            end
          
            def intermed items
              items.flatten.inject(:+)
            end
            
            def final items
              intermed items
            end
          end
          

          One of the more exciting changes (to me...) is that I have added DataBags as a native ruby object, so it's super easy to use them. If you do include the pigudf package, you can do "DataBag.new." Examples of how to use it follow:

          jruby -J-Xmx1024m -S irb
          

          this ensures that you have enough heap space

          require 'pigudf'
          db=DataBag.new
          

          a is now a databag! to test that it spills properly, we do...

          (0..10000000).each {|x| db.add(x)}
          

          On my computer, with the heap size we specified, it spilled once. But it spills! Also, a note: arrays still convert to tuples, and a bag can either accept ONE argument, or an array of arguments. The one argument thing is a convenience function. I will probably make it a varargs for conciseness. But that means you can do

          db.add(1)
          

          or

          db.add([1])
          

          After running the each above, you get:

          ree-1.8.7-2010.02 :009 > db.size()
           => 10000001
          

          Nice! I need to look into how to get JRuby to generate better docs, but if you look at RubyDataBag.java in the patch you can see the api (anything marked with @JRubyMethod). I'll summarize here.

          DataBag.new, DataBag.new db
          

          DataBag has two initializers: the default initializer just creates an empty databag, and the second takes a databag and copies it over. There is also

          db.add_all db2, db.copy db2
          

          which pulls all of the data out of the given DataBag or RubyDataBag.

          db.to_s,db.to_string,db.inspect
          

          return a string view. if you do db.to_s(true), you'll also see the contents (useful for debugging)

          db.size,db.length
          

          number of elements in the bag

          db.add(elem) or db.add([e1,e2,e3])
          

          Add the elements to the bag

          db.distinct?, db.is_distinct?
          

          returns if the bag is distinct

          db.sorted?, db.is_sorted?
          

          returns if the bag is sorted

          db.clear
          

          clears the databag

          db.empty?
          

          returns if the bag is empty

          db.each
          

          One thing that I did with the DataBag implementation is that I had it include Enumerable, and implement each. This means that all of the fun commands you like to use in ruby like map and so on should work... also, for convenient, I implement a flatten command

          db.flatten or db.flat_each
           => #<Enumerable::Enumerator:0x8939ec3 @__args__=[], @__object__=[DataBag: size: 10000001], @__method__=:flat_each> 
          

          what this does is create an object that accepts .each

          {block}

          , but will flatten the value out of the Tuple before passing it to the block. This allows you to efficiently do things like db.flatten.inject(:+), because it is pulling the element out of the tuple on each block invocation instead of doing the naive thing which would be to create an array of the output. One thing to keep in mind though is that this only pulls out the first argument. I guess I could change that. Am undecided.

          And lastly, there is...

          db.iterator
          

          returns a BagIterator. This is basically a simplifed access point that is very similar to bag, except with less power.

          db.get, db.getNext, db.get_next
          
          db.has_next?, db.hasNext, db.has_next, db.next?
          

          and it supports the exact same map semantics as bag does.

          Phew! Ok. Definitely would love feedback. I'm going to work on making UDFs in-line, and need to write tests....

          Show
          Jonathan Coveney added a comment - So! Made some new changes. There is now an accumulator interface. class SUM2 < AccumulatorPigUDF output_schema "val: long " def exec items @sum||=0 @sum+=items.flatten.inject(:+) end def get @sum end end One interesting thing about the accumulator interface is that all of the state is handled inside of the ruby class...so if you want intermediate objects, it's all there. The cleanup step is just throwing away the class, and then it will be reinstantiated if the interface is invoked again. Algebraic UDFs are easier than ever. class SUM < AlgebraicPigUDF output_schema "val: long " def initial item item end def intermed items items.flatten.inject(:+) end def final items intermed items end end class WORDCOUNT < AlgebraicPigUDF output_schema "val: long " def initial item item ? item.split.length : 0 end def intermed items items.flatten.inject(:+) end def final items intermed items end end One of the more exciting changes (to me...) is that I have added DataBags as a native ruby object, so it's super easy to use them. If you do include the pigudf package, you can do "DataBag.new." Examples of how to use it follow: jruby -J-Xmx1024m -S irb this ensures that you have enough heap space require 'pigudf' db=DataBag. new a is now a databag! to test that it spills properly, we do... (0..10000000).each {|x| db.add(x)} On my computer, with the heap size we specified, it spilled once. But it spills! Also, a note: arrays still convert to tuples, and a bag can either accept ONE argument, or an array of arguments. The one argument thing is a convenience function. I will probably make it a varargs for conciseness. But that means you can do db.add(1) or db.add([1]) After running the each above, you get: ree-1.8.7-2010.02 :009 > db.size() => 10000001 Nice! I need to look into how to get JRuby to generate better docs, but if you look at RubyDataBag.java in the patch you can see the api (anything marked with @JRubyMethod). I'll summarize here. DataBag. new , DataBag. new db DataBag has two initializers: the default initializer just creates an empty databag, and the second takes a databag and copies it over. There is also db.add_all db2, db.copy db2 which pulls all of the data out of the given DataBag or RubyDataBag. db.to_s,db.to_string,db.inspect return a string view. if you do db.to_s(true), you'll also see the contents (useful for debugging) db.size,db.length number of elements in the bag db.add(elem) or db.add([e1,e2,e3]) Add the elements to the bag db.distinct?, db.is_distinct? returns if the bag is distinct db.sorted?, db.is_sorted? returns if the bag is sorted db.clear clears the databag db.empty? returns if the bag is empty db.each One thing that I did with the DataBag implementation is that I had it include Enumerable, and implement each. This means that all of the fun commands you like to use in ruby like map and so on should work... also, for convenient, I implement a flatten command db.flatten or db.flat_each => #<Enumerable::Enumerator:0x8939ec3 @__args__=[], @__object__=[DataBag: size: 10000001], @__method__=:flat_each> what this does is create an object that accepts .each {block} , but will flatten the value out of the Tuple before passing it to the block. This allows you to efficiently do things like db.flatten.inject(:+), because it is pulling the element out of the tuple on each block invocation instead of doing the naive thing which would be to create an array of the output. One thing to keep in mind though is that this only pulls out the first argument. I guess I could change that. Am undecided. And lastly, there is... db.iterator returns a BagIterator. This is basically a simplifed access point that is very similar to bag, except with less power. db.get, db.getNext, db.get_next db.has_next?, db.hasNext, db.has_next, db.next? and it supports the exact same map semantics as bag does. Phew! Ok. Definitely would love feedback. I'm going to work on making UDFs in-line, and need to write tests....
          Hide
          Alan Gates added a comment -

          I've been reviewing and playing with this patch over the last week. I have a few comments, and then questions since I can't get it to work in the e2e tests. I'll attach my changes in a new patch.

          Comments:

          1. In JRubyScriptEngine.registerFunctions, line 130, why is canonicalName forced to the result from LongJRubyAccumulatorEvalFunc and the following switch statement commented out? This looks like a mistake.
          2. In JrubyUtils.pigToRuby, converting a databytearray to a String will mangle binary data. I don't know if Ruby has a binary type or supports a byte type.

          Issue I'm seeing:

          When I try to run the new e2e tests I've added, they abort with the following error message:

          Pig Stack Trace
          ---------------
          ERROR 2999: Unexpected internal error. (NoMethodError) undefined method `outputSchemaFunction' for main:Object
          
          org.jruby.embed.EvalFailedException: (NoMethodError) undefined method `outputSchemaFunction' for main:Object
              at org.jruby.embed.internal.EmbedEvalUnitImpl.run(EmbedEvalUnitImpl.java:127)
              at org.apache.pig.scripting.jruby.JrubyScriptEngine$RubyFunctions.getFromCache(JrubyScriptEngine.java:80)
              at org.apache.pig.scripting.jruby.JrubyScriptEngine$RubyFunctions.getEvalFunctions(JrubyScriptEngine.java:95)
              at org.apache.pig.scripting.jruby.JrubyScriptEngine.registerFunctions(JrubyScriptEngine.java:115)
              at org.apache.pig.PigServer.registerCode(PigServer.java:525)
              at org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:421)
              at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:419)
              at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
              at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
              at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
              at org.apache.pig.Main.run(Main.java:589)
              at org.apache.pig.Main.main(Main.java:148)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
          Caused by: org.jruby.exceptions.RaiseException: (NoMethodError) undefined method `outputSchemaFunction' for main:Object
          

          You can run the same tests by doing:

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local
          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -D-Dtests.to.run="-t RubyUDFs_1" test-e2e-local

          This will run just the first RubyUDF test in local mode.

          Show
          Alan Gates added a comment - I've been reviewing and playing with this patch over the last week. I have a few comments, and then questions since I can't get it to work in the e2e tests. I'll attach my changes in a new patch. Comments: In JRubyScriptEngine.registerFunctions, line 130, why is canonicalName forced to the result from LongJRubyAccumulatorEvalFunc and the following switch statement commented out? This looks like a mistake. In JrubyUtils.pigToRuby, converting a databytearray to a String will mangle binary data. I don't know if Ruby has a binary type or supports a byte type. Issue I'm seeing: When I try to run the new e2e tests I've added, they abort with the following error message: Pig Stack Trace --------------- ERROR 2999: Unexpected internal error. (NoMethodError) undefined method `outputSchemaFunction' for main: Object org.jruby.embed.EvalFailedException: (NoMethodError) undefined method `outputSchemaFunction' for main: Object at org.jruby.embed.internal.EmbedEvalUnitImpl.run(EmbedEvalUnitImpl.java:127) at org.apache.pig.scripting.jruby.JrubyScriptEngine$RubyFunctions.getFromCache(JrubyScriptEngine.java:80) at org.apache.pig.scripting.jruby.JrubyScriptEngine$RubyFunctions.getEvalFunctions(JrubyScriptEngine.java:95) at org.apache.pig.scripting.jruby.JrubyScriptEngine.registerFunctions(JrubyScriptEngine.java:115) at org.apache.pig.PigServer.registerCode(PigServer.java:525) at org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:421) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:419) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:589) at org.apache.pig.Main.main(Main.java:148) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: org.jruby.exceptions.RaiseException: (NoMethodError) undefined method `outputSchemaFunction' for main: Object You can run the same tests by doing: ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -D-Dtests.to.run="-t RubyUDFs_1" test-e2e-local This will run just the first RubyUDF test in local mode.
          Hide
          Jonathan Coveney added a comment -

          Alan,

          Thank you very much for your comments, and for adding tests (which it was sorely lacking). Point 1 is indeed an error...this is an artifact of when I was getting the whole canonicalName piece working, and wanted to force it to a Long to isolate errors. The switch should be working now.

          To point 2, there are a couple of ways that we can handle it. One would be to create a DataByteArray class in Ruby which would allow us to do whatever we wanted. Ruby, natively, does not have a binary type...what people usually do, AFAIK (someone with deeper knowledge of ruby can chime in), is pack data into strings. Since we're using JRuby, we can expose things however we like. I'll think on an elegant solution but you are right, we need to do something else.

          It is a goal of mine to get momentum going again to push this through, so thank you very much for taking a look and adding the tests. I will get to work figuring out why they are failing and update these comments.

          Show
          Jonathan Coveney added a comment - Alan, Thank you very much for your comments, and for adding tests (which it was sorely lacking). Point 1 is indeed an error...this is an artifact of when I was getting the whole canonicalName piece working, and wanted to force it to a Long to isolate errors. The switch should be working now. To point 2, there are a couple of ways that we can handle it. One would be to create a DataByteArray class in Ruby which would allow us to do whatever we wanted. Ruby, natively, does not have a binary type...what people usually do, AFAIK (someone with deeper knowledge of ruby can chime in), is pack data into strings. Since we're using JRuby, we can expose things however we like. I'll think on an elegant solution but you are right, we need to do something else. It is a goal of mine to get momentum going again to push this through, so thank you very much for taking a look and adding the tests. I will get to work figuring out why they are failing and update these comments.
          Hide
          Jonathan Coveney added a comment -

          Phew! Ok, the tests now run. There was quite a bit of bug fixing that went in, both in the ruby library 'pigudf', as well as in the UDFs themselves (they are cleaned up a bit). I added support for DataByteArrays, as well as support making it a lot easier to concatenate them together and whatnot.

          As far as things I'd like to do...

          • I'd like to make a much more natural way to interact with Schema objects from the Ruby side. Right now you can return a schema and stuff, but goooood luck actually trying to manipulate the object (it's possibly but it'd be ugly) I imagine the jython scripting suffers from this as well
          • Need to fix the enumerator interface on the DataBag and BagIterator. This is pending some help from the JRuby folk.
          • In general, need to balance making the API's look like their Java counterparts, but with the nice conveniences that JRuby can offer so that UDFs can be really concise and easy
          • I'd like to get rid of the need for pigudf.rb and implement it all in Java. This is the least priority, though, because the current implementation works, and it is absolutely idiomatic in jruby to implement libraries in jruby, so the above is more important.

          Thanks again for the tests, Alan. They were really key. It should pass them now. I'll ruminate on other good ones to add.

          Show
          Jonathan Coveney added a comment - Phew! Ok, the tests now run. There was quite a bit of bug fixing that went in, both in the ruby library 'pigudf', as well as in the UDFs themselves (they are cleaned up a bit). I added support for DataByteArrays, as well as support making it a lot easier to concatenate them together and whatnot. As far as things I'd like to do... I'd like to make a much more natural way to interact with Schema objects from the Ruby side. Right now you can return a schema and stuff, but goooood luck actually trying to manipulate the object (it's possibly but it'd be ugly) I imagine the jython scripting suffers from this as well Need to fix the enumerator interface on the DataBag and BagIterator. This is pending some help from the JRuby folk. In general, need to balance making the API's look like their Java counterparts, but with the nice conveniences that JRuby can offer so that UDFs can be really concise and easy I'd like to get rid of the need for pigudf.rb and implement it all in Java. This is the least priority, though, because the current implementation works, and it is absolutely idiomatic in jruby to implement libraries in jruby, so the above is more important. Thanks again for the tests, Alan. They were really key. It should pass them now. I'll ruminate on other good ones to add.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4091/
          -----------------------------------------------------------

          Review request for pig, Julien Le Dem and Alan Gates.

          Summary
          -------

          This is the reviewboard for the following:

          https://issues.apache.org/jira/browse/PIG-2317

          This addresses bug PIG-2317.
          https://issues.apache.org/jira/browse/PIG-2317

          Diffs


          trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/ScriptEngine.java 1294756
          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION
          trunk/bin/pig 1294756
          trunk/ivy.xml 1294756
          trunk/ivy/libraries.properties 1294756
          trunk/pigudf.rb PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyUtils.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyBagIterator.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyExampleA.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyExampleB.java PRE-CREATION
          trunk/test/e2e/pig/build.xml 1294756
          trunk/test/e2e/pig/conf/default.conf 1294756
          trunk/test/e2e/pig/drivers/TestDriverPig.pm 1294756
          trunk/test/e2e/pig/tests/nightly.conf 1294756
          trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION
          trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION
          trunk/test/org/apache/pig/test/TestUDF.java 1294756
          trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION

          Diff: https://reviews.apache.org/r/4091/diff

          Testing
          -------

          e2e tests can be run as follows:

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local
          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4091/ ----------------------------------------------------------- Review request for pig, Julien Le Dem and Alan Gates. Summary ------- This is the reviewboard for the following: https://issues.apache.org/jira/browse/PIG-2317 This addresses bug PIG-2317 . https://issues.apache.org/jira/browse/PIG-2317 Diffs trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/ScriptEngine.java 1294756 trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION trunk/bin/pig 1294756 trunk/ivy.xml 1294756 trunk/ivy/libraries.properties 1294756 trunk/pigudf.rb PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyUtils.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyBagIterator.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyExampleA.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyExampleB.java PRE-CREATION trunk/test/e2e/pig/build.xml 1294756 trunk/test/e2e/pig/conf/default.conf 1294756 trunk/test/e2e/pig/drivers/TestDriverPig.pm 1294756 trunk/test/e2e/pig/tests/nightly.conf 1294756 trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION trunk/test/org/apache/pig/test/TestUDF.java 1294756 trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION Diff: https://reviews.apache.org/r/4091/diff Testing ------- e2e tests can be run as follows: ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local Thanks, Jonathan
          Hide
          Alan Gates added a comment -

          Once PIG-2548 is checked in we should modify this patch to have the same behavior of accepting command line arguments. Or we can do it as a separate JIRA if this one gets in first.

          Show
          Alan Gates added a comment - Once PIG-2548 is checked in we should modify this patch to have the same behavior of accepting command line arguments. Or we can do it as a separate JIRA if this one gets in first.
          Hide
          Jonathan Coveney added a comment -

          Alan,

          Does that ticket really apply? It seems only to deal with jython as a control flow language, not jython as a UDF language.

          That said, if we want to use jruby as a control language I think that that's a great idea, but one that should probably be explored separately

          Am I misunderstanding?

          Show
          Jonathan Coveney added a comment - Alan, Does that ticket really apply? It seems only to deal with jython as a control flow language, not jython as a UDF language. That said, if we want to use jruby as a control language I think that that's a great idea, but one that should probably be explored separately Am I misunderstanding?
          Hide
          Alan Gates added a comment -

          Jonathan, you're right, I'm confusing the two. So this ticket doesn't apply.

          Show
          Alan Gates added a comment - Jonathan, you're right, I'm confusing the two. So this ticket doesn't apply.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4091/#review5601
          -----------------------------------------------------------

          Thanks Jonathan for this contribution. A lot of good stuff in there.
          Please add comments when using Ruby specific mechanisms in there (child classes notification, Class naming after the variable it is affected to, ...) so that less Ruby savvy contributors can follow.

          trunk/pigudf.rb
          <https://reviews.apache.org/r/4091/#comment12179>

          maybe you need to separate the top level ruby script to be used outside of pig and the one used inside using includes?

          trunk/pigudf.rb
          <https://reviews.apache.org/r/4091/#comment12208>

          please add doc on public functions

          trunk/pigudf.rb
          <https://reviews.apache.org/r/4091/#comment12180>

          @function_to_register["GETCLASSFROMOBJECT"] could be replaced by a member @current_get_class_from_object

          trunk/pigudf.rb
          <https://reviews.apache.org/r/4091/#comment12214>

          you could define a class for this

          trunk/pigudf.rb
          <https://reviews.apache.org/r/4091/#comment12209>

          • please add a comment explain the mechanism that names the class after the variable it is assigned to.

          trunk/pigudf.rb
          <https://reviews.apache.org/r/4091/#comment12210>

          Add a comment to explain that classes inheriting PigUdf will automatically get registered here.

          trunk/pigudf.rb
          <https://reviews.apache.org/r/4091/#comment12211>

          I would add a separate field for schema func instead of this convention.

          trunk/pigudf.rb
          <https://reviews.apache.org/r/4091/#comment12181>

          I would throw an exception if both @schema and @functions_to_register[id] are set.

          trunk/pigudf.rb
          <https://reviews.apache.org/r/4091/#comment12182>

          You could call it RegisterDescendentsUDF or a similar name as this is a based class for UDFs.
          The descendents registration is common to the PigUDF as well. Do you want to unify the class hierarchy?

          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java
          <https://reviews.apache.org/r/4091/#comment12184>

          ruby.getClass("DataBag") could be done only once

          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java
          <https://reviews.apache.org/r/4091/#comment12183>

          throw an Exception here

          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java
          <https://reviews.apache.org/r/4091/#comment12185>

          I would override OutputSchema instead. In the case of a bag you are losing the schema of the Tuple inside the bag.

          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java
          <https://reviews.apache.org/r/4091/#comment12212>

          If you can add the name of the function in the message that would help with debugging.

          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java
          <https://reviews.apache.org/r/4091/#comment12186>

          you could factor out the function calling and bag instantiated code.

          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java
          <https://reviews.apache.org/r/4091/#comment12215>

          some of the ruby calling logic and exception catching could be factored in this class

          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java
          <https://reviews.apache.org/r/4091/#comment12187>

          Override outputSchema instead like you did for EvalFunc

          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java
          <https://reviews.apache.org/r/4091/#comment12213>

          private

          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java
          <https://reviews.apache.org/r/4091/#comment12189>

          check for the varargs case

          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java
          <https://reviews.apache.org/r/4091/#comment12190>

          you should throw an exception here

          trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java
          <https://reviews.apache.org/r/4091/#comment12217>

          It is very similar to JrubyEvalFunc. You can factor out more of this code.

          trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java
          <https://reviews.apache.org/r/4091/#comment12216>

          private

          trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java
          <https://reviews.apache.org/r/4091/#comment12218>

          turn these comments into javadoc

          trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java
          <https://reviews.apache.org/r/4091/#comment12192>

          this will strip the inner schema of bags and tuples. See comments about overriding outputSchema

          trunk/src/org/apache/pig/scripting/jruby/JrubyUtils.java
          <https://reviews.apache.org/r/4091/#comment12219>

          please add javadoc to public methods

          trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java
          <https://reviews.apache.org/r/4091/#comment12193>

          add javadoc

          trunk/src/org/apache/pig/scripting/jruby/RubyBagIterator.java
          <https://reviews.apache.org/r/4091/#comment12194>

          please make it private and at the top

          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java
          <https://reviews.apache.org/r/4091/#comment12195>

          add javadoc

          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java
          <https://reviews.apache.org/r/4091/#comment12196>

          make it private
          1 is a perfectly good value. it does not need to be unique.
          Changing the serialVersionUID value will make the serialized data format incompatible.

          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java
          <https://reviews.apache.org/r/4091/#comment12197>

          in this case addAll would replace the current content. Is it intended ?

          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java
          <https://reviews.apache.org/r/4091/#comment12220>

          want to remove it ?

          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java
          <https://reviews.apache.org/r/4091/#comment12198>

          internalDB.add(JrubyUtils.rubyToPig((RubyArray)elem));

          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java
          <https://reviews.apache.org/r/4091/#comment12199>

          you could call runtime.getClass("DataBag").getClass("BagIterator") once at the beginning

          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java
          <https://reviews.apache.org/r/4091/#comment12200>

          add spaces around operator and prefer the normalized "if" with { }

          if ()

          { // }

          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java
          <https://reviews.apache.org/r/4091/#comment12201>

          flattten with 3 Ts? It's going to be really flat.

          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java
          <https://reviews.apache.org/r/4091/#comment12202>

          make this a javadoc comment

          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java
          <https://reviews.apache.org/r/4091/#comment12203>

          private

          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java
          <https://reviews.apache.org/r/4091/#comment12204>

          space after comma

          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java
          <https://reviews.apache.org/r/4091/#comment12205>

          give meaningful names to parameters

          trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb
          <https://reviews.apache.org/r/4091/#comment12221>

          What about keeping only the one parameter ouptutSchema and outputSchemaFunctions?
          Users would just use the function right before defining the UDF.

          outputSchema "word:chararray"
          def concat input, input2
          return input + input2
          end

          trunk/test/org/apache/pig/test/TestUDF.java
          <https://reviews.apache.org/r/4091/#comment12222>

          if this value is known you may as well compare each value to the known constant instead of comparing to the previous value.

          • Julien

          On 2012-02-28 22:02:52, Jonathan Coveney wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/4091/

          -----------------------------------------------------------

          (Updated 2012-02-28 22:02:52)

          Review request for pig, Julien Le Dem and Alan Gates.

          Summary

          -------

          This is the reviewboard for the following:

          https://issues.apache.org/jira/browse/PIG-2317

          This addresses bug PIG-2317.

          https://issues.apache.org/jira/browse/PIG-2317

          Diffs

          -----

          trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION

          trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/ScriptEngine.java 1294756

          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION

          trunk/bin/pig 1294756

          trunk/ivy.xml 1294756

          trunk/ivy/libraries.properties 1294756

          trunk/pigudf.rb PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/JrubyUtils.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/RubyBagIterator.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/RubyExampleA.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/RubyExampleB.java PRE-CREATION

          trunk/test/e2e/pig/build.xml 1294756

          trunk/test/e2e/pig/conf/default.conf 1294756

          trunk/test/e2e/pig/drivers/TestDriverPig.pm 1294756

          trunk/test/e2e/pig/tests/nightly.conf 1294756

          trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION

          trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION

          trunk/test/org/apache/pig/test/TestUDF.java 1294756

          trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION

          Diff: https://reviews.apache.org/r/4091/diff

          Testing

          -------

          e2e tests can be run as follows:

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4091/#review5601 ----------------------------------------------------------- Thanks Jonathan for this contribution. A lot of good stuff in there. Please add comments when using Ruby specific mechanisms in there (child classes notification, Class naming after the variable it is affected to, ...) so that less Ruby savvy contributors can follow. trunk/pigudf.rb < https://reviews.apache.org/r/4091/#comment12179 > maybe you need to separate the top level ruby script to be used outside of pig and the one used inside using includes? trunk/pigudf.rb < https://reviews.apache.org/r/4091/#comment12208 > please add doc on public functions trunk/pigudf.rb < https://reviews.apache.org/r/4091/#comment12180 > @function_to_register ["GETCLASSFROMOBJECT"] could be replaced by a member @current_get_class_from_object trunk/pigudf.rb < https://reviews.apache.org/r/4091/#comment12214 > you could define a class for this trunk/pigudf.rb < https://reviews.apache.org/r/4091/#comment12209 > please add a comment explain the mechanism that names the class after the variable it is assigned to. trunk/pigudf.rb < https://reviews.apache.org/r/4091/#comment12210 > Add a comment to explain that classes inheriting PigUdf will automatically get registered here. trunk/pigudf.rb < https://reviews.apache.org/r/4091/#comment12211 > I would add a separate field for schema func instead of this convention. trunk/pigudf.rb < https://reviews.apache.org/r/4091/#comment12181 > I would throw an exception if both @schema and @functions_to_register [id] are set. trunk/pigudf.rb < https://reviews.apache.org/r/4091/#comment12182 > You could call it RegisterDescendentsUDF or a similar name as this is a based class for UDFs. The descendents registration is common to the PigUDF as well. Do you want to unify the class hierarchy? trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java < https://reviews.apache.org/r/4091/#comment12184 > ruby.getClass("DataBag") could be done only once trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java < https://reviews.apache.org/r/4091/#comment12183 > throw an Exception here trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java < https://reviews.apache.org/r/4091/#comment12185 > I would override OutputSchema instead. In the case of a bag you are losing the schema of the Tuple inside the bag. trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java < https://reviews.apache.org/r/4091/#comment12212 > If you can add the name of the function in the message that would help with debugging. trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java < https://reviews.apache.org/r/4091/#comment12186 > you could factor out the function calling and bag instantiated code. trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java < https://reviews.apache.org/r/4091/#comment12215 > some of the ruby calling logic and exception catching could be factored in this class trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java < https://reviews.apache.org/r/4091/#comment12187 > Override outputSchema instead like you did for EvalFunc trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java < https://reviews.apache.org/r/4091/#comment12213 > private trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java < https://reviews.apache.org/r/4091/#comment12189 > check for the varargs case trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java < https://reviews.apache.org/r/4091/#comment12190 > you should throw an exception here trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java < https://reviews.apache.org/r/4091/#comment12217 > It is very similar to JrubyEvalFunc. You can factor out more of this code. trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java < https://reviews.apache.org/r/4091/#comment12216 > private trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java < https://reviews.apache.org/r/4091/#comment12218 > turn these comments into javadoc trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java < https://reviews.apache.org/r/4091/#comment12192 > this will strip the inner schema of bags and tuples. See comments about overriding outputSchema trunk/src/org/apache/pig/scripting/jruby/JrubyUtils.java < https://reviews.apache.org/r/4091/#comment12219 > please add javadoc to public methods trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java < https://reviews.apache.org/r/4091/#comment12193 > add javadoc trunk/src/org/apache/pig/scripting/jruby/RubyBagIterator.java < https://reviews.apache.org/r/4091/#comment12194 > please make it private and at the top trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java < https://reviews.apache.org/r/4091/#comment12195 > add javadoc trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java < https://reviews.apache.org/r/4091/#comment12196 > make it private 1 is a perfectly good value. it does not need to be unique. Changing the serialVersionUID value will make the serialized data format incompatible. trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java < https://reviews.apache.org/r/4091/#comment12197 > in this case addAll would replace the current content. Is it intended ? trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java < https://reviews.apache.org/r/4091/#comment12220 > want to remove it ? trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java < https://reviews.apache.org/r/4091/#comment12198 > internalDB.add(JrubyUtils.rubyToPig((RubyArray)elem)); trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java < https://reviews.apache.org/r/4091/#comment12199 > you could call runtime.getClass("DataBag").getClass("BagIterator") once at the beginning trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java < https://reviews.apache.org/r/4091/#comment12200 > add spaces around operator and prefer the normalized "if" with { } if () { // } trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java < https://reviews.apache.org/r/4091/#comment12201 > flattten with 3 Ts? It's going to be really flat. trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java < https://reviews.apache.org/r/4091/#comment12202 > make this a javadoc comment trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java < https://reviews.apache.org/r/4091/#comment12203 > private trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java < https://reviews.apache.org/r/4091/#comment12204 > space after comma trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java < https://reviews.apache.org/r/4091/#comment12205 > give meaningful names to parameters trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb < https://reviews.apache.org/r/4091/#comment12221 > What about keeping only the one parameter ouptutSchema and outputSchemaFunctions? Users would just use the function right before defining the UDF. outputSchema "word:chararray" def concat input, input2 return input + input2 end trunk/test/org/apache/pig/test/TestUDF.java < https://reviews.apache.org/r/4091/#comment12222 > if this value is known you may as well compare each value to the known constant instead of comparing to the previous value. Julien On 2012-02-28 22:02:52, Jonathan Coveney wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4091/ ----------------------------------------------------------- (Updated 2012-02-28 22:02:52) Review request for pig, Julien Le Dem and Alan Gates. Summary ------- This is the reviewboard for the following: https://issues.apache.org/jira/browse/PIG-2317 This addresses bug PIG-2317 . https://issues.apache.org/jira/browse/PIG-2317 Diffs ----- trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/ScriptEngine.java 1294756 trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION trunk/bin/pig 1294756 trunk/ivy.xml 1294756 trunk/ivy/libraries.properties 1294756 trunk/pigudf.rb PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyUtils.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyBagIterator.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyExampleA.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyExampleB.java PRE-CREATION trunk/test/e2e/pig/build.xml 1294756 trunk/test/e2e/pig/conf/default.conf 1294756 trunk/test/e2e/pig/drivers/TestDriverPig.pm 1294756 trunk/test/e2e/pig/tests/nightly.conf 1294756 trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION trunk/test/org/apache/pig/test/TestUDF.java 1294756 trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION Diff: https://reviews.apache.org/r/4091/diff Testing ------- e2e tests can be run as follows: ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local Thanks, Jonathan
          Hide
          Daniel Dai added a comment -

          I tried the latest patch. Find couple of issues in e2e tests in mapreduce mode. I attach PIG-2317-8_plus.patch on top of Jonathan's patch. Here is what I did:
          1. The original patch ship way too much files to the backend, my job.jar is larger than 20G. I fix it by only ship necessary ruby files
          2. ruby.jar will be registered automatically only when ruby UDF is used, same mechanism as jython UDF
          3. drop the dependency of pigudf.rb on pig.jar. All pig classes is assumed to be in classpath
          4. pigudf.rb will be shipped automatically if ruby UDF is used
          5. bin/pig and e2e test changes to support the above changes

          Other than these, patch looks good. +1 for commit once Julien's comments are addressed.

          Show
          Daniel Dai added a comment - I tried the latest patch. Find couple of issues in e2e tests in mapreduce mode. I attach PIG-2317 -8_plus.patch on top of Jonathan's patch. Here is what I did: 1. The original patch ship way too much files to the backend, my job.jar is larger than 20G. I fix it by only ship necessary ruby files 2. ruby.jar will be registered automatically only when ruby UDF is used, same mechanism as jython UDF 3. drop the dependency of pigudf.rb on pig.jar. All pig classes is assumed to be in classpath 4. pigudf.rb will be shipped automatically if ruby UDF is used 5. bin/pig and e2e test changes to support the above changes Other than these, patch looks good. +1 for commit once Julien's comments are addressed.
          Hide
          Jonathan Coveney added a comment -

          Awesome Daniel, those were some key changes. Going to try and get to Julien's review soon. Excited to get this in.

          Show
          Jonathan Coveney added a comment - Awesome Daniel, those were some key changes. Going to try and get to Julien's review soon. Excited to get this in.
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > Thanks Jonathan for this contribution. A lot of good stuff in there.

          > Please add comments when using Ruby specific mechanisms in there (child classes notification, Class naming after the variable it is affected to, ...) so that less Ruby savvy contributors can follow.

          I'm going to update the JIRA with some of the broader comments.

          My todo at this point is:

          • Polish up the Javadocs (I should probably add @params and whatnot)
          • Clean up the naming convention in the Ruby API defined in Java
          • Add more test coverage (I added a couple big features, they need to be used more in the e2e tests especially)
          • Add varargs support

          That said, the tests pass and it seems to work. The API feels quite clean to me.

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java, line 41

          > <https://reviews.apache.org/r/4091/diff/1/?file=86324#file86324line41>

          >

          > ruby.getClass("DataBag") could be done only once

          It could, but there is a caveat: the Ruby runtime HAS to be identical within calls, or you're going to run into huge issues. The getClass call isn't particularly heinous, and I think it's cleaner to leave in. For the time being, all of the Ruby runtimes are identical, but in the future there may be multiple (a different runtime per UDF or per script registered), so this is safer.

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/pigudf.rb, line 110

          > <https://reviews.apache.org/r/4091/diff/1/?file=86320#file86320line110>

          >

          > You could call it RegisterDescendentsUDF or a similar name as this is a based class for UDFs.

          > The descendents registration is common to the PigUDF as well. Do you want to unify the class hierarchy?

          Agreed, I renamed it

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/pigudf.rb, line 87

          > <https://reviews.apache.org/r/4091/diff/1/?file=86320#file86320line87>

          >

          > I would add a separate field for schema func instead of this convention.

          I split it out into another class like you suggested

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/pigudf.rb, line 41

          > <https://reviews.apache.org/r/4091/diff/1/?file=86320#file86320line41>

          >

          > - please add a comment explain the mechanism that names the class after the variable it is assigned to.

          added a ton of comments

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/pigudf.rb, line 26

          > <https://reviews.apache.org/r/4091/diff/1/?file=86320#file86320line26>

          >

          > you could define a class for this

          Agreed

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/pigudf.rb, line 6

          > <https://reviews.apache.org/r/4091/diff/1/?file=86320#file86320line6>

          >

          > maybe you need to separate the top level ruby script to be used outside of pig and the one used inside using includes?

          I was able to clean this up significantly in multiple ways. Will document how in the JIRA.

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java, line 65

          > <https://reviews.apache.org/r/4091/diff/1/?file=86324#file86324line65>

          >

          > I would override OutputSchema instead. In the case of a bag you are losing the schema of the Tuple inside the bag.

          Wisdom of the ages. Done

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java, line 86

          > <https://reviews.apache.org/r/4091/diff/1/?file=86325#file86325line86>

          >

          > some of the ruby calling logic and exception catching could be factored in this class

          Agreed...the win felt small, I will take this into account, but it's not a priority

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java, line 60

          > <https://reviews.apache.org/r/4091/diff/1/?file=86326#file86326line60>

          >

          > check for the varargs case

          I'll make varargs a top priority for the next round of changes... didn't seem super pressing atm, but maybe it is.

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java, line 15

          > <https://reviews.apache.org/r/4091/diff/1/?file=86327#file86327line15>

          >

          > It is very similar to JrubyEvalFunc. You can factor out more of this code.

          I suppose this begs the question of why we even have FilterFunc anymore, given that it just extends EvalFunc<Boolean> now. I treated it separately because I wasn't sure about that (I could just make FilterFuncs EvalFunc<Boolean>s and be done with it)... it seems like something we should change in Pig (ie abolish FilterFuncs). But if FilterFunc is treater separately, I want the two classes... I suppose I could pull out the functionality, and should, but I'd like to have some resolution on that, because it'll guide how I do so.

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java, line 107

          > <https://reviews.apache.org/r/4091/diff/1/?file=86328#file86328line107>

          >

          > this will strip the inner schema of bags and tuples. See comments about overriding outputSchema

          Can you avoid this with Algebraic UDFs? Fixed for Accumulators

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/scripting/jruby/RubyBagIterator.java, line 67

          > <https://reviews.apache.org/r/4091/diff/1/?file=86331#file86331line67>

          >

          > please make it private and at the top

          I don't even ned it, I think

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java, line 40

          > <https://reviews.apache.org/r/4091/diff/1/?file=86332#file86332line40>

          >

          > make it private

          > 1 is a perfectly good value. it does not need to be unique.

          > Changing the serialVersionUID value will make the serialized data format incompatible.

          you're correct. I think this was an artifact from a previous implementation and isn't actually necessary. good to know, though...

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java, line 104

          > <https://reviews.apache.org/r/4091/diff/1/?file=86332#file86332line104>

          >

          > in this case addAll would replace the current content. Is it intended ?

          Good call. It was a bug. I removed addAll and just made the functionality a part of add.

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java, line 149

          > <https://reviews.apache.org/r/4091/diff/1/?file=86332#file86332line149>

          >

          > you could call runtime.getClass("DataBag").getClass("BagIterator") once at the beginning

          I eliminated BagIterator, it didn't seem necessary. That said, re: the getClass calls, see above.

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java, line 98

          > <https://reviews.apache.org/r/4091/diff/1/?file=86333#file86333line98>

          >

          > give meaningful names to parameters

          the issue with ruby parameters is that a lot of the functions can take multiple types, etc, and the meaning changes depending. I could handle this a couple of ways.

          1) in the case where there is only one option, use that
          2) make better use of @param javadoc
          3) in the case where there are multiple possibilities, make sure to explicitly create an object that is named to reflect what it is supposed to be

          Would appreciate your thoughts on the matter

          On 2012-03-03 22:26:58, Julien Le Dem wrote:

          > trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb, line 32

          > <https://reviews.apache.org/r/4091/diff/1/?file=86341#file86341line32>

          >

          > What about keeping only the one parameter ouptutSchema and outputSchemaFunctions?

          > Users would just use the function right before defining the UDF.

          >

          > outputSchema "word:chararray"

          > def concat input, input2

          > return input + input2

          > end

          it'd be good to test both, agreed

          • Jonathan

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4091/#review5601
          -----------------------------------------------------------

          On 2012-02-28 22:02:52, Jonathan Coveney wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/4091/

          -----------------------------------------------------------

          (Updated 2012-02-28 22:02:52)

          Review request for pig, Julien Le Dem and Alan Gates.

          Summary

          -------

          This is the reviewboard for the following:

          https://issues.apache.org/jira/browse/PIG-2317

          This addresses bug PIG-2317.

          https://issues.apache.org/jira/browse/PIG-2317

          Diffs

          -----

          trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION

          trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/ScriptEngine.java 1294756

          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION

          trunk/bin/pig 1294756

          trunk/ivy.xml 1294756

          trunk/ivy/libraries.properties 1294756

          trunk/pigudf.rb PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/JrubyUtils.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/RubyBagIterator.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/RubyExampleA.java PRE-CREATION

          trunk/src/org/apache/pig/scripting/jruby/RubyExampleB.java PRE-CREATION

          trunk/test/e2e/pig/build.xml 1294756

          trunk/test/e2e/pig/conf/default.conf 1294756

          trunk/test/e2e/pig/drivers/TestDriverPig.pm 1294756

          trunk/test/e2e/pig/tests/nightly.conf 1294756

          trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION

          trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION

          trunk/test/org/apache/pig/test/TestUDF.java 1294756

          trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION

          Diff: https://reviews.apache.org/r/4091/diff

          Testing

          -------

          e2e tests can be run as follows:

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - On 2012-03-03 22:26:58, Julien Le Dem wrote: > Thanks Jonathan for this contribution. A lot of good stuff in there. > Please add comments when using Ruby specific mechanisms in there (child classes notification, Class naming after the variable it is affected to, ...) so that less Ruby savvy contributors can follow. I'm going to update the JIRA with some of the broader comments. My todo at this point is: Polish up the Javadocs (I should probably add @params and whatnot) Clean up the naming convention in the Ruby API defined in Java Add more test coverage (I added a couple big features, they need to be used more in the e2e tests especially) Add varargs support That said, the tests pass and it seems to work. The API feels quite clean to me. On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java, line 41 > < https://reviews.apache.org/r/4091/diff/1/?file=86324#file86324line41 > > > ruby.getClass("DataBag") could be done only once It could, but there is a caveat: the Ruby runtime HAS to be identical within calls, or you're going to run into huge issues. The getClass call isn't particularly heinous, and I think it's cleaner to leave in. For the time being, all of the Ruby runtimes are identical, but in the future there may be multiple (a different runtime per UDF or per script registered), so this is safer. On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/pigudf.rb, line 110 > < https://reviews.apache.org/r/4091/diff/1/?file=86320#file86320line110 > > > You could call it RegisterDescendentsUDF or a similar name as this is a based class for UDFs. > The descendents registration is common to the PigUDF as well. Do you want to unify the class hierarchy? Agreed, I renamed it On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/pigudf.rb, line 87 > < https://reviews.apache.org/r/4091/diff/1/?file=86320#file86320line87 > > > I would add a separate field for schema func instead of this convention. I split it out into another class like you suggested On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/pigudf.rb, line 41 > < https://reviews.apache.org/r/4091/diff/1/?file=86320#file86320line41 > > > - please add a comment explain the mechanism that names the class after the variable it is assigned to. added a ton of comments On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/pigudf.rb, line 26 > < https://reviews.apache.org/r/4091/diff/1/?file=86320#file86320line26 > > > you could define a class for this Agreed On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/pigudf.rb, line 6 > < https://reviews.apache.org/r/4091/diff/1/?file=86320#file86320line6 > > > maybe you need to separate the top level ruby script to be used outside of pig and the one used inside using includes? I was able to clean this up significantly in multiple ways. Will document how in the JIRA. On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java, line 65 > < https://reviews.apache.org/r/4091/diff/1/?file=86324#file86324line65 > > > I would override OutputSchema instead. In the case of a bag you are losing the schema of the Tuple inside the bag. Wisdom of the ages. Done On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java, line 86 > < https://reviews.apache.org/r/4091/diff/1/?file=86325#file86325line86 > > > some of the ruby calling logic and exception catching could be factored in this class Agreed...the win felt small, I will take this into account, but it's not a priority On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java, line 60 > < https://reviews.apache.org/r/4091/diff/1/?file=86326#file86326line60 > > > check for the varargs case I'll make varargs a top priority for the next round of changes... didn't seem super pressing atm, but maybe it is. On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java, line 15 > < https://reviews.apache.org/r/4091/diff/1/?file=86327#file86327line15 > > > It is very similar to JrubyEvalFunc. You can factor out more of this code. I suppose this begs the question of why we even have FilterFunc anymore, given that it just extends EvalFunc<Boolean> now. I treated it separately because I wasn't sure about that (I could just make FilterFuncs EvalFunc<Boolean>s and be done with it)... it seems like something we should change in Pig (ie abolish FilterFuncs). But if FilterFunc is treater separately, I want the two classes... I suppose I could pull out the functionality, and should, but I'd like to have some resolution on that, because it'll guide how I do so. On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java, line 107 > < https://reviews.apache.org/r/4091/diff/1/?file=86328#file86328line107 > > > this will strip the inner schema of bags and tuples. See comments about overriding outputSchema Can you avoid this with Algebraic UDFs? Fixed for Accumulators On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/src/org/apache/pig/scripting/jruby/RubyBagIterator.java, line 67 > < https://reviews.apache.org/r/4091/diff/1/?file=86331#file86331line67 > > > please make it private and at the top I don't even ned it, I think On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java, line 40 > < https://reviews.apache.org/r/4091/diff/1/?file=86332#file86332line40 > > > make it private > 1 is a perfectly good value. it does not need to be unique. > Changing the serialVersionUID value will make the serialized data format incompatible. you're correct. I think this was an artifact from a previous implementation and isn't actually necessary. good to know, though... On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java, line 104 > < https://reviews.apache.org/r/4091/diff/1/?file=86332#file86332line104 > > > in this case addAll would replace the current content. Is it intended ? Good call. It was a bug. I removed addAll and just made the functionality a part of add. On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java, line 149 > < https://reviews.apache.org/r/4091/diff/1/?file=86332#file86332line149 > > > you could call runtime.getClass("DataBag").getClass("BagIterator") once at the beginning I eliminated BagIterator, it didn't seem necessary. That said, re: the getClass calls, see above. On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java, line 98 > < https://reviews.apache.org/r/4091/diff/1/?file=86333#file86333line98 > > > give meaningful names to parameters the issue with ruby parameters is that a lot of the functions can take multiple types, etc, and the meaning changes depending. I could handle this a couple of ways. 1) in the case where there is only one option, use that 2) make better use of @param javadoc 3) in the case where there are multiple possibilities, make sure to explicitly create an object that is named to reflect what it is supposed to be Would appreciate your thoughts on the matter On 2012-03-03 22:26:58, Julien Le Dem wrote: > trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb, line 32 > < https://reviews.apache.org/r/4091/diff/1/?file=86341#file86341line32 > > > What about keeping only the one parameter ouptutSchema and outputSchemaFunctions? > Users would just use the function right before defining the UDF. > > outputSchema "word:chararray" > def concat input, input2 > return input + input2 > end it'd be good to test both, agreed Jonathan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4091/#review5601 ----------------------------------------------------------- On 2012-02-28 22:02:52, Jonathan Coveney wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4091/ ----------------------------------------------------------- (Updated 2012-02-28 22:02:52) Review request for pig, Julien Le Dem and Alan Gates. Summary ------- This is the reviewboard for the following: https://issues.apache.org/jira/browse/PIG-2317 This addresses bug PIG-2317 . https://issues.apache.org/jira/browse/PIG-2317 Diffs ----- trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/ScriptEngine.java 1294756 trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION trunk/bin/pig 1294756 trunk/ivy.xml 1294756 trunk/ivy/libraries.properties 1294756 trunk/pigudf.rb PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyUtils.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyBagIterator.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyExampleA.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyExampleB.java PRE-CREATION trunk/test/e2e/pig/build.xml 1294756 trunk/test/e2e/pig/conf/default.conf 1294756 trunk/test/e2e/pig/drivers/TestDriverPig.pm 1294756 trunk/test/e2e/pig/tests/nightly.conf 1294756 trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION trunk/test/org/apache/pig/test/TestUDF.java 1294756 trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION Diff: https://reviews.apache.org/r/4091/diff Testing ------- e2e tests can be run as follows: ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local Thanks, Jonathan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4091/
          -----------------------------------------------------------

          (Updated 2012-03-20 20:42:48.606378)

          Review request for pig, Julien Le Dem and Alan Gates.

          Summary
          -------

          This is the reviewboard for the following:

          https://issues.apache.org/jira/browse/PIG-2317

          This addresses bug PIG-2317.
          https://issues.apache.org/jira/browse/PIG-2317

          Diffs (updated)


          trunk/bin/pig 1303128
          trunk/build.xml 1303128
          trunk/ivy.xml 1303128
          trunk/ivy/libraries.properties 1303128
          trunk/src/PigService.java PRE-CREATION
          trunk/src/main/jruby/pigudf.rb PRE-CREATION
          trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/ScriptEngine.java 1303128
          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION
          trunk/test/e2e/pig/build.xml 1303128
          trunk/test/e2e/pig/conf/default.conf 1303128
          trunk/test/e2e/pig/drivers/TestDriverPig.pm 1303128
          trunk/test/e2e/pig/tests/nightly.conf 1303128
          trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION
          trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION
          trunk/test/org/apache/pig/test/TestUDF.java 1294756
          trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION

          Diff: https://reviews.apache.org/r/4091/diff

          Testing
          -------

          e2e tests can be run as follows:

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local
          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4091/ ----------------------------------------------------------- (Updated 2012-03-20 20:42:48.606378) Review request for pig, Julien Le Dem and Alan Gates. Summary ------- This is the reviewboard for the following: https://issues.apache.org/jira/browse/PIG-2317 This addresses bug PIG-2317 . https://issues.apache.org/jira/browse/PIG-2317 Diffs (updated) trunk/bin/pig 1303128 trunk/build.xml 1303128 trunk/ivy.xml 1303128 trunk/ivy/libraries.properties 1303128 trunk/src/PigService.java PRE-CREATION trunk/src/main/jruby/pigudf.rb PRE-CREATION trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/ScriptEngine.java 1303128 trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION trunk/test/e2e/pig/build.xml 1303128 trunk/test/e2e/pig/conf/default.conf 1303128 trunk/test/e2e/pig/drivers/TestDriverPig.pm 1303128 trunk/test/e2e/pig/tests/nightly.conf 1303128 trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION trunk/test/org/apache/pig/test/TestUDF.java 1294756 trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION Diff: https://reviews.apache.org/r/4091/diff Testing ------- e2e tests can be run as follows: ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local Thanks, Jonathan
          Hide
          Jonathan Coveney added a comment -

          OK! I just uploaded a new diff (which incorporates Daniel's changes). It may be possible to undo some of that, actually... I'll explain some of the big new changes. First, a todo list:

          • Need to add more e2e tests
          • Need to add more traditional tests
          • Need to make the Javadocs more robust with @params and whatnot
          • Need to add varargs support (this is the only feature that is missing, AFAIK)
          • I have some TODO's littered about...need to clean those up

          In general, there is a LOT more commenting, and I tried to be super explicit on the Ruby side of things.

          I significantly cleaned up and simplified pigudf.rb, taking into account comments from Julien. I simplified the mechanisms at play as far as I could.

          pigudf.rb is in src/main/jruby/

          Now, in order to get access to the Pig library, all you have to do is "require 'pig'", which imho is awesome: you just require pig, and you get everything! It's super clean. The unclean part of it is the way it works. If you do "require 'name.jar'", then JRuby looks for NameService.java in the base of the jar. If you do "require 'path/to/name.jar'", it'll look for path.to.NameService.java. Either way, this is the reason why I had to add src/PigService.java. IMHO the win is worth it, as it is super clean. In JRuby 1.7.0 there is a proposal to use the jar manifest to deal with this, and it's something I've brought up with them and something that will happen. 1.7 should also remove the need for a hack described below.

          I got rid of the BagIterator, as it didn't make much sense. In this implementation, it makes more sense just to iterate on the DataBag object in Ruby directly, as it hides the pain (this pattern is repeated in Schema).

          HACK ALERT: for people who know ruby, generally if you include 'Enumerable', and implement each, you can do "obj.each" and it will give you an enumerator object. This is useful for chaining together functions that enumerate over the object and change it in some way. Either way, JRuby 1.6.7 has a method that provides exactly this functionality...but they forgot to give it public permissions (it's just static enumeratorize(Blahblahblah)). I worked hard to try and get around the need for this, but it does it so cleanly and doing it any other way is such a pain (I haven't found a good one), that I used reflection to get around the permissions. I felt ok doing this because the 1.7.0 branch makes this explicitly public – it was just an oversight.

          Accumulator now uses outputSchema, as it always should have.

          One (surprisingly long) addition is a Ruby interface for Schema objects! It protects the user from the Schema/FieldSchema divide, and makes it really easy to mix String schema declarations and a Schema object that is input. I will post more depth about this later, but I think my time would be better served fixing the javadocs and the tests atm.

          Show
          Jonathan Coveney added a comment - OK! I just uploaded a new diff (which incorporates Daniel's changes). It may be possible to undo some of that, actually... I'll explain some of the big new changes. First, a todo list: Need to add more e2e tests Need to add more traditional tests Need to make the Javadocs more robust with @params and whatnot Need to add varargs support (this is the only feature that is missing, AFAIK) I have some TODO's littered about...need to clean those up In general, there is a LOT more commenting, and I tried to be super explicit on the Ruby side of things. I significantly cleaned up and simplified pigudf.rb, taking into account comments from Julien. I simplified the mechanisms at play as far as I could. pigudf.rb is in src/main/jruby/ Now, in order to get access to the Pig library, all you have to do is "require 'pig'", which imho is awesome: you just require pig, and you get everything! It's super clean. The unclean part of it is the way it works. If you do "require 'name.jar'", then JRuby looks for NameService.java in the base of the jar. If you do "require 'path/to/name.jar'", it'll look for path.to.NameService.java. Either way, this is the reason why I had to add src/PigService.java. IMHO the win is worth it, as it is super clean. In JRuby 1.7.0 there is a proposal to use the jar manifest to deal with this, and it's something I've brought up with them and something that will happen. 1.7 should also remove the need for a hack described below. I got rid of the BagIterator, as it didn't make much sense. In this implementation, it makes more sense just to iterate on the DataBag object in Ruby directly, as it hides the pain (this pattern is repeated in Schema). HACK ALERT: for people who know ruby, generally if you include 'Enumerable', and implement each, you can do "obj.each" and it will give you an enumerator object. This is useful for chaining together functions that enumerate over the object and change it in some way. Either way, JRuby 1.6.7 has a method that provides exactly this functionality...but they forgot to give it public permissions (it's just static enumeratorize(Blahblahblah)). I worked hard to try and get around the need for this, but it does it so cleanly and doing it any other way is such a pain (I haven't found a good one), that I used reflection to get around the permissions. I felt ok doing this because the 1.7.0 branch makes this explicitly public – it was just an oversight. Accumulator now uses outputSchema, as it always should have. One (surprisingly long) addition is a Ruby interface for Schema objects! It protects the user from the Schema/FieldSchema divide, and makes it really easy to mix String schema declarations and a Schema object that is input. I will post more depth about this later, but I think my time would be better served fixing the javadocs and the tests atm.
          Hide
          Jonathan Coveney added a comment -

          This is it!

          Could it have better test coverage? Perhaps, good sir. But this is it. Full javadoc coverage (let me know if its lacking anywhere), real feature completeness (most of the aforementioned tests are for features people will only find out about when I die and they are revealed in my well), including...

          • native and super easy schemas!
          • varargs!
          • ruby 1.9!
          • accumulators! algebraics!
          • customs schemas everywhere!
          • and much much more!

          Now I need some eyes on it, and I'll work on more tests and more testing... but yeah.

          Show
          Jonathan Coveney added a comment - This is it! Could it have better test coverage? Perhaps, good sir. But this is it. Full javadoc coverage (let me know if its lacking anywhere), real feature completeness (most of the aforementioned tests are for features people will only find out about when I die and they are revealed in my well), including... native and super easy schemas! varargs! ruby 1.9! accumulators! algebraics! customs schemas everywhere! and much much more! Now I need some eyes on it, and I'll work on more tests and more testing... but yeah.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4091/
          -----------------------------------------------------------

          (Updated 2012-03-24 01:33:02.389407)

          Review request for pig, Julien Le Dem and Alan Gates.

          Changes
          -------

          I incorporated a ton of Julien's comments, added a lot of javadocs, and rounded out the completeness. More fully featured and tastier than ever.

          Summary
          -------

          This is the reviewboard for the following:

          https://issues.apache.org/jira/browse/PIG-2317

          This addresses bug PIG-2317.
          https://issues.apache.org/jira/browse/PIG-2317

          Diffs (updated)


          trunk/bin/pig 1304676
          trunk/build.xml 1304676
          trunk/ivy.xml 1304676
          trunk/ivy/libraries.properties 1304676
          trunk/src/PigudfService.java PRE-CREATION
          trunk/src/main/jruby/pigudf.rb PRE-CREATION
          trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/ScriptEngine.java 1304676
          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION
          trunk/src/org/apache/pig/tools/counters/PigCounterHelper.java PRE-CREATION
          trunk/test/e2e/pig/build.xml 1304676
          trunk/test/e2e/pig/conf/default.conf 1304676
          trunk/test/e2e/pig/drivers/TestDriverPig.pm 1304676
          trunk/test/e2e/pig/tests/nightly.conf 1304676
          trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION
          trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION
          trunk/test/org/apache/pig/test/TestUDF.java 1294756
          trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION

          Diff: https://reviews.apache.org/r/4091/diff

          Testing
          -------

          e2e tests can be run as follows:

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local
          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4091/ ----------------------------------------------------------- (Updated 2012-03-24 01:33:02.389407) Review request for pig, Julien Le Dem and Alan Gates. Changes ------- I incorporated a ton of Julien's comments, added a lot of javadocs, and rounded out the completeness. More fully featured and tastier than ever. Summary ------- This is the reviewboard for the following: https://issues.apache.org/jira/browse/PIG-2317 This addresses bug PIG-2317 . https://issues.apache.org/jira/browse/PIG-2317 Diffs (updated) trunk/bin/pig 1304676 trunk/build.xml 1304676 trunk/ivy.xml 1304676 trunk/ivy/libraries.properties 1304676 trunk/src/PigudfService.java PRE-CREATION trunk/src/main/jruby/pigudf.rb PRE-CREATION trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/ScriptEngine.java 1304676 trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION trunk/src/org/apache/pig/tools/counters/PigCounterHelper.java PRE-CREATION trunk/test/e2e/pig/build.xml 1304676 trunk/test/e2e/pig/conf/default.conf 1304676 trunk/test/e2e/pig/drivers/TestDriverPig.pm 1304676 trunk/test/e2e/pig/tests/nightly.conf 1304676 trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION trunk/test/org/apache/pig/test/TestUDF.java 1294756 trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION Diff: https://reviews.apache.org/r/4091/diff Testing ------- e2e tests can be run as follows: ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local Thanks, Jonathan
          Hide
          Jonathan Coveney added a comment -

          Small update to RubySchema. Allows setting the name of a Schema from Ruby.

          Show
          Jonathan Coveney added a comment - Small update to RubySchema. Allows setting the name of a Schema from Ruby.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4091/
          -----------------------------------------------------------

          (Updated 2012-03-26 17:28:16.350869)

          Review request for pig, Julien Le Dem and Alan Gates.

          Changes
          -------

          A small update that gives RubySchema's the ability to manipulate the alias

          Summary
          -------

          This is the reviewboard for the following:

          https://issues.apache.org/jira/browse/PIG-2317

          This addresses bug PIG-2317.
          https://issues.apache.org/jira/browse/PIG-2317

          Diffs (updated)


          trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION
          trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION
          trunk/test/org/apache/pig/test/TestUDF.java 1294756
          trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION
          trunk/src/org/apache/pig/tools/counters/PigCounterHelper.java PRE-CREATION
          trunk/test/e2e/pig/build.xml 1304676
          trunk/test/e2e/pig/conf/default.conf 1304676
          trunk/test/e2e/pig/drivers/TestDriverPig.pm 1304676
          trunk/test/e2e/pig/tests/nightly.conf 1304676
          trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/ScriptEngine.java 1304676
          trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION
          trunk/src/main/jruby/pigudf.rb PRE-CREATION
          trunk/src/PigudfService.java PRE-CREATION
          trunk/bin/pig 1304676
          trunk/build.xml 1304676
          trunk/ivy.xml 1304676
          trunk/ivy/libraries.properties 1304676

          Diff: https://reviews.apache.org/r/4091/diff

          Testing
          -------

          e2e tests can be run as follows:

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local
          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4091/ ----------------------------------------------------------- (Updated 2012-03-26 17:28:16.350869) Review request for pig, Julien Le Dem and Alan Gates. Changes ------- A small update that gives RubySchema's the ability to manipulate the alias Summary ------- This is the reviewboard for the following: https://issues.apache.org/jira/browse/PIG-2317 This addresses bug PIG-2317 . https://issues.apache.org/jira/browse/PIG-2317 Diffs (updated) trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION trunk/test/org/apache/pig/test/TestUDF.java 1294756 trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION trunk/src/org/apache/pig/tools/counters/PigCounterHelper.java PRE-CREATION trunk/test/e2e/pig/build.xml 1304676 trunk/test/e2e/pig/conf/default.conf 1304676 trunk/test/e2e/pig/drivers/TestDriverPig.pm 1304676 trunk/test/e2e/pig/tests/nightly.conf 1304676 trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/ScriptEngine.java 1304676 trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION trunk/src/main/jruby/pigudf.rb PRE-CREATION trunk/src/PigudfService.java PRE-CREATION trunk/bin/pig 1304676 trunk/build.xml 1304676 trunk/ivy.xml 1304676 trunk/ivy/libraries.properties 1304676 Diff: https://reviews.apache.org/r/4091/diff Testing ------- e2e tests can be run as follows: ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local Thanks, Jonathan
          Hide
          Daniel Dai added a comment -

          Seems I still need a small patch to make e2e tests work. I attach PIG-2317-10_plus.patch. Also seems you miss AppendIndex.java in the patch.

          Show
          Daniel Dai added a comment - Seems I still need a small patch to make e2e tests work. I attach PIG-2317 -10_plus.patch. Also seems you miss AppendIndex.java in the patch.
          Hide
          Jonathan Coveney added a comment -

          I've included the missing file.

          Daniel,

          Can you include what error you're getting? As is, there should be no issue as long as the pig udf has "require 'pigudf'". We are now packaging pigudf.rb in the jar file, and when they do "require 'pigudf'" the PigudfService.java class will pull in all of the required files.

          I have no problem running the tests without your patch so I'd like to get to the bottom of that. It's possible I have something else on my classpath or something.

          Thanks!
          Jon

          Show
          Jonathan Coveney added a comment - I've included the missing file. Daniel, Can you include what error you're getting? As is, there should be no issue as long as the pig udf has "require 'pigudf'". We are now packaging pigudf.rb in the jar file, and when they do "require 'pigudf'" the PigudfService.java class will pull in all of the required files. I have no problem running the tests without your patch so I'd like to get to the bottom of that. It's possible I have something else on my classpath or something. Thanks! Jon
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4091/
          -----------------------------------------------------------

          (Updated 2012-03-26 23:42:23.147374)

          Review request for pig, Julien Le Dem and Alan Gates.

          Summary
          -------

          This is the reviewboard for the following:

          https://issues.apache.org/jira/browse/PIG-2317

          This addresses bug PIG-2317.
          https://issues.apache.org/jira/browse/PIG-2317

          Diffs (updated)


          trunk/src/PigudfService.java PRE-CREATION
          trunk/bin/pig 1304676
          trunk/build.xml 1304676
          trunk/ivy.xml 1304676
          trunk/ivy/libraries.properties 1304676
          trunk/src/main/jruby/pigudf.rb PRE-CREATION
          trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/ScriptEngine.java 1304676
          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION
          trunk/src/org/apache/pig/tools/counters/PigCounterHelper.java PRE-CREATION
          trunk/test/e2e/pig/build.xml 1304676
          trunk/test/e2e/pig/conf/default.conf 1304676
          trunk/test/e2e/pig/drivers/TestDriverPig.pm 1304676
          trunk/test/e2e/pig/tests/nightly.conf 1304676
          trunk/test/e2e/pig/udfs/java/org/apache/pig/test/udf/evalfunc/AppendIndex.java PRE-CREATION
          trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION
          trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION
          trunk/test/org/apache/pig/test/TestUDF.java 1294756
          trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION

          Diff: https://reviews.apache.org/r/4091/diff

          Testing
          -------

          e2e tests can be run as follows:

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local
          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4091/ ----------------------------------------------------------- (Updated 2012-03-26 23:42:23.147374) Review request for pig, Julien Le Dem and Alan Gates. Summary ------- This is the reviewboard for the following: https://issues.apache.org/jira/browse/PIG-2317 This addresses bug PIG-2317 . https://issues.apache.org/jira/browse/PIG-2317 Diffs (updated) trunk/src/PigudfService.java PRE-CREATION trunk/bin/pig 1304676 trunk/build.xml 1304676 trunk/ivy.xml 1304676 trunk/ivy/libraries.properties 1304676 trunk/src/main/jruby/pigudf.rb PRE-CREATION trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/ScriptEngine.java 1304676 trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION trunk/src/org/apache/pig/tools/counters/PigCounterHelper.java PRE-CREATION trunk/test/e2e/pig/build.xml 1304676 trunk/test/e2e/pig/conf/default.conf 1304676 trunk/test/e2e/pig/drivers/TestDriverPig.pm 1304676 trunk/test/e2e/pig/tests/nightly.conf 1304676 trunk/test/e2e/pig/udfs/java/org/apache/pig/test/udf/evalfunc/AppendIndex.java PRE-CREATION trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION trunk/test/org/apache/pig/test/TestUDF.java 1294756 trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION Diff: https://reviews.apache.org/r/4091/diff Testing ------- e2e tests can be run as follows: ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local Thanks, Jonathan
          Hide
          Daniel Dai added a comment -

          The error I see is Pig does not ship jruby.jar. We shall put it into scriptJars if jruby is used. Also putting pigudf.rb into pig.jar does not mean it will ship to backend, we need to explicitly add it.

          Show
          Daniel Dai added a comment - The error I see is Pig does not ship jruby.jar. We shall put it into scriptJars if jruby is used. Also putting pigudf.rb into pig.jar does not mean it will ship to backend, we need to explicitly add it.
          Hide
          Jonathan Coveney added a comment -

          Daniel,

          As always, thanks for your input. This patch reflects your comments.

          Show
          Jonathan Coveney added a comment - Daniel, As always, thanks for your input. This patch reflects your comments.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4091/
          -----------------------------------------------------------

          (Updated 2012-03-27 00:11:05.270583)

          Review request for pig, Julien Le Dem and Alan Gates.

          Summary
          -------

          This is the reviewboard for the following:

          https://issues.apache.org/jira/browse/PIG-2317

          This addresses bug PIG-2317.
          https://issues.apache.org/jira/browse/PIG-2317

          Diffs (updated)


          trunk/bin/pig 1304676
          trunk/build.xml 1304676
          trunk/ivy.xml 1304676
          trunk/ivy/libraries.properties 1304676
          trunk/src/PigudfService.java PRE-CREATION
          trunk/src/main/jruby/pigudf.rb PRE-CREATION
          trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/ScriptEngine.java 1304676
          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION
          trunk/src/org/apache/pig/tools/counters/PigCounterHelper.java PRE-CREATION
          trunk/test/e2e/pig/build.xml 1304676
          trunk/test/e2e/pig/conf/default.conf 1304676
          trunk/test/e2e/pig/drivers/TestDriverPig.pm 1304676
          trunk/test/e2e/pig/tests/nightly.conf 1304676
          trunk/test/e2e/pig/udfs/java/org/apache/pig/test/udf/evalfunc/AppendIndex.java PRE-CREATION
          trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION
          trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION
          trunk/test/org/apache/pig/test/TestUDF.java 1294756
          trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION

          Diff: https://reviews.apache.org/r/4091/diff

          Testing
          -------

          e2e tests can be run as follows:

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local
          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4091/ ----------------------------------------------------------- (Updated 2012-03-27 00:11:05.270583) Review request for pig, Julien Le Dem and Alan Gates. Summary ------- This is the reviewboard for the following: https://issues.apache.org/jira/browse/PIG-2317 This addresses bug PIG-2317 . https://issues.apache.org/jira/browse/PIG-2317 Diffs (updated) trunk/bin/pig 1304676 trunk/build.xml 1304676 trunk/ivy.xml 1304676 trunk/ivy/libraries.properties 1304676 trunk/src/PigudfService.java PRE-CREATION trunk/src/main/jruby/pigudf.rb PRE-CREATION trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/ScriptEngine.java 1304676 trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyFilterFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION trunk/src/org/apache/pig/tools/counters/PigCounterHelper.java PRE-CREATION trunk/test/e2e/pig/build.xml 1304676 trunk/test/e2e/pig/conf/default.conf 1304676 trunk/test/e2e/pig/drivers/TestDriverPig.pm 1304676 trunk/test/e2e/pig/tests/nightly.conf 1304676 trunk/test/e2e/pig/udfs/java/org/apache/pig/test/udf/evalfunc/AppendIndex.java PRE-CREATION trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION trunk/test/org/apache/pig/test/TestUDF.java 1294756 trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION Diff: https://reviews.apache.org/r/4091/diff Testing ------- e2e tests can be run as follows: ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local Thanks, Jonathan
          Hide
          Jonathan Coveney added a comment -

          Incorporated Daniel's wisdom, and removed the separate FilterFunc (unnecessary now that Boolean is a type). Some other random small fixes.

          Show
          Jonathan Coveney added a comment - Incorporated Daniel's wisdom, and removed the separate FilterFunc (unnecessary now that Boolean is a type). Some other random small fixes.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4091/
          -----------------------------------------------------------

          (Updated 2012-03-27 07:11:17.441864)

          Review request for pig, Julien Le Dem and Alan Gates.

          Summary
          -------

          This is the reviewboard for the following:

          https://issues.apache.org/jira/browse/PIG-2317

          This addresses bug PIG-2317.
          https://issues.apache.org/jira/browse/PIG-2317

          Diffs (updated)


          trunk/bin/pig 1305733
          trunk/build.xml 1305733
          trunk/ivy.xml 1305733
          trunk/ivy/libraries.properties 1305733
          trunk/src/PigudfService.java PRE-CREATION
          trunk/src/main/jruby/pigudf.rb PRE-CREATION
          trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/ScriptEngine.java 1305733
          trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION
          trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION
          trunk/src/org/apache/pig/tools/counters/PigCounterHelper.java PRE-CREATION
          trunk/test/e2e/pig/build.xml 1305733
          trunk/test/e2e/pig/conf/default.conf 1305733
          trunk/test/e2e/pig/conf/local.conf 1305733
          trunk/test/e2e/pig/drivers/TestDriverPig.pm 1305733
          trunk/test/e2e/pig/drivers/Util.pm 1305733
          trunk/test/e2e/pig/tests/nightly.conf 1305733
          trunk/test/e2e/pig/udfs/java/org/apache/pig/test/udf/evalfunc/AppendIndex.java PRE-CREATION
          trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION
          trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION
          trunk/test/org/apache/pig/test/TestUDF.java 1294756
          trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION

          Diff: https://reviews.apache.org/r/4091/diff

          Testing
          -------

          e2e tests can be run as follows:

          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local
          ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4091/ ----------------------------------------------------------- (Updated 2012-03-27 07:11:17.441864) Review request for pig, Julien Le Dem and Alan Gates. Summary ------- This is the reviewboard for the following: https://issues.apache.org/jira/browse/PIG-2317 This addresses bug PIG-2317 . https://issues.apache.org/jira/browse/PIG-2317 Diffs (updated) trunk/bin/pig 1305733 trunk/build.xml 1305733 trunk/ivy.xml 1305733 trunk/ivy/libraries.properties 1305733 trunk/src/PigudfService.java PRE-CREATION trunk/src/main/jruby/pigudf.rb PRE-CREATION trunk/src/org/apache/pig/AccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/AlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/ScriptEngine.java 1305733 trunk/src/org/apache/pig/scripting/jruby/JrubyAccumulatorEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyAlgebraicEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyEvalFunc.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/JrubyScriptEngine.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/PigJrubyLibrary.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataBag.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubyDataByteArray.java PRE-CREATION trunk/src/org/apache/pig/scripting/jruby/RubySchema.java PRE-CREATION trunk/src/org/apache/pig/tools/counters/PigCounterHelper.java PRE-CREATION trunk/test/e2e/pig/build.xml 1305733 trunk/test/e2e/pig/conf/default.conf 1305733 trunk/test/e2e/pig/conf/local.conf 1305733 trunk/test/e2e/pig/drivers/TestDriverPig.pm 1305733 trunk/test/e2e/pig/drivers/Util.pm 1305733 trunk/test/e2e/pig/tests/nightly.conf 1305733 trunk/test/e2e/pig/udfs/java/org/apache/pig/test/udf/evalfunc/AppendIndex.java PRE-CREATION trunk/test/e2e/pig/udfs/ruby/morerubyudfs.rb PRE-CREATION trunk/test/e2e/pig/udfs/ruby/scriptingudfs.rb PRE-CREATION trunk/test/org/apache/pig/test/TestUDF.java 1294756 trunk/test/org/apache/pig/test/utils/HelperEvalFuncUtils.java PRE-CREATION Diff: https://reviews.apache.org/r/4091/diff Testing ------- e2e tests can be run as follows: ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> test-e2e-deploy-local ant -Dharness.hadoop.home=<path_to_hadoop> -Dharness.old.pig=<path_to_old_pig> -Dtests.to.run="-t RubyUDFs" test-e2e-local Thanks, Jonathan
          Hide
          Daniel Dai added a comment -

          e2e tests run for me. However, one other test which runs before fail:

          register 'udf2.rb' using jruby as myfuncs;
          a = load 'student.txt' using PigStorage() as (name, age:int, gpa:double);
          b = foreach a generate myfuncs.squaresquare(age);
          dump b;

          udf2.rb:
          require 'pigudf'
          require 'morerubyudfs'

          class TestUDFs < PigUdf
          def squaresquare(num)
          return Myudfs.square(num)**2 if num
          end
          end

          morerubyudfs.rb:
          require 'pigudf'
          class Myudfs < PigUdf
          def Myudfs.square(num)
          return num**2 if num
          end

          def Myudfs.cube(num)
          return num**3 if num
          end
          end

          libsToShip() get morerubyudfs.rb as relative path before, but now I get absolute path. I will trace more tomorrow.

          Show
          Daniel Dai added a comment - e2e tests run for me. However, one other test which runs before fail: register 'udf2.rb' using jruby as myfuncs; a = load 'student.txt' using PigStorage() as (name, age:int, gpa:double); b = foreach a generate myfuncs.squaresquare(age); dump b; udf2.rb: require 'pigudf' require 'morerubyudfs' class TestUDFs < PigUdf def squaresquare(num) return Myudfs.square(num)**2 if num end end morerubyudfs.rb: require 'pigudf' class Myudfs < PigUdf def Myudfs.square(num) return num**2 if num end def Myudfs.cube(num) return num**3 if num end end libsToShip() get morerubyudfs.rb as relative path before, but now I get absolute path. I will trace more tomorrow.
          Hide
          Jonathan Coveney added a comment -

          Daniel,

          a) Is that from a separate test you're running? I don't see that in my directory.
          b) What is your command for running it? Are you running it locally, or in m/r mode?

          I'm just curious why we get different results when running... but I really appreciate your help in getting this production ready.

          Show
          Jonathan Coveney added a comment - Daniel, a) Is that from a separate test you're running? I don't see that in my directory. b) What is your command for running it? Are you running it locally, or in m/r mode? I'm just curious why we get different results when running... but I really appreciate your help in getting this production ready.
          Hide
          Russell Jurney added a comment -

          Lots of patches... what is the process to use if people want to help test this?

          Show
          Russell Jurney added a comment - Lots of patches... what is the process to use if people want to help test this?
          Hide
          Jonathan Coveney added a comment -

          Russell: each pass (excluding a couple of the _plus patches), is the full patch. So you just download the patch, put it in the base directory of a clean-esque pig folder, and do "patch -p0 < PIG-2317_13.patch"

          At that point, you can register a script as:

          register 'udf2.rb' using jruby as myfuncs;

          And so on. Check out: test/e2e/pig/udfs/ruby/ for two scripts used in the e2e tests.

          Let me rebase a version for a clean apply against 0.10... I wasn't sure if we were going to have this be in the RC so I've been working against trunk. It should apply fine against 0.10, just a little offset stuff, so I'll do that in a little and attach it...but let me know if what I said above doesn't work or make sense once I do.

          Show
          Jonathan Coveney added a comment - Russell: each pass (excluding a couple of the _plus patches), is the full patch. So you just download the patch, put it in the base directory of a clean-esque pig folder, and do "patch -p0 < PIG-2317 _13.patch" At that point, you can register a script as: register 'udf2.rb' using jruby as myfuncs; And so on. Check out: test/e2e/pig/udfs/ruby/ for two scripts used in the e2e tests. Let me rebase a version for a clean apply against 0.10... I wasn't sure if we were going to have this be in the RC so I've been working against trunk. It should apply fine against 0.10, just a little offset stuff, so I'll do that in a little and attach it...but let me know if what I said above doesn't work or make sense once I do.
          Hide
          Jonathan Coveney added a comment -

          Russell, I've attached a patch that should apply cleanly to 0.10. From the base pig directory, do "patch -p0 < PIG-2317_13_0.10.patch" and let me know how it works for you!

          Show
          Jonathan Coveney added a comment - Russell, I've attached a patch that should apply cleanly to 0.10. From the base pig directory, do "patch -p0 < PIG-2317 _13_0.10.patch" and let me know how it works for you!
          Hide
          Daniel Dai added a comment -

          a) Is that from a separate test you're running? I don't see that in my directory.

          This is a separate test I create, which has recursive inclusion

          b) What is your command for running it? Are you running it locally, or in m/r mode?

          bin/pig rubytest.pig
          m/r

          Show
          Daniel Dai added a comment - a) Is that from a separate test you're running? I don't see that in my directory. This is a separate test I create, which has recursive inclusion b) What is your command for running it? Are you running it locally, or in m/r mode? bin/pig rubytest.pig m/r
          Hide
          Jonathan Coveney added a comment -

          Any desire to include that test in the e2e? I can take it on me to do so, if you like.

          Thanks Daniel

          Show
          Jonathan Coveney added a comment - Any desire to include that test in the e2e? I can take it on me to do so, if you like. Thanks Daniel
          Hide
          Daniel Dai added a comment -

          That will be great!

          Show
          Daniel Dai added a comment - That will be great!
          Hide
          Jonathan Coveney added a comment -

          Cool, just attach it and I'll go to town.

          Show
          Jonathan Coveney added a comment - Cool, just attach it and I'll go to town.
          Hide
          Daniel Dai added a comment -

          It's in my previous comments. I attach recursiveinclusion.tar.gz to make it more clear.

          Show
          Daniel Dai added a comment - It's in my previous comments. I attach recursiveinclusion.tar.gz to make it more clear.
          Hide
          Daniel Dai added a comment -

          Seems it is related to ruby 1.9 change. If I use 1.8, I can ship the correct ruby dependencies (though fail due to syntax incompatibility later)

          Show
          Daniel Dai added a comment - Seems it is related to ruby 1.9 change. If I use 1.8, I can ship the correct ruby dependencies (though fail due to syntax incompatibility later)
          Hide
          Jonathan Coveney added a comment -

          This includes Daniel's test, and it works on my computer. the 0.10 patch is nearly identical, just rebased on 0.10.

          Show
          Jonathan Coveney added a comment - This includes Daniel's test, and it works on my computer. the 0.10 patch is nearly identical, just rebased on 0.10.
          Hide
          Daniel Dai added a comment -

          Here is what I get for RubyUDFs_14:
          org.jruby.embed.EvalFailedException: (LoadError) no such file to load – ./libexec/ruby/morerubyudfs
          at org.jruby.embed.internal.EmbedEvalUnitImpl.run(EmbedEvalUnitImpl.java:132)
          at org.jruby.embed.ScriptingContainer.runUnit(ScriptingContainer.java:1263)
          at org.jruby.embed.ScriptingContainer.runScriptlet(ScriptingContainer.java:1294)
          at org.apache.pig.scripting.jruby.JrubyScriptEngine$RubyFunctions.getFromCache(JrubyScriptEngine.java:103)
          at org.apache.pig.scripting.jruby.JrubyScriptEngine$RubyFunctions.getFunctions(JrubyScriptEngine.java:119)
          at org.apache.pig.scripting.jruby.JrubyEvalFunc.initialize(JrubyEvalFunc.java:87)
          at org.apache.pig.scripting.jruby.JrubyEvalFunc.exec(JrubyEvalFunc.java:103)
          at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:225)
          at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:284)
          at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:320)
          at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
          at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
          at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
          at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:396)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
          at org.apache.hadoop.mapred.Child.main(Child.java:249)
          Caused by: org.jruby.exceptions.RaiseException: (LoadError) no such file to load – ./libexec/ruby/morerubyudfs

          Show
          Daniel Dai added a comment - Here is what I get for RubyUDFs_14: org.jruby.embed.EvalFailedException: (LoadError) no such file to load – ./libexec/ruby/morerubyudfs at org.jruby.embed.internal.EmbedEvalUnitImpl.run(EmbedEvalUnitImpl.java:132) at org.jruby.embed.ScriptingContainer.runUnit(ScriptingContainer.java:1263) at org.jruby.embed.ScriptingContainer.runScriptlet(ScriptingContainer.java:1294) at org.apache.pig.scripting.jruby.JrubyScriptEngine$RubyFunctions.getFromCache(JrubyScriptEngine.java:103) at org.apache.pig.scripting.jruby.JrubyScriptEngine$RubyFunctions.getFunctions(JrubyScriptEngine.java:119) at org.apache.pig.scripting.jruby.JrubyEvalFunc.initialize(JrubyEvalFunc.java:87) at org.apache.pig.scripting.jruby.JrubyEvalFunc.exec(JrubyEvalFunc.java:103) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:225) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:284) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:320) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: org.jruby.exceptions.RaiseException: (LoadError) no such file to load – ./libexec/ruby/morerubyudfs
          Hide
          Jonathan Coveney added a comment -

          Hmm, that's vexing. I was getting that information, but now I am not. Are you running the test in local mode, or in M/R mode?

          Show
          Jonathan Coveney added a comment - Hmm, that's vexing. I was getting that information, but now I am not. Are you running the test in local mode, or in M/R mode?
          Hide
          Daniel Dai added a comment -

          MR mode. Are you able to check what's in your job.jar? Do you have morerubyudfs.rb in job.jar or /home/xxx/pig/morerubyudfs.rb?

          Show
          Daniel Dai added a comment - MR mode. Are you able to check what's in your job.jar? Do you have morerubyudfs.rb in job.jar or /home/xxx/pig/morerubyudfs.rb?
          Hide
          Daniel Dai added a comment -

          I think it's fine to check in the patch minus RubyUDFs_14, we can open a Jira to track recursive inclusion later.

          Show
          Daniel Dai added a comment - I think it's fine to check in the patch minus RubyUDFs_14, we can open a Jira to track recursive inclusion later.
          Hide
          Jonathan Coveney added a comment -

          I agree, Daniel. We can document any such limitations and open JIRA's to enhance them. Currently I can think of this...anything else we should clean up later?

          And separately, in JRuby news, I should probably open up a JIRA about doing control flow in JRuby

          Show
          Jonathan Coveney added a comment - I agree, Daniel. We can document any such limitations and open JIRA's to enhance them. Currently I can think of this...anything else we should clean up later? And separately, in JRuby news, I should probably open up a JIRA about doing control flow in JRuby
          Hide
          Daniel Dai added a comment -

          Sounds good. Let's try to commit it today.

          Show
          Daniel Dai added a comment - Sounds good. Let's try to commit it today.
          Hide
          Daniel Dai added a comment -

          What is PigudfService.java doing? Who's using it?

          Show
          Daniel Dai added a comment - What is PigudfService.java doing? Who's using it?
          Hide
          Jonathan Coveney added a comment -

          PigudfService.java is what allows "require 'pigudf'" to work. Given that pig.jar is on the classpath, when in JRuby you type "require x" it looks for x.jar in the classpath, and will load everything. This makes it so that user doesn't have to do "required 'long/path/to/pigudf.rb'"

          Show
          Jonathan Coveney added a comment - PigudfService.java is what allows "require 'pigudf'" to work. Given that pig.jar is on the classpath, when in JRuby you type "require x" it looks for x.jar in the classpath, and will load everything. This makes it so that user doesn't have to do "required 'long/path/to/pigudf.rb'"
          Hide
          Daniel Dai added a comment -

          Do we need to tell JRuby to lookup PigudfService? Or JRuby will automatically pickup all BasicLibraryService?

          Show
          Daniel Dai added a comment - Do we need to tell JRuby to lookup PigudfService? Or JRuby will automatically pickup all BasicLibraryService?
          Hide
          Daniel Dai added a comment -

          PIG-2317_15.patch fix some minor issues and findbugs/javac warnings. All tests pass for me. I am going to commit the patch soon.

          Show
          Daniel Dai added a comment - PIG-2317 _15.patch fix some minor issues and findbugs/javac warnings. All tests pass for me. I am going to commit the patch soon.
          Hide
          Daniel Dai added a comment -

          Another minor fix to make rpm distribution work.

          Show
          Daniel Dai added a comment - Another minor fix to make rpm distribution work.
          Hide
          Daniel Dai added a comment -

          The same patch for 0.10.

          Show
          Daniel Dai added a comment - The same patch for 0.10.
          Hide
          Daniel Dai added a comment -

          test-patch:
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 18 new or modified tests.
          [exec]
          [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages.
          [exec]
          [exec] -1 javac. The applied patch generated 915 javac compiler warnings (more than the trunk's current 0 warnings).
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          All relavent javac warnings are fixed. javadoc warning is not related.

          Patch committed to 0.10/trunk.

          Great job, Jonathan!

          We need to open additional Jira:
          1. make recursive inclusion work
          2. Add main function for JRuby

          Show
          Daniel Dai added a comment - test-patch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 18 new or modified tests. [exec] [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages. [exec] [exec] -1 javac. The applied patch generated 915 javac compiler warnings (more than the trunk's current 0 warnings). [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All relavent javac warnings are fixed. javadoc warning is not related. Patch committed to 0.10/trunk. Great job, Jonathan! We need to open additional Jira: 1. make recursive inclusion work 2. Add main function for JRuby
          Hide
          Dmitriy V. Ryaboy added a comment -

          Congrats Jonathan! Huge.

          To add to Daniel's list:

          3. Documentation.

          Show
          Dmitriy V. Ryaboy added a comment - Congrats Jonathan! Huge. To add to Daniel's list: 3. Documentation.
          Hide
          Daniel Dai added a comment -

          Document is in PIG-2601. Can anyone review it?

          Show
          Daniel Dai added a comment - Document is in PIG-2601 . Can anyone review it?

            People

            • Assignee:
              Jonathan Coveney
              Reporter:
              Jacob Perkins
            • Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development