Pig
  1. Pig
  2. PIG-2643

Use bytecode generation to make a performance replacement for InvokeForLong, InvokeForString, etc

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
    • Patch Info:
      Patch Available

      Description

      This is basically to cut my teeth for much more ambitious code generation down the line, but I think it could be performance and useful.

      the new syntax is:

      a = load 'thing' as (x:chararray);
      define concat InvokerGenerator('java.lang.String','concat','String');
      define valueOf InvokerGenerator('java.lang.Integer','valueOf','String');
      define valueOfRadix InvokerGenerator('java.lang.Integer','valueOf','String,int');
      
      b = foreach a generate x, valueOf(x) as vOf;
      c = foreach b generate x, vOf, valueOfRadix(x, 16) as vOfR;
      d = foreach c generate x, vOf, vOfR, concat(concat(x, (chararray)vOf), (chararray)vOfR);
      
      dump d;
      

      There are some differences between this version and Dmitriy's implementation:

      • it is no longer necessary to declare whether the method is static or not. This is gleaned via reflection.
      • as per the above, it is no longer necessary to make the first argument be the type of the object to invoke the method on. If it is not a static method, then the type will implicitly be the type you need. So in the case of concat, it would need to be passed a tuple of two inputs: one for the method to be called against (as it is not static), and then the 'string' that was specified. In the case of valueOf, because it IS static, then the 'String' is the only value.
      • The arguments are type sensitive. Integer means the Object Integer, whereas int (or long, or float, or boolean, etc) refer to the primitive. This is necessary to properly reflect the arguments. Values passed in WILL, however, be properly unboxed as necessary.
      • The return type will be reflected.

      This uses the ASM API to generate the bytecode, and then a custom classloader to load it in. I will add caching of the generated code based on the input strings, etc, but I wanted to get eyes and opinions on this. I also need to benchmark, but it should be native speed (excluding a little startup time to make the bytecode, but ASM is really fast).

      Another nice benefit is that this bypasses the need for the JDK, though it adds a dependency on ASM (which is a super tiny dependency).

      Patch incoming.

      1. PIG-2643-0.patch
        13 kB
        Jonathan Coveney
      2. PIG-2643-1.patch
        16 kB
        Jonathan Coveney

        Issue Links

          Activity

          Hide
          Jonathan Coveney added a comment -

          Oh, I also should add tests. Can leverage the tests Dmitriy used for his.

          Show
          Jonathan Coveney added a comment - Oh, I also should add tests. Can leverage the tests Dmitriy used for his.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Jon, the problem with non-static methods was that what people really wanted was to be able to construct the object to call methods on (as opposed to just invoking the no-arg constructor for every invocation). Any thoughts on how to achieve that?

          Show
          Dmitriy V. Ryaboy added a comment - Jon, the problem with non-static methods was that what people really wanted was to be able to construct the object to call methods on (as opposed to just invoking the no-arg constructor for every invocation). Any thoughts on how to achieve that?
          Hide
          Jonathan Coveney added a comment -

          I'm not quite sure what you mean...could you illustrate it with a quick code snippet?

          Show
          Jonathan Coveney added a comment - I'm not quite sure what you mean...could you illustrate it with a quick code snippet?
          Hide
          Jonathan Coveney added a comment -

          I am thinking about this and what I think you mean, is that if you say

          define concat InvokerGenerator('java.lang.String','concat','String')
          

          you don't want it to do:

          return new String().concat((String)input.get(0));
          

          So what it does is it uses the argument list (the 3rd parameter) to find a matching method. If it is not static, then it will expect n+1 arguments, where the 1st one will be an instance of the object on the left (in the case java.lang.String) which will serve as a method receiver. So in the case of concat, it would expect 2 Strings. If it is static, it only expects the argument list.

          Is that what you were talking about, or something else?

          Show
          Jonathan Coveney added a comment - I am thinking about this and what I think you mean, is that if you say define concat InvokerGenerator('java.lang. String ','concat',' String ') you don't want it to do: return new String ().concat(( String )input.get(0)); So what it does is it uses the argument list (the 3rd parameter) to find a matching method. If it is not static, then it will expect n+1 arguments, where the 1st one will be an instance of the object on the left (in the case java.lang.String) which will serve as a method receiver. So in the case of concat, it would expect 2 Strings. If it is static, it only expects the argument list. Is that what you were talking about, or something else?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Not quite – what I mean is that people want to be able to say, effectively:

          Foo foo = new Foo(1, 2, 'some_string');
          
          for (Tuple t : tuples) {
            foo.someMethod(t);
          }
          

          I think the problem with invokers isn't that they are slow. It's that they are cumbersome to use. Not sure making them faster is really the place to focus.. integrating them deeper would be interesting (why can't we just see that you are saying java.lang.String.concat('foo', $1) and auto-generate or auto-wrap a UDF?).

          Show
          Dmitriy V. Ryaboy added a comment - Not quite – what I mean is that people want to be able to say, effectively: Foo foo = new Foo(1, 2, 'some_string'); for (Tuple t : tuples) { foo.someMethod(t); } I think the problem with invokers isn't that they are slow. It's that they are cumbersome to use. Not sure making them faster is really the place to focus.. integrating them deeper would be interesting (why can't we just see that you are saying java.lang.String.concat('foo', $1) and auto-generate or auto-wrap a UDF?).
          Hide
          Jonathan Coveney added a comment -

          Ah, ok. Well, first, since we have a more performance, less verbose syntax (mainly not having to say "InvokeForLong" vs "InvokeForString" and so on), I think it's worth including it because it IS faster and cleaner, though I agree that the focus now should be on filling a niche that doesn't currently exist. As I said before, the work so far was in a big part to be a small project to begin to work with ASM with, and benefit pig on the side.

          I do like the idea of potentially supporting math function syntax, and then behind the scenes generating the scaffolding. I like that idea a lot. Will mull on how it'd be implemented. Perhaps a first pass would be to support a MATH keyword where if you do MATH.operator(stuff, stuff) it generates the scaffolding, and then it can get more generic? I don't really know how to do this without adding keywords... hmm hmm hmm. Would love thoughts in that vein.

          But what you bring up is an interesting use case...sort of the generation of UDF's based on some class that exists. What your proposing sounds like we could generate an accumulator UDF that would apply to any case where you have an object that you instantiate on the mapper, then stream everything through and return. Ideally we'd provide an interface that objects could implement that would serve as the bridge. Perhaps something like

          public Object eval(Object... o) throws IOException;
          

          that way they don't even have to depend on pig in the object?

          Show
          Jonathan Coveney added a comment - Ah, ok. Well, first, since we have a more performance, less verbose syntax (mainly not having to say "InvokeForLong" vs "InvokeForString" and so on), I think it's worth including it because it IS faster and cleaner, though I agree that the focus now should be on filling a niche that doesn't currently exist. As I said before, the work so far was in a big part to be a small project to begin to work with ASM with, and benefit pig on the side. I do like the idea of potentially supporting math function syntax, and then behind the scenes generating the scaffolding. I like that idea a lot. Will mull on how it'd be implemented. Perhaps a first pass would be to support a MATH keyword where if you do MATH.operator(stuff, stuff) it generates the scaffolding, and then it can get more generic? I don't really know how to do this without adding keywords... hmm hmm hmm. Would love thoughts in that vein. But what you bring up is an interesting use case...sort of the generation of UDF's based on some class that exists. What your proposing sounds like we could generate an accumulator UDF that would apply to any case where you have an object that you instantiate on the mapper, then stream everything through and return. Ideally we'd provide an interface that objects could implement that would serve as the bridge. Perhaps something like public Object eval( Object ... o) throws IOException; that way they don't even have to depend on pig in the object?
          Hide
          Dmitriy V. Ryaboy added a comment -

          I'm a bit confused about the math keyword stuff. I meant just generically being able to invoke (or generate code that wraps) methods on objects, the way one would initialize an object inside a UDF constructor and then call methods on that method during UDF exec methods.

          Show
          Dmitriy V. Ryaboy added a comment - I'm a bit confused about the math keyword stuff. I meant just generically being able to invoke (or generate code that wraps) methods on objects, the way one would initialize an object inside a UDF constructor and then call methods on that method during UDF exec methods.
          Hide
          Julien Le Dem added a comment -

          what about this?

          • call of a static method the same way a non-defined UDF can be:
            a = load 'thing' as (x:long);
            b = foreach a generate java.lang.Math.sqrt(x) as sqrt;
            
          • definition of a UDF through an expression (for a strict definition of those expressions).
            c = load 'thing' as (x:bag{t:(v:chararray)});
            DEFINE join com.google.common.base.Joiner.on('-').skipNulls().join;
            d = foreach c generate join(x);
            

            An expression being defined as follows:

            {class}.{static method|constructor}([{primitive value}[,{primitive value}]*])[.{method}([{primitive value}[,{primitive value}]*])]*.{method}
            

            The exact method to use can be evaluated through reflection using the expression and input schema.

          • For methods called on the incoming objects (ex: concat) there is a finite set of those as they are the methods available on the types supported by Pig, they can just all be defined as builtins once and for all. That would simplify the other cases above.
          Show
          Julien Le Dem added a comment - what about this? call of a static method the same way a non-defined UDF can be: a = load 'thing' as (x: long ); b = foreach a generate java.lang. Math .sqrt(x) as sqrt; definition of a UDF through an expression (for a strict definition of those expressions). c = load 'thing' as (x:bag{t:(v:chararray)}); DEFINE join com.google.common.base.Joiner.on('-').skipNulls().join; d = foreach c generate join(x); An expression being defined as follows: {class}.{ static method|constructor}([{primitive value}[,{primitive value}]*])[.{method}([{primitive value}[,{primitive value}]*])]*.{method} The exact method to use can be evaluated through reflection using the expression and input schema. For methods called on the incoming objects (ex: concat) there is a finite set of those as they are the methods available on the types supported by Pig, they can just all be defined as builtins once and for all. That would simplify the other cases above.
          Hide
          Jonathan Coveney added a comment -

          I don't know that I like including all of the methods for types supported by Pig...if anything, I think we should try and move towards a solution that requires less code like that. Why cruft up the code base with a bunch of wrappers?

          In that vein, I really like your first proposal. I think it's definitely the direction we should go. As far as a syntax to allow people to call methods that require an object, how about:

          a = load 'thing' as (a:chararray, b:chararray);
          b = foreach a generate a:sqrt(b) as sqrt;
          

          I'd love to use $, but I see the potential for namespace collision to be too great, not to mention ambiguity on the parser. It doesn't have to be : of course, but I don't think . will work. But perhaps I'm wrong? Either way, I think this is one of those cases where we should shoot for a really usable syntax. In both of these cases, we could use the InvokerGenerater and reflection to figure out all of the necessary code.

          As far as the defintion of the UDF, I think that the mentioned approach is not a bad one. Another possibility:

          c = load 'thing' as (x:bag{t:(v:chararray)});
          DEFINE joiner NEW com.google.common.base.Joiner.on('-').skipNulls();
          d = foreach c generate joiner:join(x);
          

          This would introduce a new keyword which would allow us to more succinctly reference one object with various methods we want to use. The define would register this string in the namespace, and then when we see a :, first we see if what is to the left is a relation, then we see if it is in the object space. If it is, then we can build up the UDF.

          I think we're heading in the right direction, whatever we choose.

          Show
          Jonathan Coveney added a comment - I don't know that I like including all of the methods for types supported by Pig...if anything, I think we should try and move towards a solution that requires less code like that. Why cruft up the code base with a bunch of wrappers? In that vein, I really like your first proposal. I think it's definitely the direction we should go. As far as a syntax to allow people to call methods that require an object, how about: a = load 'thing' as (a:chararray, b:chararray); b = foreach a generate a:sqrt(b) as sqrt; I'd love to use $, but I see the potential for namespace collision to be too great, not to mention ambiguity on the parser. It doesn't have to be : of course, but I don't think . will work. But perhaps I'm wrong? Either way, I think this is one of those cases where we should shoot for a really usable syntax. In both of these cases, we could use the InvokerGenerater and reflection to figure out all of the necessary code. As far as the defintion of the UDF, I think that the mentioned approach is not a bad one. Another possibility: c = load 'thing' as (x:bag{t:(v:chararray)}); DEFINE joiner NEW com.google.common.base.Joiner.on('-').skipNulls(); d = foreach c generate joiner:join(x); This would introduce a new keyword which would allow us to more succinctly reference one object with various methods we want to use. The define would register this string in the namespace, and then when we see a :, first we see if what is to the left is a relation, then we see if it is in the object space. If it is, then we can build up the UDF. I think we're heading in the right direction, whatever we choose.
          Hide
          Jonathan Coveney added a comment -

          Also, as an aside, I think it'd be awesome to make it so people didn't have to do the fully qualified classname, at least for stuff in java.lang... is there a reason why java.lang can't be added to the list returned by PigContext.getPackageImportList()? I suppose people can add it to the pig properties or whatever, but it seems like an intelligent base, at least?

          Show
          Jonathan Coveney added a comment - Also, as an aside, I think it'd be awesome to make it so people didn't have to do the fully qualified classname, at least for stuff in java.lang... is there a reason why java.lang can't be added to the list returned by PigContext.getPackageImportList()? I suppose people can add it to the pig properties or whatever, but it seems like an intelligent base, at least?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Love Julien's proposal, that reads like what I expect it to read.
          Except at some point down that road we discover ourselves writing java without IDE support..
          Is supporting chaining method calls a bit much?

          Jon, I'm not following your bit of sample syntax with the ":" – possibly due to variable collision (is "a" a defined object, or a relation?)

          Show
          Dmitriy V. Ryaboy added a comment - Love Julien's proposal, that reads like what I expect it to read. Except at some point down that road we discover ourselves writing java without IDE support.. Is supporting chaining method calls a bit much? Jon, I'm not following your bit of sample syntax with the ":" – possibly due to variable collision (is "a" a defined object, or a relation?)
          Hide
          Jonathan Coveney added a comment -

          I was worried about the "writing java without IDE support" issue, but I think as long as our scope is narrow, the win is worth it.

          I like Julien's proposal as well, but I guess I feel like we might as well push it to the next level?

          DEFINE join com.google.common.base.Joiner.on('-').skipNulls().join;
          d = foreach c generate join(x);
          

          the hanging "join" at the end of the define statement seems odd to me. Why not just let people call wahtever method they want?

          And Dmitriy, I guess the ":" syntax is a little awkward, but the idea is that if you had "relation:method(relation*)", it invoke that method on the relation with the appropriate arguments. Or, in the same vein, if you had "joiner:join(relation*)", it'd invoke the method on the object that will be created viz. the DEFINE statement.

          I think some sort of syntax allowing us to call methods of various pig types directly would be pretty neat, though. The syntax could be something bigger to highlight that it's kind of a big thing. "joiner=>join(relation*)", I dunno.

          Show
          Jonathan Coveney added a comment - I was worried about the "writing java without IDE support" issue, but I think as long as our scope is narrow, the win is worth it. I like Julien's proposal as well, but I guess I feel like we might as well push it to the next level? DEFINE join com.google.common.base.Joiner.on('-').skipNulls().join; d = foreach c generate join(x); the hanging "join" at the end of the define statement seems odd to me. Why not just let people call wahtever method they want? And Dmitriy, I guess the ":" syntax is a little awkward, but the idea is that if you had "relation:method(relation*)", it invoke that method on the relation with the appropriate arguments. Or, in the same vein, if you had "joiner:join(relation*)", it'd invoke the method on the object that will be created viz. the DEFINE statement. I think some sort of syntax allowing us to call methods of various pig types directly would be pretty neat, though. The syntax could be something bigger to highlight that it's kind of a big thing. "joiner=>join(relation*)", I dunno.
          Hide
          Scott Carey added a comment -

          Another thought for this sort of thing:

          This might be achievable without bytecode generation and good performance with Java 7 MethodHandles [1][2]. Of course, that would require Java 7, but Java 6 support ends later year [3], about the time Pig 0.11 would be out anyway.

          [1] http://docs.oracle.com/javase/7/docs/api/java/lang/invoke/MethodHandle.html
          [2] http://stackoverflow.com/questions/8823793/methodhandle-what-is-it-all-about
          [3] https://blogs.oracle.com/henrik/entry/updated_java_6_eol_date

          Show
          Scott Carey added a comment - Another thought for this sort of thing: This might be achievable without bytecode generation and good performance with Java 7 MethodHandles [1] [2] . Of course, that would require Java 7, but Java 6 support ends later year [3] , about the time Pig 0.11 would be out anyway. [1] http://docs.oracle.com/javase/7/docs/api/java/lang/invoke/MethodHandle.html [2] http://stackoverflow.com/questions/8823793/methodhandle-what-is-it-all-about [3] https://blogs.oracle.com/henrik/entry/updated_java_6_eol_date
          Hide
          Dmitriy V. Ryaboy added a comment -

          Scott, good idea but moving Pig to java 7 would require running on a Hadoop that's running on Java 7... that might take a while. Though I think Oracle's working on figuring out what that would take, now that they have a hadoop integration team.

          Show
          Dmitriy V. Ryaboy added a comment - Scott, good idea but moving Pig to java 7 would require running on a Hadoop that's running on Java 7... that might take a while. Though I think Oracle's working on figuring out what that would take, now that they have a hadoop integration team.
          Hide
          Julien Le Dem added a comment -

          this will go in a future version

          Show
          Julien Le Dem added a comment - this will go in a future version
          Hide
          Jonathan Coveney added a comment -

          I don't know why this never got +1'd, I think we got derailed be the conversation near the end. I have updated it and added some tests. I don't see why we shouldn't commit this? It's strictly better than what we have, and I will make a new JIRA for the broader issue of trying to get rid of having to make a builtin for everything.

          Show
          Jonathan Coveney added a comment - I don't know why this never got +1'd, I think we got derailed be the conversation near the end. I have updated it and added some tests. I don't see why we shouldn't commit this? It's strictly better than what we have, and I will make a new JIRA for the broader issue of trying to get rid of having to make a builtin for everything.
          Hide
          Jonathan Coveney added a comment -

          This stands alone, but has been greatly updated in PIG-3198, including some bug fixes...

          Show
          Jonathan Coveney added a comment - This stands alone, but has been greatly updated in PIG-3198 , including some bug fixes...
          Hide
          Jonathan Coveney added a comment -

          This was resolved as part of PIG-3198

          Show
          Jonathan Coveney added a comment - This was resolved as part of PIG-3198

            People

            • Assignee:
              Jonathan Coveney
              Reporter:
              Jonathan Coveney
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development