Pig
  1. Pig
  2. PIG-2359

Support more efficient Tuples when schemas are known

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      newTupleForSchema(Schema s) and newTupleForSchema(byte... types) methods are introduced to TupleFactory, allowing Pig to choose optimized Tuple implementations for specific schemas, when those are available.

      Two optimized implementations are introduced:
      - single-value Tuples (tuples which only hold a single int, double, String, etc
      - primitive-value Tuples (tuples which only hold values that can be represented by a fixed-width datatype in Java: int, long, double, boolean, float).

      Using such tuples can result in significant memory utilization improvements.
      Show
      newTupleForSchema(Schema s) and newTupleForSchema(byte... types) methods are introduced to TupleFactory, allowing Pig to choose optimized Tuple implementations for specific schemas, when those are available. Two optimized implementations are introduced: - single-value Tuples (tuples which only hold a single int, double, String, etc - primitive-value Tuples (tuples which only hold values that can be represented by a fixed-width datatype in Java: int, long, double, boolean, float). Using such tuples can result in significant memory utilization improvements.

      Description

      Pig Tuples have significant overhead due to the fact that all the fields are Objects.
      When a Tuple only contains primitive fields (ints, longs, etc), it's possible to avoid this overhead, which would result in significant memory savings.

      1. PIG-2359.4.patch
        100 kB
        Dmitriy V. Ryaboy
      2. PIG-2359.3.patch
        98 kB
        Dmitriy V. Ryaboy
      3. PIG-2359.2.patch
        78 kB
        Dmitriy V. Ryaboy
      4. PIG-2359.1.patch
        64 kB
        Dmitriy V. Ryaboy

        Issue Links

          Activity

          Hide
          Dmitriy V. Ryaboy added a comment -

          The attached patch is a first cut at adding this support.

          Note that it changes the TupleFactory interface by adding a couple new methods for creating optimized tuples.

          Two flavors of optimized tuples are provided:

          1) For single-field tuple, we provide a PrimitiveFieldTuple, which simply wraps a primitive value (or a string).

          2) For multi-field tuples, we provide an implementation that uses a single bytebuffer to hold the data in memory, and deserializes the appropriate field on read. This incurs a bit of a read-time penalty, but I believe it's a good trade-off, since (a) most of the time we only read once, and the allocation costs are much lower than for regular tuples, and (b) the memory overhead is several times lower than for regular tuples, so we'll save on GC.

          Microbenchmark results can be found in the javadoc for PrimitiveTuple.

          Note that so far I haven't changed any behavior in existing Pig code, other than changing one interface. The next step would be to start using these Tuples when possible.

          One complication is that since we don't push much metadata around with tuples, we can only deserialize them into standard tuples; so all savings are lost once we hit an MR boundary. Changing this would require a pretty significant refactor, I'd love to hear ideas from folks who worked on BinInterSedes on how to do this.

          So far, I've played with using these in some UDFs that generate large bags of tuples, and the difference in both speed and memory use if fairly dramatic.

          Show
          Dmitriy V. Ryaboy added a comment - The attached patch is a first cut at adding this support. Note that it changes the TupleFactory interface by adding a couple new methods for creating optimized tuples. Two flavors of optimized tuples are provided: 1) For single-field tuple, we provide a PrimitiveFieldTuple, which simply wraps a primitive value (or a string). 2) For multi-field tuples, we provide an implementation that uses a single bytebuffer to hold the data in memory, and deserializes the appropriate field on read. This incurs a bit of a read-time penalty, but I believe it's a good trade-off, since (a) most of the time we only read once, and the allocation costs are much lower than for regular tuples, and (b) the memory overhead is several times lower than for regular tuples, so we'll save on GC. Microbenchmark results can be found in the javadoc for PrimitiveTuple. Note that so far I haven't changed any behavior in existing Pig code, other than changing one interface. The next step would be to start using these Tuples when possible. One complication is that since we don't push much metadata around with tuples, we can only deserialize them into standard tuples; so all savings are lost once we hit an MR boundary. Changing this would require a pretty significant refactor, I'd love to hear ideas from folks who worked on BinInterSedes on how to do this. So far, I've played with using these in some UDFs that generate large bags of tuples, and the difference in both speed and memory use if fairly dramatic.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Neat!
          The patch is quite long, I will have a look at it later.
          For the BinInterSedes, I think that adding a new byte type for each specific tuple and modifying readTuple() should do the trick.
          I am a bit concerned about the access to PrimitiveTuple.
          Specifically, do we always have the schema available when we need to read a field?

          Show
          Gianmarco De Francisci Morales added a comment - Neat! The patch is quite long, I will have a look at it later. For the BinInterSedes, I think that adding a new byte type for each specific tuple and modifying readTuple() should do the trick. I am a bit concerned about the access to PrimitiveTuple. Specifically, do we always have the schema available when we need to read a field?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Nope, it's more efficient to call type-specific getInteger, getLong, etc, but get() works.

          Show
          Dmitriy V. Ryaboy added a comment - Nope, it's more efficient to call type-specific getInteger, getLong, etc, but get() works.
          Hide
          Daniel Dai added a comment -

          Great first step! Some initial observations:
          1. We can pass the schema into JobConf and read it back in BinInterSedes.setConf()
          2. Support string in PrimitiveTuple is super helpful, is that in the plan?
          3. BinInterSedes should take advantage of it to serialize/deserialize the PrimitiveTuple in a more efficient way

          Worth to mention we need to put fastutil_small.jar(http://www.java2s.com/Code/Jar/f/Downloadfastutilsmalljar.htm) into lib in order to compile.

          Show
          Daniel Dai added a comment - Great first step! Some initial observations: 1. We can pass the schema into JobConf and read it back in BinInterSedes.setConf() 2. Support string in PrimitiveTuple is super helpful, is that in the plan? 3. BinInterSedes should take advantage of it to serialize/deserialize the PrimitiveTuple in a more efficient way Worth to mention we need to put fastutil_small.jar( http://www.java2s.com/Code/Jar/f/Downloadfastutilsmalljar.htm ) into lib in order to compile.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Oops, fastutil wasn't supposed to make it into this patch, that's an experiment for later. I'll get rid of the PrimitiveBags in the next version of the patch.

          Putting the schemas into the JobConf is possible.. I am a little worried about having binary data that's not readable without an ephemeral job conf, though. Also, it'll be hard to figure out which tuple the schema is supposed to apply to – we'll need to prefix the tuple with something, anyway. Maybe a header?

          The header would look like:
          magic_header_start;
          byte tuple_schema_id; magic_schema_start; byte type; byte type; ... magic_schema_end;
          byte tuple_schema_id; magic_schema_start; byte type; byte type; ... magic_schema_end;
          byte tuple_schema_id; magic_schema_start; byte type; byte type; ... magic_schema_end;
          magic_header_end;

          Then, we introduce a SCHEMATUPLE value, like we did with TINY_TUPLE et al, and the serialized tuple would look like:
          SCHEMATUPLE tuple_schema_id bytes.....
          SCHEMATUPLE tuple_schema_id bytes.....

          I suppose we could also write the header in a separate tuple with high rep, to avoid a hotspot.

          Show
          Dmitriy V. Ryaboy added a comment - Oops, fastutil wasn't supposed to make it into this patch, that's an experiment for later. I'll get rid of the PrimitiveBags in the next version of the patch. Putting the schemas into the JobConf is possible.. I am a little worried about having binary data that's not readable without an ephemeral job conf, though. Also, it'll be hard to figure out which tuple the schema is supposed to apply to – we'll need to prefix the tuple with something, anyway. Maybe a header? The header would look like: magic_header_start ; byte tuple_schema_id; magic_schema_start ; byte type; byte type; ... magic_schema_end ; byte tuple_schema_id; magic_schema_start ; byte type; byte type; ... magic_schema_end ; byte tuple_schema_id; magic_schema_start ; byte type; byte type; ... magic_schema_end ; magic_header_end ; Then, we introduce a SCHEMATUPLE value, like we did with TINY_TUPLE et al, and the serialized tuple would look like: SCHEMATUPLE tuple_schema_id bytes..... SCHEMATUPLE tuple_schema_id bytes..... I suppose we could also write the header in a separate tuple with high rep, to avoid a hotspot.
          Hide
          Gianmarco De Francisci Morales added a comment -

          For sure we need to represent the tuple like this:

          SCHEMATUPLE tuple_schema_id bytes...

          Because we do not want to put the schema in every tuple, we need a single place where there is a Map: tuple_schema_id -> tuple_schema
          This map is anyway job specific, because tuple_schema_ids will be generated on the fly for the specific schema of the tuple used in the job.
          JobConf looks like a good place to put this piece of information exactly because it is job specific.
          I do not personally like much the idea of having metadata in a tuple with high replication, it looks a bit hacky to me.
          I don't understand your concern, could you detail it a bit more? In any case the BinInterSedes tuple is ephemeral as the JobConf, and is not used for persisting data. So I assume your concern is not for users, but for developers and debugging, right?
          It is true that the tuple will not be anymore self-describing, but this is the price to pay to have more efficient serialization format. What kind of problems do you think could arise?

          We can then discuss on the best way to represent this Map.

          Show
          Gianmarco De Francisci Morales added a comment - For sure we need to represent the tuple like this: SCHEMATUPLE tuple_schema_id bytes... Because we do not want to put the schema in every tuple, we need a single place where there is a Map: tuple_schema_id -> tuple_schema This map is anyway job specific, because tuple_schema_ids will be generated on the fly for the specific schema of the tuple used in the job. JobConf looks like a good place to put this piece of information exactly because it is job specific. I do not personally like much the idea of having metadata in a tuple with high replication, it looks a bit hacky to me. I don't understand your concern, could you detail it a bit more? In any case the BinInterSedes tuple is ephemeral as the JobConf, and is not used for persisting data. So I assume your concern is not for users, but for developers and debugging, right? It is true that the tuple will not be anymore self-describing, but this is the price to pay to have more efficient serialization format. What kind of problems do you think could arise? We can then discuss on the best way to represent this Map.
          Hide
          Dmitriy V. Ryaboy added a comment -

          It occurs to me that in effect, we already have the schema next to every tuple in BinSedes, since we precede each field with a byte that describes the type. If for primitive tuples we simply put the type bytes first, and then write the bytes as they are stored in the byte buffer, we will have a super efficient read, and won't take up any more space than we do now.

          Show
          Dmitriy V. Ryaboy added a comment - It occurs to me that in effect, we already have the schema next to every tuple in BinSedes, since we precede each field with a byte that describes the type. If for primitive tuples we simply put the type bytes first, and then write the bytes as they are stored in the byte buffer, we will have a super efficient read, and won't take up any more space than we do now.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Yes, we already do that.
          But that's because a byte is negligible when compared to the size of the rest of the tuple.
          I think that if we want really a more efficient tuple implementation when schemas are known, we need to strip the schema from the data. What's the point of repeating the schema in each tuple apart from ease of implementation?
          This modification might be done in a different Jira, while we can keep this one for the bytearray implementation.

          For the PrimitiveFieldTuple implementation, should we create a different byte for each tuple?
          This way we can save on the size and schema byte and make it really compact.
          Otherwise we could use a byte to indicate PRIMITIVE_FIELD_TUPLE and then a second one to indicate the schema (Double, Float, etc..)

          For the PrimitiveTuple, we would use PRIMITIVE_TUPLE, then the size as a byte (I assume we don't really use schemas with more than 255 primitives?), the schema (1 byte per type) and finally the data in the bytearray.
          Actually, given that we are changing the serialization format, we don't need the schema to be 1 byte per type, but we could multiplex several fields in the same byte. We have 8 primitive types in Pig (by the way, should we also implement PByteTuple, PBooleanTuple?), so 3 bits will suffice. We can use 4 for alignment and expandability. This cuts by 50% the overhead due to schema.

          Thoughts?

          Show
          Gianmarco De Francisci Morales added a comment - Yes, we already do that. But that's because a byte is negligible when compared to the size of the rest of the tuple. I think that if we want really a more efficient tuple implementation when schemas are known, we need to strip the schema from the data. What's the point of repeating the schema in each tuple apart from ease of implementation? This modification might be done in a different Jira, while we can keep this one for the bytearray implementation. For the PrimitiveFieldTuple implementation, should we create a different byte for each tuple? This way we can save on the size and schema byte and make it really compact. Otherwise we could use a byte to indicate PRIMITIVE_FIELD_TUPLE and then a second one to indicate the schema (Double, Float, etc..) For the PrimitiveTuple, we would use PRIMITIVE_TUPLE, then the size as a byte (I assume we don't really use schemas with more than 255 primitives?), the schema (1 byte per type) and finally the data in the bytearray. Actually, given that we are changing the serialization format, we don't need the schema to be 1 byte per type, but we could multiplex several fields in the same byte. We have 8 primitive types in Pig (by the way, should we also implement PByteTuple, PBooleanTuple?), so 3 bits will suffice. We can use 4 for alignment and expandability. This cuts by 50% the overhead due to schema. Thoughts?
          Hide
          Ashutosh Chauhan added a comment -

          I think that if we want really a more efficient tuple implementation when schemas are known, we need to strip the schema from the data. What's the point of repeating the schema in each tuple apart from ease of implementation?

          Be careful with the assumption that schema is going to be same for all the rows in a data. Currently, Pig doesn't make this assumption and is thus able to work with tuples of varying schema in data. See, PIG-1131 where a related optimization was attempted (and also PIG-1188).

          Show
          Ashutosh Chauhan added a comment - I think that if we want really a more efficient tuple implementation when schemas are known, we need to strip the schema from the data. What's the point of repeating the schema in each tuple apart from ease of implementation? Be careful with the assumption that schema is going to be same for all the rows in a data. Currently, Pig doesn't make this assumption and is thus able to work with tuples of varying schema in data. See, PIG-1131 where a related optimization was attempted (and also PIG-1188 ).
          Hide
          Gianmarco De Francisci Morales added a comment -

          Be careful with the assumption that schema is going to be same for all the rows in a data. Currently, Pig doesn't make this assumption and is thus able to work with tuples of varying schema in data. See, PIG-1131 where a related optimization was attempted (and also PIG-1188).

          Yes Ashutosh, you are right.
          But when a user specifies a schema Pig enforces it on the data, so all the tuples have the same schema anyway.

          grunt> sh cat file.txt
          1	2
          1	2	3
          1
          	2
          
          grunt> a = load 'file.txt' AS (x1:int, x2:int);
          grunt> dump a
          (1,2)
          (1,2)
          (1,)
          (,2)
          
          

          If we don't have a schema then we don't use this new kind of tuples.
          Instead, I see a more general problem of how to handle the serialization of null fields with these new Tuple implementations.
          The schema I proposed needs to be augmented with either a NULL_TYPE which makes us lose track of the original type in the tuple, or modify the schema to use 1 bit of each type byte.

          Show
          Gianmarco De Francisci Morales added a comment - Be careful with the assumption that schema is going to be same for all the rows in a data. Currently, Pig doesn't make this assumption and is thus able to work with tuples of varying schema in data. See, PIG-1131 where a related optimization was attempted (and also PIG-1188 ). Yes Ashutosh, you are right. But when a user specifies a schema Pig enforces it on the data, so all the tuples have the same schema anyway. grunt> sh cat file.txt 1 2 1 2 3 1 2 grunt> a = load 'file.txt' AS (x1: int , x2: int ); grunt> dump a (1,2) (1,2) (1,) (,2) If we don't have a schema then we don't use this new kind of tuples. Instead, I see a more general problem of how to handle the serialization of null fields with these new Tuple implementations. The schema I proposed needs to be augmented with either a NULL_TYPE which makes us lose track of the original type in the tuple, or modify the schema to use 1 bit of each type byte.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Ashutosh, your comment is correct in the general case, but in this case, we only use these special tuples when we know the schema from the get-go.

          I think 1 new byte for each of possible primitive single-field tuples (serialization: tuple_type, null_bit, serialized value), and 1 new primitive tuple (serialization: tuple_type, size, types, n_null_bits, bytearray) should work.

          To get around the limitation of 1 byte for expressing the # of fields in a tuple (just in case) we can limit ourselves to 127 per byte, and save the high bit to indicate that the next byte should be interpreted to contain the rest of the number (so, it would take 3 bits to express 256.. but we could. And most of the time, this won't matter, as the vast majority of the tuples is going to be well under 128 fields).

          There's some nastiness in Tuples where they all have internal "isNull" field that might be true to indicate the whole tuple is null. I am not sure what the deal is there. Do we need to write an extra bit per tuple to differentiate "all fields are null" from "the tuple is null"? Or should we just write the NULL byte (currently used to express null tuples) and let it be deserialize into a normal, not-primitive, tuple?

          Show
          Dmitriy V. Ryaboy added a comment - Ashutosh, your comment is correct in the general case, but in this case, we only use these special tuples when we know the schema from the get-go. I think 1 new byte for each of possible primitive single-field tuples (serialization: tuple_type, null_bit, serialized value), and 1 new primitive tuple (serialization: tuple_type, size, types, n_null_bits, bytearray) should work. To get around the limitation of 1 byte for expressing the # of fields in a tuple (just in case) we can limit ourselves to 127 per byte, and save the high bit to indicate that the next byte should be interpreted to contain the rest of the number (so, it would take 3 bits to express 256.. but we could. And most of the time, this won't matter, as the vast majority of the tuples is going to be well under 128 fields). There's some nastiness in Tuples where they all have internal "isNull" field that might be true to indicate the whole tuple is null. I am not sure what the deal is there. Do we need to write an extra bit per tuple to differentiate "all fields are null" from "the tuple is null"? Or should we just write the NULL byte (currently used to express null tuples) and let it be deserialize into a normal, not-primitive, tuple?
          Hide
          Daniel Dai added a comment -

          We don't even deal with isNull in BinInterSedes.

          Show
          Daniel Dai added a comment - We don't even deal with isNull in BinInterSedes.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I do not know about the semantics of isNull() in Tuple.
          Is a Tuple null when all its fields are null?
          In this case we are missing a piece in BinInterSedes.
          I don't even know whether isNull() is used at all.
          I would vote for the "just write NULL byte and get over" solution.

          Finally, I don't see where you reconstruct the "nulls" field from the serialized PrimitiveTuple, I am probably missing a piece.

          Show
          Gianmarco De Francisci Morales added a comment - I do not know about the semantics of isNull() in Tuple. Is a Tuple null when all its fields are null? In this case we are missing a piece in BinInterSedes. I don't even know whether isNull() is used at all. I would vote for the "just write NULL byte and get over" solution. Finally, I don't see where you reconstruct the "nulls" field from the serialized PrimitiveTuple, I am probably missing a piece.
          Hide
          Dmitriy V. Ryaboy added a comment -

          fwiw - I ripped all of the isNull stuff out of the tuple interface and implementations and things compile (so I guess it's never, ever called?). Extricating this change from my primitive tuples branch is a bit painful.. you guys cool with that change?

          Show
          Dmitriy V. Ryaboy added a comment - fwiw - I ripped all of the isNull stuff out of the tuple interface and implementations and things compile (so I guess it's never, ever called?). Extricating this change from my primitive tuples branch is a bit painful.. you guys cool with that change?
          Hide
          Gianmarco De Francisci Morales added a comment -

          Totally cool!
          Do tests pass?

          Show
          Gianmarco De Francisci Morales added a comment - Totally cool! Do tests pass?
          Hide
          Daniel Dai added a comment -

          Seems Pig core is not using isNull, so not dealing with it in schema tuple implementation is totally fine. But I would like to keep the interface since it is marked public stable.

          Show
          Daniel Dai added a comment - Seems Pig core is not using isNull, so not dealing with it in schema tuple implementation is totally fine. But I would like to keep the interface since it is marked public stable.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Should we at least mark the method as deprecated and remove it in a future release?
          This should also avoid someone starts using it and is also a form of documentation.

          Show
          Gianmarco De Francisci Morales added a comment - Should we at least mark the method as deprecated and remove it in a future release? This should also avoid someone starts using it and is also a form of documentation.
          Hide
          Scott Carey added a comment -

          One approach is to create a class on the fly, extending a base abstract tuple class and adding the fields for the schema.

          Then a read is reading a field. Nulls can be handled with a BitSet for objects with more than 32 fields, otherwise a bitmask on an int. That can be baked into an abstract class so that code-gen needs to only add fields and map those to indexes.

          Show
          Scott Carey added a comment - One approach is to create a class on the fly, extending a base abstract tuple class and adding the fields for the schema. Then a read is reading a field. Nulls can be handled with a BitSet for objects with more than 32 fields, otherwise a bitmask on an int. That can be baked into an abstract class so that code-gen needs to only add fields and map those to indexes.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Scott, you are suggesting code-gen and compile on the fly during runtime?

          Show
          Dmitriy V. Ryaboy added a comment - Scott, you are suggesting code-gen and compile on the fly during runtime?
          Hide
          Scott Carey added a comment -

          Yes, using something like ASM (http://asm.ow2.org/) to fill out the fields and get() methods by extending an abstract class that is written in Java and has most of the work already done.

          A custom TupleFactory could have a static ThreadLocal<WeakHashMap<Schema, TupleBuilder>> or something similar, where the TupleBuilder and the Tuple it creates is created by dynamic class generation.

          The resulting objects would be memory compact and very fast. I am tempted to try and implement it if I have some spare time. I have been thinking about something similar for Avro, but the use case for Pig is significantly simpler.

          It would re-use the other changes above to the Tuple contract – getLong() getInteger() etc. I really like the idea of moving Tuple to have intrinsic getters. There are some questions I have about the contract however – what should be done if getLong is called for a field index containg an Integer? promote? Should it follow the rules in the Pig specification for type casts or throw an exception?

          Show
          Scott Carey added a comment - Yes, using something like ASM ( http://asm.ow2.org/ ) to fill out the fields and get() methods by extending an abstract class that is written in Java and has most of the work already done. A custom TupleFactory could have a static ThreadLocal<WeakHashMap<Schema, TupleBuilder>> or something similar, where the TupleBuilder and the Tuple it creates is created by dynamic class generation. The resulting objects would be memory compact and very fast. I am tempted to try and implement it if I have some spare time. I have been thinking about something similar for Avro, but the use case for Pig is significantly simpler. It would re-use the other changes above to the Tuple contract – getLong() getInteger() etc. I really like the idea of moving Tuple to have intrinsic getters. There are some questions I have about the contract however – what should be done if getLong is called for a field index containg an Integer? promote? Should it follow the rules in the Pig specification for type casts or throw an exception?
          Hide
          Dmitriy V. Ryaboy added a comment -

          That's a sexy idea, I like it – especially since it will let us handle strings in addition to numbers.
          We'll have to implement the same codegen on the deseralization side, or somehow serialize generated class names.. that could get somewhat ugly. Doable, though. Could also try to serialize the codegenned classes using kryo.

          I'll finish up this patch and run some timing tests; unless you are ready to work on this right now, let's open a separate ticket for the codegen approach.

          Show
          Dmitriy V. Ryaboy added a comment - That's a sexy idea, I like it – especially since it will let us handle strings in addition to numbers. We'll have to implement the same codegen on the deseralization side, or somehow serialize generated class names.. that could get somewhat ugly. Doable, though. Could also try to serialize the codegenned classes using kryo. I'll finish up this patch and run some timing tests; unless you are ready to work on this right now, let's open a separate ticket for the codegen approach.
          Hide
          Scott Carey added a comment -

          Yes, codegen should probably be a different ticket.

          I'm not sure I understand why there is much to do on the serialization/deserialization side, but I have not looked into it in detaul.

          Conceptually, one can build an API for building pig tuples of this sort that is independent of the serialization used. Given a Schema, get a TupleBuilder. Fill it with data as you deserialize and create the Tuple. The tuple implementation details (e.g. code gen vs old Object[] style Tuple) can both be supported. Am I missing something?
          If the schema is not known or varies per record, then a completely different code path must be taken.

          Show
          Scott Carey added a comment - Yes, codegen should probably be a different ticket. I'm not sure I understand why there is much to do on the serialization/deserialization side, but I have not looked into it in detaul. Conceptually, one can build an API for building pig tuples of this sort that is independent of the serialization used. Given a Schema, get a TupleBuilder. Fill it with data as you deserialize and create the Tuple. The tuple implementation details (e.g. code gen vs old Object[] style Tuple) can both be supported. Am I missing something? If the schema is not known or varies per record, then a completely different code path must be taken.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Attaching patch with better serialization, deprecation of isNull, boolean support, and slightly better tests.

          Show
          Dmitriy V. Ryaboy added a comment - Attaching patch with better serialization, deprecation of isNull, boolean support, and slightly better tests.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Very rough (read: invalid, probably) speed test: modified SUM / LongSum to use a PLongTuple, and ran this code on excite.log from the tutorial:

          l = load 'tutorial/data/excite-big.log' as (id:chararray, val:long, query:chararray);
          x = foreach (group l all) generate SUM(l.val);
          
          store x into '/tmp/foo';
          

          Before optimization:

          
          real	0m14.785s
          user	0m22.516s
          sys	0m1.203s
          
          real	0m15.323s
          user	0m22.605s
          sys	0m1.182s
          
          real	0m14.841s
          user	0m22.600s
          sys	0m1.176s
          

          after:

          real	0m14.347s
          user	0m20.442s
          sys	0m1.095s
          
          real	0m14.344s
          user	0m20.241s
          sys	0m1.064s
          
          real	0m14.577s
          user	0m20.671s
          sys	0m1.087s
          
          Show
          Dmitriy V. Ryaboy added a comment - Very rough (read: invalid, probably) speed test: modified SUM / LongSum to use a PLongTuple, and ran this code on excite.log from the tutorial: l = load 'tutorial/data/excite-big.log' as (id:chararray, val: long , query:chararray); x = foreach (group l all) generate SUM(l.val); store x into '/tmp/foo'; Before optimization: real 0m14.785s user 0m22.516s sys 0m1.203s real 0m15.323s user 0m22.605s sys 0m1.182s real 0m14.841s user 0m22.600s sys 0m1.176s after: real 0m14.347s user 0m20.442s sys 0m1.095s real 0m14.344s user 0m20.241s sys 0m1.064s real 0m14.577s user 0m20.671s sys 0m1.087s
          Hide
          Dmitriy V. Ryaboy added a comment -

          Found a place where this breaks. InternalCachedBag (and presumably other cached bags) use <code>t.write(out)<code> to spill to disk, and <code>t = factory.newTuple(); t.readFields(in)<code> to read. This is a problem as it assumes t will write itself in a format the default tuple returned by factory.newTuple() will read. Seems like a straightforward fix would be to use InterSedes to read, right? Any reason that wouldn't work?

          Show
          Dmitriy V. Ryaboy added a comment - Found a place where this breaks. InternalCachedBag (and presumably other cached bags) use <code>t.write(out)<code> to spill to disk, and <code>t = factory.newTuple(); t.readFields(in)<code> to read. This is a problem as it assumes t will write itself in a format the default tuple returned by factory.newTuple() will read. Seems like a straightforward fix would be to use InterSedes to read, right? Any reason that wouldn't work?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Fixed the bags. Added a test.

          Show
          Dmitriy V. Ryaboy added a comment - Fixed the bags. Added a test.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Ok I think this is ready for review.

          Show
          Dmitriy V. Ryaboy added a comment - Ok I think this is ready for review.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I will try to have a look at it during the weekend if nobody else volunteers.

          Show
          Gianmarco De Francisci Morales added a comment - I will try to have a look at it during the weekend if nobody else volunteers.
          Hide
          Alan Gates added a comment -

          Comments:

          In PrimitiveTuple.get(), I wonder if you'd get faster access if you removed the array bounds check. Java is going to do that for you anyway. You can catch the IndexOutOfBoundsException and rethrow it with a nicer error message.

          The same comment applies to checking whether the buffer capacity will be exceeded by reading the requested field.

          Also applies to set()

          Does append ever make sense for these types of tuples? Should it just throw NotSupportedException?

          In the P*Tuple classes, when a user calls set(int pos, Object o), you are forcing o into the type of the tuple (e.g., for PIntTuple you are forcing it into an int). This is a change of semantics from the general tuple contract where whatever you pass to set is taken to be the value for that field. I would like to understand more about the use case when you would expect to see this used. Is it that you want to force this to int because the data may or may not be all ints (like there may be some floats?). I think it would be better to just take an int, and return a null and issue a warning if what you get isn't an int. This still violates the semantic, but at least it doesn't silently produce a different result. If the use case is only for the internal use of passing data between map and reducer or between MR jobs, then I definitely think we should forget all the checks and just assume the data is correct.

          You added new methods to the TupleFactory class, which is marked as Stable. You'll need to provide default implementations of those to avoid breaking backward compatibility.

          Why is this patch changing http libraries? (See the changes to ivy/library.properties.)

          Show
          Alan Gates added a comment - Comments: In PrimitiveTuple.get(), I wonder if you'd get faster access if you removed the array bounds check. Java is going to do that for you anyway. You can catch the IndexOutOfBoundsException and rethrow it with a nicer error message. The same comment applies to checking whether the buffer capacity will be exceeded by reading the requested field. Also applies to set() Does append ever make sense for these types of tuples? Should it just throw NotSupportedException? In the P*Tuple classes, when a user calls set(int pos, Object o), you are forcing o into the type of the tuple (e.g., for PIntTuple you are forcing it into an int). This is a change of semantics from the general tuple contract where whatever you pass to set is taken to be the value for that field. I would like to understand more about the use case when you would expect to see this used. Is it that you want to force this to int because the data may or may not be all ints (like there may be some floats?). I think it would be better to just take an int, and return a null and issue a warning if what you get isn't an int. This still violates the semantic, but at least it doesn't silently produce a different result. If the use case is only for the internal use of passing data between map and reducer or between MR jobs, then I definitely think we should forget all the checks and just assume the data is correct. You added new methods to the TupleFactory class, which is marked as Stable. You'll need to provide default implementations of those to avoid breaking backward compatibility. Why is this patch changing http libraries? (See the changes to ivy/library.properties.)
          Hide
          Dmitriy V. Ryaboy added a comment -

          Thanks for the close read, Alan!

          In PrimitiveTuple.get(), I wonder if you'd get faster access if you removed the array bounds check. Java is going to do that for you anyway. You can catch the IndexOutOfBoundsException and rethrow it with a nicer error message.

          Will do.

          The same comment applies to checking whether the buffer capacity will be exceeded by reading the requested field.

          Will do.

          Also applies to set()

          Will do

          Does append ever make sense for these types of tuples? Should it just throw NotSupportedException?

          I think it makes sense, painful as it is – users can get a PTuple handed to their UDF and unwittingly call append on it. I don't want existing scripts to crash, so trying to make things degrade nicely.

          In the P*Tuple classes, when a user calls set(int pos, Object o), you are forcing o into the type of the tuple (e.g., for PIntTuple you are forcing it into an int). This is a change of semantics from the general tuple contract where whatever you pass to set is taken to be the value for that field. I would like to understand more about the use case when you would expect to see this used. Is it that you want to force this to int because the data may or may not be all ints (like there may be some floats?). I think it would be better to just take an int, and return a null and issue a warning if what you get isn't an int. This still violates the semantic, but at least it doesn't silently produce a different result. If the use case is only for the internal use of passing data between map and reducer or between MR jobs, then I definitely think we should forget all the checks and just assume the data is correct.

          The use case isn't just internal, I started this in the first case because I needed to construct large tuple bags in a UDF. My reasoning for taking int value was that this is what we do when people "cast" a float to an int in pig. If you declare the schema to be an int, and put in a float... seems to me like having an int come out is ok. Could also die abruptly. I think null would be most surprising of the available choices.

          You added new methods to the TupleFactory class, which is marked as Stable. You'll need to provide default implementations of those to avoid breaking backward compatibility.

          Good call, will do.

          Why is this patch changing http libraries? (See the changes to ivy/library.properties.)

          I did that when I was going to use ByteArrayBuffer, offered by httpcore. The nice thing about it is that it's resizable, but then again it doesn't have the r/wLong, r/wInt, etc methods, so I reverted to regular nio.ByteBuffer. There is no strict reason to change the libs, but it's a safe bump – and the libs we are currently using are deprecated and replaced by the ones I bumped to (it's the same project, which got moved inside Apache). See http://hc.apache.org/httpclient-3.x/

          Show
          Dmitriy V. Ryaboy added a comment - Thanks for the close read, Alan! In PrimitiveTuple.get(), I wonder if you'd get faster access if you removed the array bounds check. Java is going to do that for you anyway. You can catch the IndexOutOfBoundsException and rethrow it with a nicer error message. Will do. The same comment applies to checking whether the buffer capacity will be exceeded by reading the requested field. Will do. Also applies to set() Will do Does append ever make sense for these types of tuples? Should it just throw NotSupportedException? I think it makes sense, painful as it is – users can get a PTuple handed to their UDF and unwittingly call append on it. I don't want existing scripts to crash, so trying to make things degrade nicely. In the P*Tuple classes, when a user calls set(int pos, Object o), you are forcing o into the type of the tuple (e.g., for PIntTuple you are forcing it into an int). This is a change of semantics from the general tuple contract where whatever you pass to set is taken to be the value for that field. I would like to understand more about the use case when you would expect to see this used. Is it that you want to force this to int because the data may or may not be all ints (like there may be some floats?). I think it would be better to just take an int, and return a null and issue a warning if what you get isn't an int. This still violates the semantic, but at least it doesn't silently produce a different result. If the use case is only for the internal use of passing data between map and reducer or between MR jobs, then I definitely think we should forget all the checks and just assume the data is correct. The use case isn't just internal, I started this in the first case because I needed to construct large tuple bags in a UDF. My reasoning for taking int value was that this is what we do when people "cast" a float to an int in pig. If you declare the schema to be an int, and put in a float... seems to me like having an int come out is ok. Could also die abruptly. I think null would be most surprising of the available choices. You added new methods to the TupleFactory class, which is marked as Stable. You'll need to provide default implementations of those to avoid breaking backward compatibility. Good call, will do. Why is this patch changing http libraries? (See the changes to ivy/library.properties.) I did that when I was going to use ByteArrayBuffer, offered by httpcore. The nice thing about it is that it's resizable, but then again it doesn't have the r/wLong, r/wInt, etc methods, so I reverted to regular nio.ByteBuffer. There is no strict reason to change the libs, but it's a safe bump – and the libs we are currently using are deprecated and replaced by the ones I bumped to (it's the same project, which got moved inside Apache). See http://hc.apache.org/httpclient-3.x/
          Hide
          Alan Gates added a comment -

          The use case isn't just internal, I started this in the first case because I needed to construct large tuple bags in a UDF. My reasoning for taking int value was that this is what we do when people "cast" a float to an int in pig. If you declare the schema to be an int, and put in a float... seems to me like having an int come out is ok. Could also die abruptly. I think null would be most surprising of the available choices.

          When will these specialized tuple types get used? Pig will use them internally when we expect a bag (or whatever) to contain that type. Users can use them in UDFs they construct. Are there are other cases where we envision them being used? I agree my "push it to null" is just as arbitrary as your "push to the type I expected". I shy away from failing jobs on these kinds of errors because you hate for one row in a billion to fail an entire job. I guess I'm ok with your approach, though I think it should issue a warning (since it seems clear the user expected to find only one type in the data), and I think the Javadoc comments on the set() functions should clearly declare that this instance of the function bends the semantics of the interface.

          Performance wise, this looks very exciting.

          Show
          Alan Gates added a comment - The use case isn't just internal, I started this in the first case because I needed to construct large tuple bags in a UDF. My reasoning for taking int value was that this is what we do when people "cast" a float to an int in pig. If you declare the schema to be an int, and put in a float... seems to me like having an int come out is ok. Could also die abruptly. I think null would be most surprising of the available choices. When will these specialized tuple types get used? Pig will use them internally when we expect a bag (or whatever) to contain that type. Users can use them in UDFs they construct. Are there are other cases where we envision them being used? I agree my "push it to null" is just as arbitrary as your "push to the type I expected". I shy away from failing jobs on these kinds of errors because you hate for one row in a billion to fail an entire job. I guess I'm ok with your approach, though I think it should issue a warning (since it seems clear the user expected to find only one type in the data), and I think the Javadoc comments on the set() functions should clearly declare that this instance of the function bends the semantics of the interface. Performance wise, this looks very exciting.
          Hide
          Dmitriy V. Ryaboy added a comment -

          I have another branch-in-progress that pushes this into a bunch of Pig UDFs including max/min/avg. Just doing that shaved about 10 out of 84 seconds off each mapper in tpc-h Q1 in my (non-conclusive) experiments.

          I'll add the javadoc in the next version of the patch, thanks. Issuing a warning would potentially lead to log spam.. is there a handy counter?

          Show
          Dmitriy V. Ryaboy added a comment - I have another branch-in-progress that pushes this into a bunch of Pig UDFs including max/min/avg. Just doing that shaved about 10 out of 84 seconds off each mapper in tpc-h Q1 in my (non-conclusive) experiments. I'll add the javadoc in the next version of the patch, thanks. Issuing a warning would potentially lead to log spam.. is there a handy counter?
          Hide
          Alan Gates added a comment -

          Use the aggregated warnings, PigLogger.warn rather than log4j warn, to avoid the log spam.

          Show
          Alan Gates added a comment - Use the aggregated warnings, PigLogger.warn rather than log4j warn, to avoid the log spam.
          Hide
          Dmitriy V. Ryaboy added a comment -

          New patch addresses Alan's comments.

          Show
          Dmitriy V. Ryaboy added a comment - New patch addresses Alan's comments.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Bump

          Show
          Dmitriy V. Ryaboy added a comment - Bump
          Hide
          Thejas M Nair added a comment -

          I will review the patch by this weekend.

          Show
          Thejas M Nair added a comment - I will review the patch by this weekend.
          Hide
          Scott Carey added a comment -

          Performance comments:

          In PrimitiveTuple.get(), I wonder if you'd get faster access if you removed the array bounds check. Java is going to do that for you anyway. You can catch the IndexOutOfBoundsException and rethrow it with a nicer error message.

          That is generally slower.
          1. The JVM will detect your checks and not do its own bounds checks if yours are sufficient. (
          2. The JVM will profile the method, and compile the checks with the right CPU branch hints and instruction layout based on the odds that the branch is taken.
          3. If it is out of bounds, it is a hundred times faster to find out via an if statement than a try/catch.

          All of the above are much more noticeable if in a loop than a single access, so it may not help here much.

          I did that when I was going to use ByteArrayBuffer, offered by httpcore. The nice thing about it is that it's resizable, but then again it doesn't have the r/wLong, r/wInt, etc methods, so I reverted to regular nio.ByteBuffer.

          Note, nio.ByteBuffer is 'slow' (but very handy). Unfortunately, all calls to it are virtual method calls and not inlined. This is because of its dual heap / direct nature. If serializnig data to a byte[], writing your own private method to swizzle the int/long into the bytes can have significant performance gains if it is a hot-spot in time spent since it will be inlined at critical call sites while ByteBuffer's methods will not.

          Show
          Scott Carey added a comment - Performance comments: In PrimitiveTuple.get(), I wonder if you'd get faster access if you removed the array bounds check. Java is going to do that for you anyway. You can catch the IndexOutOfBoundsException and rethrow it with a nicer error message. That is generally slower. 1. The JVM will detect your checks and not do its own bounds checks if yours are sufficient. ( 2. The JVM will profile the method, and compile the checks with the right CPU branch hints and instruction layout based on the odds that the branch is taken. 3. If it is out of bounds, it is a hundred times faster to find out via an if statement than a try/catch. All of the above are much more noticeable if in a loop than a single access, so it may not help here much. I did that when I was going to use ByteArrayBuffer, offered by httpcore. The nice thing about it is that it's resizable, but then again it doesn't have the r/wLong, r/wInt, etc methods, so I reverted to regular nio.ByteBuffer. Note, nio.ByteBuffer is 'slow' (but very handy). Unfortunately, all calls to it are virtual method calls and not inlined. This is because of its dual heap / direct nature. If serializnig data to a byte[], writing your own private method to swizzle the int/long into the bytes can have significant performance gains if it is a hot-spot in time spent since it will be inlined at critical call sites while ByteBuffer's methods will not.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Sigh. I originally had my own swizzling code and ripped it out because nio.ByteBuffer made the code look nicer. I'll put it back in...

          Out of bounds crashes the job, so I don't think the speed of try/catch matters.

          Show
          Dmitriy V. Ryaboy added a comment - Sigh. I originally had my own swizzling code and ripped it out because nio.ByteBuffer made the code look nicer. I'll put it back in... Out of bounds crashes the job, so I don't think the speed of try/catch matters.
          Hide
          Scott Carey added a comment -

          The JVM will detect your checks and not do its own bounds checks if yours are sufficient

          More info:
          The JVM tries to eliminate array bounds checks. It can do this in a few ways.
          There is the loop predication work that is in JRE 6_u22 or so and later that will move bound checks from inside a loop to the outside of the loop when it can. Older JVMs can do similar, but in fewer situations. This is both the intrinsic Java check and any you write yourself. In fact, it tries to hoist all sorts of code outside the loop if it can, not just array bounds checks.
          If it can prove that the value passed in is within range it may eliminate the bounds check. This can be due to the index variable having a known range (0 to arr.length, for example) or a few other conditions.

          For a public virtual method like Tuple.get() it will almost never be able to inline the call at the call site, and so it may not ever be able to prove that it can remove the bounds checks. In this sort of situation, there are two fast ways: don't check yourself and let the exception bubble up, or check yourself and handle the out of bound condition yourself. In general, catching an index out of bounds exception is slower than checking yourself since the JVM can prove that its own checks are useless with yours guarding them and exceptoin handling is much slower than a code branch.
          In the condition that the method may be inlined aggressively (small private or effectively final methods especially) leaving manual checks out can be very fast since the JVM may be able to prove that none are necessary at all at a given call site.

          Variants can be performance tested and refined over time. It doesn't have to be perfect now.

          Show
          Scott Carey added a comment - The JVM will detect your checks and not do its own bounds checks if yours are sufficient More info: The JVM tries to eliminate array bounds checks. It can do this in a few ways. There is the loop predication work that is in JRE 6_u22 or so and later that will move bound checks from inside a loop to the outside of the loop when it can. Older JVMs can do similar, but in fewer situations. This is both the intrinsic Java check and any you write yourself. In fact, it tries to hoist all sorts of code outside the loop if it can, not just array bounds checks. If it can prove that the value passed in is within range it may eliminate the bounds check. This can be due to the index variable having a known range (0 to arr.length, for example) or a few other conditions. For a public virtual method like Tuple.get() it will almost never be able to inline the call at the call site, and so it may not ever be able to prove that it can remove the bounds checks. In this sort of situation, there are two fast ways: don't check yourself and let the exception bubble up, or check yourself and handle the out of bound condition yourself. In general, catching an index out of bounds exception is slower than checking yourself since the JVM can prove that its own checks are useless with yours guarding them and exceptoin handling is much slower than a code branch. In the condition that the method may be inlined aggressively (small private or effectively final methods especially) leaving manual checks out can be very fast since the JVM may be able to prove that none are necessary at all at a given call site. Variants can be performance tested and refined over time. It doesn't have to be perfect now.
          Hide
          Alan Gates added a comment -

          +1, latest patch looks good. Looking forward to Pig flying a little faster.

          Show
          Alan Gates added a comment - +1, latest patch looks good. Looking forward to Pig flying a little faster.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Committed to trunk.

          Show
          Dmitriy V. Ryaboy added a comment - Committed to trunk.
          Hide
          Dmitriy V. Ryaboy added a comment -

          I now turn your attention to PIG-2454

          Show
          Dmitriy V. Ryaboy added a comment - I now turn your attention to PIG-2454
          Hide
          Gianmarco De Francisci Morales added a comment -

          Great job Dmitry!

          Show
          Gianmarco De Francisci Morales added a comment - Great job Dmitry!

            People

            • Assignee:
              Dmitriy V. Ryaboy
              Reporter:
              Dmitriy V. Ryaboy
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development