Hadoop Common
  1. Hadoop Common
  2. HADOOP-1986

Add support for a general serialization mechanism for Map Reduce

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      Programs that implement the raw Mapper or Reducer interfaces will need modification to compile with this release. For example,

      class MyMapper implements Mapper {
        public void map(WritableComparable key, Writable val,
          OutputCollector out, Reporter reporter) throws IOException {
          // ...
        }
        // ...
      }

      will need to be changed to refer to the parameterized type. For example:

      class MyMapper implements Mapper<WritableComparable, Writable, WritableComparable, Writable> {
        public void map(WritableComparable key, Writable val,
          OutputCollector<WritableComparable, Writable> out, Reporter reporter) throws IOException {
          // ...
        }
        // ...
      }

      Similarly implementations of the following raw interfaces will need modification: InputFormat, OutputCollector, OutputFormat, Partitioner, RecordReader, RecordWriter
      Show
      Programs that implement the raw Mapper or Reducer interfaces will need modification to compile with this release. For example, class MyMapper implements Mapper {   public void map(WritableComparable key, Writable val,     OutputCollector out, Reporter reporter) throws IOException {     // ...   }   // ... } will need to be changed to refer to the parameterized type. For example: class MyMapper implements Mapper<WritableComparable, Writable, WritableComparable, Writable> {   public void map(WritableComparable key, Writable val,     OutputCollector<WritableComparable, Writable> out, Reporter reporter) throws IOException {     // ...   }   // ... } Similarly implementations of the following raw interfaces will need modification: InputFormat, OutputCollector, OutputFormat, Partitioner, RecordReader, RecordWriter

      Description

      Currently Map Reduce programs have to use WritableComparable-Writable key-value pairs. While it's possible to write Writable wrappers for other serialization frameworks (such as Thrift), this is not very convenient: it would be nicer to be able to use arbitrary types directly, without explicit wrapping and unwrapping.

      1. serializer-v7.patch
        103 kB
        Tom White
      2. serializer-v6.patch
        94 kB
        Tom White
      3. SequenceFileWriterBenchmark.java
        3 kB
        Tom White
      4. serializer-v5.patch
        94 kB
        Tom White
      5. serializer-v4.patch
        93 kB
        Tom White
      6. serializer-v3.patch
        97 kB
        Tom White
      7. hadoop-serializer-v2.tar.gz
        1.36 MB
        Tom White
      8. serializer-v2.patch
        70 kB
        Tom White
      9. serializer-v1.patch
        70 kB
        Tom White
      10. SerializableWritable.java
        3 kB
        Dennis Kubes

        Activity

        Hide
        Tom White added a comment -

        Here's an initial plan:

        1. Remove the requirement for Map Reduce types to extend WritableComparable/Writable. So for example Mapper would become:

        public interface Mapper<K1, V1, K2, V2>
        

        2. Create a serialization class that can turn objects into byte streams and vice versa. Something like:

        public interface Serializer<T> {
          void serialize(T t, OutputStream out) throws IOException;
          void deserialize(T t, InputStream in) throws IOException;
        }
        

        3. Add a configuration property to specify the Serializer to use. This would default to WritableSerializer (an implementation of Serializer<Writable>).

        4. Change the type of the output key comparator to be Comparator<T>, with default WritableComparator (an implmentation of Comparator<Writable>).

        5. In MapTask use a Serializer to write map outputs to the output files.

        6. In ReduceTask use a Serializer to read sorted map outputs for the reduce phase.

        I've played with some of this and it looks like it would work, however there is a problem with the Serializer interface above as it stands. The serialize method of WritableSerializer looks like this:

        public void serialize(Writable w, OutputStream out) throws IOException {
          w.write(new DataOutputStream(out));
        }
        

        Clearly it is not acceptable to create a new object on every write. This is a general problem - Writables write to a DataOutputStream, Thrift objects write to a TProtocol, etc. So the solution is probably to make the Serializer stateful, having the same lifetime as the wrapped stream. Something like:

        public interface Serializer<T> {
          void open(OutputStream out);
          void serialize(T t);
          void close();
        }
        

        (There would be a similar Deserializer interface.)

        Could this work?

        Show
        Tom White added a comment - Here's an initial plan: 1. Remove the requirement for Map Reduce types to extend WritableComparable/Writable. So for example Mapper would become: public interface Mapper<K1, V1, K2, V2> 2. Create a serialization class that can turn objects into byte streams and vice versa. Something like: public interface Serializer<T> { void serialize(T t, OutputStream out) throws IOException; void deserialize(T t, InputStream in) throws IOException; } 3. Add a configuration property to specify the Serializer to use. This would default to WritableSerializer (an implementation of Serializer<Writable>). 4. Change the type of the output key comparator to be Comparator<T>, with default WritableComparator (an implmentation of Comparator<Writable>). 5. In MapTask use a Serializer to write map outputs to the output files. 6. In ReduceTask use a Serializer to read sorted map outputs for the reduce phase. I've played with some of this and it looks like it would work, however there is a problem with the Serializer interface above as it stands. The serialize method of WritableSerializer looks like this: public void serialize(Writable w, OutputStream out) throws IOException { w.write( new DataOutputStream(out)); } Clearly it is not acceptable to create a new object on every write. This is a general problem - Writables write to a DataOutputStream, Thrift objects write to a TProtocol, etc. So the solution is probably to make the Serializer stateful, having the same lifetime as the wrapped stream. Something like: public interface Serializer<T> { void open(OutputStream out); void serialize(T t); void close(); } (There would be a similar Deserializer interface.) Could this work?
        Hide
        Owen O'Malley added a comment -

        Actually, I'd probably set it up so that you could configure the list of Serializers with something like:

        <property>
          <name>hadoop.serializers</name>
          <value>org.apache.hadoop.io.WritableSerializer,org.apache.hadoop.io.ThriftSerializer</value>
        </property>
        

        and serializer could also have a target class:

        public interface Serializer<T> {
          void serialize(T t, OutputStream out) throws IOException;
          void deserialize(T t, InputStream in) throws IOException;
          // Get the base class that this serializer will work on
          Class getTargetClass();
        }
        
        Show
        Owen O'Malley added a comment - Actually, I'd probably set it up so that you could configure the list of Serializers with something like: <property> <name>hadoop.serializers</name> <value>org.apache.hadoop.io.WritableSerializer,org.apache.hadoop.io.ThriftSerializer</value> </property> and serializer could also have a target class: public interface Serializer<T> { void serialize(T t, OutputStream out) throws IOException; void deserialize(T t, InputStream in) throws IOException; // Get the base class that this serializer will work on Class getTargetClass(); }
        Hide
        Tom White added a comment -

        Is the idea here to use the target class to decide which Serializer to use? If so, it might not work too well if the serialization framework doesn't have a base class or marker interface (e.g. Thrift).

        I was thinking that the Serializer would be specified per MR job - which should be simpler.

        Show
        Tom White added a comment - Is the idea here to use the target class to decide which Serializer to use? If so, it might not work too well if the serialization framework doesn't have a base class or marker interface (e.g. Thrift). I was thinking that the Serializer would be specified per MR job - which should be simpler.
        Hide
        Owen O'Malley added a comment -

        First, I'd argue that Thrift should have a top level serialization interface... but clearly that belongs on their development list. smile

        But if the serializer is specific to the job, you wouldn't be able to mix Writables and Thrift objects. If you wanted to translate, for instance, you'd like:

        map input: MyWritableKey, MyWritableValue
        map output: MyThriftKey, MyThriftValue

        If I have to have a single serializer for my job, that is a pain. Of course, without a Thrift record super class, you really can't write ThriftSerializable anyways. (You need a standard interface to generate the bytes...)

        Show
        Owen O'Malley added a comment - First, I'd argue that Thrift should have a top level serialization interface... but clearly that belongs on their development list. smile But if the serializer is specific to the job, you wouldn't be able to mix Writables and Thrift objects. If you wanted to translate, for instance, you'd like: map input: MyWritableKey, MyWritableValue map output: MyThriftKey, MyThriftValue If I have to have a single serializer for my job, that is a pain. Of course, without a Thrift record super class, you really can't write ThriftSerializable anyways. (You need a standard interface to generate the bytes...)
        Hide
        Doug Cutting added a comment -

        > But if the serializer is specific to the job, you wouldn't be able to mix Writables and Thrift objects.

        We need a serializer and deserializer specified per job so that the mapred kernel can store intermediate data. Then the InputFormat may use a deserializer, and the OutputFormat may use a serializer. So I don't see that Tom's proposal (at least not as I interpret it) prohibits such intermixing. The job's serializer only applies to the map output. The InputFormat's deserializer would apply to the map input, and the OutputFormat's deserializer would apply to reduce output. Does that make sense?

        Show
        Doug Cutting added a comment - > But if the serializer is specific to the job, you wouldn't be able to mix Writables and Thrift objects. We need a serializer and deserializer specified per job so that the mapred kernel can store intermediate data. Then the InputFormat may use a deserializer, and the OutputFormat may use a serializer. So I don't see that Tom's proposal (at least not as I interpret it) prohibits such intermixing. The job's serializer only applies to the map output. The InputFormat's deserializer would apply to the map input, and the OutputFormat's deserializer would apply to reduce output. Does that make sense?
        Hide
        Owen O'Malley added a comment -

        But it is strictly more powerful allowing a serializer per a class (or hierarchy). Furthermore, it means you only have to configure the small number of serializers rather than worry about which context you need to set which serializer for. I think it is more confusing if you have to say:

        FileInputFormat.setSerializer(conf, Bar.class);
        job.setMapOutputSerializer(Foo.class);
        FileOutputFormat.setSerializer(conf, Baz.class);

        and it still would prevent you from mixing serializers between keys and values. Unless you are proposing the even more verbose:

        FileInputFormat.setKeySerializer(conf, BarKey.class);
        FileInputFormat.setValueSerializer(conf, BarValue.class);
        job.setMapOutputKeySerializer(FooKey.class);
        job.setMapOutputValueSerializer(FooValue.class);
        FileOutputFormat.setKeySerializer(conf, BazKey.class);
        FileOutputFormat.setValueSerializer(conf, BazValue.class);

        I think the Serializers for a given type are constant, rather than the Serializers for a given context being constant.

        Show
        Owen O'Malley added a comment - But it is strictly more powerful allowing a serializer per a class (or hierarchy). Furthermore, it means you only have to configure the small number of serializers rather than worry about which context you need to set which serializer for. I think it is more confusing if you have to say: FileInputFormat.setSerializer(conf, Bar.class); job.setMapOutputSerializer(Foo.class); FileOutputFormat.setSerializer(conf, Baz.class); and it still would prevent you from mixing serializers between keys and values. Unless you are proposing the even more verbose: FileInputFormat.setKeySerializer(conf, BarKey.class); FileInputFormat.setValueSerializer(conf, BarValue.class); job.setMapOutputKeySerializer(FooKey.class); job.setMapOutputValueSerializer(FooValue.class); FileOutputFormat.setKeySerializer(conf, BazKey.class); FileOutputFormat.setValueSerializer(conf, BazValue.class); I think the Serializers for a given type are constant, rather than the Serializers for a given context being constant.
        Hide
        Doug Cutting added a comment -

        > FileInputFormat.setSerializer(conf, Bar.class);

        I think what we'd ideally have is something like:

        job.setInputFormat(SequenceFile<Foo,Bar>);

        Which isn't java. So, yes, binding serializers by key/value class would be nice. But how we get there from here is the question. In particular, how can we handle something like Thrift, whose instances don't all implement some interface?

        > I think the Serializers for a given type are constant, rather than the Serializers for a given context being constant.

        That sounds mostly reasonable. Do you have a proposal for how to implement this?

        On a related note, I think that, if we go this way then, within Hadoop, we should deprecate Writable and directly implement the serializer API. Moving away from implementing serialization directly to the class permits us to work with other serialization systems, but there's no need for us to take the indirection hit internally.

        Show
        Doug Cutting added a comment - > FileInputFormat.setSerializer(conf, Bar.class); I think what we'd ideally have is something like: job.setInputFormat(SequenceFile<Foo,Bar>); Which isn't java. So, yes, binding serializers by key/value class would be nice. But how we get there from here is the question. In particular, how can we handle something like Thrift, whose instances don't all implement some interface? > I think the Serializers for a given type are constant, rather than the Serializers for a given context being constant. That sounds mostly reasonable. Do you have a proposal for how to implement this? On a related note, I think that, if we go this way then, within Hadoop, we should deprecate Writable and directly implement the serializer API. Moving away from implementing serialization directly to the class permits us to work with other serialization systems, but there's no need for us to take the indirection hit internally.
        Hide
        Tom White added a comment -

        > Do you have a proposal for how to implement this?

        If we follow Owen's suggestion then we can construct a map of types to Serializer classes. Then, when running
        MapTask or ReduceTask we can use the map to instantiate an appropriate Serializer for each of the key and the value types.

        > In particular, how can we handle something like Thrift, whose instances don't all implement some interface?

        The target class would have to be Object. However, for this to work we would need to have some notion of precedence so more specific subtypes (like Writable) match first. Also, this wouldn't allow you to use two different serialization frameworks whose instances only have a common type of Object. I'm not sure how much of a problem this would be in practice though.

        (I just had a look at a Thrift class, generated with release 20070917, and it is tagged with java.io.Serializable. It would be more useful though it if implemented an interface that defined the read/write fields.)

        Show
        Tom White added a comment - > Do you have a proposal for how to implement this? If we follow Owen's suggestion then we can construct a map of types to Serializer classes. Then, when running MapTask or ReduceTask we can use the map to instantiate an appropriate Serializer for each of the key and the value types. > In particular, how can we handle something like Thrift, whose instances don't all implement some interface? The target class would have to be Object. However, for this to work we would need to have some notion of precedence so more specific subtypes (like Writable) match first. Also, this wouldn't allow you to use two different serialization frameworks whose instances only have a common type of Object. I'm not sure how much of a problem this would be in practice though. (I just had a look at a Thrift class, generated with release 20070917, and it is tagged with java.io.Serializable. It would be more useful though it if implemented an interface that defined the read/write fields.)
        Hide
        Owen O'Malley added a comment -

        I think to actually do something other than use Java serialization, you pretty much need a superclass that supports conversion to and from bytes. Realistically, with the current Thrift implementation, you'd need to register a serializer for each specific class. Yuck

        However, according to the Thrift developers adding a base class would be easy and they are considering doing it.

        Show
        Owen O'Malley added a comment - I think to actually do something other than use Java serialization, you pretty much need a superclass that supports conversion to and from bytes. Realistically, with the current Thrift implementation, you'd need to register a serializer for each specific class. Yuck However, according to the Thrift developers adding a base class would be easy and they are considering doing it.
        Hide
        Joydeep Sen Sarma added a comment -

        i have been working on putting thrift structs into Hdfs. I have been a happy camper so far (at least as far as hadoop/hdfs are concerned). Just for reference - this is what it ended up looking like:

        • use BytesWritable to wrap thrift structs (and store the same in sequencefiles)
        • for writing structs - i haven't had to allocate TTransport and TProtocol objects everytime. Resetting the buffer in a ByteArrayOutputStream works. i expect similar strategy to work for reading (might need to extend ByteArrayInputStream)
        • as far as invoking the right serializer/deserializer - it's easy to do this with Reflection:
        • When loading data into hdfs - the name of the thrift class is encoded before the serialized struct. (this is a property of TTransport). The function signatures for serialize/deserialize are constant allowing easy use of reflection
        • Data is loaded into hdfs in a way that also allows us to know the class-name for any serialized struct - so again we use reflection to deserialize while processing. (There are different ways of arranging this).

        Of course - if the data is homogenous - then reflection is not required. But that's not the case with our data set.

        I haven't found the Writable interface to be a serious constraint in any way. There are some inefficiencies in the above process:
        1. the use of extra length field from using BytesWritable
        2. the use of reflection (don't know what the overhead's like)

        But i am not sure these are big burdens. (I don't even understand how to conceivably avoid #2).

        One of the good things about Thrift is cross-language code generation. What we would do at some point is allow Python (or perl) code to work on Binary serialized data. The Streaming library already seems to allow this (will pass BytesWritable key,values to external map-reduce handlers) - where the byte array can be deserialized by thrift generated python deserializer in the python mapper.

        As far as the discussions on this Thread go - i am not sure what the final proposal is. One thing i would opine is that the map-red job is best placed to understand what serialization library to invoke (instead of the map-reduce infrastructure). For example - we are segregating different data types in different files - and the thrift class is implicit in the path name (which is made available as a key). If i understand it correctly - Pig takes a similar stance (input path is mapped to a serialization library).

        Show
        Joydeep Sen Sarma added a comment - i have been working on putting thrift structs into Hdfs. I have been a happy camper so far (at least as far as hadoop/hdfs are concerned). Just for reference - this is what it ended up looking like: use BytesWritable to wrap thrift structs (and store the same in sequencefiles) for writing structs - i haven't had to allocate TTransport and TProtocol objects everytime. Resetting the buffer in a ByteArrayOutputStream works. i expect similar strategy to work for reading (might need to extend ByteArrayInputStream) as far as invoking the right serializer/deserializer - it's easy to do this with Reflection: When loading data into hdfs - the name of the thrift class is encoded before the serialized struct. (this is a property of TTransport). The function signatures for serialize/deserialize are constant allowing easy use of reflection Data is loaded into hdfs in a way that also allows us to know the class-name for any serialized struct - so again we use reflection to deserialize while processing. (There are different ways of arranging this). Of course - if the data is homogenous - then reflection is not required. But that's not the case with our data set. I haven't found the Writable interface to be a serious constraint in any way. There are some inefficiencies in the above process: 1. the use of extra length field from using BytesWritable 2. the use of reflection (don't know what the overhead's like) But i am not sure these are big burdens. (I don't even understand how to conceivably avoid #2). One of the good things about Thrift is cross-language code generation. What we would do at some point is allow Python (or perl) code to work on Binary serialized data. The Streaming library already seems to allow this (will pass BytesWritable key,values to external map-reduce handlers) - where the byte array can be deserialized by thrift generated python deserializer in the python mapper. As far as the discussions on this Thread go - i am not sure what the final proposal is. One thing i would opine is that the map-red job is best placed to understand what serialization library to invoke (instead of the map-reduce infrastructure). For example - we are segregating different data types in different files - and the thrift class is implicit in the path name (which is made available as a key). If i understand it correctly - Pig takes a similar stance (input path is mapped to a serialization library).
        Hide
        Vivek Ratan added a comment -

        I'm reading this discussion in a slightly different way. [I'm deviating a bit from the discussion that has been going on, but I will address that in the end ]. The main issue seems to be (or perhaps, should be), how do we use different serializers/deserializers within Hadoop (Thrift, Record I/O, java.io.serializable, whatever - I refer to them as serializing platforms). One clear benefit for supporting various options is to avoid having people write their own code to serialize/deserialize and let it be handled automatically, no matter what object is being serialized/deserialized. Seems to me like everyone agrees that we need (or at least we should think about) a way for Hadoop to support multiple serializing platforms and automatic serialization/deserialization.

        How do we do it? There are typically two steps, or two issues, to handle when serializing/deserializing - how does one walk through the member variables of an arbitrary object (which themselves can be objects) till we're down to basic types, and how does one serialize/deserialize the basic types.

        The former you typically do in two ways: you use a DDL/IDL to statically generate stubs which contain generated code that walks through objects of a particular type, or you use a generic 'walker' that can dynamically walk through any object (I can only think of using reflection to do this, so this approach seems restricted to languages that support reflection). Thrift and Record I/O both use DDLs, mostly because they need to support languages that do not have reflection, while Java's serialization uses reflection.

        For the second step (that of serializing/deserializing basic types), serialization platforms like Thrift and Record I/O provide support for serializing/deserializing basic types using a protocol (binary, text, etc) and a stream (file, socket, etc). Same with Java. This is typically invoked through an interface which contains methods to read/write basic types into a stream.

        Now, how does all this apply to Hadoop? I'm thinking about serialization not just for key-value pairs for Map/Reduce, but also in other places - Hadoop RPC, reading & writing data into HDFS, etc. For walking through an arbitrary object in Hadoop, one option is to modify compilers of Record I/O or Thrift to spit out Hadoop-compatible classes (which implement Writable, for example). A better option, since Hadoop's written in Java, is to have a generic walker that uses reflection to walk through any object. A user would invoke a generic Hadoop serializer class, which in turn would call a walker object or itself walk through any object using reflection.

        class HadoopSerializer {
          static public init(...);  // initialize with the serialization platform you prefer
          static void serialize(Object, OutputStream);
          static void deserialize(Object, InputStream);
        }
        

        This walker object (for lack of a better name) needs to now link to a serialization platform of choice - Record I/O, Thrift whatever. All it does is invoke the serialize/deserialize for individual types. In HadoopSerializer:init(), the user can say what platform they want to use, along with what format and what transport, or some such thing. or you could make HadoopSerializer an interface, have implementations for each serialization platform we support, and let the user configure that when running a job. Now, for the walker object to be invoke any of the serialization platforms, the platforms probably need to implement a common interface (which contains methods for serializing/deserializing individual types such as ints or longs into a stream). The other option is to register classes for each type (similar to what has been discussed earlier), but this, IMO, is a pain as most people probably want to use the same platform for all types. You probably would not want to serialize ints using Record I/O and strings using Thrift. So, in order to integrate a serialization platform into Hadoop, we'll need a wrapper for each platform which implements the generic interface invoked by the walker. This is quite easy to do - the wrapper is likely to be thin and will simply delegate to the appropriate Thrift or Record I/O object. having a base class for Thrift would make the Thrift wrapper for hadoop quite simple, but we can probably still implement a wrapper without that base class.

        Let me get back to the existing discussion now. Seems to me like we haven't addressed the bigger issue of how to make it easy on a user to serialize/deserialize data, and that we're instead just moving around the functionality (and I don't mean that in a dismissive way). I don't think you want a serializer/deserializer per class. Someone still needs to implement the code for serializing/deserializing that class and I don't see any discussion on Hadoop support for Thrift or Record which the user can just invoke. plus, if you think of using this mechanism for Hadoop RPC, we will have so many instances of the Serializer<T> interface. You're far better off having a HadoopSerializer class that takes in any object and automatically serializes/deserializes it. All a user has to do is decide which serialization platform to use. There is a bit of one-time work in integrating a platform into Hadoop (writing the wrapper class that implements the interface called by HadoopWalker), but it's not much. What I'm suggesting also matches, I think, Joydeep's experience with using Thrift with HDFS.

        Or maybe I've completely gone on a tangent. I've been neck deep in Record I/O code and everything looks like a serializable Jute Record to me.

        Show
        Vivek Ratan added a comment - I'm reading this discussion in a slightly different way. [I'm deviating a bit from the discussion that has been going on, but I will address that in the end ] . The main issue seems to be (or perhaps, should be), how do we use different serializers/deserializers within Hadoop (Thrift, Record I/O, java.io.serializable, whatever - I refer to them as serializing platforms). One clear benefit for supporting various options is to avoid having people write their own code to serialize/deserialize and let it be handled automatically, no matter what object is being serialized/deserialized. Seems to me like everyone agrees that we need (or at least we should think about) a way for Hadoop to support multiple serializing platforms and automatic serialization/deserialization. How do we do it? There are typically two steps, or two issues, to handle when serializing/deserializing - how does one walk through the member variables of an arbitrary object (which themselves can be objects) till we're down to basic types, and how does one serialize/deserialize the basic types. The former you typically do in two ways: you use a DDL/IDL to statically generate stubs which contain generated code that walks through objects of a particular type, or you use a generic 'walker' that can dynamically walk through any object (I can only think of using reflection to do this, so this approach seems restricted to languages that support reflection). Thrift and Record I/O both use DDLs, mostly because they need to support languages that do not have reflection, while Java's serialization uses reflection. For the second step (that of serializing/deserializing basic types), serialization platforms like Thrift and Record I/O provide support for serializing/deserializing basic types using a protocol (binary, text, etc) and a stream (file, socket, etc). Same with Java. This is typically invoked through an interface which contains methods to read/write basic types into a stream. Now, how does all this apply to Hadoop? I'm thinking about serialization not just for key-value pairs for Map/Reduce, but also in other places - Hadoop RPC, reading & writing data into HDFS, etc. For walking through an arbitrary object in Hadoop, one option is to modify compilers of Record I/O or Thrift to spit out Hadoop-compatible classes (which implement Writable, for example). A better option, since Hadoop's written in Java, is to have a generic walker that uses reflection to walk through any object. A user would invoke a generic Hadoop serializer class, which in turn would call a walker object or itself walk through any object using reflection. class HadoopSerializer { static public init(...); // initialize with the serialization platform you prefer static void serialize( Object , OutputStream); static void deserialize( Object , InputStream); } This walker object (for lack of a better name) needs to now link to a serialization platform of choice - Record I/O, Thrift whatever. All it does is invoke the serialize/deserialize for individual types. In HadoopSerializer:init(), the user can say what platform they want to use, along with what format and what transport, or some such thing. or you could make HadoopSerializer an interface, have implementations for each serialization platform we support, and let the user configure that when running a job. Now, for the walker object to be invoke any of the serialization platforms, the platforms probably need to implement a common interface (which contains methods for serializing/deserializing individual types such as ints or longs into a stream). The other option is to register classes for each type (similar to what has been discussed earlier), but this, IMO, is a pain as most people probably want to use the same platform for all types. You probably would not want to serialize ints using Record I/O and strings using Thrift. So, in order to integrate a serialization platform into Hadoop, we'll need a wrapper for each platform which implements the generic interface invoked by the walker. This is quite easy to do - the wrapper is likely to be thin and will simply delegate to the appropriate Thrift or Record I/O object. having a base class for Thrift would make the Thrift wrapper for hadoop quite simple, but we can probably still implement a wrapper without that base class. Let me get back to the existing discussion now. Seems to me like we haven't addressed the bigger issue of how to make it easy on a user to serialize/deserialize data, and that we're instead just moving around the functionality (and I don't mean that in a dismissive way). I don't think you want a serializer/deserializer per class. Someone still needs to implement the code for serializing/deserializing that class and I don't see any discussion on Hadoop support for Thrift or Record which the user can just invoke. plus, if you think of using this mechanism for Hadoop RPC, we will have so many instances of the Serializer<T> interface. You're far better off having a HadoopSerializer class that takes in any object and automatically serializes/deserializes it. All a user has to do is decide which serialization platform to use. There is a bit of one-time work in integrating a platform into Hadoop (writing the wrapper class that implements the interface called by HadoopWalker), but it's not much. What I'm suggesting also matches, I think, Joydeep's experience with using Thrift with HDFS. Or maybe I've completely gone on a tangent. I've been neck deep in Record I/O code and everything looks like a serializable Jute Record to me.
        Hide
        Dennis Kubes added a comment -

        Here is a SerializableWritable implementation that we currently use. Hope this helps.

        Show
        Dennis Kubes added a comment - Here is a SerializableWritable implementation that we currently use. Hope this helps.
        Hide
        Tom White added a comment -

        Vivek,

        > I'm thinking about serialization not just for key-value pairs for Map/Reduce, but also in other places

        I agree that it would be useful to have a common serialization mechanism for all parts of Hadoop. The serialization mechanism proposed so far is likely to be applicable more widely since it so general - it talks in terms of input/output streams and parameterized types.

        This Jira issue is confined to the MapReduce part, since we have to start somewhere. I think it would be a useful exercise to think through the implications of the design for other parts of Hadoop before committing any changes though.

        > I don't think you want a serializer/deserializer per class.

        Not per concrete class, agreed. But per base class (e.g. Writable, Serializable, Thriftable, etc).

        > Someone still needs to implement the code for serializing/deserializing that class and I don't see any
        > discussion on Hadoop support for Thrift or Record which the user can just invoke. plus, if you think of
        > using this mechanism for Hadoop RPC, we will have so many instances of the Serializer<T> interface. You're
        > far better off having a HadoopSerializer class that takes in any object and automatically
        > serializes/deserializes it. All a user has to do is decide which serialization platform to use.

        I think you pretty much describe where I would like to get to. If people are using Thrift for example (and there is a common Thrift interface) then there would be a ThriftSerializer that would just work for people, with little or no configuration. While it should still be relatively easy to write a custom serializer/deserializer, most people will use the standard ones for the standard serializing platforms.

        There is a question about where these serializers would go - e.g. would ThriftSerializer go in core Hadoop?

        Show
        Tom White added a comment - Vivek, > I'm thinking about serialization not just for key-value pairs for Map/Reduce, but also in other places I agree that it would be useful to have a common serialization mechanism for all parts of Hadoop. The serialization mechanism proposed so far is likely to be applicable more widely since it so general - it talks in terms of input/output streams and parameterized types. This Jira issue is confined to the MapReduce part, since we have to start somewhere. I think it would be a useful exercise to think through the implications of the design for other parts of Hadoop before committing any changes though. > I don't think you want a serializer/deserializer per class. Not per concrete class, agreed. But per base class (e.g. Writable, Serializable, Thriftable, etc). > Someone still needs to implement the code for serializing/deserializing that class and I don't see any > discussion on Hadoop support for Thrift or Record which the user can just invoke. plus, if you think of > using this mechanism for Hadoop RPC, we will have so many instances of the Serializer<T> interface. You're > far better off having a HadoopSerializer class that takes in any object and automatically > serializes/deserializes it. All a user has to do is decide which serialization platform to use. I think you pretty much describe where I would like to get to. If people are using Thrift for example (and there is a common Thrift interface) then there would be a ThriftSerializer that would just work for people, with little or no configuration. While it should still be relatively easy to write a custom serializer/deserializer, most people will use the standard ones for the standard serializing platforms. There is a question about where these serializers would go - e.g. would ThriftSerializer go in core Hadoop?
        Hide
        Tom White added a comment -

        Dennis,

        > Here is a SerializableWritable implementation that we currently use.

        Thanks. This is an example of a wrapper that I hope we can move away from - instead of writing MapReduce jobs with SerializableWritable types (and wrapping/unwrapping using setSerializable/getSerializable), you would be able to write them directly using Serializable types.

        Show
        Tom White added a comment - Dennis, > Here is a SerializableWritable implementation that we currently use. Thanks. This is an example of a wrapper that I hope we can move away from - instead of writing MapReduce jobs with SerializableWritable types (and wrapping/unwrapping using setSerializable/getSerializable), you would be able to write them directly using Serializable types.
        Hide
        Owen O'Malley added a comment -

        Vivek,
        No one was suggesting a serializer per a concrete class, except in the case of Thrift if they don't implement a generic interface. Your proposal doesn't address how the mapping from an Object to Serializer is managed. I think my suggestion provides the most flexability since you only need one serializer per a root class and they don't have any requirements on the implementation classes at all. Basically, each serialization library that someone wanted to use with Hadoop would have a single generic serializaer and a library routine would do the lookups at the first level:

        public interface Serializer<T> {
          void serialize(T t, OutputStream out) throws IOException;
          void deserialize(T t, InputStream in) throws IOException;
          // Get the base class that this serializer will work on
          Class<T> getTargetClass();
        }
        

        org.apache.hadoop.io.serializer.WritableSerializer would be coded to read and write any Writable, while org.apache.hadoop.io.serializer.ThriftSerializer would read and write any Thrift type.

        I'd probably make a utility class:

        class org.apache.hadoop.io.serializer.SerializerFactory extends Configured {
          Serializer<T> getSerializer(Class<? extends T> cls);
        }
        

        and presumably the SerializerFactory would include a cache from the class to serializer class (hopefully with weak references to allow garbage collection). This would allow you to remove all references to Writable in SequenceFile and the map/reduce classes. Any object could be written into sequence files or passed around in map/reduce jobs. It would be cool and should result in only a modest amount of confusion to the users.

        Furthermore, since it makes only relatively minor use of reflection, a C++ implementation along similar lines should be feasible. (Although it would be a lot more expensive to evaluate, because dynamic_cast is outrageously expensive because of the C++ multiple inheritance semantics.)

        Show
        Owen O'Malley added a comment - Vivek, No one was suggesting a serializer per a concrete class, except in the case of Thrift if they don't implement a generic interface. Your proposal doesn't address how the mapping from an Object to Serializer is managed. I think my suggestion provides the most flexability since you only need one serializer per a root class and they don't have any requirements on the implementation classes at all. Basically, each serialization library that someone wanted to use with Hadoop would have a single generic serializaer and a library routine would do the lookups at the first level: public interface Serializer<T> { void serialize(T t, OutputStream out) throws IOException; void deserialize(T t, InputStream in) throws IOException; // Get the base class that this serializer will work on Class <T> getTargetClass(); } org.apache.hadoop.io.serializer.WritableSerializer would be coded to read and write any Writable, while org.apache.hadoop.io.serializer.ThriftSerializer would read and write any Thrift type. I'd probably make a utility class: class org.apache.hadoop.io.serializer.SerializerFactory extends Configured { Serializer<T> getSerializer( Class <? extends T> cls); } and presumably the SerializerFactory would include a cache from the class to serializer class (hopefully with weak references to allow garbage collection). This would allow you to remove all references to Writable in SequenceFile and the map/reduce classes. Any object could be written into sequence files or passed around in map/reduce jobs. It would be cool and should result in only a modest amount of confusion to the users. Furthermore, since it makes only relatively minor use of reflection, a C++ implementation along similar lines should be feasible. (Although it would be a lot more expensive to evaluate, because dynamic_cast is outrageously expensive because of the C++ multiple inheritance semantics.)
        Hide
        Owen O'Malley added a comment -

        It can also be noted that if the serialization frameworks wanted to reference Hadoop, which granted is the anti-goal of this exercise, they could support "foreign" types that were serializable using Hadoop's framework. Thus Thrift records could contain Jute records or visa versa. I haven't thought about the ramifications of that, but it is amusing.

        Show
        Owen O'Malley added a comment - It can also be noted that if the serialization frameworks wanted to reference Hadoop, which granted is the anti-goal of this exercise, they could support "foreign" types that were serializable using Hadoop's framework. Thus Thrift records could contain Jute records or visa versa. I haven't thought about the ramifications of that, but it is amusing.
        Hide
        Joydeep Sen Sarma added a comment -

        looks great. can we also incorporate Tom's suggestion of an explicit open call to cache any wrappers that the serializer wants to create around input/output streams? thrift will need a base class. it allows different serialization formats - that can be picked up from the configuration. compared to serializing through byteswritable intermediary - we would save on 4 bytes per serialized record - as well as less memory copies. not bad.

        Show
        Joydeep Sen Sarma added a comment - looks great. can we also incorporate Tom's suggestion of an explicit open call to cache any wrappers that the serializer wants to create around input/output streams? thrift will need a base class. it allows different serialization formats - that can be picked up from the configuration. compared to serializing through byteswritable intermediary - we would save on 4 bytes per serialized record - as well as less memory copies. not bad.
        Hide
        Vivek Ratan added a comment -

        Owen, your details helped me understand your proposal a little bit more, but I'm still unsure why we need serializers for various base types. Let me describe in some detail what I was thinking about, and then let's see how different it is from what you're proposing. It's a relatively long writeup (again), but I couldn't think of a better way.

        I think we all agree with Tom's proposal that we want the Mapper to be instantiated for any kind of key and value classes, so let's start with that requirement. As I mentioned in my earlier comment, there are two approaches.

        Approach 1
        --------------------

        Let's say we have a Serializer interface as follows:

        public interface Serializer {
          public void serialize(Object o, OutputStream out) throws IOException;
          public void deserialize(Object o, InputStream in) throws IOException;
        }
        

        A key difference that I'm proposing is that as far as a user is concerned, a Serializer is able to serialize any kind of object, not an object of a particular base type.

        Next, we implement Serializer for each kind of serialization platform that we want supported in Hadoop. Let's take Record I/O and Thrift, for example. we would have:

        public class RecordIOSerializer implements Serializer {
        
          // this is our link to the Record I/O platform 
          private org.apache.hadoop.record.RecordOutput out;
        
          public void init(output_type, stream) {
            switch (output_type) {
              case CSV: out = new CsvRecordOutput(stream);
                ...
          }
        
          public void serialize(Object o, OutputStream out) throws IOException {
            // use reflection to walk through class members
            for each field {
              switch (field type) {
                case byte: 
                   out.writeByte(field value);
                   break;
                 case string:
                    out.writeString(...);
                 ...
              }
            }
        }
        

        Similarly, we would have a ThriftSerializer class that implements Serializer. It would probably have a Thrift TProtocol object as a member variable (this could be a TBinaryProtocol object if we want binary format, or a corresponding one for another format). ThriftSerializer.serialize() would be similar to RecordIOSerializer.serialize() in that the code would use reflection to walk through the class members and call TProtocol methods to serialize base types.

        Whenever we need to serialize/deserialize an object in MapReduce, we simply obtain the right serializer object (RecordIOSerializer or ThriftSerializer or whatever) using a factory or through a configuration file, then simply serialize/deserialize any object. In my earlier comment, I'd mentioned two phases when serializing/deserializing. In this approach, the first phase, that of walking through the class structure, is done in a generic manner, while we invoke a specific serialization platform to do the second phase (that of writing/reading base types).

        This approach is useful in that clients do not have to write DDLs for any class they want to serialize and there is no compilation of DDLs either. You also do not need a generic interface that all serialization platforms have to implement (so there is no need to change anything in Record I/O or Thrift). The tradeoff is the potential expense of reflection. A deserializer may be a little harder to write because you have to ensure that you walk through the class in the same order as you do when you serialized it.

        Approach 2
        -----------------

        Another approach is to have the serialization platform handle both phases (walking through the class structure and writing/reading basic types). This approach, I think, is closer to what Owen and Tom are suggesting. For this approach, we have to assume that any serialization platform generates classes that can read/write themselves and that all have a common base type. In Record I/O, this base type would be org.apache.hadoop.record.Record, while for Thrift, it would be the interface that Thrift has just added for all generated classes (based on their conversations today). So you would have a type-based Serializer interface, as Tom describes in the first comment in this discussion, and you would have an implementation for org.apache.hadoop.record.Record and one for the Thrift interface. Something like this:

        class RecordIOSerializer2 implements Serializer<org.apache.hadoop.record.Record> {
          void serialize(org.apache.hadoop.record.Record t, OutputStream o) {
            RecordOutput out = getRecordOutput(o, ...);
            t.serialize(out);
          }
          ...
        }
        

        or

        class ThriftSerializer2 implements Serializer<com.facebook.Thrift> {
          void serialize(com.facebook.Thrift t, OutputStream o) {
            TriftProtocol prot = getThriftProtocol(o, ...);
            t.write(prot);
          }
          ...
        }
        

        In this approach, any class that needs to be serilaized/deserialized would have to be described in a DDL (if using Thrift or record I/O), and a stub generated. So each Key or Value class would have to be generated through it's DDL. The stub contains the code for walking through the class and reading/writing it. For example, if I have a class MyClass that I want to use as a Key in a Map/Reduce task, I would write a DDL for it and run the appropriate compiler. In record I/O, I would get a Java class MyClass, which would extend org.apache.hadoop.record.Record, and this is the class I would use for my Map/Reduce job. I'd also have to make sure that I associate Serializer<org.apache.hadoop.record.Record> with this class.

        The benefit here is that the classes generated by the DDL compiler are optimized for performance, but on the flip side, you need to define a DDL for each class you want serialized, and generate its stub through a DDL compiler. This may be OK for classes in core Hadoop, but it can be a cumbersome step for user-defined classes.

        Summary
        ---------------

        These are the two approaches I can think of, and I suspect your approach maps closely with Approach 2. I personally think that Approach 1 is significantly easier to use and should be preferred unless reflection comes with an unacceptable run-time cost. No DDLs, no stubs, nothing for the user to do. We could also offer both, or use Approach 2 for core Hadoop classes and Approach 1 for classes the user defines. Approach 1 is also cleaner, IMO. You want to give the user a serializer/deserializer that reads/writes any Object. The serialize() and deserialize() methods should accept any object, or should be methods available off of any class that can be serialized/deserialized.

        Owen/Tom, I'm not sure I fully understand your approach (though I do think it's similar to Approach 2), so perhaps you can provide some more details. if a user wants to use their own class for a Key or Value, and they want to use Thrift or Record I/O or their own custom code for serialization/deserializaton, do they have to define a DDL for that class and run it through a compiler? Would you also instantiate Serializer for any type other than the base types for Record I/O or Thrift? At a higher level, I still don't like the idea of binding serializers to types. You may want to do that for implementation sake, but a serializer's interface to a user should accept any object.

        I haven't talked about implementation details regarding how you would configure serializers for various Map/Reduce steps (deserializing Map input, serializing Reduce output, storing intermediate data...). Those seem fairly easy (relatively) to resolve. We can use config files, or class factories, or whatever.

        Show
        Vivek Ratan added a comment - Owen, your details helped me understand your proposal a little bit more, but I'm still unsure why we need serializers for various base types. Let me describe in some detail what I was thinking about, and then let's see how different it is from what you're proposing. It's a relatively long writeup (again), but I couldn't think of a better way. I think we all agree with Tom's proposal that we want the Mapper to be instantiated for any kind of key and value classes, so let's start with that requirement. As I mentioned in my earlier comment, there are two approaches. Approach 1 -------------------- Let's say we have a Serializer interface as follows: public interface Serializer { public void serialize( Object o, OutputStream out) throws IOException; public void deserialize( Object o, InputStream in) throws IOException; } A key difference that I'm proposing is that as far as a user is concerned, a Serializer is able to serialize any kind of object, not an object of a particular base type. Next, we implement Serializer for each kind of serialization platform that we want supported in Hadoop. Let's take Record I/O and Thrift, for example. we would have: public class RecordIOSerializer implements Serializer { // this is our link to the Record I/O platform private org.apache.hadoop.record.RecordOutput out; public void init(output_type, stream) { switch (output_type) { case CSV: out = new CsvRecordOutput(stream); ... } public void serialize( Object o, OutputStream out) throws IOException { // use reflection to walk through class members for each field { switch (field type) { case byte : out.writeByte(field value); break ; case string: out.writeString(...); ... } } } Similarly, we would have a ThriftSerializer class that implements Serializer . It would probably have a Thrift TProtocol object as a member variable (this could be a TBinaryProtocol object if we want binary format, or a corresponding one for another format). ThriftSerializer.serialize() would be similar to RecordIOSerializer.serialize() in that the code would use reflection to walk through the class members and call TProtocol methods to serialize base types. Whenever we need to serialize/deserialize an object in MapReduce, we simply obtain the right serializer object ( RecordIOSerializer or ThriftSerializer or whatever) using a factory or through a configuration file, then simply serialize/deserialize any object. In my earlier comment, I'd mentioned two phases when serializing/deserializing. In this approach, the first phase, that of walking through the class structure, is done in a generic manner, while we invoke a specific serialization platform to do the second phase (that of writing/reading base types). This approach is useful in that clients do not have to write DDLs for any class they want to serialize and there is no compilation of DDLs either. You also do not need a generic interface that all serialization platforms have to implement (so there is no need to change anything in Record I/O or Thrift). The tradeoff is the potential expense of reflection. A deserializer may be a little harder to write because you have to ensure that you walk through the class in the same order as you do when you serialized it. Approach 2 ----------------- Another approach is to have the serialization platform handle both phases (walking through the class structure and writing/reading basic types). This approach, I think, is closer to what Owen and Tom are suggesting. For this approach, we have to assume that any serialization platform generates classes that can read/write themselves and that all have a common base type. In Record I/O, this base type would be org.apache.hadoop.record.Record , while for Thrift, it would be the interface that Thrift has just added for all generated classes (based on their conversations today). So you would have a type-based Serializer interface, as Tom describes in the first comment in this discussion, and you would have an implementation for org.apache.hadoop.record.Record and one for the Thrift interface. Something like this: class RecordIOSerializer2 implements Serializer<org.apache.hadoop.record.Record> { void serialize(org.apache.hadoop.record.Record t, OutputStream o) { RecordOutput out = getRecordOutput(o, ...); t.serialize(out); } ... } or class ThriftSerializer2 implements Serializer<com.facebook.Thrift> { void serialize(com.facebook.Thrift t, OutputStream o) { TriftProtocol prot = getThriftProtocol(o, ...); t.write(prot); } ... } In this approach, any class that needs to be serilaized/deserialized would have to be described in a DDL (if using Thrift or record I/O), and a stub generated. So each Key or Value class would have to be generated through it's DDL. The stub contains the code for walking through the class and reading/writing it. For example, if I have a class MyClass that I want to use as a Key in a Map/Reduce task, I would write a DDL for it and run the appropriate compiler. In record I/O, I would get a Java class MyClass , which would extend org.apache.hadoop.record.Record , and this is the class I would use for my Map/Reduce job. I'd also have to make sure that I associate Serializer<org.apache.hadoop.record.Record> with this class. The benefit here is that the classes generated by the DDL compiler are optimized for performance, but on the flip side, you need to define a DDL for each class you want serialized, and generate its stub through a DDL compiler. This may be OK for classes in core Hadoop, but it can be a cumbersome step for user-defined classes. Summary --------------- These are the two approaches I can think of, and I suspect your approach maps closely with Approach 2. I personally think that Approach 1 is significantly easier to use and should be preferred unless reflection comes with an unacceptable run-time cost. No DDLs, no stubs, nothing for the user to do. We could also offer both, or use Approach 2 for core Hadoop classes and Approach 1 for classes the user defines. Approach 1 is also cleaner, IMO. You want to give the user a serializer/deserializer that reads/writes any Object. The serialize() and deserialize() methods should accept any object, or should be methods available off of any class that can be serialized/deserialized. Owen/Tom, I'm not sure I fully understand your approach (though I do think it's similar to Approach 2), so perhaps you can provide some more details. if a user wants to use their own class for a Key or Value, and they want to use Thrift or Record I/O or their own custom code for serialization/deserializaton, do they have to define a DDL for that class and run it through a compiler? Would you also instantiate Serializer for any type other than the base types for Record I/O or Thrift? At a higher level, I still don't like the idea of binding serializers to types. You may want to do that for implementation sake, but a serializer's interface to a user should accept any object. I haven't talked about implementation details regarding how you would configure serializers for various Map/Reduce steps (deserializing Map input, serializing Reduce output, storing intermediate data...). Those seem fairly easy (relatively) to resolve. We can use config files, or class factories, or whatever.
        Hide
        Vivek Ratan added a comment -

        I forgot to add - using introspection to walk through a class has some more problems. Things like versioning, and other kinds of optimizations will be difficult with general introspection. As someone mentioned in the Thrift discussion, it also doesn't handle field IDs in DDLs, and other features (such as default values, optional fields) that serialization platforms will want to support. For such cases, DDL-generated classes are better. By offering both approaches, we can let users and developers decide on the tradeoff : ease of use versus more complicated functionality.

        Show
        Vivek Ratan added a comment - I forgot to add - using introspection to walk through a class has some more problems. Things like versioning, and other kinds of optimizations will be difficult with general introspection. As someone mentioned in the Thrift discussion, it also doesn't handle field IDs in DDLs, and other features (such as default values, optional fields) that serialization platforms will want to support. For such cases, DDL-generated classes are better. By offering both approaches, we can let users and developers decide on the tradeoff : ease of use versus more complicated functionality.
        Hide
        Owen O'Malley added a comment - - edited

        laugh

        Option 1 is precisely what I was proposing, except that you keep blurring how this part works:

        we simply obtain the right serializer object (RecordIOSerializer or ThriftSerializer or whatever) using a factory or through a configuration file

        My proposal spells out how that happens by saying that the configuration has a map between root classes and the corresponding serializer class. When the factory is given an object, it consults the map and constructs the correct serializer. Clearly the factory will cache the information so that it doesn't have to traverse the class hierarchy for constructing each serializer.

        It just simplifies things a bit if the serializer class specifies which class they work on instead of configuring a value like:

        org.apache.hadoop.io.Writable->org.apache.hadoop.io.WritableSeraizlier 
        

        we can just provide a list of serializers the factory can automatically determine what they apply to. Now that I think about it, just having

        interface Serializer<T> {
           void serialize(T t, OutputStream out) throws IOException;
          void deserialize(T t, InputStream in)  
        }
        

        because we can use reflection to find the value of T for a given factory class. So, the serializers would just look like:

        class ThriftSerializer implements Serializer<ThriftRecord> {
           void serialize(ThriftRecord t, OutputStream out) throws IOException {...}
          void deserialize(ThriftRecord t, InputStream in)  throws IOException {...}
        }
        
        class WritableSerializer implements Serializer<Writable> {
           void serialize(Writable t, OutputStream out) throws IOException {...}
          void deserialize(Writable t, InputStream in)  throws IOException {...}
        }
        

        and in the config put:

        <property>
          <name>hadoop.serializers</name>
          <value>org.apache.hadoop.io.WritableSerializer,com.facebook.hadoop.ThriftSerializer</value>
          <description>The list of serializers available to Hadoop</description>
        </property>
        
        Show
        Owen O'Malley added a comment - - edited laugh Option 1 is precisely what I was proposing, except that you keep blurring how this part works: we simply obtain the right serializer object (RecordIOSerializer or ThriftSerializer or whatever) using a factory or through a configuration file My proposal spells out how that happens by saying that the configuration has a map between root classes and the corresponding serializer class. When the factory is given an object, it consults the map and constructs the correct serializer. Clearly the factory will cache the information so that it doesn't have to traverse the class hierarchy for constructing each serializer. It just simplifies things a bit if the serializer class specifies which class they work on instead of configuring a value like: org.apache.hadoop.io.Writable->org.apache.hadoop.io.WritableSeraizlier we can just provide a list of serializers the factory can automatically determine what they apply to. Now that I think about it, just having interface Serializer<T> { void serialize(T t, OutputStream out) throws IOException; void deserialize(T t, InputStream in) } because we can use reflection to find the value of T for a given factory class. So, the serializers would just look like: class ThriftSerializer implements Serializer<ThriftRecord> { void serialize(ThriftRecord t, OutputStream out) throws IOException {...} void deserialize(ThriftRecord t, InputStream in) throws IOException {...} } class WritableSerializer implements Serializer<Writable> { void serialize(Writable t, OutputStream out) throws IOException {...} void deserialize(Writable t, InputStream in) throws IOException {...} } and in the config put: <property> <name>hadoop.serializers</name> <value>org.apache.hadoop.io.WritableSerializer,com.facebook.hadoop.ThriftSerializer</value> <description>The list of serializers available to Hadoop</description> </property>
        Hide
        Doug Cutting added a comment -

        > No one was suggesting a serializer per a concrete class [ ... ]

        Actually, I thought we might, and would like to preserve that option. I worry about performance of introspection. And, for very simple objects, the overhead of having WritableSerializer#serialize(o,out) call o.write(out) rather than just being o.write(out) may even be significant. Or it may not be. In any case, if record code is generated from a DDL, then we can implement this either way, with per-class serializers or per-baseclass serializers. If we discard the DDL and code-generation, then we're stuck with introspection, no?

        I wonder if we might permit both by having the configuration name not serializers but serializer factories. So one could specify the availability of a WritableSerializerFactory that would be constructed, cached and used to construct serializers for Writables. That could then potentially return a different serializer for each kind of Writable, or the same serializer for all Writables.

        Finally, if we keep the DDL and generate only the class, not its serializers, then there could theoretically be compatibility issues with other languages. If, for example, the DDL defines different types that map to the same type in Java (short versus character?) then using introspection could cause problems. This is improbable, but another thing to watch out for.

        Do I worry too much?

        Show
        Doug Cutting added a comment - > No one was suggesting a serializer per a concrete class [ ... ] Actually, I thought we might, and would like to preserve that option. I worry about performance of introspection. And, for very simple objects, the overhead of having WritableSerializer#serialize(o,out) call o.write(out) rather than just being o.write(out) may even be significant. Or it may not be. In any case, if record code is generated from a DDL, then we can implement this either way, with per-class serializers or per-baseclass serializers. If we discard the DDL and code-generation, then we're stuck with introspection, no? I wonder if we might permit both by having the configuration name not serializers but serializer factories. So one could specify the availability of a WritableSerializerFactory that would be constructed, cached and used to construct serializers for Writables. That could then potentially return a different serializer for each kind of Writable, or the same serializer for all Writables. Finally, if we keep the DDL and generate only the class, not its serializers, then there could theoretically be compatibility issues with other languages. If, for example, the DDL defines different types that map to the same type in Java (short versus character?) then using introspection could cause problems. This is improbable, but another thing to watch out for. Do I worry too much?
        Hide
        Vivek Ratan added a comment -

        No, Approach 1, as I've defined it in my previous comment, is NOT what you're proposing. Approach 1 does not require any DDLs, it does not instantiate Serializer objects for different types, there is no Writable, no ThriftRecord. The confusion/disagreement perhaps stems from the fact that there are two different but related issues being discussed here (maybe we need separate Jiras for each, but I think they're related enough to be discussed together). One issue is to do with how do we integrate various serialization platforms into the system, i.e., what does the interface look to the user, and the other has more to do with implementation/configuration (this is probably not a very clean demarcation, but it seems pretty valid to me).

        A lot of initial comments assumed that we would have a serialization interface based on type, so that you could have serializers for different types. These types would usually be base types for each serialization platform, but they also might be more concrete types. What I'm suggesting in Approach 1 is a different way to look at this. Approach 1 does not care whether your platform has a base class or not. It uses reflection to walk through a class structure and interacts with the serialization platform at the level of serializing/deserializing basic types (ints, longs, strings, etc), which each serialization platform provides. Approach 2 is the one that needs you to perhaps create Serializers for base classes for each platform (one for ThriftRecord, one for Jute record, and so on), and that seems closer to your examples.

        [I've sorta waved my hand on how you would configure, or have the user choose between, various serializers, especially in Approach 2. A lot of your comments, and those of Tom and Doug's, seem to me to focus on this issue. ]

        The reason I harp on the two approaches (Approach 1 and Approach 2) is that they are, to me, quite different. There is a clear tradeoff between usability and performance. Approach 1 favors the former, Approach 2 the latter. Approach 1 is really easy to use. No DDLs and very little for the user to do. However, as I had mentioned earlier, and as Doug's comments seem to indicate, there is a real danger of its performance being slow. I don't have an idea of how slow. Anybody know how expensive introspection can be (I'm sure it also depends on how deeply nested a class is or how many member variables it has, and so on)?

        I think we should support both approaches. It seems quite reasonable to me that there will be users who want to define their own key or value classes, don't want to write serialization/deserialization code for them, don't want to define DDLs or install Thrift or run the Jute compiler, and don't mind paying the extra penalty for introspection. Al they need to do is define their Key or Value class, and pick a serialization platform (Record I/O or Thrift or Writable or whatever) through some simple config option. Wherever we serialize/deserialize in the Map/Reduce code (in SequenceFile, or in the Output Collector), the code simply calls Serializer.serialize(), which accepts any Object type. Again, no DDL, nothing. But if you need better performance, or you want to use some fancy DDL feature (such as marking fields as optional or having default values, or even versioning), then you have to support Approach 2, which requires the key and value classes to be defined in DDLs, compiled, and integrated . We don't use these extra DDL features for basic serialization yet, but it's quite reasonable to expect users to want support for them in the near future.

        Maybe what we should do is actually measure the performance implication of introspection. A generic serializer/deserializer for Approach 1 shouldn't be hard to write and we can compare its performance to that for a DDL-generated class. If the difference is acceptable, it's much simpler to provide just Approach 1. if not, we could either provide both approaches or just provide Approach 2, wait till enough people complain that it's hard, and optionally provide Approach 1.

        If we do use Approach 2, we will need something that handles the mapping between a class and it's serializer, and I think your (Owen's) suggestion is fine. I haven't offered any alternate solution.

        On to Doug's comments:

        > If we discard the DDL and code-generation, then we're stuck with introspection, no?

        Yes. No DDLs and no code generation implies Approach 1, and hence introspection. DDLs and code-generation implies Approach 2, and hence no introspection.

        > Finally, if we keep the DDL and generate only the class, not its serializers, then there could theoretically be compatibility issues with other languages. If, for example, the DDL defines different types that map to the same type in Java (short versus character?) then using introspection could cause problems.

        Why would you want to do this? The only benefit of DDL is serializers. I don't understand the use case here.

        > Do I worry too much?

        Introspection performance is a real worry, but we should be able to test it out, and perhaps also get enough anecdotal evidence.

        Show
        Vivek Ratan added a comment - No, Approach 1, as I've defined it in my previous comment, is NOT what you're proposing. Approach 1 does not require any DDLs, it does not instantiate Serializer objects for different types, there is no Writable , no ThriftRecord . The confusion/disagreement perhaps stems from the fact that there are two different but related issues being discussed here (maybe we need separate Jiras for each, but I think they're related enough to be discussed together). One issue is to do with how do we integrate various serialization platforms into the system, i.e., what does the interface look to the user, and the other has more to do with implementation/configuration (this is probably not a very clean demarcation, but it seems pretty valid to me). A lot of initial comments assumed that we would have a serialization interface based on type, so that you could have serializers for different types. These types would usually be base types for each serialization platform, but they also might be more concrete types. What I'm suggesting in Approach 1 is a different way to look at this. Approach 1 does not care whether your platform has a base class or not. It uses reflection to walk through a class structure and interacts with the serialization platform at the level of serializing/deserializing basic types (ints, longs, strings, etc), which each serialization platform provides. Approach 2 is the one that needs you to perhaps create Serializers for base classes for each platform (one for ThriftRecord , one for Jute record, and so on), and that seems closer to your examples. [I've sorta waved my hand on how you would configure, or have the user choose between, various serializers, especially in Approach 2. A lot of your comments, and those of Tom and Doug's, seem to me to focus on this issue. ] The reason I harp on the two approaches (Approach 1 and Approach 2) is that they are, to me, quite different. There is a clear tradeoff between usability and performance. Approach 1 favors the former, Approach 2 the latter. Approach 1 is really easy to use. No DDLs and very little for the user to do. However, as I had mentioned earlier, and as Doug's comments seem to indicate, there is a real danger of its performance being slow. I don't have an idea of how slow. Anybody know how expensive introspection can be (I'm sure it also depends on how deeply nested a class is or how many member variables it has, and so on)? I think we should support both approaches. It seems quite reasonable to me that there will be users who want to define their own key or value classes, don't want to write serialization/deserialization code for them, don't want to define DDLs or install Thrift or run the Jute compiler, and don't mind paying the extra penalty for introspection. Al they need to do is define their Key or Value class, and pick a serialization platform (Record I/O or Thrift or Writable or whatever) through some simple config option. Wherever we serialize/deserialize in the Map/Reduce code (in SequenceFile, or in the Output Collector), the code simply calls Serializer.serialize() , which accepts any Object type. Again, no DDL, nothing. But if you need better performance, or you want to use some fancy DDL feature (such as marking fields as optional or having default values, or even versioning), then you have to support Approach 2, which requires the key and value classes to be defined in DDLs, compiled, and integrated . We don't use these extra DDL features for basic serialization yet, but it's quite reasonable to expect users to want support for them in the near future. Maybe what we should do is actually measure the performance implication of introspection. A generic serializer/deserializer for Approach 1 shouldn't be hard to write and we can compare its performance to that for a DDL-generated class. If the difference is acceptable, it's much simpler to provide just Approach 1. if not, we could either provide both approaches or just provide Approach 2, wait till enough people complain that it's hard, and optionally provide Approach 1. If we do use Approach 2, we will need something that handles the mapping between a class and it's serializer, and I think your (Owen's) suggestion is fine. I haven't offered any alternate solution. On to Doug's comments: > If we discard the DDL and code-generation, then we're stuck with introspection, no? Yes. No DDLs and no code generation implies Approach 1, and hence introspection. DDLs and code-generation implies Approach 2, and hence no introspection. > Finally, if we keep the DDL and generate only the class, not its serializers, then there could theoretically be compatibility issues with other languages. If, for example, the DDL defines different types that map to the same type in Java (short versus character?) then using introspection could cause problems. Why would you want to do this? The only benefit of DDL is serializers. I don't understand the use case here. > Do I worry too much? Introspection performance is a real worry, but we should be able to test it out, and perhaps also get enough anecdotal evidence.
        Hide
        Owen O'Malley added a comment -

        Actually, I thought we might, and would like to preserve that option.

        WIth my proposal, of course that is an option since you could define a Serializer class for each specific class. I don't see the value of it, but it wouldn't be hard.

        I worry about performance of introspection.

        That does worry me too. I think that if you are just doing the look up once, and caching the result, it won't be a problem.

        And, for very simple objects, the overhead of having WritableSerializer#serialize(o,out) call o.write(out) rather than just being o.write(out) may even be significant.

        Laugh I can't see it ever being anything but minor noise. JVMs are really good at making method calls cheap.

        Or it may not be. In any case, if record code is generated from a DDL, then we can implement this either way, with per-class serializers or per-baseclass serializers. If we discard the DDL and code-generation, then we're stuck with introspection, no?

        I assume that each serialization framework generates methods that convert objects to bytes and visa versa. The problem is just that the interfaces aren't the same. So the hadoop serializers never have to worry about DDLs, they just worry about using each serialization framework's apis to do the conversion.

        I wonder if we might permit both by having the configuration name not serializers but serializer factories.

        I think that 2 levels of indirection are overkill, but I'm open to discussing it.

        Show
        Owen O'Malley added a comment - Actually, I thought we might, and would like to preserve that option. WIth my proposal, of course that is an option since you could define a Serializer class for each specific class. I don't see the value of it, but it wouldn't be hard. I worry about performance of introspection. That does worry me too. I think that if you are just doing the look up once, and caching the result, it won't be a problem. And, for very simple objects, the overhead of having WritableSerializer#serialize(o,out) call o.write(out) rather than just being o.write(out) may even be significant. Laugh I can't see it ever being anything but minor noise. JVMs are really good at making method calls cheap. Or it may not be. In any case, if record code is generated from a DDL, then we can implement this either way, with per-class serializers or per-baseclass serializers. If we discard the DDL and code-generation, then we're stuck with introspection, no? I assume that each serialization framework generates methods that convert objects to bytes and visa versa. The problem is just that the interfaces aren't the same. So the hadoop serializers never have to worry about DDLs, they just worry about using each serialization framework's apis to do the conversion. I wonder if we might permit both by having the configuration name not serializers but serializer factories. I think that 2 levels of indirection are overkill, but I'm open to discussing it.
        Hide
        Doug Cutting added a comment -

        > I think that 2 levels of indirection are overkill, but I'm open to discussing it.

        In most scenarios we've discussed, this isn't actually removing a level of indirection but moving it out of the inner loop, where it might be a problem. JVMs can do wonderful things, but removing a method call from an inner loop still frequently makes things faster. It could become significant in, e.g., a <String,String> mapreduce job that's, e.g., inverting links. If we can't make that significant, it might even argue that we've not optimized things enough elsewhere!

        This is an area where we want both transparent simplicity, supporting things like introspection, and also want to permit the highest performance, since it is the innermost loop for many applications.

        Show
        Doug Cutting added a comment - > I think that 2 levels of indirection are overkill, but I'm open to discussing it. In most scenarios we've discussed, this isn't actually removing a level of indirection but moving it out of the inner loop, where it might be a problem. JVMs can do wonderful things, but removing a method call from an inner loop still frequently makes things faster. It could become significant in, e.g., a <String,String> mapreduce job that's, e.g., inverting links. If we can't make that significant, it might even argue that we've not optimized things enough elsewhere! This is an area where we want both transparent simplicity, supporting things like introspection, and also want to permit the highest performance, since it is the innermost loop for many applications.
        Hide
        Tom White added a comment -

        > Do I worry too much?

        I don't think so - I was actually worrying about the cost of introspection too. I think it'll be OK - as Owen says, the serializer will be cached - but I think we should do some benchmarking before any code is accepted.

        Speaking of code, I hope to have something to show fairly soon, and I hope it will help focus on the issues that we're not sure we're agreeing on

        Show
        Tom White added a comment - > Do I worry too much? I don't think so - I was actually worrying about the cost of introspection too. I think it'll be OK - as Owen says, the serializer will be cached - but I think we should do some benchmarking before any code is accepted. Speaking of code, I hope to have something to show fairly soon, and I hope it will help focus on the issues that we're not sure we're agreeing on
        Hide
        Vivek Ratan added a comment -

        After some offline discussions, I think there is some convergence. It seems like most folks have implicitly assumed that DDLs are involved. If a user defines a class that they want to use as a key or value, they (or somebody) would either implement Writable or define it using a DDL, run a Thrift or Record I/O compiler, and use the generated stub. They would also pick the right Serializable<T> implementation. I've been arguing that there is an additional way where someone may not want to go through the pain of writing a DDL and compiling it. I don't have any real use cases for assuming that some folks might find DDLs and compilers to be a pain. A fair compromise is to stick with the Serializable<T> interface that Tom and Owen have been talking about. There will be classes that implement Serializable<RecordI/O record> and Serializable<Thrift record>. There can be an additional implementation for a general-purpose reflection-based serializer, something like class ReflectionSerializer imlements Serializable<Object>. I don't know if such an implementation is required right away: it's useful only if someone doesn't want to deal with DDL hassles. So maybe we can keep this as an option for later, if required.

        Show
        Vivek Ratan added a comment - After some offline discussions, I think there is some convergence. It seems like most folks have implicitly assumed that DDLs are involved. If a user defines a class that they want to use as a key or value, they (or somebody) would either implement Writable or define it using a DDL, run a Thrift or Record I/O compiler, and use the generated stub. They would also pick the right Serializable<T> implementation. I've been arguing that there is an additional way where someone may not want to go through the pain of writing a DDL and compiling it. I don't have any real use cases for assuming that some folks might find DDLs and compilers to be a pain. A fair compromise is to stick with the Serializable<T> interface that Tom and Owen have been talking about. There will be classes that implement Serializable<RecordI/O record> and Serializable<Thrift record> . There can be an additional implementation for a general-purpose reflection-based serializer, something like class ReflectionSerializer imlements Serializable<Object> . I don't know if such an implementation is required right away: it's useful only if someone doesn't want to deal with DDL hassles. So maybe we can keep this as an option for later, if required.
        Hide
        Tom White added a comment -

        I've now run into WritableComparator which needs to be generalized too. I think what's needed is a RawComparator interface (I'm not too attached to the name) which has the pertinent method of WritableComparator:

        public interface RawComparator extends Comparator<T> {
        
          public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
        
        }
        

        Users will be free to implement optimized versions of RawComparator for their serialization framework, just as Writables do today.

        Then WritableComparator will implement RawComparator, and there'll be a new class DeserializerComparator, also implementing RawComparator, which will use a Deserializer to implement the compare method above. In JobConf the WritableComparators will become RawComparators.

        Does this sound about right?

        Show
        Tom White added a comment - I've now run into WritableComparator which needs to be generalized too. I think what's needed is a RawComparator interface (I'm not too attached to the name) which has the pertinent method of WritableComparator: public interface RawComparator extends Comparator<T> { public int compare( byte [] b1, int s1, int l1, byte [] b2, int s2, int l2); } Users will be free to implement optimized versions of RawComparator for their serialization framework, just as Writables do today. Then WritableComparator will implement RawComparator, and there'll be a new class DeserializerComparator, also implementing RawComparator, which will use a Deserializer to implement the compare method above. In JobConf the WritableComparators will become RawComparators. Does this sound about right?
        Hide
        Tom White added a comment -

        Here's a patch (serializer-v1.patch) for discussion. It's still quite incomplete: there are no tests or javadoc, it produces lots of warnings and I haven't looked at changing pipes or aggregate to not depend on Writable.

        That said, I've managed to run a MapReduce job that doesn't use Writables. (I created a wrapper round Writable that didn't actually implement Writable, even though the underlying serialization mechanism used Writable.) The next test is probably to try using Thrift types.

        Thoughts?

        Show
        Tom White added a comment - Here's a patch (serializer-v1.patch) for discussion. It's still quite incomplete: there are no tests or javadoc, it produces lots of warnings and I haven't looked at changing pipes or aggregate to not depend on Writable. That said, I've managed to run a MapReduce job that doesn't use Writables. (I created a wrapper round Writable that didn't actually implement Writable, even though the underlying serialization mechanism used Writable.) The next test is probably to try using Thrift types. Thoughts?
        Hide
        Doug Cutting added a comment -

        I think I'd opt for another level of indirection, for two reasons:

        • to bind together serializers and deserializers, which are almost always paired;
        • to permit generation of more specialized serializers and deserializers

        E.g.,

        <property>
          <name>io.serialization</name>
          <value>WritableSerialization</value>
        </property>
        
        public interface Serialization {
          Serializer getSerializer();
          Deserializer getDeserializer();
        }
        public class SerializationFactory {
          public Serializer getSerializer(Class c) { return getSerialization(c).getSerializer(); }
          public Deserializer getDeserializer(Class c) { return getSerialization(c).getDeserializer(); }
          public getSerialization(Class c) { ... infer from c's superclasses & interfaces ... }
        }
        
        Show
        Doug Cutting added a comment - I think I'd opt for another level of indirection, for two reasons: to bind together serializers and deserializers, which are almost always paired; to permit generation of more specialized serializers and deserializers E.g., <property> <name>io.serialization</name> <value>WritableSerialization</value> </property> public interface Serialization { Serializer getSerializer(); Deserializer getDeserializer(); } public class SerializationFactory { public Serializer getSerializer(Class c) { return getSerialization(c).getSerializer(); } public Deserializer getDeserializer(Class c) { return getSerialization(c).getDeserializer(); } public getSerialization(Class c) { ... infer from c's superclasses & interfaces ... } }
        Hide
        Doug Cutting added a comment -

        Also, instead of using introspection, we might use a method, e.g.:

        public interface Serialization {
          boolean accept(Class c);
          Serializer getSerializer();
          Deserializer getDeserializer();
        }
        

        SerializationFactory#getSerialization(Class) would then just iterate through the defined serializations calling accept(). It could cache an instance of each defined serialization. Note also that, for primitive types, we could pass things like Integer.TYPE. So then one might even, e.g., define a mapper that takes <int,long> pairs.

        Show
        Doug Cutting added a comment - Also, instead of using introspection, we might use a method, e.g.: public interface Serialization { boolean accept(Class c); Serializer getSerializer(); Deserializer getDeserializer(); } SerializationFactory#getSerialization(Class) would then just iterate through the defined serializations calling accept(). It could cache an instance of each defined serialization. Note also that, for primitive types, we could pass things like Integer.TYPE. So then one might even, e.g., define a mapper that takes <int,long> pairs.
        Hide
        Owen O'Malley added a comment -

        Doug,
        The boolean accept method would be fine. Your proposal would let you split apart serializers from deserializers, which doesn't seem very useful to me. If I use serializer X, I pretty much always want to use deserializer X too. I don't see the value is saying for type X use serializer Y and deserializer Z.

        Show
        Owen O'Malley added a comment - Doug, The boolean accept method would be fine. Your proposal would let you split apart serializers from deserializers, which doesn't seem very useful to me. If I use serializer X, I pretty much always want to use deserializer X too. I don't see the value is saying for type X use serializer Y and deserializer Z.
        Hide
        Doug Cutting added a comment -

        Owen, sure, maybe something like:

        <property>
          <name>io.serialization.factories</name>
          <value>WritableSerializerFactory</value>
        </property>
        
        public interface Serializer {
          void serialize(Object, OutputStream);
          Object deserialize(Object reuse, InputStream);
        }
        
        public interface SerializerFactory {
          Serializer getSerializer(Class c);
        }
        
        public class Serialization {
          public Serializer getSerializer(Class c, Configuration conf) {
            for (factory defined in conf) {
              Serializer s = factory.getSerializer(c);
              if (s != null)
                return s;
            }
          }
        }
        
        Show
        Doug Cutting added a comment - Owen, sure, maybe something like: <property> <name>io.serialization.factories</name> <value>WritableSerializerFactory</value> </property> public interface Serializer { void serialize(Object, OutputStream); Object deserialize(Object reuse, InputStream); } public interface SerializerFactory { Serializer getSerializer(Class c); } public class Serialization { public Serializer getSerializer(Class c, Configuration conf) { for (factory defined in conf) { Serializer s = factory.getSerializer(c); if (s != null) return s; } } }
        Hide
        Joydeep Sen Sarma added a comment -

        question - what's the compatibility path for all applications written with 'Writable'?

        Show
        Joydeep Sen Sarma added a comment - question - what's the compatibility path for all applications written with 'Writable'?
        Hide
        Tom White added a comment -

        Doug/Owen

        These changes generally look good - I'll try to work them into a new patch.

        In the current patch Serializers and Deserializers are stateful with open/close methods and that was the reason that led me to separate them. We could combine them in a single object, but this would be at the expense of muddying the method names: (e.g. closeSerializer and closeDeserializer), so I'm reluctant to do that - I would stick with Doug's first SerializationFactory proposal (plus the accept method).

        Another aspect that the current patch doesn't address is who instantiates objects during deserialization. (Doug - I think you're alluding to this in the "reuse" object in the Serializer class above?) For Writables and Thrift the serialization framework does not instantiate objects - it merely populates the supplied object with the representation from the stream. For Java Serialization the serialization framework reads the type from the stream and instantiates an object for that type. To cater for this difference we need to make the Deserializer expose whether it can reuse types so that the client (for example ReduceTask) knows whether to hand it an object or not. This is needed for efficiency (so the client doesn't needlessly create objects that aren't used) and also since some serialization frameworks don't require classes to have no-arg constructors (so the client would not be able to create the required object in any case).

        Show
        Tom White added a comment - Doug/Owen These changes generally look good - I'll try to work them into a new patch. In the current patch Serializers and Deserializers are stateful with open/close methods and that was the reason that led me to separate them. We could combine them in a single object, but this would be at the expense of muddying the method names: (e.g. closeSerializer and closeDeserializer), so I'm reluctant to do that - I would stick with Doug's first SerializationFactory proposal (plus the accept method). Another aspect that the current patch doesn't address is who instantiates objects during deserialization. (Doug - I think you're alluding to this in the "reuse" object in the Serializer class above?) For Writables and Thrift the serialization framework does not instantiate objects - it merely populates the supplied object with the representation from the stream. For Java Serialization the serialization framework reads the type from the stream and instantiates an object for that type. To cater for this difference we need to make the Deserializer expose whether it can reuse types so that the client (for example ReduceTask) knows whether to hand it an object or not. This is needed for efficiency (so the client doesn't needlessly create objects that aren't used) and also since some serialization frameworks don't require classes to have no-arg constructors (so the client would not be able to create the required object in any case).
        Hide
        Tom White added a comment -

        > what's the compatibility path for all applications written with 'Writable'?

        Applications written with Writable should just work with the default configuration. No recompilation would be needed either since going from (e.g.) <V extends Writable> to <V> is backwards compatible. That's the plan anyway.

        Show
        Tom White added a comment - > what's the compatibility path for all applications written with 'Writable'? Applications written with Writable should just work with the default configuration. No recompilation would be needed either since going from (e.g.) <V extends Writable> to <V> is backwards compatible. That's the plan anyway.
        Hide
        Vivek Ratan added a comment -

        >> Another aspect that the current patch doesn't address is who instantiates objects during deserialization.

        Good point. Maybe this is what Doug and/or you mean, but you could have something like this:

        boolean acceptObjectReference(); // returns true if the framework deserializes into an object, false if it creates an object and returns it
        Object deserialize(Object reuse, InputStream);
        

        Frameworks that expect the caller to create an object before deserializing it (Thrift, Record I/O), would return NULL, but others that create their own objects would accept a NULL value for the 'reuse' parameter. The acceptObjectReference() method tells a client which option the deserializer prefers, though most clients would already know before-hand. Is this similar to what you were thinking about?

        Another option is to have separate deserialize methods:

        Object deserialize(InputStream);
        void deserialize(Object, InputStream);
        

        Most frameworks would implement only one of these methods, perhaps throwing an exception for the one they don't implement (you'd also need a way for a client to dynamically find out which frameworks implements which method).

        I lean towards the former - it's more compact, though the latter is a little more cleaner (less confusion).

        Show
        Vivek Ratan added a comment - >> Another aspect that the current patch doesn't address is who instantiates objects during deserialization. Good point. Maybe this is what Doug and/or you mean, but you could have something like this: boolean acceptObjectReference(); // returns true if the framework deserializes into an object, false if it creates an object and returns it Object deserialize( Object reuse, InputStream); Frameworks that expect the caller to create an object before deserializing it (Thrift, Record I/O), would return NULL, but others that create their own objects would accept a NULL value for the 'reuse' parameter. The acceptObjectReference() method tells a client which option the deserializer prefers, though most clients would already know before-hand. Is this similar to what you were thinking about? Another option is to have separate deserialize methods: Object deserialize(InputStream); void deserialize( Object , InputStream); Most frameworks would implement only one of these methods, perhaps throwing an exception for the one they don't implement (you'd also need a way for a client to dynamically find out which frameworks implements which method). I lean towards the former - it's more compact, though the latter is a little more cleaner (less confusion).
        Hide
        Tom White added a comment -

        Vivek,

        I was thinking of the first way.

        > Frameworks that expect the caller to create an object before deserializing it (Thrift, Record I/O), would return NULL,
        > but others that create their own objects would accept a NULL value for the 'reuse' parameter.

        It's a small point but I think the return value would always be non-null for convenience. The contract would be that the deserialize method always returns a deserialized object. Deserializers for Thrift, Record I/O etc. would just return the "reuse" object.

        Show
        Tom White added a comment - Vivek, I was thinking of the first way. > Frameworks that expect the caller to create an object before deserializing it (Thrift, Record I/O), would return NULL, > but others that create their own objects would accept a NULL value for the 'reuse' parameter. It's a small point but I think the return value would always be non-null for convenience. The contract would be that the deserialize method always returns a deserialized object. Deserializers for Thrift, Record I/O etc. would just return the "reuse" object.
        Hide
        Doug Cutting added a comment -

        I don't see the need for 'acceptObjectReference()'. Clients which wish to reuse objects can, the first time, pass null. For subsequent calls they can pass the value returned from prior call. This will work well with implementations which reuse and with implementations that don't, no?

        Note also that we don't need an explicit 'accept()' method: if factory.getSerializer(c) returns null, then that factory does not know how to serialize instances of the class.

        Finally, if we have stateful serializers, then we have to make it clear that they must not do any buffering. An application should be able to call 'serializer.setOutput(out)', serialize an instance or two, do other, non-serializer output to 'out', and then serialize more instances, without calling 'setOutput' again, right? In the case of Writable, setOutput would just set the protected 'out' field of a DataOutputStream, and this would all work fine. That could be instead done on each call to a 'serialize(Object, OutputStream)' method, but perhaps its better to factor it out of inner loops. Is that the intent?

        Show
        Doug Cutting added a comment - I don't see the need for 'acceptObjectReference()'. Clients which wish to reuse objects can, the first time, pass null. For subsequent calls they can pass the value returned from prior call. This will work well with implementations which reuse and with implementations that don't, no? Note also that we don't need an explicit 'accept()' method: if factory.getSerializer(c) returns null, then that factory does not know how to serialize instances of the class. Finally, if we have stateful serializers, then we have to make it clear that they must not do any buffering. An application should be able to call 'serializer.setOutput(out)', serialize an instance or two, do other, non-serializer output to 'out', and then serialize more instances, without calling 'setOutput' again, right? In the case of Writable, setOutput would just set the protected 'out' field of a DataOutputStream, and this would all work fine. That could be instead done on each call to a 'serialize(Object, OutputStream)' method, but perhaps its better to factor it out of inner loops. Is that the intent?
        Hide
        Tom White added a comment -

        > Clients which wish to reuse objects can, the first time, pass null.

        Except there might not be enough type information to construct an object. For example if a WritableSerializer were deserializing a LongWritable how would it know to create a LongWritable object?

        > In the case of Writable, setOutput would just set the protected 'out' field of a DataOutputStream, and this
        > would all work fine. That could be instead done on each call to a 'serialize(Object, OutputStream)' method,
        > but perhaps its better to factor it out of inner loops. Is that the intent?

        From an API point of view I prefer serialize(Object, OutputStream), but it's not clear to me that you can implement this efficiently for any serialization framework. For example, I don't think the technique you describe of setting the 'out' field would work for Java Serialization. And creating a new ObjectOutputStream for every call to the serialize(Object, OutputStream) method would be prohibitive. So unless there's another way of getting round this then I think we're stuck with stateful serializers. (I'd love to be proved wrong on this!)

        Show
        Tom White added a comment - > Clients which wish to reuse objects can, the first time, pass null. Except there might not be enough type information to construct an object. For example if a WritableSerializer were deserializing a LongWritable how would it know to create a LongWritable object? > In the case of Writable, setOutput would just set the protected 'out' field of a DataOutputStream, and this > would all work fine. That could be instead done on each call to a 'serialize(Object, OutputStream)' method, > but perhaps its better to factor it out of inner loops. Is that the intent? From an API point of view I prefer serialize(Object, OutputStream), but it's not clear to me that you can implement this efficiently for any serialization framework. For example, I don't think the technique you describe of setting the 'out' field would work for Java Serialization. And creating a new ObjectOutputStream for every call to the serialize(Object, OutputStream) method would be prohibitive. So unless there's another way of getting round this then I think we're stuck with stateful serializers. (I'd love to be proved wrong on this!)
        Hide
        Doug Cutting added a comment -

        > Except there might not be enough type information to construct an object.

        The factory can keep that around. So, if deserializer depends on the type of the instance passed in, then the deserializer your factory builds should include the class and create an instance of it when the instance is null. Java Serialization would not need to do this, but Thrift would. I'm trying to avoid client code that differs depending on the serializer.

        > So unless there's another way of getting round this then I think we're stuck with stateful serializers.

        Back to my previous comment: with a stateful serializer, is it permitted to intersperse other i/o on the stream that's passed to the serializer, or must the serializer's input or output be reset each time this is done? If it must be reset, then I see little point to this optimization, as writing raw data between serialized data is common (e.g., SequenceFile writes record lenghts, RPC writes request numbers, etc.).

        A related issue is synchronization. If one calls saves a file position, calls serialize(), can one seek back to that position and call deserialize()? Java Serialization does not generally permit this. Should we add a sync() method that must be called whenever you wish to have a point in the stream that you can seek to? Or would you use open/setOutput for this too?

        Show
        Doug Cutting added a comment - > Except there might not be enough type information to construct an object. The factory can keep that around. So, if deserializer depends on the type of the instance passed in, then the deserializer your factory builds should include the class and create an instance of it when the instance is null. Java Serialization would not need to do this, but Thrift would. I'm trying to avoid client code that differs depending on the serializer. > So unless there's another way of getting round this then I think we're stuck with stateful serializers. Back to my previous comment: with a stateful serializer, is it permitted to intersperse other i/o on the stream that's passed to the serializer, or must the serializer's input or output be reset each time this is done? If it must be reset, then I see little point to this optimization, as writing raw data between serialized data is common (e.g., SequenceFile writes record lenghts, RPC writes request numbers, etc.). A related issue is synchronization. If one calls saves a file position, calls serialize(), can one seek back to that position and call deserialize()? Java Serialization does not generally permit this. Should we add a sync() method that must be called whenever you wish to have a point in the stream that you can seek to? Or would you use open/setOutput for this too?
        Hide
        Tom White added a comment -

        >> Except there might not be enough type information to construct an object.

        > The factory can keep that around.

        Good idea. This is a very neat solution I think.

        > Back to my previous comment: with a stateful serializer, is it permitted to intersperse other i/o on the stream that's passed to the serializer, or must the serializer's
        > input or output be reset each time this is done?

        I think it is permitted to intersperse - in other words the serializer is not allowed to buffer the stream.

        Show
        Tom White added a comment - >> Except there might not be enough type information to construct an object. > The factory can keep that around. Good idea. This is a very neat solution I think. > Back to my previous comment: with a stateful serializer, is it permitted to intersperse other i/o on the stream that's passed to the serializer, or must the serializer's > input or output be reset each time this is done? I think it is permitted to intersperse - in other words the serializer is not allowed to buffer the stream.
        Hide
        Vivek Ratan added a comment -

        >> The factory can keep that around. So, if deserializer depends on the type of the instance passed in, then the deserializer your factory builds should include the class and create an instance of it when the instance is null. Java Serialization would not need to do this, but Thrift would. I'm trying to avoid client code that differs depending on the serializer.

        This might be difficult if you have a serializer that can handle lots of classes. Take the example of Record I/O. Every class that can be serialized, inherits from Record. There is only one serializer, that for Record I/O, but it can handle any Record class (and there're an infinite number of such classes). You may want to create a singleton Record I/O serializer to handle more than one class that inherits from Record, and it won't know which class to deserialize (or, it will have to handle a huge amount of classes). I understand that you're trying to avoid extra client code, but you may end up unnecessary complicating the platform code. Furthermore, conceptually you do want the client to distinguish between serializers that create objects and those that expect the client to create them. This is not so relevant in Java, with its memory management, but for other languages, you do want to make it explicit as to who is responsible for memory management.

        Serializers that create their own objects and pass them back to the client are, in many ways, fundamentally different from those that expect clients to pass in an object to deserialize. The former expect deserialized objects to have a constructor with no parameters, and the objects are quite simple wrappers around data. In the latter case, the objects are usually much more than simple wrappers around member variables and their constructors can be quite complicated. I guess what I'm saying here is that these two types of serializers are different enough, and that you will rarely, if ever, see a serializer that supports both, that you don't want to hide that difference in your common serializer interface. I think a client will either always pass objects that it constructs itself, or get back new objects from the serializer; I don't think it will mix these calls up with the same serializer. So I think it's fine, and desirable, for clients to explicitly make different calls to the two types of serializers. In fact, it would seem likely that most clients will be written explicitly for one of these two kinds of serializers, given that a client will likely use the same platform for serialization and deserialization.

        Show
        Vivek Ratan added a comment - >> The factory can keep that around. So, if deserializer depends on the type of the instance passed in, then the deserializer your factory builds should include the class and create an instance of it when the instance is null. Java Serialization would not need to do this, but Thrift would. I'm trying to avoid client code that differs depending on the serializer. This might be difficult if you have a serializer that can handle lots of classes. Take the example of Record I/O. Every class that can be serialized, inherits from Record. There is only one serializer, that for Record I/O, but it can handle any Record class (and there're an infinite number of such classes). You may want to create a singleton Record I/O serializer to handle more than one class that inherits from Record, and it won't know which class to deserialize (or, it will have to handle a huge amount of classes). I understand that you're trying to avoid extra client code, but you may end up unnecessary complicating the platform code. Furthermore, conceptually you do want the client to distinguish between serializers that create objects and those that expect the client to create them. This is not so relevant in Java, with its memory management, but for other languages, you do want to make it explicit as to who is responsible for memory management. Serializers that create their own objects and pass them back to the client are, in many ways, fundamentally different from those that expect clients to pass in an object to deserialize. The former expect deserialized objects to have a constructor with no parameters, and the objects are quite simple wrappers around data. In the latter case, the objects are usually much more than simple wrappers around member variables and their constructors can be quite complicated. I guess what I'm saying here is that these two types of serializers are different enough, and that you will rarely, if ever, see a serializer that supports both, that you don't want to hide that difference in your common serializer interface. I think a client will either always pass objects that it constructs itself, or get back new objects from the serializer; I don't think it will mix these calls up with the same serializer. So I think it's fine, and desirable, for clients to explicitly make different calls to the two types of serializers. In fact, it would seem likely that most clients will be written explicitly for one of these two kinds of serializers, given that a client will likely use the same platform for serialization and deserialization.
        Hide
        Doug Cutting added a comment -

        > This might be difficult if you have a serializer that can handle lots of classes.

        I don't yet see the difficulty.

        public class RecordSerializerFactory implements SerializerFactory {
          public Serializer getSerializer(Class c) {
            return new RecordSerializer(c);
          }
        }
        
        public class RecordSerializer {
          private Class recordClass;
          public RecordSerializer(Class c) {
            this.recordClass = c;
          }
          public void serialize(Object o, OutputStream) { ... }
          public Object deserialize(Object reuse, InputStream) {
            private Object instance = this.recordClass.newInstance();
            ...
            return instance;
          }
        }
        

        So record serialization can use the same method implementation to serialize all instances. That method can know the class of the instance to be serialized. And that method need not reuse an instance, but can always create a new instance. What's difficult about this?

        Show
        Doug Cutting added a comment - > This might be difficult if you have a serializer that can handle lots of classes. I don't yet see the difficulty. public class RecordSerializerFactory implements SerializerFactory { public Serializer getSerializer(Class c) { return new RecordSerializer(c); } } public class RecordSerializer { private Class recordClass; public RecordSerializer(Class c) { this.recordClass = c; } public void serialize(Object o, OutputStream) { ... } public Object deserialize(Object reuse, InputStream) { private Object instance = this.recordClass.newInstance(); ... return instance; } } So record serialization can use the same method implementation to serialize all instances. That method can know the class of the instance to be serialized. And that method need not reuse an instance, but can always create a new instance. What's difficult about this?
        Hide
        Vivek Ratan added a comment -

        Things get difficult if you want to use a singleton serializer for more than one class. In your example, suppose that RecordSerializer is the Record I/O serializer, and can serialize any class that derives from Record. If I want to use Record I/O to serialize all my classes (my key, my value, my intermediate key, my intermediate value, etc), then with your scheme, we'd create one RecordSerializer object per class that we want to serialize, so one for my intermediate map keys, one for my intermediate map values, and so on. As we've discussed earlier, serializer objects can contain state (an input or output stream, that they keep open across each serialization, for example). So having multiple RecordSerializer objects can be a problem, especially if more than one serializes to the same stream. It's quite plausible that we may want a singleton RecordSerializer object. Well,if it can only store one class in its private recordClass variable, then i can't use a singleton object to serialize multiple classes.

        All I'm saying is that if we associate one serializer object with one class, we lose the ability to share serializer objects across classes, which seems quite stifling. And I'm also arguing that we do want clients to explicitly code for the two different kinds of serializers so that memory management is clearer, as also performance impact (it's good to know who is responsible for creating what objects so we can minimize object creation).

        Show
        Vivek Ratan added a comment - Things get difficult if you want to use a singleton serializer for more than one class. In your example, suppose that RecordSerializer is the Record I/O serializer, and can serialize any class that derives from Record . If I want to use Record I/O to serialize all my classes (my key, my value, my intermediate key, my intermediate value, etc), then with your scheme, we'd create one RecordSerializer object per class that we want to serialize, so one for my intermediate map keys, one for my intermediate map values, and so on. As we've discussed earlier, serializer objects can contain state (an input or output stream, that they keep open across each serialization, for example). So having multiple RecordSerializer objects can be a problem, especially if more than one serializes to the same stream. It's quite plausible that we may want a singleton RecordSerializer object. Well,if it can only store one class in its private recordClass variable, then i can't use a singleton object to serialize multiple classes. All I'm saying is that if we associate one serializer object with one class, we lose the ability to share serializer objects across classes, which seems quite stifling. And I'm also arguing that we do want clients to explicitly code for the two different kinds of serializers so that memory management is clearer, as also performance impact (it's good to know who is responsible for creating what objects so we can minimize object creation).
        Hide
        Vivek Ratan added a comment -

        Another problem I see is that RecordSerializer only works for those classes that have constructors that take no arguments (otherwise Class.newInstance() will fail, right?). This is also restrictive. As I mentioned in my previous comment, in platforms that require clients to pass in a class to deserialize, the classes can be quite complicated, and may not just be a wrapper around data. These classes need not have constructors that take in no arguments. Forcing a user to add such a constructor is unnecessarily restrictive.

        Show
        Vivek Ratan added a comment - Another problem I see is that RecordSerializer only works for those classes that have constructors that take no arguments (otherwise Class.newInstance() will fail, right?). This is also restrictive. As I mentioned in my previous comment, in platforms that require clients to pass in a class to deserialize, the classes can be quite complicated, and may not just be a wrapper around data. These classes need not have constructors that take in no arguments. Forcing a user to add such a constructor is unnecessarily restrictive.
        Hide
        Owen O'Malley added a comment -

        Another problem I see is that RecordSerializer only works for those classes that have constructors that take no arguments (otherwise Class.newInstance() will fail, right?). This is also restrictive.

        I don't see it Vivek. The RecordSerializer would only work with classes that have 0-argument constructors, but in general Serializers can handle arbitrary constructors for the constructed objects. So if you have some class X that doesn't have a 0-arg constructor, just write a Serializer that knows how to build it.

        Show
        Owen O'Malley added a comment - Another problem I see is that RecordSerializer only works for those classes that have constructors that take no arguments (otherwise Class.newInstance() will fail, right?). This is also restrictive. I don't see it Vivek. The RecordSerializer would only work with classes that have 0-argument constructors, but in general Serializers can handle arbitrary constructors for the constructed objects. So if you have some class X that doesn't have a 0-arg constructor, just write a Serializer that knows how to build it.
        Hide
        Vivek Ratan added a comment -

        But I was talking specifically about RecordSerializer, and in the context of what interface we should have to handle serializers that create their own objects and those that take in a reference. My comments were in response to Doug's example. To paraphrase my argument, having a serializer keep track of the class it was created for is difficult/restrictive if the serialize can handle more than one class.

        Yes, serializers can handle arbitrary constructors, but a serializer than can deserialize more than one class (for example, one for Record I/O that can deserialize any class that inherits from the base Record I/O class) requires that the classes all be constructed the same way (i.e. have the same constructor signature), if it is responsible for creating deserialized objects. Or else you give it an already-constructed object (and it just fills in the fields when deserializing).

        Show
        Vivek Ratan added a comment - But I was talking specifically about RecordSerializer, and in the context of what interface we should have to handle serializers that create their own objects and those that take in a reference. My comments were in response to Doug's example. To paraphrase my argument, having a serializer keep track of the class it was created for is difficult/restrictive if the serialize can handle more than one class. Yes, serializers can handle arbitrary constructors, but a serializer than can deserialize more than one class (for example, one for Record I/O that can deserialize any class that inherits from the base Record I/O class) requires that the classes all be constructed the same way (i.e. have the same constructor signature), if it is responsible for creating deserialized objects. Or else you give it an already-constructed object (and it just fills in the fields when deserializing).
        Hide
        Doug Cutting added a comment -

        > a serializer than can deserialize more than one class (for example, one for Record I/O that can deserialize any class that inherits from the base Record I/O class) requires that the classes all be constructed the same way

        Yes, that's more-or-less assumed, but I yet fail to see it as a problem. All record classes are generated from an IDL and it should be easy to generate a no-arg ctor for those. Ditto for thrift. Things that implement Writable today already must have a no-arg ctor. Can you please provide a more detailed example of something that would prove difficult and why it is important that it be easy?

        Show
        Doug Cutting added a comment - > a serializer than can deserialize more than one class (for example, one for Record I/O that can deserialize any class that inherits from the base Record I/O class) requires that the classes all be constructed the same way Yes, that's more-or-less assumed, but I yet fail to see it as a problem. All record classes are generated from an IDL and it should be easy to generate a no-arg ctor for those. Ditto for thrift. Things that implement Writable today already must have a no-arg ctor. Can you please provide a more detailed example of something that would prove difficult and why it is important that it be easy?
        Hide
        Vivek Ratan added a comment -

        >> Yes, that's more-or-less assumed, but I yet fail to see it as a problem. All record classes are generated from an IDL and it should be easy to generate a no-arg ctor for those. Ditto for thrift. Things that implement Writable today already must have a no-arg ctor. Can you please provide a more detailed example of something that would prove difficult and why it is important that it be easy?

        I guess I'm doing a poor job explaining my point because the larger issue seems to be missed. Given that there are two kinds of deserializers, those that create objects and those that take in object references, we're discussion what the deserialization interface should look like to handle both these kinds of deserializers, right? We seem to have two choices: have the client figure out which kind of deserializer it is interacting with and have it call the right deserialize method, or have a single deserialize method and let the deserializer create an object where necessary or use one provided by the client. Right? Doug provided an example of the latter, and I thought there were some issues. The biggest one is this: for a deserializer to create an object, it needs to know the type of the object that is being deserialized. Some deserializers (such as Java's serialization, and Writables, I think) know this, because the class name is part of the serialized data. Others, such as Thrift or Record I/O, do not serialize the class name (I'm pretty sure about Record I/O, and the Thrift code I saw sometime back didn't serialize class names, as far as I can remember), so they do not know which object to create. Doug suggested that the serializer store the class name, when it is created by the class factory. I said that wouldn't work because you will likely want a singleton deserializer object to handle deserializing more than one class so you cannot link a deserializer object with only one class. What this means is that, IMO, you cannot get Thrift or Record I/O deserializers to create objects, the way they work today, and for them, the client has to pass in an object. similarly, there can be other serializers that always create an object, and they cannot use one passed in by the client. That is the crux of my argument.

        I also mentioned that for a serializer to create its own objects, it requires all deserialized objects to support constructors with no args. yes, Thrift and record I/O and Writables do so, but if we want to support various other kinds of serialization platforms, we're forcing every supported platform to use no-arg constructors. This seems like an unnecessary restriction to me. I don't have an example of a deserializer where this would be an issue, but I can easily imagine situations where you have objects without no-arg constructors (there are lots of objects that we design where we don't want no-arg constructors) which you want to deserialize. Anyways, this is a minor point, and mostly theoretical (though valid, IMO). But it adds to my argument that you want to have separate deserialize methods and let the client call the right one. (There is also my argument that it's good design to have separate methods to make memory management explicit, especially for languages like C++, but I admit it's not a strong argument if we're only looking at Java).

        Again, my point is that deserializers for Thrift and Record I/O cannot create objects themselves and will always require the client to pass in the object (or invoke the deserialize method on a known object), so they, or a layer around them, cannot support a single deserialize method that can optionally take in an object from a client or create one of its own, at least not without a lot of pain.

        Show
        Vivek Ratan added a comment - >> Yes, that's more-or-less assumed, but I yet fail to see it as a problem. All record classes are generated from an IDL and it should be easy to generate a no-arg ctor for those. Ditto for thrift. Things that implement Writable today already must have a no-arg ctor. Can you please provide a more detailed example of something that would prove difficult and why it is important that it be easy? I guess I'm doing a poor job explaining my point because the larger issue seems to be missed. Given that there are two kinds of deserializers, those that create objects and those that take in object references, we're discussion what the deserialization interface should look like to handle both these kinds of deserializers, right? We seem to have two choices: have the client figure out which kind of deserializer it is interacting with and have it call the right deserialize method, or have a single deserialize method and let the deserializer create an object where necessary or use one provided by the client. Right? Doug provided an example of the latter, and I thought there were some issues. The biggest one is this: for a deserializer to create an object, it needs to know the type of the object that is being deserialized. Some deserializers (such as Java's serialization, and Writables, I think) know this, because the class name is part of the serialized data. Others, such as Thrift or Record I/O, do not serialize the class name (I'm pretty sure about Record I/O, and the Thrift code I saw sometime back didn't serialize class names, as far as I can remember), so they do not know which object to create. Doug suggested that the serializer store the class name, when it is created by the class factory. I said that wouldn't work because you will likely want a singleton deserializer object to handle deserializing more than one class so you cannot link a deserializer object with only one class. What this means is that, IMO, you cannot get Thrift or Record I/O deserializers to create objects, the way they work today, and for them, the client has to pass in an object. similarly, there can be other serializers that always create an object, and they cannot use one passed in by the client. That is the crux of my argument. I also mentioned that for a serializer to create its own objects, it requires all deserialized objects to support constructors with no args. yes, Thrift and record I/O and Writables do so, but if we want to support various other kinds of serialization platforms, we're forcing every supported platform to use no-arg constructors. This seems like an unnecessary restriction to me. I don't have an example of a deserializer where this would be an issue, but I can easily imagine situations where you have objects without no-arg constructors (there are lots of objects that we design where we don't want no-arg constructors) which you want to deserialize. Anyways, this is a minor point, and mostly theoretical (though valid, IMO). But it adds to my argument that you want to have separate deserialize methods and let the client call the right one. (There is also my argument that it's good design to have separate methods to make memory management explicit, especially for languages like C++, but I admit it's not a strong argument if we're only looking at Java). Again, my point is that deserializers for Thrift and Record I/O cannot create objects themselves and will always require the client to pass in the object (or invoke the deserialize method on a known object), so they, or a layer around them, cannot support a single deserialize method that can optionally take in an object from a client or create one of its own, at least not without a lot of pain.
        Hide
        Doug Cutting added a comment -

        > I said that wouldn't work because you will likely want a singleton deserializer object to handle deserializing more than one class [...]

        I was with you to that point. Why must you have a singleton serializer instance that handles more than one class? If the deserializer does not need to know the class (e.g., Java serialization) then a singleton factory can be used. But if the deserializer does need to know the class, either to create an instance or for deserialization itself, then a different factory instance would need to be created per class. These could be cached by the framework, so no per-deserialized-object allocations need happen. The client (e.g., SequenceFile) can reuse serializers, so they need not be allocated per object either.

        > But it adds to my argument that you want to have separate deserialize methods and let the client call the right one.

        So would clients like SequenceFile and the mapreduce shuffle require different code to deserialize different classes? We need to have generic client code.

        > Again, my point is that deserializers for Thrift and Record I/O cannot create objects themselves and will always require the client to pass in the object [...]

        Again, I don't see why Record I/O, where we control the code generation from an IDL, cannot generate a no-arg ctor. Similarly for Thrift. The ctor does not have to be public. We already bypass protections when we create instances.

        Show
        Doug Cutting added a comment - > I said that wouldn't work because you will likely want a singleton deserializer object to handle deserializing more than one class [...] I was with you to that point. Why must you have a singleton serializer instance that handles more than one class? If the deserializer does not need to know the class (e.g., Java serialization) then a singleton factory can be used. But if the deserializer does need to know the class, either to create an instance or for deserialization itself, then a different factory instance would need to be created per class. These could be cached by the framework, so no per-deserialized-object allocations need happen. The client (e.g., SequenceFile) can reuse serializers, so they need not be allocated per object either. > But it adds to my argument that you want to have separate deserialize methods and let the client call the right one. So would clients like SequenceFile and the mapreduce shuffle require different code to deserialize different classes? We need to have generic client code. > Again, my point is that deserializers for Thrift and Record I/O cannot create objects themselves and will always require the client to pass in the object [...] Again, I don't see why Record I/O, where we control the code generation from an IDL, cannot generate a no-arg ctor. Similarly for Thrift. The ctor does not have to be public. We already bypass protections when we create instances.
        Hide
        Vivek Ratan added a comment -

        >> Why must you have a singleton serializer instance that handles more than one class?

        For many reasons. An easy one I can think of is that serializer instances can have state (an input or output stream, that they keep open across each serialization, for example). We've been talking about stateful serializers earlier in this dicsussion and it seems to me like it's quite possible we'll associate state with serializers for performance. Let's say you have a key class and a value class, both of which are generated from the Record I?O compiler, so that both inherit from the Record base class. And let's say you want to serialize a number of keys and values into a file: a key, followed by a value, followed by another key, and so on. If you have a separate serializer instance for each of the key adn value class, they need to share the same OutputStream object to the file you serialize them to. Having one serializer instance that handles both keys and values (since they're both Records) will be cleaner and easier. Its also quite possible that we have serialization platforms that contain other states (maybe they use some libraries that need to be initialized once, for example). So forcing people to not create serializers for more than one class seems restrictive. The choice of whether the serialization platform shares an instance across multiple classes should be left to the platform.

        >> So would clients like SequenceFile and the mapreduce shuffle require different code to deserialize different classes? We need to have generic client code.
        Yes, and that is the fundamental tradeoff. The flip side of what I'm suggesting is that the client has to write separate code for two kinds of serializers. That's not great, but I'm arguing that that is better than restricting the kind of serialization platforms we use, or restricting how we use them. The client will have to write something like:

        if (serializer.acceptObjectReference()) {
          <some Class> o = new <some Class>();
          serializer.deserialize(o);
          ...
        }
        else {
          <some Class> o = serializer.deserialize();
          ...
        }
        

        Yeah, it's not great, but it's not so bad either, compared to forcing serialization platforms to not create shared serializers. But it is a tradeoff. If folks think we're OK forcing serialization platforms to not share serializer instances across classes, resulting in cleaner client code, then that's fine. I personally would choose the opposite. But I hope the tradeoff and the pros and cons are clear.

        >> Again, I don't see why Record I/O, where we control the code generation from an IDL, cannot generate a no-arg ctor. Similarly for Thrift. The ctor does not have to be public. We already bypass protections when we create instances.
        Well, yes for Thrift and Record I/O but maybe not so for some other platform we may want to support in the future (and whose code we cannot control). And besides, no-arg constructors are not the main reason for supporting a single deserialize method, singleton serializers are.

        Show
        Vivek Ratan added a comment - >> Why must you have a singleton serializer instance that handles more than one class? For many reasons. An easy one I can think of is that serializer instances can have state (an input or output stream, that they keep open across each serialization, for example). We've been talking about stateful serializers earlier in this dicsussion and it seems to me like it's quite possible we'll associate state with serializers for performance. Let's say you have a key class and a value class, both of which are generated from the Record I?O compiler, so that both inherit from the Record base class. And let's say you want to serialize a number of keys and values into a file: a key, followed by a value, followed by another key, and so on. If you have a separate serializer instance for each of the key adn value class, they need to share the same OutputStream object to the file you serialize them to. Having one serializer instance that handles both keys and values (since they're both Records) will be cleaner and easier. Its also quite possible that we have serialization platforms that contain other states (maybe they use some libraries that need to be initialized once, for example). So forcing people to not create serializers for more than one class seems restrictive. The choice of whether the serialization platform shares an instance across multiple classes should be left to the platform. >> So would clients like SequenceFile and the mapreduce shuffle require different code to deserialize different classes? We need to have generic client code. Yes, and that is the fundamental tradeoff. The flip side of what I'm suggesting is that the client has to write separate code for two kinds of serializers. That's not great, but I'm arguing that that is better than restricting the kind of serialization platforms we use, or restricting how we use them. The client will have to write something like: if (serializer.acceptObjectReference()) { <some Class > o = new <some Class >(); serializer.deserialize(o); ... } else { <some Class > o = serializer.deserialize(); ... } Yeah, it's not great, but it's not so bad either, compared to forcing serialization platforms to not create shared serializers. But it is a tradeoff. If folks think we're OK forcing serialization platforms to not share serializer instances across classes, resulting in cleaner client code, then that's fine. I personally would choose the opposite. But I hope the tradeoff and the pros and cons are clear. >> Again, I don't see why Record I/O, where we control the code generation from an IDL, cannot generate a no-arg ctor. Similarly for Thrift. The ctor does not have to be public. We already bypass protections when we create instances. Well, yes for Thrift and Record I/O but maybe not so for some other platform we may want to support in the future (and whose code we cannot control). And besides, no-arg constructors are not the main reason for supporting a single deserialize method, singleton serializers are.
        Hide
        Doug Cutting added a comment -

        > serializer instances can have state (an input or output stream, that they keep open across each serialization, for example)

        Ah, stateful serializers again. Above we agreed that "stateful" serializers could not buffer, since we might wish to put raw binary values between serialized objects (as SequenceFile does). Do you dispute that? If not, then I don't see how per-class serializer instances are a problem. In the case of Writables, the serializer's "state" would just be a DataOutputStream whose output could be re-directed. We also need to permit seeks to the position where an object was written, no? So unless we permit serializers to buffer, I still don't see what problematic state a serializer can have.

        Also, don't we permit mixing of serializers in a file? Couldn't one have, e.g., a Record i/o-defined key and a Thrift-defined value? Unless we prohibit that, clients cannot reliably share serializers.

        Note that, with these restrictions, using something like Java Serialization for small objects will be very expensive. But these shortcomings of Java Serialization are the reason we're not using Java Serialization, so such pain is to be expected.

        > it's not great, but it's not so bad either

        It is bad. Client code should not have to replicate logic. The framework should encapsulate it. That's a requirement.

        > Well, yes for Thrift and Record I/O but maybe not so for some other platform we may want to support in the future [...]

        Tell me more about this supposed platform, how it works, how it constructs instances, etc. I'm having a hard time imagining one that cannot fit within the proposed framework.

        Show
        Doug Cutting added a comment - > serializer instances can have state (an input or output stream, that they keep open across each serialization, for example) Ah, stateful serializers again. Above we agreed that "stateful" serializers could not buffer, since we might wish to put raw binary values between serialized objects (as SequenceFile does). Do you dispute that? If not, then I don't see how per-class serializer instances are a problem. In the case of Writables, the serializer's "state" would just be a DataOutputStream whose output could be re-directed. We also need to permit seeks to the position where an object was written, no? So unless we permit serializers to buffer, I still don't see what problematic state a serializer can have. Also, don't we permit mixing of serializers in a file? Couldn't one have, e.g., a Record i/o-defined key and a Thrift-defined value? Unless we prohibit that, clients cannot reliably share serializers. Note that, with these restrictions, using something like Java Serialization for small objects will be very expensive. But these shortcomings of Java Serialization are the reason we're not using Java Serialization, so such pain is to be expected. > it's not great, but it's not so bad either It is bad. Client code should not have to replicate logic. The framework should encapsulate it. That's a requirement. > Well, yes for Thrift and Record I/O but maybe not so for some other platform we may want to support in the future [...] Tell me more about this supposed platform, how it works, how it constructs instances, etc. I'm having a hard time imagining one that cannot fit within the proposed framework.
        Hide
        Vivek Ratan added a comment -

        >>Above we agreed that "stateful" serializers could not buffer, since we might wish to put raw binary values between serialized objects (as SequenceFile does). Do you dispute that?
        No. I agree hat serializers should not buffer. But serializer instances can share output streams or other objects, and that's what I meant by 'state'.

        It seems to me that what you're saying is that if you want a serialization platform X to work with Hadoop, X should do two things, at least:

        • X should allow creation of multiple instances of its serializer (so, for example, if X's serializer instances share anything, like library handles or stream objects, they have to deal any issues that arise from this sharing, such as initializing or destroying these shared instances etc; i.e., X is responsible for all this)
        • X needs to be able to both create objects before deserializing them (i.e., those objects should have no-arg constructors, or should be all constructed in a common manner) and take in a reference to an object and initialize its member variables with deserialized data.
          If X follows these, then we get client code that is generic and does not have to 'replicate logic', as you say. Correct?

        I'm all in favor of client code not replicating framework logic. It's definitely an important requirement. But I see it as coming with a price: the two constraints above that X must follow. Now, Thrift or Record I/O shouldn't have any problems with these constraints, which is quite important to know. But the constraints are non-trivial enough that some other platform might not be able to satisfy them. Unfortunately, I do not have a concrete example of such a platform. At the same time, I can realistically imagine a platform that does not force its de/serializable objects to have no-arg constructors (because that can be a severe restriction in the design of an object), and requires the caller to pass in an object reference (much like Java Serialization, but without the Java platform having to create the objects when deserializing). But yes, these are somewhat hypothetical arguments. I also understand that we should perhaps favor design that supports existing serialization platforms and not make it too general if there's a price.

        At this point, I think it's a gut call. If we feel that having clients not replicate platform logic is more important than the restrictions we're providing on serialization platforms, that's fine. I can certainly see the validity of that, and can't argue strongly against it. I lean (slightly) more towards the other side, but I don't have concrete examples to lean too far.

        Show
        Vivek Ratan added a comment - >>Above we agreed that "stateful" serializers could not buffer, since we might wish to put raw binary values between serialized objects (as SequenceFile does). Do you dispute that? No. I agree hat serializers should not buffer. But serializer instances can share output streams or other objects, and that's what I meant by 'state'. It seems to me that what you're saying is that if you want a serialization platform X to work with Hadoop, X should do two things, at least: X should allow creation of multiple instances of its serializer (so, for example, if X's serializer instances share anything, like library handles or stream objects, they have to deal any issues that arise from this sharing, such as initializing or destroying these shared instances etc; i.e., X is responsible for all this) X needs to be able to both create objects before deserializing them (i.e., those objects should have no-arg constructors, or should be all constructed in a common manner) and take in a reference to an object and initialize its member variables with deserialized data. If X follows these, then we get client code that is generic and does not have to 'replicate logic', as you say. Correct? I'm all in favor of client code not replicating framework logic. It's definitely an important requirement. But I see it as coming with a price: the two constraints above that X must follow. Now, Thrift or Record I/O shouldn't have any problems with these constraints, which is quite important to know. But the constraints are non-trivial enough that some other platform might not be able to satisfy them. Unfortunately, I do not have a concrete example of such a platform. At the same time, I can realistically imagine a platform that does not force its de/serializable objects to have no-arg constructors (because that can be a severe restriction in the design of an object), and requires the caller to pass in an object reference (much like Java Serialization, but without the Java platform having to create the objects when deserializing). But yes, these are somewhat hypothetical arguments. I also understand that we should perhaps favor design that supports existing serialization platforms and not make it too general if there's a price. At this point, I think it's a gut call. If we feel that having clients not replicate platform logic is more important than the restrictions we're providing on serialization platforms, that's fine. I can certainly see the validity of that, and can't argue strongly against it. I lean (slightly) more towards the other side, but I don't have concrete examples to lean too far.
        Hide
        Doug Cutting added a comment -

        > X should allow creation of multiple instances of its serializer [ ...]

        I don't see how this is required by anything I've proposed. If class names are serialized with class data, then a single serializer instance could be returned for a large number of different classes. If class names are not serialized with class data, then a different serializer instance could be returned for each class, but these could be cached, so that no more than a single instance is created per serialized class. If a factory creates multiple instances of its serializer, and those instances share state, then yes, they are responsible for coordinating their state. That seems reasonable and expected.

        > X needs to be able to both create objects before deserializing them and take in a reference to an object and initialize its member variables with deserialized data.

        No. It must be able to create instances, but it need not use a passed in reference. As an optimization it may use a reference passed in when optimized client code passes in non-null references. The implementation and use of references is optional.

        > At this point, I think it's a gut call. If we feel that having clients not replicate platform logic is more important than the restrictions we're providing on serialization platforms, that's fine.

        Yes, I think having clients not replicate platform logic is a mandate. The framework should maximally encapsulate serialization logic. But I still don't see what onerous restrictions this inflicts on serialization platforms.

        Show
        Doug Cutting added a comment - > X should allow creation of multiple instances of its serializer [ ...] I don't see how this is required by anything I've proposed. If class names are serialized with class data, then a single serializer instance could be returned for a large number of different classes. If class names are not serialized with class data, then a different serializer instance could be returned for each class, but these could be cached, so that no more than a single instance is created per serialized class. If a factory creates multiple instances of its serializer, and those instances share state, then yes, they are responsible for coordinating their state. That seems reasonable and expected. > X needs to be able to both create objects before deserializing them and take in a reference to an object and initialize its member variables with deserialized data. No. It must be able to create instances, but it need not use a passed in reference. As an optimization it may use a reference passed in when optimized client code passes in non-null references. The implementation and use of references is optional. > At this point, I think it's a gut call. If we feel that having clients not replicate platform logic is more important than the restrictions we're providing on serialization platforms, that's fine. Yes, I think having clients not replicate platform logic is a mandate. The framework should maximally encapsulate serialization logic. But I still don't see what onerous restrictions this inflicts on serialization platforms.
        Hide
        Vivek Ratan added a comment -

        >> If class names are serialized with class data, then a single serializer instance could be returned for a large number of different classes. If class names are not serialized with class data, then a different serializer instance could be returned for each class, but these could be cached, so that no more than a single instance is created per serialized class.

        Agreed.

        >> If a factory creates multiple instances of its serializer, and those instances share state, then yes, they are responsible for coordinating their state. That seems reasonable and expected.

        This is what I thought can get difficult. It is up to us (Hadoop), as owners of the factory, to ensure that state is coordinated between instances, and is not X's responsibility. It may or may not be easy. It seems reasonably easy for Thrift and Record I/O, though.

        Show
        Vivek Ratan added a comment - >> If class names are serialized with class data, then a single serializer instance could be returned for a large number of different classes. If class names are not serialized with class data, then a different serializer instance could be returned for each class, but these could be cached, so that no more than a single instance is created per serialized class. Agreed. >> If a factory creates multiple instances of its serializer, and those instances share state, then yes, they are responsible for coordinating their state. That seems reasonable and expected. This is what I thought can get difficult. It is up to us (Hadoop), as owners of the factory, to ensure that state is coordinated between instances, and is not X's responsibility. It may or may not be easy. It seems reasonably easy for Thrift and Record I/O, though.
        Hide
        Tom White added a comment -

        I've attached a second patch, serializer-v2.patch, which has the Serialization abstraction and the object reuse facility for Deserializers that Doug suggested. Tests, warnings and javadoc still need attending to.

        I've also included some examples that show how to use different serialization frameworks for running MapReduce programs. It includes Writable, Thrift, Record IO and Java Serialization examples. It might be good to improve these so they can be included in Hadoop - they are pretty rough and ready at the moment, which is why they are a standalone tarball.

        Show
        Tom White added a comment - I've attached a second patch, serializer-v2.patch, which has the Serialization abstraction and the object reuse facility for Deserializers that Doug suggested. Tests, warnings and javadoc still need attending to. I've also included some examples that show how to use different serialization frameworks for running MapReduce programs. It includes Writable, Thrift, Record IO and Java Serialization examples. It might be good to improve these so they can be included in Hadoop - they are pretty rough and ready at the moment, which is why they are a standalone tarball.
        Hide
        Doug Cutting added a comment -

        Tom, this looks great to me. Bravo!

        I think including at least one non-Writable serializer in the initial commit would be best, to demonstrate the generality of the abstraction. Probably JavaSerialization would be best, since it has no external dependencies.

        Also, we don't need RecordSerialization do we, since records currently implement Writable? However many would prefer it if records didn't implement Writable. So a RecordSerialization that didn't rely on Writable but only on the Record base class would be great to have. Then we could deperecate the implementation of Writable by the record compiler, and make record io available as a separate, standalone jar, as some have requested.

        Show
        Doug Cutting added a comment - Tom, this looks great to me. Bravo! I think including at least one non-Writable serializer in the initial commit would be best, to demonstrate the generality of the abstraction. Probably JavaSerialization would be best, since it has no external dependencies. Also, we don't need RecordSerialization do we, since records currently implement Writable? However many would prefer it if records didn't implement Writable. So a RecordSerialization that didn't rely on Writable but only on the Record base class would be great to have. Then we could deperecate the implementation of Writable by the record compiler, and make record io available as a separate, standalone jar, as some have requested.
        Hide
        eric baldeschwieler added a comment -

        The code seems more readable on average after this work, so i think it represents a step forward from where we were. Good stuff tom.

        That said...
        I just went through this whole thread. I find it quite mind bending. I remain convinced that the use of templated types at all in our lowest level API is a mistake and all of this templated foo should be an optional layer on top of a basic byte oriented API. Other famous map-reduce implementations make that choice (byte API only) and they get to bypass all of this discussion and complexity. Something I ask everyone to think about. That is not a reason to reject this patch, since I think this looks like an incremental improvement.

        The code doesn't look like it adds any appreciable overhead, a function call and a comparison per read seems likely to be in the noise, unless the stream class bites us. Have you validated that we're not taking on extra cost here? We're trying to strip out layers of copies and other framework fat.

        Show
        eric baldeschwieler added a comment - The code seems more readable on average after this work, so i think it represents a step forward from where we were. Good stuff tom. That said... I just went through this whole thread. I find it quite mind bending. I remain convinced that the use of templated types at all in our lowest level API is a mistake and all of this templated foo should be an optional layer on top of a basic byte oriented API. Other famous map-reduce implementations make that choice (byte API only) and they get to bypass all of this discussion and complexity. Something I ask everyone to think about. That is not a reason to reject this patch, since I think this looks like an incremental improvement. The code doesn't look like it adds any appreciable overhead, a function call and a comparison per read seems likely to be in the noise, unless the stream class bites us. Have you validated that we're not taking on extra cost here? We're trying to strip out layers of copies and other framework fat.
        Hide
        Tom White added a comment -

        > I think including at least one non-Writable serializer in the initial commit would be best, to
        > demonstrate the generality of the abstraction. Probably JavaSerialization would be best, since it
        > has no external dependencies.

        I'll do this.

        > Also, we don't need RecordSerialization do we, since records currently implement Writable?

        Correct - I overlooked this.

        > However many would prefer it if records didn't implement Writable. So a RecordSerialization
        > that didn't rely on Writable but only on the Record base class would be great to have. Then we
        > could deperecate the implementation of Writable by the record compiler, and make record io
        > available as a separate, standalone jar, as some have requested.

        Sounds good to me.

        Show
        Tom White added a comment - > I think including at least one non-Writable serializer in the initial commit would be best, to > demonstrate the generality of the abstraction. Probably JavaSerialization would be best, since it > has no external dependencies. I'll do this. > Also, we don't need RecordSerialization do we, since records currently implement Writable? Correct - I overlooked this. > However many would prefer it if records didn't implement Writable. So a RecordSerialization > that didn't rely on Writable but only on the Record base class would be great to have. Then we > could deperecate the implementation of Writable by the record compiler, and make record io > available as a separate, standalone jar, as some have requested. Sounds good to me.
        Hide
        Tom White added a comment -

        > I remain convinced that the
        > use of templated types at all in our lowest level API is a mistake and all of this templated foo
        > should be an optional layer on top of a basic byte oriented API. Other famous map-reduce
        > implementations make that choice (byte API only) and they get to bypass all of this discussion
        > and complexity.

        Interesting point - it sounds like such a design might be achieved by moving the serialization stuff into its own layer above core Map Reduce.

        >The code doesn't look like it adds any appreciable overhead, a function call and a comparison
        > per read seems likely to be in the noise, unless the stream class bites us.
        > Have you validated that we're not taking on extra cost here?

        Not yet, I plan to do so. Also, I need to add javadoc to the new classes and fix some warnings.

        Show
        Tom White added a comment - > I remain convinced that the > use of templated types at all in our lowest level API is a mistake and all of this templated foo > should be an optional layer on top of a basic byte oriented API. Other famous map-reduce > implementations make that choice (byte API only) and they get to bypass all of this discussion > and complexity. Interesting point - it sounds like such a design might be achieved by moving the serialization stuff into its own layer above core Map Reduce. >The code doesn't look like it adds any appreciable overhead, a function call and a comparison > per read seems likely to be in the noise, unless the stream class bites us. > Have you validated that we're not taking on extra cost here? Not yet, I plan to do so. Also, I need to add javadoc to the new classes and fix some warnings.
        Hide
        eric baldeschwieler added a comment -

        > Interesting point - it sounds like such a design might be achieved by moving the serialization stuff into its own layer above core Map Reduce.

        EXACTLY!

        I've created HADOOP-2429 to discuss this point.

        Show
        eric baldeschwieler added a comment - > Interesting point - it sounds like such a design might be achieved by moving the serialization stuff into its own layer above core Map Reduce. EXACTLY! I've created HADOOP-2429 to discuss this point.
        Hide
        Arun C Murthy added a comment -

        I'm moving this to 0.17.0 while we continue discussions here...

        Show
        Arun C Murthy added a comment - I'm moving this to 0.17.0 while we continue discussions here...
        Hide
        Tom White added a comment -

        New patch that works with trunk. Includes javadoc and warnings fixes.

        Show
        Tom White added a comment - New patch that works with trunk. Includes javadoc and warnings fixes.
        Hide
        Tom White added a comment -

        Updated patch to apply to trunk.

        Show
        Tom White added a comment - Updated patch to apply to trunk.
        Hide
        Mukund Madhugiri added a comment -

        Tom,
        I picked up the patch to get the benchmarks going and see that the patch applies file, but fails to compile with trunk:

        compile-core-classes:
        [javac] Compiling 454 source files to /trunk/build/classes
        [javac] /trunk/src/java/org/apache/hadoop/util/CopyFiles.java:820: cannot find symbol
        [javac] symbol : constructor Sorter(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.io.Text.Comparator,java.lang.Class<org.apache.had
        oop.io.Text>,org.apache.hadoop.conf.Configuration)
        [javac] location: class org.apache.hadoop.io.SequenceFile.Sorter
        [javac] SequenceFile.Sorter sorter = new SequenceFile.Sorter(fs,
        [javac] ^
        [javac] Note: Some input files use or override a deprecated API.
        [javac] Note: Recompile with -Xlint:deprecation for details.
        [javac] 1 error

        Show
        Mukund Madhugiri added a comment - Tom, I picked up the patch to get the benchmarks going and see that the patch applies file, but fails to compile with trunk: compile-core-classes: [javac] Compiling 454 source files to /trunk/build/classes [javac] /trunk/src/java/org/apache/hadoop/util/CopyFiles.java:820: cannot find symbol [javac] symbol : constructor Sorter(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.io.Text.Comparator,java.lang.Class<org.apache.had oop.io.Text>,org.apache.hadoop.conf.Configuration) [javac] location: class org.apache.hadoop.io.SequenceFile.Sorter [javac] SequenceFile.Sorter sorter = new SequenceFile.Sorter(fs, [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] 1 error
        Hide
        Tom White added a comment -

        Mukund - sorry about that. Here's a new patch that compiles (and tests run too). Thanks!

        Show
        Tom White added a comment - Mukund - sorry about that. Here's a new patch that compiles (and tests run too). Thanks!
        Hide
        Mukund Madhugiri added a comment -

        Tom,
        I ran the Sort benchmarks on 20, 100 and 500 nodes and see that random writer takes two times as much on 100 nodes when compared to trunk. I will do a re-run when the cluster frees up tomorrow, to see if it is repeatable

        Here is the data from the runs:

        • Sort on 20 nodes:
          Job trunk (hrs) trunk + patch (hrs)
          Random Writer 0.11 0.11
          Sort 0.31 0.33
          Sort Validation 0.15 0.15
        • Sort on 100 nodes:
          Job trunk (hrs) trunk + patch (hrs)
          Random Writer 0.13 0.29
          Sort 0.55 0.44
          Sort Validation 0.21 0.21
        • Sort on 500 nodes:
          Job trunk (hrs) trunk + patch (hrs)
          Random Writer 0.26 0.27
          Sort 1.1 1.2
          Sort Validation 0.21 0.23

        I see a checksum error in the JobTracker logs on the 500 node run, but see it on the trunk run as well. So, it is not due to your patch.

        Show
        Mukund Madhugiri added a comment - Tom, I ran the Sort benchmarks on 20, 100 and 500 nodes and see that random writer takes two times as much on 100 nodes when compared to trunk. I will do a re-run when the cluster frees up tomorrow, to see if it is repeatable Here is the data from the runs: Sort on 20 nodes: Job trunk (hrs) trunk + patch (hrs) Random Writer 0.11 0.11 Sort 0.31 0.33 Sort Validation 0.15 0.15 Sort on 100 nodes: Job trunk (hrs) trunk + patch (hrs) Random Writer 0.13 0.29 Sort 0.55 0.44 Sort Validation 0.21 0.21 Sort on 500 nodes: Job trunk (hrs) trunk + patch (hrs) Random Writer 0.26 0.27 Sort 1.1 1.2 Sort Validation 0.21 0.23 I see a checksum error in the JobTracker logs on the 500 node run, but see it on the trunk run as well. So, it is not due to your patch.
        Hide
        Owen O'Malley added a comment -

        The performance overhead seems to be 10% or so, which is pretty high. We need to bring that down before I think it is committable. Sorry about that! It probably needs a pass over with the profiler.

        Show
        Owen O'Malley added a comment - The performance overhead seems to be 10% or so, which is pretty high. We need to bring that down before I think it is committable. Sorry about that! It probably needs a pass over with the profiler.
        Hide
        Tom White added a comment -

        Mukund - Thanks for running the benchmarks.

        Owen - I'm uncomfortable with the apparent overhead too. So I'm going to run some local benchmarks to see if I can see where it's coming from.

        Show
        Tom White added a comment - Mukund - Thanks for running the benchmarks. Owen - I'm uncomfortable with the apparent overhead too. So I'm going to run some local benchmarks to see if I can see where it's coming from.
        Hide
        Tom White added a comment -

        I've written a local benchmark to see the effect of the patch. Focusing on RandomWriter, the input to the map is not read from disk, and there are no reducers, so the bulk of the processing is writing the random output to a SequenceFile. This benchmark simulates this pattern by writing Writable keys and values to an in-memory filesystem. The file was 256MB, keys and values 256 bytes. Here are the numbers (using Java 6) averaged over 50 runs.

        Trunk: 1301912844 ns
        Patch: 1338563600 ns

        This is a 2.8% overhead. When writing to disk I get the following numbers:

        Trunk: 5431308533 ns
        Patch: 5604898533 ns

        A 3.1% overhead. I was surprised by this as I thought that the overhead would be insignificant compared to disk IO.

        I altered the patch to special-case SequenceFile.Writer.append(Writable, Writable) and the times were the same as trunk (within 0.2% in either direction).

        So it seems to me that we can avoid any overhead by special casing Writable. As well as in SequenceFile.Writer this would need doing in MapTask.MapOutputBuffer and ReduceTask.ValuesIterator. I think this can be done with minimal code duplication, and obviously it is not as clean a solution as the current patch, but given the performance constraints and the general desire to get this issue fixed, I think it is the best way to proceed.

        Thoughts?

        Show
        Tom White added a comment - I've written a local benchmark to see the effect of the patch. Focusing on RandomWriter, the input to the map is not read from disk, and there are no reducers, so the bulk of the processing is writing the random output to a SequenceFile. This benchmark simulates this pattern by writing Writable keys and values to an in-memory filesystem. The file was 256MB, keys and values 256 bytes. Here are the numbers (using Java 6) averaged over 50 runs. Trunk: 1301912844 ns Patch: 1338563600 ns This is a 2.8% overhead. When writing to disk I get the following numbers: Trunk: 5431308533 ns Patch: 5604898533 ns A 3.1% overhead. I was surprised by this as I thought that the overhead would be insignificant compared to disk IO. I altered the patch to special-case SequenceFile.Writer.append(Writable, Writable) and the times were the same as trunk (within 0.2% in either direction). So it seems to me that we can avoid any overhead by special casing Writable. As well as in SequenceFile.Writer this would need doing in MapTask.MapOutputBuffer and ReduceTask.ValuesIterator. I think this can be done with minimal code duplication, and obviously it is not as clean a solution as the current patch, but given the performance constraints and the general desire to get this issue fixed, I think it is the best way to proceed. Thoughts?
        Hide
        Doug Cutting added a comment -

        I spoke with Tom over the weekend and he reported that, on micro benchmarks, the performance impact was only a few percent. It is counter-intuitive that this would balloon to 10% on a macro benchmark. The above results are not very consistent either: sort on 100 nodes was 20% faster with the patch, while randomwriter was 100% slower with the patch, both of which show little correlation with the 20 and 500 node benchmarks. It seems possible that, rather than a consistent slowdown, we're just seeing noise that should not be used to judge this patch.

        Show
        Doug Cutting added a comment - I spoke with Tom over the weekend and he reported that, on micro benchmarks, the performance impact was only a few percent. It is counter-intuitive that this would balloon to 10% on a macro benchmark. The above results are not very consistent either: sort on 100 nodes was 20% faster with the patch, while randomwriter was 100% slower with the patch, both of which show little correlation with the 20 and 500 node benchmarks. It seems possible that, rather than a consistent slowdown, we're just seeing noise that should not be used to judge this patch.
        Hide
        Tom White added a comment -

        I found an overhead: WritableSerializer was unecessarily wrapping the OutputStream. So I've replaced

          public void open(OutputStream out) {
            dataOut = new DataOutputStream(out);
          }
        

        with

          public void open(OutputStream out) {
            if (out instanceof DataOutputStream) {
              dataOut = (DataOutputStream) out;
            } else {
              dataOut = new DataOutputStream(out);
            }
          }
        

        and similarly for the deserializer.

        With this change (see v6 patch) the overhead is almost completely eliminated (running on Java6 on Linux):

        1867316142 ns - trunk
        1931475429 ns - patch v5, 3.4% overhead
        1876353143 ns - patch v6, 0.5% overhead

        I think this is now ready to be committed. I'll put it in the submit queue.

        Show
        Tom White added a comment - I found an overhead: WritableSerializer was unecessarily wrapping the OutputStream. So I've replaced public void open(OutputStream out) { dataOut = new DataOutputStream(out); } with public void open(OutputStream out) { if (out instanceof DataOutputStream) { dataOut = (DataOutputStream) out; } else { dataOut = new DataOutputStream(out); } } and similarly for the deserializer. With this change (see v6 patch) the overhead is almost completely eliminated (running on Java6 on Linux): 1867316142 ns - trunk 1931475429 ns - patch v5, 3.4% overhead 1876353143 ns - patch v6, 0.5% overhead I think this is now ready to be committed. I'll put it in the submit queue.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12376535/serializer-v6.patch
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 12 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac -1. The applied patch generated 620 javac compiler warnings (more than the trunk's current 619 warnings).

        release audit -1. The applied patch generated 197 release audit warnings (more than the trunk's current 187 warnings).

        findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests -1. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1846/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1846/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1846/artifact/trunk/build/test/checkstyle-errors.html
        Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1846/artifact/trunk/current/releaseAuditDiffWarnings.txt
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1846/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12376535/serializer-v6.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 12 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 620 javac compiler warnings (more than the trunk's current 619 warnings). release audit -1. The applied patch generated 197 release audit warnings (more than the trunk's current 187 warnings). findbugs -1. The patch appears to introduce 3 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1846/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1846/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1846/artifact/trunk/build/test/checkstyle-errors.html Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1846/artifact/trunk/current/releaseAuditDiffWarnings.txt Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1846/console This message is automatically generated.
        Hide
        Tom White added a comment -

        New patch to fix warnings.

        Show
        Tom White added a comment - New patch to fix warnings.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12376655/serializer-v7.patch
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 15 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1855/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1855/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1855/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1855/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12376655/serializer-v7.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 15 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1855/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1855/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1855/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1855/console This message is automatically generated.
        Hide
        Tom White added a comment -

        I've just committed this.

        Show
        Tom White added a comment - I've just committed this.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #415 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/415/ )

          People

          • Assignee:
            Tom White
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development