Details

    • Type: Wish Wish
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Thrift (http://incubator.apache.org/thrift/) is cross-language serialization and RPC framework. This issue is to write a ThriftSerialization to support using Thrift types in MapReduce programs, including an example program. This should probably go into contrib.

      (There is a prototype implementation in https://issues.apache.org/jira/secure/attachment/12370464/hadoop-serializer-v2.tar.gz)

      1. ThriftSerialization.java
        2 kB
        Brian Bloniarz
      2. libthrift.jar
        57 kB
        Tom White
      3. hadoop-3787.patch
        32 kB
        Tom White

        Issue Links

          Activity

          Hide
          Brian Bloniarz added a comment -

          As the previous patch has a bug and is based off an old version of thrift, I'm attaching a Serialization implementation which works better.

          This contains only a standalone org.apache.hadoop.io.serializer.Serialization implementation (the original patch has testcases etc). Tested against Thrift 0.7 and Hadoop 0.20.1.

          Show
          Brian Bloniarz added a comment - As the previous patch has a bug and is based off an old version of thrift, I'm attaching a Serialization implementation which works better. This contains only a standalone org.apache.hadoop.io.serializer.Serialization implementation (the original patch has testcases etc). Tested against Thrift 0.7 and Hadoop 0.20.1.
          Hide
          Mark Van De vyver added a comment -

          I've seen several list discussions/mentions of Thrift and serialization, but have not seen extprot yet mentioned. It is recent, but relevant to this issue (and in its own right):

          http://eigenclass.org/R2/writings/extprot-extensible-protocols-intro

          and

          http://eigenclass.org/R2/writings/protocol-extension-with-extprot

          <quote>
          At this point, you'll be thinking, "what, yet another Protocol Buffers/Thrift/ASN.1 DER/XDR/IIOP/IFF?"... Not quite: extprot differentiates itself in that it allows for more extensions and supports richer types (mainly tuples and disjoint union types aka. sum types) than Protocol Buffers or Thrift without approaching the complexity of ASN.1 DER. (Note that XDR does not define self-describing protocols, making protocol changes hard at best.)
          </quote>

          HTH?

          Show
          Mark Van De vyver added a comment - I've seen several list discussions/mentions of Thrift and serialization, but have not seen extprot yet mentioned. It is recent, but relevant to this issue (and in its own right): http://eigenclass.org/R2/writings/extprot-extensible-protocols-intro and http://eigenclass.org/R2/writings/protocol-extension-with-extprot <quote> At this point, you'll be thinking, "what, yet another Protocol Buffers/Thrift/ASN.1 DER/XDR/IIOP/IFF?"... Not quite: extprot differentiates itself in that it allows for more extensions and supports richer types (mainly tuples and disjoint union types aka. sum types) than Protocol Buffers or Thrift without approaching the complexity of ASN.1 DER. (Note that XDR does not define self-describing protocols, making protocol changes hard at best.) </quote> HTH?
          Hide
          Pete Wyckoff added a comment -

          any update on this?

          thanks, pete

          Show
          Pete Wyckoff added a comment - any update on this? thanks, pete
          Hide
          Doug Cutting added a comment -

          Do we expect different serialization implementations to share code? If not, this should probably be in contrib/thrift-serialization. Then we'd build a separate jar for each serialization implementation, which seems appropriate.

          Show
          Doug Cutting added a comment - Do we expect different serialization implementations to share code? If not, this should probably be in contrib/thrift-serialization. Then we'd build a separate jar for each serialization implementation, which seems appropriate.
          Hide
          Pete Wyckoff added a comment -

          one minor nitpick. This implements the Serialization interfaces in 3 separate files, whereas every other Serialization implementation does it in one with the serializer/deserializer as static public classes in the Serialization class.

          Show
          Pete Wyckoff added a comment - one minor nitpick. This implements the Serialization interfaces in 3 separate files, whereas every other Serialization implementation does it in one with the serializer/deserializer as static public classes in the Serialization class.
          Hide
          Pete Wyckoff added a comment -

          The simplest way to fix this is to always create a new object, but that's won't work well until HADOOP-1230 is done.

          You could also use a container object like in HADOOP-4065?? Or require that all the thrift fields have the required attribue - at least a comment?

          For this and RecordSerialization HADOOP-4199, there's also the issue that they are both by default using Binary format whereas thrift, record io support multiple formats. If thrift finally implements a compacted binary format, this will be even more important since people will have both.

          The other thing is Hive has something called TCTLSeparatedProtocol which implements the Thrift Protocol interface and allows thrift to parse simple text files with ctl separators. For us, we definitely have data in both Binary and CTL seped, so would need a way to configure this.

          But, I think those are add ons and you could submit this?

          Also, can someone create a category for contrib/serialization?

          Show
          Pete Wyckoff added a comment - The simplest way to fix this is to always create a new object, but that's won't work well until HADOOP-1230 is done. You could also use a container object like in HADOOP-4065 ?? Or require that all the thrift fields have the required attribue - at least a comment? For this and RecordSerialization HADOOP-4199 , there's also the issue that they are both by default using Binary format whereas thrift, record io support multiple formats. If thrift finally implements a compacted binary format, this will be even more important since people will have both. The other thing is Hive has something called TCTLSeparatedProtocol which implements the Thrift Protocol interface and allows thrift to parse simple text files with ctl separators. For us, we definitely have data in both Binary and CTL seped, so would need a way to configure this. But, I think those are add ons and you could submit this? Also, can someone create a category for contrib/serialization?
          Hide
          Tom White added a comment -

          Pete, I think you're right about objects not being cleared out on reads. So optional fields that aren't set in later reads will retain their values from earlier reads. The simplest way to fix this is to always create a new object, but that's won't work well until HADOOP-1230 is done. The alternative is to patch Thrift to give generated classes a clear() method.

          Show
          Tom White added a comment - Pete, I think you're right about objects not being cleared out on reads. So optional fields that aren't set in later reads will retain their values from earlier reads. The simplest way to fix this is to always create a new object, but that's won't work well until HADOOP-1230 is done. The alternative is to patch Thrift to give generated classes a clear() method.
          Hide
          Pete Wyckoff added a comment -

          I think I was looking at this the wrong way - this patch looks right.

          I was able to use this code with HADOOP-4065's flat file deserializer based record reader and read and wrote thrift records just fine.
          +1

          Show
          Pete Wyckoff added a comment - I think I was looking at this the wrong way - this patch looks right. I was able to use this code with HADOOP-4065 's flat file deserializer based record reader and read and wrote thrift records just fine. +1
          Hide
          Pete Wyckoff added a comment -

          to make a generic ThriftDeserializer<TBase>, i think one needs 4192.

          Show
          Pete Wyckoff added a comment - to make a generic ThriftDeserializer<TBase>, i think one needs 4192.
          Hide
          Pete Wyckoff added a comment -

          To make the ThriftDeserializer support a generic TBase Deserializer and have it get the actual class name from a config file or as a parameter, one needs the Deserializer to return the real thrift class name.

          Show
          Pete Wyckoff added a comment - To make the ThriftDeserializer support a generic TBase Deserializer and have it get the actual class name from a config file or as a parameter, one needs the Deserializer to return the real thrift class name.
          Hide
          Pete Wyckoff added a comment -

          -1 on this part as you don't clear the object before deserializing into it, which doesn't do a clear. Since there's no clear for a thrift object now, you would have to return a new object everytime, so the code should always ignore what's passed in. Given HADOOP-1230, this won't currently work because line 75 of SequenceFileRecordReader:

          > boolean remaining = (in.next(key) != null);

          Throws out the return value of SequenceFile.next which is the result of
          deserialize(obj).

            public T deserialize(T t) throws IOException {
              T object = (t == null ? newInstance() : t);
              try {
                object.read(protocol);
              } catch (TException e) {
                throw new IOException(e.toString());
              }
              return object;
            }
          
          Show
          Pete Wyckoff added a comment - -1 on this part as you don't clear the object before deserializing into it, which doesn't do a clear. Since there's no clear for a thrift object now, you would have to return a new object everytime, so the code should always ignore what's passed in. Given HADOOP-1230 , this won't currently work because line 75 of SequenceFileRecordReader: > boolean remaining = (in.next(key) != null); Throws out the return value of SequenceFile.next which is the result of deserialize(obj). public T deserialize(T t) throws IOException { T object = (t == null ? newInstance() : t); try { object.read(protocol); } catch (TException e) { throw new IOException(e.toString()); } return object; }
          Hide
          Pete Wyckoff added a comment -

          I queried about my last comment about more parameterized serializers to core-user.

          Show
          Pete Wyckoff added a comment - I queried about my last comment about more parameterized serializers to core-user.
          Hide
          Pete Wyckoff added a comment -

          This looks like a good addition. But, there's another use case where one does not know a-priori the thrift class (or I would imagine the same problem for recordio) that should be used for deserializing/serializing. (let's not assume sequence files and/or what to do about legacy data). This is what the hive case looks like. We may even have a thrift serde that takes its DDL at runtime. Registering all these doesn't seem very scalable. And even for sequence files, a serializer that needs the DDL at runtime wouldn't work.

          It seems one needs some kind of metainformation beyond the key or value classes that can be stored "somewhere" and then be used to instantiate the serializer/deserializer to make this use case work. Otherwise, one is stuck using BytesWritable and then having the application logic figure out how to instantiate the real serializer/deserializer. Somewhat more like what was proposed in: https://issues.apache.org/jira/browse/HADOOP-2429

          Show
          Pete Wyckoff added a comment - This looks like a good addition. But, there's another use case where one does not know a-priori the thrift class (or I would imagine the same problem for recordio) that should be used for deserializing/serializing. (let's not assume sequence files and/or what to do about legacy data). This is what the hive case looks like. We may even have a thrift serde that takes its DDL at runtime. Registering all these doesn't seem very scalable. And even for sequence files, a serializer that needs the DDL at runtime wouldn't work. It seems one needs some kind of metainformation beyond the key or value classes that can be stored "somewhere" and then be used to instantiate the serializer/deserializer to make this use case work. Otherwise, one is stuck using BytesWritable and then having the application logic figure out how to instantiate the real serializer/deserializer. Somewhat more like what was proposed in: https://issues.apache.org/jira/browse/HADOOP-2429
          Show
          Joydeep Sen Sarma added a comment - filed https://issues.apache.org/jira/browse/HADOOP-4065
          Hide
          Joydeep Sen Sarma added a comment -

          @Tom - the other thing is that SequenceFiles are self-describing - so the createKey()/Value() methods are trivial. For flat binary files - the record that's serialized is not implicit in the file and has to come from configuration outside.

          In Hive we have configuration per inputpath (or path-prefix really) that indicates the same information that's embedded inside sequencefile header. i am not sure whether we want to have this kind of information as part of hadoop-core.

          will open a separate jira for binary flat files (opening corresponding one for Hive as well since this is one of the first requests we got)

          Show
          Joydeep Sen Sarma added a comment - @Tom - the other thing is that SequenceFiles are self-describing - so the createKey()/Value() methods are trivial. For flat binary files - the record that's serialized is not implicit in the file and has to come from configuration outside. In Hive we have configuration per inputpath (or path-prefix really) that indicates the same information that's embedded inside sequencefile header. i am not sure whether we want to have this kind of information as part of hadoop-core. will open a separate jira for binary flat files (opening corresponding one for Hive as well since this is one of the first requests we got)
          Hide
          Pete Wyckoff added a comment -

          flat files with a suitable InputFormat (which would need to be written),

          See https://issues.apache.org/jira/browse/THRIFT-111. It provides a file format with fixed sized blocks that can be compressed and written in python, perl, C++ and Java. This allows a logging framework to write directly to a format that is splittable. The records themselves need not be thrift.

          Show
          Pete Wyckoff added a comment - flat files with a suitable InputFormat (which would need to be written), See https://issues.apache.org/jira/browse/THRIFT-111 . It provides a file format with fixed sized blocks that can be compressed and written in python, perl, C++ and Java. This allows a logging framework to write directly to a format that is splittable. The records themselves need not be thrift.
          Hide
          Tom White added a comment -

          This, and HADOOP-1986 in general, does not mandate the use of SequenceFile. However, SequenceFiles are a convenient binary format, so that's what's I've used here for the example.

          It would be possible to run MapReduce against Thrift records in flat files with a suitable InputFormat (which would need to be written), but such files would not be splittable (unless there is some general way to find Thrift record boundaries from an arbitrary position in the file). Unsplittable files do not in general play well with MapReduce and HDFS. Perhaps one way to fix this is to insert a special Thrift record every n records whose unique byte sequence can be scanned for to realign with the record boundaries. Could this work?

          Show
          Tom White added a comment - This, and HADOOP-1986 in general, does not mandate the use of SequenceFile. However, SequenceFiles are a convenient binary format, so that's what's I've used here for the example. It would be possible to run MapReduce against Thrift records in flat files with a suitable InputFormat (which would need to be written), but such files would not be splittable (unless there is some general way to find Thrift record boundaries from an arbitrary position in the file). Unsplittable files do not in general play well with MapReduce and HDFS. Perhaps one way to fix this is to insert a special Thrift record every n records whose unique byte sequence can be scanned for to realign with the record boundaries. Could this work?
          Hide
          Joydeep Sen Sarma added a comment -

          If i understand this and hadoop-1986 right - this all works in the context of sequencefiles - correct?

          There maybe lots of use cases of serialized records in simple flat files (we are getting some requests for this Thrift serialized records in flat files) - and was wondering what i can leverage to handle this case.

          Show
          Joydeep Sen Sarma added a comment - If i understand this and hadoop-1986 right - this all works in the context of sequencefiles - correct? There maybe lots of use cases of serialized records in simple flat files (we are getting some requests for this Thrift serialized records in flat files) - and was wondering what i can leverage to handle this case.
          Hide
          Tom White added a comment -

          A patch for a ThriftSerialization class, including an example (in the form of a test). This uses Thrift release 20080411p1 since the inccubator has not produced a release yet.

          Show
          Tom White added a comment - A patch for a ThriftSerialization class, including an example (in the form of a test). This uses Thrift release 20080411p1 since the inccubator has not produced a release yet.

            People

            • Assignee:
              Unassigned
              Reporter:
              Tom White
            • Votes:
              3 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:

                Development