Avro
  1. Avro
  2. AVRO-1353

Configurable Hadoop serialization in-memory representations

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.5
    • Fix Version/s: 1.7.5
    • Component/s: java
    • Labels:
      None

      Description

      As discussed on the Avro Users mailing list [1], it would be useful to allow configuration of the DatumReader/Writer implementations used by Hadoop Avro serialization, especially for non-Java JVM languages.

      [1] http://mail-archives.apache.org/mod_mbox/avro-user/201305.mbox/%3C87ehdh14qn.fsf%40zeno.atl.damballa%3E

      1. avro-1353.patch
        23 kB
        Marshall Bockrath-Vandegrift
      2. avro-1353-2.patch
        52 kB
        Marshall Bockrath-Vandegrift
      3. avro-1353-3.patch
        49 kB
        Marshall Bockrath-Vandegrift
      4. AVRO-1353.patch
        44 kB
        Doug Cutting

        Issue Links

          Activity

          Hide
          Marshall Bockrath-Vandegrift added a comment -

          Initial implementation. Suggestions appreciated.

          Show
          Marshall Bockrath-Vandegrift added a comment - Initial implementation. Suggestions appreciated.
          Hide
          Marshall Bockrath-Vandegrift added a comment -

          Apparently I don't know how Jira works. Actual patch, initial implementation, etc.

          Show
          Marshall Bockrath-Vandegrift added a comment - Apparently I don't know how Jira works. Actual patch, initial implementation, etc.
          Hide
          Marshall Bockrath-Vandegrift added a comment -

          I should have dogfooded before submitting a patch. The previous patch actually hit fewer than half of the necessary integration points. The new `-2` patch works for at least the test suite of small internal project which depends on this feature.

          Show
          Marshall Bockrath-Vandegrift added a comment - I should have dogfooded before submitting a patch. The previous patch actually hit fewer than half of the necessary integration points. The new `-2` patch works for at least the test suite of small internal project which depends on this feature.
          Hide
          Doug Cutting added a comment -

          In trunk (soon to be Avro 1.7.5) GenericData already serves the purpose of your DatumMapping interface. In particular, for every DatumReader/DatumWriter implementation there's a subclass of GenericData that implements createDatumReader(), createDatumWriter() and compare().

          Perhaps these methods would be better abstracted into an abstract base class named org.apache.hadoop.data.DataModel. (Abstract base classes with many methods are easier to evolve back-compatibly than interfaces, so I prefer them when there's more than one or two methods.) Additional methods that might make sense to add here are perhaps deepCopy() and hashCode(). Adding this might be done as a separate cleanup issue and isn't required to extend the MapReduce implementations as that can be done using GenericData in the meantime.

          Show
          Doug Cutting added a comment - In trunk (soon to be Avro 1.7.5) GenericData already serves the purpose of your DatumMapping interface. In particular, for every DatumReader/DatumWriter implementation there's a subclass of GenericData that implements createDatumReader(), createDatumWriter() and compare(). Perhaps these methods would be better abstracted into an abstract base class named org.apache.hadoop.data.DataModel. (Abstract base classes with many methods are easier to evolve back-compatibly than interfaces, so I prefer them when there's more than one or two methods.) Additional methods that might make sense to add here are perhaps deepCopy() and hashCode(). Adding this might be done as a separate cleanup issue and isn't required to extend the MapReduce implementations as that can be done using GenericData in the meantime.
          Hide
          Marshall Bockrath-Vandegrift added a comment -

          I noticed the trunk GenericData and my interface were similar, but there were a few issues which kept me creating a separate interface for this patch:

          (1) GenericData only has a single-schema version of createReader(), not the separate writer & reader schemas version needed. This could be fixed by adding another method to the class, or just calling setSchema() on the new instances as needed.

          (2) In order to match the existing implementation, under the reflect data model the DatumReader needs to use the class loader of the job configuration. The obvious solution was to make the class Configurable, but that means it needs to be a new Hadoop-related class and not just a class from the base Avro package. This could be solved by having a new e.g. HadoopReflectData (ConfiguredReflectData?) class, with some complications below.

          (3) GenericData seems to be an implementation detail, and tying the data model to the GenericData subclass seems to intertwine interface with implementation. For example, my current implementation for Clojure data structures doesn't include a sub-class of GenericData. For another example, each sub-class of GenericData needs to correctly override the creatDatumReader() etc methods to return instances of the correct classes; any existing subclasses which don't override the new methods will silently produce incorrect results at runtime. I think your latter proposal is the way to fix this long term. Short term – if you're okay with the initial implementation using GenericData directly, I'm not going to argue with getting a feature I need sooner, but I'm also not excited about changing code again later to implement a new interface.

          If the short-term fixes I mention in the above points seem acceptable, I'll work up another version of the patch.

          Show
          Marshall Bockrath-Vandegrift added a comment - I noticed the trunk GenericData and my interface were similar, but there were a few issues which kept me creating a separate interface for this patch: (1) GenericData only has a single-schema version of createReader(), not the separate writer & reader schemas version needed. This could be fixed by adding another method to the class, or just calling setSchema() on the new instances as needed. (2) In order to match the existing implementation, under the reflect data model the DatumReader needs to use the class loader of the job configuration. The obvious solution was to make the class Configurable, but that means it needs to be a new Hadoop-related class and not just a class from the base Avro package. This could be solved by having a new e.g. HadoopReflectData (ConfiguredReflectData?) class, with some complications below. (3) GenericData seems to be an implementation detail, and tying the data model to the GenericData subclass seems to intertwine interface with implementation. For example, my current implementation for Clojure data structures doesn't include a sub-class of GenericData. For another example, each sub-class of GenericData needs to correctly override the creatDatumReader() etc methods to return instances of the correct classes; any existing subclasses which don't override the new methods will silently produce incorrect results at runtime. I think your latter proposal is the way to fix this long term. Short term – if you're okay with the initial implementation using GenericData directly, I'm not going to argue with getting a feature I need sooner, but I'm also not excited about changing code again later to implement a new interface. If the short-term fixes I mention in the above points seem acceptable, I'll work up another version of the patch.
          Hide
          Doug Cutting added a comment -

          > the DatumReader needs to use the class loader of the job configuration

          SpecificData has a getClassLoader() method that's currently used for this. Generalizing it to a base class sounds like the thing to do. So this is another method that should migrate to the new abstract base class or to GenericData in the meantime.

          > if you're okay with the initial implementation using GenericData directly, I'm not going to argue with getting a feature I need sooner, but I'm also not excited about changing code again later to implement a new interface.

          When we add the abstract base class we should replace references to GenericData to instead be the new base class. Doing this back-compatibly may be difficult, so this might be forced to go into Avro 1.8. If you need your feature in a 1.7.x release then it's probably best to use GenericData as the base.

          Show
          Doug Cutting added a comment - > the DatumReader needs to use the class loader of the job configuration SpecificData has a getClassLoader() method that's currently used for this. Generalizing it to a base class sounds like the thing to do. So this is another method that should migrate to the new abstract base class or to GenericData in the meantime. > if you're okay with the initial implementation using GenericData directly, I'm not going to argue with getting a feature I need sooner, but I'm also not excited about changing code again later to implement a new interface. When we add the abstract base class we should replace references to GenericData to instead be the new base class. Doing this back-compatibly may be difficult, so this might be forced to go into Avro 1.8. If you need your feature in a 1.7.x release then it's probably best to use GenericData as the base.
          Hide
          Marshall Bockrath-Vandegrift added a comment -

          Reworking, but running into a few issues:

          > SpecificData has a getClassLoader() method that's currently used for this. Generalizing it to a base class sounds like the thing to do. So this is another method that should migrate to the new abstract base class or to GenericData in the meantime.

          I'm finding that I also need to add a setClassLoader() method, because the standard reflection utilities can only invoke a no-argument constructor. That means the classLoader member becomes non-final, which impacts thread-safety...

          How bad is it to tread deeper into the reflection mire, so as to invoke the GenericData (subclass) constructor with a ClassLoader argument? Alternatively, what about including an avro-mapred DataModelFactory interface, instances of which can be Configurable? Then serialization creates an instance of the factory, which returns instances of the GenericData subclass.

          Show
          Marshall Bockrath-Vandegrift added a comment - Reworking, but running into a few issues: > SpecificData has a getClassLoader() method that's currently used for this. Generalizing it to a base class sounds like the thing to do. So this is another method that should migrate to the new abstract base class or to GenericData in the meantime. I'm finding that I also need to add a setClassLoader() method, because the standard reflection utilities can only invoke a no-argument constructor. That means the classLoader member becomes non-final, which impacts thread-safety... How bad is it to tread deeper into the reflection mire, so as to invoke the GenericData (subclass) constructor with a ClassLoader argument? Alternatively, what about including an avro-mapred DataModelFactory interface, instances of which can be Configurable? Then serialization creates an instance of the factory, which returns instances of the GenericData subclass.
          Hide
          Marshall Bockrath-Vandegrift added a comment -

          The reflection mire turned out to be more of a mud puddle. Patch using GenericData/-subclasses attached.

          Show
          Marshall Bockrath-Vandegrift added a comment - The reflection mire turned out to be more of a mud puddle. Patch using GenericData/-subclasses attached.
          Hide
          Doug Cutting added a comment -

          AVRO-1357 has been filed with similar goals. So it seems that demand is high for such a facility. This is probably too late for 1.7.5, so maybe we should aim for a solution for 1.8.0, where we can make incompatible API changes.

          Show
          Doug Cutting added a comment - AVRO-1357 has been filed with similar goals. So it seems that demand is high for such a facility. This is probably too late for 1.7.5, so maybe we should aim for a solution for 1.8.0, where we can make incompatible API changes.
          Hide
          Doug Cutting added a comment -

          I wrote the above comment before seeing the latest patch. It looks good! Now I'm inclined to commit this now, for 1.7.5. Anyone object?

          Show
          Doug Cutting added a comment - I wrote the above comment before seeing the latest patch. It looks good! Now I'm inclined to commit this now, for 1.7.5. Anyone object?
          Hide
          Doug Cutting added a comment -

          AVRO-1356 conflicted with the previous patch. Here's an updated patch that resolves the conflict.

          I'll commit this soon.

          Show
          Doug Cutting added a comment - AVRO-1356 conflicted with the previous patch. Here's an updated patch that resolves the conflict. I'll commit this soon.
          Hide
          ASF subversion and git services added a comment -

          Commit 1511535 from Doug Cutting in branch 'avro/trunk'
          [ https://svn.apache.org/r1511535 ]

          AVRO-1353. Java: Permit specification of data model (generic, specific, reflect, or other) in mapreduce job configuration. Contributed by Marshall Bockrath-Vandegrift.

          Show
          ASF subversion and git services added a comment - Commit 1511535 from Doug Cutting in branch 'avro/trunk' [ https://svn.apache.org/r1511535 ] AVRO-1353 . Java: Permit specification of data model (generic, specific, reflect, or other) in mapreduce job configuration. Contributed by Marshall Bockrath-Vandegrift.
          Hide
          Doug Cutting added a comment -

          I committed this. Thanks, Marshall!

          Show
          Doug Cutting added a comment - I committed this. Thanks, Marshall!
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in AvroJava #390 (See https://builds.apache.org/job/AvroJava/390/)
          AVRO-1353. Java: Permit specification of data model (generic, specific, reflect, or other) in mapreduce job configuration. Contributed by Marshall Bockrath-Vandegrift. (cutting: rev 1511535)

          • /avro/trunk/CHANGES.txt
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroDeserializer.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroKeyComparator.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroKeyDeserializer.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerializer.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroValueDeserializer.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroJob.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroOutputFormat.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroRecordReader.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroSerialization.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroJob.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroKeyOutputFormat.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroKeyRecordWriter.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroKeyValueOutputFormat.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroKeyValueRecordWriter.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroRecordReaderBase.java
          • /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/hadoop/io/TestAvroSerialization.java
          • /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestAvroKeyOutputFormat.java
          • /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestAvroKeyRecordWriter.java
          • /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestAvroKeyValueRecordWriter.java
          Show
          Hudson added a comment - SUCCESS: Integrated in AvroJava #390 (See https://builds.apache.org/job/AvroJava/390/ ) AVRO-1353 . Java: Permit specification of data model (generic, specific, reflect, or other) in mapreduce job configuration. Contributed by Marshall Bockrath-Vandegrift. (cutting: rev 1511535) /avro/trunk/CHANGES.txt /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroDeserializer.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroKeyComparator.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroKeyDeserializer.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerializer.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroValueDeserializer.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroJob.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroOutputFormat.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroRecordReader.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroSerialization.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroJob.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroKeyOutputFormat.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroKeyRecordWriter.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroKeyValueOutputFormat.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroKeyValueRecordWriter.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroRecordReaderBase.java /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/hadoop/io/TestAvroSerialization.java /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestAvroKeyOutputFormat.java /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestAvroKeyRecordWriter.java /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestAvroKeyValueRecordWriter.java

            People

            • Assignee:
              Marshall Bockrath-Vandegrift
              Reporter:
              Marshall Bockrath-Vandegrift
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development