Hadoop Common
  1. Hadoop Common
  2. HADOOP-6323

Serialization should provide comparators

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The Serialization interface should permit one to create raw comparators.

      1. HADOOP-6323.2.patch
        14 kB
        Aaron Kimball
      2. HADOOP-6323.3.patch
        14 kB
        Aaron Kimball
      3. HADOOP-6323.4.patch
        14 kB
        Aaron Kimball
      4. HADOOP-6323.5.patch
        19 kB
        Aaron Kimball
      5. HADOOP-6323.6.patch
        19 kB
        Aaron Kimball
      6. HADOOP-6323.7.patch
        22 kB
        Aaron Kimball
      7. HADOOP-6323.8.patch
        22 kB
        Aaron Kimball
      8. HADOOP-6323.9.patch
        26 kB
        Aaron Kimball
      9. HADOOP-6323.patch
        14 kB
        Aaron Kimball

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          We should add a method to SerializationBase like:

          • RawComparator getRawComparator(Map<String,String> metadata);
          Show
          Doug Cutting added a comment - We should add a method to SerializationBase like: RawComparator getRawComparator(Map<String,String> metadata);
          Hide
          Hong Tang added a comment -

          What is the intended use of the metadata parameter?

          Show
          Hong Tang added a comment - What is the intended use of the metadata parameter?
          Hide
          Doug Cutting added a comment -

          > What is the intended use of the metadata parameter?

          In the current Serialization API, the container is required to maintain metadata. For Writable, the Java class name is in the metadata. For Avro, the schema is in the metadata. When a container wishes to create a Serializer or Deserializer, it must provide the metadata. This was added in HADOOP-6165.

          The proposal here is to also treat raw comparators this way, so that when a container wishes to create a raw comparator, it asks the Serialization, passing in the metadata that describes the serialized data. Does that make sense?

          Show
          Doug Cutting added a comment - > What is the intended use of the metadata parameter? In the current Serialization API, the container is required to maintain metadata. For Writable, the Java class name is in the metadata. For Avro, the schema is in the metadata. When a container wishes to create a Serializer or Deserializer, it must provide the metadata. This was added in HADOOP-6165 . The proposal here is to also treat raw comparators this way, so that when a container wishes to create a raw comparator, it asks the Serialization, passing in the metadata that describes the serialized data. Does that make sense?
          Hide
          Hong Tang added a comment -

          I see. One concern I have with this solution is that it does not regulate what key should be used in the metadata to specify the schema string. This has to be in the public contract between the caller and the Serialization framework implementations.

          Show
          Hong Tang added a comment - I see. One concern I have with this solution is that it does not regulate what key should be used in the metadata to specify the schema string. This has to be in the public contract between the caller and the Serialization framework implementations.
          Hide
          Doug Cutting added a comment -

          > what key should be used in the metadata to specify the schema string.

          This part was already implemented in HADOOP-6165. Currently the serialization system is named by the "Serialization-Class" metadata key. The currently supported values for this key are: "org.apache.hadoop.io.serializer.WritableSerialzation", "org.apache.hadoop.io.serializer.JavaSerialzation", and "org.apache.hadoop.io.serializer.avro.AvroSerialzation". Writable and Java serialization implementations use the key "Serialized-Class" to the name the serialized Java class, and Avro's serialization implementation uses the key "Avro-Schema" for the schema. These are currently stored by SequenceFile's metadata.

          So the proposal here adds no new metadata keys or values. It simply adds a new method to the existing Serialization API, extending it to also support the creation of raw comparators.

          Show
          Doug Cutting added a comment - > what key should be used in the metadata to specify the schema string. This part was already implemented in HADOOP-6165 . Currently the serialization system is named by the "Serialization-Class" metadata key. The currently supported values for this key are: "org.apache.hadoop.io.serializer.WritableSerialzation", "org.apache.hadoop.io.serializer.JavaSerialzation", and "org.apache.hadoop.io.serializer.avro.AvroSerialzation". Writable and Java serialization implementations use the key "Serialized-Class" to the name the serialized Java class, and Avro's serialization implementation uses the key "Avro-Schema" for the schema. These are currently stored by SequenceFile's metadata. So the proposal here adds no new metadata keys or values. It simply adds a new method to the existing Serialization API, extending it to also support the creation of raw comparators.
          Hide
          Hong Tang added a comment -

          Makes sense.

          On a side note, I wonder if we could extend this pattern to support raw comparator creation in general. Currently, both TFile and mapreduce creates raw comparators using comparator class names. In mapreduce's case, the comparators are created through ReflectionUtil, so in some way, you can pass more information through the Configuration object.

          Show
          Hong Tang added a comment - Makes sense. On a side note, I wonder if we could extend this pattern to support raw comparator creation in general. Currently, both TFile and mapreduce creates raw comparators using comparator class names. In mapreduce's case, the comparators are created through ReflectionUtil, so in some way, you can pass more information through the Configuration object.
          Hide
          Jacob Rideout added a comment -

          What needs to be done to implement this API? I'd be willing to take a stab at it if someone could point me in the right direction.

          Show
          Jacob Rideout added a comment - What needs to be done to implement this API? I'd be willing to take a stab at it if someone could point me in the right direction.
          Hide
          Aaron Kimball added a comment -

          Looking at this issue, it seems that the API needs to be extended with the signature mentioned in the first comment; then the various implementations (there are three listed above) need to be updated to match it.

          WritableSerialization and JavaSerialization look pretty straightforward at first glance, as their corresponding RawComparator implementations (WritableComparator and JavaSerializationComparator) already themselves have factories.

          AvroSerialization I think is a bit more complicated – the schema needs to be read in from the metadata, then a comparator instantiated over this metadata. I'm not as familiar with the avro classes involved, so I'd need to keep digging to figure out where exactly this all lies. Doug?

          Jacob, I was actually looking at trying to get this working this week myself – I finally freed up some room in my schedule and I figured a crash course in Avro was overdue. Let me know if you feel strongly about tackling it yourself though.

          Show
          Aaron Kimball added a comment - Looking at this issue, it seems that the API needs to be extended with the signature mentioned in the first comment; then the various implementations (there are three listed above) need to be updated to match it. WritableSerialization and JavaSerialization look pretty straightforward at first glance, as their corresponding RawComparator implementations (WritableComparator and JavaSerializationComparator) already themselves have factories. AvroSerialization I think is a bit more complicated – the schema needs to be read in from the metadata, then a comparator instantiated over this metadata. I'm not as familiar with the avro classes involved, so I'd need to keep digging to figure out where exactly this all lies. Doug? Jacob, I was actually looking at trying to get this working this week myself – I finally freed up some room in my schedule and I figured a crash course in Avro was overdue. Let me know if you feel strongly about tackling it yourself though.
          Hide
          Doug Cutting added a comment -

          > What needs to be done to implement this API?

          The abstract method needs to be added to SerializationBase, then it needs t o be implemented. For Writable this can use WritableComparator. For Avro, it can be implemented with BinaryData#compare().

          Show
          Doug Cutting added a comment - > What needs to be done to implement this API? The abstract method needs to be added to SerializationBase, then it needs t o be implemented. For Writable this can use WritableComparator. For Avro, it can be implemented with BinaryData#compare().
          Hide
          Aaron Kimball added a comment -

          I just realized that JavaSerialization doesn't actually extend SerializationBase, it implements the deprecated Serialization interface. Any thoughts as to whether this should be brought up to speed as part of this? Or is JavaSerialization effectively ignorable?

          Show
          Aaron Kimball added a comment - I just realized that JavaSerialization doesn't actually extend SerializationBase , it implements the deprecated Serialization interface. Any thoughts as to whether this should be brought up to speed as part of this? Or is JavaSerialization effectively ignorable?
          Hide
          Tom White added a comment -

          The reason for this was as a test that older Serializations still work with the new API. JavaSerialization is an experimental serialization, so I would leave it as-is.

          Show
          Tom White added a comment - The reason for this was as a test that older Serializations still work with the new API. JavaSerialization is an experimental serialization, so I would leave it as-is.
          Hide
          Aaron Kimball added a comment -

          Here's a patch that takes a first pass at implementing this. This implements the API in WritableSerialization and AvroSerialization.

          • LegacySerialization just returns null
          • AvroSerialization always uses the schema in metadata[AVRO_SCHEMA_KEY]. There are no special provisions for AvroSpecificSerialization – should there be any?
          • Note that BinaryData.compare() does not take l1 and l2 parameters, so those arguments to RawComparator.compare() are ignored – I guess this is ok, since Avro knows the max length it should read via the schema?
          Show
          Aaron Kimball added a comment - Here's a patch that takes a first pass at implementing this. This implements the API in WritableSerialization and AvroSerialization. LegacySerialization just returns null AvroSerialization always uses the schema in metadata [AVRO_SCHEMA_KEY] . There are no special provisions for AvroSpecificSerialization – should there be any? Note that BinaryData.compare() does not take l1 and l2 parameters, so those arguments to RawComparator.compare() are ignored – I guess this is ok, since Avro knows the max length it should read via the schema?
          Hide
          Aaron Kimball added a comment -

          also, you'll need to run:

          svn add src/test/core/org/apache/hadoop/io/serializer/TestRawComparators.java
          svn add src/java/org/apache/hadoop/io/serializer/avro/AvroComparator.java
          
          Show
          Aaron Kimball added a comment - also, you'll need to run: svn add src/test/core/org/apache/hadoop/io/serializer/TestRawComparators.java svn add src/java/org/apache/hadoop/io/serializer/avro/AvroComparator.java
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427290/HADOOP-6323.patch
          against trunk revision 887472.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 173 javac compiler warnings (more than the trunk's current 172 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/174/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/174/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/174/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/174/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427290/HADOOP-6323.patch against trunk revision 887472. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 173 javac compiler warnings (more than the trunk's current 172 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/174/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/174/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/174/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/174/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          Fix javac warning

          Show
          Aaron Kimball added a comment - Fix javac warning
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427298/HADOOP-6323.2.patch
          against trunk revision 887472.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/175/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/175/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/175/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/175/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427298/HADOOP-6323.2.patch against trunk revision 887472. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/175/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/175/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/175/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/175/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          I'd prefer we throw an exception rather than return null when a comparator is not available. This will permit code that requires comparable objects to use this API without having to worry about NullPointerExceptions or performing extra checks.

          Show
          Doug Cutting added a comment - I'd prefer we throw an exception rather than return null when a comparator is not available. This will permit code that requires comparable objects to use this API without having to worry about NullPointerExceptions or performing extra checks.
          Hide
          Aaron Kimball added a comment -

          New patch; throws UnsupportedOperationException instead of returning null.

          Show
          Aaron Kimball added a comment - New patch; throws UnsupportedOperationException instead of returning null.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427387/HADOOP-6323.3.patch
          against trunk revision 888565.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/176/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/176/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/176/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/176/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427387/HADOOP-6323.3.patch against trunk revision 888565. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/176/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/176/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/176/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/176/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          new patch; removes redundant log4j logging when exceptions are thrown.

          Show
          Aaron Kimball added a comment - new patch; removes redundant log4j logging when exceptions are thrown.
          Hide
          Doug Cutting added a comment -

          +1 This looks good to me. I'll commit tomorrow unless others express concerns.

          Show
          Doug Cutting added a comment - +1 This looks good to me. I'll commit tomorrow unless others express concerns.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427392/HADOOP-6323.4.patch
          against trunk revision 888565.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/177/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/177/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/177/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/177/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427392/HADOOP-6323.4.patch against trunk revision 888565. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/177/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/177/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/177/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/177/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          In the process of writing MAPREDUCE-1126 it has become clear that the SerializerBase instances themselves are responsible for validating that the data they are given to serialize matches their metadata, rather than relying on "client side" checks that they are passing in values of the expected classes.

          WritableSerialization now checks that the serialization class matches the objects it is given. Added a unit test case for this.

          This happens implicitly in AvroSerialization based on the schema walking. It was noted that AvroSerialization reparses the schema for each object it serializes; this is now memoized in the AvroSerializer instance. One AvroSerializer should only be used for a single schema.

          Show
          Aaron Kimball added a comment - In the process of writing MAPREDUCE-1126 it has become clear that the SerializerBase instances themselves are responsible for validating that the data they are given to serialize matches their metadata, rather than relying on "client side" checks that they are passing in values of the expected classes. WritableSerialization now checks that the serialization class matches the objects it is given. Added a unit test case for this. This happens implicitly in AvroSerialization based on the schema walking. It was noted that AvroSerialization reparses the schema for each object it serializes; this is now memoized in the AvroSerializer instance. One AvroSerializer should only be used for a single schema.
          Hide
          Tom White added a comment -

          WritableSerialization now checks that the serialization class matches the objects it is given.

          The extra check is called for every key and value that goes through the shuffle. We should be sure that this doesn't add any overhead by running a benchmark.

          Show
          Tom White added a comment - WritableSerialization now checks that the serialization class matches the objects it is given. The extra check is called for every key and value that goes through the shuffle. We should be sure that this doesn't add any overhead by running a benchmark.
          Hide
          Doug Cutting added a comment -
          • WritableSerialization should just throw an exception if the class is not found, no? The contract of Writable is to serialize instances of a given class whose class name is not written with each instance.
          • AvroSerialization should just create the schema in the ctor, rather than lazily later.
          Show
          Doug Cutting added a comment - WritableSerialization should just throw an exception if the class is not found, no? The contract of Writable is to serialize instances of a given class whose class name is not written with each instance. AvroSerialization should just create the schema in the ctor, rather than lazily later.
          Hide
          Aaron Kimball added a comment -

          @Tom this isn't a new check – MapOutputBuffer.collect() already did this exact same check. But for MapOutputBuffer to use a generic serialization API that isn't necessarily class-based, this check is just being pushed down into the SerializerBase.serialize() method.

          @Doug - Sure, I'll have WritableSerialization throw. AvroSerialization can't create the schema in the c'tor; the AvroSerialization.getSchema() method takes in the metadata as well as an example instance being serialized. That is not yet available when the AvroSerializer is being constructed. I don't see how AvroReflectSerialization would work without this.

          Show
          Aaron Kimball added a comment - @Tom this isn't a new check – MapOutputBuffer.collect() already did this exact same check. But for MapOutputBuffer to use a generic serialization API that isn't necessarily class-based, this check is just being pushed down into the SerializerBase.serialize() method. @Doug - Sure, I'll have WritableSerialization throw. AvroSerialization can't create the schema in the c'tor; the AvroSerialization.getSchema() method takes in the metadata as well as an example instance being serialized. That is not yet available when the AvroSerializer is being constructed. I don't see how AvroReflectSerialization would work without this.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427531/HADOOP-6323.5.patch
          against trunk revision 888998.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/184/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/184/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/184/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/184/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427531/HADOOP-6323.5.patch against trunk revision 888998. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/184/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/184/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/184/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/184/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          New patch with runtimeexception in WritableSerialization.WritableSerializer constructor

          Show
          Aaron Kimball added a comment - New patch with runtimeexception in WritableSerialization.WritableSerializer constructor
          Hide
          Doug Cutting added a comment -

          > AvroSerialization.getSchema() method takes in the metadata as well as an example instance being serialized.

          I'm not sure that's required. The class name should be in the metadata. Right, Tom?

          Also, then AvroSerializer wouldn't need the a 'metadata' field.

          Show
          Doug Cutting added a comment - > AvroSerialization.getSchema() method takes in the metadata as well as an example instance being serialized. I'm not sure that's required. The class name should be in the metadata. Right, Tom? Also, then AvroSerializer wouldn't need the a 'metadata' field.
          Hide
          Aaron Kimball added a comment -

          The metadata field is required in all SerializerBase implementations due to this method in the SerializerBase signature:

          public abstract Map<String, String> getMetadata() throws IOException;
          

          You're right though, AvroGenericSerialization will only accept() if CLASS_KEY is set and it's assignable to AvroReflectSerializable.

          Show
          Aaron Kimball added a comment - The metadata field is required in all SerializerBase implementations due to this method in the SerializerBase signature: public abstract Map< String , String > getMetadata() throws IOException; You're right though, AvroGenericSerialization will only accept() if CLASS_KEY is set and it's assignable to AvroReflectSerializable.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427534/HADOOP-6323.6.patch
          against trunk revision 889018.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/185/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/185/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/185/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/185/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427534/HADOOP-6323.6.patch against trunk revision 889018. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/185/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/185/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/185/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/185/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          ... but AvroSpecificSerialization calls t.getSchema() on the instance, not on the class.

          Show
          Aaron Kimball added a comment - ... but AvroSpecificSerialization calls t.getSchema() on the instance, not on the class.
          Hide
          Tom White added a comment -

          The class name should be in the metadata.

          Yes. To remove the example instance in the AvroSerialization.getSchema() method we would have to be sure that the Avro schema is passed in the metadata for each type of AvroSerialization.

          Show
          Tom White added a comment - The class name should be in the metadata. Yes. To remove the example instance in the AvroSerialization.getSchema() method we would have to be sure that the Avro schema is passed in the metadata for each type of AvroSerialization.
          Hide
          Aaron Kimball added a comment -

          Tom - is that a mandate we are comfortable enforcing?

          Show
          Aaron Kimball added a comment - Tom - is that a mandate we are comfortable enforcing?
          Hide
          Tom White added a comment -

          Could we do it using reflection in the serializer's constructor? Construct an instance of the SpecificRecord class specified in the metadata, then call getSchema() on it.

          Show
          Tom White added a comment - Could we do it using reflection in the serializer's constructor? Construct an instance of the SpecificRecord class specified in the metadata, then call getSchema() on it.
          Hide
          Doug Cutting added a comment -

          > AvroSpecificSerialization calls t.getSchema() on the instance

          SpecificData.get().getSchema(Class) will get the schema from a class.

          One question is whether, with Avro, we want to make the setting of metadata for a job (either the class name or a schema) required or optional. With Writable, it has always been required, no? You need to set the job's input and output classes. We never just guess based on the first object read or written, I think. So, we could do that differently for Avro, instead dynamically determining the schema or class name if it's not declared, but do we need to?

          Show
          Doug Cutting added a comment - > AvroSpecificSerialization calls t.getSchema() on the instance SpecificData.get().getSchema(Class) will get the schema from a class. One question is whether, with Avro, we want to make the setting of metadata for a job (either the class name or a schema) required or optional. With Writable, it has always been required, no? You need to set the job's input and output classes. We never just guess based on the first object read or written, I think. So, we could do that differently for Avro, instead dynamically determining the schema or class name if it's not declared, but do we need to?
          Hide
          Aaron Kimball added a comment -

          With Writable serialization, yes, you've always had to pre-declare the datatypes for each of the four key/value types emitted during a job dataflow. (The input types were implicit from the InputFormat/RecordReader.)

          I don't think it's necessary to dynamically determine the schema for emitted data during jobs in Avro either. I think if it were, someone would have already asked for some sort of "DelayedWritable" notion to be integrated in MapReduce already – which I don't think we've seen. It might be nice to support this eventually, but let's wait for demand.

          Show
          Aaron Kimball added a comment - With Writable serialization, yes, you've always had to pre-declare the datatypes for each of the four key/value types emitted during a job dataflow. (The input types were implicit from the InputFormat/RecordReader.) I don't think it's necessary to dynamically determine the schema for emitted data during jobs in Avro either. I think if it were, someone would have already asked for some sort of "DelayedWritable" notion to be integrated in MapReduce already – which I don't think we've seen. It might be nice to support this eventually, but let's wait for demand.
          Hide
          Aaron Kimball added a comment -

          Patch that implements this behavior.

          Show
          Aaron Kimball added a comment - Patch that implements this behavior.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427632/HADOOP-6323.7.patch
          against trunk revision 889378.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/188/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/188/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/188/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/188/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427632/HADOOP-6323.7.patch against trunk revision 889378. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/188/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/188/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/188/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/188/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          New patch. Per discussion with Doug, WritableSerialization should mandate CLASS_KEY is set in accept() method and again in WritableSerialization.WritableSerializer constructor.

          Show
          Aaron Kimball added a comment - New patch. Per discussion with Doug, WritableSerialization should mandate CLASS_KEY is set in accept() method and again in WritableSerialization.WritableSerializer constructor.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427655/HADOOP-6323.8.patch
          against trunk revision 889378.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/193/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/193/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/193/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/193/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427655/HADOOP-6323.8.patch against trunk revision 889378. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/193/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/193/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/193/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/193/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          +1 This looks good to me. I'll commit it tomorrow if no one objects.

          Show
          Doug Cutting added a comment - +1 This looks good to me. I'll commit it tomorrow if no one objects.
          Hide
          Aaron Kimball added a comment -

          Realized that JavaSerialization is tested in the MapReduce side (org.apache.hadoop.mapred.TestJavaSerialization); when MapReduce uses this API (in MAPREDUCE-1126), that test suite will break.

          Canceling patch until I update JavaSerialization to the new SerializationBase interface as well.

          Show
          Aaron Kimball added a comment - Realized that JavaSerialization is tested in the MapReduce side (org.apache.hadoop.mapred.TestJavaSerialization); when MapReduce uses this API (in MAPREDUCE-1126 ), that test suite will break. Canceling patch until I update JavaSerialization to the new SerializationBase interface as well.
          Hide
          Aaron Kimball added a comment -

          Updated JavaSerialization to new API.

          Show
          Aaron Kimball added a comment - Updated JavaSerialization to new API.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427776/HADOOP-6323.9.patch
          against trunk revision 889378.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/198/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/198/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/198/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/198/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427776/HADOOP-6323.9.patch against trunk revision 889378. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/198/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/198/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/198/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/198/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Aaron!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Aaron!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #106 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/106/)
          . Add comparators to the serialization API. Contributed by Aaron Kimball.

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #106 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/106/ ) . Add comparators to the serialization API. Contributed by Aaron Kimball.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #185 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/185/)
          . Add comparators to the serialization API. Contributed by Aaron Kimball.

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #185 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/185/ ) . Add comparators to the serialization API. Contributed by Aaron Kimball.

            People

            • Assignee:
              Aaron Kimball
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development