Details

    • Type: Improvement Improvement
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: task
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently the key comparator is defined as a Java class. Instead we should use the Serialization API to create key comparators. This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

      1. m-1126-3.patch
        2 kB
        Owen O'Malley
      2. m-1126-2.patch
        2 kB
        Owen O'Malley
      3. MAPREDUCE-1126.patch
        75 kB
        Tom White
      4. MAPREDUCE-1126.6.patch
        120 kB
        Aaron Kimball
      5. MAPREDUCE-1126.5.patch
        112 kB
        Aaron Kimball
      6. MAPREDUCE-1126.4.patch
        108 kB
        Aaron Kimball
      7. MAPREDUCE-1126.3.patch
        107 kB
        Aaron Kimball
      8. MAPREDUCE-1126.2.patch
        107 kB
        Aaron Kimball
      9. MAPREDUCE-1126.patch
        90 kB
        Aaron Kimball

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          This should use the API to be added in HADOOP-6323.

          Show
          Doug Cutting added a comment - This should use the API to be added in HADOOP-6323 .
          Hide
          Aaron Kimball added a comment -

          To allow Avro-based comparators during the shuffle, a few changes need to take place. Here's my survey of the scope of the problem and proposal for addressing it.

          HADOOP-6323 allows SerializationBase to provide RawComparator instances. But for Avro to be germane here, the Serializer and Deserializer instances used by MapOutputBuffer and other shuffle code need to be updated as well. MapOutputBuffer and other places where the Serializer and Deserializer interfaces are used should be changed to use instead SerializerBase and DeserializerBase. This should be done by changing the current calls to SerializationBase.getSerializer(keyClassName) to use the newer getSerializer(metadata) method.

          Currently the only metadata we store is the Serialization-Class that identifies the intermediate key and value classes. This is implicitly set in, e.g. Job.setMapOutputKeyClass(). (It actually just stores a class name, and the getSerializer(className) method converts that to the appropriate metadata map for getSerializer(metadata).) HADOOP-6165 allows an arbitrary map of metadata to be passed to the serialization framework. We should take advantange of this map throughout the shuffling process. The job's Configuration should be updated to embed a map a la the outline in HADOOP-6165 (See Tom's comment, part 4(ii)) to include a "family" of configuration parameters. Given the renaming of various parameters, I propose to embed the metadata maps in mapreduce.map.output.key.metadata and mapreduce.map.output.value.metadata.

          The JobContext.setMapOutputKeyClass() method would instead of setting mapreduce.map.output.key.class, set the appropriate metadata strings under mapreduce.map.output.key.metadata or mapreduce.map.output.value.metadata to use the WritableSerialization implementation. If this metadata is not present, metadata could be inferred from mapreduce.map.output.key/value.class, which should be marked as deprecated.

          JobContext would also include new methods which could be used instead of setMapOutputKeyClass(), e.g. setMapOutputKeySchema() which would accept an Avro schema and configure the shuffle to use AvroGenericSerialization for this data. A generic setMapOutputKeySerializationMetadata() could be added as well, which would accept any string -> string map of metadata for use with other arbitrary SerializationBase implementations. (It might be used internally by setMapOutputKeyClass() and setMapOutputKeySchema().)

          On the internal side, JobContext already provides a method getSortComparator() which provides the appropriate RawComparator implementation. But it does not directly provide SerializerBase or DeserializerBase instances; it instead only returns the key and value class names and clients of this (e.g., MapOutputBuffer) are expected to instantiate the correct serialization objects themselves. I think that JobContext should provide getMapOutputKeySerializer() and getMapOutputKeyDeserializer() methods that correctly handle either the use of mapreduce.map.output.key.metadata or mapreduce.map.output.key.class to avoid duplication of this logic.

          All SerializerBase, DeserializerBase, and RawComparator implementations instantiated during shuffle would then be created via uses of this new API.

          Longer term, metadata maps are necessary for reducer output as well; e.g., mapreduce.reduce.output.key.metadata. This would allow full end-to-end Avro use for user data. Similar methods for JobContext would need to be added to set these parameters, and Job.setOutputKeyClass() et al. would need to be updated to use this new API as well. But I think that this may be more reasonably part of MAPREDUCE-815 than the current issue, which is concerned with just the shuffle process.

          Show
          Aaron Kimball added a comment - To allow Avro-based comparators during the shuffle, a few changes need to take place. Here's my survey of the scope of the problem and proposal for addressing it. HADOOP-6323 allows SerializationBase to provide RawComparator instances. But for Avro to be germane here, the Serializer and Deserializer instances used by MapOutputBuffer and other shuffle code need to be updated as well. MapOutputBuffer and other places where the Serializer and Deserializer interfaces are used should be changed to use instead SerializerBase and DeserializerBase . This should be done by changing the current calls to SerializationBase.getSerializer(keyClassName) to use the newer getSerializer(metadata) method. Currently the only metadata we store is the Serialization-Class that identifies the intermediate key and value classes. This is implicitly set in, e.g. Job.setMapOutputKeyClass() . (It actually just stores a class name, and the getSerializer(className) method converts that to the appropriate metadata map for getSerializer(metadata) .) HADOOP-6165 allows an arbitrary map of metadata to be passed to the serialization framework. We should take advantange of this map throughout the shuffling process. The job's Configuration should be updated to embed a map a la the outline in HADOOP-6165 (See Tom's comment, part 4(ii)) to include a "family" of configuration parameters. Given the renaming of various parameters, I propose to embed the metadata maps in mapreduce.map.output.key.metadata and mapreduce.map.output.value.metadata . The JobContext.setMapOutputKeyClass() method would instead of setting mapreduce.map.output.key.class , set the appropriate metadata strings under mapreduce.map.output.key.metadata or mapreduce.map.output.value.metadata to use the WritableSerialization implementation. If this metadata is not present, metadata could be inferred from mapreduce.map.output.key/value.class , which should be marked as deprecated. JobContext would also include new methods which could be used instead of setMapOutputKeyClass() , e.g. setMapOutputKeySchema() which would accept an Avro schema and configure the shuffle to use AvroGenericSerialization for this data. A generic setMapOutputKeySerializationMetadata() could be added as well, which would accept any string -> string map of metadata for use with other arbitrary SerializationBase implementations. (It might be used internally by setMapOutputKeyClass() and setMapOutputKeySchema() .) On the internal side, JobContext already provides a method getSortComparator() which provides the appropriate RawComparator implementation. But it does not directly provide SerializerBase or DeserializerBase instances; it instead only returns the key and value class names and clients of this (e.g., MapOutputBuffer ) are expected to instantiate the correct serialization objects themselves. I think that JobContext should provide getMapOutputKeySerializer() and getMapOutputKeyDeserializer() methods that correctly handle either the use of mapreduce.map.output.key.metadata or mapreduce.map.output.key.class to avoid duplication of this logic. All SerializerBase , DeserializerBase , and RawComparator implementations instantiated during shuffle would then be created via uses of this new API. Longer term, metadata maps are necessary for reducer output as well; e.g., mapreduce.reduce.output.key.metadata . This would allow full end-to-end Avro use for user data. Similar methods for JobContext would need to be added to set these parameters, and Job.setOutputKeyClass() et al. would need to be updated to use this new API as well. But I think that this may be more reasonably part of MAPREDUCE-815 than the current issue, which is concerned with just the shuffle process.
          Hide
          Tom White added a comment -

          +1 This sounds good to me.

          Show
          Tom White added a comment - +1 This sounds good to me.
          Hide
          Doug Cutting added a comment -

          I wonder if instead of adding more methods to JobContext we ought to add these to relevant serialization implementations. For example, we might have WritableSerialization.setMapOutputKeyClass(Class) and AvroSerialization.setMapOutputKeySchema(Schema). This makes the methods perhaps harder for folks to find, but it bakes less into JobContext. The serialization system is entirely user code, so it seems reasonable that the kernel should not directly support it. With this, JobContext would only have serialization agnostic methods like get/setMapOutputKeySerializationMetadata() and getMapOutputKeySerializer().

          Show
          Doug Cutting added a comment - I wonder if instead of adding more methods to JobContext we ought to add these to relevant serialization implementations. For example, we might have WritableSerialization.setMapOutputKeyClass(Class) and AvroSerialization.setMapOutputKeySchema(Schema). This makes the methods perhaps harder for folks to find, but it bakes less into JobContext. The serialization system is entirely user code, so it seems reasonable that the kernel should not directly support it. With this, JobContext would only have serialization agnostic methods like get/setMapOutputKeySerializationMetadata() and getMapOutputKeySerializer().
          Hide
          Aaron Kimball added a comment -

          Doug: That's in line with what we do in FileInputFormat / FileOutputFormat, etc; it stands to reason that we should try to treat intermediate (k, v) pairs the same as how we configure our inputs and outputs too..

          Show
          Aaron Kimball added a comment - Doug: That's in line with what we do in FileInputFormat / FileOutputFormat, etc; it stands to reason that we should try to treat intermediate (k, v) pairs the same as how we configure our inputs and outputs too..
          Hide
          Aaron Kimball added a comment -

          Attaching a patch that makes this work. This depends on HADOOP-6438 which adds the configuration getters/setters discussed earlier.

          This patch adds the following new files:

          svn add src/test/mapred/org/apache/hadoop/mapreduce/TestAvroSerialization.java
          svn add src/test/mapred/org/apache/hadoop/mapreduce/avro/key.avsc
          svn add src/test/mapred/org/apache/hadoop/mapreduce/avro/val.avsc
          

          Tested by running Mapreduce unit tests; added new unit tests that use the Avro serialization systems (generic, specific, and reflect) for intermediate data types. Booted a pseudo-distributed cluster and ran some example jobs, which work.

          Will mark as patch-available after the dependencies are all met

          Show
          Aaron Kimball added a comment - Attaching a patch that makes this work. This depends on HADOOP-6438 which adds the configuration getters/setters discussed earlier. This patch adds the following new files: svn add src/test/mapred/org/apache/hadoop/mapreduce/TestAvroSerialization.java svn add src/test/mapred/org/apache/hadoop/mapreduce/avro/key.avsc svn add src/test/mapred/org/apache/hadoop/mapreduce/avro/val.avsc Tested by running Mapreduce unit tests; added new unit tests that use the Avro serialization systems (generic, specific, and reflect) for intermediate data types. Booted a pseudo-distributed cluster and ran some example jobs, which work. Will mark as patch-available after the dependencies are all met
          Hide
          Aaron Kimball added a comment -

          Attaching a new patch for this issue that incorporates the API changes initially proposed as HADOOP-6438.

          This adds a package org.apache.hadoop.mapreduce.lib.jobdata which contains static classes to set class-based or schema-based metadata. Explicit subclasses exist for particular serialization frameworks (e.g., WritableSerialization) to allow users to denote which framework they are using; in general their behavior is the same as their superclasses.

          This keeps the mapreduce-specific job configuration data in the MapReduce project, but does not clutter JobContext with per-serialization-framework setters for clients to use.

          Show
          Aaron Kimball added a comment - Attaching a new patch for this issue that incorporates the API changes initially proposed as HADOOP-6438 . This adds a package org.apache.hadoop.mapreduce.lib.jobdata which contains static classes to set class-based or schema-based metadata. Explicit subclasses exist for particular serialization frameworks (e.g., WritableSerialization) to allow users to denote which framework they are using; in general their behavior is the same as their superclasses. This keeps the mapreduce-specific job configuration data in the MapReduce project, but does not clutter JobContext with per-serialization-framework setters for clients to use.
          Hide
          Aaron Kimball added a comment -

          Per concerns raised in HADOOP-6420, jobdata metadata map setters now set a flag indicating that they were used.

          Show
          Aaron Kimball added a comment - Per concerns raised in HADOOP-6420 , jobdata metadata map setters now set a flag indicating that they were used.
          Hide
          Doug Cutting added a comment -

          This is looking great. A few minor nits, and then I think we can commit it.

          • the patch is stale and does not apply cleanly
          • should ReduceTask#getClassFromMetadata be in ClassBasedJobData?
          • should we file another issue to fix the skipping feature to be serialization-savvy?
          Show
          Doug Cutting added a comment - This is looking great. A few minor nits, and then I think we can commit it. the patch is stale and does not apply cleanly should ReduceTask#getClassFromMetadata be in ClassBasedJobData? should we file another issue to fix the skipping feature to be serialization-savvy?
          Hide
          Aaron Kimball added a comment -

          new patch synced with trunk.

          should ReduceTask#getClassFromMetadata be in ClassBasedJobData?

          There is no such literally-named method. The output types are queried with Job.getOutputKeyClass() and getOutputValueClass(). Eventually, these should migrate to ClassBasedJobData (e.g., in MAPREDUCE-815). But the scope of this issue is map-output types only.

          should we file another issue to fix the skipping feature to be serialization-savvy?

          yes. This requires establishing a well-defined embedding of serialization metadata for key and value types/schemas in the sequencefile metadata block.

          Show
          Aaron Kimball added a comment - new patch synced with trunk. should ReduceTask#getClassFromMetadata be in ClassBasedJobData? There is no such literally-named method. The output types are queried with Job.getOutputKeyClass() and getOutputValueClass() . Eventually, these should migrate to ClassBasedJobData (e.g., in MAPREDUCE-815 ). But the scope of this issue is map-output types only. should we file another issue to fix the skipping feature to be serialization-savvy? yes. This requires establishing a well-defined embedding of serialization metadata for key and value types/schemas in the sequencefile metadata block.
          Hide
          Aaron Kimball added a comment -

          As a followup: ReduceTask#SkippingReduceValuesIterator#getClassFromMetadata is necessary due to the dependency on WritableSerialization for supporting skipping. When record skipping works more generally, that method will disappear and the correct jobdata methods will be used.

          Show
          Aaron Kimball added a comment - As a followup: ReduceTask#SkippingReduceValuesIterator#getClassFromMetadata is necessary due to the dependency on WritableSerialization for supporting skipping. When record skipping works more generally, that method will disappear and the correct jobdata methods will be used.
          Hide
          Aaron Kimball added a comment -

          Marking patch-available now that dependencies have been checked in.

          Show
          Aaron Kimball added a comment - Marking patch-available now that dependencies have been checked in.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12429576/MAPREDUCE-1126.4.patch
          against trunk revision 896781.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/252/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/252/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/252/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12429576/MAPREDUCE-1126.4.patch against trunk revision 896781. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/252/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/252/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/252/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          new patch resynced with trunk.

          Show
          Aaron Kimball added a comment - new patch resynced with trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12429676/MAPREDUCE-1126.5.patch
          against trunk revision 896781.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 2342 javac compiler warnings (more than the trunk's current 2330 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12429676/MAPREDUCE-1126.5.patch against trunk revision 896781. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 2342 javac compiler warnings (more than the trunk's current 2330 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          Fixes javac warnings in examples and tools based on deprecated JobConf methods (these now use the new jobdata API).

          Show
          Aaron Kimball added a comment - Fixes javac warnings in examples and tools based on deprecated JobConf methods (these now use the new jobdata API).
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12429808/MAPREDUCE-1126.6.patch
          against trunk revision 897118.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12429808/MAPREDUCE-1126.6.patch against trunk revision 897118. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Aaron!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Aaron!
          Hide
          Owen O'Malley added a comment -

          -1 to this massive API change without much more dialog. The scope of the patch was much larger than the description.

          Show
          Owen O'Malley added a comment - -1 to this massive API change without much more dialog. The scope of the patch was much larger than the description.
          Hide
          Doug Cutting added a comment -

          Owen, do you have specific concerns?

          Show
          Doug Cutting added a comment - Owen, do you have specific concerns?
          Hide
          Owen O'Malley added a comment -

          I reverted this until we can discuss the API changes. In particular, the framework can not depend on the stuff in mapreduce/lib. sigh Also deprecating the

          {get,set}

          MapOutput

          {Key,Value}

          Class methods seems unnecessary.

          Clearly, getting Avro to flow through the shuffle is important, but the changes to the API were poorly considered.

          Show
          Owen O'Malley added a comment - I reverted this until we can discuss the API changes. In particular, the framework can not depend on the stuff in mapreduce/lib. sigh Also deprecating the {get,set} MapOutput {Key,Value} Class methods seems unnecessary. Clearly, getting Avro to flow through the shuffle is important, but the changes to the API were poorly considered.
          Hide
          Aaron Kimball added a comment -

          Do you think that it's okay to add methods like Job.setMapOutputKeySchema() then? In the limit, if another serialization framework makes its way into common use (e.g., Hessian, Protobufs) in Hadoop, we would then need to add a mechanism to set their serialization system-specific metadata to Job as well. We factored out InputFormat/OutputFormat-specific getters and setters (c.f. FileInputFormat.addInputPath()) a while back, and deprecated JobConf.addInputPath(); this seems like a logical next step.

          Furthermore, what specific framework dependencies are you referring to? The jobdata package provides getters and setters that allow users to configure serialization system-specific metadata keys and values, but they are put into well-defined "system wide" metadata locations (e.g. JobContext.MAP_OUTPUT_KEY_METADATA) in the Configuration itself. The SerializerBase/DeserializerBase classes are instantiated in JobConf without touching the jobdata package at all (they rely only on the system-wide Configuration names).

          The only dependency on jobdata classes in Job/JobContext/JobConf is to push-down the now-deprecated getter/setter methods that the user would call in legacy code, but the framework no longer makes any calls to JobConf.getMapOutputKeyClass(). It instead calls JobConf.getMapOutputKeySerializer() and JobConf.getMapOutputKeyDeserializer() directly.

          Show
          Aaron Kimball added a comment - Do you think that it's okay to add methods like Job.setMapOutputKeySchema() then? In the limit, if another serialization framework makes its way into common use (e.g., Hessian, Protobufs) in Hadoop, we would then need to add a mechanism to set their serialization system-specific metadata to Job as well. We factored out InputFormat/OutputFormat-specific getters and setters (c.f. FileInputFormat.addInputPath() ) a while back, and deprecated JobConf.addInputPath() ; this seems like a logical next step. Furthermore, what specific framework dependencies are you referring to? The jobdata package provides getters and setters that allow users to configure serialization system-specific metadata keys and values, but they are put into well-defined "system wide" metadata locations (e.g. JobContext.MAP_OUTPUT_KEY_METADATA ) in the Configuration itself. The SerializerBase/DeserializerBase classes are instantiated in JobConf without touching the jobdata package at all (they rely only on the system-wide Configuration names). The only dependency on jobdata classes in Job/JobContext/JobConf is to push-down the now-deprecated getter/setter methods that the user would call in legacy code, but the framework no longer makes any calls to JobConf.getMapOutputKeyClass() . It instead calls JobConf.getMapOutputKeySerializer() and JobConf.getMapOutputKeyDeserializer() directly.
          Hide
          Doug Cutting added a comment -

          > Also deprecating the

          {get,set}

          MapOutput

          {Key,Value}

          Class methods seems unnecessary.

          The preferred style is to keep setters and getters that are not directly used by the framework to library code. These methods are specific to certain serialization systems, mostly Writable. The properties these methods set are no longer read by the framework itself, but only by library code. Thus their setters also belong in library code, not in Job.

          Show
          Doug Cutting added a comment - > Also deprecating the {get,set} MapOutput {Key,Value} Class methods seems unnecessary. The preferred style is to keep setters and getters that are not directly used by the framework to library code. These methods are specific to certain serialization systems, mostly Writable. The properties these methods set are no longer read by the framework itself, but only by library code. Thus their setters also belong in library code, not in Job.
          Hide
          Jeff Hammerbacher added a comment -

          Hey Owen,

          We'd like to address your concerns on this issue as soon as possible, as there are many other tasks which depend upon this particular issue. Could you please address Aaron and Doug's responses?

          Thanks,
          Jeff

          Show
          Jeff Hammerbacher added a comment - Hey Owen, We'd like to address your concerns on this issue as soon as possible, as there are many other tasks which depend upon this particular issue. Could you please address Aaron and Doug's responses? Thanks, Jeff
          Hide
          Owen O'Malley added a comment -

          The problem is highlighted by this part of the patch:

          -    job.setMapOutputKeyClass(LongWritable.class);
          -    job.setMapOutputValueClass(BytesWritable.class);
          +    WritableJobData.setMapOutputKeyClass(job.getConfiguration(),
          +        LongWritable.class);
          +    WritableJobData.setMapOutputValueClass(job.getConfiguration(),
          +        BytesWritable.class);
          

          That is a really ugly change to the user's application. If anything, I want the api to infer the map output types from the Mapper's type parameters by default.

          We used to infer the serialization from the types. That was a clean model. Why are you trying to change it? The type should imply a schema almost always. Am I missing something? Why would someone want to set a schema separately from the type?

          Additionally, you have the framework depending on a library. That is a problem. The right solution is to have a interface in mapreduce and have the library code implement it.

          Show
          Owen O'Malley added a comment - The problem is highlighted by this part of the patch: - job.setMapOutputKeyClass(LongWritable.class); - job.setMapOutputValueClass(BytesWritable.class); + WritableJobData.setMapOutputKeyClass(job.getConfiguration(), + LongWritable.class); + WritableJobData.setMapOutputValueClass(job.getConfiguration(), + BytesWritable.class); That is a really ugly change to the user's application. If anything, I want the api to infer the map output types from the Mapper's type parameters by default. We used to infer the serialization from the types. That was a clean model. Why are you trying to change it? The type should imply a schema almost always. Am I missing something? Why would someone want to set a schema separately from the type? Additionally, you have the framework depending on a library. That is a problem. The right solution is to have a interface in mapreduce and have the library code implement it.
          Hide
          Doug Cutting added a comment -

          > That is a really ugly change to the user's application.

          It is moving something from kernel to library code, using the encouraged pattern, to make the kernel data-representation agnostic. Avro applications will be on equal footing with Writable applications. They'll do something like AvroJobData.setMapInputSchema(job, mySchema). With generic data, the base class isn't germane, but the schema is. Should we instead add that method directly to Job, then do the same for Thrift, Protobuf, Hessian, Java Serialization, etc?

          This is analagous to the change from job.addInputPath() to FileInputFormat.addInputPath().

          These methods could be a bit more friendly, accepting the job directly rather than requiring folks to call job.getConfiguration(). Perhaps Aaron can provide a revised version that does this?

          > We used to infer the serialization from the types. That was a clean model. Why are you trying to change it?

          The model was changed a few months back, in HADOOP-6165. A class does not uniquely identify a serialization. A class might implement Writable, but one might wish to use Avro Reflection to serialize it. Avro also works for built-in types like Long, null, etc. directly, as do other serialization systems. Moreover, with union types (as, e.g., supported by Avro, Thrift and perhaps others) there is no single class that corresponds to every input or output datum.

          There could, for example, be an Avro-based inputformat and outputformat that represents all data types as byte[], containing the Avro binary data. Since Avro comparators operate efficiently on binary data, this would perhaps be an efficient way to use MapReduce as a sorting engine.

          > Additionally, you have the framework depending on a library. That is a problem.

          That's only for back-compatibility: what used to be the default is now in a library. The only calls to library code are in deprecated Job setter methods, invoked by user code. The framework never calls these setters.

          If this is a problem, then we could duplicate the code from WritableJobData in these deprecated methods. That's what was done when setInputPath() was deprecated. Is that preferred?

          Show
          Doug Cutting added a comment - > That is a really ugly change to the user's application. It is moving something from kernel to library code, using the encouraged pattern, to make the kernel data-representation agnostic. Avro applications will be on equal footing with Writable applications. They'll do something like AvroJobData.setMapInputSchema(job, mySchema). With generic data, the base class isn't germane, but the schema is. Should we instead add that method directly to Job, then do the same for Thrift, Protobuf, Hessian, Java Serialization, etc? This is analagous to the change from job.addInputPath() to FileInputFormat.addInputPath(). These methods could be a bit more friendly, accepting the job directly rather than requiring folks to call job.getConfiguration(). Perhaps Aaron can provide a revised version that does this? > We used to infer the serialization from the types. That was a clean model. Why are you trying to change it? The model was changed a few months back, in HADOOP-6165 . A class does not uniquely identify a serialization. A class might implement Writable, but one might wish to use Avro Reflection to serialize it. Avro also works for built-in types like Long, null, etc. directly, as do other serialization systems. Moreover, with union types (as, e.g., supported by Avro, Thrift and perhaps others) there is no single class that corresponds to every input or output datum. There could, for example, be an Avro-based inputformat and outputformat that represents all data types as byte[], containing the Avro binary data. Since Avro comparators operate efficiently on binary data, this would perhaps be an efficient way to use MapReduce as a sorting engine. > Additionally, you have the framework depending on a library. That is a problem. That's only for back-compatibility: what used to be the default is now in a library. The only calls to library code are in deprecated Job setter methods, invoked by user code. The framework never calls these setters. If this is a problem, then we could duplicate the code from WritableJobData in these deprecated methods. That's what was done when setInputPath() was deprecated. Is that preferred?
          Hide
          Doug Cutting added a comment -

          Would this change be more palatable if we used a shorter term than "JobData", instead just calling it "Data"? Combined with changing the setters to accept a JobContext directly, this would look like:

          -    job.setMapOutputKeyClass(LongWritable.class);
          -    job.setMapOutputValueClass(BytesWritable.class);
          +    WritableData.setMapOutputKeyClass(job, LongWritable.class);
          +    WritableData.setMapOutputValueClass(job, BytesWritable.class);
          

          Is that better?

          Show
          Doug Cutting added a comment - Would this change be more palatable if we used a shorter term than "JobData", instead just calling it "Data"? Combined with changing the setters to accept a JobContext directly, this would look like: - job.setMapOutputKeyClass(LongWritable.class); - job.setMapOutputValueClass(BytesWritable.class); + WritableData.setMapOutputKeyClass(job, LongWritable.class); + WritableData.setMapOutputValueClass(job, BytesWritable.class); Is that better?
          Hide
          Owen O'Malley added a comment - - edited

          Not really. I see now what you are trying to accomplish, but I think it is the wrong model. While the FileInputFormat is similar in structure, the issues are:
          1. the analogy isn't precise because you aren't setting WritableJobData as a parameter some where.
          2. the users care about the types that come out of their map
          3. the users aren't likely to change the serialization format from the default for the types they are using
          4. we end up with 3 configuration knobs per a class instead of the current 1. There are 6 classes in the MR pipeline, that means before this is done we have 18 methods to set the class/schema/serializer for the various types. That is way too complicated.

          I thought that you were going to create a marker interface for AvroRecord that has a getSchema method. That would have enabled the AvroWriter to get the schema from the types rather than get the types from the schema.

          I also think that removing the type checks from the collector and ifile code is a bad plan and will allow a lot of errors to reach much further into the system.

          Let's consider the proposal that Arun has been discussing. Instead of doing:

             FileInputFormat.setInputPath(job, new Path("/foo"));
             job.setInputFormatClass(TextInputFormat.class);
          

          you do:

            TextInputFormat input = new TextInputFormat();
            input.setInputPath(new Path("/foo"));
            job.setInputFormat(input);
          

          clearly the job needs to serialize the InputFormat object and reconstruct it on the other side. This is much much easier for users to understand than the current model and can probably be done in a backwards compatible manner. Notice that this adds another 5 types that we are going to want serialize (InputFormat, Mapper, Combiner, Reducer, OutputFormat) per a job. With the current proposal that means that we get 11 * 3 = 33 serialization methods.

          I think that:

          • we need to use the global serialization/deserialization factory that we already have.
          • moving the {set,get}

            MapOutput

            {Key,Value}

            Class methods is a non-starter. As a general rule, if you need to modify all of the examples, we need to carefully discuss the issues.

          • the metadata should not be user visible and it would be better if it was just used to communicate within the serializer. Why is the metadata a Map? I'd rather have it be an opaque blob that is serializable.
          • we can debate whether the type restrictions on map outputs should be loosened, but certainly we need to check on the map side whether the type the map is outputting is correct. If you are going to loosen it, the class methods should become deprecated and vestigial and you need to support union types in Writable too.
          • i'm not wild about having Configuration.setMap. Having a function in StringUtils that takes a Map<String,String> map into and from a String seems more appropriate.
          Show
          Owen O'Malley added a comment - - edited Not really. I see now what you are trying to accomplish, but I think it is the wrong model. While the FileInputFormat is similar in structure, the issues are: 1. the analogy isn't precise because you aren't setting WritableJobData as a parameter some where. 2. the users care about the types that come out of their map 3. the users aren't likely to change the serialization format from the default for the types they are using 4. we end up with 3 configuration knobs per a class instead of the current 1. There are 6 classes in the MR pipeline, that means before this is done we have 18 methods to set the class/schema/serializer for the various types. That is way too complicated. I thought that you were going to create a marker interface for AvroRecord that has a getSchema method. That would have enabled the AvroWriter to get the schema from the types rather than get the types from the schema. I also think that removing the type checks from the collector and ifile code is a bad plan and will allow a lot of errors to reach much further into the system. Let's consider the proposal that Arun has been discussing. Instead of doing: FileInputFormat.setInputPath(job, new Path("/foo")); job.setInputFormatClass(TextInputFormat.class); you do: TextInputFormat input = new TextInputFormat(); input.setInputPath(new Path("/foo")); job.setInputFormat(input); clearly the job needs to serialize the InputFormat object and reconstruct it on the other side. This is much much easier for users to understand than the current model and can probably be done in a backwards compatible manner. Notice that this adds another 5 types that we are going to want serialize (InputFormat, Mapper, Combiner, Reducer, OutputFormat) per a job. With the current proposal that means that we get 11 * 3 = 33 serialization methods. I think that: we need to use the global serialization/deserialization factory that we already have. moving the {set,get} MapOutput {Key,Value} Class methods is a non-starter. As a general rule, if you need to modify all of the examples, we need to carefully discuss the issues. the metadata should not be user visible and it would be better if it was just used to communicate within the serializer. Why is the metadata a Map? I'd rather have it be an opaque blob that is serializable. we can debate whether the type restrictions on map outputs should be loosened, but certainly we need to check on the map side whether the type the map is outputting is correct. If you are going to loosen it, the class methods should become deprecated and vestigial and you need to support union types in Writable too. i'm not wild about having Configuration.setMap. Having a function in StringUtils that takes a Map<String,String> map into and from a String seems more appropriate.
          Hide
          Doug Cutting added a comment -

          > the users aren't likely to change the serialization format from the default for the types they are using

          There is no default for Long, ByteBuffer, Map<URL,byte[]>, unions, etc. These are reasonable values for mapreduce processing. Moreover, a given class might be serialized and compared in different ways by different jobs.

          > we end up with 3 configuration knobs per a class instead of the current 1. There are 6 classes in the MR pipeline, that means before this is done we have 18 methods to set the class/schema/serializer for the various types.

          I don't follow your math here. An application need call no more methods than it did before to configure a job.

          > I thought that you were going to create a marker interface for AvroRecord that has a getSchema method.

          That only works for the case when a single record is the top-level type. It does not work for arrays, maps, enums, unions or primitives, all reasonable values for mapreduce. Nor does it work where one might have a legacy Writable that one also sometimes wishes to process with Avro reflection. In the general case, how you serialize something is independent of its in-memory representation.

          We use the marker interface when we can, but we cannot always. When a marker interface is appropriate, job configuration looks much as it did before: one can still set key and value classes.

          > Notice that this adds another 5 types that we are going to want serialize (InputFormat, Mapper, Combiner, Reducer, OutputFormat) per a job. With the current proposal that means that we get 11 * 3 = 33 serialization methods.

          I do not understand what you're counting here. Currently we specify job parameters with a configuration. In the future, we might change this to instead use serialized instances of five interfaces. If we do that, then we would declare a single mechanism by which these instances are serialized, just as we now declare the single mechanism by which a configuration is serialized.

          > we need to use the global serialization/deserialization factory that we already have.

          That's precisely what this patch does. It updates the shuffle to no longer use the serialization factory in a now-deprecated manner.

          > moving the

          {set,get}

          MapOutput

          {Key,Value}

          Class methods is a non-starter.

          This is an odd statement. We're forever forbidden from making the MapReduce framework more abstract?

          > Why is the metadata a Map? I'd rather have it be an opaque blob that is serializable.

          That was established in HADOOP-6165. If you'd like to revisit that representation for metadata, that should be done as a separate issue. That API not been released, and so would be easy to change if there is consensus to do so. This current issue builds on the serialzation metadata mechanism currently in trunk and should be evaluated independently of the serialization metadata representation.

          > we need to check on the map side whether the type the map is outputting is correct.

          That check is now done by the serializer.

          > you need to support union types in Writable too.

          I don't understand this suggestion. A serialization system that implements unions should check unions, but a serialization system that does not support unions need not.

          > i'm not wild about having Configuration.setMap. Having a function in StringUtils that takes a Map<String,String> map into and from a String seems more appropriate.

          That was established in HADOOP-6420. This API has not been released, and would be easy to change if there is consensus. If you wish to revisit that API, and have a better implementation idea, then that should be done in a separate issue, but that question should be considered separately from the current issue.

          Show
          Doug Cutting added a comment - > the users aren't likely to change the serialization format from the default for the types they are using There is no default for Long, ByteBuffer, Map<URL,byte[]>, unions, etc. These are reasonable values for mapreduce processing. Moreover, a given class might be serialized and compared in different ways by different jobs. > we end up with 3 configuration knobs per a class instead of the current 1. There are 6 classes in the MR pipeline, that means before this is done we have 18 methods to set the class/schema/serializer for the various types. I don't follow your math here. An application need call no more methods than it did before to configure a job. > I thought that you were going to create a marker interface for AvroRecord that has a getSchema method. That only works for the case when a single record is the top-level type. It does not work for arrays, maps, enums, unions or primitives, all reasonable values for mapreduce. Nor does it work where one might have a legacy Writable that one also sometimes wishes to process with Avro reflection. In the general case, how you serialize something is independent of its in-memory representation. We use the marker interface when we can, but we cannot always. When a marker interface is appropriate, job configuration looks much as it did before: one can still set key and value classes. > Notice that this adds another 5 types that we are going to want serialize (InputFormat, Mapper, Combiner, Reducer, OutputFormat) per a job. With the current proposal that means that we get 11 * 3 = 33 serialization methods. I do not understand what you're counting here. Currently we specify job parameters with a configuration. In the future, we might change this to instead use serialized instances of five interfaces. If we do that, then we would declare a single mechanism by which these instances are serialized, just as we now declare the single mechanism by which a configuration is serialized. > we need to use the global serialization/deserialization factory that we already have. That's precisely what this patch does. It updates the shuffle to no longer use the serialization factory in a now-deprecated manner. > moving the {set,get} MapOutput {Key,Value} Class methods is a non-starter. This is an odd statement. We're forever forbidden from making the MapReduce framework more abstract? > Why is the metadata a Map? I'd rather have it be an opaque blob that is serializable. That was established in HADOOP-6165 . If you'd like to revisit that representation for metadata, that should be done as a separate issue. That API not been released, and so would be easy to change if there is consensus to do so. This current issue builds on the serialzation metadata mechanism currently in trunk and should be evaluated independently of the serialization metadata representation. > we need to check on the map side whether the type the map is outputting is correct. That check is now done by the serializer. > you need to support union types in Writable too. I don't understand this suggestion. A serialization system that implements unions should check unions, but a serialization system that does not support unions need not. > i'm not wild about having Configuration.setMap. Having a function in StringUtils that takes a Map<String,String> map into and from a String seems more appropriate. That was established in HADOOP-6420 . This API has not been released, and would be easy to change if there is consensus. If you wish to revisit that API, and have a better implementation idea, then that should be done in a separate issue, but that question should be considered separately from the current issue.
          Hide
          Tom White added a comment -

          This patch is making changes that make it possible to take advantage of the more general serialization mechanism introduced in HADOOP-6165 in MapReduce. Until now the serialization used for a key or value was driven by the type of the key or value. This is not sufficiently general for Avro, which is what motivated the work in HADOOP-6165. (Note that HADOOP-6165 did not have any effect on user APIs, since users don't typically interact with serialization classes directly.) However, it is true that many serialization frameworks are type driven - Writables, Thrift, Java Serialization, Avro Specific, to name a few - so I think there may be an argument to retain job.setMapOutputKeyClass() as it currently stands. The advantage is that existing Writable-based jobs do not have to be changed, which I think is at the heart of Owen's criticism.

          For Avro Generic, or serializations where the schema for the types needs to be specified, we can use the AvroGenericJobData class in this patch. (BTW Aaron, why does SchemaBasedJobData exist? It seems to reference Avro internally, even though it names suggests it is general.) In this case, there would be no need to call job.setMapOutputKeyClass().

          Would this address folks' concerns?

          Show
          Tom White added a comment - This patch is making changes that make it possible to take advantage of the more general serialization mechanism introduced in HADOOP-6165 in MapReduce. Until now the serialization used for a key or value was driven by the type of the key or value. This is not sufficiently general for Avro, which is what motivated the work in HADOOP-6165 . (Note that HADOOP-6165 did not have any effect on user APIs, since users don't typically interact with serialization classes directly.) However, it is true that many serialization frameworks are type driven - Writables, Thrift, Java Serialization, Avro Specific, to name a few - so I think there may be an argument to retain job.setMapOutputKeyClass() as it currently stands. The advantage is that existing Writable-based jobs do not have to be changed, which I think is at the heart of Owen's criticism. For Avro Generic, or serializations where the schema for the types needs to be specified, we can use the AvroGenericJobData class in this patch. (BTW Aaron, why does SchemaBasedJobData exist? It seems to reference Avro internally, even though it names suggests it is general.) In this case, there would be no need to call job.setMapOutputKeyClass(). Would this address folks' concerns?
          Hide
          Doug Cutting added a comment -

          Owen, I would like to figure out what we need to do to address your concerns without losing functionality.

          Some notes on functionality:

          We need to support non-record top-level types. A Python or Ruby program might generate data whose top-level type is a union, array, map, long, string, etc. A serialization-specific class is not a natural Java representation for such data. We should not create AvroSpecificLong, AvroReflectLong, etc. wrappers as we have been forced to do with Writable. Rather applications should be able to use built-in Java types for such data.

          A class alone is not sufficient to specify a serialization. One must be able to specify both:

          • the binary format of the data
          • the mapping from binary to in-memory

          These are distinct. In Avro a schema defines the binary format and a DatumReader usually defines the mapping from binary to in-memory. Avro includes three different DatumReader implementations: generic, specific and reflect. These map built-in types differently. For example, ReflectDatumReader maps Avro strings to java.lang.String while SpecificDatumReader maps Avro strings to org.apache.avro.util.Utf8.

          Moreover an application can use a lower-level event-based API to map its in-memory Java data structures to Avro. For example, Pig data is a union of Bag, Tuple and some built-in types. It's straightforward to define an Avro schema for this, but none of Avro's provided DatumReader implementations might be optimal. Reflection is slow. Specific and generic would require copying data to-and-from generated classes. So Pig might best declare its schema and then directly read and write its own data structures directly as Avro data. Avro includes tools to efficiently make this type-safe. So Pig might define its own serialization for Long, null, etc.

          In the current patch one would use AvroGenericData.setMapOutputKeySchema(conf, schema) to specify that Avro's generic in-memory representation should be used for data whose binary format corresponds to a given schema. One would use WritableData.setMapOutputKeyClass(conf, class) to specify that a Writable class should be used to define both the binary format and its in-memory representation. Both of these set configuration parameters used by the serialization factory to produce an appropriate serializer. A third party can define new serializers for, e.g., Pig data, and configure their jobs to use it with something like PigData.setMapOutputFormat(conf).

          We cannot easily get this from the InputFormat. We could instead configure the input format with such information, and have it use the serialization factory. Applications would still need to set the same number of parameters, just in a different place: the binary format still needs to be declared, as does the mapping from binary to in-memory, these cannot be inferred automatically. That's a more substantial change that we did not wish to make in this patch.

          Moreover, the inputformat is irrelevant to the current patch, since the serialization used during the shuffle can be different than that used for input and output. We don't force input and output serializations to be identical, nor should we force intermediate serialization to match one or the other. The map input might be legacy data, and the map function might covert it to a different representation that requires a different serialization. This patch concerns intermediate data. MAPREDUCE-815 makes the corresponding changes for input and output data.

          The tests for MAPREDUCE-815 include an end-to-end Avro job whose key uses Avro's generic data representation and whose value is simply null. Have a look at the code that creates such a job: it looks much like our job creation today, the API doesn't appear fundamentally different to me. It's not currently possible to specify null as a MapReduce value, since null has no class. NullWritable is a crude, serialization-specific manner for doing this that, long-term, I hope we can deprecate in favor of simply declaring and passing null when that's appropriate.

          I hope these observations can help us reach consensus without extensive delay.

          Show
          Doug Cutting added a comment - Owen, I would like to figure out what we need to do to address your concerns without losing functionality. Some notes on functionality: We need to support non-record top-level types. A Python or Ruby program might generate data whose top-level type is a union, array, map, long, string, etc. A serialization-specific class is not a natural Java representation for such data. We should not create AvroSpecificLong, AvroReflectLong, etc. wrappers as we have been forced to do with Writable. Rather applications should be able to use built-in Java types for such data. A class alone is not sufficient to specify a serialization. One must be able to specify both: the binary format of the data the mapping from binary to in-memory These are distinct. In Avro a schema defines the binary format and a DatumReader usually defines the mapping from binary to in-memory. Avro includes three different DatumReader implementations: generic, specific and reflect. These map built-in types differently. For example, ReflectDatumReader maps Avro strings to java.lang.String while SpecificDatumReader maps Avro strings to org.apache.avro.util.Utf8. Moreover an application can use a lower-level event-based API to map its in-memory Java data structures to Avro. For example, Pig data is a union of Bag, Tuple and some built-in types. It's straightforward to define an Avro schema for this, but none of Avro's provided DatumReader implementations might be optimal. Reflection is slow. Specific and generic would require copying data to-and-from generated classes. So Pig might best declare its schema and then directly read and write its own data structures directly as Avro data. Avro includes tools to efficiently make this type-safe. So Pig might define its own serialization for Long, null, etc. In the current patch one would use AvroGenericData.setMapOutputKeySchema(conf, schema) to specify that Avro's generic in-memory representation should be used for data whose binary format corresponds to a given schema. One would use WritableData.setMapOutputKeyClass(conf, class) to specify that a Writable class should be used to define both the binary format and its in-memory representation. Both of these set configuration parameters used by the serialization factory to produce an appropriate serializer. A third party can define new serializers for, e.g., Pig data, and configure their jobs to use it with something like PigData.setMapOutputFormat(conf). We cannot easily get this from the InputFormat. We could instead configure the input format with such information, and have it use the serialization factory. Applications would still need to set the same number of parameters, just in a different place: the binary format still needs to be declared, as does the mapping from binary to in-memory, these cannot be inferred automatically. That's a more substantial change that we did not wish to make in this patch. Moreover, the inputformat is irrelevant to the current patch, since the serialization used during the shuffle can be different than that used for input and output. We don't force input and output serializations to be identical, nor should we force intermediate serialization to match one or the other. The map input might be legacy data, and the map function might covert it to a different representation that requires a different serialization. This patch concerns intermediate data. MAPREDUCE-815 makes the corresponding changes for input and output data. The tests for MAPREDUCE-815 include an end-to-end Avro job whose key uses Avro's generic data representation and whose value is simply null. Have a look at the code that creates such a job: it looks much like our job creation today, the API doesn't appear fundamentally different to me. It's not currently possible to specify null as a MapReduce value, since null has no class. NullWritable is a crude, serialization-specific manner for doing this that, long-term, I hope we can deprecate in favor of simply declaring and passing null when that's appropriate. I hope these observations can help us reach consensus without extensive delay.
          Hide
          Owen O'Malley added a comment -

          I've very disappointed that this jira went in with a title and description that completely misrepresented the content and scope of the patch. This patch completely revamps the type system and semantics of the map/reduce framework. Changing that without a large discussion is uncool.

          I disagree with the fundamental approach taken here. The details are also problematic, but we need to find an acceptable model before any progress on this or any related patches can be made.

          My concerns are:
          1. We should use the current global serializer factory for all contexts of a job. We have 7 serialized types already (map in key, map in value, map out key, map out value, reduce out key, reduce out value, input split). We will likely end up with more types later. Having a separate serializer and metadata for each type will be extremely confusing to the users.
          2. Defining the schema should be an Avro specific function and not part of the framework.
          3. I don't see any reason to support union types at the top level of the shuffle. There are already libraries that handle this without changing the framework. Furthermore, an Avro record on top of the schema is free in serialization size.
          4. Only the default comparator should come from the serializer. The user has to be able to override it in the framework (not change the serialier factory).

          That said, I think that it is perfectly reasonable for the Avro serializer to accept all types. So if you have a Mapper<String,String,String,String> it will use Avro serialization.

          Show
          Owen O'Malley added a comment - I've very disappointed that this jira went in with a title and description that completely misrepresented the content and scope of the patch. This patch completely revamps the type system and semantics of the map/reduce framework. Changing that without a large discussion is uncool. I disagree with the fundamental approach taken here. The details are also problematic, but we need to find an acceptable model before any progress on this or any related patches can be made. My concerns are: 1. We should use the current global serializer factory for all contexts of a job. We have 7 serialized types already (map in key, map in value, map out key, map out value, reduce out key, reduce out value, input split). We will likely end up with more types later. Having a separate serializer and metadata for each type will be extremely confusing to the users. 2. Defining the schema should be an Avro specific function and not part of the framework. 3. I don't see any reason to support union types at the top level of the shuffle. There are already libraries that handle this without changing the framework. Furthermore, an Avro record on top of the schema is free in serialization size. 4. Only the default comparator should come from the serializer. The user has to be able to override it in the framework (not change the serialier factory). That said, I think that it is perfectly reasonable for the Avro serializer to accept all types. So if you have a Mapper<String,String,String,String> it will use Avro serialization.
          Hide
          Arun C Murthy added a comment -

          1. We should use the current global serializer factory for all contexts of a job.
          4. Only the default comparator should come from the serializer. The user has to be able to override it in the framework (not change the serialier factory).

          +1, strongly agree.

          Show
          Arun C Murthy added a comment - 1. We should use the current global serializer factory for all contexts of a job. 4. Only the default comparator should come from the serializer. The user has to be able to override it in the framework (not change the serialier factory). +1, strongly agree.
          Hide
          Doug Cutting added a comment -

          Owen, I am sorry you did not realize earlier the magnitude of the
          changes implied by this issue. As Aaron started work on it it became
          apparent that the changes required were substantial. A number of
          people were following the issue as it progressed, and I assumed that
          it had good oversight.

          > 1. We should use the current global serializer factory for all contexts of a job.

          I don't see that as an acceptable solution. Job inputs and outputs
          can reasonably differ in their serialization methods. Specifying
          seven different serializations can be done in seven lines of code that
          replace the seven lines of code we currently use to specify these as
          classes. There are now more possibilties for what those seven lines
          can contain, but I don't see this as a huge increase in complexity for
          end users.

          WritableData.setMapOutputKeyClass(job, class) is not much more
          complicated than the current job.setMapOutputKeyClass(class). The
          natural way for an Avro generic data job to specify its map output key
          is AvroGenericData.setMapOutputKeySchema(job, schema). In Avro, a
          schema specifies classes, binary format, and sort order. There is in
          general no single class that represents all of these aspects of a
          schema. Different serialization systems can reasonably differ in what
          kind of metadata they require.

          (A mapreduce simplification that might make sense long-term is to
          eliminate the key/value distinction. Each map input could be a single
          object, each map output could be a single comparable object, and each
          reduce output a single object. This would eliminate three settings
          per job, and I can think of no use cases where keys and values might
          use different serialization systems.)

          > 2. Defining the schema should be an Avro specific function.

          It is Avro-specific in the current patch, so we agree on this point.

          > 3. I don't see any reason to support union types at the top-level of
          the shuffle.

          We should not force folks to define wrapper classes not required or
          generated by their serialization system just to pass data through
          mapreduce. The Java class namespace is a poor mechanism to represent
          mappings between all Java objects and all their possible serialized
          forms. Serialization is not completely determined by a class, and a
          class does not completely determine a serialization. Java Strings
          longs and floats can be serialized in different ways. A given job
          might take String data from a file using one serialization system, map
          it to a union type using another serialization system that provides
          efficient, structured binary comparison, then write the final output
          to a database as String using yet another serialization system. Why
          should we require folks to define wrapper classes to achieve this?

          > 4. Only the default comparator should come from the serializer.

          That would make sense if we only permit a single, global serializer.
          If however the shuffle has its own serializer, then it could be done
          in either place: the job could define a shuffle comparator, or it
          could use the comparator from the shuffle's serializer. In either
          case, users should be able to override the comparator. Since
          comparators are a part of the serialization API, it seems better
          modularization to use the comparator specified by the shuffle's
          serializer, no?

          > That said, I think it is perfectly acceptable for the Avro serializer to accept all types.

          That would give the Avro serializer privledged status. One could not also use another serializer (e.g., a Pig, Thrift or Hessian serializer) that also accepts String. Applications should be able to specify which serializations they intend.

          A primary design goal of the Avro project is improving the flexibility
          of serialization APIs. Mapreduce is a primary target application for
          Avro. We should not hobble Avro in Mapreduce. The Writable model,
          where classes define their serialization, has served us well, but that
          model is limited. Avro permits flexible mappings between in-memory
          representations and serializations. We can easily support this in
          Mapreduce without either giving Avro privledged status or making the
          Mapreduce API overly complex. I hope you will not block this effort.

          Show
          Doug Cutting added a comment - Owen, I am sorry you did not realize earlier the magnitude of the changes implied by this issue. As Aaron started work on it it became apparent that the changes required were substantial. A number of people were following the issue as it progressed, and I assumed that it had good oversight. > 1. We should use the current global serializer factory for all contexts of a job. I don't see that as an acceptable solution. Job inputs and outputs can reasonably differ in their serialization methods. Specifying seven different serializations can be done in seven lines of code that replace the seven lines of code we currently use to specify these as classes. There are now more possibilties for what those seven lines can contain, but I don't see this as a huge increase in complexity for end users. WritableData.setMapOutputKeyClass(job, class) is not much more complicated than the current job.setMapOutputKeyClass(class). The natural way for an Avro generic data job to specify its map output key is AvroGenericData.setMapOutputKeySchema(job, schema). In Avro, a schema specifies classes, binary format, and sort order. There is in general no single class that represents all of these aspects of a schema. Different serialization systems can reasonably differ in what kind of metadata they require. (A mapreduce simplification that might make sense long-term is to eliminate the key/value distinction. Each map input could be a single object, each map output could be a single comparable object, and each reduce output a single object. This would eliminate three settings per job, and I can think of no use cases where keys and values might use different serialization systems.) > 2. Defining the schema should be an Avro specific function. It is Avro-specific in the current patch, so we agree on this point. > 3. I don't see any reason to support union types at the top-level of the shuffle. We should not force folks to define wrapper classes not required or generated by their serialization system just to pass data through mapreduce. The Java class namespace is a poor mechanism to represent mappings between all Java objects and all their possible serialized forms. Serialization is not completely determined by a class, and a class does not completely determine a serialization. Java Strings longs and floats can be serialized in different ways. A given job might take String data from a file using one serialization system, map it to a union type using another serialization system that provides efficient, structured binary comparison, then write the final output to a database as String using yet another serialization system. Why should we require folks to define wrapper classes to achieve this? > 4. Only the default comparator should come from the serializer. That would make sense if we only permit a single, global serializer. If however the shuffle has its own serializer, then it could be done in either place: the job could define a shuffle comparator, or it could use the comparator from the shuffle's serializer. In either case, users should be able to override the comparator. Since comparators are a part of the serialization API, it seems better modularization to use the comparator specified by the shuffle's serializer, no? > That said, I think it is perfectly acceptable for the Avro serializer to accept all types. That would give the Avro serializer privledged status. One could not also use another serializer (e.g., a Pig, Thrift or Hessian serializer) that also accepts String. Applications should be able to specify which serializations they intend. A primary design goal of the Avro project is improving the flexibility of serialization APIs. Mapreduce is a primary target application for Avro. We should not hobble Avro in Mapreduce. The Writable model, where classes define their serialization, has served us well, but that model is limited. Avro permits flexible mappings between in-memory representations and serializations. We can easily support this in Mapreduce without either giving Avro privledged status or making the Mapreduce API overly complex. I hope you will not block this effort.
          Hide
          Doug Cutting added a comment -

          For folk's consideration, here's what creation of an Avro-based Job looks like after this and MAPREDUCE-815 are committed:

              // Create a MapReduce job to sort this input.
              Job job = new Job();
              job.setInputFormatClass(AvroInputFormat.class);
              job.setOutputFormatClass(AvroOutputFormat.class);
              job.setNumReduceTasks(1);
          
              Configuration conf = job.getConfiguration();
              SchemaBasedJobData.setOutputKeySchema(conf, schema);
              SchemaBasedJobData.setOutputValueSchema(conf,
                  Schema.create(Schema.Type.NULL));
          
              Path outPath = new Path(workDir, "out");
          
              FileInputFormat.addInputPath(job, inPath);
              FileOutputFormat.setOutputPath(job, outPath);
          
              boolean result = job.waitForCompletion(true);
          

          Is this API markedly more complex that what we have now? It is markedly more powerful, since, e.g., nulls, unions, arrays, maps, and bulit-in types may be passed directly, without creating a wrapper class. A custom input serialization (e.g., for log files or Pig data) can be easily and freely intermixed in a job with a library of standard serializations (Avro in various flavors, Writable, etc.).

          Show
          Doug Cutting added a comment - For folk's consideration, here's what creation of an Avro-based Job looks like after this and MAPREDUCE-815 are committed: // Create a MapReduce job to sort this input. Job job = new Job(); job.setInputFormatClass(AvroInputFormat.class); job.setOutputFormatClass(AvroOutputFormat.class); job.setNumReduceTasks(1); Configuration conf = job.getConfiguration(); SchemaBasedJobData.setOutputKeySchema(conf, schema); SchemaBasedJobData.setOutputValueSchema(conf, Schema.create(Schema.Type.NULL)); Path outPath = new Path(workDir, "out" ); FileInputFormat.addInputPath(job, inPath); FileOutputFormat.setOutputPath(job, outPath); boolean result = job.waitForCompletion( true ); Is this API markedly more complex that what we have now? It is markedly more powerful, since, e.g., nulls, unions, arrays, maps, and bulit-in types may be passed directly, without creating a wrapper class. A custom input serialization (e.g., for log files or Pig data) can be easily and freely intermixed in a job with a library of standard serializations (Avro in various flavors, Writable, etc.).
          Hide
          Jay Booth added a comment -

          For what it's worth, I really like setOutputKeyClass and while I understand the distinction that a given type might map to more than one serialization method, could we find a compromise that keeps the simplicity of a general setOutputKeyClass while providing the flexibility that Doug's looking for? This could have the added benefit of decoupling the class declaration from the serialization declaration, since they're coupled but kind of separate.

          Maybe something like:

          // defaults to class-based introspection, if there are conflicts, Writable wins for legacy reasons
          job.setOutputKeyClass(Text.class)
          job.setOutputValueClass(Text.class)
          ...

          // explicitly use avro
          job.setOutputKeyClass(AvroText.class)
          job.setOutputValueClass(AvroText.class)
          job.setOutputKeySerialization(new AvroSerializationSpec(conf,schema))
          job.setOutputValueSerialization(new AvroSerializationSpec(conf, Schema.create(Schema.Type.NULL))

          Under this setup, the explicit Avro situation will be slightly wordier, but all serialization formats could go in through a common object type (SerializationSpec or something) and we wouldn't have to have a different static setter for every different serialization type. It also ensures that keyClass and valueClass are set the same way regardless of serialization type, which I'd think is worth having if possible.

          Another option would be to enforce the 4-liner version in all cases so that people don't get unpredictable behavior based on class introspection, I could see going either way on that one.. it's more robust but would break legacy code and is more typing for a lot of common cases where there's no chance for confusion.

          For what it's worth, I cringe a little bit when doing Class.staticMethod(Job), I'm always worried in the back of my mind about squashing already-resident values. Yeah, I know, namespacing, but I generally feel like if something is core functionality then it should be accounted for in the base Job class.

          Show
          Jay Booth added a comment - For what it's worth, I really like setOutputKeyClass and while I understand the distinction that a given type might map to more than one serialization method, could we find a compromise that keeps the simplicity of a general setOutputKeyClass while providing the flexibility that Doug's looking for? This could have the added benefit of decoupling the class declaration from the serialization declaration, since they're coupled but kind of separate. Maybe something like: // defaults to class-based introspection, if there are conflicts, Writable wins for legacy reasons job.setOutputKeyClass(Text.class) job.setOutputValueClass(Text.class) ... // explicitly use avro job.setOutputKeyClass(AvroText.class) job.setOutputValueClass(AvroText.class) job.setOutputKeySerialization(new AvroSerializationSpec(conf,schema)) job.setOutputValueSerialization(new AvroSerializationSpec(conf, Schema.create(Schema.Type.NULL)) Under this setup, the explicit Avro situation will be slightly wordier, but all serialization formats could go in through a common object type (SerializationSpec or something) and we wouldn't have to have a different static setter for every different serialization type. It also ensures that keyClass and valueClass are set the same way regardless of serialization type, which I'd think is worth having if possible. Another option would be to enforce the 4-liner version in all cases so that people don't get unpredictable behavior based on class introspection, I could see going either way on that one.. it's more robust but would break legacy code and is more typing for a lot of common cases where there's no chance for confusion. For what it's worth, I cringe a little bit when doing Class.staticMethod(Job), I'm always worried in the back of my mind about squashing already-resident values. Yeah, I know, namespacing, but I generally feel like if something is core functionality then it should be accounted for in the base Job class.
          Hide
          Doug Cutting added a comment -

          Jay, there's been an effort to move job accessor methods away from Job.java unless they're required by the mapreduce framework. For example, Job#setInputPath(Path) moved to FileInputFormat#setInputPath(Job, Path), since not all jobs have files for input, and the file-specific code is all user-replaceable, not built into the mapreduce system. Similarly, after the current patch for this issue is applied, the framework no longer contains calls to any particular serialization system (except in deprecated back-compatibility methods).

          Show
          Doug Cutting added a comment - Jay, there's been an effort to move job accessor methods away from Job.java unless they're required by the mapreduce framework. For example, Job#setInputPath(Path) moved to FileInputFormat#setInputPath(Job, Path), since not all jobs have files for input, and the file-specific code is all user-replaceable, not built into the mapreduce system. Similarly, after the current patch for this issue is applied, the framework no longer contains calls to any particular serialization system (except in deprecated back-compatibility methods).
          Hide
          Jay Booth added a comment -

          Well, ok, guess I'm in the minority on that instinct

          Should we still consider separating the concerns of key/value class type and key/value serialization method? They're sort of linked but sort of separate, for example in a unit testing scenario you might want the classes regardless of the serialization method. Too much setter spam then?

          Show
          Jay Booth added a comment - Well, ok, guess I'm in the minority on that instinct Should we still consider separating the concerns of key/value class type and key/value serialization method? They're sort of linked but sort of separate, for example in a unit testing scenario you might want the classes regardless of the serialization method. Too much setter spam then?
          Hide
          Owen O'Malley added a comment -

          Well, ok, guess I'm in the minority on that instinct

          No, that is pretty constant from the users.

          My assertion is that leaving the type as the primary instrument of the user in defining the job is correct. I haven't talked to any users that care about using a non-default serializer for a given type.

          I certainly do want Avro to go through the MapReduce pipeline. I guess I'll need to implement a prototype to make it clear how Avro-based MapReduce could work without a massively incompatible user-facing change.

          Show
          Owen O'Malley added a comment - Well, ok, guess I'm in the minority on that instinct No, that is pretty constant from the users. My assertion is that leaving the type as the primary instrument of the user in defining the job is correct. I haven't talked to any users that care about using a non-default serializer for a given type. I certainly do want Avro to go through the MapReduce pipeline. I guess I'll need to implement a prototype to make it clear how Avro-based MapReduce could work without a massively incompatible user-facing change.
          Hide
          Ted Dunning added a comment -

          Isn't there a middle ground available (at least from the user's point of view)?

          My thought would be that if the user specifies types in the current style, they would be limited to Writables in the current fashion. That could be marked as old-fashioned, but I wouldn't necessarily deprecate it. It does leave Writable in a privileged position relative to other serialization frameworks, but it is in a privileged position since it existed first.

          Alternately, the user could specify a serialization framework specific configuration much like Doug suggests. It should be true that if any non-standard serialization is used that specifying a type is an error and vice versa. This should be easy to check.

          From the user's point of view, they could use old-style job configuration or the new style that Doug suggests. I strongly prefer the new style, but I wouldn't be anxious to have to change all my old style programs.

          Under the covers, almost anything could happen, but the important thing that would happen is that if any special serialization is invoked, the job config would need to know about it which might affect many other components like the shuffle.

          Is there any technical reason why this cannot be made to work?

          Is there really any philosophical reason that old programs must be broken?

          If no and no, why is there a problem here? I think that this middle ground would satisfy Owen's (and my own) needs for backwards compatibility as well as Doug's (and my own) desire for flexibility for serialization.

          Show
          Ted Dunning added a comment - Isn't there a middle ground available (at least from the user's point of view)? My thought would be that if the user specifies types in the current style, they would be limited to Writables in the current fashion. That could be marked as old-fashioned, but I wouldn't necessarily deprecate it. It does leave Writable in a privileged position relative to other serialization frameworks, but it is in a privileged position since it existed first. Alternately, the user could specify a serialization framework specific configuration much like Doug suggests. It should be true that if any non-standard serialization is used that specifying a type is an error and vice versa. This should be easy to check. From the user's point of view, they could use old-style job configuration or the new style that Doug suggests. I strongly prefer the new style, but I wouldn't be anxious to have to change all my old style programs. Under the covers, almost anything could happen, but the important thing that would happen is that if any special serialization is invoked, the job config would need to know about it which might affect many other components like the shuffle. Is there any technical reason why this cannot be made to work? Is there really any philosophical reason that old programs must be broken? If no and no, why is there a problem here? I think that this middle ground would satisfy Owen's (and my own) needs for backwards compatibility as well as Doug's (and my own) desire for flexibility for serialization.
          Hide
          Tom White added a comment -

          I'm working on a modified patch which hopefully will find a middle ground. It avoids any change to the existing API (so existing jobs will run without change or deprecation warnings), and it permits generic Avro types in the shuffle.

          That said, I'm also interested in seeing your prototype Owen.

          Show
          Tom White added a comment - I'm working on a modified patch which hopefully will find a middle ground. It avoids any change to the existing API (so existing jobs will run without change or deprecation warnings), and it permits generic Avro types in the shuffle. That said, I'm also interested in seeing your prototype Owen.
          Hide
          Doug Cutting added a comment -

          > Should we still consider separating the concerns of key/value class type and key/value serialization method?

          Jay, with union types there is no single class save Object that all keys and values in a job must share. So the key/value classes for some jobs are not very useful or interesting. Consider, e.g., Pig's data model, where things are either a bag. tuple or a built-in type (int, long, String, float double, etc.). For tests, a job that uses Avro would be better off validating the data against the schema than checking whether a particular class is used.

          Show
          Doug Cutting added a comment - > Should we still consider separating the concerns of key/value class type and key/value serialization method? Jay, with union types there is no single class save Object that all keys and values in a job must share. So the key/value classes for some jobs are not very useful or interesting. Consider, e.g., Pig's data model, where things are either a bag. tuple or a built-in type (int, long, String, float double, etc.). For tests, a job that uses Avro would be better off validating the data against the schema than checking whether a particular class is used.
          Hide
          Owen O'Malley added a comment -

          For union types, it is a trivial restriction to limit map/reduce users to wrap records around the unions

          record(x: union(int, string))

          Show
          Owen O'Malley added a comment - For union types, it is a trivial restriction to limit map/reduce users to wrap records around the unions record(x: union(int, string))
          Hide
          Doug Cutting added a comment -

          It is also "trivial" to wrap everything in Writable. Should we just do that?

          Show
          Doug Cutting added a comment - It is also "trivial" to wrap everything in Writable. Should we just do that?
          Hide
          Alan Gates added a comment -

          [From Owen] My assertion is that leaving the type as the primary instrument of the user in defining the job is correct. I haven't talked to any users that care about using a non-default serializer for a given type.

          Pig would like to. For scalar types Pig uses Java String, Long, Integer, etc. But default Java serialization is slow. So currently we convert these to and from Writables as we go across the Map and Reduce boundaries to get the faster Writable serialization. If we could instead define an alternate serializer and avoid these conversions it would make our code simpler and should perform better.

          Show
          Alan Gates added a comment - [From Owen] My assertion is that leaving the type as the primary instrument of the user in defining the job is correct. I haven't talked to any users that care about using a non-default serializer for a given type. Pig would like to. For scalar types Pig uses Java String, Long, Integer, etc. But default Java serialization is slow. So currently we convert these to and from Writables as we go across the Map and Reduce boundaries to get the faster Writable serialization. If we could instead define an alternate serializer and avoid these conversions it would make our code simpler and should perform better.
          Hide
          Ted Dunning added a comment -

          > [From Owen] My assertion is that leaving the type as the primary instrument of the user in defining the job is correct.
          > I haven't talked to any users that care about using a non-default serializer for a given type.

          Pig would like to. For scalar types Pig uses Java String, Long, Integer, etc. But default Java serialization is slow. So currently we convert these to and from Writables as we go across the Map and Reduce boundaries to get the faster Writable serialization. If we could instead define an alternate serializer and avoid these conversions it would make our code simpler and should perform better.

          I would like to. I would like to start using Avro for greater expressive power as soon as possible. I also can't change all of my legacy code right away so I will have some code that implements both Writable and Avro serialization. I need to be able to use writable for old code and Avro for new code.

          Show
          Ted Dunning added a comment - > [From Owen] My assertion is that leaving the type as the primary instrument of the user in defining the job is correct. > I haven't talked to any users that care about using a non-default serializer for a given type. Pig would like to. For scalar types Pig uses Java String, Long, Integer, etc. But default Java serialization is slow. So currently we convert these to and from Writables as we go across the Map and Reduce boundaries to get the faster Writable serialization. If we could instead define an alternate serializer and avoid these conversions it would make our code simpler and should perform better. I would like to. I would like to start using Avro for greater expressive power as soon as possible. I also can't change all of my legacy code right away so I will have some code that implements both Writable and Avro serialization. I need to be able to use writable for old code and Avro for new code.
          Hide
          Owen O'Malley added a comment -

          For scalar types Pig uses Java String, Long, Integer, etc. But default Java serialization is slow

          I think the default configuration should use a WritableSerializer for Writables and AvroSerializer for everything else. Java serialization was a great experiment, but it was never performant for serious use. So the question is not whether you want different serializers, but rather a job needs different serializers for the same class.

          Show
          Owen O'Malley added a comment - For scalar types Pig uses Java String, Long, Integer, etc. But default Java serialization is slow I think the default configuration should use a WritableSerializer for Writables and AvroSerializer for everything else. Java serialization was a great experiment, but it was never performant for serious use. So the question is not whether you want different serializers, but rather a job needs different serializers for the same class.
          Hide
          Doug Cutting added a comment -

          Which Avro serializer? Avro includes three different mappings from in-memory to binary representations, and applications can add more. In generic, a java.lang.String represents an enum symbol, while in reflect it represents a string.

          And do we really want to privilege Avro here? It should be possible to use Thrift too, and intermix the two within a single job. A Long in the input might be a part of Thrift union, and a Long in the output may use Avro.

          Show
          Doug Cutting added a comment - Which Avro serializer? Avro includes three different mappings from in-memory to binary representations, and applications can add more. In generic, a java.lang.String represents an enum symbol, while in reflect it represents a string. And do we really want to privilege Avro here? It should be possible to use Thrift too, and intermix the two within a single job. A Long in the input might be a part of Thrift union, and a Long in the output may use Avro.
          Hide
          Tom White added a comment -

          Here's a much-simplified patch. To show how it works with nested types I've added an example mapper with signature Mapper<LongWritable, Text, Utf8, Map<Utf8, Long>> which uses the generic Avro serialization for the intermediate key and value. It is configured by calling

          Schema keySchema = Schema.create(Schema.Type.STRING);
          Schema valSchema = Schema.parse("{\"type\":\"map\", \"values\":\"long\"}");
          AvroGenericData.setMapOutputKeySchema(job, keySchema);
          AvroGenericData.setMapOutputValueSchema(job, valSchema);
          

          This replaces the calls to job.setMapOutputKeyClass() and job.setMapOutputValueClass().

          I'm interested in hearing people's thoughts about this.

          Show
          Tom White added a comment - Here's a much-simplified patch. To show how it works with nested types I've added an example mapper with signature Mapper<LongWritable, Text, Utf8, Map<Utf8, Long>> which uses the generic Avro serialization for the intermediate key and value. It is configured by calling Schema keySchema = Schema.create(Schema.Type.STRING); Schema valSchema = Schema.parse( "{\" type\ ":\" map\ ", \" values\ ":\" long \ "}" ); AvroGenericData.setMapOutputKeySchema(job, keySchema); AvroGenericData.setMapOutputValueSchema(job, valSchema); This replaces the calls to job.setMapOutputKeyClass() and job.setMapOutputValueClass(). I'm interested in hearing people's thoughts about this.
          Hide
          Chris Douglas added a comment -

          Replacing the type driven serialization with an explicitly specified, context-sensitive factory is 1) throwing away all Java type hierarchies, 2) asserting that the serialization defines the user types, and 3) implying that these types- and relationships between them- should remain opaque to the MapReduce framework.

          It's making a tradeoff discussed in HADOOP-6323: all the type checks are removed from the framework, but enforced by the serializer. So WritableSerialization- appropriately- requires an exact match for the configured class, but other serializers may not. The MapReduce framework can't do any checks of its own- neither, notably, may Java- to verify properties of the types users supply; their semantics are defined by the serialization. For example, a job using related Writable types may pass a compile-time type check, work with explicit Avro serialization in the intermediate data, but fail if it were run with implicit Writable serialization.

          This is a huge shift. It means the generic, Java types for the Mapper, Reducer, collector etc. literally don't matter; they're effectively all Object (relying on autoboxing to collect primitive types). This means that every serialization has its own type semantics which need not look anything like what Java can enforce, inspect, or interpret. Given this, that the patch puts the serialization as the most prominent interface to MapReduce is not entirely surprising.

          It's also powerful functionality. By allowing any user type to be serialized/deserialized per context, the long-term elimination of the key/value distinction doesn't change collect(K,V) to collect(Object) as proposed, but rather collect(Object...): the serializer transforms the record into bytes, and the comparator works on that byte range, determining which bytes are relevant per the serialization contract. Especially for frameworks written on top of MapReduce, less restrictive interfaces here would surely be fertile ground for performance improvements.

          That said: I hate this API for users. Someone writing a MapReduce job is writing a transform of data; how these data are encoded in different contexts is usually irrelevant to their task. Forcing the user to pick a serialization to declare their types to- rather than offering their types to MapReduce- is backwards for the vast majority of cases. Consider the Writable subtype example above: one is tying the correctness of the Mapper to the intermediate serialization declared in the submitter code, whose semantics are inscrutable. That's just odd.

          If one's map is going to emit data without a common type, then doesn't it make sense to declare that instead of leaving the signature as Object? That is, particularly given MAPREDUCE-1411, wouldn't the equivalent of Mapper<Text,Text,Text,AvroRecord> be a more apt signature than Mapper<Text,Text,Text,Object> for an implementation emitting int and String as value types?

          I much prefer the semantics of the global serializer, but wouldn't object to adding an inconspicuous knob in support of context-sensitive serialization. Would a Job::setSerializationFactory(CTXT, SerializationFactory...) method, such that CTXT is an enumerated type of framework-hooks (i.e. DEFAULT, MAP_OUTPUT_KEY, MAP_OUTPUT_VALUE, etc.) be satisfactory? This way, one can instruct the framework to use/prefer a particular serialization in one context without requiring most users to change their jobs. It also permits continued use of largely type-based serialization which- as Tom notes- is a very common case. Writing wrappers can be irritating, but for the MR API, I'd rather make it easier on common cases and users than on advanced uses and framework authors.

          Show
          Chris Douglas added a comment - Replacing the type driven serialization with an explicitly specified, context-sensitive factory is 1) throwing away all Java type hierarchies, 2) asserting that the serialization defines the user types, and 3) implying that these types- and relationships between them- should remain opaque to the MapReduce framework. It's making a tradeoff discussed in HADOOP-6323 : all the type checks are removed from the framework, but enforced by the serializer. So WritableSerialization - appropriately- requires an exact match for the configured class, but other serializers may not. The MapReduce framework can't do any checks of its own- neither, notably, may Java- to verify properties of the types users supply; their semantics are defined by the serialization. For example, a job using related Writable types may pass a compile-time type check, work with explicit Avro serialization in the intermediate data, but fail if it were run with implicit Writable serialization. This is a huge shift. It means the generic, Java types for the Mapper, Reducer, collector etc. literally don't matter; they're effectively all Object (relying on autoboxing to collect primitive types). This means that every serialization has its own type semantics which need not look anything like what Java can enforce, inspect, or interpret. Given this, that the patch puts the serialization as the most prominent interface to MapReduce is not entirely surprising. It's also powerful functionality. By allowing any user type to be serialized/deserialized per context, the long-term elimination of the key/value distinction doesn't change collect(K,V) to collect(Object) as proposed, but rather collect(Object...) : the serializer transforms the record into bytes, and the comparator works on that byte range, determining which bytes are relevant per the serialization contract. Especially for frameworks written on top of MapReduce, less restrictive interfaces here would surely be fertile ground for performance improvements. That said: I hate this API for users. Someone writing a MapReduce job is writing a transform of data; how these data are encoded in different contexts is usually irrelevant to their task. Forcing the user to pick a serialization to declare their types to- rather than offering their types to MapReduce- is backwards for the vast majority of cases. Consider the Writable subtype example above: one is tying the correctness of the Mapper to the intermediate serialization declared in the submitter code, whose semantics are inscrutable. That's just odd. If one's map is going to emit data without a common type, then doesn't it make sense to declare that instead of leaving the signature as Object ? That is, particularly given MAPREDUCE-1411 , wouldn't the equivalent of Mapper<Text,Text,Text,AvroRecord> be a more apt signature than Mapper<Text,Text,Text,Object> for an implementation emitting int and String as value types? I much prefer the semantics of the global serializer, but wouldn't object to adding an inconspicuous knob in support of context-sensitive serialization. Would a Job::setSerializationFactory(CTXT, SerializationFactory...) method, such that CTXT is an enumerated type of framework-hooks (i.e. DEFAULT , MAP_OUTPUT_KEY , MAP_OUTPUT_VALUE , etc.) be satisfactory? This way, one can instruct the framework to use/prefer a particular serialization in one context without requiring most users to change their jobs. It also permits continued use of largely type-based serialization which- as Tom notes- is a very common case. Writing wrappers can be irritating, but for the MR API, I'd rather make it easier on common cases and users than on advanced uses and framework authors.
          Hide
          Jeff Hammerbacher added a comment -

          Especially for frameworks written on top of MapReduce, less restrictive interfaces here would surely be fertile ground for performance improvements.

          Writing wrappers can be irritating, but for the MR API, I'd rather make it easier on common cases and users than on advanced uses and framework authors.

          Great points, Chris. Yahoo! has stated that a significant majority of their MapReduce jobs are written in Pig, and Facebook says the same of Hive. Among our many customers at Cloudera, it's far more common to target the MapReduce execution engine with a higher level language rather than the Java API. What you propose as the common case, then, appears to be uncommon in practice. Perhaps we should adjust our design criteria to match the usage data reported by the users of the project?

          Thanks,
          Jeff

          Show
          Jeff Hammerbacher added a comment - Especially for frameworks written on top of MapReduce, less restrictive interfaces here would surely be fertile ground for performance improvements. Writing wrappers can be irritating, but for the MR API, I'd rather make it easier on common cases and users than on advanced uses and framework authors. Great points, Chris. Yahoo! has stated that a significant majority of their MapReduce jobs are written in Pig, and Facebook says the same of Hive. Among our many customers at Cloudera, it's far more common to target the MapReduce execution engine with a higher level language rather than the Java API. What you propose as the common case, then, appears to be uncommon in practice. Perhaps we should adjust our design criteria to match the usage data reported by the users of the project? Thanks, Jeff
          Hide
          Scott Carey added a comment -

          I am neck deep in building stuff on Avro. I've also got a custom Pig reader that reads only my Avro record types as a prototyping stopgap until there is a general solution.

          Ted's Idea of a middle ground sounds useful. Special casing Writables is OK as long as they don't have to be used. Making all the new stuff harder to use sounds like a bad idea. Ideally, a schema system means as a user I never have to write a Writable again. Those are annoying enough.

          For union types, it is a trivial restriction to limit map/reduce users to wrap records around the unions

          Wrappers are trivial to deal with in a schema declaration. They are not trivial on the other side. First you have Foo and then Bar, and neither needs a wrapper. But then they might because sometimes you want to serialize Bar in columnar order and sometimes you don't. Now you have Bar, and ColulmnarBar. Then you realize you need to have a union, so you make FooBarUnion. Then in another use case you need BarFooUnion (different order, same Equals and HashCode – fun with permutations when the union is large).
          Then you have composite types. FooBarStringNullUnion, and FooStringLongUnion.
          Mapping serialization options to classes is not fun. Wrappers are a useful design pattern for many purposes, but not for encapsulating one to many relationships.

          In general, users are moving away from writing directly to the MapReduce API and using various frameworks. Making these frameworks high performance, powerful, and expressive is more important IMO than preventing the low level MapReduce API from getting a bit more complicated.
          As for end-users writing MapReduce, the current situation is not all that pretty anyway. 4 generic type arguments that must be perfectly aligned with several type setting calls on a job configuration to avoid a runtime error? The Map and Reduce classes have compile time checking on a few types, and thats it.

          Choosing a serialization is a declarative task, not a procedural one. Annotations are what Java has right now for declarative metadata. Unfortunately, very few people are experts at building annotation based frameworks or using tools like ASM to enrich the capabilities. Has there been any proposals to allow Annotations to handle this in a way cleaner than these declarative setter methods?

          I haven't thought through it that far, but here's a quick annotation based idea that can bolt on to the work above or the current API. This is just a point of contrast for this discussion, not a proposal to change – perhaps it sparks some ideas to simplify the user experience while also making the framework more powerful. With the right tools those can go hand in hand.

          WordCount with configurable Input/Output formats via annotations:

          public class WordCount {
          
            // if missing each of these has a default, or can be set with the traditional setters
            @Input(TextInputFormat.class);
            @Output(AvroOutputFormat.class);
            @InputWritables(key = LongWritable.class, val = Text.class).
            @SchemaConfig(SchemaBasedJobData.class);
            public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Integer> {
               public void map(LongWritable key, Text value, OutputCollector<Text, Integer> output, Reporter reporter) throws IOException {
                . . .
               }
             }
          }
          
            @Input(AvroInputFormat.class);
            @Output(AvroOutputFormat.class);
            @SchemaConfig(SchemaBasedJobData.class);
            public static class Reduce extends MapReduceBase implements Reducer<Text, Integer, Text, Integer> {
              public void reduce(Text key, Iterator<Integer> values, OutputCollector<Text, Integer> output, Reporter reporter) throws IOException {
              . . .
            }
          
            public static void main(String[] args) throws Exception {
              JobConf conf = new JobConf(WordCount.class);
              conf.setJobName("wordcount");
          
              // infer inputs and outputs from annotations on the Map, Reduce, and Combiner setters
              // throw an error if these are not compatible. 
              conf.setMapperClass(Map.class);
              conf.setCombinerClass(Reduce.class);
              conf.setReducerClass(Reduce.class);
          
              // the "old way" can still work, setting individual input and output classes.  But it is not necessary when annotated and incompatible with schema based systems.
          
              FileInputFormat.setInputPaths(conf, new Path(args[0]));
              FileOutputFormat.setOutputPath(conf, new Path(args[1]));
              JobClient.runJob(conf);
            }
          }
          

          There are some things missing above.
          For frameworks, configuration by method call isn't as big of a deal, but for hand-written classes it is a good thing to keep the generic type declaration and key/value class/type/schema declaration in the same place – JobConf won't tell you at compile time that you have screwed it up and misaligned types in your Map. And generic types aren't available at runtime, increasing the number of times the same information must be set.

          Annotations can only set primitive types, Class, Enum, and String, as well as arrays of these, so configuring a Schema becomes more difficult.
          Schemas can only be stored in annotations if they are in String form – which may be too verbose to store in an annotation. I have one Schema that is over 4k in its Avro json form.

          Anyhow, its just an idea for contrast. The debate is about how to deal with declarative configuration in Java, so it seems relevant to bring up Annotations.

          Show
          Scott Carey added a comment - I am neck deep in building stuff on Avro. I've also got a custom Pig reader that reads only my Avro record types as a prototyping stopgap until there is a general solution. Ted's Idea of a middle ground sounds useful. Special casing Writables is OK as long as they don't have to be used. Making all the new stuff harder to use sounds like a bad idea. Ideally, a schema system means as a user I never have to write a Writable again. Those are annoying enough. For union types, it is a trivial restriction to limit map/reduce users to wrap records around the unions Wrappers are trivial to deal with in a schema declaration. They are not trivial on the other side. First you have Foo and then Bar, and neither needs a wrapper. But then they might because sometimes you want to serialize Bar in columnar order and sometimes you don't. Now you have Bar, and ColulmnarBar. Then you realize you need to have a union, so you make FooBarUnion. Then in another use case you need BarFooUnion (different order, same Equals and HashCode – fun with permutations when the union is large). Then you have composite types. FooBarStringNullUnion, and FooStringLongUnion. Mapping serialization options to classes is not fun. Wrappers are a useful design pattern for many purposes, but not for encapsulating one to many relationships. In general, users are moving away from writing directly to the MapReduce API and using various frameworks. Making these frameworks high performance, powerful, and expressive is more important IMO than preventing the low level MapReduce API from getting a bit more complicated. As for end-users writing MapReduce, the current situation is not all that pretty anyway. 4 generic type arguments that must be perfectly aligned with several type setting calls on a job configuration to avoid a runtime error? The Map and Reduce classes have compile time checking on a few types, and thats it. Choosing a serialization is a declarative task, not a procedural one. Annotations are what Java has right now for declarative metadata. Unfortunately, very few people are experts at building annotation based frameworks or using tools like ASM to enrich the capabilities. Has there been any proposals to allow Annotations to handle this in a way cleaner than these declarative setter methods? I haven't thought through it that far, but here's a quick annotation based idea that can bolt on to the work above or the current API. This is just a point of contrast for this discussion, not a proposal to change – perhaps it sparks some ideas to simplify the user experience while also making the framework more powerful. With the right tools those can go hand in hand. WordCount with configurable Input/Output formats via annotations: public class WordCount { // if missing each of these has a default , or can be set with the traditional setters @Input(TextInputFormat.class); @Output(AvroOutputFormat.class); @InputWritables(key = LongWritable.class, val = Text.class). @SchemaConfig(SchemaBasedJobData.class); public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Integer > { public void map(LongWritable key, Text value, OutputCollector<Text, Integer > output, Reporter reporter) throws IOException { . . . } } } @Input(AvroInputFormat.class); @Output(AvroOutputFormat.class); @SchemaConfig(SchemaBasedJobData.class); public static class Reduce extends MapReduceBase implements Reducer<Text, Integer , Text, Integer > { public void reduce(Text key, Iterator< Integer > values, OutputCollector<Text, Integer > output, Reporter reporter) throws IOException { . . . } public static void main( String [] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName( "wordcount" ); // infer inputs and outputs from annotations on the Map, Reduce, and Combiner setters // throw an error if these are not compatible. conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); // the "old way" can still work, setting individual input and output classes. But it is not necessary when annotated and incompatible with schema based systems. FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } } There are some things missing above. For frameworks, configuration by method call isn't as big of a deal, but for hand-written classes it is a good thing to keep the generic type declaration and key/value class/type/schema declaration in the same place – JobConf won't tell you at compile time that you have screwed it up and misaligned types in your Map. And generic types aren't available at runtime, increasing the number of times the same information must be set. Annotations can only set primitive types, Class, Enum, and String, as well as arrays of these, so configuring a Schema becomes more difficult. Schemas can only be stored in annotations if they are in String form – which may be too verbose to store in an annotation. I have one Schema that is over 4k in its Avro json form. Anyhow, its just an idea for contrast. The debate is about how to deal with declarative configuration in Java, so it seems relevant to bring up Annotations.
          Hide
          Arun C Murthy added a comment -

          Great points, Chris. Yahoo! has stated that a significant majority of their MapReduce jobs are written in Pig, and Facebook says the same of Hive. Among our many customers at Cloudera, it's far more common to target the MapReduce execution engine with a higher level language rather than the Java API. What you propose as the common case, then, appears to be uncommon in practice.

          Uh, no. That is precisely the point - making it slightly harder on framework authors is better than making it harder for the average users of the Map-Reduce api. Only the framework authors pay the cost...


          Along similar sentiments I'd like to re-state:

          1. We should use the current global serializer factory for all contexts of a job.
          4. Only the default comparator should come from the serializer. The user has to be able to override it in the framework (not change the serializer factory).

          I'm not convinced we need to allow multiple serialization mechanism for the same job. I'm also much less convinced that we need to allow a serializer per map-in-key, map-in-value, map-out-key, map-out-value, reduce-out-key, reduce-out-value etc.

          I can see that we might have some phase of transition where people might move from Writables to Avro as the preferred serialization mechanism. For e.g. they might have SequenceFiles with Writables as input-records and might produce SequenceFiles with Avro output-records. However, even with a single serializer-factory for all contexts of a job it is trivial to write wrappers, provide bridges in libraries or other frameworks etc. to cross the chasm.


          At a later point, iff we get to a world where we need to console multiple serialization mechanisms for the same job on a regular basis e.g. a world where we have a lot of data in Writables and Avro and Thrift etc. I'd like to propose a slightly less involved version of Chris's proposal.

          The simplification is that we have view 4 separate 'record contexts':

          1. INPUT (map-in-key, map-in-value)
          2. INTERMEDIATE (map-out-key, map-out-value)
          3. OUTPUT (reduce-out-key, reduce-out-value)
          4. JOB_DEFINITION (currently only InputSplit, possibly more in future via MAPREDUCE-1183)

          Then we have Chris's proposal:

          enum Context {
            INPUT,
            INTERMEDIATE,
            OUTPUT,
            JOB_DEFINITION
          }
          
          Job::setSerializationFactory(Context context, SerializationFactory...)
          

          Thus we allow serializers to be specified for the 'records' flowing through the Map-Reduce framework... allowing map-in-key and map-in-value to have different serialization mechanisms seems like an over-kill. Do we have use-cases for such requirements?

          Show
          Arun C Murthy added a comment - Great points, Chris. Yahoo! has stated that a significant majority of their MapReduce jobs are written in Pig, and Facebook says the same of Hive. Among our many customers at Cloudera, it's far more common to target the MapReduce execution engine with a higher level language rather than the Java API. What you propose as the common case, then, appears to be uncommon in practice. Uh, no. That is precisely the point - making it slightly harder on framework authors is better than making it harder for the average users of the Map-Reduce api. Only the framework authors pay the cost... Along similar sentiments I'd like to re-state: 1. We should use the current global serializer factory for all contexts of a job. 4. Only the default comparator should come from the serializer. The user has to be able to override it in the framework (not change the serializer factory). I'm not convinced we need to allow multiple serialization mechanism for the same job. I'm also much less convinced that we need to allow a serializer per map-in-key, map-in-value, map-out-key, map-out-value, reduce-out-key, reduce-out-value etc. I can see that we might have some phase of transition where people might move from Writables to Avro as the preferred serialization mechanism. For e.g. they might have SequenceFiles with Writables as input-records and might produce SequenceFiles with Avro output-records. However, even with a single serializer-factory for all contexts of a job it is trivial to write wrappers, provide bridges in libraries or other frameworks etc. to cross the chasm. At a later point, iff we get to a world where we need to console multiple serialization mechanisms for the same job on a regular basis e.g. a world where we have a lot of data in Writables and Avro and Thrift etc. I'd like to propose a slightly less involved version of Chris's proposal. The simplification is that we have view 4 separate 'record contexts': INPUT (map-in-key, map-in-value) INTERMEDIATE (map-out-key, map-out-value) OUTPUT (reduce-out-key, reduce-out-value) JOB_DEFINITION (currently only InputSplit, possibly more in future via MAPREDUCE-1183 ) Then we have Chris's proposal: enum Context { INPUT, INTERMEDIATE, OUTPUT, JOB_DEFINITION } Job::setSerializationFactory(Context context, SerializationFactory...) Thus we allow serializers to be specified for the 'records' flowing through the Map-Reduce framework... allowing map-in-key and map-in-value to have different serialization mechanisms seems like an over-kill. Do we have use-cases for such requirements?
          Hide
          Aaron Kimball added a comment -

          Why are proposals now focusing on allowing users to specify different serialization factories?

          If allowing users to specify the use of a particular SerializationBase via a flexible metadata map is considered too obscure, then I feel like the notion of having separate SerializationFactory instances seems to be an unnecessary level of abstraction. The current SerializationFactory implemented in hadoop-common allows access to all SerializationBase instances. If the focus is on user-accessibility of API, asking users to define a SerializationFactory which will only produce a single SerializationBase (e.g., WritableSerialization or AvroGenericSerialization) requires needless one-off code, and clutters the class hierarchy.

          Instead, I might understand adding an API such as Job.setSerializationBase(ctxt, SerializationBase) where users directly set the SerializationBase instance to use in a given context, and disregard the SerializationFactory entirely.

          For what it's worth, the patches that I and Tom have produced all make use of the default SerializationFactory in Hadoop; this API then uses the metadata map as defined in HADOOP-6165 to acquire the user's desired SerializationBase instance as appropriate for each of map output key, value, etc.

          Show
          Aaron Kimball added a comment - Why are proposals now focusing on allowing users to specify different serialization factories? If allowing users to specify the use of a particular SerializationBase via a flexible metadata map is considered too obscure, then I feel like the notion of having separate SerializationFactory instances seems to be an unnecessary level of abstraction. The current SerializationFactory implemented in hadoop-common allows access to all SerializationBase instances. If the focus is on user-accessibility of API, asking users to define a SerializationFactory which will only produce a single SerializationBase (e.g., WritableSerialization or AvroGenericSerialization ) requires needless one-off code, and clutters the class hierarchy. Instead, I might understand adding an API such as Job.setSerializationBase(ctxt, SerializationBase ) where users directly set the SerializationBase instance to use in a given context, and disregard the SerializationFactory entirely. For what it's worth, the patches that I and Tom have produced all make use of the default SerializationFactory in Hadoop; this API then uses the metadata map as defined in HADOOP-6165 to acquire the user's desired SerializationBase instance as appropriate for each of map output key, value, etc.
          Hide
          Chris Douglas added a comment -

          Why are proposals now focusing on allowing users to specify different serialization factories?

          Yes, you're right; I mixed up the naming. s/SerializationFactory/SerializationBase/g. The setter formal with a vararg is still attractive, setSerialization(ctxt, SerializationBase...), so the semantics are the same for configuring the SerializationFactory whether there are one or many. Setting it with one is equivalent to setting the base w/o a SerializationFactory, but it keeps the accept check. The purpose is to configure a context-specific SerializationFactory; I'm liking Arun's suggestion of limiting these hooks to record contexts, rather than all the user types.

          [...] it's far more common to target the MapReduce execution engine with a higher level language rather than the Java API. What you propose as the common case, then, appears to be uncommon in practice.

          As Arun points out, framework authors are uncommon, even if their users are not. The possible performance improvements from open hooks in the framework are compelling (if most are theoretical at the moment), but the user-facing API for MapReduce shouldn't be the Job's serialization. A more abstract collector may be an interesting extension/interface when the use cases are better understood.


          @Tom
          The much-simplified patch is a considerable improvement. Losing the *JobData classes effected a cleaner integration. Thanks for this.

          Show
          Chris Douglas added a comment - Why are proposals now focusing on allowing users to specify different serialization factories? Yes, you're right; I mixed up the naming. s/SerializationFactory/SerializationBase/g . The setter formal with a vararg is still attractive, setSerialization(ctxt, SerializationBase...) , so the semantics are the same for configuring the SerializationFactory whether there are one or many. Setting it with one is equivalent to setting the base w/o a SerializationFactory , but it keeps the accept check. The purpose is to configure a context-specific SerializationFactory ; I'm liking Arun's suggestion of limiting these hooks to record contexts, rather than all the user types. [...] it's far more common to target the MapReduce execution engine with a higher level language rather than the Java API. What you propose as the common case, then, appears to be uncommon in practice. As Arun points out, framework authors are uncommon, even if their users are not. The possible performance improvements from open hooks in the framework are compelling (if most are theoretical at the moment), but the user-facing API for MapReduce shouldn't be the Job's serialization. A more abstract collector may be an interesting extension/interface when the use cases are better understood. @Tom The much-simplified patch is a considerable improvement. Losing the *JobData classes effected a cleaner integration. Thanks for this.
          Hide
          Philip Zeyliger added a comment -

          @Scott: the annotations for Input/OutputFormat seem to be misplaced. It seems desirable to be able to write a single Map function that does wordcount on Strings, regardless of whether those strings are stored in newline-delimited text, sequence files, avro data files, or whatever.

          @Chris: "1) throwing away all Java type hierarchies". Only sometimes, no? This is only in the case where you explicitly want to do unions (and Java's union support is either Object, type hierarchies, or wrappers). In the typical case, your map functions on SomeSpecificRecordType, outputs SomeSpecificMapOutputKey/ValueType, and so forth. You still get type safety in many of the recommended use cases.

          Show
          Philip Zeyliger added a comment - @Scott: the annotations for Input/OutputFormat seem to be misplaced. It seems desirable to be able to write a single Map function that does wordcount on Strings, regardless of whether those strings are stored in newline-delimited text, sequence files, avro data files, or whatever. @Chris: "1) throwing away all Java type hierarchies". Only sometimes, no? This is only in the case where you explicitly want to do unions (and Java's union support is either Object, type hierarchies, or wrappers). In the typical case, your map functions on SomeSpecificRecordType, outputs SomeSpecificMapOutputKey/ValueType, and so forth. You still get type safety in many of the recommended use cases.
          Hide
          Allen Wittenauer added a comment -

          Perhaps a dumb question and I recognize I'm totally out of my scope of knowledge (never stopped me before smile), but... what happens if my serialization code is not written in Java and I have to use JNI to get to it?

          Show
          Allen Wittenauer added a comment - Perhaps a dumb question and I recognize I'm totally out of my scope of knowledge (never stopped me before smile ), but... what happens if my serialization code is not written in Java and I have to use JNI to get to it?
          Hide
          Chris Douglas added a comment -

          "1) throwing away all Java type hierarchies". Only sometimes, no? This is only in the case where you explicitly want to do unions (and Java's union support is either Object, type hierarchies, or wrappers). In the typical case, your map functions on SomeSpecificRecordType, outputs SomeSpecificMapOutputKey/ValueType, and so forth. You still get type safety in many of the recommended use cases.

          Sure, but this doesn't cite the negative side. The patch changes the collection from a 1:1 class match- more restricted than the Java types- to a model unrelated to the declared types. So if a job accepts Short, Integer, and Long it may declare its type as Numeric but- again, depending on the serialization details- may reject (or fail to reject) Double and Float. So instead of being forced to declare a union type, whether this is reasonable is decided between the serialization and the user. This is a contrived example, but one can easily imagine other scenarios where an accepted subset of the supertypes are not only unenforced by the framework, but unenforceable. The even more interesting case is when one has a type hierarchy the serializer cares about that isn't expressed in Java types (e.g., valid keys contain a field named "dingo" whose supertype is Yak). The proposed model doesn't make it impossible to write type-checked code, but it does make it easier to write code that isn't (which, again: great for frameworks, arguably not as good for users). As I said earlier, it's a powerful, but dramatic shift from the current model that should be carefully considered.

          what happens if my serialization code is not written in Java and I have to use JNI to get to it?

          I don't think any model yet proposed would make this harder than it is today.

          Show
          Chris Douglas added a comment - "1) throwing away all Java type hierarchies". Only sometimes, no? This is only in the case where you explicitly want to do unions (and Java's union support is either Object, type hierarchies, or wrappers). In the typical case, your map functions on SomeSpecificRecordType, outputs SomeSpecificMapOutputKey/ValueType, and so forth. You still get type safety in many of the recommended use cases. Sure, but this doesn't cite the negative side. The patch changes the collection from a 1:1 class match- more restricted than the Java types- to a model unrelated to the declared types. So if a job accepts Short , Integer , and Long it may declare its type as Numeric but- again, depending on the serialization details- may reject (or fail to reject) Double and Float . So instead of being forced to declare a union type, whether this is reasonable is decided between the serialization and the user. This is a contrived example, but one can easily imagine other scenarios where an accepted subset of the supertypes are not only unenforced by the framework, but unenforceable. The even more interesting case is when one has a type hierarchy the serializer cares about that isn't expressed in Java types (e.g., valid keys contain a field named "dingo" whose supertype is Yak). The proposed model doesn't make it impossible to write type-checked code, but it does make it easier to write code that isn't (which, again: great for frameworks, arguably not as good for users). As I said earlier, it's a powerful, but dramatic shift from the current model that should be carefully considered. what happens if my serialization code is not written in Java and I have to use JNI to get to it? I don't think any model yet proposed would make this harder than it is today.
          Hide
          Chris Douglas added a comment -

          s/supertype is Yak/subtype is Yak/

          Show
          Chris Douglas added a comment - s/supertype is Yak/subtype is Yak/
          Hide
          Scott Carey added a comment -

          @Scott: the annotations for Input/OutputFormat seem to be misplaced. It seems desirable to be able to write a single Map function that does wordcount on Strings, regardless of whether those strings are stored in newline-delimited text, sequence files, avro data files, or whatever.

          Philip, yes, they are not in the right place. I just wanted to bring into the conversation that 'SomeObject.setSomeBinding()' is not the only way to do these sort of things. Annotations, unlike setter methods, can be moved around and adapted to work in various ways without breaking APIs. For example, the Input/OutputFormat annotation could go on either a Map class, OR some other more specific annotation site, and with defaults and priority (set on configuration > annotated on configuration > annotated on map > default) determining which applies.

          After thinking about it a bit more, and doing some research into how other APIs do some tricky things with Annotations, there are a few things to consider.

          What I think is most important to this discussion is that some layers of configuration complexity can be hidden from users, and some of it deferred to the future.
          The 'site' of the configuration can be moved around with Annotations, opening up ways to simplify the steps required to do declarative configuration.

          With this in mind, some additional complexity to the procedural configuration methods is more acceptable if there are good defaults and a later (backwards compatible) API addition simplifies things. Likewise, some elements of complexity can be skipped for now if it can be seen that those could be available through a configuration extension later. Perhaps the procedural API would never allow configuring a key and value to use different serializers to avoid API complexity, but an annotation extension in the future allows that.

          Show
          Scott Carey added a comment - @Scott: the annotations for Input/OutputFormat seem to be misplaced. It seems desirable to be able to write a single Map function that does wordcount on Strings, regardless of whether those strings are stored in newline-delimited text, sequence files, avro data files, or whatever. Philip, yes, they are not in the right place. I just wanted to bring into the conversation that 'SomeObject.setSomeBinding()' is not the only way to do these sort of things. Annotations, unlike setter methods, can be moved around and adapted to work in various ways without breaking APIs. For example, the Input/OutputFormat annotation could go on either a Map class, OR some other more specific annotation site, and with defaults and priority (set on configuration > annotated on configuration > annotated on map > default) determining which applies. After thinking about it a bit more, and doing some research into how other APIs do some tricky things with Annotations, there are a few things to consider. It is possible in some situations to infer the generic types of a class at runtime by constructing an instance of an object with the same type arguments. Example: http://wiki.fasterxml.com/JacksonInFiveMinutes#Data_Binding_with_Generics . Annotations on class A can be applied to class B "Mix-In Annotations"; http://wiki.fasterxml.com/JacksonMixInAnnotations Post-compile time checks via an annotation processor can validate code before run time in cases where the current M/R framework only breaks at run time. What I think is most important to this discussion is that some layers of configuration complexity can be hidden from users, and some of it deferred to the future. The 'site' of the configuration can be moved around with Annotations, opening up ways to simplify the steps required to do declarative configuration. With this in mind, some additional complexity to the procedural configuration methods is more acceptable if there are good defaults and a later (backwards compatible) API addition simplifies things. Likewise, some elements of complexity can be skipped for now if it can be seen that those could be available through a configuration extension later. Perhaps the procedural API would never allow configuring a key and value to use different serializers to avoid API complexity, but an annotation extension in the future allows that.
          Hide
          Tom White added a comment -

          Yesterday I spoke to Owen offline about his design for this JIRA. Briefly, it works as follows. (I apologize in advance for any errors due to misunderstandings on my part! Owen, please correct anything I've got wrong.)

          The Serialization classes change as follows. SerializationFactory becomes RootSerializationFactory, while Serialization becomes SerializationFactory, and Serializer/Deserializer pairs are combined into single Serialization per pair. Serialization has the ability to write itself to and read itself from a stream, using its own serialization format. There is a subclass of Serialization called TypedSerialization, which is subclassed by WritableSerialization. AvroGenericSerialization would not be a TypedSerialization. RootSerializationFactory can map types into Serializations in the usual manner (via io.serializations). Non-typed serializations, such as AvroGenericSerialization, would be set explicitly as described below.

          Instead of the metadata map, each job context (map input key, map output key, etc) has a serialized serialization that is deserialized for the Task (by the framework) and used to carry out the serialization for that context. The serialized serializations are not stored in the job configuration, but rather are stored in a new file in the job directory which has the format (context, serialization class, serialization bytes)+.

          In terms of configuration for the user, the API looks like the one described by Arun and Chris. That is, Job#setSerialization(MapReduceSerializationContext, Serialization).

          Some comments:

          • The changes to the serialization API are not backwards compatible, so a new package of serializer types would need creating.
          • I'm not sure why we need to serialize serializations. The current patch avoids the need for this by using a simple string mechanism for configuration. Having an opaque binary format also makes it difficult to retrieve and use the serialization from other languages (e.g. C++ or other Pipes languages). The current patch is language-neutral in this regard.
          • Adding a side file for the context-serializer mapping complicates the implementation. It's not clear what container file would be used for the side file (Avro container, custom). I understand that putting framework configuration in the job configuration may not be desirable, but it has been done in the past so I don't know why it is being ruled out here. I would rather have a separate effort (and discussion) to create a "private" job configuration (not accessible by user code) for such configuration.
          • The user API is no shorter than the one in the current patch. Compare:
          Schema keySchema = ...
          AvroGenericSerialization serialization = new AvroGenericSerialization();
          serialization.setSchema(keySchema);
          job.set(MAP_OUTPUT_KEY, serialization);
          

          with

          Schema keySchema = ...
          AvroGenericData.setMapOutputKeySchema(job, keySchema);
          

          I was hoping it might help reach consensus if we could incorporate some of Owen's ideas with the existing patch. However it is not clear to me how to do this. In the meantime, I would appreciate it if someone would review the latest patch. Thanks!

          Show
          Tom White added a comment - Yesterday I spoke to Owen offline about his design for this JIRA. Briefly, it works as follows. (I apologize in advance for any errors due to misunderstandings on my part! Owen, please correct anything I've got wrong.) The Serialization classes change as follows. SerializationFactory becomes RootSerializationFactory, while Serialization becomes SerializationFactory, and Serializer/Deserializer pairs are combined into single Serialization per pair. Serialization has the ability to write itself to and read itself from a stream, using its own serialization format. There is a subclass of Serialization called TypedSerialization, which is subclassed by WritableSerialization. AvroGenericSerialization would not be a TypedSerialization. RootSerializationFactory can map types into Serializations in the usual manner (via io.serializations). Non-typed serializations, such as AvroGenericSerialization, would be set explicitly as described below. Instead of the metadata map, each job context (map input key, map output key, etc) has a serialized serialization that is deserialized for the Task (by the framework) and used to carry out the serialization for that context. The serialized serializations are not stored in the job configuration, but rather are stored in a new file in the job directory which has the format (context, serialization class, serialization bytes)+ . In terms of configuration for the user, the API looks like the one described by Arun and Chris. That is, Job#setSerialization(MapReduceSerializationContext, Serialization) . Some comments: The changes to the serialization API are not backwards compatible, so a new package of serializer types would need creating. I'm not sure why we need to serialize serializations. The current patch avoids the need for this by using a simple string mechanism for configuration. Having an opaque binary format also makes it difficult to retrieve and use the serialization from other languages (e.g. C++ or other Pipes languages). The current patch is language-neutral in this regard. Adding a side file for the context-serializer mapping complicates the implementation. It's not clear what container file would be used for the side file (Avro container, custom). I understand that putting framework configuration in the job configuration may not be desirable, but it has been done in the past so I don't know why it is being ruled out here. I would rather have a separate effort (and discussion) to create a "private" job configuration (not accessible by user code) for such configuration. The user API is no shorter than the one in the current patch. Compare: Schema keySchema = ... AvroGenericSerialization serialization = new AvroGenericSerialization(); serialization.setSchema(keySchema); job.set(MAP_OUTPUT_KEY, serialization); with Schema keySchema = ... AvroGenericData.setMapOutputKeySchema(job, keySchema); I was hoping it might help reach consensus if we could incorporate some of Owen's ideas with the existing patch. However it is not clear to me how to do this. In the meantime, I would appreciate it if someone would review the latest patch. Thanks!
          Hide
          Doug Cutting added a comment -

          Arun> allowing map-in-key and map-in-value to have different serialization mechanisms seems like an over-kill. Do we have use-cases for such requirements?

          I don't think this is a requirement. But neither am I convinced, long-term, that separating keys and values is a requirement. But it is a legacy. We already permit different class-specified serializations for keys and values. This issue proposes to consistently alter how key and value serializations are specified, to not be restricted to class-only, but rather to be consistent with the serialization API. Doing this without permitting keys and values to have different serializations would be harder than simply allowing it, as it would require the introduction of a new concept, the serialization context, while the current patch simply is the logical combination of the existing Mapreduce API and the existing serialization API.

          Perhaps one of the causes of this present tension is that Job.java is striving to be two things at once: a high-level API for folks directly building Java mapreduce applications and a low-level API for folks building higher-level abstractions based on mapreduce. Perhaps we should instead add a new low-level API. This might looks something like:

          final class Split {
            String[] locations;
            ByteBuffer data;
          }
          interface MapEmitter { // implemented by kernel
            emit(ByteBuffer datum, int partition);
          }
          interface CombineEmitter { // implemented by kernel
            emit(ByteBuffer datum);
          }
          interface Job {  // implemented by user code
            Split[] getSplits();
            void map(ByteBuffer splitData, MapEmitter out);
            int compare(ByteBuffer x, ByteBuffer y);
            boolean hasCombiner();
            void combine(Iterator<Iterator<ByteBuffer>>, CombineEmitter)
            void reduce(Iterator<Iterator<ByteBuffer>>, String attemptName);
            void abandon(String[] attemptNames);
            void commit(String[] attemptNames);
          }
          

          This is of course over-simplified. We'd need to pass a Context through to job methods, so that they can report progress, etc. We'd also need to figure out how the Job implementation itself is specified, serialized, reconstituted. Etc.

          So perhaps a way around the current impasse is to devise a new, lower-level, mapreduce API that will then support multiple higher-level APIs, including the current API. Since there is resistance to evolving the existing API, it might remain untouched as a legacy API. Folks who, e.g., wish to use Avro would use a new higher-level MapReduce API that we'd develop on top of this shared low-level runtime.

          Show
          Doug Cutting added a comment - Arun> allowing map-in-key and map-in-value to have different serialization mechanisms seems like an over-kill. Do we have use-cases for such requirements? I don't think this is a requirement. But neither am I convinced, long-term, that separating keys and values is a requirement. But it is a legacy. We already permit different class-specified serializations for keys and values. This issue proposes to consistently alter how key and value serializations are specified, to not be restricted to class-only, but rather to be consistent with the serialization API. Doing this without permitting keys and values to have different serializations would be harder than simply allowing it, as it would require the introduction of a new concept, the serialization context, while the current patch simply is the logical combination of the existing Mapreduce API and the existing serialization API. Perhaps one of the causes of this present tension is that Job.java is striving to be two things at once: a high-level API for folks directly building Java mapreduce applications and a low-level API for folks building higher-level abstractions based on mapreduce. Perhaps we should instead add a new low-level API. This might looks something like: final class Split { String [] locations; ByteBuffer data; } interface MapEmitter { // implemented by kernel emit(ByteBuffer datum, int partition); } interface CombineEmitter { // implemented by kernel emit(ByteBuffer datum); } interface Job { // implemented by user code Split[] getSplits(); void map(ByteBuffer splitData, MapEmitter out); int compare(ByteBuffer x, ByteBuffer y); boolean hasCombiner(); void combine(Iterator<Iterator<ByteBuffer>>, CombineEmitter) void reduce(Iterator<Iterator<ByteBuffer>>, String attemptName); void abandon( String [] attemptNames); void commit( String [] attemptNames); } This is of course over-simplified. We'd need to pass a Context through to job methods, so that they can report progress, etc. We'd also need to figure out how the Job implementation itself is specified, serialized, reconstituted. Etc. So perhaps a way around the current impasse is to devise a new, lower-level, mapreduce API that will then support multiple higher-level APIs, including the current API. Since there is resistance to evolving the existing API, it might remain untouched as a legacy API. Folks who, e.g., wish to use Avro would use a new higher-level MapReduce API that we'd develop on top of this shared low-level runtime.
          Hide
          Jay Booth added a comment -

          +1 for the general concept of a lower-level API, great idea

          Any thoughts regarding explicitly setting a Mapper per Split? Joins between different formats are a pretty primary use case, and it's always awkward using MultipleInputs to shoehorn the different classes into a single conf.. as I understand it now, with MultipleInputs, the MapTask wakes up, looks at its input split, compares that to a magic configuration field mapping splits to mapper classes, and instantiates that mapper class. Which leads to trouble if you need to mix it with, say, CombineFileInputFormat or anything else that relies on configuration, since the different static setConfigValue(conf) methods set a single value assuming a single mapper class.

          If we set a specific mapper class per split, and then a specific config per mapper class, I think it would be a lot more flexible to shoehorn different types of functionality if you're a framework author – if you're just a user, maybe you don't want to deal with the extra environment setup for simple jobs but if this is a lower level API, maybe it could be useful? It would certainly be cleaner if a single-input job is just a N=1 multiple inputs job, rather than the current situation where a multiple inputs job is a configuration-level hack on top of the single-input framework.

          Show
          Jay Booth added a comment - +1 for the general concept of a lower-level API, great idea Any thoughts regarding explicitly setting a Mapper per Split? Joins between different formats are a pretty primary use case, and it's always awkward using MultipleInputs to shoehorn the different classes into a single conf.. as I understand it now, with MultipleInputs, the MapTask wakes up, looks at its input split, compares that to a magic configuration field mapping splits to mapper classes, and instantiates that mapper class. Which leads to trouble if you need to mix it with, say, CombineFileInputFormat or anything else that relies on configuration, since the different static setConfigValue(conf) methods set a single value assuming a single mapper class. If we set a specific mapper class per split, and then a specific config per mapper class, I think it would be a lot more flexible to shoehorn different types of functionality if you're a framework author – if you're just a user, maybe you don't want to deal with the extra environment setup for simple jobs but if this is a lower level API, maybe it could be useful? It would certainly be cleaner if a single-input job is just a N=1 multiple inputs job, rather than the current situation where a multiple inputs job is a configuration-level hack on top of the single-input framework.
          Hide
          Aaron Kimball added a comment -

          Jay,

          Right now Mappers are configured directly by the user, and the InputSplits are allocated by the InputFormat. A system that combines the configuration two is intriguing, but would represent a pretty big shift in API. I think this is something you might want to file as a separate JIRA; this thread has enough ideas floating around on it already, and something like that would require its own separate discussion.

          • Aaron
          Show
          Aaron Kimball added a comment - Jay, Right now Mappers are configured directly by the user, and the InputSplits are allocated by the InputFormat. A system that combines the configuration two is intriguing, but would represent a pretty big shift in API. I think this is something you might want to file as a separate JIRA; this thread has enough ideas floating around on it already, and something like that would require its own separate discussion. Aaron
          Hide
          Arun C Murthy added a comment -

          I agree with Aaron, let us please keep the discussions here focussed on one issue.

          Show
          Arun C Murthy added a comment - I agree with Aaron, let us please keep the discussions here focussed on one issue.
          Hide
          Jay Booth added a comment -

          Yeah, I'm definitely not a stakeholder in this issue or in low-level mapreduce in general, but was just saying if we're throwing around the idea of an entirely new underlying API anyways, maybe now would be the time to consider that – it'd be pretty simple to wrap the existing API around the concept if it were baked into the lower-level API, but it'd be harder to add the concept to the lower-level API once it's already solidified.

          Either way, I don't have the mojo to make a big push for it, was just throwing the idea out there if it's decided to go forward with a different underlying API.

          Show
          Jay Booth added a comment - Yeah, I'm definitely not a stakeholder in this issue or in low-level mapreduce in general, but was just saying if we're throwing around the idea of an entirely new underlying API anyways, maybe now would be the time to consider that – it'd be pretty simple to wrap the existing API around the concept if it were baked into the lower-level API, but it'd be harder to add the concept to the lower-level API once it's already solidified. Either way, I don't have the mojo to make a big push for it, was just throwing the idea out there if it's decided to go forward with a different underlying API.
          Hide
          Tom White added a comment -

          I've created MAPREDUCE-1452 to discuss details about a new low-level API.

          Show
          Tom White added a comment - I've created MAPREDUCE-1452 to discuss details about a new low-level API.
          Hide
          Doug Cutting added a comment -

          Since Arun closed MAPREDUCE-1452 as a duplicate of MAPREDUCE-326, I've moved discussion of a new low-level API to MAPREDUCE-326.

          Show
          Doug Cutting added a comment - Since Arun closed MAPREDUCE-1452 as a duplicate of MAPREDUCE-326 , I've moved discussion of a new low-level API to MAPREDUCE-326 .
          Hide
          Owen O'Malley added a comment -

          Here is the 1.5k subset of Tom's patch that addresses the issue of this jira. I propose we commit some variant of it and move the discussion of generalizing serialization over to a new jira MAPREDUCE-1462.

          Show
          Owen O'Malley added a comment - Here is the 1.5k subset of Tom's patch that addresses the issue of this jira. I propose we commit some variant of it and move the discussion of generalizing serialization over to a new jira MAPREDUCE-1462 .
          Hide
          Doug Cutting added a comment -

          Owen, your patch uses a deprecated means to create a serialization that does not work for all serializations. As such, it is not a complete patch for this issue. The intended way to use the serialization API is for the application to use Map<String,String> metadata. This was established by HADOOP-6165 and was the rationale for HADOOP-6420. The method SerializationBase.getMetadataFromClass should not be called by the framework, but is a utility for class-based serialization implementations.

          Show
          Doug Cutting added a comment - Owen, your patch uses a deprecated means to create a serialization that does not work for all serializations. As such, it is not a complete patch for this issue. The intended way to use the serialization API is for the application to use Map<String,String> metadata. This was established by HADOOP-6165 and was the rationale for HADOOP-6420 . The method SerializationBase.getMetadataFromClass should not be called by the framework, but is a utility for class-based serialization implementations.
          Hide
          Owen O'Malley added a comment -

          I was a little too aggressive in my trimming. Here is a better patch.

          Doug, neither this or my previous patch use any deprecated APIs.

          Show
          Owen O'Malley added a comment - I was a little too aggressive in my trimming. Here is a better patch. Doug, neither this or my previous patch use any deprecated APIs.
          Hide
          Doug Cutting added a comment -

          > Doug, neither this or my previous patch use any deprecated APIs.

          They use a method (SerializationBase.getMetadataFromClass) that's not intended as a primary, serialization-independent means for the framework to create a serialization. That's a utility intended to be shared by serialization systems that use a class to create their metadata. The only other calls to it are from deprecated, back-compatibility methods and contexts where serialization is class-based. It should have been deprecated. New framework code should not use this method as a primary means to create a serialization, as it does not support all serializations.

          Show
          Doug Cutting added a comment - > Doug, neither this or my previous patch use any deprecated APIs. They use a method (SerializationBase.getMetadataFromClass) that's not intended as a primary, serialization-independent means for the framework to create a serialization. That's a utility intended to be shared by serialization systems that use a class to create their metadata. The only other calls to it are from deprecated, back-compatibility methods and contexts where serialization is class-based. It should have been deprecated. New framework code should not use this method as a primary means to create a serialization, as it does not support all serializations.
          Hide
          Owen O'Malley added a comment -

          Tom's patch adds two calls to that exact method. Neither of them is in a deprecated method.

          Show
          Owen O'Malley added a comment - Tom's patch adds two calls to that exact method. Neither of them is in a deprecated method.
          Hide
          Doug Cutting added a comment -

          > Tom's patch adds two calls to that exact method.

          One call is in a test method called getWritableSerializationMap, clearly serializaton-specific code. The other is in reading input splits, which has not been converted to use the new serialization API, and still uses the split's class alone to determine serialization.

          A primary goal of this issue is to update the shuffle to use the new Map<String,String> means of naming serializations, so that any serialzation may be used, not just those determinable from the class of the instance to be serialized. Your patch uses the map output key class alone to determine the serialization, and hence does not meet this goal.

          Show
          Doug Cutting added a comment - > Tom's patch adds two calls to that exact method. One call is in a test method called getWritableSerializationMap, clearly serializaton-specific code. The other is in reading input splits, which has not been converted to use the new serialization API, and still uses the split's class alone to determine serialization. A primary goal of this issue is to update the shuffle to use the new Map<String,String> means of naming serializations, so that any serialzation may be used, not just those determinable from the class of the instance to be serialized. Your patch uses the map output key class alone to determine the serialization, and hence does not meet this goal.
          Hide
          Owen O'Malley added a comment -

          My primary point is that this issue got really out of scope and that I was trying to find the subset of the patch that would address the issue of "shuffle should use serialization to get comparator" and the follow on jira MAPREDUCE-1462 that discusses the much wider ranging discussion on changing the MapReduce type and serialization system.

          Show
          Owen O'Malley added a comment - My primary point is that this issue got really out of scope and that I was trying to find the subset of the patch that would address the issue of "shuffle should use serialization to get comparator" and the follow on jira MAPREDUCE-1462 that discusses the much wider ranging discussion on changing the MapReduce type and serialization system.
          Hide
          Doug Cutting added a comment -

          > I was trying to find the subset of the patch that would address the issue of "shuffle should use serialization to get comparator"

          The subject of the issue was meant to be interpreted as "any serialization" not just "some serializations". If the key class were sufficient in all cases to determine the serialization then this issue would indeed be a lot simpler, but it's not.

          Show
          Doug Cutting added a comment - > I was trying to find the subset of the patch that would address the issue of "shuffle should use serialization to get comparator" The subject of the issue was meant to be interpreted as "any serialization" not just "some serializations". If the key class were sufficient in all cases to determine the serialization then this issue would indeed be a lot simpler, but it's not.
          Hide
          Doug Cutting added a comment -

          If we elect to abandon MAPREDUCE-815 in favor of AVRO-493, and since all of the underpinnings of this issue have been reverted, perhaps we should now close this as "won't fix"?

          Show
          Doug Cutting added a comment - If we elect to abandon MAPREDUCE-815 in favor of AVRO-493 , and since all of the underpinnings of this issue have been reverted, perhaps we should now close this as "won't fix"?

            People

            • Assignee:
              Aaron Kimball
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              43 Start watching this issue

              Dates

              • Created:
                Updated:

                Development