Hadoop Common
  1. Hadoop Common
  2. HADOOP-6685

Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change

      Description

      Currently, the generic serialization framework uses Map<String,String> for the serialization specific configuration. Since this data is really internal to the specific serialization, I think we should change it to be an opaque binary blob. This will simplify the interface for defining specific serializations for different contexts (MAPREDUCE-1462). It will also move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

      1. serial.patch
        362 kB
        Owen O'Malley
      2. serial4.patch
        369 kB
        Owen O'Malley
      3. serial6.patch
        370 kB
        Owen O'Malley
      4. serial7.patch
        371 kB
        Owen O'Malley
      5. serial9.patch
        376 kB
        Owen O'Malley
      6. SerializationAtSummit.pdf
        297 kB
        Owen O'Malley

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          Different serializations currently use common keys to share parts of their configuration data. For example, many name a class that implements the serialization. Several name a class that represents the instances. The code that accesses these properties is shared, either through a common base class or a utility class.

          Binary blobs can also be considerably more fragile than a Map<String,String>. If we wish the versions of clients and servers to be able to vary, then blobs stored in configurations must be version-tolerant, i.e., written and read with a fairly sophisticated serialization system.

          Show
          Doug Cutting added a comment - Different serializations currently use common keys to share parts of their configuration data. For example, many name a class that implements the serialization. Several name a class that represents the instances. The code that accesses these properties is shared, either through a common base class or a utility class. Binary blobs can also be considerably more fragile than a Map<String,String>. If we wish the versions of clients and servers to be able to vary, then blobs stored in configurations must be version-tolerant, i.e., written and read with a fairly sophisticated serialization system.
          Hide
          Owen O'Malley added a comment -

          Ok, here is a preliminary patch.

          It includes support for Avro, Thrift, ProtocolBuffers, Writables, Java serialization, and an adaptor for the old style serializations. One of the features of the Avro serialization is that the kind ("reflection", "specific", "generic") is a parameter that can be changed between writing and reading the file.

          All of the types can be put into SequenceFiles, MapFiles, BloomFilterMapFiles, SetFile, and ArrayFile.

          In a separate issue, I'll upload the OFile wrapper that goes on top of TFile to allow all of the types into TFiles as well.

          It creates a new package o.a.h.io.serial that defines the new interfaces. The new serializations save their metadata in a framework specific format. To make the format extensible, I've use protocol buffers to encode this information. This will allow us to make arbitrary compatible extensions later.

          Show
          Owen O'Malley added a comment - Ok, here is a preliminary patch. It includes support for Avro, Thrift, ProtocolBuffers, Writables, Java serialization, and an adaptor for the old style serializations. One of the features of the Avro serialization is that the kind ("reflection", "specific", "generic") is a parameter that can be changed between writing and reading the file. All of the types can be put into SequenceFiles, MapFiles, BloomFilterMapFiles, SetFile, and ArrayFile. In a separate issue, I'll upload the OFile wrapper that goes on top of TFile to allow all of the types into TFiles as well. It creates a new package o.a.h.io.serial that defines the new interfaces. The new serializations save their metadata in a framework specific format. To make the format extensible, I've use protocol buffers to encode this information. This will allow us to make arbitrary compatible extensions later.
          Hide
          Doug Cutting added a comment -

          > It includes support for Avro, Thrift, ProtocolBuffers, Writables, Java serialization, and an adaptor for the old style serializations.
          > All of the types can be put into SequenceFiles, MapFiles, BloomFilterMapFiles, SetFile, and ArrayFile.

          Could you please explain the motivation for extending these file
          formats to support all of these serialization systems? The patch
          changes the APIs for these classes, deprecating methods and adding new
          methods to support new serializations. We know from experience that
          changing APIs has a cost, so we ought to justify that cost.

          To my thinking, a priority for the project is to support file formats
          that can be processed by other programming languages. Avro, Thrift
          and ProtocolBuffers are implemented in other languages, but
          SequenceFile, MapFile, BloomFilterMapFile, SetFile, ArrayFile and
          TFile are not. Unless we intend to implement these formats in a
          variety of other programming languages, I don't see a big advantage of
          supporting so many different serialization systems from Java only. It
          doesn't greatly increase the expressive power available to Java
          developers, and the added variety introduces more potential support
          issues.

          It would be useful if the shuffle could process things besides
          Writable (MAPREDUCE-1126) and it would be useful to have InputFormats
          and OutputFormats for language-independent file formats like Avro's
          (MAPREDUCE-815). Much of this patch seems like it could help
          implement these, but parts of it (e.g., the metadata serialization,
          enhancements to SequenceFile, etc.) don't seem relevant to these
          goals. I don't see supporting multiple Java serialization APIs as a
          goal in and of itself.

          Show
          Doug Cutting added a comment - > It includes support for Avro, Thrift, ProtocolBuffers, Writables, Java serialization, and an adaptor for the old style serializations. > All of the types can be put into SequenceFiles, MapFiles, BloomFilterMapFiles, SetFile, and ArrayFile. Could you please explain the motivation for extending these file formats to support all of these serialization systems? The patch changes the APIs for these classes, deprecating methods and adding new methods to support new serializations. We know from experience that changing APIs has a cost, so we ought to justify that cost. To my thinking, a priority for the project is to support file formats that can be processed by other programming languages. Avro, Thrift and ProtocolBuffers are implemented in other languages, but SequenceFile, MapFile, BloomFilterMapFile, SetFile, ArrayFile and TFile are not. Unless we intend to implement these formats in a variety of other programming languages, I don't see a big advantage of supporting so many different serialization systems from Java only. It doesn't greatly increase the expressive power available to Java developers, and the added variety introduces more potential support issues. It would be useful if the shuffle could process things besides Writable ( MAPREDUCE-1126 ) and it would be useful to have InputFormats and OutputFormats for language-independent file formats like Avro's ( MAPREDUCE-815 ). Much of this patch seems like it could help implement these, but parts of it (e.g., the metadata serialization, enhancements to SequenceFile, etc.) don't seem relevant to these goals. I don't see supporting multiple Java serialization APIs as a goal in and of itself.
          Hide
          Luke Lu added a comment -

          It'd be nice to have some documentation for it. I don't see a package.html or package-info.java (preferred) for the top level package (only package.html for the lib.avro is included), providing an overview of the system and how (preferably via some examples.) the system could be used.

          @Doug, being able to process different file formats in other programming languages opens up many integration possibilities and even performance improvement if a faster language is used

          Show
          Luke Lu added a comment - It'd be nice to have some documentation for it. I don't see a package.html or package-info.java (preferred) for the top level package (only package.html for the lib.avro is included), providing an overview of the system and how (preferably via some examples.) the system could be used. @Doug, being able to process different file formats in other programming languages opens up many integration possibilities and even performance improvement if a faster language is used
          Hide
          Chris Douglas added a comment -

          Much of this patch seems like it could help
          implement these, but parts of it (e.g., the metadata serialization,
          enhancements to SequenceFile, etc.) don't seem relevant to these
          goals. I don't see supporting multiple Java serialization APIs as a
          goal in and of itself.

          If one's records don't implement the Writable interface, then there's no reasonable binary container in Hadoop. Adding these capabilities to SequenceFile and TFile is an improvement. It's not about cross-language capability.

          It would be useful if the shuffle could process things besides
          Writable (MAPREDUCE-1126) and it would be useful to have InputFormats
          and OutputFormats for language-independent file formats like Avro's
          (MAPREDUCE-815).

          Importing past issues will not move this forward. You recognize that this makes progress toward pushing other serializations through the data pipeline. That is obviously the next step.

          Show
          Chris Douglas added a comment - Much of this patch seems like it could help implement these, but parts of it (e.g., the metadata serialization, enhancements to SequenceFile, etc.) don't seem relevant to these goals. I don't see supporting multiple Java serialization APIs as a goal in and of itself. If one's records don't implement the Writable interface, then there's no reasonable binary container in Hadoop. Adding these capabilities to SequenceFile and TFile is an improvement. It's not about cross-language capability. It would be useful if the shuffle could process things besides Writable ( MAPREDUCE-1126 ) and it would be useful to have InputFormats and OutputFormats for language-independent file formats like Avro's ( MAPREDUCE-815 ). Importing past issues will not move this forward. You recognize that this makes progress toward pushing other serializations through the data pipeline. That is obviously the next step.
          Hide
          Hong Tang added a comment -

          Could you please explain the motivation for extending these file formats to support all of these serialization systems?

          Being able to support so many different serialization framework in itself is a good exercise to ensure that the API is carefully designed to be future proof.

          Show
          Hong Tang added a comment - Could you please explain the motivation for extending these file formats to support all of these serialization systems? Being able to support so many different serialization framework in itself is a good exercise to ensure that the API is carefully designed to be future proof.
          Hide
          Tom White added a comment -

          > If one's records don't implement the Writable interface, then there's no reasonable binary container in Hadoop.

          SequenceFile supports non-Writable types already. The limitation today is that there must be a one-to-one mapping between Java class and serialized data type. I think that can be satisfied by both Thrift and Protocol Buffers. For Avro, I don't think we want to support it in SequenceFile, as we should instead encourage use of Avro Data File, which is like SequenceFile but interoperable with other languages.

          A process question: given how we failed to gain consensus last time, what could we do differently this time round? A design document to motivate the use cases? Any other suggestions?

          Show
          Tom White added a comment - > If one's records don't implement the Writable interface, then there's no reasonable binary container in Hadoop. SequenceFile supports non-Writable types already. The limitation today is that there must be a one-to-one mapping between Java class and serialized data type. I think that can be satisfied by both Thrift and Protocol Buffers. For Avro, I don't think we want to support it in SequenceFile, as we should instead encourage use of Avro Data File, which is like SequenceFile but interoperable with other languages. A process question: given how we failed to gain consensus last time, what could we do differently this time round? A design document to motivate the use cases? Any other suggestions?
          Hide
          Chris Douglas added a comment -

          SequenceFile supports non-Writable types already. The limitation today is that there must be a one-to-one mapping between Java class and serialized data type.

          Of course; I misspoke.

          I think that can be satisfied by both Thrift and Protocol Buffers. For Avro, I don't think we want to support it in SequenceFile, as we should instead encourage use of Avro Data File

          Let's leave it up to the user. Leaving Avro out doesn't make sense, either; I'm sure there are more efficient formats for protocol buffers, too.

          A process question: given how we failed to gain consensus last time, what could we do differently this time round?

          If the JIRA stays simple and on-topic, I think consensus is achievable. If the argument strays into the order of issues, philosophical musings on the future of MapReduce, and rehashing past debates, then failure is also achievable. If participants behave as if its success were more important than "victory," I think the purpose is clear enough to anyone hoping to contribute work. What questions do you want answered in a design document?

          Show
          Chris Douglas added a comment - SequenceFile supports non-Writable types already. The limitation today is that there must be a one-to-one mapping between Java class and serialized data type. Of course; I misspoke. I think that can be satisfied by both Thrift and Protocol Buffers. For Avro, I don't think we want to support it in SequenceFile, as we should instead encourage use of Avro Data File Let's leave it up to the user. Leaving Avro out doesn't make sense, either; I'm sure there are more efficient formats for protocol buffers, too. A process question: given how we failed to gain consensus last time, what could we do differently this time round? If the JIRA stays simple and on-topic, I think consensus is achievable. If the argument strays into the order of issues, philosophical musings on the future of MapReduce, and rehashing past debates, then failure is also achievable. If participants behave as if its success were more important than " victory ," I think the purpose is clear enough to anyone hoping to contribute work. What questions do you want answered in a design document?
          Hide
          Owen O'Malley added a comment -

          Here is the presentation slides from June about the overview of the serialization changes.

          Show
          Owen O'Malley added a comment - Here is the presentation slides from June about the overview of the serialization changes.
          Hide
          Doug Cutting added a comment -

          Owen, thanks for the slides. I don't see a direct relation between this issue and the issue of simplifying the implementation of efficient map-side joins (MAPREDUCE-1183, more or less). Am I missing the connection, or is this a distinct issue?

          File formats are forever. More variations add significant, long-term compatibility burdens to the project. We badly need to add support for a higher-level object serialization system than Writable. But I'm not convinced its wise to add such support to the exisiting Java-only container file formats. So I'm all for a more generic serialization API that can be used by MapReduce applications. I don't however see that it follows that we should provide implementations of file formats with a large number of different serialization systems, as that invites multiplicative long-term support issues. I'd prefer that we instead direct users towards a single preferred high-level serialization system and a single preferred container. Historically that's been Writable and SequenceFile. We now need to migrate from these to a more expressive, language-independent serialization system and container file. Our APIs should be of course be general enough that it's possible to incorporate different serialization systems and different file formats, but we needn't provide implementations of all combinations of these, but should rather direct folks towards a primary implementation.

          Google benefits tremendously by having a single standard serialization system and container file format. The Dremel paper (http://sergey.melnix.com/pub/melnik_VLDB10.pdf) argues that this is an essential enabler of their wide variety of interoperable systems. The further we depart from this the harder we make it to build systems like Dremel that multiply the utility of stored data.

          Changing serialization systems or file formats is a major imposition for many applications. They cannot afford to do it frequently. We should provide a clear path forward from Writable+SequenceFile to a new system that's easier to use, less fragile, and language-independent to better facilitate a rich ecosystem of tools.

          Show
          Doug Cutting added a comment - Owen, thanks for the slides. I don't see a direct relation between this issue and the issue of simplifying the implementation of efficient map-side joins ( MAPREDUCE-1183 , more or less). Am I missing the connection, or is this a distinct issue? File formats are forever. More variations add significant, long-term compatibility burdens to the project. We badly need to add support for a higher-level object serialization system than Writable. But I'm not convinced its wise to add such support to the exisiting Java-only container file formats. So I'm all for a more generic serialization API that can be used by MapReduce applications. I don't however see that it follows that we should provide implementations of file formats with a large number of different serialization systems, as that invites multiplicative long-term support issues. I'd prefer that we instead direct users towards a single preferred high-level serialization system and a single preferred container. Historically that's been Writable and SequenceFile. We now need to migrate from these to a more expressive, language-independent serialization system and container file. Our APIs should be of course be general enough that it's possible to incorporate different serialization systems and different file formats, but we needn't provide implementations of all combinations of these, but should rather direct folks towards a primary implementation. Google benefits tremendously by having a single standard serialization system and container file format. The Dremel paper ( http://sergey.melnix.com/pub/melnik_VLDB10.pdf ) argues that this is an essential enabler of their wide variety of interoperable systems. The further we depart from this the harder we make it to build systems like Dremel that multiply the utility of stored data. Changing serialization systems or file formats is a major imposition for many applications. They cannot afford to do it frequently. We should provide a clear path forward from Writable+SequenceFile to a new system that's easier to use, less fragile, and language-independent to better facilitate a rich ecosystem of tools.
          Hide
          Owen O'Malley added a comment -

          Owen, thanks for the slides

          You're welcome. Everyone had seen them before, but I wanted to make sure they were easily available for this conversation.

          I don't see a direct relation between this issue and the issue of simplifying the implementation of efficient map-side joins (MAPREDUCE-1183, more or less). Am I missing the connection, or is this a distinct issue?

          It is related because we want to support context-specific serializations. That support is much easier if the metadata for each serialization is in a separate structure and not dumped into the Configuration. This is the same problem that comes from MAPREDUCE-1183 for InputFormats, Mappers, etc. They are similar issues and it would be nice to have a consistent solutions.

          File formats are forever.

          I'm adding no new file formats. I'm just making the ones that we've had for years have more useful.

          We badly need to add support for a higher-level object serialization system than Writable.

          I obviously agree enough that I'm working on supporting it. Providing customer choice over the serialization is much richer than forcing them into a single one. They each have different design decisions, by making the choice pluggable the user can decide. I understand that you want Avro everywhere. Other users have other priorities.

          But I'm not convinced its wise to add such support to the exisiting Java-only container file formats.

          I'm supporting the containers we have. I'd love for someone to implement SequenceFiles or TFiles in C. That is an orthogonal issue. Any file format that only supports one serialization doesn't meet my needs.

          This change should have no impact on any current applications. Very few of them depend on the serialization library directly. My hope is that by extending the library and supporting a wider range of serializations, users will be able to code their applications using the types that they find convenient.

          Show
          Owen O'Malley added a comment - Owen, thanks for the slides You're welcome. Everyone had seen them before, but I wanted to make sure they were easily available for this conversation. I don't see a direct relation between this issue and the issue of simplifying the implementation of efficient map-side joins ( MAPREDUCE-1183 , more or less). Am I missing the connection, or is this a distinct issue? It is related because we want to support context-specific serializations. That support is much easier if the metadata for each serialization is in a separate structure and not dumped into the Configuration. This is the same problem that comes from MAPREDUCE-1183 for InputFormats, Mappers, etc. They are similar issues and it would be nice to have a consistent solutions. File formats are forever. I'm adding no new file formats. I'm just making the ones that we've had for years have more useful. We badly need to add support for a higher-level object serialization system than Writable. I obviously agree enough that I'm working on supporting it. Providing customer choice over the serialization is much richer than forcing them into a single one. They each have different design decisions, by making the choice pluggable the user can decide. I understand that you want Avro everywhere. Other users have other priorities. But I'm not convinced its wise to add such support to the exisiting Java-only container file formats. I'm supporting the containers we have. I'd love for someone to implement SequenceFiles or TFiles in C. That is an orthogonal issue. Any file format that only supports one serialization doesn't meet my needs. This change should have no impact on any current applications. Very few of them depend on the serialization library directly. My hope is that by extending the library and supporting a wider range of serializations, users will be able to code their applications using the types that they find convenient.
          Hide
          Doug Cutting added a comment -

          > That support is much easier if the metadata for each serialization is in a separate structure and not dumped into the Configuration.

          Got it. Thanks for clarifying. As I've commented earlier in this issue, I prefer the use of simple textual formats (properties, XML, JSon, etc.) for metadata and configuration data, as in HTTP, SMTP, and most config file formats, rather than binary data. Such textual formats seem to me to be more natural when bootstrapping interoperable systems. Metadata and configuration data are not usually performance or size sensitive, the normal motivation for the use of binary.

          > Providing customer choice over the serialization is much richer than forcing them into a single one.

          I agree that we should provide a general-purpose API that does not force a particular serialization, but we should encourage a primary serialization to provide better interoperability.

          > Any file format that only supports one serialization doesn't meet my needs.

          I certainly don't think we should mandate a single file format, and we don't at present. But I think we should focus our support around a single format. A format that contains multiple serializations is harder to support across multiple languages and greatly increases the chance that you'll have data that cannot be processed by another system. As an existence proof, Google seems to get a lot of mileage with a single preferred serialization.

          Thanks for responding to my concerns. I am -0 on this patch as currently implemented: I think we could do better but I will not block progress.

          Show
          Doug Cutting added a comment - > That support is much easier if the metadata for each serialization is in a separate structure and not dumped into the Configuration. Got it. Thanks for clarifying. As I've commented earlier in this issue, I prefer the use of simple textual formats (properties, XML, JSon, etc.) for metadata and configuration data, as in HTTP, SMTP, and most config file formats, rather than binary data. Such textual formats seem to me to be more natural when bootstrapping interoperable systems. Metadata and configuration data are not usually performance or size sensitive, the normal motivation for the use of binary. > Providing customer choice over the serialization is much richer than forcing them into a single one. I agree that we should provide a general-purpose API that does not force a particular serialization, but we should encourage a primary serialization to provide better interoperability. > Any file format that only supports one serialization doesn't meet my needs. I certainly don't think we should mandate a single file format, and we don't at present. But I think we should focus our support around a single format. A format that contains multiple serializations is harder to support across multiple languages and greatly increases the chance that you'll have data that cannot be processed by another system. As an existence proof, Google seems to get a lot of mileage with a single preferred serialization. Thanks for responding to my concerns. I am -0 on this patch as currently implemented: I think we could do better but I will not block progress.
          Hide
          Owen O'Malley added a comment -

          This patch creates javadoc for the new package and marks the old package as deprecated.

          I've also fixed two minor problems that we exposed in MapReduce testing.

          Show
          Owen O'Malley added a comment - This patch creates javadoc for the new package and marks the old package as deprecated. I've also fixed two minor problems that we exposed in MapReduce testing.
          Hide
          Owen O'Malley added a comment -

          Thrift isn't available from Maven yet, so this copy of thrift-0.5.0 should be put into lib.

          Show
          Owen O'Malley added a comment - Thrift isn't available from Maven yet, so this copy of thrift-0.5.0 should be put into lib.
          Hide
          Owen O'Malley added a comment -

          Re-uploading to allow hudson to run.

          Show
          Owen O'Malley added a comment - Re-uploading to allow hudson to run.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12459663/serial4.patch
          against trunk revision 1035353.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 35 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          -1 system test framework. The patch failed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/103//testReport/
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/103//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12459663/serial4.patch against trunk revision 1035353. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 35 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/103//testReport/ Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/103//console This message is automatically generated.
          Hide
          Owen O'Malley added a comment -

          I fixed some findbugs warnings.

          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 38 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] -1 javac. The applied patch generated 1120 javac compiler warnings (more than the trunk's current 1033 warnings).
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] -1 release audit. The applied patch generated 13 release audit warnings (more than the trunk's current 1 warnings).
          [exec]
          [exec] +1 system test framework. The patch passed system test framework compile.

          The Javac warnings are for deprecations and the release audit warnings are for generated files.

          Show
          Owen O'Malley added a comment - I fixed some findbugs warnings. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 38 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] -1 javac. The applied patch generated 1120 javac compiler warnings (more than the trunk's current 1033 warnings). [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 13 release audit warnings (more than the trunk's current 1 warnings). [exec] [exec] +1 system test framework. The patch passed system test framework compile. The Javac warnings are for deprecations and the release audit warnings are for generated files.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12459688/serial6.patch
          against trunk revision 1035353.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 38 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          -1 system test framework. The patch failed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/104//testReport/
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/104//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12459688/serial6.patch against trunk revision 1035353. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 38 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/104//testReport/ Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/104//console This message is automatically generated.
          Hide
          Luke Lu added a comment -

          THRIFT-363 addresses the thrift maven deploy, even though it's not completely resolved. We should help resolving it, as they're voting for the 0.6.0 release. Checking in an external dependency should be the last resort as any projects (hdfs, mapreduce etc. along with their contrib srcs) that depend on hadoop-common artifacts will have to be modified manually to include the jar in the source trees. libthrift.jar needs a version string (e.g. libthrift-0.5.0.jar) as well, if we decide to check it in.

          Show
          Luke Lu added a comment - THRIFT-363 addresses the thrift maven deploy, even though it's not completely resolved. We should help resolving it, as they're voting for the 0.6.0 release. Checking in an external dependency should be the last resort as any projects (hdfs, mapreduce etc. along with their contrib srcs) that depend on hadoop-common artifacts will have to be modified manually to include the jar in the source trees. libthrift.jar needs a version string (e.g. libthrift-0.5.0.jar) as well, if we decide to check it in.
          Hide
          Tom White added a comment -

          Here's my feedback on the patch:

          1. I think the new serializations should be optional dependencies. Mandating a particular version of Thrift, Protocol Buffers, and Avro is going to cause problems for folks down the line, since we would be tying the version to Hadoop's release cycle, which is infrequent. By making the serializations libraries (or contrib modules, as in MAPREDUCE-376, MAPREDUCE-377) makes them independent, and will make it easier to support the version of the serialization library the user wants.
          2. I preferred the version where the Serialization could choose the way it serialized itself. In the current patch, if you wrote Avro data in a SequenceFile you would have Writables for the file container, and a PB-encoded Avro schema for the serialization. Having so many serialization mechanisms is potentially brittle.
          3. I'm not sure we need the full generally of PB for serializing serializations. If the serialization could choose its self-serialization mechanism, then TypedSerialization could just write its type using as a Text object. Doing this would remove the core dependency on PB, and allow 1.
          Show
          Tom White added a comment - Here's my feedback on the patch: I think the new serializations should be optional dependencies. Mandating a particular version of Thrift, Protocol Buffers, and Avro is going to cause problems for folks down the line, since we would be tying the version to Hadoop's release cycle, which is infrequent. By making the serializations libraries (or contrib modules, as in MAPREDUCE-376 , MAPREDUCE-377 ) makes them independent, and will make it easier to support the version of the serialization library the user wants. I preferred the version where the Serialization could choose the way it serialized itself. In the current patch, if you wrote Avro data in a SequenceFile you would have Writables for the file container, and a PB-encoded Avro schema for the serialization. Having so many serialization mechanisms is potentially brittle. I'm not sure we need the full generally of PB for serializing serializations. If the serialization could choose its self-serialization mechanism, then TypedSerialization could just write its type using as a Text object. Doing this would remove the core dependency on PB, and allow 1.
          Hide
          Owen O'Malley added a comment -

          @Luke - I agree it is a last resort. They haven't been able to figure out how to push to Maven. I asked and they aren't sure what is going wrong. I spent a couple hours trying to see if I could get it pushed to org.apache.hadoop.thrift and didn't succeed.

          @Tom - But they aren't optional dependences. You could make the same argument for log4j, commons logging or any other package we use. Part of the power of Maven is that it catches these problems early and gives you tools for resolving them. Making them optional would leave the users in the lurch.

          Serializations have complete control over how they serialize their metadata. Any or all of them could use json or any other format. Using protobuf is very convenient and provides lots of extensibility. I believe that using a non-extensible format would be a mistake we'd regret for a long time.

          Show
          Owen O'Malley added a comment - @Luke - I agree it is a last resort. They haven't been able to figure out how to push to Maven. I asked and they aren't sure what is going wrong. I spent a couple hours trying to see if I could get it pushed to org.apache.hadoop.thrift and didn't succeed. @Tom - But they aren't optional dependences. You could make the same argument for log4j, commons logging or any other package we use. Part of the power of Maven is that it catches these problems early and gives you tools for resolving them. Making them optional would leave the users in the lurch. Serializations have complete control over how they serialize their metadata. Any or all of them could use json or any other format. Using protobuf is very convenient and provides lots of extensibility. I believe that using a non-extensible format would be a mistake we'd regret for a long time.
          Hide
          Doug Cutting added a comment -

          Owen> they aren't optional dependences.

          Indeed they aren't optional dependencies in the patch you have posted. But most are not Hadoop dependencies at all before this patch. Tom seems to be suggesting that this patch could be reduced to only introduce a serialization framework, not to add a number of implementations of it, nor I would add, to extend existing container file formats to incorporate new serializations. That would bring the patch more in line with the description of this issue. Perhaps the serialization implementations and the extension of the containers could be moved to different issues and this issue could focus on the serialization API itself?

          Show
          Doug Cutting added a comment - Owen> they aren't optional dependences. Indeed they aren't optional dependencies in the patch you have posted. But most are not Hadoop dependencies at all before this patch. Tom seems to be suggesting that this patch could be reduced to only introduce a serialization framework, not to add a number of implementations of it, nor I would add, to extend existing container file formats to incorporate new serializations. That would bring the patch more in line with the description of this issue. Perhaps the serialization implementations and the extension of the containers could be moved to different issues and this issue could focus on the serialization API itself?
          Hide
          Owen O'Malley added a comment -

          Indeed they aren't optional dependencies in the patch you have posted. But most are not Hadoop dependencies at all before this patch.

          Avro is already a dependency. Thrift is already a dependency for HDFS (see HDFS-1484). I'm only adding ProtocolBuffers, which is a commonly used serialization format that many users including me find extremely useful.

          only introduce a serialization framework, not to add a number of implementations of it,

          The implementations are relatively small and are required for showing that the system actually works.

          nor I would add, to extend existing container file formats to incorporate new serializations

          SequenceFile was updated to use the new io.serial interface. The other containers had never been updated to use anything but Writable. Bringing them up to the new interface was part of the work. Again, updating SequenceFile shows that this code actually works. Without such a demonstration, this patch would be incomplete.

          Show
          Owen O'Malley added a comment - Indeed they aren't optional dependencies in the patch you have posted. But most are not Hadoop dependencies at all before this patch. Avro is already a dependency. Thrift is already a dependency for HDFS (see HDFS-1484 ). I'm only adding ProtocolBuffers, which is a commonly used serialization format that many users including me find extremely useful. only introduce a serialization framework, not to add a number of implementations of it, The implementations are relatively small and are required for showing that the system actually works. nor I would add, to extend existing container file formats to incorporate new serializations SequenceFile was updated to use the new io.serial interface. The other containers had never been updated to use anything but Writable. Bringing them up to the new interface was part of the work. Again, updating SequenceFile shows that this code actually works. Without such a demonstration, this patch would be incomplete.
          Hide
          Tom White added a comment -

          Avro is already a dependency. Thrift is already a dependency for HDFS (see HDFS-1484).

          There's a difference. Avro is used to implement an internal Hadoop format, and Thrift, which is used in the thriftfs contrib module is also an internal detail. So users don't care about the version numbers for these libraries. (Also, the latter is a contrib module, so users can elect not to include it.) The case I'm thinking about here is if a user has a Thrift file definition that uses a feature of a later version of Thrift than is included in Hadoop then they can't use it. If we make it a library then it becomes possible to update the library independently.

          Show
          Tom White added a comment - Avro is already a dependency. Thrift is already a dependency for HDFS (see HDFS-1484 ). There's a difference. Avro is used to implement an internal Hadoop format, and Thrift, which is used in the thriftfs contrib module is also an internal detail. So users don't care about the version numbers for these libraries. (Also, the latter is a contrib module, so users can elect not to include it.) The case I'm thinking about here is if a user has a Thrift file definition that uses a feature of a later version of Thrift than is included in Hadoop then they can't use it. If we make it a library then it becomes possible to update the library independently.
          Hide
          Owen O'Malley added a comment -

          So users don't care about the version numbers for these libraries.

          That doesn't follow. In each application there can only be one instance of Thrift, Avro, or ProtocolBuffers. How the library is used doesn't matter.

          The case I'm thinking about here is if a user has a Thrift file definition that uses a feature of a later version of Thrift than is included in Hadoop then they can't use it.

          The user can always choose to not import the dependence from Hadoop. But they should make that choice explicitly, because if they override the version they need to make sure that both Hadoop and their application work with the version they choose.

          Show
          Owen O'Malley added a comment - So users don't care about the version numbers for these libraries. That doesn't follow. In each application there can only be one instance of Thrift, Avro, or ProtocolBuffers. How the library is used doesn't matter. The case I'm thinking about here is if a user has a Thrift file definition that uses a feature of a later version of Thrift than is included in Hadoop then they can't use it. The user can always choose to not import the dependence from Hadoop. But they should make that choice explicitly, because if they override the version they need to make sure that both Hadoop and their application work with the version they choose.
          Hide
          Doug Cutting added a comment -

          > I'm only adding ProtocolBuffers, which is a commonly used serialization format that many users including me find extremely useful.

          Is there a strong reason to use ProtocolBuffers here rather than Avro, which is already a dependency and provides similar functionality?

          > The implementations are relatively small and are required for showing that the system actually works.

          Then they could be included as tests. Also, if these uses are not the primary, essential purpose of this functionality, perhaps patches for the issues that are its primary use can be posted and they can be reviewed together.

          Show
          Doug Cutting added a comment - > I'm only adding ProtocolBuffers, which is a commonly used serialization format that many users including me find extremely useful. Is there a strong reason to use ProtocolBuffers here rather than Avro, which is already a dependency and provides similar functionality? > The implementations are relatively small and are required for showing that the system actually works. Then they could be included as tests. Also, if these uses are not the primary, essential purpose of this functionality, perhaps patches for the issues that are its primary use can be posted and they can be reviewed together.
          Hide
          Owen O'Malley added a comment -

          Is there a strong reason to use ProtocolBuffers here rather than Avro, which is already a dependency and provides similar functionality?

          It isn't clear what context you mean here:

          • Giving the user the ability to use ProtocolBuffers
          • Using protocol buffers for the metadata

          For the first point, ProtocolBuffers is a extremely well engineered and documented project. The fit and finish are excellent. It has been finely honed by years of extensive use in production systems. Providing the capability to natively run ProtocolBuffers through the pipeline without third party add ons is a big win. See Kevin's presentation about using Protocol Buffers at Twitter http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter.

          For the second point, Avro is completely unsuitable for that context. For the serializer's metadata, I need to encode a singleton object. With Avro, I would need to encode the schema and then the metadata information. To add insult to injury, the schema will be substantially larger than the data. With ProtocolBuffers, I just encode the data. The metadata is part of the record. In other contexts where there are a lot of the same objects being serialized, Avro is more efficient. This context is very different. As a final point, as I've told you previously, the Avro setup is very expensive. Writing a 2 row sequence file is 10x slower using Avro compared to ProtocolBuffers.

          I understand that you'd like Avro to be the one and only serialization format that Hadoop supports. Especially since that will help you push the development of Avro forward. You forcing Avro on the users is unhealthy for Hadoop.

          Show
          Owen O'Malley added a comment - Is there a strong reason to use ProtocolBuffers here rather than Avro, which is already a dependency and provides similar functionality? It isn't clear what context you mean here: Giving the user the ability to use ProtocolBuffers Using protocol buffers for the metadata For the first point, ProtocolBuffers is a extremely well engineered and documented project. The fit and finish are excellent. It has been finely honed by years of extensive use in production systems. Providing the capability to natively run ProtocolBuffers through the pipeline without third party add ons is a big win. See Kevin's presentation about using Protocol Buffers at Twitter http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter . For the second point, Avro is completely unsuitable for that context. For the serializer's metadata, I need to encode a singleton object. With Avro, I would need to encode the schema and then the metadata information. To add insult to injury, the schema will be substantially larger than the data. With ProtocolBuffers, I just encode the data. The metadata is part of the record. In other contexts where there are a lot of the same objects being serialized, Avro is more efficient. This context is very different. As a final point, as I've told you previously, the Avro setup is very expensive. Writing a 2 row sequence file is 10x slower using Avro compared to ProtocolBuffers. I understand that you'd like Avro to be the one and only serialization format that Hadoop supports. Especially since that will help you push the development of Avro forward. You forcing Avro on the users is unhealthy for Hadoop.
          Hide
          Doug Cutting added a comment -

          > you'd like Avro to be the one and only serialization format that Hadoop supports

          Not quite. I'd like Hadoop to encourage a primary persistent data file format, to better enable a rich ecosystem. Currently we promote Writables in SequenceFiles. The Avro project was launched with the goal of providing a second-generation alternative to this. But if there's another candidate that would serve better I'd gladly entertain it. The success of Avro is secondary to this goal.

          Neither Thrift nor ProtocolBuffers define a standard data file format. If someone implemented a cross-platform container file format based on Protocol Buffers or Thrift that supported compression and was splittable, it would be a strong contender. Avro's primary advantage over these is that it's more dynamic, e.g., supporting easy creation of new datatypes without a code generate/load cycle, but Thrift and Protocol Buffers are more mature and support more programming languages, so could present compelling alternatives if they offered an appropriate file format.

          We should not force a single persistent data format. We should continue to include flexible APIs that permit arbitrary data formats.

          Show
          Doug Cutting added a comment - > you'd like Avro to be the one and only serialization format that Hadoop supports Not quite. I'd like Hadoop to encourage a primary persistent data file format, to better enable a rich ecosystem. Currently we promote Writables in SequenceFiles. The Avro project was launched with the goal of providing a second-generation alternative to this. But if there's another candidate that would serve better I'd gladly entertain it. The success of Avro is secondary to this goal. Neither Thrift nor ProtocolBuffers define a standard data file format. If someone implemented a cross-platform container file format based on Protocol Buffers or Thrift that supported compression and was splittable, it would be a strong contender. Avro's primary advantage over these is that it's more dynamic, e.g., supporting easy creation of new datatypes without a code generate/load cycle, but Thrift and Protocol Buffers are more mature and support more programming languages, so could present compelling alternatives if they offered an appropriate file format. We should not force a single persistent data format. We should continue to include flexible APIs that permit arbitrary data formats.
          Hide
          Owen O'Malley added a comment -

          We should not force a single persistent data format. We should continue to include flexible APIs that permit arbitrary data formats.

          Great. This patch is moving that forward.

          If you want to create a jira and patch for making a C SequenceFile or TFile, which is a totally orthogonal issue, that would be great.

          Show
          Owen O'Malley added a comment - We should not force a single persistent data format. We should continue to include flexible APIs that permit arbitrary data formats. Great. This patch is moving that forward. If you want to create a jira and patch for making a C SequenceFile or TFile, which is a totally orthogonal issue, that would be great.
          Hide
          Tom White added a comment -

          My point was about modularity: the core should not depend on a particular set of serialization libraries with particular versions, since it hampers users' ability to use later versions of serialization libraries. The concrete change I would make is to move the serializations to contrib modules so that they can be more easily updated than core.

          Show
          Tom White added a comment - My point was about modularity: the core should not depend on a particular set of serialization libraries with particular versions, since it hampers users' ability to use later versions of serialization libraries. The concrete change I would make is to move the serializations to contrib modules so that they can be more easily updated than core.
          Hide
          Owen O'Malley added a comment -

          Tom, I understood your point. You aren't seeing mine.

          There are three use cases (using Thrift as the library in question):

          • The job is not using Thrift. In this case it doesn't matter whether the thrift jar and the plugin class are on the classpath.
          • The job is using the same version of Thrift The two solutions look like:
            • If the plugin is built into Hadoop, the user just uses their class.
            • If the plugin is a separate jar, the application must copy both the thrift jar and the thrift plugin into HDFS and put them in their distributed cache. They also need to put them on them on the task's (and the launching node's) classpath.
          • The job is using a different version of Thrift. This is the 1% case. The two solutions look like:
            • If the plugin is built into Hadoop, the application must put the thrift jar into the distributed cache.
            • If the plugin is separate, the application must put the thrift jar and the plugin jar into the distributed cache.

          Also note that using the distributed cache, it is easy for user mistakes to end up having a copy of thrift and plugin per a job or per a user on all of the slave nodes.

          In summary, the user can always override the version distributed with Hadoop. The question is just how convenient we can make the standard use cases. We have added many dependencies over the years and they've never provoked this kind of objection.

          Show
          Owen O'Malley added a comment - Tom, I understood your point. You aren't seeing mine. There are three use cases (using Thrift as the library in question): The job is not using Thrift. In this case it doesn't matter whether the thrift jar and the plugin class are on the classpath. The job is using the same version of Thrift The two solutions look like: If the plugin is built into Hadoop, the user just uses their class. If the plugin is a separate jar, the application must copy both the thrift jar and the thrift plugin into HDFS and put them in their distributed cache. They also need to put them on them on the task's (and the launching node's) classpath. The job is using a different version of Thrift. This is the 1% case. The two solutions look like: If the plugin is built into Hadoop, the application must put the thrift jar into the distributed cache. If the plugin is separate, the application must put the thrift jar and the plugin jar into the distributed cache. Also note that using the distributed cache, it is easy for user mistakes to end up having a copy of thrift and plugin per a job or per a user on all of the slave nodes. In summary, the user can always override the version distributed with Hadoop. The question is just how convenient we can make the standard use cases. We have added many dependencies over the years and they've never provoked this kind of objection.
          Hide
          Chris Douglas added a comment -

          Feedback on v6:

          • Text::readRawSting needs to reset newBytes on each iteration. It's not used in the patch; if useful, unit tests for the new Text methods would probably be a good idea.
          • Has this been tested by reading SequenceFiles and TFiles written before the patch? Tests in MapReduce pass?
          • The hadoop.serializations key should probably be added to core-default.xml and the sigh CommonConfigurationKeysPublic class

          Aside from these, the patch looks great.

          Show
          Chris Douglas added a comment - Feedback on v6: Text::readRawSting needs to reset newBytes on each iteration. It's not used in the patch; if useful, unit tests for the new Text methods would probably be a good idea. Has this been tested by reading SequenceFiles and TFiles written before the patch? Tests in MapReduce pass? The hadoop.serializations key should probably be added to core-default.xml and the sigh CommonConfigurationKeysPublic class Aside from these, the patch looks great.
          Hide
          Luke Lu added a comment -

          Regarding patch v6:

          About thrift:

          Since the maven artifact for thrift is not yet ready and that no existing code (other than new test code) depends on thrift. Maybe we can split the thrift portion out to a second patch, which can also serve as an example of how to add a new serialization scheme. Otherwise, we'd have to make changes to all the projects that depend on hadoop-common now and later. Basically 1 additional patch versus 2n patches (n is number of projects depending on hadoop-common.)

          Code nits:

          • I see code throwing RuntimeException("Error in deserialization") etc (e.g. in DeserializationRawComparator.java). It'd be nicer to have a more specific runtime exception i.e. SerializationException for easier matching in various situations (e.g. @Test(expected=SerializationException.class) etc.)
          • in WritableSerialization.java#getRawComparator, you can use Class<WritableComparable<Writable>> to avoid the rawtype warning.
          • I didn't see any tests for Text#readRawString and writeRawString.
          Show
          Luke Lu added a comment - Regarding patch v6: About thrift: Since the maven artifact for thrift is not yet ready and that no existing code (other than new test code) depends on thrift. Maybe we can split the thrift portion out to a second patch, which can also serve as an example of how to add a new serialization scheme. Otherwise, we'd have to make changes to all the projects that depend on hadoop-common now and later. Basically 1 additional patch versus 2n patches (n is number of projects depending on hadoop-common.) Code nits: I see code throwing RuntimeException("Error in deserialization") etc (e.g. in DeserializationRawComparator.java). It'd be nicer to have a more specific runtime exception i.e. SerializationException for easier matching in various situations (e.g. @Test(expected=SerializationException.class) etc.) in WritableSerialization.java#getRawComparator, you can use Class<WritableComparable<Writable>> to avoid the rawtype warning. I didn't see any tests for Text#readRawString and writeRawString.
          Hide
          Eli Collins added a comment -

          It seems like the high-level discussion got side tracked on the use of specific serialization frameworks, instead of getting consensus on the main change this jira proposes: whether the framework should use an opaque binary blob for storing metadata, ie whether serialization metadata needs to be serialized (vs a map of strings which is, which is really just a different form of serialization).

          Reading the thread and presentation it sounds like the primary motivation is:

          1. Using a binary blob will simplify the interfaces for defining specific serializations (MR-1462)
          2. The blob interface allows for contents that are more easily type checked

          If there is consensus on these points? If not, what are the specific objections? Eg is the patch for MR-1462 not significantly simplified vs what it would take with the current API?

          It doesn't seem to me that this particular plumbing decision needs to block the more valuable goal of pushing serializations through the pipeline.

          Show
          Eli Collins added a comment - It seems like the high-level discussion got side tracked on the use of specific serialization frameworks, instead of getting consensus on the main change this jira proposes: whether the framework should use an opaque binary blob for storing metadata, ie whether serialization metadata needs to be serialized (vs a map of strings which is, which is really just a different form of serialization). Reading the thread and presentation it sounds like the primary motivation is: Using a binary blob will simplify the interfaces for defining specific serializations (MR-1462) The blob interface allows for contents that are more easily type checked If there is consensus on these points? If not, what are the specific objections? Eg is the patch for MR-1462 not significantly simplified vs what it would take with the current API? It doesn't seem to me that this particular plumbing decision needs to block the more valuable goal of pushing serializations through the pipeline.
          Hide
          Owen O'Malley added a comment -

          It'd be nicer to have a more specific runtime exception

          Different types for the exceptions only helps if the user has a reasonable chance of catching the exception and handling it appropriately. In this case, there really isn't any useful recovery the user can do.

          in WritableSerialization.java#getRawComparator, you can use Class<WritableComparable<Writable>> to avoid the rawtype warning.

          I don't want to change the signature of WritableComparator.get.

          I didn't see any tests for Text#readRawString and writeRawString.

          The should have been moved to the OFile patch. I'll do that for now.

          Show
          Owen O'Malley added a comment - It'd be nicer to have a more specific runtime exception Different types for the exceptions only helps if the user has a reasonable chance of catching the exception and handling it appropriately. In this case, there really isn't any useful recovery the user can do. in WritableSerialization.java#getRawComparator, you can use Class<WritableComparable<Writable>> to avoid the rawtype warning. I don't want to change the signature of WritableComparator.get. I didn't see any tests for Text#readRawString and writeRawString. The should have been moved to the OFile patch. I'll do that for now.
          Hide
          Doug Cutting added a comment -

          Eli, thanks for refocusing discussion on the original objective of this issue. We've gotten distracted by other aspects not essential to that, and I don't think that primary objective has ever been agreed on.

          I do not agree that an array of bytes is a better way to represent serialization metadata. (I stated this in the first comment on this issue.) I prefer the solutions that were in HADOOP-6165 and HADOOP-6420. My objections are:

          • inheritance is used in serialization implementations, and inheritance is harder to implement with binary objects
          • binary encodings are less transparent and create binary serialization bootstrap problems
          • serialization metadata is not large nor read/written in inner loops, so binary is not required
          • using a binary encoding for serialization metadata will require substantial changes to serialization clients. The Map<String,String> approach is easily embedded in existing metadata, like configurations, jobs, sequencefile, etc., a binary encoding requires changes to all serialization clients, with little to compensate. changes to public apis and persistent data formats should be made only when there is clear end-user value, which i don't see that a change from Map<String,String> to byte[] provides.

          I also will re-voice my objection that the current patch makes a large number of changes beyond changing the format of serialization metadata. We should restrict the patch to the description, and change other things in other issues.

          Show
          Doug Cutting added a comment - Eli, thanks for refocusing discussion on the original objective of this issue. We've gotten distracted by other aspects not essential to that, and I don't think that primary objective has ever been agreed on. I do not agree that an array of bytes is a better way to represent serialization metadata. (I stated this in the first comment on this issue.) I prefer the solutions that were in HADOOP-6165 and HADOOP-6420 . My objections are: inheritance is used in serialization implementations, and inheritance is harder to implement with binary objects binary encodings are less transparent and create binary serialization bootstrap problems serialization metadata is not large nor read/written in inner loops, so binary is not required using a binary encoding for serialization metadata will require substantial changes to serialization clients. The Map<String,String> approach is easily embedded in existing metadata, like configurations, jobs, sequencefile, etc., a binary encoding requires changes to all serialization clients, with little to compensate. changes to public apis and persistent data formats should be made only when there is clear end-user value, which i don't see that a change from Map<String,String> to byte[] provides. I also will re-voice my objection that the current patch makes a large number of changes beyond changing the format of serialization metadata. We should restrict the patch to the description, and change other things in other issues.
          Hide
          Owen O'Malley added a comment -

          This patch addresses the feedback that has been given.

          Show
          Owen O'Malley added a comment - This patch addresses the feedback that has been given.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12459961/serial7.patch
          against trunk revision 1036694.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 38 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:

          -1 contrib tests. The patch failed contrib unit tests.

          -1 system test framework. The patch failed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/112//testReport/
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/112//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12459961/serial7.patch against trunk revision 1036694. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 38 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/112//testReport/ Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/112//console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          I do not agree that an array of bytes is a better way to represent serialization metadata. (I stated this in the first comment on this issue.) I prefer the solutions that were in HADOOP-6165 and HADOOP-6420. My objections are:

          Doug, thanks for re-voicing your concerns.

          The primary objection to HADOOP-6165 was it's implementation and we had a very, very drawn out discussion in MAPREDUCE-1126. The primary reason for taking the current approach was to support serialization in a more explicit, type-safe manner.

          We have had several discussions on this same topic and I do not see how we can make progress on this anymore. I think we should either agree on the approach or veto this and stop work on this altogether.

          As has been noted before, progress on this work will greatly benefit Map-Reduce by supporting serialized job descriptions: MAPREDUCE-1183.

          I also will re-voice my objection that the current patch makes a large number of changes beyond changing the format of serialization metadata. We should restrict the patch to the description, and change other things in other issues.

          Fair point. Using a development branch for this would have been the right approach. We can still do that. However, this patch has progressed to the point that it would be ready to merge-in to the mainline.

          Again, I think we should, as a community, decide on the direction and either commit this or stop working on it.
          I do not see how arguing further helps either the atmosphere or the community.

          Show
          Arun C Murthy added a comment - I do not agree that an array of bytes is a better way to represent serialization metadata. (I stated this in the first comment on this issue.) I prefer the solutions that were in HADOOP-6165 and HADOOP-6420 . My objections are: Doug, thanks for re-voicing your concerns. The primary objection to HADOOP-6165 was it's implementation and we had a very, very drawn out discussion in MAPREDUCE-1126 . The primary reason for taking the current approach was to support serialization in a more explicit, type-safe manner. We have had several discussions on this same topic and I do not see how we can make progress on this anymore. I think we should either agree on the approach or veto this and stop work on this altogether. As has been noted before, progress on this work will greatly benefit Map-Reduce by supporting serialized job descriptions: MAPREDUCE-1183 . I also will re-voice my objection that the current patch makes a large number of changes beyond changing the format of serialization metadata. We should restrict the patch to the description, and change other things in other issues. Fair point. Using a development branch for this would have been the right approach. We can still do that. However, this patch has progressed to the point that it would be ready to merge-in to the mainline. Again, I think we should, as a community, decide on the direction and either commit this or stop working on it. I do not see how arguing further helps either the atmosphere or the community.
          Hide
          Eli Collins added a comment -

          For those of us not as familiar with the relevant code, could someone summarize how changing this particular interface from Map<String,String> to byte[] significantly simplifies the implementation of MAPREDUCE-1462 and MAPREDUCE-1183? This isn't readily apparent from reading this jira or MAPREDUCE-1126, which seems largely orthogonal.

          Show
          Eli Collins added a comment - For those of us not as familiar with the relevant code, could someone summarize how changing this particular interface from Map<String,String> to byte[] significantly simplifies the implementation of MAPREDUCE-1462 and MAPREDUCE-1183 ? This isn't readily apparent from reading this jira or MAPREDUCE-1126 , which seems largely orthogonal.
          Hide
          Arun C Murthy added a comment -

          For those of us not as familiar with the relevant code, could someone summarize how changing this particular interface from Map<String,String> to byte[] significantly simplifies the implementation of MAPREDUCE-1462 and MAPREDUCE-1183? This isn't readily apparent from reading this jira or MAPREDUCE-1126, which seems largely orthogonal.

          This discussion is about interface design, especially on an interface which has significant impact on MR user-facing APIs, not necessarily about easing implementation of MAPREDUCE-1183 or MAPREDUCE-1126.

          MAPREDUCE-1126 should have been orthogonal, but it made significant changes to Map-Reduce APIs which led to this debate.

          I understand it's hard to cold-start on MAPREDUCE-1126, maybe you can start with: http://s.apache.org/MR1126. Hopefully that provides some context.

          Show
          Arun C Murthy added a comment - For those of us not as familiar with the relevant code, could someone summarize how changing this particular interface from Map<String,String> to byte[] significantly simplifies the implementation of MAPREDUCE-1462 and MAPREDUCE-1183 ? This isn't readily apparent from reading this jira or MAPREDUCE-1126 , which seems largely orthogonal. This discussion is about interface design, especially on an interface which has significant impact on MR user-facing APIs, not necessarily about easing implementation of MAPREDUCE-1183 or MAPREDUCE-1126 . MAPREDUCE-1126 should have been orthogonal, but it made significant changes to Map-Reduce APIs which led to this debate. I understand it's hard to cold-start on MAPREDUCE-1126 , maybe you can start with: http://s.apache.org/MR1126 . Hopefully that provides some context.
          Hide
          Owen O'Malley added a comment - - edited

          There were a few driving goals:

          • Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end. Obviously this includes supporting SequenceFiles, which are where the bulk of Hadoop data is currently stored.
          • Supporting context-specific serializations (input key, input value, shuffle key, shuffle value, output key, output value, etc) so that different serialization options can chosen depending on the application's requirements. (MAPREDUCE-1462)
          • Using serialization of the "active" objects themselves (input format, mapper, reducer, output format, etc.) to simplify making compound objects. This will allow us to get rid of the static methods to define properties like input and output directory without pushing them into the framework. (MAPREDUCE-1183)
          • Clean up the serialization interface to make it clear that each object has to be serialized independently. The current API gives the impression that the Serializer and Deserializer can hold state, which is incorrect, and led to a bug in the first implementation of the Java serialization plugin.

          The first attempt to generalize the serialization metadata was done via string to string maps. Since MapReduce already has a configuration, which is a string to string map, they used that. However, they needed to nest the serialization maps into the configuration. So for each context there was a prefix string and the values under that prefix were the metadata for that serialization. This worked, but was very ugly. It lead to "stringly-typed" interfaces where you needed to read all of the code to figure out what the legal values for the configuration were. The code was full of static methods for each serialization in each context that updated the configuration. Further, since it was never clear what was intended to be "visible" versus "opaque" the user ended up being responsible for all of it.

          Therefore, I decided to use another approach. Instead of use string to string maps, we use bytes to capture the metadata. The bytes are opaque except to the serialization itself. This allows the serialization to define what data is important to it and handle it in a type and version safe manner. It is also symmetric to the solution of MR-1183 where you use component specific metadata to save their parameters. That is the framework that has been laid out in this patch. It includes the work on the container files to show that it can be used to write and read the different serializations. It includes the serializations to show that they work correctly when used together. By making the framework use typed metadata instead of the very generic, but type-less, string to string map many user errors will be avoided.

          Part of the lesion learned from the train wreck of MAPREDUCE-1126 was that implementing sweeping changes to the API and framework by writing and committing little patches spread out over 6 months is not a healthy way of working. The reviewer needs to understand why they care and how the parts are going to work together. I should have done this jira in a public development branch, but that wouldn't have lessened this debate. Doug and I just disagree about the design of the interface. The indication that he gave when I gave the presentation on my plan 5 months ago was that he didn't like it, but wouldn't block it. He reiterated that position on this jira 6 days ago. Have you changed your mind, Doug?

          To Doug's specific points:

          inheritance is used in serialization implementations, and inheritance is harder to implement with binary objects

          Actually handling extensions is quite easy using protocol buffers, which is part of why I chose to use them for storing the metadata. Inheritance in string to string maps is quite tricky and must be managed completely by the plugin writer.

          binary encodings are less transparent and create binary serialization bootstrap problems

          I will grant you they are less transparent and require a tool to dump their contents. Bootstrapping wasn't a problem at all. (Granted, it would have been a problem with Avro, as I discussed here http://bit.ly/cJ1tVp

          serialization metadata is not large nor read/written in inner loops, so binary is not required

          It isn't required, but it isn't a problem either.

          using a binary encoding for serialization metadata will require substantial changes to serialization clients.

          The change to the clients is the same size, regardless of whether the metadata is encoded in binary or string to string maps. It is extra information that needs to be available. The data is smaller and type-safe if it is done in binary compared to string to string maps.

          Show
          Owen O'Malley added a comment - - edited There were a few driving goals: Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end. Obviously this includes supporting SequenceFiles, which are where the bulk of Hadoop data is currently stored. Supporting context-specific serializations (input key, input value, shuffle key, shuffle value, output key, output value, etc) so that different serialization options can chosen depending on the application's requirements. ( MAPREDUCE-1462 ) Using serialization of the "active" objects themselves (input format, mapper, reducer, output format, etc.) to simplify making compound objects. This will allow us to get rid of the static methods to define properties like input and output directory without pushing them into the framework. ( MAPREDUCE-1183 ) Clean up the serialization interface to make it clear that each object has to be serialized independently. The current API gives the impression that the Serializer and Deserializer can hold state, which is incorrect, and led to a bug in the first implementation of the Java serialization plugin. The first attempt to generalize the serialization metadata was done via string to string maps. Since MapReduce already has a configuration, which is a string to string map, they used that. However, they needed to nest the serialization maps into the configuration. So for each context there was a prefix string and the values under that prefix were the metadata for that serialization. This worked, but was very ugly. It lead to "stringly-typed" interfaces where you needed to read all of the code to figure out what the legal values for the configuration were. The code was full of static methods for each serialization in each context that updated the configuration. Further, since it was never clear what was intended to be "visible" versus "opaque" the user ended up being responsible for all of it. Therefore, I decided to use another approach. Instead of use string to string maps, we use bytes to capture the metadata. The bytes are opaque except to the serialization itself. This allows the serialization to define what data is important to it and handle it in a type and version safe manner. It is also symmetric to the solution of MR-1183 where you use component specific metadata to save their parameters. That is the framework that has been laid out in this patch. It includes the work on the container files to show that it can be used to write and read the different serializations. It includes the serializations to show that they work correctly when used together. By making the framework use typed metadata instead of the very generic, but type-less, string to string map many user errors will be avoided. Part of the lesion learned from the train wreck of MAPREDUCE-1126 was that implementing sweeping changes to the API and framework by writing and committing little patches spread out over 6 months is not a healthy way of working. The reviewer needs to understand why they care and how the parts are going to work together. I should have done this jira in a public development branch, but that wouldn't have lessened this debate. Doug and I just disagree about the design of the interface. The indication that he gave when I gave the presentation on my plan 5 months ago was that he didn't like it, but wouldn't block it. He reiterated that position on this jira 6 days ago. Have you changed your mind, Doug? To Doug's specific points: inheritance is used in serialization implementations, and inheritance is harder to implement with binary objects Actually handling extensions is quite easy using protocol buffers, which is part of why I chose to use them for storing the metadata. Inheritance in string to string maps is quite tricky and must be managed completely by the plugin writer. binary encodings are less transparent and create binary serialization bootstrap problems I will grant you they are less transparent and require a tool to dump their contents. Bootstrapping wasn't a problem at all. (Granted, it would have been a problem with Avro, as I discussed here http://bit.ly/cJ1tVp serialization metadata is not large nor read/written in inner loops, so binary is not required It isn't required, but it isn't a problem either. using a binary encoding for serialization metadata will require substantial changes to serialization clients. The change to the clients is the same size, regardless of whether the metadata is encoded in binary or string to string maps. It is extra information that needs to be available. The data is smaller and type-safe if it is done in binary compared to string to string maps.
          Hide
          Doug Cutting added a comment -

          > Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end. Obviously this includes supporting SequenceFiles, which are where the bulk of Hadoop data is currently stored.

          This does not follow. We cannot currently pass an object that does not implement Writable through the shuffle without wrapping it in a Writable. However we can and do currently support input and output of objects that do not implement Writable: RecordReader and RecordWriter do not require Writable. So no modifications to SequenceFile are required to permit end-to-end passage of non-Writables in mapreduce.

          > Supporting context-specific serializations (input key, input value, shuffle key, shuffle value, output key, output value, etc) so that different serialization options can chosen depending on the application's requirements.

          This does not require a binary format, only a metadata format that can be somehow nested. HADOOP-6420 made this possible.

          > This worked, but was very ugly. It lead to "stringly-typed" interfaces where you needed to read all of the code to figure out what the legal values for the configuration were.

          This sounds like a documentation issue, not a functional deficiency. This style is used consistently throughout Hadoop. If we seek to replace Configuration that should perhaps be considered wholesale rather than piecemeal.

          > By making the framework use typed metadata instead of the very generic, but type-less, string to string map many user errors will be avoided.

          The current style is to provide methods to access configurations and metadata. These methods prevent such type errors. I have not seen a large number of complaints from end users about this aspect of Hadoop.

          > The indication that he gave when I gave the presentation on my plan 5 months ago was that he didn't like it, but wouldn't block it. He reiterated that position on this jira 6 days ago. Have you changed your mind, Doug?

          I had hoped that not threatening a veto but rather providing strong criticism would elicit compromise and collaboration. It seems to have unfortunately achieved the opposite. I am sorry to learn that this strategy has failed and, yes, I am now leaning towards a veto of this issue.

          > Bootstrapping wasn't a problem at all.

          Bootstrapping a generic serialization system by requiring a particular serialization system is a bootstrapping problem.

          > The change to the clients is the same size, regardless of whether the metadata is encoded in binary or string to string maps.

          That's not true. If clients already use a Map<String,String> like Configuration (as jobs do) then no change is required.

          Show
          Doug Cutting added a comment - > Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end. Obviously this includes supporting SequenceFiles, which are where the bulk of Hadoop data is currently stored. This does not follow. We cannot currently pass an object that does not implement Writable through the shuffle without wrapping it in a Writable. However we can and do currently support input and output of objects that do not implement Writable: RecordReader and RecordWriter do not require Writable. So no modifications to SequenceFile are required to permit end-to-end passage of non-Writables in mapreduce. > Supporting context-specific serializations (input key, input value, shuffle key, shuffle value, output key, output value, etc) so that different serialization options can chosen depending on the application's requirements. This does not require a binary format, only a metadata format that can be somehow nested. HADOOP-6420 made this possible. > This worked, but was very ugly. It lead to "stringly-typed" interfaces where you needed to read all of the code to figure out what the legal values for the configuration were. This sounds like a documentation issue, not a functional deficiency. This style is used consistently throughout Hadoop. If we seek to replace Configuration that should perhaps be considered wholesale rather than piecemeal. > By making the framework use typed metadata instead of the very generic, but type-less, string to string map many user errors will be avoided. The current style is to provide methods to access configurations and metadata. These methods prevent such type errors. I have not seen a large number of complaints from end users about this aspect of Hadoop. > The indication that he gave when I gave the presentation on my plan 5 months ago was that he didn't like it, but wouldn't block it. He reiterated that position on this jira 6 days ago. Have you changed your mind, Doug? I had hoped that not threatening a veto but rather providing strong criticism would elicit compromise and collaboration. It seems to have unfortunately achieved the opposite. I am sorry to learn that this strategy has failed and, yes, I am now leaning towards a veto of this issue. > Bootstrapping wasn't a problem at all. Bootstrapping a generic serialization system by requiring a particular serialization system is a bootstrapping problem. > The change to the clients is the same size, regardless of whether the metadata is encoded in binary or string to string maps. That's not true. If clients already use a Map<String,String> like Configuration (as jobs do) then no change is required.
          Hide
          Owen O'Malley added a comment -

          This sounds like a documentation issue, not a functional deficiency. This style is used consistently throughout Hadoop. If we seek to replace Configuration that should perhaps be considered wholesale rather than piecemeal.

          Not at all. Requiring the metadata to be string to string maps forced the implementation to have a large number of related static functions. I should also point out that my proposed framework can support text-based metadata, it just doesn't require it.

          I have not seen a large number of complaints from end users about this aspect of Hadoop.

          That demonstrates a lack of involvement with users, rather than the benefits of the current system. Users often complain about the current configuration system. In particular, typos that result in subtly wrong results are very user unfriendly. String to string maps are very useful, but there are also a lot of benefits to having strongly typed interfaces.

          Users are also unhappy with the static methods that update configurations for doing things like setting the input and output directories.

          Show
          Owen O'Malley added a comment - This sounds like a documentation issue, not a functional deficiency. This style is used consistently throughout Hadoop. If we seek to replace Configuration that should perhaps be considered wholesale rather than piecemeal. Not at all. Requiring the metadata to be string to string maps forced the implementation to have a large number of related static functions. I should also point out that my proposed framework can support text-based metadata, it just doesn't require it. I have not seen a large number of complaints from end users about this aspect of Hadoop. That demonstrates a lack of involvement with users, rather than the benefits of the current system. Users often complain about the current configuration system. In particular, typos that result in subtly wrong results are very user unfriendly. String to string maps are very useful, but there are also a lot of benefits to having strongly typed interfaces. Users are also unhappy with the static methods that update configurations for doing things like setting the input and output directories.
          Hide
          Chris Douglas added a comment -

          I had hoped that not threatening a veto but rather providing strong criticism would elicit compromise and collaboration. It seems to have unfortunately achieved the opposite. I am sorry to learn that this strategy has failed and, yes, I am now leaning towards a veto of this issue.

          What would a compromise solution look like? You've asserted that you preferred the previous, stringly-typed solution, which is known. But your comments, so far, have not been actionable. You wrote a missive on the virtues of Avro and standard file formats. When Tom brought it up, you objected to packaging. Your only consistent theme has been the scope of the patch, where you vacillate between (0) splitting up work that sprawls across too many domains and (1) importing large, related issues (project direction, redesigning Configuration, etc.). Owen has responded to all of these respectfully, by fully outlining his reasoning, but cannot respond in code because most of your objections are too general or diffuse to implement. That is, without adopting the patches you prefer instead of these.

          Short of adopting your solution- which was vetoed as unacceptable not only by Owen, but by others- where are you open to compromise? If the diff were spread around different issues, would you restrict your veto to a subset of them? Which of your objections do you consider important enough to block the progress you grudgingly agreed to months/days ago?

          I'm +1 on the patch as-is (with the minor caveat that the leftover imports in Text should be omitted). I like the direction, the type safety, and its independence of Configuration. The way things are configured in Hadoop today is horrific. I think we can do better and the progress made here should be committed.

          Show
          Chris Douglas added a comment - I had hoped that not threatening a veto but rather providing strong criticism would elicit compromise and collaboration. It seems to have unfortunately achieved the opposite. I am sorry to learn that this strategy has failed and, yes, I am now leaning towards a veto of this issue. What would a compromise solution look like? You've asserted that you preferred the previous, stringly-typed solution, which is known. But your comments, so far, have not been actionable. You wrote a missive on the virtues of Avro and standard file formats. When Tom brought it up, you objected to packaging. Your only consistent theme has been the scope of the patch, where you vacillate between (0) splitting up work that sprawls across too many domains and (1) importing large, related issues (project direction, redesigning Configuration , etc.). Owen has responded to all of these respectfully, by fully outlining his reasoning, but cannot respond in code because most of your objections are too general or diffuse to implement. That is, without adopting the patches you prefer instead of these. Short of adopting your solution- which was vetoed as unacceptable not only by Owen, but by others- where are you open to compromise? If the diff were spread around different issues, would you restrict your veto to a subset of them? Which of your objections do you consider important enough to block the progress you grudgingly agreed to months/days ago? I'm +1 on the patch as-is (with the minor caveat that the leftover imports in Text should be omitted). I like the direction, the type safety, and its independence of Configuration . The way things are configured in Hadoop today is horrific. I think we can do better and the progress made here should be committed.
          Hide
          Arun C Murthy added a comment -

          > By making the framework use typed metadata instead of the very generic, but type-less, string to string map many user errors will be avoided.

          The current style is to provide methods to access configurations and metadata. These methods prevent such type errors. I have not seen a large number of complaints from end users about this aspect of Hadoop.

          Again, this is exactly the stuff we are trying to improve!

          Pardon my frustration, but if you carefully read MAPREDUCE-1183, Configuration just isn't working!

          Pig (and probably Hive too, I don't know Hive well enough) has gone through significant hoops to work around it's limitations.

          Adding more 'features' on Configuration will exacerbate the issues we have, need to move away and this is a start.

          Show
          Arun C Murthy added a comment - > By making the framework use typed metadata instead of the very generic, but type-less, string to string map many user errors will be avoided. The current style is to provide methods to access configurations and metadata. These methods prevent such type errors. I have not seen a large number of complaints from end users about this aspect of Hadoop. Again, this is exactly the stuff we are trying to improve! Pardon my frustration, but if you carefully read MAPREDUCE-1183 , Configuration just isn't working! Pig (and probably Hive too, I don't know Hive well enough) has gone through significant hoops to work around it's limitations. Adding more 'features' on Configuration will exacerbate the issues we have, need to move away and this is a start.
          Hide
          Doug Cutting added a comment -

          > What would a compromise solution look like?

          I would be willing to entertain non-Map<String,String> representations of serialization metadata, if:

          • it didn't create a core dependency on ProtocolBuffers
          • it didn't extend SequenceFile

          I am -1 on this patch as-is.

          Show
          Doug Cutting added a comment - > What would a compromise solution look like? I would be willing to entertain non-Map<String,String> representations of serialization metadata, if: it didn't create a core dependency on ProtocolBuffers it didn't extend SequenceFile I am -1 on this patch as-is.
          Hide
          Arun C Murthy added a comment -

          I would be willing to entertain non-Map<String,String> representations of serialization metadata, if:

          • it didn't create a core dependency on ProtocolBuffers
          • it didn't extend SequenceFile

          I am -1 on this patch as-is.

          Doug, thanks for clarifying.

          * it didn't extend SequenceFile

          So, the patch, as it stands allows SequenceFiles to use the new serialization framework i.e. adds a feature. Are you against this feature? Can you please explain why?

          Or, do you just prefer it as a separate patch?

          Show
          Arun C Murthy added a comment - I would be willing to entertain non-Map<String,String> representations of serialization metadata, if: it didn't create a core dependency on ProtocolBuffers it didn't extend SequenceFile I am -1 on this patch as-is. Doug, thanks for clarifying. * it didn't extend SequenceFile So, the patch, as it stands allows SequenceFiles to use the new serialization framework i.e. adds a feature . Are you against this feature? Can you please explain why? Or, do you just prefer it as a separate patch?
          Hide
          Doug Cutting added a comment -

          > So, the patch, as it stands allows SequenceFiles to use the new serialization framework i.e. adds a feature. Are you against this feature? Can you please explain why?

          Yes, I am against this feature. I've explained why several times above, and will try again now. Creating new concrete data formats that are functionally equivalent to other concrete formats decreases ecosystem interoperability, flexibility and maintainability. Above I cited the Dremel paper, whose section 2 outlines a scenario that they argue is only possible because all of the systems involved share a single common serialization and file format.

          Show
          Doug Cutting added a comment - > So, the patch, as it stands allows SequenceFiles to use the new serialization framework i.e. adds a feature. Are you against this feature? Can you please explain why? Yes, I am against this feature. I've explained why several times above, and will try again now. Creating new concrete data formats that are functionally equivalent to other concrete formats decreases ecosystem interoperability, flexibility and maintainability. Above I cited the Dremel paper, whose section 2 outlines a scenario that they argue is only possible because all of the systems involved share a single common serialization and file format.
          Hide
          Arun C Murthy added a comment -

          A couple of fresh pairs of eyes on this jira asked me for some more context on this discussion, here goes:

          The historical context is that HADOOP-6420 added a feature whose import did not hit home until MAPREDUCE-1126. MAPREDUCE-1126 itself seemed innocuous until we saw it change user-facing Map-Reduce apis. Along the way we built a better understanding of issues faced by frameworks like Hive/Pig and came up with MAPREDUCE-1183. We then conceived of HADOOP-6685 which we strongly feel improves Hadoop serialization, and in turn Map-Reduce, in a significant manner.

          Hope this helps.

          Show
          Arun C Murthy added a comment - A couple of fresh pairs of eyes on this jira asked me for some more context on this discussion, here goes: The historical context is that HADOOP-6420 added a feature whose import did not hit home until MAPREDUCE-1126 . MAPREDUCE-1126 itself seemed innocuous until we saw it change user-facing Map-Reduce apis. Along the way we built a better understanding of issues faced by frameworks like Hive/Pig and came up with MAPREDUCE-1183 . We then conceived of HADOOP-6685 which we strongly feel improves Hadoop serialization, and in turn Map-Reduce, in a significant manner. Hope this helps.
          Hide
          Chris Douglas added a comment -

          Creating new concrete data formats that are functionally equivalent to other concrete formats decreases ecosystem interoperability, flexibility and maintainability. Above I cited the Dremel paper, whose section 2 outlines a scenario that they argue is only possible because all of the systems involved share a single common serialization and file format.

          Your argument is: Avro and its file format must be maximally attractive to realize the benefits of a common serialization and file format across all installations of Hadoop. Someone at Google wrote a paper about how well that worked inside their company. SequenceFiles cannot support multiple serializations, because that would make adoption of those components (or "alternatives" with exactly its same design points) less attractive.

          Users are going to put their data in whatever form is convenient, often for legacy/interoperability reasons. Companies like Yahoo are interested in ubiquitous data formats, because they have a lot of it and many groups building on top of it. And it employs teams of people to curate its data. Others may be less enthusiastic about your "one serialization/data format to rule them all." Using your veto in Hadoop to prohibit SequenceFile or TFile from evolving into a viable competitor/alternative to Avro's preferred format is invalid. If SequenceFile didn't support Avro, would you drop your veto on this point?

          it didn't create a core dependency on ProtocolBuffers

          Anywhere? Or as it's used in the current patch?

          Show
          Chris Douglas added a comment - Creating new concrete data formats that are functionally equivalent to other concrete formats decreases ecosystem interoperability, flexibility and maintainability. Above I cited the Dremel paper, whose section 2 outlines a scenario that they argue is only possible because all of the systems involved share a single common serialization and file format. Your argument is: Avro and its file format must be maximally attractive to realize the benefits of a common serialization and file format across all installations of Hadoop. Someone at Google wrote a paper about how well that worked inside their company. SequenceFiles cannot support multiple serializations, because that would make adoption of those components (or "alternatives" with exactly its same design points) less attractive. Users are going to put their data in whatever form is convenient, often for legacy/interoperability reasons. Companies like Yahoo are interested in ubiquitous data formats, because they have a lot of it and many groups building on top of it. And it employs teams of people to curate its data. Others may be less enthusiastic about your "one serialization/data format to rule them all." Using your veto in Hadoop to prohibit SequenceFile or TFile from evolving into a viable competitor/alternative to Avro's preferred format is invalid. If SequenceFile didn't support Avro, would you drop your veto on this point? it didn't create a core dependency on ProtocolBuffers Anywhere? Or as it's used in the current patch?
          Hide
          Arun C Murthy added a comment -

          > So, the patch, as it stands allows SequenceFiles to use the new serialization framework i.e. adds a feature. Are you against this feature? Can you please explain why?

          Yes, I am against this feature. I've explained why several times above, and will try again now. Creating new concrete data formats that are functionally equivalent to other concrete formats decreases ecosystem interoperability, flexibility and maintainability. Above I cited the Dremel paper, whose section 2 outlines a scenario that they argue is only possible because all of the systems involved share a single common serialization and file format.

          Thanks for laying it out again.

          Your objections seem very unreasonable to me.

          I understand you prefer to have a single Avro-based data format, but Hadoop is a software framework used by many people and organizations. People and organizations already have data in different formats.

          Dremel is implemented and used by a single organization who have specific a technical and historical context. What they need and use isn't something everyone on the planet can.

          Hadoop as a framework should not be in the business of dictating formats.

          We should facilitate and encourage users and organizations to use inter-operable formats, not necessarily the one format.

          IAC, this seems like a discussion which belongs elsewhere - I just don't see how blocking a feature is useful. You can ask for it to be done in a separate jira, which we can, but this specific objection of yours is very unreasonable, IMO.

          Show
          Arun C Murthy added a comment - > So, the patch, as it stands allows SequenceFiles to use the new serialization framework i.e. adds a feature. Are you against this feature? Can you please explain why? Yes, I am against this feature. I've explained why several times above, and will try again now. Creating new concrete data formats that are functionally equivalent to other concrete formats decreases ecosystem interoperability, flexibility and maintainability. Above I cited the Dremel paper, whose section 2 outlines a scenario that they argue is only possible because all of the systems involved share a single common serialization and file format. Thanks for laying it out again. Your objections seem very unreasonable to me. I understand you prefer to have a single Avro-based data format, but Hadoop is a software framework used by many people and organizations. People and organizations already have data in different formats. Dremel is implemented and used by a single organization who have specific a technical and historical context. What they need and use isn't something everyone on the planet can. Hadoop as a framework should not be in the business of dictating formats. We should facilitate and encourage users and organizations to use inter-operable formats, not necessarily the one format. IAC, this seems like a discussion which belongs elsewhere - I just don't see how blocking a feature is useful. You can ask for it to be done in a separate jira, which we can, but this specific objection of yours is very unreasonable, IMO.
          Hide
          Doug Cutting added a comment -

          > Hadoop as a framework should not be in the business of dictating formats.

          I agree. So let's create new formats outside of Hadoop's kernel, rather than add new ones in.

          Show
          Doug Cutting added a comment - > Hadoop as a framework should not be in the business of dictating formats. I agree. So let's create new formats outside of Hadoop's kernel, rather than add new ones in.
          Hide
          Arun C Murthy added a comment -

          I agree. So let's create new formats outside of Hadoop's kernel, rather than add new ones in.

          Great. No one is suggesting new formats. We need to maintain the ones we already have for legacy reasons.

          There is petabytes of data in SequenceFile format in Hadoop clusters everywhere. We cannot drop it, we need to maintain it and keep it up to date. We also need to improve to continue to support existing users.

          Show
          Arun C Murthy added a comment - I agree. So let's create new formats outside of Hadoop's kernel, rather than add new ones in. Great. No one is suggesting new formats. We need to maintain the ones we already have for legacy reasons. There is petabytes of data in SequenceFile format in Hadoop clusters everywhere. We cannot drop it, we need to maintain it and keep it up to date. We also need to improve to continue to support existing users.
          Hide
          Arun C Murthy added a comment -

          Now, can you please retract your veto on this issue? Thanks.

          Show
          Arun C Murthy added a comment - Now, can you please retract your veto on this issue? Thanks.
          Hide
          Doug Cutting added a comment -

          > There is petabytes of data in SequenceFile format in Hadoop clusters everywhere. We cannot drop it, we need to maintain it and keep it up to date. We also need to improve to continue to support existing users.

          I have never proposed dropping SequenceFile. I have proposed that we not extend it. I have proposed that if we introduce a new concrete binary object data file format (container+serialization) then we should only introduce a single such second-generation format. If we cannot agree on such a format, then we will be stuck adding no new formats to the kernel but rather creating new formats in external projects.

          Show
          Doug Cutting added a comment - > There is petabytes of data in SequenceFile format in Hadoop clusters everywhere. We cannot drop it, we need to maintain it and keep it up to date. We also need to improve to continue to support existing users. I have never proposed dropping SequenceFile. I have proposed that we not extend it. I have proposed that if we introduce a new concrete binary object data file format (container+serialization) then we should only introduce a single such second-generation format. If we cannot agree on such a format, then we will be stuck adding no new formats to the kernel but rather creating new formats in external projects.
          Hide
          Arun C Murthy added a comment -

          I have never proposed dropping SequenceFile. I have proposed that we not extend it.

          I don't understand that.

          As I mentioned, there is too much data already in SequenceFile format - objecting to maintaining it or extending in minor ways (in a compatible manner) for legacy users is unreasonable, no?

          If we cannot agree on such a format, then we will be stuck adding no new formats to the kernel but rather creating new formats in external projects.

          Agreed. However, again, no one is proposing new formats here.

          Show
          Arun C Murthy added a comment - I have never proposed dropping SequenceFile. I have proposed that we not extend it. I don't understand that. As I mentioned, there is too much data already in SequenceFile format - objecting to maintaining it or extending in minor ways (in a compatible manner) for legacy users is unreasonable, no? If we cannot agree on such a format, then we will be stuck adding no new formats to the kernel but rather creating new formats in external projects. Agreed. However, again, no one is proposing new formats here.
          Hide
          Owen O'Malley added a comment -

          I question the validity of Doug's veto. His objection to the patch has nothing to do with the merits of the patch and everything to do with his wish to push Avro into Hadoop at the cost of the users.

          it didn't extend SequenceFile

          Why is no extension of SequenceFile acceptable? Would it be valid if someone on subversion said that no further enhancements to one of the primary repository access modules was acceptable? It is unacceptable to prevent future enhancements of a critical part of Hadoop's library code.

          it didn't create a core dependency on ProtocolBuffers

          Why is depending on Avro in the framework acceptable, but not ProtocolBuffers? I could have implemented the versioning by hand, but that is ridiculous. ProtocolBuffers provides exactly the capabilities that I need. Avro does not. I've discussed why above. The projects are not interchangeable. I understand that you have been working almost exclusively on Avro for the last 2 years, but that does not mean that Hadoop must use Avro exclusively.

          Show
          Owen O'Malley added a comment - I question the validity of Doug's veto. His objection to the patch has nothing to do with the merits of the patch and everything to do with his wish to push Avro into Hadoop at the cost of the users. it didn't extend SequenceFile Why is no extension of SequenceFile acceptable? Would it be valid if someone on subversion said that no further enhancements to one of the primary repository access modules was acceptable? It is unacceptable to prevent future enhancements of a critical part of Hadoop's library code. it didn't create a core dependency on ProtocolBuffers Why is depending on Avro in the framework acceptable, but not ProtocolBuffers? I could have implemented the versioning by hand, but that is ridiculous. ProtocolBuffers provides exactly the capabilities that I need. Avro does not. I've discussed why above. The projects are not interchangeable. I understand that you have been working almost exclusively on Avro for the last 2 years, but that does not mean that Hadoop must use Avro exclusively.
          Hide
          Doug Cutting added a comment -

          > I question the validity of Doug's veto. His objection to the patch has nothing to do with the merits of the patch and everything to do with his wish to push Avro into Hadoop at the cost of the users.

          I have withdrawn my hope to add Avro as a data format into Hadoop, since the Avro project now already provides a Hadoop data format layered on Hadoop. To my knowledge there are currently no Hadoop issues pushing Avro into Hadoop as a data format besides this issue, and I do not currently intend to file any new such Hadoop issues. (Avro's layer would be more easily implemented if the shuffle better supported non-Writable data, but, as it stands, is adequate.)

          We should refrain from adding any new data formats to the Hadoop kernel. More generally, we should refrain from adding code that could be implemented as user code to the kernel. At present, the kernel must contain some framework code that runs in a user's tasks, e.g., sorting code that calls the user's comparator. Beyond that required framework code however, code that runs in user tasks should not be provided with the system, but should rather be supplied by the user. User tasks should ideally be able to, e.g., run a different version of the HDFS client code. We have a fair amount of legacy code, like SequenceFile, that is currently provided with the system, that we cannot immediately remove for compatibility reasons. But new user-level functionality should be provided as external packages, not provided with the kernel. If we wish to enhance the SequenceFile data format, then that should be done in a separate project. The line between user and system code is currently blurred and we should work to clarify it and reduce the amount of user code in this project, providing a level playing field for user code libraries.

          Show
          Doug Cutting added a comment - > I question the validity of Doug's veto. His objection to the patch has nothing to do with the merits of the patch and everything to do with his wish to push Avro into Hadoop at the cost of the users. I have withdrawn my hope to add Avro as a data format into Hadoop, since the Avro project now already provides a Hadoop data format layered on Hadoop. To my knowledge there are currently no Hadoop issues pushing Avro into Hadoop as a data format besides this issue, and I do not currently intend to file any new such Hadoop issues. (Avro's layer would be more easily implemented if the shuffle better supported non-Writable data, but, as it stands, is adequate.) We should refrain from adding any new data formats to the Hadoop kernel. More generally, we should refrain from adding code that could be implemented as user code to the kernel. At present, the kernel must contain some framework code that runs in a user's tasks, e.g., sorting code that calls the user's comparator. Beyond that required framework code however, code that runs in user tasks should not be provided with the system, but should rather be supplied by the user. User tasks should ideally be able to, e.g., run a different version of the HDFS client code. We have a fair amount of legacy code, like SequenceFile, that is currently provided with the system, that we cannot immediately remove for compatibility reasons. But new user-level functionality should be provided as external packages, not provided with the kernel. If we wish to enhance the SequenceFile data format, then that should be done in a separate project. The line between user and system code is currently blurred and we should work to clarify it and reduce the amount of user code in this project, providing a level playing field for user code libraries.
          Hide
          Owen O'Malley added a comment -

          We should refrain from adding any new data formats to the Hadoop kernel. More generally, we should refrain from adding code that could be implemented as user code to the kernel.

          This is a major change in the direction of Hadoop that has never been discussed by the Hadoop PMC. You are welcome to start such a thread on general@, but until the Hadoop PMC approves it, it is not the plan of record.

          SequenceFile, far from being deprecated, stores the majority of the world's Hadoop information. There are no current replacements for its functionality and implicitly deprecating it would be very expensive. Without the ability to extend SequenceFile to use the new generic serialization api, users will have no way of using the new api for storing their data.

          The part of the bylaws (http://bit.ly/9olNWB ) on vetoes say that if the validity of the veto is questioned, it must be confirmed by someone with a binding vote. Does someone want to stand up and confirm the veto's validity?

          Show
          Owen O'Malley added a comment - We should refrain from adding any new data formats to the Hadoop kernel. More generally, we should refrain from adding code that could be implemented as user code to the kernel. This is a major change in the direction of Hadoop that has never been discussed by the Hadoop PMC. You are welcome to start such a thread on general@, but until the Hadoop PMC approves it, it is not the plan of record. SequenceFile, far from being deprecated, stores the majority of the world's Hadoop information. There are no current replacements for its functionality and implicitly deprecating it would be very expensive. Without the ability to extend SequenceFile to use the new generic serialization api, users will have no way of using the new api for storing their data. The part of the bylaws ( http://bit.ly/9olNWB ) on vetoes say that if the validity of the veto is questioned, it must be confirmed by someone with a binding vote. Does someone want to stand up and confirm the veto's validity?
          Hide
          Doug Cutting added a comment -

          > There are no current replacements for its functionality and implicitly deprecating it would be very expensive. Without the ability to extend SequenceFile to use the new generic serialization api, users will have no way of using the new api for storing their data.

          No user code will be broken if this patch is not committed. You are free to create replacements or extensions to SequenceFile in an external project. I started doing that a year ago when you vetoed initial efforts in this area. This is a way we can all productively move forward. Hadoop need only provide APIs for data formats, mappers, reducers, etc., not implementations of these.

          Show
          Doug Cutting added a comment - > There are no current replacements for its functionality and implicitly deprecating it would be very expensive. Without the ability to extend SequenceFile to use the new generic serialization api, users will have no way of using the new api for storing their data. No user code will be broken if this patch is not committed. You are free to create replacements or extensions to SequenceFile in an external project. I started doing that a year ago when you vetoed initial efforts in this area. This is a way we can all productively move forward. Hadoop need only provide APIs for data formats, mappers, reducers, etc., not implementations of these.
          Hide
          Tom White added a comment -

          I have two serious issues with the current patch, which I have mentioned above. However, given that they have not been adequately addressed I feel I have no option but to vote -1.

          The first is that no change is needed in SequenceFile unless we want to support Avro, but, given that Avro data files were designed for this, and are multi-lingual, why change the SequenceFile format solely to support Avro? Are Avro data files insufficient? Note that Thrift and Protocol Buffers can be stored in today's SequenceFiles.

          The second is that this patch adds new serializations which introduce into the core a new dependency on a particular version of each of Avro, Thrift, and PB, in a non-pluggable way.

          This type of dependency is qualitatively different to other dependencies. Hadoop depends on log4j for instance, so if a user's code does too, then it needs to use the same version. A recent JIRA made it possible to specify a different version of log4j in the job, but this only works if the version the user specifies is compatible with both their code and the Hadoop kernel code.

          However, in the case of a PB serialization, for example, the PB library is not used in Hadoop except in the serialization code for serializing the user's data type. So it's a user-level concern, and should be compiled as such - putting it in core Hadoop is asking for trouble in the future, since the Hadoop releases won't keep track with the union of PB, Thrift, and Avro releases. These serialization plugins should be stand alone, or at least easily re-compilable in a way that doesn't involve recompiling all of Hadoop, such as a contrib module. The user just treats the plugin JAR as another code dependency.

          To move forward on this issue it's clear that compromise is needed. I actually prefer strings in serialization (HADOOP-6420), but am prepared to compromise over it, in the interests of finding consensus.

          Show
          Tom White added a comment - I have two serious issues with the current patch, which I have mentioned above. However, given that they have not been adequately addressed I feel I have no option but to vote -1. The first is that no change is needed in SequenceFile unless we want to support Avro, but, given that Avro data files were designed for this, and are multi-lingual, why change the SequenceFile format solely to support Avro? Are Avro data files insufficient? Note that Thrift and Protocol Buffers can be stored in today's SequenceFiles. The second is that this patch adds new serializations which introduce into the core a new dependency on a particular version of each of Avro, Thrift, and PB, in a non-pluggable way. This type of dependency is qualitatively different to other dependencies. Hadoop depends on log4j for instance, so if a user's code does too, then it needs to use the same version. A recent JIRA made it possible to specify a different version of log4j in the job, but this only works if the version the user specifies is compatible with both their code and the Hadoop kernel code. However, in the case of a PB serialization, for example, the PB library is not used in Hadoop except in the serialization code for serializing the user's data type. So it's a user-level concern, and should be compiled as such - putting it in core Hadoop is asking for trouble in the future, since the Hadoop releases won't keep track with the union of PB, Thrift, and Avro releases. These serialization plugins should be stand alone, or at least easily re-compilable in a way that doesn't involve recompiling all of Hadoop, such as a contrib module. The user just treats the plugin JAR as another code dependency. To move forward on this issue it's clear that compromise is needed. I actually prefer strings in serialization ( HADOOP-6420 ), but am prepared to compromise over it, in the interests of finding consensus.
          Hide
          Doug Cutting added a comment -

          Tom> I actually prefer strings in serialization (HADOOP-6420), but am prepared to compromise over it [ ... ]

          I wonder if JSON might be a good nestable format for serialization metadata? JSON supports nesting, and distinguishes numeric, boolean and string types. With Jackson, one can serialize and deserialize Java objects as JSON, to get compile-time type checking.

          Show
          Doug Cutting added a comment - Tom> I actually prefer strings in serialization ( HADOOP-6420 ), but am prepared to compromise over it [ ... ] I wonder if JSON might be a good nestable format for serialization metadata? JSON supports nesting, and distinguishes numeric, boolean and string types. With Jackson, one can serialize and deserialize Java objects as JSON, to get compile-time type checking.
          Hide
          Konstantin Boudnik added a comment -

          However, in the case of a PB serialization, for example, the PB library is not used in Hadoop except in the serialization code for serializing the user's data type. So it's a user-level concern, and should be compiled as such - putting it in core Hadoop is asking for trouble in the future, since the Hadoop releases won't keep track with the union of PB, Thrift, and Avro releases. These serialization plugins should be stand alone, or at least easily re-compilable in a way that doesn't involve recompiling all of Hadoop, such as a contrib module. The user just treats the plugin JAR as another code dependency.

          +1 on Tom's point: having a variety of serialisation frameworks in a product is a good thing. Unless it doesn't come with the cost of possible mess they might cause if their public APIs start deviating in a way that core Hadoop will have to be changed to keep user applications working. Testing those is another thing: if Hadoop claims to support something explicitly somebody needs to make an effort and guarantee that it is so.

          Having a clean abstraction for serialisation and pluggable frameworks as a user wish sounds like a reasonable compromise.

          Show
          Konstantin Boudnik added a comment - However, in the case of a PB serialization, for example, the PB library is not used in Hadoop except in the serialization code for serializing the user's data type. So it's a user-level concern, and should be compiled as such - putting it in core Hadoop is asking for trouble in the future, since the Hadoop releases won't keep track with the union of PB, Thrift, and Avro releases. These serialization plugins should be stand alone, or at least easily re-compilable in a way that doesn't involve recompiling all of Hadoop, such as a contrib module. The user just treats the plugin JAR as another code dependency. +1 on Tom's point: having a variety of serialisation frameworks in a product is a good thing. Unless it doesn't come with the cost of possible mess they might cause if their public APIs start deviating in a way that core Hadoop will have to be changed to keep user applications working. Testing those is another thing: if Hadoop claims to support something explicitly somebody needs to make an effort and guarantee that it is so. Having a clean abstraction for serialisation and pluggable frameworks as a user wish sounds like a reasonable compromise.
          Hide
          Steve Loughran added a comment -

          I've been doing lots of JSON work recently, net.sf.jsonobject, gson, jackson, etc: so many parsers, so many others dependencies. Any of those that try to reimplement the dream of WS-* (seamless serialisation between native objects) is repeating the same mistakes. But it's good for shoving stuff around, serving up over HTTP, parsing in different languages. Compared to XML, the fact that Xerces ships on all Hadoop-compatible JVMs, gives XML an edge, one that DOM takes away in its pain of use. I'm =0 on it internally. Less painful than XML, but the extra dependencies and time I waste converting from different java models of the graph hurts. And like XML, you end up escaping and base-64-ing stuff.

          The ASF would veto any release of Hadoop that depended on an unreleased in-incubation artifact. This would complicate any plan to branch to 0.22, or at least release it, unless the build file was set up to exclude thrift-specific code. But if HDFS already depends that, that's something in the schedule plans anyway, and Hadoop core + hdfs will depend on a specific thrift version.

          +1 to tom's suggestion of keeping PB off in a contrib package, the same for thrift if HDFS can remove its dependencies.

          =0 to binary config data vs map<string, string>. Binary is efficient but brittle, map easier to debug. Question is, what would the performance cost of staying in string maps be?

          Show
          Steve Loughran added a comment - I've been doing lots of JSON work recently, net.sf.jsonobject, gson, jackson, etc: so many parsers, so many others dependencies. Any of those that try to reimplement the dream of WS-* (seamless serialisation between native objects) is repeating the same mistakes. But it's good for shoving stuff around, serving up over HTTP, parsing in different languages. Compared to XML, the fact that Xerces ships on all Hadoop-compatible JVMs, gives XML an edge, one that DOM takes away in its pain of use. I'm =0 on it internally. Less painful than XML, but the extra dependencies and time I waste converting from different java models of the graph hurts. And like XML, you end up escaping and base-64-ing stuff. The ASF would veto any release of Hadoop that depended on an unreleased in-incubation artifact. This would complicate any plan to branch to 0.22, or at least release it, unless the build file was set up to exclude thrift-specific code. But if HDFS already depends that, that's something in the schedule plans anyway, and Hadoop core + hdfs will depend on a specific thrift version. +1 to tom's suggestion of keeping PB off in a contrib package, the same for thrift if HDFS can remove its dependencies. =0 to binary config data vs map<string, string>. Binary is efficient but brittle, map easier to debug. Question is, what would the performance cost of staying in string maps be?
          Hide
          Nigel Daley added a comment -

          +1 to JSON.

          Show
          Nigel Daley added a comment - +1 to JSON.
          Hide
          Eli Collins added a comment -

          unless the build file was set up to exclude thrift-specific code. But if HDFS already depends that, that's something in the schedule plans anyway, and Hadoop core + hdfs will depend on a specific thrift version.

          To clarify, the thrift dependency referenced by HDFS-1484 is contrib, and broken. It's not a core HDFS dependency.

          Show
          Eli Collins added a comment - unless the build file was set up to exclude thrift-specific code. But if HDFS already depends that, that's something in the schedule plans anyway, and Hadoop core + hdfs will depend on a specific thrift version. To clarify, the thrift dependency referenced by HDFS-1484 is contrib, and broken. It's not a core HDFS dependency.
          Hide
          Owen O'Malley added a comment -

          The first is that no change is needed in SequenceFile unless we want to support Avro, but, given that Avro data files were designed for this, and are multi-lingual, why change the SequenceFile format solely to support Avro? Are Avro data files insufficient? Note that Thrift and Protocol Buffers can be stored in today's SequenceFiles.

          This isn't true. SequenceFile needs to be changed to support the new serialization API. The class name isn't sufficient to determine the serialization. Furthermore, you can't implement context sensitive serializations MAPREDUCE-1462 without the changes to SequenceFile.

          Are Avro data files insufficient?

          Yes. They don't support indices. They don't support key, value pairs. They don't support other types like Writables. Furthermore, our users already heavily use SequenceFiles and don't want to port to a new file format. Extending SequenceFile gives them more flexibility.

          I wonder if JSON might be a good nestable format for serialization metadata? JSON supports nesting, and distinguishes numeric, boolean and string types. With Jackson, one can serialize and deserialize Java objects as JSON, to get compile-time type checking.

          In MAPREDUCE-980, you took out the custom JSON parser and replaced it with calls into Avro. Using ProtoBuf is efficient and meant that I wrote 2 lines of code. If I used JSON, I would need to write a parser and printer.

          Show
          Owen O'Malley added a comment - The first is that no change is needed in SequenceFile unless we want to support Avro, but, given that Avro data files were designed for this, and are multi-lingual, why change the SequenceFile format solely to support Avro? Are Avro data files insufficient? Note that Thrift and Protocol Buffers can be stored in today's SequenceFiles. This isn't true. SequenceFile needs to be changed to support the new serialization API. The class name isn't sufficient to determine the serialization. Furthermore, you can't implement context sensitive serializations MAPREDUCE-1462 without the changes to SequenceFile. Are Avro data files insufficient? Yes. They don't support indices. They don't support key, value pairs. They don't support other types like Writables. Furthermore, our users already heavily use SequenceFiles and don't want to port to a new file format. Extending SequenceFile gives them more flexibility. I wonder if JSON might be a good nestable format for serialization metadata? JSON supports nesting, and distinguishes numeric, boolean and string types. With Jackson, one can serialize and deserialize Java objects as JSON, to get compile-time type checking. In MAPREDUCE-980 , you took out the custom JSON parser and replaced it with calls into Avro. Using ProtoBuf is efficient and meant that I wrote 2 lines of code. If I used JSON, I would need to write a parser and printer.
          Hide
          Doug Cutting added a comment -

          Owen> [Avro data files] don't support indices. They don't support key, value pairs.

          SequenceFile does not support indexes either. Indexes are provided by the MapFile API, etc., which are implemented as a pair of SequenceFiles. The same can be done with a pair of Avro data files.

          Avro data files do support key/value pairs, for example:

          http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/Pair.html

          Again, we're driven off topic, though. The declared topic of this issue is the representation of serialization metadata, not data file formats.

          Show
          Doug Cutting added a comment - Owen> [Avro data files] don't support indices. They don't support key, value pairs. SequenceFile does not support indexes either. Indexes are provided by the MapFile API, etc., which are implemented as a pair of SequenceFiles. The same can be done with a pair of Avro data files. Avro data files do support key/value pairs, for example: http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/Pair.html Again, we're driven off topic, though. The declared topic of this issue is the representation of serialization metadata, not data file formats.
          Hide
          Owen O'Malley added a comment -

          SequenceFile does not support indexes either. Indexes are provided by the MapFile API, etc., which are implemented as a pair of SequenceFiles. The same can be done with a pair of Avro data files.

          It could be, but hasn't been done. If the user doesn't want to implement their own, either the OFile implementation from HADOOP-6684 or MapFile from this patch are the two choices.

          The declared topic of this issue is the representation of serialization metadata, not data file formats.

          You vetoed the patch because it included changes to SequenceFile. Therefore the limitations of the Avro file format are completely germane.

          Show
          Owen O'Malley added a comment - SequenceFile does not support indexes either. Indexes are provided by the MapFile API, etc., which are implemented as a pair of SequenceFiles. The same can be done with a pair of Avro data files. It could be, but hasn't been done. If the user doesn't want to implement their own, either the OFile implementation from HADOOP-6684 or MapFile from this patch are the two choices. The declared topic of this issue is the representation of serialization metadata, not data file formats. You vetoed the patch because it included changes to SequenceFile. Therefore the limitations of the Avro file format are completely germane.
          Hide
          Tom White added a comment -

          > This isn't true. SequenceFile needs to be changed to support the new serialization API. The class name isn't sufficient to determine the serialization.

          Class name is sufficient for Thrift and Protocol Buffers, so you could try these serializations out for this issue without changing SequenceFile.

          > Furthermore, you can't implement context sensitive serializations MAPREDUCE-1462 without the changes to SequenceFile.

          I think they are orthogonal. See, for example, the patches I posted to MAPREDUCE-1462, which implement context sensitive serializations with no changes to SequenceFile.

          Show
          Tom White added a comment - > This isn't true. SequenceFile needs to be changed to support the new serialization API. The class name isn't sufficient to determine the serialization. Class name is sufficient for Thrift and Protocol Buffers, so you could try these serializations out for this issue without changing SequenceFile. > Furthermore, you can't implement context sensitive serializations MAPREDUCE-1462 without the changes to SequenceFile. I think they are orthogonal. See, for example, the patches I posted to MAPREDUCE-1462 , which implement context sensitive serializations with no changes to SequenceFile.
          Hide
          Scott Carey added a comment -

          Tom brought up a lot of good points.

          I have two serious issues with the current patch, which I have mentioned above. However, given that they have not been adequately addressed I feel I have no option but to vote -1.

          The first is that no change is needed in SequenceFile unless we want to support Avro, but, given that Avro data files were designed for this, and are multi-lingual, why change the SequenceFile format solely to support Avro? Are Avro data files insufficient? Note that Thrift and Protocol Buffers can be stored in today's SequenceFiles.

          The second is that this patch adds new serializations which introduce into the core a new dependency on a particular version of each of Avro, Thrift, and PB, in a non-pluggable way.

          This type of dependency is qualitatively different to other dependencies. Hadoop depends on log4j for instance, so if a user's code does too, then it needs to use the same version. A recent JIRA made it possible to specify a different version of log4j in the job, but this only works if the version the user specifies is compatible with both their code and the Hadoop kernel code.

          This is HUGE. Hadoop should not place ANY jar in its classpath that will likely cause such conflicts.

          The answer to this problem is actually simple, but not mentioned here. Hadoop should re-package such dependencies.
          Use the maven shade plugin http://maven.apache.org/plugins/maven-shade-plugin/index.html
          or equivalent in Ant (http://code.google.com/p/jarjar/)

          to embed such dependencies in a way that won't conflict with user libraries. There are a half-dozen other jars sitting in hadoop's /lib directory that should have the same treatment. If these jars are NOT part of the API that hadoop is exposing, it should NOT stuff them on the user's classpath.

          -------------
          As for Avro file format performance and features related to Hadoop – I'd be happy to work on some of them. There hasn't been any significant requests on the user or dev mailing lists for Avro so far.

          I'm not sure how important the overhead of writing a two-record file is (what use case is that for?) but surely there is more room for performance gains there since the initial focus was all about streaming bulk data. I have a lot of ideas that can improve performance for many things in Avro, as do other committers.

          Commenting on the side about issues rather than filing bugs or requests for improvement isn't constructive. Please file JIRA issues against Avro.

          Show
          Scott Carey added a comment - Tom brought up a lot of good points. I have two serious issues with the current patch, which I have mentioned above. However, given that they have not been adequately addressed I feel I have no option but to vote -1. The first is that no change is needed in SequenceFile unless we want to support Avro, but, given that Avro data files were designed for this, and are multi-lingual, why change the SequenceFile format solely to support Avro? Are Avro data files insufficient? Note that Thrift and Protocol Buffers can be stored in today's SequenceFiles. The second is that this patch adds new serializations which introduce into the core a new dependency on a particular version of each of Avro, Thrift, and PB, in a non-pluggable way. This type of dependency is qualitatively different to other dependencies. Hadoop depends on log4j for instance, so if a user's code does too, then it needs to use the same version. A recent JIRA made it possible to specify a different version of log4j in the job, but this only works if the version the user specifies is compatible with both their code and the Hadoop kernel code. This is HUGE. Hadoop should not place ANY jar in its classpath that will likely cause such conflicts. The answer to this problem is actually simple, but not mentioned here. Hadoop should re-package such dependencies. Use the maven shade plugin http://maven.apache.org/plugins/maven-shade-plugin/index.html or equivalent in Ant ( http://code.google.com/p/jarjar/ ) to embed such dependencies in a way that won't conflict with user libraries. There are a half-dozen other jars sitting in hadoop's /lib directory that should have the same treatment. If these jars are NOT part of the API that hadoop is exposing, it should NOT stuff them on the user's classpath. ------------- As for Avro file format performance and features related to Hadoop – I'd be happy to work on some of them. There hasn't been any significant requests on the user or dev mailing lists for Avro so far. I'm not sure how important the overhead of writing a two-record file is (what use case is that for?) but surely there is more room for performance gains there since the initial focus was all about streaming bulk data. I have a lot of ideas that can improve performance for many things in Avro, as do other committers. Commenting on the side about issues rather than filing bugs or requests for improvement isn't constructive. Please file JIRA issues against Avro.
          Hide
          Owen O'Malley added a comment -

          Tom brought up a lot of good points.

          All of which had been answered previously. http://bit.ly/fXNJpN

          The answer to this problem is actually simple, but not mentioned here. Hadoop should re-package such dependencies.

          That is interesting. I had never seen shade before. What are the tradeoffs between when it makes sense and when it doesn't? Clearly putting the dependences into shade is a completely independent issue.

          Show
          Owen O'Malley added a comment - Tom brought up a lot of good points. All of which had been answered previously. http://bit.ly/fXNJpN The answer to this problem is actually simple, but not mentioned here. Hadoop should re-package such dependencies. That is interesting. I had never seen shade before. What are the tradeoffs between when it makes sense and when it doesn't? Clearly putting the dependences into shade is a completely independent issue.
          Hide
          Owen O'Malley added a comment -

          I should have been clearer. The metadata is completely defined by the serialization plugin. As such, it can choose to use JSON, XML, or an arbitrary binary format. Any of them would work.

          For the plugins that I wrote, I considered and rejected JSON. The reasons were:

          • JSON would require me to write a parser and dumper.
          • JSON would be less efficient in time and space.

          I don't see any reason to use a less efficient solution, but others are welcome to implement serialization plugins that represent the metadata as they wish.

          Show
          Owen O'Malley added a comment - I should have been clearer. The metadata is completely defined by the serialization plugin. As such, it can choose to use JSON, XML, or an arbitrary binary format. Any of them would work. For the plugins that I wrote, I considered and rejected JSON. The reasons were: JSON would require me to write a parser and dumper. JSON would be less efficient in time and space. I don't see any reason to use a less efficient solution, but others are welcome to implement serialization plugins that represent the metadata as they wish.
          Hide
          Luke Lu added a comment -

          I see the discussion mired down in implementation details and tangential issues. The key contribution of the patch is a framework defining interfaces to enable type safe and context sensitive serialization (cf. MAPREDUCE-1462). The code is clean and modular by itself. The lib package happens to contain a bunch of specific serialization plugins, which is good to show that the whole thing works and serves as a basis for future refactor (mostly packaging) in a separate jira issue.

          +1 for progress.

          Show
          Luke Lu added a comment - I see the discussion mired down in implementation details and tangential issues. The key contribution of the patch is a framework defining interfaces to enable type safe and context sensitive serialization (cf. MAPREDUCE-1462 ). The code is clean and modular by itself. The lib package happens to contain a bunch of specific serialization plugins, which is good to show that the whole thing works and serves as a basis for future refactor (mostly packaging) in a separate jira issue. +1 for progress.
          Hide
          Scott Carey added a comment -

          That is interesting. I had never seen shade before. What are the tradeoffs between when it makes sense and when it doesn't? Clearly putting the dependences into shade is a completely independent issue.

          There are two things in shade / jarjar:
          1. Making a 'super-jar' and pulling dependencies into your jar.
          2. re-mapping and embedding dependencies so they don't collide – e.g. moving com.google.protobuf to org.apache.hadoop.com.google.protobuf inside the hadoop jar.

          The first is mostly orthogonal to the second.
          A super-jar is useful if you need the jar file to be your application package, but a lib directory is usually a better idea. For distributing code, independent jars are IMO always preferred. For packaging applications, a shaded 'super-jar' can be useful.

          The second is what I would like to see projects like hadoop (and Avro for that matter) do in selected cases where dependency collision is likely and dangerous.
          If you re-map a dependency and embed it, users will never accidentally have a jar collision. Its easier than building custom classloader infrastructure, and makes sense in some cases. Users could 'find' it if they knew to look in the relocated package, so its not as secure or safe as classloader based library management, but the relocated package is not part of the public API and doesn't need to be supported.
          The caveats are:

          • If the dependency is part of your API and exposed to users, re-mapped dependencies become part of your API and thats usually a bad idea.
          • If the dependency is a very large jar or has many dependencies of its own, your jar can get large with various consequences due to the size.

          The questions I ask myself over this are:
          What is the likelihood that a user will have their own, different version for their own use?
          What are the consequences of multiple version collisions?
          Is this part of the API that users depend on?

          If its not part of the API, has complicated consequences on collision, and there is a reasonable expectation that users will have their own version then remapping and embedding should have strong consideration. There are cases where even if it is part of the API it may be preferable. Probably not for 'end user' APIs but for 'contributor' APIs – such as something that contrib modules or frameworks would use but not end-users.

          • JSON would require me to write a parser and dumper.
          • JSON would be less efficient in time and space.

          The latter is very true, but I believe the former can be avoided. With Jackson for example, you can just use the object binding API and throw a bean or annotated class at it and it will serialize/deserialize. Most languages have some sort of JSON API that binds data structures to JSON without having to use a parser.

          Show
          Scott Carey added a comment - That is interesting. I had never seen shade before. What are the tradeoffs between when it makes sense and when it doesn't? Clearly putting the dependences into shade is a completely independent issue. There are two things in shade / jarjar: 1. Making a 'super-jar' and pulling dependencies into your jar. 2. re-mapping and embedding dependencies so they don't collide – e.g. moving com.google.protobuf to org.apache.hadoop.com.google.protobuf inside the hadoop jar. The first is mostly orthogonal to the second. A super-jar is useful if you need the jar file to be your application package, but a lib directory is usually a better idea. For distributing code, independent jars are IMO always preferred. For packaging applications, a shaded 'super-jar' can be useful. The second is what I would like to see projects like hadoop (and Avro for that matter) do in selected cases where dependency collision is likely and dangerous. If you re-map a dependency and embed it, users will never accidentally have a jar collision. Its easier than building custom classloader infrastructure, and makes sense in some cases. Users could 'find' it if they knew to look in the relocated package, so its not as secure or safe as classloader based library management, but the relocated package is not part of the public API and doesn't need to be supported. The caveats are: If the dependency is part of your API and exposed to users, re-mapped dependencies become part of your API and thats usually a bad idea. If the dependency is a very large jar or has many dependencies of its own, your jar can get large with various consequences due to the size. The questions I ask myself over this are: What is the likelihood that a user will have their own, different version for their own use? What are the consequences of multiple version collisions? Is this part of the API that users depend on? If its not part of the API, has complicated consequences on collision, and there is a reasonable expectation that users will have their own version then remapping and embedding should have strong consideration. There are cases where even if it is part of the API it may be preferable. Probably not for 'end user' APIs but for 'contributor' APIs – such as something that contrib modules or frameworks would use but not end-users. JSON would require me to write a parser and dumper. JSON would be less efficient in time and space. The latter is very true, but I believe the former can be avoided. With Jackson for example, you can just use the object binding API and throw a bean or annotated class at it and it will serialize/deserialize. Most languages have some sort of JSON API that binds data structures to JSON without having to use a parser.
          Hide
          Ryan Holmes added a comment -

          Avro is already a dependency. Thrift is already a dependency for HDFS (see HDFS-1484). I'm only adding ProtocolBuffers, which is a commonly used serialization format that many users including me find extremely useful.

          This line of reasoning is overly general and could be used to support the addition of literally any dependency (i.e. dependency x already exists, so it's OK to add y).

          Hadoop should focus on providing a pluggable API for serialization rather than providing specific internal implementations (optional implementations would be fine). I also think Hadoop will benefit greatly in the long term by promoting a single, default serialization and file format for new users. I was under the impression that this was a shared goal and that the chosen format was Avro. Adding a direct dependency on Protocol Buffers and increasing the scope of dependency on Thrift seems to directly contradict that goal.

          In MAPREDUCE-980, you took out the custom JSON parser and replaced it with calls into Avro. Using ProtoBuf is efficient and meant that I wrote 2 lines of code. If I used JSON, I would need to write a parser and printer.

          Can't you use Jackson, which is already a dependency?

          Show
          Ryan Holmes added a comment - Avro is already a dependency. Thrift is already a dependency for HDFS (see HDFS-1484 ). I'm only adding ProtocolBuffers, which is a commonly used serialization format that many users including me find extremely useful. This line of reasoning is overly general and could be used to support the addition of literally any dependency (i.e. dependency x already exists, so it's OK to add y). Hadoop should focus on providing a pluggable API for serialization rather than providing specific internal implementations (optional implementations would be fine). I also think Hadoop will benefit greatly in the long term by promoting a single, default serialization and file format for new users. I was under the impression that this was a shared goal and that the chosen format was Avro. Adding a direct dependency on Protocol Buffers and increasing the scope of dependency on Thrift seems to directly contradict that goal. In MAPREDUCE-980 , you took out the custom JSON parser and replaced it with calls into Avro. Using ProtoBuf is efficient and meant that I wrote 2 lines of code. If I used JSON, I would need to write a parser and printer. Can't you use Jackson, which is already a dependency?
          Hide
          Chris Douglas added a comment -

          This line of reasoning is overly general and could be used to support the addition of literally any dependency (i.e. dependency x already exists, so it's OK to add y).

          That wasn't how I read the argument. There exist some set of serialization frameworks and Hadoop depends on some of them. Arguing that the existing set is final is a decision requiring consensus. If adding serializations as default dependencies was permitted without this level of debate in the past, it seems fair to ask why this particular dependency is worthy of a veto. I don't know if everyone realizes that, by changing the list of serializations, the thrift dependency becomes optional. Not optional to build Hadoop, but optional for the serialization context. The protocol buffer dependency could be linked against the "lite" runtime, if that would address some of the concerns.

          Hadoop will benefit greatly in the long term by promoting a single, default serialization and file format for new users. I was under the impression that this was a shared goal and that the chosen format was Avro.

          The project never voted to adopt Avro as the one and only serialization, but nobody has challenged the assertion that it should be supported as a first-class form. AFAIK, every contributor advocates it. As Tom noted, the type is sufficient for protocol buffers, thrift, and writables, but Avro requires the schema. This patch is general enough to support Avro (or anything else). The interest in the bytes used to encode the type- granted, relying on protocol buffers, which is a new dependency- is out of proportion to its importance. IMO, just tossing these data into JSON is only solving half the problem; protocol buffers allows for extensions that could open fruitful paths for experimentation.

          It's news to me that Doug has decided not to integrate Avro into Hadoop. Some API changes were vetoed months ago, but many- including "opponents" in this limited context- still discuss it as a natural part of core data formats, as an RPC candidate, etc. Frankly, the PMC hasn't discussed anything like a policy. It's disappointing to learn that he's withdrawn from this community and elected to work on forwarding his vision in Avro, but the consequences of that should be limited to his engagement. Part of what has fueled the worst of this thread has been the complementary freezing of the formats and serializations hosted in Hadoop; such a division between framework/"user" code makes sense given how Doug wants to engage, but that model has not been agreed on by the PMC. Given a choice and not a mandate, I'm not convinced he would find this community antagonistic to his goals, but he can spend his time and talent as he chooses. If the Avro data file were included alongside TFile and SequenceFile, personally I see no reason why we could not write the documentation that would help downstream developers (including frameworks) help realize the benefits he foresees.

          SequenceFile is an experimental format. As long as changes are compatible with existing data, changes/enhancements to it have been tolerated to date. The concern that it distracts from the format that ships with Avro is overblown; most users interact with data through frameworks, and those frameworks are managing most of the compatibility headaches. The patch reaches no further than previous modifications.


          The existential question of the scope of Hadoop needs to be answered by the PMC, not navigated by vetoes. The packaging question is part of a larger dependency problem the project needs to answer more directly; the solution Scott advocates could make progress toward that, but it's a separate issue. The dependency on protocol buffers for encoding the type is the only substantial objection, as it commits that dependency to persistent data.

          After taking a deep breath, this issue feels pretty big/little endian. How jobs are configured in Hadoop is relevant, and this step moves away from the XML Configuration Hadoop has used for years, but no direction proposed will cause the project to fail as spectacularly as the mutual veto. Tom, Doug: if you're willing to drop conditions to your vetoes on all but the protocol buffers vs JSON, could a technical discussion on the merits of those formats for metadata get us around this impasse?

          Show
          Chris Douglas added a comment - This line of reasoning is overly general and could be used to support the addition of literally any dependency (i.e. dependency x already exists, so it's OK to add y). That wasn't how I read the argument. There exist some set of serialization frameworks and Hadoop depends on some of them. Arguing that the existing set is final is a decision requiring consensus. If adding serializations as default dependencies was permitted without this level of debate in the past, it seems fair to ask why this particular dependency is worthy of a veto. I don't know if everyone realizes that, by changing the list of serializations, the thrift dependency becomes optional. Not optional to build Hadoop, but optional for the serialization context. The protocol buffer dependency could be linked against the "lite" runtime, if that would address some of the concerns. Hadoop will benefit greatly in the long term by promoting a single, default serialization and file format for new users. I was under the impression that this was a shared goal and that the chosen format was Avro. The project never voted to adopt Avro as the one and only serialization, but nobody has challenged the assertion that it should be supported as a first-class form. AFAIK, every contributor advocates it. As Tom noted, the type is sufficient for protocol buffers, thrift, and writables, but Avro requires the schema. This patch is general enough to support Avro (or anything else). The interest in the bytes used to encode the type- granted, relying on protocol buffers, which is a new dependency- is out of proportion to its importance. IMO, just tossing these data into JSON is only solving half the problem; protocol buffers allows for extensions that could open fruitful paths for experimentation. It's news to me that Doug has decided not to integrate Avro into Hadoop. Some API changes were vetoed months ago, but many- including "opponents" in this limited context- still discuss it as a natural part of core data formats, as an RPC candidate, etc. Frankly, the PMC hasn't discussed anything like a policy. It's disappointing to learn that he's withdrawn from this community and elected to work on forwarding his vision in Avro, but the consequences of that should be limited to his engagement. Part of what has fueled the worst of this thread has been the complementary freezing of the formats and serializations hosted in Hadoop; such a division between framework/"user" code makes sense given how Doug wants to engage, but that model has not been agreed on by the PMC. Given a choice and not a mandate, I'm not convinced he would find this community antagonistic to his goals, but he can spend his time and talent as he chooses. If the Avro data file were included alongside TFile and SequenceFile, personally I see no reason why we could not write the documentation that would help downstream developers (including frameworks) help realize the benefits he foresees. SequenceFile is an experimental format. As long as changes are compatible with existing data, changes/enhancements to it have been tolerated to date. The concern that it distracts from the format that ships with Avro is overblown; most users interact with data through frameworks, and those frameworks are managing most of the compatibility headaches. The patch reaches no further than previous modifications. The existential question of the scope of Hadoop needs to be answered by the PMC, not navigated by vetoes. The packaging question is part of a larger dependency problem the project needs to answer more directly; the solution Scott advocates could make progress toward that, but it's a separate issue. The dependency on protocol buffers for encoding the type is the only substantial objection, as it commits that dependency to persistent data. After taking a deep breath, this issue feels pretty big/little endian. How jobs are configured in Hadoop is relevant, and this step moves away from the XML Configuration Hadoop has used for years, but no direction proposed will cause the project to fail as spectacularly as the mutual veto. Tom, Doug: if you're willing to drop conditions to your vetoes on all but the protocol buffers vs JSON, could a technical discussion on the merits of those formats for metadata get us around this impasse?
          Hide
          Chris Douglas added a comment -

          The ASF would veto any release of Hadoop that depended on an unreleased in-incubation artifact. This would complicate any plan to branch to 0.22, or at least release it, unless the build file was set up to exclude thrift-specific code.

          Thrift is a TLP: http://s.apache.org/e4f

          Show
          Chris Douglas added a comment - The ASF would veto any release of Hadoop that depended on an unreleased in-incubation artifact. This would complicate any plan to branch to 0.22, or at least release it, unless the build file was set up to exclude thrift-specific code. Thrift is a TLP: http://s.apache.org/e4f
          Hide
          Nigel Daley added a comment -

          Part of what has fueled the worst of this thread has been the complementary freezing of the formats and serializations hosted in Hadoop

          Hmm, reading thru this and related issue threads, I came to different conclusions on the fuel:

          • historical bitterness
          • lack of compromise on this patch
          • thinly veiled attempts to snub others
          • posting this patch on such a contentious issue so close to feature freeze of 0.22

          The existential question of the scope of Hadoop needs to be answered by the PMC, not navigated by vetoes.

          Nor, then, by unilateral reverting of patches like MAPREDUCE-1126 (the ultimate veto). Maybe it should have been reverted, but the way it was done still leaves me shaking my head.

          Show
          Nigel Daley added a comment - Part of what has fueled the worst of this thread has been the complementary freezing of the formats and serializations hosted in Hadoop Hmm, reading thru this and related issue threads, I came to different conclusions on the fuel: historical bitterness lack of compromise on this patch thinly veiled attempts to snub others posting this patch on such a contentious issue so close to feature freeze of 0.22 The existential question of the scope of Hadoop needs to be answered by the PMC, not navigated by vetoes. Nor, then, by unilateral reverting of patches like MAPREDUCE-1126 (the ultimate veto). Maybe it should have been reverted, but the way it was done still leaves me shaking my head.
          Hide
          Doug Cutting added a comment -

          Chris, while I no longer seek direct support in Hadoop for Avro's file format, I do still hope that Hadoop will move to Avro RPC. Progress stopped on HADOOP-6659 when Eric vetoed my incremental approach. I hope to resume progress on this in a branch when some collaborators volunteer, perhaps this coming winter.

          Show
          Doug Cutting added a comment - Chris, while I no longer seek direct support in Hadoop for Avro's file format, I do still hope that Hadoop will move to Avro RPC. Progress stopped on HADOOP-6659 when Eric vetoed my incremental approach. I hope to resume progress on this in a branch when some collaborators volunteer, perhaps this coming winter.
          Hide
          Owen O'Malley added a comment -

          lack of compromise on this patch

          All of the technical feedback for this patch has been addressed, including technical feedback that Tom gave me offline.

          Making compromises to not change SequenceFile or moving the plugins to the contrib module that would destroy the usability of the patch, isn't really compromise. They are just thinly veiled attempts to make it difficult to use ProtoBufs or Thrift in user's applications.

          Furthermore, Doug is vetoing this patch based on his personal view about how Hadoop should change direction. Apache doesn't allow dictators and no one is allowed to decide a new direction without discussion and reaching consensus.

          posting this patch on such a contentious issue so close to feature freeze of 0.22

          This is a non issue. If you want to fork the current 0.22 branch and become the release manager for it, that is great. If we can make some progress, serialization can got into 0.23 the following month.

          Show
          Owen O'Malley added a comment - lack of compromise on this patch All of the technical feedback for this patch has been addressed, including technical feedback that Tom gave me offline. Making compromises to not change SequenceFile or moving the plugins to the contrib module that would destroy the usability of the patch, isn't really compromise. They are just thinly veiled attempts to make it difficult to use ProtoBufs or Thrift in user's applications. Furthermore, Doug is vetoing this patch based on his personal view about how Hadoop should change direction. Apache doesn't allow dictators and no one is allowed to decide a new direction without discussion and reaching consensus. posting this patch on such a contentious issue so close to feature freeze of 0.22 This is a non issue. If you want to fork the current 0.22 branch and become the release manager for it, that is great. If we can make some progress, serialization can got into 0.23 the following month.
          Hide
          Chris Douglas added a comment -

          Nigel> There's some truth in what you say (I qualified it as "part of" the problem), but this is the wrong forum. There exist technical disagreements impeding this particular issue. To identify and resolve them, participants must focus on the code and on none of the items you raise. Please start a thread on private@ if you want to discuss personnel issues, where you can be more explicit about your concerns.

          Doug> On HADOOP-6659: Eric doesn't have a veto. Jeff was the first to ask for a design document. This issue adds not only the APIs, but- as Owen explained- it includes changes to SequenceFile to illustrate its completeness to reviewers. This approach, as agreed in the contributor meetings, was to avoid outcomes like MAPREDUCE-1126 where large changes trickle in over months, making review and participation impossible. Upfront design and consensus followed by more complete patches was supposed to enable progress. For a few days, that was in place here.

          You did not respond to my proposal. Can we cling to differences on the serialization format of serializer metadata as the point at issue? Agreement on project direction/scope, programming taste, and packaging can be hashed out elsewhere. Adding such large, contentious issues as prerequisites ensures failure. Factoring out file formats into external projects and improving the way dependencies are packaged in Hadoop could proceed even if this patch were committed.

          Show
          Chris Douglas added a comment - Nigel> There's some truth in what you say (I qualified it as "part of" the problem), but this is the wrong forum. There exist technical disagreements impeding this particular issue. To identify and resolve them, participants must focus on the code and on none of the items you raise. Please start a thread on private@ if you want to discuss personnel issues, where you can be more explicit about your concerns. Doug> On HADOOP-6659 : Eric doesn't have a veto. Jeff was the first to ask for a design document. This issue adds not only the APIs, but- as Owen explained- it includes changes to SequenceFile to illustrate its completeness to reviewers. This approach, as agreed in the contributor meetings, was to avoid outcomes like MAPREDUCE-1126 where large changes trickle in over months, making review and participation impossible. Upfront design and consensus followed by more complete patches was supposed to enable progress. For a few days, that was in place here. You did not respond to my proposal. Can we cling to differences on the serialization format of serializer metadata as the point at issue? Agreement on project direction/scope, programming taste, and packaging can be hashed out elsewhere. Adding such large, contentious issues as prerequisites ensures failure. Factoring out file formats into external projects and improving the way dependencies are packaged in Hadoop could proceed even if this patch were committed.
          Hide
          Doug Cutting added a comment -

          Chris> The existential question of the scope of Hadoop needs to be answered by the PMC, not navigated by vetoes.

          Our processes do not include setting of an official project direction. We consider patches and proposals that are submitted. We must reach consensus on each issue before it can be committed. Vetos are a last resort when no mutually acceptable compromise can be found. That's the process.

          Chris> Tom, Doug: if you're willing to drop conditions to your vetoes on all but the protocol buffers vs JSON [ ... ]

          I'm not clear on what you're proposing here, but I am in general open to discussion of potential compromises. If you feel the project requires some new functionality that it is not feasible to implement outside of this project, you might outline in more detail what the patch you propose would include.

          Show
          Doug Cutting added a comment - Chris> The existential question of the scope of Hadoop needs to be answered by the PMC, not navigated by vetoes. Our processes do not include setting of an official project direction. We consider patches and proposals that are submitted. We must reach consensus on each issue before it can be committed. Vetos are a last resort when no mutually acceptable compromise can be found. That's the process. Chris> Tom, Doug: if you're willing to drop conditions to your vetoes on all but the protocol buffers vs JSON [ ... ] I'm not clear on what you're proposing here, but I am in general open to discussion of potential compromises. If you feel the project requires some new functionality that it is not feasible to implement outside of this project, you might outline in more detail what the patch you propose would include.
          Hide
          Tom White added a comment -

          Owen> All of the technical feedback for this patch has been addressed, including technical feedback that Tom gave me offline.

          Was this the feedback I gave on MAPREDUCE-1462 back in February? I haven't given any feedback offline for this issue.

          Owen> Making compromises to not change SequenceFile or moving the plugins to the contrib module that would destroy the usability of the patch, isn't really compromise. They are just thinly veiled attempts to make it difficult to use ProtoBufs or Thrift in user's applications.

          The original work for Thrift and Protocol Buffers serializations (MAPREDUCE-376 and MAPREDUCE-377) was as contrib modules, so if we want to change that approach, then we need to get consensus on doing so. That consensus hasn't been forthcoming so they should be left as optional contrib modules. Serializations in this form are easy to use by the way: users just add the relevant contrib jar and the serialization jar to the job, just like any other dependency.

          Show
          Tom White added a comment - Owen> All of the technical feedback for this patch has been addressed, including technical feedback that Tom gave me offline. Was this the feedback I gave on MAPREDUCE-1462 back in February? I haven't given any feedback offline for this issue. Owen> Making compromises to not change SequenceFile or moving the plugins to the contrib module that would destroy the usability of the patch, isn't really compromise. They are just thinly veiled attempts to make it difficult to use ProtoBufs or Thrift in user's applications. The original work for Thrift and Protocol Buffers serializations ( MAPREDUCE-376 and MAPREDUCE-377 ) was as contrib modules, so if we want to change that approach, then we need to get consensus on doing so. That consensus hasn't been forthcoming so they should be left as optional contrib modules. Serializations in this form are easy to use by the way: users just add the relevant contrib jar and the serialization jar to the job, just like any other dependency.
          Hide
          Owen O'Malley added a comment - - edited

          Was this the feedback I gave on MAPREDUCE-1462 back in February? I haven't given any feedback offline for this issue.

          I thought we went through the APIs offline after that.

          The original work for Thrift and Protocol Buffers serializations (MAPREDUCE-376 and MAPREDUCE-377) was as contrib modules, so if we want to change that approach, then we need to get consensus on doing so.

          Neither of those was ever committed, but of course we need consensus to continue.

          Serializations in this form are easy to use by the way: users just add the relevant contrib jar and the serialization jar to the job, just like any other dependency.

          It is far more efficient to use the distributed cache to distribute the jars than bundling them up into the user's jar that has to be downloaded each time.

          Show
          Owen O'Malley added a comment - - edited Was this the feedback I gave on MAPREDUCE-1462 back in February? I haven't given any feedback offline for this issue. I thought we went through the APIs offline after that. The original work for Thrift and Protocol Buffers serializations ( MAPREDUCE-376 and MAPREDUCE-377 ) was as contrib modules, so if we want to change that approach, then we need to get consensus on doing so. Neither of those was ever committed, but of course we need consensus to continue. Serializations in this form are easy to use by the way: users just add the relevant contrib jar and the serialization jar to the job, just like any other dependency. It is far more efficient to use the distributed cache to distribute the jars than bundling them up into the user's jar that has to be downloaded each time.
          Hide
          Konstantin Boudnik added a comment -

          Owen O'Malley added a comment - 19/Nov/10 12:44 AM - edited

          Owen O'Malley added a comment - 29/Nov/10 01:33 PM - edited

          Looks like that after all JIRA comments can be edited. Hmm, I distinctly remember the thread where comments editing was declared an evil feature. Would you share the information why it is available to someone (you) but not to the rest of the community? Thanks

          Show
          Konstantin Boudnik added a comment - Owen O'Malley added a comment - 19/Nov/10 12:44 AM - edited Owen O'Malley added a comment - 29/Nov/10 01:33 PM - edited Looks like that after all JIRA comments can be edited. Hmm, I distinctly remember the thread where comments editing was declared an evil feature. Would you share the information why it is available to someone (you) but not to the rest of the community? Thanks
          Hide
          Owen O'Malley added a comment -

          Looks like that after all JIRA comments can be edited.

          If you are interested in re-enabling edits to Jira, please start a discussion thread on common-dev.

          Show
          Owen O'Malley added a comment - Looks like that after all JIRA comments can be edited. If you are interested in re-enabling edits to Jira, please start a discussion thread on common-dev.
          Hide
          Chris Douglas added a comment -

          Doug> I'll try to be clearer. As I understand the veto, you and Tom object to three points:

          1. Format of the serialized metadata, specifically you both prefer JSON/stringly-typed data to protocol buffers
          2. Packaging, in particular adding protocol buffers and thrift
          3. Changing SequenceFile, an experimental format packaged with Hadoop

          I'm suggesting that this issue focus on (1) exclusively. If we reach consensus on this point, then we can commit a patch and address (2) and (3) elsewhere. There is agreement on the domain of (1), so a technical debate may follow. In contrast, (2) and (3) can be deferred, as (2) is part of a larger failing and (3) can be addressed after the patch is committed. While (1) admits coherent arguments, (2-3) have been fought by indignation. As a tactic for consensus, it seems reasonable to stop arguing project policy proxied through this patch and focus on the item admitting technical, falsifiable claims.

          Show
          Chris Douglas added a comment - Doug> I'll try to be clearer. As I understand the veto, you and Tom object to three points: Format of the serialized metadata, specifically you both prefer JSON/stringly-typed data to protocol buffers Packaging, in particular adding protocol buffers and thrift Changing SequenceFile, an experimental format packaged with Hadoop I'm suggesting that this issue focus on (1) exclusively. If we reach consensus on this point, then we can commit a patch and address (2) and (3) elsewhere. There is agreement on the domain of (1), so a technical debate may follow. In contrast, (2) and (3) can be deferred, as (2) is part of a larger failing and (3) can be addressed after the patch is committed. While (1) admits coherent arguments, (2-3) have been fought by indignation. As a tactic for consensus, it seems reasonable to stop arguing project policy proxied through this patch and focus on the item admitting technical, falsifiable claims.
          Hide
          Doug Cutting added a comment -

          Chris, thanks for clarifying. Yes, I'd prefer to focus exclusively on (1) in this issue.

          I don't yet see an advantage for opaque binary. A common, transparent configuration data format will simplify the creation of configuration editing tools.

          I do see some advantages to using an object serialization system for configuration data: it would provide stronger typing and simplify implementation. But I don't yet see it as a clear win. A best practice for object bindings is generally to wrap a generated object in a wrapper that can host convenience methods and support a back-compatible APIs. This is not so different in developer cost than the static methods we use today.

          I don't yet see the urgency of this change. Serialization metadata is not complex, nor does it constitute much user code. Neither of the issues that this feature is designed to support (MAPREDUCE-1462 and MAPREDUCE-1183) require binary data, since both have been implemented without it. We already have support for passing Properties-like data throughout the system. Adding a second channel for configuration data will complicate things.

          As we consider replacing serialization metadata we should probably look for a solution that's appropriate for replacing all configuration data. The primary feature that seems currently missing is configuration nesting. Stronger typing would also be useful. I can imagine a JSON-based configuration system. It would have advantages over XML, in that simple types (string, number, boolean) are distinguished. JSON can be bound to objects, providing stronger typing yet, and this can be done through reflection which might avoid the need for a wrapper. It would support the implementation of configuration-editing tools.

          In summary, it would be useful to be able to nest configuration data. Nesting it in properties is possible and does not disrupt existing APIs, but it's not very elegant. Stronger typing of configuration parameters would be nice. It would be nice to keep metadata transparent.

          Show
          Doug Cutting added a comment - Chris, thanks for clarifying. Yes, I'd prefer to focus exclusively on (1) in this issue. I don't yet see an advantage for opaque binary. A common, transparent configuration data format will simplify the creation of configuration editing tools. I do see some advantages to using an object serialization system for configuration data: it would provide stronger typing and simplify implementation. But I don't yet see it as a clear win. A best practice for object bindings is generally to wrap a generated object in a wrapper that can host convenience methods and support a back-compatible APIs. This is not so different in developer cost than the static methods we use today. I don't yet see the urgency of this change. Serialization metadata is not complex, nor does it constitute much user code. Neither of the issues that this feature is designed to support ( MAPREDUCE-1462 and MAPREDUCE-1183 ) require binary data, since both have been implemented without it. We already have support for passing Properties-like data throughout the system. Adding a second channel for configuration data will complicate things. As we consider replacing serialization metadata we should probably look for a solution that's appropriate for replacing all configuration data. The primary feature that seems currently missing is configuration nesting. Stronger typing would also be useful. I can imagine a JSON-based configuration system. It would have advantages over XML, in that simple types (string, number, boolean) are distinguished. JSON can be bound to objects, providing stronger typing yet, and this can be done through reflection which might avoid the need for a wrapper. It would support the implementation of configuration-editing tools. In summary, it would be useful to be able to nest configuration data. Nesting it in properties is possible and does not disrupt existing APIs, but it's not very elegant. Stronger typing of configuration parameters would be nice. It would be nice to keep metadata transparent.
          Hide
          Owen O'Malley added a comment -

          I don't yet see an advantage for opaque binary.

          The two different solutions that have been proposed for serialization metadata are:

          • string to string map (HADOOP-6165, HADOOP-6120, HADOOP-6323, HADOOP-6443, MAPREDUCE-1126)
            • type unsafe - users may put the wrong type of value into a slot
            • unchecked keys - users may misspell a key and get the default value by mistake
            • complete visibility - details of implementation are completely visible to user and impossible to change
          • opaque blob (HADOOP-6685)
            • may be encoded as binary or text
            • given a versioned format (ProtoBuf, JSON, Thrift, XML), is completely extensible
            • since interface is via API
              • it is type-safe
              • all of the setters and getters are checked for validity by the compiler
              • can specify the visibility to the user
              • it can be easily documented via javadoc

          In both cases, the metadata is specific to the serialization and can't be interpreted without reference to the corresponding serialization.

          A common, transparent configuration data format will simplify the creation of configuration editing tools.

          Since the metadata is specific to each serialization, there are no common interfaces to support those editing tools. So the string to string maps give the appearance of a common format, but without the semantics it isn't possible to write tools to edit it. In both implementations, it is easy to write dumpers.

          As we consider replacing serialization metadata we should probably look for a solution that's appropriate for replacing all configuration data

          When MAPREDUCE-1183 is done, there is very little need for the string to string map of Configuration. We will have it for a long time to support old applications, but users won't need it.

          It would be nice to move the servers away from the current XML encoded string to string configurations, but that is way outside the scope of this jira.

          Show
          Owen O'Malley added a comment - I don't yet see an advantage for opaque binary. The two different solutions that have been proposed for serialization metadata are: string to string map ( HADOOP-6165 , HADOOP-6120 , HADOOP-6323 , HADOOP-6443 , MAPREDUCE-1126 ) type unsafe - users may put the wrong type of value into a slot unchecked keys - users may misspell a key and get the default value by mistake complete visibility - details of implementation are completely visible to user and impossible to change opaque blob ( HADOOP-6685 ) may be encoded as binary or text given a versioned format (ProtoBuf, JSON, Thrift, XML), is completely extensible since interface is via API it is type-safe all of the setters and getters are checked for validity by the compiler can specify the visibility to the user it can be easily documented via javadoc In both cases, the metadata is specific to the serialization and can't be interpreted without reference to the corresponding serialization. A common, transparent configuration data format will simplify the creation of configuration editing tools. Since the metadata is specific to each serialization, there are no common interfaces to support those editing tools. So the string to string maps give the appearance of a common format, but without the semantics it isn't possible to write tools to edit it. In both implementations, it is easy to write dumpers. As we consider replacing serialization metadata we should probably look for a solution that's appropriate for replacing all configuration data When MAPREDUCE-1183 is done, there is very little need for the string to string map of Configuration. We will have it for a long time to support old applications, but users won't need it. It would be nice to move the servers away from the current XML encoded string to string configurations, but that is way outside the scope of this jira.
          Hide
          Doug Cutting added a comment -

          > In both cases, the metadata is specific to the serialization and can't be interpreted without reference to the corresponding serialization.

          Map<String,String> permits easy non-Java access to configuration data. For example, lots of folks use -D to specify configuration options to streaming jobs and other command-line tools. Lots of folks use text editors to edit configurations. Etc.

          Show
          Doug Cutting added a comment - > In both cases, the metadata is specific to the serialization and can't be interpreted without reference to the corresponding serialization. Map<String,String> permits easy non-Java access to configuration data. For example, lots of folks use -D to specify configuration options to streaming jobs and other command-line tools. Lots of folks use text editors to edit configurations. Etc.
          Hide
          Sanjay Radia added a comment -

          @Doug> Progress stopped on HADOOP-6659 when Eric vetoed my incremental approach
          Doug, at least a couple of folks asked for a design doc and fuller plan; you did not provide it.

          Doug you added a dependency on Avro in RPC without providing a plan on how all the pieces are put together. Indeed it is more than just a library dependency - there are avro related annotations. While I am a fan of Avro (being part of the team that created it at Yahoo), I see no problems in adding first class support for PB since it is a robust serialization framework.
          When I let the Avro dependency creep into Hadoop I did not think that Doug would be opposed to adding support for PB in the future. Both need to be supported in
          a first class way. My feeling is that Avro will prevail in the long term but right now PB addresses the needs of many users.

          Show
          Sanjay Radia added a comment - @Doug> Progress stopped on HADOOP-6659 when Eric vetoed my incremental approach Doug, at least a couple of folks asked for a design doc and fuller plan; you did not provide it. Doug you added a dependency on Avro in RPC without providing a plan on how all the pieces are put together. Indeed it is more than just a library dependency - there are avro related annotations. While I am a fan of Avro (being part of the team that created it at Yahoo), I see no problems in adding first class support for PB since it is a robust serialization framework. When I let the Avro dependency creep into Hadoop I did not think that Doug would be opposed to adding support for PB in the future. Both need to be supported in a first class way. My feeling is that Avro will prevail in the long term but right now PB addresses the needs of many users.
          Hide
          Owen O'Malley added a comment -

          Rather than stall the dependent work while this issue drags out, I'm going to commit the work into a branch to make progress on the related issues.

          Show
          Owen O'Malley added a comment - Rather than stall the dependent work while this issue drags out, I'm going to commit the work into a branch to make progress on the related issues.
          Hide
          Owen O'Malley added a comment -

          Map<String,String> permits easy non-Java access to configuration data. For example, lots of folks use -D to specify configuration options to streaming jobs and other command-line tools. Lots of folks use text editors to edit configurations. Etc.

          I guess I could add a method to Serialization that parses the input from a String containing a serialization-specified format that was a user-friendly DSL.

            public void fromString(String repr) throws IOException;
          

          That would allow the framework to take two strings (name, metadata) and recreate the Serialization. We should also require that fromString is the inverse of toString so there is a matched pair.

          Show
          Owen O'Malley added a comment - Map<String,String> permits easy non-Java access to configuration data. For example, lots of folks use -D to specify configuration options to streaming jobs and other command-line tools. Lots of folks use text editors to edit configurations. Etc. I guess I could add a method to Serialization that parses the input from a String containing a serialization-specified format that was a user-friendly DSL. public void fromString( String repr) throws IOException; That would allow the framework to take two strings (name, metadata) and recreate the Serialization. We should also require that fromString is the inverse of toString so there is a matched pair.
          Hide
          Doug Cutting added a comment -

          > I guess I could add a method to Serialization that parses the input from a String containing a serialization-specified format that was a user-friendly DSL.

          Yes, that would be useful.

          Since configurations are meant to be nestable, a given string format would need to be able to nest other, different string formats. So each string format would probably need to escape strings accordingly. The need for escaping could be avoided if instead each serialization had a toJsonString and fromJsonString method, since nesting JSON within JSON is easy.

          Also, once either toString/fromString or toJsonString/fromJsonString methods are added to Serialization, then there's no need for each serialization to also implement serializeSelf and deserializeSelf, since these can simply become convenience methods implemented in terms of the String-based methods. For better performance, if that's a concern for configuration data, the core methods might instead be toString(Writer) and fromString(Reader).

          Show
          Doug Cutting added a comment - > I guess I could add a method to Serialization that parses the input from a String containing a serialization-specified format that was a user-friendly DSL. Yes, that would be useful. Since configurations are meant to be nestable, a given string format would need to be able to nest other, different string formats. So each string format would probably need to escape strings accordingly. The need for escaping could be avoided if instead each serialization had a toJsonString and fromJsonString method, since nesting JSON within JSON is easy. Also, once either toString/fromString or toJsonString/fromJsonString methods are added to Serialization, then there's no need for each serialization to also implement serializeSelf and deserializeSelf, since these can simply become convenience methods implemented in terms of the String-based methods. For better performance, if that's a concern for configuration data, the core methods might instead be toString(Writer) and fromString(Reader).
          Hide
          Owen O'Malley added a comment -

          Since configurations are meant to be nestable, a given string format would need to be able to nest other, different string formats.

          I haven't heard of any use cases where the serializations would be nested.

          I'd rather not bake JSON into the interface, although clearly a given serialization could choose to use JSON. I don't find JSON very user-friendly.

          there's no need for each serialization to also implement serializeSelf and deserializeSelf

          It is already the case that only TypedSerialization and AvroSerialization define serializeSelf and deserializeSelf.

          Show
          Owen O'Malley added a comment - Since configurations are meant to be nestable, a given string format would need to be able to nest other, different string formats. I haven't heard of any use cases where the serializations would be nested. I'd rather not bake JSON into the interface, although clearly a given serialization could choose to use JSON. I don't find JSON very user-friendly. there's no need for each serialization to also implement serializeSelf and deserializeSelf It is already the case that only TypedSerialization and AvroSerialization define serializeSelf and deserializeSelf.
          Hide
          Doug Cutting added a comment -

          > I haven't heard of any use cases where the serializations would be nested.

          True, serializations might not be directly nested, but they might be nested in other configuration data. We're trying to devise a better configuration serialization pattern than Map<String,String> for use in MAPREDUCE-1183 and MAPREDUCE-1462 . In these cases, jobs, serializations, mapper, inputformat, reducer and outputformat configurations are nested.

          > I'd rather not bake JSON into the interface, although clearly a given serialization could choose to use JSON. I don't find JSON very user-friendly.

          Java programmers should never need to touch the JSON (except perhaps when debugging) since Java programmers can use APIs to access configurations. For Java programmers it should be equivalent to binary. But it'd be nice if configurations of all sorts, daemons, jobs, serializations, inputformats, mappers, etc. were in a common format that it was possible for non-Java programs to see and potentially modify as they can currently with -D, text-editors, etc. Currently we use Map<String,String> for this purpose, but that makes nesting awkward. I see this issue primarily as a replacement for the reverted HADOOP-6420, a previous attempt to provide configuration nesting. Is that a mis-characterization?

          Show
          Doug Cutting added a comment - > I haven't heard of any use cases where the serializations would be nested. True, serializations might not be directly nested, but they might be nested in other configuration data. We're trying to devise a better configuration serialization pattern than Map<String,String> for use in MAPREDUCE-1183 and MAPREDUCE-1462 . In these cases, jobs, serializations, mapper, inputformat, reducer and outputformat configurations are nested. > I'd rather not bake JSON into the interface, although clearly a given serialization could choose to use JSON. I don't find JSON very user-friendly. Java programmers should never need to touch the JSON (except perhaps when debugging) since Java programmers can use APIs to access configurations. For Java programmers it should be equivalent to binary. But it'd be nice if configurations of all sorts, daemons, jobs, serializations, inputformats, mappers, etc. were in a common format that it was possible for non-Java programs to see and potentially modify as they can currently with -D, text-editors, etc. Currently we use Map<String,String> for this purpose, but that makes nesting awkward. I see this issue primarily as a replacement for the reverted HADOOP-6420 , a previous attempt to provide configuration nesting. Is that a mis-characterization?
          Hide
          Owen O'Malley added a comment -

          serializations might not be directly nested, but they might be nested in other configuration data.

          It is the responsibility of the container to quote the strings, not the other way around. This is important because the strings may be used in different text containers, each with their own quoting conventions. For example, the file that describes a job may be XML, JSON, or YAML. The serialization can not know all of the contexts that its textual representation will be used in and any quoting that it does will likely cause more noise and confusion.

          Show
          Owen O'Malley added a comment - serializations might not be directly nested, but they might be nested in other configuration data. It is the responsibility of the container to quote the strings, not the other way around. This is important because the strings may be used in different text containers, each with their own quoting conventions. For example, the file that describes a job may be XML, JSON, or YAML. The serialization can not know all of the contexts that its textual representation will be used in and any quoting that it does will likely cause more noise and confusion.
          Hide
          Doug Cutting added a comment -

          > It is the responsibility of the container to quote the strings, not the other way around. This is important because the strings may be used in different text containers, each with their own quoting conventions.

          If we used JSON as the standard string representation, then nesting it within other JSON would require no escaping.

          What important capability do we lose by using a uniform textual format for configuration data? For user input and output data we should not require a uniform format: that would radically narrow the scope of the system and we want the system to be able to process data in any format. But I don't yet see the advantage of supporting configuration data in a multitude of textual and/or non-textual formats, and I see some distinct disadvantages to doing this, since it impairs access to configuration data from non-Java programs.

          Consider again the -D case. With Map<String,String> this is easy to support. We create the configuation, then set any properties specified on the command line. With JSON it's a little harder, but still possible. We create the configuration, then, if someone sets output.compression.level=9 and the serialized configuration looks something like:

            { "input:": ...,
              "mapper": ....,
              "reducer": ....,
              "output": {
                "format": ... ,
                "compression": {
                    "codec": "gzip",
                    "level": 5,
                    ...  }
                ...}
          ...}
          

          Then we can rewrite this, changing the compression-level and submit that as the job instead.

          Alternately, we could require each serialization to support methods like:

          void setProperty(String[] path, String value);
          String getProperty(String[] path);
          

          But then we might as well still be using Map<String,String>.

          Show
          Doug Cutting added a comment - > It is the responsibility of the container to quote the strings, not the other way around. This is important because the strings may be used in different text containers, each with their own quoting conventions. If we used JSON as the standard string representation, then nesting it within other JSON would require no escaping. What important capability do we lose by using a uniform textual format for configuration data? For user input and output data we should not require a uniform format: that would radically narrow the scope of the system and we want the system to be able to process data in any format. But I don't yet see the advantage of supporting configuration data in a multitude of textual and/or non-textual formats, and I see some distinct disadvantages to doing this, since it impairs access to configuration data from non-Java programs. Consider again the -D case. With Map<String,String> this is easy to support. We create the configuation, then set any properties specified on the command line. With JSON it's a little harder, but still possible. We create the configuration, then, if someone sets output.compression.level=9 and the serialized configuration looks something like: { "input:" : ..., "mapper" : ...., "reducer" : ...., "output" : { "format" : ... , "compression" : { "codec" : "gzip" , "level" : 5, ... } ...} ...} Then we can rewrite this, changing the compression-level and submit that as the job instead. Alternately, we could require each serialization to support methods like: void setProperty( String [] path, String value); String getProperty( String [] path); But then we might as well still be using Map<String,String>.
          Hide
          Owen O'Malley added a comment -

          If we used JSON as the standard string representation, then nesting it within other JSON would require no escaping.

          We don't have JSON file formats except for JobHistory. Locking the API to require JSON is a huge and unreasonable cost for very little gain.

          What important capability do we lose by using a uniform textual format for configuration data?

          We don't have one right now. We have XML and JSON. Neither is user-friendly. I think it would be extremely premature to lock JSON in at this level.

          If you'll unveto my patch, I'll add the fromString to the interface and implement it. Since the goal of the text is to be user-friendly, I'll probably use YAML instead of JSON. If you think having plugins that use JSON is critical, you can write plugins that do so.

          YAML is a more human-friendly form of JSON and the Writable metadata would look like:

          {class: org.apache.hadoop.io.IntWritable}
          

          and the Avro serialization metadata would look like:

          {kind: SPECIFIC, schema: '{"type":"record","name":"AvroKey","namespace":"org.apache.hadoop.io","fields":[{"name":"value","type":"int"}]}'}
          
          Show
          Owen O'Malley added a comment - If we used JSON as the standard string representation, then nesting it within other JSON would require no escaping. We don't have JSON file formats except for JobHistory. Locking the API to require JSON is a huge and unreasonable cost for very little gain. What important capability do we lose by using a uniform textual format for configuration data? We don't have one right now. We have XML and JSON. Neither is user-friendly. I think it would be extremely premature to lock JSON in at this level. If you'll unveto my patch, I'll add the fromString to the interface and implement it. Since the goal of the text is to be user-friendly, I'll probably use YAML instead of JSON. If you think having plugins that use JSON is critical, you can write plugins that do so. YAML is a more human-friendly form of JSON and the Writable metadata would look like: {class: org.apache.hadoop.io.IntWritable} and the Avro serialization metadata would look like: {kind: SPECIFIC, schema: '{ "type" : "record" , "name" : "AvroKey" , "namespace" : "org.apache.hadoop.io" , "fields" :[{ "name" : "value" , "type" : " int " }]}'}
          Hide
          Doug Cutting added a comment -

          > We don't have one right now. We have XML and JSON. Neither is user-friendly.

          We don't use currently use JSON for configuration data. Today we use Map<String,String> as the configuration data model. This is usually serialized as XML and sometimes in other forms (e.g., inside a SequenceFile). The simplicity of this model permits differing serializations without significant loss of transparency or interoperability. This model interoperates well with Java properties, including system properties, with environment variables, etc. Appending a prefix to keys has been demonstrated to be an effective if inelegant way to implement nesting in this model. This model does not easily map to objects, nor does it provide any type support.

          If we wish to use a more complex data model, that's nestable, that's more strongly typed and that can be easily mapped to objects, then a standard serialization, like JSON or YAML, is a good way to still ensure transparency and interoperability.

          YAML could work well as a data model. Nesting YAML requires adjusting indentation, while JSON permits simple string appends to nest. But if a Java API like YamlBeans is used, then indentation would be handled automatically.

          If we can read/write YAML, what reason is there to support arbitrary binary configuration data?

          Show
          Doug Cutting added a comment - > We don't have one right now. We have XML and JSON. Neither is user-friendly. We don't use currently use JSON for configuration data. Today we use Map<String,String> as the configuration data model. This is usually serialized as XML and sometimes in other forms (e.g., inside a SequenceFile). The simplicity of this model permits differing serializations without significant loss of transparency or interoperability. This model interoperates well with Java properties, including system properties, with environment variables, etc. Appending a prefix to keys has been demonstrated to be an effective if inelegant way to implement nesting in this model. This model does not easily map to objects, nor does it provide any type support. If we wish to use a more complex data model, that's nestable, that's more strongly typed and that can be easily mapped to objects, then a standard serialization, like JSON or YAML, is a good way to still ensure transparency and interoperability. YAML could work well as a data model. Nesting YAML requires adjusting indentation, while JSON permits simple string appends to nest. But if a Java API like YamlBeans is used, then indentation would be handled automatically. If we can read/write YAML, what reason is there to support arbitrary binary configuration data?
          Hide
          Owen O'Malley added a comment -

          If we can read/write YAML, what reason is there to support arbitrary binary configuration data?

          Binary is better for:

          • parsing speed
          • size on disk
          • size in memory
          • filtering unwanted sections

          Text is better for:

          • editing

          I've added the fromString to support human-friendly encodings. We don't know what formats we are going to use in the future, but I agree human-friendly versions could be useful.

          Show
          Owen O'Malley added a comment - If we can read/write YAML, what reason is there to support arbitrary binary configuration data? Binary is better for: parsing speed size on disk size in memory filtering unwanted sections Text is better for: editing I've added the fromString to support human-friendly encodings. We don't know what formats we are going to use in the future, but I agree human-friendly versions could be useful.
          Hide
          Owen O'Malley added a comment -

          Adds toString/fromString methods that generate the metadata in YAML.

          Show
          Owen O'Malley added a comment - Adds toString/fromString methods that generate the metadata in YAML.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12465307/serial9.patch
          against trunk revision 1042047.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 38 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 1127 javac compiler warnings (more than the trunk's current 1048 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          -1 release audit. The applied patch generated 13 release audit warnings (more than the trunk's current 1 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/130//testReport/
          Release audit warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/130//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/130//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/130//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12465307/serial9.patch against trunk revision 1042047. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 38 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 1127 javac compiler warnings (more than the trunk's current 1048 warnings). +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. -1 release audit. The applied patch generated 13 release audit warnings (more than the trunk's current 1 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/130//testReport/ Release audit warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/130//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/130//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/130//console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          Owen,

          The benefits of binary that you cite all seem performance-related, yet serialization metadata is not performance sensitive. Even if it were, Jackson's JSON performance is comparable to binary serialization systems. I fail to see the need to support two serialization formats for metadata, one textual and one binary. Let's just pick one of Yaml or JSON, okay?

          Also, toString() and fromString() might better be named toYaml and fromYaml. We should choose a single textual format for metadata and the methods should indicate that.

          Show
          Doug Cutting added a comment - Owen, The benefits of binary that you cite all seem performance-related, yet serialization metadata is not performance sensitive. Even if it were, Jackson's JSON performance is comparable to binary serialization systems. I fail to see the need to support two serialization formats for metadata, one textual and one binary. Let's just pick one of Yaml or JSON, okay? Also, toString() and fromString() might better be named toYaml and fromYaml. We should choose a single textual format for metadata and the methods should indicate that.
          Hide
          Owen O'Malley added a comment -

          Let's just pick one of Yaml or JSON, okay?

          The API is binary. You asked for a user-friendly string version and I added it, because it might be useful in the future. The framework will use the binary version and use the user-friendly version for import/export tasks. Designing an API to make efficient implementation impossible is an anti-goal.

          Furthermore, there isn't any value in forcing a particular text form, as I've discussed previously.

          Of course, you are welcome to make an alternative set of serialization plugins that only use JSON or YAML for both the binary and string versions. I don't see any value in it, but it is certainly possible.

          Show
          Owen O'Malley added a comment - Let's just pick one of Yaml or JSON, okay? The API is binary. You asked for a user-friendly string version and I added it, because it might be useful in the future. The framework will use the binary version and use the user-friendly version for import/export tasks. Designing an API to make efficient implementation impossible is an anti-goal. Furthermore, there isn't any value in forcing a particular text form, as I've discussed previously. Of course, you are welcome to make an alternative set of serialization plugins that only use JSON or YAML for both the binary and string versions. I don't see any value in it, but it is certainly possible.
          Hide
          Doug Cutting added a comment -

          > The API is binary.

          The API is not binary today. You propose a change to binary, but have not motivated that change in a convincing manner.

          > Designing an API to make efficient implementation impossible is an anti-goal.

          There is no evidence that serialization metadata is performance critical. Even if it were, there is evidence that textual formats like JSON can be processed at comparable speeds to binary. Textual formats are useful for configuration data. Supporting multiple formats is error-prone and introduces needless complexity. We should support a minimal number of formats for configuration data to reduce confusion. To date, the external format has been XML and the internal representation has been Map<String,String>. If you wish to improve this you need to provide a compelling argument. Lack of support for nesting and strong typing are accepted weaknesses of the current approach. These might be improved by JSON, YAML or a different XML format combined with a bean-like internal representation.

          Show
          Doug Cutting added a comment - > The API is binary. The API is not binary today. You propose a change to binary, but have not motivated that change in a convincing manner. > Designing an API to make efficient implementation impossible is an anti-goal. There is no evidence that serialization metadata is performance critical. Even if it were, there is evidence that textual formats like JSON can be processed at comparable speeds to binary. Textual formats are useful for configuration data. Supporting multiple formats is error-prone and introduces needless complexity. We should support a minimal number of formats for configuration data to reduce confusion. To date, the external format has been XML and the internal representation has been Map<String,String>. If you wish to improve this you need to provide a compelling argument. Lack of support for nesting and strong typing are accepted weaknesses of the current approach. These might be improved by JSON, YAML or a different XML format combined with a bean-like internal representation.
          Hide
          Allen Wittenauer added a comment -

          I think it is inevitable that Hadoop will need to provide support for multiple configuration types and tying to just one (even internally) would be a huge mistake. For example, if we were to support LDAP as a configuration service the configuration data on the user side would be in x.500 format. Converting that to JSON, YAML, or XML seems like a huge waste, especially considering the verbosity of those representations.

          I think it is more than feasible to see where serialization-specific data might be binary. By forcing this data to be text, doesn't that mean any future data that is binary would need to be base64 encoded or the like?

          Show
          Allen Wittenauer added a comment - I think it is inevitable that Hadoop will need to provide support for multiple configuration types and tying to just one (even internally) would be a huge mistake. For example, if we were to support LDAP as a configuration service the configuration data on the user side would be in x.500 format. Converting that to JSON, YAML, or XML seems like a huge waste, especially considering the verbosity of those representations. I think it is more than feasible to see where serialization-specific data might be binary. By forcing this data to be text, doesn't that mean any future data that is binary would need to be base64 encoded or the like?
          Hide
          Doug Cutting added a comment -

          Allen, binary might enable you to configure the components that you write more easily with LDAP or other external systems, but works against integrating external configuration systems with existing components. If you wanted to configure existing system and library components with LDAP then you'd need some generic way to access the configuration parameters of those components, perhaps a property-like interface (Map<String,String>) or via a standard configuration interchange format, like JSON or YAML.

          > By forcing this data to be text, doesn't that mean any future data that is binary would need to be base64 encoded or the like?

          Do we have much binary configuration data?

          Show
          Doug Cutting added a comment - Allen, binary might enable you to configure the components that you write more easily with LDAP or other external systems, but works against integrating external configuration systems with existing components. If you wanted to configure existing system and library components with LDAP then you'd need some generic way to access the configuration parameters of those components, perhaps a property-like interface (Map<String,String>) or via a standard configuration interchange format, like JSON or YAML. > By forcing this data to be text, doesn't that mean any future data that is binary would need to be base64 encoded or the like? Do we have much binary configuration data?
          Hide
          Allen Wittenauer added a comment -

          > works against integrating external configuration systems with existing components

          I'm thinking of when we are past the limitations of the existing components. What if we don't pass files around for configuration information at all? Then does making sure that everything can be represented as a UTF-16 string make sense? I don't think it does.

          > Do we have much binary configuration data?

          Given that it is currently impossible, the answer is obviously no.

          But this seems like a major flaw of the existing system. Who are we to dictate what the user can/can't put in what is essentially a private part of the configuration name space? Hadoop as a framework shouldn't care what the representation of that value is if it doesn't have to read it. If I want to build a mass documentation signing system and provide the binary representation of the CA cert as a configuration option to my serializer, why shouldn't I be able to do that? If I want to work in UTF-32 and pass information as a config option to my serializer, why shouldn't I be able to do that?

          Now one could argue that I could base64 encode my data or do the wacky !Unable to render embedded object: File (binary thing that YAML does (JSON doesn't support binary, so to me, that instantly eliminates it. Even crusty x.500 supports binary) not found. ... and XML... well, you all know how I feel about it. smile). But why should I take a performance hit to support my use case?

          I don't see the value in support the existing system when it has what I would say is a major flaw.

          Show
          Allen Wittenauer added a comment - > works against integrating external configuration systems with existing components I'm thinking of when we are past the limitations of the existing components. What if we don't pass files around for configuration information at all? Then does making sure that everything can be represented as a UTF-16 string make sense? I don't think it does. > Do we have much binary configuration data? Given that it is currently impossible, the answer is obviously no. But this seems like a major flaw of the existing system. Who are we to dictate what the user can/can't put in what is essentially a private part of the configuration name space? Hadoop as a framework shouldn't care what the representation of that value is if it doesn't have to read it. If I want to build a mass documentation signing system and provide the binary representation of the CA cert as a configuration option to my serializer, why shouldn't I be able to do that? If I want to work in UTF-32 and pass information as a config option to my serializer, why shouldn't I be able to do that? Now one could argue that I could base64 encode my data or do the wacky ! Unable to render embedded object: File (binary thing that YAML does (JSON doesn't support binary, so to me, that instantly eliminates it. Even crusty x.500 supports binary) not found. ... and XML... well, you all know how I feel about it. smile ). But why should I take a performance hit to support my use case? I don't see the value in support the existing system when it has what I would say is a major flaw.
          Hide
          Owen O'Malley added a comment -

          Allen is reaffirming my point that we don't know the complete set of configuration formats. Therefore, forcing all of the metadata into a format just to avoid quoting in one format is extremely premature.

          Show
          Owen O'Malley added a comment - Allen is reaffirming my point that we don't know the complete set of configuration formats. Therefore, forcing all of the metadata into a format just to avoid quoting in one format is extremely premature.
          Hide
          Allen Wittenauer added a comment -

          (side note: i really hate the unexpected side effects that the jira text editor can invoke. It clearly doesn't like exclamation points. I'd go back and fix it, but since I don't have edit privs...)

          BTW, I can't find the JIRA, but I know the HDFS guys were thinking along the lines of supporting pretty much any character they throw at it, similar to how most POSIX file systems work. It is interesting to think how that arrangement will impact the MR side of the house....

          Show
          Allen Wittenauer added a comment - (side note: i really hate the unexpected side effects that the jira text editor can invoke. It clearly doesn't like exclamation points. I'd go back and fix it, but since I don't have edit privs...) BTW, I can't find the JIRA, but I know the HDFS guys were thinking along the lines of supporting pretty much any character they throw at it, similar to how most POSIX file systems work. It is interesting to think how that arrangement will impact the MR side of the house....
          Hide
          Doug Cutting added a comment -

          Allen> If I want to build a mass documentation signing system and provide the binary representation of the CA cert as a configuration option to my serializer, why shouldn't I be able to do that?

          You should be able to do that. Support for binary and other typed configuration values would be great to have. I do not disagree with that. What I'd like to not lose is a generic way to access configuration values. A standard configuration data model would permit that.

          Owen> forcing all of the metadata into a format just to avoid quoting in one format is extremely premature.

          The lack of quoting is a feature, but not a motivating reason. The motivating reason is to continue to provide uniform access to configurations. Folks today are able to set the value of configuration parameters with -D. Folks today are able to view & edit configurations. It is possible (if awkward) today to write a single piece of code that can retrieve any job or daemon parameter from LDAP or SmartFrog. If we move to an opaque system, with no uniform data model that permits one to specify a path to a value and a (possibly typed) value then one cannot perform such generic configuration tasks. That would be a step backwards.

          Show
          Doug Cutting added a comment - Allen> If I want to build a mass documentation signing system and provide the binary representation of the CA cert as a configuration option to my serializer, why shouldn't I be able to do that? You should be able to do that. Support for binary and other typed configuration values would be great to have. I do not disagree with that. What I'd like to not lose is a generic way to access configuration values. A standard configuration data model would permit that. Owen> forcing all of the metadata into a format just to avoid quoting in one format is extremely premature. The lack of quoting is a feature, but not a motivating reason. The motivating reason is to continue to provide uniform access to configurations. Folks today are able to set the value of configuration parameters with -D. Folks today are able to view & edit configurations. It is possible (if awkward) today to write a single piece of code that can retrieve any job or daemon parameter from LDAP or SmartFrog. If we move to an opaque system, with no uniform data model that permits one to specify a path to a value and a (possibly typed) value then one cannot perform such generic configuration tasks. That would be a step backwards.
          Hide
          Allen Wittenauer added a comment -

          > You should be able to do that. Support for binary and other typed configuration values would be great to have. I do not disagree with that. What I'd like to not lose is a generic way to access configuration values. A standard configuration data model would permit that.

          So Doug, if I read this (and other things you've posted) correctly, what you are actually saying is that value actually should be typed binary to enable more flexibility. But your real concern is that the configuration key is also typed binary because you feel that would break the currently public "text-only" based interfaces (-D and hand-editing XML files) in incompatible ways. Correct?

          Show
          Allen Wittenauer added a comment - > You should be able to do that. Support for binary and other typed configuration values would be great to have. I do not disagree with that. What I'd like to not lose is a generic way to access configuration values. A standard configuration data model would permit that. So Doug, if I read this (and other things you've posted) correctly, what you are actually saying is that value actually should be typed binary to enable more flexibility. But your real concern is that the configuration key is also typed binary because you feel that would break the currently public "text-only" based interfaces (-D and hand-editing XML files) in incompatible ways. Correct?
          Hide
          Doug Cutting added a comment -

          > But your real concern is that the configuration key is also typed binary [ ... ]

          My concern is that in the proposal here there are no configuration keys, textual, binary or otherwise. As configurations (like serializatons, codecs, etc.) are nested inside other configurations (like inputformats, mappers, etc.) they become hidden. Values may only be specified through Java APIs. One might set the reducer's compression level in a Java program with something like the following:

          ((GzipCodec)((SequenceFileOutputFormat)myReducer.getOutputFormat()).getCodec()).setLevel(9)

          But it would be nice if one could also specify this with something like -Dreducer.outputformat.codec.level=9. A common configuration data model would make this easy to implement, as it is today. Strongly typed values, nested configurations and binary values would all be useful things to add to this data model. But discarding attribute-based access in the process seems like a mistake.

          Show
          Doug Cutting added a comment - > But your real concern is that the configuration key is also typed binary [ ... ] My concern is that in the proposal here there are no configuration keys, textual, binary or otherwise. As configurations (like serializatons, codecs, etc.) are nested inside other configurations (like inputformats, mappers, etc.) they become hidden. Values may only be specified through Java APIs. One might set the reducer's compression level in a Java program with something like the following: ((GzipCodec)((SequenceFileOutputFormat)myReducer.getOutputFormat()).getCodec()).setLevel(9) But it would be nice if one could also specify this with something like -Dreducer.outputformat.codec.level=9. A common configuration data model would make this easy to implement, as it is today. Strongly typed values, nested configurations and binary values would all be useful things to add to this data model. But discarding attribute-based access in the process seems like a mistake.
          Hide
          Scott Carey added a comment -

          I apologize for replying to something from the conversation from 30 days ago. But this may be useful.

          For the second point, Avro is completely unsuitable for that context. For the serializer's metadata, I need to encode a singleton object. With Avro, I would need to encode the schema and then the metadata information. To add insult to injury, the schema will be substantially larger than the data. With ProtocolBuffers, I just encode the data.

          This is not true, all configurations could have the same Avro schema. An Avro schema that defines all possibilities is equivalent to tagging fields with type tags.
          Essentially the schema would be a record with an array of fields, with each field a union of all possible field types. The current Avro API for this use case is clunky, perhaps Avro could make this easier, but you can do dynamic typing and tagged fields in Avro.

          This means you don't have to serialize the schema, and alleviates the use case here where you just want to encode data and not a schema akin to some PB/Thrift use cases. It adds the overhead of type tags and the objects generated via either Java Reflect or Generic APIs would be cumbersome to use. I would be willing to work on an API for Avro that makes this easier for reading/writing a tagged tuple dynamic data type.

          Show
          Scott Carey added a comment - I apologize for replying to something from the conversation from 30 days ago. But this may be useful. For the second point, Avro is completely unsuitable for that context. For the serializer's metadata, I need to encode a singleton object. With Avro, I would need to encode the schema and then the metadata information. To add insult to injury, the schema will be substantially larger than the data. With ProtocolBuffers, I just encode the data. This is not true, all configurations could have the same Avro schema. An Avro schema that defines all possibilities is equivalent to tagging fields with type tags. Essentially the schema would be a record with an array of fields, with each field a union of all possible field types. The current Avro API for this use case is clunky, perhaps Avro could make this easier, but you can do dynamic typing and tagged fields in Avro. This means you don't have to serialize the schema, and alleviates the use case here where you just want to encode data and not a schema akin to some PB/Thrift use cases. It adds the overhead of type tags and the objects generated via either Java Reflect or Generic APIs would be cumbersome to use. I would be willing to work on an API for Avro that makes this easier for reading/writing a tagged tuple dynamic data type.
          Hide
          Nigel Daley added a comment -

          Too late for 0.22. Moving Fix Version from 0.22 to 0.23.

          Show
          Nigel Daley added a comment - Too late for 0.22. Moving Fix Version from 0.22 to 0.23.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12465307/serial9.patch
          against trunk revision 1071364.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 38 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/277//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12465307/serial9.patch against trunk revision 1071364. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 38 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/277//console This message is automatically generated.
          Hide
          Harsh J added a comment -

          (This is currently not a blocker for any release, so setting it back down from that severity.)

          Show
          Harsh J added a comment - (This is currently not a blocker for any release, so setting it back down from that severity.)
          Hide
          Harsh J added a comment -

          Patch isn't updated for the current development tip. Unmarking the PA status.

          I also do not know the consensus here - the conversation is kinda long. Anyone wishing to summarize and continue please?

          Show
          Harsh J added a comment - Patch isn't updated for the current development tip. Unmarking the PA status. I also do not know the consensus here - the conversation is kinda long. Anyone wishing to summarize and continue please?

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              41 Start watching this issue

              Dates

              • Created:
                Updated:

                Development