Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.5.0
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None

      Description

      Since class UTF8 is deprecated, all references of UTF8 in hadoop should be replaced with class Text if the change does not break the system.

      1. utf2text.patch
        27 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          I think we should target this for 0.6, and just deprecate UTF8 in 0.5. Does that make sense?

          Show
          Doug Cutting added a comment - I think we should target this for 0.6, and just deprecate UTF8 in 0.5. Does that make sense?
          Hide
          Sameer Paranjpye added a comment -

          I agree, it seems reasonable practice to deprecate the class in 0.5 and give people enough time to prepare for removal.

          Show
          Sameer Paranjpye added a comment - I agree, it seems reasonable practice to deprecate the class in 0.5 and give people enough time to prepare for removal.
          Hide
          Hairong Kuang added a comment -

          This is a partial patch to the problem including changes to packages mapred, ipc, and test.

          The dfs is kept untouched. It is not trival to keep dfs backwards compatible because writables are not versioned. Will work on it later on.

          Show
          Hairong Kuang added a comment - This is a partial patch to the problem including changes to packages mapred, ipc, and test. The dfs is kept untouched. It is not trival to keep dfs backwards compatible because writables are not versioned. Will work on it later on.
          Hide
          Doug Cutting added a comment -

          It would be nice if RPC.Invocation were versioned, since you've changed its format. Old clients talking to new servers (and vice versa) will fail. It would be nice if these failures were accompanied by version-mismatch exceptions.

          What if we start sending #xFF at the start of an Invocation as the version? The first thing read by the old code is the method name, a UTF8. So if old client code talks to a new daemon, then, unless the method name has 2^16-1 characters, we could log a version-mismatch and close the connection. With a bit more work, we could even construct a version-mismatch error (using UTF8) that the client could read.

          If new client code talks to an old server, the old server would try to read a method name with 2^16-1 bytes, log an EOF error and close the connection. Is that acceptable?

          Or should we punt on this?

          Show
          Doug Cutting added a comment - It would be nice if RPC.Invocation were versioned, since you've changed its format. Old clients talking to new servers (and vice versa) will fail. It would be nice if these failures were accompanied by version-mismatch exceptions. What if we start sending #xFF at the start of an Invocation as the version? The first thing read by the old code is the method name, a UTF8. So if old client code talks to a new daemon, then, unless the method name has 2^16-1 characters, we could log a version-mismatch and close the connection. With a bit more work, we could even construct a version-mismatch error (using UTF8) that the client could read. If new client code talks to an old server, the old server would try to read a method name with 2^16-1 bytes, log an EOF error and close the connection. Is that acceptable? Or should we punt on this?
          Hide
          Owen O'Malley added a comment -

          There is a worse problem in that none of the hadoop.io.* Writable classes are versioned.

          In particular, we would like to move to the new method of writing Strings (Text.writeString), but can't very easily because even if the file versions and the rpc protocol versions change, that doesn't get passed down to readFields.

          We really need to find a story that works for versioning Writables. I guess we could move everything over to VersionedWritable and that would work at the cost of 1 byte/Writable.

          Thoughts?

          Show
          Owen O'Malley added a comment - There is a worse problem in that none of the hadoop.io.* Writable classes are versioned. In particular, we would like to move to the new method of writing Strings (Text.writeString), but can't very easily because even if the file versions and the rpc protocol versions change, that doesn't get passed down to readFields. We really need to find a story that works for versioning Writables. I guess we could move everything over to VersionedWritable and that would work at the cost of 1 byte/Writable. Thoughts?
          Hide
          Doug Cutting added a comment -

          > we would like to move to the new method of writing Strings (Text.writeString), but can't very easily because even if the file versions and the rpc protocol versions change, that doesn't get passed down to readFields

          I'm not sure what you're saying. If we, in a single commit, change from UTF8 to Text in RPC parameter classes and in classes written to disk, and increase the version numbers of the file formats and RPC protocols involved, then we'd be safe, no? The problem is that if we change from UTF8 to Text in the RPC system, when transmitting class names, since this has no version. It would be good to get a version number here.

          More generally, it would be useful to have a per-Writable class version number. This could be used when setting up RPC connections, and stored in file formats, etc. But that would not solve the current problem with the RPC system.

          Maybe we should punt on the RPC system for this patch for now, and just make the changes that are easy, leaving that for another day?

          Show
          Doug Cutting added a comment - > we would like to move to the new method of writing Strings (Text.writeString), but can't very easily because even if the file versions and the rpc protocol versions change, that doesn't get passed down to readFields I'm not sure what you're saying. If we, in a single commit, change from UTF8 to Text in RPC parameter classes and in classes written to disk, and increase the version numbers of the file formats and RPC protocols involved, then we'd be safe, no? The problem is that if we change from UTF8 to Text in the RPC system, when transmitting class names, since this has no version. It would be good to get a version number here. More generally, it would be useful to have a per-Writable class version number. This could be used when setting up RPC connections, and stored in file formats, etc. But that would not solve the current problem with the RPC system. Maybe we should punt on the RPC system for this patch for now, and just make the changes that are easy, leaving that for another day?
          Hide
          Hairong Kuang added a comment -

          Only versioning file formats and RPC protocols does not completely solve the problem, for example, when a Writable class contains a field of String. Because the field is deserialied by readField and readField does not have the version information, so we have no way to find out if we should use UTF8.readString or Text.readString. I guess either we should support Writable versioning or support a method readField(DataInput in, int version).

          Show
          Hairong Kuang added a comment - Only versioning file formats and RPC protocols does not completely solve the problem, for example, when a Writable class contains a field of String. Because the field is deserialied by readField and readField does not have the version information, so we have no way to find out if we should use UTF8.readString or Text.readString. I guess either we should support Writable versioning or support a method readField(DataInput in, int version).
          Hide
          Doug Cutting added a comment -

          We seem to be talking at cross purposes. Are we still talking about the patch attached to this bug? Of course, if one changes a Writable from using UTF8 to Text, and makes no other changes, then one will have problems reading old data and/or with protocol compatibility. Folks who wish to be able to do things like this in user data should use VersionedWritable. But the question I was raising was about migrating internal uses of UTF8 to Text. For most of those, incrementing protocol version numbers and container format version numbers will suffice. The only exception I know of is the one I discussed above, namely the names of parameter classes used in RPC. That's the issue I was trying to raise here, not the general issue of version compatiblity for Writable.

          Users who have data files that use UTF8 may wish to upgrade their data to use Text. That's outside the scope of this bug, I think. Here we should be concerned with replacing internal uses.

          Show
          Doug Cutting added a comment - We seem to be talking at cross purposes. Are we still talking about the patch attached to this bug? Of course, if one changes a Writable from using UTF8 to Text, and makes no other changes, then one will have problems reading old data and/or with protocol compatibility. Folks who wish to be able to do things like this in user data should use VersionedWritable. But the question I was raising was about migrating internal uses of UTF8 to Text. For most of those, incrementing protocol version numbers and container format version numbers will suffice. The only exception I know of is the one I discussed above, namely the names of parameter classes used in RPC. That's the issue I was trying to raise here, not the general issue of version compatiblity for Writable. Users who have data files that use UTF8 may wish to upgrade their data to use Text. That's outside the scope of this bug, I think. Here we should be concerned with replacing internal uses.
          Hide
          Hairong Kuang added a comment -

          I was not talking about user data. Instead I was concerned about replacing data internal uses. The following is an example that illustreates my point.

          When we load the edit log, if the operation name is OP_DATANODE_ADD, we need to read a DataNodeDescriptor, which is a Writable, by calling DataNodeDescriptor.readField. The class DataNodeDescriptor has a field "name" with type String. Although FSEditLog has version info, but DataNodeDescriptor does not. So the readFiied method does not know how to read the data node name, whether to use UTF8.readString or Text.readString. So either we should support Writable versioning or we need to pass version number to readField.

          Show
          Hairong Kuang added a comment - I was not talking about user data. Instead I was concerned about replacing data internal uses. The following is an example that illustreates my point. When we load the edit log, if the operation name is OP_DATANODE_ADD, we need to read a DataNodeDescriptor, which is a Writable, by calling DataNodeDescriptor.readField. The class DataNodeDescriptor has a field "name" with type String. Although FSEditLog has version info, but DataNodeDescriptor does not. So the readFiied method does not know how to read the data node name, whether to use UTF8.readString or Text.readString. So either we should support Writable versioning or we need to pass version number to readField.
          Hide
          Doug Cutting added a comment -

          Perhaps we should add a new abstract base class:

          public abstract class ImplicitVersionedWritable implements Writable {
          public abstract readFields(DataInputStream in, int version);
          public abstract int currentVersion();
          public readFields(DataInputStream in)

          { readFields(in, currentVersion()); }

          }

          Then change DatanodeID to extend this:

          public class DatanodeID
          implements WritableComparable
          extends ImplicitVersionedWritable {

          public int currentVersion()

          { return FSConstants.DFS_CURRENT_VERSION; }

          public void readFields(DataInputStream in, int version) {
          if (version >= -3)

          { name = UTF8.readString(); ... }

          else

          { name = Text.readString(); ... }

          }

          Would that work?

          Show
          Doug Cutting added a comment - Perhaps we should add a new abstract base class: public abstract class ImplicitVersionedWritable implements Writable { public abstract readFields(DataInputStream in, int version); public abstract int currentVersion(); public readFields(DataInputStream in) { readFields(in, currentVersion()); } } Then change DatanodeID to extend this: public class DatanodeID implements WritableComparable extends ImplicitVersionedWritable { public int currentVersion() { return FSConstants.DFS_CURRENT_VERSION; } public void readFields(DataInputStream in, int version) { if (version >= -3) { name = UTF8.readString(); ... } else { name = Text.readString(); ... } } Would that work?
          Hide
          stack added a comment -

          Added a vote to this issue. HBase spends a bunch of time rpc'ing. A loads of our time is spent in UTF8 class doing reading and writing on the wire. A primitive test has the Text.readString/writeString being a good bit faster than the UTF8 equivs so would be great if RPC moved to off UTF8 and on to Text.

          Show
          stack added a comment - Added a vote to this issue. HBase spends a bunch of time rpc'ing. A loads of our time is spent in UTF8 class doing reading and writing on the wire. A primitive test has the Text.readString/writeString being a good bit faster than the UTF8 equivs so would be great if RPC moved to off UTF8 and on to Text.
          Hide
          Jim Kellerman added a comment -

          +1

          Show
          Jim Kellerman added a comment - +1
          Hide
          stack added a comment -

          On adding a base abstract ImplicitVersionedWritable, I can see how injection of application version can be done easily for application-specific Comparables – as is done above for the DFS-specific DatanodeID class – but I'm wondering how would the version be injected when an application uses generic Comparables such as BytesWritable? Would an application have to pass its version on every instance creation? E.g. 'new BytesWritable(byte [] b, int version)'?

          Show
          stack added a comment - On adding a base abstract ImplicitVersionedWritable, I can see how injection of application version can be done easily for application-specific Comparables – as is done above for the DFS-specific DatanodeID class – but I'm wondering how would the version be injected when an application uses generic Comparables such as BytesWritable? Would an application have to pass its version on every instance creation? E.g. 'new BytesWritable(byte [] b, int version)'?
          Hide
          Luca Telloli added a comment -

          My vote also on this issue. From some profiling I did, it looks like the NameNode is spending 2% of the CPU time on UTF8, and it could save about 40% of it by using Text.

          Show
          Luca Telloli added a comment - My vote also on this issue. From some profiling I did, it looks like the NameNode is spending 2% of the CPU time on UTF8, and it could save about 40% of it by using Text.

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Hairong Kuang
            • Votes:
              4 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development