Hadoop Common
  1. Hadoop Common
  2. HADOOP-6298

BytesWritable#getBytes is a bad name that leads to programming mistakes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.22.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Pretty much everyone at Rapleaf who has worked with Hadoop has misused BytesWritable#getBytes at some point, not expecting the byte array to be padded. I think we can completely alleviate these programming mistakes by deprecating and renaming this method (again) to be more descriptive. I propose "getPaddedBytes()" or "getPaddedValue()". It would also be helpful to have a helper method "getNonPaddedValue()" that makes a copy into a non-padded byte array.

      1. h-6298.patch
        4 kB
        Owen O'Malley

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-22-branch #8 (See https://hudson.apache.org/hudson/job/Hadoop-Common-22-branch/8/)

        Show
        Hudson added a comment - Integrated in Hadoop-Common-22-branch #8 (See https://hudson.apache.org/hudson/job/Hadoop-Common-22-branch/8/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk #548 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk/548/)
        HADOOP-6298. Add copyBytes to Text and BytesWritable. (omalley)

        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk #548 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk/548/ ) HADOOP-6298 . Add copyBytes to Text and BytesWritable. (omalley)
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #458 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk-Commit/458/)
        HADOOP-6298. Add copyBytes to Text and BytesWritable. (omalley)

        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #458 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk-Commit/458/ ) HADOOP-6298 . Add copyBytes to Text and BytesWritable. (omalley)
        Hide
        Owen O'Malley added a comment -

        I've committed this with the JavaDoc suggestions that Tom suggested.

        Show
        Owen O'Malley added a comment - I've committed this with the JavaDoc suggestions that Tom suggested.
        Hide
        Tom White added a comment -

        +1 This looks good to me.

        Minor nit/improvement: the javadoc references to copyBytes() could be links, also the javadoc for the new method could refer to getBytes(), using @see or a link.

        Show
        Tom White added a comment - +1 This looks good to me. Minor nit/improvement: the javadoc references to copyBytes() could be links, also the javadoc for the new method could refer to getBytes() , using @see or a link.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12466240/h-6298.patch
        against trunk revision 1044755.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/139//testReport/
        Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/139//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/139//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12466240/h-6298.patch against trunk revision 1044755. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/139//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/139//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/139//console This message is automatically generated.
        Hide
        Owen O'Malley added a comment -

        This patch creates the copyBytes methods for Text and BytesWriteable.

        Show
        Owen O'Malley added a comment - This patch creates the copyBytes methods for Text and BytesWriteable.
        Hide
        David Rosenstrauch added a comment -

        Or maybe copyBytes()?

        Show
        David Rosenstrauch added a comment - Or maybe copyBytes()?
        Hide
        Konstantin Boudnik added a comment -

        I think Tom's idea makes sense if, perhaps, complimented by a better JavaDoc. Current one is eh... suboptimal.

        Show
        Konstantin Boudnik added a comment - I think Tom's idea makes sense if, perhaps, complimented by a better JavaDoc. Current one is eh... suboptimal.
        Hide
        Tom White added a comment -

        How about adding getBytesNonPadded() which creates a copy in a non-padded byte array? By naming it like this, this name would appear next to getBytes() in IDE autocompletion lists, which hopefully would alert users to the difference between the two methods.

        Show
        Tom White added a comment - How about adding getBytesNonPadded() which creates a copy in a non-padded byte array? By naming it like this, this name would appear next to getBytes() in IDE autocompletion lists, which hopefully would alert users to the difference between the two methods.
        Hide
        David Rosenstrauch added a comment -

        Yeesh. I just got bit on this same bug, but from a different direction.

        Calling BytesWritable.getBytes() returns a reference to the BytesWritable's internal byte array. I was calling that, and then using that byte array in subsequent processing. Problem is that the BytesWritable was also still holding onto a copy of that array, and later modifying it - thus modifying my copy as well. This was a really subtle bug that was hard to find, and I wasted a lot of time on it.

        I realize there's a need to get access to a BytesWriteable's internal byte storage without performing an array copy. But again, I think there needs to be some additional safe method to retrieve a byte array that's a copy of a ByteWriteable's contents. There's just too many potential pitfalls for developers if the situation is just left as is.

        Show
        David Rosenstrauch added a comment - Yeesh. I just got bit on this same bug, but from a different direction. Calling BytesWritable.getBytes() returns a reference to the BytesWritable's internal byte array. I was calling that, and then using that byte array in subsequent processing. Problem is that the BytesWritable was also still holding onto a copy of that array, and later modifying it - thus modifying my copy as well. This was a really subtle bug that was hard to find, and I wasted a lot of time on it. I realize there's a need to get access to a BytesWriteable's internal byte storage without performing an array copy. But again, I think there needs to be some additional safe method to retrieve a byte array that's a copy of a ByteWriteable's contents. There's just too many potential pitfalls for developers if the situation is just left as is.
        Hide
        David Rosenstrauch added a comment -

        We just got hit with this problem as well, and lost several hours on this due to getBytes giving me back an 18-byte array instead of the expected 12.

        I realize this method is part of the API out in the field now, and so likely can't be changed. But I think there should be some solution implemented for this. i.e., some new method that's a variation on the name "getBytes" that provides the expected behavior.

        It's just asking for trouble leaving this little land mine out there for developers to keep stumbling upon.

        Show
        David Rosenstrauch added a comment - We just got hit with this problem as well, and lost several hours on this due to getBytes giving me back an 18-byte array instead of the expected 12. I realize this method is part of the API out in the field now, and so likely can't be changed. But I think there should be some solution implemented for this. i.e., some new method that's a variation on the name "getBytes" that provides the expected behavior. It's just asking for trouble leaving this little land mine out there for developers to keep stumbling upon.
        Hide
        Owen O'Malley added a comment -

        This would be an API change. Renaming methods means that all of the clients need to change their code. I agree the name is unfortunate, precisely because String.toBytes() has different semantics.

        Also note that this is both in Text and BytesWritable, so you really would need to do the same change on both of them.

        I still don't think the pain is worth the gain. I'd suggest adding a new method that does what you want. Maybe something like:

        /**
         * Get a copy of the bytes that is exactly the length of the data.
         */
        public byte[] getSlicedBytes() {
          byte[] result = new byte[length];
          System.arraycopy(bytes, 0, result, 0, length);
          return result;
        }
        

        Once you have the method, you can change your examples to use it and your co-workers will follow suit. Do note that it has a cost in terms of performance...

        Show
        Owen O'Malley added a comment - This would be an API change. Renaming methods means that all of the clients need to change their code. I agree the name is unfortunate, precisely because String.toBytes() has different semantics. Also note that this is both in Text and BytesWritable, so you really would need to do the same change on both of them. I still don't think the pain is worth the gain. I'd suggest adding a new method that does what you want. Maybe something like: /** * Get a copy of the bytes that is exactly the length of the data. */ public byte [] getSlicedBytes() { byte [] result = new byte [length]; System .arraycopy(bytes, 0, result, 0, length); return result; } Once you have the method, you can change your examples to use it and your co-workers will follow suit. Do note that it has a cost in terms of performance...
        Hide
        Nathan Marz added a comment -

        Yes - I'm proposing solely a rename, not an API change. The behavior of the method is exactly as it should be for performance reasons. I'm ok with leaving out "getNonPaddedValue" and leaving it up to the user to do the copy if they need it - this idea was just for programmer convenience but I can see how it could be accidentally misused.

        I understand that it's clearly documented - the problem is, no one looks at the JavaDoc for these seemingly "obvious" methods. Can we at least agree that the name is confusing? Myself and all my coworkers who have worked with Hadoop have made this mistake.

        Show
        Nathan Marz added a comment - Yes - I'm proposing solely a rename, not an API change. The behavior of the method is exactly as it should be for performance reasons. I'm ok with leaving out "getNonPaddedValue" and leaving it up to the user to do the copy if they need it - this idea was just for programmer convenience but I can see how it could be accidentally misused. I understand that it's clearly documented - the problem is, no one looks at the JavaDoc for these seemingly "obvious" methods. Can we at least agree that the name is confusing? Myself and all my coworkers who have worked with Hadoop have made this mistake.
        Hide
        Owen O'Malley added a comment -

        I think this is a very bad idea. It introduces an incompatible change in a key interface. The method is clearly documented. I propose we close this as won't fix.

        Show
        Owen O'Malley added a comment - I think this is a very bad idea. It introduces an incompatible change in a key interface. The method is clearly documented. I propose we close this as won't fix.
        Hide
        Chris Douglas added a comment -

        I don't think this proposal is about changing the API, it's about renaming the method

        That's a fine distinction. Forcing all the users of BytesWritable- and they are legion- to change their calls to the most preferred of three identical methods isn't clearing up anything.

        Besides, I disagree with the premise. Almost all non-trivial Writables reuse their storage for performance reasons, so querying which of the data are valid for an array is routine. It's consistent with how it's used in map/reduce. It makes sense.

        Show
        Chris Douglas added a comment - I don't think this proposal is about changing the API, it's about renaming the method That's a fine distinction. Forcing all the users of BytesWritable- and they are legion- to change their calls to the most preferred of three identical methods isn't clearing up anything. Besides, I disagree with the premise. Almost all non-trivial Writables reuse their storage for performance reasons, so querying which of the data are valid for an array is routine. It's consistent with how it's used in map/reduce. It makes sense.
        Hide
        Tom White added a comment -

        I don't think this proposal is about changing the API, it's about renaming the method to more accurately describe its contract. Text.getBytes() behaves differently to String.getBytes(). It is a problem that trips up users; see, for example, http://www.nabble.com/can%27t-read-the-SequenceFile-correctly-td21866960.html.

        We could deprecate getBytes() (on BinaryComparable and its subclasses BytesWritable and Text) in 0.22 and create getPaddedBytes() as Nathan suggests, which is identical in functionality. Then in the next release we would remove getBytes(). This change would not have any impact on efficiency, since it is purely a rename.

        Nathan, what's the use case for getNonPaddedValue()? It's possible that by exposing it, it becomes easy to write an inefficient program since copying in maps or reduces is normally expensive.

        Show
        Tom White added a comment - I don't think this proposal is about changing the API, it's about renaming the method to more accurately describe its contract. Text.getBytes() behaves differently to String.getBytes(). It is a problem that trips up users; see, for example, http://www.nabble.com/can%27t-read-the-SequenceFile-correctly-td21866960.html . We could deprecate getBytes() (on BinaryComparable and its subclasses BytesWritable and Text) in 0.22 and create getPaddedBytes() as Nathan suggests, which is identical in functionality. Then in the next release we would remove getBytes(). This change would not have any impact on efficiency, since it is purely a rename. Nathan, what's the use case for getNonPaddedValue()? It's possible that by exposing it, it becomes easy to write an inefficient program since copying in maps or reduces is normally expensive.
        Hide
        Chris Douglas added a comment -

        I propose "getPaddedBytes()" or "getPaddedValue()". It would also be helpful to have a helper method "getNonPaddedValue()" that makes a copy into a non-padded byte array.

        Changing BytesWritable's API is a non-starter. Keys and values are reused and allocating new backing for each record is not an endurable expense. Marking valid data is typical for Writables, e.g. BytesWritable and Text.

        If this issue modified the javadoc to make this more prominent, that would be reasonable.

        Show
        Chris Douglas added a comment - I propose "getPaddedBytes()" or "getPaddedValue()". It would also be helpful to have a helper method "getNonPaddedValue()" that makes a copy into a non-padded byte array. Changing BytesWritable's API is a non-starter. Keys and values are reused and allocating new backing for each record is not an endurable expense. Marking valid data is typical for Writables, e.g. BytesWritable and Text. If this issue modified the javadoc to make this more prominent, that would be reasonable.

          People

          • Assignee:
            Owen O'Malley
            Reporter:
            Nathan Marz
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development