Lucene - Core
  1. Lucene - Core
  2. LUCENE-3590

minimize bytesref to be a ref to a byte[]

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA, 3.6.1
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Setting this as blocker for 4.0, as this class is used in every API there.

      Currently this API is a little of a everything, sometimes its like a stringbuffer, it
      does numeric conversions, all kinds of stuff.

      We need this to be a ref to a byte[], nothing else. This other stuff can go somewhere else.

      1. LUCENE-3590.patch
        17 kB
        Robert Muir
      2. LUCENE-3590_with_clone.patch
        69 kB
        Robert Muir
      3. LUCENE-3590_offsets.patch
        9 kB
        Robert Muir
      4. LUCENE-3590_equals_init.patch
        4 kB
        Robert Muir
      5. LUCENE-3590_equals_init.patch
        5 kB
        Robert Muir
      6. LUCENE-3590_deprecate_comparators.patch
        2 kB
        Robert Muir
      7. LUCENE-3590_copyCtor.patch
        67 kB
        Robert Muir

        Activity

        Hide
        Uwe Schindler added a comment -

        Bulk close for 3.6.1

        Show
        Uwe Schindler added a comment - Bulk close for 3.6.1
        Hide
        Robert Muir added a comment -

        Here's a patch with tests to improve most of the remaining issues:

        • bugs in copy/append when offset !=0 (AIOOBE)
        • off-by-one in charsequence.subsequence, this just happened to be previously untested.
        • obey charsequence api completely (including exceptions stated in the spec)
        • assert offset == 0 in grow(), as it doesnt make sense otherwise.

        I'd like to backport these bugfixes to 3.6.1 as well.

        Show
        Robert Muir added a comment - Here's a patch with tests to improve most of the remaining issues: bugs in copy/append when offset !=0 (AIOOBE) off-by-one in charsequence.subsequence, this just happened to be previously untested. obey charsequence api completely (including exceptions stated in the spec) assert offset == 0 in grow(), as it doesnt make sense otherwise. I'd like to backport these bugfixes to 3.6.1 as well.
        Hide
        Robert Muir added a comment -

        here's a patch fixing the copyBytes() issue, but copyChars and in general unicodeutil methods are still broken in the same way.

        I also clarified some javadocs and other cleanups, trying to get this thing to the point we can remove @experimental (not there yet).

        Need to fix IntsRef and CharsRef in the same way.

        Show
        Robert Muir added a comment - here's a patch fixing the copyBytes() issue, but copyChars and in general unicodeutil methods are still broken in the same way. I also clarified some javadocs and other cleanups, trying to get this thing to the point we can remove @experimental (not there yet). Need to fix IntsRef and CharsRef in the same way.
        Hide
        Robert Muir added a comment -

        BytesRef.copyBytes and equivalents are also heavily broken if offset != 0, and the javadoc is non-intuitive:

           * NOTE: this method resets the offset to 0 and resizes the reference array
           * if needed.
        

        This is ambiguous, its not obvious that it will reset offset to 0 always and then copy, resizing if needed.

        This is in comparison to e.g. append(), which is less broken because it at least tries to append to the
        existing offset, and if it needs to resize it resets offset to 0, but thats fine as its a new byte[].

        The former will overwrite unrelated data if bytesrefs are pointing to different slices of the same array.

        Show
        Robert Muir added a comment - BytesRef.copyBytes and equivalents are also heavily broken if offset != 0, and the javadoc is non-intuitive: * NOTE: this method resets the offset to 0 and resizes the reference array * if needed. This is ambiguous, its not obvious that it will reset offset to 0 always and then copy, resizing if needed. This is in comparison to e.g. append(), which is less broken because it at least tries to append to the existing offset, and if it needs to resize it resets offset to 0, but thats fine as its a new byte[]. The former will overwrite unrelated data if bytesrefs are pointing to different slices of the same array.
        Hide
        Robert Muir added a comment -

        updated patch with instanceof checks in the equals, and removing the broken asserts in TestUnicodeUtil that called equals() on charsequence (which is undefined by definition)

        Show
        Robert Muir added a comment - updated patch with instanceof checks in the equals, and removing the broken asserts in TestUnicodeUtil that called equals() on charsequence (which is undefined by definition)
        Hide
        Robert Muir added a comment -

        patch fixes broken equals() methods and broken ctors in IntsRef and CharsRef.

        IntsRef never set ints = EMPTY_INTS in its no-arg ctor, so ints would just be null.

        IntsRef and CharsRef never handled the case that 'other == null' in equals().

        CharsRef had other brokenness in its equals() method where it would return equals = true sometimes to other CharSequences that are not actually CharsRefs.

        This breaks the contract of equals because its not symmetric, there is no guarantee of this in charsequence.

        Show
        Robert Muir added a comment - patch fixes broken equals() methods and broken ctors in IntsRef and CharsRef. IntsRef never set ints = EMPTY_INTS in its no-arg ctor, so ints would just be null. IntsRef and CharsRef never handled the case that 'other == null' in equals(). CharsRef had other brokenness in its equals() method where it would return equals = true sometimes to other CharSequences that are not actually CharsRefs. This breaks the contract of equals because its not symmetric, there is no guarantee of this in charsequence.
        Hide
        Robert Muir added a comment -

        i want to deprecate the utf8InUtf16Order and utf16InUtf8Order comparators. maybe these will get refactored out of bytesref itself and into better places like preflex codec

        especially since the one in bytesref has some funkiness particular to the surrogates dance implementation (it reserves some unassigned bytes).

        Show
        Robert Muir added a comment - i want to deprecate the utf8InUtf16Order and utf16InUtf8Order comparators. maybe these will get refactored out of bytesref itself and into better places like preflex codec especially since the one in bytesref has some funkiness particular to the surrogates dance implementation (it reserves some unassigned bytes).
        Hide
        Robert Muir added a comment -

        Don't worry, i will fix in 3.x too. Its good that its internal for now, because the api needs cleanup.

        i would like to remove deep copying, but again, as i said at least 2 times before here, the first step is to clearly mark what is deep copying and what is normal use.

        This way the uses are not intermixed confusingly in the source code. This makes it at least possible to fix.

        Show
        Robert Muir added a comment - Don't worry, i will fix in 3.x too. Its good that its internal for now, because the api needs cleanup. i would like to remove deep copying, but again, as i said at least 2 times before here, the first step is to clearly mark what is deep copying and what is normal use. This way the uses are not intermixed confusingly in the source code. This makes it at least possible to fix.
        Hide
        Uwe Schindler added a comment -

        You should also fix this in 3.x as BytesRef appeared there, too (I have no idea why it is there, but it is). The problem is we already released but as @lucene.internal, so we can still change.

        About your disagreement, I am fine with that, but then please remove all deep copying, also the new static methods On the other hand not supplying at least one simple deepCopy function makes it hard to use e.g. in MTQ when you put away terms.

        Show
        Uwe Schindler added a comment - You should also fix this in 3.x as BytesRef appeared there, too (I have no idea why it is there, but it is). The problem is we already released but as @lucene.internal, so we can still change. About your disagreement, I am fine with that, but then please remove all deep copying, also the new static methods On the other hand not supplying at least one simple deepCopy function makes it hard to use e.g. in MTQ when you put away terms.
        Hide
        Robert Muir added a comment -

        Uwe: i totally disagree, I'm sorry.

        Instead all these stupid ctors and deep-copying should be removed, and this class should be simplified to be a simple Ref API.

        Then @lucene.internal should be removed from the class, because its totally broken to call something "internal" but pass it around in every API, e.g. its required to use the terms/postings apis.

        Thats why I marked this issue blocker.

        Show
        Robert Muir added a comment - Uwe: i totally disagree, I'm sorry. Instead all these stupid ctors and deep-copying should be removed, and this class should be simplified to be a simple Ref API. Then @lucene.internal should be removed from the class, because its totally broken to call something "internal" but pass it around in every API, e.g. its required to use the terms/postings apis. Thats why I marked this issue blocker.
        Hide
        Uwe Schindler added a comment -

        In my opinion, those classes should have no ctors at all (and they don't need as they are final - so make the ctor private). I would make all methods creating new instances of XxxRefs static and named with what they do: BytesRef.newEmptyInstance(), BytesRef.deepCopyOf(), BytesRef.newInstanceFromRef(), BytesRef.copyOf(otherType),...

        Show
        Uwe Schindler added a comment - In my opinion, those classes should have no ctors at all (and they don't need as they are final - so make the ctor private). I would make all methods creating new instances of XxxRefs static and named with what they do: BytesRef.newEmptyInstance(), BytesRef.deepCopyOf(), BytesRef.newInstanceFromRef(), BytesRef.copyOf(otherType),...
        Hide
        Robert Muir added a comment -

        updated patch, changing clone() to be shallow as well.

        if consumers want a deep copy they can just call deepCopy explicitly.

        Show
        Robert Muir added a comment - updated patch, changing clone() to be shallow as well. if consumers want a deep copy they can just call deepCopy explicitly.
        Hide
        Robert Muir added a comment -

        Make clone() a shallow copy? I don't think we should support shallow copying at all - it's normally not going to be what a user really wants. One can still do a shallow copy by directly setting the members.

        No, abusers of the reference API can create a deep copy directly by setting the members. The onus is on abusers to have these extra lines of code, not for correct users of the API.

        Is there even any code in Lucene or Solr that creates a shallow copy?

        Yes, and its uglier because of the problems above.

        Show
        Robert Muir added a comment - Make clone() a shallow copy? I don't think we should support shallow copying at all - it's normally not going to be what a user really wants. One can still do a shallow copy by directly setting the members. No, abusers of the reference API can create a deep copy directly by setting the members. The onus is on abusers to have these extra lines of code, not for correct users of the API. Is there even any code in Lucene or Solr that creates a shallow copy? Yes, and its uglier because of the problems above.
        Hide
        Robert Muir added a comment -

        All I know is this class is abused because there was/is a need for bits and pieces that ByteBuffers don't fully address – you cannot change the underlying array (must rewrap) and you cannot fiddle with hash code, for example. I like access to internals sometimes.

        Right, the abuse exists... we have to accept this is just the case and deal with it appropriately, step 1 being to make sure the 'abuse use case' and the 'ordinary use case' can be clearly separated. But having a broken API that is sometimes shallow, sometimes deep, trying to support both use cases in whichever way we feel is 'natural' is unacceptable.

        Show
        Robert Muir added a comment - All I know is this class is abused because there was/is a need for bits and pieces that ByteBuffers don't fully address – you cannot change the underlying array (must rewrap) and you cannot fiddle with hash code, for example. I like access to internals sometimes. Right, the abuse exists... we have to accept this is just the case and deal with it appropriately, step 1 being to make sure the 'abuse use case' and the 'ordinary use case' can be clearly separated. But having a broken API that is sometimes shallow, sometimes deep, trying to support both use cases in whichever way we feel is 'natural' is unacceptable.
        Hide
        Robert Muir added a comment -

        Attached is a patch for FoosRef, where Foo =

        { Byte, Int, Char }

        that removes FoosRef(FoosRef other), instead adding FoosRef.deepCopyOf(FoosRef other).

        Also renames FoosRef.copy(FoosRefOther) to FoosRef.copyFoos(FoosRefOther) to make it more obvious that its copying the underlying data in, not just the reference.

        Show
        Robert Muir added a comment - Attached is a patch for FoosRef, where Foo = { Byte, Int, Char } that removes FoosRef(FoosRef other), instead adding FoosRef.deepCopyOf(FoosRef other). Also renames FoosRef.copy(FoosRefOther) to FoosRef.copyFoos(FoosRefOther) to make it more obvious that its copying the underlying data in, not just the reference.
        Hide
        Dawid Weiss added a comment -

        it's normally not going to be what a user really wants

        You're right, this would be even more confusing.

        All I know is this class is abused because there was/is a need for bits and pieces that ByteBuffers don't fully address – you cannot change the underlying array (must rewrap) and you cannot fiddle with hash code, for example. I like access to internals sometimes.

        Show
        Dawid Weiss added a comment - it's normally not going to be what a user really wants You're right, this would be even more confusing. All I know is this class is abused because there was/is a need for bits and pieces that ByteBuffers don't fully address – you cannot change the underlying array (must rewrap) and you cannot fiddle with hash code, for example. I like access to internals sometimes.
        Hide
        Robert Muir added a comment -

        I don't think we should be supporting deep copying at all, thats why i opened this issue.

        This class is NOT a stringbuilder class!

        Show
        Robert Muir added a comment - I don't think we should be supporting deep copying at all, thats why i opened this issue. This class is NOT a stringbuilder class!
        Hide
        Yonik Seeley added a comment -

        If so then I'd remove copying entirely then and leave a covariant ByteRef clone() (as in: clone the reference only).

        Make clone() a shallow copy? I don't think we should support shallow copying at all - it's normally not going to be what a user really wants. One can still do a shallow copy by directly setting the members.

        Is there even any code in Lucene or Solr that creates a shallow copy?

        Show
        Yonik Seeley added a comment - If so then I'd remove copying entirely then and leave a covariant ByteRef clone() (as in: clone the reference only). Make clone() a shallow copy? I don't think we should support shallow copying at all - it's normally not going to be what a user really wants. One can still do a shallow copy by directly setting the members. Is there even any code in Lucene or Solr that creates a shallow copy?
        Hide
        Robert Muir added a comment -

        well here i'm just trying to prevent BytesRef/IntsRef/CharsRef from being crazy classes where its confusing what the methods do... oh wait, its too late.

        I think the key issue is sometimes they are used like array/stringbuilder, but other times they are used like bytebuffer?

        I'm not going to push the issue so far to say we should remove BytesRef entirely and just use ByteBuffer but i also don't think that should be totally out of the question?

        For now, since we support being a ByteBuffer and also a ByteArrayBuilder, i want to first just distinguish the deep copying from the shallow copying before trying to unbreak stuff.

        Show
        Robert Muir added a comment - well here i'm just trying to prevent BytesRef/IntsRef/CharsRef from being crazy classes where its confusing what the methods do... oh wait, its too late. I think the key issue is sometimes they are used like array/stringbuilder, but other times they are used like bytebuffer? I'm not going to push the issue so far to say we should remove BytesRef entirely and just use ByteBuffer but i also don't think that should be totally out of the question? For now, since we support being a ByteBuffer and also a ByteArrayBuilder, i want to first just distinguish the deep copying from the shallow copying before trying to unbreak stuff.
        Hide
        Dawid Weiss added a comment -

        If so then I'd remove copying entirely then and leave a covariant ByteRef clone() (as in: clone the reference only). Although I also admit I did abuse BytesRef as a byte manipulation buffer slightly, my bad.

        Show
        Dawid Weiss added a comment - If so then I'd remove copying entirely then and leave a covariant ByteRef clone() (as in: clone the reference only). Although I also admit I did abuse BytesRef as a byte manipulation buffer slightly, my bad.
        Hide
        Robert Muir added a comment -

        But thats my whole problem with BytesRef, its not supposed to be a StringBuilder, or an array (which are holder classes that you manipulate).

        Its a ref to a byte[]... thus the name. If i have a reference or a pointer to something, and i copy that reference, its unexpected to copy what it refers to as well.

        Show
        Robert Muir added a comment - But thats my whole problem with BytesRef, its not supposed to be a StringBuilder, or an array (which are holder classes that you manipulate). Its a ref to a byte[]... thus the name. If i have a reference or a pointer to something, and i copy that reference, its unexpected to copy what it refers to as well.
        Hide
        Dawid Weiss added a comment -

        Maybe, although this clearly corresponds with Arrays.copyOf(). If you make the contract clear in JavaDocs then I see no problem. People will always make mistakes, no matter how explicitly you name methods and deepCopyOf sounds verbose to me (even if only by 4 keystrokes).

        Show
        Dawid Weiss added a comment - Maybe, although this clearly corresponds with Arrays.copyOf(). If you make the contract clear in JavaDocs then I see no problem. People will always make mistakes, no matter how explicitly you name methods and deepCopyOf sounds verbose to me (even if only by 4 keystrokes).
        Hide
        Robert Muir added a comment -

        While we're at it, can we get rid of the final that prevents subclassing?
        Mainly I want to be able to change how hashCode works (either the hash itself, or be able to cache like String does).

        I think if we clean up the API this would be a safe thing to do. Maybe instead of
        the whole class being final, only a few methods really need to be final (or maybe none at all).

        Show
        Robert Muir added a comment - While we're at it, can we get rid of the final that prevents subclassing? Mainly I want to be able to change how hashCode works (either the hash itself, or be able to cache like String does). I think if we clean up the API this would be a safe thing to do. Maybe instead of the whole class being final, only a few methods really need to be final (or maybe none at all).
        Hide
        Robert Muir added a comment -

        BytesRef copy = BytesRef.copyOf(original);

        I still think this is slightly confusing, because in my opinion this method should do:

        BytesRef copy = new BytesRef();
        copy.bytes = other.bytes;
        copy.length = other.length;
        copy.offset = other.offset;
        

        I think we should clearly separate these 'deep copies', e.g. deepCopyOf would be better.

        Show
        Robert Muir added a comment - BytesRef copy = BytesRef.copyOf(original); I still think this is slightly confusing, because in my opinion this method should do: BytesRef copy = new BytesRef(); copy.bytes = other.bytes; copy.length = other.length; copy.offset = other.offset; I think we should clearly separate these 'deep copies', e.g. deepCopyOf would be better.
        Hide
        Yonik Seeley added a comment -

        BytesRef copy = BytesRef.copyOf(original);

        +1

        While we're at it, can we get rid of the final that prevents subclassing?
        Mainly I want to be able to change how hashCode works (either the hash itself, or be able to cache like String does).

        Show
        Yonik Seeley added a comment - BytesRef copy = BytesRef.copyOf(original); +1 While we're at it, can we get rid of the final that prevents subclassing? Mainly I want to be able to change how hashCode works (either the hash itself, or be able to cache like String does).
        Hide
        Dawid Weiss added a comment -

        How about:

        BytesRef copy = BytesRef.copyOf(original);
        
        Show
        Dawid Weiss added a comment - How about: BytesRef copy = BytesRef.copyOf(original);
        Hide
        Robert Muir added a comment -

        First step is to remove the ctor BytesRef(BytesRef other), which is not intuitive
        since it looks like a copy constructor, but doesn't copy the ref, it creates
        a brand new byte array pointing to a complete copy of 'other's bytes, and then
        creates a pointer to that.

        I think its not obvious from the code either when i see new BytesRef(otherBytesRef).
        Instead we should do:

        {copy}
        BytesRef copy = new BytesRef();
        copy.copyBytes(otherBytesRef);{copy}

        This makes it totally clear exactly what is going on. I'll work up a patch

        Show
        Robert Muir added a comment - First step is to remove the ctor BytesRef(BytesRef other), which is not intuitive since it looks like a copy constructor, but doesn't copy the ref, it creates a brand new byte array pointing to a complete copy of 'other's bytes, and then creates a pointer to that. I think its not obvious from the code either when i see new BytesRef(otherBytesRef). Instead we should do: {copy} BytesRef copy = new BytesRef(); copy.copyBytes(otherBytesRef);{copy} This makes it totally clear exactly what is going on. I'll work up a patch

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development