Lucene - Core
  1. Lucene - Core
  2. LUCENE-362

[PATCH] Extension to binary Fields that allows fixed byte buffer

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

      Description

      This is a very simple patch that supports storing binary values in the index
      more efficiently. A new Field constructor accepts a length argument, allowing a
      fixed byte[] to be reused acrossed multiple calls with arguments of different
      sizes. A companion change to FieldsWriter uses this length when storing and/or
      compressing the field.

      There is one remaining case in Document. Intentionally, no direct accessor to
      the length of a binary field is provided from Document, only from Field. This
      is because Field's created by FieldReader will never have a specified length and
      this is usual case for Field's read from Document. It seems less confusing for
      most users.

      I don't believe any upward incompatibility is introduced here (e.g., from the
      possibility of getting a larger byte[] than actually holds the value from
      Document), since no such byte[] values are possible without this patch anyway.

      The compression case is still inefficient (much copying), but it is hard to see
      how Lucene can do too much better. However, the application can do the
      compression externally and pass in the reused compression-output buffer as a
      binary value (which is what I'm doing). This represents a substantialy
      allocation savings for storing large documents bodies (compressed) into the
      Lucene index.

      Two patch files are attached, both created by svn on 3/17/05.

        Activity

        Hide
        Chuck Williams added a comment -

        Created an attachment (id=14514)
        (1 of 2) Patch to Field.java to support lengths on binary values

        Show
        Chuck Williams added a comment - Created an attachment (id=14514) (1 of 2) Patch to Field.java to support lengths on binary values
        Hide
        Chuck Williams added a comment -

        Created an attachment (id=14515)
        (2 of 2) Patch to FieldsWriter to write binary values using length from Field

        Show
        Chuck Williams added a comment - Created an attachment (id=14515) (2 of 2) Patch to FieldsWriter to write binary values using length from Field
        Hide
        Erik Hatcher added a comment -

        Chuck - would you mind also contributing a test case that exercises this new feature?

        It'd be nice if all new code added is covered by test cases. One of these days we'll even run code
        coverage on it to see how well we're doing

        Show
        Erik Hatcher added a comment - Chuck - would you mind also contributing a test case that exercises this new feature? It'd be nice if all new code added is covered by test cases. One of these days we'll even run code coverage on it to see how well we're doing
        Hide
        Chuck Williams added a comment -

        Erik – I'd be happy to write a test case. But first, I want to fully
        understand the valid uses and dangers of this patch. Since it is reusing
        memory, it is possible to mess up seriously. I think its javadoc at least needs
        more explanation of the dangers and an "expert" caveat. It's apparent
        simplicity lulled me into submitted it too quickly. Will revise later.

        Show
        Chuck Williams added a comment - Erik – I'd be happy to write a test case. But first, I want to fully understand the valid uses and dangers of this patch. Since it is reusing memory, it is possible to mess up seriously. I think its javadoc at least needs more explanation of the dangers and an "expert" caveat. It's apparent simplicity lulled me into submitted it too quickly. Will revise later.
        Hide
        Chuck Williams added a comment -

        Created an attachment (id=14521)
        (1 of 2) Patch to Field.java to support lengths on binary values (with better
        javadoc)

        This version improves the javadoc to provide and "expert" caveat and specify
        the precise use-case that this patch supports. For this use-case (storing a
        single large binary field on each Document, e.g. the externally compressed body
        of a document), the memory allocation benefits are substantial. There are no
        code changes from the original version.

        I'll try to get to a junit test in the not too distant future.

        Show
        Chuck Williams added a comment - Created an attachment (id=14521) (1 of 2) Patch to Field.java to support lengths on binary values (with better javadoc) This version improves the javadoc to provide and "expert" caveat and specify the precise use-case that this patch supports. For this use-case (storing a single large binary field on each Document, e.g. the externally compressed body of a document), the memory allocation benefits are substantial. There are no code changes from the original version. I'll try to get to a junit test in the not too distant future.
        Hide
        cutting@apache.org added a comment -

        Things would be safer if you:

        private int fieldsDataLength = -1;

        public int binaryLength()

        { return fieldsDataLength == -1 ? binaryValue().length : fieldsDataLength; }

        public Field(String name, byte[] value, int length, Store store)

        { if (length < 0 || length > value.length) throw new IllegalArgumentException("bad length:" + length); ... }

        No?

        Show
        cutting@apache.org added a comment - Things would be safer if you: private int fieldsDataLength = -1; public int binaryLength() { return fieldsDataLength == -1 ? binaryValue().length : fieldsDataLength; } public Field(String name, byte[] value, int length, Store store) { if (length < 0 || length > value.length) throw new IllegalArgumentException("bad length:" + length); ... } No?
        Hide
        Chuck Williams added a comment -

        I don't see how the first change (initializing fieldsDataLength to -1 and
        testing it in binaryLength()) provides any additional safety. The patch always
        initializes fieldsDataLength in Field() for any binary value, so unless that was
        changed the value would never be -1 in binaryLength(). The main risk in this
        area of the code is that somebody calls binaryValue() without calling
        binaryLength() to get the length, which seems impossible to address. Another
        risk is that somebody calls binaryLength() on a non-binary field – the current
        patch returns null in that case, consistent with binaryValue(), while the
        changed version would get an NPE. I'm probably missing some other case that you
        see.

        The second change (validating the length passed to Field()) seems an
        improvement. But the biggest risk with this patch is the one I outlined in the
        Javadoc, i.e. that somebody passes the same byte array in two different calls to
        Field before the use in the first call is consumed (e.g., 2 fields in the same
        Document, or the same field in 2 Documents before either is indexed). I don't
        see a good way to protect against that (without a performance hit).

        The patch is a bit risky due to this last consideration, but the performance
        gain is substantial in the particular case where it is necessary to store large
        document bodies in Lucene. I'm indexing with this now and doing zero
        allocations by using nio ByteBuffer and CharBuffer views on a fixed byte array
        that holds the successive document bodies. It rips. I'm using compression
        outside of Lucene (using the same zlib). The combination of outside-compression
        and the patch reduces 5 allocations of each document body to 0, when compared to
        passing the bodies as text and letting Lucene do the compression (the biggest
        part of this is the outside-compression).

        I don't object to either change and would be happy to see them if it means this
        gets committed so I can eliminate my local patch!

        Chuck

        Show
        Chuck Williams added a comment - I don't see how the first change (initializing fieldsDataLength to -1 and testing it in binaryLength()) provides any additional safety. The patch always initializes fieldsDataLength in Field() for any binary value, so unless that was changed the value would never be -1 in binaryLength(). The main risk in this area of the code is that somebody calls binaryValue() without calling binaryLength() to get the length, which seems impossible to address. Another risk is that somebody calls binaryLength() on a non-binary field – the current patch returns null in that case, consistent with binaryValue(), while the changed version would get an NPE. I'm probably missing some other case that you see. The second change (validating the length passed to Field()) seems an improvement. But the biggest risk with this patch is the one I outlined in the Javadoc, i.e. that somebody passes the same byte array in two different calls to Field before the use in the first call is consumed (e.g., 2 fields in the same Document, or the same field in 2 Documents before either is indexed). I don't see a good way to protect against that (without a performance hit). The patch is a bit risky due to this last consideration, but the performance gain is substantial in the particular case where it is necessary to store large document bodies in Lucene. I'm indexing with this now and doing zero allocations by using nio ByteBuffer and CharBuffer views on a fixed byte array that holds the successive document bodies. It rips. I'm using compression outside of Lucene (using the same zlib). The combination of outside-compression and the patch reduces 5 allocations of each document body to 0, when compared to passing the bodies as text and letting Lucene do the compression (the biggest part of this is the outside-compression). I don't object to either change and would be happy to see them if it means this gets committed so I can eliminate my local patch! Chuck
        Hide
        Chuck Williams added a comment -

        (Thanks Eric for correcting my mistaken posting to the old issue tracking system)

        Better late than never I hope. FixedBufferBinaryFields.patch is revised to
        apply against the latest source and now includes a test case (extension of
        TestBinaryDocument). This is my last current local patch to Lucene, so it
        would be great if it gets committed. The value again is to eliminate copying
        of large binary values to be stored in the Lucene index. For a compressed
        document, for example, if the documents are read and compressed externally in a
        fixed buffer and the compressed buffer is passed in, all copying can be
        eliminated.

        Chuck

        Show
        Chuck Williams added a comment - (Thanks Eric for correcting my mistaken posting to the old issue tracking system) Better late than never I hope. FixedBufferBinaryFields.patch is revised to apply against the latest source and now includes a test case (extension of TestBinaryDocument). This is my last current local patch to Lucene, so it would be great if it gets committed. The value again is to eliminate copying of large binary values to be stored in the Lucene index. For a compressed document, for example, if the documents are read and compressed externally in a fixed buffer and the compressed buffer is passed in, all copying can be eliminated. Chuck
        Hide
        Shai Erera added a comment -

        Field already has ctor and setter which accept a byte[], offset and length, allowing you to reuse the byte[].

        Show
        Shai Erera added a comment - Field already has ctor and setter which accept a byte[], offset and length, allowing you to reuse the byte[].

          People

          • Assignee:
            Unassigned
            Reporter:
            Chuck Williams
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development