Lucene - Core
  1. Lucene - Core
  2. LUCENE-652

Compressed fields should be "externalized" (from Fields into Document)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9, 2.0.0, 2.1
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None

      Description

      Right now, as of 2.0 release, Lucene supports compressed stored fields. However, after discussion on java-dev, the suggestion arose, from Robert Engels, that it would be better if this logic were moved into the Document level. This way the indexing level just stores opaque binary fields, and then Document handles compress/uncompressing as needed.

      This approach would have prevented issues like LUCENE-629 because merging of segments would never need to decompress.

      See this thread for the recent discussion:

      http://www.gossamer-threads.com/lists/lucene/java-dev/38836

      When we do this we should also work on related issue LUCENE-648.

      1. LUCENE-652.patch
        5 kB
        Uwe Schindler
      2. LUCENE-652.patch
        4 kB
        Michael McCandless
      3. LUCENE-652.patch
        13 kB
        Michael McCandless

        Activity

        Hide
        Grant Ingersoll added a comment -

        Implementing this would mean deprecating Field.Store.COMPRESS and the various other places that use/set bits marking a field as compressed.

        Seems like a reasonable thing to do. I will mark this as a 2.9 issue, so that we make sure we deprecate it at or before that time.

        Show
        Grant Ingersoll added a comment - Implementing this would mean deprecating Field.Store.COMPRESS and the various other places that use/set bits marking a field as compressed. Seems like a reasonable thing to do. I will mark this as a 2.9 issue, so that we make sure we deprecate it at or before that time.
        Hide
        Michael McCandless added a comment -

        I added o.a.l.document.CompressionTools, with static methods to
        compress & decompress, and deprecated Field.Store.COMPRESS.

        I also found two separate bugs:

        • With Field.Store.COMPRESS we were running compression twice
          (unnecessarily); I've fixed that.
        • If you try to make a Field(byte[], int offset, int length,
          Store.COMPRESS), you'll hit an AIOOBE. I think we don't need to
          fix this one since it's in now-deprecated code, and with 2.9,
          users can migrate to CompressionTools.

        I plan to commit in a day or two.

        Show
        Michael McCandless added a comment - I added o.a.l.document.CompressionTools, with static methods to compress & decompress, and deprecated Field.Store.COMPRESS. I also found two separate bugs: With Field.Store.COMPRESS we were running compression twice (unnecessarily); I've fixed that. If you try to make a Field(byte[], int offset, int length, Store.COMPRESS), you'll hit an AIOOBE. I think we don't need to fix this one since it's in now-deprecated code, and with 2.9, users can migrate to CompressionTools. I plan to commit in a day or two.
        Hide
        Uwe Schindler added a comment -

        Is an index compressed with Store.COMPRESS still readable? Can i uncompress fields compressed using the old tools also by retrieving the byte array and using CompressionTools? There should be some documentation about that.

        Another question: Compressing was also used for string fields, maybe CompressionTols also suplies a method to compress strings (and convert them to UTF-8 during that to be backwards compatible). This would prevent people from calling String.getBytes() without charset and then wondering, why they cannoit read their index again...

        Show
        Uwe Schindler added a comment - Is an index compressed with Store.COMPRESS still readable? Can i uncompress fields compressed using the old tools also by retrieving the byte array and using CompressionTools? There should be some documentation about that. Another question: Compressing was also used for string fields, maybe CompressionTols also suplies a method to compress strings (and convert them to UTF-8 during that to be backwards compatible). This would prevent people from calling String.getBytes() without charset and then wondering, why they cannoit read their index again...
        Hide
        Michael McCandless added a comment -

        Good questions!

        Is an index compressed with Store.COMPRESS still readable?

        Yes, we have to support that until Lucene 4.0. But
        Field.Store.COMPRESS will be removed in 3.0 (ie you can read previous
        compressed fields, interact w/ an index that has compressed fields in
        it, etc., just not add docs with Field.Store.COMPRESS to an index as
        of 3.0).

        Can i uncompress fields compressed using the old tools also by retrieving the byte array and using CompressionTools?

        Well... yes, but: you can't actually get the compressed byte[]
        (because Lucene will decompress it for you).

        Compressing was also used for string fields, maybe CompressionTols also suplies a method to compress strings (and convert them to UTF-8 during that to be backwards compatible). This would prevent people from calling String.getBytes() without charset and then wondering, why they cannoit read their index again...

        OK I'll add them. I'll name them compressString and decompressString.

        Show
        Michael McCandless added a comment - Good questions! Is an index compressed with Store.COMPRESS still readable? Yes, we have to support that until Lucene 4.0. But Field.Store.COMPRESS will be removed in 3.0 (ie you can read previous compressed fields, interact w/ an index that has compressed fields in it, etc., just not add docs with Field.Store.COMPRESS to an index as of 3.0). Can i uncompress fields compressed using the old tools also by retrieving the byte array and using CompressionTools? Well... yes, but: you can't actually get the compressed byte[] (because Lucene will decompress it for you). Compressing was also used for string fields, maybe CompressionTols also suplies a method to compress strings (and convert them to UTF-8 during that to be backwards compatible). This would prevent people from calling String.getBytes() without charset and then wondering, why they cannoit read their index again... OK I'll add them. I'll name them compressString and decompressString.
        Hide
        Uwe Schindler added a comment -

        OK I'll add them. I'll name them compressString and decompressString.

        Maybe it is better to use the new UTF-8 tools to encode/decode (instead of toBytes()). This would be consistent with the rest bof Lucene.
        But for the old deprecated Field.Store.COMPRESS, keep it how it is (backwards compatibility).

        Show
        Uwe Schindler added a comment - OK I'll add them. I'll name them compressString and decompressString. Maybe it is better to use the new UTF-8 tools to encode/decode (instead of toBytes()). This would be consistent with the rest bof Lucene. But for the old deprecated Field.Store.COMPRESS, keep it how it is (backwards compatibility).
        Hide
        Michael McCandless added a comment -

        Maybe it is better to use the new UTF-8 tools to encode/decode (instead of toBytes()). This would be consistent with the rest bof Lucene.
        But for the old deprecated Field.Store.COMPRESS, keep it how it is (backwards compatibility).

        You mean UnicodeUtil? I think we can leave that to future optimization (I'd rather focus on the other 2.9 issues, realtime search, etc. at this point).

        Show
        Michael McCandless added a comment - Maybe it is better to use the new UTF-8 tools to encode/decode (instead of toBytes()). This would be consistent with the rest bof Lucene. But for the old deprecated Field.Store.COMPRESS, keep it how it is (backwards compatibility). You mean UnicodeUtil? I think we can leave that to future optimization (I'd rather focus on the other 2.9 issues, realtime search, etc. at this point).
        Hide
        Uwe Schindler added a comment -

        Yes, should I prepare a patch for trunk and add these methods?

        Show
        Uwe Schindler added a comment - Yes, should I prepare a patch for trunk and add these methods?
        Hide
        Michael McCandless added a comment -

        Added compress/decompressString, and improved javadocs to say this compression format matches Field.Store.COMPRESS.

        Show
        Michael McCandless added a comment - Added compress/decompressString, and improved javadocs to say this compression format matches Field.Store.COMPRESS.
        Hide
        Michael McCandless added a comment -

        Yes, should I prepare a patch for trunk and add these methods?

        You mean to switch to UnicodeUtil? That would be great!

        Show
        Michael McCandless added a comment - Yes, should I prepare a patch for trunk and add these methods? You mean to switch to UnicodeUtil? That would be great!
        Hide
        Michael McCandless added a comment -

        If we switch to UnicodeUntil we may want to allow instantiation of CompressionTools, since UnicodeUtil is optimized for re-use.

        And if we do that we have to think about thread safety & concurrency, probably using CloseableThreadLocal under the hood, and then add a close() method.

        Show
        Michael McCandless added a comment - If we switch to UnicodeUntil we may want to allow instantiation of CompressionTools, since UnicodeUtil is optimized for re-use. And if we do that we have to think about thread safety & concurrency, probably using CloseableThreadLocal under the hood, and then add a close() method.
        Hide
        Uwe Schindler added a comment - - edited

        This is a first version using UnicodeUtils. The deprecated Store.COMPRESS part still uses String.getBytes() because of backwards compatibility (otherwise it would be a change in index format).
        This version currenty creates a new UTFxResult, because no state and no close method. It can also be synchronized without ThreadLocal, but this may not be so good.
        The current version has a little performance impact because of array copying.

        Show
        Uwe Schindler added a comment - - edited This is a first version using UnicodeUtils. The deprecated Store.COMPRESS part still uses String.getBytes() because of backwards compatibility (otherwise it would be a change in index format). This version currenty creates a new UTFxResult, because no state and no close method. It can also be synchronized without ThreadLocal, but this may not be so good. The current version has a little performance impact because of array copying.
        Hide
        Michael McCandless added a comment -

        OK thanks Uwe, it looks good. We can leave the other changes I
        suggested to future optimizations. I'll commit soon!

        Show
        Michael McCandless added a comment - OK thanks Uwe, it looks good. We can leave the other changes I suggested to future optimizations. I'll commit soon!
        Hide
        Uwe Schindler added a comment -

        Fine! In my opinion the little overhead of UnicodeUtils is far lower that the one by compression and the ByteArrayStreams.

        Can i uncompress fields compressed using the old tools also by retrieving the byte array and using CompressionTools?

        Well... yes, but: you can't actually get the compressed byte[]
        (because Lucene will decompress it for you).

        You can: With a FieldSelector that load the fields for merge, you get the raw binary values (found out from the code of FieldsReader).

        Show
        Uwe Schindler added a comment - Fine! In my opinion the little overhead of UnicodeUtils is far lower that the one by compression and the ByteArrayStreams. Can i uncompress fields compressed using the old tools also by retrieving the byte array and using CompressionTools? Well... yes, but: you can't actually get the compressed byte[] (because Lucene will decompress it for you). You can: With a FieldSelector that load the fields for merge, you get the raw binary values (found out from the code of FieldsReader).
        Hide
        Michael McCandless added a comment -

        Fine! In my opinion the little overhead of UnicodeUtils is far lower that the one by compression and the ByteArrayStreams.

        Good point...

        You can: With a FieldSelector that load the fields for merge, you get the raw binary values (found out from the code of FieldsReader).

        Ahh, true! In fact, I will go and deprecate that FieldSelectorResult option.

        Show
        Michael McCandless added a comment - Fine! In my opinion the little overhead of UnicodeUtils is far lower that the one by compression and the ByteArrayStreams. Good point... You can: With a FieldSelector that load the fields for merge, you get the raw binary values (found out from the code of FieldsReader). Ahh, true! In fact, I will go and deprecate that FieldSelectorResult option.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development