Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-274

[PATCH] to store binary fields with compression

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

    • Bugzilla Id:
      31149

      Description

      hi all,

      as promised here is the enhancement for the binary field patch with optional
      compression. The attachment includes all necessary diffs based on the latest
      version from CVS. There is also a small junit test case to test the core
      functionality for binary field compression. The base implementation for binary
      fields where this patch relies on, can be found in patch #29370. The existing
      unit tests pass fine.

      For testing binary fields and compression, I'm creating an index from 2700 plain
      text files (avg. 6kb per file) and store all file content within that index
      without using compression. The test was created using the IndexFiles class from
      the demo distribution. Setting up the index and storing all content without
      compression took about 60 secs and the final index size was 21 MB. Running the
      same test, switching compression on, the time to index increase to 75 secs, but
      the final index size shrinks to 13 MB. This is less than the plain text files
      them self need in the file system (15 MB)

      Hopefully this patch helps people dealing with huge index and want to store more
      than just 300 bytes per document to display a well formed summary.

      regards
      Bernhard

        Activity

        Hide
        bernhard.messer@intrafind.de Bernhard Messer added a comment -

        Created an attachment (id=12686)
        [PATCH] to store binary fields with compression

        Show
        bernhard.messer@intrafind.de Bernhard Messer added a comment - Created an attachment (id=12686) [PATCH] to store binary fields with compression
        Hide
        goller@detego-software.de Christoph Goller added a comment -

        Hi Bernhard,

        I reviewed your patch. Looks great for me. However, I wonder why we need
        isCompressed in FieldInfo? Beeing compressed or not seems to be a property of an
        individual field more than of all fields in the index with a given name.
        Furthermore, the isCompressed flag in FieldInfo is currently not used anywhere
        outside FieldInfo and FieldInfos. Is it really needed?

        Further idea: Wouldn't it be great to have a stored stringValued field that
        has the property "compressed" meaning that if the field is written with
        FieldsWriter, it automatically is compressed and if it's read by FieldsReader,
        it is automatically decompressed and transformed into a String? The field could
        but does not have to be indexed/tokenized. This would mean that compressed
        becomes a property of stored fields (binary or stringValued ones).

        With your current implementation a field that is indexed has to be dublicated
        if it is stored in compressed form.

        regards,
        Christoph

        Show
        goller@detego-software.de Christoph Goller added a comment - Hi Bernhard, I reviewed your patch. Looks great for me. However, I wonder why we need isCompressed in FieldInfo? Beeing compressed or not seems to be a property of an individual field more than of all fields in the index with a given name. Furthermore, the isCompressed flag in FieldInfo is currently not used anywhere outside FieldInfo and FieldInfos. Is it really needed? Further idea: Wouldn't it be great to have a stored stringValued field that has the property "compressed" meaning that if the field is written with FieldsWriter, it automatically is compressed and if it's read by FieldsReader, it is automatically decompressed and transformed into a String? The field could but does not have to be indexed/tokenized. This would mean that compressed becomes a property of stored fields (binary or stringValued ones). With your current implementation a field that is indexed has to be dublicated if it is stored in compressed form. regards, Christoph
        Hide
        bernhard.messer@intrafind.de Bernhard Messer added a comment -

        Created an attachment (id=12895)
        [PATCH] all diffs and one additional testcase for the compression enhancement

        Show
        bernhard.messer@intrafind.de Bernhard Messer added a comment - Created an attachment (id=12895) [PATCH] all diffs and one additional testcase for the compression enhancement
        Hide
        bernhard.messer@intrafind.de Bernhard Messer added a comment -

        hi,

        the zip file added today contains the improved version for the compression patch
        based on the latest source from cvs with new features discussed on the mailing
        list implemented. The patch contains three diff files (Field.diff,
        FieldsReader.diff and FieldsWriter.diff) and one new Testcase to test the
        compression functionality.

        This patch does allow now compression either on binary or string value fields.

        There also is a small cleanup in FieldsReader and FieldsWriter using static
        members referencing the bit values which makes the code more readable (Doug
        asked for it).

        The Field class now stores all 3 possible data values (reader, string or byte[])
        within single member. This change was also asked by Doug and makes things easier
        to handle, but is not directly related to compression.

        Activating compression, the index size can be reduced to 60% of the original
        size when storing whole documents within the index. Increasing index time round
        about 50-70%. Regarding the query performance, i saw no differences between an
        compressed or uncompressed index. What may take a bit longer is fetching the hit
        documents.

        All lucene test cases works well. So maybe the lucene committers can have a look
        on it and decide if it will be part of the next version.

        If there are any questions regarding the changes, leave a note on the developer
        list.

        regards and fun with it
        bernhard

        Show
        bernhard.messer@intrafind.de Bernhard Messer added a comment - hi, the zip file added today contains the improved version for the compression patch based on the latest source from cvs with new features discussed on the mailing list implemented. The patch contains three diff files (Field.diff, FieldsReader.diff and FieldsWriter.diff) and one new Testcase to test the compression functionality. This patch does allow now compression either on binary or string value fields. There also is a small cleanup in FieldsReader and FieldsWriter using static members referencing the bit values which makes the code more readable (Doug asked for it). The Field class now stores all 3 possible data values (reader, string or byte[]) within single member. This change was also asked by Doug and makes things easier to handle, but is not directly related to compression. Activating compression, the index size can be reduced to 60% of the original size when storing whole documents within the index. Increasing index time round about 50-70%. Regarding the query performance, i saw no differences between an compressed or uncompressed index. What may take a bit longer is fetching the hit documents. All lucene test cases works well. So maybe the lucene committers can have a look on it and decide if it will be part of the next version. If there are any questions regarding the changes, leave a note on the developer list. regards and fun with it bernhard
        Hide
        goller@detego-software.de Christoph Goller added a comment -

        Thank you very much for the excellent patch.
        It's reviewed and committed.

        Christoph

        Show
        goller@detego-software.de Christoph Goller added a comment - Thank you very much for the excellent patch. It's reviewed and committed. Christoph

          People

          • Assignee:
            java-dev@lucene.apache.org Lucene Developers
            Reporter:
            bernhard.messer@intrafind.de Bernhard Messer
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development