Lucene - Core
  1. Lucene - Core
  2. LUCENE-648

Allow changing of ZIP compression level for compressed fields

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 1.9, 2.0.0, 2.1
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None

      Description

      In response to this thread:

      http://www.gossamer-threads.com/lists/lucene/java-user/38810

      I think we should allow changing the compression level used in the call to java.util.zip.Deflator in FieldsWriter.java. Right now it's hardwired to "best":

      compressor.setLevel(Deflater.BEST_COMPRESSION);

      Unfortunately, this can apparently cause the zip library to take a very long time (10 minutes for 4.5 MB in the above thread) and so people may want to change this setting.

      One approach would be to read the default from a Java system property, but, it seems recently (pre 2.0 I think) there was an effort to not rely on Java System properties (many were removed).

      A second approach would be to add static methods (and static class attr) to globally set the compression level?

      A third method would be in document.Field class, eg a setCompressLevel/getCompressLevel? But then every time a document is created with this field you'd have to call setCompressLevel since Lucene doesn't have a global Field schema (like Solr).

      Any other ideas / prefererences for either of these methods?

        Activity

        Hide
        Grant Ingersoll added a comment -

        Just curious, have you tried other values in here to see what kind of difference it makes before we go looking for a solution? Could you maybe put together a little benchmark that tries out the various levels and report back?

        It could be possible to add another addDocument method from the IndexWriter, so you could change it per document, we could make it part of the constructor to IndexWriter or we could do it as mentioned above. I am not sure what is the best way just yet.

        I think this also may fall under the notion of the Flexible Indexing thread that we have been talking about (someday it will get implemented).

        Show
        Grant Ingersoll added a comment - Just curious, have you tried other values in here to see what kind of difference it makes before we go looking for a solution? Could you maybe put together a little benchmark that tries out the various levels and report back? It could be possible to add another addDocument method from the IndexWriter, so you could change it per document, we could make it part of the constructor to IndexWriter or we could do it as mentioned above. I am not sure what is the best way just yet. I think this also may fall under the notion of the Flexible Indexing thread that we have been talking about (someday it will get implemented).
        Hide
        Michael McCandless added a comment -

        Good question! I will try to get the original document if possible and also run some simple tests to see the variance of CPU time consumed vs % compressed.

        Show
        Michael McCandless added a comment - Good question! I will try to get the original document if possible and also run some simple tests to see the variance of CPU time consumed vs % compressed.
        Hide
        Michael Busch added a comment -

        I think the compression level is only one part of the performance problem. Another drawback of the current implementation is how compressed fields are being merged: the FieldsReader uncompresses the fields, the SegmentMerger concatenates them and the FieldsWriter compresses the data again. The uncompress/compress steps are completely unnecessary and result in a large overhead. Before a document is written to the disk, the data of its fields is even being compressed twice. Firstly, when the DocumentWriter writes the single-document segment to the RAMDirectory, secondly, when the SegmentMerger merges the segments inside the RAMDirectory to write the merged segment to the disk.

        Please checkout Jira Issue 629 (http://issues.apache.org/jira/browse/LUCENE-629), where I recently posted a patch that fixes this problem and increases the indexing speed significantly. I also included some performance test results which quantify the improvement. Mike, it would be great if you could also try out the patched version for your tests with the compression level.

        Show
        Michael Busch added a comment - I think the compression level is only one part of the performance problem. Another drawback of the current implementation is how compressed fields are being merged: the FieldsReader uncompresses the fields, the SegmentMerger concatenates them and the FieldsWriter compresses the data again. The uncompress/compress steps are completely unnecessary and result in a large overhead. Before a document is written to the disk, the data of its fields is even being compressed twice. Firstly, when the DocumentWriter writes the single-document segment to the RAMDirectory, secondly, when the SegmentMerger merges the segments inside the RAMDirectory to write the merged segment to the disk. Please checkout Jira Issue 629 ( http://issues.apache.org/jira/browse/LUCENE-629 ), where I recently posted a patch that fixes this problem and increases the indexing speed significantly. I also included some performance test results which quantify the improvement. Mike, it would be great if you could also try out the patched version for your tests with the compression level.
        Hide
        Jason Polites added a comment -

        I you find that compression level has a meaningful impact (which it may not as suggested), one approach for a low-impact fix would be to allow the end user to specify their own Inflater/Deflater when creating the IndexWriter. If not specified, then behaviour remains as is. If the user specifies a different compression level when retrieving the document, that's their bad luck.

        Show
        Jason Polites added a comment - I you find that compression level has a meaningful impact (which it may not as suggested), one approach for a low-impact fix would be to allow the end user to specify their own Inflater/Deflater when creating the IndexWriter. If not specified, then behaviour remains as is. If the user specifies a different compression level when retrieving the document, that's their bad luck.
        Hide
        Michael McCandless added a comment -

        OK I ran some basic benchmarks to test the effect on indexing of
        varying the ZIP compression level from 0-9.

        Lucene currently hardwires compression level at 9 (= BEST).

        I found a decent text corpus here:

        http://people.csail.mit.edu/koehn/publications/europarl

        I ran all tests on the "Portuguese-English" data set, which is total
        of 327.5 MB of plain text across 976 files.

        I just ran the demo IndexFiles, modified to add the file contents as
        only a compressed stored field (ie not indexed). Note that this
        "amplifies" the cost of compression because in a real setting there
        would also be a number of indexed fields.

        I didn't change any of the default merge factor settings. I'm running
        on Ubuntu Linux 6.06, single CPU (2.4 ghz Pentium 4) desktop machine with
        index stored on an internal ATA hard drive.

        I first tested indexing time with and without the patch from
        LUCENE-629 here:

        old version: 648.7 sec

        patched version: 145.5 sec

        We clearly need to get that patch committed & released! Compressed
        fields are far more costly than they ought to be, and people are now
        using this (as of 1.9 release).

        So, then I ran all subsequent tests with the above patch applied. All
        numbers are avg. of 3 runs:

        Level Index time (sec) Index size (MB)

        None 65.3 322.3
        0 92.3 322.3
        1 80.8 128.8
        2 80.6 122.2
        3 81.3 115.8
        4 89.8 111.3
        5 104.0 106.2
        6 121.8 103.6
        7 131.7 103.1
        8 144.8 102.9
        9 145.5 102.9

        Quick conclusions:

        • There is indeed a substantial variance when you change the compression
          level.
        • The "sweet spot" above seems to be around 4 or 5 – should we
          change the default from 9?
        • I would still say we should make it possible for Lucene users to
          change the compression level?
        Show
        Michael McCandless added a comment - OK I ran some basic benchmarks to test the effect on indexing of varying the ZIP compression level from 0-9. Lucene currently hardwires compression level at 9 (= BEST). I found a decent text corpus here: http://people.csail.mit.edu/koehn/publications/europarl I ran all tests on the "Portuguese-English" data set, which is total of 327.5 MB of plain text across 976 files. I just ran the demo IndexFiles, modified to add the file contents as only a compressed stored field (ie not indexed). Note that this "amplifies" the cost of compression because in a real setting there would also be a number of indexed fields. I didn't change any of the default merge factor settings. I'm running on Ubuntu Linux 6.06, single CPU (2.4 ghz Pentium 4) desktop machine with index stored on an internal ATA hard drive. I first tested indexing time with and without the patch from LUCENE-629 here: old version: 648.7 sec patched version: 145.5 sec We clearly need to get that patch committed & released! Compressed fields are far more costly than they ought to be, and people are now using this (as of 1.9 release). So, then I ran all subsequent tests with the above patch applied. All numbers are avg. of 3 runs: Level Index time (sec) Index size (MB) None 65.3 322.3 0 92.3 322.3 1 80.8 128.8 2 80.6 122.2 3 81.3 115.8 4 89.8 111.3 5 104.0 106.2 6 121.8 103.6 7 131.7 103.1 8 144.8 102.9 9 145.5 102.9 Quick conclusions: There is indeed a substantial variance when you change the compression level. The "sweet spot" above seems to be around 4 or 5 – should we change the default from 9? I would still say we should make it possible for Lucene users to change the compression level?
        Hide
        Otis Gospodnetic added a comment -

        I agree. I like the idea of externalizing this, too, as suggested by Robert on the mailing list.

        Show
        Otis Gospodnetic added a comment - I agree. I like the idea of externalizing this, too, as suggested by Robert on the mailing list.
        Hide
        Grant Ingersoll added a comment -

        Won't fix, as I think it is agreed that compression should be handled outside of Lucene and then stored as a binary value

        Show
        Grant Ingersoll added a comment - Won't fix, as I think it is agreed that compression should be handled outside of Lucene and then stored as a binary value

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development