Lucene - Core
  1. Lucene - Core
  2. LUCENE-4226

Efficient compression of small to medium stored fields

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, Trunk
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I've been doing some experiments with stored fields lately. It is very common for an index with stored fields enabled to have most of its space used by the .fdt index file. To prevent this .fdt file from growing too much, one option is to compress stored fields. Although compression works rather well for large fields, this is not the case for small fields and the compression ratio can be very close to 100%, even with efficient compression algorithms.

      In order to improve the compression ratio for small fields, I've written a StoredFieldsFormat that compresses several documents in a single chunk of data. To see how it behaves in terms of document deserialization speed and compression ratio, I've run several tests with different index compression strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text were indexed and stored):

      • no compression,
      • docs compressed with deflate (compression level = 1),
      • docs compressed with deflate (compression level = 9),
      • docs compressed with Snappy,
      • using the compressing StoredFieldsFormat with deflate (level = 1) and chunks of 6 docs,
      • using the compressing StoredFieldsFormat with deflate (level = 9) and chunks of 6 docs,
      • using the compressing StoredFieldsFormat with Snappy and chunks of 6 docs.

      For those who don't know Snappy, it is compression algorithm from Google which has very high compression ratios, but compresses and decompresses data very quickly.

      Format           Compression ratio     IndexReader.document time
      ————————————————————————————————————————————————————————————————
      uncompressed     100%                  100%
      doc/deflate 1     59%                  616%
      doc/deflate 9     58%                  595%
      doc/snappy        80%                  129%
      index/deflate 1   49%                  966%
      index/deflate 9   46%                  938%
      index/snappy      65%                  264%
      

      (doc = doc-level compression, index = index-level compression)

      I find it interesting because it allows to trade speed for space (with deflate, the .fdt file shrinks by a factor of 2, much better than with doc-level compression). One other interesting thing is that index/snappy is almost as compact as doc/deflate while it is more than 2x faster at retrieving documents from disk.

      These tests have been done on a hot OS cache, which is the worst case for compressed fields (one can expect better results for formats that have a high compression ratio since they probably require fewer read/write operations from disk).

      1. LUCENE-4226.patch
        111 kB
        Adrien Grand
      2. LUCENE-4226.patch
        110 kB
        Adrien Grand
      3. LUCENE-4226.patch
        114 kB
        Adrien Grand
      4. LUCENE-4226.patch
        114 kB
        Adrien Grand
      5. LUCENE-4226.patch
        109 kB
        Adrien Grand
      6. LUCENE-4226.patch
        83 kB
        Adrien Grand
      7. CompressionBenchmark.java
        8 kB
        Adrien Grand
      8. LUCENE-4226.patch
        85 kB
        Adrien Grand
      9. CompressionBenchmark.java
        11 kB
        Adrien Grand
      10. SnappyCompressionAlgorithm.java
        4 kB
        Adrien Grand
      11. LUCENE-4226.patch
        55 kB
        Adrien Grand

        Issue Links

          Activity

          Adrien Grand created issue -
          Adrien Grand made changes -
          Field Original Value New Value
          Attachment LUCENE-4226.patch [ 12536682 ]
          Attachment SnappyCompressionAlgorithm.java [ 12536683 ]
          Attachment CompressionBenchmark.java [ 12536684 ]
          Adrien Grand made changes -
          Attachment LUCENE-4226.patch [ 12542866 ]
          Attachment CompressionBenchmark.java [ 12542867 ]
          Adrien Grand made changes -
          Link This issue is related to LUCENE-2810 [ LUCENE-2810 ]
          Adrien Grand made changes -
          Attachment LUCENE-4226.patch [ 12544420 ]
          Adrien Grand made changes -
          Assignee Adrien Grand [ jpountz ]
          Adrien Grand made changes -
          Fix Version/s 4.1 [ 12321140 ]
          Fix Version/s 5.0 [ 12321663 ]
          Adrien Grand made changes -
          Attachment LUCENE-4226.patch [ 12546615 ]
          Adrien Grand made changes -
          Attachment LUCENE-4226.patch [ 12547607 ]
          Adrien Grand made changes -
          Attachment LUCENE-4426.patch [ 12547635 ]
          Adrien Grand made changes -
          Attachment LUCENE-4426.patch [ 12547635 ]
          Adrien Grand made changes -
          Attachment LUCENE-4226.patch [ 12547638 ]
          Adrien Grand made changes -
          Attachment LUCENE-4226.patch [ 12547815 ]
          Adrien Grand made changes -
          Attachment LUCENE-4226.patch [ 12547856 ]
          Adrien Grand made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development