Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9447

Make BEST_COMPRESSION compress more aggressively?

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 8.7
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The Lucene86 codec supports setting a "Mode" for stored fields compression, that is either "BEST_SPEED", which translates to blocks of 16kB or 128 documents (whichever is hit first) compressed with LZ4, or "BEST_COMPRESSION", which translates to blocks of 60kB or 512 documents compressed with DEFLATE with default compression level (6).

      After looking at indices that spent most disk space on stored fields recently, I noticed that there was quite some room for improvement by increasing the block size even further:

      Block size Stored fields size
      60kB 168412338
      128kB 130813639
      256kB 113587009
      512kB 104776378
      1MB 100367095
      2MB 98152464
      4MB 97034425
      8MB 96478746

      For this specific dataset, I had 1M documents that each had about 2kB of stored fields each and quite some redundancy.

      This makes me want to look into bumping this block size to maybe 256kB. It would be interesting to re-do the experiments we did on LUCENE-6100 to see how this affects the merging speed. That said I don't think it would be terrible if the merging time increased a bit given that we already offer the BEST_SPEED option for CPU-savvy users.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jpountz Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m