Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6100

Further tuning of Lucene50Codec(BEST_COMPRESSION)

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 5.0, 6.0
    • None
    • None
    • New

    Description

      Currently this codec has two options: BEST_SPEED and BEST_COMPRESSION. But in the case of highly compressible data, the ratio for BEST_COMPRESSION is not much over BEST_SPEED, because they share the same underlying format which is not optimized for this here.

      block size is currently 24576 (32kb sliding window size minus 8kb "grace" to avoid going over it). And we compress this in a stateless manner, each block is its own stream and they dont share preset dictionary or anything. So we have a lot of waste in many cases, since zlib has to reboot itself, then we generally throw away 1/4 of the window and start over.

      I ran some experiments with highly compressible logs data:

      method time indexing(ms) time merging(ms) fdt fdx
      BEST_SPEED 101,729 15,638 372,845,282 406,964
      BEST_COMPRESSION 114,364 23,474 269,387,347 275.909
      patch (60KB) 105,533 18,914 237,284,342 117,639

      The other experiments I ran were:

      method time indexing(ms) time merging(ms) fdt fdx
      crappy preset 130,854 38,095 234,603,971 274,500
      64KB 107,256 21,570 236,004,297 111,135
      crappy preset+64KB 121,503 30,030 222,422,924 110,751

      For 'crappy preset' I just use arbitrary first 32KB bytes of original data as a preset dictionary for every block. This is effective, but slow because of some unnecessary overhead involved (like computing adler32 over and over of the preset dict for each block). However, this overhead is reduced with larger block sizes, and still offers benefits, so maybe in the future we can do it (especially e.g. if its per-chunk and we can bulk merge chunks without recompressing, etc).

      For 64KB, we measure removing the "grace" completely so it spills to another block each time. The proposed smaller "grace" amount still offers cpu savings, so I think we should keep it. But its not terrible if you go over.

      Attachments

        1. LUCENE-6100.patch
          2 kB
          Robert Muir

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: