[LUCENE-6100] Further tuning of Lucene50Codec(BEST_COMPRESSION) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 5.0, 6.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Currently this codec has two options: BEST_SPEED and BEST_COMPRESSION. But in the case of highly compressible data, the ratio for BEST_COMPRESSION is not much over BEST_SPEED, because they share the same underlying format which is not optimized for this here.

block size is currently 24576 (32kb sliding window size minus 8kb "grace" to avoid going over it). And we compress this in a stateless manner, each block is its own stream and they dont share preset dictionary or anything. So we have a lot of waste in many cases, since zlib has to reboot itself, then we generally throw away 1/4 of the window and start over.

I ran some experiments with highly compressible logs data:

method	time indexing(ms)	time merging(ms)	fdt	fdx
BEST_SPEED	101,729	15,638	372,845,282	406,964
BEST_COMPRESSION	114,364	23,474	269,387,347	275.909
patch (60KB)	105,533	18,914	237,284,342	117,639

The other experiments I ran were:

method	time indexing(ms)	time merging(ms)	fdt	fdx
crappy preset	130,854	38,095	234,603,971	274,500
64KB	107,256	21,570	236,004,297	111,135
crappy preset+64KB	121,503	30,030	222,422,924	110,751

For 'crappy preset' I just use arbitrary first 32KB bytes of original data as a preset dictionary for every block. This is effective, but slow because of some unnecessary overhead involved (like computing adler32 over and over of the preset dict for each block). However, this overhead is reduced with larger block sizes, and still offers benefits, so maybe in the future we can do it (especially e.g. if its per-chunk and we can bulk merge chunks without recompressing, etc).

For 64KB, we measure removing the "grace" completely so it spills to another block each time. The proposed smaller "grace" amount still offers cpu savings, so I think we should keep it. But its not terrible if you go over.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-6100.patch
07/Dec/14 17:37
2 kB
Robert Muir

Activity

People

Assignee:: Unassigned

Reporter:: Robert Muir

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 07/Dec/14 17:36

Updated:: 28/Aug/22 14:20

Resolved:: 10/Dec/14 01:36