[LUCENE-4509] Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl - ASF JIRA

XML

Word

Printable

JSON

What would you think of making CompressingStoredFieldsFormat the new default StoredFieldsFormat?

Stored fields compression has many benefits :

Things to know:

even with incompressible data, there is less than 0.5% overhead with LZ4,
LZ4 compression requires ~ 16kB of memory and LZ4 HC compression requires ~ 256kB,
LZ4 uncompression has almost no memory overhead,
on my low-end laptop, the LZ4 impl in Lucene uncompresses at ~ 300mB/s.

I think we could use the same default parameters as in CompressingCodec :

LZ4 compression,
in-memory stored fields index that is very memory-efficient (less than 12 bytes per block of compressed docs) and uses binary search to locate documents in the fields data file,
16 kB blocks (small enough so that there is no major slow down when the whole index would fit into the I/O cache anyway, and large enough to provide interesting compression ratios ; for example Robert got a 0.35 compression ratio with the geonames.org database).

Any concerns?