What would you think of making CompressingStoredFieldsFormat the new default StoredFieldsFormat?
Stored fields compression has many benefits :
- it makes the I/O cache work for us,
- file-based index replication/backup becomes cheaper.
Things to know:
- even with incompressible data, there is less than 0.5% overhead with LZ4,
- LZ4 compression requires ~ 16kB of memory and LZ4 HC compression requires ~ 256kB,
- LZ4 uncompression has almost no memory overhead,
- on my low-end laptop, the LZ4 impl in Lucene uncompresses at ~ 300mB/s.
I think we could use the same default parameters as in CompressingCodec :
- LZ4 compression,
- in-memory stored fields index that is very memory-efficient (less than 12 bytes per block of compressed docs) and uses binary search to locate documents in the fields data file,
- 16 kB blocks (small enough so that there is no major slow down when the whole index would fit into the I/O cache anyway, and large enough to provide interesting compression ratios ; for example Robert got a 0.35 compression ratio with the geonames.org database).