Ideally we should enable to use biggish chunk_size during compression to improve compression and decompress only single document (not depending on chunk_size), just like you proposed here (if I figured it out correctly?)
Exactly, this is one of the two proposed options. The only overhead would be that you would need to read the shared dictionary and have it in memory (but that is a single call to readBytes and its size can be controlled so that should be no issue).
Usually, such data is highly compressible (imagine all these low cardinality fields like color of something...) and even some basic compression does the magic.
Agreed, this is the reason why I'd prefer the "low-overhead" option to be something cheap rather than no compression at all: data usually has lots of patterns and even something as simple as LZ4 manages to reach interesting compression ratios.
Conclusion: compression is great, and anything that helps tweak this balance (CPU effort / IO effort) in different phases indexing/retrieving smoothly makes lucene use case coverage broader. (e.g. "I want to afford more CPU during indexing, and less CPU during retrieval", static coder being extreme case for this...)
I am not sure I figured out exactly if and how this patch is going to help in a such cases (how to achieve reasonable compression if we do per document compression for small documents? Reusing dictionaries from previous chunks? static dictionaries... ).
The trade-off that this patch makes is:
- keep indexing fast enough in all cases
- allow to trade random access speed to documents for index compression
The current patch provides two options:
- either we compress documents in blocks like today but with Deflate instead of LZ4, this provides good compression ratios but makes random access quite slow since you need to decompress a whole block of documents every time you want to access a single document
- either we still group documents into blocks but compress them individually, using the compressed representation of the previous documents as a dictionary.
I'll try to explain the 2nd option better: it works well because lz4 mostly deduplicates sequences of bytes in a stream. So imagine that you have the following 3 documents in a block:
We will first compress document 1. Given that it is the first document in the block, there is no shared dictionary, so the compressed representation look like this (`literals` means that bytes are copied as-is, and `ref` means it is a reference to a previous sequence of bytes. This is how lz4 works, it just replaces existing sequences of bytes with references to previous occurrences of the same bytes. The more references you have and the longer they are, the better the compression ratio.).
Now we are going to compress document 2. It doesn't contain any repetition of bytes, so if we wanted to compress it individually, we would just have <literals:abcdefghijkl> which doesn't compress at all (and is even slightly larger due to the overhead of the format). However, we are using the compressed representation of document1 as a dictionary, and "abcdefgh" exists in the literals, so we can make a reference to it!
And again for document3 using literals of document1 for "abcd", and literals of document2 for "ijkl":