Here is a patch that should improve memory usage for:
- variable-length binary fields
- multi-valued sorted numeric fields
- multi-valued sorted set fields
On the other hand, the BINARY_PREFIX_COMPRESSED format still uses MononicBlockPackedReader/Writer.
I wrote the patch by changing Lucene50DocValuesFormat to make it easier to review, but when it's ready I plan to make it a whole new format (with new Lucene54Codec, etc.).
Compared to previously, only per-block metadata is kept around in memory, data is written to disk using the DirectWriter/slice APIs. Out of curiosity I tried to write all entries of my /usr/share/dict/words file into a binary dv field to see how it compares to trunk:
.dvd: 992334 bytes
.dvm 128 bytes
memory usage 153124 bytes
.dvd 1038100 bytes
.dvm 165 bytes
memory usage 232 bytes
One important thing is that I had to introduce some per-thread memory usage: each thread needs to have its own array of DirectReader instances (one per block). This is why I raised the block size from 16K to 64K in order to have fewer blocks. Maybe this would need to be even more increased (but this would also hurt compression a bit). In the worst case that someone has a segment with 2B documents, there would be 32k blocks of 64k values so each thread would need about 1.2MB of memory. In my opinion it's ok since apps should query their Lucene indices from a reasonable number of threads, and it would probably still be much better than today since even requiring a single bit of memory per document (with today's MonotonicBlockPackedReader) would use 256MB of memory.