Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9877

Explore increasing the allowable exceptions in PForUtil


    • Task
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 9.0
    • 8.9
    • core/codecs
    • None
    • New


      Piggybacking a little off of the investigation I was doing over in LUCENE-9850 I thought it might also be worth-while exploring the impact of increasing the number of allowable exceptions in PForUtil. The aim of this investigation is to see if we could reduce index size by allowing for more exceptions without significant negative impact to performance.

      PForUtil currently allows for up to 3 exceptions, and it only uses 3 bits to encode the number of exceptions (using the remaining 3 bits of the byte used to also encode the number of bits-per-value, which requires 5 bits). Each exception used is encoded with a two full bytes, using a maximum of 6 bytes per block.

      It seems to me like 7 might be a more ideal number of exceptions if index size is the driving motivation. My thought process is that, in the worst-case, 7 exceptions would be used to save only a single bit-per-value in the corresponding block. With 128 entries per block, this would save 16 bytes. So with 14 bytes used to encode the exception values (7 x 2 bytes per exception), we would save a two bytes in total (just slightly better than breaking even). If we need fewer than the 7 exceptions, or if we're able to save more than 1 bit-per-value, it's all additional savings. I suppose the question is what kind of performance hit we might observe due to decoding more exceptions.

      Also note that 7 exceptions is the max we can encode with the 3 bits we currently have available for the number of exceptions. So moving to 8 exceptions would not only take 16 bytes to encode the exceptions (if using all of them), but we'd need one more byte per block to encode the exception count. So in the worst case of using all 8 exceptions to save 1 bit per value, we'd actually be worse off.

      I'll post some results here for discussion or at least for public record of my work for future reference.




            Unassigned Unassigned
            gsmiller Greg Miller
            0 Vote for this issue
            6 Start watching this issue



              Time Tracking

                Original Estimate - Not Specified
                Not Specified
                Remaining Estimate - 0h
                Time Spent - 1h