Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7485

Better storage for `docsWithField` in Lucene70NormsFormat

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently Lucene70NormsFormat uses a bit set to store documents that have a norm, and counts one bits using Long.bitCount in order to know the index of the current document in the set of docs that have a norm value.

      I think this is fairly good if a field is moderately sparse (somewhere between 5% and 99%) but it still has some issues like slow advance by large deltas (it still needs to visit all words in order to accumulate the number of ones to know the index of a document) or when very few bits are set.

      I have been working on a disk-based adaptation of RoaringDocIdSet that would still give the ability to know the index of the current document. It seems to be only a bit slower than the current implementation on moderately sparse fields. However, it also comes with benefits:

      • it is faster in the sparse case when it uses the sparse encoding that uses shorts to store doc IDs (when the density is 6% or less)
      • it has faster advance() by large deltas (still linear, but by a factor of 65536 so that should always be fine in practice since doc IDs are bound to 2B)
      • it uses O(numDocsWithField) storage rather than O(maxDoc), the worst case in 6 bytes per field, which occurs when each range of 65k docs contains exactly one document.
      • it is faster if some ranges of documents that share the same 16 upper bits are full, this is useful eg. if there is a single document that misses a field in the whole index or for use-cases that would store multiple types of documents (with different fields) within a single index and would use index sorting to put documents of the same type together

        Attachments

        1. LUCENE-7485.patch
          28 kB
          Adrien Grand
        2. LUCENE-7485.patch
          29 kB
          Adrien Grand

          Activity

            People

            • Assignee:
              jpountz Adrien Grand
              Reporter:
              jpountz Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: