Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4498

pulse docfreq=1 DOCS_ONLY for 4.1 codec

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 4.1, 6.0
    • core/codecs
    • None
    • New

    Description

      We have pulsing codec, but currently this has some downsides:

      • its very general, wrapping an arbitrary postingsformat and pulsing everything in the postings for an arbitrary docfreq/totalTermFreq cutoff
      • reuse is hairy: because it specializes its enums based on these cutoffs, when walking thru terms e.g. merging there is a lot of sophisticated stuff to avoid the worst cases where we clone indexinputs for tons of terms.

      On the other hand the way the 4.1 codec encodes "primary key" fields is pretty silly, we write the docStartFP vlong in the term dictionary metadata, which tells us where to seek in the .doc to read our one lonely vint.

      I think its worth investigating that in the DOCS_ONLY docfreq=1 case, we just write the lone doc delta where we would write docStartFP.

      We can avoid the hairy reuse problem too, by just supporting this in refillDocs() in BlockDocsEnum instead of specializing.

      This would remove the additional seek for "primary key" fields without really any of the downsides of pulsing today.

      Attachments

        1. LUCENE-4498.patch
          9 kB
          Robert Muir
        2. LUCENE-4498.patch
          10 kB
          Robert Muir
        3. LUCENE-4498_lazy.patch
          12 kB
          Robert Muir
        4. LUCENE-4498.patch
          12 kB
          Robert Muir
        5. LUCENE-4498.patch
          13 kB
          Robert Muir

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: