Lucene - Core
  1. Lucene - Core
  2. LUCENE-2929

all postings enums must explicitly declare what they need up-front.

    Details

    • Type: Task Task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: None
    • Labels:
    • Lucene Fields:
      New

      Description

      Currently, the DocsEnum api assumes you might consumes freqs at any time.
      Additionally the DocsAndPositionsEnum api assumes you might consume a payload at any time.

      High level things such as queries know what kinds of data they need from the index up-front,
      and the current APIs are limiting to codecs (other than Standard, which has these intertwined).

      So, we either need DocsAndFreqsEnum, DocsPositionsAndPayloadsEnum, or at least booleans
      in the methods that create these to specify whether you want freqs or payloads.

      we did this for freqs in the bulkpostings API, which is good, but these DocsEnum apis
      are also new in 4.0 and there's no reason to introduce non-performant APIs.

      additionally when/if we add payloads to the bulkpostings API, we should make sure we keep
      the same trend and require you to specify you want payloads or not up-front.

      1. LUCENE-2929.patch
        164 kB
        Michael McCandless
      2. LUCENE-2929.patch
        162 kB
        Michael McCandless
      3. LUCENE-2929.patch
        150 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          And: if you ask for docs and freqs but the field omitTFAP, we should return
          an error in some fashion similar to what the bulk API does.

          I think we do this now if you ask for a DocsAndPositionsEnum when the field omitTFAP,
          but we continue to lie and return TF=1 freqs in this case, which is not true,
          there is simply no freq data at all.

          Show
          Robert Muir added a comment - And: if you ask for docs and freqs but the field omitTFAP, we should return an error in some fashion similar to what the bulk API does. I think we do this now if you ask for a DocsAndPositionsEnum when the field omitTFAP, but we continue to lie and return TF=1 freqs in this case, which is not true, there is simply no freq data at all.
          Hide
          Michael McCandless added a comment -

          Patch, splitting out DocsAndFreqsEnum from DocsEnum, and fixing all
          places to pull the actual enum they need.

          I also added helper methods in _TestUtil to pull Docs/AndFreqsEnum,
          randomly sometimes "upgrading" the enum pulled (eg, if you wanted
          DocsEnum you might get a DocsAndFreqsEnum instead).

          The codec is allowed to return null from TermsEnum.docsAndFreqs, which
          means term freqs were not indexed for that field (ie IndexOptions.DOCS_ONLY).

          Just like the bulk branch, I broke out a MatchOnlyTermScorer, and also
          MatchOnlyConjunctionTermScorer.

          Still have to make "and payloads" strongly typed as well...

          Show
          Michael McCandless added a comment - Patch, splitting out DocsAndFreqsEnum from DocsEnum, and fixing all places to pull the actual enum they need. I also added helper methods in _TestUtil to pull Docs/AndFreqsEnum, randomly sometimes "upgrading" the enum pulled (eg, if you wanted DocsEnum you might get a DocsAndFreqsEnum instead). The codec is allowed to return null from TermsEnum.docsAndFreqs, which means term freqs were not indexed for that field (ie IndexOptions.DOCS_ONLY). Just like the bulk branch, I broke out a MatchOnlyTermScorer, and also MatchOnlyConjunctionTermScorer. Still have to make "and payloads" strongly typed as well...
          Hide
          Michael McCandless added a comment -

          New patch; I think it's ready.

          Instead of making a new class for every combination of attrs, I think
          we should have "class per dimension", ie DocsEnum is used if you only
          iterate over docs, no matter which attrs you pull. So now you pass in
          an up front boolean needsFreqs to TermsEnum.docs.

          Show
          Michael McCandless added a comment - New patch; I think it's ready. Instead of making a new class for every combination of attrs, I think we should have "class per dimension", ie DocsEnum is used if you only iterate over docs, no matter which attrs you pull. So now you pass in an up front boolean needsFreqs to TermsEnum.docs.
          Hide
          Michael McCandless added a comment -

          Another patch... last one had "svn rm'd" DocsEnum for some reason...

          Show
          Michael McCandless added a comment - Another patch... last one had "svn rm'd" DocsEnum for some reason...
          Hide
          Robert Muir added a comment -

          Instead of making a new class for every combination of attrs, I think
          we should have "class per dimension"

          +1

          Show
          Robert Muir added a comment - Instead of making a new class for every combination of attrs, I think we should have "class per dimension" +1
          Hide
          Michael McCandless added a comment -

          Still need "requiresPayloads" boolean.

          Show
          Michael McCandless added a comment - Still need "requiresPayloads" boolean.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development