Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I think should explore making start/end offsets a first-class attr in the
      postings APIs, and fixing the indexer to index them into postings.

      This will make term vector access cleaner (we now have to jump through
      hoops w/ non-first-class offset attr). It can also enable efficient
      highlighting without term vectors / reanalyzing, if the app indexes
      offsets into the postings.

      1. LUCENE-3684.patch
        182 kB
        Michael McCandless
      2. LUCENE-3684.patch
        194 kB
        Michael McCandless
      3. LUCENE-3684.patch
        211 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Initial patch:

        • Adds needsPositions/needsOffsets to TermsEnum.d&pEnum().
        • Cuts over all term vectors impls/usages to the new API.
        • Fixed one codec (SimpleText!) to be able to store offsets in the
          postings; added initial test to cover this.

        It's an early patch... I still need to fix the other codecs to index
        offsets.

        Show
        Michael McCandless added a comment - Initial patch: Adds needsPositions/needsOffsets to TermsEnum.d&pEnum(). Cuts over all term vectors impls/usages to the new API. Fixed one codec (SimpleText!) to be able to store offsets in the postings; added initial test to cover this. It's an early patch... I still need to fix the other codecs to index offsets.
        Hide
        Simon Willnauer added a comment -

        +1 this is awesome. seems we really need PositionIterators too

        Show
        Simon Willnauer added a comment - +1 this is awesome. seems we really need PositionIterators too
        Hide
        Michael McCandless added a comment -

        New patch; I changed it so you only pass "needsOffets" bool when pulling D&PEnum. I think it's ready!

        Show
        Michael McCandless added a comment - New patch; I changed it so you only pass "needsOffets" bool when pulling D&PEnum. I think it's ready!
        Hide
        Robert Muir added a comment -

        I agree, this patch looks good. I think we just need two assertEquals() in TestDuelingCodecs for the future.

        I agree about checkIndex, we shouldnt check anything about the offsets. The user could store something crazy
        in the offsets if they want.

        For checkindex, long term i think we should really consider adding a (slow, not by default) option to verify
        the term vectors against the postings. we could at least turn it on in tests..., but thats another separate issue.

        Show
        Robert Muir added a comment - I agree, this patch looks good. I think we just need two assertEquals() in TestDuelingCodecs for the future. I agree about checkIndex, we shouldnt check anything about the offsets. The user could store something crazy in the offsets if they want. For checkindex, long term i think we should really consider adding a (slow, not by default) option to verify the term vectors against the postings. we could at least turn it on in tests..., but thats another separate issue.
        Hide
        Robert Muir added a comment -

        Actually: just looking at FieldInfos i think i found another annoyance, we have this assertion everywhere (not just that class either):

         assert this.indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS || !this.storePayloads;
        

        But i think this is wrong, we must use compareTo >= 0?

        Show
        Robert Muir added a comment - Actually: just looking at FieldInfos i think i found another annoyance, we have this assertion everywhere (not just that class either): assert this.indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS || !this.storePayloads; But i think this is wrong, we must use compareTo >= 0?
        Hide
        Simon Willnauer added a comment -

        do we really need

        static final int FORMAT_CURRENT = FORMAT_OFFSETS_IN_POSTINGS;

        I mean the FORMAT_FLEX was introduced with flexible indexing and since we didn't release since then I think its confusing. We already changed this format since flex (ie. DocValues) without adding a new format. In LUCENE-3687 I separated the 3x and 4.0 format and removed all the previous formats from 4.0 I think this code should start with a min3xFormat-1.

        Show
        Simon Willnauer added a comment - do we really need static final int FORMAT_CURRENT = FORMAT_OFFSETS_IN_POSTINGS; I mean the FORMAT_FLEX was introduced with flexible indexing and since we didn't release since then I think its confusing. We already changed this format since flex (ie. DocValues) without adding a new format. In LUCENE-3687 I separated the 3x and 4.0 format and removed all the previous formats from 4.0 I think this code should start with a min3xFormat-1.
        Hide
        Michael McCandless added a comment -

        I think we just need two assertEquals() in TestDuelingCodecs for the future.

        Good – I added that, it found a bug in SimpleText (only codec that can index offsets currently...) and I fixed that.

        For checkindex, long term i think we should really consider adding a (slow, not by default) option to verify
        the term vectors against the postings. we could at least turn it on in tests..., but thats another separate issue.

        I added that, in one direction (for each TV it seeks the Terms/Docs/AndPositionsEnum to verify everything is the same)... and it uncovered a sneaky bug in Lucene3x codec (not present in 3.x) where we were failing to make a deep copy of the Term before using it as a key in the terms cache... I fixed it.

        But i think this is wrong, we must use compareTo >= 0?

        Right – I fixed several places that were still doing == or !=. I left ones in non-SimpleText codecs – they are still OK since they refuse to index offsets.

        I think this code should start with a min3xFormat-1.

        Ahh right! I removed that and just kept FORMAT_FLEX.

        Thanks for the reviews Robert and Simon!

        Show
        Michael McCandless added a comment - I think we just need two assertEquals() in TestDuelingCodecs for the future. Good – I added that, it found a bug in SimpleText (only codec that can index offsets currently...) and I fixed that. For checkindex, long term i think we should really consider adding a (slow, not by default) option to verify the term vectors against the postings. we could at least turn it on in tests..., but thats another separate issue. I added that, in one direction (for each TV it seeks the Terms/Docs/AndPositionsEnum to verify everything is the same)... and it uncovered a sneaky bug in Lucene3x codec (not present in 3.x) where we were failing to make a deep copy of the Term before using it as a key in the terms cache... I fixed it. But i think this is wrong, we must use compareTo >= 0? Right – I fixed several places that were still doing == or !=. I left ones in non-SimpleText codecs – they are still OK since they refuse to index offsets. I think this code should start with a min3xFormat-1. Ahh right! I removed that and just kept FORMAT_FLEX. Thanks for the reviews Robert and Simon!
        Hide
        Michael McCandless added a comment -

        New patch w/ above feedback incorporated...

        Show
        Michael McCandless added a comment - New patch w/ above feedback incorporated...

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development