Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I think should explore making start/end offsets a first-class attr in the
      postings APIs, and fixing the indexer to index them into postings.

      This will make term vector access cleaner (we now have to jump through
      hoops w/ non-first-class offset attr). It can also enable efficient
      highlighting without term vectors / reanalyzing, if the app indexes
      offsets into the postings.

      1. LUCENE-3684.patch
        182 kB
        Michael McCandless
      2. LUCENE-3684.patch
        194 kB
        Michael McCandless
      3. LUCENE-3684.patch
        211 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Initial patch:

        • Adds needsPositions/needsOffsets to TermsEnum.d&pEnum().
        • Cuts over all term vectors impls/usages to the new API.
        • Fixed one codec (SimpleText!) to be able to store offsets in the
          postings; added initial test to cover this.

        It's an early patch... I still need to fix the other codecs to index
        offsets.

        Show
        mikemccand Michael McCandless added a comment - Initial patch: Adds needsPositions/needsOffsets to TermsEnum.d&pEnum(). Cuts over all term vectors impls/usages to the new API. Fixed one codec (SimpleText!) to be able to store offsets in the postings; added initial test to cover this. It's an early patch... I still need to fix the other codecs to index offsets.
        Hide
        simonw Simon Willnauer added a comment -

        +1 this is awesome. seems we really need PositionIterators too

        Show
        simonw Simon Willnauer added a comment - +1 this is awesome. seems we really need PositionIterators too
        Hide
        mikemccand Michael McCandless added a comment -

        New patch; I changed it so you only pass "needsOffets" bool when pulling D&PEnum. I think it's ready!

        Show
        mikemccand Michael McCandless added a comment - New patch; I changed it so you only pass "needsOffets" bool when pulling D&PEnum. I think it's ready!
        Hide
        rcmuir Robert Muir added a comment -

        I agree, this patch looks good. I think we just need two assertEquals() in TestDuelingCodecs for the future.

        I agree about checkIndex, we shouldnt check anything about the offsets. The user could store something crazy
        in the offsets if they want.

        For checkindex, long term i think we should really consider adding a (slow, not by default) option to verify
        the term vectors against the postings. we could at least turn it on in tests..., but thats another separate issue.

        Show
        rcmuir Robert Muir added a comment - I agree, this patch looks good. I think we just need two assertEquals() in TestDuelingCodecs for the future. I agree about checkIndex, we shouldnt check anything about the offsets. The user could store something crazy in the offsets if they want. For checkindex, long term i think we should really consider adding a (slow, not by default) option to verify the term vectors against the postings. we could at least turn it on in tests..., but thats another separate issue.
        Hide
        rcmuir Robert Muir added a comment -

        Actually: just looking at FieldInfos i think i found another annoyance, we have this assertion everywhere (not just that class either):

         assert this.indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS || !this.storePayloads;
        

        But i think this is wrong, we must use compareTo >= 0?

        Show
        rcmuir Robert Muir added a comment - Actually: just looking at FieldInfos i think i found another annoyance, we have this assertion everywhere (not just that class either): assert this.indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS || !this.storePayloads; But i think this is wrong, we must use compareTo >= 0?
        Hide
        simonw Simon Willnauer added a comment -

        do we really need

        static final int FORMAT_CURRENT = FORMAT_OFFSETS_IN_POSTINGS;

        I mean the FORMAT_FLEX was introduced with flexible indexing and since we didn't release since then I think its confusing. We already changed this format since flex (ie. DocValues) without adding a new format. In LUCENE-3687 I separated the 3x and 4.0 format and removed all the previous formats from 4.0 I think this code should start with a min3xFormat-1.

        Show
        simonw Simon Willnauer added a comment - do we really need static final int FORMAT_CURRENT = FORMAT_OFFSETS_IN_POSTINGS; I mean the FORMAT_FLEX was introduced with flexible indexing and since we didn't release since then I think its confusing. We already changed this format since flex (ie. DocValues) without adding a new format. In LUCENE-3687 I separated the 3x and 4.0 format and removed all the previous formats from 4.0 I think this code should start with a min3xFormat-1.
        Hide
        mikemccand Michael McCandless added a comment -

        I think we just need two assertEquals() in TestDuelingCodecs for the future.

        Good – I added that, it found a bug in SimpleText (only codec that can index offsets currently...) and I fixed that.

        For checkindex, long term i think we should really consider adding a (slow, not by default) option to verify
        the term vectors against the postings. we could at least turn it on in tests..., but thats another separate issue.

        I added that, in one direction (for each TV it seeks the Terms/Docs/AndPositionsEnum to verify everything is the same)... and it uncovered a sneaky bug in Lucene3x codec (not present in 3.x) where we were failing to make a deep copy of the Term before using it as a key in the terms cache... I fixed it.

        But i think this is wrong, we must use compareTo >= 0?

        Right – I fixed several places that were still doing == or !=. I left ones in non-SimpleText codecs – they are still OK since they refuse to index offsets.

        I think this code should start with a min3xFormat-1.

        Ahh right! I removed that and just kept FORMAT_FLEX.

        Thanks for the reviews Robert and Simon!

        Show
        mikemccand Michael McCandless added a comment - I think we just need two assertEquals() in TestDuelingCodecs for the future. Good – I added that, it found a bug in SimpleText (only codec that can index offsets currently...) and I fixed that. For checkindex, long term i think we should really consider adding a (slow, not by default) option to verify the term vectors against the postings. we could at least turn it on in tests..., but thats another separate issue. I added that, in one direction (for each TV it seeks the Terms/Docs/AndPositionsEnum to verify everything is the same)... and it uncovered a sneaky bug in Lucene3x codec (not present in 3.x) where we were failing to make a deep copy of the Term before using it as a key in the terms cache... I fixed it. But i think this is wrong, we must use compareTo >= 0? Right – I fixed several places that were still doing == or !=. I left ones in non-SimpleText codecs – they are still OK since they refuse to index offsets. I think this code should start with a min3xFormat-1. Ahh right! I removed that and just kept FORMAT_FLEX. Thanks for the reviews Robert and Simon!
        Hide
        mikemccand Michael McCandless added a comment -

        New patch w/ above feedback incorporated...

        Show
        mikemccand Michael McCandless added a comment - New patch w/ above feedback incorporated...

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development