Lucene - Core
  1. Lucene - Core
  2. LUCENE-6034

MemoryIndex should be able to wrap TermVector Terms

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The default highlighter has a "WeightedSpanTermExtractor" that uses MemoryIndex for certain queries – basically phrases, SpanQueries, and the like. For lots of text, this aspect of highlighting is time consuming and consumes a fair amount of memory. What also consumes memory is that it wraps the tokenStream in CachingTokenFilter in this case. But if the underlying TokenStream is actually from TokenSources (wrapping TermVector Terms), this is all needless! Furthermore, MemoryIndex doesn't support payloads.

      The patch here has 3 aspects to it:

      • Internal refactoring to MemoryIndex to simplify it by maintaining the fields in a sorted state using a TreeMap. The ramifications of this led to reduced LOC for this file, even with the other features I added. It also puts the FieldInfo on the Info, and thus there's one less data structure to keep around. I suppose if there are a huge variety of fields in MemoryIndex, the aggregated N*Log(N) field lookup could add up, but that seems very unlikely. I also brought in the MemoryIndexNormDocValues as a simple anonymous inner class - it's super-simple after all, not worth having in a separate file.
      • New MemoryIndex.addField(String fieldName, Terms) method. In this case, MemoryIndex is providing the supporting wrappers around the underlying Terms so that it appears as an Index. In so doing, MemoryIndex supports payloads for such fields.
      • WeightedSpanTermExtractor now detects TokenSources' wrapping of Terms and it supplies this to MemoryIndex.
      1. LUCENE-6034_Simplify_MemoryIndex.patch
        21 kB
        David Smiley
      2. LUCENE-6034.patch
        39 kB
        David Smiley
      3. LUCENE-6034.patch
        13 kB
        David Smiley
      4. LUCENE-6034.patch
        32 kB
        David Smiley
      5. LUCENE-6034.patch
        32 kB
        David Smiley
      6. LUCENE-6034.patch
        27 kB
        David Smiley

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          Here's an updated patch. Continuing the internal refactor, I eliminated the separate hashmap cache for Norms and used Info for that purpose.

          Alan Woodward I know you've worked in MemoryIndex lately; it'd be great if you could have a look.

          I plan to commit this Tuesday night, subject to feedback.

          Show
          David Smiley added a comment - Here's an updated patch. Continuing the internal refactor, I eliminated the separate hashmap cache for Norms and used Info for that purpose. Alan Woodward I know you've worked in MemoryIndex lately; it'd be great if you could have a look. I plan to commit this Tuesday night, subject to feedback.
          Hide
          Alan Woodward added a comment -

          +1, this is a nice cleanup

          On the question of what to do if you try and add a TermVectors field with no stored offsets when the MemoryIndex is expecting them, should we just throw an IllegalArgumentException? Better to get an error when you add the field rather than further down the line when you try and use the offsets.

          Show
          Alan Woodward added a comment - +1, this is a nice cleanup On the question of what to do if you try and add a TermVectors field with no stored offsets when the MemoryIndex is expecting them, should we just throw an IllegalArgumentException? Better to get an error when you add the field rather than further down the line when you try and use the offsets.
          Hide
          David Smiley added a comment -

          Thanks for the review, Alan.

          I updated the patch to throw IAE if offsets are expected but not present in the term vector.

          Show
          David Smiley added a comment - Thanks for the review, Alan. I updated the patch to throw IAE if offsets are expected but not present in the term vector.
          Hide
          Robert Muir added a comment -

          Does it really make sense to involve MemoryIndex here? If you want to search against a term vectors, i think it should just be an AtomicReader implementation?

          Show
          Robert Muir added a comment - Does it really make sense to involve MemoryIndex here? If you want to search against a term vectors, i think it should just be an AtomicReader implementation?
          Hide
          David Smiley added a comment -

          That thought crossed my mind as I was working on it... but then I thought it might have a lot of duplication of infrastructure with MemoryIndex (e.g. getNormValues...). But then maybe most of that isn't needed? I could go either way I guess.

          Show
          David Smiley added a comment - That thought crossed my mind as I was working on it... but then I thought it might have a lot of duplication of infrastructure with MemoryIndex (e.g. getNormValues...). But then maybe most of that isn't needed? I could go either way I guess.
          Hide
          Robert Muir added a comment -

          I think it would be much cleaner to keep it separate. making an atomicreader backed by vectors to me is totally different task than a memoryindex.

          Show
          Robert Muir added a comment - I think it would be much cleaner to keep it separate. making an atomicreader backed by vectors to me is totally different task than a memoryindex.
          Hide
          David Smiley added a comment -

          Sure; I'll do that. I'll keep the refactorings to MemoryIndex but drop the TermVector part, of course.

          Show
          David Smiley added a comment - Sure; I'll do that. I'll keep the refactorings to MemoryIndex but drop the TermVector part, of course.
          Hide
          David Smiley added a comment -

          The attached patch is just for simplifying MemoryIndex (incl. inlining separate MemoryIndexNormDocValues.java file). The net LOC is ~ -90.

          Show
          David Smiley added a comment - The attached patch is just for simplifying MemoryIndex (incl. inlining separate MemoryIndexNormDocValues.java file). The net LOC is ~ -90.
          Hide
          David Smiley added a comment -

          This patch introduces TermVectorLeafReader which I put in the same package as WeightedSpanTermExtractor and now used by that class.

          After having implemented that LeafReader subclass, I do tend to think LeafReader should have fewer abstract methods. It would be nice if the only abstract methods were were fields(), getFieldInfos(), and maxDoc(). FieldInfos feels like something that should be retrieved from Fields, not LeafReader.

          I intend to commit both patches tomorrow.

          Show
          David Smiley added a comment - This patch introduces TermVectorLeafReader which I put in the same package as WeightedSpanTermExtractor and now used by that class. After having implemented that LeafReader subclass, I do tend to think LeafReader should have fewer abstract methods. It would be nice if the only abstract methods were were fields(), getFieldInfos(), and maxDoc(). FieldInfos feels like something that should be retrieved from Fields, not LeafReader. I intend to commit both patches tomorrow.
          Hide
          Robert Muir added a comment -

          After having implemented that LeafReader subclass, I do tend to think LeafReader should have fewer abstract methods. It would be nice if the only abstract methods were were fields(), getFieldInfos(), and maxDoc(). FieldInfos feels like something that should be retrieved from Fields, not LeafReader.

          How can you think that? You act as if the inverted index is the only thing going on. Maybe we should just remove term vectors then if they aren't very important? and stored fields too? and docvalues and norms? This would certainly be less code to maintain. And we wouldnt have to store all that stuff in fieldinfos thats unrelated to postings lists either.

          Show
          Robert Muir added a comment - After having implemented that LeafReader subclass, I do tend to think LeafReader should have fewer abstract methods. It would be nice if the only abstract methods were were fields(), getFieldInfos(), and maxDoc(). FieldInfos feels like something that should be retrieved from Fields, not LeafReader. How can you think that? You act as if the inverted index is the only thing going on. Maybe we should just remove term vectors then if they aren't very important? and stored fields too? and docvalues and norms? This would certainly be less code to maintain. And we wouldnt have to store all that stuff in fieldinfos thats unrelated to postings lists either.
          Hide
          David Smiley added a comment -

          Updated patch.:

          • WeightedSpanTermExtractor should detect MatchAllDocsQuery and do nothing (an optimization)
          • Randomized HighlighterTest to sometimes use term vectors.
          • Added test to validate that with term vectors, payloads, and payload sensitive queries, the highlighter honors those queries and doesn't highlight if there isn't a payload match. This is the only highlighter with this feature and I'll be sure to mention it as a new feature in CHANGES.txt.
          Show
          David Smiley added a comment - Updated patch.: WeightedSpanTermExtractor should detect MatchAllDocsQuery and do nothing (an optimization) Randomized HighlighterTest to sometimes use term vectors. Added test to validate that with term vectors, payloads, and payload sensitive queries, the highlighter honors those queries and doesn't highlight if there isn't a payload match. This is the only highlighter with this feature and I'll be sure to mention it as a new feature in CHANGES.txt.
          Hide
          ASF subversion and git services added a comment -

          Commit 1643298 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1643298 ]

          LUCENE-6034: Simplify MemoryIndex state via TreeMap<Info> etc.

          Show
          ASF subversion and git services added a comment - Commit 1643298 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1643298 ] LUCENE-6034 : Simplify MemoryIndex state via TreeMap<Info> etc.
          Hide
          ASF subversion and git services added a comment -

          Commit 1643300 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1643300 ]

          LUCENE-6034: Simplify MemoryIndex state via TreeMap<Info> etc.

          Show
          ASF subversion and git services added a comment - Commit 1643300 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1643300 ] LUCENE-6034 : Simplify MemoryIndex state via TreeMap<Info> etc.
          Hide
          ASF subversion and git services added a comment -

          Commit 1643316 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1643316 ]

          LUCENE-6034: Highlighter QueryScorer/WeightedSpanTermExtractor shouldn't re-invert a term vector based TokenStream. It can now highlight payload-sensitive queries.

          Show
          ASF subversion and git services added a comment - Commit 1643316 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1643316 ] LUCENE-6034 : Highlighter QueryScorer/WeightedSpanTermExtractor shouldn't re-invert a term vector based TokenStream. It can now highlight payload-sensitive queries.
          Hide
          ASF subversion and git services added a comment -

          Commit 1643321 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1643321 ]

          LUCENE-6034: Highlighter QueryScorer/WeightedSpanTermExtractor shouldn't re-invert a term vector based TokenStream. It can now highlight payload-sensitive queries.

          Show
          ASF subversion and git services added a comment - Commit 1643321 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1643321 ] LUCENE-6034 : Highlighter QueryScorer/WeightedSpanTermExtractor shouldn't re-invert a term vector based TokenStream. It can now highlight payload-sensitive queries.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development