Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      TokenSources.java, in the highlight module, is a facade that returns a TokenStream for a field by either un-inverting & converting the TermVector Terms, or by text re-analysis if TermVectors are unavailable or don't have the right options. TokenSources is used by the default highlighter, which is the most accurate highlighter we've got. When documents are large (say hundreds of kilobytes on up), I found that most of the highlighter's activity was up-front spent un-inverting & converting the term vector to a TokenStream, not on the actual/real highlighting that follows. Much of that time was on a huge sort of hundreds of thousands of Tokens. Time was also spent doing lots of String conversion and char copying, and it used a lot of memory, too.

      In this patch, I overhauled TokenStreamFromTermPositionVector.java, and I removed similar logic in TokenSources that was used in circumstances when positions weren't available but offsets were. This class can un-invert term vectors that have positions and/or offsets (at least one). It doesn't sort. It places Tokens directly into an array of tokens directly indexed by position. When positions aren't available, the startOffset/8 is a substitute. I've got a more light-weight Token inner class used in place of the former and deprecated Token that ultimately forms a linked-list when the process is done. There is no string conversion; character copying is minimized. The Token array is GC'ed after initialization, it's only needed during construction.

      Misc:

      • It implements reset() efficiently so it need not be wrapped in CachingTokenFilter (I'll supply a patch later on this).
      • It only fetches payloads if you ask for them by adding the attribute (the default highlighter won't add the attribute).
      • It exposes the underlying TermVector terms via a getter too, which is needed by another patch to follow later.

      A key assumption is that the position increment gap or first position isn't gigantic, as that will create wasted space and the linked-list formation ultimately has to visit all the slots. We also assume that there aren't a ton of tokens at the same position, since inserting new tokens in sorted order is O(N^2) where 'N' is the average co-occurring token length.

      My performance testing using Lucene's benchmark module on a megabyte document showed >5x speedup, in conjunction with some other patches to be posted separately. This patch made the most difference.

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          Here's the patch.

          There are a couple no-commits:

          1. I want to rename TokenStreamFromTermPositionVector to TokenStreamFromTermVector
          2. I think TokenSources.getTokenStreamWithOffsets should relax it's insistence that the term vector have positions. If you have control of your index options (and you do!), then you can choose not to put in positions and then highlight with the consequences of that decision, which is that highlighting will ignore stop-words, thus a query "Sugar and Spice" would not match "sugar space" and a query of "sugar spice" would match "sugar and spice" indexed. If you don't even have stop-words then why put positions in the term vector.
          Show
          David Smiley added a comment - Here's the patch. There are a couple no-commits: I want to rename TokenStreamFromTermPositionVector to TokenStreamFromTermVector I think TokenSources.getTokenStreamWithOffsets should relax it's insistence that the term vector have positions. If you have control of your index options (and you do!), then you can choose not to put in positions and then highlight with the consequences of that decision, which is that highlighting will ignore stop-words, thus a query "Sugar and Spice" would not match "sugar space" and a query of "sugar spice" would match "sugar and spice" indexed. If you don't even have stop-words then why put positions in the term vector.
          Hide
          David Smiley added a comment -

          Here's an updated patch.

          • Ensures nobody calls the now-deprecated getTokenSources overloaded version that took the tokenPositionsGuaranteedContiguous argument. The only callers were tests.
          • Relaxed the requirement for the term vector to have positions. It may be worth adding an entry in CHANGES.txt for upgraders but it seems it would only effect someone who had offsets but forgot to add positions to term vectors (why else would they not be there?), and now they may see less ideal highlighting related to positions. The fix is to add positions.
          • Removed the "nocommit" for the rename I want to do. I'l have to remember to do a rename in a second commit.

          I'll commit in a couple days or so to trunk & 5x.

          Show
          David Smiley added a comment - Here's an updated patch. Ensures nobody calls the now-deprecated getTokenSources overloaded version that took the tokenPositionsGuaranteedContiguous argument. The only callers were tests. Relaxed the requirement for the term vector to have positions. It may be worth adding an entry in CHANGES.txt for upgraders but it seems it would only effect someone who had offsets but forgot to add positions to term vectors (why else would they not be there?), and now they may see less ideal highlighting related to positions. The fix is to add positions. Removed the "nocommit" for the rename I want to do. I'l have to remember to do a rename in a second commit. I'll commit in a couple days or so to trunk & 5x.
          Hide
          ASF subversion and git services added a comment -

          Commit 1642294 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1642294 ]

          LUCENE-6031: Optimize TokenSources term vector to TokenStream

          Show
          ASF subversion and git services added a comment - Commit 1642294 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1642294 ] LUCENE-6031 : Optimize TokenSources term vector to TokenStream
          Hide
          ASF subversion and git services added a comment -

          Commit 1642295 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1642295 ]

          LUCENE-6031: Rename TokenStreamFromTermPositionVector to TokenStreamFromTermVector

          Show
          ASF subversion and git services added a comment - Commit 1642295 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1642295 ] LUCENE-6031 : Rename TokenStreamFromTermPositionVector to TokenStreamFromTermVector
          Hide
          ASF subversion and git services added a comment -

          Commit 1642297 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1642297 ]

          LUCENE-6031: Optimize TokenSources term vector to TokenStream

          Show
          ASF subversion and git services added a comment - Commit 1642297 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1642297 ] LUCENE-6031 : Optimize TokenSources term vector to TokenStream
          Hide
          ASF subversion and git services added a comment -

          Commit 1642298 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1642298 ]

          LUCENE-6031: Rename TokenStreamFromTermPositionVector to TokenStreamFromTermVector

          Show
          ASF subversion and git services added a comment - Commit 1642298 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1642298 ] LUCENE-6031 : Rename TokenStreamFromTermPositionVector to TokenStreamFromTermVector
          Hide
          Yonik Seeley added a comment -

          Reopening - Solr highlighting tests do not pass after this commit.

          Show
          Yonik Seeley added a comment - Reopening - Solr highlighting tests do not pass after this commit.
          Hide
          David Smiley added a comment -

          Ouch; so sorry I failed the build! In my checkout I have several pending issues related to highlighting, and apparently the Solr one, SOLR-6680, is dependent. I should have monitored the dev list closely; I recall getting a nastygram from Jenkins when I failed the build in the past and thought I was in the clear since I didn't get one this time.

          The coupling between this and SOLR-6680 is that TokenSources, prior to my commit here, did not require that you call reset(). This is of course a violation of the TokenSources contract which is unacceptable. The patch to SOLR-6680 does several things to DefaultSolrHighlighter, one of which is ensuring reset() is called appropriately. Since I've posted SOLR-6680 some time ago, I will commit within an hour or so, and thus fix the build. I will also add a note in the upgrading section of LUCENE in case someone else might be forgetting to reset the stream returned from TokenStream.

          Show
          David Smiley added a comment - Ouch; so sorry I failed the build! In my checkout I have several pending issues related to highlighting, and apparently the Solr one, SOLR-6680 , is dependent. I should have monitored the dev list closely; I recall getting a nastygram from Jenkins when I failed the build in the past and thought I was in the clear since I didn't get one this time. The coupling between this and SOLR-6680 is that TokenSources, prior to my commit here , did not require that you call reset(). This is of course a violation of the TokenSources contract which is unacceptable. The patch to SOLR-6680 does several things to DefaultSolrHighlighter, one of which is ensuring reset() is called appropriately. Since I've posted SOLR-6680 some time ago, I will commit within an hour or so, and thus fix the build. I will also add a note in the upgrading section of LUCENE in case someone else might be forgetting to reset the stream returned from TokenStream.
          Hide
          ASF subversion and git services added a comment -

          Commit 1642512 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1642512 ]

          LUCENE-6031: don't call deprecated TokenSources method (in tests)

          Show
          ASF subversion and git services added a comment - Commit 1642512 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1642512 ] LUCENE-6031 : don't call deprecated TokenSources method (in tests)
          Hide
          ASF subversion and git services added a comment -

          Commit 1642513 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1642513 ]

          LUCENE-6031: don't call deprecated TokenSources method (in tests)

          Show
          ASF subversion and git services added a comment - Commit 1642513 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1642513 ] LUCENE-6031 : don't call deprecated TokenSources method (in tests)
          Hide
          David Smiley added a comment -

          Hmm; Solr's CHANGES.txt is the one with an upgrading section, not Lucene. Anyway, I think I won't add the note after all; Solr was in error by not calling reset() in the first place. Though if you/anyone think adding a note in CHANGES.txt that reset() wasn't but now is required then I will.

          Show
          David Smiley added a comment - Hmm; Solr's CHANGES.txt is the one with an upgrading section, not Lucene. Anyway, I think I won't add the note after all; Solr was in error by not calling reset() in the first place. Though if you/anyone think adding a note in CHANGES.txt that reset() wasn't but now is required then I will.
          Hide
          David Smiley added a comment -

          The attached patch further improves this by:

          • delaying lazy-init to first incrementToken() call instead of at reset(). It seems people call addAttributes before and after reset() and so the only way to know if consumers wants payloads is to wait till incrementToken().
            • a consequence of the implementation is that reset() is no longer required, which is how it used to be for this TokenStream. Obviously it should be called generally, it's just that this one will still work if you don't.
          • Payload data is managed in a BytesRefArray which reduces overall memory use and object count (reduces GC burden).
          Show
          David Smiley added a comment - The attached patch further improves this by: delaying lazy-init to first incrementToken() call instead of at reset(). It seems people call addAttributes before and after reset() and so the only way to know if consumers wants payloads is to wait till incrementToken(). a consequence of the implementation is that reset() is no longer required, which is how it used to be for this TokenStream. Obviously it should be called generally, it's just that this one will still work if you don't. Payload data is managed in a BytesRefArray which reduces overall memory use and object count (reduces GC burden).
          Hide
          David Smiley added a comment -

          This patch further improves slightly by using an AttributeFactory WITH PackedTokenAttributeImpl, and thus reducing the memory overhead involved in captureState() since there will be one attribute impl (2 if payloads).

          //This attribute factory uses less memory when captureState() is called.
            public static final AttributeFactory ATTRIBUTE_FACTORY =
                AttributeFactory.getStaticImplementation(
                    AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, PackedTokenAttributeImpl.class);
          

          I pass that via the constructor, super(ATTRIBUTE_FACTORY).

          I'll commit in a couple days or so.

          Show
          David Smiley added a comment - This patch further improves slightly by using an AttributeFactory WITH PackedTokenAttributeImpl, and thus reducing the memory overhead involved in captureState() since there will be one attribute impl (2 if payloads). //This attribute factory uses less memory when captureState() is called. public static final AttributeFactory ATTRIBUTE_FACTORY = AttributeFactory.getStaticImplementation( AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, PackedTokenAttributeImpl.class); I pass that via the constructor, super(ATTRIBUTE_FACTORY). I'll commit in a couple days or so.
          Hide
          Robert Muir added a comment -

          This issue is marked resolved. Can you please open a new issue for other changes?

          Show
          Robert Muir added a comment - This issue is marked resolved. Can you please open a new issue for other changes?
          Hide
          David Smiley added a comment -

          Hi Rob.
          Since it hasn't been released yet, I'll simply re-open. The patch is an incremental improvement; no change in scope.

          Show
          David Smiley added a comment - Hi Rob. Since it hasn't been released yet, I'll simply re-open. The patch is an incremental improvement; no change in scope.
          Hide
          ASF subversion and git services added a comment -

          Commit 1647479 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1647479 ]

          LUCENE-6031: TokenStreamFromTermVector: move lazy init to incrementToken, hold payloads more efficiently; use PackedTokenAttributeImpl (save mem)

          Show
          ASF subversion and git services added a comment - Commit 1647479 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1647479 ] LUCENE-6031 : TokenStreamFromTermVector: move lazy init to incrementToken, hold payloads more efficiently; use PackedTokenAttributeImpl (save mem)
          Hide
          ASF subversion and git services added a comment -

          Commit 1647480 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1647480 ]

          LUCENE-6031: TokenStreamFromTermVector: move lazy init to incrementToken, hold payloads more efficiently; use PackedTokenAttributeImpl (save mem)

          Show
          ASF subversion and git services added a comment - Commit 1647480 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1647480 ] LUCENE-6031 : TokenStreamFromTermVector: move lazy init to incrementToken, hold payloads more efficiently; use PackedTokenAttributeImpl (save mem)
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development