Solr
  1. Solr
  2. SOLR-6680

DefaultSolrHighlighter can sometimes avoid CachingTokenFilter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: highlighter
    • Labels:
      None

      Description

      The DefaultSolrHighlighter (the most accurate one) is a bit over-eager to wrap the token stream in a CachingTokenFilter when hl.usePhraseHighlighter=true. This wastes memory, and it interferes with other optimizations – LUCENE-6034. Furthermore, the internal TermOffsetsTokenStream (used when TermVectors are used with this) wasn't properly delegating reset().

      1. SOLR-6680_Solr_Highligher,_part_2,_OffsetWindowTokenFilter.patch
        1 kB
        David Smiley
      2. SOLR-6680.patch
        16 kB
        David Smiley
      3. SOLR-6680.patch
        10 kB
        David Smiley

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          Solr's tests are already pretty good; I rely on that.

          Note that this issue might depend on LUCENE-6031 since the current TokenSources don't implement reset() (they should have been made to but whoever wrote them forgot to).

          Show
          David Smiley added a comment - Solr's tests are already pretty good; I rely on that. Note that this issue might depend on LUCENE-6031 since the current TokenSources don't implement reset() (they should have been made to but whoever wrote them forgot to).
          Hide
          David Smiley added a comment -

          I should point out that the benefit of LUCENE-6033 won't be realized for a multi-valued field because of the way the offset adjusting works (TermOffsetsTokenStream). I'm not concerned with optimizing for this case but should someone else want to take this further then consider this approach: Don't wrap the TokenStream from the TermVectors. Instead, grab all the values of this field and wrap them in a CharSequence implementation that reads from each value in sequence. But Highlighter expects a String for the value; it could be modified to deal with a CharSequence instead.

          Show
          David Smiley added a comment - I should point out that the benefit of LUCENE-6033 won't be realized for a multi-valued field because of the way the offset adjusting works (TermOffsetsTokenStream). I'm not concerned with optimizing for this case but should someone else want to take this further then consider this approach: Don't wrap the TokenStream from the TermVectors. Instead, grab all the values of this field and wrap them in a CharSequence implementation that reads from each value in sequence. But Highlighter expects a String for the value; it could be modified to deal with a CharSequence instead.
          Hide
          David Smiley added a comment -

          Updated patch. Added check to detect if, for the current document, if the field to highlight is actually multi-valued. If it isn't we can avoid TermOffsetsTokenStream, which defeats the optimization in SOLR-6034.

          Show
          David Smiley added a comment - Updated patch. Added check to detect if, for the current document, if the field to highlight is actually multi-valued. If it isn't we can avoid TermOffsetsTokenStream, which defeats the optimization in SOLR-6034 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1642510 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1642510 ]

          SOLR-6680: DefaultSolrHighlighter: avoid CachingTokenFilter when not needed

          Show
          ASF subversion and git services added a comment - Commit 1642510 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1642510 ] SOLR-6680 : DefaultSolrHighlighter: avoid CachingTokenFilter when not needed
          Hide
          ASF subversion and git services added a comment -

          Commit 1642511 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1642511 ]

          SOLR-6680: DefaultSolrHighlighter: avoid CachingTokenFilter when not needed

          Show
          ASF subversion and git services added a comment - Commit 1642511 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1642511 ] SOLR-6680 : DefaultSolrHighlighter: avoid CachingTokenFilter when not needed
          Hide
          David Smiley added a comment -

          Re-opening for incremental improvement:

          The next patch further reduces token caching in DefaultSolrHighlighter, this time by "TermOffsetsTokenStream", which is used for multi-valued fields with term vectors to provide an offset based view/window into the token stream. I found the name unclear so I also renamed it to OffsetWindowTokenFilter with a comment to clarify it's used for multi-valued term vectors. I found the variable names unclear so I renamed them too. It used to call captureState & restoreState for every token; now it only does it for the first token leading into the next value. It used to use a cloned AttributeSource but I found there to be no point to that, plus it interferes with TokenStreamFromTermVector's ability to detect if payloads are desired.

          Show
          David Smiley added a comment - Re-opening for incremental improvement: The next patch further reduces token caching in DefaultSolrHighlighter, this time by "TermOffsetsTokenStream", which is used for multi-valued fields with term vectors to provide an offset based view/window into the token stream. I found the name unclear so I also renamed it to OffsetWindowTokenFilter with a comment to clarify it's used for multi-valued term vectors. I found the variable names unclear so I renamed them too. It used to call captureState & restoreState for every token; now it only does it for the first token leading into the next value. It used to use a cloned AttributeSource but I found there to be no point to that, plus it interferes with TokenStreamFromTermVector's ability to detect if payloads are desired.
          Hide
          David Smiley added a comment -

          I'll commit Monday if there's no feedback.

          Show
          David Smiley added a comment - I'll commit Monday if there's no feedback.
          Hide
          ASF subversion and git services added a comment -

          Commit 1647481 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1647481 ]

          SOLR-6680: refactor DefaultSolrHighlighter.TermOffsetsTokenStream (from term vectors) to avoid buffering the token.

          Show
          ASF subversion and git services added a comment - Commit 1647481 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1647481 ] SOLR-6680 : refactor DefaultSolrHighlighter.TermOffsetsTokenStream (from term vectors) to avoid buffering the token.
          Hide
          ASF subversion and git services added a comment -

          Commit 1647482 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1647482 ]

          SOLR-6680: refactor DefaultSolrHighlighter.TermOffsetsTokenStream (from term vectors) to avoid buffering the token.

          Show
          ASF subversion and git services added a comment - Commit 1647482 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1647482 ] SOLR-6680 : refactor DefaultSolrHighlighter.TermOffsetsTokenStream (from term vectors) to avoid buffering the token.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development