Solr
  1. Solr
  2. SOLR-6692

hl.maxAnalyzedChars should apply cumulatively on a multi-valued field

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2
    • Component/s: highlighter
    • Labels:
      None

      Description

      in DefaultSolrHighlighter, the hl.maxAnalyzedChars figure is used to constrain how much text is analyzed before the highlighter stops, in the interests of performance. For a multi-valued field, it effectively treats each value anew, no matter how much text it was previously analyzed for other values for the same field for the current document. The PostingsHighlighter doesn't work this way – hl.maxAnalyzedChars is effectively the total budget for a field for a document, no matter how many values there might be. It's not reset for each value. I think this makes more sense. When we loop over the values, we should subtract from hl.maxAnalyzedChars the length of the value just checked. The motivation here is consistency with PostingsHighlighter, and to allow for hl.maxAnalyzedChars to be pushed down to term vector uninversion, which wouldn't be possible for multi-valued fields based on the current way this parameter is used.

      Interestingly, I noticed Solr's use of FastVectorHighlighter doesn't honor hl.maxAnalyzedChars as the FVH doesn't have a knob for that. It does have hl.phraseLimit which is a limit that could be used for a similar purpose, albeit applied differently.

      Furthermore, DefaultSolrHighligher.doHighlightingByHighlighter should exit early from it's field value loop if it reaches hl.snippets, and if hl.preserveMulti=true

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          Furthermore, DefaultSolrHighligher.doHighlightingByHighlighter should exit early from it's field value loop if it reaches hl.snippets.

          Upon further inspection, it can only do that if hl.preserveMulti=true

          Show
          David Smiley added a comment - Furthermore, DefaultSolrHighligher.doHighlightingByHighlighter should exit early from it's field value loop if it reaches hl.snippets. Upon further inspection, it can only do that if hl.preserveMulti=true
          Hide
          David Smiley added a comment -

          This patch includes several things. I'm sorry if it's doing too many things; I will break it down into separate patches and/or JIRA issue if advised to. I will certainly point out the separate parts in CHANGES.txt so users know what's going on. Nevertheless the patch is small IMO and the net LOC is practically 0.

          • hl.maxAnalyzedChars is budgeted across all values for the field being highlighted. The budget is decremented by the field length at each iteration, and so progressively smaller limits are passed on through until it's <= 0 at which we we exit the loop. I added a test for this which randomly chooses between a field with term vectors or not.
          • Refactor/extensibility:
            • All methods that were private are now protected. This widens the scope of possibilities for subclassing without having to fork this code.
            • The no-arg constructor isn't used; I removed it and made the SolrCore field final as a clarification. If anyone ever tried to use the no-arg constructor (I have), they would have soon realized that was not an option since an NPE would be thrown from init().
            • I extracted a method getFieldValues whose sole job is to get the field values (Strings) to be highlighted given the provided field name & some other parameters. This is a useful extension point so that a subclass can get the field values from another field (i.e. the source of a copyField). Till now, people had to use hl.requireFieldMatch=false which had its drawbacks in terms of highlight precision. A side-benefit is that this method is aware of hl.maxMultiValuedToMatch and hl.maxAnalyzedChars, which will limit the values it returns. This aids the term-vector code path which can now in more circumstances see when there is only one value to highlight, and thus forgo wrapping the term vector stream with a OffsetWindow filter, which is a big penalty to avoid.
          • hl.usePhraseHighlighter can now be toggled per-field.
          • It includes a nocommit to confirm from SOLR-4656 (Erick) that the intention of hl.maxMultiValuedToMatch is to limit fragments, not matching values, despite the parameter name hinting otherwise.
          • I fixed a bug with hl.maxMultiValuedToMatch in which it would decrement its counter when in fact the fragment didn't match. This bug would only occur when hl.preserveMulti=true.
          • I fixed a small bug in ordering the fragments by score. It used Math.round() which will coalesce values close to 0 to appear as the same weighting. Now it simply uses Float.compare(a,b).
          • note: the code changes below the field value loop, except for the small score order bug I just mentioned, are purely code clean-up and don't change behavior. The code was more complex due to it thinking a fragment could be null when in fact by that point it's impossible.
          Show
          David Smiley added a comment - This patch includes several things. I'm sorry if it's doing too many things; I will break it down into separate patches and/or JIRA issue if advised to. I will certainly point out the separate parts in CHANGES.txt so users know what's going on. Nevertheless the patch is small IMO and the net LOC is practically 0. hl.maxAnalyzedChars is budgeted across all values for the field being highlighted. The budget is decremented by the field length at each iteration, and so progressively smaller limits are passed on through until it's <= 0 at which we we exit the loop. I added a test for this which randomly chooses between a field with term vectors or not. Refactor/extensibility: All methods that were private are now protected. This widens the scope of possibilities for subclassing without having to fork this code. The no-arg constructor isn't used; I removed it and made the SolrCore field final as a clarification. If anyone ever tried to use the no-arg constructor (I have), they would have soon realized that was not an option since an NPE would be thrown from init(). I extracted a method getFieldValues whose sole job is to get the field values (Strings) to be highlighted given the provided field name & some other parameters. This is a useful extension point so that a subclass can get the field values from another field (i.e. the source of a copyField). Till now, people had to use hl.requireFieldMatch=false which had its drawbacks in terms of highlight precision. A side-benefit is that this method is aware of hl.maxMultiValuedToMatch and hl.maxAnalyzedChars, which will limit the values it returns. This aids the term-vector code path which can now in more circumstances see when there is only one value to highlight, and thus forgo wrapping the term vector stream with a OffsetWindow filter, which is a big penalty to avoid. hl.usePhraseHighlighter can now be toggled per-field. It includes a nocommit to confirm from SOLR-4656 (Erick) that the intention of hl.maxMultiValuedToMatch is to limit fragments , not matching values, despite the parameter name hinting otherwise. I fixed a bug with hl.maxMultiValuedToMatch in which it would decrement its counter when in fact the fragment didn't match. This bug would only occur when hl.preserveMulti=true. I fixed a small bug in ordering the fragments by score. It used Math.round() which will coalesce values close to 0 to appear as the same weighting. Now it simply uses Float.compare(a,b). note: the code changes below the field value loop, except for the small score order bug I just mentioned, are purely code clean-up and don't change behavior. The code was more complex due to it thinking a fragment could be null when in fact by that point it's impossible.
          Hide
          David Smiley added a comment -

          Furthermore, DefaultSolrHighligher.doHighlightingByHighlighter should exit early from it's field value loop if it reaches hl.snippets.

          Actually it can never do that because it takes the top hl.snippets from every value and then takes the top hl.snippets of that. Anyway; there's multiple mechanisms to exit early now – hl.maxAnalyzedChars, and both hl.multiValued* options.

          Show
          David Smiley added a comment - Furthermore, DefaultSolrHighligher.doHighlightingByHighlighter should exit early from it's field value loop if it reaches hl.snippets. Actually it can never do that because it takes the top hl.snippets from every value and then takes the top hl.snippets of that. Anyway; there's multiple mechanisms to exit early now – hl.maxAnalyzedChars, and both hl.multiValued* options.
          Hide
          ASF subversion and git services added a comment -

          Commit 1673183 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1673183 ]

          SOLR-6692: Made standard highlighter more extensible
          Private methods are now protected, removed bad constructor, and refactored out getFieldValues() and getResponseForFragments() for both clarity and sub-class extension utility. Some refactoring of doHighlightingByHighlighter to clarify logic. These changes should have ZERO effect on highlights.

          Show
          ASF subversion and git services added a comment - Commit 1673183 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1673183 ] SOLR-6692 : Made standard highlighter more extensible Private methods are now protected, removed bad constructor, and refactored out getFieldValues() and getResponseForFragments() for both clarity and sub-class extension utility. Some refactoring of doHighlightingByHighlighter to clarify logic. These changes should have ZERO effect on highlights.
          Hide
          ASF subversion and git services added a comment -

          Commit 1673184 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1673184 ]

          SOLR-6692: Made standard highlighter more extensible
          Private methods are now protected, removed bad constructor, and refactored out getFieldValues() and getResponseForFragments() for both clarity and sub-class extension utility. Some refactoring of doHighlightingByHighlighter to clarify logic. These changes should have ZERO effect on highlights.

          Show
          ASF subversion and git services added a comment - Commit 1673184 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673184 ] SOLR-6692 : Made standard highlighter more extensible Private methods are now protected, removed bad constructor, and refactored out getFieldValues() and getResponseForFragments() for both clarity and sub-class extension utility. Some refactoring of doHighlightingByHighlighter to clarify logic. These changes should have ZERO effect on highlights.
          Hide
          David Smiley added a comment -

          I'm committing this in pieces to clarify each change.
          One delta from my patch file here is that I refactored out getResponseForFragments so that a subclass could add scores and/or source offsets.

          Show
          David Smiley added a comment - I'm committing this in pieces to clarify each change. One delta from my patch file here is that I refactored out getResponseForFragments so that a subclass could add scores and/or source offsets.
          Hide
          ASF subversion and git services added a comment -

          Commit 1673185 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1673185 ]

          SOLR-6692: Don't round highlight fragment scores for ordering purposes

          Show
          ASF subversion and git services added a comment - Commit 1673185 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1673185 ] SOLR-6692 : Don't round highlight fragment scores for ordering purposes
          Hide
          ASF subversion and git services added a comment -

          Commit 1673187 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1673187 ]

          SOLR-6692: Don't round highlight fragment scores for ordering purposes

          Show
          ASF subversion and git services added a comment - Commit 1673187 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673187 ] SOLR-6692 : Don't round highlight fragment scores for ordering purposes
          Hide
          ASF subversion and git services added a comment -

          Commit 1673200 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1673200 ]

          SOLR-6692: hl.maxAnalyzedChars should apply cumulatively on a multi-valued field

          Show
          ASF subversion and git services added a comment - Commit 1673200 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1673200 ] SOLR-6692 : hl.maxAnalyzedChars should apply cumulatively on a multi-valued field
          Hide
          ASF subversion and git services added a comment -

          Commit 1673201 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1673201 ]

          SOLR-6692: hl.maxAnalyzedChars should apply cumulatively on a multi-valued field

          Show
          ASF subversion and git services added a comment - Commit 1673201 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673201 ] SOLR-6692 : hl.maxAnalyzedChars should apply cumulatively on a multi-valued field
          Hide
          ASF subversion and git services added a comment -

          Commit 1673216 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1673216 ]

          SOLR-6692: When using hl.maxMultiValuedToMatch with hl.preserveMulti, only count matched snippets.

          Show
          ASF subversion and git services added a comment - Commit 1673216 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1673216 ] SOLR-6692 : When using hl.maxMultiValuedToMatch with hl.preserveMulti, only count matched snippets.
          Hide
          ASF subversion and git services added a comment -

          Commit 1673237 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1673237 ]

          SOLR-6692: When using hl.maxMultiValuedToMatch with hl.preserveMulti, only count matched snippets.

          Show
          ASF subversion and git services added a comment - Commit 1673237 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673237 ] SOLR-6692 : When using hl.maxMultiValuedToMatch with hl.preserveMulti, only count matched snippets.
          Hide
          Varun Thacker added a comment -

          Hi David Smiley,

          Are these two failures related? I can reproduce it on my machine but haven't looked into it in detail.

          ant test  -Dtestcase=TestWriterPerf -Dtests.method=testPerf -Dtests.seed=88D84B0068AE130 -Dtests.slow=true -Dtests.locale=es_AR -Dtests.timezone=US/East-Indiana -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
          
          ant test  -Dtestcase=TestWriterPerf -Dtests.seed=88D84B0068AE130 -Dtests.slow=true -Dtests.locale=es_AR -Dtests.timezone=US/East-Indiana -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
          
          Show
          Varun Thacker added a comment - Hi David Smiley , Are these two failures related? I can reproduce it on my machine but haven't looked into it in detail. ant test -Dtestcase=TestWriterPerf -Dtests.method=testPerf -Dtests.seed=88D84B0068AE130 -Dtests.slow= true -Dtests.locale=es_AR -Dtests.timezone=US/East-Indiana -Dtests.asserts= true -Dtests.file.encoding=US-ASCII ant test -Dtestcase=TestWriterPerf -Dtests.seed=88D84B0068AE130 -Dtests.slow= true -Dtests.locale=es_AR -Dtests.timezone=US/East-Indiana -Dtests.asserts= true -Dtests.file.encoding=US-ASCII
          Hide
          ASF subversion and git services added a comment -

          Commit 1673281 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1673281 ]

          SOLR-6692: Highlighter NPE bugfix when highlight nonexistent field.

          Show
          ASF subversion and git services added a comment - Commit 1673281 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1673281 ] SOLR-6692 : Highlighter NPE bugfix when highlight nonexistent field.
          Hide
          ASF subversion and git services added a comment -

          Commit 1673283 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1673283 ]

          SOLR-6692: Highlighter NPE bugfix when highlight nonexistent field.

          Show
          ASF subversion and git services added a comment - Commit 1673283 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673283 ] SOLR-6692 : Highlighter NPE bugfix when highlight nonexistent field.
          Hide
          ASF subversion and git services added a comment -

          Commit 1673328 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1673328 ]

          SOLR-6692: highlighter refactorings...

          • extract method getDocPrefetchFieldNames
          • trim field names in getHighlightFields instead of later on
          • lazy-create FVH (could be expensive for wildcard queries)
          Show
          ASF subversion and git services added a comment - Commit 1673328 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1673328 ] SOLR-6692 : highlighter refactorings... extract method getDocPrefetchFieldNames trim field names in getHighlightFields instead of later on lazy-create FVH (could be expensive for wildcard queries)
          Hide
          ASF subversion and git services added a comment -

          Commit 1673332 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1673332 ]

          SOLR-6692: highlighter refactorings...

          • extract method getDocPrefetchFieldNames
          • trim field names in getHighlightFields instead of later on
          • lazy-create FVH (could be expensive for wildcard queries)
          Show
          ASF subversion and git services added a comment - Commit 1673332 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673332 ] SOLR-6692 : highlighter refactorings... extract method getDocPrefetchFieldNames trim field names in getHighlightFields instead of later on lazy-create FVH (could be expensive for wildcard queries)
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development