Solr
  1. Solr
  2. SOLR-4656

Add hl.maxMultiValuedToExamine to limit the number of multiValued entries examined while highlighting

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.3, 6.0
    • Fix Version/s: 4.3, 6.0
    • Component/s: highlighter
    • Labels:
      None

      Description

      I'm looking at an admittedly pathological case of many, many entries in a multiValued field, and trying to implement a way to limit the number examined, analogous to maxAnalyzedChars, see the patch.

      Along the way, I noticed that we do what looks like unnecessary copying of the fields to be examined. We call Document.getFields, which copies all of the fields and values to the returned array. Then we copy all of those to another array, converting them to Strings. Then we actually examine them. a> this doesn't seem very efficient and b> reduces the benefit from limiting the number of mv values examined.

      So the attached does two things:
      1> attempts to fix this
      2> implements hl.maxMultiValuedToExamine

      I'd really love it if someone who knows the highlighting code takes a peek at the fix to see if I've messed things up, the changes are actually pretty minimal.

      1. SOLR-4656.patch
        5 kB
        Erick Erickson
      2. SOLR-4656-4x.patch
        13 kB
        Erick Erickson
      3. SOLR-4656-4x.patch
        14 kB
        Erick Erickson
      4. SOLR-4656-trunk.patch
        14 kB
        Erick Erickson

        Issue Links

          Activity

          Hide
          Erick Erickson added a comment - - edited

          Trunk patch for comment. Haven't run through the unit tests yet.

          Is there a better way to get all the StorableFields in a document for a particular field than testing for the fieldName? I.e.
          if (! thisField.name().equals(fieldName))
          seems inefficient.

          Patch will apply to trunk only, StorableField will need to be changed to IndexableField for 4x.

          Show
          Erick Erickson added a comment - - edited Trunk patch for comment. Haven't run through the unit tests yet. Is there a better way to get all the StorableFields in a document for a particular field than testing for the fieldName? I.e. if (! thisField.name().equals(fieldName)) seems inefficient. Patch will apply to trunk only, StorableField will need to be changed to IndexableField for 4x.
          Hide
          Erick Erickson added a comment -

          Patches with tests. All tests pass. If there are no objections, I'll commit these in a couple of days.

          Show
          Erick Erickson added a comment - Patches with tests. All tests pass. If there are no objections, I'll commit these in a couple of days.
          Hide
          Erick Erickson added a comment -

          P.S. I actually added two parameters,
          maxMultiValuedToExamine which governs the number of multiValued fields to examine before bailing and
          maxMultiValuedToMatch which governs the number of matches to find before bailing.

          BTW, the first patch sucks. It counts matches rather than values.....

          Show
          Erick Erickson added a comment - P.S. I actually added two parameters, maxMultiValuedToExamine which governs the number of multiValued fields to examine before bailing and maxMultiValuedToMatch which governs the number of matches to find before bailing. BTW, the first patch sucks. It counts matches rather than values.....
          Hide
          Erick Erickson added a comment -

          I plan on committing this Tuesday or so unless there are objections....

          hl.maxMultiValuedToMatch - stops looking in the values in a multiValued field after N matches are found. Default is Integer.MAX_VALUE

          hl.maxMultiValuedToExamine - stops looking in the values in a multiValued field after N values are examined, regardless of how many have been found (no matches is perfectly reasonable). Defaults to Integer.MAX_VALUE

          If both are specified, the first condition met stops the comparisons.

          The patch also restructures traversing the fields in the document so we aren't copying things around so much, I'd particularly like someone to glance at that code. All tests pass, but a second set of eyes would be welcome.

          Also along the way I found this parameter that I'd never seen before: hl.preserveMulti and added it to the highlight parameter page (http://wiki.apache.org/solr/HighlightingParameters) with the explanation from a comment in the code, some clarification there might be a good thing.

          Fortunately, the changes are actually relatively minor, most of the bulk of the patch is additional tests.

          Show
          Erick Erickson added a comment - I plan on committing this Tuesday or so unless there are objections.... hl.maxMultiValuedToMatch - stops looking in the values in a multiValued field after N matches are found. Default is Integer.MAX_VALUE hl.maxMultiValuedToExamine - stops looking in the values in a multiValued field after N values are examined, regardless of how many have been found (no matches is perfectly reasonable). Defaults to Integer.MAX_VALUE If both are specified, the first condition met stops the comparisons. The patch also restructures traversing the fields in the document so we aren't copying things around so much, I'd particularly like someone to glance at that code. All tests pass, but a second set of eyes would be welcome. Also along the way I found this parameter that I'd never seen before: hl.preserveMulti and added it to the highlight parameter page ( http://wiki.apache.org/solr/HighlightingParameters ) with the explanation from a comment in the code, some clarification there might be a good thing. Fortunately, the changes are actually relatively minor, most of the bulk of the patch is additional tests.
          Hide
          Erick Erickson added a comment -

          When I was reconciling the patch for 4x I decremented the mvToMatch outside the for loop. Harmless since I wasn't looking at it any more, but unnecessary.

          Show
          Erick Erickson added a comment - When I was reconciling the patch for 4x I decremented the mvToMatch outside the for loop. Harmless since I wasn't looking at it any more, but unnecessary.
          Hide
          Erick Erickson added a comment -

          Committed trunk: r - 1463543

          Show
          Erick Erickson added a comment - Committed trunk: r - 1463543
          Hide
          Erick Erickson added a comment -

          4x: r - 1463627

          Show
          Erick Erickson added a comment - 4x: r - 1463627
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.
          Hide
          David Smiley added a comment -

          I saw the results of the modifications here during my work on SOLR-6680. It's not clear to me there needed to be new parameters. Shouldn't the field value lengths be accumulated, approaching maxAnalyzedChars and then exit at that point? And furthermore, shouldn't this field value loop exit early once it sees fragTexts.size() >= numFragments (i.e. hl.snippets is reached) ?

          Show
          David Smiley added a comment - I saw the results of the modifications here during my work on SOLR-6680 . It's not clear to me there needed to be new parameters. Shouldn't the field value lengths be accumulated, approaching maxAnalyzedChars and then exit at that point? And furthermore, shouldn't this field value loop exit early once it sees fragTexts.size() >= numFragments (i.e. hl.snippets is reached) ?
          Hide
          Erick Erickson added a comment -

          David:

          bq: Shouldn't the field value lengths be accumulated,
          I see where you're going, and I have I admit I didn't originate this code so all things are possible.
          It's a little different sense than maxAnayzedChars in that the unit of measurement is the number of MV entries rather than the number of characters analyzed, but I could argue either way.

          bq: shouldn't this field value loop exit early once ...
          I have no objection.

          Although it sees kind of late to take away this parameter, should we deprecate it insteas?

          Show
          Erick Erickson added a comment - David: bq: Shouldn't the field value lengths be accumulated, I see where you're going, and I have I admit I didn't originate this code so all things are possible. It's a little different sense than maxAnayzedChars in that the unit of measurement is the number of MV entries rather than the number of characters analyzed, but I could argue either way. bq: shouldn't this field value loop exit early once ... I have no objection. Although it sees kind of late to take away this parameter, should we deprecate it insteas?
          Hide
          David Smiley added a comment -

          It's a little different sense than maxAnayzedChars in that the unit of measurement is the number of MV entries rather than the number of characters analyzed, but I could argue either way.

          Sure... but was there per-value overhead involved that was a bit heavy for the particular client you did this for (i.e. massive number of values) or was it just a matter of not accumulating value lengths?

          Although it sees kind of late to take away this parameter, should we deprecate it instead?

          If there are a large number of values, I guess it has some value.

          In my last comment to SOLR-6680 I stated I think multi-value handling should be done a bit differently in which each value should be virtually concatenated/iterated via a CharSequence wrapper and handed to the highlighter. Likewise the TokenStreams of each value could be wrapped into a concatenating wrapper. If that were done, then I think these parameters would be completely obsolete as it would handle the case of massive number of values.

          I'll create a separate issue to accumulate maxAnalyzedChars per value and exit early.

          Show
          David Smiley added a comment - It's a little different sense than maxAnayzedChars in that the unit of measurement is the number of MV entries rather than the number of characters analyzed, but I could argue either way. Sure... but was there per-value overhead involved that was a bit heavy for the particular client you did this for (i.e. massive number of values) or was it just a matter of not accumulating value lengths? Although it sees kind of late to take away this parameter, should we deprecate it instead? If there are a large number of values, I guess it has some value. In my last comment to SOLR-6680 I stated I think multi-value handling should be done a bit differently in which each value should be virtually concatenated/iterated via a CharSequence wrapper and handed to the highlighter. Likewise the TokenStreams of each value could be wrapped into a concatenating wrapper. If that were done, then I think these parameters would be completely obsolete as it would handle the case of massive number of values. I'll create a separate issue to accumulate maxAnalyzedChars per value and exit early.
          Hide
          David Smiley added a comment -

          While working on SOLR-6692 I noticed this again, and I'm wondering two things:

          • Is the semantics of maxMultiValuedToMatch intentional with respect to that it counts snippets (i.e. fragments), as opposed to values? It's unfortunate the parameter name doesn't make this clear, which is suggestive that it counts values (maxMultiValuedToExamine counts values). There's a difference when hl.snippets isn't 1.
          • I don't believe mvToMatch should be decremented when bestTextFragment.getScore() is <= 0 since there actually was no match. This can happen often when hl.preserveMulti=true. I think this is a bug.

          I can fix but I'd like your thoughts, Erick Erickson.

          Show
          David Smiley added a comment - While working on SOLR-6692 I noticed this again, and I'm wondering two things: Is the semantics of maxMultiValuedToMatch intentional with respect to that it counts snippets (i.e. fragments), as opposed to values? It's unfortunate the parameter name doesn't make this clear, which is suggestive that it counts values (maxMultiValuedToExamine counts values). There's a difference when hl.snippets isn't 1. I don't believe mvToMatch should be decremented when bestTextFragment.getScore() is <= 0 since there actually was no match . This can happen often when hl.preserveMulti=true. I think this is a bug. I can fix but I'd like your thoughts, Erick Erickson .
          Hide
          Erick Erickson added a comment -

          David Smiley You're right on both counts. The intent of maxMultiValuedToMatch is, indeed, it should stop after matching N fragments, so the name is unfortunate. It should trip if it was set to, say, 3 and a single MV entry had 3 snippets. Maybe maxSnippetsToMatch? Deprecate and usenew terms IMO, but up to you.

          Right, if there is no snippet it shouldn't be decremented.

          Good catch!

          Erick

          Show
          Erick Erickson added a comment - David Smiley You're right on both counts. The intent of maxMultiValuedToMatch is, indeed, it should stop after matching N fragments , so the name is unfortunate. It should trip if it was set to, say, 3 and a single MV entry had 3 snippets. Maybe maxSnippetsToMatch? Deprecate and usenew terms IMO, but up to you. Right, if there is no snippet it shouldn't be decremented. Good catch! Erick
          Hide
          David Smiley added a comment -

          Maybe maxSnippetsToMatch? Deprecate and usenew terms IMO, but up to you.

          Naming is hard, and this parameter especially. I don't think it's worth trying to change a parameter's name after release. I may enhance the ref guide to make it more clear what this does.

          Right, if there is no snippet it shouldn't be decremented.

          Ok; I'll fix this in SOLR-6692 with a test.

          Show
          David Smiley added a comment - Maybe maxSnippetsToMatch? Deprecate and usenew terms IMO, but up to you. Naming is hard, and this parameter especially. I don't think it's worth trying to change a parameter's name after release. I may enhance the ref guide to make it more clear what this does. Right, if there is no snippet it shouldn't be decremented. Ok; I'll fix this in SOLR-6692 with a test.

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Erick Erickson
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development