Solr
  1. Solr
  2. SOLR-1221

Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by default

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: highlighter
    • Labels:
      None

      Description

      To improve the out of the box experience of Solr 1.4, I really think we should make this change. You will still be able to turn both off.

      Comments?

      1. SOLR-1221.patch
        2 kB
        Mark Miller
      2. SOLR-1221.patch
        2 kB
        Mark Miller
      3. SOLR-1221.patch
        3 kB
        Mark Miller
      4. SOLR-1221.patch
        0.9 kB
        Mark Miller
      5. SOLR-1221.patch
        4 kB
        Yonik Seeley
      6. SOLR-1221.patch
        7 kB
        Yonik Seeley
      7. SOLR-1221.patch
        8 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          Seems time to me... does this affect the HTTP API?

          Show
          Yonik Seeley added a comment - Seems time to me... does this affect the HTTP API?
          Hide
          Michael McCandless added a comment -

          +1 Solr's defaults are important, too.

          Show
          Michael McCandless added a comment - +1 Solr's defaults are important, too.
          Hide
          Mark Miller added a comment -

          does this affect the HTTP API?

          No, it won't. Just changes the default setting for two parameters.

          Show
          Mark Miller added a comment - does this affect the HTTP API? No, it won't. Just changes the default setting for two parameters.
          Hide
          Grant Ingersoll added a comment -

          Is this going to make it into 1.4?

          Show
          Grant Ingersoll added a comment - Is this going to make it into 1.4?
          Hide
          Grant Ingersoll added a comment -

          Looks good to me, the only oddity is the name of the testPhraseHighlighter now tests the non-phrase highlighter.

          Also, testing wise, do we now risk losing tests for the other highlighter? I'm not that familiar with the phrase highlighter, so maybe all it does is subsume the functionality of the old one and add phrase support, in which case the existing tests are fine. If it does not do that, though, it seems like we might want to duplicate the tests, with use phrase highlighter set to false for the dups. Unless, of course, you think the existing single test that turns off the phrase highlighter is sufficient.

          Show
          Grant Ingersoll added a comment - Looks good to me, the only oddity is the name of the testPhraseHighlighter now tests the non-phrase highlighter. Also, testing wise, do we now risk losing tests for the other highlighter? I'm not that familiar with the phrase highlighter, so maybe all it does is subsume the functionality of the old one and add phrase support, in which case the existing tests are fine. If it does not do that, though, it seems like we might want to duplicate the tests, with use phrase highlighter set to false for the dups. Unless, of course, you think the existing single test that turns off the phrase highlighter is sufficient.
          Hide
          Mark Miller added a comment -

          the only oddity is the name of the testPhraseHighlighter now tests the non-phrase highlighter.

          Okay, will address.

          Also, testing wise, do we now risk losing tests for the other highlighter?

          Yeah - before the phrase highlighter was "undertested" and the default had more, and now
          its the reverse. I suppose ideally, every test would hit both - that gets ugly though - its what
          we do in Lucene. Thats why I am confident in the tests I suppose though - Lucene runs
          both through every test - so the PhraseHighlighter matches the other highlighters functionality
          and adds support for more query types. I think between Lucene and Solr, the tests are sufficient -
          but I couldn't argue against more coverage here - hard to argue against testing in any form

          Show
          Mark Miller added a comment - the only oddity is the name of the testPhraseHighlighter now tests the non-phrase highlighter. Okay, will address. Also, testing wise, do we now risk losing tests for the other highlighter? Yeah - before the phrase highlighter was "undertested" and the default had more, and now its the reverse. I suppose ideally, every test would hit both - that gets ugly though - its what we do in Lucene. Thats why I am confident in the tests I suppose though - Lucene runs both through every test - so the PhraseHighlighter matches the other highlighters functionality and adds support for more query types. I think between Lucene and Solr, the tests are sufficient - but I couldn't argue against more coverage here - hard to argue against testing in any form
          Hide
          Grant Ingersoll added a comment -

          Sounds good, Mark. Please commit when you are ready.

          Show
          Grant Ingersoll added a comment - Sounds good, Mark. Please commit when you are ready.
          Hide
          Mark Miller added a comment -

          to trunk

          Show
          Mark Miller added a comment - to trunk
          Hide
          Mark Miller added a comment -

          the only oddity is the name of the testPhraseHighlighter now tests the non-phrase highlighter.

          Thats actually not true - just looked at the test again - testPhraseHighlighter is the same as it was - it just used to default to off, so for the portion that expected it to be off, I now explicitly turn it off - thats the only change in testPhraseHighlighter.

          Show
          Mark Miller added a comment - the only oddity is the name of the testPhraseHighlighter now tests the non-phrase highlighter. Thats actually not true - just looked at the test again - testPhraseHighlighter is the same as it was - it just used to default to off, so for the portion that expected it to be off, I now explicitly turn it off - thats the only change in testPhraseHighlighter.
          Hide
          Mark Miller added a comment -

          fix a compile issue in last patch - param needs to be "true" rather than true.
          also adds changes entry.

          will commit very soon.

          Show
          Mark Miller added a comment - fix a compile issue in last patch - param needs to be "true" rather than true. also adds changes entry. will commit very soon.
          Hide
          Mark Miller added a comment -

          There is a bug to take care of, and a missing piece that Koji pointed out.

          Show
          Mark Miller added a comment - There is a bug to take care of, and a missing piece that Koji pointed out.
          Hide
          Mark Miller added a comment -

          Grrr ... darn it. Its at the Lucene level. NumericQuery doesn't set the Term for getTerm(), so its not supported by the Span Highlighter. Grrr. Not sure what to do at the moment ...

          Show
          Mark Miller added a comment - Grrr ... darn it. Its at the Lucene level. NumericQuery doesn't set the Term for getTerm(), so its not supported by the Span Highlighter. Grrr. Not sure what to do at the moment ...
          Hide
          Mark Miller added a comment -

          Easy enough to side step this bug in Lucene for now - but if we release with pure Lucene 2.9, not sure what I'm going to do here...

          Show
          Mark Miller added a comment - Easy enough to side step this bug in Lucene for now - but if we release with pure Lucene 2.9, not sure what I'm going to do here...
          Hide
          Mark Miller added a comment -

          What do people thing about putting in an updated Highlighter jar with a fix?

          Show
          Mark Miller added a comment - What do people thing about putting in an updated Highlighter jar with a fix?
          Hide
          Yonik Seeley added a comment -

          What do people thing about putting in an updated Highlighter jar with a fix?

          +1, If there's not an easy way to work around it in Solr.

          Show
          Yonik Seeley added a comment - What do people thing about putting in an updated Highlighter jar with a fix? +1, If there's not an easy way to work around it in Solr.
          Hide
          Mark Miller added a comment -

          I don't see one -

          ugly options I can think of -

          1. extend highlighter and duplicate a ton of code with a fix - obviously not really an option
          2. run through the query first and rebuild it without the numericrangequery part
          3. use an overridden version of NumericQuery in Solr that returns a placeholder term from getTerm

          None of these are really viable IMO though.

          Show
          Mark Miller added a comment - I don't see one - ugly options I can think of - 1. extend highlighter and duplicate a ton of code with a fix - obviously not really an option 2. run through the query first and rebuild it without the numericrangequery part 3. use an overridden version of NumericQuery in Solr that returns a placeholder term from getTerm None of these are really viable IMO though.
          Hide
          Yonik Seeley added a comment -

          How can one reproduce this bug? The query parser produces a normal term query for numeric queries.

          Something like this does not cause an error on trunk:
          http://localhost:8983/solr/select?q=popularity:6&hl=true&hl.fl=popularity

          Show
          Yonik Seeley added a comment - How can one reproduce this bug? The query parser produces a normal term query for numeric queries. Something like this does not cause an error on trunk: http://localhost:8983/solr/select?q=popularity:6&hl=true&hl.fl=popularity
          Hide
          Mark Miller added a comment - - edited

          Sorry - should have been more clear.

          The bug is hidden on trunk at the moment - Koji pointed out that I missed a spot, and on trunk right now, queries are being rewritten by default when they shouldnt be.

          Now that you mention it though, the issue is not just with NumericRange (I just assumed that was being gen'd cause I saw the problem with it), but its also going to be with the
          deprecated RangeQuery it looks. edit actually - since RangeQuery no longer extends MultiTermQuery, it won't cause the error - it just won't highlight.

          edit

          which really argues for a jar update - I can fix all causes with one simple fix - checking if getTerm returns null.

          Show
          Mark Miller added a comment - - edited Sorry - should have been more clear. The bug is hidden on trunk at the moment - Koji pointed out that I missed a spot, and on trunk right now, queries are being rewritten by default when they shouldnt be. Now that you mention it though, the issue is not just with NumericRange (I just assumed that was being gen'd cause I saw the problem with it), but its also going to be with the deprecated RangeQuery it looks. edit actually - since RangeQuery no longer extends MultiTermQuery, it won't cause the error - it just won't highlight. edit which really argues for a jar update - I can fix all causes with one simple fix - checking if getTerm returns null.
          Hide
          Mark Miller added a comment -

          this patch needs to be applied to finish the issue - but for now it will cause tests to fail.

          Show
          Mark Miller added a comment - this patch needs to be applied to finish the issue - but for now it will cause tests to fail.
          Hide
          Mark Miller added a comment -

          The query parser produces a normal term query for numeric queries.

          Also, I shouldn't have said NumericQuery - NumericRangeQuery - sorry again.

          Show
          Mark Miller added a comment - The query parser produces a normal term query for numeric queries. Also, I shouldn't have said NumericQuery - NumericRangeQuery - sorry again.
          Hide
          Uwe Schindler added a comment -

          deprec ConstantScoreRangeQuery (if ever used in Solr) would also have the problem... (but it extends TermRangeQuery, so should be catched before).

          3. use an overridden version of NumericQuery in Solr that returns a placeholder term from getTerm

          would not work (final and no ctors).

          Show
          Uwe Schindler added a comment - deprec ConstantScoreRangeQuery (if ever used in Solr) would also have the problem... (but it extends TermRangeQuery, so should be catched before). 3. use an overridden version of NumericQuery in Solr that returns a placeholder term from getTerm would not work (final and no ctors).
          Hide
          Mark Miller added a comment -

          deprec ConstantScoreRangeQuery (if ever used in Solr) would also have the problem... (but it extends TermRangeQuery, so should be catched before).

          Right - anything that extends TermRangeQuery should be fine. It has special handling.

          would not work (final and no ctors).

          Thanks - I actually saw that when I was messing with the Highlighter earlier today, but I didn't put 2 and 2 together.
          Sounds like the updated Highlighter jar is the solution.

          Show
          Mark Miller added a comment - deprec ConstantScoreRangeQuery (if ever used in Solr) would also have the problem... (but it extends TermRangeQuery, so should be catched before). Right - anything that extends TermRangeQuery should be fine. It has special handling. would not work (final and no ctors). Thanks - I actually saw that when I was messing with the Highlighter earlier today, but I didn't put 2 and 2 together. Sounds like the updated Highlighter jar is the solution.
          Hide
          Uwe Schindler added a comment -

          Does RangeQuery really need to be highlighted? It is not used anywhere in Solr (I removed all occurences in the issue about TermRangeQuery), so why handle it?

          In Lucene, the fix would only be needed for highlighting in 2.9.1, 3.0 will have no RangeQuery anymore.

          Show
          Uwe Schindler added a comment - Does RangeQuery really need to be highlighted? It is not used anywhere in Solr (I removed all occurences in the issue about TermRangeQuery), so why handle it? In Lucene, the fix would only be needed for highlighting in 2.9.1, 3.0 will have no RangeQuery anymore.
          Hide
          Mark Miller added a comment -

          In Lucene, the fix would only be needed for highlighting in 2.9.1, 3.0 will have no RangeQuery anymore.

          You can have custom queryparsers in Solr, so you can't say definitely its not used.

          Whether I'm concerned about it depends on if we a 2.9.1 - if we do, I'll fix it. If we don't I won't.

          Show
          Mark Miller added a comment - In Lucene, the fix would only be needed for highlighting in 2.9.1, 3.0 will have no RangeQuery anymore. You can have custom queryparsers in Solr, so you can't say definitely its not used. Whether I'm concerned about it depends on if we a 2.9.1 - if we do, I'll fix it. If we don't I won't.
          Hide
          Yonik Seeley added a comment -

          OK, here's a potential workaround... simply wraps the NumericRangeQuery in another query and delegates everything. Since it won't be an instanceof MultiTermQuery, it shouldn't trigger any bugs in the highlighters.

          I tried starting off with a MultiTermQuery that delegated to the contained NRQ, but it turned out to be impossible because MTQ.getEnum is protected - so can't delegate.

          The easiest solution would have been just subclassing NRQ - Lucene shouldn't make stuff like this so hard... we should simply document that most of the classes are not meant to be overridden - and back compat does not extend to subclassing unless the class was specifically designed for it.

          Show
          Yonik Seeley added a comment - OK, here's a potential workaround... simply wraps the NumericRangeQuery in another query and delegates everything. Since it won't be an instanceof MultiTermQuery, it shouldn't trigger any bugs in the highlighters. I tried starting off with a MultiTermQuery that delegated to the contained NRQ, but it turned out to be impossible because MTQ.getEnum is protected - so can't delegate. The easiest solution would have been just subclassing NRQ - Lucene shouldn't make stuff like this so hard... we should simply document that most of the classes are not meant to be overridden - and back compat does not extend to subclassing unless the class was specifically designed for it.
          Hide
          Uwe Schindler added a comment -

          Does the highlighter rewrite before checking the query? In this is not the case the simpliest thing to do would be the following: Just wrap it into a Query subclass and rewrite it to NRQ.

          Show
          Uwe Schindler added a comment - Does the highlighter rewrite before checking the query? In this is not the case the simpliest thing to do would be the following: Just wrap it into a Query subclass and rewrite it to NRQ.
          Hide
          Uwe Schindler added a comment -

          A even simplier workaround:
          Instead of using a NRQ, wrap a NRF with ConstantScoreQuery (just change TrieField.getRangeQuery()). You will loose auto-rewrite if only few terms are affected, but for precSteps>4/6, MTQ default would also use ConstantScoreQuery.

          Show
          Uwe Schindler added a comment - A even simplier workaround: Instead of using a NRQ, wrap a NRF with ConstantScoreQuery (just change TrieField.getRangeQuery()). You will loose auto-rewrite if only few terms are affected, but for precSteps>4/6, MTQ default would also use ConstantScoreQuery.
          Hide
          Mark Miller added a comment -

          I was about to merge that last changed needed (rewrite fix) with Yoniks patch - but it looks like Yoniks patch is missing the SolrQueryWrapper class ...

          Show
          Mark Miller added a comment - I was about to merge that last changed needed (rewrite fix) with Yoniks patch - but it looks like Yoniks patch is missing the SolrQueryWrapper class ...
          Hide
          Yonik Seeley added a comment -

          missing "svn add" strikes again.... attaching new patch.

          Show
          Yonik Seeley added a comment - missing "svn add" strikes again.... attaching new patch.
          Hide
          Yonik Seeley added a comment -

          Instead of using a NRQ, wrap a NRF with ConstantScoreQuery

          Yep, that would work too. Your call Mark

          Show
          Yonik Seeley added a comment - Instead of using a NRQ, wrap a NRF with ConstantScoreQuery Yep, that would work too. Your call Mark
          Hide
          Mark Miller added a comment -

          On first blush, I've got to think the wrapper is better.

          We don't lose the few terms -> faster booleanquery that way, and I'm not sure I see any advantage to using CSQ. So its not a huge weight towards the wrapper, but we now have it, and it does weigh that way it would seem to me ...

          Uwe?

          Show
          Mark Miller added a comment - On first blush, I've got to think the wrapper is better. We don't lose the few terms -> faster booleanquery that way, and I'm not sure I see any advantage to using CSQ. So its not a huge weight towards the wrapper, but we now have it, and it does weigh that way it would seem to me ... Uwe?
          Hide
          Uwe Schindler added a comment -

          I have no preference...

          But we fix the highlighter bug in lucene trunk, too?

          Show
          Uwe Schindler added a comment - I have no preference... But we fix the highlighter bug in lucene trunk, too?
          Hide
          Mark Miller added a comment -

          But we fix the highlighter bug in lucene trunk, too?

          Yup, def. The only reason we are trying to sidestep here is to avoid having to update the jar in Solr. Its just a hassle. What do we call it and how do we track it?
          Back against the wall, I think it makes sense, but not if we can sidestep and go pure 2.9 release.

          I'll commit a fix in Lucene over the weekend - super easy fix there anyway.

          Show
          Mark Miller added a comment - But we fix the highlighter bug in lucene trunk, too? Yup, def. The only reason we are trying to sidestep here is to avoid having to update the jar in Solr. Its just a hassle. What do we call it and how do we track it? Back against the wall, I think it makes sense, but not if we can sidestep and go pure 2.9 release. I'll commit a fix in Lucene over the weekend - super easy fix there anyway.
          Hide
          Uwe Schindler added a comment -

          I would also stay with 2.9 in Solr. Just mark the removal of the wrapper as a TODO item after the next lucene update.

          Show
          Uwe Schindler added a comment - I would also stay with 2.9 in Solr. Just mark the removal of the wrapper as a TODO item after the next lucene update.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for Solr 1.4

          Show
          Grant Ingersoll added a comment - Bulk close for Solr 1.4

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development