Lucene - Core
  1. Lucene - Core
  2. LUCENE-3234

Provide limit on phrase analysis in FastVectorHighlighter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept
      less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze.

      The patch includes an artifical test case that shows > 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.

      With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

      1. LUCENE-3234.patch
        8 kB
        Koji Sekiguchi
      2. LUCENE-3234.patch
        8 kB
        Koji Sekiguchi
      3. LUCENE-3234.patch
        7 kB
        Mike Sokolov
      4. LUCENE-3234.patch
        7 kB
        Mike Sokolov
      5. LUCENE-3234.patch
        5 kB
        Mike Sokolov

        Activity

        Hide
        Koji Sekiguchi added a comment -

        Ok, thank you, David. I opened SOLR-2794.

        Show
        Koji Sekiguchi added a comment - Ok, thank you, David. I opened SOLR-2794 .
        Hide
        David Smiley added a comment -

        Koji; you didn't react to my comment that an app would be incredibly unlikely to actually depend on the unlimited behavior; thus there is no back-compat. Again, I think it should be 5000.

        Show
        David Smiley added a comment - Koji; you didn't react to my comment that an app would be incredibly unlikely to actually depend on the unlimited behavior; thus there is no back-compat. Again, I think it should be 5000.
        Hide
        Koji Sekiguchi added a comment -

        It could be the suggested 5000. I don't have a persistence on it. The current default is just for back-compat.

        Show
        Koji Sekiguchi added a comment - It could be the suggested 5000. I don't have a persistence on it. The current default is just for back-compat.
        Hide
        David Smiley added a comment -

        Why is the default unlimited; shouldn't it be the suggested 5000? I doubt it could be for backwards compatibility since I can't see how an app might depend on the unlimited behavior. I think Solr should have good defaults that protect against pathological cases. Other defaults in Lucene/Solr have such defaults, in general (e.g. hl.maxAnalyzedChars).

        Show
        David Smiley added a comment - Why is the default unlimited; shouldn't it be the suggested 5000? I doubt it could be for backwards compatibility since I can't see how an app might depend on the unlimited behavior. I think Solr should have good defaults that protect against pathological cases. Other defaults in Lucene/Solr have such defaults, in general (e.g. hl.maxAnalyzedChars).
        Hide
        Koji Sekiguchi added a comment -

        Thank you again for checking the commit, Mike! The javadoc has been fixed.

        Show
        Koji Sekiguchi added a comment - Thank you again for checking the commit, Mike! The javadoc has been fixed.
        Hide
        Mike Sokolov added a comment -

        Thank you, Koji - it's nice to have my first patch committed!

        um - one little comment; since you made the default be MAX_VALUE, there is a javadoc comment that should be updated which says it is 5000.

        Show
        Mike Sokolov added a comment - Thank you, Koji - it's nice to have my first patch committed! um - one little comment; since you made the default be MAX_VALUE, there is a javadoc comment that should be updated which says it is 5000.
        Hide
        Koji Sekiguchi added a comment -

        trunk: Committed revision 1139995.
        3x: Committed revision 1139997.

        Thanks, Mike!

        Show
        Koji Sekiguchi added a comment - trunk: Committed revision 1139995. 3x: Committed revision 1139997. Thanks, Mike!
        Hide
        Koji Sekiguchi added a comment -

        Oops, wrong patch. This one is correct.

        Show
        Koji Sekiguchi added a comment - Oops, wrong patch. This one is correct.
        Hide
        Koji Sekiguchi added a comment -

        Updated patch attached. I added CHANGES.txt entries for Lucene and Solr, used Integer.MAX_VALUE for the default and added @param for phraseLimit in the new constructor javadoc. Will commit soon.

        Show
        Koji Sekiguchi added a comment - Updated patch attached. I added CHANGES.txt entries for Lucene and Solr, used Integer.MAX_VALUE for the default and added @param for phraseLimit in the new constructor javadoc. Will commit soon.
        Hide
        Digy added a comment -

        I am not sure how much it is related to this issue but there was
        a similar issue in Lucene.Net.
        https://issues.apache.org/jira/browse/LUCENENET-350

        Show
        Digy added a comment - I am not sure how much it is related to this issue but there was a similar issue in Lucene.Net. https://issues.apache.org/jira/browse/LUCENENET-350
        Hide
        Mike Sokolov added a comment -

        Sure - the test is fragile. It was just meant to illustrate the use case; not really a good unit test for regression. The last patch has it commented.

        Show
        Mike Sokolov added a comment - Sure - the test is fragile. It was just meant to illustrate the use case; not really a good unit test for regression. The last patch has it commented.
        Hide
        Robert Muir added a comment -

        Oh I see, I think i'm nervous about testRepeatedTerms too.
        Maybe we can comment it out and just mention its more of a benchmark?

        The problem could be that the test is timing-based... in general a machine could suddenly get busy at any time,
        especially since we run many tests in parallel, so I'm worried it could intermittently fail.

        Show
        Robert Muir added a comment - Oh I see, I think i'm nervous about testRepeatedTerms too. Maybe we can comment it out and just mention its more of a benchmark? The problem could be that the test is timing-based... in general a machine could suddenly get busy at any time, especially since we run many tests in parallel, so I'm worried it could intermittently fail.
        Hide
        Mike Sokolov added a comment - - edited

        Added solr parameter hl.phraseLimit (default=5000)

        Koji - I'm not sure what the issue w/assertTrue is? It looked to me as if the test case ultimately inherits from org.junit.Assert, which defines the method? Is there a different version of junit on Jenkins without that method?

        Show
        Mike Sokolov added a comment - - edited Added solr parameter hl.phraseLimit (default=5000) Koji - I'm not sure what the issue w/assertTrue is? It looked to me as if the test case ultimately inherits from org.junit.Assert, which defines the method? Is there a different version of junit on Jenkins without that method?
        Hide
        Koji Sekiguchi added a comment -

        Mike, thank you for your continuous interest to FVH! Can you add the parameter for Solr, with an appropriate default value if you would like. I don't know assertTrue test in testManyRepeatedTerms() is ok, for JENKINS?

        Show
        Koji Sekiguchi added a comment - Mike, thank you for your continuous interest to FVH! Can you add the parameter for Solr, with an appropriate default value if you would like. I don't know assertTrue test in testManyRepeatedTerms() is ok, for JENKINS?
        Hide
        Mike Sokolov added a comment -

        I did go back and look at the original case that made me worried; in that case the "bad" document is 650K, and the matched term occurs 23000 times in it. The search still finishes in 24 sec or so on my desktop, which isn't too bad I guess, considering.

        After looking at that and measuring the change in the test case in the patch as the number of terms increase, I don't think there actually is an n^2 - just linear, but the growth is still enough that the patch has value. The test case in the patch is closely targeted at the method which takes all the time when you have large numbers of matching terms in a single document.

        Show
        Mike Sokolov added a comment - I did go back and look at the original case that made me worried; in that case the "bad" document is 650K, and the matched term occurs 23000 times in it. The search still finishes in 24 sec or so on my desktop, which isn't too bad I guess, considering. After looking at that and measuring the change in the test case in the patch as the number of terms increase, I don't think there actually is an n^2 - just linear, but the growth is still enough that the patch has value. The test case in the patch is closely targeted at the method which takes all the time when you have large numbers of matching terms in a single document.
        Hide
        Robert Muir added a comment -

        oh thats ok, i just meant a little tiny benchmark, hitting the nasty case that we might think might be n^2.
        If the little test case does that... then that will work, just wasn't sure if it did.

        either way just something to look at in the profiler, etc.

        Show
        Robert Muir added a comment - oh thats ok, i just meant a little tiny benchmark, hitting the nasty case that we might think might be n^2. If the little test case does that... then that will work, just wasn't sure if it did. either way just something to look at in the profiler, etc.
        Hide
        Mike Sokolov added a comment -

        I don't think I can share the test documents I have - they belong to someone else. I can look at trying to make something bad happen with the wikipedia data, but I'm curious why a benchmark is preferable to a test case?

        Show
        Mike Sokolov added a comment - I don't think I can share the test documents I have - they belong to someone else. I can look at trying to make something bad happen with the wikipedia data, but I'm curious why a benchmark is preferable to a test case?
        Hide
        Robert Muir added a comment -

        You can change it if you don't mind. However, I think I agree it would be good to figure out if there is an n^2 here. This might have some affect on what the default value should be... ideally there is some way we could fix the n^2.

        Is there a way to turn your test case into a benchmark, or do you have a separate benchmark (the example you mentioned where it blows up really bad). This could help in looking at what's going on.

        Show
        Robert Muir added a comment - You can change it if you don't mind. However, I think I agree it would be good to figure out if there is an n^2 here. This might have some affect on what the default value should be... ideally there is some way we could fix the n^2. Is there a way to turn your test case into a benchmark, or do you have a separate benchmark (the example you mentioned where it blows up really bad). This could help in looking at what's going on.
        Hide
        Mike Sokolov added a comment -

        Yes, that makes sense to me - default to 5000, say, and set explicitly to either MAX_VALUE or -1 to get the unlimited behavior (I prefer to allow -1 since otherwise you should probably treat it as an error). Do you want me to change the patch, or should I just leave that to the committer?

        Show
        Mike Sokolov added a comment - Yes, that makes sense to me - default to 5000, say, and set explicitly to either MAX_VALUE or -1 to get the unlimited behavior (I prefer to allow -1 since otherwise you should probably treat it as an error). Do you want me to change the patch, or should I just leave that to the committer?
        Hide
        Robert Muir added a comment -

        yeah, you are right.. but seeing as how positions are ints too, I think it might be easier to do Integer.MAX_VALUE versus the -1 parameter.

        Show
        Robert Muir added a comment - yeah, you are right.. but seeing as how positions are ints too, I think it might be easier to do Integer.MAX_VALUE versus the -1 parameter.
        Hide
        Mike Sokolov added a comment -

        Yes, although a smaller number might be fine. Maybe Koji will comment: I don't completely understand the scaling here, but it seemed to me that I had a case with around 2000 occurrences of a term that lead to a 15-20 sec evaluation time on my desktop. The max value will be an int, sire, although I think the number is going to scale like positions, not offsets FWIW.

        Show
        Mike Sokolov added a comment - Yes, although a smaller number might be fine. Maybe Koji will comment: I don't completely understand the scaling here, but it seemed to me that I had a case with around 2000 occurrences of a term that lead to a 15-20 sec evaluation time on my desktop. The max value will be an int, sire, although I think the number is going to scale like positions, not offsets FWIW.
        Hide
        Robert Muir added a comment -

        I like this tradeoff Mike, thanks!

        should we consider setting some kind of absurd default like 10,000 to really prevent some pathological cases with huge documents?
        We could document in CHANGES.txt that if you want the old behavior, set it to -1 or Integer.MAX_VALUE (I think we can use this here? offsets are ints?)

        Show
        Robert Muir added a comment - I like this tradeoff Mike, thanks! should we consider setting some kind of absurd default like 10,000 to really prevent some pathological cases with huge documents? We could document in CHANGES.txt that if you want the old behavior, set it to -1 or Integer.MAX_VALUE (I think we can use this here? offsets are ints?)

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Mike Sokolov
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development