Lucene - Core
  1. Lucene - Core
  2. LUCENE-5182

FVH can end in very very long running recursion on phrase highlight

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.4, 6.0
    • Fix Version/s: 4.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      due to the nature of FVH extract logic a simple phrase query can put a FHV into a super long running recursion. I had documents taking literally days to return form the extract phrases logic. I have a test that reproduces the problem and a possible fix. The reason for this is that the FVH never tries to early terminate if a phrase is already way beyond the slop coming from the phrase query. If there is a document with lot of occurrences or two or more terms in the phrase this literally tries to match all possible combinations of the terms in the doc.

      1. LUCENE-5182.patch
        5 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        here is a patch and a test

        Show
        Simon Willnauer added a comment - here is a patch and a test
        Hide
        Simon Willnauer added a comment -

        this patch really doesn't fix the actual issue that this alg is freaking crazy and somehow n! of all the positions etc. I am not even sure what the Big-O of this is but this patch just tires to prevent this thing from going totally nuts.

        Show
        Simon Willnauer added a comment - this patch really doesn't fix the actual issue that this alg is freaking crazy and somehow n! of all the positions etc. I am not even sure what the Big-O of this is but this patch just tires to prevent this thing from going totally nuts.
        Hide
        Robert Muir added a comment -

        It seems to me this patch will solve the issue for low slop values, but for higher slop values there might be the same trouble right?

        Maybe there can be a hard upper bound on this: is there some existing limit in this highlighter that can bound the slop (e.g. like the maximum number of words that can be in a snippet or something?) Failing that, maybe a separate configurable limit?

        Show
        Robert Muir added a comment - It seems to me this patch will solve the issue for low slop values, but for higher slop values there might be the same trouble right? Maybe there can be a hard upper bound on this: is there some existing limit in this highlighter that can bound the slop (e.g. like the maximum number of words that can be in a snippet or something?) Failing that, maybe a separate configurable limit?
        Hide
        Simon Willnauer added a comment -

        I agree robert we don't really fix the problem for high slops. I am not sure what a good default is for that but maybe it's just enough to make it configurable?

        Show
        Simon Willnauer added a comment - I agree robert we don't really fix the problem for high slops. I am not sure what a good default is for that but maybe it's just enough to make it configurable?
        Hide
        Robert Muir added a comment -

        Yeah I'm not sure either: maybe just a Math.min and a default of Integer.MAX_VALUE. Sure its still trappy but at least its an improvement.

        another idea (if the user is using the IDF-weighted fragments) might be to somehow not process terms where docFreq/maxDoc > foo%, realizing they wont contribute much to the score anyway.

        But in general i feel like the problem will still exist without an algorithmic change.

        anyway +1 to the patch

        Show
        Robert Muir added a comment - Yeah I'm not sure either: maybe just a Math.min and a default of Integer.MAX_VALUE. Sure its still trappy but at least its an improvement. another idea (if the user is using the IDF-weighted fragments) might be to somehow not process terms where docFreq/maxDoc > foo%, realizing they wont contribute much to the score anyway. But in general i feel like the problem will still exist without an algorithmic change. anyway +1 to the patch
        Hide
        Simon Willnauer added a comment -

        I kind of feel that we can make a lot of things configurable but eventually we need to get rid of it. It's really a can of worms and really fixing it means rewriting it from my point of view.

        I think I will go with what I have for now (the patch) which at least fixes the larger issue.

        Show
        Simon Willnauer added a comment - I kind of feel that we can make a lot of things configurable but eventually we need to get rid of it. It's really a can of worms and really fixing it means rewriting it from my point of view. I think I will go with what I have for now (the patch) which at least fixes the larger issue.
        Hide
        ASF subversion and git services added a comment -

        Commit 1515847 from Simon Willnauer in branch 'dev/trunk'
        [ https://svn.apache.org/r1515847 ]

        LUCENE-5182: Terminate phrase searches early if max phrase window is exceeded

        Show
        ASF subversion and git services added a comment - Commit 1515847 from Simon Willnauer in branch 'dev/trunk' [ https://svn.apache.org/r1515847 ] LUCENE-5182 : Terminate phrase searches early if max phrase window is exceeded
        Hide
        ASF subversion and git services added a comment -

        Commit 1515850 from Simon Willnauer in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1515850 ]

        LUCENE-5182: Terminate phrase searches early if max phrase window is exceeded

        Show
        ASF subversion and git services added a comment - Commit 1515850 from Simon Willnauer in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1515850 ] LUCENE-5182 : Terminate phrase searches early if max phrase window is exceeded
        Hide
        Simon Willnauer added a comment -

        I committed to trunk and 4x - really I want to get LUCENE-2878 in soon (will start working on it in the near future) and then re-visit all the highlighters

        Show
        Simon Willnauer added a comment - I committed to trunk and 4x - really I want to get LUCENE-2878 in soon (will start working on it in the near future) and then re-visit all the highlighters
        Hide
        ASF subversion and git services added a comment -

        Commit 1515986 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1515986 ]

        LUCENE-5182: don't stack overflow jenkins

        Show
        ASF subversion and git services added a comment - Commit 1515986 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1515986 ] LUCENE-5182 : don't stack overflow jenkins
        Hide
        ASF subversion and git services added a comment -

        Commit 1515988 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1515988 ]

        LUCENE-5182: don't stack overflow jenkins

        Show
        ASF subversion and git services added a comment - Commit 1515988 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1515988 ] LUCENE-5182 : don't stack overflow jenkins
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development