Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New
    1. LUCENE-6537.patch
      11 kB
      Paul Elschot
    2. LUCENE-6537.patch
      9 kB
      Alan Woodward

      Issue Links

        Activity

        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release
        Hide
        romseygeek Alan Woodward added a comment -

        Turned out to be simpler to do this separately.

        Show
        romseygeek Alan Woodward added a comment - Turned out to be simpler to do this separately.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1685517 from Alan Woodward in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1685517 ]

        LUCENE-6537: NearSpansOrdered should be lazy

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1685517 from Alan Woodward in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1685517 ] LUCENE-6537 : NearSpansOrdered should be lazy
        Hide
        romseygeek Alan Woodward added a comment -

        I started back-porting this to 5x, but I think it makes more sense to backport LUCENE-6490 and LUCENE-6371 first.

        Show
        romseygeek Alan Woodward added a comment - I started back-porting this to 5x, but I think it makes more sense to backport LUCENE-6490 and LUCENE-6371 first.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1684600 from Alan Woodward in branch 'dev/trunk'
        [ https://svn.apache.org/r1684600 ]

        LUCENE-6537: NearSpansOrdered should use lazy iteration

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1684600 from Alan Woodward in branch 'dev/trunk' [ https://svn.apache.org/r1684600 ] LUCENE-6537 : NearSpansOrdered should use lazy iteration
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        Same patch with two extra test methods showing that repeated matches occur only when the first term repeats and there is enough slop:
        For ordered span near t1 t2 with slop 1:
        t1 t1 t2 matches twice,
        t1 t2 t2 matches once.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - Same patch with two extra test methods showing that repeated matches occur only when the first term repeats and there is enough slop: For ordered span near t1 t2 with slop 1: t1 t1 t2 matches twice, t1 t2 t2 matches once.
        Hide
        romseygeek Alan Woodward added a comment -

        Thanks Mike. So I think this is good to go then?

        Show
        romseygeek Alan Woodward added a comment - Thanks Mike. So I think this is good to go then?
        Hide
        mikemccand Michael McCandless added a comment -

        If it's expected that scores would have changed, since we don't test for that but luceneutil does catch it, then we can ignore it when benching (I think there is an option per competition to do so).

        I'm not sure what the point of doing the score-grouping is though? It seems a pretty arbitrary thing to be checking?

        I think the idea here was not to fail if the docIDs different in sort order when they had identical scores, maybe ...

        Show
        mikemccand Michael McCandless added a comment - If it's expected that scores would have changed, since we don't test for that but luceneutil does catch it, then we can ignore it when benching (I think there is an option per competition to do so). I'm not sure what the point of doing the score-grouping is though? It seems a pretty arbitrary thing to be checking? I think the idea here was not to fail if the docIDs different in sort order when they had identical scores, maybe ...
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        Do no tests fail? Maybe it requires beasting for them to trip, but i am depressed if the benchmarks find bugs like this and our tests don't...

        On this document:
        w1 w1 w2
        an ordered span near query with slop 1 should provide one span (w1 w2) without the patch here,
        and two spans (w1 .. w2) and (w1 w2) with the patch.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - Do no tests fail? Maybe it requires beasting for them to trip, but i am depressed if the benchmarks find bugs like this and our tests don't... On this document: w1 w1 w2 an ordered span near query with slop 1 should provide one span (w1 w2) without the patch here, and two spans (w1 .. w2) and (w1 w2) with the patch.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        Lots of deleted code lines and scoring behaviour changed a little as expected, LGTM.
        Do I understand correctly that with this patch the span queries are also faster?

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - Lots of deleted code lines and scoring behaviour changed a little as expected, LGTM. Do I understand correctly that with this patch the span queries are also faster?
        Hide
        rcmuir Robert Muir added a comment -
                     MedSpanNear       78.96      (2.4%)       81.41      (2.8%)    3.1% (  -2% -    8%)
                     LowSpanNear      251.16      (3.8%)      264.27      (4.4%)    5.2% (  -2% -   14%)
                    HighSpanNear       10.13      (4.2%)       10.68      (5.2%)    5.4% (  -3% -   15%)
        
        Show
        rcmuir Robert Muir added a comment - MedSpanNear 78.96 (2.4%) 81.41 (2.8%) 3.1% ( -2% - 8%) LowSpanNear 251.16 (3.8%) 264.27 (4.4%) 5.2% ( -2% - 14%) HighSpanNear 10.13 (4.2%) 10.68 (5.2%) 5.4% ( -3% - 15%)
        Hide
        rcmuir Robert Muir added a comment -

        We may have to get Michael McCandless to comment on that. I dont know what they python is telling me, except that scores are changed. If i disable score verification, performance in the patch is fine.

        Show
        rcmuir Robert Muir added a comment - We may have to get Michael McCandless to comment on that. I dont know what they python is telling me, except that scores are changed. If i disable score verification, performance in the patch is fine.
        Hide
        romseygeek Alan Woodward added a comment -

        I think it's scoring changes. The benchmark is getting the top ten hits, ranking them by score, merging any docs that have the same score into a group, and then counting the groups. What's happened here is that doc 3979685's score has increased (presumably because NSO is now finding an extra Span in that document that was being discarded by the eager shrink-to-smallest-fit algorithm before), and it has pushed doc 85504 out of the top 10. But 85504 was part of a group of three docs with identical scores, so the number of score groups has increased by one.

        I'm not sure what the point of doing the score-grouping is though? It seems a pretty arbitrary thing to be checking?

        Show
        romseygeek Alan Woodward added a comment - I think it's scoring changes. The benchmark is getting the top ten hits, ranking them by score, merging any docs that have the same score into a group, and then counting the groups. What's happened here is that doc 3979685's score has increased (presumably because NSO is now finding an extra Span in that document that was being discarded by the eager shrink-to-smallest-fit algorithm before), and it has pushed doc 85504 out of the top 10. But 85504 was part of a group of three docs with identical scores, so the number of score groups has increased by one. I'm not sure what the point of doing the score-grouping is though? It seems a pretty arbitrary thing to be checking?
        Hide
        rcmuir Robert Muir added a comment -

        Do no tests fail? Maybe it requires beasting for them to trip, but i am depressed if the benchmarks find bugs like this and our tests don't...

        Show
        rcmuir Robert Muir added a comment - Do no tests fail? Maybe it requires beasting for them to trip, but i am depressed if the benchmarks find bugs like this and our tests don't...
        Hide
        jpountz Adrien Grand added a comment -

        I think we only have a slight difference in the top hits: 3979685 enters the top hits with its new score of 1.077948 and moves 85504 out. If we had a difference in the number of hits, luceneutil would complain earlier with the following message: "wrong hit count: X vs Y".

        Show
        jpountz Adrien Grand added a comment - I think we only have a slight difference in the top hits: 3979685 enters the top hits with its new score of 1.077948 and moves 85504 out. If we had a difference in the number of hits, luceneutil would complain earlier with the following message: "wrong hit count: X vs Y".
        Hide
        romseygeek Alan Woodward added a comment -

        It looks as though there's an extra hit in there, on document 3979685.

        Show
        romseygeek Alan Woodward added a comment - It looks as though there's an extra hit in there, on document 3979685.
        Hide
        romseygeek Alan Woodward added a comment -

        It should return the same document matches, but it will return some extra Span hits within each document, so scores might be different. Is that benchmark result saying that there are unexpected document matches, or that the scores have changed?

        Show
        romseygeek Alan Woodward added a comment - It should return the same document matches, but it will return some extra Span hits within each document, so scores might be different. Is that benchmark result saying that there are unexpected document matches, or that the scores have changed?
        Hide
        rcmuir Robert Muir added a comment -

        Is it supposed to behave differently? I tried to benchmark the patch:

        RuntimeError: results differ: ([], ["query=spanNear([body:image_flag, body:image_seal], 10, true) filter=None sort=None groupField=None hitCount=36851: self=spanNear([body:image_flag, body:image_seal], 10, true): wrong collapsed hit count: 7 vs 8\n  [([1142287], '1.2319405'), ([9531898], '1.1640743'), ([9087082], '1.1405553'), ([9531891], '1.0431801'), ([2841408, 5764103], '1.0081179'), ([8760783], '0.99798584'), ([16470, 22367, 85504], '0.93125945')] vs [([1142287], '1.2319405'), ([9531898], '1.1640743'), ([9087082], '1.1405553'), ([3979685], '1.077948'), ([9531891], '1.0431801'), ([2841408, 5764103], '1.0081179'), ([8760783], '0.99798584'), ([16470, 22367], '0.93125945')]\n  [(1142287, '1.2319405'), (9531898, '1.1640743'), (9087082, '1.1405553'), (9531891, '1.0431801'), (2841408, '1.0081179'), (5764103, '1.0081179'), (8760783, '0.99798584'), (16470, '0.93125945'), (22367, '0.93125945'), (85504, '0.93125945')] vs [(1142287, '1.2319405'), (9531898, '1.1640743'), (9087082, '1.1405553'), (3979685, '1.077948'), (9531891, '1.0431801'), (2841408, '1.0081179'), (5764103, '1.0081179'), (8760783, '0.99798584'), (16470, '0.93125945'), (22367, '0.93125945')]"])
        
        Show
        rcmuir Robert Muir added a comment - Is it supposed to behave differently? I tried to benchmark the patch: RuntimeError: results differ: ([], ["query=spanNear([body:image_flag, body:image_seal], 10, true) filter=None sort=None groupField=None hitCount=36851: self=spanNear([body:image_flag, body:image_seal], 10, true): wrong collapsed hit count: 7 vs 8\n [([1142287], '1.2319405'), ([9531898], '1.1640743'), ([9087082], '1.1405553'), ([9531891], '1.0431801'), ([2841408, 5764103], '1.0081179'), ([8760783], '0.99798584'), ([16470, 22367, 85504], '0.93125945')] vs [([1142287], '1.2319405'), ([9531898], '1.1640743'), ([9087082], '1.1405553'), ([3979685], '1.077948'), ([9531891], '1.0431801'), ([2841408, 5764103], '1.0081179'), ([8760783], '0.99798584'), ([16470, 22367], '0.93125945')]\n [(1142287, '1.2319405'), (9531898, '1.1640743'), (9087082, '1.1405553'), (9531891, '1.0431801'), (2841408, '1.0081179'), (5764103, '1.0081179'), (8760783, '0.99798584'), (16470, '0.93125945'), (22367, '0.93125945'), (85504, '0.93125945')] vs [(1142287, '1.2319405'), (9531898, '1.1640743'), (9087082, '1.1405553'), (3979685, '1.077948'), (9531891, '1.0431801'), (2841408, '1.0081179'), (5764103, '1.0081179'), (8760783, '0.99798584'), (16470, '0.93125945'), (22367, '0.93125945')]"])
        Hide
        jpountz Adrien Grand added a comment -

        +1

        Show
        jpountz Adrien Grand added a comment - +1
        Hide
        romseygeek Alan Woodward added a comment -

        Patch moving NearSpansOrdered to a lazy algorithm. This makes span collection much simpler, and it will also slightly improve performance when scores are not required, as nextStartPosition() can return as soon as it finds a match, without having to shrink the spans.

        Show
        romseygeek Alan Woodward added a comment - Patch moving NearSpansOrdered to a lazy algorithm. This makes span collection much simpler, and it will also slightly improve performance when scores are not required, as nextStartPosition() can return as soon as it finds a match, without having to shrink the spans.
        Hide
        romseygeek Alan Woodward added a comment -

        NearSpansOrdered uses an eager algorithm to only return the shortest possible matches from a document. This means that its subspans are out of position when nextStartPosition() returns, making it difficult to collect any information about the subspans.

        Show
        romseygeek Alan Woodward added a comment - NearSpansOrdered uses an eager algorithm to only return the shortest possible matches from a document. This means that its subspans are out of position when nextStartPosition() returns, making it difficult to collect any information about the subspans.

          People

          • Assignee:
            romseygeek Alan Woodward
            Reporter:
            romseygeek Alan Woodward
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development