Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New
    1. LUCENE-6537.patch
      11 kB
      Paul Elschot
    2. LUCENE-6537.patch
      9 kB
      Alan Woodward

      Activity

      Hide
      Alan Woodward added a comment -

      NearSpansOrdered uses an eager algorithm to only return the shortest possible matches from a document. This means that its subspans are out of position when nextStartPosition() returns, making it difficult to collect any information about the subspans.

      Show
      Alan Woodward added a comment - NearSpansOrdered uses an eager algorithm to only return the shortest possible matches from a document. This means that its subspans are out of position when nextStartPosition() returns, making it difficult to collect any information about the subspans.
      Hide
      Alan Woodward added a comment -

      Patch moving NearSpansOrdered to a lazy algorithm. This makes span collection much simpler, and it will also slightly improve performance when scores are not required, as nextStartPosition() can return as soon as it finds a match, without having to shrink the spans.

      Show
      Alan Woodward added a comment - Patch moving NearSpansOrdered to a lazy algorithm. This makes span collection much simpler, and it will also slightly improve performance when scores are not required, as nextStartPosition() can return as soon as it finds a match, without having to shrink the spans.
      Hide
      Adrien Grand added a comment -

      +1

      Show
      Adrien Grand added a comment - +1
      Hide
      Robert Muir added a comment -

      Is it supposed to behave differently? I tried to benchmark the patch:

      RuntimeError: results differ: ([], ["query=spanNear([body:image_flag, body:image_seal], 10, true) filter=None sort=None groupField=None hitCount=36851: self=spanNear([body:image_flag, body:image_seal], 10, true): wrong collapsed hit count: 7 vs 8\n  [([1142287], '1.2319405'), ([9531898], '1.1640743'), ([9087082], '1.1405553'), ([9531891], '1.0431801'), ([2841408, 5764103], '1.0081179'), ([8760783], '0.99798584'), ([16470, 22367, 85504], '0.93125945')] vs [([1142287], '1.2319405'), ([9531898], '1.1640743'), ([9087082], '1.1405553'), ([3979685], '1.077948'), ([9531891], '1.0431801'), ([2841408, 5764103], '1.0081179'), ([8760783], '0.99798584'), ([16470, 22367], '0.93125945')]\n  [(1142287, '1.2319405'), (9531898, '1.1640743'), (9087082, '1.1405553'), (9531891, '1.0431801'), (2841408, '1.0081179'), (5764103, '1.0081179'), (8760783, '0.99798584'), (16470, '0.93125945'), (22367, '0.93125945'), (85504, '0.93125945')] vs [(1142287, '1.2319405'), (9531898, '1.1640743'), (9087082, '1.1405553'), (3979685, '1.077948'), (9531891, '1.0431801'), (2841408, '1.0081179'), (5764103, '1.0081179'), (8760783, '0.99798584'), (16470, '0.93125945'), (22367, '0.93125945')]"])
      
      Show
      Robert Muir added a comment - Is it supposed to behave differently? I tried to benchmark the patch: RuntimeError: results differ: ([], ["query=spanNear([body:image_flag, body:image_seal], 10, true) filter=None sort=None groupField=None hitCount=36851: self=spanNear([body:image_flag, body:image_seal], 10, true): wrong collapsed hit count: 7 vs 8\n [([1142287], '1.2319405'), ([9531898], '1.1640743'), ([9087082], '1.1405553'), ([9531891], '1.0431801'), ([2841408, 5764103], '1.0081179'), ([8760783], '0.99798584'), ([16470, 22367, 85504], '0.93125945')] vs [([1142287], '1.2319405'), ([9531898], '1.1640743'), ([9087082], '1.1405553'), ([3979685], '1.077948'), ([9531891], '1.0431801'), ([2841408, 5764103], '1.0081179'), ([8760783], '0.99798584'), ([16470, 22367], '0.93125945')]\n [(1142287, '1.2319405'), (9531898, '1.1640743'), (9087082, '1.1405553'), (9531891, '1.0431801'), (2841408, '1.0081179'), (5764103, '1.0081179'), (8760783, '0.99798584'), (16470, '0.93125945'), (22367, '0.93125945'), (85504, '0.93125945')] vs [(1142287, '1.2319405'), (9531898, '1.1640743'), (9087082, '1.1405553'), (3979685, '1.077948'), (9531891, '1.0431801'), (2841408, '1.0081179'), (5764103, '1.0081179'), (8760783, '0.99798584'), (16470, '0.93125945'), (22367, '0.93125945')]"])
      Hide
      Alan Woodward added a comment -

      It should return the same document matches, but it will return some extra Span hits within each document, so scores might be different. Is that benchmark result saying that there are unexpected document matches, or that the scores have changed?

      Show
      Alan Woodward added a comment - It should return the same document matches, but it will return some extra Span hits within each document, so scores might be different. Is that benchmark result saying that there are unexpected document matches, or that the scores have changed?
      Hide
      Alan Woodward added a comment -

      It looks as though there's an extra hit in there, on document 3979685.

      Show
      Alan Woodward added a comment - It looks as though there's an extra hit in there, on document 3979685.
      Hide
      Adrien Grand added a comment -

      I think we only have a slight difference in the top hits: 3979685 enters the top hits with its new score of 1.077948 and moves 85504 out. If we had a difference in the number of hits, luceneutil would complain earlier with the following message: "wrong hit count: X vs Y".

      Show
      Adrien Grand added a comment - I think we only have a slight difference in the top hits: 3979685 enters the top hits with its new score of 1.077948 and moves 85504 out. If we had a difference in the number of hits, luceneutil would complain earlier with the following message: "wrong hit count: X vs Y".
      Hide
      Robert Muir added a comment -

      Do no tests fail? Maybe it requires beasting for them to trip, but i am depressed if the benchmarks find bugs like this and our tests don't...

      Show
      Robert Muir added a comment - Do no tests fail? Maybe it requires beasting for them to trip, but i am depressed if the benchmarks find bugs like this and our tests don't...
      Hide
      Alan Woodward added a comment -

      I think it's scoring changes. The benchmark is getting the top ten hits, ranking them by score, merging any docs that have the same score into a group, and then counting the groups. What's happened here is that doc 3979685's score has increased (presumably because NSO is now finding an extra Span in that document that was being discarded by the eager shrink-to-smallest-fit algorithm before), and it has pushed doc 85504 out of the top 10. But 85504 was part of a group of three docs with identical scores, so the number of score groups has increased by one.

      I'm not sure what the point of doing the score-grouping is though? It seems a pretty arbitrary thing to be checking?

      Show
      Alan Woodward added a comment - I think it's scoring changes. The benchmark is getting the top ten hits, ranking them by score, merging any docs that have the same score into a group, and then counting the groups. What's happened here is that doc 3979685's score has increased (presumably because NSO is now finding an extra Span in that document that was being discarded by the eager shrink-to-smallest-fit algorithm before), and it has pushed doc 85504 out of the top 10. But 85504 was part of a group of three docs with identical scores, so the number of score groups has increased by one. I'm not sure what the point of doing the score-grouping is though? It seems a pretty arbitrary thing to be checking?
      Hide
      Robert Muir added a comment -

      We may have to get Michael McCandless to comment on that. I dont know what they python is telling me, except that scores are changed. If i disable score verification, performance in the patch is fine.

      Show
      Robert Muir added a comment - We may have to get Michael McCandless to comment on that. I dont know what they python is telling me, except that scores are changed. If i disable score verification, performance in the patch is fine.
      Hide
      Robert Muir added a comment -
                   MedSpanNear       78.96      (2.4%)       81.41      (2.8%)    3.1% (  -2% -    8%)
                   LowSpanNear      251.16      (3.8%)      264.27      (4.4%)    5.2% (  -2% -   14%)
                  HighSpanNear       10.13      (4.2%)       10.68      (5.2%)    5.4% (  -3% -   15%)
      
      Show
      Robert Muir added a comment - MedSpanNear 78.96 (2.4%) 81.41 (2.8%) 3.1% ( -2% - 8%) LowSpanNear 251.16 (3.8%) 264.27 (4.4%) 5.2% ( -2% - 14%) HighSpanNear 10.13 (4.2%) 10.68 (5.2%) 5.4% ( -3% - 15%)
      Hide
      Paul Elschot added a comment -

      Lots of deleted code lines and scoring behaviour changed a little as expected, LGTM.
      Do I understand correctly that with this patch the span queries are also faster?

      Show
      Paul Elschot added a comment - Lots of deleted code lines and scoring behaviour changed a little as expected, LGTM. Do I understand correctly that with this patch the span queries are also faster?
      Hide
      Paul Elschot added a comment -

      Do no tests fail? Maybe it requires beasting for them to trip, but i am depressed if the benchmarks find bugs like this and our tests don't...

      On this document:
      w1 w1 w2
      an ordered span near query with slop 1 should provide one span (w1 w2) without the patch here,
      and two spans (w1 .. w2) and (w1 w2) with the patch.

      Show
      Paul Elschot added a comment - Do no tests fail? Maybe it requires beasting for them to trip, but i am depressed if the benchmarks find bugs like this and our tests don't... On this document: w1 w1 w2 an ordered span near query with slop 1 should provide one span (w1 w2) without the patch here, and two spans (w1 .. w2) and (w1 w2) with the patch.
      Hide
      Michael McCandless added a comment -

      If it's expected that scores would have changed, since we don't test for that but luceneutil does catch it, then we can ignore it when benching (I think there is an option per competition to do so).

      I'm not sure what the point of doing the score-grouping is though? It seems a pretty arbitrary thing to be checking?

      I think the idea here was not to fail if the docIDs different in sort order when they had identical scores, maybe ...

      Show
      Michael McCandless added a comment - If it's expected that scores would have changed, since we don't test for that but luceneutil does catch it, then we can ignore it when benching (I think there is an option per competition to do so). I'm not sure what the point of doing the score-grouping is though? It seems a pretty arbitrary thing to be checking? I think the idea here was not to fail if the docIDs different in sort order when they had identical scores, maybe ...
      Hide
      Alan Woodward added a comment -

      Thanks Mike. So I think this is good to go then?

      Show
      Alan Woodward added a comment - Thanks Mike. So I think this is good to go then?
      Hide
      Paul Elschot added a comment -

      Same patch with two extra test methods showing that repeated matches occur only when the first term repeats and there is enough slop:
      For ordered span near t1 t2 with slop 1:
      t1 t1 t2 matches twice,
      t1 t2 t2 matches once.

      Show
      Paul Elschot added a comment - Same patch with two extra test methods showing that repeated matches occur only when the first term repeats and there is enough slop: For ordered span near t1 t2 with slop 1: t1 t1 t2 matches twice, t1 t2 t2 matches once.
      Hide
      ASF subversion and git services added a comment -

      Commit 1684600 from Alan Woodward in branch 'dev/trunk'
      [ https://svn.apache.org/r1684600 ]

      LUCENE-6537: NearSpansOrdered should use lazy iteration

      Show
      ASF subversion and git services added a comment - Commit 1684600 from Alan Woodward in branch 'dev/trunk' [ https://svn.apache.org/r1684600 ] LUCENE-6537 : NearSpansOrdered should use lazy iteration
      Hide
      Alan Woodward added a comment -

      I started back-porting this to 5x, but I think it makes more sense to backport LUCENE-6490 and LUCENE-6371 first.

      Show
      Alan Woodward added a comment - I started back-porting this to 5x, but I think it makes more sense to backport LUCENE-6490 and LUCENE-6371 first.
      Hide
      ASF subversion and git services added a comment -

      Commit 1685517 from Alan Woodward in branch 'dev/branches/branch_5x'
      [ https://svn.apache.org/r1685517 ]

      LUCENE-6537: NearSpansOrdered should be lazy

      Show
      ASF subversion and git services added a comment - Commit 1685517 from Alan Woodward in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1685517 ] LUCENE-6537 : NearSpansOrdered should be lazy
      Hide
      Alan Woodward added a comment -

      Turned out to be simpler to do this separately.

      Show
      Alan Woodward added a comment - Turned out to be simpler to do this separately.
      Hide
      Shalin Shekhar Mangar added a comment -

      Bulk close for 5.3.0 release

      Show
      Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

        People

        • Assignee:
          Alan Woodward
          Reporter:
          Alan Woodward
        • Votes:
          0 Vote for this issue
          Watchers:
          7 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development