Lucene - Core
  1. Lucene - Core
  2. LUCENE-3120

span query matches too many docs when two query terms are the same unless inOrder=true

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, Trunk
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      spinoff of user list discussion - SpanNearQuery - inOrder parameter.

      With 3 documents:

      • "a b x c d"
      • "a b b d"
      • "a b x b y d"

      Here are a few queries (the number in parenthesis indicates expected #hits):

      These ones work as expected:

      • (1) in-order, slop=0, "b", "x", "b"
      • (1) in-order, slop=0, "b", "b"
      • (2) in-order, slop=1, "b", "b"

      These ones match too many hits:

      • (1) any-order, slop=0, "b", "x", "b"
      • (1) any-order, slop=1, "b", "x", "b"
      • (1) any-order, slop=2, "b", "x", "b"
      • (1) any-order, slop=3, "b", "x", "b"

      These ones match too many hits as well:

      • (1) any-order, slop=0, "b", "b"
      • (2) any-order, slop=1, "b", "b"

      Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query).

      This seems related to a known overlapping spans issue - non-overlapping Span queries - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA.

      1. LUCENE-3120.patch
        4 kB
        Doron Cohen
      2. LUCENE-3120.patch
        4 kB
        Doron Cohen
      3. LUCENE-3120.patch
        0.9 kB
        Steve Davids

        Issue Links

          Activity

          Hide
          Doron Cohen added a comment -

          Attached test case demonstrating the bug.

          Show
          Doron Cohen added a comment - Attached test case demonstrating the bug.
          Hide
          Greg Tarr added a comment -

          Thanks for raising this.

          Show
          Greg Tarr added a comment - Thanks for raising this.
          Hide
          Doron Cohen added a comment -

          Updated patch with fixed test to not depend on analysis module.

          Show
          Doron Cohen added a comment - Updated patch with fixed test to not depend on analysis module.
          Hide
          Hoss Man added a comment -

          comment i made on the mailing list regarding this topic...

          the crux of hte issue (as i recall) is that there is really no conecptual reason to why a query for "'john' near 'john', in any order, with slop of Z" shouldn't match a doc that contains only one instance of "john" ... the first SpanTermQuery says "i found a match at position X" the second SpanTermQuery says "i found a match at position Y" and the SpanNearQuery says "the differnece between X and Y is less then Z" therefore i have a match. (The SpanNearQuery can't fail just because X and Y are the same – they might be two distinct term instances, with differnet payloads perhaps, that just happen to have the same position).

          However: if true==inOrder case works because the SpanNearQuery enforces that "X must be less then Y" so the same term can't ever match twice.

          Show
          Hoss Man added a comment - comment i made on the mailing list regarding this topic... the crux of hte issue (as i recall) is that there is really no conecptual reason to why a query for "'john' near 'john', in any order, with slop of Z" shouldn't match a doc that contains only one instance of "john" ... the first SpanTermQuery says "i found a match at position X" the second SpanTermQuery says "i found a match at position Y" and the SpanNearQuery says "the differnece between X and Y is less then Z" therefore i have a match. (The SpanNearQuery can't fail just because X and Y are the same – they might be two distinct term instances, with differnet payloads perhaps, that just happen to have the same position). However: if true==inOrder case works because the SpanNearQuery enforces that "X must be less then Y" so the same term can't ever match twice.
          Hide
          Hoss Man added a comment -

          What we might want to consider is a new option on SpanNearQuery that would mandate that the spans not overlap.

          Paul Elschot described the general form of this idea once as an numeric option to specify a minimum distance between the subspans (so the default, as implemented today, for inOrder==true would be minPositionDistance=1; and the default for inOrder==false would be minPositionDistance=0)

          Show
          Hoss Man added a comment - What we might want to consider is a new option on SpanNearQuery that would mandate that the spans not overlap. Paul Elschot described the general form of this idea once as an numeric option to specify a minimum distance between the subspans (so the default, as implemented today, for inOrder==true would be minPositionDistance=1; and the default for inOrder==false would be minPositionDistance=0)
          Hide
          Robert Muir added a comment -

          bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - bulk move 3.2 -> 3.3
          Hide
          Hoss Man added a comment -

          Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19.

          Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited

          Show
          Hoss Man added a comment - Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19. Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.
          Hide
          Steve Davids added a comment -

          A user came across this "odd" behavior, attached a simple test case that was written before I came across this ticket which demonstrates the discrepancy.

          Show
          Steve Davids added a comment - A user came across this "odd" behavior, attached a simple test case that was written before I came across this ticket which demonstrates the discrepancy.

            People

            • Assignee:
              Unassigned
              Reporter:
              Doron Cohen
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development