Lucene - Core
  1. Lucene - Core
  2. LUCENE-3120

span query matches too many docs when two query terms are the same unless inOrder=true

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      spinoff of user list discussion - SpanNearQuery - inOrder parameter.

      With 3 documents:

      • "a b x c d"
      • "a b b d"
      • "a b x b y d"

      Here are a few queries (the number in parenthesis indicates expected #hits):

      These ones work as expected:

      • (1) in-order, slop=0, "b", "x", "b"
      • (1) in-order, slop=0, "b", "b"
      • (2) in-order, slop=1, "b", "b"

      These ones match too many hits:

      • (1) any-order, slop=0, "b", "x", "b"
      • (1) any-order, slop=1, "b", "x", "b"
      • (1) any-order, slop=2, "b", "x", "b"
      • (1) any-order, slop=3, "b", "x", "b"

      These ones match too many hits as well:

      • (1) any-order, slop=0, "b", "b"
      • (2) any-order, slop=1, "b", "b"

      Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query).

      This seems related to a known overlapping spans issue - non-overlapping Span queries - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA.

      1. LUCENE-3120.patch
        4 kB
        Doron Cohen
      2. LUCENE-3120.patch
        4 kB
        Doron Cohen

        Activity

        Hide
        Doron Cohen added a comment -

        Attached test case demonstrating the bug.

        Show
        Doron Cohen added a comment - Attached test case demonstrating the bug.
        Hide
        Greg Tarr added a comment -

        Thanks for raising this.

        Show
        Greg Tarr added a comment - Thanks for raising this.
        Hide
        Doron Cohen added a comment -

        Updated patch with fixed test to not depend on analysis module.

        Show
        Doron Cohen added a comment - Updated patch with fixed test to not depend on analysis module.
        Hide
        Hoss Man added a comment -

        comment i made on the mailing list regarding this topic...

        the crux of hte issue (as i recall) is that there is really no conecptual reason to why a query for "'john' near 'john', in any order, with slop of Z" shouldn't match a doc that contains only one instance of "john" ... the first SpanTermQuery says "i found a match at position X" the second SpanTermQuery says "i found a match at position Y" and the SpanNearQuery says "the differnece between X and Y is less then Z" therefore i have a match. (The SpanNearQuery can't fail just because X and Y are the same – they might be two distinct term instances, with differnet payloads perhaps, that just happen to have the same position).

        However: if true==inOrder case works because the SpanNearQuery enforces that "X must be less then Y" so the same term can't ever match twice.

        Show
        Hoss Man added a comment - comment i made on the mailing list regarding this topic... the crux of hte issue (as i recall) is that there is really no conecptual reason to why a query for "'john' near 'john', in any order, with slop of Z" shouldn't match a doc that contains only one instance of "john" ... the first SpanTermQuery says "i found a match at position X" the second SpanTermQuery says "i found a match at position Y" and the SpanNearQuery says "the differnece between X and Y is less then Z" therefore i have a match. (The SpanNearQuery can't fail just because X and Y are the same – they might be two distinct term instances, with differnet payloads perhaps, that just happen to have the same position). However: if true==inOrder case works because the SpanNearQuery enforces that "X must be less then Y" so the same term can't ever match twice.
        Hide
        Hoss Man added a comment -

        What we might want to consider is a new option on SpanNearQuery that would mandate that the spans not overlap.

        Paul Elschot described the general form of this idea once as an numeric option to specify a minimum distance between the subspans (so the default, as implemented today, for inOrder==true would be minPositionDistance=1; and the default for inOrder==false would be minPositionDistance=0)

        Show
        Hoss Man added a comment - What we might want to consider is a new option on SpanNearQuery that would mandate that the spans not overlap. Paul Elschot described the general form of this idea once as an numeric option to specify a minimum distance between the subspans (so the default, as implemented today, for inOrder==true would be minPositionDistance=1; and the default for inOrder==false would be minPositionDistance=0)
        Hide
        Robert Muir added a comment -

        bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - bulk move 3.2 -> 3.3
        Hide
        Hoss Man added a comment -

        Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19.

        Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited

        Show
        Hoss Man added a comment - Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19. Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Uwe Schindler added a comment -

        Move issue to Lucene 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Lucene 4.9.

          People

          • Assignee:
            Unassigned
            Reporter:
            Doron Cohen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development