Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3371

Support for a "SpanAndQuery" / "SpanAllNearQuery"

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I would like to parse queries like this:

      a WITHIN 5 WORDS OF (b AND c)
      

      This would match cases where both a b span and a c span are within 5 of the same a span.

      The existing span query classes do not appear to be capable of doing this no matter how they are combined, although replacing the AND with "WITHIN 10 OF" (general rule is to double the first number) at least ensures that no hits are lost (it just returns too many.)

      I'm not sure how the class would work, but it might be like this:

        Query q = new SpanAllNearQuery(a, new SpanQuery[] { b, c }, 5, false);
      

      The difference from SpanNearQuery is that SpanNearQuery considers the entire collection of terms as a single set to be found near each other, whereas this query would consider each of the terms in the array relative to the first.

        Activity

        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        This could be tested by comparing the results against:

        (a WITHIN 5 WORDS OF b) AND (a WITHIN 5 WORDS OF c)

        The above query is a boolean one, and does not provide a Spans.
        Would this SpanAllNearQuery provide a Spans?
        When it provides a Spans it can be nested inside other span queries.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - This could be tested by comparing the results against: (a WITHIN 5 WORDS OF b) AND (a WITHIN 5 WORDS OF c) The above query is a boolean one, and does not provide a Spans. Would this SpanAllNearQuery provide a Spans? When it provides a Spans it can be nested inside other span queries.
        Hide
        trejkaz Trejkaz added a comment -

        Yeah, it would provide a span much like a normal WITHIN would.

        The subtle difference from the above example is that a plain boolean does not enforce that the same "a" span is used in both cases. For instance, that plain AND query above would match "b a x x x x x a c", but "all within" would not.

        Show
        trejkaz Trejkaz added a comment - Yeah, it would provide a span much like a normal WITHIN would. The subtle difference from the above example is that a plain boolean does not enforce that the same "a" span is used in both cases. For instance, that plain AND query above would match "b a x x x x x a c", but "all within" would not.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        So the main difference with the current unordered SpanNear(a, b, c) would be that when "b" and "c" are further apart, the single "a" should be in the middle.

        Is that enough difference to write program code for?

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - So the main difference with the current unordered SpanNear(a, b, c) would be that when "b" and "c" are further apart, the single "a" should be in the middle. Is that enough difference to write program code for?
        Hide
        trejkaz Trejkaz added a comment -

        That sounds about right.

        I had another thought connected to LUCENE-3370 - if SpanNotNearQuery or an equivalent existed, you could write it like this:

        new SpanNotQuery(
          new SpanTermQuery(new Term("text", "a")),
          new SpanOrQuery(
            new SpanNotNearQuery(new Term("text", "a"), new Term("text", "b"), 4, false),
            new SpanNotNearQuery(new Term("text", "a"), new Term("text", "c"), 4, false)
          )
        )
        

        Which is to say that once you remove all the instances which aren't near one of the other spans, you end up with the ones which are near all of them.

        Show
        trejkaz Trejkaz added a comment - That sounds about right. I had another thought connected to LUCENE-3370 - if SpanNotNearQuery or an equivalent existed, you could write it like this: new SpanNotQuery( new SpanTermQuery( new Term( "text" , "a" )), new SpanOrQuery( new SpanNotNearQuery( new Term( "text" , "a" ), new Term( "text" , "b" ), 4, false ), new SpanNotNearQuery( new Term( "text" , "a" ), new Term( "text" , "c" ), 4, false ) ) ) Which is to say that once you remove all the instances which aren't near one of the other spans, you end up with the ones which are near all of them.
        Hide
        trejkaz Trejkaz added a comment - - edited

        To summarise some investigation I did towards using SpanNotQuery with the new pre and post parameters, it turns out that this doesn't work, but I can't immediately see why.

        My rewrite:

            @Override
            public SpanQuery rewrite(IndexReader reader) throws IOException
            {
                int nearQueriesCount = nearQueries.size();
                SpanQuery[] notNearClauses = new SpanQuery[nearQueriesCount];
                int pre = inOrder ? slop : 0;
                int post = slop;
                for (int i = 0; i < nearQueriesCount; i++)
                {
                    notNearClauses[i] = new SpanNotQuery(mainQuery, nearQueries.get(i), pre, post);
                }
                return new SpanNotQuery(mainQuery, new SpanOrQuery(notNearClauses));
            }
        

        i.e., for each query, create a "not near" clause, and then subtract the "not near" clauses from the main query clause to get the "near all" result.

        This logic is apparently wrong, because this query:

          mainQuery = SpanTerm("content", "a")
          nearQueries = [
            SpanTerm("content", "b"),
            SpanTerm("content", "c")
          ]
          slop = 2,
          inOrder = false
        

        Is expected to match this text:

        a x b c x x x a
        

        But instead, it does not match.

        Show
        trejkaz Trejkaz added a comment - - edited To summarise some investigation I did towards using SpanNotQuery with the new pre and post parameters, it turns out that this doesn't work, but I can't immediately see why. My rewrite: @Override public SpanQuery rewrite(IndexReader reader) throws IOException { int nearQueriesCount = nearQueries.size(); SpanQuery[] notNearClauses = new SpanQuery[nearQueriesCount]; int pre = inOrder ? slop : 0; int post = slop; for ( int i = 0; i < nearQueriesCount; i++) { notNearClauses[i] = new SpanNotQuery(mainQuery, nearQueries.get(i), pre, post); } return new SpanNotQuery(mainQuery, new SpanOrQuery(notNearClauses)); } i.e., for each query, create a "not near" clause, and then subtract the "not near" clauses from the main query clause to get the "near all" result. This logic is apparently wrong, because this query: mainQuery = SpanTerm("content", "a") nearQueries = [ SpanTerm("content", "b"), SpanTerm("content", "c") ] slop = 2, inOrder = false Is expected to match this text: a x b c x x x a But instead, it does not match.

          People

          • Assignee:
            Unassigned
            Reporter:
            trejkaz Trejkaz
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development