Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      PhraseQuery behaves quite inconsistently when the position of the first term is greater than 0. Here is an example:

          Directory dir = newDirectory();
          RandomIndexWriter iw = new RandomIndexWriter(random(), dir);
          FieldType customType = new FieldType(TextField.TYPE_NOT_STORED);
          customType.setOmitNorms(true);
          Field f = new Field("body", "", customType);
          Document doc = new Document();
          doc.add(f);
          f.setStringValue("one quick fox");
          iw.addDocument(doc);
          IndexReader ir = iw.getReader();
          iw.close();
          IndexSearcher is = newSearcher(ir);
          
          PhraseQuery pq = new PhraseQuery();
          pq.add(new Term("body", "quick"), 0);
          pq.add(new Term("body", "fox"), 1);
          System.out.println(is.search(pq, 1).totalHits); // 1
      
          pq = new PhraseQuery();
          pq.add(new Term("body", "quick"), 10);
          pq.add(new Term("body", "fox"), 11);
          System.out.println(is.search(pq, 1).totalHits); // 0
          
          pq = new PhraseQuery();
          pq.add(new Term("body", "quick"), 10);
          System.out.println(is.search(pq, 1).totalHits); // 1
          
          pq = new PhraseQuery();
          pq.add(new Term("body", "quick"), 10);
          pq.add(new Term("body", "fox"), 11);
          pq.setSlop(1);
          System.out.println(is.search(pq, 1).totalHits); // 1
          
          ir.close();
          dir.close();
      

      The reason is that when you add a term with position P on a PhraseQuery, ExactPhraseScorer ignores all positions for this term which are less than P.

      But this is inconsistent:

      • if you have a single term, it does not work anymore since we rewrite to a term query regardless of the position of the term (3rd query)
      • if you increase the slop, we will use SloppyPhraseScorer which does not have this behaviour. (4th query)

      So I think we have two options:

      • either remove this behaviour and make the positions that are provided to PhraseQuery only relative (ie. fix ExactPhraseScorer)
      • or make it work this way across the board (which means not rewriting to a term query when the position is not 0 and fixing SloppyPhraseScorer).

        Activity

        Hide
        Adrien Grand added a comment -

        I am not even sure what the behaviour should be for sloppy phrases if we decide on the second option. And I'm concerned it might make the implementation more complicated and/or slower.

        Show
        Adrien Grand added a comment - I am not even sure what the behaviour should be for sloppy phrases if we decide on the second option. And I'm concerned it might make the implementation more complicated and/or slower.
        Hide
        Michael McCandless added a comment -

        I think exact PhraseQuery shouldn't support this 'leading wildcards' case? Throw an exc if the user tries to do that?

        Show
        Michael McCandless added a comment - I think exact PhraseQuery shouldn't support this 'leading wildcards' case? Throw an exc if the user tries to do that?
        Hide
        Robert Muir added a comment -

        Can we avoid throwing an exception to the user?

        I don't think its their fault if they type "the query", and the search engine has a stopword filter in the chain. It will confuse them, they dont get an error with "query the".
        I mean, its still possible to throw it if we really want from the query side, but it just makes queryparsers more complicated, because any sane parser will want to avoid this explicitly. i really don't think its the right response, and I think its rare enough that people will see that response as a bug.

        Show
        Robert Muir added a comment - Can we avoid throwing an exception to the user? I don't think its their fault if they type "the query", and the search engine has a stopword filter in the chain. It will confuse them, they dont get an error with "query the". I mean, its still possible to throw it if we really want from the query side, but it just makes queryparsers more complicated, because any sane parser will want to avoid this explicitly. i really don't think its the right response, and I think its rare enough that people will see that response as a bug.
        Hide
        Adrien Grand added a comment -

        Here is a middle ground proposal:

        • enforce that terms are added in order of positions
        • enforce that positions are all positive
        • PhraseQuery still accepts that the first position is greater than 0 but PhraseWeight does not
        • PhraseQuery.rewrite takes care of rebasing positions if the first one is not 0

        This way, PhraseQuery would still be friendly to query parsers that create phrase queries from a token stream.

        Show
        Adrien Grand added a comment - Here is a middle ground proposal: enforce that terms are added in order of positions enforce that positions are all positive PhraseQuery still accepts that the first position is greater than 0 but PhraseWeight does not PhraseQuery.rewrite takes care of rebasing positions if the first one is not 0 This way, PhraseQuery would still be friendly to query parsers that create phrase queries from a token stream.
        Hide
        Robert Muir added a comment -

        +1

        Show
        Robert Muir added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1660910 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1660910 ]

        LUCENE-6255: Fix PhraseQuery inconsistencies.

        Show
        ASF subversion and git services added a comment - Commit 1660910 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1660910 ] LUCENE-6255 : Fix PhraseQuery inconsistencies.
        Hide
        ASF subversion and git services added a comment -

        Commit 1660915 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1660915 ]

        LUCENE-6255: Fix PhraseQuery inconsistencies.

        Show
        ASF subversion and git services added a comment - Commit 1660915 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1660915 ] LUCENE-6255 : Fix PhraseQuery inconsistencies.
        Hide
        Timothy Potter added a comment -

        Bulk close after 5.1 release

        Show
        Timothy Potter added a comment - Bulk close after 5.1 release

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development