Lucene - Core
  1. Lucene - Core
  2. LUCENE-6801

PhraseQuery incorrectly advertises it supports terms at the same position

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The following in PhraseQuery has been here since Sept 15th 2004 (by "goller"):

          /**
           * Adds a term to the end of the query phrase.
           * The relative position of the term within the phrase is specified explicitly.
           * This allows e.g. phrases with more than one term at the same position
           * or phrases with gaps (e.g. in connection with stopwords).
           * 
           */
          public Builder add(Term term, int position) {
      

      Of course this isn't true; it's why we have MultiPhraseQuery. Yet we even allow you to have consecutive terms with the same positions. We shouldn't allow that; we should throw an exception. For my own sanity, I modified a simple MultiPhraseQuery test to use PhraseQuery instead and of course it didn't work.

      1. LUCENE_6801.patch
        11 kB
        David Smiley
      2. LUCENE_6801.patch
        6 kB
        David Smiley

        Activity

        Hide
        David Smiley added a comment -

        The following patch enhances the javadocs on PhraseQuery and MultiPhraseQuery. I also made trivial code refactorings in MultiPhraseQuery that my IDE pointed out (like using Java 5 for loop where possible or removing explicit primitive boxing/unboxing)

        Show
        David Smiley added a comment - The following patch enhances the javadocs on PhraseQuery and MultiPhraseQuery. I also made trivial code refactorings in MultiPhraseQuery that my IDE pointed out (like using Java 5 for loop where possible or removing explicit primitive boxing/unboxing)
        Hide
        Adrien Grand added a comment -

        +1 overall. I'm just bit worried about the following comment you added to getPositions():

        It's a copy of the underlying data.
        

        It might sound like it's ok to modify it since it will not mutate internal data of the phrase query?

        Show
        Adrien Grand added a comment - +1 overall. I'm just bit worried about the following comment you added to getPositions() : It's a copy of the underlying data. It might sound like it's ok to modify it since it will not mutate internal data of the phrase query?
        Hide
        David Smiley added a comment -

        Thanks for the review, Adrien!

        RE getPositions: Yes – it would be okay. I guess maybe we shouldn't advertise that fact?

        Show
        David Smiley added a comment - Thanks for the review, Adrien! RE getPositions: Yes – it would be okay. I guess maybe we shouldn't advertise that fact?
        Hide
        Adrien Grand added a comment -

        Right, this is what I meant. So maybe for this method we should just leave the javadocs untouched?

        Show
        Adrien Grand added a comment - Right, this is what I meant. So maybe for this method we should just leave the javadocs untouched?
        Hide
        David Smiley added a comment -

        Cool. I'll commit today.

        Show
        David Smiley added a comment - Cool. I'll commit today.
        Hide
        Adrien Grand added a comment -

        Oh, I think I just understood the intention of these javadocs: imagine you have a token stream that sometimes puts terms at the same position and you want to find all documents that have both terms A and B at the same position. You can do it with PhraseQuery. On the other hand, what MultiPhraseQuery allows you to do is to search for documents that have either A or B at a given position (relatively to some other terms). Then maybe the javadoc change should be to say that PhraseQuery will try to match all terms at a given position, and that for the synonym use-case MultiPhraseQuery should be used instead?

        Show
        Adrien Grand added a comment - Oh, I think I just understood the intention of these javadocs: imagine you have a token stream that sometimes puts terms at the same position and you want to find all documents that have both terms A and B at the same position. You can do it with PhraseQuery. On the other hand, what MultiPhraseQuery allows you to do is to search for documents that have either A or B at a given position (relatively to some other terms). Then maybe the javadoc change should be to say that PhraseQuery will try to match all terms at a given position, and that for the synonym use-case MultiPhraseQuery should be used instead?
        Hide
        David Smiley added a comment -

        Hmm; looks like you're right! Although it's only tested with respect to it's toString() in TestPhraseQuery.testToString(). I'll add a test of the functionality.

        Show
        David Smiley added a comment - Hmm; looks like you're right! Although it's only tested with respect to it's toString() in TestPhraseQuery.testToString(). I'll add a test of the functionality.
        Hide
        David Smiley added a comment -

        Here's an updated patch. I added a simple test, ported from one in MultiPhraseQuery. I adjusted the class javadocs of both query classes a bit, as well as their methods that add terms at specified positions.

        Show
        David Smiley added a comment - Here's an updated patch. I added a simple test, ported from one in MultiPhraseQuery. I adjusted the class javadocs of both query classes a bit, as well as their methods that add terms at specified positions.
        Hide
        David Smiley added a comment -

        I'll commit this in a few hours. It'll be nice to have these docs finally reflect what's going on here.

        My only lingering concern is that this "feature" of PhraseQuery seems so hidden (not documented till now) and weird (would anyone actually want this behavior?), that I wonder if it should be supported at all. If this feature is brought to light then it makes MultiPhraseQuery seem like a misnomer... PhraseQuery also supports multiple terms at the same position but simply doesn't have the same convenience method and has different semantics. Shrug. Any way, at least document and test the behavior than pretend it's no there.

        Show
        David Smiley added a comment - I'll commit this in a few hours. It'll be nice to have these docs finally reflect what's going on here. My only lingering concern is that this "feature" of PhraseQuery seems so hidden (not documented till now) and weird (would anyone actually want this behavior?), that I wonder if it should be supported at all. If this feature is brought to light then it makes MultiPhraseQuery seem like a misnomer... PhraseQuery also supports multiple terms at the same position but simply doesn't have the same convenience method and has different semantics. Shrug. Any way, at least document and test the behavior than pretend it's no there.
        Hide
        Jason Gerlowski added a comment -

        LGTM

        Show
        Jason Gerlowski added a comment - LGTM
        Hide
        ASF subversion and git services added a comment -

        Commit 1715908 from David Smiley in branch 'dev/trunk'
        [ https://svn.apache.org/r1715908 ]

        LUCENE-6801: Clarify javadocs of PhraseQuery, MultiPhraseQuery RE terms at same position. Add test.

        Show
        ASF subversion and git services added a comment - Commit 1715908 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1715908 ] LUCENE-6801 : Clarify javadocs of PhraseQuery, MultiPhraseQuery RE terms at same position. Add test.
        Hide
        ASF subversion and git services added a comment -

        Commit 1715911 from David Smiley in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1715911 ]

        LUCENE-6801: Clarify javadocs of PhraseQuery, MultiPhraseQuery RE terms at same position. Add test.

        Show
        ASF subversion and git services added a comment - Commit 1715911 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1715911 ] LUCENE-6801 : Clarify javadocs of PhraseQuery, MultiPhraseQuery RE terms at same position. Add test.
        Hide
        Adrien Grand added a comment -

        I'm a bit late to the party but the change looks good to me too. Thanks David.

        Show
        Adrien Grand added a comment - I'm a bit late to the party but the change looks good to me too. Thanks David.

          People

          • Assignee:
            David Smiley
            Reporter:
            David Smiley
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development