Lucene - Core
  1. Lucene - Core
  2. LUCENE-1095

StopFilter should have option to incr positionIncrement after stop word

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      I've seen this come up on the mailing list a few times in the last month, so i'm filing a known bug/improvement arround it...

      StopFilter should have an option that if set, records how many stop words are "skipped" in a row, and then sets that value as the positionIncrement on the "next" token that StopFilter does return.

      1. lucene-1095-pos-incr.patch
        23 kB
        Doron Cohen
      2. lucene-1095-pos-incr.patch
        25 kB
        Doron Cohen
      3. lucene-1095-pos-incr.patch
        22 kB
        Doron Cohen

        Issue Links

          Activity

          Hide
          Steve Rowe added a comment -

          From <http://mail-archives.apache.org/mod_mbox/lucene-java-user/200510.mbox/%3c435D44FA.6050702@syr.edu%3e>:

          A patch implementing this approach was actually applied to
          StopFilter.java in late 2003, but was reverted shortly afterward,
          because this approach conflicts with the QueryParser and PhraseQuery
          implementations.

          See Doug Cutting's description of the problem with the position
          increment modification approach here:
          <http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200312.mbox/%3c3FCFB3CA.9000103@lucene.com%3e>

          See a colored diff of StopFilter.java, just before and after the
          position increment modification patch was reverted, here:
          <http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/analysis/StopFilter.java?rev=150152&r1=150150&r2=150152&diff_format=h>

          Show
          Steve Rowe added a comment - From < http://mail-archives.apache.org/mod_mbox/lucene-java-user/200510.mbox/%3c435D44FA.6050702@syr.edu%3e >: A patch implementing this approach was actually applied to StopFilter.java in late 2003, but was reverted shortly afterward, because this approach conflicts with the QueryParser and PhraseQuery implementations. See Doug Cutting's description of the problem with the position increment modification approach here: < http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200312.mbox/%3c3FCFB3CA.9000103@lucene.com%3e > See a colored diff of StopFilter.java, just before and after the position increment modification patch was reverted, here: < http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/analysis/StopFilter.java?rev=150152&r1=150150&r2=150152&diff_format=h >
          Hide
          Doron Cohen added a comment -

          The problem Doug refers to is the effect of this on a Phrase Query.
          I believe that if same analyzer with same stop filter and same option enables is used at search, the PhrasePositions would reflect the same increments due to stop word removals, and so phrase search should work.
          I didn't check this though.
          To be also verified for Span Queries.

          Show
          Doron Cohen added a comment - The problem Doug refers to is the effect of this on a Phrase Query. I believe that if same analyzer with same stop filter and same option enables is used at search, the PhrasePositions would reflect the same increments due to stop word removals, and so phrase search should work. I didn't check this though. To be also verified for Span Queries.
          Hide
          Erik Hatcher added a comment -

          I believe QueryParser has been fixed since that first change I made mentioned by Steven to account for positions returned from an Analyzer. So maybe all is well with fixing StopFilter now. Unit tests needed

          Show
          Erik Hatcher added a comment - I believe QueryParser has been fixed since that first change I made mentioned by Steven to account for positions returned from an Analyzer. So maybe all is well with fixing StopFilter now. Unit tests needed
          Hide
          Hoss Man added a comment -

          I knew there must have been a good reason why it hadn't been done before, but there are some distinctions between then and now:

          1) then the position incrementing wasn't optional, i wouldn't suggest making that some change, just making it easier for people who want to do this without having to write an entirely new Filter.

          2) now PhraseQuery actually supports the notion of relative positions, so people can do the kind of thing Doug was talking about in the email you linked to ... but i won't go so far as to suggest we should make those changes by default in the QueryParser

          Show
          Hoss Man added a comment - I knew there must have been a good reason why it hadn't been done before, but there are some distinctions between then and now: 1) then the position incrementing wasn't optional, i wouldn't suggest making that some change, just making it easier for people who want to do this without having to write an entirely new Filter. 2) now PhraseQuery actually supports the notion of relative positions, so people can do the kind of thing Doug was talking about in the email you linked to ... but i won't go so far as to suggest we should make those changes by default in the QueryParser
          Hide
          Doron Cohen added a comment -

          I checked further.

          Query parser currently ignores the position increment - it is calling PhraseQuery.add(Term) rather than add(Term,position).

          So without changing this - either by default or optionally - the whole thing will not work.

          (I think current QP behavior is wrong in this regard.)
          Is there a good reason not to change QP to always pass the position to the PhraseQuery?

          Show
          Doron Cohen added a comment - I checked further. Query parser currently ignores the position increment - it is calling PhraseQuery.add(Term) rather than add(Term,position). So without changing this - either by default or optionally - the whole thing will not work. (I think current QP behavior is wrong in this regard.) Is there a good reason not to change QP to always pass the position to the PhraseQuery?
          Hide
          Steve Rowe added a comment - - edited

          Is there a good reason not to change QP to always pass the position to the PhraseQuery?

          Backward compatibility In all seriousness, if QP's default behavior is changed, regardless of correctness, shouldn't it wait until 3.0?

          Hoss's suggestion changes no defaults, so caveat user applies - no problem applying to trunk immediately. Maybe a different patch to change the default behavior in StopFilter and in QP for 3.0?

          Show
          Steve Rowe added a comment - - edited Is there a good reason not to change QP to always pass the position to the PhraseQuery? Backward compatibility In all seriousness, if QP's default behavior is changed, regardless of correctness, shouldn't it wait until 3.0? Hoss's suggestion changes no defaults, so caveat user applies - no problem applying to trunk immediately. Maybe a different patch to change the default behavior in StopFilter and in QP for 3.0?
          Hide
          Doron Cohen added a comment -

          This patch enables increments optionally in QueryParser and StopFilter, as discussed above.
          The default behavior has not changed, so there should be no back-comp issues.

          The implementation in StopFilter differs slightly from the "older" reverted code in
          that it does not assume incoming tokens to have position-increment of 1. So it
          should work ok when stop filters and "expanders" are concatenated.

          Added API:

             QueryParser {
                 public boolean getApplyPositionIncrements();
                 public void setApplyPositionIncrements(boolean applyPositionIncrements);
             }  
          

          And

              StopFilter {
                  public boolean getApplyPositionIncrements();
                  public void setApplyPositionIncrements(boolean applyPositionIncrements);
                  public static boolean getDefaultApplyPositionIncrements()
                  public static void setDefaultApplyPositionIncrements(boolean defaultApplyPositionIncrements)
             }
          

          Note the two new static methods in StopFilter - these allow to control behavior system wide, and this way affect the behavior of classes that construct a StopFilter internally, like StopAnalyzer and StandardAnalyzer, without needing to pollute the API of many classes (all Snowball analyzers for instance) with new API.

          Parse tests and search tests were added.
          All tests pass.

          Feedback is required:

          • Comments on the proposed API:
            • the static methods approach?
            • perhaps "enable" instead of "apply"?
          • Do you think this should be added now (it is ready I think) or would you rather to hold this for after 2.3?
          Show
          Doron Cohen added a comment - This patch enables increments optionally in QueryParser and StopFilter, as discussed above. The default behavior has not changed, so there should be no back-comp issues. The implementation in StopFilter differs slightly from the "older" reverted code in that it does not assume incoming tokens to have position-increment of 1. So it should work ok when stop filters and "expanders" are concatenated. Added API: QueryParser { public boolean getApplyPositionIncrements(); public void setApplyPositionIncrements( boolean applyPositionIncrements); } And StopFilter { public boolean getApplyPositionIncrements(); public void setApplyPositionIncrements( boolean applyPositionIncrements); public static boolean getDefaultApplyPositionIncrements() public static void setDefaultApplyPositionIncrements( boolean defaultApplyPositionIncrements) } Note the two new static methods in StopFilter - these allow to control behavior system wide, and this way affect the behavior of classes that construct a StopFilter internally, like StopAnalyzer and StandardAnalyzer, without needing to pollute the API of many classes (all Snowball analyzers for instance) with new API. Parse tests and search tests were added. All tests pass. Feedback is required: Comments on the proposed API: the static methods approach? perhaps "enable" instead of "apply"? Do you think this should be added now (it is ready I think) or would you rather to hold this for after 2.3?
          Hide
          Doron Cohen added a comment -

          ... it does not assume incoming tokens to have position-increment of 1.
          So it should work ok when stop filters and "expanders" are concatenated.

          Well, chaining stop filters didn't work in previous patch.
          Attached patch fixed this.

          Problem was with tokenizers reuse of tokens (next(Token)).
          CharTokenizer, KeywordTokenizer and StandardTokenizer
          were fixed to reset positionIncrement to 1.

          The way I see it Tokenizers must do this and Filters must not do this.
          If there's no objection to this observation I'll javadoc it in Tokenizer
          for future implementors.

          Show
          Doron Cohen added a comment - ... it does not assume incoming tokens to have position-increment of 1. So it should work ok when stop filters and "expanders" are concatenated. Well, chaining stop filters didn't work in previous patch. Attached patch fixed this. Problem was with tokenizers reuse of tokens (next(Token)). CharTokenizer, KeywordTokenizer and StandardTokenizer were fixed to reset positionIncrement to 1. The way I see it Tokenizers must do this and Filters must not do this. If there's no objection to this observation I'll javadoc it in Tokenizer for future implementors.
          Hide
          Doron Cohen added a comment -

          Ok I separated the tokenizers fix to LUCENE-1101.
          Patch here stands-alone (not depending on 1101), will
          update it once 1101 gets committed.

          Show
          Doron Cohen added a comment - Ok I separated the tokenizers fix to LUCENE-1101 . Patch here stands-alone (not depending on 1101), will update it once 1101 gets committed.
          Hide
          Doron Cohen added a comment -

          Updated patch:

          • Token reuse fixes omitted (handled by LUCENE-1101).
          • new feature methods renamed from "apply" to "enable".

          All tests pass, default behavior unmodified, ready to commit.

          Show
          Doron Cohen added a comment - Updated patch: Token reuse fixes omitted (handled by LUCENE-1101 ). new feature methods renamed from "apply" to "enable". All tests pass, default behavior unmodified, ready to commit.
          Hide
          Doron Cohen added a comment -

          Committed . (already yesterday actually, I was sure that I already resolved it...)

          Show
          Doron Cohen added a comment - Committed . (already yesterday actually, I was sure that I already resolved it...)

            People

            • Assignee:
              Doron Cohen
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development