Lucene - Core
  1. Lucene - Core
  2. LUCENE-1253

LengthFilter and other TokenFilters that skip tokens ignore relative positionIncrement

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3.1
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      See for reference:
      http://www.nabble.com/WordDelimiterFilter%2BLenghtFilter-results-in-termPosition%3D%3D-1-td16306788.html
      and http://www.nabble.com/Lucene---Java-f24284.html

      It seems that LengthFilter (at least) could produce a stream in which the first Token has a positionIncrement of 0, which make CheckIndex and Luke function "Reconstruct&Edit" to generate exception.

      Should something be done to avoid this situation, or could the error be ignored (by allowing Term with a position of -1, and relaxing CheckIndex checks?)

      1. LUCENE-1253-3x.patch
        20 kB
        Uwe Schindler
      2. LUCENE-1253-3x.patch
        20 kB
        Uwe Schindler
      3. LUCENE-1253-trunk.patch
        18 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          the more general question is: should LengthFilter have an option to (or by default) change the position of the Tokens it lets through to be realative the positions of the tokens it strips out.

          ie given a stream of tokens expressed as <term,positionIncrement> ...

          <a,1> <b,1> <c,1> <ddddd,0> <e,0> <f,2> <ggggg,0> <hhhhhh,1>

          should the resulting stream after using a LengthFilter with min=3 be...

          <ddddd,0> <ggggg,0> <hhhhhh,1>

          ...(which i believe is the current behavior) or should it be...

          <ddddd,3> <ggggg,2> <hhhhhh,1>

          FWIW: StopFilter seems to have code to handle this (but I haven't tested that it works correctly)

          The question of whether or not it's legal for the first token of a stream to have a positionIncrement of "0" is being discussed on the list, most likely if it needs changed, that would be done in IndexWriter DocumentsWriter

          Show
          Hoss Man added a comment - the more general question is: should LengthFilter have an option to (or by default) change the position of the Tokens it lets through to be realative the positions of the tokens it strips out. ie given a stream of tokens expressed as <term,positionIncrement> ... <a,1> <b,1> <c,1> <ddddd,0> <e,0> <f,2> <ggggg,0> <hhhhhh,1> should the resulting stream after using a LengthFilter with min=3 be... <ddddd,0> <ggggg,0> <hhhhhh,1> ...(which i believe is the current behavior) or should it be... <ddddd,3> <ggggg,2> <hhhhhh,1> FWIW: StopFilter seems to have code to handle this (but I haven't tested that it works correctly) The question of whether or not it's legal for the first token of a stream to have a positionIncrement of "0" is being discussed on the list, most likely if it needs changed, that would be done in IndexWriter DocumentsWriter
          Hide
          Hoss Man added a comment -

          tweaking issue Summary to describe the more general problem with LengthFilter

          Show
          Hoss Man added a comment - tweaking issue Summary to describe the more general problem with LengthFilter
          Hide
          Uwe Schindler added a comment - - edited

          EDIT: Fixed in LUCENE-1542.

          Show
          Uwe Schindler added a comment - - edited EDIT: Fixed in LUCENE-1542 .
          Hide
          Uwe Schindler added a comment -

          After some discussion with Robert, we realized, that all TokenFilters that remove tokens from the stream must preserve the position increment like StopFilter. Else it could also happen that synonyms of a removed token appear as synonyms of the token before the removed one. If the removed one has posIncr=1, this would produce wrong synonyms.

          Two filters that remove tokanes need to be fixed in the same way like StopFilter:

          • LengthFilter
          • KeepWordFilter
          • ... (find more and add here)
          Show
          Uwe Schindler added a comment - After some discussion with Robert, we realized, that all TokenFilters that remove tokens from the stream must preserve the position increment like StopFilter. Else it could also happen that synonyms of a removed token appear as synonyms of the token before the removed one. If the removed one has posIncr=1, this would produce wrong synonyms. Two filters that remove tokanes need to be fixed in the same way like StopFilter: LengthFilter KeepWordFilter ... (find more and add here)
          Hide
          Uwe Schindler added a comment -

          Patch for the two broekn TokenFilters and refactoring of StopFilter. Also tests improved. No Version was added, instead the whole thing was done like for StopFilter by a ctor boolean. Tests were added for both Lucene and Solr. Also Solr's KeepWordFilter(+Factory) were changed to respect matchVersion for the CharArraySet. Also synced this filter's factory with the StopFilterFactory (params, parsing,...).

          After commit to 3.x I will port to trunk - oh no

          Show
          Uwe Schindler added a comment - Patch for the two broekn TokenFilters and refactoring of StopFilter. Also tests improved. No Version was added, instead the whole thing was done like for StopFilter by a ctor boolean. Tests were added for both Lucene and Solr. Also Solr's KeepWordFilter(+Factory) were changed to respect matchVersion for the CharArraySet. Also synced this filter's factory with the StopFilterFactory (params, parsing,...). After commit to 3.x I will port to trunk - oh no
          Hide
          Uwe Schindler added a comment -

          After discussion with Robert, we optimized the code of incrementToken() in the base class a little bit to reduce number of branches and method calls for the common cases.

          Show
          Uwe Schindler added a comment - After discussion with Robert, we optimized the code of incrementToken() in the base class a little bit to reduce number of branches and method calls for the common cases.
          Hide
          Robert Muir added a comment -

          +1, this looks really good.

          I looked for other filters, the only other one this could apply to is ChineseFilter, but this one is deprecated
          and we should just leave it alone (it exists for index backwards compatibility only).

          Show
          Robert Muir added a comment - +1, this looks really good. I looked for other filters, the only other one this could apply to is ChineseFilter, but this one is deprecated and we should just leave it alone (it exists for index backwards compatibility only).
          Hide
          Uwe Schindler added a comment -

          Committed 3.x revision: 1065324

          Now porting to trunk, damn!

          Show
          Uwe Schindler added a comment - Committed 3.x revision: 1065324 Now porting to trunk, damn!
          Hide
          Uwe Schindler added a comment -

          Merged patch for trunk, Robert can you review, I am totally clumsy now?

          Show
          Uwe Schindler added a comment - Merged patch for trunk, Robert can you review, I am totally clumsy now?
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1065343

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1065343
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Walter Ferrara
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development